From eli at dev.mellanox.co.il  Thu May  1 00:01:25 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 01 May 2008 10:01:25 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup
In-Reply-To: <adar6cm67w8.fsf@cisco.com>
References: <1209577156.1790.11.camel@mtls03>  <adar6cm67w8.fsf@cisco.com>
Message-ID: <1209625285.1790.24.camel@mtls03>


On Wed, 2008-04-30 at 20:05 -0700, Roland Dreier wrote:
> > we haves seen a few other cases where a large tx queue is needed. I
>  > think we should choose a larger default value than the current 64.
> 
> maybe yes, maybe no... what are the cases where it is needed?
> 
> The send queue is basically acting as a "shock absorber" for bursty
> traffic.  If the queue is filling up because of a steady traffic rate,
> then making the queue bigger means it will just take a little longer to
> fill.  The way a longer send queue helps I guess is if the send queue is
> emptying out before the transmit queue is woken up...
I agree, but I want to have a larger buffer to absorb larger picks. For
example, after applying this patch I tested how many times the net queue
is stopped and woken up when running four streams of netperf, udp, small
packets. When using the default 64 tx queue size it happened 500 times.
When I used a 256 tx queue size it happened only 37 times. This makes me
think that we have larger picks that a larger queue size can help
handle. 

Also looking for example on Broadcom bnx2 driver on my machine, it uses
a 1000 tx queue len.


From jackm at dev.mellanox.co.il  Thu May  1 00:04:27 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 1 May 2008 10:04:27 +0300
Subject: [ofa-general] Re: [PATCH] ib_mthca: use log values instead of
	numeric values when specifiying HCA resource maxes in module
	parameters
In-Reply-To: <adabq3sy5d2.fsf@cisco.com>
References: <200804291822.57820.jackm@dev.mellanox.co.il>
	<adabq3sy5d2.fsf@cisco.com>
Message-ID: <200805011004.27307.jackm@dev.mellanox.co.il>

On Tuesday 29 April 2008 19:48, Roland Dreier wrote:
> given that mthca has had the old interface for nearly a year and a half,
> what do we gain from changing it now?
> 
We gain clarity and consistency.  The mlx4 driver in OFED 1.3 uses log values
in the module parameters (patch for mlx4 that I submitted in October 2007).

>  > I put a check in the patch for detecting if the user specified a log or not,
>  > to make the transition from the old method (of numbers instead of logs)
>  > easier.
> 
> Yes, that is nice.  Would the plan be just to allow both methods?

Good idea, but cannot be done for all parameters.  "max rdb per qp"
is by default 4 rdb's per qp (log = 2).  If an administrator supplies
ONLY this parameter in an "options" line for ib_mthca, how can I tell
if the value is a log or a number? (Say administrator places the following
line:  "options ib_mthca num_rdb=4" -- how will I know that the admin
means "log" , or just a number).

Maybe the best solution is to change the parameter name from "num_xxx" to
"log_num_xxx".  That way, if the administrator is using an old /etc/modprobe.conf
file, with lines like "options ib_mthca  num_xxxx=20000", then ib_mthca will fail
to load, and there will be lines in file /var/log/messages like
"ib_mthca: Unknown parameter `num_cq' ".

Please note also that very few customer are using this module-parameter capability
as yet.

> it would make sense for mlx4 to allow setting parameter
> values by value and not by log, and then we end up with all the same
> code in both places, and so why not just have mlx4 set by value the same
> way as mthca?
> 
OFED 1.3 has the patch I submitted for mlx4 in October 07, and this already uses
logs, not values.  We would then be confusing Hermon customers if we change this
to values.

I think it is healthiest to:
1. Use the ib_mthca patch I submitted, but change the parameter names from "num_xxx" to
   "log_num_xxx"
2. Take the mlx4 patch as is (maybe adding a check that values are <31).

- Jack


From kensandars at hotmail.com  Thu May  1 01:24:15 2008
From: kensandars at hotmail.com (Ken Sandars)
Date: Thu, 1 May 2008 18:24:15 +1000
Subject: [Ips] [Stgt-devel] [ofa-general] Re: Calculating the VA	iniSER
	header
In-Reply-To: <39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com>
References: <4804B03C.6060507@voltaire.com><OFA528E763.71479425-ON8525742C.005B02F4-8825742C.005F18F1@us.ibm.com><694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com><20080416144830.GC23861@osc.edu>
	<adaskxlls4u.fsf@cisco.com><694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com>
	<20080429170516.GA8857@osc.edu>
	<BLU117-W2898E0FCDF826F8FBFE13ED7D80@phx.gbl> 
	<39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com>
Message-ID: <BLU117-W9F1A35C1D34824D30AA0DD7DB0@phx.gbl>

>> [Ken] It appears the current Linux iSER initiator does not send the HELLO message when
>> the connection transits to full feature phase. The stgt target also ignores this
>> message (if it were to appear). 
[Ken] The IBTA document does not mention the HELLO/HELLOREPLY messages.
Implementing this message exchange gives a distinction between the current implementations
and those that will correctly calculate the write_va (as per Pete Wyckoff's option 3).

>> [Ken] Both of these implementations use a non-conformant iSER header (they add
>> write_va and read_va fields, which incidentally do not appear to be used). Are
>> these changes documented anywhere in the IB domain, or are these variations
>> needed for another reason?
>
> [Erez] Take a look at the iSER for IB annex:
> http://www.infinibandta.org/members/spec/Annex_iSER.PDF

[Ken] Ouch. That link requires a username/password. Looks like it is only available to members
of the Infiniband Trade Association. Fortunately I gained access to it with username "open" and
password "standard". ;-)

[Ken] Neither of these implementations send or examine the iSER CM REQ/REP message private
data. The document doesn't define what action to take when this message is absent. Interestingly,
when the target reports that "ZBVA shall be used for this connection" and "the target shall issue
Send with Invalidate as needed" then it appears the iSER header specified in RFC5046 should be
used for control-type PDUs. Is there any plan to conform with the list of requirements for IBTA
compliance?


Cheers,
Ken

_________________________________________________________________
Never miss another e-mail with Hotmail on your mobile.
http://www.livelife.ninemsn.com.au/article.aspx?id=343869
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080501/260f357e/attachment.html>

From kliteyn at dev.mellanox.co.il  Thu May  1 04:48:25 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 01 May 2008 14:48:25 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: log matched QoS
	criteria
Message-ID: <4819AE09.2020509@dev.mellanox.co.il>

Adding log messages for matched criteria of
the QoS policy rule.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_qos_policy.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index 6c81872..185ccc0 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -624,6 +624,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"__qos_policy_get_match_rule_by_params: "
+				"Source port matched.\n");
 		}

 		/* If a match rule has Destination groups, PR request dest. has to be in this list */
@@ -637,6 +640,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"__qos_policy_get_match_rule_by_params: "
+				"Destination port matched.\n");
 		}

 		/* If a match rule has QoS classes, PR request HAS
@@ -655,7 +661,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"__qos_policy_get_match_rule_by_params: "
+				"QoS Class matched.\n");
 		}

 		/* If a match rule has Service IDs, PR request HAS
@@ -675,7 +683,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"__qos_policy_get_match_rule_by_params: "
+				"Service ID matched.\n");
 		}

 		/* If a match rule has PKeys, PR request HAS
@@ -694,7 +704,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"__qos_policy_get_match_rule_by_params: "
+				"PKey matched.\n");
 		}

 		/* if we got here, then this match-rule matched this PR request */
-- 
1.5.1.4


From tziporet at mellanox.co.il  Thu May  1 05:30:33 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 1 May 2008 15:30:33 +0300
Subject: [ofa-general] Reminder: OFED 1.3.1-rc1 is planed for next week
Message-ID: <6C2C79E72C305246B504CBA17B5500C903EBFA77@mtlexch01.mtl.com>

Hi All,

OFED 1.3.1-rc1 is planned for Tuesday next week (May 6)

Please send all your patches/new packages (e.g. mpi) by the end of this
week so we will be able to integrate them and have rc1 on time

Thanks
Tziporet


From kliteyn at dev.mellanox.co.il  Thu May  1 05:36:46 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 01 May 2008 15:36:46 +0300
Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: log matched QoS
	criteria
In-Reply-To: <4819AE09.2020509@dev.mellanox.co.il>
References: <4819AE09.2020509@dev.mellanox.co.il>
Message-ID: <4819B95E.8080801@dev.mellanox.co.il>

Hi Sasha,

Please ignore this patch - it is using the old osm_log.
I'll repost v2 of this patch.

-- Yevgeny

Yevgeny Kliteynik wrote:
> Adding log messages for matched criteria of
> the QoS policy rule.
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_qos_policy.c |   18 +++++++++++++++---
>  1 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
> index 6c81872..185ccc0 100644
> --- a/opensm/opensm/osm_qos_policy.c
> +++ b/opensm/opensm/osm_qos_policy.c
> @@ -624,6 +624,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> +			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"__qos_policy_get_match_rule_by_params: "
> +				"Source port matched.\n");
>  		}
> 
>  		/* If a match rule has Destination groups, PR request dest. has to be in this list */
> @@ -637,6 +640,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> +			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"__qos_policy_get_match_rule_by_params: "
> +				"Destination port matched.\n");
>  		}
> 
>  		/* If a match rule has QoS classes, PR request HAS
> @@ -655,7 +661,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"__qos_policy_get_match_rule_by_params: "
> +				"QoS Class matched.\n");
>  		}
> 
>  		/* If a match rule has Service IDs, PR request HAS
> @@ -675,7 +683,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"__qos_policy_get_match_rule_by_params: "
> +				"Service ID matched.\n");
>  		}
> 
>  		/* If a match rule has PKeys, PR request HAS
> @@ -694,7 +704,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"__qos_policy_get_match_rule_by_params: "
> +				"PKey matched.\n");
>  		}
> 
>  		/* if we got here, then this match-rule matched this PR request */


From monis at Voltaire.COM  Thu May  1 05:48:27 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 01 May 2008 15:48:27 +0300
Subject: [ofa-general] Re: [PATCH] IB/core: handle race between elements
	in	qork queues after event
In-Reply-To: <adalk2v9yac.fsf@cisco.com>
References: <48187E5A.7040809@Voltaire.COM> <adalk2v9yac.fsf@cisco.com>
Message-ID: <4819BC1B.1040909@Voltaire.COM>

Thanks for the comments. I'll resend soon.


From monis at Voltaire.COM  Thu May  1 05:52:57 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 01 May 2008 15:52:57 +0300
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <48187E5A.7040809@Voltaire.COM>
References: <48187E5A.7040809@Voltaire.COM>
Message-ID: <4819BD29.7080002@Voltaire.COM>

I made some changes according  to Or and Roland's comments


This patch solves a race between work elements that are carried out after an
event occurs. When SM address handle becomes invalid and needs an update it is
handled by a work in the global workqueue. On the other hand this event is also
handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join.
Although queuing is in the right order, it is done to 2 different workqueues and so
there is no guarantee that the first to be queued is the first to be executed.

The patch sets the SM address handle  to NULL and until update_sm_ah() is called, 
any request that needs sm_ah is replied with -EAGAIN return status.

For consumers, the patch doesn't make things worse. Before the patch mads are sent to the
wrong SM and now they are now blocked before they are sent. Consumers can be improved if they 
examine the return code and respond to EAGAIN properly but even without an improvement the 
situation is not getting worse in in some cases it gets better.

Being specific in this issue yields
* Callers of ib_sa_mcmember_rec_query() seem to handle the error returns properly but without
checking specifically for EAGAIN.
* Callers of ib_sa_path_rec_get() handle error returns but not with a retry
* I didn't find any caller of ib_sa_service_rec_query()

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---

 drivers/infiniband/core/sa_query.c |   26 ++++++++++++++++++++++----
 1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index cf474ec..a2e61d7 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE   ||
 	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		struct ib_sa_device *sa_dev;
-		sa_dev = container_of(handler, typeof(*sa_dev), event_handler);
-
+		unsigned long flags;
+		struct ib_sa_device *sa_dev =
+			container_of(handler, typeof(*sa_dev), event_handler);
+		struct ib_sa_port *port =
+			&sa_dev->port[event->element.port_num - sa_dev->start_port];
+		struct ib_sa_sm_ah *sm_ah;
+
+		spin_lock_irqsave(&port->ah_lock, flags);
+		sm_ah = port->sm_ah;
+		port->sm_ah = NULL;
+		spin_unlock_irqrestore(&port->ah_lock, flags);
+
+		if (sm_ah)
+			kref_put(&sm_ah->ref, free_sm_ah);
 		schedule_work(&sa_dev->port[event->element.port_num -
 					    sa_dev->start_port].update_task);
 	}
@@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_clie
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	if (!port->sm_ah)
+		return  -EAGAIN;
 	agent = port->agent;
 
 	query = kmalloc(sizeof *query, gfp_mask);
@@ -780,6 +793,9 @@ int ib_sa_service_rec_query(struct ib_sa
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	if (!port->sm_ah)
+		return  -EAGAIN;
+
 	agent = port->agent;
 
 	if (method != IB_MGMT_METHOD_GET &&
@@ -877,8 +893,10 @@ int ib_sa_mcmember_rec_query(struct ib_s
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
-	agent = port->agent;
+	if (!port->sm_ah)
+		return  -EAGAIN;
 
+	agent = port->agent;
 	query = kmalloc(sizeof *query, gfp_mask);
 	if (!query)
 		return -ENOMEM;


From David.Shue.ctr at rl.af.mil  Thu May  1 06:09:12 2008
From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB)
Date: Thu, 1 May 2008 09:09:12 -0400
Subject: [ofa-general] Infiniband Card Trouble
Message-ID: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>

Hello,

 
I have used the OFED-1.3 software to communicate with the current cards
I have.  These cards come up as "MT23108" in the logs, and I am not sure
whom the manufacturer is.  I was able to program the cards, and even
install MPICH2 and run tests.

 
I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric
(HPC) Adapter"
http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&
prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do
not work the same.  The machine boots up fine with the card in, and
shows the card as Mellanox "MT23108" also?  The two cards are visibly
different in every way.  Is the MT23108 a certain platform for IB?  I am
new to the entire IB technology.  This is the history of what I did.  

 
1)     Staged the machine RH EL v5

2)     Install the IB card

3)     Boot machine up

4)     Can see the card looking at "lspci" and "dmesg" but nothing in
the network area or under "ifconfig"  (Just like with the first cards)

5)     I then install the OFED-1.3 software to communicate and configure
the card

6)     When I go to start the card (instead of reboot but have tried
both ways) /etc/init.d/openib start, it all fails.  I then look in the
log file and see a bunch of "unknown symbol..." and "disagrees..."  for
all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on.

7)     When I reboot, the machine reaches "UDEV" of the reboot stage,
hangs for a little bit, and then many errors show and the machine won't
boot, unless I take the card out.  If I uninstall the OFED software, it
will reboot fine with the card still in.  The card from HP giving me
problems, does not appear to have any drivers for it.  It looks like HP
supports it to work on Windows, and HPUX.  

 
I'm look for any help you can provide.

 
Thanks in advance,

Dave  

 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

  David Shue                      

  Systems Specialist        

  Computer Sciences Corporation                                     

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080501/ab5306a0/attachment.html>

From dorfman.eli at gmail.com  Thu May  1 06:50:55 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 1 May 2008 16:50:55 +0300
Subject: [ofa-general] Re: [Ips] Calculating the VA in iSER header
In-Reply-To: <20080429170516.GA8857@osc.edu>
References: <4804B03C.6060507@voltaire.com>
	<OFA528E763.71479425-ON8525742C.005B02F4-8825742C.005F18F1@us.ibm.com>
	<694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com>
	<20080416144830.GC23861@osc.edu> <adaskxlls4u.fsf@cisco.com>
	<694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com>
	<20080429170516.GA8857@osc.edu>
Message-ID: <694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com>

On Tue, Apr 29, 2008 at 8:05 PM, Pete Wyckoff <pw at osc.edu> wrote:
> dorfman.eli at gmail.com wrote on Thu, 17 Apr 2008 14:13 +0300:
>
> > On Wed, Apr 16, 2008 at 6:46 PM, Roland Dreier <rdreier at cisco.com> wrote:
>  > >  > Agree with the interpretation of the spec, and it's probably a bit
>  > >   > clearer that way too.  But we have working initiators and targets
>  > >   > that do it the "wrong" way.
>  > >
>  > >  Yes... I guess the key question is whether there are any initiators that
>  > >  do things the "right" way.
>  > >
>  > >
>  > >   > 1. Flag day: all initiators and targets change at the same time.
>  > >   > Will see data corruption if someone unluckily runs one or the other
>  > >   > using old non-fixed code.
>  > >
>  > >  Seems unacceptable to me... it doesn't make sense at all to break every
>  > >  setup in the world just to be "right" according to the spec.
>  >
>  > This will break only when both initiator and target will use
>  > InitialR2T=No, which means allow unsolicited data.
>  > As far as I know, STGT is not very common (and its version in RHEL5.1
>  > is considered experimental). Its default is also InitialR2T=Yes.
>  > Voltaire's iSCSI over iSER target also uses default InitialR2T=Yes.
>  > So it seems that nothing will break.
>
>  I finally got a chance to look at this just now.  I think you mean
>  default is InitialR2T=No above, which means no unsolicited data.
>  That is the default case, and true, the two different meanings
>  of the initiator-supplied VA coincide.

InitialR2T=Yes means that R2T is required, hence no unsolicited data.
Only is both sides, initiator and target agree on InitialR2T=No then
first data burst is unsolicited.

>
>  But you missed the impact of immediate data.  We run with the
>  defaults (I think) that say the first write request packet should be
>  filled with a bit of the coming data stream.  From iscsid.conf:
>
>     # To enable immediate data (i.e., the initiator sends unsolicited data
>     # with the iSCSI command packet), uncomment the following line:
>     #
>     # The default is Yes
>     node.session.iscsi.ImmediateData = Yes
>
>  Looking at the offset printed out by your patch, it is indeed
>  non-zero for the first RDMA read.  Please correct me if I am
>  mistaken about this---you must have tested all four variations of
>  with and without the patches on initiator and target side, but I did
>  not.

You are right about the ImmediateData=Yes.
I really missed that, so after all this patch will break current
target implementation and
cause data corruption.
I suggest to postpone this patch till we implement the iSER HELLO
message and then add this patch with the corresponding target patch.
This will allow current initiator to work with current target and new
initiator work with new target.
I still think we should do that since future iser implementation will
probably rely on the spec.

>
>  Hence I am still a bit unhappy about having to deal with the
>  fallout, with no way to detect it.  For our local use, I'll keep an
>  older version of stgt in use until we switch to a new kernel, then
>  merge up the target side change.  It is a bother, but I can deal
>  with it.  For other institutions, this lockstep upgrade requirement
>  will not be obvious until they debug the resulting data corruption.
>
>  Still, I do understand why it would be nice to conform to the spec,
>  and it is maybe a bit cleaner that way too.  Maybe you can help with
>  the bug reports on stgt-devel during the transition, and maintain
>  and publish a patch to let it work with old kernels.
>
>                 -- Pete
>


From kliteyn at dev.mellanox.co.il  Thu May  1 07:11:08 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 01 May 2008 17:11:08 +0300
Subject: [ofa-general] [PATCH v2] opensm/osm_qos_policy.c: log matched QoS
	criteria
Message-ID: <4819CF7C.6040606@dev.mellanox.co.il>

Adding log messages for matched criteria of
the QoS policy rule.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_qos_policy.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index 6c81872..ebe3a7f 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 {
 	osm_qos_match_rule_t *p_qos_match_rule = NULL;
 	cl_list_iterator_t list_iterator;
+	osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log;

 	if (!cl_list_count(&p_qos_policy->qos_match_rules))
 		return NULL;

+	OSM_LOG_ENTER(p_log);
+
 	/* Go over all QoS match rules and find the one that matches the request */

 	list_iterator = cl_list_head(&p_qos_policy->qos_match_rules);
@@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Source port matched.\n");
 		}

 		/* If a match rule has Destination groups, PR request dest. has to be in this list */
@@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Destination port matched.\n");
 		}

 		/* If a match rule has QoS classes, PR request HAS
@@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"QoS Class matched.\n");
 		}

 		/* If a match rule has Service IDs, PR request HAS
@@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Service ID matched.\n");
 		}

 		/* If a match rule has PKeys, PR request HAS
@@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"PKey matched.\n");
 		}

 		/* if we got here, then this match-rule matched this PR request */
 		break;
 	}

+	OSM_LOG_EXIT(p_log);
+
 	if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules))
 		return NULL;

-- 
1.5.1.4


From tziporet at dev.mellanox.co.il  Thu May  1 07:13:23 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 01 May 2008 17:13:23 +0300
Subject: [ofa-general] Infiniband Card Trouble
In-Reply-To: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
References: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
Message-ID: <4819D003.70508@mellanox.co.il>

Shue, David CTR USAF AFMC AFRL/RITB wrote:
>
> Hello,
>
> I have used the OFED-1.3 software to communicate with the current 
> cards I have. These cards come up as “MT23108” in the logs, and I am 
> not sure whom the manufacturer is. I was able to program the cards, 
> and even install MPICH2 and run tests.
>
> I have recently obtained new IB cards from HP “*HP PCI-X 2-port 4X 
> Fabric (HPC) Adapter” 
> http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id 
> <http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id> 
> *and these cards do not work the same. The machine boots up fine with 
> the card in, and shows the card as Mellanox “MT23108” also? The two 
> cards are visibly different in every way. Is the MT23108 a certain 
> platform for IB?
>
Yes it is. Its Mellanox PCIX cards. Maybe you need to upgrade teh FW for 
the new card.
You can get the new FW and burn it using the instruction on Mellanox 
web: http://www.mellanox.com/support/firmware_download.php
Your card is Dual port InfiniHost PCI-X HCA cards (Cougar Cub) 
<http://www.mellanox.com/support/firmware_table_IH.php>

> This is the history of what I did.
>
> 1) Staged the machine RH EL v5
>
> 2) Install the IB card
>
> 3) Boot machine up
>
> 4) Can see the card looking at “lspci” and “dmesg” but nothing in the 
> network area or under “ifconfig” (Just like with the first cards)
>
Can you send output of lspci -vv
>
> 5) I then install the OFED-1.3 software to communicate and configure 
> the card
>
> 6) When I go to start the card (instead of reboot but have tried both 
> ways) /etc/init.d/openib start, it all fails. I then look in the log 
> file and see a bunch of “unknown symbol…” and “disagrees…” for all 
> items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on.
>
> 7) When I reboot, the machine reaches “UDEV” of the reboot stage, 
> hangs for a little bit, and then many errors show and the machine 
> won’t boot, unless I take the card out. If I uninstall the OFED 
> software, it will reboot fine with the card still in. The card from HP 
> giving me problems, does not appear to have any drivers for it. It 
> looks like HP supports it to work on Windows, and HPUX.
>
What is the machine type you use? Is it IA64?

Tziporet


From dorfman.eli at gmail.com  Thu May  1 07:18:45 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 1 May 2008 17:18:45 +0300
Subject: [Ips] [Stgt-devel] [ofa-general] Re: Calculating the VA iniSER
	header
In-Reply-To: <BLU117-W9F1A35C1D34824D30AA0DD7DB0@phx.gbl>
References: <4804B03C.6060507@voltaire.com>
	<OFA528E763.71479425-ON8525742C.005B02F4-8825742C.005F18F1@us.ibm.com>
	<694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com>
	<20080416144830.GC23861@osc.edu> <adaskxlls4u.fsf@cisco.com>
	<694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com>
	<20080429170516.GA8857@osc.edu>
	<BLU117-W2898E0FCDF826F8FBFE13ED7D80@phx.gbl>
	<39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com>
	<BLU117-W9F1A35C1D34824D30AA0DD7DB0@phx.gbl>
Message-ID: <694d48600805010718n7f02a30ev38d35c50e926d02@mail.gmail.com>

On Thu, May 1, 2008 at 11:24 AM, Ken Sandars <kensandars at hotmail.com> wrote:
>
> >> [Ken] It appears the current Linux iSER initiator does not send the HELLO
> message when
>
> >> the connection transits to full feature phase. The stgt target also
> ignores this
> >> message (if it were to appear).
> [Ken] The IBTA document does not mention the HELLO/HELLOREPLY messages.
> Implementing this message exchange gives a distinction between the current
> implementations
> and those that will correctly calculate the write_va (as per Pete Wyckoff's
> option 3).

I agree.

>
> >> [Ken] Both of these implementations use a non-conformant iSER header
> (they add
>
> >> write_va and read_va fields, which incidentally do not appear to be
> used). Are
> >> these changes documented anywhere in the IB domain, or are these
> variations
> >> needed for another reason?
> >
> > [Erez] Take a look at the iSER for IB annex:
>
> > http://www.infinibandta.org/members/spec/Annex_iSER.PDF
>
> [Ken] Ouch. That link requires a username/password. Looks like it is only
> available to members
> of the Infiniband Trade Association. Fortunately I gained access to it with
> username "open" and
> password "standard". ;-)
>
> [Ken] Neither of these implementations send or examine the iSER CM REQ/REP
> message private data.
> The document doesn't define what action to take when this message is
> absent. Interestingly,
> when the target reports that "ZBVA shall be used for this connection" and
> "the target shall issue
> Send with Invalidate as needed" then it appears the iSER header specified in
> RFC5046 should be
> used for control-type PDUs. Is there any plan to conform with the list of
> requirements for IBTA
> compliance?

At the moment these capabilities (ZBVA, Send Invalidate) are not
supported in the driver,
though they seem to be supported by the ConnectX HCA.
Hence, iSER implementation do not send/examine them.
This may be added to the CM REQ/REP with the current defaults but in
order to use these capabilities a code
should be added to the HCA driver and iser.

>
> Cheers,
> Ken
>
> ________________________________
> Hotmail on your mobile. Never miss another e-mail with


From pw at osc.edu  Thu May  1 07:26:18 2008
From: pw at osc.edu (Pete Wyckoff)
Date: Thu, 1 May 2008 10:26:18 -0400
Subject: [ofa-general] Re: [Ips] Calculating the VA in iSER header
In-Reply-To: <694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com>
References: <4804B03C.6060507@voltaire.com>
	<OFA528E763.71479425-ON8525742C.005B02F4-8825742C.005F18F1@us.ibm.com>
	<694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com>
	<20080416144830.GC23861@osc.edu> <adaskxlls4u.fsf@cisco.com>
	<694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com>
	<20080429170516.GA8857@osc.edu>
	<694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com>
Message-ID: <20080501142618.GA19304@osc.edu>

dorfman.eli at gmail.com wrote on Thu, 01 May 2008 16:50 +0300:
> InitialR2T=Yes means that R2T is required, hence no unsolicited data.
> Only is both sides, initiator and target agree on InitialR2T=No then
> first data burst is unsolicited.

Thanks for the explanation.  I keep getting that backwards.

> On Tue, Apr 29, 2008 at 8:05 PM, Pete Wyckoff <pw at osc.edu> wrote:
> >  But you missed the impact of immediate data.  We run with the
> >  defaults (I think) that say the first write request packet should be
> >  filled with a bit of the coming data stream.  From iscsid.conf:
> >
> >     # To enable immediate data (i.e., the initiator sends unsolicited data
> >     # with the iSCSI command packet), uncomment the following line:
> >     #
> >     # The default is Yes
> >     node.session.iscsi.ImmediateData = Yes
> >
> >  Looking at the offset printed out by your patch, it is indeed
> >  non-zero for the first RDMA read.  Please correct me if I am
> >  mistaken about this---you must have tested all four variations of
> >  with and without the patches on initiator and target side, but I did
> >  not.
> 
> You are right about the ImmediateData=Yes.
> I really missed that, so after all this patch will break current
> target implementation and
> cause data corruption.
> I suggest to postpone this patch till we implement the iSER HELLO
> message and then add this patch with the corresponding target patch.
> This will allow current initiator to work with current target and new
> initiator work with new target.
> I still think we should do that since future iser implementation will
> probably rely on the spec.

We might as well do the Hello message exchange anyway.  As Ken
points out, the spec would approve.  We could even use this
opportunity to set the IRD and ORD too, but I'm not sure exactly how
that would work in IB once the connection is up.

Here we're not proposing a new bit in the Hello message to indicate
"VA starts before unsol data", but rather lack of Hello message
indicates old initiatior that gets the VA wrong.  That will be easy
to detect in targets.

I wonder if by supporting Hello, that we could remove the use of
private data as specified in the IBTA annex?  These negotiated
parameters (ZBVA and Send w/Inval) could be in the Hello exchange.

We still have the need to put VAs in the iSER header to support the
non-ZBVA case (pre-ConnectX IB), though.  Once we get the Hello
worked out, it might be time to update RFC 5046 to encompass this
hardware model too.

		-- Pete


From dorfman.eli at gmail.com  Thu May  1 07:32:13 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 1 May 2008 17:32:13 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/iSER: Use offset from r2t header
	for rdma
In-Reply-To: <694d48600804270555i6ee55843x51c416294fec6397@mail.gmail.com>
References: <694d48600804270555i6ee55843x51c416294fec6397@mail.gmail.com>
Message-ID: <694d48600805010732p43bed1a7q75dd8d8512b275f2@mail.gmail.com>

On Sun, Apr 27, 2008 at 3:55 PM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> Use offset from r2t header for rdma instead of using
>  internal offset counter.
>
>  Signed-off-by: Eli Dorfman <elid at voltaire.com>
>  ---
>   usr/iscsi/iscsi_rdma.c |   16 +++++-----------
>   1 files changed, 5 insertions(+), 11 deletions(-)
>
>  diff --git a/usr/iscsi/iscsi_rdma.c b/usr/iscsi/iscsi_rdma.c
>  index d46ddff..84f5949 100644
>  --- a/usr/iscsi/iscsi_rdma.c
>  +++ b/usr/iscsi/iscsi_rdma.c
>  @@ -1447,28 +1447,22 @@ static int iscsi_rdma_rdma_read(struct
>  iscsi_connection *conn)
>         struct iscsi_r2t_rsp *r2t = (struct iscsi_r2t_rsp *) &conn->rsp.bhs;
>         uint8_t *buf;
>         uint32_t len;
>  +       uint32_t offset;
>         int ret;
>
>         buf = (uint8_t *) task->data + task->offset;
>         len = be32_to_cpu(r2t->data_length);
>  +       offset = be32_to_cpu(r2t->data_offset);
>
>  -       dprintf("len %u stag %x va %llx\n",
>  +       dprintf("len %u stag %x va %llx offset %x\n",
>                 len, itask->rem_write_stag,
>  -               (unsigned long long) itask->rem_write_va);
>  +               (unsigned long long) itask->rem_write_va, offset);
>
>         ret = iser_post_rdma_wr(ci, task, buf, len, IBV_WR_RDMA_READ,
>  -                               itask->rem_write_va, itask->rem_write_stag);
>  +                               itask->rem_write_va + offset, itask->rem_write_stag);
>         if (ret < 0)
>                 return ret;
>
>  -       /*
>  -        * Initiator registers the entire buffer, but gives us a VA that
>  -        * is advanced by immediate + unsolicited data amounts.  Advance
>  -        * rem_va as we read, knowing that the target always grabs segments
>  -        * in order.
>  -        */
>  -       itask->rem_write_va += len;
>  -
>         return 0;
>   }
>
>  --
>  1.5.5
>
Please do not apply this patch until we decide how to sync this with
the initiator side.
See the following discussion for details:
http://www.ietf.org/mail-archive/web/ips/current/msg02506.html

I tend to agree with Pete's option (3) implementing iSER HELLO message
in the initiator and target.
Then adding this patch and the corresponding initiator patch so that we have:
Old initiator working with old target, AND
New initiator working with new target.

Eli


From dorfman.eli at gmail.com  Thu May  1 07:35:48 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 1 May 2008 17:35:48 +0300
Subject: [ofa-general] Re: [PATCH 1/2] IB/iSER: Do not add unsolicited data
	offset to VA in iSER header
In-Reply-To: <694d48600804270553u36b776ame9695a8858dd278@mail.gmail.com>
References: <694d48600804270553u36b776ame9695a8858dd278@mail.gmail.com>
Message-ID: <694d48600805010735k4836e955jabde51ddaf85d645@mail.gmail.com>

On Sun, Apr 27, 2008 at 3:53 PM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> iSER initiator sends a VA (in the iSER header) which includes
>  an offset for the unsolicited data (which is wrong according to the spec).
>
>  Signed-off-by: Eli Dorfman <elid at voltaire.com>
>  Signed-off-by: Erez Zilber <erezz at voltaire.com>
>  ---
>   drivers/infiniband/ulp/iser/iser_initiator.c |    6 +++---
>   1 files changed, 3 insertions(+), 3 deletions(-)
>
>  diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
>  b/drivers/infiniband/ulp/iser/iser_initiator.c
>  index 08dc81c..5c2bbc6 100644
>  --- a/drivers/infiniband/ulp/iser/iser_initiator.c
>  +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
>  @@ -154,12 +154,12 @@ iser_prepare_write_cmd(struct iscsi_cmd_task *ctask,
>         if (unsol_sz < edtl) {
>                 hdr->flags     |= ISER_WSV;
>                 hdr->write_stag = cpu_to_be32(regd_buf->reg.rkey);
>  -               hdr->write_va   = cpu_to_be64(regd_buf->reg.va + unsol_sz);
>  +               hdr->write_va   = cpu_to_be64(regd_buf->reg.va);
>
>                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>  -                        "VA:%#llX + unsol:%d\n",
>  +                        "VA:%#llX\n",
>                          ctask->itt, regd_buf->reg.rkey,
>  -                        (unsigned long long)regd_buf->reg.va, unsol_sz);
>  +                        (unsigned long long)regd_buf->reg.va);
>         }
>
>         if (imm_sz > 0) {
>  --
>  1.5.5
>

Please do not apply this patch until we decide how to sync this with
the target side.

Thanks,
Eli


From shemminger at vyatta.com  Thu May  1 07:56:06 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 1 May 2008 07:56:06 -0700
Subject: [ofa-general] Re: [PATCH 08/13] QLogic VNIC: sysfs interface
 implementation for the driver
In-Reply-To: <20080430171955.31725.7771.stgit@localhost.localdomain>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171955.31725.7771.stgit@localhost.localdomain>
Message-ID: <20080501075606.4963afa3@extreme>

On Wed, 30 Apr 2008 22:49:55 +0530
Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

> From: Amar Mudrankit <amar.mudrankit at qlogic.com>
> 
> The sysfs interface for the QLogic VNIC driver is implemented through
> this patch.
> 
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
> ---
> 
>  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1127 +++++++++++++++++++++++++++
>  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h |   62 +
>  2 files changed, 1189 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
> 
> diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> new file mode 100644
> index 0000000..7e70b0c
> --- /dev/null
> +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> @@ -0,0 +1,1127 @@
> +/*
> + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/parser.h>
> +#include <linux/netdevice.h>
> +#include <linux/if.h>
> +
> +#include "vnic_util.h"
> +#include "vnic_config.h"
> +#include "vnic_ib.h"
> +#include "vnic_viport.h"
> +#include "vnic_main.h"
> +#include "vnic_stats.h"
> +
> +/*
> + * target eiocs are added by writing
> + *
> + * ioc_guid=<EIOC GUID>,dgid=<dest GID>,pkey=<P_key>,name=<interface_name>
> + * to the create_primary  sysfs attribute.
> + */
> +enum {
> +	VNIC_OPT_ERR = 0,
> +	VNIC_OPT_IOC_GUID = 1 << 0,
> +	VNIC_OPT_DGID = 1 << 1,
> +	VNIC_OPT_PKEY = 1 << 2,
> +	VNIC_OPT_NAME = 1 << 3,
> +	VNIC_OPT_INSTANCE = 1 << 4,
> +	VNIC_OPT_RXCSUM = 1 << 5,
> +	VNIC_OPT_TXCSUM = 1 << 6,
> +	VNIC_OPT_HEARTBEAT = 1 << 7,
> +	VNIC_OPT_IOC_STRING = 1 << 8,
> +	VNIC_OPT_IB_MULTICAST = 1 << 9,
> +	VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID |
> +			VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY),
> +};
> +
> +static match_table_t vnic_opt_tokens = {
> +	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
> +	{VNIC_OPT_DGID, "dgid=%s"},
> +	{VNIC_OPT_PKEY, "pkey=%x"},
> +	{VNIC_OPT_NAME, "name=%s"},
> +	{VNIC_OPT_INSTANCE, "instance=%d"},
> +	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
> +	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
> +	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
> +	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
> +	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
> +	{VNIC_OPT_ERR, NULL}
> +};
> 

NO

1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...)

2. Sysfs is one value per file not name=value


From shemminger at vyatta.com  Thu May  1 07:58:16 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 1 May 2008 07:58:16 -0700
Subject: [ofa-general] Re: [PATCH 11/13] QLogic VNIC: Driver utility file -
 implements various utility macros
In-Reply-To: <20080430172126.31725.48554.stgit@localhost.localdomain>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172126.31725.48554.stgit@localhost.localdomain>
Message-ID: <20080501075816.1010ec3a@extreme>

On Wed, 30 Apr 2008 22:51:26 +0530
Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

> From: Poornima Kamath <poornima.kamath at qlogic.com>
> 
> This patch adds the driver utility file which mainly contains utility
> macros for debugging of QLogic VNIC driver.
> 
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
> ---
> 
>  drivers/infiniband/ulp/qlgc_vnic/vnic_util.h |  251 ++++++++++++++++++++++++++
>  1 files changed, 251 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
> 
> diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
> new file mode 100644
> index 0000000..4d7d540
> --- /dev/null
> +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
> @@ -0,0 +1,251 @@
> +/*
> + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef VNIC_UTIL_H_INCLUDED
> +#define VNIC_UTIL_H_INCLUDED
> +
> +#define MODULE_NAME "QLGC_VNIC"
> +
> +#define VNIC_MAJORVERSION	1
> +#define VNIC_MINORVERSION	1
> +
> +#define is_power_of2(value)	(((value) & ((value - 1))) == 0)
> +#define ALIGN_DOWN(x, a)	((x)&(~((a)-1)))

In kernel.h already

> +extern u32 vnic_debug;

Use msg level macros instead?

> +
> +enum {
> +	DEBUG_IB_INFO			= 0x00000001,
> +	DEBUG_IB_FUNCTION		= 0x00000002,
> +	DEBUG_IB_FSTATUS		= 0x00000004,
> +	DEBUG_IB_ASSERTS		= 0x00000008,
> +	DEBUG_CONTROL_INFO		= 0x00000010,
> +	DEBUG_CONTROL_FUNCTION	= 0x00000020,
> +	DEBUG_CONTROL_PACKET	= 0x00000040,
> +	DEBUG_CONFIG_INFO		= 0x00000100,
> +	DEBUG_DATA_INFO 		= 0x00001000,
> +	DEBUG_DATA_FUNCTION		= 0x00002000,
> +	DEBUG_NETPATH_INFO		= 0x00010000,
> +	DEBUG_VIPORT_INFO		= 0x00100000,
> +	DEBUG_VIPORT_FUNCTION	= 0x00200000,
> +	DEBUG_LINK_STATE		= 0x00400000,
> +	DEBUG_VNIC_INFO 		= 0x01000000,
> +	DEBUG_VNIC_FUNCTION		= 0x02000000,
> +	DEBUG_MCAST_INFO		= 0x04000000,
> +	DEBUG_MCAST_FUNCTION	= 0x08000000,
> +	DEBUG_SYS_INFO			= 0x10000000,
> +	DEBUG_SYS_VERBOSE		= 0x40000000
> +};
> +
> +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_DEBUG
> +#define PRINT(level, x, fmt, arg...)					\
> +	printk(level "%s: %s: %s, line %d: " fmt,			\
> +	       MODULE_NAME, x, __FILE__, __LINE__, ##arg)
> +
> +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...)		\
> +	do {								\
> +		if (condition)						\
> +			printk(level "%s: %s: %s, line %d: " fmt,	\
> +			       MODULE_NAME, x, __FILE__, __LINE__,	\
> +			       ##arg);					\
> +	} while (0)
> +#else
> +#define PRINT(level, x, fmt, arg...)					\
> +	printk(level "%s: " fmt, MODULE_NAME, ##arg)
> +
> +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...)		\
> +	do {								\
> +		 if (condition)						\
> +			printk(level "%s: %s: " fmt,			\
> +			       MODULE_NAME, x, ##arg);			\
> +	} while (0)
> +#endif	/*CONFIG_INFINIBAND_QLGC_VNIC_DEBUG*/
> +
> +#define IB_PRINT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "IB", fmt, ##arg)
> +#define IB_ERROR(fmt, arg...)			\
> +	PRINT(KERN_ERR, "IB", fmt, ##arg)
> +
> +#define IB_FUNCTION(fmt, arg...) 				\
> +	PRINT_CONDITIONAL(KERN_INFO, 				\
> +			  "IB", 				\
> +			  (vnic_debug & DEBUG_IB_FUNCTION), 	\
> +			  fmt, ##arg)
> +
> +#define IB_INFO(fmt, arg...)					\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "IB",					\
> +			  (vnic_debug & DEBUG_IB_INFO),		\
> +			  fmt, ##arg)
> +
> +#define IB_ASSERT(x)							\
> +	do {								\
> +		 if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x))		\
> +			panic("%s assertion failed, file:  %s,"		\
> +				" line %d: ",				\
> +				MODULE_NAME, __FILE__, __LINE__)	\
> +	} while (0)
> +
> +#define CONTROL_PRINT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "CONTROL", fmt, ##arg)
> +#define CONTROL_ERROR(fmt, arg...)			\
> +	PRINT(KERN_ERR, "CONTROL", fmt, ##arg)
> +
> +#define CONTROL_INFO(fmt, arg...)					\
> +	PRINT_CONDITIONAL(KERN_INFO,					\
> +			  "CONTROL",					\
> +			  (vnic_debug & DEBUG_CONTROL_INFO),		\
> +			  fmt, ##arg)
> +
> +#define CONTROL_FUNCTION(fmt, arg...)					\
> +	PRINT_CONDITIONAL(KERN_INFO,					\
> +			"CONTROL",					\
> +			(vnic_debug & DEBUG_CONTROL_FUNCTION),		\
> +			fmt, ##arg)
> +
> +#define CONTROL_PACKET(pkt)					\
> +	do {							\
> +		 if (vnic_debug & DEBUG_CONTROL_PACKET)		\
> +			control_log_control_packet(pkt);	\
> +	} while (0)
> +
> +#define CONFIG_PRINT(fmt, arg...)		\
> +	PRINT(KERN_INFO, "CONFIG", fmt, ##arg)
> +#define CONFIG_ERROR(fmt, arg...)		\
> +	PRINT(KERN_ERR, "CONFIG", fmt, ##arg)
> +
> +#define CONFIG_INFO(fmt, arg...)				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "CONFIG",				\
> +			  (vnic_debug & DEBUG_CONFIG_INFO),	\
> +			  fmt, ##arg)
> +
> +#define DATA_PRINT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "DATA", fmt, ##arg)
> +#define DATA_ERROR(fmt, arg...)			\
> +	PRINT(KERN_ERR, "DATA", fmt, ##arg)
> +
> +#define DATA_INFO(fmt, arg...)					\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "DATA",				\
> +			  (vnic_debug & DEBUG_DATA_INFO),	\
> +			  fmt, ##arg)
> +
> +#define DATA_FUNCTION(fmt, arg...)				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "DATA",				\
> +			  (vnic_debug & DEBUG_DATA_FUNCTION),	\
> +			  fmt, ##arg)
> +
> +
> +#define MCAST_PRINT(fmt, arg...)        \
> +    PRINT(KERN_INFO, "MCAST", fmt, ##arg)
> +#define MCAST_ERROR(fmt, arg...)        \
> +    PRINT(KERN_ERR, "MCAST", fmt, ##arg)
> +
> +#define MCAST_INFO(fmt, arg...)   	              		\
> +	PRINT_CONDITIONAL(KERN_INFO,     			\
> +			"MCAST",   				\
> +			(vnic_debug & DEBUG_MCAST_INFO),	\
> +			fmt, ##arg)
> +
> +#define MCAST_FUNCTION(fmt, arg...)				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			"MCAST",				\
> +			(vnic_debug & DEBUG_MCAST_FUNCTION), 	\
> +			fmt, ##arg)
> +
> +#define NETPATH_PRINT(fmt, arg...)		\
> +	PRINT(KERN_INFO, "NETPATH", fmt, ##arg)
> +#define NETPATH_ERROR(fmt, arg...)		\
> +	PRINT(KERN_ERR, "NETPATH", fmt, ##arg)
> +
> +#define NETPATH_INFO(fmt, arg...)				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "NETPATH",				\
> +			  (vnic_debug & DEBUG_NETPATH_INFO),	\
> +			  fmt, ##arg)
> +
> +#define VIPORT_PRINT(fmt, arg...)		\
> +	PRINT(KERN_INFO, "VIPORT", fmt, ##arg)
> +#define VIPORT_ERROR(fmt, arg...)		\
> +	PRINT(KERN_ERR, "VIPORT", fmt, ##arg)
> +
> +#define VIPORT_INFO(fmt, arg...) 				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "VIPORT",				\
> +			  (vnic_debug & DEBUG_VIPORT_INFO),	\
> +			  fmt, ##arg)
> +
> +#define VIPORT_FUNCTION(fmt, arg...)				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "VIPORT",				\
> +			  (vnic_debug & DEBUG_VIPORT_FUNCTION),	\
> +			  fmt, ##arg)
> +
> +#define LINK_STATE(fmt, arg...) 				\
> +	PRINT_CONDITIONAL(KERN_INFO,				\
> +			  "LINK",				\
> +			  (vnic_debug & DEBUG_LINK_STATE),	\
> +			  fmt, ##arg)
> +
> +#define VNIC_PRINT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "NIC", fmt, ##arg)
> +#define VNIC_ERROR(fmt, arg...)			\
> +	PRINT(KERN_ERR, "NIC", fmt, ##arg)
> +#define VNIC_INIT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "NIC", fmt, ##arg)
> +
> +#define VNIC_INFO(fmt, arg...)					\
> +	 PRINT_CONDITIONAL(KERN_INFO,				\
> +			   "NIC",				\
> +			   (vnic_debug & DEBUG_VNIC_INFO),	\
> +			   fmt, ##arg)
> +
> +#define VNIC_FUNCTION(fmt, arg...)				\
> +	 PRINT_CONDITIONAL(KERN_INFO,				\
> +			   "NIC",				\
> +			   (vnic_debug & DEBUG_VNIC_FUNCTION),	\
> +			   fmt, ##arg)
> +
> +#define SYS_PRINT(fmt, arg...)			\
> +	PRINT(KERN_INFO, "SYS", fmt, ##arg)
> +#define SYS_ERROR(fmt, arg...)			\
> +	PRINT(KERN_ERR, "SYS", fmt, ##arg)
> +
> +#define SYS_INFO(fmt, arg...)					\
> +	 PRINT_CONDITIONAL(KERN_INFO,				\
> +			   "SYS",				\
> +			   (vnic_debug & DEBUG_SYS_INFO),	\
> +			   fmt, ##arg)
> +
> +#endif	/* VNIC_UTIL_H_INCLUDED */


Many of these are already in standard macros pr_info, pr_err etc.


From rdreier at cisco.com  Thu May  1 08:07:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 May 2008 08:07:18 -0700
Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup
In-Reply-To: <1209625285.1790.24.camel@mtls03> (Eli Cohen's message of "Thu,
	01 May 2008 10:01:25 +0300")
References: <1209577156.1790.11.camel@mtls03> <adar6cm67w8.fsf@cisco.com>
	<1209625285.1790.24.camel@mtls03>
Message-ID: <adaiqxy5ahl.fsf@cisco.com>

 > I agree, but I want to have a larger buffer to absorb larger picks. For
 > example, after applying this patch I tested how many times the net queue
 > is stopped and woken up when running four streams of netperf, udp, small
 > packets. When using the default 64 tx queue size it happened 500 times.
 > When I used a 256 tx queue size it happened only 37 times. This makes me
 > think that we have larger picks that a larger queue size can help
 > handle. 

OK, that makes sense -- although did you see any performance difference?

 > Also looking for example on Broadcom bnx2 driver on my machine, it uses
 > a 1000 tx queue len.

Isn't that the software queue above the hardware?  (That's what
txqueuelen in ifconfig is reporting)


From rdreier at cisco.com  Thu May  1 08:08:56 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 May 2008 08:08:56 -0700
Subject: [ofa-general] Infiniband Card Trouble
In-Reply-To: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
	(David Shue's message of "Thu, 1 May 2008 09:09:12 -0400")
References: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
Message-ID: <adaej8m5aev.fsf@cisco.com>

 > 7)     When I reboot, the machine reaches "UDEV" of the reboot stage,
 > hangs for a little bit, and then many errors show and the machine won't
 > boot, unless I take the card out.  If I uninstall the OFED software, it
 > will reboot fine with the card still in.  The card from HP giving me
 > problems, does not appear to have any drivers for it.  It looks like HP
 > supports it to work on Windows, and HPUX.  

What are the errors?

This is too vague to solve without the actual console output.


From michael.heinz at qlogic.com  Thu May  1 08:20:31 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Thu, 1 May 2008 10:20:31 -0500
Subject: [ofa-general] Infiniband Card Trouble
In-Reply-To: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
References: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
Message-ID: <C07C40DB2364324799506DE8FF12F8D8678AE3@EPEXCH1.qlogic.org>

#6 makes it sound like it's an ofed installation issue rather than the
HCA itself.
 
Could you post the relevant /var/log/messages? Messages from ib_mthca
would be especially important. In addition, the output from
 
mstflint -d <mypciaddress> q 
 
could also be useful.
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 

________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shue, David
CTR USAF AFMC AFRL/RITB
Sent: Thursday, May 01, 2008 9:09 AM
To: general at lists.openfabrics.org
Subject: [ofa-general] Infiniband Card Trouble


Hello,

 
I have used the OFED-1.3 software to communicate with the current cards
I have.  These cards come up as "MT23108" in the logs, and I am not sure
whom the manufacturer is.  I was able to program the cards, and even
install MPICH2 and run tests.

 
I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric
(HPC) Adapter"
http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&
prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do
not work the same.  The machine boots up fine with the card in, and
shows the card as Mellanox "MT23108" also?  The two cards are visibly
different in every way.  Is the MT23108 a certain platform for IB?  I am
new to the entire IB technology.  This is the history of what I did.  

 
1)     Staged the machine RH EL v5

2)     Install the IB card

3)     Boot machine up

4)     Can see the card looking at "lspci" and "dmesg" but nothing in
the network area or under "ifconfig"  (Just like with the first cards)

5)     I then install the OFED-1.3 software to communicate and configure
the card

6)     When I go to start the card (instead of reboot but have tried
both ways) /etc/init.d/openib start, it all fails.  I then look in the
log file and see a bunch of "unknown symbol..." and "disagrees..."  for
all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on.

7)     When I reboot, the machine reaches "UDEV" of the reboot stage,
hangs for a little bit, and then many errors show and the machine won't
boot, unless I take the card out.  If I uninstall the OFED software, it
will reboot fine with the card still in.  The card from HP giving me
problems, does not appear to have any drivers for it.  It looks like HP
supports it to work on Windows, and HPUX.  

 
I'm look for any help you can provide.

 
Thanks in advance,

Dave  

 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

  David Shue                      

  Systems Specialist        

  Computer Sciences Corporation                                     

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080501/795a4c80/attachment.html>

From eli at dev.mellanox.co.il  Thu May  1 08:37:06 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 01 May 2008 18:37:06 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup
In-Reply-To: <adaiqxy5ahl.fsf@cisco.com>
References: <1209577156.1790.11.camel@mtls03> <adar6cm67w8.fsf@cisco.com>
	<1209625285.1790.24.camel@mtls03>  <adaiqxy5ahl.fsf@cisco.com>
Message-ID: <1209656226.1790.39.camel@mtls03>


On Thu, 2008-05-01 at 08:07 -0700, Roland Dreier wrote:

> OK, that makes sense -- although did you see any performance difference?
Yes. With four streams on a 4 cores machine, the senders sum up to:
898 * 10^6 bits/sec  @ 256 tx queue length
756 * 10^6 bits/sec  @ 64 tx queue length

> 
>  > Also looking for example on Broadcom bnx2 driver on my machine, it uses
>  > a 1000 tx queue len.
> 
> Isn't that the software queue above the hardware?  (That's what
> txqueuelen in ifconfig is reporting)
I wrongly assumed this is the hardware queue size. Now I looked at the
driver code and if I did not make any mistake in the calculations, the
hardware queue size is... 256.


From ramachandra.kuchimanchi at qlogic.com  Thu May  1 09:02:14 2008
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi,
	Ramachandra (Contractor - ))
Date: Thu, 1 May 2008 11:02:14 -0500
Subject: [ofa-general] RE: [PATCH 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
References: <20080430171028.31725.86190.stgit@localhost.localdomain><20080430171955.31725.7771.stgit@localhost.localdomain>
	<20080501075606.4963afa3@extreme>
Message-ID: <C07C40DB2364324799506DE8FF12F8D8590EE0@EPEXCH1.qlogic.org>

Stephen,

Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:

>> On Wed, 30 Apr 2008 22:49:55 +0530
>> Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

<snip>

>> +static match_table_t vnic_opt_tokens = {
>> +	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
>> +	{VNIC_OPT_DGID, "dgid=%s"},
>> +	{VNIC_OPT_PKEY, "pkey=%x"},
>> +	{VNIC_OPT_NAME, "name=%s"},
>> +	{VNIC_OPT_INSTANCE, "instance=%d"},
>> +	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
>> +	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
>> +	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
>> +	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
>> +	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
>> +	{VNIC_OPT_ERR, NULL}
>> +};
>> 

> NO
> 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...)
> 2. Sysfs is one value per file not name=value

The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space
to connect to the EVIC. For this the "name=value" mechanism is used for
a write-only sysfs file as an input method to the driver.

The driver follows the one value per file sysfs rule when it returns any data
with each readable file returning only a single value.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080501/f8988ab8/attachment.html>

From ramachandra.kuchimanchi at qlogic.com  Thu May  1 09:18:53 2008
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi,
	Ramachandra (Contractor - ))
Date: Thu, 1 May 2008 11:18:53 -0500
Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility file -
	implements various utility macros
References: <20080430171028.31725.86190.stgit@localhost.localdomain><20080430172126.31725.48554.stgit@localhost.localdomain>
	<20080501075816.1010ec3a@extreme>
Message-ID: <C07C40DB2364324799506DE8FF12F8D8590EE1@EPEXCH1.qlogic.org>

Stephen,

Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:

> Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

>> +#define is_power_of2(value)	(((value) & ((value - 1))) == 0)
>> +#define ALIGN_DOWN(x, a)	((x)&(~((a)-1)))

> In kernel.h already

Will fix this. Thanks.

>> +extern u32 vnic_debug;

> Use msg level macros instead?

I am sorry, I did not understand this comment.

> +
> +#define SYS_INFO(fmt, arg...)					\
> +	 PRINT_CONDITIONAL(KERN_INFO,				\
> +			   "SYS",				\
> +			   (vnic_debug & DEBUG_SYS_INFO),	\
> +			   fmt, ##arg)
> +
> +#endif	/* VNIC_UTIL_H_INCLUDED */


> Many of these are already in standard macros pr_info, pr_err etc.

These macros are for providing a debug log level functionality through
the vnic_debug module parameter.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080501/1bbbe12b/attachment.html>

From ramachandra.kuchimanchi at qlogic.com  Thu May  1 09:43:08 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 1 May 2008 22:13:08 +0530
Subject: [ofa-general] [RESEND] RE: [PATCH 08/13] QLogic VNIC: sysfs
	interface implementation for the driver
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D8590EE0@EPEXCH1.qlogic.org>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171955.31725.7771.stgit@localhost.localdomain>
	<20080501075606.4963afa3@extreme>
	<C07C40DB2364324799506DE8FF12F8D8590EE0@EPEXCH1.qlogic.org>
Message-ID: <71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com>

Sorry for the resend. Original mail got bounced from netdev.

On Thu, May 1, 2008 at 9:32 PM,  <ramachandra.kuchimanchi at qlogic.com> wrote:
>
> Stephen,
>
>
>  Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:
>
>  >> On Wed, 30 Apr 2008 22:49:55 +0530
>  >> Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:
>
>  <snip>
>
>
>  >> +static match_table_t vnic_opt_tokens = {
>  >> +    {VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
>  >> +    {VNIC_OPT_DGID, "dgid=%s"},
>  >> +    {VNIC_OPT_PKEY, "pkey=%x"},
>  >> +    {VNIC_OPT_NAME, "name=%s"},
>  >> +    {VNIC_OPT_INSTANCE, "instance=%d"},
>  >> +    {VNIC_OPT_RXCSUM, "rx_csum=%s"},
>  >> +    {VNIC_OPT_TXCSUM, "tx_csum=%s"},
>  >> +    {VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
>  >> +    {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
>  >> +    {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
>  >> +    {VNIC_OPT_ERR, NULL}
>  >> +};
>  >>
>
>  > NO
>  > 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...)
>  > 2. Sysfs is one value per file not name=value
>
>  The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space
>  to connect to the EVIC. For this the "name=value" mechanism is used for
>  a write-only sysfs file as an input method to the driver.
>
>  The driver follows the one value per file sysfs rule when it returns any data
>  with each readable file returning only a single value.
>
>  Regards,
>  Ram


From ramachandra.kuchimanchi at qlogic.com  Thu May  1 10:01:10 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 1 May 2008 22:31:10 +0530
Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility file
	- implements various utility macros
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D8590EE1@EPEXCH1.qlogic.org>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172126.31725.48554.stgit@localhost.localdomain>
	<20080501075816.1010ec3a@extreme>
	<C07C40DB2364324799506DE8FF12F8D8590EE1@EPEXCH1.qlogic.org>
Message-ID: <71d336490805011001he7359dfw831470986d87b385@mail.gmail.com>

Sorry for the resend. Original mail bounced from netdev.

On Thu, May 1, 2008 at 9:48 PM <ramachandra.kuchimanchi at qlogic.com> wrote:
> Stephen,
>
>
>  Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:
>
>  > Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:
>
>  >> +#define is_power_of2(value) (((value) & ((value - 1))) == 0)
>  >> +#define ALIGN_DOWN(x, a)    ((x)&(~((a)-1)))
>
>  > In kernel.h already
>
>  Will fix this. Thanks.
>
>
>  >> +extern u32 vnic_debug;
>
>  > Use msg level macros instead?
>
>  I am sorry, I did not understand this comment.
>
>
>  > +
>  > +#define SYS_INFO(fmt, arg...)                                        \
>  > +      PRINT_CONDITIONAL(KERN_INFO,                           \
>  > +                        "SYS",                               \
>  > +                        (vnic_debug & DEBUG_SYS_INFO),       \
>  > +                        fmt, ##arg)
>  > +
>  > +#endif       /* VNIC_UTIL_H_INCLUDED */
>
>
>  > Many of these are already in standard macros pr_info, pr_err etc.
>
>  These macros are for providing a debug log level functionality through
>  the vnic_debug module parameter.
>
>  Regards,
>  Ram


From David.Shue.ctr at rl.af.mil  Thu May  1 10:09:28 2008
From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB)
Date: Thu, 1 May 2008 13:09:28 -0400
Subject: [ofa-general] RE: HP PCI-X 2-port 4X Fabric (HPC) Adapter
In-Reply-To: <48188364.9030809@systemfabricworks.com>
References: <00d101c8aac6$1e06ae60$6401a8c0@YOURCB10AA3FFD>
	<48188364.9030809@systemfabricworks.com>
Message-ID: <E87DB458899A9040ACD4D627073D9EF6039CDE20@VFOHMLAO11.Enterprise.afmc.ds.af.mil>

ALL:

I appreciate everyone's help.  This is where I stand and I am putting
the info in the email in case the firewall decides to block it.

I wanted to update the FW as some have suggested.  I tried using both
"flint" and "mstflint" and get the same output you can see below which
will not show any PSID so I can obtain the correct Mellanox FW update.
I am also including the dmesg, lspci, and mst stat information.

****flint****

./flint -d /dev/mst/mt23108_pci_cr0 query
Image type:      Failsafe
I.S. Version:    1
Chip Revision:   A1
Description:     Node             Port1            Port2            Sys
image
GUIDs:           0008f10403972174 0008f10403972175 0008f10403972176
0008f10403972177 
Board ID:        
VSD:             
PSID:            

***** MST STATUS *****

mst status
MST modules:
------------
    MST PCI module loaded
    MST PCI configuration module loaded
    MST Calibre (I2C) module is not loaded

MST devices:
------------
/dev/mst/mt23108_pciconf0        - PCI configuration cycles access.
                                   bus:dev.fn=09:01.0 addr.reg=88
data.reg=92
                                   Chip revision is: A1
/dev/mst/mt23108_pci_cr0         - PCI direct access.
                                   bus:dev.fn=0a:00.0 bar=0xd8000000
size=0x100000
                                   Chip revision is: A1
/dev/mst/mt23108_pci_ddr0        - PCI direct access.
                                   bus:dev.fn=0a:00.0 bar=0xc0000000
size=0x8000000
/dev/mst/mt23108_pci_uar0        - PCI direct access.
                                   bus:dev.fn=0a:00.0 bar=0xc8000000
size=0x800000

****DMESG****

ib_mthca 0000:0a:00.0: HCA FW version 3.3.2 is old (3.4.0 is current).
ib_mthca 0000:0a:00.0: If you have problems, try updating your HCA FW.

****LSPCI****

09:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
0a:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)

The original card (manufactured in Israel) and works as a result of
OFED, returned "HP_0040000001" when I ran the "ibv_devinfo".

These boards were ordered at different times, but the same part number
was used.  The one that does not work, is clearly labeled "HP" and was
manufactured in USA it says.  The IB card that works, does not clearly
state a manufacturer, but would it appear to be HP seeing that the
board_id is "HP_0040000001"?

Anyone have any ideas where I should go with this all at this point?  If
you require any further logs please let me know.

I appreciate your help GREATLY.

-Dave


-----Original Message-----
From: David McMillen [mailto:davem at systemfabricworks.com] 
Sent: Wednesday, April 30, 2008 10:34 AM
To: Shue, David CTR USAF AFMC AFRL/RITB
Subject: Re: HP PCI-X 2-port 4X Fabric (HPC) Adapter


________________________________


	From: Shue, David CTR USAF AFMC AFRL/RITB
[mailto:David.Shue.ctr at rl.af.mil] 
	Sent: Wednesday, April 30, 2008 6:17 AM
	To: membership at openfabrics.org
	Subject: HP PCI-X 2-port 4X Fabric (HPC) Adapter 

	 
	I have used the OFED-1.3 software to communicate to the Mellanox
HPC I use.  However, the OFED-1.3 does not appear to work with the
subject HPC card.  The card is an HPC 380299-B21.  Is there any
information you may provide in how to communicate to this card?

	 
	Thank you.

	 
	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

	  David Shue                      

	  Systems Specialist        

	  Computer Sciences Corporation


	<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 
	

This appears to be an older PCI-X card, and may have a firmware level
that is too old to be supported.  There may be some information printed
by the driver on startup, so look at dmesg or /var/log/messages.

According to the OFED 1.3 release notes, the firmware level needs to be
3.5.000 which happens to be the latest released by Mellanox.  Does the
ibv_devinfo command give any output?  If so, it will show the firmware
level.  Otherwise, perhaps the tvflash -i command will tell you.  If
that does not work, please send me the output of both "lspci" and "lspci
-n" and I will see if there is any obvious reason from the PCI
identification.

You can get new firmware from Mellanox at
http://www.mellanox.com/support/firmware_table_IH.php

Be very careful to match up the PSID of your existing card with the
firmware, as there are enough differences between the models that the
wrong firmware might render your card useless.

Let me know how this works out for you, or if you need more information.

Dave McMillen


From arlin.r.davis at intel.com  Thu May  1 10:52:08 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 1 May 2008 10:52:08 -0700
Subject: [ofa-general] [ANNOUNCE] dapl-1.2.6 and dapl-2.0.8 release 
Message-ID: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com>

New release for uDAPL v1 (1.2.6) and v2 (2.0.8) is available at:

http://www.openfabrics.org/downloads/dapl

md5sum: 752ae54a93b4883c88b41241f52db4ab dapl-1.2.6.tar.gz 
md5sum: a48f9da59318c395bcc6ad170226764a dapl-2.0.8.tar.gz 

Vlad, please pull into OFED 1.3.1 using package spec files and installing:

dapl-1.2.6-1
dapl-devel-1.2.6-1
dapl-2.0.8-1
dapl-utils-2.0.8-1
dapl-devel-2.0.8-1
dapl-debuginfo-2.0.8-1

tags: dapl-1.2.6-1, dapl-2.0.8-1

Summary of changes since last release: 

v2 - add private data exchange with reject 
v1,v2 - better error reporting in non-debug builds 
v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers 
v1,v2 - support for zero byte operations, iov==NULL 
v1,v2 - multi-transport support for inline data and private data differences 
v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 

Thanks,

-arlin


From andrea at qumranet.com  Thu May  1 11:12:56 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 1 May 2008 20:12:56 +0200
Subject: [ofa-general] mmu notifier-core v14->v15 diff for review
In-Reply-To: <20080426164511.GJ9514@duo.random>
References: <20080426164511.GJ9514@duo.random>
Message-ID: <20080501181256.GK8150@duo.random>

Hello everyone,

this is the v14 to v15 difference to the mmu-notifier-core patch. This
is just for review of the difference, I'll post full v15 soon, please
review the diff in the meantime. Lots of those cleanups are thanks to
Andrew review on mmu-notifier-core in v14. He also spotted the
GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP
failed to catch it until I enabled PREEMPT (GFP_KERNEL there was
perfectly safe with all patchset applied but not ok if only
mmu-notifier-core was applied). As usual that bug couldn't hurt
anybody unless the mmu notifiers were armed.

I also wrote a proper changelog to the mmu-notifier-core patch that I
will append before the v14->v15 diff:

Subject: mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with "mm". This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further ->invalidate_page notification (invalidate_range_begin/end
   works on ranges and ->invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -739,7 +739,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -755,6 +755,26 @@ static inline void hlist_del_init(struct
 	}
 }
 
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
 static inline void hlist_del_init_rcu(struct hlist_node *n)
 {
 	if (!hlist_unhashed(n)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,18 +1050,6 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
-/*
- * mm_lock will take mmap_sem writably (to prevent all modifications
- * and scanning of vmas) and then also takes the mapping locks for
- * each of the vma to lockout any scans of pagetables of this address
- * space. This can be used to effectively holding off reclaim from the
- * address space.
- *
- * mm_lock can fail if there is not enough memory to store a pointer
- * array to all vmas.
- *
- * mm_lock and mm_unlock are expensive operations that may take a long time.
- */
 struct mm_lock_data {
 	spinlock_t **i_mmap_locks;
 	spinlock_t **anon_vma_locks;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -4,17 +4,24 @@
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
+#include <linux/srcu.h>
 
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
 #ifdef CONFIG_MMU_NOTIFIER
-#include <linux/srcu.h>
 
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
 struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
 	struct hlist_head list;
+	/* srcu structure for this mm */
 	struct srcu_struct srcu;
-	/* to serialize mmu_notifier_unregister against mmu_notifier_release */
+	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
 };
 
@@ -23,8 +30,8 @@ struct mmu_notifier_ops {
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
 	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently to other mmu notifier methods and it
-	 * should teardown all secondary mmu mappings and freeze the
+	 * run concurrently with other mmu notifier methods and it
+	 * should tear down all secondary mmu mappings and freeze the
 	 * secondary mmu.
 	 */
 	void (*release)(struct mmu_notifier *mn,
@@ -43,9 +50,10 @@ struct mmu_notifier_ops {
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
-	 * read/write to the page previously pointed by the Linux pte
-	 * because the old page hasn't been freed yet.  If required
-	 * set_page_dirty has to be called internally to this method.
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
@@ -53,20 +61,18 @@ struct mmu_notifier_ops {
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
-	 * paired and are called only when the mmap_sem is held and/or
-	 * the semaphores protecting the reverse maps. Both functions
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. Both functions
 	 * may sleep. The subsystem must guarantee that no additional
-	 * references to the pages in the range established between
-	 * the call to invalidate_range_start() and the matching call
-	 * to invalidate_range_end().
+	 * references are taken to the pages in the range established
+	 * between the call to invalidate_range_start() and the
+	 * matching call to invalidate_range_end().
 	 *
-	 * Invalidation of multiple concurrent ranges may be permitted
-	 * by the driver or the driver may exclude other invalidation
-	 * from proceeding by blocking on new invalidate_range_start()
-	 * callback that overlap invalidates that are already in
-	 * progress. Either way the establishment of sptes to the
-	 * range can only be allowed if all invalidate_range_stop()
-	 * function have been called.
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
 	 *
 	 * invalidate_range_start() is called when all pages in the
 	 * range are still mapped and have at least a refcount of one.
@@ -187,6 +193,14 @@ static inline void mmu_notifier_mm_destr
 		__mmu_notifier_mm_destroy(mm);
 }
 
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
 #define ptep_clear_flush_notify(__vma, __address, __ptep)		\
 ({									\
 	pte_t __pte;							\
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,7 +193,3 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
-
-config MMU_NOTIFIER
-	def_bool y
-	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -613,6 +613,12 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2329,7 +2329,36 @@ static inline void __mm_unlock(spinlock_
  * operations that could ever happen on a certain mm. This includes
  * vmtruncate, try_to_unmap, and all page faults. The holder
  * must not hold any mm related lock. A single task can't take more
- * than one mm lock in a row or it would deadlock.
+ * than one mm_lock in a row or it would deadlock.
+ *
+ * The mmap_sem must be taken in write mode to block all operations
+ * that could modify pagetables and free pages without altering the
+ * vma layout (for example populate_range() with nonlinear vmas).
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
@@ -2350,6 +2379,13 @@ int mm_lock(struct mm_struct *mm, struct
 			return -ENOMEM;
 		}
 
+		/*
+		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
+		 * means there's no lock to take and so we can free
+		 * the array here without waiting mm_unlock. mm_unlock
+		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * zero.
+		 */
 		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
 		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
 
@@ -2374,7 +2410,17 @@ static void mm_unlock_vfree(spinlock_t *
 	vfree(locks);
 }
 
-/* avoid memory allocations for mm_unlock to prevent deadlock */
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ * 
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -21,12 +21,12 @@
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
  * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
- * in parallel despite there's no task using this mm anymore, through
- * the vmas outside of the exit_mmap context, like with
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
  * vmtruncate. This serializes against mmu_notifier_unregister with
  * the mmu_notifier_mm->lock in addition to SRCU and it serializes
  * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
- * can't go away from under us as exit_mmap holds a mm_count pin
+ * can't go away from under us as exit_mmap holds an mm_count pin
  * itself.
  */
 void __mmu_notifier_release(struct mm_struct *mm)
@@ -41,7 +41,7 @@ void __mmu_notifier_release(struct mm_st
 				 hlist);
 		/*
 		 * We arrived before mmu_notifier_unregister so
-		 * mmu_notifier_unregister will do nothing else than
+		 * mmu_notifier_unregister will do nothing other than
 		 * to wait ->release to finish and
 		 * mmu_notifier_unregister to return.
 		 */
@@ -66,7 +66,11 @@ void __mmu_notifier_release(struct mm_st
 	spin_unlock(&mm->mmu_notifier_mm->lock);
 
 	/*
-	 * Wait ->release if mmu_notifier_unregister is running it.
+	 * synchronize_srcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
 	 * The mmu_notifier_mm can't go away from under us because one
 	 * mm_count is hold by exit_mmap.
 	 */
@@ -144,8 +148,9 @@ void __mmu_notifier_invalidate_range_end
  * Must not hold mmap_sem nor any other VM related lock when calling
  * this registration function. Must also ensure mm_users can't go down
  * to zero while this runs to avoid races with mmu_notifier_release,
- * so mm has to be current->mm or the mm should be pinned safely like
- * with get_task_mm(). mmput can be called after mmu_notifier_register
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
  * returns. mmu_notifier_unregister must be always called to
  * unregister the notifier. mm_count is automatically pinned to allow
  * mmu_notifier_unregister to safely run at any time later, before or
@@ -155,29 +160,29 @@ int mmu_notifier_register(struct mmu_not
 int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct mm_lock_data data;
+	struct mmu_notifier_mm * mmu_notifier_mm;
 	int ret;
 
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+	if (unlikely(ret))
+		goto out_kfree;
+
 	ret = mm_lock(mm, &data);
 	if (unlikely(ret))
-		goto out;
+		goto out_cleanup;
 
 	if (!mm_has_notifiers(mm)) {
-		mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm),
-					      GFP_KERNEL);
-		ret = -ENOMEM;
-		if (unlikely(!mm_has_notifiers(mm)))
-			goto out_unlock;
-
-		ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu);
-		if (unlikely(ret)) {
-			kfree(mm->mmu_notifier_mm);
-			mmu_notifier_mm_init(mm);
-			goto out_unlock;
-		}
-		INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list);
-		spin_lock_init(&mm->mmu_notifier_mm->lock);
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
 	}
 	atomic_inc(&mm->mm_count);
 
@@ -192,8 +197,14 @@ int mmu_notifier_register(struct mmu_not
 	spin_lock(&mm->mmu_notifier_mm->lock);
 	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
 	spin_unlock(&mm->mmu_notifier_mm->lock);
-out_unlock:
+
 	mm_unlock(mm, &data);
+out_cleanup:
+	if (mmu_notifier_mm)
+		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
 out:
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 	return ret;


From shemminger at vyatta.com  Thu May  1 11:22:52 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 1 May 2008 11:22:52 -0700
Subject: [ofa-general] Re: [RESEND] RE: [PATCH 08/13] QLogic VNIC: sysfs
 interface implementation for the driver
In-Reply-To: <71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171955.31725.7771.stgit@localhost.localdomain>
	<20080501075606.4963afa3@extreme>
	<C07C40DB2364324799506DE8FF12F8D8590EE0@EPEXCH1.qlogic.org>
	<71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com>
Message-ID: <20080501112252.167695c7@extreme>

On Thu, 1 May 2008 22:13:08 +0530
"Ramachandra K" <ramachandra.kuchimanchi at qlogic.com> wrote:

> Sorry for the resend. Original mail got bounced from netdev.
> 
> On Thu, May 1, 2008 at 9:32 PM,  <ramachandra.kuchimanchi at qlogic.com> wrote:
> >
> > Stephen,
> >
> >
> >  Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:
> >
> >  >> On Wed, 30 Apr 2008 22:49:55 +0530
> >  >> Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:
> >
> >  <snip>
> >
> >
> >  >> +static match_table_t vnic_opt_tokens = {
> >  >> +    {VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
> >  >> +    {VNIC_OPT_DGID, "dgid=%s"},
> >  >> +    {VNIC_OPT_PKEY, "pkey=%x"},
> >  >> +    {VNIC_OPT_NAME, "name=%s"},
> >  >> +    {VNIC_OPT_INSTANCE, "instance=%d"},
> >  >> +    {VNIC_OPT_RXCSUM, "rx_csum=%s"},
> >  >> +    {VNIC_OPT_TXCSUM, "tx_csum=%s"},
> >  >> +    {VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
> >  >> +    {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
> >  >> +    {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
> >  >> +    {VNIC_OPT_ERR, NULL}
> >  >> +};
> >  >>
> >
> >  > NO
> >  > 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...)
> >  > 2. Sysfs is one value per file not name=value
> >
> >  The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space
> >  to connect to the EVIC. For this the "name=value" mechanism is used for
> >  a write-only sysfs file as an input method to the driver.
> >
> >  The driver follows the one value per file sysfs rule when it returns any data
> >  with each readable file returning only a single value.
> >
> >  Regards,
> >  Ram

The undocumented style rule of sysfs is one value (ascii) per file.


From shemminger at vyatta.com  Thu May  1 11:26:33 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 1 May 2008 11:26:33 -0700
Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility
	file - implements various utility macros
In-Reply-To: <71d336490805011001he7359dfw831470986d87b385@mail.gmail.com>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172126.31725.48554.stgit@localhost.localdomain>
	<20080501075816.1010ec3a@extreme>
	<C07C40DB2364324799506DE8FF12F8D8590EE1@EPEXCH1.qlogic.org>
	<71d336490805011001he7359dfw831470986d87b385@mail.gmail.com>
Message-ID: <20080501112633.7e272dcc@extreme>

On Thu, 1 May 2008 22:31:10 +0530
"Ramachandra K" <ramachandra.kuchimanchi at qlogic.com> wrote:

> Sorry for the resend. Original mail bounced from netdev.
> 
> On Thu, May 1, 2008 at 9:48 PM <ramachandra.kuchimanchi at qlogic.com> wrote:
> > Stephen,
> >
> >
> >  Stephen Hemminger [mailto:shemminger at vyatta.com] wrote:
> >
> >  > Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:
> >
> >  >> +#define is_power_of2(value) (((value) & ((value - 1))) == 0)
> >  >> +#define ALIGN_DOWN(x, a)    ((x)&(~((a)-1)))
> >
> >  > In kernel.h already
> >
> >  Will fix this. Thanks.
> >
> >
> >  >> +extern u32 vnic_debug;
> >
> >  > Use msg level macros instead?

There is a ethtool mechanism to set message level for debug method, read other
network drivers and look at netif_msg_timer(x), netif_msg_probe(x), etc in
include/linux/netdevice.h

The goal here is to not have any special configuration for each different type
of hardware.  It is bad user design to have each hardware vendor choosing different
mechanism and values to enable debugging. You can argue that the existing infrastructure
is inadequate, in which case extend the infrastructure for all devices.


From makc at sgi.com  Thu May  1 11:43:34 2008
From: makc at sgi.com (Max Matveev)
Date: Fri, 2 May 2008 04:43:34 +1000
Subject: [ofa-general] mapping IP addresses to GIDs across IP subnets
In-Reply-To: <20080430213051.GX24525@obsidianresearch.com>
References: <18456.56771.908062.459625@kuku.melbourne.sgi.com>
	<20080430213051.GX24525@obsidianresearch.com>
Message-ID: <18458.3926.633446.715678@kuku.melbourne.sgi.com>

On Wed, 30 Apr 2008 15:30:51 -0600, Jason Gunthorpe wrote:

 JG> Well, you can't just assume that a AAAA record associated with the
 JG> reverse of a IPv4 is a GID - it could be a legitimate IPv6 address.
 JG> The GID space and IPv6 space are completely distinct, despite the same
 JG> format of the address.
You can also make an administrative decision to have IB prefix to be
the same as IPv6 prefix and then IPv6 and GID becomes the same.

 JG> The only way I could see to do this with DNS is to introduce a new
 JG> record type for GIDs..

 JG> Alternatively, you could use DNS to manage a mapping table, ala the
 JG> reverse map:

 JG> 1.0.0.10.ipv4.ibta-addr. AAAA fd83:609c:bdc8:1:213:72ff:fe29:e65d
That could work too but resolver would need to be modifed to ask for
the different TLA.

max


From weiny2 at llnl.gov  Thu May  1 15:50:45 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 1 May 2008 15:50:45 -0700
Subject: [ofa-general] [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix
 printing of switch name when port 1 is down.
Message-ID: <20080501155045.4aa3ef2c.weiny2@llnl.gov>

I found a bug in the printing of the names of switches on iblinkinfo.pl.  The
name of the switch was being pulled from the first ports "link" structure.  The
problem is, if the first port is down there was no structure available.  This
gets the switch name from the first link structure available and prints the
name correctly.

Ira

>From 9b69c0ff4c7785be78157ab78e4a4892d64e2fb2 Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Thu, 1 May 2008 15:46:25 -0700
Subject: [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1
is down.

Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/scripts/iblinkinfo.pl |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
index 890567c..6077ded 100755
--- a/infiniband-diags/scripts/iblinkinfo.pl
+++ b/infiniband-diags/scripts/iblinkinfo.pl
@@ -139,11 +139,23 @@ sub main
 		foreach my $port (1 .. $num_ports) {
 			my $hr = $IBswcountlimits::link_ends{$switch}{$port};
 			if ($switch_prompt eq "no" && !$line_mode) {
+			   	my $switch_name = "";
+				my $tmp_port = $port;
+				while ($switch_name eq "" && $tmp_port <= $num_ports) {
+					# the first port is down find switch name with up port
+					my $hr = $IBswcountlimits::link_ends{$switch}{$tmp_port};
+			   		$switch_name = $hr->{loc_desc};
+					$tmp_port++;
+				}
+				if ($switch_name eq "") {
+					printf(
+						"WARNING: Switch Name not found for $switch\n");
+				}
 				push(
 					@output_lines,
 					sprintf(
 						"Switch %18s %s%s:\n",
-						$switch, $hr->{loc_desc}, $pkt_life_prompt
+						$switch, $switch_name, $pkt_life_prompt
 					)
 				);
 				$switch_prompt = "yes";
-- 
1.5.1


From rajib.majumder at credit-suisse.com  Fri May  2 02:37:38 2008
From: rajib.majumder at credit-suisse.com (Majumder, Rajib)
Date: Fri, 2 May 2008 17:37:38 +0800
Subject: [ofa-general] RDMA vs Shared Memory
Message-ID: <0175FAC12977B047809C1BACA25881AD239C7E@ESNG17P32002A.csfb.cs-group.com>

Hello,

I was trying to find out fastest IPC where data source and sink run on
the same host on SLES 10, running ConnectX and OFED 1.3.0. It seems RDMA
is performing much better than shared memory. 

Mode 			Latency (microsecs)
Throughput 
-------			-----------------------------
----------------
SHM			20 (32 bytes)				17 Mbps 
RDMA			1.07 (32 bytes)				1.2 Gbps
SHM			70 (32k)
5.2 Gbps
RDMA			30 (32k)
8.5 Gbps

Can someone explain why RDMA's giving better performance on the same
host? Is it only kernel bypass and zcopy? 

Thanks

Rajib			   


==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer: 

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080502/438b29fb/attachment.html>

From andrea at qumranet.com  Fri May  2 08:05:05 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:05 +0200
Subject: [ofa-general] [PATCH 02 of 11] get_task_mm
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <c85c85c4be165eb6de16.1209740705@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740185 -7200
# Node ID c85c85c4be165eb6de16136bb97cf1fa7fd5c88f
# Parent  1489529e7b53d3f2dab8431372aa4850ec821caa
get_task_mm

get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas
 		if (task->flags & PF_BORROWED_MM)
 			mm = NULL;
 		else
-			atomic_inc(&mm->mm_users);
+			if (!atomic_inc_not_zero(&mm->mm_users))
+				mm = NULL;
 	}
 	task_unlock(task);
 	return mm;


From andrea at qumranet.com  Fri May  2 08:05:06 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:06 +0200
Subject: [ofa-general] [PATCH 03 of 11] invalidate_page outside PT lock
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <ea8fc9187b6d3ef27420.1209740706@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740185 -7200
# Node ID ea8fc9187b6d3ef2742061b4f62598afe55281cf
# Parent  c85c85c4be165eb6de16136bb97cf1fa7fd5c88f
invalidate_page outside PT lock

Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -193,35 +193,6 @@ static inline void mmu_notifier_mm_destr
 		__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
-						  ___address);		\
-	__young;							\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -257,9 +228,6 @@ static inline void mmu_notifier_mm_destr
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
+			new_page = NULL;
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
+			page_cache_release(old_page);
 
 			page_mkwrite = 1;
 		}
@@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = new_page = NULL;
 		goto unlock;
 	}
 
@@ -1776,7 +1778,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush_notify(vma, address, page_table);
+		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1788,12 +1790,18 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
+unlock:
+	pte_unmap_unlock(page_table, ptl);
+
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
 		page_cache_release(new_page);
+	}
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
-	pte_unmap_unlock(page_table, ptl);
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -275,7 +275,7 @@ static int page_referenced_one(struct pa
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -288,8 +288,11 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young_notify(vma, address, pte))
-		referenced++;
+	} else {
+		clear_flush_young = 1;
+		if (ptep_clear_flush_young(vma, address, pte))
+			referenced++;
+	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
@@ -299,6 +302,10 @@ static int page_referenced_one(struct pa
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
+
+	if (clear_flush_young)
+		referenced += mmu_notifier_clear_flush_young(mm, address);
+
 out:
 	return referenced;
 }
@@ -458,7 +465,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -466,6 +473,10 @@ static int page_mkclean_one(struct page 
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
+
 out:
 	return ret;
 }
@@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young_notify(vma, address, pte)))) {
+	if (!migration && (vma->vm_flags & VM_LOCKED)) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long end;
+	unsigned long start, end;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne
 	if (!pmd_present(*pmd))
 		return;
 
+	start = address;
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/* Update high watermark before we lower rss */
@@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young_notify(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
 static int try_to_unmap_anon(struct page *page, int migration)


From andrea at qumranet.com  Fri May  2 08:05:07 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:07 +0200
Subject: [ofa-general] [PATCH 04 of 11] free-pgtables
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <14e9f5a12bb1657fa675.1209740707@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740185 -7200
# Node ID 14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
# Parent  ea8fc9187b6d3ef2742061b4f62598afe55281cf
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		}
+		tlb_finish_mmu(tlb, addr, vma->vm_end);
 		vma = next;
 	}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,


From andrea at qumranet.com  Fri May  2 08:05:04 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:04 +0200
Subject: [ofa-general] [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <1489529e7b53d3f2dab8.1209740704@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740175 -7200
# Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
# Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with "mm". This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further ->invalidate_page notification (invalidate_range_begin/end
   works on ranges and ->invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
Signed-off-by: Nick Piggin <npiggin at suse.de>
Signed-off-by: Christoph Lameter <clameter at sgi.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
 	if (!hlist_unhashed(n)) {
 		__hlist_del(n);
 		INIT_HLIST_NODE(n);
+	}
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+	if (!hlist_unhashed(n)) {
+		__hlist_del(n);
+		n->pprev = NULL;
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+struct mm_lock_data {
+	spinlock_t **i_mmap_locks;
+	spinlock_t **anon_vma_locks;
+	size_t nr_i_mmap_locks;
+	size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/cpumask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -19,6 +20,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
+struct mmu_notifier_mm;
 
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 typedef atomic_long_t mm_counter_t;
@@ -235,6 +237,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,265 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+#include <linux/srcu.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
+struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
+	struct hlist_head list;
+	/* srcu structure for this mm */
+	struct srcu_struct srcu;
+	/* to serialize the list modifications and hlist_unhashed */
+	spinlock_t lock;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Called either by mmu_notifier_unregister or when the mm is
+	 * being destroyed by exit_mmap, always before all pages are
+	 * freed. It's mandatory to implement this method. This can
+	 * run concurrently with other mmu notifier methods and it
+	 * should tear down all secondary mmu mappings and freeze the
+	 * secondary mmu.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. Both functions
+	 * may sleep. The subsystem must guarantee that no additional
+	 * references are taken to the pages in the range established
+	 * between the call to invalidate_range_start() and the
+	 * matching call to invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_mm_destroy(mm);
+}
+
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/mutex.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
@@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mmu_notifier_mm_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -205,3 +205,6 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	bool
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						  vma->vm_start, end);
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1753,7 +1776,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2255,3 +2259,190 @@ int install_special_mapping(struct mm_st
 
 	return 0;
 }
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+	unsigned long _a = (unsigned long)*(spinlock_t **)a;
+	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+
+	cond_resched();
+	if (_a < _b)
+		return -1;
+	if (_a > _b)
+		return 1;
+	return 0;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+				  int anon)
+{
+	struct vm_area_struct *vma;
+	size_t i = 0;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (anon) {
+			if (vma->anon_vma)
+				locks[i++] = &vma->anon_vma->lock;
+		} else {
+			if (vma->vm_file && vma->vm_file->f_mapping)
+				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+		}
+	}
+
+	if (!i)
+		goto out;
+
+	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+	return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+						  spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+						spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+	spinlock_t *last = NULL;
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		/*  Multiple vmas may use the same lock. */
+		if (locks[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+			last = locks[i];
+			if (lock)
+				spin_lock(last);
+			else
+				spin_unlock(last);
+		}
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm_lock in a row or it would deadlock.
+ *
+ * The mmap_sem must be taken in write mode to block all operations
+ * that could modify pagetables and free pages without altering the
+ * vma layout (for example populate_range() with nonlinear vmas).
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+	down_write(&mm->mmap_sem);
+	if (mm->map_count) {
+		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!anon_vma_locks)) {
+			up_write(&mm->mmap_sem);
+			return -ENOMEM;
+		}
+
+		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!i_mmap_locks)) {
+			up_write(&mm->mmap_sem);
+			vfree(anon_vma_locks);
+			return -ENOMEM;
+		}
+
+		/*
+		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
+		 * means there's no lock to take and so we can free
+		 * the array here without waiting mm_unlock. mm_unlock
+		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * zero.
+		 */
+		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+		if (data->nr_anon_vma_locks) {
+			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+			data->anon_vma_locks = anon_vma_locks;
+		} else
+			vfree(anon_vma_locks);
+
+		if (data->nr_i_mmap_locks) {
+			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+			data->i_mmap_locks = i_mmap_locks;
+		} else
+			vfree(i_mmap_locks);
+	}
+	return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+	__mm_unlock(locks, nr);
+	vfree(locks);
+}
+
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ * 
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	if (mm->map_count) {
+		if (data->nr_anon_vma_locks)
+			mm_unlock_vfree(data->anon_vma_locks,
+					data->nr_anon_vma_locks);
+		if (data->nr_i_mmap_locks)
+			mm_unlock_vfree(data->i_mmap_locks,
+					data->nr_i_mmap_locks);
+	}
+	up_write(&mm->mmap_sem);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,269 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter at sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/srcu.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds an mm_count pin
+ * itself.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	int srcu;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+				 struct mmu_notifier,
+				 hlist);
+		/*
+		 * We arrived before mmu_notifier_unregister so
+		 * mmu_notifier_unregister will do nothing other than
+		 * to wait ->release to finish and
+		 * mmu_notifier_unregister to return.
+		 */
+		hlist_del_init_rcu(&mn->hlist);
+		/*
+		 * SRCU here will block mmu_notifier_unregister until
+		 * ->release returns.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * if ->release runs before mmu_notifier_unregister it
+		 * must be handled as it's the only way for the driver
+		 * to flush all existing sptes and stop the driver
+		 * from establishing any more sptes before all the
+		 * pages in the mm are freed.
+		 */
+		mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * synchronize_srcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
+	 * The mmu_notifier_mm can't go away from under us because one
+	 * mm_count is hold by exit_mmap.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0, srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	struct mmu_notifier_mm * mmu_notifier_mm;
+	int ret;
+
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+	if (unlikely(ret))
+		goto out_kfree;
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out_cleanup;
+
+	if (!mm_has_notifiers(mm)) {
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
+	}
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Serialize the update against mmu_notifier_unregister. A
+	 * side note: mmu_notifier_release can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 * We can't race against any other mmu notifiers either thanks
+	 * to mm_lock().
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	mm_unlock(mm, &data);
+out_cleanup:
+	if (mmu_notifier_mm)
+		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
+out:
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+	cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu);
+	kfree(mm->mmu_notifier_mm);
+	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with SRCU and against mmu_notifier_unregister
+ * with the unregister lock + SRCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	if (!hlist_unhashed(&mn->hlist)) {
+		int srcu;
+
+		hlist_del_rcu(&mn->hlist);
+
+		/*
+		 * SRCU here will force exit_mmap to wait ->release to finish
+		 * before freeing the pages.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * exit_mmap will block in mmu_notifier_release to
+		 * guarantee ->release is called before freeing the
+		 * pages.
+		 */
+		mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+	} else
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wait any running method to finish, of course including
+	 * ->release if it was run by mmu_notifier_relase instead of us.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -457,7 +458,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))


From andrea at qumranet.com  Fri May  2 08:05:08 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:08 +0200
Subject: [ofa-general] [PATCH 05 of 11] unmap vmas tlb flushing
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <a8ac53b928dfcea0ccb3.1209740708@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740186 -7200
# Node ID a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
# Parent  14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
 
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
-					*tlbp = NULL;
+					tlb = NULL;
 					goto out;
 				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 }
@@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*


From andrea at qumranet.com  Fri May  2 08:05:09 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:09 +0200
Subject: [ofa-general] [PATCH 06 of 11] rwsem contended
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <74b873f3ea07012e2fc8.1209740709@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740186 -7200
# Node ID 74b873f3ea07012e2fc864f203edf1179865feb1
# Parent  a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+	return rwsem_is_contended(sem);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
 	return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);


From andrea at qumranet.com  Fri May  2 08:05:03 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:03 +0200
Subject: [ofa-general] [PATCH 00 of 11] mmu notifier #v15
Message-ID: <patchbomb.1209740703@duo.random>

Hello everyone,

1/11 is the latest version of the mmu-notifier-core patch.

As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.

Thanks!
Andrea


From andrea at qumranet.com  Fri May  2 08:05:10 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:10 +0200
Subject: [ofa-general] [PATCH 07 of 11] i_mmap_rwsem
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <de28c85baef11b90c993.1209740710@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740186 -7200
# Node ID de28c85baef11b90c993047ca851a2f52c85a5be
# Parent  74b873f3ea07012e2fc864f203edf1179865feb1
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
Signed-off-by: Christoph Lameter <clameter at sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1829,7 +1829,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1848,7 +1848,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1898,7 +1898,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1912,9 +1912,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file) {
@@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 

From andrea at qumranet.com  Fri May  2 08:05:11 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:11 +0200
Subject: [ofa-general] [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <0be678c52e540d5f5d5f.1209740711@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740186 -7200
# Node ID 0be678c52e540d5f5d5fd9af549b57b9bb018d32
# Parent  de28c85baef11b90c993047ca851a2f52c85a5be
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -157,9 +157,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*


From andrea at qumranet.com  Fri May  2 08:05:13 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:13 +0200
Subject: [ofa-general] [PATCH 10 of 11] export zap_page_range for XPMEM
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <4f462fb3dff614cd7d97.1209740713@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740229 -7200
# Node ID 4f462fb3dff614cd7d971219c3feaef0b43359c1
# Parent  721c3787cd42043734331e54a42eb20c51766f71
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <dcn at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
 	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.


From andrea at qumranet.com  Fri May  2 08:05:12 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:12 +0200
Subject: [ofa-general] [PATCH 09 of 11] mm_lock-rwsem
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <721c3787cd4204373433.1209740712@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740226 -7200
# Node ID 721c3787cd42043734331e54a42eb20c51766f71
# Parent  0be678c52e540d5f5d5fd9af549b57b9bb018d32
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
 				   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-	spinlock_t **i_mmap_locks;
-	spinlock_t **anon_vma_locks;
-	size_t nr_i_mmap_locks;
-	size_t nr_anon_vma_locks;
+	struct rw_semaphore **i_mmap_sems;
+	struct rw_semaphore **anon_vma_sems;
+	size_t nr_i_mmap_sems;
+	size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-	unsigned long _a = (unsigned long)*(spinlock_t **)a;
-	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+	unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+	unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
 	cond_resched();
 	if (_a < _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
 	return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
 				  int anon)
 {
 	struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (anon) {
 			if (vma->anon_vma)
-				locks[i++] = &vma->anon_vma->lock;
+				sems[i++] = &vma->anon_vma->sem;
 		} else {
 			if (vma->vm_file && vma->vm_file->f_mapping)
-				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+				sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
 		}
 	}
 
 	if (!i)
 		goto out;
 
-	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+	sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
 	return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
-						  spinlock_t **locks)
+						  struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 1);
+	return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-						spinlock_t **locks)
+						struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 0);
+	return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-	spinlock_t *last = NULL;
+	struct rw_semaphore *last = NULL;
 	size_t i;
 
 	for (i = 0; i < nr; i++)
 		/*  Multiple vmas may use the same lock. */
-		if (locks[i] != last) {
-			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
-			last = locks[i];
+		if (sems[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+			last = sems[i];
 			if (lock)
-				spin_lock(last);
+				down_write(last);
 			else
-				spin_unlock(last);
+				up_write(last);
 		}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 1);
+	mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 0);
+	mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2351,10 +2351,10 @@ static inline void __mm_unlock(spinlock_
  * of vmas is defined in /proc/sys/vm/max_map_count.
  *
  * mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less because
- * normally there won't be max_map_count vmas allocated in the task
- * that runs mm_lock().
+ * vmalloc allocation required is 2*max_map_count*sizeof(struct
+ * rw_semaphore *), so around 1Mbyte, but in practice it'll be much
+ * less because normally there won't be max_map_count vmas allocated
+ * in the task that runs mm_lock().
  *
  * The vmalloc memory allocated by mm_lock is stored in the
  * mm_lock_data structure that must be allocated by the caller and it
@@ -2368,20 +2368,20 @@ static inline void __mm_unlock(spinlock_
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-	spinlock_t **anon_vma_locks, **i_mmap_locks;
+	struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
 	down_write(&mm->mmap_sem);
 	if (mm->map_count) {
-		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks)) {
+		anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!anon_vma_sems)) {
 			up_write(&mm->mmap_sem);
 			return -ENOMEM;
 		}
 
-		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!i_mmap_locks)) {
+		i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!i_mmap_sems)) {
 			up_write(&mm->mmap_sem);
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 			return -ENOMEM;
 		}
 
@@ -2389,31 +2389,31 @@ int mm_lock(struct mm_struct *mm, struct
 		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
 		 * means there's no lock to take and so we can free
 		 * the array here without waiting mm_unlock. mm_unlock
-		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * will do nothing if nr_i_mmap/anon_vma_sems is
 		 * zero.
 		 */
-		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
-		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+		data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems);
+		data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems);
 
-		if (data->nr_anon_vma_locks) {
-			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
-			data->anon_vma_locks = anon_vma_locks;
+		if (data->nr_anon_vma_sems) {
+			__mm_lock(anon_vma_sems, data->nr_anon_vma_sems);
+			data->anon_vma_sems = anon_vma_sems;
 		} else
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 
-		if (data->nr_i_mmap_locks) {
-			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
-			data->i_mmap_locks = i_mmap_locks;
+		if (data->nr_i_mmap_sems) {
+			__mm_lock(i_mmap_sems, data->nr_i_mmap_sems);
+			data->i_mmap_sems = i_mmap_sems;
 		} else
-			vfree(i_mmap_locks);
+			vfree(i_mmap_sems);
 	}
 	return 0;
 }
 
-static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr)
 {
-	__mm_unlock(locks, nr);
-	vfree(locks);
+	__mm_unlock(sems, nr);
+	vfree(sems);
 }
 
 /*
@@ -2430,12 +2430,12 @@ void mm_unlock(struct mm_struct *mm, str
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
-		if (data->nr_anon_vma_locks)
-			mm_unlock_vfree(data->anon_vma_locks,
-					data->nr_anon_vma_locks);
-		if (data->nr_i_mmap_locks)
-			mm_unlock_vfree(data->i_mmap_locks,
-					data->nr_i_mmap_locks);
+		if (data->nr_anon_vma_sems)
+			mm_unlock_vfree(data->anon_vma_sems,
+					data->nr_anon_vma_sems);
+		if (data->nr_i_mmap_sems)
+			mm_unlock_vfree(data->i_mmap_sems,
+					data->nr_i_mmap_sems);
 	}
 	up_write(&mm->mmap_sem);
 }


From andrea at qumranet.com  Fri May  2 08:05:14 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 02 May 2008 17:05:14 +0200
Subject: [ofa-general] [PATCH 11 of 11] mmap sems
In-Reply-To: <patchbomb.1209740703@duo.random>
Message-ID: <b4bf6df98bc00bfbef94.1209740714@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1209740229 -7200
# Node ID b4bf6df98bc00bfbef9423b0dd31cfdba63a5eeb
# Parent  4f462fb3dff614cd7d971219c3feaef0b43359c1
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <dcn at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
+ *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
  *
  *  ->i_mutex
  *    ->i_alloc_sem             (various)


From swise at opengridcomputing.com  Fri May  2 09:17:41 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 May 2008 11:17:41 -0500
Subject: [ofa-general] [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes
Message-ID: <20080502161741.30500.95337.stgit@dell3.ogc.int>


- Flush the QP only after the HW disables the connection.
  Currently we flush the QP when transitioning to CLOSING.  This
  exposes a race condition where the HW can complete a RECV WR,
  for instance, -and- the SW can flush that same WR.

- Only call CQ event handlers on flush IFF we actually flushed
  anything.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |   13 ++++++++++---
 drivers/infiniband/hw/cxgb3/cxio_hal.h |    4 ++--
 drivers/infiniband/hw/cxgb3/iwch_qp.c  |   13 ++++++++-----
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 3de0fbf..8a86960 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -359,9 +359,10 @@ static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
 	cq->sw_wptr++;
 }
 
-void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
 {
 	u32 ptr;
+	int flushed = 0;
 
 	PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq);
 
@@ -369,8 +370,11 @@ void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
 	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__,
 	    wq->rq_rptr, wq->rq_wptr, count);
 	ptr = wq->rq_rptr + count;
-	while (ptr++ != wq->rq_wptr)
+	while (ptr++ != wq->rq_wptr) {
 		insert_recv_cqe(wq, cq);
+		flushed++;
+	}
+	return flushed;
 }
 
 static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq,
@@ -394,9 +398,10 @@ static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq,
 	cq->sw_wptr++;
 }
 
-void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 {
 	__u32 ptr;
+	int flushed = 0;
 	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
 
 	ptr = wq->sq_rptr + count;
@@ -405,7 +410,9 @@ void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 		insert_sq_cqe(wq, cq, sqp);
 		sqp++;
 		ptr++;
+		flushed++;
 	}
+	return flushed;
 }
 
 /*
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 2bcff7f..69ab08e 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -173,8 +173,8 @@ u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
 void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
 int __init cxio_hal_init(void);
 void __exit cxio_hal_exit(void);
-void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
-void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
 void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_flush_hw_cq(struct t3_cq *cq);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index b0e5aea..353fbb3 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -655,6 +655,7 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
 {
 	struct iwch_cq *rchp, *schp;
 	int count;
+	int flushed;
 
 	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
 	schp = get_chp(qhp->rhp, qhp->attr.scq);
@@ -669,20 +670,22 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
 	spin_lock(&qhp->lock);
 	cxio_flush_hw_cq(&rchp->cq);
 	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
-	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	flushed = cxio_flush_rq(&qhp->wq, &rchp->cq, count);
 	spin_unlock(&qhp->lock);
 	spin_unlock_irqrestore(&rchp->lock, *flag);
-	(*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context);
+	if (flushed)
+		(*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context);
 
 	/* locking heirarchy: cq lock first, then qp lock. */
 	spin_lock_irqsave(&schp->lock, *flag);
 	spin_lock(&qhp->lock);
 	cxio_flush_hw_cq(&schp->cq);
 	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
-	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	flushed = cxio_flush_sq(&qhp->wq, &schp->cq, count);
 	spin_unlock(&qhp->lock);
 	spin_unlock_irqrestore(&schp->lock, *flag);
-	(*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context);
+	if (flushed)
+		(*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context);
 
 	/* deref */
 	if (atomic_dec_and_test(&qhp->refcnt))
@@ -880,7 +883,6 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
 				ep = qhp->ep;
 				get_ep(&ep->com);
 			}
-			flush_qp(qhp, &flag);
 			break;
 		case IWCH_QP_STATE_TERMINATE:
 			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
@@ -911,6 +913,7 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
 		}
 		switch (attrs->next_state) {
 			case IWCH_QP_STATE_IDLE:
+				flush_qp(qhp, &flag);
 				qhp->attr.state = IWCH_QP_STATE_IDLE;
 				qhp->attr.llp_stream_handle = NULL;
 				put_ep(&qhp->ep->com);


From swise at opengridcomputing.com  Fri May  2 09:17:43 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 May 2008 11:17:43 -0500
Subject: [ofa-general] [PATCH 2.6.26 2/3] RDMA/cxgb3: Silently ignore close
	reply after abort.
In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int>
References: <20080502161741.30500.95337.stgit@dell3.ogc.int>
Message-ID: <20080502161743.30500.27928.stgit@dell3.ogc.int>


Remove bad BUG_ON() in close_con_rpl().  It is possible to get a close_rpl
message on a dead connection.  The sequence is:
	host refs ep for close exchange
	host posts close_req
	hw posts PEER_ABORT from incoming RST
	host marks ep DEAD
	host posts ABORT_RPL and releases ep resources
	hw posts CLOSE_RPL
	host derefs ep and ep freed.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index f4f3c9e..b2db0a9 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -1650,8 +1650,8 @@ static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		release = 1;
 		break;
 	case ABORTING:
-		break;
 	case DEAD:
+		break;
 	default:
 		BUG_ON(1);
 		break;


From swise at opengridcomputing.com  Fri May  2 09:17:45 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 May 2008 11:17:45 -0500
Subject: [ofa-general] [PATCH 2.6.26 3/3] RDMA/cxgb3: Bump up the mpa
	connection setup timeout.
In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int>
References: <20080502161741.30500.95337.stgit@dell3.ogc.int>
Message-ID: <20080502161745.30500.99485.stgit@dell3.ogc.int>


Testing on large clusters shows its way too short at 10 secs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index b2db0a9..9ea3a07 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -67,10 +67,10 @@ int peer2peer = 0;
 module_param(peer2peer, int, 0644);
 MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=0)");
 
-static int ep_timeout_secs = 10;
+static int ep_timeout_secs = 60;
 module_param(ep_timeout_secs, int, 0644);
 MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
-				   "in seconds (default=10)");
+				   "in seconds (default=60)");
 
 static int mpa_rev = 1;
 module_param(mpa_rev, int, 0644);


From rdreier at cisco.com  Fri May  2 10:58:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 May 2008 10:58:10 -0700
Subject: [ofa-general] Re: [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes
In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int> (Steve Wise's
	message of "Fri, 02 May 2008 11:17:41 -0500")
References: <20080502161741.30500.95337.stgit@dell3.ogc.int>
Message-ID: <adamyn84mh9.fsf@cisco.com>

thanks, applied all 3... (1/3 was against a slightly old tree, so I had
to fix up __FUNCTION__ -> __func__ in the context of the diffs)


From rdreier at cisco.com  Fri May  2 11:15:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 May 2008 11:15:10 -0700
Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <20080430171624.31725.98475.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:46:24 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171624.31725.98475.stgit@localhost.localdomain>
Message-ID: <adaiqxw4lox.fsf@cisco.com>

 > From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
 > 
 > QLogic Virtual NIC Driver. This patch implements netdev registration,
 > netdev functions and state maintenance of the QLogic Virtual NIC
 > corresponding to the various events associated with the QLogic Ethernet 
 > Virtual I/O Controller (EVIC/VEx) connection.
 > 
 > Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
 > Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>

For the next submission please clean up the From and Signed-off-by
lines.  As it stands now you are saying that you (Ramachandra K) are the
author of the patch, and that Poornima and Amar signed off on it (ie
forwarded it), but you as the person sending the email did not sign off
on it.

 > +#include <rdma/ib_cache.h>

I would like to kill off the caching support in the IB core, so adding
new users of the API is not desirable.  However your code doesn't seem
to call any functions from this header anyway, so I guess you can just
delete the include.

 > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath)
 > +{
 > +	VNIC_FUNCTION("vnic_stop_xmit()\n");
 > +	if (netpath == vnic->current_path) {
 > +		if (vnic->xmit_started) {
 > +			netif_stop_queue(vnic->netdevice);
 > +			vnic->xmit_started = 0;
 > +		}
 > +
 > +		vnic_stop_xmit_stats(vnic);
 > +	}
 > +}

Do you have sufficient locking here?  Could vnic->current_path or
vnic->xmit_started change after they are tested, leading to bad results?
Also do you get anything from having a xmit_started flag that you
couldn't get just by testing with netif_queue_stopped()?

 > +	vnic = (struct vnic *)device->priv;

All this device->priv should probably be using netdev_priv() instead,
and without a cast (since a cast from void * is not needed).

 > +	if (jiffies > netpath->connect_time +
 > +		      vnic->config->no_path_timeout) {

want to use time_after() for jiffies comparison to avoid problems with
jiffies wrap.

 > +	vnic->netdevice = alloc_netdev((int) 0, config->name, vnic_setup);
 > +	vnic->netdevice->priv = (void *)vnic;

Not sure this is even kosher to do any more.  Anyway I think it's much
cleaner if you just allocate everything with alloc_netdev instead of
trying to stick your own structure in the priv pointer.

 > +extern cycles_t recv_ref;

seems like too generic a name to make global.  What the heck are you
using cycle_t to keep track of anyway?

 > +/* This array should be kept next to enum above since a change to npevent_type
 > +   enum affects this array. */
 > +static const char *const vnic_npevent_str[] = {
 > +    "PRIMARY CONNECTED",
 > +    "PRIMARY DISCONNECTED",
 > +    "PRIMARY CARRIER",

putting this in a header means every file that uses it gets a private copy/


From xavier at tddft.org  Fri May  2 11:21:15 2008
From: xavier at tddft.org (Xavier Andrade)
Date: Fri, 2 May 2008 20:21:15 +0200 (CEST)
Subject: [ofa-general] Loading of ib_mthca fails
In-Reply-To: <4815BA98.8000802@mellanox.co.il>
References: <Pine.LNX.4.64.0804232043280.22707@theory.polytechnique.fr>
	<adaiqy89y59.fsf@cisco.com>
	<Pine.LNX.4.64.0804232332170.13031@theory.polytechnique.fr>
	<ada8wz49r3s.fsf@cisco.com>
	<Pine.LNX.4.64.0804240107500.13031@theory.polytechnique.fr>
	<adar6cw87v0.fsf@cisco.com>
	<Pine.LNX.4.64.0804241600440.5313@theory.polytechnique.fr>
	<4815BA98.8000802@mellanox.co.il>
Message-ID: <Pine.LNX.4.64.0805022016130.12749@theory.polytechnique.fr>

Hi Tziporet,

On Mon, 28 Apr 2008, Tziporet Koren wrote:
>
> Attached is the ini file for this PSID.
> Please create a binary using the MFT package in our web site and try to burn 
> it.
> If you have more issues please work with Todd that cc on this maik
>

I have generated the firmware with the .ini you send me, I burned it and 
for the moment it seems to work.

Thanks,

Xavier


From swise at opengridcomputing.com  Fri May  2 13:16:23 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 May 2008 15:16:23 -0500
Subject: [ofa-general] Re: [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes
In-Reply-To: <adamyn84mh9.fsf@cisco.com>
References: <20080502161741.30500.95337.stgit@dell3.ogc.int>
	<adamyn84mh9.fsf@cisco.com>
Message-ID: <481B7697.2080500@opengridcomputing.com>

Roland Dreier wrote:
> thanks, applied all 3... (1/3 was against a slightly old tree, so I had
> to fix up __FUNCTION__ -> __func__ in the context of the diffs)
>   
Sorry about that.  Someday I'll learn __func__.

;-)


From steiner at sgi.com  Sat May  3 04:09:04 2008
From: steiner at sgi.com (Jack Steiner)
Date: Sat, 3 May 2008 06:09:04 -0500
Subject: [ofa-general] Re: [PATCH 00 of 11] mmu notifier #v15
In-Reply-To: <patchbomb.1209740703@duo.random>
References: <patchbomb.1209740703@duo.random>
Message-ID: <20080503110904.GA19688@sgi.com>

On Fri, May 02, 2008 at 05:05:03PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> 1/11 is the latest version of the mmu-notifier-core patch.
> 
> As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.
> 

Not sure why -mm is different, but I get compile errors w/o the following...

--- jack


Index: linux/mm/mmu_notifier.c
===================================================================
--- linux.orig/mm/mmu_notifier.c	2008-05-02 16:54:52.780576831 -0500
+++ linux/mm/mmu_notifier.c	2008-05-02 16:56:38.817719509 -0500
@@ -16,6 +16,7 @@
 #include <linux/srcu.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/rculist.h>
 
 /*
  * This function can't run concurrently against mmu_notifier_register


From swise at opengridcomputing.com  Sat May  3 09:05:26 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 03 May 2008 11:05:26 -0500
Subject: [ofa-general] [ GIT PULL ofed-1.3.1] - more cxgb3 fixes for -rc1
Message-ID: <481C8D46.3050305@opengridcomputing.com>

Vlad,

Please pull these additional upstream bug fixes into ofed-1.3.1.  Pull from

git://git.openfabrics.org/~swise/ofed-1.3.git ofed_kernel

Shortlog:

Steve Wise (4):
       RDMA/cxgb3: Program hardware IRD with correct value
       RDMA/cxgb3: QP flush fixes
       RDMA/cxgb3: Silently ignore close reply after abort.
       RDMA/cxgb3: Bump up the mpa connection setup timeout.


From vlad at dev.mellanox.co.il  Sun May  4 00:43:43 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 04 May 2008 10:43:43 +0300
Subject: [ofa-general] Re: [GIT PULL ofed-1.3.1] - chelsio changes for
	ofed-1.3.1
In-Reply-To: <4818E2C5.7060907@opengridcomputing.com>
References: <4818E2C5.7060907@opengridcomputing.com>
Message-ID: <481D692F.5080503@dev.mellanox.co.il>

Steve Wise wrote:
> Vlad,
> 
> Please pull from:
> 
> git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel
> 
> This will sync up ofed-1.3.1 with all the important upstream fixes since 
> ofed-1.3.  The patch files added are:
> 
> kernel_patches/fixes/iw_cxgb3_0080_Fail_Loopback_Connections.patch
> kernel_patches/fixes/iw_cxgb3_0090_Fix_shift_calc_in_build_phys_page_list_for_1-entry_page_lists.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0100_Return_correct_max_inline_data_when_creating_a_QP.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0110_Fix_iwch_create_cq_off-by-one_error.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0120_Dont_access_a_cm_id_after_dropping_reference.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0130_Correctly_set_the_max_mr_size_device_attribute.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0140_Correctly_serialize_peer_abort_path.patch 
> 
> kernel_patches/fixes/iw_cxgb3_0150_Support_peer-2-peer_connection_setup.patch 
> 
> 
> 
> Thanks,
> 
> Steve.
> 

Done,

Regards,
Vladimir


From vlad at dev.mellanox.co.il  Sun May  4 00:45:49 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 04 May 2008 10:45:49 +0300
Subject: [ofa-general] Re: [GIT PULL ofed-1.3.1] libcxgb3 version 1.2.0
In-Reply-To: <4818E35C.4050206@opengridcomputing.com>
References: <4818E35C.4050206@opengridcomputing.com>
Message-ID: <481D69AD.9070404@dev.mellanox.co.il>

Steve Wise wrote:
> Vlad,
> 
> Please pull in version 1.2.0 of libcxgb3.  This is needed for the 
> ofed-1.3.1 kernel drivers.
> 
> Pull from:
> 
> git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3_1
> 
> Thanks,
> 
> Steve.
> 

Done,

Regards,
Vladimir


From vlad at dev.mellanox.co.il  Sun May  4 00:49:21 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 04 May 2008 10:49:21 +0300
Subject: [ofa-general] Re: [ GIT PULL ofed-1.3.1] - more cxgb3 fixes for -rc1
In-Reply-To: <481C8D46.3050305@opengridcomputing.com>
References: <481C8D46.3050305@opengridcomputing.com>
Message-ID: <481D6A81.3060705@dev.mellanox.co.il>

Steve Wise wrote:
> Vlad,
> 
> Please pull these additional upstream bug fixes into ofed-1.3.1.  Pull from
> 
> git://git.openfabrics.org/~swise/ofed-1.3.git ofed_kernel
> 
> Shortlog:
> 
> Steve Wise (4):
>       RDMA/cxgb3: Program hardware IRD with correct value
>       RDMA/cxgb3: QP flush fixes
>       RDMA/cxgb3: Silently ignore close reply after abort.
>       RDMA/cxgb3: Bump up the mpa connection setup timeout.
> 

Done,

Regards,
Vladimir


From vlad at dev.mellanox.co.il  Sun May  4 00:52:59 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 04 May 2008 10:52:59 +0300
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] dapl-1.2.6 and dapl-2.0.8 release
In-Reply-To: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com>
References: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com>
Message-ID: <481D6B5B.20208@dev.mellanox.co.il>

Arlin Davis wrote:
> New release for uDAPL v1 (1.2.6) and v2 (2.0.8) is available at:
> 
> http://www.openfabrics.org/downloads/dapl
> 
> md5sum: 752ae54a93b4883c88b41241f52db4ab dapl-1.2.6.tar.gz 
> md5sum: a48f9da59318c395bcc6ad170226764a dapl-2.0.8.tar.gz 
> 
> Vlad, please pull into OFED 1.3.1 using package spec files and installing:
> 
> dapl-1.2.6-1
> dapl-devel-1.2.6-1
> dapl-2.0.8-1
> dapl-utils-2.0.8-1
> dapl-devel-2.0.8-1
> dapl-debuginfo-2.0.8-1
> 
> tags: dapl-1.2.6-1, dapl-2.0.8-1
> 
> Summary of changes since last release: 
> 
> v2 - add private data exchange with reject 
> v1,v2 - better error reporting in non-debug builds 
> v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers 
> v1,v2 - support for zero byte operations, iov==NULL 
> v1,v2 - multi-transport support for inline data and private data differences 
> v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 
> 
> Thanks,
> 
> -arlin
> 

Done,

Regards,
Vladimir


From kliteyn at dev.mellanox.co.il  Sun May  4 02:57:30 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 12:57:30 +0300
Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
Message-ID: <481D888A.7080608@dev.mellanox.co.il>

Hi Sasha,

The following series of 4 patches implements unicast routing cache
in OpenSM.

None of the current routing engines is scalable when we're talking
about big clusters. On ~5K cluster with ~1.3K switches, it takes
about two minutes to calculate the routing. The problem is, each
time the routing is calculated from scratch.

Incremental routing (which is on my to-do list) aims to address this
problem when there is some "local" change in fabric (e.g. single
switch failure, single link failure, link added, etc).
In such cases we can use the routing that was already calculated in
the previous heavy sweep, and then we just have to modify it according
to the change.

For instance, if some switch has disappeared from the fabric, we can
use the routing that existed with this switch, take a step back from
this switch and see if it is possible to route all the lids that were
routed through this switch some other way (which is usually the case).

To implement incremental routing, we need to create some kind of unicast
routing cache, which is what these patches implement. In addition to being
a step toward the incremental routing, routing cache is usefull by itself.

This cache can save us routing calculation in case of change in the leaf
switches or in hosts. For instance, if some node is rebooted, OpenSM would
start a heavy sweep with full routing recalculation when the HCA is going
down, and another one when HCA is brought up, when in fact both of these
routing calculation can be replaced by using of unicast routing cache.

Unicast routing cache comprises the following:
 - Topology: a data structure with all the switches and CAs of the fabric
 - LFTs: each switch has an LFT cached
 - Lid matrices: each switch has lid matrices cached, which is needed for
   multicast routing (which is not cached).

There is a topology matching function that compares the current topology
with the cached one to find out whether the cache is usable (valid) or not.

The cache is used the following way:
 - SM is executed - it starts first routing calculation
 - calculated routing is stored in the cache
 - at some point new heavy sweep is triggered
 - unicast manager checks whether the cache can be used instead
   of new routing calculation.
   In one of the following cases we can use cached routing
    + there is no topology change
    + one or more CAs disappeared (they exist in the cached topology
      model, but missing in the newly discovered fabric)
    + one or more leaf switches disappeared
   In these cases cached routing is written to the switches as is
   (unless the switch doesn't exist).
   If there is any other topology change:
     - existing cache is invalidated
     - topology is cached
     - routing is calculated as usual
     - routing is cached

My simulations show that when the usual routing phase of the heavy
sweep on the topology that I mentioned above takes ~2 minutes,
cached routing reduces this time to 6 seconds (which is nice, if you
ask me...).

Of all the cases when the cache is valid, the most painful and
"complainable" case is when a compute node reboot (which happens pretty
often) causes two heavy sweeps with two full routing calculations.
Unicast Routing Cache is aimed to solve this problem (again, in addition
to being a step toward the incremental routing).

-- Yevgeny


From kliteyn at dev.mellanox.co.il  Sun May  4 02:59:33 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 12:59:33 +0300
Subject: [ofa-general] [PATCH 1/4] opensm/osm_ucast_cache.{c,
	h}: ucast routing cache implementation
Message-ID: <481D8905.1010207@dev.mellanox.co.il>

Unicast routing cache implementation.

Unicast routing cache comprises the following:
 - Topology: a data structure with all the switches and CAs of the fabric
 - LFTs: each switch has an LFT cached
 - Lid matrices: each switch has lid matrices cached, which is needed for
   multicast routing (which is not cached).

There is also a topology matching function that compares the current topology
with the cached one to find out whether the cache is usable (valid) or not.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/opensm/osm_ucast_cache.h |  319 ++++++++
 opensm/opensm/osm_ucast_cache.c         | 1197 +++++++++++++++++++++++++++++++
 2 files changed, 1516 insertions(+), 0 deletions(-)
 create mode 100644 opensm/include/opensm/osm_ucast_cache.h
 create mode 100644 opensm/opensm/osm_ucast_cache.c

diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h
new file mode 100644
index 0000000..a3b40f9
--- /dev/null
+++ b/opensm/include/opensm/osm_ucast_cache.h
@@ -0,0 +1,319 @@
+/*
+ * Copyright (c) 2002-2008 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * 	Declaration of osm_ucast_cache_t.
+ *	This object represents the Unicast Cache object.
+ *
+ * Environment:
+ * 	Linux User Mode
+ *
+ * $Revision: 1.4 $
+ */
+
+#ifndef _OSM_UCAST_CACHE_H_
+#define _OSM_UCAST_CACHE_H_
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else				/* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif				/* __cplusplus */
+
+BEGIN_C_DECLS
+
+struct _osm_ucast_mgr;
+
+#define UCAST_CACHE_TOPOLOGY_MATCH                   0x0000
+#define UCAST_CACHE_TOPOLOGY_LESS_SWITCHES           0x0001
+#define UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING 0x0002
+#define UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING      0x0004
+#define UCAST_CACHE_TOPOLOGY_MORE_SWITCHES           0x0008
+#define UCAST_CACHE_TOPOLOGY_NEW_LID                 0x0010
+#define UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING      0x0020
+#define UCAST_CACHE_TOPOLOGY_LINK_ADDED              0x0040
+#define UCAST_CACHE_TOPOLOGY_NEW_SWITCH              0x0080
+#define UCAST_CACHE_TOPOLOGY_NEW_CA                  0x0100
+#define UCAST_CACHE_TOPOLOGY_NO_MATCH                0x0200
+
+/****h* OpenSM/Unicast Manager/Unicast Cache
+* NAME
+*	Unicast Cache
+*
+* DESCRIPTION
+*	The Unicast Cache object encapsulates the information
+*	needed to cache and write unicast routing of the subnet.
+*
+*	The Unicast Cache object is NOT thread safe.
+*
+*	This object should be treated as opaque and should be
+*	manipulated only through the provided functions.
+*
+* AUTHOR
+*	Yevgeny Kliteynik, Mellanox
+*
+*********/
+
+
+/****s* OpenSM: Unicast Cache/osm_ucast_cache_t
+* NAME
+*	osm_ucast_cache_t
+*
+* DESCRIPTION
+*	Unicast Cache structure.
+*
+*	This object should be treated as opaque and should
+*	be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+typedef struct osm_ucast_cache_t_ {
+	struct _osm_ucast_mgr * p_ucast_mgr;
+	cl_qmap_t sw_tbl;
+	cl_qmap_t ca_tbl;
+	boolean_t topology_valid;
+	boolean_t routing_valid;
+	boolean_t need_update;
+} osm_ucast_cache_t;
+/*
+* FIELDS
+*	p_ucast_mgr
+*		Pointer to the Unicast Manager for this subnet.
+*
+*	sw_tbl
+*		Cached switches table.
+*
+*	ca_tbl
+*		Cached CAs table.
+*
+*	topology_valid
+*		TRUE if the cache is populated with the fabric topology.
+*
+*	routing_valid
+*		TRUE if the cache is populated with the unicast routing
+*		in addition to the topology.
+*
+*	need_update
+*		TRUE if the cached routing needs to be updated.
+*
+* SEE ALSO
+*	Unicast Manager object
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_construct
+* NAME
+*	osm_ucast_cache_construct
+*
+* DESCRIPTION
+*	This function constructs a Unicast Cache object.
+*
+* SYNOPSIS
+*/
+osm_ucast_cache_t *
+osm_ucast_cache_construct(struct _osm_ucast_mgr * const p_mgr);
+/*
+* PARAMETERS
+*	p_mgr
+*		[in] Pointer to a Unicast Manager object.
+*
+* RETURN VALUE
+*	This function return the created Ucast Cache object on success,
+*	or NULL on any error.
+*
+* NOTES
+*	Allows osm_ucast_cache_destroy
+*
+*	Calling osm_ucast_mgr_construct is a prerequisite to
+*	calling any other method.
+*
+* SEE ALSO
+*	Unicast Cache object, osm_ucast_cache_destroy
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_destroy
+* NAME
+*	osm_ucast_cache_destroy
+*
+* DESCRIPTION
+*	The osm_ucast_cache_destroy function destroys the object,
+*	releasing all resources.
+*
+* SYNOPSIS
+*/
+void osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache);
+/*
+* PARAMETERS
+*	p_cache
+*		[in] Pointer to the object to destroy.
+*
+* RETURN VALUE
+*	This function does not return any value.
+*
+* NOTES
+*	Performs any necessary cleanup of the specified
+*	Unicast Cache object.
+*	Further operations should not be attempted on the
+*	destroyed object.
+*	This function should only be called after a call to
+*	osm_ucast_cache_construct.
+*
+* SEE ALSO
+*	Unicast Cache object, osm_ucast_cache_construct
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_refresh_topo
+* NAME
+*	osm_ucast_cache_refresh_topo
+*
+* DESCRIPTION
+*	The osm_ucast_cache_refresh_topo function re-reads the
+*	updated topology.
+*
+* SYNOPSIS
+*/
+void osm_ucast_cache_refresh_topo(osm_ucast_cache_t * p_cache);
+/*
+* PARAMETERS
+*	p_cache
+*		[in] Pointer to the cache object to refresh.
+*
+* RETURN VALUE
+*	This function does not return any value.
+*
+* NOTES
+*	This function invalidates the existing unicast cache
+*	and re-reads the updated topology.
+*
+* SEE ALSO
+*	Unicast Cache object, osm_ucast_cache_construct
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_refresh_lid_matrices
+* NAME
+*	osm_ucast_cache_refresh_lid_matrices
+*
+* DESCRIPTION
+*	The osm_ucast_cache_refresh_topo function re-reads the
+*	updated lid matrices.
+*
+* SYNOPSIS
+*/
+void osm_ucast_cache_refresh_lid_matrices(osm_ucast_cache_t * p_cache);
+/*
+* PARAMETERS
+*	p_cache
+*		[in] Pointer to the cache object to refresh.
+*
+* RETURN VALUE
+*	This function does not return any value.
+*
+* NOTES
+*	This function re-reads the updated lid matrices.
+*
+* SEE ALSO
+*	Unicast Cache object, osm_ucast_cache_construct
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_apply
+* NAME
+*	osm_ucast_cache_apply
+*
+* DESCRIPTION
+*	The osm_ucast_cache_apply function tries to apply
+*	the cached unicast routing on the subnet switches.
+*
+* SYNOPSIS
+*/
+int osm_ucast_cache_apply(osm_ucast_cache_t * p_cache);
+/*
+* PARAMETERS
+*	p_cache
+*		[in] Pointer to the cache object to be used.
+*
+* RETURN VALUE
+*	0 if unicast cache was successfully written to switches,
+*	non-zero for any error.
+*
+* NOTES
+*	Compares the current topology to the cached topology,
+*	and if the topology matches, or if changes in topology
+*	have no impact on routing tables, writes the cached
+*	unicast routing to the subnet switches.
+*
+* SEE ALSO
+*	Unicast Cache object
+*********/
+
+/****f* OpenSM: Unicast Cache/osm_ucast_cache_set_sw_fwd_table
+* NAME
+*	osm_ucast_cache_set_sw_fwd_table
+*
+* DESCRIPTION
+*	The osm_ucast_cache_set_sw_fwd_table function sets
+*	(caches) linear forwarding table for the specified
+*	switch.
+*
+* SYNOPSIS
+*/
+void
+osm_ucast_cache_set_sw_fwd_table(osm_ucast_cache_t * p_cache,
+				 uint8_t * ucast_mgr_lft_buf,
+				 osm_switch_t * p_osm_sw);
+/*
+* PARAMETERS
+*	p_cache
+*		[in] Pointer to the cache object to be used.
+*
+*	ucast_mgr_lft_buf
+*		[in] LFT to set.
+*
+*	p_osm_sw
+*		[in] pointer to the switch that the LFT refers to.
+*
+* RETURN VALUE
+*	This function does not return any value.
+*
+* NOTES
+*
+* SEE ALSO
+*	Unicast Cache object
+*********/
+
+END_C_DECLS
+#endif				/* _OSM_UCAST_MGR_H_ */
+
diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
new file mode 100644
index 0000000..4ad7c30
--- /dev/null
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -0,0 +1,1197 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of OpenSM Cached routing
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif
+
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#include <errno.h>
+#include <iba/ib_types.h>
+#include <complib/cl_qmap.h>
+#include <complib/cl_pool.h>
+#include <complib/cl_debug.h>
+#include <opensm/osm_opensm.h>
+#include <opensm/osm_ucast_mgr.h>
+#include <opensm/osm_ucast_cache.h>
+#include <opensm/osm_switch.h>
+#include <opensm/osm_node.h>
+#include <opensm/osm_port.h>
+
+struct cache_sw_t_;
+struct cache_ca_t_;
+struct cache_port_t_;
+
+typedef union cache_sw_or_ca_ {
+	struct cache_sw_t_ * p_sw;
+	struct cache_ca_t_ * p_ca;
+} cache_node_t;
+
+typedef struct cache_port_t_ {
+	uint8_t remote_node_type;
+	cache_node_t remote_node;
+} cache_port_t;
+
+typedef struct cache_ca_t_ {
+	cl_map_item_t map_item;
+	uint16_t lid_ho;
+} cache_ca_t;
+
+typedef struct cache_sw_t_ {
+	cl_map_item_t map_item;
+	uint16_t lid_ho;
+	uint16_t max_lid_ho;
+	osm_switch_t *p_osm_sw; /* pointer to the updated switch object */
+	uint8_t num_ports;
+	cache_port_t ** ports;
+	uint8_t **lid_matrix;
+        uint8_t * lft_buff;
+        boolean_t is_leaf;
+} cache_sw_t;
+
+/**********************************************************************
+ **********************************************************************/
+
+static osm_switch_t *
+__ucast_cache_get_starting_osm_sw(osm_ucast_cache_t * p_cache)
+{
+	osm_port_t * p_osm_port;
+	osm_node_t * p_osm_node;
+	osm_physp_t * p_osm_physp;
+
+	CL_ASSERT(p_cache->p_ucast_mgr);
+
+	/* find the OSM node */
+	p_osm_port = osm_get_port_by_guid(
+			p_cache->p_ucast_mgr->p_subn,
+			p_cache->p_ucast_mgr->p_subn->sm_port_guid);
+	CL_ASSERT(p_osm_port);
+
+	p_osm_node = p_osm_port->p_node;
+	switch (osm_node_get_type(p_osm_node)) {
+		case IB_NODE_TYPE_SWITCH:
+			/* OpenSM runs on switch - we're done */
+			return p_osm_node->sw;
+
+		case IB_NODE_TYPE_CA:
+			/* SM runs on CA - get the switch
+			   that CA is connected to. */
+			p_osm_physp = p_osm_port->p_physp;
+			p_osm_physp = osm_physp_get_remote(p_osm_physp);
+			p_osm_node = osm_physp_get_node_ptr(p_osm_physp);
+			CL_ASSERT(p_osm_node);
+			return p_osm_node->sw;
+
+		default:
+			/* SM runs on some other node - not supported */
+			return NULL;
+	}
+} /* __ucast_cache_get_starting_osm_sw() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static cache_sw_t *
+__ucast_cache_get_sw(osm_ucast_cache_t * p_cache,
+		     uint16_t lid_ho)
+{
+	cache_sw_t * p_sw;
+
+	p_sw = (cache_sw_t *) cl_qmap_get(&p_cache->sw_tbl, lid_ho);
+	if (p_sw == (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl))
+		return NULL;
+
+	return p_sw;
+} /* __ucast_cache_get_sw() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static cache_ca_t *
+__ucast_cache_get_ca(osm_ucast_cache_t * p_cache,
+		     uint16_t lid_ho)
+{
+	cache_ca_t * p_ca;
+
+	p_ca = (cache_ca_t *) cl_qmap_get(&p_cache->ca_tbl, lid_ho);
+	if (p_ca == (cache_ca_t *) cl_qmap_end(&p_cache->ca_tbl))
+		return NULL;
+
+	return p_ca;
+} /* __ucast_cache_get_ca() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static cache_port_t *
+__ucast_cache_add_port(osm_ucast_cache_t * p_cache,
+		       uint8_t remote_node_type,
+		       uint16_t lid_ho)
+{
+	cache_port_t * p_port = (cache_port_t *) malloc(sizeof(cache_port_t));
+	memset(p_port, 0, sizeof(cache_port_t));
+
+	p_port->remote_node_type = remote_node_type;
+	if (remote_node_type == IB_NODE_TYPE_SWITCH)
+	{
+		cache_sw_t * p_sw = __ucast_cache_get_sw(
+					p_cache, lid_ho);
+		CL_ASSERT(p_sw);
+		p_port->remote_node.p_sw = p_sw;
+	}
+	else {
+		cache_ca_t * p_ca = __ucast_cache_get_ca(
+					p_cache, lid_ho);
+		CL_ASSERT(p_ca);
+		p_port->remote_node.p_ca = p_ca;
+	}
+
+	return p_port;
+} /* __ucast_cache_add_port() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static cache_sw_t *
+__ucast_cache_add_sw(osm_ucast_cache_t * p_cache,
+		     osm_switch_t * p_osm_sw)
+{
+	cache_sw_t *p_sw = (cache_sw_t*)malloc(sizeof(cache_sw_t));
+	memset(p_sw, 0, sizeof(cache_sw_t));
+
+	p_sw->p_osm_sw = p_osm_sw;
+
+	p_sw->lid_ho =
+		cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node, 0));
+
+	p_sw->num_ports = osm_node_get_num_physp(p_osm_sw->p_node);
+	p_sw->ports = (cache_port_t **)
+		malloc(p_sw->num_ports * sizeof(cache_port_t *));
+	memset(p_sw->ports, 0, p_sw->num_ports * sizeof(cache_port_t *));
+
+	cl_qmap_insert(&p_cache->sw_tbl, p_sw->lid_ho, &p_sw->map_item);
+	return p_sw;
+} /* __ucast_cache_add_sw() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static cache_ca_t *
+__ucast_cache_add_ca(osm_ucast_cache_t * p_cache,
+		     uint16_t lid_ho)
+{
+	cache_ca_t *p_ca = (cache_ca_t*)malloc(sizeof(cache_ca_t));
+	memset(p_ca, 0, sizeof(cache_ca_t));
+
+	p_ca->lid_ho = lid_ho;
+
+	cl_qmap_insert(&p_cache->ca_tbl, p_ca->lid_ho, &p_ca->map_item);
+	return p_ca;
+} /* __ucast_cache_add_ca() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__cache_port_destroy(cache_port_t * p_port)
+{
+	if (!p_port)
+		return;
+	free(p_port);
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__cache_sw_destroy(cache_sw_t * p_sw)
+{
+	int i;
+
+	if (!p_sw)
+		return;
+
+	if (p_sw->ports) {
+		for (i = 0; i < p_sw->num_ports; i++)
+			if (p_sw->ports[i])
+				__cache_port_destroy(p_sw->ports[i]);
+		free(p_sw->ports);
+	}
+
+	if (p_sw->lid_matrix) {
+		for (i = 0; i <= p_sw->max_lid_ho; i++)
+			if (p_sw->lid_matrix[i])
+				free(p_sw->lid_matrix[i]);
+		free(p_sw->lid_matrix);
+	}
+
+	if (p_sw->lft_buff)
+		free(p_sw->lft_buff);
+
+	free(p_sw);
+} /* __cache_sw_destroy() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__cache_ca_destroy(cache_ca_t * p_ca)
+{
+	if (!p_ca)
+		return;
+	free(p_ca);
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static int
+__ucast_cache_populate(osm_ucast_cache_t * p_cache)
+{
+	cl_list_t sw_bfs_list;
+	osm_switch_t * p_osm_sw;
+	osm_switch_t * p_remote_osm_sw;
+	osm_node_t   * p_osm_node;
+	osm_node_t   * p_remote_osm_node;
+	osm_physp_t  * p_osm_physp;
+	osm_physp_t  * p_remote_osm_physp;
+	cache_sw_t   * p_sw;
+	cache_sw_t   * p_remote_sw;
+	cache_ca_t   * p_remote_ca;
+	uint16_t remote_lid_ho;
+	unsigned num_ports;
+	unsigned i;
+	int res = 0;
+	osm_log_t * p_log = p_cache->p_ucast_mgr->p_log;
+
+	OSM_LOG_ENTER(p_log);
+
+	cl_list_init(&sw_bfs_list, 10);
+
+	/* Use management switch or switch that is connected
+	   to management CA as a BFS scan starting point */
+
+	p_osm_sw = __ucast_cache_get_starting_osm_sw(p_cache);
+	if (!p_osm_sw) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A51: "
+			"failed getting cache population starting point\n");
+		res = 1;
+		goto Exit;
+	}
+
+	/* switch is cached BEFORE entering to the BFS list,
+	   so we will know whether this switch was "visited" */
+
+	p_sw = __ucast_cache_add_sw(p_cache, p_osm_sw);
+	cl_list_insert_tail(&sw_bfs_list, p_sw);
+
+	/* Create cached switches in the BFS order.
+	   This will ensure that the fabric scan is done each
+	   time the same way and will allow accurate matching
+	   between the current fabric and the cached one. */
+	while (!cl_is_list_empty(&sw_bfs_list)) {
+		p_sw = (cache_sw_t *) cl_list_remove_head(&sw_bfs_list);
+		p_osm_sw = p_sw->p_osm_sw;
+		p_osm_node = p_osm_sw->p_node;
+		num_ports = osm_node_get_num_physp(p_osm_node);
+
+		/* skipping port 0 on switches */
+		for (i = 1; i < num_ports; i++) {
+			p_osm_physp = osm_node_get_physp_ptr(p_osm_node, i);
+			if (!p_osm_physp ||
+			    !osm_physp_is_valid(p_osm_physp) ||
+			    !osm_link_is_healthy(p_osm_physp))
+				continue;
+
+			p_remote_osm_physp = osm_physp_get_remote(p_osm_physp);
+			if (!p_remote_osm_physp ||
+			    !osm_physp_is_valid(p_remote_osm_physp) ||
+			    !osm_link_is_healthy(p_remote_osm_physp))
+				continue;
+
+			p_remote_osm_node =
+				osm_physp_get_node_ptr(p_remote_osm_physp);
+			if (!p_remote_osm_node) {
+				OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A52: "
+					"no node for remote port\n");
+				res = 1;
+				goto Exit;
+			}
+
+			if (osm_node_get_type(p_remote_osm_node) ==
+			    IB_NODE_TYPE_SWITCH) {
+
+				remote_lid_ho = cl_ntoh16(
+					osm_node_get_base_lid(
+						p_remote_osm_node, 0));
+
+				p_remote_osm_sw = p_remote_osm_node->sw;
+				CL_ASSERT(p_remote_osm_sw);
+
+				p_remote_sw = __ucast_cache_get_sw(
+					p_cache,
+					remote_lid_ho);
+
+				/* If the remote switch hasn't been
+				   cached yet, add it to the cache
+				   and insert it into the BFS list */
+
+				if (!p_remote_sw) {
+					p_remote_sw = __ucast_cache_add_sw(
+						p_cache,
+						p_remote_osm_sw);
+					cl_list_insert_tail(&sw_bfs_list,
+						    p_remote_sw);
+				}
+			}
+			else {
+				remote_lid_ho = cl_ntoh16(
+					osm_physp_get_base_lid(
+						p_remote_osm_physp));
+
+				p_sw->is_leaf = TRUE;
+				p_remote_ca = __ucast_cache_add_ca(
+					p_cache, remote_lid_ho);
+
+				/* no need to add this node to BFS list */
+			}
+
+			/* cache this port */
+			p_sw->ports[i] = __ucast_cache_add_port(
+				p_cache,
+				osm_node_get_type(p_remote_osm_node),
+				remote_lid_ho);
+		}
+	}
+
+        cl_list_destroy(&sw_bfs_list);
+	p_cache->topology_valid = TRUE;
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"cache populated (%u SWs, %u CAs)\n",
+		cl_qmap_count(&p_cache->sw_tbl),
+		cl_qmap_count(&p_cache->ca_tbl));
+
+    Exit:
+	OSM_LOG_EXIT(p_log);
+	return res;
+} /* __ucast_cache_populate() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_read_sw_lid_matrix(cl_map_item_t * const p_map_item,
+				 void *context)
+{
+	cache_sw_t *p_sw = (cache_sw_t * const)p_map_item;
+	uint16_t target_lid_ho;
+	uint8_t port_num;
+
+	if (!p_sw->p_osm_sw)
+		return;
+
+	/* allocate lid matrices buffer:
+	   lid_matrix[target_lids][port_nums] */
+        CL_ASSERT(!p_sw->lid_matrix);
+	p_sw->lid_matrix = (uint8_t **)
+		malloc((p_sw->max_lid_ho + 1) * sizeof(uint8_t*));
+
+	for (target_lid_ho = 0;
+	     target_lid_ho <= p_sw->max_lid_ho; target_lid_ho++){
+
+		/* set hops for this target through every switch port */
+
+		p_sw->lid_matrix[target_lid_ho] =
+			(uint8_t *)malloc(p_sw->num_ports);
+		memset(p_sw->lid_matrix[target_lid_ho],
+		       OSM_NO_PATH, p_sw->num_ports);
+
+		for (port_num = 1; port_num < p_sw->num_ports; port_num++)
+			p_sw->lid_matrix[target_lid_ho][port_num] =
+				osm_switch_get_hop_count(p_sw->p_osm_sw,
+							 target_lid_ho,
+							 port_num);
+	}
+} /* __ucast_cache_read_sw_lid_matrix() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_write_sw_routing(cl_map_item_t * const p_map_item,
+			       void * context)
+{
+	cache_sw_t *p_sw = (cache_sw_t * const)p_map_item;
+	osm_ucast_cache_t * p_cache = (osm_ucast_cache_t *) context;
+	uint8_t *ucast_mgr_lft_buf = p_cache->p_ucast_mgr->lft_buf;
+	uint16_t target_lid_ho;
+	uint8_t port_num;
+	uint8_t hops;
+	osm_log_t * p_log = p_cache->p_ucast_mgr->p_log;
+
+	OSM_LOG_ENTER(p_log);
+
+	if (!p_sw->p_osm_sw) {
+		/* some switches (leaf switches) may exist in the
+		   cache, but not exist in the current topology */
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"cached switch 0x%04x doesn't exist in the fabric\n",
+			p_sw->lid_ho);
+		goto Exit;
+	}
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"writing routing for cached switch 0x%04x, "
+		"max_lid_ho = 0x%04x\n",
+		p_sw->lid_ho, p_sw->max_lid_ho);
+
+	/* write cached LFT to this switch: clear existing
+	   ucast mgr lft buffer, write the cached lft to the
+	   ucast mgr buffer, and set this lft on switch */
+	CL_ASSERT(p_sw->lft_buff);
+	memset(ucast_mgr_lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
+	if (p_sw->max_lid_ho > 0)
+		memcpy(ucast_mgr_lft_buf, p_sw->lft_buff,
+		       p_sw->max_lid_ho + 1);
+
+	p_sw->p_osm_sw->max_lid_ho = p_sw->max_lid_ho;
+	osm_ucast_mgr_set_fwd_table(p_cache->p_ucast_mgr,p_sw->p_osm_sw);
+
+	/* write cached lid matrix to this switch */
+
+	osm_switch_prepare_path_rebuild(p_sw->p_osm_sw, p_sw->max_lid_ho);
+
+	/* set hops to itself */
+	osm_switch_set_hops(p_sw->p_osm_sw,p_sw->lid_ho,0,0);
+
+	for (target_lid_ho = 0;
+	     target_lid_ho <= p_sw->max_lid_ho; target_lid_ho++){
+		/* port 0 on switches lid matrices is used
+		   for storing minimal hops to the target
+		   lid, so we iterate from port 1 */
+		for (port_num = 1; port_num < p_sw->num_ports; port_num++) {
+			hops = p_sw->lid_matrix[target_lid_ho][port_num];
+			if (hops != OSM_NO_PATH)
+				osm_switch_set_hops(p_sw->p_osm_sw,
+				    target_lid_ho, port_num, hops);
+		}
+	}
+    Exit:
+	OSM_LOG_EXIT(p_log);
+} /* __ucast_cache_write_sw_routing() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_clear_sw_routing(cl_map_item_t * const p_map_item,
+			       void *context)
+{
+	cache_sw_t *p_sw = (cache_sw_t * const)p_map_item;
+	unsigned lid;
+
+	if(p_sw->lft_buff) {
+		free(p_sw->lft_buff);
+		p_sw->lft_buff = NULL;
+	}
+
+	if(p_sw->lid_matrix) {
+		for (lid = 0; lid < p_sw->max_lid_ho; lid++)
+			if (p_sw->lid_matrix[lid])
+				free(p_sw->lid_matrix[lid]);
+		free(p_sw->lid_matrix);
+		p_sw->lid_matrix = NULL;
+	}
+
+	p_sw->max_lid_ho = 0;
+} /* __ucast_cache_clear_sw_routing() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_clear_routing(osm_ucast_cache_t * p_cache)
+{
+	cl_qmap_apply_func(&p_cache->sw_tbl, __ucast_cache_clear_sw_routing,
+			   (void *)p_cache);
+	p_cache->routing_valid = FALSE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_invalidate(osm_ucast_cache_t * p_cache)
+{
+	cache_sw_t * p_sw;
+	cache_sw_t * p_next_sw;
+	cache_ca_t * p_ca;
+	cache_ca_t * p_next_ca;
+
+	p_next_sw = (cache_sw_t *) cl_qmap_head(&p_cache->sw_tbl);
+	while (p_next_sw != (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl)) {
+		p_sw = p_next_sw;
+		p_next_sw = (cache_sw_t *) cl_qmap_next(&p_sw->map_item);
+		__cache_sw_destroy(p_sw);
+	}
+	cl_qmap_remove_all(&p_cache->sw_tbl);
+
+	p_next_ca = (cache_ca_t *) cl_qmap_head(&p_cache->ca_tbl);
+	while (p_next_ca != (cache_ca_t *) cl_qmap_end(&p_cache->ca_tbl)) {
+		p_ca = p_next_ca;
+		p_next_ca = (cache_ca_t *) cl_qmap_next(&p_ca->map_item);
+		__cache_ca_destroy(p_ca);
+	}
+	cl_qmap_remove_all(&p_cache->ca_tbl);
+
+	p_cache->routing_valid = FALSE;
+	p_cache->topology_valid = FALSE;
+	p_cache->need_update = FALSE;
+} /* __ucast_cache_invalidate() */
+
+/**********************************************************************
+ **********************************************************************/
+
+static int
+__ucast_cache_read_topology(osm_ucast_cache_t * p_cache)
+{
+	CL_ASSERT(p_cache && p_cache->p_ucast_mgr);
+
+	return __ucast_cache_populate(p_cache);
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_read_lid_matrices(osm_ucast_cache_t * p_cache)
+{
+	CL_ASSERT(p_cache && p_cache->p_ucast_mgr &&
+		  p_cache->topology_valid);
+
+	if (p_cache->routing_valid)
+		__ucast_cache_clear_routing(p_cache);
+
+	cl_qmap_apply_func(&p_cache->sw_tbl,
+			   __ucast_cache_read_sw_lid_matrix,
+			   (void *)p_cache);
+	p_cache->routing_valid = TRUE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_write_routing(osm_ucast_cache_t * p_cache)
+{
+	CL_ASSERT(p_cache && p_cache->p_ucast_mgr &&
+		  p_cache->topology_valid && p_cache->routing_valid);
+
+	cl_qmap_apply_func(&p_cache->sw_tbl,
+			   __ucast_cache_write_sw_routing,
+			   (void *)p_cache);
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static void
+__ucast_cache_sw_clear_osm_ptr(cl_map_item_t * const p_map_item,
+			       void *context)
+{
+	((cache_sw_t * const)p_map_item)->p_osm_sw = NULL;
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+static int
+__ucast_cache_validate(osm_ucast_cache_t * p_cache)
+{
+	osm_switch_t * p_osm_sw;
+	osm_node_t   * p_osm_node;
+	osm_node_t   * p_remote_osm_node;
+	osm_physp_t  * p_osm_physp;
+	osm_physp_t  * p_remote_osm_physp;
+	cache_sw_t   * p_sw;
+	cache_sw_t   * p_remote_sw;
+	cache_ca_t   * p_remote_ca;
+	uint16_t lid_ho;
+	uint16_t remote_lid_ho;
+	uint8_t remote_node_type;
+	unsigned num_ports;
+	unsigned i;
+	int res = UCAST_CACHE_TOPOLOGY_MATCH;
+	boolean_t fabric_link_exists;
+	osm_log_t * p_log = p_cache->p_ucast_mgr->p_log;
+	cl_qmap_t * p_osm_sw_guid_tbl;
+
+	OSM_LOG_ENTER(p_log);
+
+	p_osm_sw_guid_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl;
+
+	if (cl_qmap_count(p_osm_sw_guid_tbl) >
+	    cl_qmap_count(&p_cache->sw_tbl)) {
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"current subnet has more switches than the cache - "
+			"cache is invalid\n");
+		res |= UCAST_CACHE_TOPOLOGY_MORE_SWITCHES;
+		goto Exit;
+	}
+
+	if (cl_qmap_count(p_osm_sw_guid_tbl) <
+	    cl_qmap_count(&p_cache->sw_tbl)) {
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"current subnet has less switches than the cache - "
+			"continuing validation\n");
+		res |= UCAST_CACHE_TOPOLOGY_LESS_SWITCHES;
+	}
+
+	/* Clear the pointers to osm switch on all the cached switches.
+	   These pointers might be invalid right now: some cached switch
+	   might be missing in the real subnet, and some missing switch
+	   might reappear, such as in case of switch reboot. */
+	cl_qmap_apply_func(&p_cache->sw_tbl, __ucast_cache_sw_clear_osm_ptr,
+			   NULL);
+
+
+	for (p_osm_sw = (osm_switch_t *) cl_qmap_head(p_osm_sw_guid_tbl);
+	     p_osm_sw != (osm_switch_t *) cl_qmap_end(p_osm_sw_guid_tbl);
+	     p_osm_sw = (osm_switch_t *) cl_qmap_next(&p_osm_sw->map_item)) {
+
+		lid_ho = cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node,0));
+		p_sw = __ucast_cache_get_sw(p_cache, lid_ho);
+		if (!p_sw) {
+			OSM_LOG(p_log, OSM_LOG_VERBOSE,
+				"new lid (0x%04x)is in the fabric - "
+				"cache is invalid\n", lid_ho);
+			res |= UCAST_CACHE_TOPOLOGY_NEW_LID;
+			goto Exit;
+		}
+
+		p_sw->p_osm_sw = p_osm_sw;
+
+		/* scan all the ports and check if the cache is valid */
+
+		p_osm_node = p_osm_sw->p_node;
+		num_ports = osm_node_get_num_physp(p_osm_node);
+
+		/* skipping port 0 on switches */
+		for (i = 1; i < num_ports; i++) {
+			p_osm_physp = osm_node_get_physp_ptr(p_osm_node, i);
+
+			fabric_link_exists = FALSE;
+			if (p_osm_physp &&
+			    osm_physp_is_valid(p_osm_physp) &&
+			    osm_link_is_healthy(p_osm_physp)) {
+				p_remote_osm_physp =
+					osm_physp_get_remote(p_osm_physp);
+				if (p_remote_osm_physp &&
+				    osm_physp_is_valid(p_remote_osm_physp) &&
+				    osm_link_is_healthy(p_remote_osm_physp))
+					fabric_link_exists = TRUE;
+			}
+
+			if (!fabric_link_exists && !p_sw->ports[i])
+				continue;
+
+			if (fabric_link_exists && !p_sw->ports[i]) {
+				OSM_LOG(p_log, OSM_LOG_VERBOSE,
+					"lid 0x%04x, port %d, link exists "
+					"in the fabric, but not cached - "
+					"cache is invalid\n",
+					lid_ho, i);
+				res |= UCAST_CACHE_TOPOLOGY_LINK_ADDED;
+				goto Exit;
+			}
+
+			if (!fabric_link_exists && p_sw->ports[i]){
+				/*
+				 * link exists in cache, but missing
+				 * in current fabric
+				 */
+				if (p_sw->ports[i]->remote_node_type ==
+				    IB_NODE_TYPE_SWITCH) {
+					p_remote_sw =
+					    p_sw->ports[i]->remote_node.p_sw;
+					/* cache is allowed to have a
+					   leaf switch that is missing
+					   in the current subnet */
+					if (!p_remote_sw->is_leaf) {
+						OSM_LOG(p_log, OSM_LOG_VERBOSE,
+							"lid 0x%04x, port %d, "
+							"fabric is missing a link "
+							"to non-leaf switch - "
+							"cache is invalid\n",
+							lid_ho, i);
+						res |= UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING;
+						goto Exit;
+					}
+					else {
+						OSM_LOG(p_log, OSM_LOG_VERBOSE,
+							"lid 0x%04x, port %d, "
+							"fabric is missing a link "
+							"to leaf switch - "
+							"continuing validation\n",
+							lid_ho, i);
+						res |= UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING;
+						continue;
+					}
+				}
+				else {
+					/* this means that link to
+					   non-switch node is missing */
+					CL_ASSERT(p_sw->is_leaf);
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, port %d, "
+						"fabric is missing a link "
+						"to CA - "
+						"continuing validation\n",
+						lid_ho, i);
+					res |= UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING;
+					continue;
+				}
+			}
+
+			/*
+			 * Link exists both in fabric and in cache.
+			 * Compare remote nodes.
+			 */
+
+			p_remote_osm_node =
+				osm_physp_get_node_ptr(p_remote_osm_physp);
+			if (!p_remote_osm_node) {
+				/* No node for remote port!
+				   Something wrong is going on here,
+				    so we better not use cache... */
+				OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A53: "
+					"lid 0x%04x, port %d, "
+					"no node for remote port - "
+					"cache mismatch\n",
+					lid_ho, i);
+				res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+				goto Exit;
+			}
+
+			remote_node_type =
+				osm_node_get_type(p_remote_osm_node);
+
+			if (remote_node_type !=
+			    p_sw->ports[i]->remote_node_type) {
+				/* remote node type in the current fabric
+				   differs from the cached one - looks like
+				   node was replaced by something else */
+				OSM_LOG(p_log, OSM_LOG_VERBOSE,
+					"lid 0x%04x, port %d, "
+					"remote node type mismatch - "
+					"cache is invalid\n",
+					lid_ho, i);
+				res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+				goto Exit;
+			}
+
+			if (remote_node_type == IB_NODE_TYPE_SWITCH) {
+				remote_lid_ho =
+					cl_ntoh16(osm_node_get_base_lid(
+						p_remote_osm_node, 0));
+
+				p_remote_sw = __ucast_cache_get_sw(
+					p_cache,
+					remote_lid_ho);
+
+				if (!p_remote_sw) {
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, "
+						"new switch in the fabric - "
+						"cache is invalid\n",
+						remote_lid_ho);
+					res |= UCAST_CACHE_TOPOLOGY_NEW_SWITCH;
+					goto Exit;
+				}
+
+				if (p_sw->ports[i]->remote_node.p_sw !=
+				    p_remote_sw) {
+					/* remote cached switch that pointed
+					   by the port is not equal to the
+					   switch that was obtained for the
+					   remote lid - link was changed */
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, port %d, "
+						"link location changed "
+						"(remote node mismatch) - "
+						"cache is invalid\n",
+						lid_ho, i);
+					res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+					goto Exit;
+				}
+			}
+			else {
+				if (!p_sw->is_leaf) {
+					/* remote node type is CA, but the
+					   cached switch is not marked as
+					   leaf - something has changed */
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, port %d, "
+						"link changed - "
+						"cache is invalid\n",
+						lid_ho, i);
+					res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+					goto Exit;
+				}
+
+				remote_lid_ho =
+					cl_ntoh16(osm_physp_get_base_lid(
+						p_remote_osm_physp));
+
+				p_remote_ca = __ucast_cache_get_ca(
+					p_cache, remote_lid_ho);
+
+				if (!p_remote_ca) {
+					/* new lid is in the fabric -
+					   cache is invalid */
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, port %d, "
+						"new CA in the fabric "
+						"(lid 0x%04x) - "
+						"cache is invalid\n",
+						lid_ho, i, remote_lid_ho);
+					res |= UCAST_CACHE_TOPOLOGY_NEW_CA;
+					goto Exit;
+				}
+
+				if (p_sw->ports[i]->remote_node.p_ca !=
+				    p_remote_ca) {
+					/* remote cached CA that pointed
+					   by the port is not equal to the
+					   CA that was obtained for the
+					   remote lid - link was changed */
+					OSM_LOG(p_log, OSM_LOG_VERBOSE,
+						"lid 0x%04x, port %d, "
+						"link to CA (lid 0x%04x) "
+						"has changed - "
+						"cache is invalid\n",
+						lid_ho, i, remote_lid_ho);
+					res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+					goto Exit;
+				}
+			}
+		} /* done comparing the ports of the switch */
+	} /* done comparing all the switches */
+
+	/* At this point we have four possible flags on:
+	   1. UCAST_CACHE_TOPOLOGY_MATCH
+	      We have a perfect topology match to the cache
+	   2. UCAST_CACHE_TOPOLOGY_LESS_SWITCHES
+	      Cached topology has one or more switches that do not exist
+	      in the current topology. There are two types of such switches:
+	      leaf switches and the regular switches. But if some regular
+	      switch was missing, we would exit the comparison with the
+	      UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING flag, so if some switch
+	      in the topology is missing, it has to be leaf switch.
+	   3. UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING
+	      One or more link to leaf switches are missing in the current
+	      topology.
+	   4. UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING
+	      One or more CAs are missing in the current topology.
+	   In all these cases the cache is perfectly usable - it just might
+	   have routing to unexisting lids. */
+
+	if (res & UCAST_CACHE_TOPOLOGY_LESS_SWITCHES) {
+		/* if there are switches in the cache that don't exist
+		   in the current topology, make sure that they are
+		   all leaf switches, otherwise cache is useless */
+		for (p_sw = (cache_sw_t *) cl_qmap_head(&p_cache->sw_tbl);
+		     p_sw != (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl);
+		     p_sw = (cache_sw_t *) cl_qmap_next(&p_sw->map_item)) {
+			if (!p_sw->p_osm_sw && !p_sw->is_leaf) {
+				OSM_LOG(p_log, OSM_LOG_VERBOSE,
+					"non-leaf switch in the fabric is "
+					"missing - cache is invalid\n");
+				res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+				goto Exit;
+			}
+		}
+	}
+
+	if ((res & UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING) &&
+	    !(res & UCAST_CACHE_TOPOLOGY_LESS_SWITCHES)) {
+		/* some link to leaf switch is missing, but there are
+		   no missing switches - link failure or topology
+		   changes, which means that we probably shouldn't
+		   use the cache here */
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"topology change - cache is invalid\n");
+		res |= UCAST_CACHE_TOPOLOGY_NO_MATCH;
+		goto Exit;
+	}
+
+    Exit:
+	OSM_LOG_EXIT(p_log);
+	return res;
+
+} /* __ucast_cache_validate() */
+
+/**********************************************************************
+ **********************************************************************/
+
+int
+osm_ucast_cache_apply(osm_ucast_cache_t * p_cache)
+{
+	int res = 0;
+	osm_log_t * p_log;
+
+	if (!p_cache)
+		return 1;
+
+	p_log = p_cache->p_ucast_mgr->p_log;
+
+	OSM_LOG_ENTER(p_log);
+	if (!p_cache->topology_valid) {
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"unicast cache is empty - can't "
+			"use it on this sweep\n");
+		res = UCAST_CACHE_TOPOLOGY_NO_MATCH;
+		goto Exit;
+	}
+
+	if (!p_cache->routing_valid) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A55: "
+			"cached routing invalid\n");
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"invalidating cache\n");
+		__ucast_cache_invalidate(p_cache);
+		res = UCAST_CACHE_TOPOLOGY_NO_MATCH;
+		goto Exit;
+	}
+
+	res = __ucast_cache_validate(p_cache);
+
+	if ((res & UCAST_CACHE_TOPOLOGY_NO_MATCH          ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_MORE_SWITCHES     ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_LINK_ADDED        ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING) ||
+	    (res & UCAST_CACHE_TOPOLOGY_NEW_SWITCH        ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_NEW_CA            ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_NEW_LID           ) ||
+	    (res & UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING)) {
+		/* The change in topology doesn't allow us to use the.
+		   existing cache. Cache should be invalidated, and new
+		   cache should be built after the routing recalculation. */
+		OSM_LOG(p_log, OSM_LOG_INFO,
+			"changes in topology (0x%x) - "
+			"invalidating cache\n", res);
+		__ucast_cache_invalidate(p_cache);
+		goto Exit;
+	}
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"cache is valid (status 0x%04x) - using the cached routing\n",res);
+
+	/* existing cache can be used - write back the cached routing */
+	__ucast_cache_write_routing(p_cache);
+
+	/*
+	 * ToDo: Detailed result of the topology comparison will
+	 * ToDo: be needed later for the Incremental Routing,
+	 * ToDo: where based on this result, the routing algorithm
+	 * ToDo: will try to route "around" the missing components.
+	 * ToDo: For now - reset the result whenever the cache
+	 * ToDo: is valid.
+	 */
+	res = 0;
+
+    Exit:
+	OSM_LOG_EXIT(p_log);
+	return res;
+} /* osm_ucast_cache_apply() */
+
+/**********************************************************************
+ **********************************************************************/
+
+void osm_ucast_cache_set_sw_fwd_table(osm_ucast_cache_t * p_cache,
+				      uint8_t * ucast_mgr_lft_buf,
+				      osm_switch_t * p_osm_sw)
+{
+	uint16_t lid_ho =
+		cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node, 0));
+	cache_sw_t * p_sw = __ucast_cache_get_sw(p_cache, lid_ho);
+
+	OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log);
+
+	OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE,
+		"caching lft for switch 0x%04x\n",
+		lid_ho);
+
+	if (!p_sw || !p_sw->p_osm_sw) {
+		OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR,
+			"ERR 3A57: "
+			"fabric switch 0x%04x %s in the unicast cache\n",
+			lid_ho,
+			(p_sw) ? "is not initialized" : "doesn't exist");
+		goto Exit;
+	}
+
+	CL_ASSERT(p_sw->p_osm_sw == p_osm_sw);
+	CL_ASSERT(!p_sw->lft_buff);
+
+	p_sw->max_lid_ho = p_osm_sw->max_lid_ho;
+
+	/* allocate linear forwarding table buffer and fill it */
+	p_sw->lft_buff = (uint8_t *)malloc(IB_LID_UCAST_END_HO + 1);
+	memcpy(p_sw->lft_buff, p_cache->p_ucast_mgr->lft_buf,
+	       IB_LID_UCAST_END_HO + 1);
+
+    Exit:
+	OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log);
+} /* osm_ucast_cache_set_sw_fwd_table() */
+
+/**********************************************************************
+ **********************************************************************/
+
+void osm_ucast_cache_refresh_topo(osm_ucast_cache_t * p_cache)
+{
+	osm_log_t * p_log = p_cache->p_ucast_mgr->p_log;
+	OSM_LOG_ENTER(p_log);
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"starting ucast cache topology refresh\n");
+
+	if (p_cache->topology_valid) {
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"invalidating existing ucast cache\n");
+		__ucast_cache_invalidate(p_cache);
+	}
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE, "caching topology\n");
+
+	if (__ucast_cache_read_topology(p_cache) != 0) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A56: "
+			"cache population failed\n");
+		__ucast_cache_invalidate(p_cache);
+		goto Exit;
+	}
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"ucast cache topology refresh done\n");
+    Exit:
+	OSM_LOG_EXIT(p_log);
+} /* osm_ucast_cache_refresh_topo() */
+
+/**********************************************************************
+ **********************************************************************/
+
+void osm_ucast_cache_refresh_lid_matrices(osm_ucast_cache_t * p_cache)
+{
+	osm_log_t * p_log = p_cache->p_ucast_mgr->p_log;
+	OSM_LOG_ENTER(p_log);
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"starting ucast cache lid matrices refresh\n");
+
+	if (!p_cache->topology_valid) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A54: "
+			"cached topology is invalid\n");
+		goto Exit;
+	}
+
+	if (p_cache->routing_valid) {
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
+			"invalidating existing ucast routing cache\n");
+		__ucast_cache_clear_routing(p_cache);
+	}
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"caching lid matrices\n");
+
+	__ucast_cache_read_lid_matrices(p_cache);
+
+	OSM_LOG(p_log, OSM_LOG_VERBOSE,
+		"ucast cache lid matrices refresh done\n");
+    Exit:
+	OSM_LOG_EXIT(p_log);
+} /* osm_ucast_cache_refresh_lid_matrices() */
+
+/**********************************************************************
+ **********************************************************************/
+
+osm_ucast_cache_t *
+osm_ucast_cache_construct(osm_ucast_mgr_t * const p_mgr)
+{
+	if (p_mgr->p_subn->opt.lmc > 0) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A50: "
+			"Unicast cache is not supported for LMC>0\n");
+		return NULL;
+	}
+
+	osm_ucast_cache_t * p_cache =
+		(osm_ucast_cache_t*)malloc(sizeof(osm_ucast_cache_t));
+	if (!p_cache)
+		return NULL;
+
+	memset(p_cache, 0, sizeof(osm_ucast_cache_t));
+
+	cl_qmap_init(&p_cache->sw_tbl);
+	cl_qmap_init(&p_cache->ca_tbl);
+	p_cache->p_ucast_mgr = p_mgr;
+
+	return p_cache;
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+void
+osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache)
+{
+	if (!p_cache)
+		return;
+	__ucast_cache_invalidate(p_cache);
+	free(p_cache);
+}
+
+/**********************************************************************
+ **********************************************************************/
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Sun May  4 03:00:36 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 13:00:36 +0300
Subject: [ofa-general] [PATCH 2/4] opensm: adding ucast cache option
Message-ID: <481D8944.10003@dev.mellanox.co.il>

Adding ucast cache option to OpenSM command line
arguments: -F or --ucast_cache.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/opensm/osm_subnet.h |    6 +++++-
 opensm/opensm/main.c               |   33 +++++++++++++++++++++++++++++++--
 opensm/opensm/osm_subnet.c         |   11 ++++++++++-
 3 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index b1dd659..cffbe5e 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -256,6 +256,7 @@ typedef struct _osm_subn_opt {
 	boolean_t sweep_on_trap;
 	char *routing_engine_name;
 	boolean_t connect_roots;
+	boolean_t use_ucast_cache;
 	char *lid_matrix_dump_file;
 	char *ucast_dump_file;
 	char *root_guid_file;
@@ -441,6 +442,9 @@ typedef struct _osm_subn_opt {
 *		up/down routing engine (even if this violates "pure" deadlock
 *		free up/down algorithm)
 *
+*	use_ucast_cache
+*		When TRUE enables unicast routing cache.
+*
 *	lid_matrix_dump_file
 *		Name of the lid matrix dump file from where switch
 *		lid matrices (min hops tables) will be loaded
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index fb41d50..71deacb 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -183,6 +183,17 @@ static void show_usage(void)
 	       "          and in this way be IBA compliant. In many cases,\n"
 	       "          this can violate \"pure\" deadlock free algorithm, so\n"
 	       "          use it carefully.\n\n");
+	printf("-F\n"
+	       "--ucast_cache\n"
+	       "          This option enables unicast routing cache to prevent\n"
+	       "          routing recalculation (which is a heavy task in a\n"
+	       "          large cluster) when there was no topology change\n"
+	       "          detected during the heavy sweep, or when the topology\n"
+	       "          change does not require new routing calculation,\n"
+	       "          e.g. in case of host reboot.\n"
+	       "          This option becomes very handy when the cluster size\n"
+	       "          is thousands of nodes.\n"
+	       "          Unicast cache is not supported for LMC > 0.\n\n");
 	printf("-M\n"
 	       "--lid_matrix_file <file name>\n"
 	       "          This option specifies the name of the lid matrix dump file\n"
@@ -599,7 +610,7 @@ int main(int argc, char *argv[])
 	char *ignore_guids_file_name = NULL;
 	uint32_t val;
 	const char *const short_option =
-	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:";
+	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:";

 	/*
 	   In the array below, the 2nd parameter specifies the number
@@ -634,6 +645,7 @@ int main(int argc, char *argv[])
 		{"smkey", 1, NULL, 'k'},
 		{"routing_engine", 1, NULL, 'R'},
 		{"connect_roots", 0, NULL, 'z'},
+		{"ucast_cache", 0, NULL, 'F'},
 		{"lid_matrix_file", 1, NULL, 'M'},
 		{"ucast_file", 1, NULL, 'U'},
 		{"sadb_file", 1, NULL, 'S'},
@@ -805,6 +817,12 @@ int main(int argc, char *argv[])
 					"ERROR: LMC must be 7 or less.");
 				return (-1);
 			}
+			if (opt.use_ucast_cache && temp > 0) {
+				fprintf(stderr,
+					"ERROR: Unicast routing cache is "
+					"not supported for LMC > 0\n");
+				return (-1);
+			}
 			opt.lmc = (uint8_t) temp;
 			printf(" LMC = %d\n", temp);
 			break;
@@ -891,6 +909,17 @@ int main(int argc, char *argv[])
 			printf(" Connect roots option is on\n");
 			break;

+		case 'F':
+			if (opt.lmc > 0) {
+				fprintf(stderr,
+					"ERROR: Unicast routing cache is "
+					"not supported for LMC > 0\n");
+				return (-1);
+			}
+			opt.use_ucast_cache = TRUE;
+			printf(" Unicast routing cache option is on\n");
+			break;
+
 		case 'M':
 			opt.lid_matrix_dump_file = optarg;
 			printf(" Lid matrix dump file is \'%s\'\n", optarg);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 47d735f..dc55e72 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->sweep_on_trap = TRUE;
 	p_opt->routing_engine_name = NULL;
 	p_opt->connect_roots = FALSE;
+	p_opt->use_ucast_cache = FALSE;
 	p_opt->lid_matrix_dump_file = NULL;
 	p_opt->ucast_dump_file = NULL;
 	p_opt->root_guid_file = NULL;
@@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
 		opts_unpack_boolean("connect_roots",
 				    p_key, p_val, &p_opts->connect_roots);

+		opts_unpack_boolean("use_ucast_cache",
+				    p_key, p_val, &p_opts->use_ucast_cache);
+
 		opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file);

 		opts_unpack_uint32("log_max_size",
@@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 			"# Connect roots (use FALSE if unsure)\n"
 			"connect_roots %s\n\n",
 			p_opts->connect_roots ? "TRUE" : "FALSE");
+	if (p_opts->use_ucast_cache)
+		fprintf(opts_file,
+			"# Use unicast routing cache (use FALSE if unsure)\n"
+			"use_ucast_cache %s\n\n",
+			p_opts->use_ucast_cache ? "TRUE" : "FALSE");
 	if (p_opts->lid_matrix_dump_file)
 		fprintf(opts_file,
 			"# Lid matrix dump file name\n"
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Sun May  4 03:02:03 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 13:02:03 +0300
Subject: [ofa-general] [PATCH 3/4] opensm: compile new ucast cache files
Message-ID: <481D899B.5010206@dev.mellanox.co.il>

Include new ucast cache c/h files in the makefiles.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/Makefile.am |    1 +
 opensm/opensm/Makefile.am  |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am
index 48264ff..a6791d4 100644
--- a/opensm/include/Makefile.am
+++ b/opensm/include/Makefile.am
@@ -33,6 +33,7 @@ EXTRA_DIST = \
 	$(srcdir)/opensm/osm_sm.h \
 	$(srcdir)/opensm/osm_lin_fwd_tbl.h \
 	$(srcdir)/opensm/osm_ucast_mgr.h \
+	$(srcdir)/opensm/osm_ucast_cache.h \
 	$(srcdir)/opensm/osm_db.h \
 	$(srcdir)/opensm/osm_mad_pool.h \
 	$(srcdir)/opensm/osm_remote_sm.h \
diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am
index acd0b1d..ec6c5b0 100644
--- a/opensm/opensm/Makefile.am
+++ b/opensm/opensm/Makefile.am
@@ -57,6 +57,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \
 		 osm_prtn.c osm_prtn_config.c osm_qos.c osm_router.c \
 		 osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \
 		 osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \
+		 osm_ucast_cache.c \
 		 osm_vl15intf.c osm_vl_arb_rcv.c \
 		 st.c osm_perfmgr.c osm_perfmgr_db.c \
 		 osm_event_plugin.c osm_dump.c \
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Sun May  4 03:03:14 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 13:03:14 +0300
Subject: [ofa-general] [PATCH 4/4] opensm/osm_ucast_mgr.{c,
	h}: integrate ucast cache
Message-ID: <481D89E2.1090303@dev.mellanox.co.il>

Integrating unicast routing cache into the unicast manager.

The cache is used the following way:
 - SM is executed - it starts first routing calculation
 - calculated routing is stored in the cache
 - at some point new heavy sweep is triggered
 - unicast manager checks whether the cache can be used instead
   of new routing calculation.
   In one of the following cases we can use cached routing
    + there is no topology change
    + one or more CAs disappeared (they exist in the cached topology
      model, but missing in the newly discovered fabric)
    + one or more leaf switches disappeared
   In these cases cached routing is written to the switches as is
   (unless the switch doesn't exist).
   If there is any other topology change:
     - existing cache is invalidated
     - topology is cached
     - routing is calculated as usual
     - routing is cached

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/opensm/osm_ucast_mgr.h |    7 +++-
 opensm/opensm/osm_ucast_mgr.c         |   79 ++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 27 deletions(-)

diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
index 0317c93..33e164b 100644
--- a/opensm/include/opensm/osm_ucast_mgr.h
+++ b/opensm/include/opensm/osm_ucast_mgr.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -53,6 +53,7 @@
 #include <opensm/osm_madw.h>
 #include <opensm/osm_subnet.h>
 #include <opensm/osm_switch.h>
+#include <opensm/osm_ucast_cache.h>
 #include <opensm/osm_log.h>

 #ifdef __cplusplus
@@ -103,6 +104,7 @@ typedef struct _osm_ucast_mgr {
 	boolean_t is_dor;
 	boolean_t any_change;
 	boolean_t some_hop_count_set;
+	osm_ucast_cache_t * p_cache;
 	uint8_t *lft_buf;
 } osm_ucast_mgr_t;
 /*
@@ -132,6 +134,9 @@ typedef struct _osm_ucast_mgr {
 *		tables calculation iteration cycle, set to TRUE to indicate
 *		that some hop count changes were done.
 *
+*	p_cache
+*		Pointer to the unicast cache object.
+*
 *	lft_buf
 *		LFT buffer - used during LFT calculation/setup.
 *
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 938db84..d854fa9 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -81,6 +81,9 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr)
 	if (p_mgr->lft_buf)
 		free(p_mgr->lft_buf);

+	if (p_mgr->p_cache)
+		osm_ucast_cache_destroy(p_mgr->p_cache);
+
 	OSM_LOG_EXIT(p_mgr->p_log);
 }

@@ -104,6 +107,9 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm)
 	if (!p_mgr->lft_buf)
 		return IB_INSUFFICIENT_MEMORY;

+	if (p_mgr->p_subn->opt.use_ucast_cache)
+		p_mgr->p_cache = osm_ucast_cache_construct(p_mgr);
+
 	OSM_LOG_EXIT(p_mgr->p_log);
 	return (status);
 }
@@ -375,6 +381,10 @@ osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,

 	CL_ASSERT(p_path);

+	if (p_mgr->p_cache && p_mgr->p_cache->need_update)
+		osm_ucast_cache_set_sw_fwd_table(p_mgr->p_cache,
+						 p_mgr->lft_buf, p_sw);
+
 	/*
 	   Set the top of the unicast forwarding table.
 	 */
@@ -688,33 +698,50 @@ osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr)

 	p_mgr->any_change = FALSE;

-	if (!p_routing_eng->build_lid_matrices ||
-	    (blm = p_routing_eng->build_lid_matrices(p_routing_eng->context)))
-		osm_ucast_mgr_build_lid_matrices(p_mgr);
+	if (p_mgr->p_cache && (osm_ucast_cache_apply(p_mgr->p_cache) == 0))
+		OSM_LOG(p_mgr->p_log, OSM_LOG_INFO,
+			"configured switch tables using cached routing\n");
+	else {
+		if (p_mgr->p_cache) {
+			/* ucast cache is enabled - refresh
+			   topology and mark routing for update */
+			p_mgr->p_cache->need_update = TRUE;
+			osm_ucast_cache_refresh_topo(p_mgr->p_cache);
+		}
+
+		if (!p_routing_eng->build_lid_matrices ||
+		    (blm = p_routing_eng->build_lid_matrices(p_routing_eng->context)))
+			osm_ucast_mgr_build_lid_matrices(p_mgr);

-	/*
-	   Now that the lid matrices have been built, we can
-	   build and download the switch forwarding tables.
-	 */
-	if (!p_routing_eng->ucast_build_fwd_tables ||
-	    (ubft =
-	     p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context)))
-		cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_process_tbl,
-				   p_mgr);
-
-	/* 'file' routing engine has one unique logic corner case */
-	if (p_routing_eng->name && (strcmp(p_routing_eng->name, "file") == 0)
-	    && (!blm || !ubft))
-		p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_FILE;
-	else if (!blm && !ubft)
-		p_osm->routing_engine_used =
-		    osm_routing_engine_type(p_routing_eng->name);
-	else
-		p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
+		/*
+		   Now that the lid matrices have been built, we can
+		   build and download the switch forwarding tables.
+		 */
+		if (!p_routing_eng->ucast_build_fwd_tables ||
+		    (ubft =
+		     p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context)))
+			cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_process_tbl,
+					   p_mgr);
+
+		/* 'file' routing engine has one unique logic corner case */
+		if (p_routing_eng->name && (strcmp(p_routing_eng->name, "file") == 0)
+		    && (!blm || !ubft))
+			p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_FILE;
+		else if (!blm && !ubft)
+			p_osm->routing_engine_used =
+			    osm_routing_engine_type(p_routing_eng->name);
+		else
+			p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP;
+
+		OSM_LOG(p_mgr->p_log, OSM_LOG_INFO,
+			"%s tables configured on all switches\n",
+			osm_routing_engine_type_str(p_osm->routing_engine_used));

-	OSM_LOG(p_mgr->p_log, OSM_LOG_INFO,
-		"%s tables configured on all switches\n",
-		osm_routing_engine_type_str(p_osm->routing_engine_used));
+		if (p_mgr->p_cache) {
+			osm_ucast_cache_refresh_lid_matrices(p_mgr->p_cache);
+			p_mgr->p_cache->need_update = FALSE;
+		}
+	}

 	if (p_mgr->any_change) {
 		signal = OSM_SIGNAL_DONE_PENDING;
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Sun May  4 03:08:51 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 04 May 2008 13:08:51 +0300
Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
In-Reply-To: <481D888A.7080608@dev.mellanox.co.il>
References: <481D888A.7080608@dev.mellanox.co.il>
Message-ID: <481D8B33.4000803@dev.mellanox.co.il>


One thing I need to add here: ucast cache is currently supported
for LMC=0 only.

-- Yevgeny

Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> The following series of 4 patches implements unicast routing cache
> in OpenSM.
> 
> None of the current routing engines is scalable when we're talking
> about big clusters. On ~5K cluster with ~1.3K switches, it takes
> about two minutes to calculate the routing. The problem is, each
> time the routing is calculated from scratch.
> 
> Incremental routing (which is on my to-do list) aims to address this
> problem when there is some "local" change in fabric (e.g. single
> switch failure, single link failure, link added, etc).
> In such cases we can use the routing that was already calculated in
> the previous heavy sweep, and then we just have to modify it according
> to the change.
> 
> For instance, if some switch has disappeared from the fabric, we can
> use the routing that existed with this switch, take a step back from
> this switch and see if it is possible to route all the lids that were
> routed through this switch some other way (which is usually the case).
> 
> To implement incremental routing, we need to create some kind of unicast
> routing cache, which is what these patches implement. In addition to being
> a step toward the incremental routing, routing cache is usefull by itself.
> 
> This cache can save us routing calculation in case of change in the leaf
> switches or in hosts. For instance, if some node is rebooted, OpenSM would
> start a heavy sweep with full routing recalculation when the HCA is going
> down, and another one when HCA is brought up, when in fact both of these
> routing calculation can be replaced by using of unicast routing cache.
> 
> Unicast routing cache comprises the following:
>  - Topology: a data structure with all the switches and CAs of the fabric
>  - LFTs: each switch has an LFT cached
>  - Lid matrices: each switch has lid matrices cached, which is needed for
>    multicast routing (which is not cached).
> 
> There is a topology matching function that compares the current topology
> with the cached one to find out whether the cache is usable (valid) or not.
> 
> The cache is used the following way:
>  - SM is executed - it starts first routing calculation
>  - calculated routing is stored in the cache
>  - at some point new heavy sweep is triggered
>  - unicast manager checks whether the cache can be used instead
>    of new routing calculation.
>    In one of the following cases we can use cached routing
>     + there is no topology change
>     + one or more CAs disappeared (they exist in the cached topology
>       model, but missing in the newly discovered fabric)
>     + one or more leaf switches disappeared
>    In these cases cached routing is written to the switches as is
>    (unless the switch doesn't exist).
>    If there is any other topology change:
>      - existing cache is invalidated
>      - topology is cached
>      - routing is calculated as usual
>      - routing is cached
> 
> My simulations show that when the usual routing phase of the heavy
> sweep on the topology that I mentioned above takes ~2 minutes,
> cached routing reduces this time to 6 seconds (which is nice, if you
> ask me...).
> 
> Of all the cases when the cache is valid, the most painful and
> "complainable" case is when a compute node reboot (which happens pretty
> often) causes two heavy sweeps with two full routing calculations.
> Unicast Routing Cache is aimed to solve this problem (again, in addition
> to being a step toward the incremental routing).
> 
> -- Yevgeny
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From holt at sgi.com  Sun May  4 12:13:45 2008
From: holt at sgi.com (Robin Holt)
Date: Sun, 4 May 2008 14:13:45 -0500
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <1489529e7b53d3f2dab8.1209740704@duo.random>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
Message-ID: <20080504191345.GD18857@sgi.com>

> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -205,3 +205,6 @@ config VIRT_TO_BUS
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	bool

Without some text following the bool keyword, I am not even asked for
this config setting on my ia64 build.

Thanks,
Robin


From hrosenstock at xsigo.com  Sun May  4 14:00:52 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 04 May 2008 14:00:52 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] ibsim/sim_net.c: Fix some typos
Message-ID: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com>

ibsim/sim.net.c: Fix some typos

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index 888c91c..1873187 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -270,7 +270,7 @@ static int new_hca(Node * nd)
 static int build_nodeid(char *nodeid, char *base)
 {
 	if (strchr(base, '#') || strchr(base, '@')) {
-		IBWARN("bad nodeid \"%s\": '#' & '@' characters are resereved",
+		IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved",
 		       base);
 		return -1;
 	}
@@ -649,7 +649,7 @@ static int parse_port(char *line, Node * node, int type, int maxports)
 		build_alias(port->remotealias, s, 0);
 
 	expand_name(s, remotenodeid, &sp);
-	PDEBUG("remotenodid %s s %s sp %s", remotenodeid, s, sp);
+	PDEBUG("remotenodeid %s s %s sp %s", remotenodeid, s, sp);
 
 	s += strlen(s) + 1;
 	if (!sp && *s == '[')
@@ -791,7 +791,7 @@ static int parse_guidbase(int fd, char *line, int type)
 	char *s;
 
 	if (!(s = strchr(line, '=')) && !(s = strchr(line, '+'))) {
-		IBWARN("bad assignemnt: missing '=|+' sign");
+		IBWARN("bad assignment: missing '=|+' sign");
 		return -1;
 	}
 
@@ -805,7 +805,7 @@ static int parse_guidbase(int fd, char *line, int type)
 		guidbase = 0;
 	}
 	guids[type] = absguids[type] + guidbase;
-	PDEBUG("new guidbase for %s: base %" PRIx64 " current %" PRIx64,
+	PDEBUG("new guidbase for %s: base 0x%" PRIx64 " current 0x%" PRIx64,
 	       node_type_name(type), absguids[type],
 	       guids[type]);
 	return 1;
@@ -816,7 +816,7 @@ static int parse_devid(int fd, char *line)
 	char *s;
 
 	if (!(s = strchr(line, '='))) {
-		IBWARN("bad assignemnt: missing '=' sign");
+		IBWARN("bad assignment: missing '=' sign");
 		return -1;
 	}
 
@@ -831,7 +831,7 @@ static int parse_width(int fd, char *line)
 	int width;
 
 	if (!(s = strchr(line, '='))) {
-		IBWARN("bad assignemnt: missing '=' sign");
+		IBWARN("bad assignment: missing '=' sign");
 		return -1;
 	}
 
@@ -851,7 +851,7 @@ static int parse_speed(int fd, char *line)
 	int speed;
 
 	if (!(s = strchr(line, '='))) {
-		IBWARN("bad assignemnt: missing '=' sign");
+		IBWARN("bad assignment: missing '=' sign");
 		return -1;
 	}
 
@@ -870,7 +870,7 @@ static int parse_netprefix(int fd, char *line)
 	char *s;
 
 	if (!(s = strchr(line, '='))) {
-		IBWARN("bad assignemnt: missing '=' sign");
+		IBWARN("bad assignment: missing '=' sign");
 		return -1;
 	}
 
@@ -907,7 +907,7 @@ static int set_var(char *line, int *var)
 	char *s;
 
 	if (!(s = strchr(line, '='))) {
-		IBWARN("bad assignemnt: missing '=' sign");
+		IBWARN("bad assignment: missing '=' sign");
 		return -1;
 	}
 

From hrosenstock at xsigo.com  Sun May  4 14:01:07 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 04 May 2008 14:01:07 -0700
Subject: [ofa-general] [PATCH] ibsim/sim.h: Fix NodeDescription size
Message-ID: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>

ibsim/sim.h: Fix NodeDescription size so can have maximum size
NodeDescription per IBA spec rather than truncating them

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/ibsim/sim.h b/ibsim/sim.h
index bea136a..dbf1220 100644
--- a/ibsim/sim.h
+++ b/ibsim/sim.h
@@ -67,7 +67,7 @@
 
 #define NODEIDBASE	20
 #define NODEPREFIX	20
-#define NODEIDLEN	(NODEIDBASE+NODEPREFIX+1)
+#define NODEIDLEN	65
 #define ALIASLEN 	40
 
 #define MAXHOPS 16


From hrosenstock at xsigo.com  Sun May  4 14:01:35 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 04 May 2008 14:01:35 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] ibsim/ibsim.c: Fix usage display
Message-ID: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com>

ibsim/ibsim.c: Fix usage display

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
index d0f1c30..5b996fd 100644
--- a/ibsim/ibsim.c
+++ b/ibsim/ibsim.c
@@ -698,7 +698,7 @@ Client *find_client(Port * port, int response, int qp, uint64_t trid)
 void usage(char *prog_name)
 {
 	fprintf(stderr,
-		"Usage: %s [-f outfile -d debug_level -p parse_debug -s(tart) -v(erbose) "
+		"Usage: %s [-f outfile -d(ebug) -p(arse_debug) -s(tart) -v(erbose) "
 		"-I(gnore_duplicate) -N nodes -S switchs -P ports -L linearcap"
 		" -M mcastcap -r(emote_mode) -l(isten_to_port) <port>] <netfile>\n",
 		prog_name);


From hrosenstock at xsigo.com  Sun May  4 14:01:48 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 04 May 2008 14:01:48 -0700
Subject: [ofa-general] [PATCH] ibsim/README: Clarify point of
	attachment/SIM_HOST use
Message-ID: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com>

ibsim/README: Clarify point of attachment/SIM_HOST use

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/README b/README
index f782fbe..6f3c711 100644
--- a/README
+++ b/README
@@ -90,6 +90,9 @@ Building and using ibsim.
    - in order to run OpenSM as non-privileged user you may need to
      export OSM_CACHE_DIR variable and to use '-f' option in order to
      specify writable path to OpenSM log file.
+   - Point of attachment is indicated by SIM_HOST environment variable.
+     If not specified, first entry in topology file is used. For OpenSM,
+     if -g option is used, it must same as this.
 
 5. Enjoy and comment.
 

From hrosenstock at xsigo.com  Sun May  4 14:02:24 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 04 May 2008 14:02:24 -0700
Subject: [ofa-general] ibsim parsing question
Message-ID: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>

Hi Sasha,

I have a question on ibsim parsing:

In sim_net.c:parse_port, there is the following code:
      parse_opt:
        line = s;
        while (s && (s = strchr(s + 1, '='))) {
                char *opt = s;
                while (opt && !isalpha(*opt))
                        opt--;
                if (!opt || parse_port_opt(port, opt, s + 1) < 0) {
                        IBWARN("bad port option");
                        return -1;
                }
                line = s + 1;
        }

port options appear include w for link width and s for link speed.

An issue is that this parsing starts inside the NodeDescription. = is a
valid character there and causes an invalid port option. There seem to
me to be two choices here:
1. Either ignore unknown options in parse_port_option and the rule
becomes w= and s= are invalid in the NodeDescription (which is
artificial and not really per the spec).
or
2. Find some way to start this port option parsing past the end of the
NodeDescription. As I'm not sure about all the formats supported, I
don't know how to determine a "solid" way to get past the end of the
NodeDescription in the topology format. Do you ?

-- Hal


From andrea at qumranet.com  Sun May  4 15:08:25 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Mon, 5 May 2008 00:08:25 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080504191345.GD18857@sgi.com>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080504191345.GD18857@sgi.com>
Message-ID: <20080504220824.GA21051@duo.random>

On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -205,3 +205,6 @@ config VIRT_TO_BUS
> >  config VIRT_TO_BUS
> >  	def_bool y
> >  	depends on !ARCH_NO_VIRT_TO_BUS
> > +
> > +config MMU_NOTIFIER
> > +	bool
> 
> Without some text following the bool keyword, I am not even asked for
> this config setting on my ia64 build.

Yes, this was explicitly asked by Andrew after his review. This is the
explanation pasted from the changelog.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.


From holt at sgi.com  Sun May  4 19:25:46 2008
From: holt at sgi.com (Robin Holt)
Date: Sun, 4 May 2008 21:25:46 -0500
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080504220824.GA21051@duo.random>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080504191345.GD18857@sgi.com>
	<20080504220824.GA21051@duo.random>
Message-ID: <20080505022546.GE18857@sgi.com>

On Mon, May 05, 2008 at 12:08:25AM +0200, Andrea Arcangeli wrote:
> On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -205,3 +205,6 @@ config VIRT_TO_BUS
> > >  config VIRT_TO_BUS
> > >  	def_bool y
> > >  	depends on !ARCH_NO_VIRT_TO_BUS
> > > +
> > > +config MMU_NOTIFIER
> > > +	bool
> > 
> > Without some text following the bool keyword, I am not even asked for
> > this config setting on my ia64 build.
> 
> Yes, this was explicitly asked by Andrew after his review. This is the
> explanation pasted from the changelog.
> 
> 3) It'd be a waste to add branches in the VM if nobody could possibly
>    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
>    if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
>    advantage of mmu notifiers, but this already allows to compile a
>    KVM external module against a kernel with mmu notifiers enabled and
>    from the next pull from kvm.git we'll start using them. And
>    GRU/XPMEM will also be able to continue the development by enabling
>    KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
>    to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
>    the same way KVM does it (even if KVM=n). This guarantees nobody
>    selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

Ah, so Andrew wants users of KVM to do a select of MMU_NOTIFIER.  That
makes sense.  I will change (fix) my Kconfig changes.

Thanks,
Robin


From keshetti85-student at yahoo.co.in  Sun May  4 21:55:46 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 5 May 2008 10:25:46 +0530
Subject: [ofa-general] Install IPoIB separately ..
Message-ID: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com>

While installing OFED-1.3 on a machine, I forgot to select the IPoIB module.
Now, is it possible to build IPoIB module separately and install it without
affecting the earlier installation?

-Mahesh


From ogerlitz at voltaire.com  Sun May  4 23:37:29 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 05 May 2008 09:37:29 +0300
Subject: [ofa-general] Re: infiniband hot unplug
In-Reply-To: <481A7173.3030007@cs.wisc.edu>
References: <47E7DBEA.9030704@cs.wisc.edu> <4818AEB2.9050407@cs.wisc.edu>
	<4819D1BF.6090002@Voltaire.COM> <4819DF25.1010202@cs.wisc.edu>
	<481A7173.3030007@cs.wisc.edu>
Message-ID: <481EAB29.5090901@voltaire.com>

Mike Christie wrote:
>
> Oh yeah, I was just checking to see how infinnband handled hot 
> unplugging the card and sparks started to shoot out.
Hi Mike,

Maybe you can drop an email to Roland with cc to open fabrics general 
list <general at lists.openfabrics.org> in order to initiate a discussion 
on the matter?

Or.


From vlad at dev.mellanox.co.il  Sun May  4 23:39:38 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 05 May 2008 09:39:38 +0300
Subject: [ofa-general] Install IPoIB separately ..
In-Reply-To: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com>
References: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com>
Message-ID: <481EABAA.6080105@dev.mellanox.co.il>

Keshetti Mahesh wrote:
> While installing OFED-1.3 on a machine, I forgot to select the IPoIB module.
> Now, is it possible to build IPoIB module separately and install it without
> affecting the earlier installation?
> 
> -Mahesh

Not with OFED-1.3 installation script.
If you will run install.pl then it will:
1. Uninstall the current OFED installation.
2. Rebuild and install kernel-ib RPM (with IPoIB).
3. Install other binary RPMs following your selection using binary RPMs that were created during the previous install.

Regards,
Vladimir


From telma.mesquita at vedior.pt  Mon May  5 03:48:54 2008
From: telma.mesquita at vedior.pt (Lela Hart)
Date: Mon, 5 May 2008 12:48:54 +0200
Subject: [ofa-general] Re: Re: nature attraction breeze
Message-ID: <016105204.44166266863975@vedior.pt>

Best used for girls attraction!

http://www.fetiu.cn/r/


From lshhgyfijmnt at brainstopping.com  Mon May  5 04:13:33 2008
From: lshhgyfijmnt at brainstopping.com (Guillermo Haney)
Date: Mon, 5 May 2008 20:13:33 +0900
Subject: [ofa-general] Hi piramid in your pants
Message-ID: <01c8aeec$7e079480$e78b9477@lshhgyfijmnt>

Eat a strip and be like a egyptian god!

http://www.fetiu.cn/v/


From tziporet at mellanox.co.il  Mon May  5 04:26:54 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 5 May 2008 14:26:54 +0300
Subject: [ofa-general] Agenda for the OFED meeting today (May 5)
Message-ID: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com>


Hi,

This is the agenda for the OFED meeting today:
1. OFED 1.3.1:
	1.1  Planned changes:
		ULPs changes:
			IB-bonding - done
			SRP failover - done
			SDP crashes - on work (not clear if we will have
something on time)
			RDS fixes for RDMA API - done
			librdmacm 1.0.7 - done
			uDAPL updates - done
			Open MPI 1.2.6 - done
				MVAPICH 1.0.1 - done
				IPoIB - 2 bugs are fixed. There are
still one issue that should be resolved.
			Low level drivers: Changes that already
committed:
				nes
				mlx4
				cxgb3
				ehca
	
		1.2 Schedule:
			rc1 - will be released tomorrow
			rc2 - May 20
			GA  - May 29
	
	Daily builds of 1.3.1 are already available at:
	http://www.openfabrics.org/builds/ofed-1.3.1


2. OFED 1.4:
	Delayed the work on the kernel rebase and will do it now on
2.6.26-rc1.
	Will have the new tree ready next week.
	Reason: Many fixes are already applied on the 2.6.26 tree and in
this way we can do all the work only once.


3. Open discussion
   - Open SuSE build system - If Yiftah will be able to update on
progress
   - Other topics ...

Tziporet


From keshetti85-student at yahoo.co.in  Mon May  5 05:29:48 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 5 May 2008 17:59:48 +0530
Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2
In-Reply-To: <Pine.GSO.4.40.0805050821380.7254-100000@xi.cse.ohio-state.edu>
References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com>
	<Pine.GSO.4.40.0805050821380.7254-100000@xi.cse.ohio-state.edu>
Message-ID: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com>

On Mon, May 5, 2008 at 5:54 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> RDMA CM is designed by OpenFabrics. MVAPICH2 uses this. If you post your
>  questions to OpenFabrics General mailing list, the designers of RDMA CM
>  component will be able to provide detailed answers regarding how it works.
>
>  DK


   Hi all,

   I want to use the RDMA CM option of MVAPICH2. The procedure described
   in the user guide is not much informative. Can anyone here give me
the detailed
   procedure for using the RDMA CM option. Also, I'll be glad if some
one can give
   me a document describing how it works in detail.
   Actually I have some doubts like, how the IP addresses (???) are
   resolved into IB    addresses and what happens in the case of nodes two HCAs
   (or 1 HCA with two ports) ?

   In the MVAPICH2 user guide it is mentioned that "RDMA CM device needs
   to be setup, configured with an IP address and connected to the network".
   Is this same as configuring IPoIB device ?

   -Mahesh


From olaf.kirch at oracle.com  Mon May  5 05:50:08 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Mon, 5 May 2008 14:50:08 +0200
Subject: [ofa-general] Re: [ewg] Agenda for the OFED meeting today (May 5)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com>
Message-ID: <200805051450.09752.olaf.kirch@oracle.com>

Hi Tziporet,

> 			RDS fixes for RDMA API - done

As an update, I just sent Vlad a bugfix for a RDMA related
crash in RDS. It would be cool if that could be included in
1.3.1.

I am also currently testing three more bugfix patches;
two of them related to dma_sync related issues, and one patch
to reduce the latency of RDS RDMA notifications (a process
expects a notification from the kernel that tells it when it's
okay to release the RDMA buffer - the current code tries to
give a reliable status at the expense of one round-trip; this
turns out to be too slow for some purposes).

It is not yet clear however which (if any) of these three
pending patches will make OFED 1.3.1.

Regards,
Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From jackm at dev.mellanox.co.il  Mon May  5 08:20:49 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 5 May 2008 18:20:49 +0300
Subject: [ofa-general] [PATCH] mlx4_core: Support creation of FMRs with pages
	smaller than 4K
Message-ID: <200805051820.49796.jackm@dev.mellanox.co.il>

From: Oren Duer <oren at mellanox.co.il>
mlx4_core: Support creation of FMRs with pages smaller than 4K

Actual smallest page size is given by device capabilities.

Signed-off-by: Oren Duer <oren at mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---

Roland,

The device minimum page size should be taken from the device
capabilities, and not hard-coded.  This hard-coding has lead
to problems with new mlx4 firmware.

This is for your 2.6.26 git tree.

Jack

diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 79b317b..7123463 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -551,7 +551,7 @@ int mlx4_fmr_alloc(struct mlx4_dev *dev, u32 pd, u32 access, int max_pages,
 	u64 mtt_seg;
 	int err = -ENOMEM;
 
-	if (page_shift < 12 || page_shift >= 32)
+	if (page_shift < (ffs(dev->caps.page_size_cap) - 1) || page_shift >= 32)
 		return -EINVAL;
 
 	/* All MTTs must fit in the same page */


From ramachandra.kuchimanchi at qlogic.com  Mon May  5 08:36:58 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 5 May 2008 21:06:58 +0530
Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <adaiqxw4lox.fsf@cisco.com>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171624.31725.98475.stgit@localhost.localdomain>
	<adaiqxw4lox.fsf@cisco.com>
Message-ID: <71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com>

Roland,

Thanks for the review. Your comments make sense and we will fix the
things you pointed out. Please see some clarifications in-line.

On Fri, May 2, 2008 at 11:45 PM, Roland Dreier <rdreier at cisco.com> wrote:
>  > From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
>   >
>   > Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
>   > Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
>
>  For the next submission please clean up the From and Signed-off-by
>  lines.  As it stands now you are saying that you (Ramachandra K) are the
>  author of the patch, and that Poornima and Amar signed off on it (ie
>  forwarded it), but you as the person sending the email did not sign off
>  on it.
>

I will make sure to sign off on all patches. Should I also drop the From line
for the patches which I developed, since I am mailing them myself ?

I am using the Signed-off-by line to indicate the people who were
involved in the development of the patches at some stage.

>
>
>   > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath)
>   > +{
>   > +    VNIC_FUNCTION("vnic_stop_xmit()\n");
>   > +    if (netpath == vnic->current_path) {
>   > +            if (vnic->xmit_started) {
>   > +                    netif_stop_queue(vnic->netdevice);
>   > +                    vnic->xmit_started = 0;
>   > +            }
>   > +
>   > +            vnic_stop_xmit_stats(vnic);
>   > +    }
>   > +}
>
>  Do you have sufficient locking here?  Could vnic->current_path or
>  vnic->xmit_started change after they are tested, leading to bad results?
>  Also do you get anything from having a xmit_started flag that you
>  couldn't get just by testing with netif_queue_stopped()?
>

You are right, xmit_started might not be required and we will look at
the locking
issue too.

>
>
>   > +extern cycles_t recv_ref;
>
>  seems like too generic a name to make global.  What the heck are you
>  using cycle_t to keep track of anyway?
>

This is being used as part of the driver internal statistics
collection to keep track of the time
elapsed between a message arriving from the EVIC indicating that it
has done an RDMA write of
an Ethernet packet to the driver memory and the driver giving the packet
to the network stack.  Will fix the variable name.

Regards,
Ram


From steiner at sgi.com  Mon May  5 09:21:13 2008
From: steiner at sgi.com (Jack Steiner)
Date: Mon, 5 May 2008 11:21:13 -0500
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <1489529e7b53d3f2dab8.1209740704@duo.random>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
Message-ID: <20080505162113.GA18761@sgi.com>

On Fri, May 02, 2008 at 05:05:04PM +0200, Andrea Arcangeli wrote:
> # HG changeset patch
> # User Andrea Arcangeli <andrea at qumranet.com>
> # Date 1209740175 -7200
> # Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
> # Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
> mmu-notifier-core
 

I upgraded to the latest mmu notifier patch & hit a deadlock. (Sorry -
I should have seen this  earlier but I haven't tracked the last couple
of patches).

The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
At this point, the mmap_sem is already held writeable. I hit a deadlock
in mm_lock.

A quick fix would be to do one of the following:

	- move the mmap_sem locking to the caller of the [de]registration routines.
	  Since the first/last thing done in mm_lock/mm_unlock is to
	  acquire/release mmap_sem, this change does not cause major changes.

	- add a flag to mmu_notifier_[un]register routines to indicate
	  if mmap_sem is already locked.


I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
test. More later....


--- jack


From hrosenstock at xsigo.com  Mon May  5 09:22:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 05 May 2008 09:22:13 -0700
Subject: [ofa-general] [PATCHv2] ibsim/README: Clarify point of
	attachment/SIM_HOST use
Message-ID: <1210004533.20493.255.camel@hrosenstock-ws.xsigo.com>

Fix typo in original version of this patch
Resend as got bounce on general list
--
ibsim/README: Clarify point of attachment/SIM_HOST use

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/README b/README
index f782fbe..b7615aa 100644
--- a/README
+++ b/README
@@ -90,6 +90,9 @@ Building and using ibsim.
    - in order to run OpenSM as non-privileged user you may need to
      export OSM_CACHE_DIR variable and to use '-f' option in order to
      specify writable path to OpenSM log file.
+   - Point of attachment is indicated by SIM_HOST environment variable.
+     If not specified, first entry in topology file is used. For OpenSM,
+     if -g option is used, it must be the same as this.
 
 5. Enjoy and comment.
 

From chu11 at llnl.gov  Mon May  5 09:32:49 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 05 May 2008 09:32:49 -0700
Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
In-Reply-To: <481D8B33.4000803@dev.mellanox.co.il>
References: <481D888A.7080608@dev.mellanox.co.il>
	<481D8B33.4000803@dev.mellanox.co.il>
Message-ID: <1210005169.11133.374.camel@cardanus.llnl.gov>

Hey Yevgeny,

This looks like a great idea.  But is there a reason its only supported
for LMC=0?  Since the caching is handled at the ucast-mgr level (rather
than in the routing algorithm code), I don't quite see why LMC=0
matters.  

Maybe it is b/c of future incremental routing on your todo?  If that's
the case, instead of only caching when LMC=0, perhaps initial
incremental routing should only work under LMC=0.  Later on incremental
routing for LMC > 0 could be added.

Al

On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote:
> One thing I need to add here: ucast cache is currently supported
> for LMC=0 only.
> 
> -- Yevgeny
> 
> Yevgeny Kliteynik wrote:
> > Hi Sasha,
> > 
> > The following series of 4 patches implements unicast routing cache
> > in OpenSM.
> > 
> > None of the current routing engines is scalable when we're talking
> > about big clusters. On ~5K cluster with ~1.3K switches, it takes
> > about two minutes to calculate the routing. The problem is, each
> > time the routing is calculated from scratch.
> > 
> > Incremental routing (which is on my to-do list) aims to address this
> > problem when there is some "local" change in fabric (e.g. single
> > switch failure, single link failure, link added, etc).
> > In such cases we can use the routing that was already calculated in
> > the previous heavy sweep, and then we just have to modify it according
> > to the change.
> > 
> > For instance, if some switch has disappeared from the fabric, we can
> > use the routing that existed with this switch, take a step back from
> > this switch and see if it is possible to route all the lids that were
> > routed through this switch some other way (which is usually the case).
> > 
> > To implement incremental routing, we need to create some kind of unicast
> > routing cache, which is what these patches implement. In addition to being
> > a step toward the incremental routing, routing cache is usefull by itself.
> > 
> > This cache can save us routing calculation in case of change in the leaf
> > switches or in hosts. For instance, if some node is rebooted, OpenSM would
> > start a heavy sweep with full routing recalculation when the HCA is going
> > down, and another one when HCA is brought up, when in fact both of these
> > routing calculation can be replaced by using of unicast routing cache.
> > 
> > Unicast routing cache comprises the following:
> >  - Topology: a data structure with all the switches and CAs of the fabric
> >  - LFTs: each switch has an LFT cached
> >  - Lid matrices: each switch has lid matrices cached, which is needed for
> >    multicast routing (which is not cached).
> > 
> > There is a topology matching function that compares the current topology
> > with the cached one to find out whether the cache is usable (valid) or not.
> > 
> > The cache is used the following way:
> >  - SM is executed - it starts first routing calculation
> >  - calculated routing is stored in the cache
> >  - at some point new heavy sweep is triggered
> >  - unicast manager checks whether the cache can be used instead
> >    of new routing calculation.
> >    In one of the following cases we can use cached routing
> >     + there is no topology change
> >     + one or more CAs disappeared (they exist in the cached topology
> >       model, but missing in the newly discovered fabric)
> >     + one or more leaf switches disappeared
> >    In these cases cached routing is written to the switches as is
> >    (unless the switch doesn't exist).
> >    If there is any other topology change:
> >      - existing cache is invalidated
> >      - topology is cached
> >      - routing is calculated as usual
> >      - routing is cached
> > 
> > My simulations show that when the usual routing phase of the heavy
> > sweep on the topology that I mentioned above takes ~2 minutes,
> > cached routing reduces this time to 6 seconds (which is nice, if you
> > ask me...).
> > 
> > Of all the cases when the cache is valid, the most painful and
> > "complainable" case is when a compute node reboot (which happens pretty
> > often) causes two heavy sweeps with two full routing calculations.
> > Unicast Routing Cache is aimed to solve this problem (again, in addition
> > to being a step toward the incremental routing).
> > 
> > -- Yevgeny
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From chu11 at llnl.gov  Mon May  5 09:39:08 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 05 May 2008 09:39:08 -0700
Subject: [ofa-general] Re: [PATCH 2/4] opensm: adding ucast cache option
In-Reply-To: <481D8944.10003@dev.mellanox.co.il>
References: <481D8944.10003@dev.mellanox.co.il>
Message-ID: <1210005548.11133.377.camel@cardanus.llnl.gov>

Hey Yevgeny,

Tiny nit, there is no manpage entry :-)

Al

On Sun, 2008-05-04 at 13:00 +0300, Yevgeny Kliteynik wrote:
> Adding ucast cache option to OpenSM command line
> arguments: -F or --ucast_cache.
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/include/opensm/osm_subnet.h |    6 +++++-
>  opensm/opensm/main.c               |   33 +++++++++++++++++++++++++++++++--
>  opensm/opensm/osm_subnet.c         |   11 ++++++++++-
>  3 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index b1dd659..cffbe5e 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -256,6 +256,7 @@ typedef struct _osm_subn_opt {
>  	boolean_t sweep_on_trap;
>  	char *routing_engine_name;
>  	boolean_t connect_roots;
> +	boolean_t use_ucast_cache;
>  	char *lid_matrix_dump_file;
>  	char *ucast_dump_file;
>  	char *root_guid_file;
> @@ -441,6 +442,9 @@ typedef struct _osm_subn_opt {
>  *		up/down routing engine (even if this violates "pure" deadlock
>  *		free up/down algorithm)
>  *
> +*	use_ucast_cache
> +*		When TRUE enables unicast routing cache.
> +*
>  *	lid_matrix_dump_file
>  *		Name of the lid matrix dump file from where switch
>  *		lid matrices (min hops tables) will be loaded
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index fb41d50..71deacb 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -183,6 +183,17 @@ static void show_usage(void)
>  	       "          and in this way be IBA compliant. In many cases,\n"
>  	       "          this can violate \"pure\" deadlock free algorithm, so\n"
>  	       "          use it carefully.\n\n");
> +	printf("-F\n"
> +	       "--ucast_cache\n"
> +	       "          This option enables unicast routing cache to prevent\n"
> +	       "          routing recalculation (which is a heavy task in a\n"
> +	       "          large cluster) when there was no topology change\n"
> +	       "          detected during the heavy sweep, or when the topology\n"
> +	       "          change does not require new routing calculation,\n"
> +	       "          e.g. in case of host reboot.\n"
> +	       "          This option becomes very handy when the cluster size\n"
> +	       "          is thousands of nodes.\n"
> +	       "          Unicast cache is not supported for LMC > 0.\n\n");
>  	printf("-M\n"
>  	       "--lid_matrix_file <file name>\n"
>  	       "          This option specifies the name of the lid matrix dump file\n"
> @@ -599,7 +610,7 @@ int main(int argc, char *argv[])
>  	char *ignore_guids_file_name = NULL;
>  	uint32_t val;
>  	const char *const short_option =
> -	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:";
> +	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:";
> 
>  	/*
>  	   In the array below, the 2nd parameter specifies the number
> @@ -634,6 +645,7 @@ int main(int argc, char *argv[])
>  		{"smkey", 1, NULL, 'k'},
>  		{"routing_engine", 1, NULL, 'R'},
>  		{"connect_roots", 0, NULL, 'z'},
> +		{"ucast_cache", 0, NULL, 'F'},
>  		{"lid_matrix_file", 1, NULL, 'M'},
>  		{"ucast_file", 1, NULL, 'U'},
>  		{"sadb_file", 1, NULL, 'S'},
> @@ -805,6 +817,12 @@ int main(int argc, char *argv[])
>  					"ERROR: LMC must be 7 or less.");
>  				return (-1);
>  			}
> +			if (opt.use_ucast_cache && temp > 0) {
> +				fprintf(stderr,
> +					"ERROR: Unicast routing cache is "
> +					"not supported for LMC > 0\n");
> +				return (-1);
> +			}
>  			opt.lmc = (uint8_t) temp;
>  			printf(" LMC = %d\n", temp);
>  			break;
> @@ -891,6 +909,17 @@ int main(int argc, char *argv[])
>  			printf(" Connect roots option is on\n");
>  			break;
> 
> +		case 'F':
> +			if (opt.lmc > 0) {
> +				fprintf(stderr,
> +					"ERROR: Unicast routing cache is "
> +					"not supported for LMC > 0\n");
> +				return (-1);
> +			}
> +			opt.use_ucast_cache = TRUE;
> +			printf(" Unicast routing cache option is on\n");
> +			break;
> +
>  		case 'M':
>  			opt.lid_matrix_dump_file = optarg;
>  			printf(" Lid matrix dump file is \'%s\'\n", optarg);
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 47d735f..dc55e72 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->sweep_on_trap = TRUE;
>  	p_opt->routing_engine_name = NULL;
>  	p_opt->connect_roots = FALSE;
> +	p_opt->use_ucast_cache = FALSE;
>  	p_opt->lid_matrix_dump_file = NULL;
>  	p_opt->ucast_dump_file = NULL;
>  	p_opt->root_guid_file = NULL;
> @@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
>  		opts_unpack_boolean("connect_roots",
>  				    p_key, p_val, &p_opts->connect_roots);
> 
> +		opts_unpack_boolean("use_ucast_cache",
> +				    p_key, p_val, &p_opts->use_ucast_cache);
> +
>  		opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file);
> 
>  		opts_unpack_uint32("log_max_size",
> @@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
>  			"# Connect roots (use FALSE if unsure)\n"
>  			"connect_roots %s\n\n",
>  			p_opts->connect_roots ? "TRUE" : "FALSE");
> +	if (p_opts->use_ucast_cache)
> +		fprintf(opts_file,
> +			"# Use unicast routing cache (use FALSE if unsure)\n"
> +			"use_ucast_cache %s\n\n",
> +			p_opts->use_ucast_cache ? "TRUE" : "FALSE");
>  	if (p_opts->lid_matrix_dump_file)
>  		fprintf(opts_file,
>  			"# Lid matrix dump file name\n"
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From andrea at qumranet.com  Mon May  5 10:14:34 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Mon, 5 May 2008 19:14:34 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080505162113.GA18761@sgi.com>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com>
Message-ID: <20080505171434.GF8470@duo.random>

On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
> The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
> At this point, the mmap_sem is already held writeable. I hit a deadlock
> in mm_lock.

It'd been better to know about this detail earlier, but frankly this
is a minor problem, the important thing is we all agree together on
the more difficult parts ;).

> A quick fix would be to do one of the following:
> 
> 	- move the mmap_sem locking to the caller of the [de]registration routines.
> 	  Since the first/last thing done in mm_lock/mm_unlock is to
> 	  acquire/release mmap_sem, this change does not cause major changes.

I don't like this solution very much. Nor GRU nor KVM will call
mmu_notifier_register inside the mmap_sem protected sections, so I
think the default mmu_notifier_register should be smp safe by itself
without requiring additional locks to be artificially taken externally
(especially because the need for mmap_sem in write mode is a very
mmu_notifier internal detail).

> 	- add a flag to mmu_notifier_[un]register routines to indicate
> 	  if mmap_sem is already locked.

The interface would change like this:

#define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0)
void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
			   unsigned long mmu_notifier_flags);

A third solution is to add:

/*
 * This must can be called instead of mmu_notifier_register after
 * taking the mmap_sem in write mode (read mode isn't enough).
 */
void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);

Do you still prefer the bitflag or you prefer
__mmu_notifier_register. It's ok either ways, except
__mmu_notifier_reigster could be removed in a backwards compatible
way, the bitflag can't.

> I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
> test. More later....

Sure! In the meantime go ahead this way.

Another very minor change I've been thinking about is to make
->release not mandatory. It happens that with KVM ->release isn't
strictly required because after mm_users reaches 0, no guest could
possibly run anymore. So I'm using ->release only for debugging by
placing -1UL in the root shadow pagetable, to be sure ;). So because
at least one user won't strictly require ->release being consistent in
having all method optional may be nicer. Alternatively we could make
them all mandatory and if somebody doesn't need one of the methods it
should implement it as a dummy function. Both ways have pros and cons,
but they don't make any difference to us in practice. If I've to
change the patch for the mmap_sem taken during registration I may as
well cleanup this minor bit.

Also note the rculist.h patch you sent earlier won't work against
mainline so I can't incorporate it in my patchset, Andrew will have to
apply it as mmu-notifier-core-mm after incorporating mmu-notifier-core
into -mm.

Until a new update is released, mmu-notifier-core v15 remains ok for
merging, no known bugs, here we're talking about a new and simple
feature and a tiny cleanup that nobody can notice anyway.


From steiner at sgi.com  Mon May  5 10:25:06 2008
From: steiner at sgi.com (Jack Steiner)
Date: Mon, 5 May 2008 12:25:06 -0500
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080505171434.GF8470@duo.random>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random>
Message-ID: <20080505172506.GA9247@sgi.com>

On Mon, May 05, 2008 at 07:14:34PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
> > The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
> > At this point, the mmap_sem is already held writeable. I hit a deadlock
> > in mm_lock.
> 
> It'd been better to know about this detail earlier,

Agree. My apologies... I should have caught it.


> but frankly this
> is a minor problem, the important thing is we all agree together on
> the more difficult parts ;).
> 
> > A quick fix would be to do one of the following:
> > 
> > 	- move the mmap_sem locking to the caller of the [de]registration routines.
> > 	  Since the first/last thing done in mm_lock/mm_unlock is to
> > 	  acquire/release mmap_sem, this change does not cause major changes.
> 
> I don't like this solution very much. Nor GRU nor KVM will call
> mmu_notifier_register inside the mmap_sem protected sections, so I
> think the default mmu_notifier_register should be smp safe by itself
> without requiring additional locks to be artificially taken externally
> (especially because the need for mmap_sem in write mode is a very
> mmu_notifier internal detail).
> 
> > 	- add a flag to mmu_notifier_[un]register routines to indicate
> > 	  if mmap_sem is already locked.
> 
> The interface would change like this:
> 
> #define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0)
> void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
> 			   unsigned long mmu_notifier_flags);

That works...


> 
> A third solution is to add:
> 
> /*
>  * This must can be called instead of mmu_notifier_register after
>  * taking the mmap_sem in write mode (read mode isn't enough).
>  */
> void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);
> 
> Do you still prefer the bitflag or you prefer
> __mmu_notifier_register. It's ok either ways, except
> __mmu_notifier_reigster could be removed in a backwards compatible
> way, the bitflag can't.
> 
> > I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
> > test. More later....

__mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
go, although either is ok.


> 
> Sure! In the meantime go ahead this way.
> 
> Another very minor change I've been thinking about is to make
> ->release not mandatory. It happens that with KVM ->release isn't
> strictly required because after mm_users reaches 0, no guest could
> possibly run anymore. So I'm using ->release only for debugging by
> placing -1UL in the root shadow pagetable, to be sure ;). So because
> at least one user won't strictly require ->release being consistent in
> having all method optional may be nicer. Alternatively we could make
> them all mandatory and if somebody doesn't need one of the methods it
> should implement it as a dummy function. Both ways have pros and cons,
> but they don't make any difference to us in practice. If I've to
> change the patch for the mmap_sem taken during registration I may as
> well cleanup this minor bit.
 
Let me finish my testing. At one time, I did not use ->release but
with all the locking & teardown changes, I need to do some reverification.


--- jack


From andrea at qumranet.com  Mon May  5 11:34:05 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Mon, 5 May 2008 20:34:05 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080505172506.GA9247@sgi.com>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random>
	<20080505172506.GA9247@sgi.com>
Message-ID: <20080505183405.GI8470@duo.random>

On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
> Agree. My apologies... I should have caught it.

No problem.

> __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
> go, although either is ok.

If you also like __mmu_notifier_register more I'll go with it. The
bitflags seems like a bit of overkill as I can't see the need of any
other bitflag other than this one and they also can't be removed as
easily in case you'll find a way to call it outside the lock later.

> Let me finish my testing. At one time, I did not use ->release but
> with all the locking & teardown changes, I need to do some reverification.

If you didn't implement it you shall apply this patch but you shall
read carefully the comment I written that covers that usage case.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
-	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently with other mmu notifier methods and it
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
 	 * should tear down all secondary mmu mappings and freeze the
-	 * secondary mmu.
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all memory accesses through the
+	 * secondary mmu implicitly are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative hardware operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, leading to memory
+	 * corruption.
 	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st
 		 * from establishing any more sptes before all the
 		 * pages in the mm are freed.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 		spin_lock(&mm->mmu_notifier_mm->lock);
 	}
@@ -251,7 +252,8 @@ void mmu_notifier_unregister(struct mmu_
 		 * guarantee ->release is called before freeing the
 		 * pages.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 	} else
 		spin_unlock(&mm->mmu_notifier_mm->lock);


From steiner at sgi.com  Mon May  5 12:46:25 2008
From: steiner at sgi.com (Jack Steiner)
Date: Mon, 5 May 2008 14:46:25 -0500
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080505183405.GI8470@duo.random>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random>
	<20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random>
Message-ID: <20080505194625.GA17734@sgi.com>

On Mon, May 05, 2008 at 08:34:05PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
> > Agree. My apologies... I should have caught it.
> 
> No problem.
> 
> > __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
> > go, although either is ok.
> 
> If you also like __mmu_notifier_register more I'll go with it. The
> bitflags seems like a bit of overkill as I can't see the need of any
> other bitflag other than this one and they also can't be removed as
> easily in case you'll find a way to call it outside the lock later.
> 
> > Let me finish my testing. At one time, I did not use ->release but
> > with all the locking & teardown changes, I need to do some reverification.

I finished testing & everything looks good. I do use the ->release callout but
mainly as a performance hint that teardown is in progress & that TLB flushing is
no longer needed. (GRU TLB entries are tagged with a task-specific ID that will
not be reused until a full TLB purge is done. This eliminates the requirement
to purge at task-exit.)


Normally, a notifier is registered when a GRU segment is mmaped, and unregistered
when the segment is unmapped. Well behaved tasks will not have a GRU or
a notifier when exit starts.

If a task fails to unmap a GRU segment, they still exist at the start of
exit. On the ->release callout, I set a flag in the container of my
mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
are skipped because of the flag is set. When the GRU VMA is deleted, I free
my structure containing the notifier.

I _think_ works. Do you see any problems?

I should also mention that I have an open-coded function that possibly
belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
Each GRU has a couple of data structures linked to the VMA. All, however,
need to share the same notifier. I currently open code a function that
scans the notifier list to determine if a GRU notifier already exists.
If it does, I update a refcnt & use it. Otherwise, I register a new
one. All of this is protected by the mmap_sem.

Just in case I mangled the above description, I'll attach a copy of the GRU mmuops
code.

--- jack
-------------- next part --------------
A non-text attachment was scrubbed...
Name: z
Type: application/x-compress
Size: 2662 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080505/0132d0c4/attachment.bin>

From rdreier at cisco.com  Mon May  5 13:28:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 May 2008 13:28:59 -0700
Subject: [ofa-general] Re: [PATCH] mlx4_core: Support creation of FMRs with
	pages smaller than 4K
In-Reply-To: <200805051820.49796.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 5 May 2008 18:20:49 +0300")
References: <200805051820.49796.jackm@dev.mellanox.co.il>
Message-ID: <adak5i81oms.fsf@cisco.com>

 > The device minimum page size should be taken from the device
 > capabilities, and not hard-coded.  This hard-coding has lead
 > to problems with new mlx4 firmware.

Please don't expect me to guess what kind of problems... what changed
with new firmware, what breaks, and why?


From rdreier at cisco.com  Mon May  5 13:42:15 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 May 2008 13:42:15 -0700
Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com>
	(Ramachandra K.'s message of "Mon, 5 May 2008 21:06:58 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171624.31725.98475.stgit@localhost.localdomain>
	<adaiqxw4lox.fsf@cisco.com>
	<71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com>
Message-ID: <adafxsw1o0o.fsf@cisco.com>

 > I will make sure to sign off on all patches. Should I also drop the From line
 > for the patches which I developed, since I am mailing them myself ?

It doesn't hurt to include a From: line if it is the same as the one for
the email itself, but it isn't necessary.  When I import a patch the
last From: line will be used.

 > I am using the Signed-off-by line to indicate the people who were
 > involved in the development of the patches at some stage.

That's fine.  You can read Documentation/SubmittingPatches to see the
precise legal meaning of Signed-off-by, and make sure that it applies to
everyone whose signoff you are including.  You can also add less formal
text like "X <z at foo> helped develop this patch" in the changelog entry.

 > >   > +extern cycles_t recv_ref;
 > >
 > >  seems like too generic a name to make global.  What the heck are you
 > >  using cycle_t to keep track of anyway?
 > >
 > 
 > This is being used as part of the driver internal statistics
 > collection to keep track of the time
 > elapsed between a message arriving from the EVIC indicating that it
 > has done an RDMA write of
 > an Ethernet packet to the driver memory and the driver giving the packet
 > to the network stack.

cycles don't track time (eg x86 TSC might stop for a while).  Do you
*really* need to use cycles, or are jiffies a better replacement?

 - R.


From kliteyn at dev.mellanox.co.il  Mon May  5 13:52:28 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 05 May 2008 23:52:28 +0300
Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
In-Reply-To: <1210005169.11133.374.camel@cardanus.llnl.gov>
References: <481D888A.7080608@dev.mellanox.co.il>	
	<481D8B33.4000803@dev.mellanox.co.il>
	<1210005169.11133.374.camel@cardanus.llnl.gov>
Message-ID: <481F738C.3040804@dev.mellanox.co.il>

Al Chu wrote:
> Hey Yevgeny,
> 
> This looks like a great idea.  But is there a reason its only supported
> for LMC=0?  Since the caching is handled at the ucast-mgr level (rather
> than in the routing algorithm code), I don't quite see why LMC=0
> matters.  

No particular reason - I'll enhance it for LMC>0, just didn't find the time
to do it right now. The cached topology model is based on LIDs, so I just
need to check that LMC>0 doesn't break anything.
I also had a more complex topology and routing model, where I wasn't relying
on LIDs - I had what I called "Virtual LIDs", and at every heavy sweep
the topology model was built and Virtual LIDs were matched to LIDs to create
VLID <-> LID mapping, so that the cache won't depend on fabric LIDs, and
there I had some problems with LMC (can't remember what exactly), but that
model proved to be useless.

> Maybe it is b/c of future incremental routing on your todo?  If that's
> the case, instead of only caching when LMC=0, perhaps initial
> incremental routing should only work under LMC=0.  Later on incremental
> routing for LMC > 0 could be added.

Agree, that is what I eventually should do.

-- Yevgeny

> Al
> 
> On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote:
>> One thing I need to add here: ucast cache is currently supported
>> for LMC=0 only.
>>
>> -- Yevgeny
>>
>> Yevgeny Kliteynik wrote:
>>> Hi Sasha,
>>>
>>> The following series of 4 patches implements unicast routing cache
>>> in OpenSM.
>>>
>>> None of the current routing engines is scalable when we're talking
>>> about big clusters. On ~5K cluster with ~1.3K switches, it takes
>>> about two minutes to calculate the routing. The problem is, each
>>> time the routing is calculated from scratch.
>>>
>>> Incremental routing (which is on my to-do list) aims to address this
>>> problem when there is some "local" change in fabric (e.g. single
>>> switch failure, single link failure, link added, etc).
>>> In such cases we can use the routing that was already calculated in
>>> the previous heavy sweep, and then we just have to modify it according
>>> to the change.
>>>
>>> For instance, if some switch has disappeared from the fabric, we can
>>> use the routing that existed with this switch, take a step back from
>>> this switch and see if it is possible to route all the lids that were
>>> routed through this switch some other way (which is usually the case).
>>>
>>> To implement incremental routing, we need to create some kind of unicast
>>> routing cache, which is what these patches implement. In addition to being
>>> a step toward the incremental routing, routing cache is usefull by itself.
>>>
>>> This cache can save us routing calculation in case of change in the leaf
>>> switches or in hosts. For instance, if some node is rebooted, OpenSM would
>>> start a heavy sweep with full routing recalculation when the HCA is going
>>> down, and another one when HCA is brought up, when in fact both of these
>>> routing calculation can be replaced by using of unicast routing cache.
>>>
>>> Unicast routing cache comprises the following:
>>>  - Topology: a data structure with all the switches and CAs of the fabric
>>>  - LFTs: each switch has an LFT cached
>>>  - Lid matrices: each switch has lid matrices cached, which is needed for
>>>    multicast routing (which is not cached).
>>>
>>> There is a topology matching function that compares the current topology
>>> with the cached one to find out whether the cache is usable (valid) or not.
>>>
>>> The cache is used the following way:
>>>  - SM is executed - it starts first routing calculation
>>>  - calculated routing is stored in the cache
>>>  - at some point new heavy sweep is triggered
>>>  - unicast manager checks whether the cache can be used instead
>>>    of new routing calculation.
>>>    In one of the following cases we can use cached routing
>>>     + there is no topology change
>>>     + one or more CAs disappeared (they exist in the cached topology
>>>       model, but missing in the newly discovered fabric)
>>>     + one or more leaf switches disappeared
>>>    In these cases cached routing is written to the switches as is
>>>    (unless the switch doesn't exist).
>>>    If there is any other topology change:
>>>      - existing cache is invalidated
>>>      - topology is cached
>>>      - routing is calculated as usual
>>>      - routing is cached
>>>
>>> My simulations show that when the usual routing phase of the heavy
>>> sweep on the topology that I mentioned above takes ~2 minutes,
>>> cached routing reduces this time to 6 seconds (which is nice, if you
>>> ask me...).
>>>
>>> Of all the cases when the cache is valid, the most painful and
>>> "complainable" case is when a compute node reboot (which happens pretty
>>> often) causes two heavy sweeps with two full routing calculations.
>>> Unicast Routing Cache is aimed to solve this problem (again, in addition
>>> to being a step toward the incremental routing).
>>>
>>> -- Yevgeny
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Mon May  5 14:12:39 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 05 May 2008 14:12:39 -0700
Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
In-Reply-To: <481D888A.7080608@dev.mellanox.co.il>
References: <481D888A.7080608@dev.mellanox.co.il>
Message-ID: <1210021959.27137.89.camel@hrosenstock-ws.xsigo.com>

On Sun, 2008-05-04 at 12:57 +0300, Yevgeny Kliteynik wrote:
> My simulations show that when the usual routing phase of the heavy
> sweep on the topology that I mentioned above takes ~2 minutes,
> cached routing reduces this time to 6 seconds (which is nice, if you
> ask me...).

Cool!

-- Hal


From hrosenstock at xsigo.com  Mon May  5 14:12:41 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 05 May 2008 14:12:41 -0700
Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c,
	h}: ucast routing cache implementation
In-Reply-To: <481D8905.1010207@dev.mellanox.co.il>
References: <481D8905.1010207@dev.mellanox.co.il>
Message-ID: <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com>

Hi Yevgeny,

On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote:

I haven't yet had a chance to review this in detail but think that
router ports need to be accomodated in the subnet (I think this is a
firm requirement as router ports on the subnet are already supported)
and also think that nothing should be introduced precluding the running
of OpenSM on a router port. From the latter standpoint, it looks much
like a CA port.

-- Hal


From kliteyn at dev.mellanox.co.il  Mon May  5 14:35:28 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 00:35:28 +0300
Subject: [ofa-general] Re: [PATCH 2/4] opensm: adding ucast cache option
In-Reply-To: <1210005548.11133.377.camel@cardanus.llnl.gov>
References: <481D8944.10003@dev.mellanox.co.il>
	<1210005548.11133.377.camel@cardanus.llnl.gov>
Message-ID: <481F7DA0.6090707@dev.mellanox.co.il>

Al Chu wrote:
> Hey Yevgeny,
> 
> Tiny nit, there is no manpage entry :-)

Right, thanks :)

-- Yevgeny

> Al
> 
> On Sun, 2008-05-04 at 13:00 +0300, Yevgeny Kliteynik wrote:
>> Adding ucast cache option to OpenSM command line
>> arguments: -F or --ucast_cache.
>>
>> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/include/opensm/osm_subnet.h |    6 +++++-
>>  opensm/opensm/main.c               |   33 +++++++++++++++++++++++++++++++--
>>  opensm/opensm/osm_subnet.c         |   11 ++++++++++-
>>  3 files changed, 46 insertions(+), 4 deletions(-)
>>
>> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
>> index b1dd659..cffbe5e 100644
>> --- a/opensm/include/opensm/osm_subnet.h
>> +++ b/opensm/include/opensm/osm_subnet.h
>> @@ -1,6 +1,6 @@
>>  /*
>>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>> @@ -256,6 +256,7 @@ typedef struct _osm_subn_opt {
>>  	boolean_t sweep_on_trap;
>>  	char *routing_engine_name;
>>  	boolean_t connect_roots;
>> +	boolean_t use_ucast_cache;
>>  	char *lid_matrix_dump_file;
>>  	char *ucast_dump_file;
>>  	char *root_guid_file;
>> @@ -441,6 +442,9 @@ typedef struct _osm_subn_opt {
>>  *		up/down routing engine (even if this violates "pure" deadlock
>>  *		free up/down algorithm)
>>  *
>> +*	use_ucast_cache
>> +*		When TRUE enables unicast routing cache.
>> +*
>>  *	lid_matrix_dump_file
>>  *		Name of the lid matrix dump file from where switch
>>  *		lid matrices (min hops tables) will be loaded
>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
>> index fb41d50..71deacb 100644
>> --- a/opensm/opensm/main.c
>> +++ b/opensm/opensm/main.c
>> @@ -1,6 +1,6 @@
>>  /*
>>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>> @@ -183,6 +183,17 @@ static void show_usage(void)
>>  	       "          and in this way be IBA compliant. In many cases,\n"
>>  	       "          this can violate \"pure\" deadlock free algorithm, so\n"
>>  	       "          use it carefully.\n\n");
>> +	printf("-F\n"
>> +	       "--ucast_cache\n"
>> +	       "          This option enables unicast routing cache to prevent\n"
>> +	       "          routing recalculation (which is a heavy task in a\n"
>> +	       "          large cluster) when there was no topology change\n"
>> +	       "          detected during the heavy sweep, or when the topology\n"
>> +	       "          change does not require new routing calculation,\n"
>> +	       "          e.g. in case of host reboot.\n"
>> +	       "          This option becomes very handy when the cluster size\n"
>> +	       "          is thousands of nodes.\n"
>> +	       "          Unicast cache is not supported for LMC > 0.\n\n");
>>  	printf("-M\n"
>>  	       "--lid_matrix_file <file name>\n"
>>  	       "          This option specifies the name of the lid matrix dump file\n"
>> @@ -599,7 +610,7 @@ int main(int argc, char *argv[])
>>  	char *ignore_guids_file_name = NULL;
>>  	uint32_t val;
>>  	const char *const short_option =
>> -	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:";
>> +	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:";
>>
>>  	/*
>>  	   In the array below, the 2nd parameter specifies the number
>> @@ -634,6 +645,7 @@ int main(int argc, char *argv[])
>>  		{"smkey", 1, NULL, 'k'},
>>  		{"routing_engine", 1, NULL, 'R'},
>>  		{"connect_roots", 0, NULL, 'z'},
>> +		{"ucast_cache", 0, NULL, 'F'},
>>  		{"lid_matrix_file", 1, NULL, 'M'},
>>  		{"ucast_file", 1, NULL, 'U'},
>>  		{"sadb_file", 1, NULL, 'S'},
>> @@ -805,6 +817,12 @@ int main(int argc, char *argv[])
>>  					"ERROR: LMC must be 7 or less.");
>>  				return (-1);
>>  			}
>> +			if (opt.use_ucast_cache && temp > 0) {
>> +				fprintf(stderr,
>> +					"ERROR: Unicast routing cache is "
>> +					"not supported for LMC > 0\n");
>> +				return (-1);
>> +			}
>>  			opt.lmc = (uint8_t) temp;
>>  			printf(" LMC = %d\n", temp);
>>  			break;
>> @@ -891,6 +909,17 @@ int main(int argc, char *argv[])
>>  			printf(" Connect roots option is on\n");
>>  			break;
>>
>> +		case 'F':
>> +			if (opt.lmc > 0) {
>> +				fprintf(stderr,
>> +					"ERROR: Unicast routing cache is "
>> +					"not supported for LMC > 0\n");
>> +				return (-1);
>> +			}
>> +			opt.use_ucast_cache = TRUE;
>> +			printf(" Unicast routing cache option is on\n");
>> +			break;
>> +
>>  		case 'M':
>>  			opt.lid_matrix_dump_file = optarg;
>>  			printf(" Lid matrix dump file is \'%s\'\n", optarg);
>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>> index 47d735f..dc55e72 100644
>> --- a/opensm/opensm/osm_subnet.c
>> +++ b/opensm/opensm/osm_subnet.c
>> @@ -1,6 +1,6 @@
>>  /*
>>   * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>>   *
>>   * This software is available to you under a choice of one of two
>> @@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>>  	p_opt->sweep_on_trap = TRUE;
>>  	p_opt->routing_engine_name = NULL;
>>  	p_opt->connect_roots = FALSE;
>> +	p_opt->use_ucast_cache = FALSE;
>>  	p_opt->lid_matrix_dump_file = NULL;
>>  	p_opt->ucast_dump_file = NULL;
>>  	p_opt->root_guid_file = NULL;
>> @@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
>>  		opts_unpack_boolean("connect_roots",
>>  				    p_key, p_val, &p_opts->connect_roots);
>>
>> +		opts_unpack_boolean("use_ucast_cache",
>> +				    p_key, p_val, &p_opts->use_ucast_cache);
>> +
>>  		opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file);
>>
>>  		opts_unpack_uint32("log_max_size",
>> @@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
>>  			"# Connect roots (use FALSE if unsure)\n"
>>  			"connect_roots %s\n\n",
>>  			p_opts->connect_roots ? "TRUE" : "FALSE");
>> +	if (p_opts->use_ucast_cache)
>> +		fprintf(opts_file,
>> +			"# Use unicast routing cache (use FALSE if unsure)\n"
>> +			"use_ucast_cache %s\n\n",
>> +			p_opts->use_ucast_cache ? "TRUE" : "FALSE");
>>  	if (p_opts->lid_matrix_dump_file)
>>  		fprintf(opts_file,
>>  			"# Lid matrix dump file name\n"


From Nathan.Dauchy at noaa.gov  Mon May  5 14:49:29 2008
From: Nathan.Dauchy at noaa.gov (Nathan Dauchy)
Date: Mon, 05 May 2008 15:49:29 -0600
Subject: [ofa-general] Install IPoIB separately ..
In-Reply-To: <481EABAA.6080105@dev.mellanox.co.il>
References: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com>
	<481EABAA.6080105@dev.mellanox.co.il>
Message-ID: <481F80E9.9010009@noaa.gov>

Vladimir Sokolovsky wrote:
> Keshetti Mahesh wrote:
>> While installing OFED-1.3 on a machine, I forgot to select the IPoIB
>> module.
>> Now, is it possible to build IPoIB module separately and install it
>> without
>> affecting the earlier installation?
>>
>> -Mahesh
> 
> Not with OFED-1.3 installation script.
> If you will run install.pl then it will:
> 1. Uninstall the current OFED installation.
> 2. Rebuild and install kernel-ib RPM (with IPoIB).
> 3. Install other binary RPMs following your selection using binary RPMs
> that were created during the previous install.
> 
> Regards,
> Vladimir

This coupling of install and build steps complicates life for users and
seems like a step backwards from OFED-1.2.

>From the "OFED Aug 13 meeting summary", this change was made in part
because the previous build method and manner of handling dependencies
did not follow standard RPM usage.  I don't think that uninstalling
multiple RPM's and rebuilding them in order to add another RPM is
standard RPM usage either.  Can this be put on the "to-do" list for
OFED-1.3.2?

Is "install.pl" the only way to reliably build OFED-1.3?

Thanks,
Nathan


From kliteyn at dev.mellanox.co.il  Mon May  5 14:58:13 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 00:58:13 +0300
Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c,
 h}: ucast routing cache implementation
In-Reply-To: <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com>
References: <481D8905.1010207@dev.mellanox.co.il>
	<1210021961.27137.90.camel@hrosenstock-ws.xsigo.com>
Message-ID: <481F82F5.8030505@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote:
> 
> I haven't yet had a chance to review this in detail but think that
> router ports need to be accomodated in the subnet (I think this is a
> firm requirement as router ports on the subnet are already supported)
> and also think that nothing should be introduced precluding the running
> of OpenSM on a router port. From the latter standpoint, it looks much
> like a CA port.

This is exactly how I implemented it - any non-switch port is
treated as CA, which is just a target LID.

Well, I mean I intended to implement it that way - I reviewed it again,
and it appears that the cache is fine with routing to routers and running
from switches, but there will be a problem when SM runs on a router node -
cache will complain and fall back to usual routing.
That can be easily fixed.

However, I'm not sure how the OpenSM will behave in general when running
from switch or router - I've never tried it. Has anyone try it?

-- Yevgeny

> -- Hal
> 
> 


From hrosenstock at xsigo.com  Mon May  5 15:20:44 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 05 May 2008 15:20:44 -0700
Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c,
	h}: ucast routing cache implementation
In-Reply-To: <481F82F5.8030505@dev.mellanox.co.il>
References: <481D8905.1010207@dev.mellanox.co.il>
	<1210021961.27137.90.camel@hrosenstock-ws.xsigo.com>
	<481F82F5.8030505@dev.mellanox.co.il>
Message-ID: <1210026044.27137.121.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-06 at 00:58 +0300, Yevgeny Kliteynik wrote:
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote:
> > 
> > I haven't yet had a chance to review this in detail but think that
> > router ports need to be accomodated in the subnet (I think this is a
> > firm requirement as router ports on the subnet are already supported)
> > and also think that nothing should be introduced precluding the running
> > of OpenSM on a router port. From the latter standpoint, it looks much
> > like a CA port.
> 
> This is exactly how I implemented it - any non-switch port is
> treated as CA, which is just a target LID.
> 
> Well, I mean I intended to implement it that way - I reviewed it again,
> and it appears that the cache is fine with routing to routers and running
> from switches, 

then it's just the variable names which indicate ca if routers are
grouped with cas.

> but there will be a problem when SM runs on a router node -
> cache will complain and fall back to usual routing.
> That can be easily fixed.

Right; the one thing I saw was in _ucast_cache_get_starting_osm_sw where
routers were not supported. I think a one line change is all that's
needed there. Not sure if there are other places.

> However, I'm not sure how the OpenSM will behave in general when running
> from switch or router - I've never tried it. Has anyone try it?

I'm not sure either but would be interested to hear. I think there are
some using it on switch port 0 and also think others have tried it on
routers. In terms of switches, it used to work and there is some support
in the vendor directory for this.

-- Hal

> -- Yevgeny
> 
> > -- Hal
> > 
> > 
> 


From rdreier at cisco.com  Mon May  5 15:54:49 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 May 2008 15:54:49 -0700
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <4819BD29.7080002@Voltaire.COM> (Moni Shoua's message of "Thu, 01
	May 2008 15:52:57 +0300")
References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM>
Message-ID: <adaskwwz7ie.fsf@cisco.com>

 > For consumers, the patch doesn't make things worse. Before the patch
 > mads are sent to the wrong SM and now they are now blocked before
 > they are sent. Consumers can be improved if they examine the return
 > code and respond to EAGAIN properly but even without an improvement
 > the situation is not getting worse in in some cases it gets better.

I guess I can believe things don't get worse but I still don't know how
this makes things better.  With the current code the request is lost
because it goes to the wrong SM; with the new code the request is failed
by the SA layer.  So in both cases the consumer just has to try again.

So is there some practical benefit we see by adding this code?

 - R.


From rdreier at cisco.com  Mon May  5 15:55:33 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 May 2008 15:55:33 -0700
Subject: [ofa-general] Re: [PATCH] mlx4_core: Support creation of FMRs with
	pages smaller than 4K
In-Reply-To: <200805051820.49796.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 5 May 2008 18:20:49 +0300")
References: <200805051820.49796.jackm@dev.mellanox.co.il>
Message-ID: <adaod7kz7h6.fsf@cisco.com>

 > mlx4_core: Support creation of FMRs with pages smaller than 4K

never mind, the subject makes sense now.  applied.


From rdreier at cisco.com  Mon May  5 16:00:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 May 2008 16:00:18 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adak5i8z799.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a fixes for:

 - mlx4 breakage introduced (by me) with the CQ resize support
 - mlx4 breakage with new FW that supports smaller adapter pages
 - IPoIB breakage introduced with the separate send CQ support
 - longstanding cxgb3 breakage uncovered by stress testing
 - ehca minor messiness

Eli Cohen (1):
      IB/ipoib: Fix transmit queue stalling forever

Oren Duer (1):
      mlx4_core: Support creation of FMRs with pages smaller than 4K

Roland Dreier (1):
      IB/mlx4: Fix off-by-one errors in calls to mlx4_ib_free_cq_buf()

Stefan Roscher (1):
      IB/ehca: Fix function return types

Steve Wise (3):
      RDMA/cxgb3: QP flush fixes
      RDMA/cxgb3: Silently ignore close reply after abort.
      RDMA/cxgb3: Bump up the MPA connection setup timeout.

 drivers/infiniband/hw/cxgb3/cxio_hal.c     |   13 ++++++--
 drivers/infiniband/hw/cxgb3/cxio_hal.h     |    4 +-
 drivers/infiniband/hw/cxgb3/iwch_cm.c      |    6 ++--
 drivers/infiniband/hw/cxgb3/iwch_qp.c      |   13 +++++---
 drivers/infiniband/hw/ehca/ehca_hca.c      |    7 ++--
 drivers/infiniband/hw/mlx4/cq.c            |    4 +-
 drivers/infiniband/ulp/ipoib/ipoib.h       |    2 +
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   47 +++++++++++++++++++++++++---
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |    3 +-
 drivers/net/mlx4/mr.c                      |    2 +-
 10 files changed, 75 insertions(+), 26 deletions(-)


diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index ed2ee4b..5fd8506 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -359,9 +359,10 @@ static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
 	cq->sw_wptr++;
 }
 
-void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
 {
 	u32 ptr;
+	int flushed = 0;
 
 	PDBG("%s wq %p cq %p\n", __func__, wq, cq);
 
@@ -369,8 +370,11 @@ void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
 	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __func__,
 	    wq->rq_rptr, wq->rq_wptr, count);
 	ptr = wq->rq_rptr + count;
-	while (ptr++ != wq->rq_wptr)
+	while (ptr++ != wq->rq_wptr) {
 		insert_recv_cqe(wq, cq);
+		flushed++;
+	}
+	return flushed;
 }
 
 static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq,
@@ -394,9 +398,10 @@ static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq,
 	cq->sw_wptr++;
 }
 
-void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 {
 	__u32 ptr;
+	int flushed = 0;
 	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
 
 	ptr = wq->sq_rptr + count;
@@ -405,7 +410,9 @@ void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 		insert_sq_cqe(wq, cq, sqp);
 		sqp++;
 		ptr++;
+		flushed++;
 	}
+	return flushed;
 }
 
 /*
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 2bcff7f..69ab08e 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -173,8 +173,8 @@ u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
 void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
 int __init cxio_hal_init(void);
 void __exit cxio_hal_exit(void);
-void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
-void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
 void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
 void cxio_flush_hw_cq(struct t3_cq *cq);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index d44a6df..c325c44 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -67,10 +67,10 @@ int peer2peer = 0;
 module_param(peer2peer, int, 0644);
 MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=0)");
 
-static int ep_timeout_secs = 10;
+static int ep_timeout_secs = 60;
 module_param(ep_timeout_secs, int, 0644);
 MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
-				   "in seconds (default=10)");
+				   "in seconds (default=60)");
 
 static int mpa_rev = 1;
 module_param(mpa_rev, int, 0644);
@@ -1650,8 +1650,8 @@ static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
 		release = 1;
 		break;
 	case ABORTING:
-		break;
 	case DEAD:
+		break;
 	default:
 		BUG_ON(1);
 		break;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 9b4be88..79dbe5b 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -655,6 +655,7 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
 {
 	struct iwch_cq *rchp, *schp;
 	int count;
+	int flushed;
 
 	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
 	schp = get_chp(qhp->rhp, qhp->attr.scq);
@@ -669,20 +670,22 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
 	spin_lock(&qhp->lock);
 	cxio_flush_hw_cq(&rchp->cq);
 	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
-	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	flushed = cxio_flush_rq(&qhp->wq, &rchp->cq, count);
 	spin_unlock(&qhp->lock);
 	spin_unlock_irqrestore(&rchp->lock, *flag);
-	(*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context);
+	if (flushed)
+		(*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context);
 
 	/* locking heirarchy: cq lock first, then qp lock. */
 	spin_lock_irqsave(&schp->lock, *flag);
 	spin_lock(&qhp->lock);
 	cxio_flush_hw_cq(&schp->cq);
 	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
-	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	flushed = cxio_flush_sq(&qhp->wq, &schp->cq, count);
 	spin_unlock(&qhp->lock);
 	spin_unlock_irqrestore(&schp->lock, *flag);
-	(*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context);
+	if (flushed)
+		(*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context);
 
 	/* deref */
 	if (atomic_dec_and_test(&qhp->refcnt))
@@ -880,7 +883,6 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
 				ep = qhp->ep;
 				get_ep(&ep->com);
 			}
-			flush_qp(qhp, &flag);
 			break;
 		case IWCH_QP_STATE_TERMINATE:
 			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
@@ -911,6 +913,7 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
 		}
 		switch (attrs->next_state) {
 			case IWCH_QP_STATE_IDLE:
+				flush_qp(qhp, &flag);
 				qhp->attr.state = IWCH_QP_STATE_IDLE;
 				qhp->attr.llp_stream_handle = NULL;
 				put_ep(&qhp->ep->com);
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index 2515cbd..bc3b37d 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props)
 	props->max_ee          = limit_uint(rblock->max_rd_ee_context);
 	props->max_rdd         = limit_uint(rblock->max_rd_domain);
 	props->max_fmr         = limit_uint(rblock->max_mr);
-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
 	props->max_qp_rd_atom  = limit_uint(rblock->max_rr_qp);
 	props->max_ee_rd_atom  = limit_uint(rblock->max_rr_ee_context);
 	props->max_res_rd_atom = limit_uint(rblock->max_rr_hca);
@@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props)
 	}
 
 	props->max_pkeys           = 16;
-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
+	props->local_ca_ack_delay  = min_t(u8, rblock->local_ca_ack_delay, 255);
 	props->max_raw_ipv6_qp     = limit_uint(rblock->max_raw_ipv6_qp);
 	props->max_raw_ethy_qp     = limit_uint(rblock->max_raw_ethy_qp);
 	props->max_mcast_grp       = limit_uint(rblock->max_mcast_grp);
@@ -136,7 +135,7 @@ query_device1:
 	return ret;
 }
 
-static int map_mtu(struct ehca_shca *shca, u32 fw_mtu)
+static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu)
 {
 	switch (fw_mtu) {
 	case 0x1:
@@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shca, u32 fw_mtu)
 	}
 }
 
-static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
+static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
 {
 	switch (vl_cap) {
 	case 0x1:
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 2f199c5..4521319 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -246,7 +246,7 @@ err_mtt:
 	if (context)
 		ib_umem_release(cq->umem);
 	else
-		mlx4_ib_free_cq_buf(dev, &cq->buf, entries);
+		mlx4_ib_free_cq_buf(dev, &cq->buf, cq->ibcq.cqe);
 
 err_db:
 	if (!context)
@@ -434,7 +434,7 @@ int mlx4_ib_destroy_cq(struct ib_cq *cq)
 		mlx4_ib_db_unmap_user(to_mucontext(cq->uobject->context), &mcq->db);
 		ib_umem_release(mcq->umem);
 	} else {
-		mlx4_ib_free_cq_buf(dev, &mcq->buf, cq->cqe + 1);
+		mlx4_ib_free_cq_buf(dev, &mcq->buf, cq->cqe);
 		mlx4_db_free(dev->dev, &mcq->db);
 	}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9044f88..ca126fc 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -334,6 +334,7 @@ struct ipoib_dev_priv {
 #endif
 	int	hca_caps;
 	struct ipoib_ethtool_st ethtool;
+	struct timer_list poll_timer;
 };
 
 struct ipoib_ah {
@@ -404,6 +405,7 @@ extern struct workqueue_struct *ipoib_workqueue;
 
 int ipoib_poll(struct napi_struct *napi, int budget);
 void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
+void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 				 struct ib_pd *pd, struct ib_ah_attr *attr);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 97b815c..f429bce 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -461,6 +461,26 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
 	netif_rx_schedule(dev, &priv->napi);
 }
 
+static void drain_tx_cq(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	while (poll_tx(priv))
+		; /* nothing */
+
+	if (netif_queue_stopped(dev))
+		mod_timer(&priv->poll_timer, jiffies + 1);
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr)
+{
+	drain_tx_cq((struct net_device *)dev_ptr);
+}
+
 static inline int post_send(struct ipoib_dev_priv *priv,
 			    unsigned int wr_id,
 			    struct ib_ah *address, u32 qpn,
@@ -555,12 +575,22 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 	else
 		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
+	if (++priv->tx_outstanding == ipoib_sendq_size) {
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+			ipoib_warn(priv, "request notify on send CQ failed\n");
+		netif_stop_queue(dev);
+	}
+
 	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
 			       address->ah, qpn, tx_req, phead, hlen))) {
 		ipoib_warn(priv, "post_send failed\n");
 		++dev->stats.tx_errors;
+		--priv->tx_outstanding;
 		ipoib_dma_unmap_tx(priv->ca, tx_req);
 		dev_kfree_skb_any(skb);
+		if (netif_queue_stopped(dev))
+			netif_wake_queue(dev);
 	} else {
 		dev->trans_start = jiffies;
 
@@ -568,14 +598,11 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		++priv->tx_head;
 		skb_orphan(skb);
 
-		if (++priv->tx_outstanding == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-		}
 	}
 
 	if (unlikely(priv->tx_outstanding > MAX_SEND_CQE))
-		poll_tx(priv);
+		while (poll_tx(priv))
+			; /* nothing */
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)
@@ -609,6 +636,11 @@ void ipoib_reap_ah(struct work_struct *work)
 				   round_jiffies_relative(HZ));
 }
 
+static void ipoib_ib_tx_timer_func(unsigned long ctx)
+{
+	drain_tx_cq((struct net_device *)ctx);
+}
+
 int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -645,6 +677,10 @@ int ipoib_ib_dev_open(struct net_device *dev)
 	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
 			   round_jiffies_relative(HZ));
 
+	init_timer(&priv->poll_timer);
+	priv->poll_timer.function = ipoib_ib_tx_timer_func;
+	priv->poll_timer.data = (unsigned long)dev;
+
 	set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
 
 	return 0;
@@ -810,6 +846,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	ipoib_dbg(priv, "All sends and receives done.\n");
 
 timeout:
+	del_timer_sync(&priv->poll_timer);
 	qp_attr.qp_state = IB_QPS_RESET;
 	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index c1e7ece..8766d29 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -187,7 +187,8 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		goto out_free_mr;
 	}
 
-	priv->send_cq = ib_create_cq(priv->ca, NULL, NULL, dev, ipoib_sendq_size, 0);
+	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
+				     dev, ipoib_sendq_size, 0);
 	if (IS_ERR(priv->send_cq)) {
 		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
 		goto out_free_recv_cq;
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index cb46446..03a9abc 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -551,7 +551,7 @@ int mlx4_fmr_alloc(struct mlx4_dev *dev, u32 pd, u32 access, int max_pages,
 	u64 mtt_seg;
 	int err = -ENOMEM;
 
-	if (page_shift < 12 || page_shift >= 32)
+	if (page_shift < (ffs(dev->caps.page_size_cap) - 1) || page_shift >= 32)
 		return -EINVAL;
 
 	/* All MTTs must fit in the same page */


From kliteyn at dev.mellanox.co.il  Mon May  5 21:33:32 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 07:33:32 +0300
Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c,
 h}: ucast routing cache implementation
In-Reply-To: <1210026044.27137.121.camel@hrosenstock-ws.xsigo.com>
References: <481D8905.1010207@dev.mellanox.co.il>	
	<1210021961.27137.90.camel@hrosenstock-ws.xsigo.com>	
	<481F82F5.8030505@dev.mellanox.co.il>
	<1210026044.27137.121.camel@hrosenstock-ws.xsigo.com>
Message-ID: <481FDF9C.9000108@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> On Tue, 2008-05-06 at 00:58 +0300, Yevgeny Kliteynik wrote:
>> Hal Rosenstock wrote:
>>> Hi Yevgeny,
>>>
>>> On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote:
>>>
>>> I haven't yet had a chance to review this in detail but think that
>>> router ports need to be accomodated in the subnet (I think this is a
>>> firm requirement as router ports on the subnet are already supported)
>>> and also think that nothing should be introduced precluding the running
>>> of OpenSM on a router port. From the latter standpoint, it looks much
>>> like a CA port.
>> This is exactly how I implemented it - any non-switch port is
>> treated as CA, which is just a target LID.
>>
>> Well, I mean I intended to implement it that way - I reviewed it again,
>> and it appears that the cache is fine with routing to routers and running
>> from switches, 
> 
> then it's just the variable names which indicate ca if routers are
> grouped with cas.
> 
>> but there will be a problem when SM runs on a router node -
>> cache will complain and fall back to usual routing.
>> That can be easily fixed.
> 
> Right; the one thing I saw was in _ucast_cache_get_starting_osm_sw where
> routers were not supported. I think a one line change is all that's
> needed there. Not sure if there are other places.

Right, that's the place I was talking about.
AFAIK, no other places.

-- Yevgeny

>> However, I'm not sure how the OpenSM will behave in general when running
>> from switch or router - I've never tried it. Has anyone try it?
> 
> I'm not sure either but would be interested to hear. I think there are
> some using it on switch port 0 and also think others have tried it on
> routers. In terms of switches, it used to work and there is some support
> in the vendor directory for this.
> 
> -- Hal
> 
>> -- Yevgeny
>>
>>> -- Hal
>>>
>>>
> 
> 


From ogerlitz at voltaire.com  Tue May  6 00:11:06 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 06 May 2008 10:11:06 +0300
Subject: [ofa-general] Re:  Using RDMA CM with MPI
In-Reply-To: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com>
References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com>	<Pine.GSO.4.40.0805050821380.7254-100000@xi.cse.ohio-state.edu>
	<829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com>
Message-ID: <4820048A.1000706@voltaire.com>

Keshetti Mahesh wrote:
>    I want to use the RDMA CM option of MVAPICH2. The procedure described
>    in the user guide is not much informative. ... Also, I'll be glad if some one can give
>    me a document describing how it works in detail.
try looking on the rdma_cm(7) man page installed by librdmacm
>    Actually I have some doubts like, how the IP addresses (???) are
>    resolved into IB   addresses and what happens in the case of nodes two HCAs
>    (or 1 HCA with two ports) ?
each port you want to use for your job (HCA with two ports, two HCAs, 
etc) has to have an IP address associated with it. This IP address (or a 
host name which is translated to it through DNS lookup) is probably what 
you want the rank to advertise. RDMA address resolution uses route 
lookup and ARP to learn the local <HCA, port, PKEY> and remote GID.
>
>    In the MVAPICH2 user guide it is mentioned that "RDMA CM device needs
>    to be setup, configured with an IP address and connected to the network".
>    Is this same as configuring IPoIB device ?
for IB, yes.

Or.


From keshetti85-student at yahoo.co.in  Tue May  6 02:21:08 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Tue, 6 May 2008 14:51:08 +0530
Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2
In-Reply-To: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com>
References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com>
	<Pine.GSO.4.40.0805050821380.7254-100000@xi.cse.ohio-state.edu>
	<829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com>
Message-ID: <829ded920805060221lc3b5f77safacf8a8ad299b33@mail.gmail.com>

I'vecouple of more questions to ask you.

Below are the steps mentioned in MVAPICH user guide for
running MPI application with RDMA CM support.

*• Setup the RDMA CM device: RDMA CM device needs to be
setup, configured with an IP address and connected to the network.*

I have two machines (n0 and n1)connected with one ethernet interface
and 2 IB interfaces in each. And /etc/hosts on both machines is like
below.

192.168.3.1   n0
192.168.3.2   n1
172.131.15.1   n0_ib0
172.131.15.2   n0_ib1
172.131.15.3   n1_ib0
172.131.15.4   n1_ib0

Now If I want to run an MPI job on both of the nodes what should I
mention in the *'hostfile'* given to MPI ("n0, n1"  or "n0_ib0, n1_ib0 ... "
) ?

*• Setup the Local Address File: Create the file (/etc/mv2.conf) with the
local IP address to be used by RDMA CM.
$ echo 10.1.1.1 >> /etc/mv2.conf*

Why is this file (/etc/mv2.conf) required ? Is it required to be present on
all nodes?

-Mahesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080506/d7e8f6bd/attachment.html>

From vlad at dev.mellanox.co.il  Tue May  6 03:49:47 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 06 May 2008 13:49:47 +0300
Subject: [ofa-general] OFED 1.3.1 RC1 release is available
Message-ID: <482037CB.7020903@dev.mellanox.co.il>

Hi,

OFED 1.3.1 RC1 release is available on
http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc1.tgz

To get BUILD_ID run ofed_info

Please report any issues in Bugzilla https://bugs.openfabrics.org/

The RC2 release is expected on May 20

Release information:
--------------------
Linux Operating Systems:
         - RedHat EL4 up4:       2.6.9-42.ELsmp
         - RedHat EL4 up5:       2.6.9-55.ELsmp
         - RedHat EL4 up6:       2.6.9-67.ELsmp
         - RedHat EL5:           2.6.18-8.el5
         - RedHat EL5 up1:       2.6.18-53.el5
         - Fedora C6:            2.6.18-8.fc6
         - SLES10:               2.6.16.21-0.8-smp
         - SLES10 SP1:           2.6.16.46-0.12-smp
         - SLES10 SP1 up1:       2.6.16.53-0.16-smp
         - OpenSuSE 10.3:        2.6.22-*-*
         - kernel.org:           2.6.23 and 2.6.24

Systems:
	* x86_64
	* x86
	* ia64
	* ppc64

Main Changes from OFED-1.3:

* MPI packages update:
    * mvapich-1.0.1-2434
    * mvapich2-1.0.3-1
    * openmpi-1.2.6-1

* Updated libraries:
   * dapl-v1 1.2.6
   * dapl-v2 2.0.8
   * libcxgb3 1.2.0
   * librdmacm 1.0.7	I

* ULPs changes:
   * IB Bonding: ib-bonding-0.9.0-24
   * IPoIB bug fixes
   * RDS fixes for RDMA API
   * SRP failover

* Updated low level drivers:
   * nes
   * mlx4
   * cxgb3
   * ehca

Note: In the attached tgz file you can find git-log of all changes.

Vlad & Tziporet

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.3-1.3.1-rc1.diff.tgz
Type: application/octet-stream
Size: 20514 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080506/50201ba6/attachment.obj>

From hrosenstock at xsigo.com  Tue May  6 06:16:55 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 06:16:55 -0700
Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the
	example in QoS_management_in_OpenSM.txt
In-Reply-To: <47E8C032.2050907@dev.mellanox.co.il>
References: <47E8C032.2050907@dev.mellanox.co.il>
Message-ID: <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>

Hi Yevgeny,

On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote:
>  QoS_management_in_OpenSM.txt

Shouldn't this doc also be available in the OpenSM git tree (in
management/opensm/doc) and distributed as part of OpenSM ?

If so, I can issue a patch for this.

Thanks.

-- Hal


From kliteyn at dev.mellanox.co.il  Tue May  6 06:40:56 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 16:40:56 +0300
Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in
	the	example in QoS_management_in_OpenSM.txt
In-Reply-To: <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
References: <47E8C032.2050907@dev.mellanox.co.il>
	<1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
Message-ID: <48205FE8.3070104@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote:
>>  QoS_management_in_OpenSM.txt
> 
> Shouldn't this doc also be available in the OpenSM git tree (in
> management/opensm/doc) and distributed as part of OpenSM ?
> 
> If so, I can issue a patch for this.

I think that it's a good idea (which reminds me that I still
haven't fixed the patch for QoS stuff in OpenSM man page...)

-- Yevgeny

> Thanks.
> 
> -- Hal
> 
> 


From hrosenstock at xsigo.com  Tue May  6 06:44:12 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 06:44:12 -0700
Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in
	the	example in QoS_management_in_OpenSM.txt
In-Reply-To: <48205FE8.3070104@dev.mellanox.co.il>
References: <47E8C032.2050907@dev.mellanox.co.il>
	<1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
	<48205FE8.3070104@dev.mellanox.co.il>
Message-ID: <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com>

Hi Yevgeny,

On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote:
> >>  QoS_management_in_OpenSM.txt
> > 
> > Shouldn't this doc also be available in the OpenSM git tree (in
> > management/opensm/doc) and distributed as part of OpenSM ?
> > 
> > If so, I can issue a patch for this.
> 
> I think that it's a good idea (which reminds me that I still
> haven't fixed the patch for QoS stuff in OpenSM man page...)

Yes, that would be nice too :-)

That's going to make the OpenSM man page huge. Not sure how that should
be dealt with. Maybe a simple way out in the short term might be to just
reference that doc in the man page.

-- Hal

> -- Yevgeny
> 
> > Thanks.
> > 
> > -- Hal
> > 
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From narravul at cse.ohio-state.edu  Tue May  6 06:50:48 2008
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Tue, 6 May 2008 09:50:48 -0400 (EDT)
Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2
In-Reply-To: <829ded920805060221lc3b5f77safacf8a8ad299b33@mail.gmail.com>
Message-ID: <Pine.GSO.4.40.0805060938290.24609-100000@kappa.cse.ohio-state.edu>

Hi Mahesh,

Thanks for trying out our RDMA CM support. My answers are inline.

> I'vecouple of more questions to ask you.
>
> Below are the steps mentioned in MVAPICH user guide for
> running MPI application with RDMA CM support.
>
> *� Setup the RDMA CM device: RDMA CM device needs to be
> setup, configured with an IP address and connected to the network.*
>
> I have two machines (n0 and n1)connected with one ethernet interface
> and 2 IB interfaces in each. And /etc/hosts on both machines is like
> below.
>
> 192.168.3.1   n0
> 192.168.3.2   n1
> 172.131.15.1   n0_ib0
> 172.131.15.2   n0_ib1
> 172.131.15.3   n1_ib0
> 172.131.15.4   n1_ib0
>
> Now If I want to run an MPI job on both of the nodes what should I
> mention in the *'hostfile'* given to MPI ("n0, n1"  or "n0_ib0, n1_ib0 ... "
> ) ?

You can use any one of these pairs in your hostfile. i.e. using n0 and n1
should work fine.

> *� Setup the Local Address File: Create the file (/etc/mv2.conf) with the
> local IP address to be used by RDMA CM.
> $ echo 10.1.1.1 >> /etc/mv2.conf*
>
> Why is this file (/etc/mv2.conf) required ? Is it required to be present on
> all nodes?

The local rdma-cm device that the mpi library needs to use is specified in
the /etc/mv2.conf file. The file needs to be on all the machines.

  --Sundeep.

>
> -Mahesh
>


From suri at baymicrosystems.com  Tue May  6 06:56:26 2008
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Tue, 6 May 2008 09:56:26 -0400
Subject: [ofa-general] IBTA Compliance- Mkey Violations trap
In-Reply-To: <48205FE8.3070104@dev.mellanox.co.il>
References: <47E8C032.2050907@dev.mellanox.co.il><1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
	<48205FE8.3070104@dev.mellanox.co.il>
Message-ID: <04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com>

Folks:

WRT sending traps on MkeyViolations, the spec is a little ambiguous IMO.
In section 14.2.4.2 C-14-18 is it saying to send a trap the first time the Mkey Violation
happens and the Lease Expiry timer is started or to send a trap(possibly multiple of them)
every time Mkey Violation happens even though the lease timer may have been started already?

What is the consensus in this group? 

Many thanks,
Suri


From monis at Voltaire.COM  Tue May  6 06:56:30 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 06 May 2008 16:56:30 +0300
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <adaskwwz7ie.fsf@cisco.com>
References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM>
	<adaskwwz7ie.fsf@cisco.com>
Message-ID: <4820638E.4030901@Voltaire.COM>


> I guess I can believe things don't get worse but I still don't know how
> this makes things better.  With the current code the request is lost
> because it goes to the wrong SM; with the new code the request is failed
> by the SA layer.  So in both cases the consumer just has to try again.
> 
> So is there some practical benefit we see by adding this code?
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 

In general I see the benefit in faster detection of wrong SM ah. Before the patch consumers 
need to wait for  a timeout before the detection and after the patch it happens immediately 
on return from the function. This improves the performance of an SM failover scenario.

Some applications  may get the benefit above  only they handle new return code (EAGAIN) specifically 
but this patch opens the door for such improvement. 

thanks

 MoniS


From monis at Voltaire.COM  Tue May  6 07:09:20 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 06 May 2008 17:09:20 +0300
Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and
 handle each according to level of severity
Message-ID: <48206690.3090604@Voltaire.COM>

The purpose of this patch is to make the events that are related to SM change
(namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
When SM related events are handled, it is not necessary to flush unicast
info from device but only multicast info. This patch divides the events that are
handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1
does more than 0).
The main change is in __ipoib_ib_dev_flush(). Instead of flagging  to the function
about pkey_events we now use leveling. An event that requires "harder" flushing
calls this function with higher number for level. Besides the concept,
the actual change is that SM related events are not  flushing unicast info and
not bringing the device down but only refresh the multicast info in the background.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>
---

 drivers/infiniband/ulp/ipoib/ipoib.h       |    9 ++++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   37 ++++++++++++++++++-----------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    5 ++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   19 +++++++-------
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 054fab8..e1e91d3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -268,10 +268,11 @@ struct ipoib_dev_priv {
 
 	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
-	struct work_struct flush_task;
+	struct work_struct flush_task0;
+	struct work_struct flush_task1;
+	struct work_struct flush_task2;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
-	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8		  port;
@@ -401,7 +402,9 @@ void ipoib_flush_paths(struct net_device
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
-void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_ib_dev_flush0(struct work_struct *work);
+void ipoib_ib_dev_flush1(struct work_struct *work);
+void ipoib_ib_dev_flush2(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 08c4396..54fee47 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -749,12 +749,14 @@ int ipoib_ib_dev_init(struct net_device 
 	return 0;
 }
 
-static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
 {
 	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 	u16 new_index;
 
+	ipoib_dbg(priv, "Try flushing level %d\n", level);
+
 	mutex_lock(&priv->vlan_mutex);
 
 	/*
@@ -762,7 +764,7 @@ static void __ipoib_ib_dev_flush(struct 
 	 * the parent is down.
 	 */
 	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		__ipoib_ib_dev_flush(cpriv, pkey_event);
+		__ipoib_ib_dev_flush(cpriv, level);
 
 	mutex_unlock(&priv->vlan_mutex);
 
@@ -776,7 +778,7 @@ static void __ipoib_ib_dev_flush(struct 
 		return;
 	}
 
-	if (pkey_event) {
+	if (level == 2) {
 		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
 			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 			ipoib_ib_dev_down(dev, 0);
@@ -794,11 +796,13 @@ static void __ipoib_ib_dev_flush(struct 
 		priv->pkey_index = new_index;
 	}
 
-	ipoib_dbg(priv, "flushing\n");
 
-	ipoib_ib_dev_down(dev, 0);
+	ipoib_mcast_dev_flush(dev);
+
+	if (level >= 1)
+		ipoib_ib_dev_down(dev, 0);
 
-	if (pkey_event) {
+	if (level >= 2) {
 		ipoib_ib_dev_stop(dev, 0);
 		ipoib_ib_dev_open(dev);
 	}
@@ -808,29 +812,36 @@ static void __ipoib_ib_dev_flush(struct 
 	 * we get here, don't bring it back up if it's not configured up
 	 */
 	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
-		ipoib_ib_dev_up(dev);
+		if (level >= 1)
+			ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+void ipoib_ib_dev_flush0(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+		container_of(work, struct ipoib_dev_priv, flush_task0);
 
-	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 0);
 }
 
-void ipoib_pkey_event(struct work_struct *work)
+void ipoib_ib_dev_flush1(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+		container_of(work, struct ipoib_dev_priv, flush_task1);
 
-	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 1);
 }
 
+void ipoib_ib_dev_flush2(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task2);
+
+	__ipoib_ib_dev_flush(priv, 2);
+}
+
 void ipoib_ib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 5728204..54f046a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -992,9 +992,10 @@ static void ipoib_setup(struct net_devic
 	INIT_LIST_HEAD(&priv->multicast_list);
 
 	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
-	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
-	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
+	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
+	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
+	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
 	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index a3aeb91..83d9c6d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -259,15 +259,16 @@ void ipoib_event(struct ib_event_handler
 	if (record->element.port_num != priv->port)
 		return;
 
-	if (record->event == IB_EVENT_PORT_ERR    ||
-	    record->event == IB_EVENT_PORT_ACTIVE ||
-	    record->event == IB_EVENT_LID_CHANGE  ||
-	    record->event == IB_EVENT_SM_CHANGE   ||
-	    record->event == IB_EVENT_CLIENT_REREGISTER) {
-		ipoib_dbg(priv, "Port state change event\n");
-		queue_work(ipoib_workqueue, &priv->flush_task);
+	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
+			record->device->name, record->element.port_num);
+	if ( record->event == IB_EVENT_SM_CHANGE   ||
+	     record->event == IB_EVENT_CLIENT_REREGISTER) {
+		queue_work(ipoib_workqueue, &priv->flush_task0);
+	} else if (record->event == IB_EVENT_PORT_ERR ||
+		   record->event == IB_EVENT_PORT_ACTIVE ||
+		   record->event == IB_EVENT_LID_CHANGE) {
+		queue_work(ipoib_workqueue, &priv->flush_task1);
 	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
-		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
-		queue_work(ipoib_workqueue, &priv->pkey_event_task);
+		queue_work(ipoib_workqueue, &priv->flush_task2);
 	}
 }


From pawel.dziekonski at wcss.pl  Tue May  6 07:10:40 2008
From: pawel.dziekonski at wcss.pl (Pawel Dziekonski)
Date: Tue, 6 May 2008 16:10:40 +0200
Subject: [ofa-general] getting network statistics
In-Reply-To: <1203424196.16145.1.camel@mtls03>
References: <FCB44A2146B78C479695CF9CCA7EEA870311DE05@excg-isl01>
	<1203424196.16145.1.camel@mtls03>
Message-ID: <20080506141039.GJ6586@cefeid.wcss.wroc.pl>


you mean port_rcv_data and port_xmit_data ?

if so, then I have 2 jobs that are definitelly using IB network, but
those files almost do not change. :o

OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp

root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al
total 0
drwxr-xr-x  2 root root    0 May  6 15:45 ./
drwxr-xr-x  5 root root    0 May  6 15:45 ../
-r--r--r--  1 root root 4096 May  6 15:45 VL15_dropped
-r--r--r--  1 root root 4096 May  6 15:45 excessive_buffer_overrun_errors
-r--r--r--  1 root root 4096 May  6 15:45 link_downed
-r--r--r--  1 root root 4096 May  6 15:45 link_error_recovery
-r--r--r--  1 root root 4096 May  6 15:45 local_link_integrity_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_constraint_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_data
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_packets
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_remote_physical_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_rcv_switch_relay_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_xmit_constraint_errors
-r--r--r--  1 root root 4096 May  6 15:45 port_xmit_data
-r--r--r--  1 root root 4096 May  6 15:45 port_xmit_discards
-r--r--r--  1 root root 4096 May  6 15:45 port_xmit_packets
-r--r--r--  1 root root 4096 May  6 15:45 symbol_error


On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote:
> cat /sys/class/infiniband/mlx4_0/ports/1/counters/*
> 
> mlx4_* can be mthca* 
> 
> On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote:
> > Under Linux with Mellanox ofed, how can I get real-time network
> > statistics. e.g. how many bytes are being sent and received over each
> > port at any given time?

-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From hrosenstock at xsigo.com  Tue May  6 07:33:26 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 07:33:26 -0700
Subject: [ofa-general] [PATCH] OpenSM: Add QoS_management_in_OpenSM.txt to
	opensm/doc directory
Message-ID: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com>

Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- /dev/null	2008-03-17 00:34:45.630902751 -0700
+++ opensm/doc/QoS_management_in_OpenSM.txt	2008-04-01 08:29:04.625737000 -0700
@@ -0,0 +1,492 @@
+
+                    QoS Management in OpenSM
+
+==============================================================================
+ Table of contents
+==============================================================================
+
+1. Overview
+2. Full QoS Policy File
+3. Simplified QoS Policy Definition
+4. Policy File Syntax Guidelines
+5. Examples of Full Policy File
+6. Simplified QoS Policy - Details and Examples
+7. SL2VL Mapping and VL Arbitration
+
+
+==============================================================================
+ 1. Overview
+==============================================================================
+
+When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
+The default name of OpenSM QoS policy file is
+/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
+or --qos_policy_file option with OpenSM.
+
+During fabric initialization and at every heavy sweep OpenSM parses the QoS
+policy file, applies its settings to the discovered fabric elements, and
+enforces the provided policy on client requests. The overall flow for such
+requests is:
+ - The request is matched against the defined matching rules such that the
+   QoS Level definition is found.
+ - Given the QoS Level, path(s) search is performed with the given
+   restrictions imposed by that level.
+
+There are two ways to define QoS policy:
+ - Full policy, where the policy file syntax provides an administrator
+   various ways to match PathRecord/MultiPathRecord (PR/MPR) request and
+   enforce various QoS constraints on the requested PR/MPR
+ - Simplified QoS policy definition, where an administrator would be able to
+   match PR/MPR requests by various ULPs and applications running on top of
+   these ULPs.
+
+While the full policy syntax is very flexible, in many cases the simplified
+policy definition would be sufficient.
+
+
+==============================================================================
+ 2. Full QoS Policy File
+==============================================================================
+
+QoS policy file has the following sections:
+
+I) Port Groups (denoted by port-groups).
+This section defines zero or more port groups that can be referred later by
+matching rules (see below). Port group lists ports by:
+  - Port GUID
+  - Port name, which is a combination of NodeDescription and IB port number
+  - PKey, which means that all the ports in the subnet that belong to
+    partition with a given PKey belong to this port group
+  - Partition name, which means that all the ports in the subnet that belong
+    to partition with a given name belong to this port group
+  - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
+    SELF (SM's port).
+
+II) QoS Setup (denoted by qos-setup).
+This section describes how to set up SL2VL and VL Arbitration tables on
+various nodes in the fabric.
+However, this is not supported in OFED 1.3.
+SL2VL and VLArb tables should be configured in the OpenSM options file
+(default location - /var/cache/opensm/opensm.opts).
+
+III) QoS Levels (denoted by qos-levels).
+Each QoS Level defines Service Level (SL) and a few optional fields:
+  - MTU limit
+  - Rate limit
+  - PKey
+  - Packet lifetime
+When path(s) search is performed, it is done with regards to restriction that
+these QoS Level parameters impose.
+One QoS level that is mandatory to define is a DEFAULT QoS level. It is
+applied to a PR/MPR query that does not match any existing match rule.
+Similar to any other QoS Level, it can also be explicitly referred by any
+match rule.
+
+IV) QoS Matching Rules (denoted by qos-match-rules).
+Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
+the set of matching rules. Rules are scanned in order of appearance in the QoS
+policy file such as the first match takes precedence.
+Each rule has a name of QoS level that will be applied to the matching query.
+A default QoS level is applied to a query that did not match any rule.
+Queries can be matched by:
+ - Source port group (whether a source port is a member of a specified group)
+ - Destination port group (same as above, only for destination port)
+ - PKey
+ - QoS class
+ - Service ID
+To match a certain matching rule, PR/MPR query has to match ALL the rule's
+criteria. However, not all the fields of the PR/MPR query have to appear in
+the matching rule.
+For instance, if the rule has a single criterion - Service ID, it will match
+any query that has this Service ID, disregarding rest of the query fields.
+However, if a certain query has only Service ID (which means that this is the
+only bit in the PR/MPR component mask that is on), it will not match any rule
+that has other matching criteria besides Service ID.
+
+
+==============================================================================
+ 3. Simplified QoS Policy Definition
+==============================================================================
+
+Simplified QoS policy definition comprises of a single section denoted by
+qos-ulps. Similar to the full QoS policy, it has a list of match rules and
+their QoS Level, but in this case a match rule has only one criterion - its
+goal is to match a certain ULP (or a certain application on top of this ULP)
+PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
+The simplified policy section may appear in the policy file in combine with
+the full policy, or as a stand-alone policy definition.
+See more details and list of match rule criteria below.
+
+
+==============================================================================
+ 4. Policy File Syntax Guidelines
+==============================================================================
+
+- Empty lines are ignored.
+- Leading and trailing blanks, as well as empty lines, are ignored, so
+  the indentation in the example is just for better readability.
+- Comments are started with the pound sign (#) and terminated by EOL.
+- Any keyword should be the first non-blank in the line, unless it's a
+  comment.
+- Keywords that denote section/subsection start have matching closing
+  keywords.
+- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
+  requests that didn't match any of the matching rules.
+- Any section/subsection of the policy file is optional.
+
+
+==============================================================================
+ 5. Examples of Full Policy File
+==============================================================================
+
+As mentioned earlier, any section of the policy file is optional, and
+the only mandatory part of the policy file is a default QoS Level.
+Here's an example of the shortest policy file:
+
+    qos-levels
+        qos-level
+            name: DEFAULT
+            sl: 0
+        end-qos-level
+    end-qos-levels
+
+Port groups section is missing because there are no match rules, which means
+that port groups are not referred anywhere, and there is no need defining
+them. And since this policy file doesn't have any matching rules, PR/MPR query
+won't match any rule, and OpenSM will enforce default QoS level.
+Essentially, the above example is equivalent to not having QoS policy file
+at all.
+
+The following example shows all the possible options and keywords in the
+policy file and their syntax:
+
+    #
+    # See the comments in the following example.
+    # They explain different keywords and their meaning.
+    #
+    port-groups
+
+        port-group # using port GUIDs
+            name: Storage
+            # "use" is just a description that is used for logging
+            #  Other than that, it is just a comment
+            use: SRP Targets
+            port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
+            port-guid: 0x1000000000FFFF
+        end-port-group
+
+        port-group
+            name: Virtual Servers
+            # The syntax of the port name is as follows:
+            #   "node_description/Pnum".
+            # node_description is compared to the NodeDescription of the node,
+            # and "Pnum" is a port number on that node.
+            port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
+        end-port-group
+
+        # using partitions defined in the partition policy
+        port-group
+            name: Partitions
+            partition: Part1
+            pkey: 0x1234
+        end-port-group
+
+        # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
+        # or ALL (for all the nodes in the subnet)
+        port-group
+            name: CAs and SM
+            node-type: CA, SELF
+        end-port-group
+
+    end-port-groups
+
+    qos-setup
+        # This section of the policy file describes how to set up SL2VL and VL
+        # Arbitration tables on various nodes in the fabric.
+        # However, this is not supported in OFED 1.3 - the section is parsed
+        # and ignored. SL2VL and VLArb tables should be configured in the
+        # OpenSM options file (by default - /var/cache/opensm/opensm.opts).
+    end-qos-setup
+
+    qos-levels
+
+        # Having a QoS Level named "DEFAULT" is a must - it is applied to
+        # PR/MPR requests that didn't match any of the matching rules.
+        qos-level
+            name: DEFAULT
+            use: default QoS Level
+            sl: 0
+        end-qos-level
+
+        # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
+        qos-level
+            name: WholeSet
+            sl: 1
+            mtu-limit: 4
+            rate-limit: 5
+            pkey: 0x1234
+            packet-life: 8
+        end-qos-level
+
+    end-qos-levels
+
+    # Match rules are scanned in order of their apperance in the policy file.
+    # First matched rule takes precedence.
+    qos-match-rules
+
+        # matching by single criteria: QoS class
+        qos-match-rule
+            use: by QoS class
+            qos-class: 7-9,11
+            # Name of qos-level to apply to the matching PR/MPR
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+        # show matching by destination group and service id
+        qos-match-rule
+            use: Storage targets
+            destination: Storage
+            service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+        qos-match-rule
+            source: Storage
+            use: match by source group only
+            qos-level-name: DEFAULT
+        end-qos-match-rule
+
+        qos-match-rule
+            use: match by all parameters
+            qos-class: 7-9,11
+            source: Virtual Servers
+            destination: Storage
+            service-id: 0x0000000000010000-0x000000000001FFFF
+            pkey: 0x0F00-0x0FFF
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+    end-qos-match-rules
+
+
+==============================================================================
+ 6. Simplified QoS Policy - Details and Examples
+==============================================================================
+
+Simplified QoS policy match rules are tailored for matching ULPs (or some
+application on top of a ULP) PR/MPR requests. This section has a list of
+per-ULP (or per-application) match rules and the SL that should be enforced
+on the matched PR/MPR query.
+
+Match rules include:
+ - Default match rule that is applied to PR/MPR query that didn't match any
+   of the other match rules
+ - SDP
+ - SDP application with a specific target TCP/IP port range
+ - SRP with a specific target IB port GUID
+ - RDS
+ - iSER
+ - iSER application with a specific target TCP/IP port range
+ - IPoIB with a default PKey
+ - IPoIB with a specific PKey
+ - any ULP/application with a specific Service ID in the PR/MPR query
+ - any ULP/application with a specific PKey in the PR/MPR query
+ - any ULP/application with a specific target IB port GUID in the PR/MPR query
+
+Since any section of the policy file is optional, as long as basic rules of
+the file are kept (such as no referring to nonexisting port group, having
+default QoS Level, etc), the simplified policy section (qos-ulps) can serve
+as a complete QoS policy file.
+The shortest policy file in this case would be as follows:
+
+    qos-ulps
+        default  : 0 #default SL
+    end-qos-ulps
+
+It is equivalent to the previous example of the shortest policy file, and it
+is also equivalent to not having policy file at all.
+
+Below is an example of simplified QoS policy with all the possible keywords:
+
+    qos-ulps
+        default                       : 0 # default SL
+        sdp, port-num 30000           : 0 # SL for application running on top
+                                          # of SDP when a destination
+                                          # TCP/IPport is 30000
+        sdp, port-num 10000-20000     : 0
+        sdp                           : 1 # default SL for any other
+                                          # application running on top of SDP
+        rds                           : 2 # SL for RDS traffic
+        iser, port-num 900            : 0 # SL for iSER with a specific target
+                                          # port
+        iser                          : 3 # default SL for iSER
+        ipoib, pkey 0x0001            : 0 # SL for IPoIB on partition with
+                                          # pkey 0x0001
+        ipoib                         : 4 # default IPoIB partition,
+                                          # pkey=0x7FFF
+        any, service-id 0x6234        : 6 # match any PR/MPR query with a
+                                          # specific Service ID
+        any, pkey 0x0ABC              : 6 # match any PR/MPR query with a
+                                          # specific PKey
+        srp, target-port-guid 0x1234  : 5 # SRP when SRP Target is located on
+                                          # a specified IB port GUID
+        any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
+                                          # a specific target port GUID
+    end-qos-ulps
+
+
+Similar to the full policy definition, matching of PR/MPR queries is done in
+order of appearance in the QoS policy file such as the first match takes
+precedence, except for the "default" rule, which is applied only if the query
+didn't match any other rule.
+
+All other sections of the QoS policy file take precedence over the qos-ulps
+section. That is, if a policy file has both qos-match-rules and qos-ulps
+sections, then any query is matched first against the rules in the
+qos-match-rules section, and only if there was no match, the query is matched
+against the rules in qos-ulps section.
+
+Note that some of these match rules may overlap, so in order to use the
+simplified QoS definition effectively, it is important to understand how each
+of the ULPs is matched:
+
+6.1  IPoIB
+IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
+the following three match rules are equivalent:
+
+    ipoib              : <SL>
+    ipoib, pkey 0x7fff : <SL>
+    any,   pkey 0x7fff : <SL>
+
+6.2  SDP
+SDP PR query is matched by Service ID. The Service-ID for SDP is
+0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port
+Number to connect to. The following two match rules are equivalent:
+
+    sdp                                                   : <SL>
+    any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
+
+6.3  RDS
+Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS
+is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
+Port Number to connect to. Default port number for RDS is 0x48CA, which makes
+a default Service-ID 0x00000000010648CA. The following two match rules are
+equivalent:
+
+    rds                                : <SL>
+    any, service-id 0x00000000010648CA : <SL>
+
+6.4  iSER
+Similar to RDS, iSER query is matched by Service ID, where the the Service ID
+is also 0x000000000106PPPP. Default port number for iSER is 0x035C, which makes
+a default Service-ID 0x000000000106035C. The following two match rules are
+equivalent:
+
+    iser                               : <SL>
+    any, service-id 0x000000000106035C : <SL>
+
+6.5  SRP
+Service ID for SRP varies from storage vendor to vendor, thus SRP query is
+matched by the target IB port GUID. The following two match rules are
+equivalent:
+
+    srp, target-port-guid 0x1234  : <SL>
+    any, target-port-guid 0x1234  : <SL>
+
+Note that any of the above ULPs might contain target port GUID in the PR
+query, so in order for these queries not to be recognized by the QoS manager
+as SRP, the SRP match rule (or any match rule that refers to the target port
+guid only) should be placed at the end of the qos-ulps match rules.
+
+6.6  MPI
+SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL
+on the MPI traffic, and that's why it is the only ULP that did not appear in
+the qos-ulps section.
+
+
+==============================================================================
+ 7. SL2VL Mapping and VL Arbitration
+==============================================================================
+
+OpenSM cached options file has a set of QoS related configuration parameters,
+that are used to configure SL2VL mapping and VL arbitration on IB ports.
+These parameters are:
+ - Max VLs: the maximum number of VLs that will be on the subnet.
+ - High limit: the limit of High Priority component of VL Arbitration
+   table (IBA 7.6.9).
+ - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
+ - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
+ - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
+   corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
+
+There are separate QoS configuration parameters sets for various target types:
+CAs, routers, switch external ports, and switch's enhanced port 0. The names
+of such parameters are prefixed by "qos_<type>_" string. Here is a full list
+of the currently supported sets:
+
+    qos_ca_  - QoS configuration parameters set for CAs.
+    qos_rtr_ - parameters set for routers.
+    qos_sw0_ - parameters set for switches' port 0.
+    qos_swe_ - parameters set for switches' external ports.
+
+Here's the example of typical default values for CAs and switches' external
+ports (hard-coded in OpenSM initialization):
+
+    qos_ca_max_vls=15
+    qos_ca_high_limit=0
+    qos_ca_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+    qos_ca_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+    qos_ca_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+    qos_swe_max_vls=15
+    qos_swe_high_limit=0
+    qos_swe_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+    qos_swe_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+    qos_swe_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+VL arbitration tables (both high and low) are lists of VL/Weight pairs.
+Each list entry contains a VL number (values from 0-14), and a weighting value
+(values 0-255), indicating the number of 64 byte units (credits) which may be
+transmitted from that VL when its turn in the arbitration occurs. A weight
+of 0 indicates that this entry should be skipped. If a list entry is
+programmed for VL15 or for a VL that is not supported or is not currently
+configured by the port, the port may either skip that entry or send from any
+supported VL for that entry.
+
+Note, that the same VLs may be listed multiple times in the High or Low
+priority arbitration tables, and, further, it can be listed in both tables.
+
+The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
+number of high-priority packets that can be transmitted without an opportunity
+to send a low-priority packet. Specifically, the number of bytes that can be
+sent is high_limit times 4K bytes.
+
+A high_limit value of 255 indicates that the byte limit is unbounded.
+Note: if the 255 value is used, the low priority VLs may be starved.
+A value of 0 indicates that only a single packet from the high-priority table
+may be sent before an opportunity is given to the low-priority table.
+
+Keep in mind that ports usually transmit packets of size equal to MTU.
+For instance, for 4KB MTU a single packet will require 64 credits, so in order
+to achieve effective VL arbitration for packets of 4KB MTU, the weighting
+values for each VL should be multiples of 64.
+
+Below is an example of SL2VL and VL Arbitration configuration on subnet:
+
+    qos_ca_max_vls=15
+    qos_ca_high_limit=6
+    qos_ca_vlarb_high=0:4
+    qos_ca_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
+    qos_ca_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+    qos_swe_max_vls=15
+    qos_swe_high_limit=6
+    qos_swe_vlarb_high=0:4
+    qos_swe_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
+    qos_swe_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
+defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
+transmission burst. Such configuration would suilt VL that needs low latency
+and uses small MTU when transmitting packets. Rest of VLs are defined as low
+priority VLs with different weights, while VL4 is effectively turned off.


From kliteyn at dev.mellanox.co.il  Tue May  6 07:33:56 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 17:33:56 +0300
Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax
	in	the	example in QoS_management_in_OpenSM.txt
In-Reply-To: <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com>
References: <47E8C032.2050907@dev.mellanox.co.il>	
	<1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>	
	<48205FE8.3070104@dev.mellanox.co.il>
	<1210081452.2026.37.camel@hrosenstock-ws.xsigo.com>
Message-ID: <48206C54.8010605@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> Hal Rosenstock wrote:
>>> Hi Yevgeny,
>>>
>>> On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote:
>>>>  QoS_management_in_OpenSM.txt
>>> Shouldn't this doc also be available in the OpenSM git tree (in
>>> management/opensm/doc) and distributed as part of OpenSM ?
>>>
>>> If so, I can issue a patch for this.
>> I think that it's a good idea (which reminds me that I still
>> haven't fixed the patch for QoS stuff in OpenSM man page...)
> 
> Yes, that would be nice too :-)
> 
> That's going to make the OpenSM man page huge. Not sure how that should
> be dealt with. Maybe a simple way out in the short term might be to just
> reference that doc in the man page.

I thought about it too.

The only problem is that "the short term" solutions have an astonishing
ability to stay as a "long term", or "final" solutions... :)

Let's think about the long term solution right away.
Are we OK with having just 10-15 lines about the existence of QoS
annex support in OpenSM man page (in addition to SL2VL and VLArb
tables configuration that already exists there), and a reference
to the QoS Management doc? I, for one, have no problems with that.

I tried reducing the full text to include it in the man - it's
still very long...
I just check the mailing list - the mail isn't there (it was "delayed
for approval" when there were problems with mailing list filtering
policy a month ago). I'll forward it to you.

-- Yevgeny

> -- Hal
> 
>> -- Yevgeny
>>
>>> Thanks.
>>>
>>> -- Hal
>>>
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From kliteyn at dev.mellanox.co.il  Tue May  6 07:34:43 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 17:34:43 +0300
Subject: [Fwd: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to
	opensm	man pages]
Message-ID: <48206C83.8000009@dev.mellanox.co.il>

Hi Hal,

This is the mail that I was talking about (QoS info for OpenSM man page).
Sasha has reviewed it, and posted his answer to the mailing list.

-- Yevgeny


-------- Original Message --------
Subject: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages
Date: Wed, 26 Mar 2008 02:47:08 +0200
From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
To: Sasha Khapyorsky <sashak at voltaire.com>
CC: OpenIB <general at lists.openfabrics.org>

Hi Sasha,

I've added QoS related info to opensm man pages: enhanced
existing part (that was talking about VL arbitration) and
added description of QoS manager in accordance with QoS annex.

Please apply to ofed_1_3 and master.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
  opensm/man/opensm.8.in |  501 +++++++++++++++++++++++++++++++++++++++++++-----
  1 files changed, 457 insertions(+), 44 deletions(-)

diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 5322ab7..1d9c5b7 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each
  InfiniBand subnet).

  opensm also now contains an experimental version of a performance
-manager as well.
+manager and an experimental version QoS manager (in accordance with
+IBA QoS Annex).

  opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
  fabric, initialize it, and sweep occasionally for changes.
@@ -433,51 +434,463 @@ partition manager:

   Default=0x7fff,ipoib:ALL=full;

-.SH QOS CONFIGURATION
+.SH QUALITY OF SERVICE
  .PP
-There are a set of QoS related low-level configuration parameters.
-All these parameter names are prefixed by "qos_" string. Here is a full
-list of these parameters:
-
- qos_max_vls    - The maximum number of VLs that will be on the subnet
- qos_high_limit - The limit of High Priority component of VL
-                  Arbitration table (IBA 7.6.9)
- qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
-                  template
- qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
-                  template
-                  Both VL arbitration templates are pairs of
-                  VL and weight
- qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
-                  a list of VLs corresponding to SLs 0-15 (Note
-                  that VL15 used here means drop this SL)
-
-Typical default values (hard-coded in OpenSM initialization) are:
-
- qos_max_vls=15
- qos_high_limit=0
- qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
- qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
- qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
-The syntax is compatible with rest of OpenSM configuration options and
-values may be stored in OpenSM config file (cached options file).
-
-In addition to the above, we may define separate QoS configuration
-parameters sets for various target types. As targets, we currently support
-CAs, routers, switch external ports, and switch's enhanced port 0. The
-names of such specialized parameters are prefixed by "qos_<type>_"
-string. Here is a full list of the currently supported sets:
-
- qos_ca_  - QoS configuration parameters set for CAs.
- qos_rtr_ - parameters set for routers.
- qos_sw0_ - parameters set for switches' port 0.
- qos_swe_ - parameters set for switches' external ports.
+OpenSM QoS support comprises of two parts:

-Examples:
- qos_sw0_max_vls=2
- qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
- qos_swe_high_limit=0
+  1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental)
+.P
+  2. \fBSL2VL and VL Arbitration tables configuration\fP
+.P
+.SS QoS Manager (experimental)
+.PP
+When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks
+for QoS Policy file. The default name of this file is
+\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using
+-Y or --qos_policy_file option with OpenSM.
+
+During fabric initialization and at every heavy sweep OpenSM parses the
+QoS policy file, applies its settings to the discovered fabric elements,
+and enforces the provided policy on client requests. The overall flow for
+such requests is as follows:
+ - The request is matched against the defined matching rules such that
+   the QoS Level definition is found.
+ - Given the QoS Level, path(s) search is performed with the given
+   restrictions imposed by that level.
+
+There are two ways to define QoS policy:
+ - \fBFull\fP: the full policy file syntax provides the administrator various
+   ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to
+   enforce various QoS constraints on the requested PR/MPR.
+ - \fBSimplified\fP: the simplified policy file syntax enables the administrator
+   match PR/MPR requests by various ULPs and applications running on top of
+   these ULPs.
+
+While the full policy syntax is very flexible, in many cases the simplified
+policy definition would be sufficient.
+.PP
+.B Full QoS Policy File
+.PP
+QoS policy file has the following sections:
+
+.B I)
+Port Groups (denoted by port-groups).
+This section defines zero or more port groups that can be referred later by
+matching rules (see below). Port group lists ports by:
+  - Port GUID
+  - Port name, which is a combination of NodeDescription and IB port number
+  - PKey, which means that all the ports in the subnet that belong to
+    partition with a given PKey belong to this port group
+  - Partition name, which means that all the ports in the subnet that belong
+    to partition with a given name belong to this port group
+  - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
+    SELF (SM's port).
+
+.B II)
+QoS Setup (denoted by qos-setup).
+This section describes how to set up SL2VL and VL Arbitration tables on
+various nodes in the fabric.
+However, this is not supported in OFED 1.3.
+SL2VL and VLArb tables should be configured in the OpenSM options file.
+
+.B III)
+QoS Levels (denoted by qos-levels).
+Each QoS Level defines Service Level (SL) and a few optional fields:
+  - MTU limit
+  - Rate limit
+  - PKey
+  - Packet lifetime
+
+When path(s) search is performed, it is done with regards to restriction that
+these QoS Level parameters impose.
+One QoS level that is mandatory to define is a DEFAULT QoS level. It is
+applied to a PR/MPR query that does not match any existing match rule.
+Similar to any other QoS Level, it can also be explicitly referred by any
+match rule.
+
+.B IV)
+QoS Matching Rules (denoted by qos-match-rules).
+Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
+the set of matching rules. Rules are scanned in order of appearance in the QoS
+policy file such as the first match takes precedence.
+Each rule has a name of QoS level that will be applied to the matching query.
+A default QoS level is applied to a query that did not match any rule.
+Queries can be matched by:
+ - Source port group (whether a source port is a member of a specified group)
+ - Destination port group (same as above, only for destination port)
+ - PKey
+ - QoS class
+ - Service ID
+
+To match a certain matching rule, PR/MPR query has to match ALL the rule's
+criteria. However, not all the fields of the PR/MPR query have to appear in
+the matching rule.
+For instance, if the rule has a single criterion - Service ID, it will match
+any query that has this Service ID, disregarding rest of the query fields.
+However, if a certain query has only Service ID (which means that this is the
+only bit in the PR/MPR component mask that is on), it will not match any rule
+that has other matching criteria besides Service ID.
+.PP
+.B Simplified QoS Policy Definition
+.PP
+Simplified QoS policy definition comprises of a single section denoted by
+qos-ulps. Similar to the full QoS policy, it has a list of match rules and
+their QoS Level, but in this case a match rule has only one criterion - its
+goal is to match a certain ULP (or a certain application on top of this ULP)
+PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
+The simplified policy section may appear in the policy file in combine with
+the full policy, or as a stand-alone policy definition.
+See more details and list of match rule criteria below.
+.PP
+.B Policy File Syntax Guidelines
+.PP
+Empty lines are ignored.
+Leading and trailing blanks, as well as empty lines, are ignored, so the
+indentation in the example is just for better readability.
+Comments are started with the pound sign (#) and terminated by EOL.
+Any keyword should be the first non-blank in the line, unless it's a comment.
+Keywords that denote section/subsection start have matching closing keywords.
+Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
+requests that didn't match any of the matching rules.
+Any section/subsection of the policy file is optional.
+
+.PP
+.B Examples of Full Policy File
+.PP
+As mentioned earlier, any section of the policy file is optional, and
+the only mandatory part of the policy file is a default QoS Level.
+Here's an example of the shortest policy file:
+
+    qos-levels
+        qos-level
+            name: DEFAULT
+            sl: 0
+        end-qos-level
+    end-qos-levels
+
+Port groups section is missing because there are no match rules, which means
+that port groups are not referred anywhere, and there is no need defining
+them. And since this policy file doesn't have any matching rules, PR/MPR query
+won't match any rule, and OpenSM will enforce default QoS level.
+Essentially, the above example is equivalent to not having QoS policy file
+at all.
+
+The following example shows all the possible options and keywords in the
+policy file and their syntax:
+
+    #
+    # See the comments in the following example.
+    # They explain different keywords and their meaning.
+    #
+    port-groups
+        port-group
+            name: Storage
+            # "use" is just a description that is used for logging
+            #  Other than that, it is just a comment
+            use: SRP Targets
+            port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
+            port-guid: 0x1000000000FFFF
+        end-port-group
+
+        port-group
+            name: Virtual Servers
+            # The syntax of the port name is as follows:
+            #   "node_description/Pnum".
+            # node_description is compared to the NodeDescription of the node,
+            # and "Pnum" is a port number on that node.
+            port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
+        end-port-group
+
+        # using partitions defined in the partition policy
+        port-group
+            name: Partitions
+            partition: Part1
+            pkey: 0x1234
+        end-port-group
+
+        # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
+        # or ALL (for all the nodes in the subnet)
+        port-group
+            name: CAs and SM
+            node-type: CA, SELF
+        end-port-group
+
+    end-port-groups
+
+    qos-setup
+        # This section of the policy file describes how to set up SL2VL and VL
+        # Arbitration tables on various nodes in the fabric.
+        # However, this is not supported in OFED 1.3 - the section is parsed
+        # and ignored. SL2VL and VLArb tables should be configured in the
+        # OpenSM options file (by default - /var/cache/opensm/opensm.opts).
+    end-qos-setup
+
+    qos-levels
+
+        # Having a QoS Level named "DEFAULT" is a must - it is applied to
+        # PR/MPR requests that didn't match any of the matching rules.
+        qos-level
+            name: DEFAULT
+            use: default QoS Level
+            sl: 0
+        end-qos-level
+
+        # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
+        qos-level
+            name: WholeSet
+            sl: 1
+            mtu-limit: 4
+            rate-limit: 5
+            pkey: 0x1234
+            packet-life: 8
+        end-qos-level
+
+    end-qos-levels
+
+    # Match rules are scanned in order of their apperance in the policy file.
+    # First matched rule takes precedence.
+    qos-match-rules
+
+        # matching by single criteria: QoS class
+        qos-match-rule
+            use: by QoS class
+            qos-class: 7-9,11
+            # Name of qos-level to apply to the matching PR/MPR
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+        # show matching by destination group and service id
+        qos-match-rule
+            use: Storage targets
+            destination: Storage
+            service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+        qos-match-rule
+            source: Storage
+            use: match by source group only
+            qos-level-name: DEFAULT
+        end-qos-match-rule
+
+        qos-match-rule
+            use: match by all parameters
+            qos-class: 7-9,11
+            source: Virtual Servers
+            destination: Storage
+            service-id: 0x0000000000010000-0x000000000001FFFF
+            pkey: 0x0F00-0x0FFF
+            qos-level-name: WholeSet
+        end-qos-match-rule
+
+    end-qos-match-rules
+
+.PP
+.B Simplified QoS Policy - Details and Examples
+.PP
+Simplified QoS policy match rules are tailored for matching ULPs (or
+some application on top of a ULP) PR/MPR requests. It has a list of
+per-ULP (or per-application) match rules and the SL that should be
+enforced on the matched PR/MPR query.
+
+Match rules include:
+ - Default match rule that is applied to PR/MPR query that didn't
+   match any of the other match rules
+ - SDP
+ - SDP application with a specific target TCP/IP port range
+ - SRP with a specific target IB port GUID
+ - RDS
+ - iSER
+ - iSER application with a specific target TCP/IP port range
+ - IPoIB with a default PKey
+ - IPoIB with a specific PKey
+ - any ULP/application with a specific Service ID in the PR/MPR query
+ - any ULP/application with a specific PKey in the PR/MPR query
+ - any ULP/application with a specific target IB port GUID in the PR/MPR query
+
+Since any section of the policy file is optional, as long as basic rules
+of the file are kept (such as no referring to nonexisting port group,
+having default QoS Level, etc), the simplified policy section (qos-ulps)
+can serve as a complete QoS policy file.
+The shortest policy file in this case would be as follows:
+
+    qos-ulps
+        default  : 0 #default SL
+    end-qos-ulps
+
+It is equivalent to not having policy file at all.
+
+Below is an example of simplified QoS policy with all the possible keywords:
+
+    qos-ulps
+        default                       : 0 # default SL
+        sdp, port-num 30000           : 0 # SL for application running on top
+                                          # of SDP when a destination
+                                          # TCP/IPport is 30000
+        sdp, port-num 10000-20000     : 0
+        sdp                           : 1 # default SL for any other
+                                          # application running on top of SDP
+        rds                           : 2 # SL for RDS traffic
+        iser, port-num 900            : 0 # SL for iSER with a specific target
+                                          # port
+        iser                          : 3 # default SL for iSER
+        ipoib, pkey 0x0001            : 0 # SL for IPoIB on partition with
+                                          # pkey 0x0001
+        ipoib                         : 4 # default IPoIB partition,
+                                          # pkey=0x7FFF
+        any, service-id 0x6234        : 6 # match any PR/MPR query with a
+                                          # specific Service ID
+        any, pkey 0x0ABC              : 6 # match any PR/MPR query with a
+                                          # specific PKey
+        srp, target-port-guid 0x1234  : 5 # SRP when SRP Target is located on
+                                          # a specified IB port GUID
+        any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
+                                          # a specific target port GUID
+    end-qos-ulps
+
+
+Similar to the full policy definition, matching of PR/MPR queries is done in
+order of appearance in the QoS policy file such as the first match takes
+precedence, except for the "default" rule, which is applied only if the query
+didn't match any other rule.
+
+All other sections of the QoS policy file take precedence over the qos-ulps
+section. That is, if a policy file has both qos-match-rules and qos-ulps
+sections, then any query is matched first against the rules in the
+qos-match-rules section, and only if there was no match, the query is matched
+against the rules in qos-ulps section.
+
+Note that some of these match rules may overlap, so in order to use the
+simplified QoS definition effectively, it is important to understand how each
+of the ULPs is matched:
+
+.B  IPoIB:
+PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
+the following three match rules are equivalent:
+
+    ipoib              : <SL>
+    ipoib, pkey 0x7fff : <SL>
+    any,   pkey 0x7fff : <SL>
+
+.I Note
+: For OFED 1.3, IPoIB partition SL configuration should be done through
+partition configuration file only.
+
+\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is
+0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
+Port Number to connect to. The following two match rules are equivalent:
+
+    sdp                                                   : <SL>
+    any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
+
+\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The
+Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits
+holding the remote TCP/IP Port Number to connect to. Default port number
+for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA.
+The following two match rules are equivalent:
+
+    rds                                : <SL>
+    any, service-id 0x00000000010648CA : <SL>
+
+\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the
+Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C,
+which makes a default Service-ID 0x000000000106035C.
+The following two match rules are equivalent:
+
+    iser                               : <SL>
+    any, service-id 0x000000000106035C : <SL>
+
+\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is
+matched by the target IB port GUID. The following two match rules are
+equivalent:
+
+    srp, target-port-guid 0x1234  : <SL>
+    any, target-port-guid 0x1234  : <SL>
+
+Note that any of the above ULPs might contain target port GUID in the PR
+query, so in order for these queries not to be recognized by the QoS manager
+as SRP, the SRP match rule (or any match rule that refers to the target port
+guid only) should be placed at the end of the qos-ulps match rules.
+
+\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not
+forcing any SL on the MPI traffic, and that's why it is the only ULP that
+did not appear in the qos-ulps section.
+
+
+.SS SL2VL Mapping and VL Arbitration
+.PP
+
+OpenSM cached options file has a set of QoS related configuration
+parameters, that are used to configure SL2VL mapping and VL arbitration
+on IB ports. These parameters are:
+  - Max VLs: the maximum number of VLs that will be on the subnet.
+  - High limit: the limit of High Priority component of VL Arbitration
+    table (IBA 7.6.9).
+  - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
+  - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
+  - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
+    corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
+
+There are separate QoS configuration parameters sets for various target
+types: CAs, routers, switch external ports, and switch's enhanced port 0.
+The names of such parameters are prefixed by "qos_<type>_" string.
+Here is a full list of the currently supported sets:
+
+    qos_ca_  - QoS configuration parameters set for CAs.
+    qos_rtr_ - parameters set for routers.
+    qos_sw0_ - parameters set for switches' port 0.
+    qos_swe_ - parameters set for switches' external ports.
+
+Here's the example of typical default values for all the ports in the
+subnet (hard-coded in OpenSM initialization):
+
+    qos_max_vls=15
+    qos_high_limit=0
+    qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+    qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+
+VL arbitration tables (both high and low) are lists of VL/Weight pairs.
+Each list entry contains a VL number (values from 0-14), and a weighting value
+(values 0-255), indicating the number of 64 byte units (credits) which may be
+transmitted from that VL when its turn in the arbitration occurs. A weight
+of 0 indicates that this entry should be skipped. If a list entry is
+programmed for VL15 or for a VL that is not supported or is not currently
+configured by the port, the port may either skip that entry or send from any
+supported VL for that entry.
+
+Note, that the same VLs may be listed multiple times in the High or Low
+priority arbitration tables, and, further, it can be listed in both tables.
+
+The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
+number of high-priority packets that can be transmitted without an opportunity
+to send a low-priority packet. Specifically, the number of bytes that can be
+sent is high_limit times 4K bytes.
+
+A high_limit value of 255 indicates that the byte limit is unbounded.
+Note: if the 255 value is used, the low priority VLs may be starved.
+A value of 0 indicates that only a single packet from the high-priority table
+may be sent before an opportunity is given to the low-priority table.
+
+Keep in mind that ports usually transmit packets of size equal to MTU.
+For instance, for 4KB MTU a single packet will require 64 credits, so in order
+to achieve effective VL arbitration for packets of 4KB MTU, the weighting
+values for each VL should be multiples of 64.
+
+Below is an example of SL2VL and VL Arbitration configuration on subnet:
+
+    qos_max_vls=15
+    qos_high_limit=6
+    qos_vlarb_high=0:4
+    qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
+    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
+defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
+transmission burst. Such configuration would suilt VL that needs low latency
+and uses small MTU when transmitting packets. Rest of VLs are defined as low
+priority VLs with different weights, while VL4 is effectively turned off.

  .SH PREFIX ROUTES
  .PP
-- 
1.5.1.4

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Tue May  6 07:37:04 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 07:37:04 -0700
Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax
	in	the	example in QoS_management_in_OpenSM.txt
In-Reply-To: <48206C54.8010605@dev.mellanox.co.il>
References: <47E8C032.2050907@dev.mellanox.co.il>
	<1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
	<48205FE8.3070104@dev.mellanox.co.il>
	<1210081452.2026.37.camel@hrosenstock-ws.xsigo.com>
	<48206C54.8010605@dev.mellanox.co.il>
Message-ID: <1210084624.2026.51.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-06 at 17:33 +0300, Yevgeny Kliteynik wrote:
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote:
> >> Hi Hal,
> >>
> >> Hal Rosenstock wrote:
> >>> Hi Yevgeny,
> >>>
> >>> On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote:
> >>>>  QoS_management_in_OpenSM.txt
> >>> Shouldn't this doc also be available in the OpenSM git tree (in
> >>> management/opensm/doc) and distributed as part of OpenSM ?
> >>>
> >>> If so, I can issue a patch for this.
> >> I think that it's a good idea (which reminds me that I still
> >> haven't fixed the patch for QoS stuff in OpenSM man page...)
> > 
> > Yes, that would be nice too :-)
> > 
> > That's going to make the OpenSM man page huge. Not sure how that should
> > be dealt with. Maybe a simple way out in the short term might be to just
> > reference that doc in the man page.
> 
> I thought about it too.
> 
> The only problem is that "the short term" solutions have an astonishing
> ability to stay as a "long term", or "final" solutions... :)
> 
> Let's think about the long term solution right away.
> Are we OK with having just 10-15 lines about the existence of QoS
> annex support in OpenSM man page (in addition to SL2VL and VLArb
> tables configuration that already exists there), and a reference
> to the QoS Management doc? I, for one, have no problems with 

At a high level, this sounds fine to me but I'd want to see the actual
text to be sure.

> I tried reducing the full text to include it in the man - it's
> still very long...

I think the OpenSM man page needs to be broken up.

-- Hal

> I just check the mailing list - the mail isn't there (it was "delayed
> for approval" when there were problems with mailing list filtering
> policy a month ago). I'll forward it to you.
> 
> -- Yevgeny
> 
> > -- Hal
> > 
> >> -- Yevgeny
> >>
> >>> Thanks.
> >>>
> >>> -- Hal
> >>>
> >>>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 
> 


From hrosenstock at xsigo.com  Tue May  6 07:45:25 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 07:45:25 -0700
Subject: [Fwd: [ofa-general] [PATCH] opensm/man: Adding QoS-related
	info to opensm	man pages]
In-Reply-To: <48206C83.8000009@dev.mellanox.co.il>
References: <48206C83.8000009@dev.mellanox.co.il>
Message-ID: <1210085125.2026.60.camel@hrosenstock-ws.xsigo.com>

Hi Yevgeny,

On Tue, 2008-05-06 at 17:34 +0300, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> This is the mail that I was talking about (QoS info for OpenSM man page).
> Sasha has reviewed it, and posted his answer to the mailing list.

I must have missed that. What was the date of that post ?

See below for some additional comments.

-- Hal

> 
> -- Yevgeny
> 
> 
> -------- Original Message --------
> Subject: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages
> Date: Wed, 26 Mar 2008 02:47:08 +0200
> From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> To: Sasha Khapyorsky <sashak at voltaire.com>
> CC: OpenIB <general at lists.openfabrics.org>
> 
> Hi Sasha,
> 
> I've added QoS related info to opensm man pages: enhanced
> existing part (that was talking about VL arbitration) and
> added description of QoS manager in accordance with QoS annex.
> 
> Please apply to ofed_1_3 and master.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>   opensm/man/opensm.8.in |  501 +++++++++++++++++++++++++++++++++++++++++++-----
>   1 files changed, 457 insertions(+), 44 deletions(-)
> 
> diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
> index 5322ab7..1d9c5b7 100644
> --- a/opensm/man/opensm.8.in
> +++ b/opensm/man/opensm.8.in
> @@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each
>   InfiniBand subnet).
> 
>   opensm also now contains an experimental version of a performance
> -manager as well.
> +manager and an experimental version QoS manager (in accordance with
> +IBA QoS Annex).

Minor tweak as I think the performance manager is now longer being
indicated as experimental:

opensm also now contains a performance manager as well as an
experimental QoS manager (in accordance with IBTA 1.2.1 QoS Annex).

>   opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
>   fabric, initialize it, and sweep occasionally for changes.
> @@ -433,51 +434,463 @@ partition manager:
> 
>    Default=0x7fff,ipoib:ALL=full;
> 
> -.SH QOS CONFIGURATION
> +.SH QUALITY OF SERVICE
>   .PP
> -There are a set of QoS related low-level configuration parameters.
> -All these parameter names are prefixed by "qos_" string. Here is a full
> -list of these parameters:
> -
> - qos_max_vls    - The maximum number of VLs that will be on the subnet
> - qos_high_limit - The limit of High Priority component of VL
> -                  Arbitration table (IBA 7.6.9)
> - qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
> -                  template
> - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
> -                  template
> -                  Both VL arbitration templates are pairs of
> -                  VL and weight
> - qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
> -                  a list of VLs corresponding to SLs 0-15 (Note
> -                  that VL15 used here means drop this SL)
> -
> -Typical default values (hard-coded in OpenSM initialization) are:
> -
> - qos_max_vls=15
> - qos_high_limit=0
> - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> -
> -The syntax is compatible with rest of OpenSM configuration options and
> -values may be stored in OpenSM config file (cached options file).
> -
> -In addition to the above, we may define separate QoS configuration
> -parameters sets for various target types. As targets, we currently support
> -CAs, routers, switch external ports, and switch's enhanced port 0. The
> -names of such specialized parameters are prefixed by "qos_<type>_"
> -string. Here is a full list of the currently supported sets:
> -
> - qos_ca_  - QoS configuration parameters set for CAs.
> - qos_rtr_ - parameters set for routers.
> - qos_sw0_ - parameters set for switches' port 0.
> - qos_swe_ - parameters set for switches' external ports.
> +OpenSM QoS support comprises of two parts:
> 
> -Examples:
> - qos_sw0_max_vls=2
> - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
> - qos_swe_high_limit=0
> +  1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental)
> +.P
> +  2. \fBSL2VL and VL Arbitration tables configuration\fP
> +.P
> +.SS QoS Manager (experimental)
> +.PP
> +When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks
> +for QoS Policy file. The default name of this file is
> +\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using
> +-Y or --qos_policy_file option with OpenSM.

This is essentially the QoS management doc cast into the man page with
the older primitive QoS tacked on. Should the annex support just refer
to the doc and leave the older description present ? Or something else ?
Maybe that was addressed in Sasha's response.

> +
> +During fabric initialization and at every heavy sweep OpenSM parses the
> +QoS policy file, applies its settings to the discovered fabric elements,
> +and enforces the provided policy on client requests. The overall flow for
> +such requests is as follows:
> + - The request is matched against the defined matching rules such that
> +   the QoS Level definition is found.
> + - Given the QoS Level, path(s) search is performed with the given
> +   restrictions imposed by that level.
> +
> +There are two ways to define QoS policy:
> + - \fBFull\fP: the full policy file syntax provides the administrator various
> +   ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to
> +   enforce various QoS constraints on the requested PR/MPR.
> + - \fBSimplified\fP: the simplified policy file syntax enables the administrator
> +   match PR/MPR requests by various ULPs and applications running on top of
> +   these ULPs.
> +
> +While the full policy syntax is very flexible, in many cases the simplified
> +policy definition would be sufficient.
> +.PP
> +.B Full QoS Policy File
> +.PP
> +QoS policy file has the following sections:
> +
> +.B I)
> +Port Groups (denoted by port-groups).
> +This section defines zero or more port groups that can be referred later by
> +matching rules (see below). Port group lists ports by:
> +  - Port GUID
> +  - Port name, which is a combination of NodeDescription and IB port number
> +  - PKey, which means that all the ports in the subnet that belong to
> +    partition with a given PKey belong to this port group
> +  - Partition name, which means that all the ports in the subnet that belong
> +    to partition with a given name belong to this port group
> +  - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
> +    SELF (SM's port).
> +
> +.B II)
> +QoS Setup (denoted by qos-setup).
> +This section describes how to set up SL2VL and VL Arbitration tables on
> +various nodes in the fabric.
> +However, this is not supported in OFED 1.3.
> +SL2VL and VLArb tables should be configured in the OpenSM options file.
> +
> +.B III)
> +QoS Levels (denoted by qos-levels).
> +Each QoS Level defines Service Level (SL) and a few optional fields:
> +  - MTU limit
> +  - Rate limit
> +  - PKey
> +  - Packet lifetime
> +
> +When path(s) search is performed, it is done with regards to restriction that
> +these QoS Level parameters impose.
> +One QoS level that is mandatory to define is a DEFAULT QoS level. It is
> +applied to a PR/MPR query that does not match any existing match rule.
> +Similar to any other QoS Level, it can also be explicitly referred by any
> +match rule.
> +
> +.B IV)
> +QoS Matching Rules (denoted by qos-match-rules).
> +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
> +the set of matching rules. Rules are scanned in order of appearance in the QoS
> +policy file such as the first match takes precedence.
> +Each rule has a name of QoS level that will be applied to the matching query.
> +A default QoS level is applied to a query that did not match any rule.
> +Queries can be matched by:
> + - Source port group (whether a source port is a member of a specified group)
> + - Destination port group (same as above, only for destination port)
> + - PKey
> + - QoS class
> + - Service ID
> +
> +To match a certain matching rule, PR/MPR query has to match ALL the rule's
> +criteria. However, not all the fields of the PR/MPR query have to appear in
> +the matching rule.
> +For instance, if the rule has a single criterion - Service ID, it will match
> +any query that has this Service ID, disregarding rest of the query fields.
> +However, if a certain query has only Service ID (which means that this is the
> +only bit in the PR/MPR component mask that is on), it will not match any rule
> +that has other matching criteria besides Service ID.
> +.PP
> +.B Simplified QoS Policy Definition
> +.PP
> +Simplified QoS policy definition comprises of a single section denoted by
> +qos-ulps. Similar to the full QoS policy, it has a list of match rules and
> +their QoS Level, but in this case a match rule has only one criterion - its
> +goal is to match a certain ULP (or a certain application on top of this ULP)
> +PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
> +The simplified policy section may appear in the policy file in combine with
> +the full policy, or as a stand-alone policy definition.
> +See more details and list of match rule criteria below.
> +.PP
> +.B Policy File Syntax Guidelines
> +.PP
> +Empty lines are ignored.
> +Leading and trailing blanks, as well as empty lines, are ignored, so the
> +indentation in the example is just for better readability.
> +Comments are started with the pound sign (#) and terminated by EOL.
> +Any keyword should be the first non-blank in the line, unless it's a comment.
> +Keywords that denote section/subsection start have matching closing keywords.
> +Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
> +requests that didn't match any of the matching rules.
> +Any section/subsection of the policy file is optional.
> +
> +.PP
> +.B Examples of Full Policy File
> +.PP
> +As mentioned earlier, any section of the policy file is optional, and
> +the only mandatory part of the policy file is a default QoS Level.
> +Here's an example of the shortest policy file:
> +
> +    qos-levels
> +        qos-level
> +            name: DEFAULT
> +            sl: 0
> +        end-qos-level
> +    end-qos-levels
> +
> +Port groups section is missing because there are no match rules, which means
> +that port groups are not referred anywhere, and there is no need defining
> +them. And since this policy file doesn't have any matching rules, PR/MPR query
> +won't match any rule, and OpenSM will enforce default QoS level.
> +Essentially, the above example is equivalent to not having QoS policy file
> +at all.
> +
> +The following example shows all the possible options and keywords in the
> +policy file and their syntax:
> +
> +    #
> +    # See the comments in the following example.
> +    # They explain different keywords and their meaning.
> +    #
> +    port-groups
> +        port-group
> +            name: Storage
> +            # "use" is just a description that is used for logging
> +            #  Other than that, it is just a comment
> +            use: SRP Targets
> +            port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
> +            port-guid: 0x1000000000FFFF
> +        end-port-group
> +
> +        port-group
> +            name: Virtual Servers
> +            # The syntax of the port name is as follows:
> +            #   "node_description/Pnum".
> +            # node_description is compared to the NodeDescription of the node,
> +            # and "Pnum" is a port number on that node.
> +            port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
> +        end-port-group
> +
> +        # using partitions defined in the partition policy
> +        port-group
> +            name: Partitions
> +            partition: Part1
> +            pkey: 0x1234
> +        end-port-group
> +
> +        # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
> +        # or ALL (for all the nodes in the subnet)
> +        port-group
> +            name: CAs and SM
> +            node-type: CA, SELF
> +        end-port-group
> +
> +    end-port-groups
> +
> +    qos-setup
> +        # This section of the policy file describes how to set up SL2VL and VL
> +        # Arbitration tables on various nodes in the fabric.
> +        # However, this is not supported in OFED 1.3 - the section is parsed
> +        # and ignored. SL2VL and VLArb tables should be configured in the
> +        # OpenSM options file (by default - /var/cache/opensm/opensm.opts).
> +    end-qos-setup
> +
> +    qos-levels
> +
> +        # Having a QoS Level named "DEFAULT" is a must - it is applied to
> +        # PR/MPR requests that didn't match any of the matching rules.
> +        qos-level
> +            name: DEFAULT
> +            use: default QoS Level
> +            sl: 0
> +        end-qos-level
> +
> +        # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
> +        qos-level
> +            name: WholeSet
> +            sl: 1
> +            mtu-limit: 4
> +            rate-limit: 5
> +            pkey: 0x1234
> +            packet-life: 8
> +        end-qos-level
> +
> +    end-qos-levels
> +
> +    # Match rules are scanned in order of their apperance in the policy file.
> +    # First matched rule takes precedence.
> +    qos-match-rules
> +
> +        # matching by single criteria: QoS class
> +        qos-match-rule
> +            use: by QoS class
> +            qos-class: 7-9,11
> +            # Name of qos-level to apply to the matching PR/MPR
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +        # show matching by destination group and service id
> +        qos-match-rule
> +            use: Storage targets
> +            destination: Storage
> +            service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +        qos-match-rule
> +            source: Storage
> +            use: match by source group only
> +            qos-level-name: DEFAULT
> +        end-qos-match-rule
> +
> +        qos-match-rule
> +            use: match by all parameters
> +            qos-class: 7-9,11
> +            source: Virtual Servers
> +            destination: Storage
> +            service-id: 0x0000000000010000-0x000000000001FFFF
> +            pkey: 0x0F00-0x0FFF
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +    end-qos-match-rules
> +
> +.PP
> +.B Simplified QoS Policy - Details and Examples
> +.PP
> +Simplified QoS policy match rules are tailored for matching ULPs (or
> +some application on top of a ULP) PR/MPR requests. It has a list of
> +per-ULP (or per-application) match rules and the SL that should be
> +enforced on the matched PR/MPR query.
> +
> +Match rules include:
> + - Default match rule that is applied to PR/MPR query that didn't
> +   match any of the other match rules
> + - SDP
> + - SDP application with a specific target TCP/IP port range
> + - SRP with a specific target IB port GUID
> + - RDS
> + - iSER
> + - iSER application with a specific target TCP/IP port range
> + - IPoIB with a default PKey
> + - IPoIB with a specific PKey
> + - any ULP/application with a specific Service ID in the PR/MPR query
> + - any ULP/application with a specific PKey in the PR/MPR query
> + - any ULP/application with a specific target IB port GUID in the PR/MPR query
> +
> +Since any section of the policy file is optional, as long as basic rules
> +of the file are kept (such as no referring to nonexisting port group,
> +having default QoS Level, etc), the simplified policy section (qos-ulps)
> +can serve as a complete QoS policy file.
> +The shortest policy file in this case would be as follows:
> +
> +    qos-ulps
> +        default  : 0 #default SL
> +    end-qos-ulps
> +
> +It is equivalent to not having policy file at all.
> +
> +Below is an example of simplified QoS policy with all the possible keywords:
> +
> +    qos-ulps
> +        default                       : 0 # default SL
> +        sdp, port-num 30000           : 0 # SL for application running on top
> +                                          # of SDP when a destination
> +                                          # TCP/IPport is 30000
> +        sdp, port-num 10000-20000     : 0
> +        sdp                           : 1 # default SL for any other
> +                                          # application running on top of SDP
> +        rds                           : 2 # SL for RDS traffic
> +        iser, port-num 900            : 0 # SL for iSER with a specific target
> +                                          # port
> +        iser                          : 3 # default SL for iSER
> +        ipoib, pkey 0x0001            : 0 # SL for IPoIB on partition with
> +                                          # pkey 0x0001
> +        ipoib                         : 4 # default IPoIB partition,
> +                                          # pkey=0x7FFF
> +        any, service-id 0x6234        : 6 # match any PR/MPR query with a
> +                                          # specific Service ID
> +        any, pkey 0x0ABC              : 6 # match any PR/MPR query with a
> +                                          # specific PKey
> +        srp, target-port-guid 0x1234  : 5 # SRP when SRP Target is located on
> +                                          # a specified IB port GUID
> +        any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
> +                                          # a specific target port GUID
> +    end-qos-ulps
> +
> +
> +Similar to the full policy definition, matching of PR/MPR queries is done in
> +order of appearance in the QoS policy file such as the first match takes
> +precedence, except for the "default" rule, which is applied only if the query
> +didn't match any other rule.
> +
> +All other sections of the QoS policy file take precedence over the qos-ulps
> +section. That is, if a policy file has both qos-match-rules and qos-ulps
> +sections, then any query is matched first against the rules in the
> +qos-match-rules section, and only if there was no match, the query is matched
> +against the rules in qos-ulps section.
> +
> +Note that some of these match rules may overlap, so in order to use the
> +simplified QoS definition effectively, it is important to understand how each
> +of the ULPs is matched:
> +
> +.B  IPoIB:
> +PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
> +the following three match rules are equivalent:
> +
> +    ipoib              : <SL>
> +    ipoib, pkey 0x7fff : <SL>
> +    any,   pkey 0x7fff : <SL>
> +
> +.I Note
> +: For OFED 1.3, IPoIB partition SL configuration should be done through
> +partition configuration file only.
> +
> +\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is
> +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
> +Port Number to connect to. The following two match rules are equivalent:
> +
> +    sdp                                                   : <SL>
> +    any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
> +
> +\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The
> +Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits
> +holding the remote TCP/IP Port Number to connect to. Default port number
> +for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA.
> +The following two match rules are equivalent:
> +
> +    rds                                : <SL>
> +    any, service-id 0x00000000010648CA : <SL>
> +
> +\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the
> +Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C,
> +which makes a default Service-ID 0x000000000106035C.
> +The following two match rules are equivalent:
> +
> +    iser                               : <SL>
> +    any, service-id 0x000000000106035C : <SL>
> +
> +\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is
> +matched by the target IB port GUID. The following two match rules are
> +equivalent:
> +
> +    srp, target-port-guid 0x1234  : <SL>
> +    any, target-port-guid 0x1234  : <SL>
> +
> +Note that any of the above ULPs might contain target port GUID in the PR
> +query, so in order for these queries not to be recognized by the QoS manager
> +as SRP, the SRP match rule (or any match rule that refers to the target port
> +guid only) should be placed at the end of the qos-ulps match rules.
> +
> +\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not
> +forcing any SL on the MPI traffic, and that's why it is the only ULP that
> +did not appear in the qos-ulps section.
> +
> +
> +.SS SL2VL Mapping and VL Arbitration
> +.PP
> +
> +OpenSM cached options file has a set of QoS related configuration
> +parameters, that are used to configure SL2VL mapping and VL arbitration
> +on IB ports. These parameters are:
> +  - Max VLs: the maximum number of VLs that will be on the subnet.
> +  - High limit: the limit of High Priority component of VL Arbitration
> +    table (IBA 7.6.9).
> +  - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
> +  - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
> +  - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
> +    corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
> +
> +There are separate QoS configuration parameters sets for various target
> +types: CAs, routers, switch external ports, and switch's enhanced port 0.
> +The names of such parameters are prefixed by "qos_<type>_" string.
> +Here is a full list of the currently supported sets:
> +
> +    qos_ca_  - QoS configuration parameters set for CAs.
> +    qos_rtr_ - parameters set for routers.
> +    qos_sw0_ - parameters set for switches' port 0.
> +    qos_swe_ - parameters set for switches' external ports.
> +
> +Here's the example of typical default values for all the ports in the
> +subnet (hard-coded in OpenSM initialization):
> +
> +    qos_max_vls=15
> +    qos_high_limit=0
> +    qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +    qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +
> +VL arbitration tables (both high and low) are lists of VL/Weight pairs.
> +Each list entry contains a VL number (values from 0-14), and a weighting value
> +(values 0-255), indicating the number of 64 byte units (credits) which may be
> +transmitted from that VL when its turn in the arbitration occurs. A weight
> +of 0 indicates that this entry should be skipped. If a list entry is
> +programmed for VL15 or for a VL that is not supported or is not currently
> +configured by the port, the port may either skip that entry or send from any
> +supported VL for that entry.
> +
> +Note, that the same VLs may be listed multiple times in the High or Low
> +priority arbitration tables, and, further, it can be listed in both tables.
> +
> +The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
> +number of high-priority packets that can be transmitted without an opportunity
> +to send a low-priority packet. Specifically, the number of bytes that can be
> +sent is high_limit times 4K bytes.
> +
> +A high_limit value of 255 indicates that the byte limit is unbounded.
> +Note: if the 255 value is used, the low priority VLs may be starved.
> +A value of 0 indicates that only a single packet from the high-priority table
> +may be sent before an opportunity is given to the low-priority table.
> +
> +Keep in mind that ports usually transmit packets of size equal to MTU.
> +For instance, for 4KB MTU a single packet will require 64 credits, so in order
> +to achieve effective VL arbitration for packets of 4KB MTU, the weighting
> +values for each VL should be multiples of 64.
> +
> +Below is an example of SL2VL and VL Arbitration configuration on subnet:
> +
> +    qos_max_vls=15
> +    qos_high_limit=6
> +    qos_vlarb_high=0:4
> +    qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
> +    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
> +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
> +transmission burst. Such configuration would suilt VL that needs low latency
> +and uses small MTU when transmitting packets. Rest of VLs are defined as low
> +priority VLs with different weights, while VL4 is effectively turned off.
> 
>   .SH PREFIX ROUTES
>   .PP


From hrosenstock at xsigo.com  Tue May  6 07:46:12 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 07:46:12 -0700
Subject: [ofa-general] IBTA Compliance- Mkey Violations trap
In-Reply-To: <04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com>
References: <47E8C032.2050907@dev.mellanox.co.il>
	<1210079815.27137.190.camel@hrosenstock-ws.xsigo.com>
	<48205FE8.3070104@dev.mellanox.co.il>
	<04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com>
Message-ID: <1210085172.2026.61.camel@hrosenstock-ws.xsigo.com>

Suri,

On Tue, 2008-05-06 at 09:56 -0400, Suresh Shelvapille wrote:
> Folks:
> 
> WRT sending traps on MkeyViolations, the spec is a little ambiguous IMO.
> In section 14.2.4.2 C-14-18 is it saying to send a trap the first time the Mkey Violation
> happens and the Lease Expiry timer is started or to send a trap(possibly multiple of them)
> every time Mkey Violation happens even though the lease timer may have been started already?

I think that it's every time (independent of whether the lease countdown
has already been started or not): See o14-9.

Note also that there is a max trap rate requirement though per
PortInfo:SubnetTimeOut.

If your company is an IBTA member, a better place for this inquiry is on
the mgtwg mailing list IMO.

-- Hal

> What is the consensus in this group? 
> 
> Many thanks,
> Suri
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From andrea at qumranet.com  Tue May  6 07:46:54 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Tue, 6 May 2008 16:46:54 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080505194625.GA17734@sgi.com>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random>
	<20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random>
	<20080505194625.GA17734@sgi.com>
Message-ID: <20080506144654.GD8471@duo.random>

On Mon, May 05, 2008 at 02:46:25PM -0500, Jack Steiner wrote:
> If a task fails to unmap a GRU segment, they still exist at the start of

Yes, this will also happen in case the well behaved task receives
SIGKILL, so you can test it that way too.

> exit. On the ->release callout, I set a flag in the container of my
> mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
> are skipped because of the flag is set. When the GRU VMA is deleted, I free

GRU TLB flushes aren't skipped because your flag is set but because
__mmu_notifier_release already executed
list_del_init_rcu(&grunotifier->hlist) before proceeding with
unmap_vmas.

> my structure containing the notifier.

As long as nobody can write through the already established gru tlbs
and nobody can establish new tlbs after exit_mmap run you don't
strictly need ->release.

> I _think_ works. Do you see any problems?

You can remove the flag and ->release and ->clear_flush_young (if you
keep clear_flush_young implemented it should return 0). The
synchronize_rcu after mmu_notifier_register can also be dropped thanks
to mm_lock(). gru_drop_mmu_notifier should be careful with current->mm
if you're using an fd and if the fd can be passed to a different task
through unix sockets (you should probably fail any operation if
current->mm != gru->mm).

The way I use ->release in KVM is to set the root hpa to -1UL
(invalid) as a debug trap. That's only for debugging because even if
tlb entries and sptes are still established on the secondary mmu they
are only relevant when the cpu jumps to guest mode and that can never
happen again after exit_mmap is started.

> I should also mention that I have an open-coded function that possibly
> belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
> Each GRU has a couple of data structures linked to the VMA. All, however,
> need to share the same notifier. I currently open code a function that
> scans the notifier list to determine if a GRU notifier already exists.
> If it does, I update a refcnt & use it. Otherwise, I register a new
> one. All of this is protected by the mmap_sem.
> 
> Just in case I mangled the above description, I'll attach a copy of the GRU mmuops
> code.

Well that function needs fixing w.r.t. srcu. Are you sure you want to
search for mn->ops == gru_mmuops and not for mn == gmn?  And if you
search for mn why can't you keep track of the mn being registered or
unregistered outside of the mmu_notifier layer? Set a bitflag in the
container after mmu_notifier_register returns and a clear it after
_unregister returns. I doubt saving one bitflag is worth searching the
list and your approach make it obvious that you've to protect the
bitflag and the register/unregister under write-mmap_sem
yourself. Otherwise the find function will return an object that can
be freed at any time if somebody calls unregister and
kfree. (synchronize_srcu in mmu_notifier_unregister won't wait for
anything but some outstanding srcu_read_lock)


From kliteyn at dev.mellanox.co.il  Tue May  6 07:49:19 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 06 May 2008 17:49:19 +0300
Subject: [ofa-general] [Fwd: Re: [PATCH] opensm/man: Adding QoS-related info
 to opensm man pages]
Message-ID: <48206FEF.6020602@dev.mellanox.co.il>

And this is Sasha's response.

-- Yevgeny

-------- Original Message --------
Subject: Re: [PATCH] opensm/man: Adding QoS-related info to opensm man pages
Date: Mon, 31 Mar 2008 10:52:13 +0000
From: Sasha Khapyorsky <sashak at voltaire.com>
To: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
CC: OpenIB <general at lists.openfabrics.org>
References: <47E99D0C.7040403 at dev.mellanox.co.il>

Hi Yevgeny,

On 02:47 Wed 26 Mar     , Yevgeny Kliteynik wrote:
> 
> I've added QoS related info to opensm man pages: enhanced
> existing part (that was talking about VL arbitration)

I see that this part was fully rewritten. And IMO it is less clear now
than originally was (some comments are below).

Any reason to not start enhancements from existing text?

> and
> added description of QoS manager in accordance with QoS annex.
> 
> Please apply to ofed_1_3 and master.

Comments are below.

> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/man/opensm.8.in |  501 +++++++++++++++++++++++++++++++++++++++++++-----
>  1 files changed, 457 insertions(+), 44 deletions(-)
> 
> diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
> index 5322ab7..1d9c5b7 100644
> --- a/opensm/man/opensm.8.in
> +++ b/opensm/man/opensm.8.in
> @@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each
>  InfiniBand subnet).
> 
>  opensm also now contains an experimental version of a performance
> -manager as well.
> +manager and an experimental version QoS manager (in accordance with
> +IBA QoS Annex).
> 
>  opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
>  fabric, initialize it, and sweep occasionally for changes.
> @@ -433,51 +434,463 @@ partition manager:
> 
>   Default=0x7fff,ipoib:ALL=full;
> 
> -.SH QOS CONFIGURATION
> +.SH QUALITY OF SERVICE
>  .PP
> -There are a set of QoS related low-level configuration parameters.
> -All these parameter names are prefixed by "qos_" string. Here is a full
> -list of these parameters:
> -
> - qos_max_vls    - The maximum number of VLs that will be on the subnet
> - qos_high_limit - The limit of High Priority component of VL
> -                  Arbitration table (IBA 7.6.9)
> - qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
> -                  template
> - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
> -                  template
> -                  Both VL arbitration templates are pairs of
> -                  VL and weight
> - qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
> -                  a list of VLs corresponding to SLs 0-15 (Note
> -                  that VL15 used here means drop this SL)
> -
> -Typical default values (hard-coded in OpenSM initialization) are:
> -
> - qos_max_vls=15
> - qos_high_limit=0
> - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> -
> -The syntax is compatible with rest of OpenSM configuration options and
> -values may be stored in OpenSM config file (cached options file).
> -
> -In addition to the above, we may define separate QoS configuration
> -parameters sets for various target types. As targets, we currently support
> -CAs, routers, switch external ports, and switch's enhanced port 0. The
> -names of such specialized parameters are prefixed by "qos_<type>_"
> -string. Here is a full list of the currently supported sets:
> -
> - qos_ca_  - QoS configuration parameters set for CAs.
> - qos_rtr_ - parameters set for routers.
> - qos_sw0_ - parameters set for switches' port 0.
> - qos_swe_ - parameters set for switches' external ports.
> +OpenSM QoS support comprises of two parts:
> 
> -Examples:
> - qos_sw0_max_vls=2
> - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
> - qos_swe_high_limit=0
> +  1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental)
> +.P
> +  2. \fBSL2VL and VL Arbitration tables configuration\fP
> +.P
> +.SS QoS Manager (experimental)
> +.PP
> +When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks
> +for QoS Policy file. The default name of this file is
> +\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using
> +-Y or --qos_policy_file option with OpenSM.
> +
> +During fabric initialization and at every heavy sweep OpenSM parses the
> +QoS policy file, applies its settings to the discovered fabric elements,
> +and enforces the provided policy on client requests. The overall flow for
> +such requests is as follows:
> + - The request is matched against the defined matching rules such that
> +   the QoS Level definition is found.
> + - Given the QoS Level, path(s) search is performed with the given
> +   restrictions imposed by that level.
> +
> +There are two ways to define QoS policy:
> + - \fBFull\fP: the full policy file syntax provides the administrator various
> +   ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to
> +   enforce various QoS constraints on the requested PR/MPR.
> + - \fBSimplified\fP: the simplified policy file syntax enables the administrator
> +   match PR/MPR requests by various ULPs and applications running on top of
> +   these ULPs.
> +
> +While the full policy syntax is very flexible, in many cases the simplified
> +policy definition would be sufficient.
> +.PP
> +.B Full QoS Policy File
> +.PP
> +QoS policy file has the following sections:
> +
> +.B I)
> +Port Groups (denoted by port-groups).
> +This section defines zero or more port groups that can be referred later by
> +matching rules (see below). Port group lists ports by:
> +  - Port GUID
> +  - Port name, which is a combination of NodeDescription and IB port number
> +  - PKey, which means that all the ports in the subnet that belong to
> +    partition with a given PKey belong to this port group
> +  - Partition name, which means that all the ports in the subnet that belong
> +    to partition with a given name belong to this port group
> +  - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
> +    SELF (SM's port).
> +
> +.B II)
> +QoS Setup (denoted by qos-setup).
> +This section describes how to set up SL2VL and VL Arbitration tables on
> +various nodes in the fabric.
> +However, this is not supported in OFED 1.3.

Here and below.
I would prefer to not refer OFED versions (OpenSM can be used as part of
OFED, independently or OFED/OpenSM versions can be mixed), something
like "this version of OpenSM" looks more appropriate here for me.

> +SL2VL and VLArb tables should be configured in the OpenSM options file.
> +
> +.B III)
> +QoS Levels (denoted by qos-levels).
> +Each QoS Level defines Service Level (SL) and a few optional fields:
> +  - MTU limit
> +  - Rate limit
> +  - PKey
> +  - Packet lifetime
> +
> +When path(s) search is performed, it is done with regards to restriction that
> +these QoS Level parameters impose.
> +One QoS level that is mandatory to define is a DEFAULT QoS level. It is
> +applied to a PR/MPR query that does not match any existing match rule.
> +Similar to any other QoS Level, it can also be explicitly referred by any
> +match rule.

Shouldn't this paragraph be placed after IV) - it refers matching rules
which defined below? Or maybe even merged with IV?

> +
> +.B IV)
> +QoS Matching Rules (denoted by qos-match-rules).
> +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
> +the set of matching rules. Rules are scanned in order of appearance in the QoS
> +policy file such as the first match takes precedence.
> +Each rule has a name of QoS level that will be applied to the matching query.
> +A default QoS level is applied to a query that did not match any rule.
> +Queries can be matched by:
> + - Source port group (whether a source port is a member of a specified group)
> + - Destination port group (same as above, only for destination port)
> + - PKey
> + - QoS class
> + - Service ID
> +
> +To match a certain matching rule, PR/MPR query has to match ALL the rule's
> +criteria. However, not all the fields of the PR/MPR query have to appear in
> +the matching rule.
> +For instance, if the rule has a single criterion - Service ID, it will match
> +any query that has this Service ID, disregarding rest of the query fields.
> +However, if a certain query has only Service ID (which means that this is the
> +only bit in the PR/MPR component mask that is on), it will not match any rule
> +that has other matching criteria besides Service ID.
> +.PP
> +.B Simplified QoS Policy Definition
> +.PP
> +Simplified QoS policy definition comprises of a single section denoted by
> +qos-ulps. Similar to the full QoS policy, it has a list of match rules and
> +their QoS Level, but in this case a match rule has only one criterion - its
> +goal is to match a certain ULP (or a certain application on top of this ULP)
> +PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
> +The simplified policy section may appear in the policy file in combine with
> +the full policy, or as a stand-alone policy definition.
> +See more details and list of match rule criteria below.

What about to merge this paragraph with Simplified QoS policy description
below? Here it looks like duplication.

> +.PP
> +.B Policy File Syntax Guidelines
> +.PP
> +Empty lines are ignored.
> +Leading and trailing blanks, as well as empty lines, are ignored, so the
> +indentation in the example is just for better readability.
> +Comments are started with the pound sign (#) and terminated by EOL.
> +Any keyword should be the first non-blank in the line, unless it's a comment.
> +Keywords that denote section/subsection start have matching closing keywords.
> +Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
> +requests that didn't match any of the matching rules.
> +Any section/subsection of the policy file is optional.

And this paragraph to move above 'Full QoS Policy File' section?

> +
> +.PP
> +.B Examples of Full Policy File
> +.PP
> +As mentioned earlier, any section of the policy file is optional, and
> +the only mandatory part of the policy file is a default QoS Level.
> +Here's an example of the shortest policy file:
> +
> +    qos-levels
> +        qos-level
> +            name: DEFAULT
> +            sl: 0
> +        end-qos-level
> +    end-qos-levels
> +
> +Port groups section is missing because there are no match rules, which means
> +that port groups are not referred anywhere, and there is no need defining
> +them. And since this policy file doesn't have any matching rules, PR/MPR query
> +won't match any rule, and OpenSM will enforce default QoS level.
> +Essentially, the above example is equivalent to not having QoS policy file
> +at all.
> +
> +The following example shows all the possible options and keywords in the
> +policy file and their syntax:
> +
> +    #
> +    # See the comments in the following example.
> +    # They explain different keywords and their meaning.
> +    #
> +    port-groups
> +        port-group
> +            name: Storage
> +            # "use" is just a description that is used for logging
> +            #  Other than that, it is just a comment
> +            use: SRP Targets
> +            port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
> +            port-guid: 0x1000000000FFFF
> +        end-port-group
> +
> +        port-group
> +            name: Virtual Servers
> +            # The syntax of the port name is as follows:
> +            #   "node_description/Pnum".
> +            # node_description is compared to the NodeDescription of the node,
> +            # and "Pnum" is a port number on that node.
> +            port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
> +        end-port-group
> +
> +        # using partitions defined in the partition policy
> +        port-group
> +            name: Partitions
> +            partition: Part1
> +            pkey: 0x1234
> +        end-port-group
> +
> +        # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
> +        # or ALL (for all the nodes in the subnet)
> +        port-group
> +            name: CAs and SM
> +            node-type: CA, SELF
> +        end-port-group
> +
> +    end-port-groups
> +
> +    qos-setup
> +        # This section of the policy file describes how to set up SL2VL and VL
> +        # Arbitration tables on various nodes in the fabric.
> +        # However, this is not supported in OFED 1.3 - the section is parsed
> +        # and ignored. SL2VL and VLArb tables should be configured in the
> +        # OpenSM options file (by default - /var/cache/opensm/opensm.opts).
> +    end-qos-setup
> +
> +    qos-levels
> +
> +        # Having a QoS Level named "DEFAULT" is a must - it is applied to
> +        # PR/MPR requests that didn't match any of the matching rules.
> +        qos-level
> +            name: DEFAULT
> +            use: default QoS Level
> +            sl: 0
> +        end-qos-level
> +
> +        # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
> +        qos-level
> +            name: WholeSet
> +            sl: 1
> +            mtu-limit: 4
> +            rate-limit: 5
> +            pkey: 0x1234
> +            packet-life: 8
> +        end-qos-level
> +
> +    end-qos-levels
> +
> +    # Match rules are scanned in order of their apperance in the policy file.
> +    # First matched rule takes precedence.
> +    qos-match-rules
> +
> +        # matching by single criteria: QoS class
> +        qos-match-rule
> +            use: by QoS class
> +            qos-class: 7-9,11
> +            # Name of qos-level to apply to the matching PR/MPR
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +        # show matching by destination group and service id
> +        qos-match-rule
> +            use: Storage targets
> +            destination: Storage
> +            service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +        qos-match-rule
> +            source: Storage
> +            use: match by source group only
> +            qos-level-name: DEFAULT
> +        end-qos-match-rule
> +
> +        qos-match-rule
> +            use: match by all parameters
> +            qos-class: 7-9,11
> +            source: Virtual Servers
> +            destination: Storage
> +            service-id: 0x0000000000010000-0x000000000001FFFF
> +            pkey: 0x0F00-0x0FFF
> +            qos-level-name: WholeSet
> +        end-qos-match-rule
> +
> +    end-qos-match-rules
> +
> +.PP
> +.B Simplified QoS Policy - Details and Examples
> +.PP
> +Simplified QoS policy match rules are tailored for matching ULPs (or
> +some application on top of a ULP) PR/MPR requests. It has a list of
> +per-ULP (or per-application) match rules and the SL that should be
> +enforced on the matched PR/MPR query.
> +
> +Match rules include:
> + - Default match rule that is applied to PR/MPR query that didn't
> +   match any of the other match rules
> + - SDP
> + - SDP application with a specific target TCP/IP port range
> + - SRP with a specific target IB port GUID
> + - RDS
> + - iSER
> + - iSER application with a specific target TCP/IP port range
> + - IPoIB with a default PKey
> + - IPoIB with a specific PKey
> + - any ULP/application with a specific Service ID in the PR/MPR query
> + - any ULP/application with a specific PKey in the PR/MPR query
> + - any ULP/application with a specific target IB port GUID in the PR/MPR query

Are there duplicated entries (SDP, iSER, IPoIB)?

> +
> +Since any section of the policy file is optional, as long as basic rules
> +of the file are kept (such as no referring to nonexisting port group,
> +having default QoS Level, etc), the simplified policy section (qos-ulps)
> +can serve as a complete QoS policy file.
> +The shortest policy file in this case would be as follows:
> +
> +    qos-ulps
> +        default  : 0 #default SL
> +    end-qos-ulps
> +
> +It is equivalent to not having policy file at all.
> +
> +Below is an example of simplified QoS policy with all the possible keywords:
> +
> +    qos-ulps
> +        default                       : 0 # default SL
> +        sdp, port-num 30000           : 0 # SL for application running on top
> +                                          # of SDP when a destination
> +                                          # TCP/IPport is 30000
> +        sdp, port-num 10000-20000     : 0
> +        sdp                           : 1 # default SL for any other
> +                                          # application running on top of SDP
> +        rds                           : 2 # SL for RDS traffic
> +        iser, port-num 900            : 0 # SL for iSER with a specific target
> +                                          # port
> +        iser                          : 3 # default SL for iSER
> +        ipoib, pkey 0x0001            : 0 # SL for IPoIB on partition with
> +                                          # pkey 0x0001
> +        ipoib                         : 4 # default IPoIB partition,
> +                                          # pkey=0x7FFF
> +        any, service-id 0x6234        : 6 # match any PR/MPR query with a
> +                                          # specific Service ID
> +        any, pkey 0x0ABC              : 6 # match any PR/MPR query with a
> +                                          # specific PKey
> +        srp, target-port-guid 0x1234  : 5 # SRP when SRP Target is located on
> +                                          # a specified IB port GUID
> +        any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
> +                                          # a specific target port GUID
> +    end-qos-ulps
> +
> +
> +Similar to the full policy definition, matching of PR/MPR queries is done in
> +order of appearance in the QoS policy file such as the first match takes
> +precedence, except for the "default" rule, which is applied only if the query
> +didn't match any other rule.
> +
> +All other sections of the QoS policy file take precedence over the qos-ulps
> +section. That is, if a policy file has both qos-match-rules and qos-ulps
> +sections, then any query is matched first against the rules in the
> +qos-match-rules section, and only if there was no match, the query is matched
> +against the rules in qos-ulps section.
> +
> +Note that some of these match rules may overlap, so in order to use the
> +simplified QoS definition effectively, it is important to understand how each
> +of the ULPs is matched:
> +
> +.B  IPoIB:
> +PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
> +the following three match rules are equivalent:
> +
> +    ipoib              : <SL>
> +    ipoib, pkey 0x7fff : <SL>
> +    any,   pkey 0x7fff : <SL>
> +
> +.I Note
> +: For OFED 1.3, IPoIB partition SL configuration should be done through
> +partition configuration file only.
> +
> +\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is
> +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
> +Port Number to connect to. The following two match rules are equivalent:
> +
> +    sdp                                                   : <SL>
> +    any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
> +
> +\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The
> +Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits
> +holding the remote TCP/IP Port Number to connect to. Default port number
> +for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA.
> +The following two match rules are equivalent:
> +
> +    rds                                : <SL>
> +    any, service-id 0x00000000010648CA : <SL>
> +
> +\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the
> +Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C,
> +which makes a default Service-ID 0x000000000106035C.
> +The following two match rules are equivalent:
> +
> +    iser                               : <SL>
> +    any, service-id 0x000000000106035C : <SL>
> +
> +\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is
> +matched by the target IB port GUID. The following two match rules are
> +equivalent:
> +
> +    srp, target-port-guid 0x1234  : <SL>
> +    any, target-port-guid 0x1234  : <SL>
> +
> +Note that any of the above ULPs might contain target port GUID in the PR
> +query, so in order for these queries not to be recognized by the QoS manager
> +as SRP, the SRP match rule (or any match rule that refers to the target port
> +guid only) should be placed at the end of the qos-ulps match rules.
> +
> +\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not
> +forcing any SL on the MPI traffic, and that's why it is the only ULP that
> +did not appear in the qos-ulps section.
> +
> +
> +.SS SL2VL Mapping and VL Arbitration
> +.PP
> +
> +OpenSM cached options file has a set of QoS related configuration
> +parameters, that are used to configure SL2VL mapping and VL arbitration
> +on IB ports. These parameters are:
> +  - Max VLs: the maximum number of VLs that will be on the subnet.
> +  - High limit: the limit of High Priority component of VL Arbitration
> +    table (IBA 7.6.9).
> +  - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
> +  - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
> +  - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
> +    corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).

Why configuration keywords were removed here? OpenSM doesn't know what
is "Max VLs", but knows "max_vls".

> +There are separate QoS configuration parameters sets for various target
> +types:

This is optional...

> CAs, routers, switch external ports, and switch's enhanced port 0.
> +The names of such parameters are prefixed by "qos_<type>_" string.
> +Here is a full list of the currently supported sets:
> +
> +    qos_ca_  - QoS configuration parameters set for CAs.
> +    qos_rtr_ - parameters set for routers.
> +    qos_sw0_ - parameters set for switches' port 0.
> +    qos_swe_ - parameters set for switches' external ports.
> +
> +Here's the example of typical default values for all the ports in the
> +subnet (hard-coded in OpenSM initialization):
> +
> +    qos_max_vls=15
> +    qos_high_limit=0
> +    qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +    qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +
> +VL arbitration tables (both high and low) are lists of VL/Weight pairs.
> +Each list entry contains a VL number (values from 0-14), and a weighting value
> +(values 0-255), indicating the number of 64 byte units (credits) which may be
> +transmitted from that VL when its turn in the arbitration occurs. A weight
> +of 0 indicates that this entry should be skipped. If a list entry is
> +programmed for VL15 or for a VL that is not supported or is not currently
> +configured by the port, the port may either skip that entry or send from any
> +supported VL for that entry.
> +
> +Note, that the same VLs may be listed multiple times in the High or Low
> +priority arbitration tables, and, further, it can be listed in both tables.

Here and below. Do we need to rewrite the IBA spec in the man pages?
Those parameters are not invited by OpenSM implementation, it is port
parameters as defined in IBA spec.

> +
> +The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
> +number of high-priority packets that can be transmitted without an opportunity
> +to send a low-priority packet. Specifically, the number of bytes that can be
> +sent is high_limit times 4K bytes.
> +
> +A high_limit value of 255 indicates that the byte limit is unbounded.
> +Note: if the 255 value is used, the low priority VLs may be starved.
> +A value of 0 indicates that only a single packet from the high-priority table
> +may be sent before an opportunity is given to the low-priority table.
> +
> +Keep in mind that ports usually transmit packets of size equal to MTU.
> +For instance, for 4KB MTU a single packet will require 64 credits, so in order
> +to achieve effective VL arbitration for packets of 4KB MTU, the weighting
> +values for each VL should be multiples of 64.
> +
> +Below is an example of SL2VL and VL Arbitration configuration on subnet:
> +
> +    qos_max_vls=15
> +    qos_high_limit=6
> +    qos_vlarb_high=0:4
> +    qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
> +    qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
> +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
> +transmission burst. Such configuration would suilt VL that needs low latency
> +and uses small MTU when transmitting packets. Rest of VLs are defined as low
> +priority VLs with different weights, while VL4 is effectively turned off.
> 
>  .SH PREFIX ROUTES
>  .PP

And finally due to a huge size of QoS description, wouldn't it be
useful to move it below another (shorter) sections of the man page?

Sasha


From hrosenstock at xsigo.com  Tue May  6 07:53:08 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 06 May 2008 07:53:08 -0700
Subject: [ofa-general] Re: [Fwd: Re: [PATCH] opensm/man: Adding QoS-related
	info to opensm man pages]
In-Reply-To: <48206FEF.6020602@dev.mellanox.co.il>
References: <48206FEF.6020602@dev.mellanox.co.il>
Message-ID: <1210085588.2026.63.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-06 at 17:49 +0300, Yevgeny Kliteynik wrote:
> And this is Sasha's response.

Thanks.

Looks like there's some work to do here to address the comments. Also, a
number of those comments apply to the doc just submitted as patch.

-- Hal

> -- Yevgeny


From tziporet at mellanox.co.il  Tue May  6 08:44:51 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 6 May 2008 18:44:51 +0300
Subject: [ofa-general] OFED May 5  meeting summary
Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>

 
May 5 OFED meeting summary:
===========================

1. OFED 1.3.1:
	1.1  Status of changes:
		IB-bonding - on work
		SRP failover - done (need more testing)
		SDP crashes - on work (not clear if we will have
something on time)
		RDS fixes for RDMA API - done
		librdmacm 1.0.7 - done
		uDAPL updates - done
		Open MPI 1.2.6 - done
		MVAPICH 1.0.1 - done
		MVAPICH2 1.0.3 - done
		IPoIB - 2 bugs fixed. There are still two issue that
should be resolved.
		Low level drivers: Changes that already committed:
			nes
			mlx4
			cxgb3
			ehca
	
	1.2 Schedule:
		rc1 - was released today
		rc2 - May 20
		GA  - May 29
	
	1.3 Discussion:
	- ipath driver is going to be updated
	- There is an issue of bonding and Ethernet drivers on RHEL4 -
under debug
	- We wish to add support for SLES10 SP2. Already got an approval
from Novell
        Any volunteer to provide the new backport patches?

2. OFED 1.4:
   Updated that the new tree will be ready next week - based on
2.6.26-rc

3. Update on OpenSuSE build system - Yiftah updated on the work that is
done and problems:
   - The system requires clean RPMs only (no use of install script) -
they work to resolve
   - We target this system toward releases (and not to replace the daily
build system).
   - we may try now with OFED 1.3.1


Tziporet


From swise at opengridcomputing.com  Tue May  6 10:02:30 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 06 May 2008 12:02:30 -0500
Subject: [ofa-general] [PATCH] Request For Comments:
Message-ID: <20080506170230.11409.43625.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

Here is the top level API change I'm proposing for enabling interoperable
peer2peer mode for iwarp.  I want to get agreement on how to expose
this to the application before posting more of the gritty details of
the kernel driver changes needed. The plan is to include this support
in linux-2.6.27 + ofed-1.4.

Does this require an ABI bump?

Note:  We could do this several ways.  I'm proposing one with this
uncompiled patch.  The downside of my proposal is the applications have
to change to turn this on.  However, I'm not sure thats too painful.
We would have OMPI turn it on, and maybe even uDAPL so that all uDAPL
ULPs would get it (IMPI, dapltest, HPMPI).

Alternative designs:

- always do peer2peer and don't let the app choose.  This forces
the overhead of p2p mode on all apps, but preserves the API.

- use and environment variable that librdmacm will query.  This doesn't
force p2p, and has the beneifit of not changing the API.  But at the
expense of adding environment variables to the rdma-cm model.  This is
used extensively in MPIs and even DAPL.  I think its an alternative
we should consider.  This approach, however, doesn't help kernel
applications.


Steve.

-----

Peer2peer support in librdmacm.

User applications can set a new u8 boolean named peer2peer_mode in
the rdma_conn_param struct to indicate if they require peer2peer mode
support.  This means they don't enforce the "client must send first" iwarp
requirement in their own application logic.  If they set peer2peer_mode
to 1, then the iwarp CM and drivers will handle this requirement.
Applications that don't require this should set peer2peer_mode to 0 to
reduce the message exchanged done at iwarp connection setup.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 include/rdma/rdma_cma.h     |    1 +
 include/rdma/rdma_cma_abi.h |    1 +
 src/cma.c                   |    2 ++
 3 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h
index 76df90f..943aa45 100644
--- a/include/rdma/rdma_cma.h
+++ b/include/rdma/rdma_cma.h
@@ -118,6 +118,7 @@ struct rdma_conn_param {
 	uint8_t flow_control;
 	uint8_t retry_count;		/* ignored when accepting */
 	uint8_t rnr_retry_count;
+	uint8_t peer2peer_mode;
 	/* Fields below ignored if a QP is created on the rdma_cm_id. */
 	uint8_t srq;
 	uint32_t qp_num;
diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h
index 1a3a9c2..5914aaa 100644
--- a/include/rdma/rdma_cma_abi.h
+++ b/include/rdma/rdma_cma_abi.h
@@ -140,6 +140,7 @@ struct ucma_abi_conn_param {
 	__u8  retry_count;
 	__u8  rnr_retry_count;
 	__u8  valid;
+	__u8  peer2peer_mode;
 };
 
 struct ucma_abi_ud_param {
diff --git a/src/cma.c b/src/cma.c
index fc98c8f..dbbb2e8 100644
--- a/src/cma.c
+++ b/src/cma.c
@@ -844,6 +844,7 @@ static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst,
 	dst->retry_count = src->retry_count;
 	dst->rnr_retry_count = src->rnr_retry_count;
 	dst->valid = 1;
+	dst->peer2peer_mode = src->peer2peer_mode;
 
 	if (src->private_data && src->private_data_len) {
 		memcpy(dst->private_data, src->private_data,
@@ -1261,6 +1262,7 @@ static void ucma_copy_conn_event(struct cma_event *event,
 	dst->rnr_retry_count = src->rnr_retry_count;
 	dst->srq = src->srq;
 	dst->qp_num = src->qp_num;
+	dst->peer2peer_mode = src->peer2peer_mode;
 }
 
 static void ucma_copy_ud_event(struct cma_event *event,


From sean.hefty at intel.com  Tue May  6 10:31:37 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 May 2008 10:31:37 -0700
Subject: [ofa-general] RE: [PATCH] Request For Comments:
In-Reply-To: <20080506170230.11409.43625.stgit@dell3.ogc.int>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
Message-ID: <000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com>

Thanks for looking at this.

>Here is the top level API change I'm proposing for enabling interoperable
>peer2peer mode for iwarp.  I want to get agreement on how to expose
>this to the application before posting more of the gritty details of
>the kernel driver changes needed. The plan is to include this support
>in linux-2.6.27 + ofed-1.4.

I don't have a better idea what to call this, but when I think of peer to peer,
I think of that as the connection model, not a channel usage restriction.

>Does this require an ABI bump?

I'd like to avoid breaking the ABI or userspace API if possible.

>Note:  We could do this several ways.  I'm proposing one with this
>uncompiled patch.  The downside of my proposal is the applications have
>to change to turn this on.  However, I'm not sure thats too painful.
>We would have OMPI turn it on, and maybe even uDAPL so that all uDAPL
>ULPs would get it (IMPI, dapltest, HPMPI).

We could use rdma_set_option() for this.  If we do go the route of changing the
rdma_conn_param, adding generic flags or options would be more extendible.

>- always do peer2peer and don't let the app choose.  This forces
>the overhead of p2p mode on all apps, but preserves the API.

If we use rdma_set_option, I guess we could always enable it by default, and let
apps disable it.  I'm unsure if the better default is avoiding the overhead or
making the API easier to use, but I'm leaning toward the latter in this case.

>- use and environment variable that librdmacm will query.  This doesn't
>force p2p, and has the beneifit of not changing the API.  But at the
>expense of adding environment variables to the rdma-cm model.  This is
>used extensively in MPIs and even DAPL.  I think its an alternative
>we should consider.  This approach, however, doesn't help kernel
>applications.

I'm not thrilled with this idea.  Although I'm fine with the kernel solution
being different from the userspace one.

- Sean


From rdreier at cisco.com  Tue May  6 10:47:47 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 May 2008 10:47:47 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <20080506170230.11409.43625.stgit@dell3.ogc.int> (Steve Wise's
	message of "Tue, 06 May 2008 12:02:30 -0500")
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
Message-ID: <ada63trz5mk.fsf@cisco.com>

 > - always do peer2peer and don't let the app choose.  This forces
 > the overhead of p2p mode on all apps, but preserves the API.

How bad is the overhead?

 - R.


From andrea at qumranet.com  Tue May  6 10:53:57 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Tue, 6 May 2008 19:53:57 +0200
Subject: [ofa-general] mmu notifier v15 -> v16 diff
In-Reply-To: <20080505194625.GA17734@sgi.com>
References: <patchbomb.1209740703@duo.random>
	<1489529e7b53d3f2dab8.1209740704@duo.random>
	<20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random>
	<20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random>
	<20080505194625.GA17734@sgi.com>
Message-ID: <20080506175357.GB12593@duo.random>

Hello everyone,

This is to allow GRU code to call __mmu_notifier_register inside the
mmap_sem (write mode is required as documented in the patch).

It also removes the requirement to implement ->release as it's not
guaranteed all users will really need it.

I didn't integrate the search function as we can sort that out after
2.6.26 is out and it wasn't entirely obvious it's really needed, as
the driver should be able to track if a mmu notifier is registered in
the container.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
-	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently with other mmu notifier methods and it
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
 	 * should tear down all secondary mmu mappings and freeze the
-	 * secondary mmu.
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all _software_ driven memory accesses
+	 * through the secondary mmu are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative _hardware_ operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, so leading to memory
+	 * corruption.
 	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2340,13 +2340,20 @@ static inline void __mm_unlock(spinlock_
 /*
  * This operation locks against the VM for all pte/vma/mm related
  * operations that could ever happen on a certain mm. This includes
- * vmtruncate, try_to_unmap, and all page faults. The holder
- * must not hold any mm related lock. A single task can't take more
- * than one mm_lock in a row or it would deadlock.
+ * vmtruncate, try_to_unmap, and all page faults.
  *
- * The mmap_sem must be taken in write mode to block all operations
- * that could modify pagetables and free pages without altering the
- * vma layout (for example populate_range() with nonlinear vmas).
+ * The caller must take the mmap_sem in read or write mode before
+ * calling mm_lock(). The caller isn't allowed to release the mmap_sem
+ * until mm_unlock() returns.
+ *
+ * While mm_lock() itself won't strictly require the mmap_sem in write
+ * mode to be safe, in order to block all operations that could modify
+ * pagetables and free pages without need of altering the vma layout
+ * (for example populate_range() with nonlinear vmas) the mmap_sem
+ * must be taken in write mode by the caller.
+ *
+ * A single task can't take more than one mm_lock in a row or it would
+ * deadlock.
  *
  * The sorting is needed to avoid lock inversion deadlocks if two
  * tasks run mm_lock at the same time on different mm that happen to
@@ -2377,17 +2384,13 @@ int mm_lock(struct mm_struct *mm, struct
 {
 	spinlock_t **anon_vma_locks, **i_mmap_locks;
 
-	down_write(&mm->mmap_sem);
 	if (mm->map_count) {
 		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks)) {
-			up_write(&mm->mmap_sem);
+		if (unlikely(!anon_vma_locks))
 			return -ENOMEM;
-		}
 
 		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
 		if (unlikely(!i_mmap_locks)) {
-			up_write(&mm->mmap_sem);
 			vfree(anon_vma_locks);
 			return -ENOMEM;
 		}
@@ -2426,10 +2429,12 @@ static void mm_unlock_vfree(spinlock_t *
 /*
  * mm_unlock doesn't require any memory allocation and it won't fail.
  *
+ * The mmap_sem cannot be released until mm_unlock returns.
+ *
  * All memory has been previously allocated by mm_lock and it'll be
  * all freed before returning. Only after mm_unlock returns, the
  * caller is allowed to free and forget the mm_lock_data structure.
- * 
+ *
  * mm_unlock runs in O(N) where N is the max number of VMAs in the
  * mm. The max number of vmas is defined in
  * /proc/sys/vm/max_map_count.
@@ -2444,5 +2449,4 @@ void mm_unlock(struct mm_struct *mm, str
 			mm_unlock_vfree(data->i_mmap_locks,
 					data->nr_i_mmap_locks);
 	}
-	up_write(&mm->mmap_sem);
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st
 		 * from establishing any more sptes before all the
 		 * pages in the mm are freed.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 		spin_lock(&mm->mmu_notifier_mm->lock);
 	}
@@ -144,20 +145,9 @@ void __mmu_notifier_invalidate_range_end
 	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 }
 
-/*
- * Must not hold mmap_sem nor any other VM related lock when calling
- * this registration function. Must also ensure mm_users can't go down
- * to zero while this runs to avoid races with mmu_notifier_release,
- * so mm has to be current->mm or the mm should be pinned safely such
- * as with get_task_mm(). If the mm is not current->mm, the mm_users
- * pin should be released by calling mmput after mmu_notifier_register
- * returns. mmu_notifier_unregister must be always called to
- * unregister the notifier. mm_count is automatically pinned to allow
- * mmu_notifier_unregister to safely run at any time later, before or
- * after exit_mmap. ->release will always be called before exit_mmap
- * frees the pages.
- */
-int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+static int do_mmu_notifier_register(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    int take_mmap_sem)
 {
 	struct mm_lock_data data;
 	struct mmu_notifier_mm * mmu_notifier_mm;
@@ -174,6 +164,8 @@ int mmu_notifier_register(struct mmu_not
 	if (unlikely(ret))
 		goto out_kfree;
 
+	if (take_mmap_sem)
+		down_write(&mm->mmap_sem);
 	ret = mm_lock(mm, &data);
 	if (unlikely(ret))
 		goto out_cleanup;
@@ -200,6 +192,8 @@ int mmu_notifier_register(struct mmu_not
 
 	mm_unlock(mm, &data);
 out_cleanup:
+	if (take_mmap_sem)
+		up_write(&mm->mmap_sem);
 	if (mmu_notifier_mm)
 		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
 out_kfree:
@@ -209,7 +203,35 @@ out:
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 	return ret;
 }
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 1);
+}
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 0);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 /* this is called after the last mmu_notifier_unregister() returned */
 void __mmu_notifier_mm_destroy(struct mm_struct *mm)
@@ -251,7 +273,8 @@ void mmu_notifier_unregister(struct mmu_
 		 * guarantee ->release is called before freeing the
 		 * pages.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 	} else
 		spin_unlock(&mm->mmu_notifier_mm->lock);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -148,6 +148,8 @@ static inline int mm_has_notifiers(struc
 
 extern int mmu_notifier_register(struct mmu_notifier *mn,
 				 struct mm_struct *mm);
+extern int __mmu_notifier_register(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);


From swise at opengridcomputing.com  Tue May  6 11:32:07 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 06 May 2008 13:32:07 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <ada63trz5mk.fsf@cisco.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
	<ada63trz5mk.fsf@cisco.com>
Message-ID: <4820A427.1070405@opengridcomputing.com>

Roland Dreier wrote:
>  > - always do peer2peer and don't let the app choose.  This forces
>  > the overhead of p2p mode on all apps, but preserves the API.
>
> How bad is the overhead?
>
>  - R.
>   
The client side must send a "Ready To Receive" message.  This will be 
negotiated via the MPA exchange and the resulting RTR message may be a 
0B read + read response, 0B write, or a 0B send.  For chelsio, the 0B 
write couldn't be used, and the 0B read was the least impact on the 
driver code, so we used that.  For nes, they currently use a 0B write.

Also, there are some "caveats" if you turn this on:

1) private data is used to negotiate the type of RTR message and if its 
needed.   This is more of a global module option I think, since it will 
break interoperability with iwarp.  Prolly will bump the MPA version 
number if this option is on too.

2) if the RTR message fails, it can generate a CQE that is unexpected.

3) if using SEND, then a recv completion is always generated.

Steve.


From ralph.campbell at qlogic.com  Tue May  6 11:36:15 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:15 -0700
Subject: [ofa-general] [PATCH 0/7] IB/ipath -- fixes for 2.6.26
Message-ID: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>

The following patches fix a number of bugs for the QLogic DDR HCA.

IB/ipath -- only warn about prototype chip during init
IB/ipath - only increment SSN if WQE is put on send queue
IB/ipath - fix bug that can leave sends disabled after freeze recovery
IB/ipath - Return the correct opcode for RDMA WRITE with immediate
IB/ipath -- fix count of packets received by kernel
IB/ipath - need to always request and handle PIO avail interrupts
IB/ipath - fix SDMA error recovery in absence of link status change

These can also be pulled into Roland's infiniband.git for-2.6.26 repo using:
git pull git://git.qlogic.com/ipath-linux-2.6 for-roland


From ralph.campbell at qlogic.com  Tue May  6 11:36:21 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:21 -0700
Subject: [ofa-general] [PATCH 1/7] IB/ipath -- only warn about prototype chip
	during init
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183620.6521.62416.stgit@eng-46.mv.qlogic.com>

From: Michael Albaugh <Michael.Albaugh at Qlogic.com>

We warn about prototype chips, but the function that checks for
support is also called as a result of a get_portinfo request, which
can clutter the logs.

Restrict warning to only appear during initialization.

Signed-off-by: Michael Albaugh <michael.albaugh at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_iba7220.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba7220.c b/drivers/infiniband/hw/ipath/ipath_iba7220.c
index e3ec0d1..5f693de 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba7220.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba7220.c
@@ -870,8 +870,9 @@ static int ipath_7220_boardname(struct ipath_devdata *dd, char *name,
 			      "revision %u.%u!\n",
 			      dd->ipath_majrev, dd->ipath_minrev);
 		ret = 1;
-	} else if (dd->ipath_minrev == 1) {
-		/* Rev1 chips are prototype. Complain, but allow use */
+	} else if (dd->ipath_minrev == 1 &&
+		!(dd->ipath_flags & IPATH_INITTED)) {
+		/* Rev1 chips are prototype. Complain at init, but allow use */
 		ipath_dev_err(dd, "Unsupported hardware "
 			      "revision %u.%u, Contact support at qlogic.com\n",
 			      dd->ipath_majrev, dd->ipath_minrev);


From ralph.campbell at qlogic.com  Tue May  6 11:36:26 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:26 -0700
Subject: [ofa-general] [PATCH 2/7] IB/ipath - only increment SSN if WQE is
	put on send queue
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183626.6521.97442.stgit@eng-46.mv.qlogic.com>

If a send work request has immediate errors and is not put on the
send queue, we shouldn't update any of the QP state.
The increment of the SSN wasn't obeying this.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_verbs.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index e63927c..5015cd2 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -396,7 +396,6 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr)
 
 	wqe = get_swqe_ptr(qp, qp->s_head);
 	wqe->wr = *wr;
-	wqe->ssn = qp->s_ssn++;
 	wqe->length = 0;
 	if (wr->num_sge) {
 		acc = wr->opcode >= IB_WR_RDMA_READ ?
@@ -422,6 +421,7 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr)
 			goto bail_inval;
 	} else if (wqe->length > to_idev(qp->ibqp.device)->dd->ipath_ibmtu)
 		goto bail_inval;
+	wqe->ssn = qp->s_ssn++;
 	qp->s_head = next;
 
 	ret = 0;


From ralph.campbell at qlogic.com  Tue May  6 11:36:31 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:31 -0700
Subject: [ofa-general] [PATCH 3/7] IB/ipath - fix bug that can leave sends
	disabled after freeze recovery
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183631.6521.82997.stgit@eng-46.mv.qlogic.com>

From: Dave Olson <dave.olson at qlogic.com>

The semantics of cancel_sends changed, but the code using it was missed.
Don't leave sends and pioavail updates disabled, and add a comment as to
why the force update is needed.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_intr.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 1b58f47..45c4c06 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -933,11 +933,15 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	 * therefore would not be sent, and eventually
 	 * might cause the process to run out of bufs
 	 */
-	ipath_cancel_sends(dd, 0);
+	ipath_cancel_sends(dd, 1);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
 			 dd->ipath_control);
 
-	/* ensure pio avail updates continue */
+	/*
+	 * ensure pio avail updates continue (because the update
+	 * won't have happened from cancel_sends because we were
+	 * still in freeze
+	 */
 	ipath_force_pio_avail_update(dd);
 
 	/*


From ralph.campbell at qlogic.com  Tue May  6 11:36:36 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:36 -0700
Subject: [ofa-general] [PATCH 4/7] IB/ipath - Return the correct opcode for
	RDMA WRITE with immediate
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183636.6521.42474.stgit@eng-46.mv.qlogic.com>

This patch fixes a bug in the RC responder which generates a completion
entry with the wrong opcode when an RDMA WRITE with immediate is
received.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_rc.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index c405dfb..08b11b5 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -1746,7 +1746,11 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		qp->r_wrid_valid = 0;
 		wc.wr_id = qp->r_wr_id;
 		wc.status = IB_WC_SUCCESS;
-		wc.opcode = IB_WC_RECV;
+		if (opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE) ||
+		    opcode == OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE))
+			wc.opcode = IB_WC_RECV_RDMA_WITH_IMM;
+		else
+			wc.opcode = IB_WC_RECV;
 		wc.vendor_err = 0;
 		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;


From ralph.campbell at qlogic.com  Tue May  6 11:36:41 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:41 -0700
Subject: [ofa-general] [PATCH 5/7] IB/ipath -- fix count of packets received
	by kernel
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183641.6521.79318.stgit@eng-46.mv.qlogic.com>

From: Michael Albaugh <Michael.Albaugh at Qlogic.com>

The loop in ipath_kreceive() that processes packets
increments the loop-index 'i' once too often, because
the exit condition does not depend on it, and is checked
after the increment. By adding a check for !last to
the iterator in the for loop, we correct that in a way
that is not so likely to be re-broken by changes in the
loop body.

Signed-off-by: Michael Albaugh <micheal.albaugh at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index acf30c0..f81dd4a 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1197,7 +1197,7 @@ void ipath_kreceive(struct ipath_portdata *pd)
 	}
 
 reloop:
-	for (last = 0, i = 1; !last; i++) {
+	for (last = 0, i = 1; !last; i += !last) {
 		hdr = dd->ipath_f_get_msgheader(dd, rhf_addr);
 		eflags = ipath_hdrget_err_flags(rhf_addr);
 		etype = ipath_hdrget_rcv_type(rhf_addr);


From ralph.campbell at qlogic.com  Tue May  6 11:36:47 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:47 -0700
Subject: [ofa-general] [PATCH 6/7] IB/ipath - need to always request and
	handle PIO avail interrupts
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183646.6521.81784.stgit@eng-46.mv.qlogic.com>

From: Dave Olson <dave.olson at qlogic.com>

Now that we always use PIO for vl15 on 7220, we could get stuck forever
if we happened to run out of PIO buffers from the verbs code, because
the setup code wouldn't run; the interrupt was also ignored if SDMA
was supported.  We also have to reduce the pio update threshold if we
have fewer kernel buffers than the existing threshold.

Cleans up the initialization a bit to get ordering safer and
more sensible, and to use the existing ipath_chg_kernavail call
to do init, rather than doing it separately.

Drops unnecessary clearing of pio buffer on pio parity error.
Drops incorrect updating of pioavailshadow when exitting freeze
mode (software state may not match chip state if buffer has been
allocated and not yet written).
If we couldn't get a kernel buffer for a while, make sure we are
in sync with hardware, mainly to handle the exitting freeze case.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c    |  128 +++++++++++++++++++++++--
 drivers/infiniband/hw/ipath/ipath_file_ops.c  |   72 ++++++--------
 drivers/infiniband/hw/ipath/ipath_iba7220.c   |   21 +---
 drivers/infiniband/hw/ipath/ipath_init_chip.c |   95 ++++++++-----------
 drivers/infiniband/hw/ipath/ipath_intr.c      |   82 ++--------------
 drivers/infiniband/hw/ipath/ipath_kernel.h    |    8 +-
 drivers/infiniband/hw/ipath/ipath_ruc.c       |    7 +
 drivers/infiniband/hw/ipath/ipath_sdma.c      |   13 ++-
 8 files changed, 224 insertions(+), 202 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index f81dd4a..2036d38 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1428,6 +1428,40 @@ static void ipath_update_pio_bufs(struct ipath_devdata *dd)
 	spin_unlock_irqrestore(&ipath_pioavail_lock, flags);
 }
 
+/*
+ * used to force update of pioavailshadow if we can't get a pio buffer.
+ * Needed primarily due to exitting freeze mode after recovering
+ * from errors.  Done lazily, because it's safer (known to not
+ * be writing pio buffers).
+ */
+static void ipath_reset_availshadow(struct ipath_devdata *dd)
+{
+	int i, im;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ipath_pioavail_lock, flags);
+	for (i = 0; i < dd->ipath_pioavregs; i++) {
+		u64 val, oldval;
+		/* deal with 6110 chip bug on high register #s */
+		im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ?
+			i ^ 1 : i;
+		val = le64_to_cpu(dd->ipath_pioavailregs_dma[im]);
+		/*
+		 * busy out the buffers not in the kernel avail list,
+		 * without changing the generation bits.
+		 */
+		oldval = dd->ipath_pioavailshadow[i];
+		dd->ipath_pioavailshadow[i] = val |
+			((~dd->ipath_pioavailkernel[i] <<
+			INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT) &
+			0xaaaaaaaaaaaaaaaaULL); /* All BUSY bits in qword */
+		if (oldval != dd->ipath_pioavailshadow[i])
+			ipath_dbg("shadow[%d] was %Lx, now %lx\n",
+				i, oldval, dd->ipath_pioavailshadow[i]);
+	}
+	spin_unlock_irqrestore(&ipath_pioavail_lock, flags);
+}
+
 /**
  * ipath_setrcvhdrsize - set the receive header size
  * @dd: the infinipath device
@@ -1482,9 +1516,12 @@ static noinline void no_pio_bufs(struct ipath_devdata *dd)
 	 */
 	ipath_stats.sps_nopiobufs++;
 	if (!(++dd->ipath_consec_nopiobuf % 100000)) {
-		ipath_dbg("%u pio sends with no bufavail; dmacopy: "
-			"%llx %llx %llx %llx; shadow:  %lx %lx %lx %lx\n",
+		ipath_force_pio_avail_update(dd); /* at start */
+		ipath_dbg("%u tries no piobufavail ts%lx; dmacopy: "
+			"%llx %llx %llx %llx\n"
+			"ipath  shadow:  %lx %lx %lx %lx\n",
 			dd->ipath_consec_nopiobuf,
+			(unsigned long)get_cycles(),
 			(unsigned long long) le64_to_cpu(dma[0]),
 			(unsigned long long) le64_to_cpu(dma[1]),
 			(unsigned long long) le64_to_cpu(dma[2]),
@@ -1496,14 +1533,17 @@ static noinline void no_pio_bufs(struct ipath_devdata *dd)
 		 */
 		if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) >
 		    (sizeof(shadow[0]) * 4 * 4))
-			ipath_dbg("2nd group: dmacopy: %llx %llx "
-				  "%llx %llx; shadow: %lx %lx %lx %lx\n",
+			ipath_dbg("2nd group: dmacopy: "
+				  "%llx %llx %llx %llx\n"
+				  "ipath  shadow:  %lx %lx %lx %lx\n",
 				  (unsigned long long)le64_to_cpu(dma[4]),
 				  (unsigned long long)le64_to_cpu(dma[5]),
 				  (unsigned long long)le64_to_cpu(dma[6]),
 				  (unsigned long long)le64_to_cpu(dma[7]),
-				  shadow[4], shadow[5], shadow[6],
-				  shadow[7]);
+				  shadow[4], shadow[5], shadow[6], shadow[7]);
+
+		/* at end, so update likely happened */
+		ipath_reset_availshadow(dd);
 	}
 }
 
@@ -1652,19 +1692,46 @@ void ipath_chg_pioavailkernel(struct ipath_devdata *dd, unsigned start,
 			      unsigned len, int avail)
 {
 	unsigned long flags;
-	unsigned end;
+	unsigned end, cnt = 0, next;
 
 	/* There are two bits per send buffer (busy and generation) */
 	start *= 2;
-	len *= 2;
-	end = start + len;
+	end = start + len * 2;
 
-	/* Set or clear the generation bits. */
 	spin_lock_irqsave(&ipath_pioavail_lock, flags);
+	/* Set or clear the busy bit in the shadow. */
 	while (start < end) {
 		if (avail) {
-			__clear_bit(start + INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT,
-				dd->ipath_pioavailshadow);
+			unsigned long dma;
+			int i, im;
+			/*
+			 * the BUSY bit will never be set, because we disarm
+			 * the user buffers before we hand them back to the
+			 * kernel.  We do have to make sure the generation
+			 * bit is set correctly in shadow, since it could
+			 * have changed many times while allocated to user.
+			 * We can't use the bitmap functions on the full
+			 * dma array because it is always little-endian, so
+			 * we have to flip to host-order first.
+			 * BITS_PER_LONG is slightly wrong, since it's
+			 * always 64 bits per register in chip...
+			 * We only work on 64 bit kernels, so that's OK.
+			 */
+			/* deal with 6110 chip bug on high register #s */
+			i = start / BITS_PER_LONG;
+			im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ?
+				i ^ 1 : i;
+			__clear_bit(INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT
+				+ start, dd->ipath_pioavailshadow);
+			dma = (unsigned long) le64_to_cpu(
+				dd->ipath_pioavailregs_dma[im]);
+			if (test_bit((INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT
+				+ start) % BITS_PER_LONG, &dma))
+				__set_bit(INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT
+					+ start, dd->ipath_pioavailshadow);
+			else
+				__clear_bit(INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT
+					+ start, dd->ipath_pioavailshadow);
 			__set_bit(start, dd->ipath_pioavailkernel);
 		} else {
 			__set_bit(start + INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT,
@@ -1673,7 +1740,44 @@ void ipath_chg_pioavailkernel(struct ipath_devdata *dd, unsigned start,
 		}
 		start += 2;
 	}
+
+	if (dd->ipath_pioupd_thresh) {
+		end = 2 * (dd->ipath_piobcnt2k + dd->ipath_piobcnt4k);
+		next = find_first_bit(dd->ipath_pioavailkernel, end);
+		while (next < end) {
+			cnt++;
+			next = find_next_bit(dd->ipath_pioavailkernel, end,
+					next + 1);
+		}
+	}
 	spin_unlock_irqrestore(&ipath_pioavail_lock, flags);
+
+	/*
+	 * When moving buffers from kernel to user, if number assigned to
+	 * the user is less than the pio update threshold, and threshold
+	 * is supported (cnt was computed > 0), drop the update threshold
+	 * so we update at least once per allocated number of buffers.
+	 * In any case, if the kernel buffers are less than the threshold,
+	 * drop the threshold.  We don't bother increasing it, having once
+	 * decreased it, since it would typically just cycle back and forth.
+	 * If we don't decrease below buffers in use, we can wait a long
+	 * time for an update, until some other context uses PIO buffers.
+	 */
+	if (!avail && len < cnt)
+		cnt = len;
+	if (cnt < dd->ipath_pioupd_thresh) {
+		dd->ipath_pioupd_thresh = cnt;
+		ipath_dbg("Decreased pio update threshold to %u\n",
+			dd->ipath_pioupd_thresh);
+		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
+		dd->ipath_sendctrl &= ~(INFINIPATH_S_UPDTHRESH_MASK
+			<< INFINIPATH_S_UPDTHRESH_SHIFT);
+		dd->ipath_sendctrl |= dd->ipath_pioupd_thresh
+			<< INFINIPATH_S_UPDTHRESH_SHIFT;
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+			dd->ipath_sendctrl);
+		spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags);
+	}
 }
 
 /**
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 8b17522..3295177 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -173,47 +173,25 @@ static int ipath_get_base_info(struct file *fp,
 		(void *) dd->ipath_statusp -
 		(void *) dd->ipath_pioavailregs_dma;
 	if (!shared) {
-		kinfo->spi_piocnt = dd->ipath_pbufsport;
+		kinfo->spi_piocnt = pd->port_piocnt;
 		kinfo->spi_piobufbase = (u64) pd->port_piobufs;
 		kinfo->__spi_uregbase = (u64) dd->ipath_uregbase +
 			dd->ipath_ureg_align * pd->port_port;
 	} else if (master) {
-		kinfo->spi_piocnt = (dd->ipath_pbufsport / subport_cnt) +
-				    (dd->ipath_pbufsport % subport_cnt);
+		kinfo->spi_piocnt = (pd->port_piocnt / subport_cnt) +
+				    (pd->port_piocnt % subport_cnt);
 		/* Master's PIO buffers are after all the slave's */
 		kinfo->spi_piobufbase = (u64) pd->port_piobufs +
 			dd->ipath_palign *
-			(dd->ipath_pbufsport - kinfo->spi_piocnt);
+			(pd->port_piocnt - kinfo->spi_piocnt);
 	} else {
 		unsigned slave = subport_fp(fp) - 1;
 
-		kinfo->spi_piocnt = dd->ipath_pbufsport / subport_cnt;
+		kinfo->spi_piocnt = pd->port_piocnt / subport_cnt;
 		kinfo->spi_piobufbase = (u64) pd->port_piobufs +
 			dd->ipath_palign * kinfo->spi_piocnt * slave;
 	}
 
-	/*
-	 * Set the PIO avail update threshold to no larger
-	 * than the number of buffers per process. Note that
-	 * we decrease it here, but won't ever increase it.
-	 */
-	if (dd->ipath_pioupd_thresh &&
-	    kinfo->spi_piocnt < dd->ipath_pioupd_thresh) {
-		unsigned long flags;
-
-		dd->ipath_pioupd_thresh = kinfo->spi_piocnt;
-		ipath_dbg("Decreased pio update threshold to %u\n",
-			dd->ipath_pioupd_thresh);
-		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
-		dd->ipath_sendctrl &= ~(INFINIPATH_S_UPDTHRESH_MASK
-			<< INFINIPATH_S_UPDTHRESH_SHIFT);
-		dd->ipath_sendctrl |= dd->ipath_pioupd_thresh
-			<< INFINIPATH_S_UPDTHRESH_SHIFT;
-		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-			dd->ipath_sendctrl);
-		spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags);
-	}
-
 	if (shared) {
 		kinfo->spi_port_uregbase = (u64) dd->ipath_uregbase +
 			dd->ipath_ureg_align * pd->port_port;
@@ -1309,19 +1287,19 @@ static int ipath_mmap(struct file *fp, struct vm_area_struct *vma)
 	ureg = dd->ipath_uregbase + dd->ipath_ureg_align * pd->port_port;
 	if (!pd->port_subport_cnt) {
 		/* port is not shared */
-		piocnt = dd->ipath_pbufsport;
+		piocnt = pd->port_piocnt;
 		piobufs = pd->port_piobufs;
 	} else if (!subport_fp(fp)) {
 		/* caller is the master */
-		piocnt = (dd->ipath_pbufsport / pd->port_subport_cnt) +
-			 (dd->ipath_pbufsport % pd->port_subport_cnt);
+		piocnt = (pd->port_piocnt / pd->port_subport_cnt) +
+			 (pd->port_piocnt % pd->port_subport_cnt);
 		piobufs = pd->port_piobufs +
-			dd->ipath_palign * (dd->ipath_pbufsport - piocnt);
+			dd->ipath_palign * (pd->port_piocnt - piocnt);
 	} else {
 		unsigned slave = subport_fp(fp) - 1;
 
 		/* caller is a slave */
-		piocnt = dd->ipath_pbufsport / pd->port_subport_cnt;
+		piocnt = pd->port_piocnt / pd->port_subport_cnt;
 		piobufs = pd->port_piobufs + dd->ipath_palign * piocnt * slave;
 	}
 
@@ -1633,9 +1611,6 @@ static int try_alloc_port(struct ipath_devdata *dd, int port,
 		port_fp(fp) = pd;
 		pd->port_pid = current->pid;
 		strncpy(pd->port_comm, current->comm, sizeof(pd->port_comm));
-		ipath_chg_pioavailkernel(dd,
-			dd->ipath_pbufsport * (pd->port_port - 1),
-			dd->ipath_pbufsport, 0);
 		ipath_stats.sps_ports++;
 		ret = 0;
 	} else
@@ -1938,11 +1913,25 @@ static int ipath_do_user_init(struct file *fp,
 
 	/* for now we do nothing with rcvhdrcnt: uinfo->spu_rcvhdrcnt */
 
+	/* some ports may get extra buffers, calculate that here */
+	if (pd->port_port <= dd->ipath_ports_extrabuf)
+		pd->port_piocnt = dd->ipath_pbufsport + 1;
+	else
+		pd->port_piocnt = dd->ipath_pbufsport;
+
 	/* for right now, kernel piobufs are at end, so port 1 is at 0 */
+	if (pd->port_port <= dd->ipath_ports_extrabuf)
+		pd->port_pio_base = (dd->ipath_pbufsport + 1)
+			* (pd->port_port - 1);
+	else
+		pd->port_pio_base = dd->ipath_ports_extrabuf +
+			dd->ipath_pbufsport * (pd->port_port - 1);
 	pd->port_piobufs = dd->ipath_piobufbase +
-		dd->ipath_pbufsport * (pd->port_port - 1) * dd->ipath_palign;
-	ipath_cdbg(VERBOSE, "Set base of piobufs for port %u to 0x%x\n",
-		   pd->port_port, pd->port_piobufs);
+		pd->port_pio_base * dd->ipath_palign;
+	ipath_cdbg(VERBOSE, "piobuf base for port %u is 0x%x, piocnt %u,"
+		" first pio %u\n", pd->port_port, pd->port_piobufs,
+		pd->port_piocnt, pd->port_pio_base);
+	ipath_chg_pioavailkernel(dd, pd->port_pio_base, pd->port_piocnt, 0);
 
 	/*
 	 * Now allocate the rcvhdr Q and eager TIDs; skip the TID
@@ -2107,7 +2096,6 @@ static int ipath_close(struct inode *in, struct file *fp)
 	}
 
 	if (dd->ipath_kregbase) {
-		int i;
 		/* atomically clear receive enable port and intr avail. */
 		clear_bit(dd->ipath_r_portenable_shift + port,
 			  &dd->ipath_rcvctrl);
@@ -2136,9 +2124,9 @@ static int ipath_close(struct inode *in, struct file *fp)
 		ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr,
 			pd->port_port, dd->ipath_dummy_hdrq_phys);
 
-		i = dd->ipath_pbufsport * (port - 1);
-		ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport);
-		ipath_chg_pioavailkernel(dd, i, dd->ipath_pbufsport, 1);
+		ipath_disarm_piobufs(dd, pd->port_pio_base, pd->port_piocnt);
+		ipath_chg_pioavailkernel(dd, pd->port_pio_base,
+			pd->port_piocnt, 1);
 
 		dd->ipath_f_clear_tids(dd, pd->port_port);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_iba7220.c b/drivers/infiniband/hw/ipath/ipath_iba7220.c
index 5f693de..8eee783 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba7220.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba7220.c
@@ -595,7 +595,7 @@ static void ipath_7220_txe_recover(struct ipath_devdata *dd)
 
 	dev_info(&dd->pcidev->dev,
 		"Recovering from TXE PIO parity error\n");
-	ipath_disarm_senderrbufs(dd, 1);
+	ipath_disarm_senderrbufs(dd);
 }
 
 
@@ -675,10 +675,8 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg,
 	ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control);
 	if ((ctrl & INFINIPATH_C_FREEZEMODE) && !ipath_diag_inuse) {
 		/*
-		 * Parity errors in send memory are recoverable,
-		 * just cancel the send (if indicated in * sendbuffererror),
-		 * count the occurrence, unfreeze (if no other handled
-		 * hardware error bits are set), and continue.
+		 * Parity errors in send memory are recoverable by h/w
+		 * just do housekeeping, exit freeze mode and continue.
 		 */
 		if (hwerrs & ((INFINIPATH_HWE_TXEMEMPARITYERR_PIOBUF |
 			       INFINIPATH_HWE_TXEMEMPARITYERR_PIOPBC)
@@ -687,13 +685,6 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg,
 			hwerrs &= ~((INFINIPATH_HWE_TXEMEMPARITYERR_PIOBUF |
 				     INFINIPATH_HWE_TXEMEMPARITYERR_PIOPBC)
 				    << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT);
-			if (!hwerrs) {
-				/* else leave in freeze mode */
-				ipath_write_kreg(dd,
-						 dd->ipath_kregs->kr_control,
-						 dd->ipath_control);
-				goto bail;
-			}
 		}
 		if (hwerrs) {
 			/*
@@ -723,8 +714,8 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg,
 			*dd->ipath_statusp |= IPATH_STATUS_HWERROR;
 			dd->ipath_flags &= ~IPATH_INITTED;
 		} else {
-			ipath_dbg("Clearing freezemode on ignored hardware "
-				  "error\n");
+			ipath_dbg("Clearing freezemode on ignored or "
+				"recovered hardware error\n");
 			ipath_clear_freeze(dd);
 		}
 	}
@@ -1967,7 +1958,7 @@ static void ipath_7220_config_ports(struct ipath_devdata *dd, ushort cfgports)
 			 dd->ipath_rcvctrl);
 	dd->ipath_p0_rcvegrcnt = 2048; /* always */
 	if (dd->ipath_flags & IPATH_HAS_SEND_DMA)
-		dd->ipath_pioreserved = 1; /* reserve a buffer */
+		dd->ipath_pioreserved = 3; /* kpiobufs used for PIO */
 }
 
 
diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 27dd894..3e5baa4 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -41,7 +41,7 @@
 /*
  * min buffers we want to have per port, after driver
  */
-#define IPATH_MIN_USER_PORT_BUFCNT 8
+#define IPATH_MIN_USER_PORT_BUFCNT 7
 
 /*
  * Number of ports we are configured to use (to allow for more pio
@@ -54,13 +54,9 @@ MODULE_PARM_DESC(cfgports, "Set max number of ports to use");
 
 /*
  * Number of buffers reserved for driver (verbs and layered drivers.)
- * Reserved at end of buffer list.   Initialized based on
- * number of PIO buffers if not set via module interface.
+ * Initialized based on number of PIO buffers if not set via module interface.
  * The problem with this is that it's global, but we'll use different
- * numbers for different chip types.  So the default value is not
- * very useful.  I've redefined it for the 1.3 release so that it's
- * zero unless set by the user to something else, in which case we
- * try to respect it.
+ * numbers for different chip types.
  */
 static ushort ipath_kpiobufs;
 
@@ -546,9 +542,12 @@ static void enable_chip(struct ipath_devdata *dd, int reinit)
 			pioavail = dd->ipath_pioavailregs_dma[i ^ 1];
 		else
 			pioavail = dd->ipath_pioavailregs_dma[i];
-		dd->ipath_pioavailshadow[i] = le64_to_cpu(pioavail) |
-			(~dd->ipath_pioavailkernel[i] <<
-			INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT);
+		/*
+		 * don't need to worry about ipath_pioavailkernel here
+		 * because we will call ipath_chg_pioavailkernel() later
+		 * in initialization, to busy out buffers as needed
+		 */
+		dd->ipath_pioavailshadow[i] = le64_to_cpu(pioavail);
 	}
 	/* can get counters, stats, etc. */
 	dd->ipath_flags |= IPATH_PRESENT;
@@ -708,12 +707,11 @@ static void verify_interrupt(unsigned long opaque)
 int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 {
 	int ret = 0;
-	u32 val32, kpiobufs;
+	u32 kpiobufs, defkbufs;
 	u32 piobufs, uports;
 	u64 val;
 	struct ipath_portdata *pd;
 	gfp_t gfp_flags = GFP_USER | __GFP_COMP;
-	unsigned long flags;
 
 	ret = init_housekeeping(dd, reinit);
 	if (ret)
@@ -753,56 +751,46 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	dd->ipath_pioavregs = ALIGN(piobufs, sizeof(u64) * BITS_PER_BYTE / 2)
 		/ (sizeof(u64) * BITS_PER_BYTE / 2);
 	uports = dd->ipath_cfgports ? dd->ipath_cfgports - 1 : 0;
-	if (ipath_kpiobufs == 0) {
-		/* not set by user (this is default) */
-		if (piobufs > 144)
-			kpiobufs = 32;
-		else
-			kpiobufs = 16;
-	}
+	if (piobufs > 144)
+		defkbufs = 32 + dd->ipath_pioreserved;
 	else
-		kpiobufs = ipath_kpiobufs;
+		defkbufs = 16 + dd->ipath_pioreserved;
 
-	if (kpiobufs + (uports * IPATH_MIN_USER_PORT_BUFCNT) > piobufs) {
+	if (ipath_kpiobufs && (ipath_kpiobufs +
+		(uports * IPATH_MIN_USER_PORT_BUFCNT)) > piobufs) {
 		int i = (int) piobufs -
 			(int) (uports * IPATH_MIN_USER_PORT_BUFCNT);
 		if (i < 1)
 			i = 1;
 		dev_info(&dd->pcidev->dev, "Allocating %d PIO bufs of "
 			 "%d for kernel leaves too few for %d user ports "
-			 "(%d each); using %u\n", kpiobufs,
+			 "(%d each); using %u\n", ipath_kpiobufs,
 			 piobufs, uports, IPATH_MIN_USER_PORT_BUFCNT, i);
 		/*
 		 * shouldn't change ipath_kpiobufs, because could be
 		 * different for different devices...
 		 */
 		kpiobufs = i;
-	}
+	} else if (ipath_kpiobufs)
+		kpiobufs = ipath_kpiobufs;
+	else
+		kpiobufs = defkbufs;
 	dd->ipath_lastport_piobuf = piobufs - kpiobufs;
 	dd->ipath_pbufsport =
 		uports ? dd->ipath_lastport_piobuf / uports : 0;
-	val32 = dd->ipath_lastport_piobuf - (dd->ipath_pbufsport * uports);
-	if (val32 > 0) {
-		ipath_dbg("allocating %u pbufs/port leaves %u unused, "
-			  "add to kernel\n", dd->ipath_pbufsport, val32);
-		dd->ipath_lastport_piobuf -= val32;
-		kpiobufs += val32;
-		ipath_dbg("%u pbufs/port leaves %u unused, add to kernel\n",
-			  dd->ipath_pbufsport, val32);
-	}
+	/* if not an even divisor, some user ports get extra buffers */
+	dd->ipath_ports_extrabuf = dd->ipath_lastport_piobuf -
+		(dd->ipath_pbufsport * uports);
+	if (dd->ipath_ports_extrabuf)
+		ipath_dbg("%u pbufs/port leaves some unused, add 1 buffer to "
+			"ports <= %u\n", dd->ipath_pbufsport,
+			dd->ipath_ports_extrabuf);
 	dd->ipath_lastpioindex = 0;
 	dd->ipath_lastpioindexl = dd->ipath_piobcnt2k;
-	ipath_chg_pioavailkernel(dd, 0, piobufs, 1);
+	/* ipath_pioavailshadow initialized earlier */
 	ipath_cdbg(VERBOSE, "%d PIO bufs for kernel out of %d total %u "
 		   "each for %u user ports\n", kpiobufs,
 		   piobufs, dd->ipath_pbufsport, uports);
-	if (dd->ipath_pioupd_thresh) {
-		if (dd->ipath_pbufsport < dd->ipath_pioupd_thresh)
-			dd->ipath_pioupd_thresh = dd->ipath_pbufsport;
-		if (kpiobufs < dd->ipath_pioupd_thresh)
-			dd->ipath_pioupd_thresh = kpiobufs;
-	}
-
 	ret = dd->ipath_f_early_init(dd);
 	if (ret) {
 		ipath_dev_err(dd, "Early initialization failure\n");
@@ -810,13 +798,6 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	}
 
 	/*
-	 * Cancel any possible active sends from early driver load.
-	 * Follows early_init because some chips have to initialize
-	 * PIO buffers in early_init to avoid false parity errors.
-	 */
-	ipath_cancel_sends(dd, 0);
-
-	/*
 	 * Early_init sets rcvhdrentsize and rcvhdrsize, so this must be
 	 * done after early_init.
 	 */
@@ -836,6 +817,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendpioavailaddr,
 			 dd->ipath_pioavailregs_phys);
+
 	/*
 	 * this is to detect s/w errors, which the h/w works around by
 	 * ignoring the low 6 bits of address, if it wasn't aligned.
@@ -862,12 +844,6 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 			 ~0ULL&~INFINIPATH_HWE_MEMBISTFAILED);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control, 0ULL);
 
-	spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
-	dd->ipath_sendctrl = INFINIPATH_S_PIOENABLE;
-	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl);
-	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
-	spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags);
-
 	/*
 	 * before error clears, since we expect serdes pll errors during
 	 * this, the first time after reset
@@ -940,6 +916,19 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	else
 		enable_chip(dd, reinit);
 
+	/* after enable_chip, so pioavailshadow setup */
+	ipath_chg_pioavailkernel(dd, 0, piobufs, 1);
+
+	/*
+	 * Cancel any possible active sends from early driver load.
+	 * Follows early_init because some chips have to initialize
+	 * PIO buffers in early_init to avoid false parity errors.
+	 * After enable and ipath_chg_pioavailkernel so we can safely
+	 * enable pioavail updates and PIOENABLE; packets are now
+	 * ready to go out.
+	 */
+	ipath_cancel_sends(dd, 1);
+
 	if (!reinit) {
 		/*
 		 * Used when we close a port, for DMA already in flight
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 45c4c06..26900b3 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -38,42 +38,12 @@
 #include "ipath_verbs.h"
 #include "ipath_common.h"
 
-/*
- * clear (write) a pio buffer, to clear a parity error.   This routine
- * should only be called when in freeze mode, and the buffer should be
- * canceled afterwards.
- */
-static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum)
-{
-	u32 __iomem *pbuf;
-	u32 dwcnt; /* dword count to write */
-	if (pnum < dd->ipath_piobcnt2k) {
-		pbuf = (u32 __iomem *) (dd->ipath_pio2kbase + pnum *
-			dd->ipath_palign);
-		dwcnt = dd->ipath_piosize2k >> 2;
-	}
-	else {
-		pbuf = (u32 __iomem *) (dd->ipath_pio4kbase +
-			(pnum - dd->ipath_piobcnt2k) * dd->ipath_4kalign);
-		dwcnt = dd->ipath_piosize4k >> 2;
-	}
-	dev_info(&dd->pcidev->dev,
-		"Rewrite PIO buffer %u, to recover from parity error\n",
-		pnum);
-
-	/* no flush required, since already in freeze */
-	writel(dwcnt + 1, pbuf);
-	while (--dwcnt)
-		writel(0, pbuf++);
-}
 
 /*
  * Called when we might have an error that is specific to a particular
  * PIO buffer, and may need to cancel that buffer, so it can be re-used.
- * If rewrite is true, and bits are set in the sendbufferror registers,
- * we'll write to the buffer, for error recovery on parity errors.
  */
-void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
+void ipath_disarm_senderrbufs(struct ipath_devdata *dd)
 {
 	u32 piobcnt;
 	unsigned long sbuf[4];
@@ -109,11 +79,8 @@ void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
 		}
 
 		for (i = 0; i < piobcnt; i++)
-			if (test_bit(i, sbuf)) {
-				if (rewrite)
-					ipath_clrpiobuf(dd, i);
+			if (test_bit(i, sbuf))
 				ipath_disarm_piobufs(dd, i, 1);
-			}
 		/* ignore armlaunch errs for a bit */
 		dd->ipath_lastcancel = jiffies+3;
 	}
@@ -164,7 +131,7 @@ static u64 handle_e_sum_errs(struct ipath_devdata *dd, ipath_err_t errs)
 {
 	u64 ignore_this_time = 0;
 
-	ipath_disarm_senderrbufs(dd, 0);
+	ipath_disarm_senderrbufs(dd);
 	if ((errs & E_SUM_LINK_PKTERRS) &&
 	    !(dd->ipath_flags & IPATH_LINKACTIVE)) {
 		/*
@@ -909,8 +876,8 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
  * processes (causing armlaunch), send errors due to going into freeze mode,
  * etc., and try to avoid causing extra interrupts while doing so.
  * Forcibly update the in-memory pioavail register copies after cleanup
- * because the chip won't do it for anything changing while in freeze mode
- * (we don't want to wait for the next pio buffer state change).
+ * because the chip won't do it while in freeze mode (the register values
+ * themselves are kept correct).
  * Make sure that we don't lose any important interrupts by using the chip
  * feature that says that writing 0 to a bit in *clear that is set in
  * *status will cause an interrupt to be generated again (if allowed by
@@ -918,48 +885,23 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
  */
 void ipath_clear_freeze(struct ipath_devdata *dd)
 {
-	int i, im;
-	u64 val;
-
 	/* disable error interrupts, to avoid confusion */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL);
 
 	/* also disable interrupts; errormask is sometimes overwriten */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL);
 
-	/*
-	 * clear all sends, because they have may been
-	 * completed by usercode while in freeze mode, and
-	 * therefore would not be sent, and eventually
-	 * might cause the process to run out of bufs
-	 */
 	ipath_cancel_sends(dd, 1);
+
+	/* clear the freeze, and be sure chip saw it */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
 			 dd->ipath_control);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 
-	/*
-	 * ensure pio avail updates continue (because the update
-	 * won't have happened from cancel_sends because we were
-	 * still in freeze
-	 */
+	/* force in-memory update now we are out of freeze */
 	ipath_force_pio_avail_update(dd);
 
 	/*
-	 * We just enabled pioavailupdate, so dma copy is almost certainly
-	 * not yet right, so read the registers directly.  Similar to init
-	 */
-	for (i = 0; i < dd->ipath_pioavregs; i++) {
-		/* deal with 6110 chip bug */
-		im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ?
-			i ^ 1 : i;
-		val = ipath_read_kreg64(dd, (0x1000 / sizeof(u64)) + im);
-		dd->ipath_pioavailregs_dma[i] = cpu_to_le64(val);
-		dd->ipath_pioavailshadow[i] = val |
-			(~dd->ipath_pioavailkernel[i] <<
-			INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT);
-	}
-
-	/*
 	 * force new interrupt if any hwerr, error or interrupt bits are
 	 * still set, and clear "safe" send packet errors related to freeze
 	 * and cancelling sends.  Re-enable error interrupts before possible
@@ -1316,10 +1258,8 @@ irqreturn_t ipath_intr(int irq, void *data)
 		ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 		spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags);
 
-		if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA))
-			handle_layer_pioavail(dd);
-		else
-			ipath_dbg("unexpected BUFAVAIL intr\n");
+		/* always process; sdma verbs uses PIO for acks and VL15  */
+		handle_layer_pioavail(dd);
 	}
 
 	ret = IRQ_HANDLED;
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 202337a..02b24a3 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -117,6 +117,10 @@ struct ipath_portdata {
 	u16 port_subport_cnt;
 	/* non-zero if port is being shared. */
 	u16 port_subport_id;
+	/* number of pio bufs for this port (all procs, if shared) */
+	u32 port_piocnt;
+	/* first pio buffer for this port */
+	u32 port_pio_base;
 	/* chip offset of PIO buffers for this port */
 	u32 port_piobufs;
 	/* how many alloc_pages() chunks in port_rcvegrbuf_pages */
@@ -384,6 +388,8 @@ struct ipath_devdata {
 	u32 ipath_lastrpkts;
 	/* pio bufs allocated per port */
 	u32 ipath_pbufsport;
+	/* if remainder on bufs/port, ports < extrabuf get 1 extra */
+	u32 ipath_ports_extrabuf;
 	u32 ipath_pioupd_thresh; /* update threshold, some chips */
 	/*
 	 * number of ports configured as max; zero is set to number chip
@@ -1011,7 +1017,7 @@ void ipath_get_eeprom_info(struct ipath_devdata *);
 int ipath_update_eeprom_log(struct ipath_devdata *dd);
 void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr);
 u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg);
-void ipath_disarm_senderrbufs(struct ipath_devdata *, int);
+void ipath_disarm_senderrbufs(struct ipath_devdata *);
 void ipath_force_pio_avail_update(struct ipath_devdata *);
 void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index 8ac5c1d..9e3fe61 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -481,9 +481,10 @@ done:
 		wake_up(&qp->wait);
 }
 
-static void want_buffer(struct ipath_devdata *dd)
+static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp)
 {
-	if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA)) {
+	if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA) ||
+		qp->ibqp.qp_type == IB_QPT_SMI) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
@@ -519,7 +520,7 @@ static void ipath_no_bufs_available(struct ipath_qp *qp,
 	spin_lock_irqsave(&dev->pending_lock, flags);
 	list_add_tail(&qp->piowait, &dev->piowait);
 	spin_unlock_irqrestore(&dev->pending_lock, flags);
-	want_buffer(dev->dd);
+	want_buffer(dev->dd, qp);
 	dev->n_piowait++;
 }
 
diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 1974df7..0d07682 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -449,16 +449,19 @@ int setup_sdma(struct ipath_devdata *dd)
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmaheadaddr,
 			 dd->ipath_sdma_head_phys);
 
-	/* Reserve all the former "kernel" piobufs */
-	n = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k - dd->ipath_pioreserved;
-	for (i = dd->ipath_lastport_piobuf; i < n; ++i) {
+	/*
+	 * Reserve all the former "kernel" piobufs, using high number range
+	 * so we get as many 4K buffers as possible
+	 */
+	n = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k;
+	i = dd->ipath_lastport_piobuf + dd->ipath_pioreserved;
+	ipath_chg_pioavailkernel(dd, i, n - i , 0);
+	for (; i < n; ++i) {
 		unsigned word = i / 64;
 		unsigned bit = i & 63;
 		BUG_ON(word >= 3);
 		senddmabufmask[word] |= 1ULL << bit;
 	}
-	ipath_chg_pioavailkernel(dd, dd->ipath_lastport_piobuf,
-		n - dd->ipath_lastport_piobuf, 0);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmabufmask0,
 			 senddmabufmask[0]);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmabufmask1,


From ralph.campbell at qlogic.com  Tue May  6 11:36:52 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 06 May 2008 11:36:52 -0700
Subject: [ofa-general] [PATCH 7/7] IB/ipath - fix SDMA error recovery in
	absence of link status change
In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com>

From: John Gregor <john.gregor at qlogic.com>

What's fixed:

    in ipath_cancel_sends()

        We need to unconditionally set ABORTING.  So, swapped the tests
        so the set_bit() isn't shadowed by the &&.

        If we've disarmed the piobufs, then we need to unconditionally
        set DISARMED.  So, moved it out from the overly protective if
        at the bottom.

    in sdma_abort_task()

        Abort_task was written knowing that the SDMA engine would always
        be reset (and restarted) on error.  A recent change broke that
        fundamental assumption by taking the restart portion and making
        it conditional on a link status change. But, SDMA can go boom
        without a link status change in some conditions.

Signed-off-by: John Gregor <john.gregor at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c |    8 +++++--
 drivers/infiniband/hw/ipath/ipath_sdma.c   |   31 ++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 2036d38..ce7b7c3 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1898,8 +1898,8 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 
 		spin_lock_irqsave(&dd->ipath_sdma_lock, flags);
 		skip_cancel =
-			!test_bit(IPATH_SDMA_DISABLED, statp) &&
-			test_and_set_bit(IPATH_SDMA_ABORTING, statp);
+			test_and_set_bit(IPATH_SDMA_ABORTING, statp)
+			&& !test_bit(IPATH_SDMA_DISABLED, statp);
 		spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
 		if (skip_cancel)
 			goto bail;
@@ -1930,6 +1930,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 	ipath_disarm_piobufs(dd, 0,
 		dd->ipath_piobcnt2k + dd->ipath_piobcnt4k);
 
+	if (dd->ipath_flags & IPATH_HAS_SEND_DMA)
+		set_bit(IPATH_SDMA_DISARMED, &dd->ipath_sdma_status);
+
 	if (restore_sendctrl) {
 		/* else done by caller later if needed */
 		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
@@ -1949,7 +1952,6 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 		/* only wait so long for intr */
 		dd->ipath_sdma_abort_intr_timeout = jiffies + HZ;
 		dd->ipath_sdma_reset_wait = 200;
-		__set_bit(IPATH_SDMA_DISARMED, &dd->ipath_sdma_status);
 		if (!test_bit(IPATH_SDMA_SHUTDOWN, &dd->ipath_sdma_status))
 			tasklet_hi_schedule(&dd->ipath_sdma_abort_task);
 		spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 0d07682..3697449 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -308,13 +308,15 @@ static void sdma_abort_task(unsigned long opaque)
 		spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
 
 		/*
-		 * Don't restart sdma here. Wait until link is up to ACTIVE.
-		 * VL15 MADs used to bring the link up use PIO, and multiple
-		 * link transitions otherwise cause the sdma engine to be
+		 * Don't restart sdma here (with the exception
+		 * below). Wait until link is up to ACTIVE.  VL15 MADs
+		 * used to bring the link up use PIO, and multiple link
+		 * transitions otherwise cause the sdma engine to be
 		 * stopped and started multiple times.
-		 * The disable is done here, including the shadow, so the
-		 * state is kept consistent.
-		 * See ipath_restart_sdma() for the actual starting of sdma.
+		 * The disable is done here, including the shadow,
+		 * so the state is kept consistent.
+		 * See ipath_restart_sdma() for the actual starting
+		 * of sdma.
 		 */
 		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
 		dd->ipath_sendctrl &= ~INFINIPATH_S_SDMAENABLE;
@@ -326,6 +328,13 @@ static void sdma_abort_task(unsigned long opaque)
 		/* make sure I see next message */
 		dd->ipath_sdma_abort_jiffies = 0;
 
+		/*
+		 * Not everything that takes SDMA offline is a link
+		 * status change.  If the link was up, restart SDMA.
+		 */
+		if (dd->ipath_flags & IPATH_LINKACTIVE)
+			ipath_restart_sdma(dd);
+
 		goto done;
 	}
 
@@ -427,7 +436,12 @@ int setup_sdma(struct ipath_devdata *dd)
 		goto done;
 	}
 
-	dd->ipath_sdma_status = 0;
+	/*
+	 * Set initial status as if we had been up, then gone down.
+	 * This lets initial start on transition to ACTIVE be the
+	 * same as restart after link flap.
+	 */
+	dd->ipath_sdma_status = IPATH_SDMA_ABORT_ABORTED;
 	dd->ipath_sdma_abort_jiffies = 0;
 	dd->ipath_sdma_generation = 0;
 	dd->ipath_sdma_descq_tail = 0;
@@ -618,6 +632,9 @@ void ipath_restart_sdma(struct ipath_devdata *dd)
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags);
 
+	/* notify upper layers */
+	ipath_ib_piobufavail(dd->verbs_dev);
+
 bail:
 	return;
 }


From caitlin.bestler at neterion.com  Tue May  6 12:28:28 2008
From: caitlin.bestler at neterion.com (Caitlin Bestler)
Date: Tue, 6 May 2008 12:28:28 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <4820A427.1070405@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
	<ada63trz5mk.fsf@cisco.com> <4820A427.1070405@opengridcomputing.com>
Message-ID: <469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com>

On Tue, May 6, 2008 at 11:32 AM, Steve Wise <swise at opengridcomputing.com> wrote:
> Roland Dreier wrote:
>
> >  > - always do peer2peer and don't let the app choose.  This forces
> >  > the overhead of p2p mode on all apps, but preserves the API.
> >
> > How bad is the overhead?
> >
> >  - R.
> >
> >
>  The client side must send a "Ready To Receive" message.  This will be
> negotiated via the MPA exchange and the resulting RTR message may be a 0B
> read + read response, 0B write, or a 0B send.  For chelsio, the 0B write
> couldn't be used, and the 0B read was the least impact on the driver code,
> so we used that.  For nes, they currently use a 0B write.
>
>  Also, there are some "caveats" if you turn this on:
>
>  1) private data is used to negotiate the type of RTR message and if its
> needed.   This is more of a global module option I think, since it will
> break interoperability with iwarp.  Prolly will bump the MPA version number
> if this option is on too.
>
>  2) if the RTR message fails, it can generate a CQE that is unexpected.
>
>  3) if using SEND, then a recv completion is always generated.
>
>  Steve.
>
>
>

Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct
packet that needs TCP handling, will occupy a buffer in various switch
queues, etc.

So while it can be about as innocuous as any TCP segment can be, it
is still an excess packet if it did not need to be sent. The overwhelming
majority of applications use a client/server model rather than peer2peer.
For them this is an excess wire packet, so I think that would make it
excessive overhead.

Secondly, the applications that need this feature will generally know
that they need it. Developers of MPI and other peer-2-peer applications
tend to know advanced networking a bit more than typical app developers.
So keeping the default to match the client/server model makes sense.


From swise at opengridcomputing.com  Tue May  6 13:37:58 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 06 May 2008 15:37:58 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	
	<ada63trz5mk.fsf@cisco.com>
	<4820A427.1070405@opengridcomputing.com>
	<469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com>
Message-ID: <4820C1A6.8020104@opengridcomputing.com>

Caitlin Bestler wrote:
> On Tue, May 6, 2008 at 11:32 AM, Steve Wise <swise at opengridcomputing.com> wrote:
>   
>> Roland Dreier wrote:
>>
>>     
>>>  > - always do peer2peer and don't let the app choose.  This forces
>>>  > the overhead of p2p mode on all apps, but preserves the API.
>>>
>>> How bad is the overhead?
>>>
>>>  - R.
>>>
>>>
>>>       
>>  The client side must send a "Ready To Receive" message.  This will be
>> negotiated via the MPA exchange and the resulting RTR message may be a 0B
>> read + read response, 0B write, or a 0B send.  For chelsio, the 0B write
>> couldn't be used, and the 0B read was the least impact on the driver code,
>> so we used that.  For nes, they currently use a 0B write.
>>
>>  Also, there are some "caveats" if you turn this on:
>>
>>  1) private data is used to negotiate the type of RTR message and if its
>> needed.   This is more of a global module option I think, since it will
>> break interoperability with iwarp.  Prolly will bump the MPA version number
>> if this option is on too.
>>
>>  2) if the RTR message fails, it can generate a CQE that is unexpected.
>>
>>  3) if using SEND, then a recv completion is always generated.
>>
>>  Steve.
>>
>>
>>
>>     
>
> Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct
> packet that needs TCP handling, will occupy a buffer in various switch
> queues, etc.
>
> So while it can be about as innocuous as any TCP segment can be, it
> is still an excess packet if it did not need to be sent. The overwhelming
> majority of applications use a client/server model rather than peer2peer.
> For them this is an excess wire packet, so I think that would make it
> excessive overhead.
>
> Secondly, the applications that need this feature will generally know
> that they need it. Developers of MPI and other peer-2-peer applications
> tend to know advanced networking a bit more than typical app developers.
> So keeping the default to match the client/server model makes sense.
>   

What are the overwhelming majority of user mode rdma applications that 
don't assume a peer2peer model?

Steve.


From birkett at elementis.com  Tue May  6 19:38:51 2008
From: birkett at elementis.com (churchill chin-w)
Date: Wed, 07 May 2008 02:38:51 +0000
Subject: [ofa-general] Surprise for general
Message-ID: <000701c8affa$04dae05c$737686a9@txgyn>

Download and Watch :) 

LgZLsJMXbSD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080507/41160413/attachment.html>

From eli at dev.mellanox.co.il  Wed May  7 00:34:58 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 07 May 2008 10:34:58 +0300
Subject: [ofa-general] getting network statistics
In-Reply-To: <20080506141039.GJ6586@cefeid.wcss.wroc.pl>
References: <FCB44A2146B78C479695CF9CCA7EEA870311DE05@excg-isl01>
	<1203424196.16145.1.camel@mtls03>
	<20080506141039.GJ6586@cefeid.wcss.wroc.pl>
Message-ID: <1210145698.15669.78.camel@mtls03>

These files are on a virtual file system and their size does not change.
You need to read them, e.g. using cat, in order to get the statistics
data. For example, "cat port_rcv_data" will give you a measure of how
many bytes of data were received by the port.

On Tue, 2008-05-06 at 16:10 +0200, Pawel Dziekonski wrote:
> you mean port_rcv_data and port_xmit_data ?
> 
> if so, then I have 2 jobs that are definitelly using IB network, but
> those files almost do not change. :o
> 
> OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp
> 
> root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al
> total 0
> drwxr-xr-x  2 root root    0 May  6 15:45 ./
> drwxr-xr-x  5 root root    0 May  6 15:45 ../
> -r--r--r--  1 root root 4096 May  6 15:45 VL15_dropped
> -r--r--r--  1 root root 4096 May  6 15:45 excessive_buffer_overrun_errors
> -r--r--r--  1 root root 4096 May  6 15:45 link_downed
> -r--r--r--  1 root root 4096 May  6 15:45 link_error_recovery
> -r--r--r--  1 root root 4096 May  6 15:45 local_link_integrity_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_constraint_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_data
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_packets
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_remote_physical_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_switch_relay_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_constraint_errors
> -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_data
> -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_discards
> -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_packets
> -r--r--r--  1 root root 4096 May  6 15:45 symbol_error
> 
> 
> On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote:
> > cat /sys/class/infiniband/mlx4_0/ports/1/counters/*
> > 
> > mlx4_* can be mthca* 
> > 
> > On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote:
> > > Under Linux with Mellanox ofed, how can I get real-time network
> > > statistics. e.g. how many bytes are being sent and received over each
> > > port at any given time?
> 


From eli at dev.mellanox.co.il  Wed May  7 01:14:24 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 07 May 2008 11:14:24 +0300
Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups
	and handle each according to level of severity
In-Reply-To: <48206690.3090604@Voltaire.COM>
References: <48206690.3090604@Voltaire.COM>
Message-ID: <1210148064.15669.84.camel@mtls03>


On Tue, 2008-05-06 at 17:09 +0300, Moni Shoua wrote:
> The purpose of this patch is to make the events that are related to SM change
> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
> When SM related events are handled, it is not necessary to flush unicast
> info from device but only multicast info. This patch divides the events that are
> handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1
> does more than 0).
> The main change is in __ipoib_ib_dev_flush(). Instead of flagging  to the function
> about pkey_events we now use leveling. An event that requires "harder" flushing
> calls this function with higher number for level. Besides the concept,
> the actual change is that SM related events are not  flushing unicast info and
> not bringing the device down but only refresh the multicast info in the background.
> 
As far as I know, when an SM change event occurs, it could mean the SM
changed and the new one "decided" to reprogram all the LIDs for example.
In that case you will issue only level 0 and the all your neighbours can
become invalid.


From hfwnhfb at bourdarias.com  Wed May  7 03:48:42 2008
From: hfwnhfb at bourdarias.com (Lenny Rubin)
Date: Wed, 7 May 2008 18:48:42 +0800
Subject: [ofa-general] Hi sure cure
Message-ID: <060301331.40877404264200@bourdarias.com>

ED affects over 30 million men and their partners in the U.S.
So if youre a man who has ED, or if you think you might be, dont worry.
Youre not alone. More than 50% of men. between 40 and 70 can experience ED to some degree.
The fact is guys at any age can experience it.

http://www.zavxx.cn/v/


From olaf.kirch at oracle.com  Wed May  7 03:50:22 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 7 May 2008 12:50:22 +0200
Subject: [ofa-general] New rds patches for ofed 1.3.1
Message-ID: <200805071250.22719.olaf.kirch@oracle.com>

Hi,

I have two more RDS kernel patches for OFED 1.3.1, and one additional
rds-tools patch. They're available from my git trees at
on branch code-drop-20080507

If you have any feedback, please let me know.

At this point, I'm not going to submit the dma_sync patches yet.
I think they need more testing, and I'd rather postpone them to
OFED 1.3.2.

I'll also post these patches in a follow-up email to this message.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From olaf.kirch at oracle.com  Wed May  7 03:51:52 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 7 May 2008 12:51:52 +0200
Subject: [ofa-general] [PATCH 1/3] Change RDMA completion notifications
In-Reply-To: <200805071250.22719.olaf.kirch@oracle.com>
References: <200805071250.22719.olaf.kirch@oracle.com>
Message-ID: <200805071251.53377.olaf.kirch@oracle.com>

commit 9194a75cf945beee95f8fb8ab08015d05aa797d4
Author: Olaf Kirch <olaf.kirch at oracle.com>
Date:   Wed May 7 10:40:13 2008 +0200

    Change RDMA completion notifications
    
    If the user asked for a completion notification on RDMA ops,
    we can implement three different semantics:
    
      1.  Notify when we received the ACK on the RDS message
          that was queued with the RDMA. This provides reliable
          notification of RDMA status at the expense of a one-way
          packet delay.
    
      2.  Notify when the IB stack gives us the completion event for
          the RDMA operation.
    
      3.  Notify when the IB stack gives us the completion event for
          the accompanying RDS messages.
    
    In OFED 1.3, RDS implemented approach #1. This turns out to be too slow
    for some purposes, so I'm switching to approach #3 with this patch.
    I'm leaving the old code in place however, so that we can support
    different modes later if we want.
    
    Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 4bbab10..724167c 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -53,6 +53,23 @@ void rds_ib_send_unmap_rm(struct rds_ib_connection *ic,
 
 	/* raise rdma completion hwm */
 	if (rm->m_rdma_op && success) {
+		/* If the user asked for a completion notification on this
+		 * message, we can implement three different semantics:
+		 *  1.	Notify when we received the ACK on the RDS message
+		 *	that was queued with the RDMA. This provides reliable
+		 *	notification of RDMA status at the expense of a one-way
+		 *	packet delay.
+		 *  2.	Notify when the IB stack gives us the completion event for
+		 *	the RDMA operation.
+		 *  3.	Notify when the IB stack gives us the completion event for
+		 *	the accompanying RDS messages.
+		 * Here, we implement approach #3. To implement approach #2,
+		 * call rds_rdma_send_complete from the cq_handler. To implement #1,
+		 * don't call rds_rdma_send_complete at all, and fall back to the notify
+		 * handling in the ACK processing code.
+		 */
+		rds_rdma_send_complete(rm);
+
 		if (rm->m_rdma_op->r_write)
 			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
 		else
diff --git a/net/rds/rdma.h b/net/rds/rdma.h
index 2ff0cea..289f962 100644
--- a/net/rds/rdma.h
+++ b/net/rds/rdma.h
@@ -71,5 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
 int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
 			  struct cmsghdr *cmsg);
 void rds_rdma_free_op(struct rds_rdma_op *ro);
+void rds_rdma_send_complete(struct rds_message *rm);
 
 #endif
diff --git a/net/rds/send.c b/net/rds/send.c
index 26e1e3e..2b7661d 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -356,6 +356,42 @@ int rds_send_acked_before(struct rds_connection *conn, u64 seq)
 }
 
 /*
+ * This is pretty similar to what happens below in the ACK
+ * handling code - except that we call here as soon as we get
+ * the IB send completion on the RDMA op and the accompanying
+ * message.
+ */
+void rds_rdma_send_complete(struct rds_message *rm)
+{
+	struct rds_sock *rs = NULL;
+	struct rds_rdma_op *ro;
+	struct rds_notifier *notifier;
+
+	spin_lock(&rm->m_rs_lock);
+
+	ro = rm->m_rdma_op;
+	if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags)
+	 && ro && ro->r_notify
+	 && (notifier = ro->r_notifier) != NULL) {
+		rs = rm->m_rs;
+		sock_hold(rds_rs_to_sk(rs));
+
+		spin_lock(&rs->rs_lock);
+		list_add_tail(&notifier->n_list, &rs->rs_notify_queue);
+		spin_unlock(&rs->rs_lock);
+
+		ro->r_notifier = NULL;
+	}
+
+	spin_unlock(&rm->m_rs_lock);
+
+	if (rs) {
+		rds_wake_sk_sleep(rs);
+		sock_put(rds_rs_to_sk(rs));
+	}
+}
+
+/*
  * This removes messages from the socket's list if they're on it.  The list
  * argument must be private to the caller, we must be able to modify it
  * without locks.  The messages must have a reference held for their 

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From olaf.kirch at oracle.com  Wed May  7 03:53:56 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 7 May 2008 12:53:56 +0200
Subject: [ofa-general] [PATCH 3/3] rds-stress: fix RDS congestion monitoring
In-Reply-To: <200805071252.46590.olaf.kirch@oracle.com>
References: <200805071250.22719.olaf.kirch@oracle.com>
	<200805071251.53377.olaf.kirch@oracle.com>
	<200805071252.46590.olaf.kirch@oracle.com>
Message-ID: <200805071253.57429.olaf.kirch@oracle.com>

commit 4dacd1a8270aa226bfff157af4519fa33c820253
Author: Olaf Kirch <olaf.kirch at oracle.com>
Date:   Wed May 7 10:42:55 2008 +0200

    Fix RDS congestion monitoring
    
    The RDS congestion monitoring code tries to help applications deal with
    remote congestion more efficiently. If enabled, an application that tried
    to send to a congested port will receive a notification as soon as the
    port becomes uncongested again. For efficiency reasons, the application
    isn't given a complete 8K congestion bitmap, but a 64bit mask that
    represents the ports having changed, with port N being represented by
    (1 << (port % 64))
    
    The macro used to translate port numbers to the mask bit shifted integer
    1, not 1ULL, resulting in undefined behavior when (port % 64) >= 32
    
    Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>

diff --git a/net/ib_rds.h b/net/ib_rds.h
index cea73fc..e098036 100644
--- a/net/ib_rds.h
+++ b/net/ib_rds.h
@@ -176,7 +176,7 @@ struct rds_info_tcp_socket {
  */
 #define RDS_CONG_MONITOR_SIZE	64
 #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
-#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
+#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port))
 
 /*
  * RDMA related types

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From olaf.kirch at oracle.com  Wed May  7 03:54:13 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 7 May 2008 12:54:13 +0200
Subject: [ofa-general] [PATCH 2/3] RDS: Fix RDS congestion monitoring
In-Reply-To: <200805071251.53377.olaf.kirch@oracle.com>
References: <200805071250.22719.olaf.kirch@oracle.com>
	<200805071251.53377.olaf.kirch@oracle.com>
Message-ID: <200805071254.14211.olaf.kirch@oracle.com>

commit 01b72f27721c2dccf42dd5eea2e5d5e573cd8585
Author: Olaf Kirch <olaf.kirch at oracle.com>
Date:   Wed May 7 10:40:36 2008 +0200

    Fix RDS congestion monitoring
    
    The RDS congestion monitoring code tries to help applications
    deal with remote congestion more efficiently. If enabled,
    an application that tried to send to a congested port will
    receive a notification as soon as the port becomes uncongested
    again. For efficiency reasons, the application isn't given a
    complete 8K congestion bitmap, but a 64bit mask that represents
    the ports having changed, with port N being represented by
    (1 << (port % 64))
    
    This code had several bugs in it:
     -	the macro used to translate port numbers to the mask bit
     	shifted integer 1, not 1ULL, resulting in undefined
    	behavior when (port % 64) >= 32
    
     -	rds_ib_cong_recv computes the 64bit mask of all ports
     	that changed from congested to uncongested. It got the
    	bit arithmetics wrong.
    	Also, it used be64_to_cpu to convert the mask, which
    	is wrong, as the congestion map is little endian
    
     -	in the IB send completion handler, we need to check
     	whether there is a pending congestion map update we
    	need to send.
    
     -	in rds_poll, we should grab the rs_lock spinlock
     	when testing whether rs_cong_mask is non-zero
    
    Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 7c20dd4..2fa2d0e 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -165,8 +165,10 @@ static unsigned int rds_poll(struct file *file, struct socket *sock,
 		if (rds_cong_updated_since(&rs->rs_cong_track))
 			mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
 	} else {
+		spin_lock(&rs->rs_lock);
 		if (rs->rs_cong_notify)
 			mask |= (POLLIN | POLLRDNORM);
+		spin_unlock(&rs->rs_lock);
 	}
 	if (!list_empty(&rs->rs_recv_queue)
 	 || !list_empty(&rs->rs_notify_queue))
diff --git a/net/rds/cong.c b/net/rds/cong.c
index 4ec85ce..beeb539 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -238,7 +238,7 @@ void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask)
 	if (waitqueue_active(&rds_poll_waitq))
 		wake_up_all(&rds_poll_waitq);
 
-	if (!list_empty(&rds_cong_monitor)) {
+	if (portmask && !list_empty(&rds_cong_monitor)) {
 		unsigned long flags;
 		struct rds_sock *rs;
 
diff --git a/net/rds/ib_rds.h b/net/rds/ib_rds.h
index cea73fc..e098036 100644
--- a/net/rds/ib_rds.h
+++ b/net/rds/ib_rds.h
@@ -176,7 +176,7 @@ struct rds_info_tcp_socket {
  */
 #define RDS_CONG_MONITOR_SIZE	64
 #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
-#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
+#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port))
 
 /*
  * RDMA related types
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 6c9cc9e..8ffbb0c 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -639,8 +639,9 @@ static void rds_ib_cong_recv(struct rds_connection *conn,
 		src = addr + frag_off;
 		dst = (void *)map->m_page_addrs[map_page] + map_off;
 		for (k = 0; k < to_copy; k += 8) {
-			/* Record ports that became uncongested */
-			uncongested |= *src & (*src ^ *dst);
+			/* Record ports that became uncongested, ie
+			 * bits that changed from 0 to 1. */
+			uncongested |= ~(*src) & *dst;
 			*dst++ = *src++;
 		}
 		kunmap_atomic(addr, KM_SOFTIRQ0);
@@ -662,7 +663,7 @@ static void rds_ib_cong_recv(struct rds_connection *conn,
 	}
 
 	/* the congestion map is in little endian order */
-	uncongested = be64_to_cpu(uncongested);
+	uncongested = le64_to_cpu(uncongested);
 
 	rds_cong_map_updated(map, uncongested);
 }
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 724167c..567f62f 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -218,7 +218,8 @@ void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context)
 
 		rds_ib_ring_free(&ic->i_send_ring, completed);
 
-		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
+		 || test_bit(0, &conn->c_map_queued))
 			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
 
 		/* We expect errors as the qp is drained during shutdown */

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From xlwcpahp at boardmansystems.com  Wed May  7 05:38:20 2008
From: xlwcpahp at boardmansystems.com (Duane Valentin)
Date: Wed, 7 May 2008 14:38:20 +0200
Subject: [ofa-general] Hi large drive
Message-ID: <01c8b04f$fe92d600$408fb94e@xlwcpahp>

"In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size."

We offer solution!
Gain incredible girth and mind-blowing length in just a few weeks time!

http://www.gksoo.cn/a/


From andrea at qumranet.com  Wed May  7 07:35:51 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:51 +0200
Subject: [ofa-general] [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <e20917dcc8284b6a07cf.1210170951@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210096013 -7200
# Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
# Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with "mm". This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further ->invalidate_page notification (invalidate_range_begin/end
   works on ranges and ->invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
Signed-off-by: Nick Piggin <npiggin at suse.de>
Signed-off-by: Christoph Lameter <clameter at sgi.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
 	if (!hlist_unhashed(n)) {
 		__hlist_del(n);
 		INIT_HLIST_NODE(n);
+	}
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+	if (!hlist_unhashed(n)) {
+		__hlist_del(n);
+		n->pprev = NULL;
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+struct mm_lock_data {
+	spinlock_t **i_mmap_locks;
+	spinlock_t **anon_vma_locks;
+	size_t nr_i_mmap_locks;
+	size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/cpumask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -19,6 +20,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
+struct mmu_notifier_mm;
 
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 typedef atomic_long_t mm_counter_t;
@@ -235,6 +237,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,282 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+#include <linux/srcu.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
+struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
+	struct hlist_head list;
+	/* srcu structure for this mm */
+	struct srcu_struct srcu;
+	/* to serialize the list modifications and hlist_unhashed */
+	spinlock_t lock;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Called either by mmu_notifier_unregister or when the mm is
+	 * being destroyed by exit_mmap, always before all pages are
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
+	 * should tear down all secondary mmu mappings and freeze the
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all _software_ driven memory accesses
+	 * through the secondary mmu are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative _hardware_ operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, so leading to memory
+	 * corruption.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. Both functions
+	 * may sleep. The subsystem must guarantee that no additional
+	 * references are taken to the pages in the range established
+	 * between the call to invalidate_range_start() and the
+	 * matching call to invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern int __mmu_notifier_register(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_mm_destroy(mm);
+}
+
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/mutex.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
@@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mmu_notifier_mm_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -205,3 +205,6 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	bool
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						  vma->vm_start, end);
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1753,7 +1776,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2255,3 +2259,194 @@ int install_special_mapping(struct mm_st
 
 	return 0;
 }
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+	unsigned long _a = (unsigned long)*(spinlock_t **)a;
+	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+
+	cond_resched();
+	if (_a < _b)
+		return -1;
+	if (_a > _b)
+		return 1;
+	return 0;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+				  int anon)
+{
+	struct vm_area_struct *vma;
+	size_t i = 0;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (anon) {
+			if (vma->anon_vma)
+				locks[i++] = &vma->anon_vma->lock;
+		} else {
+			if (vma->vm_file && vma->vm_file->f_mapping)
+				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+		}
+	}
+
+	if (!i)
+		goto out;
+
+	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+	return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+						  spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+						spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+	spinlock_t *last = NULL;
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		/*  Multiple vmas may use the same lock. */
+		if (locks[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+			last = locks[i];
+			if (lock)
+				spin_lock(last);
+			else
+				spin_unlock(last);
+		}
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults.
+ *
+ * The caller must take the mmap_sem in read or write mode before
+ * calling mm_lock(). The caller isn't allowed to release the mmap_sem
+ * until mm_unlock() returns.
+ *
+ * While mm_lock() itself won't strictly require the mmap_sem in write
+ * mode to be safe, in order to block all operations that could modify
+ * pagetables and free pages without need of altering the vma layout
+ * (for example populate_range() with nonlinear vmas) the mmap_sem
+ * must be taken in write mode by the caller.
+ *
+ * A single task can't take more than one mm_lock in a row or it would
+ * deadlock.
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+	if (mm->map_count) {
+		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!anon_vma_locks))
+			return -ENOMEM;
+
+		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!i_mmap_locks)) {
+			vfree(anon_vma_locks);
+			return -ENOMEM;
+		}
+
+		/*
+		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
+		 * means there's no lock to take and so we can free
+		 * the array here without waiting mm_unlock. mm_unlock
+		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * zero.
+		 */
+		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+		if (data->nr_anon_vma_locks) {
+			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+			data->anon_vma_locks = anon_vma_locks;
+		} else
+			vfree(anon_vma_locks);
+
+		if (data->nr_i_mmap_locks) {
+			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+			data->i_mmap_locks = i_mmap_locks;
+		} else
+			vfree(i_mmap_locks);
+	}
+	return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+	__mm_unlock(locks, nr);
+	vfree(locks);
+}
+
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * The mmap_sem cannot be released until mm_unlock returns.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ *
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	if (mm->map_count) {
+		if (data->nr_anon_vma_locks)
+			mm_unlock_vfree(data->anon_vma_locks,
+					data->nr_anon_vma_locks);
+		if (data->nr_i_mmap_locks)
+			mm_unlock_vfree(data->i_mmap_locks,
+					data->nr_i_mmap_locks);
+	}
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,292 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter at sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/srcu.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds an mm_count pin
+ * itself.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	int srcu;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+				 struct mmu_notifier,
+				 hlist);
+		/*
+		 * We arrived before mmu_notifier_unregister so
+		 * mmu_notifier_unregister will do nothing other than
+		 * to wait ->release to finish and
+		 * mmu_notifier_unregister to return.
+		 */
+		hlist_del_init_rcu(&mn->hlist);
+		/*
+		 * SRCU here will block mmu_notifier_unregister until
+		 * ->release returns.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * if ->release runs before mmu_notifier_unregister it
+		 * must be handled as it's the only way for the driver
+		 * to flush all existing sptes and stop the driver
+		 * from establishing any more sptes before all the
+		 * pages in the mm are freed.
+		 */
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * synchronize_srcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
+	 * The mmu_notifier_mm can't go away from under us because one
+	 * mm_count is hold by exit_mmap.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0, srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+static int do_mmu_notifier_register(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    int take_mmap_sem)
+{
+	struct mm_lock_data data;
+	struct mmu_notifier_mm * mmu_notifier_mm;
+	int ret;
+
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+	if (unlikely(ret))
+		goto out_kfree;
+
+	if (take_mmap_sem)
+		down_write(&mm->mmap_sem);
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out_cleanup;
+
+	if (!mm_has_notifiers(mm)) {
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
+	}
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Serialize the update against mmu_notifier_unregister. A
+	 * side note: mmu_notifier_release can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 * We can't race against any other mmu notifiers either thanks
+	 * to mm_lock().
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	mm_unlock(mm, &data);
+out_cleanup:
+	if (take_mmap_sem)
+		up_write(&mm->mmap_sem);
+	if (mmu_notifier_mm)
+		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
+out:
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+	return ret;
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 1);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 0);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+	cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu);
+	kfree(mm->mmu_notifier_mm);
+	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with SRCU and against mmu_notifier_unregister
+ * with the unregister lock + SRCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	if (!hlist_unhashed(&mn->hlist)) {
+		int srcu;
+
+		hlist_del_rcu(&mn->hlist);
+
+		/*
+		 * SRCU here will force exit_mmap to wait ->release to finish
+		 * before freeing the pages.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * exit_mmap will block in mmu_notifier_release to
+		 * guarantee ->release is called before freeing the
+		 * pages.
+		 */
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+	} else
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wait any running method to finish, of course including
+	 * ->release if it was run by mmu_notifier_relase instead of us.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -457,7 +458,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))


From andrea at qumranet.com  Wed May  7 07:35:52 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:52 +0200
Subject: [ofa-general] [PATCH 02 of 11] get_task_mm
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <c5badbefeee07518d9d1.1210170952@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115127 -7200
# Node ID c5badbefeee07518d9d1acca13e94c981420317c
# Parent  e20917dcc8284b6a07cfcced13dda4cbca850a9c
get_task_mm

get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas
 		if (task->flags & PF_BORROWED_MM)
 			mm = NULL;
 		else
-			atomic_inc(&mm->mm_users);
+			if (!atomic_inc_not_zero(&mm->mm_users))
+				mm = NULL;
 	}
 	task_unlock(task);
 	return mm;


From andrea at qumranet.com  Wed May  7 07:35:54 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:54 +0200
Subject: [ofa-general] [PATCH 04 of 11] free-pgtables
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <34f6a4bf67ce66714ba2.1210170954@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115130 -7200
# Node ID 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
# Parent  d60d200565abde6a8ed45271e53cde9c5c75b426
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		}
+		tlb_finish_mmu(tlb, addr, vma->vm_end);
 		vma = next;
 	}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,


From andrea at qumranet.com  Wed May  7 07:35:53 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:53 +0200
Subject: [ofa-general] [PATCH 03 of 11] invalidate_page outside PT lock
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <d60d200565abde6a8ed4.1210170953@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115129 -7200
# Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
# Parent  c5badbefeee07518d9d1acca13e94c981420317c
invalidate_page outside PT lock

Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -210,35 +210,6 @@ static inline void mmu_notifier_mm_destr
 		__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
-						  ___address);		\
-	__young;							\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -274,9 +245,6 @@ static inline void mmu_notifier_mm_destr
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
+			new_page = NULL;
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
+			page_cache_release(old_page);
 
 			page_mkwrite = 1;
 		}
@@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = new_page = NULL;
 		goto unlock;
 	}
 
@@ -1776,7 +1778,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush_notify(vma, address, page_table);
+		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1788,12 +1790,18 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
+unlock:
+	pte_unmap_unlock(page_table, ptl);
+
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
 		page_cache_release(new_page);
+	}
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
-	pte_unmap_unlock(page_table, ptl);
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -275,7 +275,7 @@ static int page_referenced_one(struct pa
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -288,8 +288,11 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young_notify(vma, address, pte))
-		referenced++;
+	} else {
+		clear_flush_young = 1;
+		if (ptep_clear_flush_young(vma, address, pte))
+			referenced++;
+	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
@@ -299,6 +302,10 @@ static int page_referenced_one(struct pa
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
+
+	if (clear_flush_young)
+		referenced += mmu_notifier_clear_flush_young(mm, address);
+
 out:
 	return referenced;
 }
@@ -458,7 +465,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -466,6 +473,10 @@ static int page_mkclean_one(struct page 
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
+
 out:
 	return ret;
 }
@@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young_notify(vma, address, pte)))) {
+	if (!migration && (vma->vm_flags & VM_LOCKED)) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long end;
+	unsigned long start, end;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne
 	if (!pmd_present(*pmd))
 		return;
 
+	start = address;
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/* Update high watermark before we lower rss */
@@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young_notify(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
 static int try_to_unmap_anon(struct page *page, int migration)


From andrea at qumranet.com  Wed May  7 07:35:56 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:56 +0200
Subject: [ofa-general] [PATCH 06 of 11] rwsem contended
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <0621238970155f8ff2d6.1210170956@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115132 -7200
# Node ID 0621238970155f8ff2d60ca4996dcdd470f9c6ce
# Parent  20bc6a66a86ef6bd60919cc77ff51d4af741b057
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+	return rwsem_is_contended(sem);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
 	return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);


From andrea at qumranet.com  Wed May  7 07:35:55 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:55 +0200
Subject: [ofa-general] [PATCH 05 of 11] unmap vmas tlb flushing
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <20bc6a66a86ef6bd6091.1210170955@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115131 -7200
# Node ID 20bc6a66a86ef6bd60919cc77ff51d4af741b057
# Parent  34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
 
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
-					*tlbp = NULL;
+					tlb = NULL;
 					goto out;
 				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 }
@@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*


From andrea at qumranet.com  Wed May  7 07:35:58 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:58 +0200
Subject: [ofa-general] [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <6b384bb988786aa78ef0.1210170958@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent  58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -157,9 +157,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*


From andrea at qumranet.com  Wed May  7 07:36:00 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:36:00 +0200
Subject: [ofa-general] [PATCH 10 of 11] export zap_page_range for XPMEM
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <5b2eb7d28a4517daf91b.1210170960@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115797 -7200
# Node ID 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
# Parent  94eaa1515369e8ef183e2457f6f25a7f36473d70
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <dcn at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
 	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.


From andrea at qumranet.com  Wed May  7 07:35:57 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:57 +0200
Subject: [ofa-general] [PATCH 07 of 11] i_mmap_rwsem
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <58f716ad4d067afb6bdd.1210170957@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115135 -7200
# Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
# Parent  0621238970155f8ff2d60ca4996dcdd470f9c6ce
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
Signed-off-by: Christoph Lameter <clameter at sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1829,7 +1829,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1848,7 +1848,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1898,7 +1898,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1912,9 +1912,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file) {
@@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 

From andrea at qumranet.com  Wed May  7 07:36:01 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:36:01 +0200
Subject: [ofa-general] [PATCH 11 of 11] mmap sems
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <eb924315351f6b056428.1210170961@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115798 -7200
# Node ID eb924315351f6b056428e35c983ad28040420fea
# Parent  5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <dcn at sgi.com>
Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
+ *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
  *
  *  ->i_mutex
  *    ->i_alloc_sem             (various)


From andrea at qumranet.com  Wed May  7 07:35:59 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:59 +0200
Subject: [ofa-general] [PATCH 09 of 11] mm_lock-rwsem
In-Reply-To: <patchbomb.1210170950@duo.random>
Message-ID: <94eaa1515369e8ef183e.1210170959@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea at qumranet.com>
# Date 1210115508 -7200
# Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70
# Parent  6b384bb988786aa78ef07440180e4b2948c4c6a2
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
 				   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-	spinlock_t **i_mmap_locks;
-	spinlock_t **anon_vma_locks;
-	size_t nr_i_mmap_locks;
-	size_t nr_anon_vma_locks;
+	struct rw_semaphore **i_mmap_sems;
+	struct rw_semaphore **anon_vma_sems;
+	size_t nr_i_mmap_sems;
+	size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-	unsigned long _a = (unsigned long)*(spinlock_t **)a;
-	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+	unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+	unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
 	cond_resched();
 	if (_a < _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
 	return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
 				  int anon)
 {
 	struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (anon) {
 			if (vma->anon_vma)
-				locks[i++] = &vma->anon_vma->lock;
+				sems[i++] = &vma->anon_vma->sem;
 		} else {
 			if (vma->vm_file && vma->vm_file->f_mapping)
-				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+				sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
 		}
 	}
 
 	if (!i)
 		goto out;
 
-	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+	sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
 	return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
-						  spinlock_t **locks)
+						  struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 1);
+	return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-						spinlock_t **locks)
+						struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 0);
+	return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-	spinlock_t *last = NULL;
+	struct rw_semaphore *last = NULL;
 	size_t i;
 
 	for (i = 0; i < nr; i++)
 		/*  Multiple vmas may use the same lock. */
-		if (locks[i] != last) {
-			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
-			last = locks[i];
+		if (sems[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+			last = sems[i];
 			if (lock)
-				spin_lock(last);
+				down_write(last);
 			else
-				spin_unlock(last);
+				up_write(last);
 		}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 1);
+	mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 0);
+	mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2358,10 +2358,10 @@ static inline void __mm_unlock(spinlock_
  * of vmas is defined in /proc/sys/vm/max_map_count.
  *
  * mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less because
- * normally there won't be max_map_count vmas allocated in the task
- * that runs mm_lock().
+ * vmalloc allocation required is 2*max_map_count*sizeof(struct
+ * rw_semaphore *), so around 1Mbyte, but in practice it'll be much
+ * less because normally there won't be max_map_count vmas allocated
+ * in the task that runs mm_lock().
  *
  * The vmalloc memory allocated by mm_lock is stored in the
  * mm_lock_data structure that must be allocated by the caller and it
@@ -2375,16 +2375,16 @@ static inline void __mm_unlock(spinlock_
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-	spinlock_t **anon_vma_locks, **i_mmap_locks;
+	struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
 	if (mm->map_count) {
-		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks))
+		anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!anon_vma_sems))
 			return -ENOMEM;
 
-		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!i_mmap_locks)) {
-			vfree(anon_vma_locks);
+		i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!i_mmap_sems)) {
+			vfree(anon_vma_sems);
 			return -ENOMEM;
 		}
 
@@ -2392,31 +2392,31 @@ int mm_lock(struct mm_struct *mm, struct
 		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
 		 * means there's no lock to take and so we can free
 		 * the array here without waiting mm_unlock. mm_unlock
-		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * will do nothing if nr_i_mmap/anon_vma_sems is
 		 * zero.
 		 */
-		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
-		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+		data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems);
+		data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems);
 
-		if (data->nr_anon_vma_locks) {
-			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
-			data->anon_vma_locks = anon_vma_locks;
+		if (data->nr_anon_vma_sems) {
+			__mm_lock(anon_vma_sems, data->nr_anon_vma_sems);
+			data->anon_vma_sems = anon_vma_sems;
 		} else
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 
-		if (data->nr_i_mmap_locks) {
-			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
-			data->i_mmap_locks = i_mmap_locks;
+		if (data->nr_i_mmap_sems) {
+			__mm_lock(i_mmap_sems, data->nr_i_mmap_sems);
+			data->i_mmap_sems = i_mmap_sems;
 		} else
-			vfree(i_mmap_locks);
+			vfree(i_mmap_sems);
 	}
 	return 0;
 }
 
-static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr)
 {
-	__mm_unlock(locks, nr);
-	vfree(locks);
+	__mm_unlock(sems, nr);
+	vfree(sems);
 }
 
 /*
@@ -2435,11 +2435,11 @@ void mm_unlock(struct mm_struct *mm, str
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
-		if (data->nr_anon_vma_locks)
-			mm_unlock_vfree(data->anon_vma_locks,
-					data->nr_anon_vma_locks);
-		if (data->nr_i_mmap_locks)
-			mm_unlock_vfree(data->i_mmap_locks,
-					data->nr_i_mmap_locks);
+		if (data->nr_anon_vma_sems)
+			mm_unlock_vfree(data->anon_vma_sems,
+					data->nr_anon_vma_sems);
+		if (data->nr_i_mmap_sems)
+			mm_unlock_vfree(data->i_mmap_sems,
+					data->nr_i_mmap_sems);
 	}
 }


From andrea at qumranet.com  Wed May  7 07:35:50 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 07 May 2008 16:35:50 +0200
Subject: [ofa-general] [PATCH 00 of 11] mmu notifier #v16
Message-ID: <patchbomb.1210170950@duo.random>

Hello,

this is the last update of the mmu notifier patch.

Jack asked a __mmu_notifier_register to call under mmap_sem in write mode.

Here an update with that change plus allowing ->release not to be implemented
(two liner change to mmu_notifier.c).

The entire diff between v15 and v16 mmu-notifier-core was posted in separate
email.


From andrea at qumranet.com  Wed May  7 08:00:15 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 7 May 2008 17:00:15 +0200
Subject: [ofa-general] Re: [PATCH 01 of 12] Core of mmu notifiers
In-Reply-To: <20080429160340.GG8315@duo.random>
References: <20080424153943.GJ24536@duo.random>
	<20080424174145.GM24536@duo.random>
	<20080426131734.GB19717@sgi.com> <20080427122727.GO9514@duo.random>
	<Pine.LNX.4.64.0804281332030.31163@schroedinger.engr.sgi.com>
	<20080429001052.GA8315@duo.random>
	<Pine.LNX.4.64.0804281819020.2502@schroedinger.engr.sgi.com>
	<20080429153052.GE8315@duo.random> <20080429155030.GB28944@sgi.com>
	<20080429160340.GG8315@duo.random>
Message-ID: <20080507150014.GI8362@duo.random>

On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote:
> Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
> yourself in this direction, you're very welcome to go ahead while I

In case you didn't notice this already, for a further explanation of
why semaphores runs slower for small critical sections and why the
conversion from spinlock to rwsem should happen under a config option,
see the "AIM7 40% regression with 2.6.26-rc1" thread.


From rdreier at cisco.com  Wed May  7 08:29:27 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 08:29:27 -0700
Subject: [ofa-general] [RFC/PATCH 1/2] RDMA/cxgb3: Don't add PBL memory to
	gen_pool in chunks
Message-ID: <adaprryw2so.fsf@cisco.com>

Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
chunks.  This limits the largest single allocation that can be done to
the same size, which means that with 4 KB pages, each of which takes 8
bytes of PBL memory, the largest memory region that can be allocated
is 1 GB (256K PBL entries * 4 KB/entry).

Remove this limit by adding all the PBL memory in a single gen_pool
chunk, if possible.  Add code that falls back to smaller chunks if
gen_pool_add() fails, which can happen if there is not sufficient
contiguous lowmem for the internal gen_pool bitmap.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/hw/cxgb3/cxio_resource.c |   36 +++++++++++++++++++++------
 1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_resource.c b/drivers/infiniband/hw/cxgb3/cxio_resource.c
index 45ed4f2..bd233c0 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_resource.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_resource.c
@@ -250,7 +250,6 @@ void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
  */
 
 #define MIN_PBL_SHIFT 8			/* 256B == min PBL size (32 entries) */
-#define PBL_CHUNK 2*1024*1024
 
 u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size)
 {
@@ -267,14 +266,35 @@ void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
 
 int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p)
 {
-	unsigned long i;
+	unsigned pbl_start, pbl_chunk;
+
 	rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1);
-	if (rdev_p->pbl_pool)
-		for (i = rdev_p->rnic_info.pbl_base;
-		     i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1;
-		     i += PBL_CHUNK)
-			gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1);
-	return rdev_p->pbl_pool ? 0 : -ENOMEM;
+	if (!rdev_p->pbl_pool)
+		return -ENOMEM;
+
+	pbl_start = rdev_p->rnic_info.pbl_base;
+	pbl_chunk = rdev_p->rnic_info.pbl_top - pbl_start + 1;
+
+	while (pbl_start < rdev_p->rnic_info.pbl_top) {
+		pbl_chunk = min(rdev_p->rnic_info.pbl_top - pbl_start + 1,
+				pbl_chunk);
+		if (gen_pool_add(rdev_p->pbl_pool, pbl_start, pbl_chunk, -1)) {
+			PDBG("%s failed to add PBL chunk (%x/%x)\n",
+			     __func__, pbl_start, pbl_chunk);
+			if (pbl_chunk <= 1024 << MIN_PBL_SHIFT) {
+				printk(KERN_WARNING MOD "%s: Failed to add all PBL chunks (%x/%x)\n",
+				       __func__, pbl_start, rdev_p->rnic_info.pbl_top - pbl_start);
+				return 0;
+			}
+			pbl_chunk >>= 1;
+		} else {
+			PDBG("%s added PBL chunk (%x/%x)\n",
+			     __func__, pbl_start, pbl_chunk);
+			pbl_start += pbl_chunk;
+		}
+	}
+
+	return 0;
 }
 
 void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p)
-- 
1.5.5.1


From rdreier at cisco.com  Wed May  7 08:29:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 08:29:59 -0700
Subject: [ofa-general] [RFC/PATCH 2/2] RDMA/cxgb3: Fix severe limit on
	userspace memory registration size
Message-ID: <adalk2mw2rs.fsf@cisco.com>

Currently, iw_cxgb3 is severely limited on the amount of userspace
memory that can be registered in in a single memory region, which
causes big problems for applications that expect to be able to
register 100s of MB.

The problem is that the driver uses a single kmalloc()ed buffer to
hold the physical buffer list (PBL) for the entire memory region
during registration, which means that 8 bytes of contiguous memory are
required for each page of memory being registered.  For example, a 64
MB registration will require 128 KB of contiguous memory with 4 KB
pages, and it unlikely that such an allocation will succeed on a busy
system.

This is purely a driver problem: the temporary page list buffer is not
needed by the hardware, so we can fix this by writing the PBL to the
hardware in page-sized chunks rather than all at once.  We do this by
splitting the memory registration operation up into several steps:

 - Allocate PBL space in adapter memory for the full registration
 - Copy PBL to adapter memory in chunks
 - Allocate STag and enable memory region

This also allows several other cleanups to the __cxio_tpt_op()
interface and related parts of the driver.

This change leaves the reregister memory region and memory window
operations broken, but they already didn't work due to other
longstanding bugs, so fixing them will be left to a later patch.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/hw/cxgb3/cxio_hal.c      |   90 +++++++++++++-------------
 drivers/infiniband/hw/cxgb3/cxio_hal.h      |    8 +-
 drivers/infiniband/hw/cxgb3/iwch_mem.c      |   75 +++++++++++++++--------
 drivers/infiniband/hw/cxgb3/iwch_provider.c |   68 ++++++++++++++++-----
 drivers/infiniband/hw/cxgb3/iwch_provider.h |    8 +-
 5 files changed, 155 insertions(+), 94 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 5fd8506..ebf9d30 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -588,7 +588,7 @@ static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
  * caller aquires the ctrl_qp lock before the call
  */
 static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
-				      u32 len, void *data, int completion)
+				      u32 len, void *data)
 {
 	u32 i, nr_wqe, copy_len;
 	u8 *copy_data;
@@ -624,7 +624,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
 		flag = 0;
 		if (i == (nr_wqe - 1)) {
 			/* last WQE */
-			flag = completion ? T3_COMPLETION_FLAG : 0;
+			flag = T3_COMPLETION_FLAG;
 			if (len % 32)
 				utx_len = len / 32 + 1;
 			else
@@ -683,21 +683,20 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
 	return 0;
 }
 
-/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
- * OUT: stag index, actual pbl_size, pbl_addr allocated.
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl_size and pbl_addr
+ * OUT: stag index
  * TBD: shared memory region support
  */
 static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
 			 u32 *stag, u8 stag_state, u32 pdid,
 			 enum tpt_mem_type type, enum tpt_mem_perm perm,
-			 u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl,
-			 u32 *pbl_size, u32 *pbl_addr)
+			 u32 zbva, u64 to, u32 len, u8 page_size,
+			 u32 pbl_size, u32 pbl_addr)
 {
 	int err;
 	struct tpt_entry tpt;
 	u32 stag_idx;
 	u32 wptr;
-	int rereg = (*stag != T3_STAG_UNSET);
 
 	stag_state = stag_state > 0;
 	stag_idx = (*stag) >> 8;
@@ -711,30 +710,8 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
 	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n",
 	     __func__, stag_state, type, pdid, stag_idx);
 
-	if (reset_tpt_entry)
-		cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3);
-	else if (!rereg) {
-		*pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3);
-		if (!*pbl_addr) {
-			return -ENOMEM;
-		}
-	}
-
 	mutex_lock(&rdev_p->ctrl_qp.lock);
 
-	/* write PBL first if any - update pbl only if pbl list exist */
-	if (pbl) {
-
-		PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
-		     __func__, *pbl_addr, rdev_p->rnic_info.pbl_base,
-		     *pbl_size);
-		err = cxio_hal_ctrl_qp_write_mem(rdev_p,
-				(*pbl_addr >> 5),
-				(*pbl_size << 3), pbl, 0);
-		if (err)
-			goto ret;
-	}
-
 	/* write TPT entry */
 	if (reset_tpt_entry)
 		memset(&tpt, 0, sizeof(tpt));
@@ -749,23 +726,23 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
 				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
 				V_TPT_PAGE_SIZE(page_size));
 		tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 :
-				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3));
+				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, pbl_addr)>>3));
 		tpt.len = cpu_to_be32(len);
 		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
 		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
 		tpt.rsvd_bind_cnt_or_pstag = 0;
 		tpt.rsvd_pbl_size = reset_tpt_entry ? 0 :
-				  cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2));
+				  cpu_to_be32(V_TPT_PBL_SIZE(pbl_size >> 2));
 	}
 	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
 				       stag_idx +
 				       (rdev_p->rnic_info.tpt_base >> 5),
-				       sizeof(tpt), &tpt, 1);
+				       sizeof(tpt), &tpt);
 
 	/* release the stag index to free pool */
 	if (reset_tpt_entry)
 		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
-ret:
+
 	wptr = rdev_p->ctrl_qp.wptr;
 	mutex_unlock(&rdev_p->ctrl_qp.lock);
 	if (!err)
@@ -776,44 +753,67 @@ ret:
 	return err;
 }
 
+int cxio_write_pbl(struct cxio_rdev *rdev_p, __be64 *pbl,
+		   u32 pbl_addr, u32 pbl_size)
+{
+	u32 wptr;
+	int err;
+
+	PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
+	     __func__, pbl_addr, rdev_p->rnic_info.pbl_base,
+	     pbl_size);
+
+	mutex_lock(&rdev_p->ctrl_qp.lock);
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p, pbl_addr >> 5, pbl_size << 3,
+					 pbl);
+	wptr = rdev_p->ctrl_qp.wptr;
+	mutex_unlock(&rdev_p->ctrl_qp.lock);
+	if (err)
+		return err;
+
+	if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+				     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+					      wptr)))
+		return -ERESTARTSYS;
+
+	return 0;
+}
+
 int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
 			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
-			   u8 page_size, __be64 *pbl, u32 *pbl_size,
-			   u32 *pbl_addr)
+			   u8 page_size, u32 pbl_size, u32 pbl_addr)
 {
 	*stag = T3_STAG_UNSET;
 	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
-			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+			     zbva, to, len, page_size, pbl_size, pbl_addr);
 }
 
 int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
 			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
-			   u8 page_size, __be64 *pbl, u32 *pbl_size,
-			   u32 *pbl_addr)
+			   u8 page_size, u32 pbl_size, u32 pbl_addr)
 {
 	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
-			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+			     zbva, to, len, page_size, pbl_size, pbl_addr);
 }
 
 int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size,
 		   u32 pbl_addr)
 {
-	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
-			     &pbl_size, &pbl_addr);
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0,
+			     pbl_size, pbl_addr);
 }
 
 int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
 {
-	u32 pbl_size = 0;
 	*stag = T3_STAG_UNSET;
 	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
-			     NULL, &pbl_size, NULL);
+			     0, 0);
 }
 
 int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
 {
-	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
-			     NULL, NULL);
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0,
+			     0, 0);
 }
 
 int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 69ab08e..6e128f6 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -154,14 +154,14 @@ int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq,
 int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq,
 		    struct cxio_ucontext *uctx);
 int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_write_pbl(struct cxio_rdev *rdev_p, __be64 *pbl,
+		   u32 pbl_addr, u32 pbl_size);
 int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
 			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
-			   u8 page_size, __be64 *pbl, u32 *pbl_size,
-			   u32 *pbl_addr);
+			   u8 page_size, u32 pbl_size, u32 pbl_addr);
 int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
 			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
-			   u8 page_size, __be64 *pbl, u32 *pbl_size,
-			   u32 *pbl_addr);
+			   u8 page_size, u32 pbl_size, u32 pbl_addr);
 int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size,
 		   u32 pbl_addr);
 int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
index 58c3d61..ec49a5c 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_mem.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -35,17 +35,26 @@
 #include <rdma/ib_verbs.h>
 
 #include "cxio_hal.h"
+#include "cxio_resource.h"
 #include "iwch.h"
 #include "iwch_provider.h"
 
-int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
-					struct iwch_mr *mhp,
-					int shift,
-					__be64 *page_list)
+static void iwch_finish_mem_reg(struct iwch_mr *mhp, u32 stag)
 {
-	u32 stag;
 	u32 mmid;
 
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(mhp->rhp, &mhp->rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp);
+}
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+		      struct iwch_mr *mhp, int shift)
+{
+	u32 stag;
 
 	if (cxio_register_phys_mem(&rhp->rdev,
 				   &stag, mhp->attr.pdid,
@@ -53,28 +62,21 @@ int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
 				   mhp->attr.zbva,
 				   mhp->attr.va_fbo,
 				   mhp->attr.len,
-				   shift-12,
-				   page_list,
-				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+				   shift - 12,
+				   mhp->attr.pbl_size, mhp->attr.pbl_addr))
 		return -ENOMEM;
-	mhp->attr.state = 1;
-	mhp->attr.stag = stag;
-	mmid = stag >> 8;
-	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
-	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
-	PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp);
+
+	iwch_finish_mem_reg(mhp, stag);
+
 	return 0;
 }
 
 int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
 					struct iwch_mr *mhp,
 					int shift,
-					__be64 *page_list,
 					int npages)
 {
 	u32 stag;
-	u32 mmid;
-
 
 	/* We could support this... */
 	if (npages > mhp->attr.pbl_size)
@@ -87,19 +89,40 @@ int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
 				   mhp->attr.zbva,
 				   mhp->attr.va_fbo,
 				   mhp->attr.len,
-				   shift-12,
-				   page_list,
-				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+				   shift - 12,
+				   mhp->attr.pbl_size, mhp->attr.pbl_addr))
 		return -ENOMEM;
-	mhp->attr.state = 1;
-	mhp->attr.stag = stag;
-	mmid = stag >> 8;
-	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
-	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
-	PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp);
+
+	iwch_finish_mem_reg(mhp, stag);
+
+	return 0;
+}
+
+int iwch_alloc_pbl(struct iwch_mr *mhp, int npages)
+{
+	mhp->attr.pbl_addr = cxio_hal_pblpool_alloc(&mhp->rhp->rdev,
+						    npages << 3);
+
+	if (!mhp->attr.pbl_addr)
+		return -ENOMEM;
+
+	mhp->attr.pbl_size = npages;
+
 	return 0;
 }
 
+void iwch_free_pbl(struct iwch_mr *mhp)
+{
+	cxio_hal_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
+			      mhp->attr.pbl_size << 3);
+}
+
+int iwch_write_pbl(struct iwch_mr *mhp, __be64 *pages, int npages, int offset)
+{
+	return cxio_write_pbl(&mhp->rhp->rdev, pages,
+			      mhp->attr.pbl_addr + (offset << 3), npages);
+}
+
 int build_phys_page_list(struct ib_phys_buf *buffer_list,
 					int num_phys_buf,
 					u64 *iova_start,
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index d07d3a3..8934178 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -442,6 +442,7 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr)
 	mmid = mhp->attr.stag >> 8;
 	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
+	iwch_free_pbl(mhp);
 	remove_handle(rhp, &rhp->mmidr, mmid);
 	if (mhp->kva)
 		kfree((void *) (unsigned long) mhp->kva);
@@ -475,6 +476,8 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 
+	mhp->rhp = rhp;
+
 	/* First check that we have enough alignment */
 	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
 		ret = -EINVAL;
@@ -492,7 +495,17 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
 	if (ret)
 		goto err;
 
-	mhp->rhp = rhp;
+	ret = iwch_alloc_pbl(mhp, npages);
+	if (ret) {
+		kfree(page_list);
+		goto err_pbl;
+	}
+
+	ret = iwch_write_pbl(mhp, page_list, npages, 0);
+	kfree(page_list);
+	if (ret)
+		goto err_pbl;
+
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.zbva = 0;
 
@@ -502,12 +515,15 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
 
 	mhp->attr.len = (u32) total_size;
 	mhp->attr.pbl_size = npages;
-	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
-	kfree(page_list);
-	if (ret) {
-		goto err;
-	}
+	ret = iwch_register_mem(rhp, php, mhp, shift);
+	if (ret)
+		goto err_pbl;
+
 	return &mhp->ibmr;
+
+err_pbl:
+	iwch_free_pbl(mhp);
+
 err:
 	kfree(mhp);
 	return ERR_PTR(ret);
@@ -560,7 +576,7 @@ static int iwch_reregister_phys_mem(struct ib_mr *mr,
 			return ret;
 	}
 
-	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages);
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, npages);
 	kfree(page_list);
 	if (ret) {
 		return ret;
@@ -602,6 +618,8 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	if (!mhp)
 		return ERR_PTR(-ENOMEM);
 
+	mhp->rhp = rhp;
+
 	mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
 	if (IS_ERR(mhp->umem)) {
 		err = PTR_ERR(mhp->umem);
@@ -615,10 +633,14 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	list_for_each_entry(chunk, &mhp->umem->chunk_list, list)
 		n += chunk->nents;
 
-	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	err = iwch_alloc_pbl(mhp, n);
+	if (err)
+		goto err;
+
+	pages = (__be64 *) __get_free_page(GFP_KERNEL);
 	if (!pages) {
 		err = -ENOMEM;
-		goto err;
+		goto err_pbl;
 	}
 
 	i = n = 0;
@@ -630,25 +652,38 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				pages[i++] = cpu_to_be64(sg_dma_address(
 					&chunk->page_list[j]) +
 					mhp->umem->page_size * k);
+				if (i == PAGE_SIZE / sizeof *pages) {
+					err = iwch_write_pbl(mhp, pages, i, n);
+					if (err)
+						goto pbl_done;
+					n += i;
+					i = 0;
+				}
 			}
 		}
 
-	mhp->rhp = rhp;
+	if (i)
+		err = iwch_write_pbl(mhp, pages, i, n);
+
+pbl_done:
+	free_page((unsigned long) pages);
+	if (err)
+		goto err_pbl;
+
 	mhp->attr.pdid = php->pdid;
 	mhp->attr.zbva = 0;
 	mhp->attr.perms = iwch_ib_to_tpt_access(acc);
 	mhp->attr.va_fbo = virt;
 	mhp->attr.page_size = shift - 12;
 	mhp->attr.len = (u32) length;
-	mhp->attr.pbl_size = i;
-	err = iwch_register_mem(rhp, php, mhp, shift, pages);
-	kfree(pages);
+
+	err = iwch_register_mem(rhp, php, mhp, shift);
 	if (err)
-		goto err;
+		goto err_pbl;
 
 	if (udata && !t3a_device(rhp)) {
 		uresp.pbl_addr = (mhp->attr.pbl_addr -
-	                         rhp->rdev.rnic_info.pbl_base) >> 3;
+				 rhp->rdev.rnic_info.pbl_base) >> 3;
 		PDBG("%s user resp pbl_addr 0x%x\n", __func__,
 		     uresp.pbl_addr);
 
@@ -661,6 +696,9 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	return &mhp->ibmr;
 
+err_pbl:
+	iwch_free_pbl(mhp);
+
 err:
 	ib_umem_release(mhp->umem);
 	kfree(mhp);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
index db5100d..836163f 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.h
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -340,14 +340,14 @@ int iwch_quiesce_qps(struct iwch_cq *chp);
 int iwch_resume_qps(struct iwch_cq *chp);
 void stop_read_rep_timer(struct iwch_qp *qhp);
 int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
-					struct iwch_mr *mhp,
-					int shift,
-					__be64 *page_list);
+		      struct iwch_mr *mhp, int shift);
 int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
 					struct iwch_mr *mhp,
 					int shift,
-					__be64 *page_list,
 					int npages);
+int iwch_alloc_pbl(struct iwch_mr *mhp, int npages);
+void iwch_free_pbl(struct iwch_mr *mhp);
+int iwch_write_pbl(struct iwch_mr *mhp, __be64 *pages, int npages, int offset);
 int build_phys_page_list(struct ib_phys_buf *buffer_list,
 					int num_phys_buf,
 					u64 *iova_start,
-- 
1.5.5.1


From holt at sgi.com  Wed May  7 08:59:48 2008
From: holt at sgi.com (Robin Holt)
Date: Wed, 7 May 2008 10:59:48 -0500
Subject: [ofa-general] Re: [PATCH 02 of 11] get_task_mm
In-Reply-To: <c5badbefeee07518d9d1.1210170952@duo.random>
References: <patchbomb.1210170950@duo.random>
	<c5badbefeee07518d9d1.1210170952@duo.random>
Message-ID: <20080507155948.GO18857@sgi.com>

You can drop this patch.

This turned out to be a race in xpmem.  It "appeared" as if it were a
race in get_task_mm, but it really is not.  The current->mm field is
cleared under the task_lock and the task_lock is grabbed by get_task_mm.

I have been testing you v15 version without this patch and not
encountere the problem again (now that I fixed my xpmem race).

Thanks,
Robin

On Wed, May 07, 2008 at 04:35:52PM +0200, Andrea Arcangeli wrote:
> # HG changeset patch
> # User Andrea Arcangeli <andrea at qumranet.com>
> # Date 1210115127 -7200
> # Node ID c5badbefeee07518d9d1acca13e94c981420317c
> # Parent  e20917dcc8284b6a07cfcced13dda4cbca850a9c
> get_task_mm


From andrea at qumranet.com  Wed May  7 09:20:07 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 7 May 2008 18:20:07 +0200
Subject: [ofa-general] Re: [PATCH 02 of 11] get_task_mm
In-Reply-To: <20080507155948.GO18857@sgi.com>
References: <patchbomb.1210170950@duo.random>
	<c5badbefeee07518d9d1.1210170952@duo.random>
	<20080507155948.GO18857@sgi.com>
Message-ID: <20080507162006.GB18260@duo.random>

On Wed, May 07, 2008 at 10:59:48AM -0500, Robin Holt wrote:
> You can drop this patch.
> 
> This turned out to be a race in xpmem.  It "appeared" as if it were a
> race in get_task_mm, but it really is not.  The current->mm field is
> cleared under the task_lock and the task_lock is grabbed by get_task_mm.

100% agreed, I'll nuke it as it seems really a noop.

> I have been testing you v15 version without this patch and not
> encountere the problem again (now that I fixed my xpmem race).

Great. About your other deadlock I'm curious if my deadlock fix for
the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the
swap + a swaphog allocating and freeing another 2G of swap in a
loop. I couldn't reproduce any other problem with KVM since I fixed
that bit regardless if I apply only mmu-notifier-core (2.6.26 version)
or the full patchset (post 2.6.26).


From swise at opengridcomputing.com  Wed May  7 09:29:49 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 May 2008 11:29:49 -0500
Subject: [ofa-general] [RFC/PATCH 1/2] RDMA/cxgb3: Don't add PBL memory
	to	gen_pool in chunks
In-Reply-To: <adaprryw2so.fsf@cisco.com>
References: <adaprryw2so.fsf@cisco.com>
Message-ID: <4821D8FD.9030705@opengridcomputing.com>


Roland Dreier wrote:
> Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
> chunks.  This limits the largest single allocation that can be done to
> the same size, which means that with 4 KB pages, each of which takes 8
> bytes of PBL memory, the largest memory region that can be allocated
> is 1 GB (256K PBL entries * 4 KB/entry).
>
> Remove this limit by adding all the PBL memory in a single gen_pool
> chunk, if possible.  Add code that falls back to smaller chunks if
> gen_pool_add() fails, which can happen if there is not sufficient
> contiguous lowmem for the internal gen_pool bitmap.
>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
>   

Acked-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Wed May  7 09:30:12 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 May 2008 11:30:12 -0500
Subject: [ofa-general] [RFC/PATCH 2/2] RDMA/cxgb3: Fix severe limit on
	userspace memory registration size
In-Reply-To: <adalk2mw2rs.fsf@cisco.com>
References: <adalk2mw2rs.fsf@cisco.com>
Message-ID: <4821D914.7030403@opengridcomputing.com>

Roland Dreier wrote:
> Currently, iw_cxgb3 is severely limited on the amount of userspace
> memory that can be registered in in a single memory region, which
> causes big problems for applications that expect to be able to
> register 100s of MB.
>
> The problem is that the driver uses a single kmalloc()ed buffer to
> hold the physical buffer list (PBL) for the entire memory region
> during registration, which means that 8 bytes of contiguous memory are
> required for each page of memory being registered.  For example, a 64
> MB registration will require 128 KB of contiguous memory with 4 KB
> pages, and it unlikely that such an allocation will succeed on a busy
> system.
>
> This is purely a driver problem: the temporary page list buffer is not
> needed by the hardware, so we can fix this by writing the PBL to the
> hardware in page-sized chunks rather than all at once.  We do this by
> splitting the memory registration operation up into several steps:
>
>  - Allocate PBL space in adapter memory for the full registration
>  - Copy PBL to adapter memory in chunks
>  - Allocate STag and enable memory region
>
> This also allows several other cleanups to the __cxio_tpt_op()
> interface and related parts of the driver.
>
> This change leaves the reregister memory region and memory window
> operations broken, but they already didn't work due to other
> longstanding bugs, so fixing them will be left to a later patch.
>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
>   
Acked-by: Steve Wise <swise at opengridcomputing.com>


From riel at redhat.com  Wed May  7 10:35:32 2008
From: riel at redhat.com (Rik van Riel)
Date: Wed, 7 May 2008 13:35:32 -0400
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <e20917dcc8284b6a07cf.1210170951@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
Message-ID: <20080507133532.4a4df89d@bree.surriel.com>

On Wed, 07 May 2008 16:35:51 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
> Signed-off-by: Nick Piggin <npiggin at suse.de>
> Signed-off-by: Christoph Lameter <clameter at sgi.com>

Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.


From riel at redhat.com  Wed May  7 10:39:43 2008
From: riel at redhat.com (Rik van Riel)
Date: Wed, 7 May 2008 13:39:43 -0400
Subject: [ofa-general] Re: [PATCH 03 of 11] invalidate_page outside PT lock
In-Reply-To: <d60d200565abde6a8ed4.1210170953@duo.random>
References: <patchbomb.1210170950@duo.random>
	<d60d200565abde6a8ed4.1210170953@duo.random>
Message-ID: <20080507133943.3e76c899@bree.surriel.com>

On Wed, 07 May 2008 16:35:53 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea at qumranet.com>
> # Date 1210115129 -7200
> # Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
> # Parent  c5badbefeee07518d9d1acca13e94c981420317c
> invalidate_page outside PT lock
> 
> Moves all mmu notifier methods outside the PT lock (first and not last
> step to make them sleep capable).

This patch appears to undo some of the changes made by patch 01/11.

Would it be an idea to merge them into one, so the first patch
introduces the right conventions directly?

-- 
All rights reversed.


From riel at redhat.com  Wed May  7 10:41:33 2008
From: riel at redhat.com (Rik van Riel)
Date: Wed, 7 May 2008 13:41:33 -0400
Subject: [ofa-general] Re: [PATCH 04 of 11] free-pgtables
In-Reply-To: <34f6a4bf67ce66714ba2.1210170954@duo.random>
References: <patchbomb.1210170950@duo.random>
	<34f6a4bf67ce66714ba2.1210170954@duo.random>
Message-ID: <20080507134133.6c3f7d99@bree.surriel.com>

On Wed, 07 May 2008 16:35:54 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> Signed-off-by: Christoph Lameter <clameter at sgi.com>
> Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.


From riel at redhat.com  Wed May  7 10:46:29 2008
From: riel at redhat.com (Rik van Riel)
Date: Wed, 7 May 2008 13:46:29 -0400
Subject: [ofa-general] Re: [PATCH 05 of 11] unmap vmas tlb flushing
In-Reply-To: <20bc6a66a86ef6bd6091.1210170955@duo.random>
References: <patchbomb.1210170950@duo.random>
	<20bc6a66a86ef6bd6091.1210170955@duo.random>
Message-ID: <20080507134629.0dcfd4a1@bree.surriel.com>

On Wed, 07 May 2008 16:35:55 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> Signed-off-by: Christoph Lameter <clameter at sgi.com>
> Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>

Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.


From andrea at qumranet.com  Wed May  7 10:57:05 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 7 May 2008 19:57:05 +0200
Subject: [ofa-general] Re: [PATCH 03 of 11] invalidate_page outside PT lock
In-Reply-To: <20080507133943.3e76c899@bree.surriel.com>
References: <patchbomb.1210170950@duo.random>
	<d60d200565abde6a8ed4.1210170953@duo.random>
	<20080507133943.3e76c899@bree.surriel.com>
Message-ID: <20080507175705.GE18260@duo.random>

On Wed, May 07, 2008 at 01:39:43PM -0400, Rik van Riel wrote:
> Would it be an idea to merge them into one, so the first patch
> introduces the right conventions directly?

The only reason this isn't merged into one, is that this requires
non obvious (not difficult though) to the core VM code. I wanted to
keep an obviously safe approach for 2.6.26. The other conventions are
only needed by XPMEM and XPMEM can't work without all other patches anyway.


From rdreier at cisco.com  Wed May  7 11:02:46 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 11:02:46 -0700
Subject: [ofa-general] Re: [PATCH 7/7] IB/ipath - fix SDMA error recovery in
	absence of link status change
In-Reply-To: <20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Tue, 06 May 2008 11:36:52 -0700")
References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com>
	<20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com>
Message-ID: <adamyn2uh4p.fsf@cisco.com>

thanks, applied all 7


From rdreier at cisco.com  Wed May  7 12:17:26 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 12:17:26 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaiqxpvs8p.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a fixes for various low-level HW driver issues:

 - cxgb3 severe limits on memory registration size
 - ehca QP async event race
 - ipath miscellaneous issues

Dave Olson (2):
      IB/ipath: Fix bug that can leave sends disabled after freeze recovery
      IB/ipath: Need to always request and handle PIO avail interrupts

John Gregor (1):
      IB/ipath: Fix SDMA error recovery in absence of link status change

Michael Albaugh (2):
      IB/ipath: Only warn about prototype chip during init
      IB/ipath: Fix count of packets received by kernel

Ralph Campbell (2):
      IB/ipath: Only increment SSN if WQE is put on send queue
      IB/ipath: Return the correct opcode for RDMA WRITE with immediate

Roland Dreier (2):
      RDMA/cxgb3: Don't add PBL memory to gen_pool in chunks
      RDMA/cxgb3: Fix severe limit on userspace memory registration size

Stefan Roscher (1):
      IB/ehca: Wait for async events to finish before destroying QP

 drivers/infiniband/hw/cxgb3/cxio_hal.c        |   90 ++++++++--------
 drivers/infiniband/hw/cxgb3/cxio_hal.h        |    8 +-
 drivers/infiniband/hw/cxgb3/cxio_resource.c   |   36 +++++--
 drivers/infiniband/hw/cxgb3/iwch_mem.c        |   75 +++++++++-----
 drivers/infiniband/hw/cxgb3/iwch_provider.c   |   68 ++++++++++---
 drivers/infiniband/hw/cxgb3/iwch_provider.h   |    8 +-
 drivers/infiniband/hw/ehca/ehca_classes.h     |    2 +
 drivers/infiniband/hw/ehca/ehca_irq.c         |    4 +
 drivers/infiniband/hw/ehca/ehca_qp.c          |    5 +
 drivers/infiniband/hw/ipath/ipath_driver.c    |  138 ++++++++++++++++++++++---
 drivers/infiniband/hw/ipath/ipath_file_ops.c  |   72 ++++++--------
 drivers/infiniband/hw/ipath/ipath_iba7220.c   |   26 ++---
 drivers/infiniband/hw/ipath/ipath_init_chip.c |   95 ++++++++----------
 drivers/infiniband/hw/ipath/ipath_intr.c      |   80 ++------------
 drivers/infiniband/hw/ipath/ipath_kernel.h    |    8 ++-
 drivers/infiniband/hw/ipath/ipath_rc.c        |    6 +-
 drivers/infiniband/hw/ipath/ipath_ruc.c       |    7 +-
 drivers/infiniband/hw/ipath/ipath_sdma.c      |   44 ++++++--
 drivers/infiniband/hw/ipath/ipath_verbs.c     |    2 +-
 19 files changed, 458 insertions(+), 316 deletions(-)


From akpm at linux-foundation.org  Wed May  7 13:02:14 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Wed, 7 May 2008 13:02:14 -0700
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <e20917dcc8284b6a07cf.1210170951@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
Message-ID: <20080507130214.5884d94a.akpm@linux-foundation.org>

On Wed, 07 May 2008 16:35:51 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea at qumranet.com>
> # Date 1210096013 -7200
> # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
> # Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
> mmu-notifier-core
> 
> ...
>
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
>   * or hlist_del_rcu(), running on this same list.
>   * However, it is perfectly legal to run concurrently with
>   * the _rcu list-traversal primitives, such as
> - * hlist_for_each_entry().
> + * hlist_for_each_entry_rcu().
>   */
>  static inline void hlist_del_rcu(struct hlist_node *n)
>  {
> @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
>  	if (!hlist_unhashed(n)) {
>  		__hlist_del(n);
>  		INIT_HLIST_NODE(n);
> +	}
> +}
> +
> +/**
> + * hlist_del_init_rcu - deletes entry from hash list with re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: list_unhashed() on entry does return true after this. It is

Should that be "does" or "does not".  "does", I suppose.

It should refer to hlist_unhashed()

The term "on entry" is a bit ambiguous - we normally use that as shorthand
to mean "on entry to the function".  

So I'll change this to

> + * Note: hlist_unhashed() on the node returns true after this. It is

OK?

<oh, that was copied-and-pasted from similarly errant comments in that file>

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -10,6 +10,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/rwsem.h>
>  #include <linux/completion.h>
> +#include <linux/cpumask.h>

OK, unrelated bugfix ;)

> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -27,6 +27,8 @@
>  #ifndef _LINUX_SRCU_H
>  #define _LINUX_SRCU_H
>  
> +#include <linux/mutex.h>

And another.  Fair enough.


From akpm at linux-foundation.org  Wed May  7 13:05:28 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Wed, 7 May 2008 13:05:28 -0700
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <e20917dcc8284b6a07cf.1210170951@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
Message-ID: <20080507130528.adfd154c.akpm@linux-foundation.org>

On Wed, 07 May 2008 16:35:51 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea at qumranet.com>
> # Date 1210096013 -7200
> # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
> # Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
> mmu-notifier-core

The patch looks OK to me.

The proposal is that we sneak this into 2.6.26.  Are there any
sufficiently-serious objections to this?

The patch will be a no-op for 2.6.26.

This is all rather unusual.  For the record, could we please review the
reasons for wanting to do this?

Thanks.


From torvalds at linux-foundation.org  Wed May  7 13:30:39 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 13:30:39 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507130528.adfd154c.akpm@linux-foundation.org>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
Message-ID: <alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Andrew Morton wrote:
> 
> The patch looks OK to me.

As far as I can tell, authorship has been destroyed by at least two of the 
patches (ie Christoph seems to be the author, but Andrea seems to have 
dropped that fact).

> The proposal is that we sneak this into 2.6.26.  Are there any
> sufficiently-serious objections to this?

Yeah, too late and no upside.

That "locking" code is also too ugly to live, at least without some 
serious arguments for why it has to be done that way. Sorting the locks? 
In a vmalloc'ed area?  And calling this something innocuous like 
"mm_lock()"? Hell no. 

That code needs some serious re-thinking.

		Linus


From torvalds at linux-foundation.org  Wed May  7 13:56:23 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 13:56:23 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <6b384bb988786aa78ef0.1210170958@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
Message-ID: <alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Andrea Arcangeli wrote:
> 
> Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
> traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
> allows the calling of sleeping functions from reverse map traversal as
> needed for the notifier callbacks. It includes possible concurrency.

This also looks very debatable indeed. The only performance numbers quoted 
are:

>   This results in f.e. the Aim9 brk performance test to got down by 10-15%.

which just seems like a total disaster.

The whole series looks bad, in fact. Lack of authorship, bad single-line 
description, and the code itself sucks so badly that it's not even funny.

NAK NAK NAK. All of it. It stinks.

		Linus


From andrea at qumranet.com  Wed May  7 14:26:50 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 7 May 2008 23:26:50 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
Message-ID: <20080507212650.GA8276@duo.random>

On Wed, May 07, 2008 at 01:56:23PM -0700, Linus Torvalds wrote:
> This also looks very debatable indeed. The only performance numbers quoted 
> are:
> 
> >   This results in f.e. the Aim9 brk performance test to got down by 10-15%.
> 
> which just seems like a total disaster.
> 
> The whole series looks bad, in fact. Lack of authorship, bad single-line 

Glad you agree. Note that the fact the whole series looks bad, is
_exactly_ why I couldn't let Christoph keep going with
mmu-notifier-core at the very end of his patchset. I had to move it at
the top to have a chance to get the KVM and GRU requirements merged
in 2.6.26.

I think the spinlock->rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.
And I don't see a problem in making the conversion from
spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.


From torvalds at linux-foundation.org  Wed May  7 14:36:57 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 14:36:57 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507212650.GA8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Andrea Arcangeli wrote:
> 
> I think the spinlock->rwsem conversion is ok under config option, as
> you can see I complained myself to various of those patches and I'll
> take care they're in a mergeable state the moment I submit them. What
> XPMEM requires are different semantics for the methods, and we never
> had to do any blocking I/O during vmtruncate before, now we have to.

I really suspect we don't really have to, and that it would be better to 
just fix the code that does that.

> Please ignore all patches but mmu-notifier-core. I regularly forward
> _only_ mmu-notifier-core to Andrew, that's the only one that is in
> merge-ready status, everything else is just so XPMEM can test and we
> can keep discussing it to bring it in a mergeable state like
> mmu-notifier-core already is.

The thing is, I didn't like that one *either*. I thought it was the 
biggest turd in the series (and by "biggest", I literally mean "most lines 
of turd-ness" rather than necessarily "ugliest per se").

I literally think that mm_lock() is an unbelievable piece of utter and 
horrible CRAP.

There's simply no excuse for code like that.

If you want to avoid the deadlock from taking multiple locks in order, but 
there is really just a single operation that needs it, there's a really 
really simple solution.

And that solution is *not* to sort the whole damn f*cking list in a 
vmalloc'ed data structure prior to locking!

Damn.

No, the simple solution is to just make up a whole new upper-level lock, 
and get that lock *first*. You can then take all the multiple locks at a 
lower level in any order you damn well please. 

And yes, it's one more lock, and yes, it serializes stuff, but:

 - that code had better not be critical anyway, because if it was, then 
   the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_

 - parallelism is overrated: it doesn't matter one effing _whit_ if 
   something is a hundred times more parallel, if it's also a hundred 
   times *SLOWER*.

So dang it, flush the whole damn series down the toilet and either forget 
the thing entirely, or re-do it sanely.

And here's an admission that I lied: it wasn't *all* clearly crap. I did 
like one part, namely list_del_init_rcu(), but that one should have been 
in a separate patch. I'll happily apply that one.

		Linus


From andrea at qumranet.com  Wed May  7 14:58:40 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Wed, 7 May 2008 23:58:40 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
Message-ID: <20080507215840.GB8276@duo.random>

On Wed, May 07, 2008 at 01:30:39PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 May 2008, Andrew Morton wrote:
> > 
> > The patch looks OK to me.
> 
> As far as I can tell, authorship has been destroyed by at least two of the 
> patches (ie Christoph seems to be the author, but Andrea seems to have 
> dropped that fact).

I can't follow this, please be more specific.

About the patches I merged from Christoph, I didn't touch them at all
(except for fixing a kernel crashing bug in them plus some reject
fix). Initially I didn't even add a signed-off-by: andrea, and I only
had the signed-off-by: christoph. But then he said I had to add my
signed-off-by too, while I thought at most an acked-by was
required. So if I got any attribution on Christoph work it's only
because he explicitly requested it as it was passing through my
maintenance line. In any case, all patches except mmu-notifier-core
are irrelevant in this context and I'm entirely fine to give Christoph
the whole attribution of the whole patchset including the whole
mmu-notifier-core where most of the code is mine.

We had many discussions with Christoph, Robin and Jack, but I can
assure you nobody had a single problem with regard to attribution.

About all patches except mmu-notifier-core: Christoph, Robin and
everyone (especially myself) agrees those patches can't yet be merged
in 2.6.26.

With regard to the post-2.6.26 material, I think adding a config
option to make the change at compile time, is ok. And there's no other
way to deal with it in a clean way, as vmtrunate has to teardown
pagetables, and if the i_mmap_lock is a spinlock there's no way to
notify secondary mmus about it, if the ->invalidate_range_start method
has to allocate an skb, send it through the network and wait for I/O
completion with schedule().

> Yeah, too late and no upside.

No upside to all people setting CONFIG_KVM=n true, but no downside
for them either, that's the important fact!

And for all the people setting CONFIG_KVM!=n, I should provide some
background here. KVM MM development is halted without this, that
includes: paging, ballooning, tlb flushing at large, pci-passthrough
removing page pin as a whole, etc...

Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d
patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle
things without mmu notifiers (mlock whatever).

Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has
gone as far as writing his own ultrasimple mmu notifier
implementation, unfortunately too simple as invalidate_range_start was
missing and we can't remove the page pinning and avoid doing
spte=invalid;tlbflush;unpin for every group of sptes released without
it. And without mm_lock invalidate_range_start can't be implemented in
a generic way (to work for GRU/XPMEM too).

> That "locking" code is also too ugly to live, at least without some 
> serious arguments for why it has to be done that way. Sorting the locks? 
> In a vmalloc'ed area?  And calling this something innocuous like 
> "mm_lock()"? Hell no. 

That's only invoked in mmu_notifier_register, mm_lock is explicitly
documented as heavyweight function. In the KVM case it's only called
when a VM is created, that's irrelevant cpu cost compared to the time
it takes to the OS to boot in the VM... (especially without real mode
emulation with direct NPT-like secondary-mmu paging).

mm_lock solved the fundamental race in the range_start/end
invalidation model (that will allow GRU to do a single tlb flush for
the whole range that is going to be freed by
zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his
EMM versions of mmu notifiers, moments after I released it, I think he
wouldn't have done it if there was a better way.

> That code needs some serious re-thinking.

Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu
notifiers, my original mmu notifiers, Christoph's first version of my
mmu notifiers, with my new mmu notifiers, with christoph EMM version
of my new mmu notifiers, with my latest mmu notifiers, and all people
making suggestions and testing the code and needing the code badly,
and further patches waiting inclusion during 2.6.27 in this area, it
must be obvious for everyone, that there's zero chance this code won't
evolve over time to perfection, but we can't wait it to be perfect
before start using it or we're screwed. Even if it's entirely broken
this will allow kvm development to continue and then we'll fix it (but
don't worry it works great at runtime and there are no race
conditions, Jack and Robin are also using it with zero problems with
GRU and XPMEM just in case the KVM testing going great isn't enough).

Furthermore the API is freezed for almost months, everyone agrees with
all fundamental blocks in mmu-notifier-core patch (to be complete
Christoph would like to replace invalidate_page with an
invalidate_range_start/end but that's a minor detail).

And most important we need something in now, regardless of which
API. We can handle a change of API totally fine later.

mm_lock() is not even part of the mmu notifier API, it's just an
internal implementation detail, so whatever problem it has, or
whatever better name we can find, isn't an high priority right now.

If you suggest a better name now I'll fix it up immediately. I hope
the mm_lock name and whatever signed-off-by error in patches after
mmu-notifier-core won't be really why this doesn't go in.

Thanks a lot for your time to review even if it wasn't as positive as
I hoped,
Andrea


From torvalds at linux-foundation.org  Wed May  7 15:11:10 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 15:11:10 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507215840.GB8276@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Andrea Arcangeli wrote:

> > As far as I can tell, authorship has been destroyed by at least two of the 
> > patches (ie Christoph seems to be the author, but Andrea seems to have 
> > dropped that fact).
> 
> I can't follow this, please be more specific.

The patches were sent to lkml without *any* indication that you weren't 
actually the author.

So if Andrew had merged them, they would have been merged as yours.

> > That "locking" code is also too ugly to live, at least without some 
> > serious arguments for why it has to be done that way. Sorting the locks? 
> > In a vmalloc'ed area?  And calling this something innocuous like 
> > "mm_lock()"? Hell no. 
> 
> That's only invoked in mmu_notifier_register, mm_lock is explicitly
> documented as heavyweight function.

Is that an excuse for UTTER AND TOTAL CRAP?

		Linus


From rdreier at cisco.com  Wed May  7 15:14:42 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 15:14:42 -0700
Subject: [ofa-general] [PATCH/RFC] RDMA/nes: Fix up nes_lro_max_aggr module
	parameter
Message-ID: <adave1pu5gt.fsf@cisco.com>

Fix some bugs with the max_aggr module parameter added with LRO support:

 - The module parameter value ignored and not actually used to set
   lro_mgr.max_aggr.
 - MODULE_PARM_DESC had a typo "_mro_" instead of "_lro_" so it didn't
   end up describing the actual module parameter.
 - The nes_lro_max_aggr variable was declared as unsigned, but the
   module_param line said "int" instead of "uint" for the type.
 - The default value for the parameter was stuck in the permissions
   field of module_param, which led to nonsensical permissions for the
   file under /sys/module/iw_nes/param.
 - The parameter was used in only one file but defined in another, which
   led to the variable being global for no good reason.  Move everything
   related to the parameter to the file nes_hw.c where it is actually
   used.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/hw/nes/nes.c    |    4 ----
 drivers/infiniband/hw/nes/nes.h    |    1 -
 drivers/infiniband/hw/nes/nes_hw.c |    6 +++++-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index 9f7364a..a4e9269 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -91,10 +91,6 @@ unsigned int nes_debug_level = 0;
 module_param_named(debug_level, nes_debug_level, uint, 0644);
 MODULE_PARM_DESC(debug_level, "Enable debug output level");
 
-unsigned int nes_lro_max_aggr = NES_LRO_MAX_AGGR;
-module_param(nes_lro_max_aggr, int, NES_LRO_MAX_AGGR);
-MODULE_PARM_DESC(nes_mro_max_aggr, " nic LRO MAX packet aggregation");
-
 LIST_HEAD(nes_adapter_list);
 static LIST_HEAD(nes_dev_list);
 
diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h
index 1f9f7bf..61b46e9 100644
--- a/drivers/infiniband/hw/nes/nes.h
+++ b/drivers/infiniband/hw/nes/nes.h
@@ -173,7 +173,6 @@ extern int disable_mpa_crc;
 extern unsigned int send_first;
 extern unsigned int nes_drv_opt;
 extern unsigned int nes_debug_level;
-extern unsigned int nes_lro_max_aggr;
 
 extern struct list_head nes_adapter_list;
 
diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c
index 8dc70f9..d3278f1 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -42,6 +42,10 @@
 
 #include "nes.h"
 
+static unsigned int nes_lro_max_aggr = NES_LRO_MAX_AGGR;
+module_param(nes_lro_max_aggr, uint, 0444);
+MODULE_PARM_DESC(nes_lro_max_aggr, "NIC LRO max packet aggregation");
+
 static u32 crit_err_count;
 u32 int_mod_timer_init;
 u32 int_mod_cq_depth_256;
@@ -1738,7 +1742,7 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct net_device *netdev)
 			jumbomode = 1;
 		nes_nic_init_timer_defaults(nesdev, jumbomode);
 	}
-	nesvnic->lro_mgr.max_aggr       = NES_LRO_MAX_AGGR;
+	nesvnic->lro_mgr.max_aggr       = nes_lro_max_aggr;
 	nesvnic->lro_mgr.max_desc       = NES_MAX_LRO_DESCRIPTORS;
 	nesvnic->lro_mgr.lro_arr        = nesvnic->lro_desc;
 	nesvnic->lro_mgr.get_skb_header = nes_lro_get_skb_hdr;
-- 
1.5.5.1


From andrea at qumranet.com  Wed May  7 15:22:05 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:22:05 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
Message-ID: <20080507222205.GC8276@duo.random>

On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
> > had to do any blocking I/O during vmtruncate before, now we have to.
> 
> I really suspect we don't really have to, and that it would be better to 
> just fix the code that does that.

I'll let you discuss with Christoph and Robin about it. The moment I
heard the schedule inside ->invalidate_page() requirement I reacted
the same way you did. But I don't see any other real solution for XPMEM
other than spin-looping for ages halting the scheduler for ages, while
the ack is received from the network device.

But mm_lock is required even without XPMEM. And srcu is also required
without XPMEM to allow ->release to schedule (however downgrading srcu
to rcu will result in a very small patch, srcu and rcu are about the
same with a kernel supporting preempt=y like 2.6.26).

> I literally think that mm_lock() is an unbelievable piece of utter and 
> horrible CRAP.
> 
> There's simply no excuse for code like that.

I think it's a great smp scalability optimization over the global lock
you're proposing below.

> No, the simple solution is to just make up a whole new upper-level lock, 
> and get that lock *first*. You can then take all the multiple locks at a 
> lower level in any order you damn well please. 

Unfortunately the lock you're talking about would be:

static spinlock_t global_lock = ...

There's no way to make it more granular.

So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at
compile time).

> And yes, it's one more lock, and yes, it serializes stuff, but:
> 
>  - that code had better not be critical anyway, because if it was, then 
>    the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_

mmu_notifier_register can take ages. No problem.

>  - parallelism is overrated: it doesn't matter one effing _whit_ if 
>    something is a hundred times more parallel, if it's also a hundred 
>    times *SLOWER*.

mmu_notifier_register is fine to be hundred times slower (preempt-rt
will turn all locks in spinlocks so no problem).

> And here's an admission that I lied: it wasn't *all* clearly crap. I did 
> like one part, namely list_del_init_rcu(), but that one should have been 
> in a separate patch. I'll happily apply that one.

Sure, I'll split it from the rest if the mmu-notifier-core isn't merged.

My objective has been:

1) add zero overhead to the VM before anybody starts a VM with kvm and
   still zero overhead for all other tasks except the task where the
   VM runs.  The only exception is the unlikely(!mm->mmu_notifier_mm)
   check that is optimized away too when CONFIG_KVM=n. And even for
   that check my invalidate_page reduces the number of branches to the
   absolute minimum possible.

2) avoid any new cacheline collision in the fast paths to allow numa
   systems not to nearly-crash (mm->mmu_notifier_mm will be shared and
   never written, except during the first mmu_notifier_register)

3) avoid any risk to introduce regressions in 2.6.26 (the patch must
   be obviously safe). Even if mm_lock would be a bad idea like you
   say, it's order of magnitude safer even if entirely broken then
   messing with the VM core locking in 2.6.26.

mm_lock (or whatever name you like to give it, I admit mm_lock may not
be worrysome enough for people to have an idea to call it in a fast
path) is going to be the real deal for the long term to allow
mmu_notifier_register to serialize against
invalidate_page_start/end. If I fail in 2.6.26 I'll offer
maintainership to Christoph as promised, and you'll find him pushing
for mm_lock to be merged (as XPMEM/GRU aren't technologies running on
cellphones where your global wide spinlocks is optimized away at
compile time, and he also has to deal with XPMEM where such a spinlock
would need to become a rwsem as the anon_vma->sem has to be taken
after it), but let's assume you're right entirely right here that
mm_lock is going to be dropped and there's a better way: it's still a
fine solution for 2.6.26.

And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to
mmu_notifier.[ch] so you don't get any pollution in the core VM, and
mm_lock will be invisible to everything but anybody calling
mmu_notifier_register() then and it will be trivial to remove later if
you really want to add a global spinlock as there's no way to be more
granular than a _global_ numa-wide spinlock taken before any
i_mmap_lock/anon_vma->lock, without my mm_lock.


From rdreier at cisco.com  Wed May  7 15:22:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 15:22:30 -0700
Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached
	P_Key/GID queries
Message-ID: <adar6cdu53t.fsf@cisco.com>

The SRP initiator is currently using ib_find_cached_pkey() and
ib_get_cached_gid() in situations where the uncached ib_find_pkey()
and ib_query_gid() functions serve just as well: sleeping is allowed
and performance is not an issue.  Since we want to eliminate the
cached operations in the long term, convert SRP to use the uncached
variants.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Anyone have concerns about queueing this for the next merge window?

 drivers/infiniband/ulp/srp/ib_srp.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 4351457..81cc59c 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -49,8 +49,6 @@
 #include <scsi/srp.h>
 #include <scsi/scsi_transport_srp.h>
 
-#include <rdma/ib_cache.h>
-
 #include "ib_srp.h"
 
 #define DRV_NAME	"ib_srp"
@@ -183,10 +181,10 @@ static int srp_init_qp(struct srp_target_port *target,
 	if (!attr)
 		return -ENOMEM;
 
-	ret = ib_find_cached_pkey(target->srp_host->srp_dev->dev,
-				  target->srp_host->port,
-				  be16_to_cpu(target->path.pkey),
-				  &attr->pkey_index);
+	ret = ib_find_pkey(target->srp_host->srp_dev->dev,
+			   target->srp_host->port,
+			   be16_to_cpu(target->path.pkey),
+			   &attr->pkey_index);
 	if (ret)
 		goto out;
 
@@ -1883,8 +1881,7 @@ static ssize_t srp_create_target(struct device *dev,
 	if (ret)
 		goto err;
 
-	ib_get_cached_gid(host->srp_dev->dev, host->port, 0,
-			  &target->path.sgid);
+	ib_query_gid(host->srp_dev->dev, host->port, 0, &target->path.sgid);
 
 	shost_printk(KERN_DEBUG, target->scsi_host, PFX
 		     "new target: id_ext %016llx ioc_guid %016llx pkey %04x "
-- 
1.5.5.1


From rdreier at cisco.com  Wed May  7 15:26:46 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 15:26:46 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <4820A427.1070405@opengridcomputing.com> (Steve Wise's message of
	"Tue, 06 May 2008 13:32:07 -0500")
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
	<ada63trz5mk.fsf@cisco.com> <4820A427.1070405@opengridcomputing.com>
Message-ID: <adamyn1u4wp.fsf@cisco.com>

 > 3) if using SEND, then a recv completion is always generated.

I'm just trying to define the scope of the issue here... so is there any
conceivable real-life situation where neither a 0B read nor a 0B write
would work, and the connection setup will have to use a 0B send?

 - R.


From andrea at qumranet.com  Wed May  7 15:27:58 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:27:58 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
Message-ID: <20080507222758.GD8276@duo.random>

On Wed, May 07, 2008 at 03:11:10PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 May 2008, Andrea Arcangeli wrote:
> 
> > > As far as I can tell, authorship has been destroyed by at least two of the 
> > > patches (ie Christoph seems to be the author, but Andrea seems to have 
> > > dropped that fact).
> > 
> > I can't follow this, please be more specific.
> 
> The patches were sent to lkml without *any* indication that you weren't 
> actually the author.
> 
> So if Andrew had merged them, they would have been merged as yours.

I rechecked and I guarantee that the patches where Christoph isn't
listed are developed by myself and he didn't write a single line on
them. In any case I expect Christoph to review (he's CCed) and to
point me to any attribution error. The only mistake I did once in that
area was to give too _few_ attribution to myself and he asked me to
add myself in the signed-off so I added myself by Christoph own
request, but be sure I didn't remove him!


From swise at opengridcomputing.com  Wed May  7 15:29:44 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 May 2008 17:29:44 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <adamyn1u4wp.fsf@cisco.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com>
	<4820A427.1070405@opengridcomputing.com>
	<adamyn1u4wp.fsf@cisco.com>
Message-ID: <48222D58.9020205@opengridcomputing.com>

Roland Dreier wrote:
>  > 3) if using SEND, then a recv completion is always generated.
>
> I'm just trying to define the scope of the issue here... so is there any
> conceivable real-life situation where neither a 0B read nor a 0B write
> would work, and the connection setup will have to use a 0B send?
>   
i'm not sure what you mean by "real-life".  For the rnics we have:

nes - requires 0b write
cxgb3 - requires 0b read
amso1100 - won't work in p2p mode

So there are none that I know of that require a send for this.


From rdreier at cisco.com  Wed May  7 15:31:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 15:31:08 -0700
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507222758.GD8276@duo.random> (Andrea Arcangeli's message of
	"Thu, 8 May 2008 00:27:58 +0200")
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random>
Message-ID: <adaej8du4pf.fsf@cisco.com>

 > I rechecked and I guarantee that the patches where Christoph isn't
 > listed are developed by myself and he didn't write a single line on
 > them. In any case I expect Christoph to review (he's CCed) and to
 > point me to any attribution error. The only mistake I did once in that
 > area was to give too _few_ attribution to myself and he asked me to
 > add myself in the signed-off so I added myself by Christoph own
 > request, but be sure I didn't remove him!

I think the point you're missing is that any patches written by
Christoph need a line like

From: Christoph Lameter <clameter at sgi.com>

at the top of the body so that Christoph becomes the author when it is
committed into git.  The Signed-off-by: line needs to be preserved too
of course, but it is not sufficient by itself.

 - R.


From akpm at linux-foundation.org  Wed May  7 15:31:03 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Wed, 7 May 2008 15:31:03 -0700
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507222205.GC8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
Message-ID: <20080507153103.237ea5b6.akpm@linux-foundation.org>

On Thu, 8 May 2008 00:22:05 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> > No, the simple solution is to just make up a whole new upper-level lock, 
> > and get that lock *first*. You can then take all the multiple locks at a 
> > lower level in any order you damn well please. 
> 
> Unfortunately the lock you're talking about would be:
> 
> static spinlock_t global_lock = ...
> 
> There's no way to make it more granular.
> 
> So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock
> we'd need to take that extremely wide spinlock first (and even worse,
> later it would become a rwsem when XPMEM is selected making the VM
> even slower than it already becomes when XPMEM support is selected at
> compile time).

Nope.  We only need to take the global lock before taking *two or more* of
the per-vma locks.

I really wish I'd thought of that.


From rdreier at cisco.com  Wed May  7 15:33:53 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 May 2008 15:33:53 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <48222D58.9020205@opengridcomputing.com> (Steve Wise's message of
	"Wed, 07 May 2008 17:29:44 -0500")
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
	<ada63trz5mk.fsf@cisco.com> <4820A427.1070405@opengridcomputing.com>
	<adamyn1u4wp.fsf@cisco.com> <48222D58.9020205@opengridcomputing.com>
Message-ID: <adaabj1u4ku.fsf@cisco.com>

 > > I'm just trying to define the scope of the issue here... so is there any
 > > conceivable real-life situation where neither a 0B read nor a 0B write
 > > would work, and the connection setup will have to use a 0B send?

 > i'm not sure what you mean by "real-life".  For the rnics we have:
 > 
 > nes - requires 0b write
 > cxgb3 - requires 0b read
 > amso1100 - won't work in p2p mode
 > 
 > So there are none that I know of that require a send for this.

I guess my question was whether we expect to ever need to worry about
the 0B send case, or whether it's just theoretical.  If no current NICs
have a problem with read or write, and future NICs will be built to a
future MPA spec, then it seems we don't have to worry about what happens
if a 0B send is done as part of connection setup.

The spurious CQE on connection failure and the private data breakage are
serious obviously.  The interoperability issues of this stuff seem
pretty painful to me.


From andrea at qumranet.com  Wed May  7 15:37:38 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:37:38 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507222758.GD8276@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random>
Message-ID: <20080507223738.GF8276@duo.random>

On Thu, May 08, 2008 at 12:27:58AM +0200, Andrea Arcangeli wrote:
> I rechecked and I guarantee that the patches where Christoph isn't
> listed are developed by myself and he didn't write a single line on
> them. In any case I expect Christoph to review (he's CCed) and to
> point me to any attribution error. The only mistake I did once in that
> area was to give too _few_ attribution to myself and he asked me to
> add myself in the signed-off so I added myself by Christoph own
> request, but be sure I didn't remove him!

By PM (guess he's scared to post to this thread ;) Chris is telling
me, what you mean perhaps is I should add a From: Christoph in the
body of the email if the first signed-off-by is from Christoph, to
indicate the first signoff was by him and the patch in turn was
started by him. I thought the order of the signoffs was enough, but if
that From was mandatory and missing, if there's any error it obviously
wasn't intentional especially given I only left a signed-off-by:
christoph on his patches until he asked me to add my signoff
too. Correcting it is trivial given I carefully ordered the signoff so
that the author is at the top of the signoff list.

At least for mmu-notifier-core given I obviously am the original
author of that code, I hope the From: of the email was enough even if
an additional From: andrea was missing in the body.

Also you can be sure that Christoph and especially Robin (XPMEM) will
be more than happy if all patches with Christoph at the top of the
signed-off-by will be merged in 2.6.26 despite there wasn't From:
christoph at the top of the body ;). So I don't see a big deal here...


From andrea at qumranet.com  Wed May  7 15:39:14 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:39:14 +0200
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <adaej8du4pf.fsf@cisco.com>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random> <adaej8du4pf.fsf@cisco.com>
Message-ID: <20080507223914.GG8276@duo.random>

On Wed, May 07, 2008 at 03:31:08PM -0700, Roland Dreier wrote:
> I think the point you're missing is that any patches written by
> Christoph need a line like
> 
> From: Christoph Lameter <clameter at sgi.com>
> 
> at the top of the body so that Christoph becomes the author when it is
> committed into git.  The Signed-off-by: line needs to be preserved too
> of course, but it is not sufficient by itself.

Ok so I see the problem Linus is referring to now (I received the hint
by PM too), I thought the order of the signed-off-by was relevant, it
clearly isn't or we're wasting space ;)


From swise at opengridcomputing.com  Wed May  7 15:41:01 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 May 2008 17:41:01 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <adaabj1u4ku.fsf@cisco.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com>
	<4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com>
	<48222D58.9020205@opengridcomputing.com>
	<adaabj1u4ku.fsf@cisco.com>
Message-ID: <48222FFD.40302@opengridcomputing.com>

Roland Dreier wrote:
>  > > I'm just trying to define the scope of the issue here... so is there any
>  > > conceivable real-life situation where neither a 0B read nor a 0B write
>  > > would work, and the connection setup will have to use a 0B send?
>
>  > i'm not sure what you mean by "real-life".  For the rnics we have:
>  > 
>  > nes - requires 0b write
>  > cxgb3 - requires 0b read
>  > amso1100 - won't work in p2p mode
>  > 
>  > So there are none that I know of that require a send for this.
>
> I guess my question was whether we expect to ever need to worry about
> the 0B send case, or whether it's just theoretical.  If no current NICs
> have a problem with read or write, and future NICs will be built to a
> future MPA spec, then it seems we don't have to worry about what happens
> if a 0B send is done as part of connection setup.
>
>   
I agree.  We can dump the 0B send stuff. 
> The spurious CQE on connection failure and the private data breakage are
> serious obviously.  The interoperability issues of this stuff seem
> pretty painful to me.
Its is painful.  But without anything, you cannot run OMPI, IMPI or 
HPMPI on a iwarp cluster with mixed vendor rnics...


Steve.


From steiner at sgi.com  Wed May  7 15:42:33 2008
From: steiner at sgi.com (Jack Steiner)
Date: Wed, 7 May 2008 17:42:33 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507212650.GA8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
Message-ID: <20080507224232.GA24600@sgi.com>

> And I don't see a problem in making the conversion from
> spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
> anything but ia64.
 
That is currently true but we are also working on XPMEM for x86_64.
The new XPMEM code should be posted within a few weeks.

--- jack


From andrea at qumranet.com  Wed May  7 15:44:06 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:44:06 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507153103.237ea5b6.akpm@linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
Message-ID: <20080507224406.GI8276@duo.random>

On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote:
> Nope.  We only need to take the global lock before taking *two or more* of
> the per-vma locks.
> 
> I really wish I'd thought of that.

I don't see how you can avoid taking the system-wide-global lock
before every single anon_vma->lock/i_mmap_lock out there without
mm_lock.

Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Please elaborate, thanks!


From torvalds at linux-foundation.org  Wed May  7 15:44:24 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 15:44:24 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507222205.GC8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071540300.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> Unfortunately the lock you're talking about would be:
> 
> static spinlock_t global_lock = ...
> 
> There's no way to make it more granular.

Right. So what? 

It's still about a million times faster than what the code does now.

You comment about "great smp scalability optimization" just shows that 
you're a moron. It is no such thing. The fact is, it's a horrible 
pessimization, since even SMP will be *SLOWER*. It will just be "less 
slower" when you have a million CPU's and they all try to do this at the 
same time (which probably never ever happens).

In other words, "scalability" is totally meaningless. The only thing that 
matters is *performance*. If the "scalable" version performs WORSE, then 
it is simply worse. Not better. End of story.

> mmu_notifier_register can take ages. No problem.

So what you're saying is that performance doesn't matter?

So why do you do the ugly crazy hundred-line implementation, when a simple 
two-liner would do equally well?

Your arguments are crap.

Anyway, discussion over. This code doesn't get merged. It doesn't get 
merged before 2.6.26, and it doesn't get merged _after_ either.

Rewrite the code, or not. I don't care. I'll very happily not merge crap 
for the rest of my life.

		Linus


From andrea at qumranet.com  Wed May  7 15:58:01 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 00:58:01 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071540300.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<alpine.LFD.1.10.0805071540300.3024@woody.linux-foundation.org>
Message-ID: <20080507225801.GK8276@duo.random>

On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 8 May 2008, Andrea Arcangeli wrote:
> > 
> > Unfortunately the lock you're talking about would be:
> > 
> > static spinlock_t global_lock = ...
> > 
> > There's no way to make it more granular.
> 
> Right. So what? 
> 
> It's still about a million times faster than what the code does now.

mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

What you're proposing is to slowdown AIM and certain benchmarks 20% or
more for all users, just so you save at most 1msec to start a VM.

> Rewrite the code, or not. I don't care. I'll very happily not merge crap 
> for the rest of my life.

If you want the global lock I'll do it no problem, I just think it's
obviously inferior solution for 99% of users out there (including kvm
users that will also have to take that lock while kvm userland runs).

In my view the most we should do in this area is to reduce further the
max number of locks to take if max_map_count already isn't enough.


From akpm at linux-foundation.org  Wed May  7 15:59:14 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Wed, 7 May 2008 15:59:14 -0700
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507224406.GI8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
Message-ID: <20080507155914.d7790069.akpm@linux-foundation.org>

On Thu, 8 May 2008 00:44:06 +0200
Andrea Arcangeli <andrea at qumranet.com> wrote:

> On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote:
> > Nope.  We only need to take the global lock before taking *two or more* of
> > the per-vma locks.
> > 
> > I really wish I'd thought of that.
> 
> I don't see how you can avoid taking the system-wide-global lock
> before every single anon_vma->lock/i_mmap_lock out there without
> mm_lock.
> 
> Please note, we can't allow a thread to be in the middle of
> zap_page_range while mmu_notifier_register runs.
> 
> vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
> than one lock and we've to still take the global-system-wide lock
> _before_ this single i_mmap_lock and no other lock at all.
> 
> Please elaborate, thanks!


umm...


	CPU0:			CPU1:

	spin_lock(a->lock);	spin_lock(b->lock);
	spin_lock(b->lock);	spin_lock(a->lock);

bad.

	CPU0:			CPU1:

	spin_lock(global_lock)	spin_lock(global_lock);
	spin_lock(a->lock);	spin_lock(b->lock);
	spin_lock(b->lock);	spin_lock(a->lock);

Is OK.


	CPU0:			CPU1:

	spin_lock(global_lock)	
	spin_lock(a->lock);	spin_lock(b->lock);
	spin_lock(b->lock);	spin_unlock(b->lock);
				spin_lock(a->lock);
				spin_unlock(a->lock);

also OK.

As long as all code paths which can take two-or-more locks are all covered
by the global lock there is no deadlock scenario.  If a thread takes just a
single instance of one of these locks without taking the global_lock then
there is also no deadlock.


Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly
added mm_lock() thing and we also take both those locks at the same time in
regular code, we're probably screwed.


From torvalds at linux-foundation.org  Wed May  7 16:00:13 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 16:00:13 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507222758.GD8276@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071558210.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
>
> I rechecked and I guarantee that the patches where Christoph isn't
> listed are developed by myself and he didn't write a single line on
> them.

How long have you been doing kernel development?

How about you read SubmittingPatches a few times before you show just how 
clueless you are?

Hint: look for the string that says "From:".

Also look at the section that talks about "summary phrase". You got it all 
wrong, and you don't even seem to realize that you got it wrong, even when 
I told you.

		Linus


From andrea at qumranet.com  Wed May  7 16:02:42 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 01:02:42 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507225801.GK8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<alpine.LFD.1.10.0805071540300.3024@woody.linux-foundation.org>
	<20080507225801.GK8276@duo.random>
Message-ID: <20080507230242.GL8276@duo.random>

To remove mm_lock without adding an horrible system-wide lock before
every i_mmap_lock etc.. we've to remove
invalidate_range_begin/end. Then we can return to an older approach of
doing only invalidate_page and serializing it with the PT lock against
get_user_pages. That works fine for KVM but GRU will have to flush the
tlb once every time we drop the PT lock, that means once per each 512
ptes on x86-64 etc... instead of a single time for the whole range
regardless how large the range is.


From torvalds at linux-foundation.org  Wed May  7 16:03:00 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 16:03:00 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507223914.GG8276@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random> <adaej8du4pf.fsf@cisco.com>
	<20080507223914.GG8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071601090.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> Ok so I see the problem Linus is referring to now (I received the hint
> by PM too), I thought the order of the signed-off-by was relevant, it
> clearly isn't or we're wasting space ;)

The order of the signed-offs are somewhat relevant, but no, sign-offs 
don't mean authorship.

See the rules for sign-off: you can sign off on another persons patches, 
even if they didn't sign off on them themselves. That's clause (b) in 
particular.

So yes, quite often you'd _expect_ the first sign-off to match the author, 
but that's a correlation, not a causal relationship.

		Linus


From torvalds at linux-foundation.org  Wed May  7 16:09:48 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 16:09:48 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507225801.GK8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<alpine.LFD.1.10.0805071540300.3024@woody.linux-foundation.org>
	<20080507225801.GK8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071604480.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
>
> mmu_notifier_register only runs when windows or linux or macosx
> boots. Who could ever care of the msec spent in mm_lock compared to
> the time it takes to linux to boot?

Andrea, you're *this* close to going to my list of people who it is not 
worth reading email from, and where it's better for everybody involved if 
I just teach my spam-filter about it.

That code was CRAP.

That code was crap whether it's used once, or whether it's used a million 
times. Stop making excuses for it just because it's not performance- 
critical.

So give it up already. I told you what the non-crap solution was. It's 
simpler, faster, and is about two lines of code compared to the crappy 
version (which was what - 200 lines of crap with a big comment on top of 
it just to explain the idiocy?).

So until you can understand the better solution, don't even bother 
emailing me, ok? Because the next email I get from you that shows the 
intelligence level of a gnat, I'll just give up and put you in a 
spam-filter.

Because my IQ goes down just from reading your mails. I can't afford to 
continue.

		Linus


From torvalds at linux-foundation.org  Wed May  7 16:19:05 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 16:19:05 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507155914.d7790069.akpm@linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
Message-ID: <alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Andrew Morton wrote:
> 
> Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly
> added mm_lock() thing and we also take both those locks at the same time in
> regular code, we're probably screwed.

No, just use the normal static ordering for that case: one type of lock 
goes before the other kind. If those locks nest in regular code, you have 
to do that *anyway*.

The code that can take many locks, will have to get the global lock *and* 
order the types, but that's still trivial. It's something like

	spin_lock(&global_lock);
	for (vma = mm->mmap; vma; vma = vma->vm_next) {
		if (vma->anon_vma)
			spin_lock(&vma->anon_vma->lock);
	}
	for (vma = mm->mmap; vma; vma = vma->vm_next) {
		if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping)
			spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
	}
	spin_unlock(&global_lock);

and now everybody follows the rule that "anon_vma->lock" precedes 
"i_mmap_lock". So there can be no ABBA deadlock between the normal users 
and the many-locks version, and there can be no ABBA deadlock between 
many-locks-takers because they use the global_lock to serialize.

This really isn't rocket science, guys.

(I really hope and believe that they don't nest anyway, and that you can 
just use a single for-loop for the many-lock case)

		Linus


From sean.hefty at intel.com  Wed May  7 16:25:06 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 7 May 2008 16:25:06 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <48222FFD.40302@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
Message-ID: <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>

>>  > nes - requires 0b write
>>  > cxgb3 - requires 0b read
>>  > amso1100 - won't work in p2p mode

I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b
reads, or cxgb3 0b writes.

>Its is painful.  But without anything, you cannot run OMPI, IMPI or
>HPMPI on a iwarp cluster with mixed vendor rnics...

Is there any requirement at the receiving side, versus the initiating side?
That is, just because nes issues a 0b write, does the receiving HW care if a
read or write shows up?  Or is this restriction on both sides?

- Sean


From benh at kernel.crashing.org  Wed May  7 16:28:38 2008
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 08 May 2008 09:28:38 +1000
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507224406.GI8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
Message-ID: <1210202918.1421.20.camel@pasglop>


On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote:
> 
> Please note, we can't allow a thread to be in the middle of
> zap_page_range while mmu_notifier_register runs.

You said yourself that mmu_notifier_register can be as slow as you
want ... what about you use stop_machine for it ? I'm not even joking
here :-)

> vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
> than one lock and we've to still take the global-system-wide lock
> _before_ this single i_mmap_lock and no other lock at all.

Ben.


From clameter at sgi.com  Wed May  7 16:39:39 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Wed, 7 May 2008 16:39:39 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
Message-ID: <Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>

On Wed, 7 May 2008, Linus Torvalds wrote:

> The code that can take many locks, will have to get the global lock *and* 
> order the types, but that's still trivial. It's something like
> 
> 	spin_lock(&global_lock);
> 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> 		if (vma->anon_vma)
> 			spin_lock(&vma->anon_vma->lock);
> 	}
> 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> 		if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping)
> 			spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> 	}
> 	spin_unlock(&global_lock);

Multiple vmas may share the same mapping or refer to the same anonymous 
vma. The above code will deadlock since we may take some locks multiple 
times.


From bpkg at blickpunkt-studios.com  Wed May  7 16:39:54 2008
From: bpkg at blickpunkt-studios.com (Tameka Raymond)
Date: Wed, 7 May 2008 20:39:54 -0300
Subject: [ofa-general] Hi large drive
Message-ID: <154218039.49768283245069@blickpunkt-studios.com>

"In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size."

We offer solution!
Gain incredible girth and mind-blowing length in just a few weeks time!

http://www.aucri.net/a/


From andrea at qumranet.com  Wed May  7 16:39:53 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 01:39:53 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507155914.d7790069.akpm@linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
Message-ID: <20080507233953.GM8276@duo.random>

Hi Andrew,

On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
> 	CPU0:			CPU1:
> 
> 	spin_lock(global_lock)	
> 	spin_lock(a->lock);	spin_lock(b->lock);
				================== mmu_notifier_register()
> 	spin_lock(b->lock);	spin_unlock(b->lock);
> 				spin_lock(a->lock);
> 				spin_unlock(a->lock);
> 
> also OK.

But the problem is that we've to stop the critical section in the
place I marked with "========" while mmu_notifier_register
runs. Otherwise the driver calling mmu_notifier_register won't know if
it's safe to start establishing secondary sptes/tlbs. If the driver
will establish sptes/tlbs with get_user_pages/follow_page the page
could be freed immediately later when zap_page_range starts.

So if CPU1 doesn't take the global_lock before proceeding in
zap_page_range (inside vmtruncate i_mmap_lock that is represented as
b->lock above) we're in trouble.

What we can do is to replace the mm_lock with a
spin_lock(&global_lock) only if all places that takes i_mmap_lock
takes the global lock first and that hurts scalability of the fast
paths that are performance critical like vmtruncate and
anon_vma->lock. Perhaps they're not so performance critical, but
surely much more performant critical than mmu_notifier_register ;).

The idea of polluting various scalable paths like truncate() syscall
in the VM with a global spinlock frightens me, I'd rather return to
invalidate_page() inside the PT lock removing both
invalidate_range_start/end. Then all serialization against the mmu
notifiers will be provided by the PT lock that the secondary mmu page
fault also has to take in get_user_pages (or follow_page). In any case
that is a better solution that won't slowdown the VM when
MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance
is about the same with or without invalidate_range_start/end. I didn't
think anybody could care about how long mmu_notifier_register takes
until it returns compared to all heavyweight operations that happens
to start a VM (not only in the kernel but in the guest too).

Infact if it's security that we worry about here, can put a cap of
_time_ that mmu_notifier_register can take before it fails, and we
fail to start a VM if it takes more than 5sec, that's still fine as
the failure could happen for other reasons too like vmalloc shortage
and we already handle it just fine. This 5sec delay can't possibly happen in
practice anyway in the only interesting scenario, just like the
vmalloc shortage. This is obviously a superior solution than polluting
the VM with an useless global spinlock that will destroy truncate/AIM
on numa.

Anyway Christoph, I uploaded my last version here:

       	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16

(applies and runs fine on 26-rc1)

You're more than welcome to takeover from it, I kind of feel my time
now may be better spent to emulate the mmu-notifier-core with kprobes.


From torvalds at linux-foundation.org  Wed May  7 16:38:51 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 16:38:51 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core
In-Reply-To: <20080507223738.GF8276@duo.random>
References: <patchbomb.1210170950@duo.random>
	<e20917dcc8284b6a07cf.1210170951@duo.random>
	<20080507130528.adfd154c.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071324570.3024@woody.linux-foundation.org>
	<20080507215840.GB8276@duo.random>
	<alpine.LFD.1.10.0805071509270.3024@woody.linux-foundation.org>
	<20080507222758.GD8276@duo.random>
	<20080507223738.GF8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071633010.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> At least for mmu-notifier-core given I obviously am the original
> author of that code, I hope the From: of the email was enough even if
> an additional From: andrea was missing in the body.

Ok, this whole series of patches have just been such a disaster that I'm 
(a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that 
I'm still looking at it at all, but I am.

And quite frankly, the more I look, and the more answers from you I get, 
the less I like it. And I didn't like it that much to start with, as you 
may have noticed.

You say that "At least for mmu-notifier-core given I obviously am the 
original author of that code", but that is not at all obvious either. One 
of the reasons I stated that authorship seems to have been thrown away is 
very much exactly in that first mmu-notifier-core patch:

	+ *  linux/mm/mmu_notifier.c
	+ *
	+ *  Copyright (C) 2008  Qumranet, Inc.
	+ *  Copyright (C) 2008  SGI
	+ *             Christoph Lameter <clameter at sgi.com>

so I would very strongly dispute that it's "obvious" that you are the 
original author of the code there.

So there was a reason why I said that I thought authorship had been lost 
somewhere along the way.

		Linus


From andrea at qumranet.com  Wed May  7 16:45:21 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 01:45:21 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <1210202918.1421.20.camel@pasglop>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<1210202918.1421.20.camel@pasglop>
Message-ID: <20080507234521.GN8276@duo.random>

On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote:
> 
> On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote:
> > 
> > Please note, we can't allow a thread to be in the middle of
> > zap_page_range while mmu_notifier_register runs.
> 
> You said yourself that mmu_notifier_register can be as slow as you
> want ... what about you use stop_machine for it ? I'm not even joking
> here :-)

We can put a cap of time + a cap of vmas. It's not important if it
fails, the only useful case we know it, and it won't be slow at
all. The failure can happen because the cap of time or the cap of vmas
or the cap vmas triggers or there's a vmalloc shortage. We handle the
failure in userland of course. There are zillon of allocations needed
anyway, any one of them can fail, so this isn't a new fail path, is
the same fail path that always existed before mmu_notifiers existed.

I can't possibly see how adding a new global wide lock that forces all
truncate to be serialized against each other, practically eliminating
the need of the i_mmap_lock, could be superior to an approach that
doesn't cause the overhead to the VM at all, and only require kvm to
pay for an additional cost when it startup.

Furthermore the only reason I had to implement mm_lock was to fix the
invalidate_range_start/end model, if we go with only invalidate_page
and invalidate_pages called inside the PT lock and we use the PT lock
to serialize, we don't need a mm_lock anymore and no new lock from the
VM either. I tried to push for that, but everyone else wanted
invalidate_range_start/end. I only did the only possible thing to do:
to make invalidate_range_start safe to make everyone happy without
slowing down the VM.


From torvalds at linux-foundation.org  Wed May  7 17:03:30 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 17:03:30 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
Message-ID: <alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Christoph Lameter wrote:
> 
> Multiple vmas may share the same mapping or refer to the same anonymous 
> vma. The above code will deadlock since we may take some locks multiple 
> times.

Ok, so that actually _is_ a problem. It would be easy enough to also add 
just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
suggestion of just using stop-machine is actually the right one just 
because it's _so_ simple.

(That said, we're not running out of vm flags yet, and if we were, we 
could just add another word. We're already wasting that space right now on 
64-bit by calling it "unsigned long").

		Linus


From swise at opengridcomputing.com  Wed May  7 17:16:34 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 May 2008 19:16:34 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
	<001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
Message-ID: <48224662.60401@opengridcomputing.com>


Sean Hefty wrote:
>>>  > nes - requires 0b write
>>>  > cxgb3 - requires 0b read
>>>  > amso1100 - won't work in p2p mode
> 
> I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b
> reads, or cxgb3 0b writes.
> 
Well, I'm not sure about nes.  But cxgb3 cannot deal with receiving a 0B 
write for the RTR because the FW doesn't see incoming writes, nor does 
the driver.

nes may be able to request a 0b read, but I what I meant was they 
currently use a 0B write and not a read.

So its possible to reduce the complexity if we just mandate 0B read for 
RTR.  But it makes sense in my mind to allow the other message types...

>> Its is painful.  But without anything, you cannot run OMPI, IMPI or
>> HPMPI on a iwarp cluster with mixed vendor rnics...
> 
> Is there any requirement at the receiving side, versus the initiating side?
> That is, just because nes issues a 0b write, does the receiving HW care if a
> read or write shows up?  Or is this restriction on both sides?
> 
The requirement is mostly driven from the receiving side.  For cxgb3 it 
is anyway...

The receiving side, ie the side that issues the rdma_accept will tell 
the sending side what RTR message to send, if any.  So the MPA exchange 
will look like this:

client sends MPA Start request with private data saying "i can send an 
RTR if you want it".

server moves connection into RDMA mode

server sends MPA Start response with "lets do RTR and send me X"  where 
X could be 0B write, 0B read request or 0B send.

client moves connection into RDMA mode

client sends X and then enables SQ processing (or indicate ESTABLISHED)

Once server gets X it can enable SQ processing (or indicate ESTABLISHED)

If X was a 0B read request, server sends 0B read response.


Steve


From holt at sgi.com  Wed May  7 17:38:38 2008
From: holt at sgi.com (Robin Holt)
Date: Wed, 7 May 2008 19:38:38 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
Message-ID: <20080508003838.GA9878@sgi.com>

On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
> On Wed, 7 May 2008, Andrea Arcangeli wrote:
> > 
> > I think the spinlock->rwsem conversion is ok under config option, as
> > you can see I complained myself to various of those patches and I'll
> > take care they're in a mergeable state the moment I submit them. What
> > XPMEM requires are different semantics for the methods, and we never
> > had to do any blocking I/O during vmtruncate before, now we have to.
> 
> I really suspect we don't really have to, and that it would be better to 
> just fix the code that does that.

That fix is going to be fairly difficult.  I will argue impossible.

First, a little background.  SGI allows one large numa-link connected
machine to be broken into seperate single-system images which we call
partitions.

XPMEM allows, at its most extreme, one process on one partition to
grant access to a portion of its virtual address range to processes on
another partition.  Those processes can then fault pages and directly
share the memory.

In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side.  The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range().  Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier invalidate_*
callouts().

On a side note, we currently have XPMEM working on x86_64 SSI, and
ia64 cross-partition.  We are in the process of getting XPMEM working
on x86_64 cross-partition in support of UV.

Thanks,
Robin Holt


From holt at sgi.com  Wed May  7 17:52:56 2008
From: holt at sgi.com (Robin Holt)
Date: Wed, 7 May 2008 19:52:56 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
References: <alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
Message-ID: <20080508005256.GB9878@sgi.com>

On Wed, May 07, 2008 at 05:03:30PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 May 2008, Christoph Lameter wrote:
> > 
> > Multiple vmas may share the same mapping or refer to the same anonymous 
> > vma. The above code will deadlock since we may take some locks multiple 
> > times.
> 
> Ok, so that actually _is_ a problem. It would be easy enough to also add 
> just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
> a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
> suggestion of just using stop-machine is actually the right one just 
> because it's _so_ simple.

Also, stop-machine will not work if we come to the conclusion that
i_mmap_lock and anon_vma->lock need to be sleepable locks.

Thanks,
Robin Holt


From clameter at sgi.com  Wed May  7 17:56:17 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Wed, 7 May 2008 17:56:17 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
Message-ID: <Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>

On Wed, 7 May 2008, Linus Torvalds wrote:

> On Wed, 7 May 2008, Christoph Lameter wrote:
> > 
> > Multiple vmas may share the same mapping or refer to the same anonymous 
> > vma. The above code will deadlock since we may take some locks multiple 
> > times.
> 
> Ok, so that actually _is_ a problem. It would be easy enough to also add 
> just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
> a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
> suggestion of just using stop-machine is actually the right one just 
> because it's _so_ simple.

Set the vma flag when we locked it and then skip when we find it locked 
right? This would be in addition to the global lock?

stop-machine would work for KVM since its a once in a Guest OS time of 
thing. But GRU, KVM and eventually Infiniband need the ability to attach 
in a reasonable timeframe without causing major hiccups for other 
processes.

> (That said, we're not running out of vm flags yet, and if we were, we 
> could just add another word. We're already wasting that space right now on 
> 64-bit by calling it "unsigned long").

We sure have enough flags.


From torvalds at linux-foundation.org  Wed May  7 17:55:33 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 17:55:33 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508003838.GA9878@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
Message-ID: <alpine.LFD.1.10.0805071743460.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Robin Holt wrote:
> 
> In order to invalidate the remote page table entries, we need to message
> (uses XPC) to the remote side.  The remote side needs to acquire the
> importing process's mmap_sem and call zap_page_range().  Between the
> messaging and the acquiring a sleeping lock, I would argue this will
> require sleeping locks in the path prior to the mmu_notifier invalidate_*
> callouts().

You simply will *have* to do it without locally holding all the MM 
spinlocks. Because quite frankly, slowing down all the normal VM stuff for 
some really esoteric hardware simply isn't acceptable. We just don't do 
it.

So what is it that actually triggers one of these events?

The most obvious solution is to just queue the affected pages while 
holding the spinlocks (perhaps locking them locally), and then handling 
all the stuff that can block after releasing things. That's how we 
normally do these things, and it works beautifully, without making 
everything slower.

Sometimes we go to extremes, and actually break the locks are restart 
(ugh), and it gets ugly, but even that tends to be preferable to using the 
wrong locking.

The thing is, spinlocks really kick ass. Yes, they restrict what you can 
do within them, but if 99.99% of all work is non-blocking, then the really 
odd rare blocking case is the one that needs to accomodate, not the rest.

				Linus


From torvalds at linux-foundation.org  Wed May  7 18:02:49 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 18:02:49 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507233953.GM8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:

> Hi Andrew,
> 
> On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
> > 	CPU0:			CPU1:
> > 
> > 	spin_lock(global_lock)	
> > 	spin_lock(a->lock);	spin_lock(b->lock);
> 				================== mmu_notifier_register()

If mmy_notifier_register() takes the global lock, it cannot happen here. 
It will be blocked (by CPU0), so there's no way it can then cause an ABBA 
deadlock. It will be released when CPU0 has taken *all* the locks it 
needed to take.

> What we can do is to replace the mm_lock with a
> spin_lock(&global_lock) only if all places that takes i_mmap_lock

NO!

You replace mm_lock() with the sequence that Andrew gave you (and I 
described):

	spin_lock(&global_lock)
	.. get all locks UNORDERED ..
	spin_unlock(&global_lock)

and you're now done. You have your "mm_lock()" (which still needs to be 
renamed - it should be a "mmu_notifier_lock()" or something like that), 
but you don't need the insane sorting. At most you apparently need a way 
to recognize duplicates (so that you don't deadlock on yourself), which 
looks like a simple bit-per-vma.

The global lock doesn't protect any data structures itself - it just 
protects two of these mm_lock() functions from ABBA'ing on each other!

			Linus


From torvalds at linux-foundation.org  Wed May  7 18:07:27 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 18:07:27 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
Message-ID: <alpine.LFD.1.10.0805071803240.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Christoph Lameter wrote:
> 
> Set the vma flag when we locked it and then skip when we find it locked 
> right? This would be in addition to the global lock?

Yes. And clear it before unlocking (and again, testing if it's already 
clear - you mustn't unlock twice, so you must only unlock when the bit 
was set).

You also (obviously) need to have somethign that guarantees that the lists 
themselves are stable over the whole sequence, but I assume you already 
have mmap_sem for reading (since you'd need it anyway just to follow the 
list).

And if you have it for writing, it can obviously *act* as the global lock, 
since it would already guarantee mutual exclusion on that mm->mmap list.

		Linus


From clameter at sgi.com  Wed May  7 18:12:32 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Wed, 7 May 2008 18:12:32 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
Message-ID: <Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>

On Wed, 7 May 2008, Linus Torvalds wrote:

> and you're now done. You have your "mm_lock()" (which still needs to be 
> renamed - it should be a "mmu_notifier_lock()" or something like that), 
> but you don't need the insane sorting. At most you apparently need a way 
> to recognize duplicates (so that you don't deadlock on yourself), which 
> looks like a simple bit-per-vma.

Andrea's mm_lock could have wider impact. It is the first effective 
way that I have seen of temporarily holding off reclaim from an address 
space. It sure is a brute force approach.


From andrea at qumranet.com  Wed May  7 18:26:56 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 03:26:56 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
Message-ID: <20080508012656.GQ8276@duo.random>

On Wed, May 07, 2008 at 06:02:49PM -0700, Linus Torvalds wrote:
> You replace mm_lock() with the sequence that Andrew gave you (and I 
> described):
> 
> 	spin_lock(&global_lock)
> 	.. get all locks UNORDERED ..
> 	spin_unlock(&global_lock)
> 
> and you're now done. You have your "mm_lock()" (which still needs to be 
> renamed - it should be a "mmu_notifier_lock()" or something like that), 
> but you don't need the insane sorting. At most you apparently need a way 
> to recognize duplicates (so that you don't deadlock on yourself), which 
> looks like a simple bit-per-vma.
> 
> The global lock doesn't protect any data structures itself - it just 
> protects two of these mm_lock() functions from ABBA'ing on each other!

I thought the thing to remove was the "get all locks". I didn't
realize the major problem was only the sorting of the array.

I'll add the global lock, it's worth it as it drops the worst case
number of steps by log(65536) times. Furthermore surely two concurrent
mm_notifier_lock will run faster as it'll decrease the cacheline
collisions. Since you ask to call it mmu_notifier_lock I'll also move
it to mmu_notifier.[ch] as consequence.


From torvalds at linux-foundation.org  Wed May  7 18:32:11 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 18:32:11 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
Message-ID: <alpine.LFD.1.10.0805071828110.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Christoph Lameter wrote:
> On Wed, 7 May 2008, Linus Torvalds wrote:
> > and you're now done. You have your "mm_lock()" (which still needs to be 
> > renamed - it should be a "mmu_notifier_lock()" or something like that), 
> > but you don't need the insane sorting. At most you apparently need a way 
> > to recognize duplicates (so that you don't deadlock on yourself), which 
> > looks like a simple bit-per-vma.
> 
> Andrea's mm_lock could have wider impact. It is the first effective 
> way that I have seen of temporarily holding off reclaim from an address 
> space. It sure is a brute force approach.

Well, I don't think the naming necessarily has to be about notifiers, but 
it should be at least a *bit* more scary than "mm_lock()", to make it 
clear that it's pretty dang expensive. 

Even without the vmalloc and sorting, if it would be used by "normal" 
things it would still be very expensive for some cases - running thngs 
like ElectricFence, for example, will easily generate thousands and 
thousands of vma's in a process. 

		Linus


From andrea at qumranet.com  Wed May  7 18:34:59 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 03:34:59 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080507234521.GN8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<1210202918.1421.20.camel@pasglop>
	<20080507234521.GN8276@duo.random>
Message-ID: <20080508013459.GS8276@duo.random>

Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.

For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.

I think there are two ways:

1) add global lock around mm_lock to remove the sorting

2) remove invalidate_range_start/end, nuke mm_lock as consequence of
   it, and replace all three with invalidate_pages issued inside the
   PT lock, one invalidation for each 512 pte_t modified, so
   serialization against get_user_pages becomes trivial but this will
   be not ok at all for SGI as it increases a lot their invalidation
   frequency

For KVM both ways are almost the same.

I'll implement 1 now then we'll see...


From torvalds at linux-foundation.org  Wed May  7 18:39:48 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 18:39:48 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
Message-ID: <alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>


On Wed, 7 May 2008, Christoph Lameter wrote:
> 
> > (That said, we're not running out of vm flags yet, and if we were, we 
> > could just add another word. We're already wasting that space right now on 
> > 64-bit by calling it "unsigned long").
> 
> We sure have enough flags.

Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are 
unique), we need to mark the address spaces/anonvma's. So the flag would 
need to be in the "struct anon_vma" (and struct address_space), not in the 
vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and 
would require adding a new field to "struct anon_vma()"

And related to that brain-fart of mine, that obviously also means that 
yes, the locking has to be stronger than "mm->mmap_sem" held for writing, 
so yeah, it would have be a separate global spinlock (or perhaps a 
blocking lock if you have some reason to protect anything else with this 
too).

			Linus


From andrea at qumranet.com  Wed May  7 18:52:49 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 03:52:49 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>
References: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>
Message-ID: <20080508015249.GT8276@duo.random>

On Wed, May 07, 2008 at 06:39:48PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 7 May 2008, Christoph Lameter wrote:
> > 
> > > (That said, we're not running out of vm flags yet, and if we were, we 
> > > could just add another word. We're already wasting that space right now on 
> > > 64-bit by calling it "unsigned long").
> > 
> > We sure have enough flags.
> 
> Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are 
> unique), we need to mark the address spaces/anonvma's. So the flag would 
> need to be in the "struct anon_vma" (and struct address_space), not in the 
> vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and 
> would require adding a new field to "struct anon_vma()"
> 
> And related to that brain-fart of mine, that obviously also means that 
> yes, the locking has to be stronger than "mm->mmap_sem" held for writing, 
> so yeah, it would have be a separate global spinlock (or perhaps a 
> blocking lock if you have some reason to protect anything else with this 

So because the bitflag can't prevent taking the same lock twice on two
different vmas in the same mm, we still can't remove the sorting, and
the global lock won't buy much other than reducing the collisions. I
can add that though.

I think it's more interesting to put a cap on the number of vmas to
min(1024,max_map_count). The sort time on an 8k array runs in constant
time. kvm runs with 127 vmas allocated...


From torvalds at linux-foundation.org  Wed May  7 18:57:05 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 18:57:05 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508015249.GT8276@duo.random>
References: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>
	<20080508015249.GT8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071853500.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> So because the bitflag can't prevent taking the same lock twice on two
> different vmas in the same mm, we still can't remove the sorting

Andrea. 

Take five minutes. Take a deep breadth. And *think* about actually reading 
what I wrote.

The bitflag *can* prevent taking the same lock twice. It just needs to be 
in the right place.

Let me quote it for you:

> So the flag wouldn't be one of the VM_xyzzy flags, and would require 
> adding a new field to "struct anon_vma()"

IOW, just make it be in that anon_vma (and the address_space). No sorting 
required.

> I think it's more interesting to put a cap on the number of vmas to
> min(1024,max_map_count). The sort time on an 8k array runs in constant
> time.

Shut up already. It's not constant time just because you can cap the 
overhead. We're not in a university, and we care about performance, not 
your made-up big-O notation.

			Linus


From andrea at qumranet.com  Wed May  7 19:24:24 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 04:24:24 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805071853500.3024@woody.linux-foundation.org>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>
	<20080508015249.GT8276@duo.random>
	<alpine.LFD.1.10.0805071853500.3024@woody.linux-foundation.org>
Message-ID: <20080508022424.GU8276@duo.random>

On Wed, May 07, 2008 at 06:57:05PM -0700, Linus Torvalds wrote:
> Take five minutes. Take a deep breadth. And *think* about actually reading 
> what I wrote.
> 
> The bitflag *can* prevent taking the same lock twice. It just needs to be 
> in the right place.

It's not that I didn't read it, but to do it I've to grow every
anon_vma by 8 bytes. I thought it was implicit that the conclusion of
your email is that it couldn't possibly make sense to grow the size of
each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem
kernel modules. It surely isn't my preferred solution, while capping
the number of vmas to 1024 means sort() will make around 10240 steps,
Matt call tell the exact number. The big cost shouldn't be in
sort. Even 512 vmas will be more than enough for us infact. Note that
I've a cond_resched in the sort compare function and I can re-add the
signal_pending check. I had the signal_pending check in the original
version that didn't use sort() but was doing an inner loop, I thought
signal_pending wasn't necessary after speeding it up with sort(). But
I can add it again, so then we'll only fail to abort inside sort() and
we'll be able to break the loop while taking all the spinlocks, but
with such as small array that can't be an issue and the result will
surely run faster than stop_machine with zero ram and cpu overhead for
the VM (besides stop_machine can't work or we'd need to disable
preemption between invalidate_range_start/end, even removing the xpmem
schedule-inside-mmu-notifier requirement).


From torvalds at linux-foundation.org  Wed May  7 19:32:05 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 19:32:05 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508022424.GU8276@duo.random>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<alpine.LFD.1.10.0805071610490.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071637360.14337@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071655100.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071752490.14829@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805071833450.3024@woody.linux-foundation.org>
	<20080508015249.GT8276@duo.random>
	<alpine.LFD.1.10.0805071853500.3024@woody.linux-foundation.org>
	<20080508022424.GU8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805071930050.3024@woody.linux-foundation.org>


Andrea, I'm not interested. I've stated my standpoint: the code being 
discussed is crap. We're not doing that. Not in the core VM. 

I gave solutions that I think aren't crap, but I already also stated that 
I have no problems not merging it _ever_ if no solution can be found. The 
whole issue simply isn't even worth the pain, imnsho. 

			Linus


From andrea at qumranet.com  Wed May  7 19:56:52 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 04:56:52 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
References: <alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
Message-ID: <20080508025652.GW8276@duo.random>

On Wed, May 07, 2008 at 06:12:32PM -0700, Christoph Lameter wrote:
> Andrea's mm_lock could have wider impact. It is the first effective 
> way that I have seen of temporarily holding off reclaim from an address 
> space. It sure is a brute force approach.

The only improvement I can imagine on mm_lock, is after changing the
name to global_mm_lock() to reestablish the signal_pending check in
the loop that takes the spinlock and to backoff and put the cap to 512
vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec
at most (plus the vfree that may be a bigger cost but we're ok to pay
it and it surely isn't security related).

Then on the long term we need to talk to Matt on returning a parameter
to the sort function to break the loop. After that we remove the 512
vma cap and mm_lock is free to run as long as it wants like
/dev/urandom, nobody can care less how long it will run before
returning as long as it reacts to signals.

This is the right way if we want to support XPMEM/GRU efficiently and
without introducing unnecessary regressions in the VM fastpaths and VM
footprint.


From clameter at sgi.com  Wed May  7 20:10:33 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Wed, 7 May 2008 20:10:33 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508025652.GW8276@duo.random>
References: <alpine.LFD.1.10.0805071349200.3024@woody.linux-foundation.org>
	<20080507212650.GA8276@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
Message-ID: <Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>

On Thu, 8 May 2008, Andrea Arcangeli wrote:

> to the sort function to break the loop. After that we remove the 512
> vma cap and mm_lock is free to run as long as it wants like
> /dev/urandom, nobody can care less how long it will run before
> returning as long as it reacts to signals.

Look Linus has told you what to do. Why not simply do it?


From andrea at qumranet.com  Wed May  7 20:41:33 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 05:41:33 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
References: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
Message-ID: <20080508034133.GY8276@duo.random>

On Wed, May 07, 2008 at 08:10:33PM -0700, Christoph Lameter wrote:
> On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> > to the sort function to break the loop. After that we remove the 512
> > vma cap and mm_lock is free to run as long as it wants like
> > /dev/urandom, nobody can care less how long it will run before
> > returning as long as it reacts to signals.
> 
> Look Linus has told you what to do. Why not simply do it?

When it looked like we could use vm_flags to remove sort, that looked
an ok optimization, no problem with optimizations, I'm all for
optimizations if they cost nothing to the VM fast paths and VM footprint.

But removing sort isn't worth it if it takes away ram from the VM even
when global_mm_lock will never be called.

sort is like /dev/urandom so after sort is fixed to handle signals
(and I expect Matt will help with this) we'll remove the 512 vmas cap.
In the meantime we can live with the 512 vmas cap that guarantees sort
won't take more than a few dozen usec. Removing sort() is the only
thing that the anon vma bitflag can achieve and it's clearly not worth
it and it would go in the wrong direction (fixing sort to handle
signals is clearly the right direction, if sort is a concern at all).

Adding the global lock around global_mm_lock to avoid one
global_mm_lock to collide against another global_mm_lock is sure ok
with me, if that's still wanted now that it's clear removing sort
isn't worth it, I'm neutral on this.

Christoph please go ahead and add the bitflag to anon-vma yourself if
you want. If something isn't technically right I don't do it no matter
who asks it.


From torvalds at linux-foundation.org  Wed May  7 21:14:45 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 7 May 2008 21:14:45 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508034133.GY8276@duo.random>
References: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080507222205.GC8276@duo.random>
	<20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> But removing sort isn't worth it if it takes away ram from the VM even
> when global_mm_lock will never be called.

Andrea, you really are a piece of work. Your arguments have been bogus 
crap that didn't even understand what was going on from the beginning, and 
now you continue to do that.

What exactly "takes away ram" from the VM?

The notion of adding a flag to "struct anon_vma"?

The one that already has a 4 byte padding thing on x86-64 just after the 
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two 
bytes of padding if we didn't just make the spinlock type unconditionally 
32 bits rather than the 16 bits we actually _use_?

IOW, you didn't even look at it, did you?

But whatever. I clearly don't want a patch from you anyway, so ..

		Linus


From andrea at qumranet.com  Wed May  7 22:20:19 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 07:20:19 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
Message-ID: <20080508052019.GA8276@duo.random>

On Wed, May 07, 2008 at 09:14:45PM -0700, Linus Torvalds wrote:
> IOW, you didn't even look at it, did you?

Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.

2.6.26-rc1

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma             260    576     24  144    1 : tunables  120   60    8 : slabdata      4      4      0
                                    ^^  ^^^

2.6.26-rc1 + below patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -27,6 +27,7 @@ struct anon_vma {
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
+	int flag:1;
 };
 
 #ifdef CONFIG_MMU

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma             250    560     32  112    1 : tunables  120   60    8 : slabdata      5      5      0
                                    ^^  ^^^

Not a big deal sure to grow it 33%, it's so small anyway, but I don't
see the point in growing it. sort() can be interrupted by signals, and
until it can we can cap it to 512 vmas making the worst case taking
dozen usecs, I fail to see what you have against sort().

Again: if a vma bitflag + global lock could have avoided sort and run
O(N) instead of current O(N*log(N)) I would have done that
immediately, infact I was in the process of doing it when you posted
the followup. Nothing personal here, just staying technical. Hope you
too.


From penberg at cs.helsinki.fi  Wed May  7 22:27:47 2008
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Thu, 8 May 2008 08:27:47 +0300
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508052019.GA8276@duo.random>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
Message-ID: <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com>

On Thu, May 8, 2008 at 8:20 AM, Andrea Arcangeli <andrea at qumranet.com> wrote:
>  Actually I looked both at the struct and at the slab alignment just in
>  case it was changed recently. Now after reading your mail I also
>  compiled it just in case.
>
>  @@ -27,6 +27,7 @@ struct anon_vma {
>   struct anon_vma {
>
>         spinlock_t lock;        /* Serialize access to vma list */
>
>         struct list_head head;  /* List of private "related" vmas */
>  +       int flag:1;
>   };

You might want to read carefully what Linus wrote:

> The one that already has a 4 byte padding thing on x86-64 just after the
> spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
> bytes of padding if we didn't just make the spinlock type unconditionally
> 32 bits rather than the 16 bits we actually _use_?

So you need to add the flag _after_ ->lock and _before_ ->head....

                        Pekka


From penberg at cs.helsinki.fi  Wed May  7 22:30:20 2008
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Thu, 8 May 2008 08:30:20 +0300
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com>
Message-ID: <84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com>

On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg <penberg at cs.helsinki.fi> wrote:
> You might want to read carefully what Linus wrote:
>
>  > The one that already has a 4 byte padding thing on x86-64 just after the
>  > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
>  > bytes of padding if we didn't just make the spinlock type unconditionally
>  > 32 bits rather than the 16 bits we actually _use_?
>
>  So you need to add the flag _after_ ->lock and _before_ ->head....

Oh should have taken my morning coffee first, before ->lock works
obviously as well.


From andrea at qumranet.com  Wed May  7 22:49:31 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Thu, 8 May 2008 07:49:31 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com>
References: <20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com>
	<84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com>
Message-ID: <20080508054931.GB8276@duo.random>

On Thu, May 08, 2008 at 08:30:20AM +0300, Pekka Enberg wrote:
> On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg <penberg at cs.helsinki.fi> wrote:
> > You might want to read carefully what Linus wrote:
> >
> >  > The one that already has a 4 byte padding thing on x86-64 just after the
> >  > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
> >  > bytes of padding if we didn't just make the spinlock type unconditionally
> >  > 32 bits rather than the 16 bits we actually _use_?
> >
> >  So you need to add the flag _after_ ->lock and _before_ ->head....
> 
> Oh should have taken my morning coffee first, before ->lock works
> obviously as well.

Sorry, Linus's right: I didn't realize the "after the spinlock" was
literally after the spinlock, I didn't see the 4 byte padding when I
read the code and put the flag:1 in. If put between ->lock and ->head
it doesn't take more memory on x86-64 as described literlly. So the
next would be to find another place like that in the address
space. Perhaps after the private_lock using the same trick or perhaps
the slab alignment won't actually alter the number of slabs per page
regardless.

I leave that to Christoph, he's surely better than me at doing this, I
give it up entirely and I consider my attempt to merge a total failure
and I strongly regret it.

On a side note the anon_vma will change to this when XPMEM support is
compiled in:

 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	   /* List of private "related" vmas
	*/
 };

not sure if it'll grow in size or not after that but let's say it's
not a big deal.


From gip at bonniemichaels.com  Thu May  8 00:46:28 2008
From: gip at bonniemichaels.com (Sidney Watson)
Date: Thu, 8 May 2008 14:46:28 +0700
Subject: [ofa-general] Re: Hi sure cure
Message-ID: <01c8b11a$4c5ba680$b3920a3a@gip>

ED affects over 30 million men and their partners in the U.S.
So if youre a man who has ED, or if you think you might be, dont worry.
Youre not alone. More than 50% of men. between 40 and 70 can experience ED to some degree.
The fact is guys at any age can experience it.

http://www.divekto.com/v/


From PHF at zurich.ibm.com  Thu May  8 02:38:19 2008
From: PHF at zurich.ibm.com (Philip Frey1)
Date: Thu, 8 May 2008 11:38:19 +0200
Subject: [ofa-general] cxgb3 user limitations
Message-ID: <OF06222EF8.7F4BDA06-ONC1257443.003491AF-C1257443.0034E1F1@ch.ibm.com>

Hi,

I have a Chelsio T3. Whenever I do RDMA as normal user, I am severely 
limited in terms of memory.
Why is that? Is there a way of using the RNIC with the same privileges as 
the root but without actually
being the root?

Could you give me some insight in what the limits of the Chelsio RNIC are? 
(Max MRs, QPs, PDs etc)

Many thanks and kind regards,
 Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080508/9bd9b6fb/attachment.html>

From michael.heinz at qlogic.com  Thu May  8 06:54:24 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Thu, 8 May 2008 08:54:24 -0500
Subject: [ofa-general] Need help diagnosing a problem....
Message-ID: <C07C40DB2364324799506DE8FF12F8D8678D4F@EPEXCH1.qlogic.org>

I was smoke testing a small cluster when one of the nodes posted this:
 
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
Internal error detected:
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[00]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[01]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[02]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[03]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[04]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[05]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[06]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[07]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[08]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[09]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0a]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0b]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0c]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0d]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0e]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0f]: ffffffff

At this point, all further IB traffic on that node failed, and it
silently hung during shut down. 
 
Any suggestions as to what I should look at?
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080508/44de437d/attachment.html>

From swise at opengridcomputing.com  Thu May  8 07:16:54 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 08 May 2008 09:16:54 -0500
Subject: [ofa-general] Re: cxgb3 user limitations
In-Reply-To: <OF06222EF8.7F4BDA06-ONC1257443.003491AF-C1257443.0034E1F1@ch.ibm.com>
References: <OF06222EF8.7F4BDA06-ONC1257443.003491AF-C1257443.0034E1F1@ch.ibm.com>
Message-ID: <48230B56.4060101@opengridcomputing.com>

Philip Frey1 wrote:
>
> Hi,
>
> I have a Chelsio T3. Whenever I do RDMA as normal user, I am severely 
> limited in terms of memory.
> Why is that? Is there a way of using the RNIC with the same privileges 
> as the root but without actually
> being the root?
>
> Could you give me some insight in what the limits of the Chelsio RNIC 
> are? (Max MRs, QPs, PDs etc)
>
> Many thanks and kind regards,
>  Philip 


Try running ibv_devinfo -v to see driver/hw limits.

However, how are you limited?  Are you getting failures registering 
memory?  Did you try setting your ulimit -l to unlimited or at least as 
large as the memory region you want to register?


Steve.


From pawel.dziekonski at wcss.pl  Thu May  8 07:57:10 2008
From: pawel.dziekonski at wcss.pl (Pawel Dziekonski)
Date: Thu, 8 May 2008 16:57:10 +0200
Subject: [ofa-general] getting network statistics
In-Reply-To: <1210145698.15669.78.camel@mtls03>
References: <FCB44A2146B78C479695CF9CCA7EEA870311DE05@excg-isl01>
	<1203424196.16145.1.camel@mtls03>
	<20080506141039.GJ6586@cefeid.wcss.wroc.pl>
	<1210145698.15669.78.camel@mtls03>
Message-ID: <20080508145710.GA13329@cefeid.wcss.wroc.pl>


yes, I meant CONTENT of those files.

while true; do cat port_rcv_data; sleep 1; done
223212307
223212307
223212307
223212342
223227022
223227022
223227022
223227022
223227022
223227057
223227265


On Wed, 07 May 2008 at 10:34:58AM +0300, Eli Cohen wrote:
> These files are on a virtual file system and their size does not change.
> You need to read them, e.g. using cat, in order to get the statistics
> data. For example, "cat port_rcv_data" will give you a measure of how
> many bytes of data were received by the port.
> 
> On Tue, 2008-05-06 at 16:10 +0200, Pawel Dziekonski wrote:
> > you mean port_rcv_data and port_xmit_data ?
> > 
> > if so, then I have 2 jobs that are definitelly using IB network, but
> > those files almost do not change. :o
> > 
> > OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp
> > 
> > root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al
> > total 0
> > drwxr-xr-x  2 root root    0 May  6 15:45 ./
> > drwxr-xr-x  5 root root    0 May  6 15:45 ../
> > -r--r--r--  1 root root 4096 May  6 15:45 VL15_dropped
> > -r--r--r--  1 root root 4096 May  6 15:45 excessive_buffer_overrun_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 link_downed
> > -r--r--r--  1 root root 4096 May  6 15:45 link_error_recovery
> > -r--r--r--  1 root root 4096 May  6 15:45 local_link_integrity_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_constraint_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_data
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_packets
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_remote_physical_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_rcv_switch_relay_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_constraint_errors
> > -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_data
> > -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_discards
> > -r--r--r--  1 root root 4096 May  6 15:45 port_xmit_packets
> > -r--r--r--  1 root root 4096 May  6 15:45 symbol_error
> > 
> > 
> > On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote:
> > > cat /sys/class/infiniband/mlx4_0/ports/1/counters/*
> > > 
> > > mlx4_* can be mthca* 
> > > 
> > > On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote:
> > > > Under Linux with Mellanox ofed, how can I get real-time network
> > > > statistics. e.g. how many bytes are being sent and received over each
> > > > port at any given time?
> > 
> 

-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From torvalds at linux-foundation.org  Thu May  8 08:03:19 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Thu, 8 May 2008 08:03:19 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508052019.GA8276@duo.random>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
Message-ID: <alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Andrea Arcangeli wrote:
> 
> Actually I looked both at the struct and at the slab alignment just in
> case it was changed recently. Now after reading your mail I also
> compiled it just in case.

Put the flag after the spinlock, not after the "list_head".

Also, we'd need to make it 

	unsigned short flag:1;

_and_ change spinlock_types.h to make the spinlock size actually match the 
required size (right now we make it an "unsigned int slock" even when we 
actually only use 16 bits). See the 

	#if (NR_CPUS < 256)

code in <asm-x86/spinlock.h>.

				Linus


From edmumpvosie at bosbyshell.com  Thu May  8 08:45:00 2008
From: edmumpvosie at bosbyshell.com (Abby Ashley)
Date: Thu, 8 May 2008 16:45:00 +0100
Subject: [ofa-general] Hi large drive
Message-ID: <01c8b12a$daf08e00$b764f259@edmumpvosie>

"In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size."

We offer solution!
Gain incredible girth and mind-blowing length in just a few weeks time!

http://www.alleeg.net/a/


From torvalds at linux-foundation.org  Thu May  8 09:11:33 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Thu, 8 May 2008 09:11:33 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
Message-ID: <alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>


On Thu, 8 May 2008, Linus Torvalds wrote:
> 
> Also, we'd need to make it 
> 
> 	unsigned short flag:1;
> 
> _and_ change spinlock_types.h to make the spinlock size actually match the 
> required size (right now we make it an "unsigned int slock" even when we 
> actually only use 16 bits).

Btw, this is an issue only on 32-bit x86, because on 64-bit one we already 
have the padding due to the alignment of the 64-bit pointers in the 
list_head (so there's already empty space there).

On 32-bit, the alignment of list-head is obviously just 32 bits, so right 
now the structure is "perfectly packed" and doesn't have any empty space. 
But that's just because the spinlock is unnecessarily big.

(Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the 
structure really will grow. That's a very odd configuration, though, and 
not one I feel we really need to care about).

			Linus


From akepner at sgi.com  Thu May  8 10:09:48 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 8 May 2008 10:09:48 -0700
Subject: [ofa-general] IPoIB-UD TX timeouts (OFED 1.2)
In-Reply-To: <4e6a6b3c0804301300q57b4b562r854e337ff8706222@mail.gmail.com>
References: <20080430192354.GG26724@sgi.com>
	<4e6a6b3c0804301300q57b4b562r854e337ff8706222@mail.gmail.com>
Message-ID: <20080508170948.GT24293@sgi.com>

On Wed, Apr 30, 2008 at 11:00:55PM +0300, Eli Cohen wrote:
> ....
> when it happens please:
> 1. Check the link error counters.

Unfortunately there appear to be things running that periodically 
reset the counters (to avoid hitting the 32 bit limit), so the 
port counters usually come back as all 0.

> 2. Disconnect and reconnect the cable and see if it recovers.

None of the systems where this has been seen are physically 
accessible to me (and even if they were, finding the right cable 
to pull might be tricky :-)

We have some new information, which I'll post now.

-- 
Arthur


From akepner at sgi.com  Thu May  8 10:19:36 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 8 May 2008 10:19:36 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
Message-ID: <20080508171936.GU24293@sgi.com>


In an earlier email I mentioned that, with certain workloads, we 
are seeing an endless loop of timeouts on the IPoIB-UD send queue.
Messages like "NETDEV WATCHDOG: ib0: transmit timed out" appear 
once a second until the driver is unloaded. That was with OFED 1.2. 

Using OFED 1.3, we see what I believe is the same problem, but it
looks a little different. We don't get "NETDEV WATCHDOG", but 
we get an endless string of "post_send failed".

(I suspect, but haven't verified, that the difference is due to 
the sharing of ipoib_dev_priv's tx_outstanding member between 
the UD and CM IPoIB QPs, the value of tx_outstanding is used 
to determine when to call netif_stop_queue().)

The h/w is MT25204, with f/w version 1.2.0, on an x86_64.

I instrumented the mthca driver to maintain a cicular buffer 
of the state of the IPoIB-UD send queue on each call to the  
"post_send" (mthca_arbel_post_send) and "poll_cq" (mthca_poll_one) 
routines, and also to dump the QP and CQ context when the full 
queue is detected. 

At some point, we just stop getting completions on the send queue. 
Here are the last entries from the "poll_cq" log:

# jiffies   qpn   last head  tail
#                 comp 
.....
0x100032cdc 0x404 0x49 0x24b 0x24a
0x100032cdc 0x404 0x4a 0x24b 0x24b
0x100033eed 0x404 0x4c 0x24e 0x24d
0x100033eed 0x404 0x4d 0x24e 0x24e
0x10003b594 0x404 0x4f 0x251 0x250
0x10003b594 0x404 0x50 0x251 0x251
0x10003c999 0x404 0x52 0x254 0x253
0x10003ca16 0x404 0x53 0x255 0x254
0x10003ca93 0x404 0x54 0x256 0x255
0x10003ca93 0x404 0x55 0x256 0x256

We keep calling the send routine (apparently via the periodic 
ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - 
the send queue length is 128. Here are some entries after the queue 
has filled (they keep going "forever"):


# jiffies   qpn   last head  tail
#                 comp 
.....
0x1000760dd 0x404 0x55 0x2d6 0x256
0x1000761c6 0x404 0x55 0x2d6 0x256
0x1000761d7 0x404 0x55 0x2d6 0x256
0x1000762c0 0x404 0x55 0x2d6 0x256
0x1000762d1 0x404 0x55 0x2d6 0x256
0x1000763ba 0x404 0x55 0x2d6 0x256


And here's the QP and CQ context immediately after the first 
post_send failure:

QP context (including the 2-32 bit "opt_param_mask" 
and reserved fields at the beginning):
[00] 0x00000000 0x00000000 0x30031900 0xef3e3f16
[10] 0x8b423b00 0x00000002 0x00000404 0x00000000
[20] 0x00000000 0x00000000 0x01000000 0x60000000
[30] 0x00000000 0x00000000 0x00000000 0x00000000
[40] 0x00000000 0x00000000 0x00000000 0x00000000
[50] 0x00000000 0x00000000 0x00000000 0x00000000
[60] 0x00000000 0x00000000 0x00000000 0x00000006
[70] 0x00000000 0x00002600 0xaf004000 0x00800088
[80] 0x00000256 0x00000082 0x00004000 0x00000005
[90] 0x00ffffff 0x00000257 0x00000008 0x003a277f
[a0] 0x25020200 0x00000081 0x00000000 0x00007ff9
[b0] 0x00000b1b 0x00000000 0x000003f8 0x03f80256
[c0] 0x00000000 0x00000000 0x00000000 0x00000000
[d0] 0x00000000 0x00000000 0x00000000 0x00000000
[e0] 0x00000000 0x00000000 0x00000000 0x00000000
[f0] 0x00000000 0x00000000 0x00000000 0x00000000
CQ context:
[00] 0x00000a00 0x00000000 0x00000000 0x08000002
[10] 0x00000000 0x00000001 0x00000004 0x00002500
[20] 0x000001fd 0x000001fd 0x00000000 0x00000238
[30] 0x00000082 0x00007ffa 0x00000004 0x00000000


I don't see anything obviously wrong here - anyone at Mellanox? 
Any idea why the card would stop generating TX completions?

-- 
Arthur


From rdreier at cisco.com  Thu May  8 10:42:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 May 2008 10:42:37 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080508171936.GU24293@sgi.com> (akepner@sgi.com's message of
	"Thu, 8 May 2008 10:19:36 -0700")
References: <20080508171936.GU24293@sgi.com>
Message-ID: <adaod7gsnea.fsf@cisco.com>

 > Using OFED 1.3, we see what I believe is the same problem, but it
 > looks a little different. We don't get "NETDEV WATCHDOG", but 
 > we get an endless string of "post_send failed".

That's bad.  Did you check if the send is failing due to overrunning the
send queue?

 > (I suspect, but haven't verified, that the difference is due to 
 > the sharing of ipoib_dev_priv's tx_outstanding member between 
 > the UD and CM IPoIB QPs, the value of tx_outstanding is used 
 > to determine when to call netif_stop_queue().)

A while ago, I was worried about the handling of tx_outstanding and how
the driver makes sure that it doesn't post too many sends, but I managed
to convince myself that the code was OK.  Guess we should check it over
one more time.

 - R.


From rdreier at cisco.com  Thu May  8 10:45:28 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 May 2008 10:45:28 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080508171936.GU24293@sgi.com> (akepner@sgi.com's message of
	"Thu, 8 May 2008 10:19:36 -0700")
References: <20080508171936.GU24293@sgi.com>
Message-ID: <adak5i4sn9j.fsf@cisco.com>

 > We keep calling the send routine (apparently via the periodic 
 > ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - 
 > the send queue length is 128. Here are some entries after the queue 
 > has filled (they keep going "forever"):

I don't see how ipoib_ib_tx_timer_func() could call post_send since all
it does is poll the CQ and handle completions.

Also ipoib_ib_tx_timer_func() was added post-OFED 1.3 (it is only in
2.6.26-rc1 AFAIK), so what kernel are you using?

 - R.


From akepner at sgi.com  Thu May  8 10:43:58 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 8 May 2008 10:43:58 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <adaod7gsnea.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
Message-ID: <20080508174358.GW24293@sgi.com>

On Thu, May 08, 2008 at 10:42:37AM -0700, Roland Dreier wrote:
> ..
> That's bad.  Did you check if the send is failing due to overrunning the
> send queue?

Yes, mthca_wq_overflow() is detecting a full queue. 

(Is that what you mean?)

-- 
Arthur


From rdreier at cisco.com  Thu May  8 10:50:11 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 May 2008 10:50:11 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080508174358.GW24293@sgi.com> (akepner@sgi.com's message of
	"Thu, 8 May 2008 10:43:58 -0700")
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com>
Message-ID: <adafxsssn1o.fsf@cisco.com>

 > Yes, mthca_wq_overflow() is detecting a full queue. 
 > 
 > (Is that what you mean?)

Yep, that's what I was wondering.

It might be useful to track the value of tx_outstanding... from a quick
look at the code I can't see how the transmit queue could be awake when
the UD send queue is full.

Are you using connected mode when you reproduce this, or does it happen
with datagram mode?

 - R.


From akepner at sgi.com  Thu May  8 10:52:26 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 8 May 2008 10:52:26 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <adak5i4sn9j.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com> <adak5i4sn9j.fsf@cisco.com>
Message-ID: <20080508175226.GX24293@sgi.com>

On Thu, May 08, 2008 at 10:45:28AM -0700, Roland Dreier wrote:
> 
> I don't see how ipoib_ib_tx_timer_func() could call post_send since all
> it does is poll the CQ and handle completions.

I'll check for sure about what's calling post_send here.

> 
> Also ipoib_ib_tx_timer_func() was added post-OFED 1.3 (it is only in
> 2.6.26-rc1 AFAIK), so what kernel are you using?
> 

The kernel is SLES10 SP1 (2.6.16.46-0.12-smp), and OFED 1.3-ga is 
installed on that.

-- 
Arthur


From akepner at sgi.com  Thu May  8 10:55:47 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 8 May 2008 10:55:47 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <adafxsssn1o.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
Message-ID: <20080508175547.GY24293@sgi.com>

On Thu, May 08, 2008 at 10:50:11AM -0700, Roland Dreier wrote:
> ...
> It might be useful to track the value of tx_outstanding... from a quick
> look at the code I can't see how the transmit queue could be awake when
> the UD send queue is full.
>

OK, I'll check that.
 
> Are you using connected mode when you reproduce this, or does it happen
> with datagram mode?
> 

We're using connected mode. (I think we've have had some similar 
problems when using datagram mode, but I don't have details.)

-- 
Arthur


From ralph.campbell at qlogic.com  Thu May  8 11:55:12 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 08 May 2008 11:55:12 -0700
Subject: [ofa-general] [PATCH 0/3] IB/ipath -- fixes for 2.6.26
Message-ID: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>

The following patches fix a number of bugs for the QLogic DDR HCA.

IB/ipath - fix RC and UC error handling
IB/ipath - fix many locking issues when switching to error state
IB/ipath - fix RDMA read response sequence checking

These can also be pulled into Roland's infiniband.git for-2.6.26 repo using:
git pull git://git.qlogic.com/ipath-linux-2.6 for-roland

Just FYI, these changes bring 2.6.26 into sync with what I submitted
for OFED-1.3.1.  I also don't expect further changes for a while.


From ralph.campbell at qlogic.com  Thu May  8 11:55:17 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 08 May 2008 11:55:17 -0700
Subject: [ofa-general] [PATCH 1/3] IB/ipath - fix RC and UC error handling
In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080508185517.8547.42852.stgit@eng-46.mv.qlogic.com>

When errors are detected in RC, the QP should transition to the IB_QPS_ERR
state, not the IB_QPS_SQE state. Also, when the error is on the
responder side, the recv work completion error was incorrect
(rem vs. local).

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_qp.c    |   54 +--------
 drivers/infiniband/hw/ipath/ipath_rc.c    |  127 +++++++---------------
 drivers/infiniband/hw/ipath/ipath_ruc.c   |  165 ++++++++++++++---------------
 drivers/infiniband/hw/ipath/ipath_verbs.c |    4 -
 drivers/infiniband/hw/ipath/ipath_verbs.h |    6 +
 5 files changed, 132 insertions(+), 224 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index dd5b6e9..6f98632 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -374,13 +374,14 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type)
 }
 
 /**
- * ipath_error_qp - put a QP into an error state
- * @qp: the QP to put into an error state
+ * ipath_error_qp - put a QP into the error state
+ * @qp: the QP to put into the error state
  * @err: the receive completion error to signal if a RWQE is active
  *
  * Flushes both send and receive work queues.
  * Returns true if last WQE event should be generated.
  * The QP s_lock should be held and interrupts disabled.
+ * If we are already in error state, just return.
  */
 
 int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
@@ -389,8 +390,10 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 	struct ib_wc wc;
 	int ret = 0;
 
-	ipath_dbg("QP%d/%d in error state (%d)\n",
-		  qp->ibqp.qp_num, qp->remote_qpn, err);
+	if (qp->state == IB_QPS_ERR)
+		goto bail;
+
+	qp->state = IB_QPS_ERR;
 
 	spin_lock(&dev->pending_lock);
 	if (!list_empty(&qp->timerwait))
@@ -460,6 +463,7 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 	} else if (qp->ibqp.event_handler)
 		ret = 1;
 
+bail:
 	return ret;
 }
 
@@ -1026,48 +1030,6 @@ bail:
 }
 
 /**
- * ipath_sqerror_qp - put a QP's send queue into an error state
- * @qp: QP who's send queue will be put into an error state
- * @wc: the WC responsible for putting the QP in this state
- *
- * Flushes the send work queue.
- * The QP s_lock should be held and interrupts disabled.
- */
-
-void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc)
-{
-	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
-	struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
-
-	ipath_dbg("Send queue error on QP%d/%d: err: %d\n",
-		  qp->ibqp.qp_num, qp->remote_qpn, wc->status);
-
-	spin_lock(&dev->pending_lock);
-	if (!list_empty(&qp->timerwait))
-		list_del_init(&qp->timerwait);
-	if (!list_empty(&qp->piowait))
-		list_del_init(&qp->piowait);
-	spin_unlock(&dev->pending_lock);
-
-	ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1);
-	if (++qp->s_last >= qp->s_size)
-		qp->s_last = 0;
-
-	wc->status = IB_WC_WR_FLUSH_ERR;
-
-	while (qp->s_last != qp->s_head) {
-		wqe = get_swqe_ptr(qp, qp->s_last);
-		wc->wr_id = wqe->wr.wr_id;
-		wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-		ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1);
-		if (++qp->s_last >= qp->s_size)
-			qp->s_last = 0;
-	}
-	qp->s_cur = qp->s_tail = qp->s_head;
-	qp->state = IB_QPS_SQE;
-}
-
-/**
  * ipath_get_credit - flush the send work queue of a QP
  * @qp: the qp who's send work queue to flush
  * @aeth: the Acknowledge Extended Transport Header
diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index 08b11b5..b4b26c3 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -771,27 +771,14 @@ done:
  *
  * The QP s_lock should be held and interrupts disabled.
  */
-void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc)
+void ipath_restart_rc(struct ipath_qp *qp, u32 psn)
 {
 	struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
 	struct ipath_ibdev *dev;
 
 	if (qp->s_retry == 0) {
-		wc->wr_id = wqe->wr.wr_id;
-		wc->status = IB_WC_RETRY_EXC_ERR;
-		wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-		wc->vendor_err = 0;
-		wc->byte_len = 0;
-		wc->qp = &qp->ibqp;
-		wc->imm_data = 0;
-		wc->src_qp = qp->remote_qpn;
-		wc->wc_flags = 0;
-		wc->pkey_index = 0;
-		wc->slid = qp->remote_ah_attr.dlid;
-		wc->sl = qp->remote_ah_attr.sl;
-		wc->dlid_path_bits = 0;
-		wc->port_num = 0;
-		ipath_sqerror_qp(qp, wc);
+		ipath_send_complete(qp, wqe, IB_WC_RETRY_EXC_ERR);
+		ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
 		goto bail;
 	}
 	qp->s_retry--;
@@ -804,6 +791,8 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc)
 	spin_lock(&dev->pending_lock);
 	if (!list_empty(&qp->timerwait))
 		list_del_init(&qp->timerwait);
+	if (!list_empty(&qp->piowait))
+		list_del_init(&qp->piowait);
 	spin_unlock(&dev->pending_lock);
 
 	if (wqe->wr.opcode == IB_WR_RDMA_READ)
@@ -845,6 +834,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 {
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	struct ib_wc wc;
+	enum ib_wc_status status;
 	struct ipath_swqe *wqe;
 	int ret = 0;
 	u32 ack_psn;
@@ -909,7 +899,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 			 */
 			update_last_psn(qp, wqe->psn - 1);
 			/* Retry this request. */
-			ipath_restart_rc(qp, wqe->psn, &wc);
+			ipath_restart_rc(qp, wqe->psn);
 			/*
 			 * No need to process the ACK/NAK since we are
 			 * restarting an earlier request.
@@ -937,20 +927,15 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 		/* Post a send completion queue entry if requested. */
 		if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) ||
 		    (wqe->wr.send_flags & IB_SEND_SIGNALED)) {
+			memset(&wc, 0, sizeof wc);
 			wc.wr_id = wqe->wr.wr_id;
 			wc.status = IB_WC_SUCCESS;
 			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-			wc.vendor_err = 0;
 			wc.byte_len = wqe->length;
-			wc.imm_data = 0;
 			wc.qp = &qp->ibqp;
 			wc.src_qp = qp->remote_qpn;
-			wc.wc_flags = 0;
-			wc.pkey_index = 0;
 			wc.slid = qp->remote_ah_attr.dlid;
 			wc.sl = qp->remote_ah_attr.sl;
-			wc.dlid_path_bits = 0;
-			wc.port_num = 0;
 			ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0);
 		}
 		qp->s_retry = qp->s_retry_cnt;
@@ -1012,7 +997,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 		if (qp->s_last == qp->s_tail)
 			goto bail;
 		if (qp->s_rnr_retry == 0) {
-			wc.status = IB_WC_RNR_RETRY_EXC_ERR;
+			status = IB_WC_RNR_RETRY_EXC_ERR;
 			goto class_b;
 		}
 		if (qp->s_rnr_retry_cnt < 7)
@@ -1050,37 +1035,25 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 			 * RDMA READ response which terminates the RDMA
 			 * READ.
 			 */
-			ipath_restart_rc(qp, psn, &wc);
+			ipath_restart_rc(qp, psn);
 			break;
 
 		case 1:	/* Invalid Request */
-			wc.status = IB_WC_REM_INV_REQ_ERR;
+			status = IB_WC_REM_INV_REQ_ERR;
 			dev->n_other_naks++;
 			goto class_b;
 
 		case 2:	/* Remote Access Error */
-			wc.status = IB_WC_REM_ACCESS_ERR;
+			status = IB_WC_REM_ACCESS_ERR;
 			dev->n_other_naks++;
 			goto class_b;
 
 		case 3:	/* Remote Operation Error */
-			wc.status = IB_WC_REM_OP_ERR;
+			status = IB_WC_REM_OP_ERR;
 			dev->n_other_naks++;
 		class_b:
-			wc.wr_id = wqe->wr.wr_id;
-			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-			wc.vendor_err = 0;
-			wc.byte_len = 0;
-			wc.qp = &qp->ibqp;
-			wc.imm_data = 0;
-			wc.src_qp = qp->remote_qpn;
-			wc.wc_flags = 0;
-			wc.pkey_index = 0;
-			wc.slid = qp->remote_ah_attr.dlid;
-			wc.sl = qp->remote_ah_attr.sl;
-			wc.dlid_path_bits = 0;
-			wc.port_num = 0;
-			ipath_sqerror_qp(qp, &wc);
+			ipath_send_complete(qp, wqe, status);
+			ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
 			break;
 
 		default:
@@ -1126,8 +1099,8 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 				     int header_in_data)
 {
 	struct ipath_swqe *wqe;
+	enum ib_wc_status status;
 	unsigned long flags;
-	struct ib_wc wc;
 	int diff;
 	u32 pad;
 	u32 aeth;
@@ -1159,6 +1132,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 	if (unlikely(qp->s_last == qp->s_tail))
 		goto ack_done;
 	wqe = get_swqe_ptr(qp, qp->s_last);
+	status = IB_WC_SUCCESS;
 
 	switch (opcode) {
 	case OP(ACKNOWLEDGE):
@@ -1200,7 +1174,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		/* no AETH, no ACK */
 		if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) {
 			dev->n_rdma_seq++;
-			ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+			ipath_restart_rc(qp, qp->s_last_psn + 1);
 			goto ack_done;
 		}
 		if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ))
@@ -1261,7 +1235,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		/* ACKs READ req. */
 		if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) {
 			dev->n_rdma_seq++;
-			ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+			ipath_restart_rc(qp, qp->s_last_psn + 1);
 			goto ack_done;
 		}
 		if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ))
@@ -1291,31 +1265,16 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		goto ack_done;
 	}
 
-ack_done:
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-	goto bail;
-
 ack_op_err:
-	wc.status = IB_WC_LOC_QP_OP_ERR;
+	status = IB_WC_LOC_QP_OP_ERR;
 	goto ack_err;
 
 ack_len_err:
-	wc.status = IB_WC_LOC_LEN_ERR;
+	status = IB_WC_LOC_LEN_ERR;
 ack_err:
-	wc.wr_id = wqe->wr.wr_id;
-	wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-	wc.vendor_err = 0;
-	wc.byte_len = 0;
-	wc.imm_data = 0;
-	wc.qp = &qp->ibqp;
-	wc.src_qp = qp->remote_qpn;
-	wc.wc_flags = 0;
-	wc.pkey_index = 0;
-	wc.slid = qp->remote_ah_attr.dlid;
-	wc.sl = qp->remote_ah_attr.sl;
-	wc.dlid_path_bits = 0;
-	wc.port_num = 0;
-	ipath_sqerror_qp(qp, &wc);
+	ipath_send_complete(qp, wqe, status);
+	ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
+ack_done:
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 bail:
 	return;
@@ -1523,13 +1482,12 @@ send_ack:
 	return 0;
 }
 
-static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err)
+void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err)
 {
 	unsigned long flags;
 	int lastwqe;
 
 	spin_lock_irqsave(&qp->s_lock, flags);
-	qp->state = IB_QPS_ERR;
 	lastwqe = ipath_error_qp(qp, err);
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 
@@ -1643,11 +1601,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		    opcode == OP(SEND_LAST) ||
 		    opcode == OP(SEND_LAST_WITH_IMMEDIATE))
 			break;
-	nack_inv:
-		ipath_rc_error(qp, IB_WC_REM_INV_REQ_ERR);
-		qp->r_nak_state = IB_NAK_INVALID_REQUEST;
-		qp->r_ack_psn = qp->r_psn;
-		goto send_ack;
+		goto nack_inv;
 
 	case OP(RDMA_WRITE_FIRST):
 	case OP(RDMA_WRITE_MIDDLE):
@@ -1673,18 +1627,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		break;
 	}
 
-	wc.imm_data = 0;
-	wc.wc_flags = 0;
+	memset(&wc, 0, sizeof wc);
 
 	/* OK, process the packet. */
 	switch (opcode) {
 	case OP(SEND_FIRST):
-		if (!ipath_get_rwqe(qp, 0)) {
-		rnr_nak:
-			qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer;
-			qp->r_ack_psn = qp->r_psn;
-			goto send_ack;
-		}
+		if (!ipath_get_rwqe(qp, 0))
+			goto rnr_nak;
 		qp->r_rcv_len = 0;
 		/* FALLTHROUGH */
 	case OP(SEND_MIDDLE):
@@ -1751,14 +1700,10 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			wc.opcode = IB_WC_RECV_RDMA_WITH_IMM;
 		else
 			wc.opcode = IB_WC_RECV;
-		wc.vendor_err = 0;
 		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
-		wc.pkey_index = 0;
 		wc.slid = qp->remote_ah_attr.dlid;
 		wc.sl = qp->remote_ah_attr.sl;
-		wc.dlid_path_bits = 0;
-		wc.port_num = 0;
 		/* Signal completion event if the solicited bit is set. */
 		ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
 			       (ohdr->bth[0] &
@@ -1951,11 +1896,21 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		goto send_ack;
 	goto done;
 
+rnr_nak:
+	qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer;
+	qp->r_ack_psn = qp->r_psn;
+	goto send_ack;
+
+nack_inv:
+	ipath_rc_error(qp, IB_WC_LOC_QP_OP_ERR);
+	qp->r_nak_state = IB_NAK_INVALID_REQUEST;
+	qp->r_ack_psn = qp->r_psn;
+	goto send_ack;
+
 nack_acc:
-	ipath_rc_error(qp, IB_WC_REM_ACCESS_ERR);
+	ipath_rc_error(qp, IB_WC_LOC_PROT_ERR);
 	qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR;
 	qp->r_ack_psn = qp->r_psn;
-
 send_ack:
 	send_rc_ack(qp);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index 9e3fe61..c716a03 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
+ * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -140,20 +140,11 @@ int ipath_init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe,
 	goto bail;
 
 bad_lkey:
+	memset(&wc, 0, sizeof(wc));
 	wc.wr_id = wqe->wr_id;
 	wc.status = IB_WC_LOC_PROT_ERR;
 	wc.opcode = IB_WC_RECV;
-	wc.vendor_err = 0;
-	wc.byte_len = 0;
-	wc.imm_data = 0;
 	wc.qp = &qp->ibqp;
-	wc.src_qp = 0;
-	wc.wc_flags = 0;
-	wc.pkey_index = 0;
-	wc.slid = 0;
-	wc.sl = 0;
-	wc.dlid_path_bits = 0;
-	wc.port_num = 0;
 	/* Signal solicited completion event. */
 	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
 	ret = 0;
@@ -270,6 +261,7 @@ static void ipath_ruc_loopback(struct ipath_qp *sqp)
 	struct ib_wc wc;
 	u64 sdata;
 	atomic64_t *maddr;
+	enum ib_wc_status send_status;
 
 	qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn);
 	if (!qp) {
@@ -300,8 +292,8 @@ again:
 	wqe = get_swqe_ptr(sqp, sqp->s_last);
 	spin_unlock_irqrestore(&sqp->s_lock, flags);
 
-	wc.wc_flags = 0;
-	wc.imm_data = 0;
+	memset(&wc, 0, sizeof wc);
+	send_status = IB_WC_SUCCESS;
 
 	sqp->s_sge.sge = wqe->sg_list[0];
 	sqp->s_sge.sg_list = wqe->sg_list + 1;
@@ -313,75 +305,33 @@ again:
 		wc.imm_data = wqe->wr.ex.imm_data;
 		/* FALLTHROUGH */
 	case IB_WR_SEND:
-		if (!ipath_get_rwqe(qp, 0)) {
-		rnr_nak:
-			/* Handle RNR NAK */
-			if (qp->ibqp.qp_type == IB_QPT_UC)
-				goto send_comp;
-			if (sqp->s_rnr_retry == 0) {
-				wc.status = IB_WC_RNR_RETRY_EXC_ERR;
-				goto err;
-			}
-			if (sqp->s_rnr_retry_cnt < 7)
-				sqp->s_rnr_retry--;
-			dev->n_rnr_naks++;
-			sqp->s_rnr_timeout =
-				ib_ipath_rnr_table[qp->r_min_rnr_timer];
-			ipath_insert_rnr_queue(sqp);
-			goto done;
-		}
+		if (!ipath_get_rwqe(qp, 0))
+			goto rnr_nak;
 		break;
 
 	case IB_WR_RDMA_WRITE_WITH_IMM:
-		if (unlikely(!(qp->qp_access_flags &
-			       IB_ACCESS_REMOTE_WRITE))) {
-			wc.status = IB_WC_REM_INV_REQ_ERR;
-			goto err;
-		}
+		if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE)))
+			goto inv_err;
 		wc.wc_flags = IB_WC_WITH_IMM;
 		wc.imm_data = wqe->wr.ex.imm_data;
 		if (!ipath_get_rwqe(qp, 1))
 			goto rnr_nak;
 		/* FALLTHROUGH */
 	case IB_WR_RDMA_WRITE:
-		if (unlikely(!(qp->qp_access_flags &
-			       IB_ACCESS_REMOTE_WRITE))) {
-			wc.status = IB_WC_REM_INV_REQ_ERR;
-			goto err;
-		}
+		if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE)))
+			goto inv_err;
 		if (wqe->length == 0)
 			break;
 		if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge, wqe->length,
 					    wqe->wr.wr.rdma.remote_addr,
 					    wqe->wr.wr.rdma.rkey,
-					    IB_ACCESS_REMOTE_WRITE))) {
-		acc_err:
-			wc.status = IB_WC_REM_ACCESS_ERR;
-		err:
-			wc.wr_id = wqe->wr.wr_id;
-			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-			wc.vendor_err = 0;
-			wc.byte_len = 0;
-			wc.qp = &sqp->ibqp;
-			wc.src_qp = sqp->remote_qpn;
-			wc.pkey_index = 0;
-			wc.slid = sqp->remote_ah_attr.dlid;
-			wc.sl = sqp->remote_ah_attr.sl;
-			wc.dlid_path_bits = 0;
-			wc.port_num = 0;
-			spin_lock_irqsave(&sqp->s_lock, flags);
-			ipath_sqerror_qp(sqp, &wc);
-			spin_unlock_irqrestore(&sqp->s_lock, flags);
-			goto done;
-		}
+					    IB_ACCESS_REMOTE_WRITE)))
+			goto acc_err;
 		break;
 
 	case IB_WR_RDMA_READ:
-		if (unlikely(!(qp->qp_access_flags &
-			       IB_ACCESS_REMOTE_READ))) {
-			wc.status = IB_WC_REM_INV_REQ_ERR;
-			goto err;
-		}
+		if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ)))
+			goto inv_err;
 		if (unlikely(!ipath_rkey_ok(qp, &sqp->s_sge, wqe->length,
 					    wqe->wr.wr.rdma.remote_addr,
 					    wqe->wr.wr.rdma.rkey,
@@ -394,11 +344,8 @@ again:
 
 	case IB_WR_ATOMIC_CMP_AND_SWP:
 	case IB_WR_ATOMIC_FETCH_AND_ADD:
-		if (unlikely(!(qp->qp_access_flags &
-			       IB_ACCESS_REMOTE_ATOMIC))) {
-			wc.status = IB_WC_REM_INV_REQ_ERR;
-			goto err;
-		}
+		if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_ATOMIC)))
+			goto inv_err;
 		if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge, sizeof(u64),
 					    wqe->wr.wr.atomic.remote_addr,
 					    wqe->wr.wr.atomic.rkey,
@@ -415,7 +362,8 @@ again:
 		goto send_comp;
 
 	default:
-		goto done;
+		send_status = IB_WC_LOC_QP_OP_ERR;
+		goto serr;
 	}
 
 	sge = &sqp->s_sge.sge;
@@ -458,14 +406,11 @@ again:
 		wc.opcode = IB_WC_RECV;
 	wc.wr_id = qp->r_wr_id;
 	wc.status = IB_WC_SUCCESS;
-	wc.vendor_err = 0;
 	wc.byte_len = wqe->length;
 	wc.qp = &qp->ibqp;
 	wc.src_qp = qp->remote_qpn;
-	wc.pkey_index = 0;
 	wc.slid = qp->remote_ah_attr.dlid;
 	wc.sl = qp->remote_ah_attr.sl;
-	wc.dlid_path_bits = 0;
 	wc.port_num = 1;
 	/* Signal completion event if the solicited bit is set. */
 	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
@@ -473,9 +418,63 @@ again:
 
 send_comp:
 	sqp->s_rnr_retry = sqp->s_rnr_retry_cnt;
-	ipath_send_complete(sqp, wqe, IB_WC_SUCCESS);
+	ipath_send_complete(sqp, wqe, send_status);
 	goto again;
 
+rnr_nak:
+	/* Handle RNR NAK */
+	if (qp->ibqp.qp_type == IB_QPT_UC)
+		goto send_comp;
+	/*
+	 * Note: we don't need the s_lock held since the BUSY flag
+	 * makes this single threaded.
+	 */
+	if (sqp->s_rnr_retry == 0) {
+		send_status = IB_WC_RNR_RETRY_EXC_ERR;
+		goto serr;
+	}
+	if (sqp->s_rnr_retry_cnt < 7)
+		sqp->s_rnr_retry--;
+	spin_lock_irqsave(&sqp->s_lock, flags);
+	if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_RECV_OK))
+		goto unlock;
+	dev->n_rnr_naks++;
+	sqp->s_rnr_timeout = ib_ipath_rnr_table[qp->r_min_rnr_timer];
+	ipath_insert_rnr_queue(sqp);
+	goto unlock;
+
+inv_err:
+	send_status = IB_WC_REM_INV_REQ_ERR;
+	wc.status = IB_WC_LOC_QP_OP_ERR;
+	goto err;
+
+acc_err:
+	send_status = IB_WC_REM_ACCESS_ERR;
+	wc.status = IB_WC_LOC_PROT_ERR;
+err:
+	/* responder goes to error state */
+	ipath_rc_error(qp, wc.status);
+
+serr:
+	spin_lock_irqsave(&sqp->s_lock, flags);
+	ipath_send_complete(sqp, wqe, send_status);
+	if (sqp->ibqp.qp_type == IB_QPT_RC) {
+		int lastwqe = ipath_error_qp(sqp, IB_WC_WR_FLUSH_ERR);
+
+		sqp->s_flags &= ~IPATH_S_BUSY;
+		spin_unlock_irqrestore(&sqp->s_lock, flags);
+		if (lastwqe) {
+			struct ib_event ev;
+
+			ev.device = sqp->ibqp.device;
+			ev.element.qp = &sqp->ibqp;
+			ev.event = IB_EVENT_QP_LAST_WQE_REACHED;
+			sqp->ibqp.event_handler(&ev, sqp->ibqp.qp_context);
+		}
+		goto done;
+	}
+unlock:
+	spin_unlock_irqrestore(&sqp->s_lock, flags);
 done:
 	if (atomic_dec_and_test(&qp->refcount))
 		wake_up(&qp->wait);
@@ -651,21 +650,15 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe,
 	    status != IB_WC_SUCCESS) {
 		struct ib_wc wc;
 
+		memset(&wc, 0, sizeof wc);
 		wc.wr_id = wqe->wr.wr_id;
 		wc.status = status;
 		wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-		wc.vendor_err = 0;
-		wc.byte_len = wqe->length;
-		wc.imm_data = 0;
 		wc.qp = &qp->ibqp;
-		wc.src_qp = 0;
-		wc.wc_flags = 0;
-		wc.pkey_index = 0;
-		wc.slid = 0;
-		wc.sl = 0;
-		wc.dlid_path_bits = 0;
-		wc.port_num = 0;
-		ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0);
+		if (status == IB_WC_SUCCESS)
+			wc.byte_len = wqe->length;
+		ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc,
+			       status != IB_WC_SUCCESS);
 	}
 
 	spin_lock_irqsave(&qp->s_lock, flags);
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 5015cd2..22bb42d 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -744,12 +744,10 @@ static void ipath_ib_timer(struct ipath_ibdev *dev)
 
 	/* XXX What if timer fires again while this is running? */
 	for (qp = resend; qp != NULL; qp = qp->timer_next) {
-		struct ib_wc wc;
-
 		spin_lock_irqsave(&qp->s_lock, flags);
 		if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) {
 			dev->n_timeouts++;
-			ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+			ipath_restart_rc(qp, qp->s_last_psn + 1);
 		}
 		spin_unlock_irqrestore(&qp->s_lock, flags);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 6514aa8..4c7c2aa 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -710,8 +710,6 @@ void ipath_free_all_qps(struct ipath_qp_table *qpt);
 
 int ipath_init_qp_table(struct ipath_ibdev *idev, int size);
 
-void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc);
-
 void ipath_get_credit(struct ipath_qp *qp, u32 aeth);
 
 unsigned ipath_ib_rate_to_mult(enum ib_rate rate);
@@ -729,7 +727,9 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		  int has_grh, void *data, u32 tlen, struct ipath_qp *qp);
 
-void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc);
+void ipath_restart_rc(struct ipath_qp *qp, u32 psn);
+
+void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err);
 
 int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr);
 

From ralph.campbell at qlogic.com  Thu May  8 11:55:23 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 08 May 2008 11:55:23 -0700
Subject: [ofa-general] [PATCH 2/3] IB/ipath - fix many locking issues when
	switching to error state
In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080508185522.8547.46713.stgit@eng-46.mv.qlogic.com>

The send DMA hardware queue voided a number of prior assumptions about
when a send is complete which led to completions being generated out of
order.  There were also a number of locking issues when switching the QP
to the error or reset states, and implements the IB_QPS_SQD state.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_qp.c        |  183 ++++++++++++-------------
 drivers/infiniband/hw/ipath/ipath_rc.c        |  151 +++++++++++++--------
 drivers/infiniband/hw/ipath/ipath_ruc.c       |  168 ++++++++++++++++-------
 drivers/infiniband/hw/ipath/ipath_uc.c        |   57 +++++---
 drivers/infiniband/hw/ipath/ipath_ud.c        |   66 +++++++--
 drivers/infiniband/hw/ipath/ipath_user_sdma.h |    2 
 drivers/infiniband/hw/ipath/ipath_verbs.c     |  174 ++++++++++++++++--------
 drivers/infiniband/hw/ipath/ipath_verbs.h     |   57 +++++++-
 8 files changed, 554 insertions(+), 304 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 6f98632..4715911 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -242,7 +242,6 @@ static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp)
 {
 	struct ipath_qp *q, **qpp;
 	unsigned long flags;
-	int fnd = 0;
 
 	spin_lock_irqsave(&qpt->lock, flags);
 
@@ -253,51 +252,40 @@ static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp)
 			*qpp = qp->next;
 			qp->next = NULL;
 			atomic_dec(&qp->refcount);
-			fnd = 1;
 			break;
 		}
 	}
 
 	spin_unlock_irqrestore(&qpt->lock, flags);
-
-	if (!fnd)
-		return;
-
-	free_qpn(qpt, qp->ibqp.qp_num);
-
-	wait_event(qp->wait, !atomic_read(&qp->refcount));
 }
 
 /**
- * ipath_free_all_qps - remove all QPs from the table
+ * ipath_free_all_qps - check for QPs still in use
  * @qpt: the QP table to empty
+ *
+ * There should not be any QPs still in use.
+ * Free memory for table.
  */
-void ipath_free_all_qps(struct ipath_qp_table *qpt)
+unsigned ipath_free_all_qps(struct ipath_qp_table *qpt)
 {
 	unsigned long flags;
-	struct ipath_qp *qp, *nqp;
-	u32 n;
+	struct ipath_qp *qp;
+	u32 n, qp_inuse = 0;
 
+	spin_lock_irqsave(&qpt->lock, flags);
 	for (n = 0; n < qpt->max; n++) {
-		spin_lock_irqsave(&qpt->lock, flags);
 		qp = qpt->table[n];
 		qpt->table[n] = NULL;
-		spin_unlock_irqrestore(&qpt->lock, flags);
-
-		while (qp) {
-			nqp = qp->next;
-			free_qpn(qpt, qp->ibqp.qp_num);
-			if (!atomic_dec_and_test(&qp->refcount) ||
-			    !ipath_destroy_qp(&qp->ibqp))
-				ipath_dbg("QP memory leak!\n");
-			qp = nqp;
-		}
+
+		for (; qp; qp = qp->next)
+			qp_inuse++;
 	}
+	spin_unlock_irqrestore(&qpt->lock, flags);
 
-	for (n = 0; n < ARRAY_SIZE(qpt->map); n++) {
+	for (n = 0; n < ARRAY_SIZE(qpt->map); n++)
 		if (qpt->map[n].page)
-			free_page((unsigned long)qpt->map[n].page);
-	}
+			free_page((unsigned long) qpt->map[n].page);
+	return qp_inuse;
 }
 
 /**
@@ -336,11 +324,12 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type)
 	qp->remote_qpn = 0;
 	qp->qkey = 0;
 	qp->qp_access_flags = 0;
-	qp->s_busy = 0;
+	atomic_set(&qp->s_dma_busy, 0);
 	qp->s_flags &= IPATH_S_SIGNAL_REQ_WR;
 	qp->s_hdrwords = 0;
 	qp->s_wqe = NULL;
 	qp->s_pkt_delay = 0;
+	qp->s_draining = 0;
 	qp->s_psn = 0;
 	qp->r_psn = 0;
 	qp->r_msn = 0;
@@ -353,7 +342,8 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type)
 	}
 	qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
 	qp->r_nak_state = 0;
-	qp->r_wrid_valid = 0;
+	qp->r_aflags = 0;
+	qp->r_flags = 0;
 	qp->s_rnr_timeout = 0;
 	qp->s_head = 0;
 	qp->s_tail = 0;
@@ -361,7 +351,6 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type)
 	qp->s_last = 0;
 	qp->s_ssn = 1;
 	qp->s_lsn = 0;
-	qp->s_wait_credit = 0;
 	memset(qp->s_ack_queue, 0, sizeof(qp->s_ack_queue));
 	qp->r_head_ack_queue = 0;
 	qp->s_tail_ack_queue = 0;
@@ -370,7 +359,6 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type)
 		qp->r_rq.wq->head = 0;
 		qp->r_rq.wq->tail = 0;
 	}
-	qp->r_reuse_sge = 0;
 }
 
 /**
@@ -402,39 +390,21 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 		list_del_init(&qp->piowait);
 	spin_unlock(&dev->pending_lock);
 
-	wc.vendor_err = 0;
-	wc.byte_len = 0;
-	wc.imm_data = 0;
+	/* Schedule the sending tasklet to drain the send work queue. */
+	if (qp->s_last != qp->s_head)
+		ipath_schedule_send(qp);
+
+	memset(&wc, 0, sizeof(wc));
 	wc.qp = &qp->ibqp;
-	wc.src_qp = 0;
-	wc.wc_flags = 0;
-	wc.pkey_index = 0;
-	wc.slid = 0;
-	wc.sl = 0;
-	wc.dlid_path_bits = 0;
-	wc.port_num = 0;
-	if (qp->r_wrid_valid) {
-		qp->r_wrid_valid = 0;
+	wc.opcode = IB_WC_RECV;
+
+	if (test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags)) {
 		wc.wr_id = qp->r_wr_id;
-		wc.opcode = IB_WC_RECV;
 		wc.status = err;
 		ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
 	}
 	wc.status = IB_WC_WR_FLUSH_ERR;
 
-	while (qp->s_last != qp->s_head) {
-		struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
-
-		wc.wr_id = wqe->wr.wr_id;
-		wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-		if (++qp->s_last >= qp->s_size)
-			qp->s_last = 0;
-		ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1);
-	}
-	qp->s_cur = qp->s_tail = qp->s_head;
-	qp->s_hdrwords = 0;
-	qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
-
 	if (qp->r_rq.wq) {
 		struct ipath_rwq *wq;
 		u32 head;
@@ -450,7 +420,6 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 		tail = wq->tail;
 		if (tail >= qp->r_rq.size)
 			tail = 0;
-		wc.opcode = IB_WC_RECV;
 		while (tail != head) {
 			wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id;
 			if (++tail >= qp->r_rq.size)
@@ -482,11 +451,10 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_qp *qp = to_iqp(ibqp);
 	enum ib_qp_state cur_state, new_state;
-	unsigned long flags;
 	int lastwqe = 0;
 	int ret;
 
-	spin_lock_irqsave(&qp->s_lock, flags);
+	spin_lock_irq(&qp->s_lock);
 
 	cur_state = attr_mask & IB_QP_CUR_STATE ?
 		attr->cur_qp_state : qp->state;
@@ -539,16 +507,42 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 
 	switch (new_state) {
 	case IB_QPS_RESET:
+		if (qp->state != IB_QPS_RESET) {
+			qp->state = IB_QPS_RESET;
+			spin_lock(&dev->pending_lock);
+			if (!list_empty(&qp->timerwait))
+				list_del_init(&qp->timerwait);
+			if (!list_empty(&qp->piowait))
+				list_del_init(&qp->piowait);
+			spin_unlock(&dev->pending_lock);
+			qp->s_flags &= ~IPATH_S_ANY_WAIT;
+			spin_unlock_irq(&qp->s_lock);
+			/* Stop the sending tasklet */
+			tasklet_kill(&qp->s_task);
+			wait_event(qp->wait_dma, !atomic_read(&qp->s_dma_busy));
+			spin_lock_irq(&qp->s_lock);
+		}
 		ipath_reset_qp(qp, ibqp->qp_type);
 		break;
 
+	case IB_QPS_SQD:
+		qp->s_draining = qp->s_last != qp->s_cur;
+		qp->state = new_state;
+		break;
+
+	case IB_QPS_SQE:
+		if (qp->ibqp.qp_type == IB_QPT_RC)
+			goto inval;
+		qp->state = new_state;
+		break;
+
 	case IB_QPS_ERR:
 		lastwqe = ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
 		break;
 
 	default:
+		qp->state = new_state;
 		break;
-
 	}
 
 	if (attr_mask & IB_QP_PKEY_INDEX)
@@ -601,8 +595,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC)
 		qp->s_max_rd_atomic = attr->max_rd_atomic;
 
-	qp->state = new_state;
-	spin_unlock_irqrestore(&qp->s_lock, flags);
+	spin_unlock_irq(&qp->s_lock);
 
 	if (lastwqe) {
 		struct ib_event ev;
@@ -616,7 +609,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	goto bail;
 
 inval:
-	spin_unlock_irqrestore(&qp->s_lock, flags);
+	spin_unlock_irq(&qp->s_lock);
 	ret = -EINVAL;
 
 bail:
@@ -647,7 +640,7 @@ int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	attr->pkey_index = qp->s_pkey_index;
 	attr->alt_pkey_index = 0;
 	attr->en_sqd_async_notify = 0;
-	attr->sq_draining = 0;
+	attr->sq_draining = qp->s_draining;
 	attr->max_rd_atomic = qp->s_max_rd_atomic;
 	attr->max_dest_rd_atomic = qp->r_max_rd_atomic;
 	attr->min_rnr_timer = qp->r_min_rnr_timer;
@@ -837,6 +830,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
 		spin_lock_init(&qp->r_rq.lock);
 		atomic_set(&qp->refcount, 0);
 		init_waitqueue_head(&qp->wait);
+		init_waitqueue_head(&qp->wait_dma);
 		tasklet_init(&qp->s_task, ipath_do_send, (unsigned long)qp);
 		INIT_LIST_HEAD(&qp->piowait);
 		INIT_LIST_HEAD(&qp->timerwait);
@@ -930,6 +924,7 @@ bail_ip:
 	else
 		vfree(qp->r_rq.wq);
 	ipath_free_qp(&dev->qp_table, qp);
+	free_qpn(&dev->qp_table, qp->ibqp.qp_num);
 bail_qp:
 	kfree(qp);
 bail_swq:
@@ -951,41 +946,44 @@ int ipath_destroy_qp(struct ib_qp *ibqp)
 {
 	struct ipath_qp *qp = to_iqp(ibqp);
 	struct ipath_ibdev *dev = to_idev(ibqp->device);
-	unsigned long flags;
 
-	spin_lock_irqsave(&qp->s_lock, flags);
-	qp->state = IB_QPS_ERR;
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-	spin_lock(&dev->n_qps_lock);
-	dev->n_qps_allocated--;
-	spin_unlock(&dev->n_qps_lock);
+	/* Make sure HW and driver activity is stopped. */
+	spin_lock_irq(&qp->s_lock);
+	if (qp->state != IB_QPS_RESET) {
+		qp->state = IB_QPS_RESET;
+		spin_lock(&dev->pending_lock);
+		if (!list_empty(&qp->timerwait))
+			list_del_init(&qp->timerwait);
+		if (!list_empty(&qp->piowait))
+			list_del_init(&qp->piowait);
+		spin_unlock(&dev->pending_lock);
+		qp->s_flags &= ~IPATH_S_ANY_WAIT;
+		spin_unlock_irq(&qp->s_lock);
+		/* Stop the sending tasklet */
+		tasklet_kill(&qp->s_task);
+		wait_event(qp->wait_dma, !atomic_read(&qp->s_dma_busy));
+	} else
+		spin_unlock_irq(&qp->s_lock);
 
-	/* Stop the sending tasklet. */
-	tasklet_kill(&qp->s_task);
+	ipath_free_qp(&dev->qp_table, qp);
 
 	if (qp->s_tx) {
 		atomic_dec(&qp->refcount);
 		if (qp->s_tx->txreq.flags & IPATH_SDMA_TXREQ_F_FREEBUF)
 			kfree(qp->s_tx->txreq.map_addr);
+		spin_lock_irq(&dev->pending_lock);
+		list_add(&qp->s_tx->txreq.list, &dev->txreq_free);
+		spin_unlock_irq(&dev->pending_lock);
+		qp->s_tx = NULL;
 	}
 
-	/* Make sure the QP isn't on the timeout list. */
-	spin_lock_irqsave(&dev->pending_lock, flags);
-	if (!list_empty(&qp->timerwait))
-		list_del_init(&qp->timerwait);
-	if (!list_empty(&qp->piowait))
-		list_del_init(&qp->piowait);
-	if (qp->s_tx)
-		list_add(&qp->s_tx->txreq.list, &dev->txreq_free);
-	spin_unlock_irqrestore(&dev->pending_lock, flags);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
 
-	/*
-	 * Make sure that the QP is not in the QPN table so receive
-	 * interrupts will discard packets for this QP.  XXX Also remove QP
-	 * from multicast table.
-	 */
-	if (atomic_read(&qp->refcount) != 0)
-		ipath_free_qp(&dev->qp_table, qp);
+	/* all user's cleaned up, mark it available */
+	free_qpn(&dev->qp_table, qp->ibqp.qp_num);
+	spin_lock(&dev->n_qps_lock);
+	dev->n_qps_allocated--;
+	spin_unlock(&dev->n_qps_lock);
 
 	if (qp->ip)
 		kref_put(&qp->ip->ref, ipath_release_mmap_info);
@@ -1055,9 +1053,10 @@ void ipath_get_credit(struct ipath_qp *qp, u32 aeth)
 	}
 
 	/* Restart sending if it was blocked due to lack of credits. */
-	if (qp->s_cur != qp->s_head &&
+	if ((qp->s_flags & IPATH_S_WAIT_SSN_CREDIT) &&
+	    qp->s_cur != qp->s_head &&
 	    (qp->s_lsn == (u32) -1 ||
 	     ipath_cmp24(get_swqe_ptr(qp, qp->s_cur)->ssn,
 			 qp->s_lsn + 1) <= 0))
-		tasklet_hi_schedule(&qp->s_task);
+		ipath_schedule_send(qp);
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index b4b26c3..5b5276a 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -92,6 +92,10 @@ static int ipath_make_rc_ack(struct ipath_ibdev *dev, struct ipath_qp *qp,
 	u32 bth0;
 	u32 bth2;
 
+	/* Don't send an ACK if we aren't supposed to. */
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK))
+		goto bail;
+
 	/* header size in 32-bit words LRH+BTH = (8+12)/4. */
 	hwords = 5;
 
@@ -238,14 +242,25 @@ int ipath_make_rc_req(struct ipath_qp *qp)
 	    ipath_make_rc_ack(dev, qp, ohdr, pmtu))
 		goto done;
 
-	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) ||
-	    qp->s_rnr_timeout || qp->s_wait_credit)
-		goto bail;
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) {
+		if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND))
+			goto bail;
+		/* We are in the error state, flush the work request. */
+		if (qp->s_last == qp->s_head)
+			goto bail;
+		/* If DMAs are in progress, we can't flush immediately. */
+		if (atomic_read(&qp->s_dma_busy)) {
+			qp->s_flags |= IPATH_S_WAIT_DMA;
+			goto bail;
+		}
+		wqe = get_swqe_ptr(qp, qp->s_last);
+		ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR);
+		goto done;
+	}
 
-	/* Limit the number of packets sent without an ACK. */
-	if (ipath_cmp24(qp->s_psn, qp->s_last_psn + IPATH_PSN_CREDIT) > 0) {
-		qp->s_wait_credit = 1;
-		dev->n_rc_stalls++;
+	/* Leave BUSY set until RNR timeout. */
+	if (qp->s_rnr_timeout) {
+		qp->s_flags |= IPATH_S_WAITING;
 		goto bail;
 	}
 
@@ -257,6 +272,9 @@ int ipath_make_rc_req(struct ipath_qp *qp)
 	wqe = get_swqe_ptr(qp, qp->s_cur);
 	switch (qp->s_state) {
 	default:
+		if (!(ib_ipath_state_ops[qp->state] &
+		    IPATH_PROCESS_NEXT_SEND_OK))
+			goto bail;
 		/*
 		 * Resend an old request or start a new one.
 		 *
@@ -294,8 +312,10 @@ int ipath_make_rc_req(struct ipath_qp *qp)
 		case IB_WR_SEND_WITH_IMM:
 			/* If no credit, return. */
 			if (qp->s_lsn != (u32) -1 &&
-			    ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0)
+			    ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) {
+				qp->s_flags |= IPATH_S_WAIT_SSN_CREDIT;
 				goto bail;
+			}
 			wqe->lpsn = wqe->psn;
 			if (len > pmtu) {
 				wqe->lpsn += (len - 1) / pmtu;
@@ -325,8 +345,10 @@ int ipath_make_rc_req(struct ipath_qp *qp)
 		case IB_WR_RDMA_WRITE_WITH_IMM:
 			/* If no credit, return. */
 			if (qp->s_lsn != (u32) -1 &&
-			    ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0)
+			    ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) {
+				qp->s_flags |= IPATH_S_WAIT_SSN_CREDIT;
 				goto bail;
+			}
 			ohdr->u.rc.reth.vaddr =
 				cpu_to_be64(wqe->wr.wr.rdma.remote_addr);
 			ohdr->u.rc.reth.rkey =
@@ -570,7 +592,11 @@ int ipath_make_rc_req(struct ipath_qp *qp)
 	ipath_make_ruc_header(dev, qp, ohdr, bth0 | (qp->s_state << 24), bth2);
 done:
 	ret = 1;
+	goto unlock;
+
 bail:
+	qp->s_flags &= ~IPATH_S_BUSY;
+unlock:
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 	return ret;
 }
@@ -606,7 +632,11 @@ static void send_rc_ack(struct ipath_qp *qp)
 
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 
+	/* Don't try to send ACKs if the link isn't ACTIVE */
 	dd = dev->dd;
+	if (!(dd->ipath_flags & IPATH_LINKACTIVE))
+		goto done;
+
 	piobuf = ipath_getpiobuf(dd, 0, NULL);
 	if (!piobuf) {
 		/*
@@ -668,15 +698,16 @@ static void send_rc_ack(struct ipath_qp *qp)
 	goto done;
 
 queue_ack:
-	dev->n_rc_qacks++;
-	qp->s_flags |= IPATH_S_ACK_PENDING;
-	qp->s_nak_state = qp->r_nak_state;
-	qp->s_ack_psn = qp->r_ack_psn;
+	if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK) {
+		dev->n_rc_qacks++;
+		qp->s_flags |= IPATH_S_ACK_PENDING;
+		qp->s_nak_state = qp->r_nak_state;
+		qp->s_ack_psn = qp->r_ack_psn;
+
+		/* Schedule the send tasklet. */
+		ipath_schedule_send(qp);
+	}
 	spin_unlock_irqrestore(&qp->s_lock, flags);
-
-	/* Call ipath_do_rc_send() in another thread. */
-	tasklet_hi_schedule(&qp->s_task);
-
 done:
 	return;
 }
@@ -735,7 +766,7 @@ static void reset_psn(struct ipath_qp *qp, u32 psn)
 	/*
 	 * Set the state to restart in the middle of a request.
 	 * Don't change the s_sge, s_cur_sge, or s_cur_size.
-	 * See ipath_do_rc_send().
+	 * See ipath_make_rc_req().
 	 */
 	switch (opcode) {
 	case IB_WR_SEND:
@@ -801,7 +832,7 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn)
 		dev->n_rc_resends += (qp->s_psn - psn) & IPATH_PSN_MASK;
 
 	reset_psn(qp, psn);
-	tasklet_hi_schedule(&qp->s_task);
+	ipath_schedule_send(qp);
 
 bail:
 	return;
@@ -809,13 +840,7 @@ bail:
 
 static inline void update_last_psn(struct ipath_qp *qp, u32 psn)
 {
-	if (qp->s_last_psn != psn) {
-		qp->s_last_psn = psn;
-		if (qp->s_wait_credit) {
-			qp->s_wait_credit = 0;
-			tasklet_hi_schedule(&qp->s_task);
-		}
-	}
+	qp->s_last_psn = psn;
 }
 
 /**
@@ -915,14 +940,10 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 		     wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD)) {
 			qp->s_num_rd_atomic--;
 			/* Restart sending task if fence is complete */
-			if ((qp->s_flags & IPATH_S_FENCE_PENDING) &&
-			    !qp->s_num_rd_atomic) {
-				qp->s_flags &= ~IPATH_S_FENCE_PENDING;
-				tasklet_hi_schedule(&qp->s_task);
-			} else if (qp->s_flags & IPATH_S_RDMAR_PENDING) {
-				qp->s_flags &= ~IPATH_S_RDMAR_PENDING;
-				tasklet_hi_schedule(&qp->s_task);
-			}
+			if (((qp->s_flags & IPATH_S_FENCE_PENDING) &&
+			     !qp->s_num_rd_atomic) ||
+			    qp->s_flags & IPATH_S_RDMAR_PENDING)
+				ipath_schedule_send(qp);
 		}
 		/* Post a send completion queue entry if requested. */
 		if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) ||
@@ -956,6 +977,8 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 		} else {
 			if (++qp->s_last >= qp->s_size)
 				qp->s_last = 0;
+			if (qp->state == IB_QPS_SQD && qp->s_last == qp->s_cur)
+				qp->s_draining = 0;
 			if (qp->s_last == qp->s_tail)
 				break;
 			wqe = get_swqe_ptr(qp, qp->s_last);
@@ -979,7 +1002,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 			 */
 			if (ipath_cmp24(qp->s_psn, psn) <= 0) {
 				reset_psn(qp, psn + 1);
-				tasklet_hi_schedule(&qp->s_task);
+				ipath_schedule_send(qp);
 			}
 		} else if (ipath_cmp24(qp->s_psn, psn) <= 0) {
 			qp->s_state = OP(SEND_LAST);
@@ -1018,6 +1041,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode,
 			ib_ipath_rnr_table[(aeth >> IPATH_AETH_CREDIT_SHIFT) &
 					   IPATH_AETH_CREDIT_MASK];
 		ipath_insert_rnr_queue(qp);
+		ipath_schedule_send(qp);
 		goto bail;
 
 	case 3:		/* NAK */
@@ -1108,6 +1132,10 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 
 	spin_lock_irqsave(&qp->s_lock, flags);
 
+	/* Double check we can process this now that we hold the s_lock. */
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK))
+		goto ack_done;
+
 	/* Ignore invalid responses. */
 	if (ipath_cmp24(psn, qp->s_next_psn) >= 0)
 		goto ack_done;
@@ -1343,7 +1371,12 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev,
 	psn &= IPATH_PSN_MASK;
 	e = NULL;
 	old_req = 1;
+
 	spin_lock_irqsave(&qp->s_lock, flags);
+	/* Double check we can process this now that we hold the s_lock. */
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK))
+		goto unlock_done;
+
 	for (i = qp->r_head_ack_queue; ; i = prev) {
 		if (i == qp->s_tail_ack_queue)
 			old_req = 0;
@@ -1471,7 +1504,7 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev,
 		break;
 	}
 	qp->r_nak_state = 0;
-	tasklet_hi_schedule(&qp->s_task);
+	ipath_schedule_send(qp);
 
 unlock_done:
 	spin_unlock_irqrestore(&qp->s_lock, flags);
@@ -1503,18 +1536,15 @@ void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err)
 
 static inline void ipath_update_ack_queue(struct ipath_qp *qp, unsigned n)
 {
-	unsigned long flags;
 	unsigned next;
 
 	next = n + 1;
 	if (next > IPATH_MAX_RDMA_ATOMIC)
 		next = 0;
-	spin_lock_irqsave(&qp->s_lock, flags);
 	if (n == qp->s_tail_ack_queue) {
 		qp->s_tail_ack_queue = next;
 		qp->s_ack_state = OP(ACKNOWLEDGE);
 	}
-	spin_unlock_irqrestore(&qp->s_lock, flags);
 }
 
 /**
@@ -1543,6 +1573,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	int diff;
 	struct ib_reth *reth;
 	int header_in_data;
+	unsigned long flags;
 
 	/* Validate the SLID. See Ch. 9.6.1.5 */
 	if (unlikely(be16_to_cpu(hdr->lrh[3]) != qp->remote_ah_attr.dlid))
@@ -1690,9 +1721,8 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			goto nack_inv;
 		ipath_copy_sge(&qp->r_sge, data, tlen);
 		qp->r_msn++;
-		if (!qp->r_wrid_valid)
+		if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags))
 			break;
-		qp->r_wrid_valid = 0;
 		wc.wr_id = qp->r_wr_id;
 		wc.status = IB_WC_SUCCESS;
 		if (opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE) ||
@@ -1764,9 +1794,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		next = qp->r_head_ack_queue + 1;
 		if (next > IPATH_MAX_RDMA_ATOMIC)
 			next = 0;
+		spin_lock_irqsave(&qp->s_lock, flags);
+		/* Double check we can process this while holding the s_lock. */
+		if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK))
+			goto unlock;
 		if (unlikely(next == qp->s_tail_ack_queue)) {
 			if (!qp->s_ack_queue[next].sent)
-				goto nack_inv;
+				goto nack_inv_unlck;
 			ipath_update_ack_queue(qp, next);
 		}
 		e = &qp->s_ack_queue[qp->r_head_ack_queue];
@@ -1787,7 +1821,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			ok = ipath_rkey_ok(qp, &e->rdma_sge, len, vaddr,
 					   rkey, IB_ACCESS_REMOTE_READ);
 			if (unlikely(!ok))
-				goto nack_acc;
+				goto nack_acc_unlck;
 			/*
 			 * Update the next expected PSN.  We add 1 later
 			 * below, so only add the remainder here.
@@ -1814,13 +1848,12 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		qp->r_psn++;
 		qp->r_state = opcode;
 		qp->r_nak_state = 0;
-		barrier();
 		qp->r_head_ack_queue = next;
 
-		/* Call ipath_do_rc_send() in another thread. */
-		tasklet_hi_schedule(&qp->s_task);
+		/* Schedule the send tasklet. */
+		ipath_schedule_send(qp);
 
-		goto done;
+		goto unlock;
 	}
 
 	case OP(COMPARE_SWAP):
@@ -1839,9 +1872,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		next = qp->r_head_ack_queue + 1;
 		if (next > IPATH_MAX_RDMA_ATOMIC)
 			next = 0;
+		spin_lock_irqsave(&qp->s_lock, flags);
+		/* Double check we can process this while holding the s_lock. */
+		if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK))
+			goto unlock;
 		if (unlikely(next == qp->s_tail_ack_queue)) {
 			if (!qp->s_ack_queue[next].sent)
-				goto nack_inv;
+				goto nack_inv_unlck;
 			ipath_update_ack_queue(qp, next);
 		}
 		if (!header_in_data)
@@ -1851,13 +1888,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		vaddr = ((u64) be32_to_cpu(ateth->vaddr[0]) << 32) |
 			be32_to_cpu(ateth->vaddr[1]);
 		if (unlikely(vaddr & (sizeof(u64) - 1)))
-			goto nack_inv;
+			goto nack_inv_unlck;
 		rkey = be32_to_cpu(ateth->rkey);
 		/* Check rkey & NAK */
 		if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge,
 					    sizeof(u64), vaddr, rkey,
 					    IB_ACCESS_REMOTE_ATOMIC)))
-			goto nack_acc;
+			goto nack_acc_unlck;
 		/* Perform atomic OP and save result. */
 		maddr = (atomic64_t *) qp->r_sge.sge.vaddr;
 		sdata = be64_to_cpu(ateth->swap_data);
@@ -1874,13 +1911,12 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		qp->r_psn++;
 		qp->r_state = opcode;
 		qp->r_nak_state = 0;
-		barrier();
 		qp->r_head_ack_queue = next;
 
-		/* Call ipath_do_rc_send() in another thread. */
-		tasklet_hi_schedule(&qp->s_task);
+		/* Schedule the send tasklet. */
+		ipath_schedule_send(qp);
 
-		goto done;
+		goto unlock;
 	}
 
 	default:
@@ -1901,19 +1937,26 @@ rnr_nak:
 	qp->r_ack_psn = qp->r_psn;
 	goto send_ack;
 
+nack_inv_unlck:
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 nack_inv:
 	ipath_rc_error(qp, IB_WC_LOC_QP_OP_ERR);
 	qp->r_nak_state = IB_NAK_INVALID_REQUEST;
 	qp->r_ack_psn = qp->r_psn;
 	goto send_ack;
 
+nack_acc_unlck:
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 nack_acc:
 	ipath_rc_error(qp, IB_WC_LOC_PROT_ERR);
 	qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR;
 	qp->r_ack_psn = qp->r_psn;
 send_ack:
 	send_rc_ack(qp);
+	goto done;
 
+unlock:
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 done:
 	return;
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index c716a03..a4b5521 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -78,6 +78,7 @@ const u32 ib_ipath_rnr_table[32] = {
  * ipath_insert_rnr_queue - put QP on the RNR timeout list for the device
  * @qp: the QP
  *
+ * Called with the QP s_lock held and interrupts disabled.
  * XXX Use a simple list for now.  We might need a priority
  * queue if we have lots of QPs waiting for RNR timeouts
  * but that should be rare.
@@ -85,9 +86,9 @@ const u32 ib_ipath_rnr_table[32] = {
 void ipath_insert_rnr_queue(struct ipath_qp *qp)
 {
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
-	unsigned long flags;
 
-	spin_lock_irqsave(&dev->pending_lock, flags);
+	/* We already did a spin_lock_irqsave(), so just use spin_lock */
+	spin_lock(&dev->pending_lock);
 	if (list_empty(&dev->rnrwait))
 		list_add(&qp->timerwait, &dev->rnrwait);
 	else {
@@ -109,7 +110,7 @@ void ipath_insert_rnr_queue(struct ipath_qp *qp)
 			nqp->s_rnr_timeout -= qp->s_rnr_timeout;
 		list_add(&qp->timerwait, l);
 	}
-	spin_unlock_irqrestore(&dev->pending_lock, flags);
+	spin_unlock(&dev->pending_lock);
 }
 
 /**
@@ -185,6 +186,11 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 	}
 
 	spin_lock_irqsave(&rq->lock, flags);
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) {
+		ret = 0;
+		goto unlock;
+	}
+
 	wq = rq->wq;
 	tail = wq->tail;
 	/* Validate tail before using it since it is user writable. */
@@ -192,9 +198,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 		tail = 0;
 	do {
 		if (unlikely(tail == wq->head)) {
-			spin_unlock_irqrestore(&rq->lock, flags);
 			ret = 0;
-			goto bail;
+			goto unlock;
 		}
 		/* Make sure entry is read after head index is read. */
 		smp_rmb();
@@ -207,7 +212,7 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 	wq->tail = tail;
 
 	ret = 1;
-	qp->r_wrid_valid = 1;
+	set_bit(IPATH_R_WRID_VALID, &qp->r_aflags);
 	if (handler) {
 		u32 n;
 
@@ -234,8 +239,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 			goto bail;
 		}
 	}
+unlock:
 	spin_unlock_irqrestore(&rq->lock, flags);
-
 bail:
 	return ret;
 }
@@ -263,35 +268,59 @@ static void ipath_ruc_loopback(struct ipath_qp *sqp)
 	atomic64_t *maddr;
 	enum ib_wc_status send_status;
 
+	/*
+	 * Note that we check the responder QP state after
+	 * checking the requester's state.
+	 */
 	qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn);
-	if (!qp) {
-		dev->n_pkt_drops++;
-		return;
-	}
 
-again:
 	spin_lock_irqsave(&sqp->s_lock, flags);
 
-	if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_SEND_OK) ||
-	    sqp->s_rnr_timeout) {
-		spin_unlock_irqrestore(&sqp->s_lock, flags);
-		goto done;
-	}
+	/* Return if we are already busy processing a work request. */
+	if ((sqp->s_flags & (IPATH_S_BUSY | IPATH_S_ANY_WAIT)) ||
+	    !(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_OR_FLUSH_SEND))
+		goto unlock;
 
-	/* Get the next send request. */
-	if (sqp->s_last == sqp->s_head) {
-		/* Send work queue is empty. */
-		spin_unlock_irqrestore(&sqp->s_lock, flags);
-		goto done;
+	sqp->s_flags |= IPATH_S_BUSY;
+
+again:
+	if (sqp->s_last == sqp->s_head)
+		goto clr_busy;
+	wqe = get_swqe_ptr(sqp, sqp->s_last);
+
+	/* Return if it is not OK to start a new work reqeust. */
+	if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_NEXT_SEND_OK)) {
+		if (!(ib_ipath_state_ops[sqp->state] & IPATH_FLUSH_SEND))
+			goto clr_busy;
+		/* We are in the error state, flush the work request. */
+		send_status = IB_WC_WR_FLUSH_ERR;
+		goto flush_send;
 	}
 
 	/*
 	 * We can rely on the entry not changing without the s_lock
 	 * being held until we update s_last.
+	 * We increment s_cur to indicate s_last is in progress.
 	 */
-	wqe = get_swqe_ptr(sqp, sqp->s_last);
+	if (sqp->s_last == sqp->s_cur) {
+		if (++sqp->s_cur >= sqp->s_size)
+			sqp->s_cur = 0;
+	}
 	spin_unlock_irqrestore(&sqp->s_lock, flags);
 
+	if (!qp || !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) {
+		dev->n_pkt_drops++;
+		/*
+		 * For RC, the requester would timeout and retry so
+		 * shortcut the timeouts and just signal too many retries.
+		 */
+		if (sqp->ibqp.qp_type == IB_QPT_RC)
+			send_status = IB_WC_RETRY_EXC_ERR;
+		else
+			send_status = IB_WC_SUCCESS;
+		goto serr;
+	}
+
 	memset(&wc, 0, sizeof wc);
 	send_status = IB_WC_SUCCESS;
 
@@ -396,8 +425,7 @@ again:
 		sqp->s_len -= len;
 	}
 
-	if (wqe->wr.opcode == IB_WR_RDMA_WRITE ||
-	    wqe->wr.opcode == IB_WR_RDMA_READ)
+	if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags))
 		goto send_comp;
 
 	if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM)
@@ -417,6 +445,8 @@ again:
 		       wqe->wr.send_flags & IB_SEND_SOLICITED);
 
 send_comp:
+	spin_lock_irqsave(&sqp->s_lock, flags);
+flush_send:
 	sqp->s_rnr_retry = sqp->s_rnr_retry_cnt;
 	ipath_send_complete(sqp, wqe, send_status);
 	goto again;
@@ -437,11 +467,12 @@ rnr_nak:
 		sqp->s_rnr_retry--;
 	spin_lock_irqsave(&sqp->s_lock, flags);
 	if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_RECV_OK))
-		goto unlock;
+		goto clr_busy;
+	sqp->s_flags |= IPATH_S_WAITING;
 	dev->n_rnr_naks++;
 	sqp->s_rnr_timeout = ib_ipath_rnr_table[qp->r_min_rnr_timer];
 	ipath_insert_rnr_queue(sqp);
-	goto unlock;
+	goto clr_busy;
 
 inv_err:
 	send_status = IB_WC_REM_INV_REQ_ERR;
@@ -473,17 +504,19 @@ serr:
 		}
 		goto done;
 	}
+clr_busy:
+	sqp->s_flags &= ~IPATH_S_BUSY;
 unlock:
 	spin_unlock_irqrestore(&sqp->s_lock, flags);
 done:
-	if (atomic_dec_and_test(&qp->refcount))
+	if (qp && atomic_dec_and_test(&qp->refcount))
 		wake_up(&qp->wait);
 }
 
 static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp)
 {
 	if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA) ||
-		qp->ibqp.qp_type == IB_QPT_SMI) {
+	    qp->ibqp.qp_type == IB_QPT_SMI) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);
@@ -501,26 +534,36 @@ static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp)
  * @dev: the device we ran out of buffers on
  *
  * Called when we run out of PIO buffers.
+ * If we are now in the error state, return zero to flush the
+ * send work request.
  */
-static void ipath_no_bufs_available(struct ipath_qp *qp,
+static int ipath_no_bufs_available(struct ipath_qp *qp,
 				    struct ipath_ibdev *dev)
 {
 	unsigned long flags;
+	int ret = 1;
 
 	/*
 	 * Note that as soon as want_buffer() is called and
 	 * possibly before it returns, ipath_ib_piobufavail()
-	 * could be called.  If we are still in the tasklet function,
-	 * tasklet_hi_schedule() will not call us until the next time
-	 * tasklet_hi_schedule() is called.
-	 * We leave the busy flag set so that another post send doesn't
-	 * try to put the same QP on the piowait list again.
+	 * could be called. Therefore, put QP on the piowait list before
+	 * enabling the PIO avail interrupt.
 	 */
-	spin_lock_irqsave(&dev->pending_lock, flags);
-	list_add_tail(&qp->piowait, &dev->piowait);
-	spin_unlock_irqrestore(&dev->pending_lock, flags);
-	want_buffer(dev->dd, qp);
-	dev->n_piowait++;
+	spin_lock_irqsave(&qp->s_lock, flags);
+	if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) {
+		dev->n_piowait++;
+		qp->s_flags |= IPATH_S_WAITING;
+		qp->s_flags &= ~IPATH_S_BUSY;
+		spin_lock(&dev->pending_lock);
+		if (list_empty(&qp->piowait))
+			list_add_tail(&qp->piowait, &dev->piowait);
+		spin_unlock(&dev->pending_lock);
+	} else
+		ret = 0;
+	spin_unlock_irqrestore(&qp->s_lock, flags);
+	if (ret)
+		want_buffer(dev->dd, qp);
+	return ret;
 }
 
 /**
@@ -596,15 +639,13 @@ void ipath_do_send(unsigned long data)
 	struct ipath_qp *qp = (struct ipath_qp *)data;
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	int (*make_req)(struct ipath_qp *qp);
-
-	if (test_and_set_bit(IPATH_S_BUSY, &qp->s_busy))
-		goto bail;
+	unsigned long flags;
 
 	if ((qp->ibqp.qp_type == IB_QPT_RC ||
 	     qp->ibqp.qp_type == IB_QPT_UC) &&
 	    qp->remote_ah_attr.dlid == dev->dd->ipath_lid) {
 		ipath_ruc_loopback(qp);
-		goto clear;
+		goto bail;
 	}
 
 	if (qp->ibqp.qp_type == IB_QPT_RC)
@@ -614,6 +655,19 @@ void ipath_do_send(unsigned long data)
 	else
 	       make_req = ipath_make_ud_req;
 
+	spin_lock_irqsave(&qp->s_lock, flags);
+
+	/* Return if we are already busy processing a work request. */
+	if ((qp->s_flags & (IPATH_S_BUSY | IPATH_S_ANY_WAIT)) ||
+	    !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_OR_FLUSH_SEND)) {
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+		goto bail;
+	}
+
+	qp->s_flags |= IPATH_S_BUSY;
+
+	spin_unlock_irqrestore(&qp->s_lock, flags);
+
 again:
 	/* Check for a constructed packet to be sent. */
 	if (qp->s_hdrwords != 0) {
@@ -623,8 +677,8 @@ again:
 		 */
 		if (ipath_verbs_send(qp, &qp->s_hdr, qp->s_hdrwords,
 				     qp->s_cur_sge, qp->s_cur_size)) {
-			ipath_no_bufs_available(qp, dev);
-			goto bail;
+			if (ipath_no_bufs_available(qp, dev))
+				goto bail;
 		}
 		dev->n_unicast_xmit++;
 		/* Record that we sent the packet and s_hdr is empty. */
@@ -633,16 +687,20 @@ again:
 
 	if (make_req(qp))
 		goto again;
-clear:
-	clear_bit(IPATH_S_BUSY, &qp->s_busy);
+
 bail:;
 }
 
+/*
+ * This should be called with s_lock held.
+ */
 void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe,
 			 enum ib_wc_status status)
 {
-	unsigned long flags;
-	u32 last;
+	u32 old_last, last;
+
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_OR_FLUSH_SEND))
+		return;
 
 	/* See ch. 11.2.4.1 and 10.7.3.1 */
 	if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) ||
@@ -661,10 +719,14 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe,
 			       status != IB_WC_SUCCESS);
 	}
 
-	spin_lock_irqsave(&qp->s_lock, flags);
-	last = qp->s_last;
+	old_last = last = qp->s_last;
 	if (++last >= qp->s_size)
 		last = 0;
 	qp->s_last = last;
-	spin_unlock_irqrestore(&qp->s_lock, flags);
+	if (qp->s_cur == old_last)
+		qp->s_cur = last;
+	if (qp->s_tail == old_last)
+		qp->s_tail = last;
+	if (qp->state == IB_QPS_SQD && last == qp->s_cur)
+		qp->s_draining = 0;
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c
index bfe8926..7fd18e8 100644
--- a/drivers/infiniband/hw/ipath/ipath_uc.c
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
+ * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -47,14 +47,30 @@ int ipath_make_uc_req(struct ipath_qp *qp)
 {
 	struct ipath_other_headers *ohdr;
 	struct ipath_swqe *wqe;
+	unsigned long flags;
 	u32 hwords;
 	u32 bth0;
 	u32 len;
 	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
 	int ret = 0;
 
-	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK))
+	spin_lock_irqsave(&qp->s_lock, flags);
+
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) {
+		if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND))
+			goto bail;
+		/* We are in the error state, flush the work request. */
+		if (qp->s_last == qp->s_head)
+			goto bail;
+		/* If DMAs are in progress, we can't flush immediately. */
+		if (atomic_read(&qp->s_dma_busy)) {
+			qp->s_flags |= IPATH_S_WAIT_DMA;
+			goto bail;
+		}
+		wqe = get_swqe_ptr(qp, qp->s_last);
+		ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR);
 		goto done;
+	}
 
 	ohdr = &qp->s_hdr.u.oth;
 	if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
@@ -69,9 +85,12 @@ int ipath_make_uc_req(struct ipath_qp *qp)
 	qp->s_wqe = NULL;
 	switch (qp->s_state) {
 	default:
+		if (!(ib_ipath_state_ops[qp->state] &
+		    IPATH_PROCESS_NEXT_SEND_OK))
+			goto bail;
 		/* Check if send work queue is empty. */
 		if (qp->s_cur == qp->s_head)
-			goto done;
+			goto bail;
 		/*
 		 * Start a new request.
 		 */
@@ -134,7 +153,7 @@ int ipath_make_uc_req(struct ipath_qp *qp)
 			break;
 
 		default:
-			goto done;
+			goto bail;
 		}
 		break;
 
@@ -194,9 +213,14 @@ int ipath_make_uc_req(struct ipath_qp *qp)
 	ipath_make_ruc_header(to_idev(qp->ibqp.device),
 			      qp, ohdr, bth0 | (qp->s_state << 24),
 			      qp->s_next_psn++ & IPATH_PSN_MASK);
+done:
 	ret = 1;
+	goto unlock;
 
-done:
+bail:
+	qp->s_flags &= ~IPATH_S_BUSY;
+unlock:
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 	return ret;
 }
 
@@ -258,8 +282,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	 */
 	opcode = be32_to_cpu(ohdr->bth[0]) >> 24;
 
-	wc.imm_data = 0;
-	wc.wc_flags = 0;
+	memset(&wc, 0, sizeof wc);
 
 	/* Compare the PSN verses the expected PSN. */
 	if (unlikely(ipath_cmp24(psn, qp->r_psn) != 0)) {
@@ -322,8 +345,8 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	case OP(SEND_ONLY):
 	case OP(SEND_ONLY_WITH_IMMEDIATE):
 	send_first:
-		if (qp->r_reuse_sge) {
-			qp->r_reuse_sge = 0;
+		if (qp->r_flags & IPATH_R_REUSE_SGE) {
+			qp->r_flags &= ~IPATH_R_REUSE_SGE;
 			qp->r_sge = qp->s_rdma_read_sge;
 		} else if (!ipath_get_rwqe(qp, 0)) {
 			dev->n_pkt_drops++;
@@ -340,13 +363,13 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	case OP(SEND_MIDDLE):
 		/* Check for invalid length PMTU or posted rwqe len. */
 		if (unlikely(tlen != (hdrsize + pmtu + 4))) {
-			qp->r_reuse_sge = 1;
+			qp->r_flags |= IPATH_R_REUSE_SGE;
 			dev->n_pkt_drops++;
 			goto done;
 		}
 		qp->r_rcv_len += pmtu;
 		if (unlikely(qp->r_rcv_len > qp->r_len)) {
-			qp->r_reuse_sge = 1;
+			qp->r_flags |= IPATH_R_REUSE_SGE;
 			dev->n_pkt_drops++;
 			goto done;
 		}
@@ -372,7 +395,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		/* Check for invalid length. */
 		/* XXX LAST len should be >= 1 */
 		if (unlikely(tlen < (hdrsize + pad + 4))) {
-			qp->r_reuse_sge = 1;
+			qp->r_flags |= IPATH_R_REUSE_SGE;
 			dev->n_pkt_drops++;
 			goto done;
 		}
@@ -380,7 +403,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		tlen -= (hdrsize + pad + 4);
 		wc.byte_len = tlen + qp->r_rcv_len;
 		if (unlikely(wc.byte_len > qp->r_len)) {
-			qp->r_reuse_sge = 1;
+			qp->r_flags |= IPATH_R_REUSE_SGE;
 			dev->n_pkt_drops++;
 			goto done;
 		}
@@ -390,14 +413,10 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		wc.wr_id = qp->r_wr_id;
 		wc.status = IB_WC_SUCCESS;
 		wc.opcode = IB_WC_RECV;
-		wc.vendor_err = 0;
 		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
-		wc.pkey_index = 0;
 		wc.slid = qp->remote_ah_attr.dlid;
 		wc.sl = qp->remote_ah_attr.sl;
-		wc.dlid_path_bits = 0;
-		wc.port_num = 0;
 		/* Signal completion event if the solicited bit is set. */
 		ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
 			       (ohdr->bth[0] &
@@ -488,8 +507,8 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			dev->n_pkt_drops++;
 			goto done;
 		}
-		if (qp->r_reuse_sge)
-			qp->r_reuse_sge = 0;
+		if (qp->r_flags & IPATH_R_REUSE_SGE)
+			qp->r_flags &= ~IPATH_R_REUSE_SGE;
 		else if (!ipath_get_rwqe(qp, 1)) {
 			dev->n_pkt_drops++;
 			goto done;
diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c
index 8b6a261..77ca8ca 100644
--- a/drivers/infiniband/hw/ipath/ipath_ud.c
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c
@@ -65,9 +65,9 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe)
 	u32 length;
 
 	qp = ipath_lookup_qpn(&dev->qp_table, swqe->wr.wr.ud.remote_qpn);
-	if (!qp) {
+	if (!qp || !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) {
 		dev->n_pkt_drops++;
-		goto send_comp;
+		goto done;
 	}
 
 	rsge.sg_list = NULL;
@@ -91,14 +91,12 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe)
 	 * present on the wire.
 	 */
 	length = swqe->length;
+	memset(&wc, 0, sizeof wc);
 	wc.byte_len = length + sizeof(struct ib_grh);
 
 	if (swqe->wr.opcode == IB_WR_SEND_WITH_IMM) {
 		wc.wc_flags = IB_WC_WITH_IMM;
 		wc.imm_data = swqe->wr.ex.imm_data;
-	} else {
-		wc.wc_flags = 0;
-		wc.imm_data = 0;
 	}
 
 	/*
@@ -229,7 +227,6 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe)
 	}
 	wc.status = IB_WC_SUCCESS;
 	wc.opcode = IB_WC_RECV;
-	wc.vendor_err = 0;
 	wc.qp = &qp->ibqp;
 	wc.src_qp = sqp->ibqp.qp_num;
 	/* XXX do we know which pkey matched? Only needed for GSI. */
@@ -248,8 +245,7 @@ drop:
 	kfree(rsge.sg_list);
 	if (atomic_dec_and_test(&qp->refcount))
 		wake_up(&qp->wait);
-send_comp:
-	ipath_send_complete(sqp, swqe, IB_WC_SUCCESS);
+done:;
 }
 
 /**
@@ -264,6 +260,7 @@ int ipath_make_ud_req(struct ipath_qp *qp)
 	struct ipath_other_headers *ohdr;
 	struct ib_ah_attr *ah_attr;
 	struct ipath_swqe *wqe;
+	unsigned long flags;
 	u32 nwords;
 	u32 extra_bytes;
 	u32 bth0;
@@ -271,13 +268,30 @@ int ipath_make_ud_req(struct ipath_qp *qp)
 	u16 lid;
 	int ret = 0;
 
-	if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)))
-		goto bail;
+	spin_lock_irqsave(&qp->s_lock, flags);
+
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_NEXT_SEND_OK)) {
+		if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND))
+			goto bail;
+		/* We are in the error state, flush the work request. */
+		if (qp->s_last == qp->s_head)
+			goto bail;
+		/* If DMAs are in progress, we can't flush immediately. */
+		if (atomic_read(&qp->s_dma_busy)) {
+			qp->s_flags |= IPATH_S_WAIT_DMA;
+			goto bail;
+		}
+		wqe = get_swqe_ptr(qp, qp->s_last);
+		ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR);
+		goto done;
+	}
 
 	if (qp->s_cur == qp->s_head)
 		goto bail;
 
 	wqe = get_swqe_ptr(qp, qp->s_cur);
+	if (++qp->s_cur >= qp->s_size)
+		qp->s_cur = 0;
 
 	/* Construct the header. */
 	ah_attr = &to_iah(wqe->wr.wr.ud.ah)->attr;
@@ -288,10 +302,23 @@ int ipath_make_ud_req(struct ipath_qp *qp)
 			dev->n_unicast_xmit++;
 	} else {
 		dev->n_unicast_xmit++;
-		lid = ah_attr->dlid &
-			~((1 << dev->dd->ipath_lmc) - 1);
+		lid = ah_attr->dlid & ~((1 << dev->dd->ipath_lmc) - 1);
 		if (unlikely(lid == dev->dd->ipath_lid)) {
+			/*
+			 * If DMAs are in progress, we can't generate
+			 * a completion for the loopback packet since
+			 * it would be out of order.
+			 * XXX Instead of waiting, we could queue a
+			 * zero length descriptor so we get a callback.
+			 */
+			if (atomic_read(&qp->s_dma_busy)) {
+				qp->s_flags |= IPATH_S_WAIT_DMA;
+				goto bail;
+			}
+			spin_unlock_irqrestore(&qp->s_lock, flags);
 			ipath_ud_loopback(qp, wqe);
+			spin_lock_irqsave(&qp->s_lock, flags);
+			ipath_send_complete(qp, wqe, IB_WC_SUCCESS);
 			goto done;
 		}
 	}
@@ -368,11 +395,13 @@ int ipath_make_ud_req(struct ipath_qp *qp)
 	ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num);
 
 done:
-	if (++qp->s_cur >= qp->s_size)
-		qp->s_cur = 0;
 	ret = 1;
+	goto unlock;
 
 bail:
+	qp->s_flags &= ~IPATH_S_BUSY;
+unlock:
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 	return ret;
 }
 
@@ -506,8 +535,8 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	/*
 	 * Get the next work request entry to find where to put the data.
 	 */
-	if (qp->r_reuse_sge)
-		qp->r_reuse_sge = 0;
+	if (qp->r_flags & IPATH_R_REUSE_SGE)
+		qp->r_flags &= ~IPATH_R_REUSE_SGE;
 	else if (!ipath_get_rwqe(qp, 0)) {
 		/*
 		 * Count VL15 packets dropped due to no receive buffer.
@@ -523,7 +552,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	}
 	/* Silently drop packets which are too big. */
 	if (wc.byte_len > qp->r_len) {
-		qp->r_reuse_sge = 1;
+		qp->r_flags |= IPATH_R_REUSE_SGE;
 		dev->n_pkt_drops++;
 		goto bail;
 	}
@@ -535,7 +564,8 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh));
 	ipath_copy_sge(&qp->r_sge, data,
 		       wc.byte_len - sizeof(struct ib_grh));
-	qp->r_wrid_valid = 0;
+	if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags))
+		goto bail;
 	wc.wr_id = qp->r_wr_id;
 	wc.status = IB_WC_SUCCESS;
 	wc.opcode = IB_WC_RECV;
diff --git a/drivers/infiniband/hw/ipath/ipath_user_sdma.h b/drivers/infiniband/hw/ipath/ipath_user_sdma.h
index e70946c..fc76316 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_sdma.h
+++ b/drivers/infiniband/hw/ipath/ipath_user_sdma.h
@@ -45,8 +45,6 @@ int ipath_user_sdma_writev(struct ipath_devdata *dd,
 int ipath_user_sdma_make_progress(struct ipath_devdata *dd,
 				  struct ipath_user_sdma_queue *pq);
 
-int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq,
-			     u32 counter);
 void ipath_user_sdma_queue_drain(struct ipath_devdata *dd,
 				 struct ipath_user_sdma_queue *pq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 22bb42d..e0ec540 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -111,16 +111,24 @@ static unsigned int ib_ipath_disable_sma;
 module_param_named(disable_sma, ib_ipath_disable_sma, uint, S_IWUSR | S_IRUGO);
 MODULE_PARM_DESC(disable_sma, "Disable the SMA");
 
+/*
+ * Note that it is OK to post send work requests in the SQE and ERR
+ * states; ipath_do_send() will process them and generate error
+ * completions as per IB 1.2 C10-96.
+ */
 const int ib_ipath_state_ops[IB_QPS_ERR + 1] = {
 	[IB_QPS_RESET] = 0,
 	[IB_QPS_INIT] = IPATH_POST_RECV_OK,
 	[IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK,
 	[IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK |
-	    IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK,
+	    IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK |
+	    IPATH_PROCESS_NEXT_SEND_OK,
 	[IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK |
-	    IPATH_POST_SEND_OK,
-	[IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK,
-	[IB_QPS_ERR] = 0,
+	    IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK,
+	[IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK |
+	    IPATH_POST_SEND_OK | IPATH_FLUSH_SEND,
+	[IB_QPS_ERR] = IPATH_POST_RECV_OK | IPATH_FLUSH_RECV |
+	    IPATH_POST_SEND_OK | IPATH_FLUSH_SEND,
 };
 
 struct ipath_ucontext {
@@ -230,18 +238,6 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length)
 	}
 }
 
-static void ipath_flush_wqe(struct ipath_qp *qp, struct ib_send_wr *wr)
-{
-	struct ib_wc wc;
-
-	memset(&wc, 0, sizeof(wc));
-	wc.wr_id = wr->wr_id;
-	wc.status = IB_WC_WR_FLUSH_ERR;
-	wc.opcode = ib_ipath_wc_opcode[wr->opcode];
-	wc.qp = &qp->ibqp;
-	ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1);
-}
-
 /*
  * Count the number of DMA descriptors needed to send length bytes of data.
  * Don't modify the ipath_sge_state to get the count.
@@ -347,14 +343,8 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr)
 	spin_lock_irqsave(&qp->s_lock, flags);
 
 	/* Check that state is OK to post send. */
-	if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK))) {
-		if (qp->state != IB_QPS_SQE && qp->state != IB_QPS_ERR)
-			goto bail_inval;
-		/* C10-96 says generate a flushed completion entry. */
-		ipath_flush_wqe(qp, wr);
-		ret = 0;
-		goto bail;
-	}
+	if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK)))
+		goto bail_inval;
 
 	/* IB spec says that num_sge == 0 is OK. */
 	if (wr->num_sge > qp->s_max_sge)
@@ -677,6 +667,7 @@ bail:;
 static void ipath_ib_timer(struct ipath_ibdev *dev)
 {
 	struct ipath_qp *resend = NULL;
+	struct ipath_qp *rnr = NULL;
 	struct list_head *last;
 	struct ipath_qp *qp;
 	unsigned long flags;
@@ -703,7 +694,9 @@ static void ipath_ib_timer(struct ipath_ibdev *dev)
 		if (--qp->s_rnr_timeout == 0) {
 			do {
 				list_del_init(&qp->timerwait);
-				tasklet_hi_schedule(&qp->s_task);
+				qp->timer_next = rnr;
+				rnr = qp;
+				atomic_inc(&qp->refcount);
 				if (list_empty(last))
 					break;
 				qp = list_entry(last->next, struct ipath_qp,
@@ -743,9 +736,13 @@ static void ipath_ib_timer(struct ipath_ibdev *dev)
 	spin_unlock_irqrestore(&dev->pending_lock, flags);
 
 	/* XXX What if timer fires again while this is running? */
-	for (qp = resend; qp != NULL; qp = qp->timer_next) {
+	while (resend != NULL) {
+		qp = resend;
+		resend = qp->timer_next;
+
 		spin_lock_irqsave(&qp->s_lock, flags);
-		if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) {
+		if (qp->s_last != qp->s_tail &&
+		    ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) {
 			dev->n_timeouts++;
 			ipath_restart_rc(qp, qp->s_last_psn + 1);
 		}
@@ -755,6 +752,19 @@ static void ipath_ib_timer(struct ipath_ibdev *dev)
 		if (atomic_dec_and_test(&qp->refcount))
 			wake_up(&qp->wait);
 	}
+	while (rnr != NULL) {
+		qp = rnr;
+		rnr = qp->timer_next;
+
+		spin_lock_irqsave(&qp->s_lock, flags);
+		if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)
+			ipath_schedule_send(qp);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+
+		/* Notify ipath_destroy_qp() if it is waiting. */
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
 }
 
 static void update_sge(struct ipath_sge_state *ss, u32 length)
@@ -1010,13 +1020,24 @@ static void sdma_complete(void *cookie, int status)
 	struct ipath_verbs_txreq *tx = cookie;
 	struct ipath_qp *qp = tx->qp;
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	unsigned int flags;
+	enum ib_wc_status ibs = status == IPATH_SDMA_TXREQ_S_OK ?
+		IB_WC_SUCCESS : IB_WC_WR_FLUSH_ERR;
 
-	/* Generate a completion queue entry if needed */
-	if (qp->ibqp.qp_type != IB_QPT_RC && tx->wqe) {
-		enum ib_wc_status ibs = status == IPATH_SDMA_TXREQ_S_OK ?
-			IB_WC_SUCCESS : IB_WC_WR_FLUSH_ERR;
-
+	if (atomic_dec_and_test(&qp->s_dma_busy)) {
+		spin_lock_irqsave(&qp->s_lock, flags);
+		if (tx->wqe)
+			ipath_send_complete(qp, tx->wqe, ibs);
+		if ((ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND &&
+		     qp->s_last != qp->s_head) ||
+		    (qp->s_flags & IPATH_S_WAIT_DMA))
+			ipath_schedule_send(qp);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+		wake_up(&qp->wait_dma);
+	} else if (tx->wqe) {
+		spin_lock_irqsave(&qp->s_lock, flags);
 		ipath_send_complete(qp, tx->wqe, ibs);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
 	}
 
 	if (tx->txreq.flags & IPATH_SDMA_TXREQ_F_FREEBUF)
@@ -1027,6 +1048,21 @@ static void sdma_complete(void *cookie, int status)
 		wake_up(&qp->wait);
 }
 
+static void decrement_dma_busy(struct ipath_qp *qp)
+{
+	unsigned int flags;
+
+	if (atomic_dec_and_test(&qp->s_dma_busy)) {
+		spin_lock_irqsave(&qp->s_lock, flags);
+		if ((ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND &&
+		     qp->s_last != qp->s_head) ||
+		    (qp->s_flags & IPATH_S_WAIT_DMA))
+			ipath_schedule_send(qp);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+		wake_up(&qp->wait_dma);
+	}
+}
+
 /*
  * Compute the number of clock cycles of delay before sending the next packet.
  * The multipliers reflect the number of clocks for the fastest rate so
@@ -1065,9 +1101,12 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp,
 	if (tx) {
 		qp->s_tx = NULL;
 		/* resend previously constructed packet */
+		atomic_inc(&qp->s_dma_busy);
 		ret = ipath_sdma_verbs_send(dd, tx->ss, tx->len, tx);
-		if (ret)
+		if (ret) {
 			qp->s_tx = tx;
+			decrement_dma_busy(qp);
+		}
 		goto bail;
 	}
 
@@ -1118,12 +1157,14 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp,
 		tx->txreq.sg_count = ndesc;
 		tx->map_len = (hdrwords + 2) << 2;
 		tx->txreq.map_addr = &tx->hdr;
+		atomic_inc(&qp->s_dma_busy);
 		ret = ipath_sdma_verbs_send(dd, ss, dwords, tx);
 		if (ret) {
 			/* save ss and length in dwords */
 			tx->ss = ss;
 			tx->len = dwords;
 			qp->s_tx = tx;
+			decrement_dma_busy(qp);
 		}
 		goto bail;
 	}
@@ -1144,6 +1185,7 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp,
 	memcpy(piobuf, hdr, hdrwords << 2);
 	ipath_copy_from_sge(piobuf + hdrwords, ss, len);
 
+	atomic_inc(&qp->s_dma_busy);
 	ret = ipath_sdma_verbs_send(dd, NULL, 0, tx);
 	/*
 	 * If we couldn't queue the DMA request, save the info
@@ -1154,6 +1196,7 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp,
 		tx->ss = NULL;
 		tx->len = 0;
 		qp->s_tx = tx;
+		decrement_dma_busy(qp);
 	}
 	dev->n_unaligned++;
 	goto bail;
@@ -1177,6 +1220,7 @@ static int ipath_verbs_send_pio(struct ipath_qp *qp,
 	unsigned flush_wc;
 	u32 control;
 	int ret;
+	unsigned int flags;
 
 	piobuf = ipath_getpiobuf(dd, plen, NULL);
 	if (unlikely(piobuf == NULL)) {
@@ -1247,8 +1291,11 @@ static int ipath_verbs_send_pio(struct ipath_qp *qp,
 	}
 	copy_io(piobuf, ss, len, flush_wc);
 done:
-	if (qp->s_wqe)
+	if (qp->s_wqe) {
+		spin_lock_irqsave(&qp->s_lock, flags);
 		ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+	}
 	ret = 0;
 bail:
 	return ret;
@@ -1281,19 +1328,12 @@ int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr,
 	 * can defer SDMA restart until link goes ACTIVE without
 	 * worrying about just how we got there.
 	 */
-	if (qp->ibqp.qp_type == IB_QPT_SMI)
+	if (qp->ibqp.qp_type == IB_QPT_SMI ||
+	    !(dd->ipath_flags & IPATH_HAS_SEND_DMA))
 		ret = ipath_verbs_send_pio(qp, hdr, hdrwords, ss, len,
 					   plen, dwords);
-	/* All non-VL15 packets are dropped if link is not ACTIVE */
-	else if (!(dd->ipath_flags & IPATH_LINKACTIVE)) {
-		if (qp->s_wqe)
-			ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS);
-		ret = 0;
-	} else if (dd->ipath_flags & IPATH_HAS_SEND_DMA)
-		ret = ipath_verbs_send_dma(qp, hdr, hdrwords, ss, len,
-					   plen, dwords);
 	else
-		ret = ipath_verbs_send_pio(qp, hdr, hdrwords, ss, len,
+		ret = ipath_verbs_send_dma(qp, hdr, hdrwords, ss, len,
 					   plen, dwords);
 
 	return ret;
@@ -1401,27 +1441,46 @@ bail:
  * This is called from ipath_intr() at interrupt level when a PIO buffer is
  * available after ipath_verbs_send() returned an error that no buffers were
  * available.  Return 1 if we consumed all the PIO buffers and we still have
- * QPs waiting for buffers (for now, just do a tasklet_hi_schedule and
+ * QPs waiting for buffers (for now, just restart the send tasklet and
  * return zero).
  */
 int ipath_ib_piobufavail(struct ipath_ibdev *dev)
 {
+	struct list_head *list;
+	struct ipath_qp *qplist;
 	struct ipath_qp *qp;
 	unsigned long flags;
 
 	if (dev == NULL)
 		goto bail;
 
+	list = &dev->piowait;
+	qplist = NULL;
+
 	spin_lock_irqsave(&dev->pending_lock, flags);
-	while (!list_empty(&dev->piowait)) {
-		qp = list_entry(dev->piowait.next, struct ipath_qp,
-				piowait);
+	while (!list_empty(list)) {
+		qp = list_entry(list->next, struct ipath_qp, piowait);
 		list_del_init(&qp->piowait);
-		clear_bit(IPATH_S_BUSY, &qp->s_busy);
-		tasklet_hi_schedule(&qp->s_task);
+		qp->pio_next = qplist;
+		qplist = qp;
+		atomic_inc(&qp->refcount);
 	}
 	spin_unlock_irqrestore(&dev->pending_lock, flags);
 
+	while (qplist != NULL) {
+		qp = qplist;
+		qplist = qp->pio_next;
+
+		spin_lock_irqsave(&qp->s_lock, flags);
+		if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)
+			ipath_schedule_send(qp);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+
+		/* Notify ipath_destroy_qp() if it is waiting. */
+		if (atomic_dec_and_test(&qp->refcount))
+			wake_up(&qp->wait);
+	}
+
 bail:
 	return 0;
 }
@@ -2143,11 +2202,12 @@ bail:
 void ipath_unregister_ib_device(struct ipath_ibdev *dev)
 {
 	struct ib_device *ibdev = &dev->ibdev;
-
-	disable_timer(dev->dd);
+	u32 qps_inuse;
 
 	ib_unregister_device(ibdev);
 
+	disable_timer(dev->dd);
+
 	if (!list_empty(&dev->pending[0]) ||
 	    !list_empty(&dev->pending[1]) ||
 	    !list_empty(&dev->pending[2]))
@@ -2162,7 +2222,10 @@ void ipath_unregister_ib_device(struct ipath_ibdev *dev)
 	 * Note that ipath_unregister_ib_device() can be called before all
 	 * the QPs are destroyed!
 	 */
-	ipath_free_all_qps(&dev->qp_table);
+	qps_inuse = ipath_free_all_qps(&dev->qp_table);
+	if (qps_inuse)
+		ipath_dev_err(dev->dd, "QP memory leak! %u still in use\n",
+			qps_inuse);
 	kfree(dev->qp_table.table);
 	kfree(dev->lk_table.table);
 	kfree(dev->txreq_bufs);
@@ -2213,17 +2276,14 @@ static ssize_t show_stats(struct device *device, struct device_attribute *attr,
 		      "RC OTH NAKs %d\n"
 		      "RC timeouts %d\n"
 		      "RC RDMA dup %d\n"
-		      "RC stalls   %d\n"
 		      "piobuf wait %d\n"
-		      "no piobuf   %d\n"
 		      "unaligned   %d\n"
 		      "PKT drops   %d\n"
 		      "WQE errs    %d\n",
 		      dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks,
 		      dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks,
 		      dev->n_other_naks, dev->n_timeouts,
-		      dev->n_rdma_dup_busy, dev->n_rc_stalls, dev->n_piowait,
-		      dev->n_no_piobuf, dev->n_unaligned,
+		      dev->n_rdma_dup_busy, dev->n_piowait, dev->n_unaligned,
 		      dev->n_pkt_drops, dev->n_wqe_errs);
 	for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) {
 		const struct ipath_opcode_stats *si = &dev->opstats[i];
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 4c7c2aa..eed1fdc 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -74,6 +74,11 @@
 #define IPATH_POST_RECV_OK		0x02
 #define IPATH_PROCESS_RECV_OK		0x04
 #define IPATH_PROCESS_SEND_OK		0x08
+#define IPATH_PROCESS_NEXT_SEND_OK	0x10
+#define IPATH_FLUSH_SEND		0x20
+#define IPATH_FLUSH_RECV		0x40
+#define IPATH_PROCESS_OR_FLUSH_SEND \
+	(IPATH_PROCESS_SEND_OK | IPATH_FLUSH_SEND)
 
 /* IB Performance Manager status values */
 #define IB_PMA_SAMPLE_STATUS_DONE	0x00
@@ -353,12 +358,14 @@ struct ipath_qp {
 	struct ib_qp ibqp;
 	struct ipath_qp *next;		/* link list for QPN hash table */
 	struct ipath_qp *timer_next;	/* link list for ipath_ib_timer() */
+	struct ipath_qp *pio_next;	/* link for ipath_ib_piobufavail() */
 	struct list_head piowait;	/* link for wait PIO buf */
 	struct list_head timerwait;	/* link for waiting for timeouts */
 	struct ib_ah_attr remote_ah_attr;
 	struct ipath_ib_header s_hdr;	/* next packet header to send */
 	atomic_t refcount;
 	wait_queue_head_t wait;
+	wait_queue_head_t wait_dma;
 	struct tasklet_struct s_task;
 	struct ipath_mmap_info *ip;
 	struct ipath_sge_state *s_cur_sge;
@@ -369,7 +376,7 @@ struct ipath_qp {
 	struct ipath_sge_state s_rdma_read_sge;
 	struct ipath_sge_state r_sge;	/* current receive data */
 	spinlock_t s_lock;
-	unsigned long s_busy;
+	atomic_t s_dma_busy;
 	u16 s_pkt_delay;
 	u16 s_hdrwords;		/* size of s_hdr in 32 bit words */
 	u32 s_cur_size;		/* size of send packet in bytes */
@@ -383,6 +390,7 @@ struct ipath_qp {
 	u32 s_rnr_timeout;	/* number of milliseconds for RNR timeout */
 	u32 r_ack_psn;		/* PSN for next ACK or atomic ACK */
 	u64 r_wr_id;		/* ID for current receive WQE */
+	unsigned long r_aflags;
 	u32 r_len;		/* total length of r_sge */
 	u32 r_rcv_len;		/* receive data len processed */
 	u32 r_psn;		/* expected rcv packet sequence number */
@@ -394,8 +402,7 @@ struct ipath_qp {
 	u8 r_state;		/* opcode of last packet received */
 	u8 r_nak_state;		/* non-zero if NAK is pending */
 	u8 r_min_rnr_timer;	/* retry timeout value for RNR NAKs */
-	u8 r_reuse_sge;		/* for UC receive errors */
-	u8 r_wrid_valid;	/* r_wrid set but CQ entry not yet made */
+	u8 r_flags;
 	u8 r_max_rd_atomic;	/* max number of RDMA read/atomic to receive */
 	u8 r_head_ack_queue;	/* index into s_ack_queue[] */
 	u8 qp_access_flags;
@@ -404,13 +411,13 @@ struct ipath_qp {
 	u8 s_rnr_retry_cnt;
 	u8 s_retry;		/* requester retry counter */
 	u8 s_rnr_retry;		/* requester RNR retry counter */
-	u8 s_wait_credit;	/* limit number of unacked packets sent */
 	u8 s_pkey_index;	/* PKEY index to use */
 	u8 s_max_rd_atomic;	/* max number of RDMA read/atomic to send */
 	u8 s_num_rd_atomic;	/* number of RDMA read/atomic pending */
 	u8 s_tail_ack_queue;	/* index into s_ack_queue[] */
 	u8 s_flags;
 	u8 s_dmult;
+	u8 s_draining;
 	u8 timeout;		/* Timeout for this QP */
 	enum ib_mtu path_mtu;
 	u32 remote_qpn;
@@ -428,16 +435,39 @@ struct ipath_qp {
 	struct ipath_sge r_sg_list[0];	/* verified SGEs */
 };
 
-/* Bit definition for s_busy. */
-#define IPATH_S_BUSY		0
+/*
+ * Atomic bit definitions for r_aflags.
+ */
+#define IPATH_R_WRID_VALID	0
+
+/*
+ * Bit definitions for r_flags.
+ */
+#define IPATH_R_REUSE_SGE	0x01
 
 /*
  * Bit definitions for s_flags.
+ *
+ * IPATH_S_FENCE_PENDING - waiting for all prior RDMA read or atomic SWQEs
+ *			   before processing the next SWQE
+ * IPATH_S_RDMAR_PENDING - waiting for any RDMA read or atomic SWQEs
+ *			   before processing the next SWQE
+ * IPATH_S_WAITING - waiting for RNR timeout or send buffer available.
+ * IPATH_S_WAIT_SSN_CREDIT - waiting for RC credits to process next SWQE
+ * IPATH_S_WAIT_DMA - waiting for send DMA queue to drain before generating
+ 		      next send completion entry not via send DMA.
  */
 #define IPATH_S_SIGNAL_REQ_WR	0x01
 #define IPATH_S_FENCE_PENDING	0x02
 #define IPATH_S_RDMAR_PENDING	0x04
 #define IPATH_S_ACK_PENDING	0x08
+#define IPATH_S_BUSY		0x10
+#define IPATH_S_WAITING		0x20
+#define IPATH_S_WAIT_SSN_CREDIT	0x40
+#define IPATH_S_WAIT_DMA	0x80
+
+#define IPATH_S_ANY_WAIT (IPATH_S_FENCE_PENDING | IPATH_S_RDMAR_PENDING | \
+	IPATH_S_WAITING | IPATH_S_WAIT_SSN_CREDIT | IPATH_S_WAIT_DMA)
 
 #define IPATH_PSN_CREDIT	512
 
@@ -573,13 +603,11 @@ struct ipath_ibdev {
 	u32 n_rnr_naks;
 	u32 n_other_naks;
 	u32 n_timeouts;
-	u32 n_rc_stalls;
 	u32 n_pkt_drops;
 	u32 n_vl15_dropped;
 	u32 n_wqe_errs;
 	u32 n_rdma_dup_busy;
 	u32 n_piowait;
-	u32 n_no_piobuf;
 	u32 n_unaligned;
 	u32 port_cap_flags;
 	u32 pma_sample_start;
@@ -657,6 +685,17 @@ static inline struct ipath_ibdev *to_idev(struct ib_device *ibdev)
 	return container_of(ibdev, struct ipath_ibdev, ibdev);
 }
 
+/*
+ * This must be called with s_lock held.
+ */
+static inline void ipath_schedule_send(struct ipath_qp *qp)
+{
+	if (qp->s_flags & IPATH_S_ANY_WAIT)
+		qp->s_flags &= ~IPATH_S_ANY_WAIT;
+	if (!(qp->s_flags & IPATH_S_BUSY))
+		tasklet_hi_schedule(&qp->s_task);
+}
+
 int ipath_process_mad(struct ib_device *ibdev,
 		      int mad_flags,
 		      u8 port_num,
@@ -706,7 +745,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		   int attr_mask, struct ib_qp_init_attr *init_attr);
 
-void ipath_free_all_qps(struct ipath_qp_table *qpt);
+unsigned ipath_free_all_qps(struct ipath_qp_table *qpt);
 
 int ipath_init_qp_table(struct ipath_ibdev *idev, int size);
 

From ralph.campbell at qlogic.com  Thu May  8 11:55:28 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 08 May 2008 11:55:28 -0700
Subject: [ofa-general] [PATCH 3/3] IB/ipath - fix RDMA read response sequence
	checking
In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com>

If an out of sequence RDMA read response middle or last packet is
received, we should only resend the RDMA read request on the first
out of sequence packet and drop subsequent out of sequence packets
otherwise, we get "too many retries".

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_rc.c    |    7 +++++++
 drivers/infiniband/hw/ipath/ipath_verbs.h |    1 +
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index 5b5276a..108df66 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -1189,6 +1189,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		wqe = get_swqe_ptr(qp, qp->s_last);
 		if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ))
 			goto ack_op_err;
+		qp->r_flags &= ~IPATH_R_RDMAR_SEQ;
 		/*
 		 * If this is a response to a resent RDMA read, we
 		 * have to be careful to copy the data to the right
@@ -1202,6 +1203,9 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		/* no AETH, no ACK */
 		if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) {
 			dev->n_rdma_seq++;
+			if (qp->r_flags & IPATH_R_RDMAR_SEQ)
+				goto ack_done;
+			qp->r_flags |= IPATH_R_RDMAR_SEQ;
 			ipath_restart_rc(qp, qp->s_last_psn + 1);
 			goto ack_done;
 		}
@@ -1263,6 +1267,9 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev,
 		/* ACKs READ req. */
 		if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) {
 			dev->n_rdma_seq++;
+			if (qp->r_flags & IPATH_R_RDMAR_SEQ)
+				goto ack_done;
+			qp->r_flags |= IPATH_R_RDMAR_SEQ;
 			ipath_restart_rc(qp, qp->s_last_psn + 1);
 			goto ack_done;
 		}
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index eed1fdc..d64ca0f 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -444,6 +444,7 @@ struct ipath_qp {
  * Bit definitions for r_flags.
  */
 #define IPATH_R_REUSE_SGE	0x01
+#define IPATH_R_RDMAR_SEQ	0x02
 
 /*
  * Bit definitions for s_flags.


From sean.hefty at intel.com  Thu May  8 11:58:03 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 May 2008 11:58:03 -0700
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <48224662.60401@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
	<001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
	<48224662.60401@opengridcomputing.com>
Message-ID: <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com>

>The requirement is mostly driven from the receiving side.  For cxgb3 it
>is anyway...

Maybe you can help me understand the spec here.  If we ignore this feature for a
minute, then the side that calls rdma_connect() must instead issue the first
'send' request to the server.  Can the first 'send' be a 0B rdma write or read?
Why wouldn't the target of that request not have to transition to connected?

Is the issue that there's no way for the receiving FW/driver to know that this
has occurred so that it can signal that the connection has been established?
I.e. a client that does this must signal the server that things are ready
through some out of band means.

>server sends MPA Start response with "lets do RTR and send me X"  where
>X could be 0B write, 0B read request or 0B send.

Are there any restrictions where a client may not be able to issue what the
server requests?  E.g. the hardware doesn't issue 0B writes.

- Sean


From swise at opengridcomputing.com  Thu May  8 12:17:02 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 08 May 2008 14:17:02 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
	<001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
	<48224662.60401@opengridcomputing.com>
	<000a01c8b13d$716d2980$465a180a@amr.corp.intel.com>
Message-ID: <482351AE.2050800@opengridcomputing.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080508/92b2adf2/attachment.html>

From swise at opengridcomputing.com  Thu May  8 12:25:13 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 08 May 2008 14:25:13 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <482351AE.2050800@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
	<001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
	<48224662.60401@opengridcomputing.com>
	<000a01c8b13d$716d2980$465a180a@amr.corp.intel.com>
	<482351AE.2050800@opengridcomputing.com>
Message-ID: <48235399.4030609@opengridcomputing.com>


 From RFC 5044, section 7.1.2 "Connection Startup Rules", Page 29:

4.  MPA Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers.

       Note: This requirement is present to allow the Initiator time to
           get its receiver into Full Operation before an FPDU arrives,
           avoiding potential race conditions at the Initiator.  This
           was also subject to some debate in the work group before
           rough consensus was reached.  Eliminating this requirement
           would allow faster startup in some types of applications.
           However, that would also make certain implementations
           (particularly "dual stack") much harder.


Steve Wise wrote:
> Sean Hefty wrote:
>>> The requirement is mostly driven from the receiving side.  For cxgb3 it
>>> is anyway...
>>>     
>>
>> Maybe you can help me understand the spec here.  If we ignore this feature for a
>> minute, then the side that calls rdma_connect() must instead issue the first
>> 'send' request to the server.  Can the first 'send' be a 0B rdma write or read?
>>   
> According to the MPI IETF RFC, the initiator must send the first 
> FPDU.  That could be anything.  The spec leaves it up to the ULP.
>
>> Why wouldn't the target of that request not have to transition to connected?
>>
>>   
> I don't understand this question?  What does 'transition to connected' 
> mean?
>
> The requirement is that the responder (the side that issues the 
> rdma_accept in rdma-cma terms) _cannot_ send an FPDU until it first 
> receives one from the initiator.   How that is enforces is an 
> implementation detail.  The responder driver could hold off on the 
> ESTABLISHED event until it receives the first FPDU.  Or it could stall 
> SQ processing until the first FPDU is received yet still indicate that 
> the connection is ESTABLISHED.
>
>> Is the issue that there's no way for the receiving FW/driver to know that this
>> has occurred so that it can signal that the connection has been established?
>> I.e. a client that does this must signal the server that things are ready
>> through some out of band means.
>>
>>   
> I don't understand what you're getting at exactly. 
>
> The issue is that the server doesn't know when the client receives the 
> MPA Start Response and has successfully transitioned the connection 
> into RDMA mode.  IF the server sends an FPDU immediately following the 
> MPA Start Response (which is in streaming mode), then its possible for 
> that first FPDU to get passed up to the driver/ULP as streaming mode 
> data.  Which breaks everything.  Soooo, the spec says the server 
> cannot send an FPDU until it first receives one and thus _knows_ the 
> client is in RDMA mode (by virtue of the fact that the client sent and 
> FPDU).
>
>
>>> server sends MPA Start response with "lets do RTR and send me X"  where
>>> X could be 0B write, 0B read request or 0B send.
>>>     
>>
>> Are there any restrictions where a client may not be able to issue what the
>> server requests?  E.g. the hardware doesn't issue 0B writes.
>>
>>   
>
> Well I guess there could be.  The concensus within the iWARP vendors 
> at Reno was that 0B read would ok.  During the previous discussion on 
> this list shortly after Reno, issues where raised that we should allow 
> other types.
>
> We could make the MPA start request have more info than "I can do 
> RTR".   It could have "Here are the RTR msgs I can send".    Does that 
> help?
>
>
>
> Steve.
>
>


From swise at opengridcomputing.com  Thu May  8 12:42:55 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 08 May 2008 14:42:55 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <48235399.4030609@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>	<ada63trz5mk.fsf@cisco.com><4820A427.1070405@opengridcomputing.com>	<adamyn1u4wp.fsf@cisco.com><48222D58.9020205@opengridcomputing.com><adaabj1u4ku.fsf@cisco.com>
	<48222FFD.40302@opengridcomputing.com>
	<001001c8b099$960a9560$5fd4180a@amr.corp.intel.com>
	<48224662.60401@opengridcomputing.com>
	<000a01c8b13d$716d2980$465a180a@amr.corp.intel.com>
	<482351AE.2050800@opengridcomputing.com>
	<48235399.4030609@opengridcomputing.com>
Message-ID: <482357BF.8050702@opengridcomputing.com>

Here is the thread where we discussed how to implement peer-to-peer for 
iWARP in Nov/2007:

http://lists.openfabrics.org/pipermail/general/2007-November/043252.html


Steve Wise wrote:
>
>
> From RFC 5044, section 7.1.2 "Connection Startup Rules", Page 29:
>
> 4.  MPA Responder mode implementations MUST receive and validate at
>       least one FPDU before sending any FPDUs or Markers.
>
>       Note: This requirement is present to allow the Initiator time to
>           get its receiver into Full Operation before an FPDU arrives,
>           avoiding potential race conditions at the Initiator.  This
>           was also subject to some debate in the work group before
>           rough consensus was reached.  Eliminating this requirement
>           would allow faster startup in some types of applications.
>           However, that would also make certain implementations
>           (particularly "dual stack") much harder.
>


From jsquyres at cisco.com  Thu May  8 14:28:46 2008
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 8 May 2008 17:28:46 -0400
Subject: [ofa-general] Verbs: IB vs. iWARP
Message-ID: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>

Over the past 24 hours, we assembled a list of differences between IB  
and iWARP usage of verbs.  I got a few comments on the text we  
assembled, and figured it was time to turn this text over to  
OpenFabrics to make it fully correct/complete/whatever, and then  
publish it however you see fit.

I hope this starter text is helpful to you; enjoy.

-----
  * struct ib_device.transport_type will be IBV_TRANSPORT_IWARP for  
iWARP devices and IBV_TRANSPORT_IB for IB devices.

  * ibv_query_gid():
    * When invoked on an IB HCA, will return the IB subnet prefix in  
subnet_prefix and GUID of the port in the interface_id.
    * When invoked on an iWARP NIC, will return the NIC's MAC address  
in subnet_prefix and 0 in the interface_id.

  * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can be made  
using the IB CM, RDMA CM, or some other (assumedly out-of-band)  
mechanism.

  * When making QPs, some versions of iWARP drivers require the  
initiator of the connection to send the first message (having the non- 
initiator send the first message will terminate the connection).   
Newer versions of iWARP firmware/drivers hide this functionality down  
in the driver, so the ULP doesn't have to ensure that the initiator  
sends the first message.

  * When terminating connections via the RDMA CM (via the  
rdma_disconnect() call or by simply destroying the QP without  
disconnecting first), iWARP transports will automatically create a CQE  
for any pending send or receive WRs with the status set to  
IBV_WC_WR_FLUSH_ERR.  Note that IB HCAs do the same thing, but the  
iWARP RDMA CM disconnection progresses independently of the ULP,  
meaning that when one side issues the disconnect, the other side will  
automatically be disconnected (even if the ULP doesn't realize it).   
IB HCAs may not process the disconnect until later (via RDMA CM or  
otherwise), perhaps not until the ULP realizes that the disconnect has  
occurred.  In short: device-independent verbs-based applications need  
to be able to handle FLUSH WRs during disconnection and not treat them  
as an error.

  * LIDs are always 0 in iWARP.

  * LMC is always 0 for iWARP.

  * Memory regions used to receive RDMA read responses must have  
"remote write" permission (since in the iWARP protocol, RDMA read  
responses are basically the same as incoming RDMA write requests).

  * Atomics and immediate data are not available in iWARP.

  * The sink scatter-gather list for an RDMA read can only have one  
element for iWARP (which is reported accurately in struct  
ibv_device.max_sge).

  * Send completions provide a slightly different guarantee:
    * iWARP: indicates that the resources in the corresponding WR can  
be reused; it does ''not'' indicate that the data is in the peer's  
memory, or even that they have been transmitted yet.
    * IB: indicates that the data has been transmitted and has arrived  
at the remote HCA (but is not necessarily in the remote target buffer  
yet)

  * All currently-available RNICs (May 2008) do not support RNR  
retry.  Specifically: current RNICs will terminate a QP connection if  
a SEND arrives with no corresponding pre-posted receive.

-- 
Jeff Squyres
Cisco Systems


From andrea at qumranet.com  Thu May  8 15:01:06 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 9 May 2008 00:01:06 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
References: <20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
	<alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
Message-ID: <20080508220106.GF2964@duo.random>

On Thu, May 08, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
> Btw, this is an issue only on 32-bit x86, because on 64-bit one we already 
> have the padding due to the alignment of the 64-bit pointers in the 
> list_head (so there's already empty space there).
> 
> On 32-bit, the alignment of list-head is obviously just 32 bits, so right 
> now the structure is "perfectly packed" and doesn't have any empty space. 
> But that's just because the spinlock is unnecessarily big.
> 
> (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the 
> structure really will grow. That's a very odd configuration, though, and 
> not one I feel we really need to care about).

I see two ways to implement it:

1) use #ifdef and make it zero overhead for 64bit only without playing
any non obvious trick.

struct anon_vma {
       spinlock_t lock;
#ifdef CONFIG_MMU_NOTIFIER
       int global_mm_lock:1;
#endif

struct address_space {
       spinlock_t	private_lock;
#ifdef CONFIG_MMU_NOTIFIER
       int global_mm_lock:1;
#endif

2) add a:

#define AS_GLOBAL_MM_LOCK   (__GFP_BITS_SHIFT + 2)	/* global_mm_locked */

and use address_space->flags with bitops

And as Andrew pointed me out by PM, for the anon_vma we can use the
LSB of the list.next/prev because the list can't be browsed when the
lock is taken, so taking the lock and then setting the bit and
clearing the bit before unlocking is safe. The LSB will always read 0
even if it's under list_add modification when the global spinlock isn't
taken. And after taking the anon_vma lock we can switch it the LSB
from 0 to 1 without races and the 1 will be protected by the
global spinlock.

The above solution is zero cost for 32bit too, so I prefer it.

So I now agree with you this is a great idea on how to remove sort()
and vmalloc and especially vfree without increasing the VM footprint.

I'll send an update with this for review very shortly and I hope this
goes in so KVM will be able to swap and do many other things very well
starting in 2.6.26.

Thanks a lot,
Andrea


From changquing.tang at hp.com  Thu May  8 15:02:13 2008
From: changquing.tang at hp.com (Tang, Changqing)
Date: Thu, 8 May 2008 22:02:13 +0000
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
Message-ID: <D89C2C212795564B837FA1665CAE02991017366FD1@G5W0278.americas.hpqcorp.net>


Great, Thanks, Jeff.

--CQ

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> Jeff Squyres
> Sent: Thursday, May 08, 2008 4:29 PM
> To: OpenFabrics General
> Subject: [ofa-general] Verbs: IB vs. iWARP
>
> Over the past 24 hours, we assembled a list of differences
> between IB and iWARP usage of verbs.  I got a few comments on
> the text we assembled, and figured it was time to turn this
> text over to OpenFabrics to make it fully
> correct/complete/whatever, and then publish it however you see fit.
>
> I hope this starter text is helpful to you; enjoy.
>
> -----
>   * struct ib_device.transport_type will be
> IBV_TRANSPORT_IWARP for iWARP devices and IBV_TRANSPORT_IB
> for IB devices.
>
>   * ibv_query_gid():
>     * When invoked on an IB HCA, will return the IB subnet
> prefix in subnet_prefix and GUID of the port in the interface_id.
>     * When invoked on an iWARP NIC, will return the NIC's MAC
> address in subnet_prefix and 0 in the interface_id.
>
>   * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can
> be made using the IB CM, RDMA CM, or some other (assumedly
> out-of-band) mechanism.
>
>   * When making QPs, some versions of iWARP drivers require
> the initiator of the connection to send the first message
> (having the non- initiator send the first message will
> terminate the connection).
> Newer versions of iWARP firmware/drivers hide this
> functionality down in the driver, so the ULP doesn't have to
> ensure that the initiator sends the first message.
>
>   * When terminating connections via the RDMA CM (via the
> rdma_disconnect() call or by simply destroying the QP without
> disconnecting first), iWARP transports will automatically
> create a CQE for any pending send or receive WRs with the
> status set to IBV_WC_WR_FLUSH_ERR.  Note that IB HCAs do the
> same thing, but the iWARP RDMA CM disconnection progresses
> independently of the ULP, meaning that when one side issues
> the disconnect, the other side will automatically be
> disconnected (even if the ULP doesn't realize it).
> IB HCAs may not process the disconnect until later (via RDMA
> CM or otherwise), perhaps not until the ULP realizes that the
> disconnect has occurred.  In short: device-independent
> verbs-based applications need to be able to handle FLUSH WRs
> during disconnection and not treat them as an error.
>
>   * LIDs are always 0 in iWARP.
>
>   * LMC is always 0 for iWARP.
>
>   * Memory regions used to receive RDMA read responses must
> have "remote write" permission (since in the iWARP protocol,
> RDMA read responses are basically the same as incoming RDMA
> write requests).
>
>   * Atomics and immediate data are not available in iWARP.
>
>   * The sink scatter-gather list for an RDMA read can only
> have one element for iWARP (which is reported accurately in
> struct ibv_device.max_sge).
>
>   * Send completions provide a slightly different guarantee:
>     * iWARP: indicates that the resources in the
> corresponding WR can be reused; it does ''not'' indicate that
> the data is in the peer's memory, or even that they have been
> transmitted yet.
>     * IB: indicates that the data has been transmitted and
> has arrived at the remote HCA (but is not necessarily in the
> remote target buffer
> yet)
>
>   * All currently-available RNICs (May 2008) do not support
> RNR retry.  Specifically: current RNICs will terminate a QP
> connection if a SEND arrives with no corresponding pre-posted receive.
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From Arkady.Kanevsky at netapp.com  Thu May  8 15:14:54 2008
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Thu, 8 May 2008 18:14:54 -0400
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
Message-ID: <C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>

There are also some difference in memory registration, for example FMR.
peer-to-peer iWARP CM support has been submitted by Steve Wise.
We will test its interop in Sept assuming that it will be in OFED
version
which will be used for OFA interop.
The changes are not just in the FW and driver but also in iWARP CM.
Also one can call iWARP CM directly bypassing RDMA CM. But there is no
reason for
it. All iWARP apps hade been developed after RDMA CM was in place
so there was no reason to go under the cover.
Cheers,

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Jeff Squyres [mailto:jsquyres at cisco.com] 
> Sent: Thursday, May 08, 2008 5:29 PM
> To: OpenFabrics General
> Subject: [ofa-general] Verbs: IB vs. iWARP
> 
> Over the past 24 hours, we assembled a list of differences 
> between IB and iWARP usage of verbs.  I got a few comments on 
> the text we assembled, and figured it was time to turn this 
> text over to OpenFabrics to make it fully 
> correct/complete/whatever, and then publish it however you see fit.
> 
> I hope this starter text is helpful to you; enjoy.
> 
> -----
>   * struct ib_device.transport_type will be 
> IBV_TRANSPORT_IWARP for iWARP devices and IBV_TRANSPORT_IB 
> for IB devices.
> 
>   * ibv_query_gid():
>     * When invoked on an IB HCA, will return the IB subnet 
> prefix in subnet_prefix and GUID of the port in the interface_id.
>     * When invoked on an iWARP NIC, will return the NIC's MAC 
> address in subnet_prefix and 0 in the interface_id.
> 
>   * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can 
> be made using the IB CM, RDMA CM, or some other (assumedly 
> out-of-band) mechanism.
> 
>   * When making QPs, some versions of iWARP drivers require 
> the initiator of the connection to send the first message 
> (having the non- 
> initiator send the first message will terminate the connection).   
> Newer versions of iWARP firmware/drivers hide this 
> functionality down in the driver, so the ULP doesn't have to 
> ensure that the initiator sends the first message.
> 
>   * When terminating connections via the RDMA CM (via the
> rdma_disconnect() call or by simply destroying the QP without 
> disconnecting first), iWARP transports will automatically 
> create a CQE for any pending send or receive WRs with the 
> status set to IBV_WC_WR_FLUSH_ERR.  Note that IB HCAs do the 
> same thing, but the iWARP RDMA CM disconnection progresses 
> independently of the ULP, meaning that when one side issues 
> the disconnect, the other side will  
> automatically be disconnected (even if the ULP doesn't realize it).   
> IB HCAs may not process the disconnect until later (via RDMA 
> CM or otherwise), perhaps not until the ULP realizes that the 
> disconnect has occurred.  In short: device-independent 
> verbs-based applications need to be able to handle FLUSH WRs 
> during disconnection and not treat them as an error.
> 
>   * LIDs are always 0 in iWARP.
> 
>   * LMC is always 0 for iWARP.
> 
>   * Memory regions used to receive RDMA read responses must 
> have "remote write" permission (since in the iWARP protocol, 
> RDMA read responses are basically the same as incoming RDMA 
> write requests).
> 
>   * Atomics and immediate data are not available in iWARP.
> 
>   * The sink scatter-gather list for an RDMA read can only 
> have one element for iWARP (which is reported accurately in 
> struct ibv_device.max_sge).
> 
>   * Send completions provide a slightly different guarantee:
>     * iWARP: indicates that the resources in the 
> corresponding WR can be reused; it does ''not'' indicate that 
> the data is in the peer's memory, or even that they have been 
> transmitted yet.
>     * IB: indicates that the data has been transmitted and 
> has arrived at the remote HCA (but is not necessarily in the 
> remote target buffer
> yet)
> 
>   * All currently-available RNICs (May 2008) do not support 
> RNR retry.  Specifically: current RNICs will terminate a QP 
> connection if a SEND arrives with no corresponding pre-posted receive.
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From sean.hefty at intel.com  Thu May  8 15:16:12 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 May 2008 15:16:12 -0700
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
Message-ID: <000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com>

It'd be great to find a place for this on the wiki, so it's easier to find in
the future.

- Sean


From rdreier at cisco.com  Thu May  8 15:22:47 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 May 2008 15:22:47 -0700
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>
	(Arkady Kanevsky's message of "Thu, 8 May 2008 18:14:54 -0400")
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>
Message-ID: <adazlr0qvuw.fsf@cisco.com>

 > There are also some difference in memory registration, for example FMR.

What are the differences?  I don't know of any significant ones (given
the IB verbs extensions).

 - R.


From Arkady.Kanevsky at netapp.com  Thu May  8 15:31:02 2008
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Thu, 8 May 2008 18:31:02 -0400
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <adazlr0qvuw.fsf@cisco.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>
	<adazlr0qvuw.fsf@cisco.com>
Message-ID: <C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>

I had not check for a while but my recollection is the the fmr
implementation is
vendor specific...

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: Thursday, May 08, 2008 6:23 PM
> To: Kanevsky, Arkady
> Cc: OpenFabrics General
> Subject: Re: [ofa-general] Verbs: IB vs. iWARP
> 
>  > There are also some difference in memory registration, for 
> example FMR.
> 
> What are the differences?  I don't know of any significant 
> ones (given the IB verbs extensions).
> 
>  - R.
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From akpm at linux-foundation.org  Thu May  8 22:29:16 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Thu, 8 May 2008 22:29:16 -0700
Subject: [ofa-general] bitops take an unsigned long *
Message-ID: <20080508222916.277649ca.akpm@linux-foundation.org>


Most architectures could (and should) take an unsigned long * arg for their
bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
is being a problem.

It would be nice to get it fixed up, please.

drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs':
drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends':
drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors':
drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr':
drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task':
drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma':
drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma':
drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send':
drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type


From radix at stamm-guisan.ch  Thu May  8 22:52:57 2008
From: radix at stamm-guisan.ch (franky keh)
Date: Fri, 09 May 2008 05:52:57 +0000
Subject: [ofa-general] I busted you general
Message-ID: <000501c8b1a7$053c382a$0500d8a6@jymrus>

Watch :) 

EitpJNFbyD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080509/6b3a8ab6/attachment.html>

From Robert at saq.co.uk  Fri May  9 01:26:47 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Fri, 9 May 2008 09:26:47 +0100
Subject: [ofa-general] Some general Infinband questions
Message-ID: <C1EAC9C5E752D24C968FF091D446D8231C570D@ALTERNATEREALIT>

First of all please excuse me if this is not the right place to ask (If
anyone knows a good place to start for information on Infiniband
clustering then please let me know). 

 
I'm considering an Infiniband Xen Virtual Server cluster under Centos
5.1. I already have a pair of under utilized NAS servers running Windows
Server 2003. I've played with MYSQL NDB clustering and have a reasonable
amount of Linux and Windows experience.

 
Would using the Windows Servers as storage and the Centos based servers
as cluster nodes even be possible with the current state of software?
Any general advice or tips?

 
The SAQ Group
Registered Office: 18 Chapel Street, Petersfield, Hampshire. GU32 3DZ
SEmtec Limited trading as SAQ is Registered in England & Wales
Company Number: 06481952

http://www.saqnet.co.uk AS29219
SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.
DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080509/ba6ad536/attachment.html>

From michael.heinz at qlogic.com  Fri May  9 06:17:45 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Fri, 9 May 2008 08:17:45 -0500
Subject: [ofa-general] Still looking for help debugging a problem.
Message-ID: <C07C40DB2364324799506DE8FF12F8D8678DE2@EPEXCH1.qlogic.org>

May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
Internal error detected:
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[00]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[01]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[02]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[03]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[04]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[05]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[06]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[07]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[08]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[09]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0a]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0b]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0c]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0d]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0e]: ffffffff
May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0f]: ffffffff

The HCA in question is a Connect-X and the problem only seems to happen
with this node.
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080509/2ea0bf38/attachment.html>

From forum.san at gmail.com  Fri May  9 08:47:06 2008
From: forum.san at gmail.com (Sangamesh B)
Date: Fri, 9 May 2008 21:17:06 +0530
Subject: [ofa-general] RPM build errors:user vlad does not exist - using root
Message-ID: <cb60cbc40805090847x6d089a3eue9d9856a612b5d85@mail.gmail.com>

Hi all,

I've worked with MPICH2. But I am a beginner to Infiniband and OFED stuff.

   The installation of OFED-1.3.rc1 package on a cluster with CentOS 5 gave
following error:
....
.......


gcc
-Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.addr.o.d
-nostdinc -iwithprefix include -D__KERNEL__ -include
include/linux/autoconf.h  -include
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/linux/autoconf.h
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/kernel_addons/backport/2.6.9_U3/include/
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/debug
-I/usr/local/include/scst
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/net/cxgb3  -Iinclude
-Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common
-Os -fomit-frame-pointer -g -Wdeclaration-after-statement  -mno-red-zone
-mcmodel=kernel -pipe -fno-reorder-blocks     -Wno-sign-compare
-funit-at-a-time   -DMODULE -DKBUILD_BASENAME=addr -DKBUILD_MODNAME=ib_addr
-c -o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.tmp_addr.o
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:31:25:
linux/mutex.h: No such file or directory
In file included from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:32:
include/linux/inetdevice.h:50: error: field `mr_gq_timer' has incomplete
type
include/linux/inetdevice.h:51: error: field `mr_ifc_timer' has incomplete
type
include/linux/inetdevice.h:95: error: `IFNAMSIZ' undeclared here (not in a
function)
include/linux/inetdevice.h: In function `in_dev_get':
include/linux/inetdevice.h:146: error: dereferencing pointer to incomplete
type
include/linux/inetdevice.h: In function `__in_dev_get':
include/linux/inetdevice.h:156: error: dereferencing pointer to incomplete
type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:38:26:
net/netevent.h: No such file or directory
In file included from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_addr.h:37,
                 from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:39:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h:51:31:
linux/scatterlist.h: No such file or directory
In file included from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_addr.h:37,
                 from
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:39:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h: At top
level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h:1090:
error: field `xrcd_table_mutex' has incomplete type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60:
warning: type defaults to `int' in declaration of `DEFINE_MUTEX'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60:
warning: parameter names (without types) in function declaration
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62:
warning: type defaults to `int' in declaration of `DECLARE_DELAYED_WORK'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62:
warning: parameter names (without types) in function declaration
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `set_timeout':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127:
error: `work' undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127:
error: (Each undeclared identifier is reported only once
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127:
error: for each function it appears in.)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `queue_req':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:140:
warning: implicit declaration of function `mutex_lock'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:140:
error: `lock' undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:150:
warning: implicit declaration of function `mutex_unlock'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `process_req':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:225:
error: `lock' undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `rdma_addr_cancel':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:341:
error: `lock' undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `netevent_callback':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:358:
error: `NETEVENT_NEIGH_UPDATE' undeclared (first use in this function)
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `addr_init':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:378:
warning: implicit declaration of function `register_netevent_notifier'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In
function `addr_cleanup':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:384:
warning: implicit declaration of function `unregister_netevent_notifier'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: At
top level:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:218:
warning: 'process_req' defined but not used
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60:
warning: 'DEFINE_MUTEX' declared `static' but never defined
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62:
warning: 'DECLARE_DELAYED_WORK' declared `static' but never defined
make[4]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.o]
Error 1
make[3]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core] Error 2
make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband]
Error 2
make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.0.2.EL-smp-x86_64'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.66489 (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.66489 (%build)


Why these errors?

As a beginner, I want to know some ponts:

In docs/OFED_Installation_Guide.txt guide, it is given that


1. OS Distribution         Required Packages
---------------         ----------------------------------
General:
o  Common to all        gcc, glib, glib-devel, glibc, glibc-devel,
                        glibc-devel-32bit (to build 32-bit libraries on
x86_86
                        and ppc64),
                        zlib-devel, automake, autoconf, libtool.
o  RedHat, Fedora       kernel-devel, rpm-build

And:

2. Specific Component Requirements:
o  Mvapich              a Fortran Compiler (such as gcc-g77)
o  Mvapich2             libstdc++-devel, sysfsutils (SuSE),
                        libsysfs-devel (RedHat5.0, Fedora C6)

Since OS is CentOS, I tried to install libstdc++-devel CentOS rpms. But
failed because of glibc dependencies.

Are both of these very much required?


In mvapich2. three makefiles are given:

make.mvapich2.ofad, make.mvapich2.vapi and make.mvapich2.udapl.

Are all these support this OFED?
Or only make.mvapich2.ofad supports OFED? If yes, then what should be used
for the other two?


 Thanks in advance for Howto:Infiniband + OFED + mvapich2 concepts.

-Sangamesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080509/0f5225c4/attachment.html>

From rdreier at cisco.com  Fri May  9 09:20:13 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 09 May 2008 09:20:13 -0700
Subject: [ofa-general] Still looking for help debugging a problem.
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D8678DE2@EPEXCH1.qlogic.org> (Mike
	Heinz's message of "Fri, 9 May 2008 08:17:45 -0500")
References: <C07C40DB2364324799506DE8FF12F8D8678DE2@EPEXCH1.qlogic.org>
Message-ID: <adaiqxnqwjm.fsf@cisco.com>

 > May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
 > Internal error detected:
 > May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
 > buf[00]: ffffffff

 > The HCA in question is a Connect-X and the problem only seems to happen
 > with this node.

Sounds like a hardware problem.  Try reseating everything etc.


From swise at opengridcomputing.com  Fri May  9 09:26:00 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 09 May 2008 11:26:00 -0500
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>	<adazlr0qvuw.fsf@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
Message-ID: <48247B18.8000100@opengridcomputing.com>

The ib_fmr stuff is pretty HW-specific yes?


Kanevsky, Arkady wrote:
> I had not check for a while but my recollection is the the fmr
> implementation is
> vendor specific...
>
> Arkady Kanevsky                       email: arkady at netapp.com
> Network Appliance Inc.               phone: 781-768-5395
> 1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> Waltham, MA 02451                   central phone: 781-768-5300
>  
>
>   
>> -----Original Message-----
>> From: Roland Dreier [mailto:rdreier at cisco.com] 
>> Sent: Thursday, May 08, 2008 6:23 PM
>> To: Kanevsky, Arkady
>> Cc: OpenFabrics General
>> Subject: Re: [ofa-general] Verbs: IB vs. iWARP
>>
>>  > There are also some difference in memory registration, for 
>> example FMR.
>>
>> What are the differences?  I don't know of any significant 
>> ones (given the IB verbs extensions).
>>
>>  - R.
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From rdreier at cisco.com  Fri May  9 09:34:24 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 09 May 2008 09:34:24 -0700
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <48247B18.8000100@opengridcomputing.com> (Steve Wise's message of
	"Fri, 09 May 2008 11:26:00 -0500")
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>
	<adazlr0qvuw.fsf@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
	<48247B18.8000100@opengridcomputing.com>
Message-ID: <adaej8bqvvz.fsf@cisco.com>

 > The ib_fmr stuff is pretty HW-specific yes?

But not really an IB vs. iWARP thing... it's just some wacky extension
that was created a long time ago, which an iWARP RNIC could implement
just as easily as an IB HCA.

 - R.


From rdreier at cisco.com  Fri May  9 09:36:35 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 09 May 2008 09:36:35 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <20080508222916.277649ca.akpm@linux-foundation.org> (Andrew
	Morton's message of "Thu, 8 May 2008 22:29:16 -0700")
References: <20080508222916.277649ca.akpm@linux-foundation.org>
Message-ID: <adaabizqvsc.fsf@cisco.com>

 > Most architectures could (and should) take an unsigned long * arg for their
 > bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
 > is being a problem.

Is your fix available somewhere?  Would like to check any patches I make.

 > It would be nice to get it fixed up, please.

Will take a look.

A few non-ipath warnings in the spew:

 > drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
 > drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function

which gcc version is giving this?

 > drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used

...this got lost in the noise.

 - R.


From swise at opengridcomputing.com  Fri May  9 09:37:33 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 09 May 2008 11:37:33 -0500
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
	<000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com>
Message-ID: <48247DCD.9010205@opengridcomputing.com>

Sean Hefty wrote:
> It'd be great to find a place for this on the wiki, so it's easier to find in
> the future.
>
>   

https://wiki.openfabrics.org/tiki-index.php?page=Verbs%3A+Infiniband+vs+iWARP


From Arkady.Kanevsky at netapp.com  Fri May  9 09:54:34 2008
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Fri, 9 May 2008 12:54:34 -0400
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <48247B18.8000100@opengridcomputing.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>	<adazlr0qvuw.fsf@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
	<48247B18.8000100@opengridcomputing.com>
Message-ID: <C98692FD98048C41885E0B0FACD9DFB808C19809@exnane01.hq.netapp.com>

The one in the core should not be.
The one in a vendor driver invoked by core is vendor and HW specific.

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com] 
> Sent: Friday, May 09, 2008 12:26 PM
> To: Kanevsky, Arkady
> Cc: Roland Dreier; OpenFabrics General
> Subject: Re: [ofa-general] Verbs: IB vs. iWARP
> 
> The ib_fmr stuff is pretty HW-specific yes?
> 
> 
> Kanevsky, Arkady wrote:
> > I had not check for a while but my recollection is the the fmr 
> > implementation is vendor specific...
> >
> > Arkady Kanevsky                       email: arkady at netapp.com
> > Network Appliance Inc.               phone: 781-768-5395
> > 1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> > Waltham, MA 02451                   central phone: 781-768-5300
> >  
> >
> >   
> >> -----Original Message-----
> >> From: Roland Dreier [mailto:rdreier at cisco.com]
> >> Sent: Thursday, May 08, 2008 6:23 PM
> >> To: Kanevsky, Arkady
> >> Cc: OpenFabrics General
> >> Subject: Re: [ofa-general] Verbs: IB vs. iWARP
> >>
> >>  > There are also some difference in memory registration, 
> for example 
> >> FMR.
> >>
> >> What are the differences?  I don't know of any significant ones 
> >> (given the IB verbs extensions).
> >>
> >>  - R.
> >>
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >>     
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> >   
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From swise at opengridcomputing.com  Fri May  9 10:48:08 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 09 May 2008 12:48:08 -0500
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <C98692FD98048C41885E0B0FACD9DFB808C19809@exnane01.hq.netapp.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>	<adazlr0qvuw.fsf@cisco.com><C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
	<48247B18.8000100@opengridcomputing.com>
	<C98692FD98048C41885E0B0FACD9DFB808C19809@exnane01.hq.netapp.com>
Message-ID: <48248E58.8060301@opengridcomputing.com>

One more item added to the wiki page:


    * iWARP RNICs support a special privileged lkey == 0 which can be
      used in local SGLs when the address is a bus/dma address.


From michael.heinz at qlogic.com  Fri May  9 10:58:03 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Fri, 9 May 2008 12:58:03 -0500
Subject: [ofa-general] Still looking for help debugging a problem.
In-Reply-To: <adaiqxnqwjm.fsf@cisco.com>
References: <C07C40DB2364324799506DE8FF12F8D8678DE2@EPEXCH1.qlogic.org>
	<adaiqxnqwjm.fsf@cisco.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D8678E1A@EPEXCH1.qlogic.org>

Thanks, Roland. We're trying switching HCAs now. 


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Friday, May 09, 2008 12:20 PM
To: Mike Heinz
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] Still looking for help debugging a problem.

 > May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
 > Internal error detected:
 > May  9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
 > buf[00]: ffffffff

 > The HCA in question is a Connect-X and the problem only seems to
happen  > with this node.

Sounds like a hardware problem.  Try reseating everything etc.


From a.p.zijlstra at chello.nl  Fri May  9 11:37:29 2008
From: a.p.zijlstra at chello.nl (Peter Zijlstra)
Date: Fri, 09 May 2008 20:37:29 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
References: <20080507153103.237ea5b6.akpm@linux-foundation.org>
	<20080507224406.GI8276@duo.random>
	<20080507155914.d7790069.akpm@linux-foundation.org>
	<20080507233953.GM8276@duo.random>
	<alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
	<alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
Message-ID: <1210358249.13978.275.camel@twins>

On Thu, 2008-05-08 at 09:11 -0700, Linus Torvalds wrote:
> 
> On Thu, 8 May 2008, Linus Torvalds wrote:
> > 
> > Also, we'd need to make it 
> > 
> > 	unsigned short flag:1;
> > 
> > _and_ change spinlock_types.h to make the spinlock size actually match the 
> > required size (right now we make it an "unsigned int slock" even when we 
> > actually only use 16 bits).
> 
> Btw, this is an issue only on 32-bit x86, because on 64-bit one we already 
> have the padding due to the alignment of the 64-bit pointers in the 
> list_head (so there's already empty space there).
> 
> On 32-bit, the alignment of list-head is obviously just 32 bits, so right 
> now the structure is "perfectly packed" and doesn't have any empty space. 
> But that's just because the spinlock is unnecessarily big.
> 
> (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the 
> structure really will grow. That's a very odd configuration, though, and 
> not one I feel we really need to care about).

Another possibility, would something like this work?

 
 /*
  * null out the begin function, no new begin calls can be made
  */
 rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 

 /*
  * lock/unlock all rmap locks in any order - this ensures that any
  * pending start() will have its end() function called.
  */
 mm_barrier(mm);

 /*
  * now that no new start() call can be made and all start()/end() pairs
  * are complete we can remove the notifier.
  */
 mmu_notifier_remove(mm, my_notifier);


This requires a mmu_notifier instance per attached mm and that
__mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
the function.

But I think its enough to ensure that:

  for each start an end will be called

It can however happen that end is called without start - but we could
handle that I think.


From andrea at qumranet.com  Fri May  9 11:55:53 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 9 May 2008 20:55:53 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <1210358249.13978.275.camel@twins>
References: <alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
	<alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
	<1210358249.13978.275.camel@twins>
Message-ID: <20080509185553.GF7710@duo.random>

On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote:
> Another possibility, would something like this work?
> 
>  
>  /*
>   * null out the begin function, no new begin calls can be made
>   */
>  rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 
> 
>  /*
>   * lock/unlock all rmap locks in any order - this ensures that any
>   * pending start() will have its end() function called.
>   */
>  mm_barrier(mm);
> 
>  /*
>   * now that no new start() call can be made and all start()/end() pairs
>   * are complete we can remove the notifier.
>   */
>  mmu_notifier_remove(mm, my_notifier);
> 
> 
> This requires a mmu_notifier instance per attached mm and that
> __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
> the function.
> 
> But I think its enough to ensure that:
> 
>   for each start an end will be called

We don't need that, it's perfectly ok if start is called but end is
not, it's ok to unregister in the middle as I guarantee ->release is
called before mmu_notifier_unregister returns (if ->release is needed
at all, not the case for KVM/GRU).

Unregister is already solved with srcu/rcu without any additional
complication as we don't need the guarantee that for each start an end
will be called.

> It can however happen that end is called without start - but we could
> handle that I think.

The only reason mm_lock() was introduced is to solve "register", to
guarantee that for each end there was a start. We can't handle end
called without start in the driver.

The reason the driver must be prevented to register in the middle of
start/end, if that if it ever happens the driver has no way to know it
must stop the secondary mmu page faults to call get_user_pages and
instantiate sptes/secondarytlbs on pages that will be freed as soon as
zap_page_range starts.


From a.p.zijlstra at chello.nl  Fri May  9 12:04:47 2008
From: a.p.zijlstra at chello.nl (Peter Zijlstra)
Date: Fri, 09 May 2008 21:04:47 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080509185553.GF7710@duo.random>
References: <alpine.LFD.1.10.0805071757520.3024@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805071809170.14935@schroedinger.engr.sgi.com>
	<20080508025652.GW8276@duo.random>
	<Pine.LNX.4.64.0805072009230.15543@schroedinger.engr.sgi.com>
	<20080508034133.GY8276@duo.random>
	<alpine.LFD.1.10.0805072109430.3024@woody.linux-foundation.org>
	<20080508052019.GA8276@duo.random>
	<alpine.LFD.1.10.0805080759430.3024@woody.linux-foundation.org>
	<alpine.LFD.1.10.0805080907420.3024@woody.linux-foundation.org>
	<1210358249.13978.275.camel@twins> <20080509185553.GF7710@duo.random>
Message-ID: <1210359887.6524.0.camel@lappy.programming.kicks-ass.net>

On Fri, 2008-05-09 at 20:55 +0200, Andrea Arcangeli wrote:
> On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote:
> > Another possibility, would something like this work?
> > 
> >  
> >  /*
> >   * null out the begin function, no new begin calls can be made
> >   */
> >  rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 
> > 
> >  /*
> >   * lock/unlock all rmap locks in any order - this ensures that any
> >   * pending start() will have its end() function called.
> >   */
> >  mm_barrier(mm);
> > 
> >  /*
> >   * now that no new start() call can be made and all start()/end() pairs
> >   * are complete we can remove the notifier.
> >   */
> >  mmu_notifier_remove(mm, my_notifier);
> > 
> > 
> > This requires a mmu_notifier instance per attached mm and that
> > __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
> > the function.
> > 
> > But I think its enough to ensure that:
> > 
> >   for each start an end will be called
> 
> We don't need that, it's perfectly ok if start is called but end is
> not, it's ok to unregister in the middle as I guarantee ->release is
> called before mmu_notifier_unregister returns (if ->release is needed
> at all, not the case for KVM/GRU).
> 
> Unregister is already solved with srcu/rcu without any additional
> complication as we don't need the guarantee that for each start an end
> will be called.
> 
> > It can however happen that end is called without start - but we could
> > handle that I think.
> 
> The only reason mm_lock() was introduced is to solve "register", to
> guarantee that for each end there was a start. We can't handle end
> called without start in the driver.
> 
> The reason the driver must be prevented to register in the middle of
> start/end, if that if it ever happens the driver has no way to know it
> must stop the secondary mmu page faults to call get_user_pages and
> instantiate sptes/secondarytlbs on pages that will be freed as soon as
> zap_page_range starts.

Right - then I got it backwards. Never mind me then..


From akpm at linux-foundation.org  Fri May  9 12:05:29 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Fri, 9 May 2008 12:05:29 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <adaabizqvsc.fsf@cisco.com>
References: <20080508222916.277649ca.akpm@linux-foundation.org>
	<adaabizqvsc.fsf@cisco.com>
Message-ID: <20080509120529.a9e616e3.akpm@linux-foundation.org>

On Fri, 09 May 2008 09:36:35 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  > Most architectures could (and should) take an unsigned long * arg for their
>  > bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
>  > is being a problem.
> 
> Is your fix available somewhere?  Would like to check any patches I make.

It needs some preparatory patches, otherwise you'll be looking through
thousands of warnings.

At http://userweb.kernel.org/~akpm/mmotm/broken-out/ we have

arch-x86-mm-patc-use-boot_cpu_has.patch
x86-setup_force_cpu_cap-dont-do-clear_bitnon-unsigned-long.patch
lguest-use-cpu-capability-accessors.patch
x86-set_restore_sigmask-avoid-bitop-on-a-u32.patch

and then the conversion patch:

x86-bitops-take-an-unsigned-long.patch

>  > It would be nice to get it fixed up, please.
> 
> Will take a look.

Thanks.

> A few non-ipath warnings in the spew:
> 
>  > drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
>  > drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function

That's a falsie: gcc assumes that foo(&var) doesn't write to `var' :(

> which gcc version is giving this?

4.0.2 I think.

>  > drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used
> 
> ...this got lost in the noise.
> 

Interesting, thanks.  I'll bug Alan about that.


From andrea at qumranet.com  Fri May  9 12:32:30 2008
From: andrea at qumranet.com (Andrea Arcangeli)
Date: Fri, 9 May 2008 21:32:30 +0200
Subject: [ofa-general] [PATCH 001/001] mmu-notifier-core v17
Message-ID: <20080509193230.GH7710@duo.random>

From: Andrea Arcangeli <andrea at qumranet.com>

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with "mm". This allows mmu notifier users to
   keep track if the VM is in the middle of the
   invalidate_range_begin/end critical section with an atomic counter
   incraese in range_begin and decreased in range_end. No secondary
   MMU page fault is allowed to map any spte or secondary tlb
   reference, while the VM is in the middle of range_begin/end as any
   page returned by get_user_pages in that critical section could
   later immediately be freed without any further ->invalidate_page
   notification (invalidate_range_begin/end works on ranges and
   ->invalidate_page isn't called immediately before freeing the
   page). To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may
be interrupted by a signal and return -EINTR. Because
mmu_notifier_reigster is used when a driver startup, a failure can be
gracefully handled. Here an example of the change applied to kvm to
register the mmu notifiers. Usually when a driver startups other
allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
Signed-off-by: Nick Piggin <npiggin at suse.de>
Signed-off-by: Christoph Lameter <clameter at sgi.com>
---

Full patchset is here:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc1/mmu-notifier-v17

Thanks!

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
 	if (!hlist_unhashed(n)) {
 		__hlist_del(n);
 		INIT_HLIST_NODE(n);
+	}
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+	if (!hlist_unhashed(n)) {
+		__hlist_del(n);
+		n->pprev = NULL;
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1067,6 +1067,9 @@ extern struct vm_area_struct *copy_vma(s
 	unsigned long addr, unsigned long len, pgoff_t pgoff);
 extern void exit_mmap(struct mm_struct *);
 
+extern int mm_take_all_locks(struct mm_struct *mm);
+extern void mm_drop_all_locks(struct mm_struct *mm);
+
 #ifdef CONFIG_PROC_FS
 /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */
 extern void added_exe_file_vma(struct mm_struct *mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/cpumask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -19,6 +20,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
+struct mmu_notifier_mm;
 
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 typedef atomic_long_t mm_counter_t;
@@ -235,6 +237,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,279 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
+ * critical section and it's released only when mm_count reaches zero
+ * in mmdrop().
+ */
+struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
+	struct hlist_head list;
+	/* to serialize the list modifications and hlist_unhashed */
+	spinlock_t lock;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Called either by mmu_notifier_unregister or when the mm is
+	 * being destroyed by exit_mmap, always before all pages are
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
+	 * should tear down all secondary mmu mappings and freeze the
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all _software_ driven memory accesses
+	 * through the secondary mmu are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative _hardware_ operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, so leading to memory
+	 * corruption.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. The subsystem
+	 * must guarantee that no additional references are taken to
+	 * the pages in the range established between the call to
+	 * invalidate_range_start() and the matching call to
+	 * invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern int __mmu_notifier_register(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_mm_destroy(mm);
+}
+
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -19,6 +19,7 @@
  */
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
+#define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,6 +26,14 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
+	/*
+	 * NOTE: the LSB of the head.next is set by
+	 * mm_take_all_locks() _after_ taking the above lock. So the
+	 * head must only be read/written after taking the above lock
+	 * to be sure to see a valid next pointer. The LSB bit itself
+	 * is serialized by a system wide lock only visible to
+	 * mm_take_all_locks() (mm_all_locks_mutex).
+	 */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -54,6 +54,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -386,6 +387,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
@@ -418,6 +420,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mmu_notifier_mm_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -205,3 +205,6 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	bool
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						  vma->vm_start, end);
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1544,10 +1565,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1555,6 +1577,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1756,7 +1779,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2048,6 +2049,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2255,3 +2257,152 @@ int install_special_mapping(struct mm_st
 
 	return 0;
 }
+
+static DEFINE_MUTEX(mm_all_locks_mutex);
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults.
+ *
+ * The caller must take the mmap_sem in write mode before calling
+ * mm_take_all_locks(). The caller isn't allowed to release the
+ * mmap_sem until mm_drop_all_locks() returns.
+ *
+ * mmap_sem in write mode is required in order to block all operations
+ * that could modify pagetables and free pages without need of
+ * altering the vma layout (for example populate_range() with
+ * nonlinear vmas). It's also needed in write mode to avoid new
+ * anon_vmas to be associated with existing vmas.
+ *
+ * A single task can't take more than one mm_take_all_locks() in a row
+ * or it would deadlock.
+ *
+ * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in
+ * mapping->flags avoid to take the same lock twice, if more than one
+ * vma in this mm is backed by the same anon_vma or address_space.
+ *
+ * We can take all the locks in random order because the VM code
+ * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never
+ * takes more than one of them in a row. Secondly we're protected
+ * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
+ *
+ * mm_take_all_locks() and mm_drop_all_locks are expensive operations
+ * that may have to take thousand of locks.
+ *
+ * mm_take_all_locks() can fail if it's interrupted by signals.
+ */
+int mm_take_all_locks(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	int ret = -EINTR;
+
+	BUG_ON(down_read_trylock(&mm->mmap_sem));
+
+	mutex_lock(&mm_all_locks_mutex);
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		struct file *filp;
+		if (signal_pending(current))
+			goto out_unlock;
+		if (vma->anon_vma && !test_bit(0, (unsigned long *)
+					       &vma->anon_vma->head.next)) {
+			/*
+			 * The LSB of head.next can't change from
+			 * under us because we hold the
+			 * global_mm_spinlock.
+			 */
+			spin_lock(&vma->anon_vma->lock);
+			/*
+			 * We can safely modify head.next after taking
+			 * the anon_vma->lock. If some other vma in
+			 * this mm shares the same anon_vma we won't
+			 * take it again.
+			 *
+			 * No need of atomic instructions here,
+			 * head.next can't change from under us thanks
+			 * to the anon_vma->lock.
+			 */
+			if (__test_and_set_bit(0, (unsigned long *)
+					       &vma->anon_vma->head.next))
+				BUG();
+		}
+
+		filp = vma->vm_file;
+		if (filp && filp->f_mapping &&
+		    !test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
+			/*
+			 * AS_MM_ALL_LOCKS can't change from under us
+			 * because we hold the global_mm_spinlock.
+			 *
+			 * Operations on ->flags have to be atomic
+			 * because even if AS_MM_ALL_LOCKS is stable
+			 * thanks to the global_mm_spinlock, there may
+			 * be other cpus changing other bitflags in
+			 * parallel to us.
+			 */
+			if (test_and_set_bit(AS_MM_ALL_LOCKS,
+					     &filp->f_mapping->flags))
+				BUG();
+			spin_lock(&filp->f_mapping->i_mmap_lock);
+		}
+	}
+	ret = 0;
+
+out_unlock:
+	if (ret)
+		mm_drop_all_locks(mm);
+
+	return ret;
+}
+
+/*
+ * The mmap_sem cannot be released by the caller until
+ * mm_drop_all_locks() returns.
+ */
+void mm_drop_all_locks(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	BUG_ON(down_read_trylock(&mm->mmap_sem));
+	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		struct file *filp;
+		if (vma->anon_vma &&
+		    test_bit(0, (unsigned long *)
+			     &vma->anon_vma->head.next)) {
+			/*
+			 * The LSB of head.next can't change to 0 from
+			 * under us because we hold the
+			 * global_mm_spinlock.
+			 *
+			 * We must however clear the bitflag before
+			 * unlocking the vma so the users using the
+			 * anon_vma->head will never see our bitflag.
+			 *
+			 * No need of atomic instructions here,
+			 * head.next can't change from under us until
+			 * we release the anon_vma->lock.
+			 */
+			if (!__test_and_clear_bit(0, (unsigned long *)
+						  &vma->anon_vma->head.next))
+				BUG();
+			spin_unlock(&vma->anon_vma->lock);
+		}
+		filp = vma->vm_file;
+		if (filp && filp->f_mapping &&
+		    test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
+			/*
+			 * AS_MM_ALL_LOCKS can't change to 0 from under us
+			 * because we hold the global_mm_spinlock.
+			 */
+			spin_unlock(&filp->f_mapping->i_mmap_lock);
+			if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
+						&filp->f_mapping->flags))
+				BUG();
+		}
+	}
+
+	mutex_unlock(&mm_all_locks_mutex);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,276 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter at sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to RCU and it serializes
+ * against the other mmu notifiers with RCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds an mm_count pin
+ * itself.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+				 struct mmu_notifier,
+				 hlist);
+		/*
+		 * We arrived before mmu_notifier_unregister so
+		 * mmu_notifier_unregister will do nothing other than
+		 * to wait ->release to finish and
+		 * mmu_notifier_unregister to return.
+		 */
+		hlist_del_init_rcu(&mn->hlist);
+		/*
+		 * RCU here will block mmu_notifier_unregister until
+		 * ->release returns.
+		 */
+		rcu_read_lock();
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * if ->release runs before mmu_notifier_unregister it
+		 * must be handled as it's the only way for the driver
+		 * to flush all existing sptes and stop the driver
+		 * from establishing any more sptes before all the
+		 * pages in the mm are freed.
+		 */
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+		rcu_read_unlock();
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * synchronize_rcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
+	 * The mmu_notifier_mm can't go away from under us because one
+	 * mm_count is hold by exit_mmap.
+	 */
+	synchronize_rcu();
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+	rcu_read_unlock();
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+	rcu_read_unlock();
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+	rcu_read_unlock();
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+	rcu_read_unlock();
+}
+
+static int do_mmu_notifier_register(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    int take_mmap_sem)
+{
+	struct mmu_notifier_mm * mmu_notifier_mm;
+	int ret;
+
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	if (take_mmap_sem)
+		down_write(&mm->mmap_sem);
+	ret = mm_take_all_locks(mm);
+	if (unlikely(ret))
+		goto out_cleanup;
+
+	if (!mm_has_notifiers(mm)) {
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
+	}
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Serialize the update against mmu_notifier_unregister. A
+	 * side note: mmu_notifier_release can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 * We can't race against any other mmu notifier method either
+	 * thanks to mm_take_all_locks().
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	mm_drop_all_locks(mm);
+out_cleanup:
+	if (take_mmap_sem)
+		up_write(&mm->mmap_sem);
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
+out:
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+	return ret;
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 1);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 0);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+	kfree(mm->mmu_notifier_mm);
+	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with RCU and against mmu_notifier_unregister
+ * with the unregister lock + RCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	if (!hlist_unhashed(&mn->hlist)) {
+		hlist_del_rcu(&mn->hlist);
+
+		/*
+		 * RCU here will force exit_mmap to wait ->release to finish
+		 * before freeing the pages.
+		 */
+		rcu_read_lock();
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * exit_mmap will block in mmu_notifier_release to
+		 * guarantee ->release is called before freeing the
+		 * pages.
+		 */
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+		rcu_read_unlock();
+	} else
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wait any running method to finish, of course including
+	 * ->release if it was run by mmu_notifier_relase instead of us.
+	 */
+	synchronize_rcu();
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -457,7 +458,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))


From swise at opengridcomputing.com  Fri May  9 13:19:02 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 09 May 2008 15:19:02 -0500
Subject: [ofa-general] [PATCH 2.6.26] RDMA/cxgb3: Wrap the software sq ptr as
	needed on flush.
Message-ID: <20080509201902.13077.53047.stgit@dell3.ogc.int>


cxio_flush_sq() was failing to wrap around the sw-sq causing garbage
completion entries on a flush operation.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 5fd8506..20a6326 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -405,11 +405,11 @@ int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
 
 	ptr = wq->sq_rptr + count;
-	sqp += count;
+	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
 	while (ptr != wq->sq_wptr) {
 		insert_sq_cqe(wq, cq, sqp);
-		sqp++;
 		ptr++;
+		sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
 		flushed++;
 	}
 	return flushed;


From rdreier at cisco.com  Fri May  9 22:37:36 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 09 May 2008 22:37:36 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <20080508222916.277649ca.akpm@linux-foundation.org> (Andrew
	Morton's message of "Thu, 8 May 2008 22:29:16 -0700")
References: <20080508222916.277649ca.akpm@linux-foundation.org>
Message-ID: <adazlqypvmn.fsf@cisco.com>

 > Most architectures could (and should) take an unsigned long * arg for their
 > bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
 > is being a problem.

 > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs':
 > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends':
 > drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors':
 > drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr':
 > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
 > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type

So all of these are ipath warnings, seemingly all because
ipath_devdata.ipath_sdma_status is a u64.  The stupid fix is to change
this declaration to unsigned long as below, but this sets a trap if the
driver is ever fixed so that it doesn't depend on 64BIT, because of

    /* bit positions for sdma_status */
    #define IPATH_SDMA_ABORTING  0
    #define IPATH_SDMA_DISARMED  1
    #define IPATH_SDMA_DISABLED  2
    #define IPATH_SDMA_LAYERBUF  3
    #define IPATH_SDMA_RUNNING  62
    #define IPATH_SDMA_SHUTDOWN 63

I don't see that this status is shared with hardware, and I don't see
why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to
unsigned long and moving those to bits 4 and 5 seems like it might be a
clean fix.

The other option is to convert to a bitmap and using the bitmap
operations, which ends up being a bigger patch.

But since I don't really understand this part of the driver, some
guidance would be helpful...

 - R.


diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index ce7b7c3..7635ace 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 	 */
 	if (dd->ipath_flags & IPATH_HAS_SEND_DMA) {
 		int skip_cancel;
-		u64 *statp = &dd->ipath_sdma_status;
+		unsigned long *statp = &dd->ipath_sdma_status;
 
 		spin_lock_irqsave(&dd->ipath_sdma_lock, flags);
 		skip_cancel =
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 02b24a3..a46f8ad 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -483,7 +483,7 @@ struct ipath_devdata {
 
 	/* SendDMA related entries */
 	spinlock_t            ipath_sdma_lock;
-	u64                   ipath_sdma_status;
+	unsigned long         ipath_sdma_status;
 	unsigned long         ipath_sdma_abort_jiffies;
 	unsigned long         ipath_sdma_abort_intr_timeout;
 	unsigned long         ipath_sdma_buf_jiffies;


From akpm at linux-foundation.org  Sat May 10 00:08:38 2008
From: akpm at linux-foundation.org (Andrew Morton)
Date: Sat, 10 May 2008 00:08:38 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <adazlqypvmn.fsf@cisco.com>
References: <20080508222916.277649ca.akpm@linux-foundation.org>
	<adazlqypvmn.fsf@cisco.com>
Message-ID: <20080510000838.e85f5d89.akpm@linux-foundation.org>

On Fri, 09 May 2008 22:37:36 -0700 Roland Dreier <rdreier at cisco.com> wrote:

>  > Most architectures could (and should) take an unsigned long * arg for their
>  > bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
>  > is being a problem.
>
> ...
> 
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
> 
> So all of these are ipath warnings, seemingly all because
> ipath_devdata.ipath_sdma_status is a u64.  The stupid fix is to change
> this declaration to unsigned long as below, but this sets a trap if the
> driver is ever fixed so that it doesn't depend on 64BIT, because of
> 
>     /* bit positions for sdma_status */
>     #define IPATH_SDMA_ABORTING  0
>     #define IPATH_SDMA_DISARMED  1
>     #define IPATH_SDMA_DISABLED  2
>     #define IPATH_SDMA_LAYERBUF  3
>     #define IPATH_SDMA_RUNNING  62
>     #define IPATH_SDMA_SHUTDOWN 63
> 
> I don't see that this status is shared with hardware, and I don't see
> why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to
> unsigned long and moving those to bits 4 and 5 seems like it might be a
> clean fix.
> 
> The other option is to convert to a bitmap and using the bitmap
> operations, which ends up being a bigger patch.
> 
> But since I don't really understand this part of the driver, some
> guidance would be helpful...
> 

Another option might be

-	u64			ipath_sdma_status;
+	unsigned long		ipath_sdma_status[64/BITS_PER_LONG];

Because the bitops are OK for use against an _array_ of unsigned longs, not
just a single unsigned long.

Or, if you want to preserve that u64:

	union {
		u64 ipath_sdma_status;
		unsigned long ipath_sdma_status_bits[64/BITS_PER_LONG];
	};

and do the bitops on ipath_sdma_status_bits.

Or just remove all the set_bit/clear_bit/etc and use plain old |, &, etc.


It all needs a bit of thought if you're supporting big-endian machines,
however.


From hrosenstock at xsigo.com  Sat May 10 05:29:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 10 May 2008 05:29:13 -0700
Subject: [ofa-general] OpenSM and fat tree
Message-ID: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>

Hi Yevgeny,

Is it possible that OpenSM's fat tree routing somehow depends on the
LIDs previously assigned ?

It seems that for a legitimate fat tree topology, the topology sometimes
won't come up as a fat tree if reassigning LIDs (-r) is not used.

In addition to -r making fat tree work, certain routing algorithms seem
to also clear this out (without using -r). For example, if lash were run
and then ftree, it seems to work without doing the -r. (Haven't yet
tried updn).

Any ideas on this ? Should a bug be filed on this ? Thanks.

-- Hal


From rdreier at cisco.com  Sat May 10 09:02:09 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 10 May 2008 09:02:09 -0700
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <48248E58.8060301@opengridcomputing.com> (Steve Wise's message of
	"Fri, 09 May 2008 12:48:08 -0500")
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>
	<adazlr0qvuw.fsf@cisco.com>
	<C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>
	<48247B18.8000100@opengridcomputing.com>
	<C98692FD98048C41885E0B0FACD9DFB808C19809@exnane01.hq.netapp.com>
	<48248E58.8060301@opengridcomputing.com>
Message-ID: <adar6cap2pq.fsf@cisco.com>

 > One more item added to the wiki page:
 > 
 > 
 >    * iWARP RNICs support a special privileged lkey == 0 which can be
 >      used in local SGLs when the address is a bus/dma address.

I think under Linux this is pretty much irrelevant, given that we have
ib_get_dma_mr().  And the IB BMME define the same concept anyway.

 - R.


From 2magasin at rems.fr  Sat May 10 07:38:46 2008
From: 2magasin at rems.fr (ernesto jikun)
Date: Sat, 10 May 2008 14:38:46 +0000
Subject: [ofa-general] branded time
Message-ID: <000701c8b2ba$01bdda8d$ecb1a7b1@tjkhjri>

" My order arrived yesterday via registered mail in good order THE WATCH IS BEAUTIFUL AND EVEN BETTER THAN I EXPECTED." 
   Try it for yourself - u will be amazed!!
- The worlds largest online retailer of luxury products, including:
Rolex Sports Models
Rolex Datejusts
  Breitling
Cartier
Porsche Design
Dolce & Gabbana
Dior
Gucci
Hermes Watches
Patek Philippe
 Visit - www.zimpleq.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080510/f9660faa/attachment.html>

From kliteyn at dev.mellanox.co.il  Sat May 10 11:51:53 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sat, 10 May 2008 21:51:53 +0300
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4825EEC9.4070208@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> Is it possible that OpenSM's fat tree routing somehow depends on the
> LIDs previously assigned ?

It depends only on the existence of the LIDs.

> It seems that for a legitimate fat tree topology, the topology sometimes
> won't come up as a fat tree if reassigning LIDs (-r) is not used.

That's odd...

> In addition to -r making fat tree work, certain routing algorithms seem
> to also clear this out (without using -r). For example, if lash were run
> and then ftree, it seems to work without doing the -r. (Haven't yet
> tried updn).
> 
> Any ideas on this ? Should a bug be filed on this ? Thanks.

No ideas whatsoever. Please file a bug on this.
It would be nice if I could reproduce it in simulation.

-- Yevgeny

> -- Hal
> 
> 
> 


From akepner at sgi.com  Sat May 10 12:07:21 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Sat, 10 May 2008 12:07:21 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <adafxsssn1o.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
Message-ID: <20080510190721.GI5298@sgi.com>

On Thu, May 08, 2008 at 10:50:11AM -0700, Roland Dreier wrote:
> ...
> It might be useful to track the value of tx_outstanding... from a quick
> look at the code I can't see how the transmit queue could be awake when
> the UD send queue is full.
> 

I haven't been able to get any new debug data (the only way we 
know to reproduce this one is to use a pretty large system - a 
scarce resource), but it does look like there's a hole here, 
since ipoib_cm.c:ipoib_cm_send() and ipoib_ib.c:ipoib_send() 
check on different conditions (off by one) to detect a full 
queue. 

ipoib_cm.c:ipoib_cm_send() does:
        if (++priv->tx_outstanding == ipoib_sendq_size)
                netif_stop_queue(dev);

but ipoib_ib.c:ipoib_send() does:
        if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) {
                netif_stop_queue(dev);

So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), 
followed by a call to ipoib_send() would get to a situation where 
the queue was full, but not stopped.

I'm not saying this is what's happening for us (just dunno yet) 
but it looks possible. 

-- 
Arthur


From swise at opengridcomputing.com  Sat May 10 16:18:45 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 10 May 2008 18:18:45 -0500
Subject: [ofa-general] Verbs: IB vs. iWARP
In-Reply-To: <adar6cap2pq.fsf@cisco.com>
References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com>	<C98692FD98048C41885E0B0FACD9DFB808C19655@exnane01.hq.netapp.com>	<adazlr0qvuw.fsf@cisco.com>	<C98692FD98048C41885E0B0FACD9DFB808C1965B@exnane01.hq.netapp.com>	<48247B18.8000100@opengridcomputing.com>	<C98692FD98048C41885E0B0FACD9DFB808C19809@exnane01.hq.netapp.com>	<48248E58.8060301@opengridcomputing.com>
	<adar6cap2pq.fsf@cisco.com>
Message-ID: <48262D55.80806@opengridcomputing.com>


Roland Dreier wrote:
>  > One more item added to the wiki page:
>  > 
>  > 
>  >    * iWARP RNICs support a special privileged lkey == 0 which can be
>  >      used in local SGLs when the address is a bus/dma address.
> 
> I think under Linux this is pretty much irrelevant, given that we have
> ib_get_dma_mr().  And the IB BMME define the same concept anyway.
> 
>  - R.

Its not irrelevant if someone tries to port an iwarp app to IB and said 
iwarp app uses lkey 0 everywhere...


From jackm at dev.mellanox.co.il  Sat May 10 22:49:24 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 11 May 2008 08:49:24 +0300
Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached
	P_Key/GID queries
In-Reply-To: <adar6cdu53t.fsf@cisco.com>
References: <adar6cdu53t.fsf@cisco.com>
Message-ID: <200805110849.25034.jackm@dev.mellanox.co.il>

On Thursday 08 May 2008 01:22, Roland Dreier wrote:
>   Since we want to eliminate the
> cached operations in the long term, convert SRP to use the uncached
> variants.
Eliminating the caches will pose a performance problem when sending
raw packets.  The ib_post_send API provides the pkey_index -- and this
needs to be translated to the actual p_key when building the Base
Transport Header.

- Jack


From eli at dev.mellanox.co.il  Sun May 11 01:18:19 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 11 May 2008 11:18:19 +0300
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080510190721.GI5298@sgi.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com>
Message-ID: <1210493899.15669.116.camel@mtls03>

On Sat, 2008-05-10 at 12:07 -0700, akepner at sgi.com wrote:

> I haven't been able to get any new debug data (the only way we 
> know to reproduce this one is to use a pretty large system - a 
> scarce resource), but it does look like there's a hole here, 
> since ipoib_cm.c:ipoib_cm_send() and ipoib_ib.c:ipoib_send() 
> check on different conditions (off by one) to detect a full 
> queue. 
> 
> ipoib_cm.c:ipoib_cm_send() does:
>         if (++priv->tx_outstanding == ipoib_sendq_size)
>                 netif_stop_queue(dev);
> 
> but ipoib_ib.c:ipoib_send() does:
>         if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) {
>                 netif_stop_queue(dev);
> 
> So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), 
> followed by a call to ipoib_send() would get to a situation where 
> the queue was full, but not stopped.
The reason why the queue is stopped when there is one entry still left
is to allow ipoib_ib_tx_timer_func() to post a special send request that
will ensure a completion is reported for this operation thus freeing
entries at the tx ring. I don't think the scenario you describe here can
lead to a deadlock since if that happens, it will be released because of
either one of the following two reasons:
1. If the tx queue contains not yet polled, more than one completion of
send WRs posted by ipoib_cm_send(), they will soon be polled since they
are posted to a signaled QP and sooner or later will generate
completions and interrupts. In this case, subsequent postings to
ipoib_send() will work as expected.

2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
means that there are 126 outstanding ipoib_send() requests at the tx
queue and this means that a few of them are signaled and are expected to
be completed soon.

If you just want to make sure there is no bug in my theory you can just
use this patch:

Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2008-05-07 12:30:10.000000000 +0300
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2008-05-11 09:59:42.000000000 +0300
@@ -535,7 +535,9 @@ static inline int post_send(struct ipoib
 	} else
 		priv->tx_wr.opcode      = IB_WR_SEND;
 
-	if (unlikely((priv->tx_head & (MAX_SEND_CQE - 1)) == MAX_SEND_CQE - 1))
+	/* start forcing signaled if we get near queue full */
+	if (unlikely((priv->tx_head & (MAX_SEND_CQE - 1)) == MAX_SEND_CQE - 1) ||
+	    priv->tx_outstanding > (ipoib_sendq_size -  5))
 		priv->tx_wr.send_flags |= IB_SEND_SIGNALED;
 	else
 		priv->tx_wr.send_flags &= ~IB_SEND_SIGNALED;


And last, could you arrange a remote access to a machine in this
condition so we could check the state of the device/FW?


From olga.shern at gmail.com  Sun May 11 02:38:50 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Sun, 11 May 2008 12:38:50 +0300
Subject: [ofa-general] Re: [ewg] OFED May 5 meeting summary
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
Message-ID: <bc457d660805110238y5b9898c7haa38de24a53c7d69@mail.gmail.com>

On 5/6/08, Tziporet Koren <tziporet at mellanox.co.il> wrote:
>
>
> May 5 OFED meeting summary:
> ===========================
>
> 1. OFED 1.3.1:
>        1.1  Status of changes:
>                IB-bonding - on work
>                SRP failover - done (need more testing)
>                SDP crashes - on work (not clear if we will have
> something on time)
>                RDS fixes for RDMA API - done
>                librdmacm 1.0.7 - done
>                uDAPL updates - done
>                Open MPI 1.2.6 - done
>                MVAPICH 1.0.1 - done
>                MVAPICH2 1.0.3 - done
>                IPoIB - 2 bugs fixed. There are still two issue that
> should be resolved.
>                Low level drivers: Changes that already committed:
>                        nes
>                        mlx4
>                        cxgb3
>                        ehca
>
>        1.2 Schedule:
>                rc1 - was released today
>                rc2 - May 20
>                GA  - May 29
>
>        1.3 Discussion:
>        - ipath driver is going to be updated
>        - There is an issue of bonding and Ethernet drivers on RHEL4 -
> under debug
>        - We wish to add support for SLES10 SP2. Already got an approval
> from Novell
>        Any volunteer to provide the new backport patches?


Tziporet, we will do it.
Already started with it, seems like everything is compiled, need only
backport bonding

Olga

2. OFED 1.4:
>   Updated that the new tree will be ready next week - based on
> 2.6.26-rc
>
> 3. Update on OpenSuSE build system - Yiftah updated on the work that is
> done and problems:
>   - The system requires clean RPMs only (no use of install script) -
> they work to resolve
>   - We target this system toward releases (and not to replace the daily
> build system).
>   - we may try now with OFED 1.3.1
>
>
> Tziporet
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080511/edf9d013/attachment.html>

From moshek at voltaire.com  Sun May 11 03:03:17 2008
From: moshek at voltaire.com (Moshe Kazir)
Date: Sun, 11 May 2008 13:03:17 +0300
Subject: [ofa-general] OFED May 5  meeting summary
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>

>	- We wish to add support for SLES10 SP2. Already got an approval
from Novell
>        Any volunteer to provide the new backport patches? 

I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3.

ib-bonding compile failed.  Everything else is compiled o.k. 

Attached : ib-bonding error log.


I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed,
I'll get Moni's help).

Moshe 


____________________________________________________________
Moshe Katzir   |  +972-9971-8639 (o)   |   +972-52-860-6042  (m)
 
Voltaire - The Grid Backbone
 
 www.voltaire.com

  
-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Tziporet
Koren
Sent: Tuesday, May 06, 2008 6:45 PM
To: Tziporet Koren; ewg at lists.openfabrics.org
Cc: general at lists.openfabrics.org
Subject: [ofa-general] OFED May 5 meeting summary

 
May 5 OFED meeting summary:
===========================

1. OFED 1.3.1:
	1.1  Status of changes:
		IB-bonding - on work
		SRP failover - done (need more testing)
		SDP crashes - on work (not clear if we will have
something on time)
		RDS fixes for RDMA API - done
		librdmacm 1.0.7 - done
		uDAPL updates - done
		Open MPI 1.2.6 - done
		MVAPICH 1.0.1 - done
		MVAPICH2 1.0.3 - done
		IPoIB - 2 bugs fixed. There are still two issue that
should be resolved.
		Low level drivers: Changes that already committed:
			nes
			mlx4
			cxgb3
			ehca
	
	1.2 Schedule:
		rc1 - was released today
		rc2 - May 20
		GA  - May 29
	
	1.3 Discussion:
	- ipath driver is going to be updated
	- There is an issue of bonding and Ethernet drivers on RHEL4 -
under debug
	- We wish to add support for SLES10 SP2. Already got an approval
from Novell
        Any volunteer to provide the new backport patches?

2. OFED 1.4:
   Updated that the new tree will be ready next week - based on
2.6.26-rc

3. Update on OpenSuSE build system - Yiftah updated on the work that is
done and problems:
   - The system requires clean RPMs only (no use of install script) -
they work to resolve
   - We target this system toward releases (and not to replace the daily
build system).
   - we may try now with OFED 1.3.1


Tziporet
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ib-bonding.rpmbuild.log
Type: application/octet-stream
Size: 31538 bytes
Desc: ib-bonding.rpmbuild.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080511/fe4af958/attachment.obj>

From akepner at sgi.com  Sun May 11 03:23:45 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Sun, 11 May 2008 03:23:45 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <1210493899.15669.116.camel@mtls03>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com> <1210493899.15669.116.camel@mtls03>
Message-ID: <20080511102345.GJ5298@sgi.com>

On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote:
> ....
> The reason why the queue is stopped when there is one entry still left
> is to allow ipoib_ib_tx_timer_func() to post a special send request that
> will ensure a completion is reported for this operation thus freeing
> entries at the tx ring. I don't think the scenario you describe here can
> lead to a deadlock since if that happens, it will be released because of
> either one of the following two reasons:
> 1. If the tx queue contains not yet polled, more than one completion of
> send WRs posted by ipoib_cm_send(), they will soon be polled since they
> are posted to a signaled QP and sooner or later will generate
> completions and interrupts. In this case, subsequent postings to
> ipoib_send() will work as expected.
> 
> 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
> means that there are 126 outstanding ipoib_send() requests at the tx
> queue and this means that a few of them are signaled and are expected to
> be completed soon.

Thanks for the explanation. 

The main problem that we're seeing is that we just stop getting 
completions for the send queue. (And we see this with OFED-1.2 
and 1.3, which makes me think that it's unlikely to be due to the 
IPoIB driver since that's changed so much.) 

> .....
> And last, could you arrange a remote access to a machine in this
> condition so we could check the state of the device/FW?
> 

Yes, I think so. Let me see if I can arrange that.

-- 
Arthur


From erezz at voltaire.com  Sun May 11 04:00:16 2008
From: erezz at voltaire.com (Erez Zilber)
Date: Sun, 11 May 2008 14:00:16 +0300
Subject: [ofa-general] Moving responsibility for iSER & iSCSI related issues
Message-ID: <4826D1C0.4040301@voltaire.com>

Hi,


After ~4 years of working on iSER & iSCSI, I'm moving on and will be
involved from a different perspective. Therefore, I will be unable to
continue my current maintainership responsibility for iSER related
issues. I want to thank everyone for the great work that I had the
chance to be part of.


Eli Dorfman (elid at voltaire.com) will be taking over my maintainership of
iSER code for kernel.org. Eli has already started doing that work.


Doron Shoham (dorons at voltaire.com) will be responsible for iSER and
iSCSI related issues in OFED (i.e. open-iscsi, iSER & stgt). All
relevant git trees will move from my trees to his.


These changes will be effective as of 19/5/08. After that, if you need
anything, I will be available on erezzi.list at gmail.com


Erez


From vlad at dev.mellanox.co.il  Sun May 11 04:46:00 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 11 May 2008 14:46:00 +0300
Subject: [ofa-general] Re: [rds-devel] New rds patches for ofed 1.3.1
In-Reply-To: <200805071250.22719.olaf.kirch@oracle.com>
References: <200805071250.22719.olaf.kirch@oracle.com>
Message-ID: <4826DC78.50700@dev.mellanox.co.il>

Olaf Kirch wrote:
> Hi,
> 
> I have two more RDS kernel patches for OFED 1.3.1, and one additional
> rds-tools patch. They're available from my git trees at
> on branch code-drop-20080507
> 
> If you have any feedback, please let me know.
> 
> At this point, I'm not going to submit the dma_sync patches yet.
> I think they need more testing, and I'd rather postpone them to
> OFED 1.3.2.
> 
> I'll also post these patches in a follow-up email to this message.
> 
> Olaf

Pulled into OFED-1.3.1.

Regards,
Vladimir


From olga.shern at gmail.com  Sun May 11 04:49:42 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Sun, 11 May 2008 14:49:42 +0300
Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and
	handle each according to level of severity
In-Reply-To: <1210148064.15669.84.camel@mtls03>
References: <48206690.3090604@Voltaire.COM> <1210148064.15669.84.camel@mtls03>
Message-ID: <bc457d660805110449t29abd349q7e74803040230728@mail.gmail.com>

On 5/7/08, Eli Cohen <eli at dev.mellanox.co.il> wrote:
>
>
> On Tue, 2008-05-06 at 17:09 +0300, Moni Shoua wrote:
> > The purpose of this patch is to make the events that are related to SM
> change
> > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
> > When SM related events are handled, it is not necessary to flush unicast
> > info from device but only multicast info. This patch divides the events
> that are
> > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1
> and 1
> > does more than 0).
> > The main change is in __ipoib_ib_dev_flush(). Instead of flagging  to the
> function
> > about pkey_events we now use leveling. An event that requires "harder"
> flushing
> > calls this function with higher number for level. Besides the concept,
> > the actual change is that SM related events are not  flushing unicast
> info and
> > not bringing the device down but only refresh the multicast info in the
> background.
> >
> As far as I know, when an SM change event occurs, it could mean the SM
> changed and the new one "decided" to reprogram all the LIDs for example.
> In that case you will issue only level 0 and the all your neighbours can
> become invalid.
>
> _


When SM change event occurs it mean that there was SM fail over,
OpenSM and also vendor's SM in 99% of the cases will keep the LIDs (LIDs
persistency).
If there will be LID change then there will be LID change event and it is
level 1 and not level 0 event.


______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080511/c29af678/attachment.html>

From eli at dev.mellanox.co.il  Sun May 11 05:17:53 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 11 May 2008 15:17:53 +0300
Subject: [ofa-general] [PATCH] IB/core: Add comletion flag for send with
	invalidate
Message-ID: <1210508273.15669.131.camel@mtls03>

>From da2391afba573aa5cbfd488e2c2498e3586ae1b9 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Sun, 11 May 2008 14:59:08 +0300
Subject: [PATCH] IB/core: Add comletion flag for send with invalidate

Add IB_WC_WITH_INVALIDATE to enum ib_wc_flags to mark completions
of "send with invalidate" operations.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 include/rdma/ib_verbs.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..57a11f8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -424,7 +424,8 @@ enum ib_wc_opcode {
 
 enum ib_wc_flags {
 	IB_WC_GRH		= 1,
-	IB_WC_WITH_IMM		= (1<<1)
+	IB_WC_WITH_IMM		= (1<<1),
+	IB_WC_WITH_INVALIDATE	= (1<<2),
 };
 
 struct ib_wc {
-- 
1.5.5.1


From eli at dev.mellanox.co.il  Sun May 11 05:18:41 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 11 May 2008 15:18:41 +0300
Subject: [ofa-general] [PATCH] IB/mlx4: Add send with invalidate support
Message-ID: <1210508321.15669.133.camel@mtls03>

>From 1c9492f357efa456074ab7e4552e8d8eccfe3cfe Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Sun, 11 May 2008 15:02:04 +0300
Subject: [PATCH] IB/mlx4: Add send with invalidate support

Add send with invalidate support to mlx4.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/cq.c |    8 ++++++++
 drivers/infiniband/hw/mlx4/qp.c |   22 +++++++++++++++++-----
 drivers/net/mlx4/mr.c           |    6 ++++--
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 4521319..291e856 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -637,6 +637,7 @@ repoll:
 		case MLX4_OPCODE_SEND_IMM:
 			wc->wc_flags |= IB_WC_WITH_IMM;
 		case MLX4_OPCODE_SEND:
+		case MLX4_OPCODE_SEND_INVAL:
 			wc->opcode    = IB_WC_SEND;
 			break;
 		case MLX4_OPCODE_RDMA_READ:
@@ -676,6 +677,13 @@ repoll:
 			wc->wc_flags = IB_WC_WITH_IMM;
 			wc->imm_data = cqe->immed_rss_invalid;
 			break;
+		case MLX4_RECV_OPCODE_SEND_INVAL:
+			wc->opcode = IB_WC_RECV;
+			wc->wc_flags = IB_WC_WITH_INVALIDATE;
+			/*
+			 * TBD: maybe we should just call this ieth_val
+			 */
+			wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid);
 		}
 
 		wc->slid	   = be16_to_cpu(cqe->rlid);
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8e02ecf..d0d5f77 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = {
 	[IB_WR_RDMA_READ]		= __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ),
 	[IB_WR_ATOMIC_CMP_AND_SWP]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS),
 	[IB_WR_ATOMIC_FETCH_AND_ADD]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA),
+	[IB_WR_SEND_WITH_INV]		= __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL),
 };
 
 static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp)
@@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
+static __be32 get_ieth(struct ib_send_wr *wr)
+{
+	switch (wr->opcode) {
+	case IB_WR_SEND_WITH_IMM:
+	case IB_WR_RDMA_WRITE_WITH_IMM:
+		return wr->ex.imm_data;
+
+	case IB_WR_SEND_WITH_INV:
+		return cpu_to_be32(wr->ex.invalidate_rkey);
+
+	default:
+		return 0;
+	}
+}
+
 int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		      struct ib_send_wr **bad_wr)
 {
@@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 				     MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) |
 			qp->sq_signal_bits;
 
-		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
-		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-			ctrl->imm = wr->ex.imm_data;
-		else
-			ctrl->imm = 0;
+		ctrl->imm = get_ieth(wr);
 
 		wqe += sizeof *ctrl;
 		size = sizeof *ctrl / 16;
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 03a9abc..e78f53d 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -47,7 +47,7 @@ struct mlx4_mpt_entry {
 	__be32 flags;
 	__be32 qpn;
 	__be32 key;
-	__be32 pd;
+	__be32 pd_flags;
 	__be64 start;
 	__be64 length;
 	__be32 lkey;
@@ -71,6 +71,8 @@ struct mlx4_mpt_entry {
 #define MLX4_MPT_STATUS_SW		0xF0
 #define MLX4_MPT_STATUS_HW		0x00
 
+#define MLX4_MPT_FLAG_EN_INV		0x3000000
+
 static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order)
 {
 	int o;
@@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 				       mr->access);
 
 	mpt_entry->key	       = cpu_to_be32(key_to_hw_index(mr->key));
-	mpt_entry->pd	       = cpu_to_be32(mr->pd);
+	mpt_entry->pd_flags    = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV);
 	mpt_entry->start       = cpu_to_be64(mr->iova);
 	mpt_entry->length      = cpu_to_be64(mr->size);
 	mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift);
-- 
1.5.5.1


From kliteyn at dev.mellanox.co.il  Sun May 11 05:37:11 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 11 May 2008 15:37:11 +0300
Subject: [ofa-general] [PATCH] infiniband-diags/Makefile.am: fix location of
	ibdiag_version.h
Message-ID: <4826E877.7090706@dev.mellanox.co.il>

Hi Sasha,

When compiling infiniband-diags not from the source code location,
compilation fails to find the ibdiag_version.h file - fixing it.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 infiniband-diags/Makefile.am |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
index e502a06..b6228b5 100644
--- a/infiniband-diags/Makefile.am
+++ b/infiniband-diags/Makefile.am
@@ -1,5 +1,5 @@

-INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband
+INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband

 if DEBUG
 DBGFLAGS = -ggdb -D_DEBUG_
@@ -103,7 +103,7 @@ man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \
 BUILT_SOURCES = ibdiag_version
 ibdiag_version:
 	if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \
-		ver_file=$(srcdir)/include/ibdiag_version.h ; \
+		ver_file=$(top_builddir)/include/ibdiag_version.h ; \
 		ibdiag_ver=`cat $$ver_file | sed -ne '/#define IBDIAG_VERSION /s/^.*\"\(.*\)\"$$/\1/p'` ; \
 		ver=`$(top_srcdir)/../gen_ver.sh $(PACKAGE)` ; \
 		if [ $$ver != $$ibdiag_ver ] ; then \
-- 
1.5.1.4


From eli at dev.mellanox.co.il  Sun May 11 07:21:11 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 11 May 2008 17:21:11 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
Message-ID: <1210515671.15669.138.camel@mtls03>

>From 0fdabd83e54369b51ac41003f7fe282604b63ad5 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Sun, 11 May 2008 15:02:04 +0300
Subject: 		

Add send with invalidate support to mlx4.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

changes since last commit:
set cap flag IB_DEVICE_SEND_W_INV

 drivers/infiniband/hw/mlx4/cq.c   |    8 ++++++++
 drivers/infiniband/hw/mlx4/main.c |    3 ++-
 drivers/infiniband/hw/mlx4/qp.c   |   22 +++++++++++++++++-----
 drivers/net/mlx4/mr.c             |    6 ++++--
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 4521319..291e856 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -637,6 +637,7 @@ repoll:
 		case MLX4_OPCODE_SEND_IMM:
 			wc->wc_flags |= IB_WC_WITH_IMM;
 		case MLX4_OPCODE_SEND:
+		case MLX4_OPCODE_SEND_INVAL:
 			wc->opcode    = IB_WC_SEND;
 			break;
 		case MLX4_OPCODE_RDMA_READ:
@@ -676,6 +677,13 @@ repoll:
 			wc->wc_flags = IB_WC_WITH_IMM;
 			wc->imm_data = cqe->immed_rss_invalid;
 			break;
+		case MLX4_RECV_OPCODE_SEND_INVAL:
+			wc->opcode = IB_WC_RECV;
+			wc->wc_flags = IB_WC_WITH_INVALIDATE;
+			/*
+			 * TBD: maybe we should just call this ieth_val
+			 */
+			wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid);
 		}
 
 		wc->slid	   = be16_to_cpu(cqe->rlid);
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 4d61e32..a88fa15 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -90,7 +90,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	props->device_cap_flags    = IB_DEVICE_CHANGE_PHY_PORT |
 		IB_DEVICE_PORT_ACTIVE_EVENT		|
 		IB_DEVICE_SYS_IMAGE_GUID		|
-		IB_DEVICE_RC_RNR_NAK_GEN;
+		IB_DEVICE_RC_RNR_NAK_GEN		|
+		IB_DEVICE_SEND_W_INV;
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR)
 		props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR)
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8e02ecf..d0d5f77 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = {
 	[IB_WR_RDMA_READ]		= __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ),
 	[IB_WR_ATOMIC_CMP_AND_SWP]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS),
 	[IB_WR_ATOMIC_FETCH_AND_ADD]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA),
+	[IB_WR_SEND_WITH_INV]		= __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL),
 };
 
 static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp)
@@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
+static __be32 get_ieth(struct ib_send_wr *wr)
+{
+	switch (wr->opcode) {
+	case IB_WR_SEND_WITH_IMM:
+	case IB_WR_RDMA_WRITE_WITH_IMM:
+		return wr->ex.imm_data;
+
+	case IB_WR_SEND_WITH_INV:
+		return cpu_to_be32(wr->ex.invalidate_rkey);
+
+	default:
+		return 0;
+	}
+}
+
 int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		      struct ib_send_wr **bad_wr)
 {
@@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 				     MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) |
 			qp->sq_signal_bits;
 
-		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
-		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-			ctrl->imm = wr->ex.imm_data;
-		else
-			ctrl->imm = 0;
+		ctrl->imm = get_ieth(wr);
 
 		wqe += sizeof *ctrl;
 		size = sizeof *ctrl / 16;
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 03a9abc..e78f53d 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -47,7 +47,7 @@ struct mlx4_mpt_entry {
 	__be32 flags;
 	__be32 qpn;
 	__be32 key;
-	__be32 pd;
+	__be32 pd_flags;
 	__be64 start;
 	__be64 length;
 	__be32 lkey;
@@ -71,6 +71,8 @@ struct mlx4_mpt_entry {
 #define MLX4_MPT_STATUS_SW		0xF0
 #define MLX4_MPT_STATUS_HW		0x00
 
+#define MLX4_MPT_FLAG_EN_INV		0x3000000
+
 static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order)
 {
 	int o;
@@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 				       mr->access);
 
 	mpt_entry->key	       = cpu_to_be32(key_to_hw_index(mr->key));
-	mpt_entry->pd	       = cpu_to_be32(mr->pd);
+	mpt_entry->pd_flags    = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV);
 	mpt_entry->start       = cpu_to_be64(mr->iova);
 	mpt_entry->length      = cpu_to_be64(mr->size);
 	mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift);
-- 
1.5.5.1


From rdreier at cisco.com  Sun May 11 08:34:12 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 11 May 2008 08:34:12 -0700
Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached
	P_Key/GID queries
In-Reply-To: <200805110849.25034.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 11 May 2008 08:49:24 +0300")
References: <adar6cdu53t.fsf@cisco.com>
	<200805110849.25034.jackm@dev.mellanox.co.il>
Message-ID: <adamymwq2h7.fsf@cisco.com>

 > >   Since we want to eliminate the
 > > cached operations in the long term, convert SRP to use the uncached
 > > variants.
 > Eliminating the caches will pose a performance problem when sending
 > raw packets.  The ib_post_send API provides the pkey_index -- and this
 > needs to be translated to the actual p_key when building the Base
 > Transport Header.

Not sure what you mean about a performance problem... if you're talking
about the mthca/mlx4 internals, then we just need a private P_Key/GID
cache inside the driver.  This can be updated synchronously when
processing MADs, rather than having all the complicated logic in the
current caching module.  But I don't think sending MADs is really a fast
path anyway.

 - R.


From rdreier at cisco.com  Sun May 11 08:37:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 11 May 2008 08:37:57 -0700
Subject: [ofa-general] Re: [PATCH] IB/core: Add comletion flag for send with
	invalidate
In-Reply-To: <1210508273.15669.131.camel@mtls03> (Eli Cohen's message of "Sun, 
	11 May 2008 15:17:53 +0300")
References: <1210508273.15669.131.camel@mtls03>
Message-ID: <adaiqxkq2ay.fsf@cisco.com>

 > Add IB_WC_WITH_INVALIDATE to enum ib_wc_flags to mark completions
 > of "send with invalidate" operations.

Actually, given that we took the approach of adding a new send opcode for
send w/invalidate, I think it probably makes more sense to add a new
IB_WC_ opcode for such completions.  This doesn't match the immediate
data handling, but on the other hand, if we just add a wc flag for
invalidate, what opcode do we report for local invalidate work requests?

 - R.


From rdreier at cisco.com  Sun May 11 08:44:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 11 May 2008 08:44:18 -0700
Subject: [ofa-general] Re: [PATCH] IB/core: Add comletion flag for send with
	invalidate
In-Reply-To: <1210508273.15669.131.camel@mtls03> (Eli Cohen's message of "Sun, 
	11 May 2008 15:17:53 +0300")
References: <1210508273.15669.131.camel@mtls03>
Message-ID: <adaej88q20d.fsf@cisco.com>

Never mind... I see that this is for the corresponding receive on the
other side.  But:

 > +			/*
 > +			 * TBD: maybe we should just call this ieth_val
 > +			 */
 > +			wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid);

I think we should make the wc->imm_data handling a union like with did
for the work request structure.

If we call it ieth_val I think it's too hard for anyone to understand,
and it is IB-specific (iWARP has no IETH of course).

 - R.


From ogerlitz at voltaire.com  Sun May 11 22:13:50 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 12 May 2008 08:13:50 +0300
Subject: [ofa-general] Re: [2.6.27 PATCH/RFC] IB/srp: Remove use of cached
	P_Key/GID queries
In-Reply-To: <adamymwq2h7.fsf@cisco.com>
References: <adar6cdu53t.fsf@cisco.com>	<200805110849.25034.jackm@dev.mellanox.co.il>
	<adamymwq2h7.fsf@cisco.com>
Message-ID: <4827D20E.4060805@voltaire.com>

Roland Dreier wrote:
> But I don't think sending MADs is really a fast path anyway.
It is fast path to some extent when this node runs the SM

Or.


From ogerlitz at voltaire.com  Sun May 11 22:44:31 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 12 May 2008 08:44:31 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
In-Reply-To: <1210515671.15669.138.camel@mtls03>
References: <1210515671.15669.138.camel@mtls03>
Message-ID: <4827D93F.1020208@voltaire.com>

Eli Cohen wrote:
> Add send with invalidate support to mlx4.
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>
Hi Eli,

Thinking about it a little, I don't see how this send-with-invalidate 
support is applicable for ULPs such as SRP, iSER and RDS who use the 
Mellanox FMRs.

This is because to invalidate an rkey they call fmr_unmap and what the 
mlx4 (similarly for mthca) driver does at mlx4_ib_unmap_fmr() is call 
mlx4_fmr_unmap() for each fmr and then issue SYNC_TPT command. 

Even if doing send-with-inv would save the ULP the indirect call to 
mlx4_fmr_unmap() (which does almost nothing by itself), if it doesn't 
cause the HW/FW to issue SYNC_TPT, it can not replace the call to 
ib_unmap_fmr in the side that generated this rkey. And if it does cause 
SYNC_TPT, the effect of amortizing the cost of this heavy command 
through un-mapping on many fmrs at once is lost, correct?

Or
> mlx4_ib_unmap_fmr calls mlx4_fmr_unmap for each fmr and then issues SYNC_TPT command
>
> void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr,
>                     u32 *lkey, u32 *rkey)
> {
>         if (!fmr->maps)
>                 return;
>
>         fmr->maps = 0;
>
>         *(u8 *) fmr->mpt = MLX4_MPT_STATUS_SW;
> }
>


From keshetti85-student at yahoo.co.in  Sun May 11 23:30:14 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Mon, 12 May 2008 12:00:14 +0530
Subject: [ofa-general] OpenSM SA dump ?
Message-ID: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com>

When I ran opensm with '-D 0x43' option, it generated opensm-sa.dump
file in /var/log directory. But to my surprise 'opensm-sa.dump' file is very
small in size and it contained information of MC groups only. Do I need
to give more options to get detailed information of SA dump ?

Also, is there any way to dump the local SA cache to a file  in the
OFED-1.3 implementation ?

-Mahesh


From monis at Voltaire.COM  Mon May 12 01:08:25 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 12 May 2008 11:08:25 +0300
Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups
	and handle each according to level of severity
In-Reply-To: <48206690.3090604@Voltaire.COM>
References: <48206690.3090604@Voltaire.COM>
Message-ID: <4827FAF9.2030506@Voltaire.COM>


Hi Roland

Do you have comments for this patch?
We'd like to have it in please.

thanks
 MoniS


From monis at Voltaire.COM  Mon May 12 01:12:15 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 12 May 2008 11:12:15 +0300
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <4820638E.4030901@Voltaire.COM>
References: <48187E5A.7040809@Voltaire.COM>
	<4819BD29.7080002@Voltaire.COM>	<adaskwwz7ie.fsf@cisco.com>
	<4820638E.4030901@Voltaire.COM>
Message-ID: <4827FBDF.9040308@Voltaire.COM>

Moni Shoua wrote:
>> I guess I can believe things don't get worse but I still don't know how
>> this makes things better.  With the current code the request is lost
>> because it goes to the wrong SM; with the new code the request is failed
>> by the SA layer.  So in both cases the consumer just has to try again.
>>
>> So is there some practical benefit we see by adding this code?
>>
>>  - R.
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> 
> In general I see the benefit in faster detection of wrong SM ah. Before the patch consumers 
> need to wait for  a timeout before the detection and after the patch it happens immediately 
> on return from the function. This improves the performance of an SM failover scenario.
> 
> Some applications  may get the benefit above  only they handle new return code (EAGAIN) specifically 
> but this patch opens the door for such improvement. 
> 
> thanks
> 
>  MoniS
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 

Hi Roland,
Can we please go on with this patch? We would like to see it in the next kernel.

thanks
 MoniS


From eli at dev.mellanox.co.il  Mon May 12 01:32:42 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 12 May 2008 11:32:42 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
In-Reply-To: <4827D93F.1020208@voltaire.com>
References: <1210515671.15669.138.camel@mtls03> <4827D93F.1020208@voltaire.com>
Message-ID: <1210581162.15669.158.camel@mtls03>

On Mon, 2008-05-12 at 08:44 +0300, Or Gerlitz wrote:

> Thinking about it a little, I don't see how this send-with-invalidate 
> support is applicable for ULPs such as SRP, iSER and RDS who use the 
> Mellanox FMRs.
> 
> This is because to invalidate an rkey they call fmr_unmap and what the 
> mlx4 (similarly for mthca) driver does at mlx4_ib_unmap_fmr() is call 
> mlx4_fmr_unmap() for each fmr and then issue SYNC_TPT command. 
> 
> Even if doing send-with-inv would save the ULP the indirect call to 
> mlx4_fmr_unmap() (which does almost nothing by itself), if it doesn't 
> cause the HW/FW to issue SYNC_TPT, it can not replace the call to 
> ib_unmap_fmr in the side that generated this rkey. And if it does cause 
> SYNC_TPT, the effect of amortizing the cost of this heavy command 
> through un-mapping on many fmrs at once is lost, correct?
The outcome of send with invalidate involves an implicit "sync_tpt" like
operation although it syncs the caches to invalidate only the specific
memory key (as opposed to sync tpt command which has a more global
nature).
But I think that the idea is not to save the overhead of sync tpt
commands but to provide security. Perhaps someone from RDS can add more
on that.


From pawel.dziekonski at wcss.pl  Mon May 12 01:54:15 2008
From: pawel.dziekonski at wcss.pl (Pawel Dziekonski)
Date: Mon, 12 May 2008 10:54:15 +0200
Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand
	interface
In-Reply-To: <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com>
References: <e2e108260803170808g45860dafmd0f84d1f37334bb8@mail.gmail.com>
	<1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com>
	<e2e108260803180117t42e99734iac16ffb94839ef30@mail.gmail.com>
	<1205844786.11393.121.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512085415.GF5226@cefeid.wcss.wroc.pl>

On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote:
> On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote:
> > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky <boris at mellanox.com> wrote:
> > > Check :
> > >
> > >  /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/*
> > 
> > Thanks, these are interesting counters. Unfortunately these counters
> > are 32-bit counters and already overflowed during the test I ran (less
> > than one day of SRP communication):
> > 
> > $ uname -r
> > 2.6.24
> > $ uname -m
> > x86_64
> > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets}
> > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <==
> > 4294967295
> > 
> > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <==
> > 4294967295
> > 
> > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <==
> > 4294967295
> > 
> > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <==
> > 4294967295
> 
> Depending on which fabric management (usually bundled with the SM) is
> being used, you may be able to obtain this information via the
> Performance Manager (and not be limited to 32 bit counters).

Hi,

I have exactly the same problem on OFED 1.2.5.5, redhat kernel
2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain
4294967295 value, or barely change. :(

How can I get Performance Manager running and printing some reasonable
numbers?

regards, P
-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From ogerlitz at voltaire.com  Mon May 12 03:42:31 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 12 May 2008 13:42:31 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate	support
In-Reply-To: <1210581162.15669.158.camel@mtls03>
References: <1210515671.15669.138.camel@mtls03>	
	<4827D93F.1020208@voltaire.com> <1210581162.15669.158.camel@mtls03>
Message-ID: <48281F17.400@voltaire.com>

Eli Cohen wrote:
> The outcome of send with invalidate involves an implicit "sync_tpt" like
> operation although it syncs the caches to invalidate only the specific
> memory key (as opposed to sync tpt command which has a more global
> nature).
> But I think that the idea is not to save the overhead of sync tpt
> commands but to provide security. 
Yes, if send-with-invalidate causes a sync-tpt which applies only to the 
specific rkey (I assume its documented in the PRM) this can be used to 
make the mellanox fmrs --much-- more secure.  Are you thinking on 
enhancement to support that for consumers that use FMRs through the pool 
at the core?

Or.


From tziporet at dev.mellanox.co.il  Mon May 12 04:09:52 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 12 May 2008 14:09:52 +0300
Subject: [ewg] RE: [ofa-general] OFED May 5  meeting summary
In-Reply-To: <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
	<39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>
Message-ID: <48282580.8040208@mellanox.co.il>

Moshe Kazir wrote:
>
> I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3.
>
> ib-bonding compile failed.  Everything else is compiled o.k. 
>
> Attached : ib-bonding error log.
>
>
> I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed,
> I'll get Moni's help).
>
>   
Thanks
Please update when done.
Any need for a change in the install script?

Tziporet


From hrosenstock at xsigo.com  Mon May 12 04:19:05 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 04:19:05 -0700
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <4825EEC9.4070208@dev.mellanox.co.il>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
Message-ID: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>

On Sat, 2008-05-10 at 21:51 +0300, Yevgeny Kliteynik wrote:
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > Is it possible that OpenSM's fat tree routing somehow depends on the
> > LIDs previously assigned ?
> 
> It depends only on the existence of the LIDs.
>
> > It seems that for a legitimate fat tree topology, the topology sometimes
> > won't come up as a fat tree if reassigning LIDs (-r) is not used.
> 
> That's odd...
> 
> > In addition to -r making fat tree work, certain routing algorithms seem
> > to also clear this out (without using -r). For example, if lash were run
> > and then ftree, it seems to work without doing the -r. (Haven't yet
> > tried updn).
> > 
> > Any ideas on this ? Should a bug be filed on this ? Thanks.
> 
> No ideas whatsoever. Please file a bug on this.

I filed this as bug 1031:
https://bugs.openfabrics.org/show_bug.cgi?id=1031

> It would be nice if I could reproduce it in simulation.

Yes, that would be nice; but I don't have a sim case.

-- Hal

> -- Yevgeny
> 
> > -- Hal
> > 
> > 
> > 
> 


From kliteyn at dev.mellanox.co.il  Mon May 12 05:30:18 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 12 May 2008 15:30:18 +0300
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>	
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4828385A.6080804@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> On Sat, 2008-05-10 at 21:51 +0300, Yevgeny Kliteynik wrote:
>> Hal Rosenstock wrote:
>>> Hi Yevgeny,
>>>
>>> Is it possible that OpenSM's fat tree routing somehow depends on the
>>> LIDs previously assigned ?
>> It depends only on the existence of the LIDs.
>>
>>> It seems that for a legitimate fat tree topology, the topology sometimes
>>> won't come up as a fat tree if reassigning LIDs (-r) is not used.
>> That's odd...
>>
>>> In addition to -r making fat tree work, certain routing algorithms seem
>>> to also clear this out (without using -r). For example, if lash were run
>>> and then ftree, it seems to work without doing the -r. (Haven't yet
>>> tried updn).
>>>
>>> Any ideas on this ? Should a bug be filed on this ? Thanks.
>> No ideas whatsoever. Please file a bug on this.
> 
> I filed this as bug 1031:
> https://bugs.openfabrics.org/show_bug.cgi?id=1031

Thanks

>> It would be nice if I could reproduce it in simulation.
> 
> Yes, that would be nice; but I don't have a sim case.

The problem is, I don't even know where to start.
I tested it in simulations on different topologies,
and it is used on real cluster(s) too.

I need more details, and some hint on how to reproduce it.
Can you describe the setup you used when you saw this problem?

-- Yevgeny


> -- Hal
> 
>> -- Yevgeny
>>
>>> -- Hal
>>>
>>>
>>>
> 
> 


From olga.shern at gmail.com  Mon May 12 06:49:34 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Mon, 12 May 2008 16:49:34 +0300
Subject: [ewg] RE: [ofa-general] OFED May 5 meeting summary
In-Reply-To: <48282580.8040208@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
	<39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>
	<48282580.8040208@mellanox.co.il>
Message-ID: <bc457d660805120649x21e1565blc3fc4ced9803317f@mail.gmail.com>

On 5/12/08, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
>
> Moshe Kazir wrote:
>
> >
> > I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3.
> >
> > ib-bonding compile failed.  Everything else is compiled o.k.
> > Attached : ib-bonding error log.
> >
> >
> > I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed,
> > I'll get Moni's help).
> >
> >
> >
> Thanks
> Please update when done.
> Any need for a change in the install script?


It seems that there is no need for changes in the install script,
I will update you

Tziporet


_______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080512/15cd1b5c/attachment.html>

From olgas at voltaire.com  Mon May 12 07:18:33 2008
From: olgas at voltaire.com (Olga Shern)
Date: Mon, 12 May 2008 17:18:33 +0300
Subject: [ofa-general] Compiling OFED 1.3 on Gentoo
Message-ID: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com>

Hi,

 
We are trying to compile OFED 1.3 on Gentoo and see the following error,

 
Build falls on libibcommon library with the error bellow.

 
Running  rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir'

--define 'dist ' --target i386 --define '_prefix /usr' --define
'_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr'
/tmp/OFED-1.3/SRPMS/libibcommon-1.0.8-1.ofed1.3.src.rpm

error: Macro %dist has empty body

error: Macro %dist has empty body

sh: line 0: fg: no job control

error: Failed build dependencies:

         is needed by libibcommon-1.0.8-1.ofed1.3.src Installing
/tmp/OFED-1.3/SRPMS/libibcommon-1.0.8-1.ofed1.3.src.rpm

Building target platforms: i386

Building for target i386

 
There is a strange space under 'error:' line, before 'is needed by
libibcommon-1.0.8-1.ofed1.3.src'

 
But if I install source RPM file and then running 'rpmbuild -ba
libibcommon.spec' then I can build RPM, so only rpmbuild --rebuild
command causing to problems.

 
Have someone seen this error before?

Have someone succeeded to build OFED 1.3 on Gentoo?

 
Thanks

Olga 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080512/caaadf72/attachment.html>

From xemul at openvz.org  Mon May 12 07:43:58 2008
From: xemul at openvz.org (Pavel Emelyanov)
Date: Mon, 12 May 2008 18:43:58 +0400
Subject: [ofa-general] [PATCH][INFINIBAND]: Make ipath_portdata work with
 struct pid * not pid_t.
Message-ID: <482857AE.2030904@openvz.org>

The official reason is "with the presence of pid namespaces in the
kernel using pid_t-s inside one is no longer safe".

But the reason I fix exactly the infiniband right now is the following.

About a month ago (when the 2.6.25 was not yet released) there still
was a one last caller of a to-be-deprecated-soon function find_pid() -
the kill_proc() function, which in turn was only used by nfs callback
code.

During the last merge window, this last caller was finally eliminated 
by some NFS patch(es) and I was about to finally kill this kill_proc() 
and find_pid(), but found, that I was late and the kill_proc is now 
called from the infiniband driver (commit 58411d1c).

So here's the patch, that turns this code to use struct pid * and (!)
the kill_pid routine. If it is possible to have this one in 2.6.26, I
would appreciate this A LOT and be able to close one more hole in pid
namespaces.

Signed-off-by: Pavel Emelyanov <xemul at openvz.org>

---

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index ce7b7c3..258e66c 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -2616,7 +2616,7 @@ int ipath_reset_device(int unit)
 				ipath_dbg("unit %u port %d is in use "
 					  "(PID %u cmd %s), can't reset\n",
 					  unit, i,
-					  dd->ipath_pd[i]->port_pid,
+					  pid_nr(dd->ipath_pd[i]->port_pid),
 					  dd->ipath_pd[i]->port_comm);
 				ret = -EBUSY;
 				goto bail;
@@ -2654,19 +2654,21 @@ bail:
 static int ipath_signal_procs(struct ipath_devdata *dd, int sig)
 {
 	int i, sub, any = 0;
-	pid_t pid;
+	struct pid *pid;
 
 	if (!dd->ipath_pd)
 		return 0;
 	for (i = 1; i < dd->ipath_cfgports; i++) {
-		if (!dd->ipath_pd[i] || !dd->ipath_pd[i]->port_cnt ||
-		    !dd->ipath_pd[i]->port_pid)
+		if (!dd->ipath_pd[i] || !dd->ipath_pd[i]->port_cnt)
 			continue;
 		pid = dd->ipath_pd[i]->port_pid;
+		if (!pid)
+			continue;
+
 		dev_info(&dd->pcidev->dev, "context %d in use "
 			  "(PID %u), sending signal %d\n",
-			  i, pid, sig);
-		kill_proc(pid, sig, 1);
+			  i, pid_nr(pid), sig);
+		kill_pid(pid, sig, 1);
 		any++;
 		for (sub = 0; sub < INFINIPATH_MAX_SUBPORT; sub++) {
 			pid = dd->ipath_pd[i]->port_subpid[sub];
@@ -2674,8 +2676,8 @@ static int ipath_signal_procs(struct ipath_devdata *dd, int sig)
 				continue;
 			dev_info(&dd->pcidev->dev, "sub-context "
 				"%d:%d in use (PID %u), sending "
-				"signal %d\n", i, sub, pid, sig);
-			kill_proc(pid, sig, 1);
+				"signal %d\n", i, sub, pid_nr(pid), sig);
+			kill_pid(pid, sig, 1);
 			any++;
 		}
 	}
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 3295177..b472b15 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -555,7 +555,7 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport,
 			p = dd->ipath_pageshadow[porttid + tid];
 			dd->ipath_pageshadow[porttid + tid] = NULL;
 			ipath_cdbg(VERBOSE, "PID %u freeing TID %u\n",
-				   pd->port_pid, tid);
+				   pid_nr(pd->port_pid), tid);
 			dd->ipath_f_put_tid(dd, &tidbase[tid],
 					    RCVHQ_RCV_TYPE_EXPECTED,
 					    dd->ipath_tidinvalid);
@@ -1609,7 +1609,7 @@ static int try_alloc_port(struct ipath_devdata *dd, int port,
 			   port);
 		pd->port_cnt = 1;
 		port_fp(fp) = pd;
-		pd->port_pid = current->pid;
+		pd->port_pid = get_pid(task_pid(current));
 		strncpy(pd->port_comm, current->comm, sizeof(pd->port_comm));
 		ipath_stats.sps_ports++;
 		ret = 0;
@@ -1793,14 +1793,15 @@ static int find_shared_port(struct file *fp,
 			}
 			port_fp(fp) = pd;
 			subport_fp(fp) = pd->port_cnt++;
-			pd->port_subpid[subport_fp(fp)] = current->pid;
+			pd->port_subpid[subport_fp(fp)] =
+				get_pid(task_pid(current));
 			tidcursor_fp(fp) = 0;
 			pd->active_slaves |= 1 << subport_fp(fp);
 			ipath_cdbg(PROC,
 				   "%s[%u] %u sharing %s[%u] unit:port %u:%u\n",
 				   current->comm, current->pid,
 				   subport_fp(fp),
-				   pd->port_comm, pd->port_pid,
+				   pd->port_comm, pid_nr(pd->port_pid),
 				   dd->ipath_unit, pd->port_port);
 			ret = 1;
 			goto done;
@@ -2066,7 +2067,8 @@ static int ipath_close(struct inode *in, struct file *fp)
 		 * the slave(s) don't wait for receive data forever.
 		 */
 		pd->active_slaves &= ~(1 << fd->subport);
-		pd->port_subpid[fd->subport] = 0;
+		put_pid(pd->port_subpid[fd->subport]);
+		pd->port_subpid[fd->subport] = NULL;
 		mutex_unlock(&ipath_mutex);
 		goto bail;
 	}
@@ -2074,7 +2076,7 @@ static int ipath_close(struct inode *in, struct file *fp)
 
 	if (pd->port_hdrqfull) {
 		ipath_cdbg(PROC, "%s[%u] had %u rcvhdrqfull errors "
-			   "during run\n", pd->port_comm, pd->port_pid,
+			   "during run\n", pd->port_comm, pid_nr(pd->port_pid),
 			   pd->port_hdrqfull);
 		pd->port_hdrqfull = 0;
 	}
@@ -2134,11 +2136,12 @@ static int ipath_close(struct inode *in, struct file *fp)
 			unlock_expected_tids(pd);
 		ipath_stats.sps_ports--;
 		ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n",
-			   pd->port_comm, pd->port_pid,
+			   pd->port_comm, pid_nr(pd->port_pid),
 			   dd->ipath_unit, port);
 	}
 
-	pd->port_pid = 0;
+	put_pid(pd->port_pid);
+	pd->port_pid = NULL;
 	dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */
 	mutex_unlock(&ipath_mutex);
 	ipath_free_pddata(dd, pd); /* after releasing the mutex */
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 02b24a3..2097587 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -159,8 +159,8 @@ struct ipath_portdata {
 	/* saved total number of polled urgent packets for poll edge trigger */
 	u32 port_urgent_poll;
 	/* pid of process using this port */
-	pid_t port_pid;
-	pid_t port_subpid[INFINIPATH_MAX_SUBPORT];
+	struct pid *port_pid;
+	struct pid *port_subpid[INFINIPATH_MAX_SUBPORT];
 	/* same size as task_struct .comm[] */
 	char port_comm[16];
 	/* pkeys set by this use of this port */


From sashak at voltaire.com  Mon May 12 11:37:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 18:37:36 +0000
Subject: [ofa-general] Re: [PATCH][TRIVIAL] ibsim/sim_net.c: Fix some typos
In-Reply-To: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com>
References: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512183736.GB17046@sashak.voltaire.com>

On 14:00 Sun 04 May     , Hal Rosenstock wrote:
> ibsim/sim.net.c: Fix some typos
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon May 12 11:38:07 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 18:38:07 +0000
Subject: [ofa-general] Re: [PATCH][TRIVIAL] ibsim/ibsim.c: Fix usage display
In-Reply-To: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com>
References: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512183807.GC17046@sashak.voltaire.com>

On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> ibsim/ibsim.c: Fix usage display
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From rdreier at cisco.com  Mon May 12 08:41:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 May 2008 08:41:18 -0700
Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
	struct pid * not pid_t.
In-Reply-To: <482857AE.2030904@openvz.org> (Pavel Emelyanov's message of "Mon, 
	12 May 2008 18:43:58 +0400")
References: <482857AE.2030904@openvz.org>
Message-ID: <adave1jo7hd.fsf@cisco.com>

Seems fine to me... ipath guys, any comment?  I think it would be
reasonale to include this with the other ipath fixes when I ask Linus to
pull in a day or two.


From sashak at voltaire.com  Mon May 12 11:41:38 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 18:41:38 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512184138.GD17046@sashak.voltaire.com>

On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> ibsim/sim.h: Fix NodeDescription size so can have maximum size
> NodeDescription per IBA spec rather than truncating them
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> 
> diff --git a/ibsim/sim.h b/ibsim/sim.h
> index bea136a..dbf1220 100644
> --- a/ibsim/sim.h
> +++ b/ibsim/sim.h
> @@ -67,7 +67,7 @@
>  
>  #define NODEIDBASE	20
>  #define NODEPREFIX	20
> -#define NODEIDLEN	(NODEIDBASE+NODEPREFIX+1)
> +#define NODEIDLEN	65
>  #define ALIASLEN 	40

nodeid filed in struct Node still have length 64, so it looks that using
NODEIDLEN value larger than this introduces overflow. I think bigger
change is needed there.

Sasha


From sashak at voltaire.com  Mon May 12 11:49:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 18:49:03 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of
	attachment/SIM_HOST use
In-Reply-To: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com>
References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512184903.GE17046@sashak.voltaire.com>

On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> ibsim/README: Clarify point of attachment/SIM_HOST use
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From hrosenstock at xsigo.com  Mon May 12 08:59:17 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 08:59:17 -0700
Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of
	attachment/SIM_HOST use
In-Reply-To: <20080512184903.GE17046@sashak.voltaire.com>
References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com>
	<20080512184903.GE17046@sashak.voltaire.com>
Message-ID: <1210607957.2026.501.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 18:49 +0000, Sasha Khapyorsky wrote:
> On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> > ibsim/README: Clarify point of attachment/SIM_HOST use
> > 
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> 
> Applied. Thanks.

There was a v2 of this patch with a minor change for some omitted words.

-- Hal

> Sasha


From dave.olson at qlogic.com  Mon May 12 08:48:44 2008
From: dave.olson at qlogic.com (Dave Olson)
Date: Mon, 12 May 2008 08:48:44 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
 struct pid * not pid_t.
In-Reply-To: <adave1jo7hd.fsf@cisco.com>
References: <482857AE.2030904@openvz.org> <adave1jo7hd.fsf@cisco.com>
Message-ID: <alpine.LFD.1.00.0805120847410.1536@topaz.pathscale.com>

On Mon, 12 May 2008, Roland Dreier wrote:

| Seems fine to me... ipath guys, any comment?  I think it would be
| reasonale to include this with the other ipath fixes when I ask Linus to
| pull in a day or two.

I looked at the original patch, and it looks fine to me.

Should be fairly easy to cover in ofed 1.4 backport patches for the
older kernels.

Dave Olson
dave.olson at qlogic.com


From sashak at voltaire.com  Mon May 12 12:10:32 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 19:10:32 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of
	attachment/SIM_HOST use
In-Reply-To: <1210607957.2026.501.camel@hrosenstock-ws.xsigo.com>
References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com>
	<20080512184903.GE17046@sashak.voltaire.com>
	<1210607957.2026.501.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512191032.GG17046@sashak.voltaire.com>

On 08:59 Mon 12 May     , Hal Rosenstock wrote:
> On Mon, 2008-05-12 at 18:49 +0000, Sasha Khapyorsky wrote:
> > On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> > > ibsim/README: Clarify point of attachment/SIM_HOST use
> > > 
> > > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> > 
> > Applied. Thanks.
> 
> There was a v2 of this patch with a minor change for some omitted words.

Applied this too. Thanks.

Sasha


From hrosenstock at xsigo.com  Mon May 12 09:28:16 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 09:28:16 -0700
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <20080512184138.GD17046@sashak.voltaire.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
	<20080512184138.GD17046@sashak.voltaire.com>
Message-ID: <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 18:41 +0000, Sasha Khapyorsky wrote:
> On 14:01 Sun 04 May     , Hal Rosenstock wrote:
> > ibsim/sim.h: Fix NodeDescription size so can have maximum size
> > NodeDescription per IBA spec rather than truncating them
> > 
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> > 
> > diff --git a/ibsim/sim.h b/ibsim/sim.h
> > index bea136a..dbf1220 100644
> > --- a/ibsim/sim.h
> > +++ b/ibsim/sim.h
> > @@ -67,7 +67,7 @@
> >  
> >  #define NODEIDBASE	20
> >  #define NODEPREFIX	20
> > -#define NODEIDLEN	(NODEIDBASE+NODEPREFIX+1)
> > +#define NODEIDLEN	65
> >  #define ALIASLEN 	40
> 
> nodeid filed in struct Node still have length 64, so it looks that using
> NODEIDLEN value larger than this introduces overflow. I think bigger
> change is needed there.

I made NODEIDLEN 65 rather than 64 due to the +1 in the original define.
How about defining it as 64 as in the below ? Does that get around the
overflow issue ?

-- Hal

ibsim/sim.h: Fix NodeDescription size so can have maximum size
NodeDescription per IBA spec rather than truncating them
 
Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/ibsim/sim.h b/ibsim/sim.h
index bea136a..0bf14fd 100644
--- a/ibsim/sim.h
+++ b/ibsim/sim.h
@@ -65,9 +65,8 @@
 #define        DEFAULT_LINKWIDTH       LINKWIDTH_4x
 #define DEFAULT_LINKSPEED      LINKSPEED_SDR
 
-#define NODEIDBASE     20
 #define NODEPREFIX     20
-#define NODEIDLEN      (NODEIDBASE+NODEPREFIX+1)
+#define NODEIDLEN      64
 #define ALIASLEN       40
 
 #define MAXHOPS 16

> Sasha


From jon at opengridcomputing.com  Mon May 12 09:57:38 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Mon, 12 May 2008 11:57:38 -0500
Subject: [ofa-general] RDS flow control
Message-ID: <200805121157.38135.jon@opengridcomputing.com>

As part of my effort to get RDS working for iWARP, I will be working on the 
RDS flow control.  Flow control is needed for iWARP due to the fact that 
iWARP connections terminate if there is no posted recv for an incoming 
packet.  IB connections do not have this limitation if setup in a certain 
way.  In its current implementation, RDS sets the connection attribute 
rnr_retry to 7.  This causes IB to retransmit until there is a posted recv 
buffer. 

Using a credit based flow control mechanism, we can ensure there will be a 
posted recv for every incoming packet (thus laying part of the foundation of 
allowing iWARP to work).  Also, it will reduce unnecessary IB transport 
traffic (at the expense of maintaining the credit schema).

I am still in the very early stages of implementing this.  So any pointers to 
RDS documentation (or a RDS git tree) would be very helpful.  I have a small 
IB setup to test this on, so anyone willing to test it when I am done would 
be helpful as well.

Thanks,
Jon


From richard.frank at oracle.com  Mon May 12 10:08:06 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Mon, 12 May 2008 13:08:06 -0400
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805121157.38135.jon@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
Message-ID: <48287976.10403@oracle.com>

We should define a set of performance criteria / tests to ensure we do 
not impact our current performance with IB...

An alternative would be to push this into an IWARP specific 
module....and if works well there - we might then want to move it to 
generic RDS layer ? As an example, the TCP transport for RDS - handles 
flow control internally..

Jon Mason wrote:
> As part of my effort to get RDS working for iWARP, I will be working on the 
> RDS flow control.  Flow control is needed for iWARP due to the fact that 
> iWARP connections terminate if there is no posted recv for an incoming 
> packet.  IB connections do not have this limitation if setup in a certain 
> way.  In its current implementation, RDS sets the connection attribute 
> rnr_retry to 7.  This causes IB to retransmit until there is a posted recv 
> buffer. 
>
> Using a credit based flow control mechanism, we can ensure there will be a 
> posted recv for every incoming packet (thus laying part of the foundation of 
> allowing iWARP to work).  Also, it will reduce unnecessary IB transport 
> traffic (at the expense of maintaining the credit schema).
>
> I am still in the very early stages of implementing this.  So any pointers to 
> RDS documentation (or a RDS git tree) would be very helpful.  I have a small 
> IB setup to test this on, so anyone willing to test it when I am done would 
> be helpful as well.
>
> Thanks,
> Jon
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From sashak at voltaire.com  Mon May 12 14:25:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 21:25:36 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
	<20080512184138.GD17046@sashak.voltaire.com>
	<1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512212536.GJ17046@sashak.voltaire.com>

On 09:28 Mon 12 May     , Hal Rosenstock wrote:
> 
> I made NODEIDLEN 65 rather than 64 due to the +1 in the original define.
> How about defining it as 64 as in the below ? Does that get around the
> overflow issue ?
> 
> -- Hal
> 
> ibsim/sim.h: Fix NodeDescription size so can have maximum size
> NodeDescription per IBA spec rather than truncating them
>  
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> 
> diff --git a/ibsim/sim.h b/ibsim/sim.h
> index bea136a..0bf14fd 100644
> --- a/ibsim/sim.h
> +++ b/ibsim/sim.h
> @@ -65,9 +65,8 @@
>  #define        DEFAULT_LINKWIDTH       LINKWIDTH_4x
>  #define DEFAULT_LINKSPEED      LINKSPEED_SDR
>  
> -#define NODEIDBASE     20
>  #define NODEPREFIX     20
> -#define NODEIDLEN      (NODEIDBASE+NODEPREFIX+1)
> +#define NODEIDLEN      64
>  #define ALIASLEN       40

It is likely will prevent overflow, but will potentially truncate last
NodeDesc character due to string NULL terminator. What about something
like below?

Sasha


diff --git a/ibsim/sim.h b/ibsim/sim.h
index 81bb47c..d3294f4 100644
--- a/ibsim/sim.h
+++ b/ibsim/sim.h
@@ -67,7 +67,7 @@
 
 #define NODEIDBASE	20
 #define NODEPREFIX	20
-#define NODEIDLEN	(NODEIDBASE+NODEPREFIX+1)
+#define NODEIDLEN	65
 #define ALIASLEN 	40
 
 #define MAXHOPS 16
@@ -237,7 +237,7 @@ struct Node {
 	uint64_t sysguid;
 	uint64_t nodeguid;	// also portguid
 	int portsbase;		// in port table
-	char nodeid[64];	// contain nodeid[NODEIDLEN]
+	char nodeid[NODEIDLEN];	// contain nodeid[NODEIDLEN]
 	uint8_t nodeinfo[64];
 	char nodedesc[64];
 	Switch *sw;
diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c
index 5f64229..fe3e9be 100644
--- a/ibsim/sim_cmd.c
+++ b/ibsim/sim_cmd.c
@@ -56,7 +56,7 @@ extern Port *ports;
 extern Port **lids;
 extern int netnodes, netports, netswitches;
 
-#define NAMELEN	64
+#define NAMELEN	NODEIDLEN
 
 char *portstates[] = {
 	"-", "Down", "Init", "Armed", "Active",
diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index bf7a06a..2a9c19b 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -267,17 +267,16 @@ static int new_hca(Node * nd)
 	return 0;
 }
 
-static int build_nodeid(char *nodeid, char *base)
+static int build_nodeid(char *nodeid, size_t len, char *base)
 {
 	if (strchr(base, '#') || strchr(base, '@')) {
 		IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved",
 		       base);
 		return -1;
 	}
-	if (netprefix[0] == 0)
-		strncpy(nodeid, base, NODEIDLEN);
-	else
-		snprintf(nodeid, NODEIDLEN, "%s#%s", netprefix, base);
+
+	snprintf(nodeid, len, "%s%s%s", netprefix, *netprefix ? "#" : "", base);
+
 	return 0;
 }
 
@@ -287,7 +286,7 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports)
 	char nodeid[NODEIDLEN];
 	Node *nd;
 
-	if (build_nodeid(nodeid, nodename) < 0)
+	if (build_nodeid(nodeid, sizeof(nodeid), nodename) < 0)
 		return 0;
 
 	if (find_node(nodeid)) {
@@ -310,11 +309,9 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports)
 
 	nd->type = type;
 	nd->numports = nodeports;
-	strncpy(nd->nodeid, nodeid, NODEIDLEN - 1);
-	if (nodedesc && nodedesc[0])
-		strncpy(nd->nodedesc, nodedesc, NODEIDLEN - 1);
-	else
-		strncpy(nd->nodedesc, nodeid, NODEIDLEN - 1);
+	strncpy(nd->nodeid, nodeid, sizeof(nd->nodeid) - 1);
+	strncpy(nd->nodedesc, nodedesc && *nodedesc ? nodedesc : nodeid,
+		sizeof(nd->nodedesc) - 1);
 	nd->sysguid = nd->nodeguid = guids[type];
 	if (type == SWITCH_NODE) {
 		nodeports++;	// port 0 is SMA
@@ -551,22 +548,20 @@ char *expand_name(char *base, char *name, char **portstr)
 		if (netprefix[0] != 0 && !strchr(base, '#'))
 			snprintf(name, NODEIDLEN, "%s#%s", netprefix, base);
 		else
-			strcpy(name, base);
+			strncpy(name, base, NODEIDLEN - 1);
 		if (portstr)
-			*portstr = 0;
+			*portstr = NULL;
 		PDEBUG("name %s port %s", name, portstr ? *portstr : 0);
 		return name;
 	}
-	if (base[0] == '@')
-		snprintf(name, ALIASLEN, "%s%s", netprefix, base);
-	else
-		strcpy(name, base);
+
+	snprintf(name, NODEIDLEN, "%s%s", base[0] == '@' ? netprefix : "", base);
 	PDEBUG("alias %s", name);
 
 	if (!(s = map_alias(name)))
 		return 0;
 
-	strcpy(name, s);
+	strncpy(name, s, NODEIDLEN - 1);
 
 	if (portstr) {
 		*portstr = name;
@@ -1075,12 +1070,12 @@ int link_ports(Port * lport, Port * rport)
 	lport->remotenode = rnode;
 	lport->remoteport = rport->portnum;
 	set_portinfo(lport, lnode->type == SWITCH_NODE ? swport : hcaport);
-	memcpy(lport->remotenodeid, rnode->nodeid, NODEIDLEN);
+	memcpy(lport->remotenodeid, rnode->nodeid, sizeof(lport->remotenodeid));
 
 	rport->remotenode = lnode;
 	rport->remoteport = lport->portnum;
 	set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport);
-	memcpy(rport->remotenodeid, lnode->nodeid, NODEIDLEN);
+	memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid));
 	lport->state = rport->state = 2;	// Initialilze
 	lport->physstate = rport->physstate = 5;	// LinkUP
 	if (lnode->sw)
@@ -1166,7 +1161,7 @@ int connect_ports(void)
 			}
 		} else if (remoteport->remoteport != port->portnum ||
 			   strncmp(remoteport->remotenodeid, port->node->nodeid,
-				   NODEIDLEN)) {
+				   sizeof(remoteport->remotenodeid))) {
 			IBWARN
 			    ("remote port %d in node \"%s\" is not connected to "
 			     "node \"%s\" port %d (\"%s\" %d)",


From arlin.r.davis at intel.com  Mon May 12 11:29:35 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Mon, 12 May 2008 11:29:35 -0700
Subject: [ofa-general] [PATCH 1/2][dat1.2] dapl: change cma provider to use
	max_rdma_read_in,
	out from ep_attr instead of HCA max values when connecting.
Message-ID: <000001c8b45e$217943e0$9f97070a@amr.corp.intel.com>


Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 dapl/openib_cma/dapl_ib_cm.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c
index f08ee4b..f2eb8cb 100755
--- a/dapl/openib_cma/dapl_ib_cm.c
+++ b/dapl/openib_cma/dapl_ib_cm.c
@@ -404,9 +404,6 @@ static void dapli_cm_passive_cb(struct dapl_cm_id *conn,
 			       struct rdma_cm_event *event)
 {
 	struct dapl_cm_id *new_conn;
-#ifdef DAPL_DBG
-	struct rdma_addr *ipaddr = &conn->cm_id->route.addr;
-#endif
 
 	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
 		     " passive_cb: conn %p id %d event %d\n",
@@ -539,8 +536,10 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HANDLE ep_handle,
 
 	/* Setup QP/CM parameters and private data in cm_id */
 	(void)dapl_os_memzero(&conn->params, sizeof(conn->params));
-	conn->params.responder_resources = conn->hca->ib_trans.max_rdma_rd_in;
-	conn->params.initiator_depth = conn->hca->ib_trans.max_rdma_rd_out;
+	conn->params.responder_resources = 
+			ep_ptr->param.ep_attr.max_rdma_read_in;
+	conn->params.initiator_depth = 
+			ep_ptr->param.ep_attr.max_rdma_read_out;
 	conn->params.flow_control = 1;
 	conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT;
 	conn->params.retry_count = IB_RC_RETRY_COUNT;
-- 
1.5.2.5


From arlin.r.davis at intel.com  Mon May 12 11:29:40 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 12 May 2008 11:29:40 -0700
Subject: [ofa-general] [PATCH 2/2][dat1.2] dapl: Fix long delays with the cma
	provider open call when DNS is not configure on server.
Message-ID: <B0095134066CC94FBC80973103FFA1FE07159193@orsmsx416.amr.corp.intel.com>


Open call should default to netdev names when resolving local IP address
for cma binding to match dat.conf settings. The open code attempts to
resolve with IP or Hostname first and if there is no DNS services setup
the failover to netdev name resolution is delayed for as much as 20
seconds.

Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 dapl/openib_cma/dapl_ib_util.c |   36
++++++++++++++++--------------------
 1 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_util.c
b/dapl/openib_cma/dapl_ib_util.c
index 4de5a2c..e76e319 100755
--- a/dapl/openib_cma/dapl_ib_util.c
+++ b/dapl/openib_cma/dapl_ib_util.c
@@ -117,28 +117,24 @@ bail:
 static int getipaddr(char *name, char *addr, int len)
 {
 	struct addrinfo *res;
-	int ret;
 	
-	/* Assume network name and address type for first attempt */
-	if (getaddrinfo(name, NULL, NULL, &res)) {
-		/* retry using network device name */
-		ret = getipaddr_netdev(name,addr,len);
-		if (ret) {
+	/* assume netdev for first attempt, then network and address
type */
+	if (getipaddr_netdev(name,addr,len)) {
+		if (getaddrinfo(name, NULL, NULL, &res)) {
 			dapl_log(DAPL_DBG_TYPE_ERR,
-				 " open_hca: getaddr_netdev ERROR:"
-				 " %s. Is %s configured?\n",
-				 strerror(errno), name);
-			return ret;
-		}
-	} else {
-		if (len >= res->ai_addrlen)
-			memcpy(addr, res->ai_addr, res->ai_addrlen);
-		else {
+		 		" open_hca: getaddr_netdev ERROR:"
+		 		" %s. Is %s configured?\n",
+		 		strerror(errno), name);
+			return 1;
+		} else {
+			if (len >= res->ai_addrlen) 
+				memcpy(addr, res->ai_addr,
res->ai_addrlen);
+			else {
+				freeaddrinfo(res);
+				return 1;
+			}
 			freeaddrinfo(res);
-			return EINVAL;
 		}
-		
-		freeaddrinfo(res);
 	}
 
 	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
@@ -642,7 +638,7 @@ DAT_RETURN dapli_ib_thread_init(void)
 	while (g_ib_thread_state != IB_THREAD_RUN) {
                 struct timespec sleep, remain;
                 sleep.tv_sec = 0;
-                sleep.tv_nsec = 20000000; /* 20 ms */
+                sleep.tv_nsec = 2000000; /* 2 ms */
                 dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
                              " ib_thread_init: waiting for
ib_thread\n");
 		dapl_os_unlock(&g_hca_lock);
@@ -679,7 +675,7 @@ void dapli_ib_thread_destroy(void)
 	while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) {
 		struct timespec	sleep, remain;
 		sleep.tv_sec = 0;
-		sleep.tv_nsec = 20000000; /* 20 ms */
+		sleep.tv_nsec = 2000000; /* 2 ms */
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
 			" ib_thread_destroy: waiting for ib_thread\n");
 		write(g_ib_pipe[1], "w", sizeof "w");
-- 
1.5.2.5


From chu11 at llnl.gov  Mon May 12 11:33:45 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 12 May 2008 11:33:45 -0700
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
Message-ID: <1210617225.11133.461.camel@cardanus.llnl.gov>

Hey Sasha,

Ira and I were chatting.  A few other comments:

1) Many configuration values are not output by default in opensm right
now, mainly b/c it behaves like a cache rather than an configuration
file.  i.e.

        if (p_opts->connect_roots)
                fprintf(opts_file,
                        "# Connect roots (use FALSE if unsure)\n"
                        "connect_roots %s\n\n",
                        p_opts->connect_roots ? "TRUE" : "FALSE");

Going forward w/ a config file, I think these should be output by
default all the time so users know they exist.

2) Will there be an option to specify an alternate configuration file,
i.e. not /etc/opensm/opensm.conf?

Al

On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote:
> Hi,
> 
> This is attempt to make some order with OpenSM configuration. Now it
> will use conventional (similar to another programs which may have
> configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead
> of option cache file. Config file for some startup scripts should go
> away. Option '-c' is preserved - it can be useful for config file
> template generation, but OpenSM will not try to read option cache file.
> 
> This is RFC yet. In addition to this we will need to update scripts and
> man pages.
> 
> Any feedback? Thoughts?
> 
> Sasha
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From chu11 at llnl.gov  Mon May 12 11:33:45 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 12 May 2008 11:33:45 -0700
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
Message-ID: <1210617225.11133.461.camel@cardanus.llnl.gov>

Hey Sasha,

Ira and I were chatting.  A few other comments:

1) Many configuration values are not output by default in opensm right
now, mainly b/c it behaves like a cache rather than an configuration
file.  i.e.

        if (p_opts->connect_roots)
                fprintf(opts_file,
                        "# Connect roots (use FALSE if unsure)\n"
                        "connect_roots %s\n\n",
                        p_opts->connect_roots ? "TRUE" : "FALSE");

Going forward w/ a config file, I think these should be output by
default all the time so users know they exist.

2) Will there be an option to specify an alternate configuration file,
i.e. not /etc/opensm/opensm.conf?

Al

On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote:
> Hi,
> 
> This is attempt to make some order with OpenSM configuration. Now it
> will use conventional (similar to another programs which may have
> configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead
> of option cache file. Config file for some startup scripts should go
> away. Option '-c' is preserved - it can be useful for config file
> template generation, but OpenSM will not try to read option cache file.
> 
> This is RFC yet. In addition to this we will need to update scripts and
> man pages.
> 
> Any feedback? Thoughts?
> 
> Sasha
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From hrosenstock at xsigo.com  Mon May 12 12:00:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 12:00:13 -0700
Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand
	interface
In-Reply-To: <20080512085415.GF5226@cefeid.wcss.wroc.pl>
References: <e2e108260803170808g45860dafmd0f84d1f37334bb8@mail.gmail.com>
	<1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com>
	<e2e108260803180117t42e99734iac16ffb94839ef30@mail.gmail.com>
	<1205844786.11393.121.camel@hrosenstock-ws.xsigo.com>
	<20080512085415.GF5226@cefeid.wcss.wroc.pl>
Message-ID: <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote:
> On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote:
> > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote:
> > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky <boris at mellanox.com> wrote:
> > > > Check :
> > > >
> > > >  /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/*
> > > 
> > > Thanks, these are interesting counters. Unfortunately these counters
> > > are 32-bit counters and already overflowed during the test I ran (less
> > > than one day of SRP communication):
> > > 
> > > $ uname -r
> > > 2.6.24
> > > $ uname -m
> > > x86_64
> > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets}
> > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <==
> > > 4294967295
> > > 
> > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <==
> > > 4294967295
> > > 
> > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <==
> > > 4294967295
> > > 
> > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <==
> > > 4294967295
> > 
> > Depending on which fabric management (usually bundled with the SM) is
> > being used, you may be able to obtain this information via the
> > Performance Manager (and not be limited to 32 bit counters).
> 
> Hi,
> 
> I have exactly the same problem on OFED 1.2.5.5, redhat kernel
> 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain
> 4294967295 value, or barely change. :(
> 
> How can I get Performance Manager running and printing some reasonable
> numbers?

Are you using OpenSM ? If so, which version ?

-- Hal

> regards, P


From hrosenstock at xsigo.com  Mon May 12 12:05:17 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 12:05:17 -0700
Subject: [ofa-general] OpenSM SA dump ?
In-Reply-To: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com>
References: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com>
Message-ID: <1210619117.2026.554.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 12:00 +0530, Keshetti Mahesh wrote:
> When I ran opensm with '-D 0x43' option, it generated opensm-sa.dump
> file in /var/log directory. But to my surprise 'opensm-sa.dump' file is very
> small in size and it contained information of MC groups only. 

Only multicast, services, and informs are dumped. These are the so
called client registrations.

> Do I need
> to give more options to get detailed information of SA dump ?

What SA information are you looking for ?

-- Hal

> Also, is there any way to dump the local SA cache to a file  in the
> OFED-1.3 implementation ?
> 
> -Mahesh
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From pawel.dziekonski at wcss.pl  Mon May 12 12:05:35 2008
From: pawel.dziekonski at wcss.pl (Pawel Dziekonski)
Date: Mon, 12 May 2008 21:05:35 +0200
Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand
	interface
In-Reply-To: <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com>
References: <e2e108260803170808g45860dafmd0f84d1f37334bb8@mail.gmail.com>
	<1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com>
	<e2e108260803180117t42e99734iac16ffb94839ef30@mail.gmail.com>
	<1205844786.11393.121.camel@hrosenstock-ws.xsigo.com>
	<20080512085415.GF5226@cefeid.wcss.wroc.pl>
	<1210618813.2026.551.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512190535.GA24024@cefeid.wcss.wroc.pl>

On Mon, 12 May 2008 at 12:00:13PM -0700, Hal Rosenstock wrote:
> On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote:
> > On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote:
> > > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote:
> > > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky <boris at mellanox.com> wrote:
> > > > > Check :
> > > > >
> > > > >  /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/*
> > > > 
> > > > Thanks, these are interesting counters. Unfortunately these counters
> > > > are 32-bit counters and already overflowed during the test I ran (less
> > > > than one day of SRP communication):
> > > > 
> > > > $ uname -r
> > > > 2.6.24
> > > > $ uname -m
> > > > x86_64
> > > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets}
> > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <==
> > > > 4294967295
> > > > 
> > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <==
> > > > 4294967295
> > > > 
> > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <==
> > > > 4294967295
> > > > 
> > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <==
> > > > 4294967295
> > > 
> > > Depending on which fabric management (usually bundled with the SM) is
> > > being used, you may be able to obtain this information via the
> > > Performance Manager (and not be limited to 32 bit counters).
> > 
> > Hi,
> > 
> > I have exactly the same problem on OFED 1.2.5.5, redhat kernel
> > 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain
> > 4294967295 value, or barely change. :(
> > 
> > How can I get Performance Manager running and printing some reasonable
> > numbers?
> 
> Are you using OpenSM ? If so, which version ?

yes, opensm-3.0.3.
P
-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


From hrosenstock at xsigo.com  Mon May 12 12:13:14 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 12:13:14 -0700
Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand
	interface
In-Reply-To: <20080512190535.GA24024@cefeid.wcss.wroc.pl>
References: <e2e108260803170808g45860dafmd0f84d1f37334bb8@mail.gmail.com>
	<1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com>
	<e2e108260803180117t42e99734iac16ffb94839ef30@mail.gmail.com>
	<1205844786.11393.121.camel@hrosenstock-ws.xsigo.com>
	<20080512085415.GF5226@cefeid.wcss.wroc.pl>
	<1210618813.2026.551.camel@hrosenstock-ws.xsigo.com>
	<20080512190535.GA24024@cefeid.wcss.wroc.pl>
Message-ID: <1210619594.2026.563.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 21:05 +0200, Pawel Dziekonski wrote:
> On Mon, 12 May 2008 at 12:00:13PM -0700, Hal Rosenstock wrote:
> > On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote:
> > > On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote:
> > > > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote:
> > > > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky <boris at mellanox.com> wrote:
> > > > > > Check :
> > > > > >
> > > > > >  /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/*
> > > > > 
> > > > > Thanks, these are interesting counters. Unfortunately these counters
> > > > > are 32-bit counters and already overflowed during the test I ran (less
> > > > > than one day of SRP communication):
> > > > > 
> > > > > $ uname -r
> > > > > 2.6.24
> > > > > $ uname -m
> > > > > x86_64
> > > > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets}
> > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <==
> > > > > 4294967295
> > > > > 
> > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <==
> > > > > 4294967295
> > > > > 
> > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <==
> > > > > 4294967295
> > > > > 
> > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <==
> > > > > 4294967295
> > > > 
> > > > Depending on which fabric management (usually bundled with the SM) is
> > > > being used, you may be able to obtain this information via the
> > > > Performance Manager (and not be limited to 32 bit counters).
> > > 
> > > Hi,
> > > 
> > > I have exactly the same problem on OFED 1.2.5.5, redhat kernel
> > > 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain
> > > 4294967295 value, or barely change. :(
> > > 
> > > How can I get Performance Manager running and printing some reasonable
> > > numbers?
> > 
> > Are you using OpenSM ? If so, which version ?
> 
> yes, opensm-3.0.3.

3.0.3 or 3.0.13 ? Anyway, PerfMgr was not part of those 3.0.x versions.
I think it was added in the 3.1 series (and is available in OFED 1.3) or
3.2 series (trunk). Is upgrading a possibility ? If so, you might want
to check out Ira's response on how to do this (and also what the output
looks like to make sure it can meet your needs).

-- Hal

> P


From hrosenstock at xsigo.com  Mon May 12 12:20:25 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 12:20:25 -0700
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <20080512212536.GJ17046@sashak.voltaire.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
	<20080512184138.GD17046@sashak.voltaire.com>
	<1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>
	<20080512212536.GJ17046@sashak.voltaire.com>
Message-ID: <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 21:25 +0000, Sasha Khapyorsky wrote:
> On 09:28 Mon 12 May     , Hal Rosenstock wrote:
> > 
> > I made NODEIDLEN 65 rather than 64 due to the +1 in the original define.
> > How about defining it as 64 as in the below ? Does that get around the
> > overflow issue ?
> > 
> > -- Hal
> > 
> > ibsim/sim.h: Fix NodeDescription size so can have maximum size
> > NodeDescription per IBA spec rather than truncating them
> >  
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> > 
> > diff --git a/ibsim/sim.h b/ibsim/sim.h
> > index bea136a..0bf14fd 100644
> > --- a/ibsim/sim.h
> > +++ b/ibsim/sim.h
> > @@ -65,9 +65,8 @@
> >  #define        DEFAULT_LINKWIDTH       LINKWIDTH_4x
> >  #define DEFAULT_LINKSPEED      LINKSPEED_SDR
> >  
> > -#define NODEIDBASE     20
> >  #define NODEPREFIX     20

I think this can now be eliminated as the only use was in NODEIDLEN.

> > -#define NODEIDLEN      (NODEIDBASE+NODEPREFIX+1)
> > +#define NODEIDLEN      64
> >  #define ALIASLEN       40
> 
> It is likely will prevent overflow, but will potentially truncate last
> NodeDesc character due to string NULL terminator. What about something
> like below?

Thanks. This seems to work for my usage but not sure about some of the
other concatenated names and whether it accomodates those other usages
which I don't fully understand.

-- Hal

> Sasha
> 
> 
> diff --git a/ibsim/sim.h b/ibsim/sim.h
> index 81bb47c..d3294f4 100644
> --- a/ibsim/sim.h
> +++ b/ibsim/sim.h
> @@ -67,7 +67,7 @@
>  
>  #define NODEIDBASE	20
>  #define NODEPREFIX	20
> -#define NODEIDLEN	(NODEIDBASE+NODEPREFIX+1)
> +#define NODEIDLEN	65
>  #define ALIASLEN 	40
>  
>  #define MAXHOPS 16
> @@ -237,7 +237,7 @@ struct Node {
>  	uint64_t sysguid;
>  	uint64_t nodeguid;	// also portguid
>  	int portsbase;		// in port table
> -	char nodeid[64];	// contain nodeid[NODEIDLEN]
> +	char nodeid[NODEIDLEN];	// contain nodeid[NODEIDLEN]
>  	uint8_t nodeinfo[64];
>  	char nodedesc[64];
>  	Switch *sw;
> diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c
> index 5f64229..fe3e9be 100644
> --- a/ibsim/sim_cmd.c
> +++ b/ibsim/sim_cmd.c
> @@ -56,7 +56,7 @@ extern Port *ports;
>  extern Port **lids;
>  extern int netnodes, netports, netswitches;
>  
> -#define NAMELEN	64
> +#define NAMELEN	NODEIDLEN
>  
>  char *portstates[] = {
>  	"-", "Down", "Init", "Armed", "Active",
> diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
> index bf7a06a..2a9c19b 100644
> --- a/ibsim/sim_net.c
> +++ b/ibsim/sim_net.c
> @@ -267,17 +267,16 @@ static int new_hca(Node * nd)
>  	return 0;
>  }
>  
> -static int build_nodeid(char *nodeid, char *base)
> +static int build_nodeid(char *nodeid, size_t len, char *base)
>  {
>  	if (strchr(base, '#') || strchr(base, '@')) {
>  		IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved",
>  		       base);
>  		return -1;
>  	}
> -	if (netprefix[0] == 0)
> -		strncpy(nodeid, base, NODEIDLEN);
> -	else
> -		snprintf(nodeid, NODEIDLEN, "%s#%s", netprefix, base);
> +
> +	snprintf(nodeid, len, "%s%s%s", netprefix, *netprefix ? "#" : "", base);
> +
>  	return 0;
>  }
>  
> @@ -287,7 +286,7 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports)
>  	char nodeid[NODEIDLEN];
>  	Node *nd;
>  
> -	if (build_nodeid(nodeid, nodename) < 0)
> +	if (build_nodeid(nodeid, sizeof(nodeid), nodename) < 0)
>  		return 0;
>  
>  	if (find_node(nodeid)) {
> @@ -310,11 +309,9 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports)
>  
>  	nd->type = type;
>  	nd->numports = nodeports;
> -	strncpy(nd->nodeid, nodeid, NODEIDLEN - 1);
> -	if (nodedesc && nodedesc[0])
> -		strncpy(nd->nodedesc, nodedesc, NODEIDLEN - 1);
> -	else
> -		strncpy(nd->nodedesc, nodeid, NODEIDLEN - 1);
> +	strncpy(nd->nodeid, nodeid, sizeof(nd->nodeid) - 1);
> +	strncpy(nd->nodedesc, nodedesc && *nodedesc ? nodedesc : nodeid,
> +		sizeof(nd->nodedesc) - 1);
>  	nd->sysguid = nd->nodeguid = guids[type];
>  	if (type == SWITCH_NODE) {
>  		nodeports++;	// port 0 is SMA
> @@ -551,22 +548,20 @@ char *expand_name(char *base, char *name, char **portstr)
>  		if (netprefix[0] != 0 && !strchr(base, '#'))
>  			snprintf(name, NODEIDLEN, "%s#%s", netprefix, base);
>  		else
> -			strcpy(name, base);
> +			strncpy(name, base, NODEIDLEN - 1);
>  		if (portstr)
> -			*portstr = 0;
> +			*portstr = NULL;
>  		PDEBUG("name %s port %s", name, portstr ? *portstr : 0);
>  		return name;
>  	}
> -	if (base[0] == '@')
> -		snprintf(name, ALIASLEN, "%s%s", netprefix, base);
> -	else
> -		strcpy(name, base);
> +
> +	snprintf(name, NODEIDLEN, "%s%s", base[0] == '@' ? netprefix : "", base);
>  	PDEBUG("alias %s", name);
>  
>  	if (!(s = map_alias(name)))
>  		return 0;
>  
> -	strcpy(name, s);
> +	strncpy(name, s, NODEIDLEN - 1);
>  
>  	if (portstr) {
>  		*portstr = name;
> @@ -1075,12 +1070,12 @@ int link_ports(Port * lport, Port * rport)
>  	lport->remotenode = rnode;
>  	lport->remoteport = rport->portnum;
>  	set_portinfo(lport, lnode->type == SWITCH_NODE ? swport : hcaport);
> -	memcpy(lport->remotenodeid, rnode->nodeid, NODEIDLEN);
> +	memcpy(lport->remotenodeid, rnode->nodeid, sizeof(lport->remotenodeid));
>  
>  	rport->remotenode = lnode;
>  	rport->remoteport = lport->portnum;
>  	set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport);
> -	memcpy(rport->remotenodeid, lnode->nodeid, NODEIDLEN);
> +	memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid));
>  	lport->state = rport->state = 2;	// Initialilze
>  	lport->physstate = rport->physstate = 5;	// LinkUP
>  	if (lnode->sw)
> @@ -1166,7 +1161,7 @@ int connect_ports(void)
>  			}
>  		} else if (remoteport->remoteport != port->portnum ||
>  			   strncmp(remoteport->remotenodeid, port->node->nodeid,
> -				   NODEIDLEN)) {
> +				   sizeof(remoteport->remotenodeid))) {
>  			IBWARN
>  			    ("remote port %d in node \"%s\" is not connected to "
>  			     "node \"%s\" port %d (\"%s\" %d)",
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Mon May 12 15:23:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 22:23:16 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512222316.GK17046@sashak.voltaire.com>

Hi Hal,

On 14:02 Sun 04 May     , Hal Rosenstock wrote:
> 
> I have a question on ibsim parsing:
> 
> In sim_net.c:parse_port, there is the following code:
>       parse_opt:
>         line = s;
>         while (s && (s = strchr(s + 1, '='))) {
>                 char *opt = s;
>                 while (opt && !isalpha(*opt))
>                         opt--;
>                 if (!opt || parse_port_opt(port, opt, s + 1) < 0) {
>                         IBWARN("bad port option");
>                         return -1;
>                 }
>                 line = s + 1;
>         }
> 
> port options appear include w for link width and s for link speed.
>
> An issue is that this parsing starts inside the NodeDescription. = is a
> valid character there and causes an invalid port option.

I can see the issue, but not sure I know best solution yet (never used
's' and 'w' options and didn't see topology files examples where it was
used).

> There seem to
> me to be two choices here:
> 1. Either ignore unknown options in parse_port_option and the rule
> becomes w= and s= are invalid in the NodeDescription (which is
> artificial and not really per the spec).
> or

Not sure that "per the spec" restriction is applicable here, it is
only about simulator topology file format and this file is editable
(node description value is ignored by ibsim parser in those lines
anyway).

> 2. Find some way to start this port option parsing past the end of the
> NodeDescription. As I'm not sure about all the formats supported, I
> don't know how to determine a "solid" way to get past the end of the
> NodeDescription in the topology format. Do you ?

Do you have examples of 'w' or 's' usage?

If it is something like:

[1]     "S-0008f104003f15e4"[19][ext 1] w=4 s=2 # lid 460 lmc 1 "ISR9288/ISR9096 Voltaire sLB-24D"

, then it should be easy separable by '#' character.

Sasha


From hrosenstock at xsigo.com  Mon May 12 12:35:27 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 12:35:27 -0700
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080512222316.GK17046@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
Message-ID: <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>

Hi Sasha,

On Mon, 2008-05-12 at 22:23 +0000, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> On 14:02 Sun 04 May     , Hal Rosenstock wrote:
> > 
> > I have a question on ibsim parsing:
> > 
> > In sim_net.c:parse_port, there is the following code:
> >       parse_opt:
> >         line = s;
> >         while (s && (s = strchr(s + 1, '='))) {
> >                 char *opt = s;
> >                 while (opt && !isalpha(*opt))
> >                         opt--;
> >                 if (!opt || parse_port_opt(port, opt, s + 1) < 0) {
> >                         IBWARN("bad port option");
> >                         return -1;
> >                 }
> >                 line = s + 1;
> >         }
> > 
> > port options appear include w for link width and s for link speed.
> >
> > An issue is that this parsing starts inside the NodeDescription. = is a
> > valid character there and causes an invalid port option.
> 
> I can see the issue, but not sure I know best solution yet (never used
> 's' and 'w' options and didn't see topology files examples where it was
> used).
> 
> > There seem to
> > me to be two choices here:
> > 1. Either ignore unknown options in parse_port_option and the rule
> > becomes w= and s= are invalid in the NodeDescription (which is
> > artificial and not really per the spec).
> > or
> 
> Not sure that "per the spec" restriction is applicable here, it is
> only about simulator topology file format and this file is editable
> (node description value is ignored by ibsim parser in those lines
> anyway).
> 
> > 2. Find some way to start this port option parsing past the end of the
> > NodeDescription. As I'm not sure about all the formats supported, I
> > don't know how to determine a "solid" way to get past the end of the
> > NodeDescription in the topology format. Do you ?
> 
> Do you have examples of 'w' or 's' usage?

No.

> If it is something like:
> 
> [1]     "S-0008f104003f15e4"[19][ext 1] w=4 s=2 # lid 460 lmc 1 "ISR9288/ISR9096 Voltaire sLB-24D"
> 
> , then it should be easy separable by '#' character.

The = character is part of the NodeDescription and doesn't get skipped
even though it should.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Mon May 12 15:57:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 22:57:25 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512225725.GN17046@sashak.voltaire.com>

On 12:35 Mon 12 May     , Hal Rosenstock wrote:
> 
> The = character is part of the NodeDescription and doesn't get skipped
> even though it should.

I'm not following. If '=' character is used in NodeDescription why it
should be skipped?

Sasha


From steiner at sgi.com  Mon May 12 13:01:13 2008
From: steiner at sgi.com (Jack Steiner)
Date: Mon, 12 May 2008 15:01:13 -0500
Subject: [ofa-general] Re: [PATCH 001/001] mmu-notifier-core v17
In-Reply-To: <20080509193230.GH7710@duo.random>
References: <20080509193230.GH7710@duo.random>
Message-ID: <20080512200113.GA31862@sgi.com>

On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <andrea at qumranet.com>
> 
> With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
> pages. There are secondary MMUs (with secondary sptes and secondary
> tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
> spte in mmu-notifier context, I mean "secondary pte". In GRU case
> there's no actual secondary pte and there's only a secondary tlb
> because the GRU secondary MMU has no knowledge about sptes and every
> secondary tlb miss event in the MMU always generates a page fault that
> has to be resolved by the CPU (this is not the case of KVM where the a
> secondary tlb miss will walk sptes in hardware and it will refill the
>...

FYI, I applied to patch to a tree that has the GRU driver. All regression
tests passed.

--- jack


From sashak at voltaire.com  Mon May 12 16:00:09 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 23:00:09 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
	<20080512184138.GD17046@sashak.voltaire.com>
	<1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>
	<20080512212536.GJ17046@sashak.voltaire.com>
	<1210620025.2026.567.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512230009.GO17046@sashak.voltaire.com>

On 12:20 Mon 12 May     , Hal Rosenstock wrote:
> > > diff --git a/ibsim/sim.h b/ibsim/sim.h
> > > index bea136a..0bf14fd 100644
> > > --- a/ibsim/sim.h
> > > +++ b/ibsim/sim.h
> > > @@ -65,9 +65,8 @@
> > >  #define        DEFAULT_LINKWIDTH       LINKWIDTH_4x
> > >  #define DEFAULT_LINKSPEED      LINKSPEED_SDR
> > >  
> > > -#define NODEIDBASE     20
> > >  #define NODEPREFIX     20
> 
> I think this can now be eliminated as the only use was in NODEIDLEN.

Agree (missed this).

> > > -#define NODEIDLEN      (NODEIDBASE+NODEPREFIX+1)
> > > +#define NODEIDLEN      64
> > >  #define ALIASLEN       40
> > 
> > It is likely will prevent overflow, but will potentially truncate last
> > NodeDesc character due to string NULL terminator. What about something
> > like below?
> 
> Thanks. This seems to work for my usage but not sure about some of the
> other concatenated names and whether it accomodates those other usages
> which I don't fully understand.

There are no changes in the parser, just filed sizes handling.

Sasha


From hrosenstock at xsigo.com  Mon May 12 13:02:10 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 13:02:10 -0700
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080512225725.GN17046@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
Message-ID: <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 22:57 +0000, Sasha Khapyorsky wrote:
> On 12:35 Mon 12 May     , Hal Rosenstock wrote:
> > 
> > The = character is part of the NodeDescription and doesn't get skipped
> > even though it should.
> 
> I'm not following. If '=' character is used in NodeDescription why it
> should be skipped?

In your previous post, you wrote:
"node description value is ignored by ibsim parser in those lines
anyway"
so that seems like it should be "skipped" rather than treating it like
some keyword preceeds it.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Mon May 12 16:10:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 23:10:03 +0000
Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size
In-Reply-To: <20080512230009.GO17046@sashak.voltaire.com>
References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com>
	<20080512184138.GD17046@sashak.voltaire.com>
	<1210609696.2026.508.camel@hrosenstock-ws.xsigo.com>
	<20080512212536.GJ17046@sashak.voltaire.com>
	<1210620025.2026.567.camel@hrosenstock-ws.xsigo.com>
	<20080512230009.GO17046@sashak.voltaire.com>
Message-ID: <20080512231003.GP17046@sashak.voltaire.com>

On 23:00 Mon 12 May     , Sasha Khapyorsky wrote:
> > 
> > I think this can now be eliminated as the only use was in NODEIDLEN.
> 
> Agree (missed this).

I applied both patches.

Sasha


From sashak at voltaire.com  Mon May 12 16:12:48 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 23:12:48 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080512231248.GQ17046@sashak.voltaire.com>

On 13:02 Mon 12 May     , Hal Rosenstock wrote:
> 
> In your previous post, you wrote:
> "node description value is ignored by ibsim parser in those lines
> anyway"
> so that seems like it should be "skipped" rather than treating it like
> some keyword preceeds it.

Yes, that is correct.

Sasha


From sashak at voltaire.com  Mon May 12 16:18:31 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 12 May 2008 23:18:31 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080512231248.GQ17046@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
Message-ID: <20080512231831.GR17046@sashak.voltaire.com>

On 23:12 Mon 12 May     , Sasha Khapyorsky wrote:
> On 13:02 Mon 12 May     , Hal Rosenstock wrote:
> > 
> > In your previous post, you wrote:
> > "node description value is ignored by ibsim parser in those lines
> > anyway"
> > so that seems like it should be "skipped" rather than treating it like
> > some keyword preceeds it.
> 
> Yes, that is correct.

Something like this should help (eg ignore "unknown" options).

Sasha


diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
index 2a9c19b..6e3c0e9 100644
--- a/ibsim/sim_net.c
+++ b/ibsim/sim_net.c
@@ -432,32 +432,31 @@ static int parse_port_lid_and_lmc(Port * port, char *line)
 
 static int parse_port_opt(Port * port, char *opt, char *val)
 {
-	int width;
-	int speed;
+	int v;
 
-	if (*opt == 'w') {
-		width = strtoul(val, 0, 0);
-		if (!is_linkwidth_valid(width))
+	switch (*opt) {
+	case 'w':
+		v = strtoul(val, 0, 0);
+		if (!is_linkwidth_valid(v))
 			return -1;
 
-		port->linkwidthena = width;
+		port->linkwidthena = v;
 		DEBUG("port %p linkwidth enabled set to %d", port,
 		      port->linkwidthena);
-		return 0;
-	} else if (*opt == 's') {
-		speed = strtoul(val, 0, 0);
-
-		if (!is_linkspeed_valid(speed))
+		break;
+	case 's':
+		v = strtoul(val, 0, 0);
+		if (!is_linkspeed_valid(v))
 			return -1;
 
-		port->linkspeedena = speed;
+		port->linkspeedena = v;
 		DEBUG("port %p linkspeed enabled set to %d", port,
 		      port->linkspeedena);
-		return 0;
-	} else {
-		IBWARN("unknown opt %c", *opt);
-		return -1;
+		break;
+	default:
+		break;
 	}
+	return 0;
 }
 
 static void init_ports(Node * node, int type, int maxports)


From hrosenstock at xsigo.com  Mon May 12 13:40:04 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 12 May 2008 13:40:04 -0700
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080512231831.GR17046@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
	<20080512231831.GR17046@sashak.voltaire.com>
Message-ID: <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-12 at 23:18 +0000, Sasha Khapyorsky wrote:
> On 23:12 Mon 12 May     , Sasha Khapyorsky wrote:
> > On 13:02 Mon 12 May     , Hal Rosenstock wrote:
> > > 
> > > In your previous post, you wrote:
> > > "node description value is ignored by ibsim parser in those lines
> > > anyway"
> > > so that seems like it should be "skipped" rather than treating it like
> > > some keyword preceeds it.
> > 
> > Yes, that is correct.
> 
> Something like this should help (eg ignore "unknown" options).

Right; that's what I meant by option 1. 

Also, it's not really unknown options since it's part of the
NodeDescription.

This works as long as the "known" options (currently s= and w=) are not
part of NodeDescription. It works for the real life use case that
started this.

-- Hal

> Sasha
> 
> 
> diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c
> index 2a9c19b..6e3c0e9 100644
> --- a/ibsim/sim_net.c
> +++ b/ibsim/sim_net.c
> @@ -432,32 +432,31 @@ static int parse_port_lid_and_lmc(Port * port, char *line)
>  
>  static int parse_port_opt(Port * port, char *opt, char *val)
>  {
> -	int width;
> -	int speed;
> +	int v;
>  
> -	if (*opt == 'w') {
> -		width = strtoul(val, 0, 0);
> -		if (!is_linkwidth_valid(width))
> +	switch (*opt) {
> +	case 'w':
> +		v = strtoul(val, 0, 0);
> +		if (!is_linkwidth_valid(v))
>  			return -1;
>  
> -		port->linkwidthena = width;
> +		port->linkwidthena = v;
>  		DEBUG("port %p linkwidth enabled set to %d", port,
>  		      port->linkwidthena);
> -		return 0;
> -	} else if (*opt == 's') {
> -		speed = strtoul(val, 0, 0);
> -
> -		if (!is_linkspeed_valid(speed))
> +		break;
> +	case 's':
> +		v = strtoul(val, 0, 0);
> +		if (!is_linkspeed_valid(v))
>  			return -1;
>  
> -		port->linkspeedena = speed;
> +		port->linkspeedena = v;
>  		DEBUG("port %p linkspeed enabled set to %d", port,
>  		      port->linkspeedena);
> -		return 0;
> -	} else {
> -		IBWARN("unknown opt %c", *opt);
> -		return -1;
> +		break;
> +	default:
> +		break;
>  	}
> +	return 0;
>  }
>  
>  static void init_ports(Node * node, int type, int maxports)
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From christophe.jaillet at wanadoo.fr  Mon May 12 14:35:59 2008
From: christophe.jaillet at wanadoo.fr (Christophe Jaillet)
Date: Mon, 12 May 2008 23:35:59 +0200
Subject: [ofa-general] ***SPAM*** [PATCH 1/1] infiniband/hw/nes/: avoid
	unnecessary memset
Message-ID: <20080512213601.626C91C0008F@mwinf2103.orange.fr>

From: Christophe Jaillet <christophe.jaillet at wanadoo.fr>

Hi, here is a patch against linux/drivers/infiniband/hw/nes/nes_cm.c which :
	
1) Remove an explicit memset(.., 0, ...) to a variable allocated with
kzalloc (i.e. 'listener').


Note: this patch is based on 'linux-2.6.25.tar.bz2'

Signed-off-by: Christophe Jaillet <christophe.jaillet at wanadoo.fr>

---

--- linux/drivers/infiniband/hw/nes/nes_cm.c	2008-04-17 04:49:44.000000000 +0200
+++ linux/drivers/infiniband/hw/nes/nes_cm.c.cj	2008-05-12 23:31:24.000000000 +0200
@@ -1587,7 +1587,6 @@ static struct nes_cm_listener *mini_cm_l
 			return NULL;
 		}
 
-		memset(listener, 0, sizeof(struct nes_cm_listener));
 		listener->loc_addr = htonl(cm_info->loc_addr);
 		listener->loc_port = htons(cm_info->loc_port);
 		listener->reused_node = 0;


From weiny2 at llnl.gov  Mon May 12 14:45:41 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 12 May 2008 14:45:41 -0700
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <1210617225.11133.461.camel@cardanus.llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
Message-ID: <20080512144541.3879de40.weiny2@llnl.gov>

Sasha,

Also, I wonder if anyone would object to you applying your patches to the tree
as is and we work out the details from there?  I don't see anything wrong with
your patches except that more work will be needed, as you said, in the man
pages and scripts.

After you apply your patches I think we can start in changing the man pages and
scripts.  Al, Tim, and I started talking about this after I ran into problems
with the current config files trying to write a PerfMgr HOWTO today.  :-(

Ira

On Mon, 12 May 2008 11:33:45 -0700
Al Chu <chu11 at llnl.gov> wrote:

> Hey Sasha,
> 
> Ira and I were chatting.  A few other comments:
> 
> 1) Many configuration values are not output by default in opensm right
> now, mainly b/c it behaves like a cache rather than an configuration
> file.  i.e.
> 
>         if (p_opts->connect_roots)
>                 fprintf(opts_file,
>                         "# Connect roots (use FALSE if unsure)\n"
>                         "connect_roots %s\n\n",
>                         p_opts->connect_roots ? "TRUE" : "FALSE");
> 
> Going forward w/ a config file, I think these should be output by
> default all the time so users know they exist.
> 
> 2) Will there be an option to specify an alternate configuration file,
> i.e. not /etc/opensm/opensm.conf?
> 
> Al
> 
> On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote:
> > Hi,
> > 
> > This is attempt to make some order with OpenSM configuration. Now it
> > will use conventional (similar to another programs which may have
> > configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead
> > of option cache file. Config file for some startup scripts should go
> > away. Option '-c' is preserved - it can be useful for config file
> > template generation, but OpenSM will not try to read option cache file.
> > 
> > This is RFC yet. In addition to this we will need to update scripts and
> > man pages.
> > 
> > Any feedback? Thoughts?
> > 
> > Sasha
> -- 
> Albert Chu
> chu11 at llnl.gov
> 925-422-5311
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 


From ralph.campbell at qlogic.com  Mon May 12 16:13:25 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Mon, 12 May 2008 16:13:25 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <adazlqypvmn.fsf@cisco.com>
References: <20080508222916.277649ca.akpm@linux-foundation.org>
	<adazlqypvmn.fsf@cisco.com>
Message-ID: <1210634005.3949.26.camel@brick.pathscale.com>

This change looks fine to me.

ipath_sdma_status doesn't depend on hardware so changing
   #define IPATH_SDMA_RUNNING  62
   #define IPATH_SDMA_SHUTDOWN 63
to different values is fine.

Roland, do you want me to send a patch for this?

On Fri, 2008-05-09 at 22:37 -0700, Roland Dreier wrote:
>  > Most architectures could (and should) take an unsigned long * arg for their
>  > bitops.  x86 doesn't do this and it needs fixing.  I fixed it.  Infiniband
>  > is being a problem.
> 
>  > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs':
>  > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends':
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors':
>  > drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr':
>  > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
> 
> So all of these are ipath warnings, seemingly all because
> ipath_devdata.ipath_sdma_status is a u64.  The stupid fix is to change
> this declaration to unsigned long as below, but this sets a trap if the
> driver is ever fixed so that it doesn't depend on 64BIT, because of
> 
>     /* bit positions for sdma_status */
>     #define IPATH_SDMA_ABORTING  0
>     #define IPATH_SDMA_DISARMED  1
>     #define IPATH_SDMA_DISABLED  2
>     #define IPATH_SDMA_LAYERBUF  3
>     #define IPATH_SDMA_RUNNING  62
>     #define IPATH_SDMA_SHUTDOWN 63
> 
> I don't see that this status is shared with hardware, and I don't see
> why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to
> unsigned long and moving those to bits 4 and 5 seems like it might be a
> clean fix.
> 
> The other option is to convert to a bitmap and using the bitmap
> operations, which ends up being a bigger patch.
> 
> But since I don't really understand this part of the driver, some
> guidance would be helpful...
> 
>  - R.
> 
> 
> diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
> index ce7b7c3..7635ace 100644
> --- a/drivers/infiniband/hw/ipath/ipath_driver.c
> +++ b/drivers/infiniband/hw/ipath/ipath_driver.c
> @@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
>  	 */
>  	if (dd->ipath_flags & IPATH_HAS_SEND_DMA) {
>  		int skip_cancel;
> -		u64 *statp = &dd->ipath_sdma_status;
> +		unsigned long *statp = &dd->ipath_sdma_status;
>  
>  		spin_lock_irqsave(&dd->ipath_sdma_lock, flags);
>  		skip_cancel =
> diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
> index 02b24a3..a46f8ad 100644
> --- a/drivers/infiniband/hw/ipath/ipath_kernel.h
> +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
> @@ -483,7 +483,7 @@ struct ipath_devdata {
>  
>  	/* SendDMA related entries */
>  	spinlock_t            ipath_sdma_lock;
> -	u64                   ipath_sdma_status;
> +	unsigned long         ipath_sdma_status;
>  	unsigned long         ipath_sdma_abort_jiffies;
>  	unsigned long         ipath_sdma_abort_intr_timeout;
>  	unsigned long         ipath_sdma_buf_jiffies;
> 
> 


From gsadasiv7 at gmail.com  Mon May 12 16:32:26 2008
From: gsadasiv7 at gmail.com (Ganesh Sadasivan)
Date: Mon, 12 May 2008 16:32:26 -0700
Subject: [ofa-general] ibv_get_cq_event blocking forever after successful
	ibv_post_send...
In-Reply-To: <adalkf8z21b.fsf@cisco.com>
References: <20070525212214.20500.qmail@station183.com>
	<adalkf8z21b.fsf@cisco.com>
Message-ID: <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>

Hi,

  Was there any resolution to this issue? I am seeing the exact behavior
  where no event is generated after doing a send. There were a few
successful sends
  that got completion events. But it just stops without any error
indication.

  I am pasting the part of the code which does this operation:

create_qp ()
{
    struct ibv_qp_init_attr   init_attr;

    init_attr.cap.max_send_wr = 20;
    init_attr.cap.max_recv_wr = 20;
    init_attr.cap.max_recv_sge = 1;
    init_attr.cap.max_send_sge = 1;
    init_attr.qp_type = IBV_QPT_RC;
    init_attr.send_cq = send_cq;
    init_attr.recv_cq = recv_cq;
    init_attr.sq_sig_all = 0;

    qp = ibv_create_qp(pd, &init_attr);

    if (!qp) {
        return 1;
    }

    attr.qp_state        = IBV_QPS_INIT;
    attr.pkey_index      = 0;
    attr.port_num        = src_port;
    attr.qp_access_flags = 0;

    if (ibv_modify_qp(qp, &attr,
                      IBV_QP_STATE |
                      IBV_QP_PKEY_INDEX |
                      IBV_QP_PORT |
                      IBV_QP_ACCESS_FLAGS)) {
        return 1;
    }


    attr.qp_state = IBV_QPS_RTR;
    attr.path_mtu = IBV_MTU_2048;
    attr.rq_psn = 1;
    attr.dest_qp_num = dst_qp_num;
    attr.max_dest_rd_atomic = 1;
    attr.ah_attr.dlid = dst_lid;
    attr.ah_attr.sl = serv_level;
    attr.ah_attr.port_num = src_port;
    attr.min_rnr_timer = 12;
    attr.ah_attr.is_global = 0;
    attr.ah_attr.src_path_bits = 0;

    if (ibv_modify_qp(qp, &attr,
                          IBV_QP_STATE|
                          IBV_QP_PATH_MTU|
                          IBV_QP_RQ_PSN|
                          IBV_QP_DEST_QPN|
                          IBV_QP_MAX_DEST_RD_ATOMIC|
                          IBV_QP_AV|
                          IBV_QP_MIN_RNR_TIMER)) {
        return 1;
    }

    attr.qp_state = IBV_QPS_RTS;
    attr.timeout = 10;
    attr.retry_cnt = 7;
    attr.rnr_retry = 7;
    attr.sq_psn = 1;
    attr.max_rd_atomic = 1;
    if (ibv_modify_qp(qp, &attr,
                      IBV_QP_STATE |
                      IBV_QP_TIMEOUT |
                      IBV_QP_RETRY_CNT |
                      IBV_QP_RNR_RETRY |
                      IBV_QP_SQ_PSN |
                      IBV_QP_MAX_QP_RD_ATOMIC)) {
        return 1;
    }

}

send_data(char *buf, int datasz, void *arg)
{

    int ret;

    /*
     * Save the WR-id so that we can compare against this
     * once tx is done.
     */
    sq_wr_id[tail] = global_cnt++;

    send_sgl[tail].addr = (u64) (unsigned long) buf;
    send_sgl[tail].length = datasz;
    send_sgl[tail].lkey = send_mr->lkey;

    sq_wr[tail].opcode = IBV_WR_SEND;
    sq_wr[tail].send_flags = IBV_SEND_SIGNALED;
    sq_wr[tail].sg_list = &send_sgl[tail];
    sq_wr[tail].num_sge = 1;
    send_data[tail] = (u64)buf;
    send_arg[tail] = arg;

    ret = ibv_post_send(qp, &sq_wr[tail], &bad_wr);

    if (tail == 19) { //max_send_wr -1
          tail = 0;
    } else {
          tail += 1;
    }
    return ret;
}

recv_thread (void *arg)
{
    struct ibv_cq        *ev_cq;
    void                 *ev_ctx;
    int                   ret;


    ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx);
    if (ret) {
        return 1;
    }

    ibv_ack_cq_events(ev_cq, 1);

    ret = ibv_req_notify_cq(ev_cq, 0);
    if (ret) {
        return 1;
    }

    while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) {
        switch (wc.opcode) {
            case IBV_WC_SEND: {
                if (wc.status == IBV_WC_SUCCESS) {
                    if (sq_wr_id[head] != wc.wr_id) {
                        datasz = 0;
                        return 1;
                    }
                } else {
                    retuen 1;
                }
                buf = (char *)send_data[head];
                arg = (u64)send_arg[head];
                sq_wr_id[head] = 0;
                if (head == 19) {//max_send_wr -1
                    head = 0;
                } else {
                    head += 1;
                }
                break;
         }
    }

}

Thanks
Ganesh

On Mon, May 28, 2007 at 9:28 PM, Roland Dreier <rdreier at cisco.com> wrote:

>  > Any ideas on why the ibv_get_cq_event() would never see an event
>  > after a "successful" send requesting a completion event?
>
> It's either a bug in your code or a bug in the stack below your code.
> The best way to debug this would be for you to post your actual code
> (in a form that someone else can run), so that we can either point out
> what's wrong with your code, or have a test case for the real bug.
>
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080512/0957ed73/attachment.html>

From sean.hefty at intel.com  Mon May 12 16:44:34 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 12 May 2008 16:44:34 -0700
Subject: [ofa-general] ibv_get_cq_event blocking forever after
	successfulibv_post_send...
In-Reply-To: <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>
References: <20070525212214.20500.qmail@station183.com><adalkf8z21b.fsf@cisco.com>
	<532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>
Message-ID: <000301c8b48a$2200e670$465a180a@amr.corp.intel.com>

>    attr.rnr_retry = 7;

Can you drop this to 6 and see if the behavior changes?

> recv_thread (void *arg)
> {
>    struct ibv_cq        *ev_cq;
>    void                 *ev_ctx;
>    int                   ret;
>
>
>    ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx);
>    if (ret) {
>        return 1;
>    }
>
>    ibv_ack_cq_events(ev_cq, 1);
>
>    ret = ibv_req_notify_cq(ev_cq, 0);
>    if (ret) {
>        return 1;
>    }
>
>    while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) {
>        switch (wc.opcode) {
>            case IBV_WC_SEND: {
>                if (wc.status == IBV_WC_SUCCESS) {
>                    if (sq_wr_id[head] != wc.wr_id) {
>                        datasz = 0;
>                        return 1;
>                    }
>                } else {
>                    retuen 1;

                     ^^^^^^
Are you sure this is the code that's running?

>                }
>                buf = (char *)send_data[head];
>                arg = (u64)send_arg[head];
>                sq_wr_id[head] = 0;
>                if (head == 19) {//max_send_wr -1
>                    head = 0;
>                } else {
>                    head += 1;
>                }
>                break;
>         }
>    }
>
> }

Where do you re-post receives?


From gsadasiv7 at gmail.com  Mon May 12 17:03:00 2008
From: gsadasiv7 at gmail.com (Ganesh Sadasivan)
Date: Mon, 12 May 2008 17:03:00 -0700
Subject: [ofa-general] ibv_get_cq_event blocking forever after
	successfulibv_post_send...
In-Reply-To: <000301c8b48a$2200e670$465a180a@amr.corp.intel.com>
References: <20070525212214.20500.qmail@station183.com>
	<adalkf8z21b.fsf@cisco.com>
	<532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>
	<000301c8b48a$2200e670$465a180a@amr.corp.intel.com>
Message-ID: <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com>

On Mon, May 12, 2008 at 4:44 PM, Sean Hefty <sean.hefty at intel.com> wrote:

> >    attr.rnr_retry = 7;
>
> Can you drop this to 6 and see if the behavior changes?


That does not change the behavior.


>
> > recv_thread (void *arg)
> > {
> >    struct ibv_cq        *ev_cq;
> >    void                 *ev_ctx;
> >    int                   ret;
> >
> >
> >    ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx);
> >    if (ret) {
> >        return 1;
> >    }
> >
> >    ibv_ack_cq_events(ev_cq, 1);
> >
> >    ret = ibv_req_notify_cq(ev_cq, 0);
> >    if (ret) {
> >        return 1;
> >    }
> >
> >    while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) {
> >        switch (wc.opcode) {
> >            case IBV_WC_SEND: {
> >                if (wc.status == IBV_WC_SUCCESS) {
> >                    if (sq_wr_id[head] != wc.wr_id) {
> >                        datasz = 0;
> >                        return 1;
> >                    }
> >                } else {
> >                    retuen 1;
>
>                      ^^^^^^
> Are you sure this is the code that's running?


This is a cut-paste error. I just extracted the relevant code from the
actual piece.


>
> >                }
> >                buf = (char *)send_data[head];
> >                arg = (u64)send_arg[head];
> >                sq_wr_id[head] = 0;
> >                if (head == 19) {//max_send_wr -1
> >                    head = 0;
> >                } else {
> >                    head += 1;
> >                }
> >                break;
> >         }
> >    }
> >
> > }
>
> Where do you re-post receives?


I just pasted the send part of the code. Should I send the receive code too?


Thanks
Ganesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080512/1146de57/attachment.html>

From rdreier at cisco.com  Mon May 12 18:03:35 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 May 2008 18:03:35 -0700
Subject: [ofa-general] ibv_get_cq_event blocking forever after
	successfulibv_post_send...
In-Reply-To: <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com>
	(Ganesh Sadasivan's message of "Mon, 12 May 2008 17:03:00 -0700")
References: <20070525212214.20500.qmail@station183.com>
	<adalkf8z21b.fsf@cisco.com>
	<532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>
	<000301c8b48a$2200e670$465a180a@amr.corp.intel.com>
	<532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com>
Message-ID: <ada1w47nhg8.fsf@cisco.com>

 > This is a cut-paste error. I just extracted the relevant code from the
 > actual piece.

Unless you send an actual test app that someone could really compile and
run, it's very hard to help debug it.  Basically your only chance is if
you have a really obvious bug that someone could see by reading your code.


From kkfiesta at businessweb.com.hk  Mon May 12 19:00:19 2008
From: kkfiesta at businessweb.com.hk (Emil Beatty)
Date: Mon, 12 May 2008 22:00:19 -0400
Subject: [ofa-general] You don`t forget it
Message-ID: <001501c8b47b$91473840$00f0fe3c@pc6>


Improve xxx experience and gain in size with these efficient medicine.


No woman thinks his man's babymaker is hard enough.

It's here...


"They must be in pain. It's so painful to know that your child has been so cruel, so wicked. So I say to them my prayers are with you.
On Sunday, family and friends observed a two-minute silence outside Our Lady of Lourdes church, where Jimmy had been an altar boy.
He said: "The youth, intent on violence, offered him (Jimmy) outside for a fight.
She said: "We just don't know really whether we're coming or going.
He said: "I wanted to help young people, as a young person myself, I wanted to help give them a voice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080512/98e5000f/attachment.html>

From keshetti85-student at yahoo.co.in  Mon May 12 21:33:53 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Tue, 13 May 2008 10:03:53 +0530
Subject: [ofa-general] OpenSM SA dump ?
In-Reply-To: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com>
References: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com>
Message-ID: <829ded920805122133j76f483et8280197f216721c6@mail.gmail.com>

Thanks Hal for the reply.

>  Only multicast, services, and informs are dumped. These are the so
>  called client registrations.
>
>  What SA information are you looking for ?

I expected 'opensm-sa.dump' file to contain all the configured paths
between the hosts.

Is there any way to dump the local SA cache to a file with current
OFED ?

-Mahesh


From okir at lst.de  Mon May 12 23:08:09 2008
From: okir at lst.de (Olaf Kirch)
Date: Tue, 13 May 2008 08:08:09 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805121157.38135.jon@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
Message-ID: <200805130808.10510.okir@lst.de>

On Monday 12 May 2008 18:57:38 Jon Mason wrote:
> As part of my effort to get RDS working for iWARP, I will be working on the 
> RDS flow control.  Flow control is needed for iWARP due to the fact that 
> iWARP connections terminate if there is no posted recv for an incoming 
> packet.  IB connections do not have this limitation if setup in a certain 
> way.  In its current implementation, RDS sets the connection attribute 
> rnr_retry to 7.  This causes IB to retransmit until there is a posted recv 
> buffer. 

I think for the initial implementation, it is fine for iWARP to just
fail the connect when that happens, and re-establish the connection.

If you use reasonable defaults for the send and recv queues, receiver
overruns should be relatively rare.

Once everything else works, let's revisit the flow control part.

> I am still in the very early stages of implementing this.  So any pointers to 
> RDS documentation (or a RDS git tree) would be very helpful.  I have a small 
> IB setup to test this on, so anyone willing to test it when I am done would 
> be helpful as well.

The main RDS repo is the OFED tree. If you want to integrate with my work tree,
let me know and I'll feed your patches into my tree at
http://www.openfabrics.org/git/?p=~okir/ofed_1_3/linux-2.6.git

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From kliteyn at dev.mellanox.co.il  Tue May 13 04:15:22 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 13 May 2008 14:15:22 +0300
Subject: [ofa-general] [PATCH] opensm/osm_state_mgr.c: fix segmentation fault
Message-ID: <4829784A.6030708@dev.mellanox.co.il>

Hi Sasha,

Fixing trivial segmentation fault in state manager.

Please apply to ofed_1_3 branch and to master.

-- Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_state_mgr.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 4b7235f..6f06a8d 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -736,9 +736,8 @@ static boolean_t __osm_state_mgr_is_sm_port_down(IN osm_state_mgr_t *
 	if (!p_port) {
 		osm_log(p_mgr->p_log, OSM_LOG_ERROR,
 			"__osm_state_mgr_is_sm_port_down: ERR 3309: "
-			"SM port with GUID:%016" PRIx64 " (%s) is unknown\n",
-			cl_ntoh64(port_guid),
-			p_port->p_node ? p_port->p_node->print_desc : "UNKNOWN");
+			"SM port with GUID:%016" PRIx64 " is unknown\n",
+			cl_ntoh64(port_guid));
 		state = IB_LINK_DOWN;
 		CL_PLOCK_RELEASE(p_mgr->p_lock);
 		goto Exit;
-- 
1.5.1.4


From moshek at voltaire.com  Tue May 13 04:53:08 2008
From: moshek at voltaire.com (Moshe Kazir)
Date: Tue, 13 May 2008 14:53:08 +0300
Subject: [ewg] RE: [ofa-general] OFED May 5  meeting summary
In-Reply-To: <48282580.8040208@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
	<39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>
	<48282580.8040208@mellanox.co.il>
Message-ID: <39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com>


Backport of ib-bonding to sles10 sp2 Beta3 is finished.
The diff file was delivered to Moni Shoua .

BUT !!!

When I tried OFED-1.3.1 on  sles10 sp2 rc3   I found that the kernel is
changed and we have 
backport issues with ofa_kernel compilation.....

I continue digging ....

How many rc's are planed for sles 10 sp 2 ?

Do we want to backport every RC ? 
or we want to wait till the last RC before GA ?

Moshe 


____________________________________________________________
Moshe Katzir   |  +972-9971-8639 (o)   |   +972-52-860-6042  (m)
 
Voltaire - The Grid Backbone
 
 www.voltaire.com

  
-----Original Message-----
From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
Sent: Monday, May 12, 2008 2:10 PM
To: Moshe Kazir
Cc: ewg at lists.openfabrics.org; Moni Shoua; Olga Shern;
general at lists.openfabrics.org
Subject: Re: [ewg] RE: [ofa-general] OFED May 5 meeting summary

Moshe Kazir wrote:
>
> I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3.
>
> ib-bonding compile failed.  Everything else is compiled o.k. 
>
> Attached : ib-bonding error log.
>
>
> I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed, 
> I'll get Moni's help).
>
>   
Thanks
Please update when done.
Any need for a change in the install script?

Tziporet


From tziporet at dev.mellanox.co.il  Tue May 13 05:06:38 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 13 May 2008 15:06:38 +0300
Subject: [ewg] RE: [ofa-general] OFED May 5  meeting summary
In-Reply-To: <39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com>
	<39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com>
	<48282580.8040208@mellanox.co.il>
	<39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com>
Message-ID: <4829844E.3090307@mellanox.co.il>

Moshe Kazir wrote:
> Backport of ib-bonding to sles10 sp2 Beta3 is finished.
> The diff file was delivered to Moni Shoua .
>
> BUT !!!
>
> When I tried OFED-1.3.1 on  sles10 sp2 rc3   I found that the kernel is
> changed and we have 
> backport issues with ofa_kernel compilation.....
>
> I continue digging ....
>
> How many rc's are planed for sles 10 sp 2 ?
>
> Do we want to backport every RC ? 
> or we want to wait till the last RC before GA ?
>
>   
I think we should add support for the latest SLES10 SP2 only.
Meanwhile we can add backport patches for the latest available RC and replace them when a new RC is out

Tziporet


From nickpiggin at yahoo.com.au  Tue May 13 05:06:44 2008
From: nickpiggin at yahoo.com.au (Nick Piggin)
Date: Tue, 13 May 2008 22:06:44 +1000
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508003838.GA9878@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
Message-ID: <200805132206.47655.nickpiggin@yahoo.com.au>

On Thursday 08 May 2008 10:38, Robin Holt wrote:
> On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
> > On Wed, 7 May 2008, Andrea Arcangeli wrote:
> > > I think the spinlock->rwsem conversion is ok under config option, as
> > > you can see I complained myself to various of those patches and I'll
> > > take care they're in a mergeable state the moment I submit them. What
> > > XPMEM requires are different semantics for the methods, and we never
> > > had to do any blocking I/O during vmtruncate before, now we have to.
> >
> > I really suspect we don't really have to, and that it would be better to
> > just fix the code that does that.
>
> That fix is going to be fairly difficult.  I will argue impossible.
>
> First, a little background.  SGI allows one large numa-link connected
> machine to be broken into seperate single-system images which we call
> partitions.
>
> XPMEM allows, at its most extreme, one process on one partition to
> grant access to a portion of its virtual address range to processes on
> another partition.  Those processes can then fault pages and directly
> share the memory.
>
> In order to invalidate the remote page table entries, we need to message
> (uses XPC) to the remote side.  The remote side needs to acquire the
> importing process's mmap_sem and call zap_page_range().  Between the
> messaging and the acquiring a sleeping lock, I would argue this will
> require sleeping locks in the path prior to the mmu_notifier invalidate_*
> callouts().

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.


From nickpiggin at yahoo.com.au  Tue May 13 05:14:24 2008
From: nickpiggin at yahoo.com.au (Nick Piggin)
Date: Tue, 13 May 2008 22:14:24 +1000
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080508013459.GS8276@duo.random>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<20080507234521.GN8276@duo.random>
	<20080508013459.GS8276@duo.random>
Message-ID: <200805132214.27510.nickpiggin@yahoo.com.au>

On Thursday 08 May 2008 11:34, Andrea Arcangeli wrote:
> Sorry for not having completely answered to this. I initially thought
> stop_machine could work when you mentioned it, but I don't think it
> can even removing xpmem block-inside-mmu-notifier-method requirements.
>
> For stop_machine to solve this (besides being slower and potentially
> not more safe as running stop_machine in a loop isn't nice), we'd need
> to prevent preemption in between invalidate_range_start/end.
>
> I think there are two ways:
>
> 1) add global lock around mm_lock to remove the sorting
>
> 2) remove invalidate_range_start/end, nuke mm_lock as consequence of
>    it, and replace all three with invalidate_pages issued inside the
>    PT lock, one invalidation for each 512 pte_t modified, so
>    serialization against get_user_pages becomes trivial but this will
>    be not ok at all for SGI as it increases a lot their invalidation
>    frequency

This is what I suggested to begin with before this crazy locking was
developed to handle these corner cases... because I wanted the locking
to match with the tried and tested Linux core mm/ locking rather than
introducing this new idea.

I don't see why you're bending over so far backwards to accommodate
this GRU thing that we don't even have numbers for and could actually
potentially be batched up in other ways (eg. using mmu_gather or
mmu_gather-like idea).

The bare essential, matches-with-Linux-mm mmu notifiers that I first
saw of yours was pretty elegant and nice. The idea that "only one
solution must go in and handle everything perfectly" is stupid because
it is quite obvious that the sleeping invalidate idea is just an order
of magnitude or two more complex than the simple atomic invalidates
needed by you. We should and could easily have had that code upstream
long ago :(

I'm not saying we ignore the sleeping or batching cases, but we should
introduce the ideas slowly and carefully and assess the pros and cons
of each step along the way.


>
> For KVM both ways are almost the same.
>
> I'll implement 1 now then we'll see...
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont at kvack.org"> email at kvack.org </a>


From ogerlitz at voltaire.com  Tue May 13 07:11:15 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 13 May 2008 17:11:15 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 1/4] net/bonding: announce fail-over for
 the active-backup mode
Message-ID: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>

Enhance bonding to announce fail-over for the active-backup mode through
the netdev events notifier chain mechanism. Such an event can be of use
for the RDMA CM (communication manager) to let native RDMA ULPs (eg
NFS-RDMA, iSER) always use the same links as the IP stack does.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

I am sending the patch along with the series before its review in netdev since
its needed later in patch #4 and I see some issues while testing it with
2.6.26-rc2 which I am pretty sure not related directly to my work, so I'd like
to do some more testing before handing it to the bonding maintainer.

Index: linux-2.6.26-rc2/drivers/net/bonding/bond_main.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c	2008-05-13 10:02:22.000000000 +0300
+++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c	2008-05-13 16:34:28.000000000 +0300
@@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon
 			bond->send_grat_arp = 1;
 		} else
 			bond_send_gratuitous_arp(bond);
+		netdev_bonding_change(bond->dev);
 	}
 }

Index: linux-2.6.26-rc2/include/linux/notifier.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/notifier.h	2008-05-13 10:02:30.000000000 +0300
+++ linux-2.6.26-rc2/include/linux/notifier.h	2008-05-13 11:50:44.000000000 +0300
@@ -197,6 +197,7 @@ static inline int notifier_to_errno(int
 #define NETDEV_GOING_DOWN	0x0009
 #define NETDEV_CHANGENAME	0x000A
 #define NETDEV_FEAT_CHANGE	0x000B
+#define NETDEV_BONDING_FAILOVER 0x000C

 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN
Index: linux-2.6.26-rc2/include/linux/netdevice.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/netdevice.h	2008-05-13 10:02:30.000000000 +0300
+++ linux-2.6.26-rc2/include/linux/netdevice.h	2008-05-13 11:50:20.000000000 +0300
@@ -1459,6 +1459,7 @@ extern void		__dev_addr_unsync(struct de
 extern void		dev_set_promiscuity(struct net_device *dev, int inc);
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
+extern void		netdev_bonding_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
Index: linux-2.6.26-rc2/net/core/dev.c
===================================================================
--- linux-2.6.26-rc2.orig/net/core/dev.c	2008-05-13 10:02:31.000000000 +0300
+++ linux-2.6.26-rc2/net/core/dev.c	2008-05-13 11:50:49.000000000 +0300
@@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi
 	}
 }

+void netdev_bonding_change(struct net_device *dev)
+{
+	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
+}
+EXPORT_SYMBOL(netdev_bonding_change);
+
 /**
  *	dev_load 	- load a network module
  *	@net: the applicable net namespace


From ogerlitz at voltaire.com  Tue May 13 07:12:16 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 13 May 2008 17:12:16 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 2/4] rdma/addr: keep the name of the
 netdevice in struct rdma_dev_addr 
In-Reply-To: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>

Keep also the local (src) device name in struct rdma_dev_addr. Under bonding HA
scheme this can be used by the rdma-cm to align RDMA sessions to use the same links
as the IP stack does after bonding fail-over happened.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/addr.c	2008-05-13 16:31:07.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/addr.c	2008-05-13 16:45:01.000000000 +0300
@@ -100,6 +100,7 @@ int rdma_copy_addr(struct rdma_dev_addr
 	memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
 	if (dst_dev_addr)
 		memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+	memcpy(dev_addr->src_netdev_name, dev->name, IFNAMSIZ);
 	return 0;
 }
 EXPORT_SYMBOL(rdma_copy_addr);
Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-13 16:31:07.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-13 16:45:01.000000000 +0300
@@ -998,6 +998,7 @@ static struct rdma_id_private *cma_new_c
 	union cma_ip_addr *src, *dst;
 	__be16 port;
 	u8 ip_ver;
+	int ret;

 	if (cma_get_net_info(ib_event->private_data, listen_id->ps,
 			     &ip_ver, &port, &src, &dst))
@@ -1022,10 +1023,11 @@ static struct rdma_id_private *cma_new_c
 	if (rt->num_paths == 2)
 		rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path;

-	ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid);
 	ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
-	ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey));
-	rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA;
+	ret = rdma_translate_ip(&id->route.addr.src_addr,
+				&id->route.addr.dev_addr);
+	if (ret)
+		goto destroy_id;

 	id_priv = container_of(id, struct rdma_id_private, id);
 	id_priv->state = CMA_CONNECT;
Index: linux-2.6.26-rc2/include/rdma/ib_addr.h
===================================================================
--- linux-2.6.26-rc2.orig/include/rdma/ib_addr.h	2008-05-13 16:31:07.000000000 +0300
+++ linux-2.6.26-rc2/include/rdma/ib_addr.h	2008-05-13 16:45:01.000000000 +0300
@@ -57,6 +57,7 @@ struct rdma_dev_addr {
 	unsigned char dst_dev_addr[MAX_ADDR_LEN];
 	unsigned char broadcast[MAX_ADDR_LEN];
 	enum rdma_node_type dev_type;
+	char src_netdev_name[IFNAMSIZ];
 };

 /**


From ogerlitz at voltaire.com  Tue May 13 07:13:14 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 13 May 2008 17:13:14 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode
 attribute to IDs
In-Reply-To: <Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>

RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer
of the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
as the IP stack does. In the current code, this does not happen when bonding did
fail-over but the IB link used by an already existing session is operating fine.
For now this mode is supported only for the connected services of the rdma-cm.

More ha modes can be added in the future.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c
===================================================================

Index: linux-2.6.26-rc2/include/rdma/rdma_cm.h
===================================================================
--- linux-2.6.26-rc2.orig/include/rdma/rdma_cm.h	2008-04-17 05:49:44.000000000 +0300
+++ linux-2.6.26-rc2/include/rdma/rdma_cm.h	2008-05-13 13:52:53.000000000 +0300
@@ -328,4 +328,10 @@ void rdma_leave_multicast(struct rdma_cm
  */
 void rdma_set_service_type(struct rdma_cm_id *id, int tos);

+enum  rdma_ha_mode {
+	RDMA_ALIGN_WITH_NETDEVICE = 1
+};
+
+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode);
+
 #endif /* RDMA_CM_H */
Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-13 11:57:02.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-13 14:57:12.000000000 +0300
@@ -143,6 +143,7 @@ struct rdma_id_private {
 	u32			qp_num;
 	u8			srq;
 	u8			tos;
+	enum rdma_ha_mode	ha_mode;
 };

 struct cma_multicast {
@@ -1523,6 +1524,19 @@ void rdma_set_service_type(struct rdma_c
 }
 EXPORT_SYMBOL(rdma_set_service_type);

+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode)
+{
+	struct rdma_id_private *id_priv;
+
+	if ((mode == RDMA_ALIGN_WITH_NETDEVICE) && cma_is_ud_ps(id->ps))
+		return -ENOTSUPP;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	id_priv->ha_mode = mode;
+	return 0;
+}
+EXPORT_SYMBOL(rdma_set_high_availability_mode);
+
 static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec,
 			      void *context)
 {


From ogerlitz at voltaire.com  Tue May 13 07:13:58 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 13 May 2008 17:13:58 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement
 RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>

RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer
of the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
as the IP stack does. In the current code, this does not happen when bonding did
fail-over but the IB link used by an already existing session is operating fine.

Use netevent notification for sensing that a change has happened in the IP stack,
then scan the rdma-cm IDs list to see if there is an ID that is "misaligned"
in that respect with the IP stack, and disconnect it, in case this is what the
user asked to when setting an ha mode for the ID.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-13 16:57:47.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-13 16:58:55.000000000 +0300
@@ -144,6 +144,7 @@ struct rdma_id_private {
 	u8			srq;
 	u8			tos;
 	enum rdma_ha_mode	ha_mode;
+	struct work_struct	ha_work;
 };

 struct cma_multicast {
@@ -268,6 +269,14 @@ static inline int cma_is_ud_ps(enum rdma
 	return (ps == RDMA_PS_UDP || ps == RDMA_PS_IPOIB);
 }

+static void cma_ha_work_handler(struct work_struct *work)
+{
+	struct rdma_id_private *id_priv;
+
+	id_priv = container_of(work, struct rdma_id_private, ha_work);
+	rdma_disconnect(&id_priv->id);
+}
+
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
 			      struct cma_device *cma_dev)
 {
@@ -401,7 +410,8 @@ struct rdma_cm_id *rdma_create_id(rdma_c
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-
+	INIT_WORK(&id_priv->ha_work, cma_ha_work_handler);
+
 	return &id_priv->id;
 }
 EXPORT_SYMBOL(rdma_create_id);
@@ -2743,6 +2753,38 @@ void rdma_leave_multicast(struct rdma_cm
 }
 EXPORT_SYMBOL(rdma_leave_multicast);

+static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
+	void *ctx)
+{
+	struct net_device *ndev = (struct net_device *)ctx;
+	struct cma_device *cma_dev;
+	struct rdma_id_private *id_priv;
+	struct rdma_dev_addr *dev_addr;
+
+	if (dev_net(ndev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (event != NETDEV_BONDING_FAILOVER)
+		return NOTIFY_DONE;
+
+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
+		return NOTIFY_DONE;
+
+	list_for_each_entry(cma_dev, &dev_list, list)
+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
+			dev_addr = &id_priv->id.route.addr.dev_addr;
+			if (!memcmp(dev_addr->src_netdev_name, ndev->name, IFNAMSIZ) &&
+				memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len))
+					if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE)
+						schedule_work(&id_priv->ha_work);
+		}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block cma_nb = {
+	.notifier_call = cma_netdev_callback
+};
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
@@ -2847,6 +2889,7 @@ static int cma_init(void)

 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
+	register_netdevice_notifier(&cma_nb);

 	ret = ib_register_client(&cma_client);
 	if (ret)
@@ -2854,6 +2897,7 @@ static int cma_init(void)
 	return 0;

 err:
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
@@ -2863,6 +2907,7 @@ err:
 static void cma_cleanup(void)
 {
 	ib_unregister_client(&cma_client);
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);


From rdreier at cisco.com  Tue May 13 07:27:46 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 07:27:46 -0700
Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability
	mode attribute to IDs
In-Reply-To: <Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com> (Or
	Gerlitz's message of "Tue, 13 May 2008 17:13:14 +0300 (IDT)")
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
Message-ID: <adaod7amg7x.fsf@cisco.com>

 > +enum  rdma_ha_mode {
 > +	RDMA_ALIGN_WITH_NETDEVICE = 1
 > +};

 > +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode)

this seems like overengineering to me... given there are no other modes,
you are adding an elaborate NOP.  (Nothing looks at ha_mode)

Do you have plans for other modes?

 >  	u8			srq;
 >  	u8			tos;
 > +	enum rdma_ha_mode	ha_mode;

Side note -- you're wasting two bytes here because of alignment.


From rdreier at cisco.com  Tue May 13 07:32:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 07:32:10 -0700
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com> (Or
	Gerlitz's message of "Tue, 13 May 2008 17:13:58 +0300 (IDT)")
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
Message-ID: <adak5hymg0l.fsf@cisco.com>

 > Use netevent notification for sensing that a change has happened in the IP stack,
 > then scan the rdma-cm IDs list to see if there is an ID that is "misaligned"
 > in that respect with the IP stack, and disconnect it, in case this is what the
 > user asked to when setting an ha mode for the ID.

this seems like a strange "HA" feature -- to disconnect connections that
otherwise would continue operating.  What is the use case/use scenario?

 > +	list_for_each_entry(cma_dev, &dev_list, list)
 > +		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
 > +			dev_addr = &id_priv->id.route.addr.dev_addr;
 > +			if (!memcmp(dev_addr->src_netdev_name, ndev->name, IFNAMSIZ) &&
 > +				memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len))
 > +					if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE)
 > +						schedule_work(&id_priv->ha_work);
 > +		}

This looks horribly racy/incorrect against RDMA device removal, CMA ID
destruction and netdev renaming.

 - R.


From weiny2 at llnl.gov  Tue May 13 08:05:36 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 13 May 2008 08:05:36 -0700
Subject: [ofa-general] OpenSM SA dump ?
In-Reply-To: <829ded920805122133j76f483et8280197f216721c6@mail.gmail.com>
References: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com>
	<829ded920805122133j76f483et8280197f216721c6@mail.gmail.com>
Message-ID: <20080513080536.079b56a3.weiny2@llnl.gov>

On Tue, 13 May 2008 10:03:53 +0530
"Keshetti Mahesh" <keshetti85-student at yahoo.co.in> wrote:

> Thanks Hal for the reply.
> 
> >  Only multicast, services, and informs are dumped. These are the so
> >  called client registrations.
> >
> >  What SA information are you looking for ?
> 
> I expected 'opensm-sa.dump' file to contain all the configured paths
> between the hosts.

saquery can provide the PathRecords anytime.

Ira

> 
> Is there any way to dump the local SA cache to a file with current
> OFED ?
> 
> -Mahesh
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From holt at sgi.com  Tue May 13 08:32:38 2008
From: holt at sgi.com (Robin Holt)
Date: Tue, 13 May 2008 10:32:38 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <200805132206.47655.nickpiggin@yahoo.com.au>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
Message-ID: <20080513153238.GL19717@sgi.com>

On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
> On Thursday 08 May 2008 10:38, Robin Holt wrote:
> > In order to invalidate the remote page table entries, we need to message
> > (uses XPC) to the remote side.  The remote side needs to acquire the
> > importing process's mmap_sem and call zap_page_range().  Between the
> > messaging and the acquiring a sleeping lock, I would argue this will
> > require sleeping locks in the path prior to the mmu_notifier invalidate_*
> > callouts().
> 
> Why do you need to take mmap_sem in order to shoot down pagetables of
> the process? It would be nice if this can just be done without
> sleeping.

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

The messaging is clearly going to require sleeping.  Are you suggesting
we need to rework XPC communications to not require sleeping?  I think
that is going to be impossible since the transfer engine requires a
sleeping context.

Additionally, the call to zap_page_range expects to have the mmap_sem
held.  I suppose we could use something other than zap_page_range and
atomically clear the process page tables.  Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

Thanks,
Robin


From rdreier at cisco.com  Tue May 13 09:17:45 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 09:17:45 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <1210634005.3949.26.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Mon, 12 May 2008 16:13:25 -0700")
References: <20080508222916.277649ca.akpm@linux-foundation.org>
	<adazlqypvmn.fsf@cisco.com>
	<1210634005.3949.26.camel@brick.pathscale.com>
Message-ID: <adafxsmmb4m.fsf@cisco.com>

 > This change looks fine to me.
 > 
 > ipath_sdma_status doesn't depend on hardware so changing
 >    #define IPATH_SDMA_RUNNING  62
 >    #define IPATH_SDMA_SHUTDOWN 63
 > to different values is fine.

Great, I guess I will change them to 30 and 31 so the values always work
even if unsigned long is 32 bits.  Out of curiousity, was there any
reason for choosing 0, 1, 2, 3 and then skipping to 62?

 > Roland, do you want me to send a patch for this?

I can handle it I think... I'll merge a patch that changes the two
declarations to unsigned long (as I sent out before) and also changes
RUNNING and SHUTDOWN to 30 and 31.

 - R.


From swise at opengridcomputing.com  Tue May 13 09:47:05 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 May 2008 11:47:05 -0500
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805130808.10510.okir@lst.de>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805130808.10510.okir@lst.de>
Message-ID: <4829C609.5000205@opengridcomputing.com>

Olaf Kirch wrote:
> On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>   
>> As part of my effort to get RDS working for iWARP, I will be working on the 
>> RDS flow control.  Flow control is needed for iWARP due to the fact that 
>> iWARP connections terminate if there is no posted recv for an incoming 
>> packet.  IB connections do not have this limitation if setup in a certain 
>> way.  In its current implementation, RDS sets the connection attribute 
>> rnr_retry to 7.  This causes IB to retransmit until there is a posted recv 
>> buffer. 
>>     
>
> I think for the initial implementation, it is fine for iWARP to just
> fail the connect when that happens, and re-establish the connection.
>
> If you use reasonable defaults for the send and recv queues, receiver
> overruns should be relatively rare.
>
> Once everything else works, let's revisit the flow control part.
>
>   
I _think_ you'll hit this quickly with one-way flows.  Send completions 
for iWARP only mean the user's buffer can be reused.  Not that its 
placed at the remote peer or in the remote user's buffer.

But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
and rnr_retry == 0 using the rds perf tools? 

Also "the everything else" part depends on remove fmr usage.  I'm 
working on the new RDMA memory verbs allowing fast registration of 
physical memory via a send WR.  To support iWARP we need to remove the 
fmr usage from RDS.   The idea was to replace fmrs with the new fastreg 
verbs.   Thoughts?

Stay tuned for the new verbs API RFC...

Steve.


From richard.frank at oracle.com  Tue May 13 10:03:21 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Tue, 13 May 2008 13:03:21 -0400
Subject: [ofa-general] RDS flow control
In-Reply-To: <4829C609.5000205@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805130808.10510.okir@lst.de>
	<4829C609.5000205@opengridcomputing.com>
Message-ID: <4829C9D9.6050409@oracle.com>

Steve Wise wrote:
> Olaf Kirch wrote:
>> On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>  
>>> As part of my effort to get RDS working for iWARP, I will be working 
>>> on the RDS flow control.  Flow control is needed for iWARP due to 
>>> the fact that iWARP connections terminate if there is no posted recv 
>>> for an incoming packet.  IB connections do not have this limitation 
>>> if setup in a certain way.  In its current implementation, RDS sets 
>>> the connection attribute rnr_retry to 7.  This causes IB to 
>>> retransmit until there is a posted recv buffer.     
>>
>> I think for the initial implementation, it is fine for iWARP to just
>> fail the connect when that happens, and re-establish the connection.
>>
>> If you use reasonable defaults for the send and recv queues, receiver
>> overruns should be relatively rare.
>>
>> Once everything else works, let's revisit the flow control part.
>>
>>   
> I _think_ you'll hit this quickly with one-way flows.  Send 
> completions for iWARP only mean the user's buffer can be reused.  Not 
> that its placed at the remote peer or in the remote user's buffer.
>
Let's see what happens - anyway - this could be solved in an IWARP 
extension to RDS  - right ?
> But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
> and rnr_retry == 0 using the rds perf tools?
> Also "the everything else" part depends on remove fmr usage.  I'm 
> working on the new RDMA memory verbs allowing fast registration of 
> physical memory via a send WR.  To support iWARP we need to remove the 
> fmr usage from RDS.   The idea was to replace fmrs with the new 
> fastreg verbs.   Thoughts?
>
What does "fast" imply here - how does this compare to the performance 
of FMRs ?

Why would not push memory window creation into the RDS transport 
specific implementations ?

Changing the API may be OK - if we retain the performance we have with IB.

> Stay tuned for the new verbs API RFC...
>
> Steve.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Tue May 13 10:05:12 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 13 May 2008 10:05:12 -0700
Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
Message-ID: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com>

>+static void cma_ha_work_handler(struct work_struct *work)
>+{
>+	struct rdma_id_private *id_priv;
>+
>+	id_priv = container_of(work, struct rdma_id_private, ha_work);
>+	rdma_disconnect(&id_priv->id);
>+}

This will race with other user calls.  I've found it fairly difficult for the
rdma_cm to call back into its own API and avoid racing with the user trying to
destroy the cm_id.  None of the APIs are coded to allow calling them
simultaneously with destroy.

A better solution for this may be for the rdma_cm to simply notify the user that
the IP mapping for their RDMA device has changed.  The user can then disconnect,
with the appropriate synchronization, if they want their RDMA connection to
follow the IP address.  (If I understood correctly, the reason for this is to
allow failing back to a repaired port.)

>+static int cma_netdev_callback(struct notifier_block *self, unsigned long
>event,
>+	void *ctx)
>+{
>+	struct net_device *ndev = (struct net_device *)ctx;
>+	struct cma_device *cma_dev;
>+	struct rdma_id_private *id_priv;
>+	struct rdma_dev_addr *dev_addr;
>+
>+	if (dev_net(ndev) != &init_net)
>+		return NOTIFY_DONE;
>+
>+	if (event != NETDEV_BONDING_FAILOVER)
>+		return NOTIFY_DONE;
>+
>+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
>+		return NOTIFY_DONE;
>+
>+	list_for_each_entry(cma_dev, &dev_list, list)
>+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
>+			dev_addr = &id_priv->id.route.addr.dev_addr;
>+			if (!memcmp(dev_addr->src_netdev_name, ndev->name,
IFNAMSIZ)
>&&
>+				memcmp(dev_addr->src_dev_addr, ndev->dev_addr,
ndev-
>>addr_len))
>+					if (id_priv->ha_mode ==
>RDMA_ALIGN_WITH_NETDEVICE)
>+
schedule_work(&id_priv->ha_work);
>+		}
>+	return NOTIFY_DONE;
>+}

As Roland mentioned, this is racy in the areas he pointed out.  This will take
some thought to handle correctly.

- Sean


From rdreier at cisco.com  Tue May 13 10:41:39 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 10:41:39 -0700
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <4827FBDF.9040308@Voltaire.COM> (Moni Shoua's message of "Mon, 12
	May 2008 11:12:15 +0300")
References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM>
	<adaskwwz7ie.fsf@cisco.com> <4820638E.4030901@Voltaire.COM>
	<4827FBDF.9040308@Voltaire.COM>
Message-ID: <adatzh2ksoc.fsf@cisco.com>

 > Can we please go on with this patch? We would like to see it in the next kernel.

I still don't get why this is important to you.  Is there a concrete
example of a situation where this actually makes a measurable difference?

We need some justification for adding this locking complexity beyond "it
doesn't hurt."  (And also of course we need it fixed so there aren't races)

 - R.


From swise at opengridcomputing.com  Tue May 13 10:58:11 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 May 2008 12:58:11 -0500
Subject: [ofa-general] RDS flow control
In-Reply-To: <4829C9D9.6050409@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805130808.10510.okir@lst.de>
	<4829C609.5000205@opengridcomputing.com>
	<4829C9D9.6050409@oracle.com>
Message-ID: <4829D6B3.5080900@opengridcomputing.com>

Richard Frank wrote:
> Steve Wise wrote:
>> Olaf Kirch wrote:
>>> On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>>  
>>>> As part of my effort to get RDS working for iWARP, I will be 
>>>> working on the RDS flow control.  Flow control is needed for iWARP 
>>>> due to the fact that iWARP connections terminate if there is no 
>>>> posted recv for an incoming packet.  IB connections do not have 
>>>> this limitation if setup in a certain way.  In its current 
>>>> implementation, RDS sets the connection attribute rnr_retry to 7.  
>>>> This causes IB to retransmit until there is a posted recv buffer.     
>>>
>>> I think for the initial implementation, it is fine for iWARP to just
>>> fail the connect when that happens, and re-establish the connection.
>>>
>>> If you use reasonable defaults for the send and recv queues, receiver
>>> overruns should be relatively rare.
>>>
>>> Once everything else works, let's revisit the flow control part.
>>>
>>>   
>> I _think_ you'll hit this quickly with one-way flows.  Send 
>> completions for iWARP only mean the user's buffer can be reused.  Not 
>> that its placed at the remote peer or in the remote user's buffer.
>>
> Let's see what happens - anyway - this could be solved in an IWARP 
> extension to RDS  - right ?


Yes, by adding flow control.  And it could be iwarp-specific if you 
want.    I would not suggest relying on connection termination and 
re-establishment as the way to handle this :).


>> But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
>> and rnr_retry == 0 using the rds perf tools?
>> Also "the everything else" part depends on remove fmr usage.  I'm 
>> working on the new RDMA memory verbs allowing fast registration of 
>> physical memory via a send WR.  To support iWARP we need to remove 
>> the fmr usage from RDS.   The idea was to replace fmrs with the new 
>> fastreg verbs.   Thoughts?
>>
> What does "fast" imply here - how does this compare to the performance 
> of FMRs ?


Don't know yet, but probably as fast. 

>
> Why would not push memory window creation into the RDS transport 
> specific implementations ?

Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
(I'm ignorant on the specifics of the implementation at this point, so 
please excuse any dumb statements :)


>
> Changing the API may be OK - if we retain the performance we have with 
> IB.


I assume nothing would fly that regresses IB performance.  Worst case, 
you have an iwarp-specific RDS transport like you do for TCP, I guess.  
Hopefully though, IB + iWARP will be a common transport.


>
>> Stay tuned for the new verbs API RFC...
>>
>> Steve.
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general


From okir at lst.de  Tue May 13 11:04:00 2008
From: okir at lst.de (Olaf Kirch)
Date: Tue, 13 May 2008 20:04:00 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <4829D6B3.5080900@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<4829C9D9.6050409@oracle.com>
	<4829D6B3.5080900@opengridcomputing.com>
Message-ID: <200805132004.01371.okir@lst.de>

On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
> Yes, by adding flow control.  And it could be iwarp-specific if you 
> want.    I would not suggest relying on connection termination and 
> re-establishment as the way to handle this :).

No, not in the long term. But let's hold off on the flow control stuff
for a little - I would first like to finish my patch set and hand it
out for you folks to bang on it, rather than the other way round.
Okay with you guys?

> I assume nothing would fly that regresses IB performance.  Worst case, 
> you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> Hopefully though, IB + iWARP will be a common transport.

If it turns out that way, fine. If iWARP ands up sharing 80% of the
code with IB except the RDMA specific functions, I think that's
very much acceptable, too.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From swise at opengridcomputing.com  Tue May 13 11:08:46 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 May 2008 13:08:46 -0500
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805132004.01371.okir@lst.de>
References: <200805121157.38135.jon@opengridcomputing.com>
	<4829C9D9.6050409@oracle.com>
	<4829D6B3.5080900@opengridcomputing.com>
	<200805132004.01371.okir@lst.de>
Message-ID: <4829D92E.2070504@opengridcomputing.com>

Olaf Kirch wrote:
> On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
>   
>> Yes, by adding flow control.  And it could be iwarp-specific if you 
>> want.    I would not suggest relying on connection termination and 
>> re-establishment as the way to handle this :).
>>     
>
> No, not in the long term. But let's hold off on the flow control stuff
> for a little - I would first like to finish my patch set and hand it
> out for you folks to bang on it, rather than the other way round.
> Okay with you guys?
>   

What patch set?  We can't run on chelsio's rnic with fmrs...

>   
>> I assume nothing would fly that regresses IB performance.  Worst case, 
>> you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>> Hopefully though, IB + iWARP will be a common transport.
>>     
>
> If it turns out that way, fine. If iWARP ands up sharing 80% of the
> code with IB except the RDMA specific functions, I think that's
> very much acceptable, too.
>
> Olaf
>   


From okir at lst.de  Tue May 13 11:24:11 2008
From: okir at lst.de (Olaf Kirch)
Date: Tue, 13 May 2008 20:24:11 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <4829D92E.2070504@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805132004.01371.okir@lst.de>
	<4829D92E.2070504@opengridcomputing.com>
Message-ID: <200805132024.12741.okir@lst.de>

On Tuesday 13 May 2008 20:08:46 Steve Wise wrote:
> > No, not in the long term. But let's hold off on the flow control stuff
> > for a little - I would first like to finish my patch set and hand it
> > out for you folks to bang on it, rather than the other way round.
> > Okay with you guys?
> >   
> 
> What patch set?

I mentioned in a previous mail to Jon that I have some partial patches
that implement flow control. I want to get that code out to you ASAP;
I think that's easier than having two different approaches that need
to be reconciled afterwards.

> We can't run on chelsio's rnic with fmrs... 

Yes, that is understood.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From rdreier at cisco.com  Tue May 13 11:46:06 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 11:46:06 -0700
Subject: [ofa-general] [PATCH 3/3] IB/ipath - fix RDMA read response
	sequence checking
In-Reply-To: <20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Thu, 08 May 2008 11:55:28 -0700")
References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com>
	<20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com>
Message-ID: <ada8wyekpox.fsf@cisco.com>

OK, applied all 3.


From rdreier at cisco.com  Tue May 13 11:46:13 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 11:46:13 -0700
Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
	struct pid * not pid_t.
In-Reply-To: <482857AE.2030904@openvz.org> (Pavel Emelyanov's message of "Mon, 
	12 May 2008 18:43:58 +0400")
References: <482857AE.2030904@openvz.org>
Message-ID: <ada4p92kpoq.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue May 13 11:51:58 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 11:51:58 -0700
Subject: [ofa-general] bitops take an unsigned long *
In-Reply-To: <1210634005.3949.26.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Mon, 12 May 2008 16:13:25 -0700")
References: <20080508222916.277649ca.akpm@linux-foundation.org>
	<adazlqypvmn.fsf@cisco.com>
	<1210634005.3949.26.camel@brick.pathscale.com>
Message-ID: <adazlqujaup.fsf@cisco.com>

OK, I added the below to my tree for my next pull request:

commit f018c7e177a50390f6fcb137f1a28a6027d8ba50
Author: Roland Dreier <rolandd at cisco.com>
Date:   Tue May 13 11:51:23 2008 -0700

    IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long
    
    Andrew Morton <akpm at linux-foundation.org> pointed out that bitops
    should take an unsigned long * arg.  However, the ipath driver was
    doing bitops on struct ipath_devdata.ipath_sdma_status, which is u64.
    Change this member to unsigned long to avoid tons of warnings when x86
    fixes the bitops to take unsigned long * instead of void *.
    
    Also, change the IPATH_SDMA_RUNNING and IPATH_SDMA_SHUTDOWN bit
    numbers to 30 and 31 (instead of 62 and 63) so that we're not setting
    another booby trap for someone who tries to make ipath work on a
    32-bit architecture.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 258e66c..daad09a 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 	 */
 	if (dd->ipath_flags & IPATH_HAS_SEND_DMA) {
 		int skip_cancel;
-		u64 *statp = &dd->ipath_sdma_status;
+		unsigned long *statp = &dd->ipath_sdma_status;
 
 		spin_lock_irqsave(&dd->ipath_sdma_lock, flags);
 		skip_cancel =
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 2097587..59a8b25 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -483,7 +483,7 @@ struct ipath_devdata {
 
 	/* SendDMA related entries */
 	spinlock_t            ipath_sdma_lock;
-	u64                   ipath_sdma_status;
+	unsigned long         ipath_sdma_status;
 	unsigned long         ipath_sdma_abort_jiffies;
 	unsigned long         ipath_sdma_abort_intr_timeout;
 	unsigned long         ipath_sdma_buf_jiffies;
@@ -822,8 +822,8 @@ struct ipath_devdata {
 #define IPATH_SDMA_DISARMED  1
 #define IPATH_SDMA_DISABLED  2
 #define IPATH_SDMA_LAYERBUF  3
-#define IPATH_SDMA_RUNNING  62
-#define IPATH_SDMA_SHUTDOWN 63
+#define IPATH_SDMA_RUNNING  30
+#define IPATH_SDMA_SHUTDOWN 31
 
 /* bit combinations that correspond to abort states */
 #define IPATH_SDMA_ABORT_NONE 0


From rdreier at cisco.com  Tue May 13 11:53:20 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 11:53:20 -0700
Subject: [ofa-general] [PATCH 2.6.26] RDMA/cxgb3: Wrap the software sq ptr
	as needed on flush.
In-Reply-To: <20080509201902.13077.53047.stgit@dell3.ogc.int> (Steve Wise's
	message of "Fri, 09 May 2008 15:19:02 -0500")
References: <20080509201902.13077.53047.stgit@dell3.ogc.int>
Message-ID: <adave1ijasf.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue May 13 11:56:21 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 11:56:21 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080510190721.GI5298@sgi.com> (akepner@sgi.com's message of
	"Sat, 10 May 2008 12:07:21 -0700")
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com>
Message-ID: <adar6c6jane.fsf@cisco.com>

 > ipoib_cm.c:ipoib_cm_send() does:
 >         if (++priv->tx_outstanding == ipoib_sendq_size)
 >                 netif_stop_queue(dev);
 > 
 > but ipoib_ib.c:ipoib_send() does:
 >         if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) {
 >                 netif_stop_queue(dev);

So this is not in the upstream kernel... I wonder if this is a bug
introduced in an OFED 1.3 patch?


From sashak at voltaire.com  Tue May 13 15:18:34 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 May 2008 22:18:34 +0000
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: fix segmentation
	fault
In-Reply-To: <4829784A.6030708@dev.mellanox.co.il>
References: <4829784A.6030708@dev.mellanox.co.il>
Message-ID: <20080513221834.GE21414@sashak.voltaire.com>

Hi Yevgeny,

On 14:15 Tue 13 May     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Fixing trivial segmentation fault in state manager.
> 
> Please apply to ofed_1_3 branch and to master.
> 
> -- Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

This patch is not against master branch. I applied this by hand editing.
Thanks for the fix. But please rebase your working branch!

Sasha


From or.gerlitz at gmail.com  Tue May 13 12:48:27 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 13 May 2008 22:48:27 +0300
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <adak5hymg0l.fsf@cisco.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<adak5hymg0l.fsf@cisco.com>
Message-ID: <15ddcffd0805131248l7da1274fy8f467ae3a98e176e@mail.gmail.com>

On 5/13/08, Roland Dreier <rdreier at cisco.com> wrote:
>
> > Use netevent notification for sensing that a change has happened in the
> IP stack,
>   > then scan the rdma-cm IDs list to see if there is an ID that is
> "misaligned"
>   > in that respect with the IP stack, and disconnect it, in case this is
> what the
>   > user asked to when setting an ha mode for the ID.
>
> this seems like a strange "HA" feature -- to disconnect connections that
> otherwise would continue operating.  What is the use case/use scenario?


OK, I might went too fast here. The idea is to align the RDMA traffic with
the links used by the IP stack. In the case where the app takes advantage of
bonding ipoib devices to achieve HA AND it want this alignment, when bonding
does fail-over from --any-- reason (eg the problem is fixed and the primary
option is used) a "fail-back" of the connection is needed.

(*) HW error --> RC connection break && bonding failover (change of active
slave device, send gratuitous ARP), then this app reconnects and its back in
the business.


  > +    list_for_each_entry(cma_dev, &dev_list, list)
>   > +            list_for_each_entry(id_priv, &cma_dev->id_list, list) {
>   > +                    dev_addr = &id_priv->id.route.addr.dev_addr;
>   > +                    if (!memcmp(dev_addr->src_netdev_name,
> ndev->name, IFNAMSIZ) &&
>   > +                            memcmp(dev_addr->src_dev_addr,
> ndev->dev_addr, ndev->addr_len))
>   > +                                    if (id_priv->ha_mode ==
> RDMA_ALIGN_WITH_NETDEVICE)
>   >
> +                                            schedule_work(&id_priv->ha_work);
>   > +            }
>
>
> This looks horribly racy/incorrect against RDMA device removal, CMA ID
> destruction and netdev renaming.


mmm, bad. I see your point re the first two, that is some locking is needed
to protect against device removal, the ID should be referenced, etc. As for
the netdev renaming, I don't see how making a decsion based on
memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) is racy?! even for the
crazy case the ndev->name gets changed in the middle of this memcmp, the
only issue would be some confusion made by the code, no damage.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080513/25233a27/attachment.html>

From sashak at voltaire.com  Tue May 13 15:59:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 May 2008 22:59:14 +0000
Subject: [ofa-general] Re: [PATCH] infiniband-diags/Makefile.am: fix location
	of ibdiag_version.h
In-Reply-To: <4826E877.7090706@dev.mellanox.co.il>
References: <4826E877.7090706@dev.mellanox.co.il>
Message-ID: <20080513225914.GF21414@sashak.voltaire.com>

On 15:37 Sun 11 May     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> When compiling infiniband-diags not from the source code location,
> compilation fails to find the ibdiag_version.h file - fixing it.
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Tue May 13 13:01:53 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 13 May 2008 23:01:53 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: fix segmentation
	fault
In-Reply-To: <20080513221834.GE21414@sashak.voltaire.com>
References: <4829784A.6030708@dev.mellanox.co.il>
	<20080513221834.GE21414@sashak.voltaire.com>
Message-ID: <4829F3B1.3040009@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 14:15 Tue 13 May     , Yevgeny Kliteynik wrote:
>> Hi Sasha,
>>
>> Fixing trivial segmentation fault in state manager.
>>
>> Please apply to ofed_1_3 branch and to master.
>>
>> -- Yevgeny
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> This patch is not against master branch. I applied this by hand editing.
> Thanks for the fix. But please rebase your working branch!

My bad, this patch was against ofed_1_3 only, and I forgot
to send a separate patch for master after seeing that...

Sorry

-- Yevgeny

> Sasha
> 


From or.gerlitz at gmail.com  Tue May 13 13:10:30 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 13 May 2008 23:10:30 +0300
Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability
	mode attribute to IDs
In-Reply-To: <adaod7amg7x.fsf@cisco.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<adaod7amg7x.fsf@cisco.com>
Message-ID: <15ddcffd0805131310x6710fbb6v890d71297f5588ed@mail.gmail.com>

On 5/13/08, Roland Dreier <rdreier at cisco.com> wrote:
>
> > +enum  rdma_ha_mode {
>   > +    RDMA_ALIGN_WITH_NETDEVICE = 1
>   > +};
>
>   > +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum
> rdma_ha_mode mode)
>
> this seems like overengineering to me... given there are no other modes,
> you are adding an elaborate NOP.  (Nothing looks at ha_mode)


First, this patch would be later extended for the rdma_ucm part (exposing
the ha_mode to user space). Second, indeed nothing looks on ha_mode in this
patch, but the next one (4/4) uses it. I was thinking its better to
decompose the changes this way such that patches are not too small and not
too big both in size and the change they carry in their content.

Do you have plans for other modes?


down the road someone might want to add APM support for the rdma-cm, or more
modes that I can't think of now.

  >      u8                      srq;
>   >      u8                      tos;
>   > +    enum rdma_ha_mode       ha_mode;
>
> Side note -- you're wasting two bytes here because of alignment.


What would be the easy way to avoid it?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080513/012450ee/attachment.html>

From or.gerlitz at gmail.com  Tue May 13 13:26:06 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 13 May 2008 23:26:06 +0300
Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com>
Message-ID: <15ddcffd0805131326x7873df30ua015a1719c90fb89@mail.gmail.com>

On 5/13/08, Sean Hefty <sean.hefty at intel.com> wrote:

This will race with other user calls.  I've found it fairly difficult for
> the
> rdma_cm to call back into its own API and avoid racing with the user
> trying to
> destroy the cm_id.  None of the APIs are coded to allow calling them
> simultaneously with destroy.


I see.


> A better solution for this may be for the rdma_cm to simply notify the
> user that
> the IP mapping for their RDMA device has changed.  The user can then
> disconnect,
> with the appropriate synchronization, if they want their RDMA connection
> to
> follow the IP address.


Yes, this is possible, I have tried to implement it at the rdma-cm to avoid
having each ULP do it at their code, if you think its practially impossible
for the rdma-cm to call its own API, I will can change this into delivering
disconnected event.

(If I understood correctly, the reason for this is to allow failing back to
> a repaired port.)


indeed, this is a possible use case.

>+      list_for_each_entry(cma_dev, &dev_list, list)
> >+              list_for_each_entry(id_priv, &cma_dev->id_list, list) {
> >+                      dev_addr = &id_priv->id.route.addr.dev_addr;
> >+                      if (!memcmp(dev_addr->src_netdev_name, ndev->name,
> IFNAMSIZ) &&
> >+                              memcmp(dev_addr->src_dev_addr,
> ndev->dev_addr,ndev->addr_len)
> >+                                      if (id_priv->ha_mode ==
> RDMA_ALIGN_WITH_NETDEVICE)
> >+
> schedule_work(&id_priv->ha_work);
>
> As Roland mentioned, this is racy in the areas he pointed out.  This will
> take
> some thought to handle correctly.
>

OK, I will try to improve things here, any hints/directions would be very
much appreciated...

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080513/d3bcd23c/attachment.html>

From rdreier at cisco.com  Tue May 13 13:40:52 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 13:40:52 -0700
Subject: [ofa-general] Re: [PATCH 06/13] QLogic VNIC: IB core stack
	interaction
In-Reply-To: <20080430171855.31725.89658.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:48:55 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171855.31725.89658.stgit@localhost.localdomain>
Message-ID: <adafxsmj5t7.fsf@cisco.com>

 > +#include <rdma/ib_cache.h>

 > +	ret = ib_find_cached_pkey(viport_config->ibdev,
 > +				  viport_config->port,
 > +				  be16_to_cpu(viport_config->path_info.path.
 > +					      pkey),
 > +				  &attr->pkey_index);

I think this can just be replaced with ib_find_pkey()... there is a call
to kmalloc(... GFP_KERNEL) just a couple of lines about, so you are in a
context where sleeping is allowed.  As I said before we want to get rid
of the caching infrastructure so please don't add new users.


From rdreier at cisco.com  Tue May 13 13:41:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 May 2008 13:41:37 -0700
Subject: [ofa-general] Re: [PATCH 07/13] QLogic VNIC: Handling configurable
	parameters of the driver
In-Reply-To: <20080430171925.31725.22023.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:49:25 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430171925.31725.22023.stgit@localhost.localdomain>
Message-ID: <adabq3aj5ry.fsf@cisco.com>

 > +	ib_get_cached_gid(config->ibdev, config->port, 0,
 > +			  &config->path_info.path.sgid);

Again, looks like a sleepable context so please use ib_query_gid()
instead.


From sashak at voltaire.com  Tue May 13 16:43:45 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 May 2008 23:43:45 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
	<20080512231831.GR17046@sashak.voltaire.com>
	<1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080513234345.GI21414@sashak.voltaire.com>

On 13:40 Mon 12 May     , Hal Rosenstock wrote:
> 
> Right; that's what I meant by option 1. 

True.

> Also, it's not really unknown options since it's part of the
> NodeDescription.
> 
> This works as long as the "known" options (currently s= and w=) are not
> part of NodeDescription. It works for the real life use case that
> started this.

That is correct, but we know that ibsim parser doesn't parse
NodeDescription in those (port related) lines, so in such "worst" cases
when 's=' and/or 'w=' strings are used in NodeDescription this could be
just filtered out from ibnetdiscovery file.

Sasha


From sashak at voltaire.com  Tue May 13 16:51:00 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 May 2008 23:51:00 +0000
Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts/iblinkinfo.pl:
	fix printing of switch name when port 1 is down.
In-Reply-To: <20080501155045.4aa3ef2c.weiny2@llnl.gov>
References: <20080501155045.4aa3ef2c.weiny2@llnl.gov>
Message-ID: <20080513235100.GK21414@sashak.voltaire.com>

On 15:50 Thu 01 May     , Ira Weiny wrote:
> I found a bug in the printing of the names of switches on iblinkinfo.pl.  The
> name of the switch was being pulled from the first ports "link" structure.  The
> problem is, if the first port is down there was no structure available.  This
> gets the switch name from the first link structure available and prints the
> name correctly.
> 
> Ira
> 
> From 9b69c0ff4c7785be78157ab78e4a4892d64e2fb2 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Thu, 1 May 2008 15:46:25 -0700
> Subject: [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1
> is down.
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From hrosenstock at xsigo.com  Tue May 13 13:56:58 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 13 May 2008 13:56:58 -0700
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080513234345.GI21414@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
	<20080512231831.GR17046@sashak.voltaire.com>
	<1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>
	<20080513234345.GI21414@sashak.voltaire.com>
Message-ID: <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-13 at 23:43 +0000, Sasha Khapyorsky wrote:
> On 13:40 Mon 12 May     , Hal Rosenstock wrote:
> > 
> > Right; that's what I meant by option 1. 
> 
> True.
> 
> > Also, it's not really unknown options since it's part of the
> > NodeDescription.
> > 
> > This works as long as the "known" options (currently s= and w=) are not
> > part of NodeDescription. It works for the real life use case that
> > started this.
> 
> That is correct, but we know that ibsim parser doesn't parse
> NodeDescription in those (port related) lines, so in such "worst" cases
> when 's=' and/or 'w=' strings are used in NodeDescription this could be
> just filtered out from ibnetdiscovery file.

That's why I termed this approach a workaround and it does limit the
NodeDescription in ways not limited by the IBA spec.

Is this worth mentioning in the README or some other doc for ibsim ?

-- Hal

> Sasha


From sashak at voltaire.com  Tue May 13 17:02:47 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 May 2008 00:02:47 +0000
Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when
	setting rereg bit  (Was: Re: [ofa-general] Nodes dropping out of
	IPoIB mcast group due to a temporary node soft lockup.)
In-Reply-To: <20080424181657.28d58a29.weiny2@llnl.gov>
References: <20080423133816.6c1b6315.weiny2@llnl.gov>
	<48109087.6030606@voltaire.com>
	<20080424143125.2aad1db8.weiny2@llnl.gov>
	<15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com>
	<20080424181657.28d58a29.weiny2@llnl.gov>
Message-ID: <20080514000247.GL21414@sashak.voltaire.com>

On 18:16 Thu 24 Apr     , Ira Weiny wrote:
> 
> From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Thu, 24 Apr 2008 18:05:01 -0700
> Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
> 
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 13 17:06:17 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 May 2008 00:06:17 +0000
Subject: [ofa-general] Nodes dropping out of IPoIB mcast group
In-Reply-To: <4816C6F6.6000602@voltaire.com>
References: <20080423133816.6c1b6315.weiny2@llnl.gov>
	<48109087.6030606@voltaire.com>
	<20080424143125.2aad1db8.weiny2@llnl.gov>
	<15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com>
	<20080424181657.28d58a29.weiny2@llnl.gov>
	<48143DBA.3080701@voltaire.com>
	<20080428091923.0abf9fb5.weiny2@llnl.gov>
	<4816C6F6.6000602@voltaire.com>
Message-ID: <20080514000617.GM21414@sashak.voltaire.com>

On 09:57 Tue 29 Apr     , Or Gerlitz wrote:
>
> And when openSM does the heavy sweep, what nodes would have their client 
> rereg bit set, only the ones beyond the recovered link?

Yes.

> also will openSM 
> cycle the logical link state of those nodes (which is active!) through 
> armed-active again or the only SET would be for the rereg bit?

No, ideally (unless other PortInfo fields were changed) only client rereg
bit SET will be issued.

Sasha.


From caitlin.bestler at neterion.com  Tue May 13 14:15:04 2008
From: caitlin.bestler at neterion.com (Caitlin Bestler)
Date: Tue, 13 May 2008 14:15:04 -0700
Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability
	mode attribute to IDs
In-Reply-To: <Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
Message-ID: <469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com>

On Tue, May 13, 2008 at 7:13 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer
>  of the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
>  as the IP stack does. In the current code, this does not happen when bonding did
>  fail-over but the IB link used by an already existing session is operating fine.
>  For now this mode is supported only for the connected services of the rdma-cm.
>

I'm not sure I've even seen an "RDMA Session".

There are lots of RDMA *connections*, and there are RDMA applications
that have an application-layer session that use several RDMA connections.
But I'm fairly certain that there is no such thing as an "RDMA Session".

Which raises some serious doubts about an automatic connection tear
down based upon decisions at the RDMA layer.

This will also create problems with iWARP/IB compatability.

The iWARP standards (IETF and RDMAC) both solve the problem of
RDMA endpoint / IP Address affinity by simply mandating it. While
no real solution is given in the standards, it has generally been
interpreted to mean:

- You cannot create an RDMA connection on a device (or assign an
   existing TCP connection to an RDMA endpoint) if the device is
  not a valid route given the source/destination IP Addresses.
- You can determine the set of possible RDMA devices by first
  consulting the local routing tables using the desired source
  and destination IP addresses.
- If an RDMA device is no longer a valid route for a connection
  then the underlying TCP connection will fail (and it would be
  real nice if this happened promptly if the reason if a network
  reconfiguration rather than just waiting for things to fail).

An important corner case here is that there may not be a need
to migrate an existing RDMA Connection to a new device just
because the *preferred* route has changed. The non-preferred
route may still be fully operable and it may be preferable to
continue using it for *this* connection given the cost of tear
down and start up.

Keep in mind that if the old route does not work then it will
fail fairly quickly. If doing it quickly is important then the
device should have mechanisms to ensure that it does not
keep stale ARP or Neighbor Discovery lingering around.
If the ARP/ND information is erased the connection will
be torn down very quickly (destination unreachable).

Now for both IB and iWARP there is a substantial possibility
that a connection can be migrated to a different port within
the same or co-operating devices. In that case the High
Availability is achieved without the application having to
be involved at all.

If the connection is going to have to re-established on a
*different* device there is a substantial risk that this will
involve re-registering memory, re-connecting, and
re-advertising buffers. I don't see how you can wisely
decide that the benefits of a preferred route outweigh
these costs on an application-independent basis.
What if the application was nearly done with the
connection? Or knew that it would be ending a
current burst of activity in a few seconds and
could pay for the connection shift-back then?

And if the application is going to make the decision, then
can't it just subscribe to the local routing tables on its
own without any help from OFA?

Even if it is response to a failure on the old connection,
any application that has a "session'" concept will have
procedures for re-establishing the session on a new
connection. Where is the need for a one-size-fits-none
standardized solution?


From sashak at voltaire.com  Tue May 13 17:15:45 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 May 2008 00:15:45 +0000
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
	<20080512231831.GR17046@sashak.voltaire.com>
	<1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>
	<20080513234345.GI21414@sashak.voltaire.com>
	<1210712218.2026.719.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080514001545.GO21414@sashak.voltaire.com>

On 13:56 Tue 13 May     , Hal Rosenstock wrote:
> > 
> > That is correct, but we know that ibsim parser doesn't parse
> > NodeDescription in those (port related) lines, so in such "worst" cases
> > when 's=' and/or 'w=' strings are used in NodeDescription this could be
> > just filtered out from ibnetdiscovery file.
> 
> That's why I termed this approach a workaround and it does limit the
> NodeDescription in ways not limited by the IBA spec.

No, it does not limit NodeDescription at all - it is *only* file format
limitation (remove NodeDescription from port related lines in the file
and we are done).

> Is this worth mentioning in the README or some other doc for ibsim ?

Looks like overkill for me.

Sasha


From flatif at NetEffect.com  Tue May 13 14:46:47 2008
From: flatif at NetEffect.com (Faisal Latif)
Date: Tue, 13 May 2008 16:46:47 -0500
Subject: [ofa-general] RE: [PATCH 1/1] infiniband/hw/nes/: avoid unnecessary
	memset
In-Reply-To: <20080512213601.626C91C0008F@mwinf2103.orange.fr>
References: <20080512213601.626C91C0008F@mwinf2103.orange.fr>
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2>

Acked-by Faisal Latif<faisal at neteffect.com>

Thanks
Faisal


> 
> From: Christophe Jaillet <christophe.jaillet at wanadoo.fr>
> 
> Hi, here is a patch against linux/drivers/infiniband/hw/nes/nes_cm.c
> which :
> 
> 1) Remove an explicit memset(.., 0, ...) to a variable allocated with
> kzalloc (i.e. 'listener').
> 
> 
> Note: this patch is based on 'linux-2.6.25.tar.bz2'
> 
> Signed-off-by: Christophe Jaillet <christophe.jaillet at wanadoo.fr>
> 
> ---
> 
> --- linux/drivers/infiniband/hw/nes/nes_cm.c	2008-04-17
> 04:49:44.000000000 +0200
> +++ linux/drivers/infiniband/hw/nes/nes_cm.c.cj	2008-05-12
> 23:31:24.000000000 +0200
> @@ -1587,7 +1587,6 @@ static struct nes_cm_listener *mini_cm_l
>  			return NULL;
>  		}
> 
> -		memset(listener, 0, sizeof(struct nes_cm_listener));
>  		listener->loc_addr = htonl(cm_info->loc_addr);
>  		listener->loc_port = htons(cm_info->loc_port);
>  		listener->reused_node = 0;
> 


From swise at opengridcomputing.com  Tue May 13 14:59:14 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 May 2008 16:59:14 -0500
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma:
	implement	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <adak5hymg0l.fsf@cisco.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<adak5hymg0l.fsf@cisco.com>
Message-ID: <482A0F32.2010001@opengridcomputing.com>

Roland Dreier wrote:
>  > Use netevent notification for sensing that a change has happened in the IP stack,
>  > then scan the rdma-cm IDs list to see if there is an ID that is "misaligned"
>  > in that respect with the IP stack, and disconnect it, in case this is what the
>  > user asked to when setting an ha mode for the ID.
>
> this seems like a strange "HA" feature -- to disconnect connections that
> otherwise would continue operating.  What is the use case/use scenario?
>
>   

Maybe this should really be implemented in the ULP that wants this 
behavior.  IE the ULP could register for routing/neighbour changes and 
tear down connections and re-established them on the correct device.

STeve.


From swise at opengridcomputing.com  Tue May 13 15:00:18 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 May 2008 17:00:18 -0500
Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma:
	implement	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com>
Message-ID: <482A0F72.3090206@opengridcomputing.com>

Sean Hefty wrote:
>> +static void cma_ha_work_handler(struct work_struct *work)
>> +{
>> +	struct rdma_id_private *id_priv;
>> +
>> +	id_priv = container_of(work, struct rdma_id_private, ha_work);
>> +	rdma_disconnect(&id_priv->id);
>> +}
>>     
>
> This will race with other user calls.  I've found it fairly difficult for the
> rdma_cm to call back into its own API and avoid racing with the user trying to
> destroy the cm_id.  None of the APIs are coded to allow calling them
> simultaneously with destroy.
>
> A better solution for this may be for the rdma_cm to simply notify the user that
> the IP mapping for their RDMA device has changed.  The user can then disconnect,
> with the appropriate synchronization, if they want their RDMA connection to
> follow the IP address.  (If I understood correctly, the reason for this is to
> allow failing back to a repaired port.)
>
>   

Yes.  Move this logic to the ULP, not in the rdma-cm...


From bryan.d.green at nasa.gov  Tue May 13 15:16:28 2008
From: bryan.d.green at nasa.gov (Bryan Green)
Date: Tue, 13 May 2008 15:16:28 -0700
Subject: [ofa-general] libibvpp - A libibverbs C++ wrapper library.
Message-ID: <20080513221628.3BC9E20415F@ece06.nas.nasa.gov>


I'd like to make an announcement about a recently released library.
I've released a C++ wrapper library for libibverbs, called libibvpp.
It is currently released under the NOSA (NASA Open Source Agreement)
license.  For more information,  please see the README (link below).

I hope this library is of interest to the C++ programmers out there in the
OpenFabrics community.  I'm also curious if there would be any interest in
hosting this project on the openfabrics website.

Here is the library's current home:

http://opensource.arc.nasa.gov/project/libivpp/

README:  http://opensource.arc.nasa.gov/software/24/notes/
Download: http://opensource.arc.nasa.gov/static/downloads/libibvpp-0.1.tar.gz

Thanks,
-Bryan

---------------------------------------
Bryan Green
Visualization Group
NASA Advanced Supercomputing Division
NASA Ames Research Center
email: bryan.d.green at nasa.gov
---------------------------------------


From richard.frank at oracle.com  Tue May 13 15:36:44 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Tue, 13 May 2008 18:36:44 -0400
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805132024.12741.okir@lst.de>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805132004.01371.okir@lst.de>
	<4829D92E.2070504@opengridcomputing.com>
	<200805132024.12741.okir@lst.de>
Message-ID: <482A17FC.7070804@oracle.com>

Olaf, if / when you have this running for IB - let me know - I think we 
can give to some folks at Oracle who will be able to tell us if there is 
any performance regression using TPCH.. especially if we have it in the 
next week or so.. as I think we have a config to test with.

Olaf Kirch wrote:
> On Tuesday 13 May 2008 20:08:46 Steve Wise wrote:
>   
>>> No, not in the long term. But let's hold off on the flow control stuff
>>> for a little - I would first like to finish my patch set and hand it
>>> out for you folks to bang on it, rather than the other way round.
>>> Okay with you guys?
>>>   
>>>       
>> What patch set?
>>     
>
> I mentioned in a previous mail to Jon that I have some partial patches
> that implement flow control. I want to get that code out to you ASAP;
> I think that's easier than having two different approaches that need
> to be reconciled afterwards.
>
>   
>> We can't run on chelsio's rnic with fmrs... 
>>     
>
> Yes, that is understood.
>
> Olaf
>   


From john.gregor at qlogic.com  Tue May 13 16:00:49 2008
From: john.gregor at qlogic.com (John Gregor)
Date: Tue, 13 May 2008 16:00:49 -0700 (PDT)
Subject: [ofa-general] bitops take an unsigned long *
Message-ID: <20080513230049.C9DC121A047C@diamond.mv.qlogic.com>

From: Roland Dreier <rdreier at cisco.com>
> Out of curiousity, was there any reason for choosing 0, 1, 2, 3 and
> then skipping to 62?

Not really.  Just that RUNNING and SHUTDOWN are conceptually different
from ABORTING, DISARMED, and DISABLED and so it seemed to make sense
at the time to cluster the bits at opposite ends of the qword.  It made
the printk() output easier to scan quickly during debugging.

-John Gregor


From MAILER-DAEMON at mail.sy-h.com  Tue May 13 16:00:26 2008
From: MAILER-DAEMON at mail.sy-h.com (Mail Delivery Subsystem)
Date: Wed, 14 May 2008 07:00:26 +0800
Subject: [ofa-general] Returned mail: see transcript for details
Message-ID: <200805132300.m4DN0Qjm007166@mail.sy-h.com>

The original message was received at Wed, 14 May 2008 07:00:24 +0800
from host-195.117.116.18.smj.net.pl [195.117.116.18] (may be forged)

   ----- The following addresses had permanent fatal errors -----
<lla_chang at tadport.com>
    (reason: User unknown)

   ----- Transcript of session follows -----
procmail: Unknown user "lla_chang"
550 5.1.1 <lla_chang at tadport.com>... User unknown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 893 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/6caaa752/attachment.bin>

From consequences25 at rigodelatorre.com  Tue May 13 16:27:58 2008
From: consequences25 at rigodelatorre.com (Brain Shafer)
Date: Tue, 13 May 2008 20:27:58 -0300
Subject: [ofa-general] buy now Viagra 100mg x 30 pills US $ 3.33 Per Pill
Message-ID: <01c8b537$d4ea1b00$aff745bd@consequences25>

US $ 69.95 100mg x 10 pills price
http://pdqlothes.com


From worleys at gmail.com  Tue May 13 17:47:15 2008
From: worleys at gmail.com (Chris Worley)
Date: Tue, 13 May 2008 18:47:15 -0600
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not
	in others
In-Reply-To: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
Message-ID: <f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>

In two 1.3 builds I get different SET_IPOIB_CM settings in
/etc/infiniband/openib.conf.

A generic build sets it to "yes".  A kitchen-sink build doesn't set it.

Is there a reason (as I need it to be enabled on a system that needs
the kitchen-sink build)?

Thanks,

Chris


From akepner at sgi.com  Tue May 13 18:21:46 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Tue, 13 May 2008 18:21:46 -0700
Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack
Message-ID: <20080514012146.GG29302@sgi.com>


We're getting panics like this one on big clusters:

skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at net/core/skbuff.c:94
invalid opcode: 0000 [1] SMP
last sysfs file: /class/infiniband/mlx4_0/node_type
CPU 0
Modules linked in: worm sg sd_mod crc32c libcrc32c rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_umad iw_cxgb3 cxgb3 firmware_class mlx4_ibib_mthca iscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib ib_cm ib_sa ib_mad ib_core ipv6 loop numatools xpmem shpchp pci_hotplug i2c_i801 i2c_core mlx4_core libata scsi_mod nfs lockd nfs_acl af_packet sunrpc e1000
Pid: 0, comm: swapper Tainted: G     U 2.6.16.46-0.12-smp #1
RIP: 0010:[<ffffffff8027a830>] <ffffffff8027a830>{skb_over_panic+77}
RSP: 0018:ffffffff80417e28  EFLAGS: 00010292
RAX: 0000000000000098 RBX: ffff81041b4bee08 RCX: 0000000000000292
RDX: ffffffff80347868 RSI: 0000000000000292 RDI: ffffffff80347860
RBP: ffff8103725817c0 R08: ffffffff80347868 R09: ffff81041d94e3c0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81041b4be500
R13: 0000000000000060 R14: 0000000000000900 R15: ffffc20000078908
FS:  0000000000000000(0000) GS:ffffffff803be000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b44089dc000 CR3: 000000041f35d000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803d8000, task ffffffff80341340)
Stack: ffff810372b0f0bc ffff810372b0f080 ffff81041b4be000 ffff81041b4be500
       0000000000000060 ffffffff8821f336 ffffffff80417ec8 ffff81041b4be000
       0000000417227014 0000000000000292
Call Trace: <IRQ> <ffffffff8821f336>{:ib_ipoib:ipoib_ib_handle_rx_wc+909}
       <ffffffff882205a2>{:ib_ipoib:ipoib_poll+159} <ffffffff802811a5>{net_rx_action+165}
       <ffffffff8013775d>{__do_softirq+85} <ffffffff8010c11e>{call_softirq+30}
       <ffffffff8010d07c>{do_softirq+44} <ffffffff8010d435>{do_IRQ+64}
       <ffffffff80109e3a>{mwait_idle+0} <ffffffff8010b25a>{ret_from_intr+0} <EOI>
       <ffffffff80109e3a>{mwait_idle+0} <ffffffff80109e70>{mwait_idle+54}
       <ffffffff80109e17>{cpu_idle+151} <ffffffff803da7ec>{start_kernel+601}
       <ffffffff803da28a>{_sinittext+650}


Started looking into what might cause this and I found that IPoIB 
always does something like this:

int ipoib_poll(struct net_device *dev, int *budget) 
{ 
	struct ipoib_dev_priv *priv = netdev_priv(dev);
	....
	ib_poll_cq(priv->rcq, t, priv->ibwc);

	for (i = 0; i < n; i++) {
		struct ib_wc *wc = priv->ibwc + i;
		....
		ipoib_ib_handle_rx_wc(dev, wc);
		

What happens if we call ib_poll_cq() then, before processing the 
rx completions in ipoib_ib_handle_rx_wc(), ipoib_poll() gets called 
again (on a different CPU)? That could corrupt the priv->ibwc array, 
and lead to a panic like above. 

How about keeping the array of struct ib_wc on the stack? 

This has been tested only on a small system, not yet on one large
enough to verify that it prevents the panic. But this "obviously" 
needs to be fixed, no?

Signed-off-by: Arthur Kepner <akepner at sgi.com>

---
 ipoib.h    |    3 ---
 ipoib_ib.c |   31 +++++++++++++++++--------------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff -rup a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
--- a/drivers/infiniband/ulp/ipoib/ipoib.h	2008-05-12 16:39:22.024109931 -0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h	2008-05-13 16:21:52.433988977 -0700
@@ -326,7 +326,6 @@ struct ipoib_cm_dev_priv {
 	struct sk_buff_head     skb_queue;
 	struct list_head        start_list;
 	struct list_head        reap_list;
-	struct ib_wc            ibwc[IPOIB_NUM_WC];
 	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
 	struct ib_recv_wr       rx_wr;
 	int			nonsrq_conn_qp;
@@ -406,8 +405,6 @@ struct ipoib_dev_priv {
 	struct ib_send_wr    tx_wr;
 	unsigned             tx_outstanding;
 
-	struct ib_wc 	     ibwc[IPOIB_NUM_WC];
-	struct ib_wc         send_wc[MAX_SEND_CQE];
 	unsigned int	     tx_poll;
 
 	struct list_head dead_ahs;
diff -rup a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2008-05-12 16:39:22.020109690 -0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2008-05-13 17:19:28.809819954 -0700
@@ -366,12 +366,13 @@ static void ipoib_ib_handle_tx_wc(struct
 
 void poll_tx(struct ipoib_dev_priv *priv)
 {
+	struct ib_wc send_wc[MAX_SEND_CQE];
 	int n, i;
 
 	while (1) {
-		n = ib_poll_cq(priv->scq, MAX_SEND_CQE, priv->send_wc);
+		n = ib_poll_cq(priv->scq, MAX_SEND_CQE, send_wc);
 		for (i = 0; i < n; ++i)
-			ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i);
+			ipoib_ib_handle_tx_wc(priv->dev, send_wc + i);
 
 		if (n < MAX_SEND_CQE)
 			break;
@@ -380,6 +381,7 @@ void poll_tx(struct ipoib_dev_priv *priv
 
 int ipoib_poll(struct net_device *dev, int *budget)
 {
+	struct ib_wc ibwc[IPOIB_NUM_WC];
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int max = min(*budget, dev->quota);
 	int done;
@@ -393,10 +395,10 @@ poll_more:
 	while (max) {
 
 		t = min(IPOIB_NUM_WC, max);
-		n = ib_poll_cq(priv->rcq, t, priv->ibwc);
+		n = ib_poll_cq(priv->rcq, t, ibwc);
 
 		for (i = 0; i < n; i++) {
-			struct ib_wc *wc = priv->ibwc + i;
+			struct ib_wc *wc = ibwc + i;
 
 			if (wc->wr_id & IPOIB_OP_RECV) {
 				++done;
@@ -783,29 +785,30 @@ static int recvs_pending(struct net_devi
 
 void ipoib_drain_cq(struct net_device *dev)
 {
+	struct ib_wc ibwc[IPOIB_NUM_WC];
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i, n;
 	do {
-		n = ib_poll_cq(priv->rcq, IPOIB_NUM_WC, priv->ibwc);
+		n = ib_poll_cq(priv->rcq, IPOIB_NUM_WC, ibwc);
 		for (i = 0; i < n; ++i) {
 			/*
 			 * Convert any successful completions to flush
 			 * errors to avoid passing packets up the
 			 * stack after bringing the device down.
 			 */
-			if (priv->ibwc[i].status == IB_WC_SUCCESS)
-				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
+			if (ibwc[i].status == IB_WC_SUCCESS)
+				ibwc[i].status = IB_WC_WR_FLUSH_ERR;
 
-			if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
-				if (priv->ibwc[i].wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			if (ibwc[i].wr_id & IPOIB_OP_RECV) {
+				if (ibwc[i].wr_id & IPOIB_OP_CM)
+					ipoib_cm_handle_rx_wc(dev, ibwc + i);
 				else
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+					ipoib_ib_handle_rx_wc(dev, ibwc + i);
 			} else {
-				if (priv->ibwc[i].wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_tx_wc(dev, priv->ibwc + i);
+				if (ibwc[i].wr_id & IPOIB_OP_CM)
+					ipoib_cm_handle_tx_wc(dev, ibwc + i);
 				else
-					ipoib_ib_handle_tx_wc(dev, priv->ibwc + i);
+					ipoib_ib_handle_tx_wc(dev, ibwc + i);
 			}
 		}
 	} while (n == IPOIB_NUM_WC);


From postmaster at iol.it  Tue May 13 18:45:26 2008
From: postmaster at iol.it (Mail Delivery Service)
Date: Wed, 14 May 2008 03:45:26 +0200
Subject: [ofa-general] Delivery Status Notification
Message-ID: <47F2341F0A61E194@smtp-in4.libero.it>

 - These recipients of your message have been processed by the mail server:
richzucc at libero.it; Failed; 5.2.2 (mailbox full)

    Remote MTA ims73c.libero.it: SMTP diagnostic: 552 RCPT TO:<richzucc at libero.it> Mailbox disk quota exceeded


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 963 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/7e699bb9/attachment.bin>

From npiggin at suse.de  Tue May 13 21:11:22 2008
From: npiggin at suse.de (Nick Piggin)
Date: Wed, 14 May 2008 06:11:22 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080513153238.GL19717@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
Message-ID: <20080514041122.GE24516@wotan.suse.de>

On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote:
> On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
> > On Thursday 08 May 2008 10:38, Robin Holt wrote:
> > > In order to invalidate the remote page table entries, we need to message
> > > (uses XPC) to the remote side.  The remote side needs to acquire the
> > > importing process's mmap_sem and call zap_page_range().  Between the
> > > messaging and the acquiring a sleeping lock, I would argue this will
> > > require sleeping locks in the path prior to the mmu_notifier invalidate_*
> > > callouts().
> > 
> > Why do you need to take mmap_sem in order to shoot down pagetables of
> > the process? It would be nice if this can just be done without
> > sleeping.
> 
> We are trying to shoot down page tables of a different process running
> on a different instance of Linux running on Numa-link connected portions
> of the same machine.

Right. You can zap page tables without sleeping, if you're careful. I
don't know that we quite do that for anonymous pages at the moment, but it
should be possible with a bit of thought, I believe.

 
> The messaging is clearly going to require sleeping.  Are you suggesting
> we need to rework XPC communications to not require sleeping?  I think
> that is going to be impossible since the transfer engine requires a
> sleeping context.

I guess that you have found a way to perform TLB flushing within coherent
domains over the numalink interconnect without sleeping. I'm sure it would
be possible to send similar messages between non coherent domains.

So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and
apparently slow everyone else down).

 
> Additionally, the call to zap_page_range expects to have the mmap_sem
> held.  I suppose we could use something other than zap_page_range and
> atomically clear the process page tables.

zap_page_range does not expect to have mmap_sem held. I think for anon
pages it is always called with mmap_sem, however try_to_unmap_anon is
not (although it expects page lock to be held, I think we should be able
to avoid that).


>  Doing that will not alleviate
> the need to sleep for the messaging to the other partitions.

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?


From benh at kernel.crashing.org  Tue May 13 22:43:59 2008
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 13 May 2008 22:43:59 -0700
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <200805132214.27510.nickpiggin@yahoo.com.au>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<20080507234521.GN8276@duo.random> <20080508013459.GS8276@duo.random>
	<200805132214.27510.nickpiggin@yahoo.com.au>
Message-ID: <1210743839.8297.55.camel@pasglop>


On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote:
> ea.
> 
> I don't see why you're bending over so far backwards to accommodate
> this GRU thing that we don't even have numbers for and could actually
> potentially be batched up in other ways (eg. using mmu_gather or
> mmu_gather-like idea).

I agree, we're better off generalizing the mmu_gather batching
instead...

I had some never-finished patches to use the mmu_gather for pretty much
everything except single page faults, tho various subtle differences
between archs and lack of time caused me to let them take the dust and
not finish them...

I can try to dig some of that out when I'm back from my current travel,
though it's probably worth re-doing from scratch now.

Ben.


From npiggin at suse.de  Tue May 13 23:06:11 2008
From: npiggin at suse.de (Nick Piggin)
Date: Wed, 14 May 2008 08:06:11 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <1210743839.8297.55.camel@pasglop>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<20080507234521.GN8276@duo.random>
	<20080508013459.GS8276@duo.random>
	<200805132214.27510.nickpiggin@yahoo.com.au>
	<1210743839.8297.55.camel@pasglop>
Message-ID: <20080514060610.GB30448@wotan.suse.de>

On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote:
> 
> On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote:
> > ea.
> > 
> > I don't see why you're bending over so far backwards to accommodate
> > this GRU thing that we don't even have numbers for and could actually
> > potentially be batched up in other ways (eg. using mmu_gather or
> > mmu_gather-like idea).
> 
> I agree, we're better off generalizing the mmu_gather batching
> instead...

Well, the first thing would be just to get rid of the whole start/end
idea, which completely departs from the standard Linux system of
clearing ptes, then flushing TLBs, then freeing memory.

The onus would then be on GRU to come up with some numbers to justify
batching, and a patch which works nicely with the rest of the Linux
mm. And yes, mmu-gather is *the* obvious first choice of places to
look if one wanted batching hooks.


> I had some never-finished patches to use the mmu_gather for pretty much
> everything except single page faults, tho various subtle differences
> between archs and lack of time caused me to let them take the dust and
> not finish them...
> 
> I can try to dig some of that out when I'm back from my current travel,
> though it's probably worth re-doing from scratch now.

I always liked the idea as you know. But I don't think that should
be mixed in with the first iteration of the mmu notifiers patch
anyway. GRU actually can work without batching, but there is simply
some (unquantified to me) penalty for not batching it. I think it
is far better to first put in a clean and simple and working functionality
first. The idea that we have to unload some monster be-all-and-end-all
solution onto mainline in a single go seems counter productive to me.


From ogerlitz at voltaire.com  Tue May 13 23:23:48 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 May 2008 09:23:48 +0300
Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by OFED
	1.3
In-Reply-To: <adar6c6jane.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com>
	<adaod7gsnea.fsf@cisco.com>	<20080508174358.GW24293@sgi.com>
	<adafxsssn1o.fsf@cisco.com>	<20080510190721.GI5298@sgi.com>
	<adar6c6jane.fsf@cisco.com>
Message-ID: <482A8574.8070201@voltaire.com>

Roland Dreier wrote:
>  > ipoib_cm.c:ipoib_cm_send() does:
>  >         if (++priv->tx_outstanding == ipoib_sendq_size)
>  >                 netif_stop_queue(dev);
>  > 
>  > but ipoib_ib.c:ipoib_send() does:
>  >         if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) {
>  >                 netif_stop_queue(dev);
>
> So this is not in the upstream kernel... I wonder if this is a bug
> introduced in an OFED 1.3 patch?
Over the last period we had so much debugging done to non reviewed ipoib 
patches which were merged to ofed 1.3 bypassing any sane procedure. This 
includes people sending to Roland bugs reports on code he does not see 
in his tree, and people reporting on bugs introduced by code pushed to 
ofed after rc3!

It seems like we chose a very un efficient way to work : first, merge 
code, second, test and see it crashing, third, ask for the maintainer to 
review, get him to fix it, forth, push it to the kernel.

ofed 1.3 is out there merged into commercial "enterprise" distros, ipoib 
is the first thing people test, so these people would get all these 
crashes.

Maybe its about time for the Linux IB maintainers to get a little angry?!

Or.


From eli at dev.mellanox.co.il  Wed May 14 00:25:48 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 14 May 2008 10:25:48 +0300
Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack
In-Reply-To: <20080514012146.GG29302@sgi.com>
References: <20080514012146.GG29302@sgi.com>
Message-ID: <1210749948.15669.268.camel@mtls03>

On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote:
> We're getting panics like this one on big clusters:
> 
> skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0

RX SKBs are large enough to contain 100 bytes... this looks like
corruption. Can you give more information on OS, kernel version, OFED
version.

> 
> Started looking into what might cause this and I found that IPoIB 
> always does something like this:
> 
> int ipoib_poll(struct net_device *dev, int *budget) 
> { 
> 	struct ipoib_dev_priv *priv = netdev_priv(dev);
> 	....
> 	ib_poll_cq(priv->rcq, t, priv->ibwc);
> 
> 	for (i = 0; i < n; i++) {
> 		struct ib_wc *wc = priv->ibwc + i;
> 		....
> 		ipoib_ib_handle_rx_wc(dev, wc);
> 		
> 
> What happens if we call ib_poll_cq() then, before processing the 
> rx completions in ipoib_ib_handle_rx_wc(), ipoib_poll() gets called 
> again (on a different CPU)? That could corrupt the priv->ibwc array, 
> and lead to a panic like above. 

>From NAPI_HOWTO.txt, although the file has been removed but I think the
statement is still valid:

-Guarantee: Only one CPU at any time can call dev->poll(); this is
because only one CPU can pick the initial interrupt and hence the
initial netif_rx_schedule(dev);

> 
> How about keeping the array of struct ib_wc on the stack? 
The stack is limited for kernel code and putting this on the stack is
limiting. I think this could hurt performance too due to more cache
misses.


From monis at Voltaire.COM  Wed May 14 00:41:19 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 14 May 2008 10:41:19 +0300
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <adatzh2ksoc.fsf@cisco.com>
References: <48187E5A.7040809@Voltaire.COM>
	<4819BD29.7080002@Voltaire.COM>	<adaskwwz7ie.fsf@cisco.com>
	<4820638E.4030901@Voltaire.COM>	<4827FBDF.9040308@Voltaire.COM>
	<adatzh2ksoc.fsf@cisco.com>
Message-ID: <482A979F.6040305@Voltaire.COM>

Roland Dreier wrote:
>  > Can we please go on with this patch? We would like to see it in the next kernel.
> 
> I still don't get why this is important to you.  Is there a concrete
> example of a situation where this actually makes a measurable difference?
> 
> We need some justification for adding this locking complexity beyond "it
> doesn't hurt."  (And also of course we need it fixed so there aren't races)
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
Hi,
OK. Here is an example that was viewed in our tests.
One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server).
SM takeover event takes place during traffic and as a result multicast info is flushed
and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience
is a very big chance) that the request to rejoin will be to the old  SM and only after a retry join completes successfully.
This takes too long and the patch solves it.

I hope that this is convincing enough for you because for us it is important that a
recovery from a failure will be as quick as possible.

thanks

 MoniS


From vlad at dev.mellanox.co.il  Wed May 14 03:55:12 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 14 May 2008 13:55:12 +0300
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds
	and not	in others
In-Reply-To: <f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
Message-ID: <482AC510.3090602@dev.mellanox.co.il>

Chris Worley wrote:
> In two 1.3 builds I get different SET_IPOIB_CM settings in
> /etc/infiniband/openib.conf.
> 
> A generic build sets it to "yes".  A kitchen-sink build doesn't set it.
> 
> Is there a reason (as I need it to be enabled on a system that needs
> the kitchen-sink build)?
> 
> Thanks,
> 
> Chris
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 

The default mode for IPoIB CM in OFED-1.3 (/etc/infiniband/openib.conf) is:
SET_IPOIB_CM=yes

It was different (SET_IPOIB_CM=no) in OFED-1.2 between Thu Mar 29 16:57:22 2007 and Wed Apr 4 10:41:09 2007 (before OFED-1.2-rc1).

Can you point me the OFED-1.3 build where SET_IPOIB_CM is set to "no"?

Regards,
Vladimir


From hrosenstock at xsigo.com  Wed May 14 04:24:04 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 14 May 2008 04:24:04 -0700
Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when
	setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB
	mcast group due to a temporary node soft lockup.)
In-Reply-To: <20080514000247.GL21414@sashak.voltaire.com>
References: <20080423133816.6c1b6315.weiny2@llnl.gov>
	<48109087.6030606@voltaire.com>
	<20080424143125.2aad1db8.weiny2@llnl.gov>
	<15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com>
	<20080424181657.28d58a29.weiny2@llnl.gov>
	<20080514000247.GL21414@sashak.voltaire.com>
Message-ID: <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com>

On Wed, 2008-05-14 at 00:02 +0000, Sasha Khapyorsky wrote:
> On 18:16 Thu 24 Apr     , Ira Weiny wrote:
> > 
> > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> > From: Ira K. Weiny <weiny2 at llnl.gov>
> > Date: Thu, 24 Apr 2008 18:05:01 -0700
> > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
> > 
> > 
> > Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> 
> Applied. Thanks.

Would this change also be applied to ofed_1_3 branch ?

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From holt at sgi.com  Wed May 14 04:26:25 2008
From: holt at sgi.com (Robin Holt)
Date: Wed, 14 May 2008 06:26:25 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080514041122.GE24516@wotan.suse.de>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
Message-ID: <20080514112625.GY9878@sgi.com>

On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote:
> On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote:
> > On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
> > > On Thursday 08 May 2008 10:38, Robin Holt wrote:
> > > > In order to invalidate the remote page table entries, we need to message
> > > > (uses XPC) to the remote side.  The remote side needs to acquire the
> > > > importing process's mmap_sem and call zap_page_range().  Between the
> > > > messaging and the acquiring a sleeping lock, I would argue this will
> > > > require sleeping locks in the path prior to the mmu_notifier invalidate_*
> > > > callouts().
> > > 
> > > Why do you need to take mmap_sem in order to shoot down pagetables of
> > > the process? It would be nice if this can just be done without
> > > sleeping.
> > 
> > We are trying to shoot down page tables of a different process running
> > on a different instance of Linux running on Numa-link connected portions
> > of the same machine.
> 
> Right. You can zap page tables without sleeping, if you're careful. I
> don't know that we quite do that for anonymous pages at the moment, but it
> should be possible with a bit of thought, I believe.
> 
>  
> > The messaging is clearly going to require sleeping.  Are you suggesting
> > we need to rework XPC communications to not require sleeping?  I think
> > that is going to be impossible since the transfer engine requires a
> > sleeping context.
> 
> I guess that you have found a way to perform TLB flushing within coherent
> domains over the numalink interconnect without sleeping. I'm sure it would
> be possible to send similar messages between non coherent domains.

I assume by coherent domains, your are actually talking about system
images.  Our memory coherence domain on the 3700 family is 512 processors
on 128 nodes.  On the 4700 family, it is 16,384 processors on 4096 nodes.
We extend a "Read-Exclusive" mode beyond the coherence domain so any
processor is able to read any cacheline on the system.  We also provide
uncached access for certain types of memory beyond the coherence domain.

For the other partitions, the exporting partition does not know what
virtual address the imported pages are mapped.  The pages are frequently
mapped in a different order by the MPI library to help with MPI collective
operations.

For the exporting side to do those TLB flushes, we would need to replicate
all that importing information back to the exporting side.

Additionally, the hardware that does the TLB flushing is protected
by a spinlock on each system image.  We would need to change that
simple spinlock into a type of hardware lock that would work (on 3700)
outside the processors coherence domain.  The only way to do that is to
use uncached addresses with our Atomic Memory Operations which do the
cmpxchg at the memory controller.  The uncached accesses are an order
of magnitude or more slower.

> So yes, I'd much rather rework such highly specialized system to fit in
> closer with Linux than rework Linux to fit with these machines (and
> apparently slow everyone else down).

But it isn't that we are having a problem adapting to just the hardware.
One of the limiting factors is Linux on the other partition.

> > Additionally, the call to zap_page_range expects to have the mmap_sem
> > held.  I suppose we could use something other than zap_page_range and
> > atomically clear the process page tables.
> 
> zap_page_range does not expect to have mmap_sem held. I think for anon
> pages it is always called with mmap_sem, however try_to_unmap_anon is
> not (although it expects page lock to be held, I think we should be able
> to avoid that).

zap_page_range calls unmap_vmas which walks to vma->next.  Are you saying
that can be walked without grabbing the mmap_sem at least readably?
I feel my understanding of list management and locking completely
shifting.

> >  Doing that will not alleviate
> > the need to sleep for the messaging to the other partitions.
> 
> No, but I'd venture to guess that is not impossible to implement even
> on your current hardware (maybe a firmware update is needed)?

Are you suggesting the sending side would not need to sleep or the
receiving side?  Assuming you meant the sender, it spins waiting for the
remote side to acknowledge the invalidate request?  We place the data
into a previously agreed upon buffer and send an interrupt.  At this
point, we would need to start spinning and waiting for completion.
Let's assume we never run out of buffer space.

The receiving side receives an interrupt.  The interrupt currently wakes
an XPC thread to do the work of transfering and delivering the message
to XPMEM.  The transfer of the data which XPC does uses the BTE engine
which takes up to 28 seconds to timeout (hardware timeout before raising
and error) and the BTE code automatically does a retry for certain
types of failure.  We currently need to grab semaphores which _MAY_
be able to be reworked into other types of locks.


Thanks,
Robin


From hrosenstock at xsigo.com  Wed May 14 04:29:39 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 14 May 2008 04:29:39 -0700
Subject: [ofa-general] Re: ibsim parsing question
In-Reply-To: <20080514001545.GO21414@sashak.voltaire.com>
References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com>
	<20080512222316.GK17046@sashak.voltaire.com>
	<1210620927.2026.575.camel@hrosenstock-ws.xsigo.com>
	<20080512225725.GN17046@sashak.voltaire.com>
	<1210622530.2026.580.camel@hrosenstock-ws.xsigo.com>
	<20080512231248.GQ17046@sashak.voltaire.com>
	<20080512231831.GR17046@sashak.voltaire.com>
	<1210624804.2026.586.camel@hrosenstock-ws.xsigo.com>
	<20080513234345.GI21414@sashak.voltaire.com>
	<1210712218.2026.719.camel@hrosenstock-ws.xsigo.com>
	<20080514001545.GO21414@sashak.voltaire.com>
Message-ID: <1210764579.2026.735.camel@hrosenstock-ws.xsigo.com>

On Wed, 2008-05-14 at 00:15 +0000, Sasha Khapyorsky wrote:
> On 13:56 Tue 13 May     , Hal Rosenstock wrote:
> > > 
> > > That is correct, but we know that ibsim parser doesn't parse
> > > NodeDescription in those (port related) lines, so in such "worst" cases
> > > when 's=' and/or 'w=' strings are used in NodeDescription this could be
> > > just filtered out from ibnetdiscovery file.
> > 
> > That's why I termed this approach a workaround and it does limit the
> > NodeDescription in ways not limited by the IBA spec.
> 
> No, it does not limit NodeDescription at all - it is *only* file format
> limitation (remove NodeDescription from port related lines in the file
> and we are done).

That's my point; NodeDescription is part of the ibnd file format and
hence this limits what can be there so in that sense this limits it's
contents and removing such occurrences from the file is a workaround
IMO. While a distinction can be made between the actual NodeDescription
and what is in the file, they're supposed to be one and the same and not
require some extra post processing.

> > Is this worth mentioning in the README or some other doc for ibsim ?
> 
> Looks like overkill for me.

OK.

-- Hal

> Sasha


From aiyu1016 at 126.com  Wed May 14 04:37:44 2008
From: aiyu1016 at 126.com (=?GBK?B?MjAwOMnPuqO5+rzKyta7+rL60rXVuV/AqdW5?=)
Date: Wed, 14 May 2008 19:37:44 +0800 (CST)
Subject: [ofa-general] =?gbk?b?Rnc6MjAwOLXazuW97NbQufqjqMnPuqOjqbn6vMo=?=
 =?gbk?b?yta7+rL60rXVucDAu+H039HQzNa74S0tLc28?=
Message-ID: <8557900.1029561210765064406.JavaMail.coremail@bj126app43.126.com>

 
---------- 转发邮件信息 ----------
发件人：""2008上海国际手机产业展_扩展" <aiyu1016 at 126.com>" 
发送日期：2008-05-12 13:22:21
收件人：aiyu1016 at 126.com
主题： 2008第五届中国（上海）国际手机产业展览会暨研讨会---图

 
2008第五届中国（上海）国际手机产业展览会暨研讨会

欢迎参展/参观！

2008第五届中国（上海）国际手机产业展览会暨研讨会

时间：2008.7.9—11  

地点：上海光大会展中心  

展出面积：18000平米

【参展请联系】

   联系人：胡钟辉(先生) 013686820858

?? 电? 话：021-64751029(直线)? 64363606-1029(总机)?? 
???传? 真：021-64755056

【前言】

“上海国际手机产业展览会暨研讨会”，从2004年开始，一年一届固定在上海举办。近年来，已发展为中国有名的手机产业展会，拥有大量固定的参展商和观众，是每年手机上下游企业云集的贸易盛会，是手机行业人士济济一堂的最佳场所。

    “2008第五届中国（上海）国际手机产业展览会暨研讨会”将于2008年7月9—11日在上海光大会展中心举行，展出面积将达18000平米，标准展位900个，届时,一个规模宏大，知名品牌云集的行业盛会将呈现在您的面前。

【07年精彩回顾】

上届展会于2007年7月11日—13日在上海光大会展中心圆满落下帷幕，总展出面积从2006年的6000平方米增加到2007年的12000平方米；参展商数量由2006年的210家跃升到2007年的486家，创下了历史最好记录；观众数量从2006年的28638人增加到2007年的44862人。展览会的高成长性再次印证了其在中国手机产业展会中的领先地位。

组委会展期展后对展商信息的调查表明：82％的参展商对本届展览会的展出效果表示满意， 80％的参展商有浓厚的兴趣表示再次参加下届展览会。70％的参展商认为同比其它展会本届展会有着更大的优势。

【组织机构】

批准单位：上海市人民政府对外经济贸易委员会

主办单位：上海市通信制造业行业协会

中国电子学会

上海国际广告展览有限公司

承办单位：上海扩展展览服务有限公司

上海国际广告展览有限公司

展会指定网站:中国手机信息网(http://www.mobileexpo.cn)

特邀媒体推广：中通网（http://www.ci800.com/）

网络支持：中国移动通信配件网（ http://www.mob-acc.com/）

中国手机研发网（http://www.1mp.cc/）         

中华液晶网（http://cn.fpdisplay.com/）

多彩数码商情（http: //www.mobilerefer.com）

中国信息产业网（http://www.cnii.com.cn/）     

通信世界网（http://www.cww.net.cn/）

3G中国资讯网（ http://www.t800.com.cn/） 

国际电池网（ http://www.ib160.com/）

IT交易网（http://www.itb2b.com.cn/）

中国移动通信产业网（http://www.chinamobile.gov.cn/）

中国手机配套采购网（http://mobileaccessory.chinacomponents.com.cn）

媒体支持：《环球通讯》《手机圈》《手机资讯》《手机＆配件》

《数码网络商情》《通讯专刊》《通信世界报》《新电子》

《今日电子》《环球光电与显示》《电子元器件应用》等。

【日程安排】

报到布展：2008年7月7日—7月8日   

开 幕 式：2008年7月9日（9：30—10：00）

展    览：2008年7月9日—7月11日  

撤    展：2008年7月11日下午

【展会优势】

（1）地理优势：上海是中国最大的经济中心城市，也是国际著名的港口城市，在中国的经济发展中具有极其重要的地位。上海地处长江三角洲前沿，北界长江，东濒东海，南临杭州湾，西接江苏、浙江两省。地处南北海岸线中心，长江由此入海，交通便利，腹地宽阔，地理位置优越，是一个良好的江海港口

（2）规模优势：展览总规模达到18000平方米，标准展台超过900个，作为专业盛会，本届展览规模优势十分明显。

（3）专业优势：本次展会将成为手机行业最主要的展览，与同类展览相比，主题突出，集中展示以“手机”为主线的产业科技成果、先进的技术及产品，专业优势非常明显。成为业内机构进行针对性的业务推介和形象宣传的首选渠道。

（4）经验优势：展会承办机构上海扩展展览服务有限公司，每年在国内外举办10多个专业国际展会，拥有庞大行业数据库，能确保展会拥有高质量的观众。与中国20多个各级政府部门、30余家专业协会学会、150多家专业媒体以及国际会展机构保持密切联系。

【参展范围】 

(一)  移动终端：手机品牌厂商、手机设计公司、手机ODM、OEM、EMS厂商、手机代工厂商、小灵通、PDA、数码相机、掌上电脑、手机电视等移动终端生产厂家；

(二) 相关配件及产品:

1.半导体器件及IC电路: 芯片、芯片设计、图象传感器、导电矽胶、调解器、半导体器件、集成电路电讯和数据通讯网络应用技术等；
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/5b1b9236/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2008第五届中国（上海）国际手机产业展览会暨研讨会正.doc
Type: application/msword
Size: 68096 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/5b1b9236/attachment.doc>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2008手机展位图.jpg
Type: image/pjpeg
Size: 1202722 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/5b1b9236/attachment.bin>

From ogerlitz at voltaire.com  Wed May 14 05:33:47 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 May 2008 15:33:47 +0300
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma:
	implement	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482A0F32.2010001@opengridcomputing.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<adak5hymg0l.fsf@cisco.com>
	<482A0F32.2010001@opengridcomputing.com>
Message-ID: <482ADC2B.5080008@voltaire.com>

Steve Wise wrote:
> Maybe this should really be implemented in the ULP that wants this 
> behavior.  IE the ULP could register for routing/neighbour changes and 
> tear down connections and re-established them on the correct device.
>
Hi Steve,

First, registration for neighbour changes can't serve for the purpose of 
aligning RDMA traffic with the IP stack, from bunch of reasons among 
them are:

- for IB, no neighbour is created at the passive side of the unicast session

- for unicast sessions, address resolution involves ARP but the 
neighbour may be deleted by the kernel since the rdma traffic does not 
go through the stack

- for multicast sessions, no neighbour is created during address resolution

Second, the rdma-cm does well in saving the ULP from interacting with 
the network stack, that is the ULP is not aware to the routing lookup / 
neigbour / net device used for address resolution. In that spirit I 
prefer to add the registration for net events at the low level (rdma-cm).

Third, thanks for bringing the point of route changes :)

Or.


From ogerlitz at voltaire.com  Wed May 14 05:44:01 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 May 2008 15:44:01 +0300
Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability
	mode attribute to IDs
In-Reply-To: <469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>	
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>	
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com>
Message-ID: <482ADE91.1050704@voltaire.com>

Caitlin Bestler wrote:
> I'm not sure I've even seen an "RDMA Session".
OK, there was some misunderstanding here: by "session" I refer to both 
connected and unconnected (unicast & multicast) services of the rdma-cm. 
I wanted to emphasize that similarly to the network stack where bonding 
works for TCP, UDP unicast and UDP multicast, etc traffic, I want this 
"rdma/ip traffic alignment" feature to work not only for RC connections. 
> And if the application is going to make the decision, then
> can't it just subscribe to the local routing tables on its
> own without any help from OFA?
It can, but I don't want it to. Please see my other response to Steve on 
this thread from today.

Or.


From ogerlitz at voltaire.com  Wed May 14 05:52:11 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 May 2008 15:52:11 +0300
Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels
	older than	2.6.21
In-Reply-To: <1210765031.15669.285.camel@mtls03>
References: <1210765031.15669.285.camel@mtls03>
Message-ID: <482AE07B.8040501@voltaire.com>

Eli Cohen wrote:
> IB/ipoib: Fix neigh destructor oops
>
> For kernels 2.6.20 and older, it may happen that the pointer to
> ipoib_neigh_cleanup() is called after IPoIB has been unloades,
> causing a kernel oops. This problem has been fixed for 2.6.21 with
> the following commit: ecbb416939da77c0d107409976499724baddce7b
Hi Eli,

Before looking into the solution, I'd like to slow down a little and 
understand the problem (how can ipoib_neigh_cleanup() called after IPoIB 
has been unloaded) and why does the commit below solve it, from its 
change-log I don't see any reference to this problem:

> commit ecbb416939da77c0d107409976499724baddce7b
> Author: Alexey Kuznetsov <kuznet at ms2.inr.ac.ru>
> Date:   Sat Mar 24 12:52:16 2007 -0700
>
>     [NET]: Fix neighbour destructor handling.
>     
>     ->neigh_destructor() is killed (not used), replaced with
>     ->neigh_cleanup(), which is called when neighbor entry goes to dead
>     state. At this point everything is still valid: neigh->dev,
>     neigh->parms etc.
>     
>     The device should guarantee that dead neighbor entries (neigh->dead !=
>     0) do not get private part initialized, otherwise nobody will cleanup
>     it.
>     
>     I think this is enough for ipoib which is the only user of this thing.
>     Initialization private part of neighbor entries happens in ipib
>     start_xmit routine, which is not reached when device is down.  But it
>     would be better to add explicit test for neigh->dead in any case.
>     
>     Signed-off-by: David S. Miller <davem at davemloft.net>
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 0741c6d..f2a40ae 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>  	queue_work(ipoib_workqueue, &priv->restart_task);
>  }
>  
> -static void ipoib_neigh_destructor(struct neighbour *n)
> +static void ipoib_neigh_cleanup(struct neighbour *n)
>  {
>  	struct ipoib_neigh *neigh;
>  	struct ipoib_dev_priv *priv = netdev_priv(n->dev);
> @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n)
>  	struct ipoib_ah *ah = NULL;
>  
>  	ipoib_dbg(priv,
> -		  "neigh_destructor for %06x " IPOIB_GID_FMT "\n",
> +		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
>  		  IPOIB_QPN(n->ha),
>  		  IPOIB_GID_RAW_ARG(n->ha + 4));
>  
> @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
>  
>  static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
>  {
> -	parms->neigh_destructor = ipoib_neigh_destructor;
> +	parms->neigh_cleanup = ipoib_neigh_cleanup;
>  
>  	return 0;
>  }
> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
> index 3725b93..ad7fe11 100644
> --- a/include/net/neighbour.h
> +++ b/include/net/neighbour.h
> @@ -36,7 +36,7 @@ struct neigh_parms
>  	struct net_device *dev;
>  	struct neigh_parms *next;
>  	int	(*neigh_setup)(struct neighbour *);
> -	void	(*neigh_destructor)(struct neighbour *);
> +	void	(*neigh_cleanup)(struct neighbour *);
>  	struct neigh_table *tbl;
>  
>  	void	*sysctl_table;
> diff --git a/net/atm/clip.c b/net/atm/clip.c
> index ebb5d0c..8c38258 100644
> --- a/net/atm/clip.c
> +++ b/net/atm/clip.c
> @@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc, struct sk_buff *skb)
>  	spin_unlock_irqrestore(&PRIV(dev)->xoff_lock, flags);
>  }
>  
> -static void clip_neigh_destroy(struct neighbour *neigh)
> -{
> -	DPRINTK("clip_neigh_destroy (neigh %p)\n", neigh);
> -	if (NEIGH2ENTRY(neigh)->vccs)
> -		printk(KERN_CRIT "clip_neigh_destroy: vccs != NULL !!!\n");
> -	NEIGH2ENTRY(neigh)->vccs = (void *) NEIGHBOR_DEAD;
> -}
> -
>  static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  {
>  	DPRINTK("clip_neigh_solicit (neigh %p, skb %p)\n", neigh, skb);
> @@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = {
>  	/* parameters are copied from ARP ... */
>  	.parms = {
>  		.tbl 			= &clip_tbl,
> -		.neigh_destructor	= clip_neigh_destroy,
>  		.base_reachable_time 	= 30 * HZ,
>  		.retrans_time 		= 1 * HZ,
>  		.gc_staletime 		= 60 * HZ,
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 3183142..cfc6001 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>  				n->dead = 1;
>  				shrunk	= 1;
>  				write_unlock(&n->lock);
> +				if (n->parms->neigh_cleanup)
> +					n->parms->neigh_cleanup(n);
>  				neigh_release(n);
>  				continue;
>  			}
> @@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
>  				NEIGH_PRINTK2("neigh %p is stray.\n", n);
>  			}
>  			write_unlock(&n->lock);
> +			if (n->parms->neigh_cleanup)
> +				n->parms->neigh_cleanup(n);
>  			neigh_release(n);
>  		}
>  	}
> @@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *neigh)
>  			kfree(hh);
>  	}
>  
> -	if (neigh->parms->neigh_destructor)
> -		(neigh->parms->neigh_destructor)(neigh);
> -
>  	skb_queue_purge(&neigh->arp_queue);
>  
>  	dev_put(neigh->dev);
> @@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigned long arg)
>  			*np = n->next;
>  			n->dead = 1;
>  			write_unlock(&n->lock);
> +			if (n->parms->neigh_cleanup)
> +				n->parms->neigh_cleanup(n);
>  			neigh_release(n);
>  			continue;
>  		}
> @@ -2088,8 +2091,11 @@ void __neigh_for_each_release(struct neigh_table *tbl,
>  			} else
>  				np = &n->next;
>  			write_unlock(&n->lock);
> -			if (release)
> +			if (release) {
> +				if (n->parms->neigh_cleanup)
> +					n->parms->neigh_cleanup(n);
>  				neigh_release(n);
> +			}
>  		}
>  	}
>  }
>   


From steiner at sgi.com  Wed May 14 06:15:32 2008
From: steiner at sgi.com (Jack Steiner)
Date: Wed, 14 May 2008 08:15:32 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <1210743839.8297.55.camel@pasglop>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<20080507234521.GN8276@duo.random>
	<20080508013459.GS8276@duo.random>
	<200805132214.27510.nickpiggin@yahoo.com.au>
	<1210743839.8297.55.camel@pasglop>
Message-ID: <20080514131531.GA10393@sgi.com>

On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote:
> > ea.
> > 
> > I don't see why you're bending over so far backwards to accommodate
> > this GRU thing that we don't even have numbers for and could actually
> > potentially be batched up in other ways (eg. using mmu_gather or
> > mmu_gather-like idea).
> 
> I agree, we're better off generalizing the mmu_gather batching
> instead...

Unfortunately, we are at least several months away from being able to
provide numbers to justify batching - assuming it is really needed.  We need
large systems running real user workloads. I wish we had that available
right now, but we don't.

It also depends on what you mean by "no batching". If you mean that the
notifier gets called for each pte that is removed from the page table, then
the overhead is clearly very high for some operations. Consider the unmap of
a very large object. A TLB flush per page will be too costly.

However, something based on the mmu_gather seems like it should provide
exactly what is needed to do efficient flushing of the TLB. The GRU does not
require that it be called in a sleepable context. As long as the notifier
callout provides the mmu_gather and vaddr range being flushed, the GRU can
do the efficiently do the rest.

> 
> I had some never-finished patches to use the mmu_gather for pretty much
> everything except single page faults, tho various subtle differences
> between archs and lack of time caused me to let them take the dust and
> not finish them...
> 
> I can try to dig some of that out when I'm back from my current travel,
> though it's probably worth re-doing from scratch now.
> 
> Ben.
> 


-- jack


From okir at lst.de  Wed May 14 06:16:00 2008
From: okir at lst.de (Olaf Kirch)
Date: Wed, 14 May 2008 15:16:00 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <482A17FC.7070804@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805132024.12741.okir@lst.de> <482A17FC.7070804@oracle.com>
Message-ID: <200805141516.01908.okir@lst.de>

On Wednesday 14 May 2008 00:36:44 Richard Frank wrote:
> Olaf, if / when you have this running for IB - let me know - I think we 
> can give to some folks at Oracle who will be able to tell us if there is 
> any performance regression using TPCH.. especially if we have it in the 
> next week or so.. as I think we have a config to test with.

I'll give it a try. It's not complete, and still somewhat brittle.
As the code was never designed with transport level flow control in
mind, there's a few things I need to move around that tend to sparkle
and emit bit smoke when you touch them the wrong way :)

I'll let you know as soon as I have something for you to test.

Olaf


> 
> Olaf Kirch wrote:
> > On Tuesday 13 May 2008 20:08:46 Steve Wise wrote:
> >   
> >>> No, not in the long term. But let's hold off on the flow control stuff
> >>> for a little - I would first like to finish my patch set and hand it
> >>> out for you folks to bang on it, rather than the other way round.
> >>> Okay with you guys?
> >>>   
> >>>       
> >> What patch set?
> >>     
> >
> > I mentioned in a previous mail to Jon that I have some partial patches
> > that implement flow control. I want to get that code out to you ASAP;
> > I think that's easier than having two different approaches that need
> > to be reconciled afterwards.
> >
> >   
> >> We can't run on chelsio's rnic with fmrs... 
> >>     
> >
> > Yes, that is understood.
> >
> > Olaf
> >   
> 


-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From eli at dev.mellanox.co.il  Wed May 14 06:38:10 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 14 May 2008 16:38:10 +0300
Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels
	older than	2.6.21
In-Reply-To: <482AE07B.8040501@voltaire.com>
References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com>
Message-ID: <1210772290.20499.5.camel@mtls03>


On Wed, 2008-05-14 at 15:52 +0300, Or Gerlitz wrote:
> Eli Cohen wrote:
> > IB/ipoib: Fix neigh destructor oops
> >
> > For kernels 2.6.20 and older, it may happen that the pointer to
> > ipoib_neigh_cleanup() is called after IPoIB has been unloades,
> > causing a kernel oops. This problem has been fixed for 2.6.21 with
> > the following commit: ecbb416939da77c0d107409976499724baddce7b
> Hi Eli,
> 
> Before looking into the solution, I'd like to slow down a little and 
> understand the problem (how can ipoib_neigh_cleanup() called after IPoIB 
> has been unloaded) and why does the commit below solve it, from its 
> change-log I don't see any reference to this problem:
> 
> > commit ecbb416939da77c0d107409976499724baddce7b
> > Author: Alexey Kuznetsov <kuznet at ms2.inr.ac.ru>
> > Date:   Sat Mar 24 12:52:16 2007 -0700
> >
I add to the thread the author of the commit. I don't know this code
good enough to give an explanation but for kernels following this commit
I don't get these failures. Perhaps someone else can comment on this.


From ogerlitz at voltaire.com  Wed May 14 06:40:12 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 May 2008 16:40:12 +0300
Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels
	older	than	2.6.21
In-Reply-To: <1210772290.20499.5.camel@mtls03>
References: <1210765031.15669.285.camel@mtls03>	
	<482AE07B.8040501@voltaire.com> <1210772290.20499.5.camel@mtls03>
Message-ID: <482AEBBC.10803@voltaire.com>

Eli Cohen wrote:
> I add to the thread the author of the commit. I don't know this code
> good enough to give an explanation but for kernels following this commit
> I don't get these failures. Perhaps someone else can comment on this.
>
Even before understanding what Alexy's patch is doing, can you explain 
why the ipoib neighbour destructor callback is called after the ipoib 
module has been unloaded? is it b/c the stack did this call on a 
neighbour created by another device such as loopack etc?

Or.


From eli at dev.mellanox.co.il  Wed May 14 06:52:10 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 14 May 2008 16:52:10 +0300
Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels
	older	than	2.6.21
In-Reply-To: <482AEBBC.10803@voltaire.com>
References: <1210765031.15669.285.camel@mtls03>
	<482AE07B.8040501@voltaire.com> <1210772290.20499.5.camel@mtls03>
	<482AEBBC.10803@voltaire.com>
Message-ID: <1210773130.20499.11.camel@mtls03>


On Wed, 2008-05-14 at 16:40 +0300, Or Gerlitz wrote:
> Eli Cohen wrote:
> > I add to the thread the author of the commit. I don't know this code
> > good enough to give an explanation but for kernels following this commit
> > I don't get these failures. Perhaps someone else can comment on this.
> >
> Even before understanding what Alexy's patch is doing, can you explain 
> why the ipoib neighbour destructor callback is called after the ipoib 
> module has been unloaded? is it b/c the stack did this call on a 
> neighbour created by another device such as loopack etc?
> 

That could be one reason for this. And it could be that the kernel does
not guarantee that all the neighbours destructors of the interface get
called before the interface is stopped (and the module is unloaded).


From 997.palmerlucius at jbstrans.com  Wed May 14 06:56:44 2008
From: 997.palmerlucius at jbstrans.com (Esteban Lunsford)
Date: Wed, 14 May 2008 21:56:44 +0800
Subject: [ofa-general] Don't miss to see my pic
Message-ID: <01c8b60d$65def600$c9be0eda@997.palmerlucius>

Hello! I am bored tonight. I am nice girl that would like to chat with you. Email me at Astrid at emaidaking.cn only, because I am using my friend's email to write this. Mind me sending some of my pictures to you?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: me
Type: image/gif
Size: 46393 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/234c3555/attachment.gif>

From akepner at sgi.com  Wed May 14 07:05:46 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 14 May 2008 07:05:46 -0700
Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack
In-Reply-To: <1210749948.15669.268.camel@mtls03>
References: <20080514012146.GG29302@sgi.com>
	<1210749948.15669.268.camel@mtls03>
Message-ID: <20080514140546.GK29302@sgi.com>

On Wed, May 14, 2008 at 10:25:48AM +0300, Eli Cohen wrote:

> On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote:
> > We're getting panics like this one on big clusters:
> > 
> > skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0
> 
> RX SKBs are large enough to contain 100 bytes... this looks like
> corruption. 

Exactly.

> Can you give more information on OS, kernel version, OFED
> version.

SUSE Linux Enterprise Server 10 SP1 (x86_64) - Kernel 2.6.16.46-0.12-smp
OFED 1.3 GA

> .....
> >From NAPI_HOWTO.txt, although the file has been removed but I think the
> statement is still valid:
> 
> -Guarantee: Only one CPU at any time can call dev->poll(); this is
> because only one CPU can pick the initial interrupt and hence the
> initial netif_rx_schedule(dev);
> 

Yes, you're correct. I missed the use of the __LINK_STATE_RX_SCHED 
bit in __netif_rx_schedule_prep()/netif_rx_complete() that serializes 
this. (Roland also pointed this out to me.)

-- 
Arthur


From eli at dev.mellanox.co.il  Wed May 14 07:22:46 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 14 May 2008 17:22:46 +0300
Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack
In-Reply-To: <20080514140546.GK29302@sgi.com>
References: <20080514012146.GG29302@sgi.com>
	<1210749948.15669.268.camel@mtls03>  <20080514140546.GK29302@sgi.com>
Message-ID: <1210774966.23636.3.camel@mtls03>

On Wed, 2008-05-14 at 07:05 -0700, akepner at sgi.com wrote:
> On Wed, May 14, 2008 at 10:25:48AM +0300, Eli Cohen wrote:
> 
> > On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote:
> > > We're getting panics like this one on big clusters:
> > > 
> > > skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0
> > 
> > RX SKBs are large enough to contain 100 bytes... this looks like
> > corruption. 
> 
> Exactly.
One thing that can help discover memory corruptions and other bug is to
use a debug kernel. Is it possible that you will configure a few nodes
for debug kernel?


From akepner at sgi.com  Wed May 14 07:23:43 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 14 May 2008 07:23:43 -0700
Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack
In-Reply-To: <1210774966.23636.3.camel@mtls03>
References: <20080514012146.GG29302@sgi.com>
	<1210749948.15669.268.camel@mtls03>
	<20080514140546.GK29302@sgi.com> <1210774966.23636.3.camel@mtls03>
Message-ID: <20080514142343.GM29302@sgi.com>

On Wed, May 14, 2008 at 05:22:46PM +0300, Eli Cohen wrote:
> ....
> One thing that can help discover memory corruptions and other bug is to
> use a debug kernel. Is it possible that you will configure a few nodes
> for debug kernel?
> 

Yes, we can certainly do that.

It may take some time, because this bug (like lots of others) is 
only seen on very large systems, and scheduling test/debug time on 
these systems isn't easy.

-- 
Arthur


From weiny2 at llnl.gov  Wed May 14 07:59:47 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 14 May 2008 07:59:47 -0700
Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when
	setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB
	mcast group due to a temporary node soft lockup.)
In-Reply-To: <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com>
References: <20080423133816.6c1b6315.weiny2@llnl.gov>
	<48109087.6030606@voltaire.com>
	<20080424143125.2aad1db8.weiny2@llnl.gov>
	<15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com>
	<20080424181657.28d58a29.weiny2@llnl.gov>
	<20080514000247.GL21414@sashak.voltaire.com>
	<1210764244.2026.728.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080514075947.1e0c3b53.weiny2@llnl.gov>

Yea I guess it should be.

Ira


On Wed, 14 May 2008 04:24:04 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Wed, 2008-05-14 at 00:02 +0000, Sasha Khapyorsky wrote:
> > On 18:16 Thu 24 Apr     , Ira Weiny wrote:
> > > 
> > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> > > From: Ira K. Weiny <weiny2 at llnl.gov>
> > > Date: Thu, 24 Apr 2008 18:05:01 -0700
> > > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
> > > 
> > > 
> > > Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> > 
> > Applied. Thanks.
> 
> Would this change also be applied to ofed_1_3 branch ?
> 
> -- Hal
> 
> > Sasha
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From torvalds at linux-foundation.org  Wed May 14 08:18:21 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 14 May 2008 08:18:21 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080514112625.GY9878@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
Message-ID: <alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>


On Wed, 14 May 2008, Robin Holt wrote:
> 
> Are you suggesting the sending side would not need to sleep or the
> receiving side?

One thing to realize is that most of the time (read: pretty much *always*) 
when we have the problem of wanting to sleep inside a spinlock, the 
solution is actually to just move the sleeping to outside the lock, and 
then have something else that serializes things.

That way, the core code (protected by the spinlock, and in all the hot 
paths) doesn't sleep, but the special case code (that wants to sleep) can 
have some other model of serialization that allows sleeping, and that 
includes as a small part the spinlocked region.

I do not know how XPMEM actually works, or how you use it, but it 
seriously sounds like that is how things *should* work. And yes, that 
probably means that the mmu-notifiers as they are now are simply not 
workable: they'd need to be moved up so that they are inside the mmap 
semaphore but not the spinlocks.

Can it be done? I don't know. But I do know that I'm unlikely to accept a 
noticeable slowdown in some very core code for a case that affects about 
0.00001% of the population. In other words, I think you *have* to do it.

			Linus


From kuznet at ms2.inr.ac.ru  Wed May 14 08:30:00 2008
From: kuznet at ms2.inr.ac.ru (Alexey Kuznetsov)
Date: Wed, 14 May 2008 19:30:00 +0400
Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels
	older	than	2.6.21
In-Reply-To: <482AEBBC.10803@voltaire.com>
References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com>
	<1210772290.20499.5.camel@mtls03> <482AEBBC.10803@voltaire.com>
Message-ID: <20080514153000.GA23220@ms2.inr.ac.ru>

Hello!

On Wed, May 14, 2008 at 04:40:12PM +0300, Or Gerlitz wrote:
> Eli Cohen wrote:
> >I add to the thread the author of the commit. I don't know this code
> >good enough to give an explanation but for kernels following this commit
> >I don't get these failures. Perhaps someone else can comment on this.
> >
> Even before understanding what Alexy's patch is doing, can you explain 
> why the ipoib neighbour destructor callback is called after the ipoib 
> module has been unloaded? is it b/c the stack did this call on a 
> neighbour created by another device such as loopack etc?

Look at thread "Subject: dst_ifdown breaks infiniband?" in netdev or lkml.


Shortly, the problem is the following:

* To unload a netdevice we must release all the references.
* Particularly, we force release of the neighbour references,
  redirecting stale neighbour entries to loopback device
* However, neighbour destructor still points to ipoib device and it is called
  after the device is unloaded (since we dropped reference, unload is possible)

The observation was that destructor is the only harmful thing
and that actually it is not used by anyone but ipoib.

The patch get rids of destructor and introduces cleanup, which is made once
for each neighbour entry before invalidation (particularly, right before
the device is unregistered) and it is supposed to move neighbour entry
to the state, where we do not need any calls to the device code.

Alexey


From sashak at voltaire.com  Wed May 14 08:47:46 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 May 2008 18:47:46 +0300
Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when
	setting rereg bit  (Was: Re: [ofa-general] Nodes dropping out of
	IPoIB mcast group due to a temporary node soft lockup.)
In-Reply-To: <20080514075947.1e0c3b53.weiny2@llnl.gov>
References: <20080423133816.6c1b6315.weiny2@llnl.gov>
	<48109087.6030606@voltaire.com>
	<20080424143125.2aad1db8.weiny2@llnl.gov>
	<15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com>
	<20080424181657.28d58a29.weiny2@llnl.gov>
	<20080514000247.GL21414@sashak.voltaire.com>
	<1210764244.2026.728.camel@hrosenstock-ws.xsigo.com>
	<20080514075947.1e0c3b53.weiny2@llnl.gov>
Message-ID: <20080514154746.GF4616@sashak.voltaire.com>

On 07:59 Wed 14 May     , Ira Weiny wrote:
> Yea I guess it should be.

Ok. I applied this to 1.3 branch too.

Sasha


From holt at sgi.com  Wed May 14 09:22:24 2008
From: holt at sgi.com (Robin Holt)
Date: Wed, 14 May 2008 11:22:24 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
Message-ID: <20080514162223.GZ9878@sgi.com>

On Wed, May 14, 2008 at 08:18:21AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 14 May 2008, Robin Holt wrote:
> >
> > Are you suggesting the sending side would not need to sleep or the
> > receiving side?
>
> One thing to realize is that most of the time (read: pretty much *always*)
> when we have the problem of wanting to sleep inside a spinlock, the
> solution is actually to just move the sleeping to outside the lock, and
> then have something else that serializes things.
>
> That way, the core code (protected by the spinlock, and in all the hot
> paths) doesn't sleep, but the special case code (that wants to sleep) can
> have some other model of serialization that allows sleeping, and that
> includes as a small part the spinlocked region.
>
> I do not know how XPMEM actually works, or how you use it, but it
> seriously sounds like that is how things *should* work. And yes, that
> probably means that the mmu-notifiers as they are now are simply not
> workable: they'd need to be moved up so that they are inside the mmap
> semaphore but not the spinlocks.

We are in the process of attempting this now.  Unfortunately for SGI,
Christoph is on vacation right now so we have been trying to work it
internally.

We are looking through two possible methods, one we add a callout to the
tlb flush paths for both the mmu_gather and flush_tlb_page locations.
The other we place a specific callout seperate from the gather callouts
in the paths we are concerned with.  We will look at both more carefully
before posting.


In either implementation, not all call paths would require the stall
to ensure data integrity.  Would it be acceptable to always put a
sleepable stall in even if the code path did not require the pages be
unwritable prior to continuing?  If we did that, I would be freed from
having a pool of invalidate threads ready for XPMEM to use for that work.
Maybe there is a better way, but the sleeping requirement we would have
on the threads make most options seem unworkable.


Thanks,
Robin


From caitlin.bestler at neterion.com  Wed May 14 09:24:16 2008
From: caitlin.bestler at neterion.com (Caitlin Bestler)
Date: Wed, 14 May 2008 09:24:16 -0700
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482ADC2B.5080008@voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>
	<adak5hymg0l.fsf@cisco.com> <482A0F32.2010001@opengridcomputing.com>
	<482ADC2B.5080008@voltaire.com>
Message-ID: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>

On Wed, May 14, 2008 at 5:33 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> Steve Wise wrote:
>>
>> Maybe this should really be implemented in the ULP that wants this
>> behavior.  IE the ULP could register for routing/neighbour changes and tear
>> down connections and re-established them on the correct device.
>>
> Hi Steve,
>
> First, registration for neighbour changes can't serve for the purpose of
> aligning RDMA traffic with the IP stack, from bunch of reasons among them
> are:
>
> - for IB, no neighbour is created at the passive side of the unicast session
>
> - for unicast sessions, address resolution involves ARP but the neighbour
> may be deleted by the kernel since the rdma traffic does not go through the
> stack
>
> - for multicast sessions, no neighbour is created during address resolution
>
> Second, the rdma-cm does well in saving the ULP from interacting with the
> network stack, that is the ULP is not aware to the routing lookup / neigbour
> / net device used for address resolution. In that spirit I prefer to add the
> registration for net events at the low level (rdma-cm).
>
> Third, thanks for bringing the point of route changes :)
>
> Or.
>
>

Perhaps one of the most fundamental differences for RDMA services
versus the traditional socket interface is that RDMA services need
to be bound to a specific device.

When establishing a connection (or flow) the application needs to
select which device to use. Traditional socket applications do not
need to do this, but rdma-cm seems to be an acceptable solution.

The trickier problem is the one you raise on migrating a connection
or flow when IP routing is reconfigured. To a classic socket
application, each IP datagram generated is sent according to
the current routing tables. A connection or flow is not sticky.
An RDMA connection (or IB UD flow) is sticky. The question
is how sticky should it be.

If it is too sticky the application may have to wait for a time-out,
or be stuck using an inferior path after the primary path is restored.
These are obviously undesirable.

But what you have not addressed is how this compares with
the cost of forcing the application session to shift connections
even when the inferior path would have been acceptable.

Is it not true that the lower performance of an inferior path
may be preferable to the cost of tearing down and recreating
a connection (and its associated protection domains and
memory regions)?

Because of those costs I can only see two options:

1) Merely enable the application to know when there has been
   a significant change in IP routing. If the current services are
   inadequate for this purpose then extend those rather than
   do an automatic connection teardown/rebuild.

2) Reduce the cost of connection teardown/rebuild by offering
    an option to "pre-bind" two RDMA devices so that memory
    registrations will be valid on both. This probably requires
    device level co-operation on L-Key/STag allocation, but
    it would be reasonable feature to consider for the High
    Availability market.

But making automatic connection teardown a standard feature
is not the best solution.


From worleys at gmail.com  Wed May 14 09:26:08 2008
From: worleys at gmail.com (Chris Worley)
Date: Wed, 14 May 2008 10:26:08 -0600
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and
	not in others
In-Reply-To: <482AC510.3090602@dev.mellanox.co.il>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
	<482AC510.3090602@dev.mellanox.co.il>
Message-ID: <f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>

On Wed, May 14, 2008 at 4:55 AM, Vladimir Sokolovsky
<vlad at dev.mellanox.co.il> wrote:
> Chris Worley wrote:
>>
>> In two 1.3 builds I get different SET_IPOIB_CM settings in
>> /etc/infiniband/openib.conf.
>>
>> A generic build sets it to "yes".  A kitchen-sink build doesn't set it.
>>
>> Is there a reason (as I need it to be enabled on a system that needs
>> the kitchen-sink build)?
>>
>> Thanks,
>>
>> Chris
>
> The default mode for IPoIB CM in OFED-1.3 (/etc/infiniband/openib.conf) is:
> SET_IPOIB_CM=yes
>
> It was different (SET_IPOIB_CM=no) in OFED-1.2 between Thu Mar 29 16:57:22
> 2007 and Wed Apr 4 10:41:09 2007 (before OFED-1.2-rc1).
>
> Can you point me the OFED-1.3 build where SET_IPOIB_CM is set to "no"?

Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5
in the "kitchen sink" build.

Is there any reason to NOT use connected mode?

Thanks,

Chris
>
> Regards,
> Vladimir
>


From rdreier at cisco.com  Wed May 14 09:33:44 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 09:33:44 -0700
Subject: [ofa-general] [GIT PULL] please pull infinibang.git
Message-ID: <aday76cj15j.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a fixes for various low-level HW driver issues:

 - nes bugs with LRO module parameter
 - cxgb3 bug in handling flushing on connection teardown
 - ipath miscellaneous issues (kind of big but look like needed fixes)

Pavel Emelyanov (1):
      IB/ipath: Make ipath_portdata work with struct pid * not pid_t

Ralph Campbell (3):
      IB/ipath: Fix RC and UC error handling
      IB/ipath: Fix many locking issues when switching to error state
      IB/ipath: Fix RDMA read response sequence checking

Roland Dreier (2):
      RDMA/nes: Fix up nes_lro_max_aggr module parameter
      IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long

Steve Wise (1):
      RDMA/cxgb3: Wrap the software send queue pointer as needed on flush

 drivers/infiniband/hw/cxgb3/cxio_hal.c        |    4 +-
 drivers/infiniband/hw/ipath/ipath_driver.c    |   20 +-
 drivers/infiniband/hw/ipath/ipath_file_ops.c  |   19 +-
 drivers/infiniband/hw/ipath/ipath_kernel.h    |   10 +-
 drivers/infiniband/hw/ipath/ipath_qp.c        |  237 ++++++++-----------
 drivers/infiniband/hw/ipath/ipath_rc.c        |  285 +++++++++++-----------
 drivers/infiniband/hw/ipath/ipath_ruc.c       |  329 ++++++++++++++----------
 drivers/infiniband/hw/ipath/ipath_uc.c        |   57 +++--
 drivers/infiniband/hw/ipath/ipath_ud.c        |   66 ++++--
 drivers/infiniband/hw/ipath/ipath_user_sdma.h |    2 -
 drivers/infiniband/hw/ipath/ipath_verbs.c     |  176 +++++++++-----
 drivers/infiniband/hw/ipath/ipath_verbs.h     |   64 ++++-
 drivers/infiniband/hw/nes/nes.c               |    4 -
 drivers/infiniband/hw/nes/nes.h               |    1 -
 drivers/infiniband/hw/nes/nes_hw.c            |    6 +-
 15 files changed, 725 insertions(+), 555 deletions(-)


From rdreier at cisco.com  Wed May 14 09:36:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 09:36:30 -0700
Subject: [ofa-general] RE: [PATCH 1/1] infiniband/hw/nes/: avoid
	unnecessary memset
In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2> (Faisal Latif's
	message of "Tue, 13 May 2008 16:46:47 -0500")
References: <20080512213601.626C91C0008F@mwinf2103.orange.fr>
	<5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2>
Message-ID: <adatzh0j10x.fsf@cisco.com>

thanks guys, applied for 2.6.27


From sean.hefty at intel.com  Wed May 14 09:40:00 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 14 May 2008 09:40:00 -0700
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma:
	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>
	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>
	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
Message-ID: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>

>1) Merely enable the application to know when there has been
>   a significant change in IP routing. If the current services are
>   inadequate for this purpose then extend those rather than
>   do an automatic connection teardown/rebuild.

This is my current preferred solution.

I don't have an issue with the rdma_cm issuing some sort of notification event
when an IP address mapping changes.  I would use an event name that indicated
this, rather than 'disconnect'.

If this is implemented, I'd like to minimize the overhead per rdma_cm_id
required to report this event.

- Sean


From torvalds at linux-foundation.org  Wed May 14 09:56:18 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 14 May 2008 09:56:18 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080514162223.GZ9878@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
	<20080514162223.GZ9878@sgi.com>
Message-ID: <alpine.LFD.1.10.0805140954280.3019@woody.linux-foundation.org>


On Wed, 14 May 2008, Robin Holt wrote:
> 
> Would it be acceptable to always put a sleepable stall in even if the 
> code path did not require the pages be unwritable prior to continuing?  
> If we did that, I would be freed from having a pool of invalidate 
> threads ready for XPMEM to use for that work. Maybe there is a better 
> way, but the sleeping requirement we would have on the threads make most 
> options seem unworkable.

I'm not understanding the question. If you can do you management outside 
of the spinlocks, then you can obviously do whatever you want, including 
sleping. It's changing the existing spinlocks to be sleepable that is not 
acceptable, because it's such a performance problem.

		Linus


From ruimario at gmail.com  Wed May 14 09:59:41 2008
From: ruimario at gmail.com (Rui Machado)
Date: Wed, 14 May 2008 18:59:41 +0200
Subject: [ofa-general] timeout question
Message-ID: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com>

Hi all,

when setting the timeout in a struct ibv_qp_attr, this value
corresponds to the Local ACK timeout which according to the Infiniband
spec will define the transport timer timeout defined by the formula:
4.096uS * 2 ^Local Ack timeout". Is this right?
And is there a value for this timeout to be considered "good practice"?

Also, in a client-server setup, if this timeout is set to a "big
value" (like 30) when the server dies, the client will take that
amount of time to realize the failure. Is this correct?

Thank you for the help.

Rui


From clameter at sgi.com  Wed May 14 10:57:08 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Wed, 14 May 2008 10:57:08 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
Message-ID: <Pine.LNX.4.64.0805141053350.15490@schroedinger.engr.sgi.com>

On Wed, 14 May 2008, Linus Torvalds wrote:

> One thing to realize is that most of the time (read: pretty much *always*) 
> when we have the problem of wanting to sleep inside a spinlock, the 
> solution is actually to just move the sleeping to outside the lock, and 
> then have something else that serializes things.

The problem is that the code in rmap.c try_to_umap() and friends loops 
over reverse maps after taking a spinlock. The mm_struct is only known 
after the rmap has been acccessed. This means *inside* the spinlock.

That is why I tried to convert the locks to scan the revese maps to 
semaphores. If that is done then one can indeed do the callouts outside of 
atomic contexts.

> Can it be done? I don't know. But I do know that I'm unlikely to accept a 
> noticeable slowdown in some very core code for a case that affects about 
> 0.00001% of the population. In other words, I think you *have* to do it.

With larger number of processor semaphores make a lot of sense since the 
holdoff times on spinlocks will increase. If we go to sleep then the 
processor can do something useful instead of hogging a cacheline.

A rw lock there can also increase concurrency during reclaim espcially if 
the anon_vma chains and the number of address spaces mapping a page is 
high.


From torvalds at linux-foundation.org  Wed May 14 11:27:14 2008
From: torvalds at linux-foundation.org (Linus Torvalds)
Date: Wed, 14 May 2008 11:27:14 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805141053350.15490@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805141053350.15490@schroedinger.engr.sgi.com>
Message-ID: <alpine.LFD.1.10.0805141114410.3019@woody.linux-foundation.org>


On Wed, 14 May 2008, Christoph Lameter wrote:
> 
> The problem is that the code in rmap.c try_to_umap() and friends loops 
> over reverse maps after taking a spinlock. The mm_struct is only known 
> after the rmap has been acccessed. This means *inside* the spinlock.

So you queue them. That's what we do with things like the dirty bit. We 
need to hold various spinlocks to look up pages, but then we can't 
actually call the filesystem with the spinlock held.

Converting a spinlock to a waiting lock for things like that is simply not 
acceptable. You have to work with the system.

Yeah, there's only a single bit worth of information on whether a page is 
dirty or not, so "queueing" that information is trivial (it's just the 
return value from "page_mkclean_file()". Some things are harder than 
others, and I suspect you need some kind of "gather" structure to queue up 
all the vma's that can be affected.

But it sounds like for the case of rmap, the approach of:

 - the page lock is the higher-level "sleeping lock" (which makes sense, 
   since this is very close to an IO event, and that is what the page lock 
   is generally used for)

   But hey, it could be anything else - maybe you have some other even 
   bigger lock to allow you to handle lots of pages in one go.

 - with that lock held, you do the whole rmap dance (which requires 
   spinlocks) and gather up the vma's and the struct mm's involved. 

 - outside the spinlocks you then do whatever it is you need to do.

This doesn't sound all that different from TLB shoot-down in SMP, and the 
"mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while 
holding the spinlocks, but if we couldn't that's how we'd still do it: 
it would get more involved (because we'd need to guarantee that the gather 
can hold *all* the pages - right now we can just flush in the middle if we 
need to), but it wouldn't be all that fundamentally different.

And no, I really haven't even wanted to look at what XPMEM really needs to 
do, so maybe the above thing doesn't work for you, and you have other 
issues. I'm just pointing you in a general direction, not trying to say 
"this is exactly how to get there". 

		Linus


From swise at opengridcomputing.com  Wed May 14 12:05:32 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 14:05:32 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
Message-ID: <20080514190532.28544.41595.stgit@dell3.ogc.int>


The following patch proposes the API and core changes needed to
implement the IB BMMR and iWARP equivalient memory extensions.

Please review these vs the verbs specs and see what I've missed.

This patch is a request for comments and hasn't even been compiled...

Steve.

-----

RDMA: New Memory Extensions.

Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.
---

 drivers/infiniband/core/verbs.c |   46 +++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |   55 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..869be7d 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access)
+{
+	struct ib_mr *mr;
+
+	if (!pd->device->alloc_mr)
+		return ERR_PTR(-ENOSYS);
+
+	mr = pd->device->alloc_mr(pd, pbl_depth, remote_access);
+
+	if (!IS_ERR(mr)) {
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_alloc_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list
+
+	if (!device->alloc_fast_reg_page_list)
+		return ERR_PTR(-ENOSYS);
+
+	page_list = device->alloc_fast_reg_page_list(device, page_list_len);
+
+	if (!IS_ERR(page_list)) {
+		page_list->device = device;
+		page_list->page_list_len = page_list_len;
+	}
+
+	return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+	page_list->device->dealloc_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..d6d9514 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
 	IB_DEVICE_SEND_W_INV		= (1<<21),
+	IB_DEVICE_MM_EXTENSIONS		= (1<<22),
 };
 
 enum ib_atomic_cap {
@@ -414,6 +415,8 @@ enum ib_wc_opcode {
 	IB_WC_FETCH_ADD,
 	IB_WC_BIND_MW,
 	IB_WC_LSO,
+	IB_WC_FAST_REG_MR,
+	IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -628,6 +631,9 @@ enum ib_wr_opcode {
 	IB_WR_ATOMIC_FETCH_AND_ADD,
 	IB_WR_LSO,
 	IB_WR_SEND_WITH_INV,
+	IB_WR_FAST_REG_MR,
+	IB_WR_INVALIDATE_MR,
+	IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +682,17 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u64				iova_start;
+			struct ib_fast_reg_page_list	*page_list;
+			int				fbo;
+			u32				length;
+			int				access_flags;
+			struct ib_mr 			*mr;
+		} fast_reg;
+		struct {
+			struct ib_mr 	*mr;
+		} local_inv;
 	} wr;
 };
 
@@ -1014,6 +1031,11 @@ struct ib_device {
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
+	struct ib_mr *		   (*alloc_mr)(struct ib_pd *pd,
+					       int pbl_depth,
+					       int remote_access);
+	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
+	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -1808,6 +1830,39 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
 int ib_dereg_mr(struct ib_mr *mr);
 
 /**
+ * ib_alloc_mr - Allocates memory region usable with the
+ * IB_WR_FAST_REG_MR send work request.
+ * @pd: The protection domain associated with the region.
+ * @pbl_depth: requested max physical buffer list size to be allocated.
+ * @remote_access: set to 1 if remote rdma operations are allowed.
+ */
+struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth,
+			  int remote_access);
+
+struct ib_fast_reg_page_list {
+	struct ib_device 	*device;
+	u64			*page_list;
+	int			page_list_len;
+};
+
+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array to be used
+ * in a IB_WR_FAST_REG_MR work request.  The resources allocated by this method
+ * allows for dev-specific optimization of the FAST_REG operation.
+ * @device - ib device pointer.
+ * @page_list_len - depth of the page list array to be allocated.
+ */
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len);
+
+/**
+ * ib_free_fast_reg_page_list - Deallocates a previously allocated
+ * page list array.
+ * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
+ */
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From dotanb at dev.mellanox.co.il  Wed May 14 13:05:24 2008
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Wed, 14 May 2008 22:05:24 +0200
Subject: [ofa-general] ibv_get_cq_event blocking forever
	after	successfulibv_post_send...
In-Reply-To: <ada1w47nhg8.fsf@cisco.com>
References: <20070525212214.20500.qmail@station183.com>	<adalkf8z21b.fsf@cisco.com>	<532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com>	<000301c8b48a$2200e670$465a180a@amr.corp.intel.com>	<532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com>
	<ada1w47nhg8.fsf@cisco.com>
Message-ID: <482B4604.4010909@dev.mellanox.co.il>

Roland Dreier wrote:
>  > This is a cut-paste error. I just extracted the relevant code from the
>  > actual piece.
>
> Unless you send an actual test app that someone could really compile and
> run, it's very hard to help debug it.  Basically your only chance is if
> you have a really obvious bug that someone could see by reading your code.
>   
There is a minor bug in the code: wc.opcode is valid ONLY if wc.status 
== IBV_WC_SUCCESS
(otherwise, its value is undefined).

But anyway, this should prevent you from getting the completion event.


Do you call ibv_req_notify_cq BEFORE any completion is being created
(for example, after the CQ was created) to request to get completion 
event on the first completion

Dotan


From swise at opengridcomputing.com  Wed May 14 12:11:27 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 14:11:27 -0500
Subject: [ofa-general] [RFC PATCH 4/4]
	rdma/cma:	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
Message-ID: <482B395F.8020201@opengridcomputing.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080514/8f73dd41/attachment.html>

From dotanb at dev.mellanox.co.il  Wed May 14 13:11:21 2008
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Wed, 14 May 2008 22:11:21 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com>
References: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com>
Message-ID: <482B4769.1010307@dev.mellanox.co.il>

Hi.
Rui Machado wrote:
> Hi all,
>
> when setting the timeout in a struct ibv_qp_attr, this value
> corresponds to the Local ACK timeout which according to the Infiniband
> spec will define the transport timer timeout defined by the formula:
> 4.096uS * 2 ^Local Ack timeout". Is this right?
> And is there a value for this timeout to be considered "good practice"?
>   
This value is depend on your fabric size, on the HCA you have (and some 
more factors)..
> Also, in a client-server setup, if this timeout is set to a "big
> value" (like 30) when the server dies, the client will take that
> amount of time to realize the failure. Is this correct?
>   
Yes, after (at least) the calculated time * number of retry_count usec, 
the sender QP will get a retry exceeded
(if there was a SR which was posted without any response from the receiver).

> Thank you for the help.
>   
Dotan


From rdreier at cisco.com  Wed May 14 12:31:05 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 12:31:05 -0700
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <482A979F.6040305@Voltaire.COM> (Moni Shoua's message of "Wed, 14
	May 2008 10:41:19 +0300")
References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM>
	<adaskwwz7ie.fsf@cisco.com> <4820638E.4030901@Voltaire.COM>
	<4827FBDF.9040308@Voltaire.COM> <adatzh2ksoc.fsf@cisco.com>
	<482A979F.6040305@Voltaire.COM>
Message-ID: <ada8wycisxy.fsf@cisco.com>

 > OK. Here is an example that was viewed in our tests.
 > One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server).
 > SM takeover event takes place during traffic and as a result multicast info is flushed
 > and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience
 > is a very big chance) that the request to rejoin will be to the old  SM and only after a retry join completes successfully.
 > This takes too long and the patch solves it.

OK, that is fairly convincing (and it would be nice to include when
sending the original patch).

Please resend a version that fixes the races in the patch and we can
probably add this for 2.6.27.

 - R.


From swise at opengridcomputing.com  Wed May 14 12:39:36 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 14:39:36 -0500
Subject: [ofa-general] [RFC PATCH
	4/4]	rdma/cma:	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482B395F.8020201@opengridcomputing.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
	<482B395F.8020201@opengridcomputing.com>
Message-ID: <482B3FF8.4040102@opengridcomputing.com>

Steve Wise wrote:
> Sean Hefty wrote:
>>> 1) Merely enable the application to know when there has been
>>>   a significant change in IP routing. If the current services are
>>>   inadequate for this purpose then extend those rather than
>>>   do an automatic connection teardown/rebuild.
>>>     
>>
>> This is my current preferred solution.
>>
>>   
> I agree.
>> I don't have an issue with the rdma_cm issuing some sort of notification event
>> when an IP address mapping changes.  I would use an event name that indicated
>> this, rather than 'disconnect'.
>>
>> If this is implemented, I'd like to minimize the overhead per rdma_cm_id
>> required to report this event.
>>
>>   
>
> Maybe instead of making this a cm_id event, we should add a concept of 
> rdma async events that aren't necessarily affiliated with any 
> particular cm_id?  IE a new channel for these types.  Then you can 
> post it once when a route changes that affects rdma devices, for 
> example...
>
As opposed to posting the event to every cm_id affected...


From dotanba at gmail.com  Wed May 14 13:45:20 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Wed, 14 May 2008 22:45:20 +0200
Subject: [ofa-general] Moving on
Message-ID: <482B4F60.5060300@gmail.com>

Hi,

After more than seven years of having the fun of being part of the
development and productizing of the InfiniBand products and especially
its SW, I have decided to move on in my professional way.

I'm proud that I had a part in the OpenFabrics project and 
I had the opportunity to learn from the best !!!

Even tough I won't be an employee of Mellanox anymore, I will still try to
be evolve in the SW development in the OpenFabrics and give remarks/patches/support
like I did before.


I will still be available for questions/replies if needed in:
dotanba at gmail.com

thanks
Dotan


From swise at opengridcomputing.com  Wed May 14 13:34:30 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 15:34:30 -0500
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int>
	<000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com>
Message-ID: <482B4CD6.2090709@opengridcomputing.com>

Sean Hefty wrote:
> Thanks for looking at this.
>
>   
>> Here is the top level API change I'm proposing for enabling interoperable
>> peer2peer mode for iwarp.  I want to get agreement on how to expose
>> this to the application before posting more of the gritty details of
>> the kernel driver changes needed. The plan is to include this support
>> in linux-2.6.27 + ofed-1.4.
>>     
>
> I don't have a better idea what to call this, but when I think of peer to peer,
> I think of that as the connection model, not a channel usage restriction.
>
>   

I think I'll call it rtr_mode.  That better describes it.  The client 
side is sending a "ready to receive" message.  And the server side holds 
off SQ processing until the RTR is received...


From Arkady.Kanevsky at netapp.com  Wed May 14 13:36:28 2008
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Wed, 14 May 2008 16:36:28 -0400
Subject: [ofa-general] Re: [PATCH] Request For Comments:
In-Reply-To: <482B4CD6.2090709@opengridcomputing.com>
References: <20080506170230.11409.43625.stgit@dell3.ogc.int><000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com>
	<482B4CD6.2090709@opengridcomputing.com>
Message-ID: <C98692FD98048C41885E0B0FACD9DFB8096CDB2B@exnane01.hq.netapp.com>

But the difference is who generates rtr message.
It is not user job to deal with it.
Thanks,

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com] 
> Sent: Wednesday, May 14, 2008 4:35 PM
> To: Sean Hefty
> Cc: rdreier at cisco.com; ewg at lists.openfabrics.org; 
> general at lists.openfabrics.org
> Subject: [ofa-general] Re: [PATCH] Request For Comments:
> 
> Sean Hefty wrote:
> > Thanks for looking at this.
> >
> >   
> >> Here is the top level API change I'm proposing for enabling 
> >> interoperable peer2peer mode for iwarp.  I want to get 
> agreement on 
> >> how to expose this to the application before posting more of the 
> >> gritty details of the kernel driver changes needed. The plan is to 
> >> include this support in linux-2.6.27 + ofed-1.4.
> >>     
> >
> > I don't have a better idea what to call this, but when I 
> think of peer 
> > to peer, I think of that as the connection model, not a 
> channel usage restriction.
> >
> >   
> 
> I think I'll call it rtr_mode.  That better describes it.  
> The client side is sending a "ready to receive" message.  And 
> the server side holds off SQ processing until the RTR is received...
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From sean.hefty at intel.com  Wed May 14 13:46:38 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 14 May 2008 13:46:38 -0700
Subject: [ofa-general] [RFC PATCH
	4/4]	rdma/cma:	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482B3FF8.4040102@opengridcomputing.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
	<482B395F.8020201@opengridcomputing.com>
	<482B3FF8.4040102@opengridcomputing.com>
Message-ID: <000001c8b603$9bb70880$8e248686@amr.corp.intel.com>

>> Maybe instead of making this a cm_id event, we should add a concept of
>> rdma async events that aren't necessarily affiliated with any
>> particular cm_id?  IE a new channel for these types.  Then you can
>> post it once when a route changes that affects rdma devices, for
>> example...
>>
>As opposed to posting the event to every cm_id affected...

I thought about this, and I agree that it's worth exploring.  The locking to
support device removal ended up being fairly complex.  (I'm not sure it would
have been any easier for ULPs to do this though.)  The main counter I see to
using a separate channel is that device removal is invoked per rdma_cm_id, so
there's precedence for invoking the callback per id.

My expectation is that this is a rare event.

- Sean


From ralph.campbell at qlogic.com  Wed May 14 15:56:06 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 14 May 2008 15:56:06 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <20080514190532.28544.41595.stgit@dell3.ogc.int>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
Message-ID: <1210805766.3949.114.camel@brick.pathscale.com>

Do we have any expected consumers for this interface?
I would guess ib_srp, ib_iser as likely candidates.

detailed comments inline below.

On Wed, 2008-05-14 at 14:05 -0500, Steve Wise wrote:
> The following patch proposes the API and core changes needed to
> implement the IB BMMR and iWARP equivalient memory extensions.
> 
> Please review these vs the verbs specs and see what I've missed.
> 
> This patch is a request for comments and hasn't even been compiled...
> 
> Steve.
> 
> -----
> 
> RDMA: New Memory Extensions.
> 
> Support for the IB BMME and iWARP equivalent memory extensions to 
> non shared memory regions.  This includes:
> 
> - allocation of an ib_mr for use in fast register work requests
> 
> - device-specific alloc/free of physical buffer lists for use in fast
> register work requests.  This allows devices to allocate this memory as
> needed (like via dma_alloc_coherent).
> 
> - fast register memory region work request
> 
> - invalidate local memory region work request
> 
> - read with invalidate local memory region work request (iWARP only)
> 
> 
> Design details:
> 
> - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates
> device support for this feature.
> 
> - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.
> 
> - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.
> 
> - New API function, ib_alloc_mr() used to allocate fast_reg memory
> regions.
> 
> - New API function, ib_alloc_fast_reg_page_list to allocate
> device-specific page lists.
> 
> - New API function, ib_free_fast_reg_page_list to free said page lists.
> 
> 
> Usage Model:
> 
> - MR allocated with ib_alloc_mr()
> 
> - Page lists allocated via ib_alloc_fast_reg_page_list().
> 
> - MR made VALID and bound to a specific page list via
> ib_post_send(IB_WR_FAST_REG_MR)

Can the same ib_alloc_fast_reg_page_list() page list be
bound to more than one MR?
What happens if a user tries to issue a
ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR?

How can the memory be read/written?
If the MR allows remote operations, then RDMA writes could be
used. An RDMA READ could be used. What about local access
by the host CPU?

> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

> - MR deallocated with ib_dereg_mr()
> 
> - page lists dealloced via ib_free_fast_reg_page_list().
> 
> Applications can allocate a fast_reg mr once, and then can repeatedly
> bind the mr to different physical memory SGLs via posting work requests
> to the send queue.  For each outstanding mr-to-pbl binding in the SQ
> pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
> be achieved while still allowing device-specific page_list processing.
> ---
> 
>  drivers/infiniband/core/verbs.c |   46 +++++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h         |   55 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 0504208..869be7d 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
>  }
>  EXPORT_SYMBOL(ib_dereg_mr);

What does pbl_depth actually control?
Is it the maximum size page list that can be used in a
ib_post_send(IB_WR_FAST_REG_MR) work request?

pbl_depth should be unsigned since I don't think negative values
make sense.

> +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access)
> +{
> +	struct ib_mr *mr;
> +
> +	if (!pd->device->alloc_mr)
> +		return ERR_PTR(-ENOSYS);
> +
> +	mr = pd->device->alloc_mr(pd, pbl_depth, remote_access);
> +
> +	if (!IS_ERR(mr)) {
> +		mr->device  = pd->device;
> +		mr->pd      = pd;
> +		mr->uobject = NULL;
> +		atomic_inc(&pd->usecnt);
> +		atomic_set(&mr->usecnt, 0);
> +	}
> +
> +	return mr;
> +}
> +EXPORT_SYMBOL(ib_alloc_mr);
> +
> +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
> +				struct ib_device *device, int page_list_len)
> +{
> +	struct ib_fast_reg_page_list *page_list
> +
> +	if (!device->alloc_fast_reg_page_list)
> +		return ERR_PTR(-ENOSYS);
> +
> +	page_list = device->alloc_fast_reg_page_list(device, page_list_len);
> +
> +	if (!IS_ERR(page_list)) {
> +		page_list->device = device;
> +		page_list->page_list_len = page_list_len;
> +	}
> +
> +	return page_list;
> +}
> +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
> +
> +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
> +{
> +	page_list->device->dealloc_fast_reg_page_list(page_list);
> +}
> +EXPORT_SYMBOL(ib_free_fast_reg_page_list);
> +
>  /* Memory windows */
>  
>  struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 911a661..d6d9514 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -106,6 +106,7 @@ enum ib_device_cap_flags {
>  	IB_DEVICE_UD_IP_CSUM		= (1<<18),
>  	IB_DEVICE_UD_TSO		= (1<<19),
>  	IB_DEVICE_SEND_W_INV		= (1<<21),
> +	IB_DEVICE_MM_EXTENSIONS		= (1<<22),
>  };
>  
>  enum ib_atomic_cap {
> @@ -414,6 +415,8 @@ enum ib_wc_opcode {
>  	IB_WC_FETCH_ADD,
>  	IB_WC_BIND_MW,
>  	IB_WC_LSO,
> +	IB_WC_FAST_REG_MR,
> +	IB_WC_INVALIDATE_MR,
>  /*
>   * Set value of IB_WC_RECV so consumers can test if a completion is a
>   * receive by testing (opcode & IB_WC_RECV).
> @@ -628,6 +631,9 @@ enum ib_wr_opcode {
>  	IB_WR_ATOMIC_FETCH_AND_ADD,
>  	IB_WR_LSO,
>  	IB_WR_SEND_WITH_INV,
> +	IB_WR_FAST_REG_MR,
> +	IB_WR_INVALIDATE_MR,
> +	IB_WR_READ_WITH_INV,
>  };
>  
>  enum ib_send_flags {
> @@ -676,6 +682,17 @@ struct ib_send_wr {
>  			u16	pkey_index; /* valid for GSI only */
>  			u8	port_num;   /* valid for DR SMPs on switch only */
>  		} ud;
> +		struct {
> +			u64				iova_start;
> +			struct ib_fast_reg_page_list	*page_list;
> +			int				fbo;

What is fbo? First byte offset?
I assume fbo can't be negative so it should be "unsigned"

> +			u32				length;

So I'm guessing the fbo and length select a subset from page_list for
initializing the mr. Otherwise, the ib_fast_reg_page_list has the
info.

> +			int				access_flags;
> +			struct ib_mr 			*mr;
> +		} fast_reg;
> +		struct {
> +			struct ib_mr 	*mr;
> +		} local_inv;
>  	} wr;
>  };
>  
> @@ -1014,6 +1031,11 @@ struct ib_device {
>  	int                        (*query_mr)(struct ib_mr *mr,
>  					       struct ib_mr_attr *mr_attr);
>  	int                        (*dereg_mr)(struct ib_mr *mr);
> +	struct ib_mr *		   (*alloc_mr)(struct ib_pd *pd,
> +					       int pbl_depth,
> +					       int remote_access);
> +	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
> +	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
>  	int                        (*rereg_phys_mr)(struct ib_mr *mr,
>  						    int mr_rereg_mask,
>  						    struct ib_pd *pd,
> @@ -1808,6 +1830,39 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
>  int ib_dereg_mr(struct ib_mr *mr);

We should define what error return values are possible
and what they mean. Obviously ENOSYS is being used as
the call is not supported by the device. ENOMEM is
obvious. But what about EPERM, EINVAL, etc.

>  /**
> + * ib_alloc_mr - Allocates memory region usable with the
> + * IB_WR_FAST_REG_MR send work request.
> + * @pd: The protection domain associated with the region.
> + * @pbl_depth: requested max physical buffer list size to be allocated.
> + * @remote_access: set to 1 if remote rdma operations are allowed.
> + */
> +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth,
> +			  int remote_access);
> +
> +struct ib_fast_reg_page_list {
> +	struct ib_device 	*device;
> +	u64			*page_list;
> +	int			page_list_len;
> +};

Is the page size always assumed to be PAGE_SIZE?
What about large pages?
The interface definition should say whether the page_list
values are meaningful to the verbs caller. Can this
list be used only for ib_post_send(IB_WR_FAST_REG_MR)
or also by ib_map_phys_fmr() for example.

> +/**
> + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used
> + * in a IB_WR_FAST_REG_MR work request.  The resources allocated by this method
> + * allows for dev-specific optimization of the FAST_REG operation.
> + * @device - ib device pointer.
> + * @page_list_len - depth of the page list array to be allocated.
> + */
> +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
> +				struct ib_device *device, int page_list_len);
> +
> +/**
> + * ib_free_fast_reg_page_list - Deallocates a previously allocated
> + * page list array.
> + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
> + */
> +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
> +
> +/**
>   * ib_alloc_mw - Allocates a memory window.
>   * @pd: The protection domain associated with the memory window.
>   */
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Wed May 14 16:49:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 16:49:57 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <1210805766.3949.114.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Wed, 14 May 2008 15:56:06 -0700")
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
Message-ID: <adazlqsh2e2.fsf@cisco.com>

 > Can the same ib_alloc_fast_reg_page_list() page list be
 > bound to more than one MR?

Yes, but as the IB spec describes, the page list belongs to the
low-level driver until the fast-reg operation has completed.

 > What happens if a user tries to issue a
 > ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR?

The operation completes with an error status.

 > How can the memory be read/written?

what memory?

 > > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access)

 > What does pbl_depth actually control?

pbl_depth is actual a terrible name.  I would suggest calling the
parameter something like max_page_list_len.

I wonder if we really need the remote access flag.  I know the iWARP and
IB verbs both call this out, but is there really a case where specifying
the exact permissions when doing the fast register is insufficient?

also I wonder if it's clearer if we call this verb
ib_alloc_fast_reg_mr().

 > What is fbo? First byte offset?

yes... too many abbreviations in this API, better to make things
self-documenting at the cost of a bit more typing.

 > So I'm guessing the fbo and length select a subset from page_list for
 > initializing the mr. Otherwise, the ib_fast_reg_page_list has the
 > info.

If you pass in one page, you might want the MR to start after the
beginning of the page, and end before the end of the page.

 > We should define what error return values are possible
 > and what they mean. Obviously ENOSYS is being used as
 > the call is not supported by the device. ENOMEM is
 > obvious. But what about EPERM, EINVAL, etc.

This is a big project, given we haven't done this for any other functions.

 > Is the page size always assumed to be PAGE_SIZE?

I think we want a page_size member here for sure.

 > The interface definition should say whether the page_list
 > values are meaningful to the verbs caller.

not sure what you mean... the values are initialized by the verbs
consumer so they better mean something.

 > Can this
 > list be used only for ib_post_send(IB_WR_FAST_REG_MR)
 > or also by ib_map_phys_fmr() for example.

It's just for posting sends, because it gives us a way to let low-level
drivers enforce requirements they have for the page_list passed into the
fast register via send queue operation-- eg it may need to be DMA-able
memory (since the adapter fetches it as part of executing the WQE),
there may be alignment restrictions, etc.

I think we should consider the fmr interface as legacy and try to phase
out using it over the long term.

 - R.


From rdreier at cisco.com  Wed May 14 16:54:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 16:54:59 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <20080514190532.28544.41595.stgit@dell3.ogc.int> (Steve Wise's
	message of "Wed, 14 May 2008 14:05:32 -0500")
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
Message-ID: <adave1gh25o.fsf@cisco.com>

A few quick comments (more later):

 > - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates
 > device support for this feature.

We still have time before 2.6.26 comes out.  Rather than moving
IB_DEVICE_SEND_W_INV to a new bit number, I think it might be better to
just remove IB_DEVICE_SEND_W_INV and make IB_DEVICE_MEMORY_EXTENSIONS
(maybe "MM_EXTENSIONS" or "MEMORY_MANAGEMENT_EXTENSIONS" is better?)
imply real send-with-invalidate support... so 2.6.26 won't have
send-with-invalidate and 2.6.27 will have all of the IB base MM exts
(and iWARP equivs) in one capability bit.

Any thoughts either way?  I'll post a trial balloon patch tomorrow.


Second question -- IB BMME and iWARP talk about a key portion (least
significant byte) of STag/L_Key/R_Key as being under consumer control.
Do we want to expose that as part of this API?  Basically it means we
need to add a way for the consumer to pass in a new L_Key/STag as part
of a lot of calls.

 - R.


From rdreier at cisco.com  Wed May 14 17:02:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 17:02:10 -0700
Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by
	OFED 1.3
In-Reply-To: <482A8574.8070201@voltaire.com> (Or Gerlitz's message of "Wed, 14
	May 2008 09:23:48 +0300")
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com> <adar6c6jane.fsf@cisco.com>
	<482A8574.8070201@voltaire.com>
Message-ID: <adar6c4h1tp.fsf@cisco.com>

 > Maybe its about time for the Linux IB maintainers to get a little angry?!

I'm not angry about it, although I have pretty much given up on trying
to debug IPoIB issues seen running anything other than an upstream
kernel.  It seems like the OFED maintainers, the enterprise distros and
their customers should be more concerned about the failure of the OFED
process -- clearly producing something much buggier and less reliable
than the stock kernel is not what anyone wants.

 - R.


From charr at fusionio.com  Wed May 14 17:12:20 2008
From: charr at fusionio.com (Cameron Harr)
Date: Wed, 14 May 2008 18:12:20 -0600
Subject: [ofa-general] iSer and Direct IO
Message-ID: <482B7FE4.9070502@fusionio.com>

Hi, I've been trying to compare performances between iSer and srpt and
am getting mixed results where iSer wins for IOPs and srpt wins for some
streaming b/w tests. I've tested with iozone, spew and FIO, and IOP
numbers are always higher on iSer. My problem though is that I'm a
little suspicious of some of the iSer numbers and whether they are
really using Direct IO. For example, you'll see below in some of my FIO
results that I'm getting a write B/W of 799.1 MB/s at one point. That's
way above what I can get natively on the device (~650 MB/s DIO) and is
more along the lines of buffered IO. If the IOP numbers are also using
some kind of caching, that could possibly taint them also. Does anyone
know if specifying DIO will really bypass all buffers or if something is
getting cached in the agents (iscsi, tgtd)?


FIO
--------------- iSer 1----iSer 2----SRPT 1----SRPT 2-
RBW (MB/s)      565.3     836.5     622.0     581.7
Read IOPs       63488.1   68053.8   5335.6    5446.1
WBW (MB/s)      799.1     737.7     589.5     594.4
Write IOPs      79086.6   80005.7   33884.6   34058.6


Thanks much,
Cameron


From swise at opengridcomputing.com  Wed May 14 17:56:13 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 19:56:13 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adave1gh25o.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com>
Message-ID: <482B8A2D.7030505@opengridcomputing.com>


Roland Dreier wrote:
> A few quick comments (more later):
> 
>  > - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates
>  > device support for this feature.
> 
> We still have time before 2.6.26 comes out.  Rather than moving
> IB_DEVICE_SEND_W_INV to a new bit number, I think it might be better to
> just remove IB_DEVICE_SEND_W_INV and make IB_DEVICE_MEMORY_EXTENSIONS
> (maybe "MM_EXTENSIONS" or "MEMORY_MANAGEMENT_EXTENSIONS" is better?)
> imply real send-with-invalidate support... so 2.6.26 won't have
> send-with-invalidate and 2.6.27 will have all of the IB base MM exts
> (and iWARP equivs) in one capability bit.
> 
> Any thoughts either way?  I'll post a trial balloon patch tomorrow.
> 

Sounds fine to me.  As you've seen, I don't like to type, so 
MM_EXTENSIONS seems better. :)

> 
> Second question -- IB BMME and iWARP talk about a key portion (least
> significant byte) of STag/L_Key/R_Key as being under consumer control.
> Do we want to expose that as part of this API?  Basically it means we
> need to add a way for the consumer to pass in a new L_Key/STag as part
> of a lot of calls.

I left it out from this first pass because we don't expose any of that 
in the existing RDMA API.  Currently the iwarp providers make up their 
own keys.  EG:  ib_reg_phys_mr() should also allow passing in the key, 
at least according to iWARP verbs.  But I don't really see the need...

Steve.


From swise at opengridcomputing.com  Wed May 14 18:05:30 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 20:05:30 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adazlqsh2e2.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>
Message-ID: <482B8C5A.5020904@opengridcomputing.com>


Roland Dreier wrote:
>  > Can the same ib_alloc_fast_reg_page_list() page list be
>  > bound to more than one MR?
> 
> Yes, but as the IB spec describes, the page list belongs to the
> low-level driver until the fast-reg operation has completed.
> 
>  > What happens if a user tries to issue a
>  > ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR?
> 
> The operation completes with an error status.
> 
>  > How can the memory be read/written?
> 
> what memory?
> 
>  > > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access)
> 
>  > What does pbl_depth actually control?
> 
> pbl_depth is actual a terrible name.  I would suggest calling the
> parameter something like max_page_list_len.
>

Terrible?  :(

max_page_list_len is ok.

> I wonder if we really need the remote access flag.  I know the iWARP and
> IB verbs both call this out, but is there really a case where specifying
> the exact permissions when doing the fast register is insufficient?
> 

I agree.  I don't know why they specify this.  Lets remove it.

> also I wonder if it's clearer if we call this verb
> ib_alloc_fast_reg_mr().

Ok.

> 
>  > What is fbo? First byte offset?
> 
> yes... too many abbreviations in this API, better to make things
> self-documenting at the cost of a bit more typing.
>

ooh_kay

:)

>  > So I'm guessing the fbo and length select a subset from page_list for
>  > initializing the mr. Otherwise, the ib_fast_reg_page_list has the
>  > info.
> 
> If you pass in one page, you might want the MR to start after the
> beginning of the page, and end before the end of the page.
> 
>  > We should define what error return values are possible
>  > and what they mean. Obviously ENOSYS is being used as
>  > the call is not supported by the device. ENOMEM is
>  > obvious. But what about EPERM, EINVAL, etc.
> 
> This is a big project, given we haven't done this for any other functions.
> 
>  > Is the page size always assumed to be PAGE_SIZE?
> 
> I think we want a page_size member here for sure.
> 

So you want the page size specified in the fast_reg_page_list as opposed 
to when the page list is bound to the fast_reg mr (via post_send)?


>  > The interface definition should say whether the page_list
>  > values are meaningful to the verbs caller.
> 
> not sure what you mean... the values are initialized by the verbs
> consumer so they better mean something.
>

The idea is the (kernel) application will allocate the page_list memory 
vi ib_alloc_fast_reg_page_list(), then map the desired physical IO 
memory page-by-page, filling in the page_list with the resulting dma 
addresses.  This page_list is then bound to a MR via the 
post_send(IB_WR_FAST_REG_MR).  The rkey can then be advertised to peers 
for remote IO, or the lkey used for local IO.


>  > Can this
>  > list be used only for ib_post_send(IB_WR_FAST_REG_MR)
>  > or also by ib_map_phys_fmr() for example.
> 
> It's just for posting sends, because it gives us a way to let low-level
> drivers enforce requirements they have for the page_list passed into the
> fast register via send queue operation-- eg it may need to be DMA-able
> memory (since the adapter fetches it as part of executing the WQE),
> there may be alignment restrictions, etc.
> 
> I think we should consider the fmr interface as legacy and try to phase
> out using it over the long term.

Agreed.

Steve.


From swise at opengridcomputing.com  Wed May 14 18:20:53 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 20:20:53 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <1210805766.3949.114.camel@brick.pathscale.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
Message-ID: <482B8FF5.80309@opengridcomputing.com>


Ralph Campbell wrote:
> Do we have any expected consumers for this interface?
> I would guess ib_srp, ib_iser as likely candidates.
> 

NFSRDMA
RDS


> detailed comments inline below.
>

I followed up on Roland's answers to your questions, and added a few 
replies inline below:

<snip>

>> Usage Model:
>>
>> - MR allocated with ib_alloc_mr()
>>
>> - Page lists allocated via ib_alloc_fast_reg_page_list().
>>
>> - MR made VALID and bound to a specific page list via
>> ib_post_send(IB_WR_FAST_REG_MR)
> 
> Can the same ib_alloc_fast_reg_page_list() page list be
> bound to more than one MR?
> What happens if a user tries to issue a
> ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR?
> 
> How can the memory be read/written?
> If the MR allows remote operations, then RDMA writes could be
> used. An RDMA READ could be used. What about local access
> by the host CPU?
>

LOCAL_WRITE can be supplied allowing the device to do local IO.

> 
> What does pbl_depth actually control?

It allows the device to pre-allocate the page_list resources in HW.

> Is it the maximum size page list that can be used in a
> ib_post_send(IB_WR_FAST_REG_MR) work request?

Yes.

> 
> pbl_depth should be unsigned since I don't think negative values
> make sense.
>

Ok.


>> @@ -676,6 +682,17 @@ struct ib_send_wr {
>>  			u16	pkey_index; /* valid for GSI only */
>>  			u8	port_num;   /* valid for DR SMPs on switch only */
>>  		} ud;
>> +		struct {
>> +			u64				iova_start;
>> +			struct ib_fast_reg_page_list	*page_list;
>> +			int				fbo;
> 
> What is fbo? First byte offset?
> I assume fbo can't be negative so it should be "unsigned"
>

Ok.


From rdreier at cisco.com  Wed May 14 19:49:07 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 19:49:07 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <482B8C5A.5020904@opengridcomputing.com> (Steve Wise's message of
	"Wed, 14 May 2008 20:05:30 -0500")
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com> <482B8C5A.5020904@opengridcomputing.com>
Message-ID: <adaej84gu3g.fsf@cisco.com>

 > So you want the page size specified in the fast_reg_page_list as
 > opposed to when the page list is bound to the fast_reg mr (via
 > post_send)?

It's kind of the same thing, since the fast_reg_page_list is part of the
send work request... the structures you have at the moment are:

 > +		struct {
 > +			u64				iova_start;
 > +			struct ib_fast_reg_page_list	*page_list;
 > +			int				fbo;
 > +			u32				length;
 > +			int				access_flags;
 > +			struct ib_mr 			*mr;

(side note... move this pointer up with the other pointers, so you don't
end up with a hole in the structure due to alignment... or stick an int
page_size in to fill the hole)

 > +		} fast_reg;

 > +struct ib_fast_reg_page_list {
 > +	struct ib_device 	*device;
 > +	u64			*page_list;
 > +	int			page_list_len;
 > +};

is page_list_len the maximum length of the page_list, or is it filled in
by the consumer?  The driver could figure out the length of the
page_list for any given work request by looking at the MR length and the
page_size I suppose.

 - R.


From rdreier at cisco.com  Wed May 14 19:50:58 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 19:50:58 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <482B8A2D.7030505@opengridcomputing.com> (Steve Wise's message of
	"Wed, 14 May 2008 19:56:13 -0500")
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com> <482B8A2D.7030505@opengridcomputing.com>
Message-ID: <ada63tggu0d.fsf@cisco.com>

 > > Second question -- IB BMME and iWARP talk about a key portion (least
 > > significant byte) of STag/L_Key/R_Key as being under consumer control.
 > > Do we want to expose that as part of this API?  Basically it means we
 > > need to add a way for the consumer to pass in a new L_Key/STag as part
 > > of a lot of calls.
 > 
 > I left it out from this first pass because we don't expose any of that
 > in the existing RDMA API.  Currently the iwarp providers make up their
 > own keys.  EG:  ib_reg_phys_mr() should also allow passing in the key,
 > at least according to iWARP verbs.  But I don't really see the need...

Makes sense.  Maybe RDS would like to control the key to avoid reuse but
they've lived without it so far, and the RDS use case is kind of wacky anyway.


From rdreier at cisco.com  Wed May 14 19:50:05 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 19:50:05 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <482B8A2D.7030505@opengridcomputing.com> (Steve Wise's message of
	"Wed, 14 May 2008 19:56:13 -0500")
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com> <482B8A2D.7030505@opengridcomputing.com>
Message-ID: <adaabisgu1u.fsf@cisco.com>

 > Sounds fine to me.  As you've seen, I don't like to type, so
 > MM_EXTENSIONS seems better. :)

How about we compromise on MEM_MGT_EXTENSIONS ;)
MM is not 100% clear at first glance I think.

 - R.


From rdreier at cisco.com  Wed May 14 20:07:02 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 14 May 2008 20:07:02 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
Message-ID: <adaskwkfep5.fsf@cisco.com>

it was recently pointed out to me that mem-free mthca devices cannot
create a QP with max_send_sge and max_recv_sge set to the max_sge value
returned by querying the device.  The strange thing is that I thought I
had tested this a while ago and it worked, but I can't see anything that
would have changed things.

Anyway, the patch below fixes things for me (tested on both mem-free and
mem-full HCAs).  But does anyone see something better to do?  (short
term or long term)

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 9ebadd6..200cf13 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -45,6 +45,7 @@
 #include "mthca_cmd.h"
 #include "mthca_profile.h"
 #include "mthca_memfree.h"
+#include "mthca_wqe.h"
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
@@ -200,7 +201,18 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim)
 	mdev->limits.gid_table_len  	= dev_lim->max_gids;
 	mdev->limits.pkey_table_len 	= dev_lim->max_pkeys;
 	mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay;
-	mdev->limits.max_sg             = dev_lim->max_sg;
+	/*
+	 * Need to allow for worst case send WQE overhead and check
+	 * whether max_desc_sz imposes a lower limit than max_sg; UD
+	 * send has the biggest overhead.
+	 */
+	mdev->limits.max_sg		= min_t(int, dev_lim->max_sg,
+					      (dev_lim->max_desc_sz -
+					       sizeof (struct mthca_next_seg) -
+					       (mthca_is_memfree(mdev) ?
+						sizeof (struct mthca_arbel_ud_seg) :
+						sizeof (struct mthca_tavor_ud_seg))) /
+						sizeof (struct mthca_data_seg));
 	mdev->limits.max_wqes           = dev_lim->max_qp_sz;
 	mdev->limits.max_qp_init_rdma   = dev_lim->max_requester_per_qp;
 	mdev->limits.reserved_qps       = dev_lim->reserved_qps;


From Thomas.Talpey at netapp.com  Wed May 14 20:32:21 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 14 May 2008 23:32:21 -0400
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaskwkfep5.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>

At 11:07 PM 5/14/2008, Roland Dreier wrote:
>it was recently pointed out to me that mem-free mthca devices cannot
>create a QP with max_send_sge and max_recv_sge set to the max_sge value
>returned by querying the device.

We've been hit by this twice this week on two NFS/RDMA servers, so I'm
glad to see this! But, for us it happens with memless ConnectX - our mthca
devices are ok (but OTOH they're memfull not memfree)

I'll be happy to test it with our misbehaving cards, but I can't do it until
next week since they just went into a box for shipping. In the meantime,
dare I ask - what's different about memfree cards that limits the sge
attributes like this? And, what values result from the new code? The
ConnectX ones I have report 32, and fail when trying to set that.

Tom.


From swise at opengridcomputing.com  Wed May 14 20:40:56 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 22:40:56 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adaabisgu1u.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	<adave1gh25o.fsf@cisco.com>
	<482B8A2D.7030505@opengridcomputing.com>
	<adaabisgu1u.fsf@cisco.com>
Message-ID: <482BB0C8.3040307@opengridcomputing.com>


Roland Dreier wrote:
>  > Sounds fine to me.  As you've seen, I don't like to type, so
>  > MM_EXTENSIONS seems better. :)
> 
> How about we compromise on MEM_MGT_EXTENSIONS ;)
> MM is not 100% clear at first glance I think.
> 
>  - R.

works for me.


From swise at opengridcomputing.com  Wed May 14 20:46:57 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 May 2008 22:46:57 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adaej84gu3g.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	<1210805766.3949.114.camel@brick.pathscale.com>	<adazlqsh2e2.fsf@cisco.com>
	<482B8C5A.5020904@opengridcomputing.com>
	<adaej84gu3g.fsf@cisco.com>
Message-ID: <482BB231.2000909@opengridcomputing.com>


Roland Dreier wrote:
>  > So you want the page size specified in the fast_reg_page_list as
>  > opposed to when the page list is bound to the fast_reg mr (via
>  > post_send)?
> 
> It's kind of the same thing, since the fast_reg_page_list is part of the
> send work request... the structures you have at the moment are:
> 

Yes, but I guess it makes more sense to specify it when you allocate the 
page_list.

>  > +		struct {
>  > +			u64				iova_start;
>  > +			struct ib_fast_reg_page_list	*page_list;
>  > +			int				fbo;
>  > +			u32				length;
>  > +			int				access_flags;
>  > +			struct ib_mr 			*mr;
> 
> (side note... move this pointer up with the other pointers, so you don't
> end up with a hole in the structure due to alignment... or stick an int
> page_size in to fill the hole)

k

> 
>  > +		} fast_reg;
> 
>  > +struct ib_fast_reg_page_list {
>  > +	struct ib_device 	*device;
>  > +	u64			*page_list;
>  > +	int			page_list_len;
>  > +};
> 
> is page_list_len the maximum length of the page_list, or is it filled in
> by the consumer?  The driver could figure out the length of the
> page_list for any given work request by looking at the MR length and the
> page_size I suppose.

The idea was that it was the current page_list length.  But perhaps the 
struct needs both current and max?  Or maybe the struct contains the 
max, and the actual length is passed in with the bind.  Apps, however, 
might need both anyway and providing a place to keep them in this struct 
will help apps...


Steve.


From Thomas.Talpey at netapp.com  Wed May 14 22:59:13 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Thu, 15 May 2008 01:59:13 -0400
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adazlqsh2e2.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRDwjwtX000000e7@RTPMVEXC1-PRD.hq.netapp.com>

At 07:49 PM 5/14/2008, Roland Dreier wrote:
>also I wonder if it's clearer if we call this verb
>ib_alloc_fast_reg_mr().

I have to disagree. Calling anything "fast" simply invites a "faster"
thing to come along later. It's like calling something "new".

I say call it what it is - a work-request-based, alloc-phys-buffer-list,
bind-pages-to-list, to-be-widely-supported memory registration.
Obviously, the individual verbs need to be a bit more precise. :-)

Ralph - to answer your question who wants it, NFS/RDMA does, both
client and server. I talked about requirements that it matches closely
at Sonoma last month.

But Steve - aren't these capable of protecting memory at byte
granularity? The word "page" in some of the names implies otherwise.

Tom.


From Thomas.Talpey at netapp.com  Wed May 14 23:04:52 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Thu, 15 May 2008 02:04:52 -0400
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adave1gh25o.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRDFRaqb000000e8@RTPMVEXC1-PRD.hq.netapp.com>

At 07:54 PM 5/14/2008, Roland Dreier wrote:
>Second question -- IB BMME and iWARP talk about a key portion (least
>significant byte) of STag/L_Key/R_Key as being under consumer control.
>Do we want to expose that as part of this API?  Basically it means we
>need to add a way for the consumer to pass in a new L_Key/STag as part
>of a lot of calls.

I think the Key portion is a quite useful way for the upper layer to
salt the actual R_Keys as a protection mechanism, and having it would
simplify a bunch of defensive code in the NFS/RDMA client. Currently,
because the keys are provider-chosen and potentially recycled, there
is a latent risk.

But, I only want it if ALL future providers support it in some way. If a
subset does not, it's not worth coding around the differences.

Tom.


From sashak at voltaire.com  Wed May 14 23:09:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 09:09:16 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM: Add QoS_management_in_OpenSM.txt
	to opensm/doc directory
In-Reply-To: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com>
References: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080515060916.GA24654@sashak.voltaire.com>

On 07:33 Tue 06 May     , Hal Rosenstock wrote:
> Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From Thomas.Talpey at netapp.com  Wed May 14 23:11:47 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Thu, 15 May 2008 02:11:47 -0400
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <RTPCLUEXC1-PRDFRaqb000000e8@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com>
	<RTPCLUEXC1-PRDFRaqb000000e8@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <RTPCLUEXC1-PRDbbTMk000000e9@RTPMVEXC1-PRD.hq.netapp.com>

At 02:04 AM 5/15/2008, Talpey, Thomas wrote:
>At 07:54 PM 5/14/2008, Roland Dreier wrote:
>>Second question -- IB BMME and iWARP talk about a key portion (least
>>significant byte) of STag/L_Key/R_Key as being under consumer control.
>>Do we want to expose that as part of this API?  Basically it means we
>>need to add a way for the consumer to pass in a new L_Key/STag as part
>>of a lot of calls.
>
>I think the Key portion is a quite useful way for the upper layer to
>salt the actual R_Keys as a protection mechanism, and having it would
>simplify a bunch of defensive code in the NFS/RDMA client. Currently,
>because the keys are provider-chosen and potentially recycled, there
>is a latent risk.
>
>But, I only want it if ALL future providers support it in some way. If a
>subset does not, it's not worth coding around the differences.

I forgot to mention that the provider portion of the R_Key is reduced
to 24 bits as a result of exposing/requiring the key. This may cause an
issue at large scale, if the R_Keys have global scope. If they are limited
to use on specific connections as in iWARP, then this is less of an issue.

Tom.


From kliteyn at dev.mellanox.co.il  Thu May 15 00:09:51 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 15 May 2008 10:09:51 +0300
Subject: [ofa-general] [Fwd: Your message to general awaits moderator
	approval]
Message-ID: <482BE1BF.8030505@dev.mellanox.co.il>

Guys,

I'm having some troubles with the mailing list filter again.
Do we have some filtering changes? Any other ideas?

-- Yevgeny

-------- Original Message --------
Subject: Your message to general awaits moderator approval
Date: Thu, 15 May 2008 00:06:04 -0700
From: general-bounces at lists.openfabrics.org
To: kliteyn at dev.mellanox.co.il

Your mail to 'general' with the subject

     ***SPAM*** [PATCH] opensm/ib_types.h: cosmetics - ?sec to usec

Is being held until the list moderator can review it for approval.

The reason it is being held:

     The message headers matched a filter rule

Either the message will get posted to the list, or you will receive
notification of the moderator's decision.  If you would like to cancel
this posting, please visit the following URL:

     http://lists.openfabrics.org/cgi-bin/mailman/confirm/general/b4e509be88382dc9cdc3eb30aa789ce58306a08c


From eli at dev.mellanox.co.il  Thu May 15 00:20:27 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 15 May 2008 10:20:27 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
Message-ID: <1210836027.18385.2.camel@mtls03>

>From 2fa86ee977039784f50de982e2f6bf197f00fbeb Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Sun, 11 May 2008 15:02:04 +0300
Subject: [PATCH] IB/mlx4: Add send with invalidate support

Add send with invalidate support to mlx4.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/cq.c   |    8 ++++++++
 drivers/infiniband/hw/mlx4/main.c |    4 ++++
 drivers/infiniband/hw/mlx4/qp.c   |   22 +++++++++++++++++-----
 drivers/net/mlx4/mr.c             |    6 ++++--
 4 files changed, 33 insertions(+), 7 deletions(-)

Changes since last post:
set IB_DEVICE_SEND_W_INV only if FW supports it.

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 4521319..291e856 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -637,6 +637,7 @@ repoll:
 		case MLX4_OPCODE_SEND_IMM:
 			wc->wc_flags |= IB_WC_WITH_IMM;
 		case MLX4_OPCODE_SEND:
+		case MLX4_OPCODE_SEND_INVAL:
 			wc->opcode    = IB_WC_SEND;
 			break;
 		case MLX4_OPCODE_RDMA_READ:
@@ -676,6 +677,13 @@ repoll:
 			wc->wc_flags = IB_WC_WITH_IMM;
 			wc->imm_data = cqe->immed_rss_invalid;
 			break;
+		case MLX4_RECV_OPCODE_SEND_INVAL:
+			wc->opcode = IB_WC_RECV;
+			wc->wc_flags = IB_WC_WITH_INVALIDATE;
+			/*
+			 * TBD: maybe we should just call this ieth_val
+			 */
+			wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid);
 		}
 
 		wc->slid	   = be16_to_cpu(cqe->rlid);
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 4d61e32..b1e9505 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -56,6 +56,8 @@ static const char mlx4_ib_version[] =
 	DRV_NAME ": Mellanox ConnectX InfiniBand driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+#define MLX4_FW_VER_LOCAL_SEND_INVL mlx4_fw_ver(2, 5, 0)
+
 static void init_query_mad(struct ib_smp *mad)
 {
 	mad->base_version  = 1;
@@ -103,6 +105,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 		props->device_cap_flags |= IB_DEVICE_UD_IP_CSUM;
 	if (dev->dev->caps.max_gso_sz)
 		props->device_cap_flags |= IB_DEVICE_UD_TSO;
+	if (dev->dev->caps.fw_ver >=  MLX4_FW_VER_LOCAL_SEND_INVL)
+		props->device_cap_flags |= IB_DEVICE_SEND_W_INV;
 
 	props->vendor_id	   = be32_to_cpup((__be32 *) (out_mad->data + 36)) &
 		0xffffff;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8e02ecf..d0d5f77 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = {
 	[IB_WR_RDMA_READ]		= __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ),
 	[IB_WR_ATOMIC_CMP_AND_SWP]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS),
 	[IB_WR_ATOMIC_FETCH_AND_ADD]	= __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA),
+	[IB_WR_SEND_WITH_INV]		= __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL),
 };
 
 static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp)
@@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
+static __be32 get_ieth(struct ib_send_wr *wr)
+{
+	switch (wr->opcode) {
+	case IB_WR_SEND_WITH_IMM:
+	case IB_WR_RDMA_WRITE_WITH_IMM:
+		return wr->ex.imm_data;
+
+	case IB_WR_SEND_WITH_INV:
+		return cpu_to_be32(wr->ex.invalidate_rkey);
+
+	default:
+		return 0;
+	}
+}
+
 int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		      struct ib_send_wr **bad_wr)
 {
@@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 				     MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) |
 			qp->sq_signal_bits;
 
-		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
-		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-			ctrl->imm = wr->ex.imm_data;
-		else
-			ctrl->imm = 0;
+		ctrl->imm = get_ieth(wr);
 
 		wqe += sizeof *ctrl;
 		size = sizeof *ctrl / 16;
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 03a9abc..e78f53d 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -47,7 +47,7 @@ struct mlx4_mpt_entry {
 	__be32 flags;
 	__be32 qpn;
 	__be32 key;
-	__be32 pd;
+	__be32 pd_flags;
 	__be64 start;
 	__be64 length;
 	__be32 lkey;
@@ -71,6 +71,8 @@ struct mlx4_mpt_entry {
 #define MLX4_MPT_STATUS_SW		0xF0
 #define MLX4_MPT_STATUS_HW		0x00
 
+#define MLX4_MPT_FLAG_EN_INV		0x3000000
+
 static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order)
 {
 	int o;
@@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 				       mr->access);
 
 	mpt_entry->key	       = cpu_to_be32(key_to_hw_index(mr->key));
-	mpt_entry->pd	       = cpu_to_be32(mr->pd);
+	mpt_entry->pd_flags    = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV);
 	mpt_entry->start       = cpu_to_be64(mr->iova);
 	mpt_entry->length      = cpu_to_be64(mr->size);
 	mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift);
-- 
1.5.5.1


From npiggin at suse.de  Thu May 15 00:57:47 2008
From: npiggin at suse.de (Nick Piggin)
Date: Thu, 15 May 2008 09:57:47 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080514112625.GY9878@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
Message-ID: <20080515075747.GA7177@wotan.suse.de>

On Wed, May 14, 2008 at 06:26:25AM -0500, Robin Holt wrote:
> On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote:
> > 
> > I guess that you have found a way to perform TLB flushing within coherent
> > domains over the numalink interconnect without sleeping. I'm sure it would
> > be possible to send similar messages between non coherent domains.
> 
> I assume by coherent domains, your are actually talking about system
> images.

Yes

>  Our memory coherence domain on the 3700 family is 512 processors
> on 128 nodes.  On the 4700 family, it is 16,384 processors on 4096 nodes.
> We extend a "Read-Exclusive" mode beyond the coherence domain so any
> processor is able to read any cacheline on the system.  We also provide
> uncached access for certain types of memory beyond the coherence domain.

Yes, I understand the basics.

 
> For the other partitions, the exporting partition does not know what
> virtual address the imported pages are mapped.  The pages are frequently
> mapped in a different order by the MPI library to help with MPI collective
> operations.
> 
> For the exporting side to do those TLB flushes, we would need to replicate
> all that importing information back to the exporting side.

Right. Or the exporting side could be passed tokens that it tracks itself,
rather than virtual addresses.

 
> Additionally, the hardware that does the TLB flushing is protected
> by a spinlock on each system image.  We would need to change that
> simple spinlock into a type of hardware lock that would work (on 3700)
> outside the processors coherence domain.  The only way to do that is to
> use uncached addresses with our Atomic Memory Operations which do the
> cmpxchg at the memory controller.  The uncached accesses are an order
> of magnitude or more slower.

I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.

 
> > So yes, I'd much rather rework such highly specialized system to fit in
> > closer with Linux than rework Linux to fit with these machines (and
> > apparently slow everyone else down).
> 
> But it isn't that we are having a problem adapting to just the hardware.
> One of the limiting factors is Linux on the other partition.

In what way is the Linux limiting? 


> > > Additionally, the call to zap_page_range expects to have the mmap_sem
> > > held.  I suppose we could use something other than zap_page_range and
> > > atomically clear the process page tables.
> > 
> > zap_page_range does not expect to have mmap_sem held. I think for anon
> > pages it is always called with mmap_sem, however try_to_unmap_anon is
> > not (although it expects page lock to be held, I think we should be able
> > to avoid that).
> 
> zap_page_range calls unmap_vmas which walks to vma->next.  Are you saying
> that can be walked without grabbing the mmap_sem at least readably?

Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.


> I feel my understanding of list management and locking completely
> shifting.

FWIW, mmap_sem isn't held to protect vma->next there anyway, because at
that point the vmas are detached from the mm's rbtree and linked list.
But sure, in that particular path it is held for other reasons.

 
> > >  Doing that will not alleviate
> > > the need to sleep for the messaging to the other partitions.
> > 
> > No, but I'd venture to guess that is not impossible to implement even
> > on your current hardware (maybe a firmware update is needed)?
> 
> Are you suggesting the sending side would not need to sleep or the
> receiving side?  Assuming you meant the sender, it spins waiting for the
> remote side to acknowledge the invalidate request?  We place the data
> into a previously agreed upon buffer and send an interrupt.  At this
> point, we would need to start spinning and waiting for completion.
> Let's assume we never run out of buffer space.

How would you run out of buffer space if it is synchronous?

 
> The receiving side receives an interrupt.  The interrupt currently wakes
> an XPC thread to do the work of transfering and delivering the message
> to XPMEM.  The transfer of the data which XPC does uses the BTE engine
> which takes up to 28 seconds to timeout (hardware timeout before raising
> and error) and the BTE code automatically does a retry for certain
> types of failure.  We currently need to grab semaphores which _MAY_
> be able to be reworked into other types of locks.

Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

What is XPMEM exactly anyway? I'd assumed it is a Linux driver.


From sashak at voltaire.com  Thu May 15 01:21:12 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 11:21:12 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to
	usec
In-Reply-To: <482BE026.7030104@dev.mellanox.co.il>
References: <482BE026.7030104@dev.mellanox.co.il>
Message-ID: <20080515082112.GC24654@sashak.voltaire.com>

On 10:03 Thu 15 May     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Although "?sec" in the comments looks really cool and sophisticated :-),
> I'd prefer to lose it and replace with a simple "usec".
> 
> Heaving '?' in the code confuses some editors and tools, such as "kompare".
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Thu May 15 02:08:30 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 15 May 2008 12:08:30 +0300
Subject: [ofa-general] [PATCH] opensm/ib_types.h: fixing some wrong comments
Message-ID: <482BFD8E.10101@dev.mellanox.co.il>

Hi Sasha,

Fixing a couple of wrong attribute descriptions in ib_types.h

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/iba/ib_types.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index 6f3c400..e6bd9ee 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -974,7 +974,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
 *	IB_MAD_ATTR_PORT_SMPL_CTRL
 *
 * DESCRIPTION
-*	NodeDescription attribute (16.1.2)
+*	PortSamplesControl attribute (16.1.3)
 *
 * SOURCE
 */
@@ -998,7 +998,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
 *	IB_MAD_ATTR_PORT_SMPL_RSLT
 *
 * DESCRIPTION
-*	NodeInfo attribute (16.1.2)
+*	PortSamplesResult attribute (16.1.3)
 *
 * SOURCE
 */
@@ -1022,7 +1022,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
 *	IB_MAD_ATTR_PORT_CNTRS
 *
 * DESCRIPTION
-*	SwitchInfo attribute (16.1.2)
+*	PortCounters attribute (16.1.3)
 *
 * SOURCE
 */
-- 
1.5.1.4


From sashak at voltaire.com  Thu May 15 02:09:56 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 12:09:56 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: fixing some wrong
	comments
In-Reply-To: <482BFD8E.10101@dev.mellanox.co.il>
References: <482BFD8E.10101@dev.mellanox.co.il>
Message-ID: <20080515090956.GF24654@sashak.voltaire.com>

On 12:08 Thu 15 May     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Fixing a couple of wrong attribute descriptions in ib_types.h
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha
> ---
>  opensm/include/iba/ib_types.h |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
> index 6f3c400..e6bd9ee 100644
> --- a/opensm/include/iba/ib_types.h
> +++ b/opensm/include/iba/ib_types.h
> @@ -974,7 +974,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
>  *	IB_MAD_ATTR_PORT_SMPL_CTRL
>  *
>  * DESCRIPTION
> -*	NodeDescription attribute (16.1.2)
> +*	PortSamplesControl attribute (16.1.3)
>  *
>  * SOURCE
>  */
> @@ -998,7 +998,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
>  *	IB_MAD_ATTR_PORT_SMPL_RSLT
>  *
>  * DESCRIPTION
> -*	NodeInfo attribute (16.1.2)
> +*	PortSamplesResult attribute (16.1.3)
>  *
>  * SOURCE
>  */
> @@ -1022,7 +1022,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code)
>  *	IB_MAD_ATTR_PORT_CNTRS
>  *
>  * DESCRIPTION
> -*	SwitchInfo attribute (16.1.2)
> +*	PortCounters attribute (16.1.3)
>  *
>  * SOURCE
>  */
> -- 
> 1.5.1.4
> 


From kliteyn at dev.mellanox.co.il  Thu May 15 02:18:00 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 15 May 2008 12:18:00 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to
	usec
In-Reply-To: <20080515082112.GC24654@sashak.voltaire.com>
References: <482BE026.7030104@dev.mellanox.co.il>
	<20080515082112.GC24654@sashak.voltaire.com>
Message-ID: <482BFFC8.1060404@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 10:03 Thu 15 May     , Yevgeny Kliteynik wrote:
>> Hi Sasha,
>>
>> Although "?sec" in the comments looks really cool and sophisticated :-),
>> I'd prefer to lose it and replace with a simple "usec".

By the way, the problematic character confused Thunderburd too :)
I see it was replaced by '?'.
How did you apply the patch?

-- Yevgeny

>> Heaving '?' in the code confuses some editors and tools, such as "kompare".
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Applied. Thanks.
> 
> Sasha
> 


From atheatre at bellnet.ca  Thu May 15 02:22:15 2008
From: atheatre at bellnet.ca (Jocey Wall.)
Date: Thu, 15 May 2008 5:22:15 -0400
Subject: [ofa-general] Ref  : L/400-26932
Message-ID: <6t0pti$17dfkr@toip35-bus.srvr.bell.ca>

You won £2,000,000.00 GBP.Get back to us via return email with your Name,and Address\Country and your Phone Number for more information on how you won and the delivery of your won prize to you.Email: processing.unit at btinternet.com


From atheatre at bellnet.ca  Thu May 15 02:24:03 2008
From: atheatre at bellnet.ca (Jocey Wall.)
Date: Thu, 15 May 2008 5:24:03 -0400
Subject: [ofa-general] Ref  : L/400-26932
Message-ID: <6t0pti$17dfth@toip35-bus.srvr.bell.ca>

You won £2,000,000.00 GBP.Get back to us via return email with your Name,and Address\Country and your Phone Number for more information on how you won and the delivery of your won prize to you.Email: processing.unit at btinternet.com


From kliteyn at dev.mellanox.co.il  Thu May 15 02:25:50 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 15 May 2008 12:25:50 +0300
Subject: [ofa-general] [PATCH v2] opensm/osm_qos_policy.c: log matched QoS
	criteria
Message-ID: <482C019E.40606@dev.mellanox.co.il>

Hi Sasha,

I think this patch was somehow lost in the pile of patches
that you recently got. Anyhow, reposting it:

Adding log messages for matched criteria of the QoS policy rule.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_qos_policy.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index 6c81872..ebe3a7f 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 {
 	osm_qos_match_rule_t *p_qos_match_rule = NULL;
 	cl_list_iterator_t list_iterator;
+	osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log;

 	if (!cl_list_count(&p_qos_policy->qos_match_rules))
 		return NULL;

+	OSM_LOG_ENTER(p_log);
+
 	/* Go over all QoS match rules and find the one that matches the request */

 	list_iterator = cl_list_head(&p_qos_policy->qos_match_rules);
@@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Source port matched.\n");
 		}

 		/* If a match rule has Destination groups, PR request dest. has to be in this list */
@@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Destination port matched.\n");
 		}

 		/* If a match rule has QoS classes, PR request HAS
@@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"QoS Class matched.\n");
 		}

 		/* If a match rule has Service IDs, PR request HAS
@@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"Service ID matched.\n");
 		}

 		/* If a match rule has PKeys, PR request HAS
@@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
 				list_iterator = cl_list_next(list_iterator);
 				continue;
 			}
-
+			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
+				"PKey matched.\n");
 		}

 		/* if we got here, then this match-rule matched this PR request */
 		break;
 	}

+	OSM_LOG_EXIT(p_log);
+
 	if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules))
 		return NULL;

-- 
1.5.1.4


From sashak at voltaire.com  Thu May 15 02:37:09 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 12:37:09 +0300
Subject: [ofa-general] Re: [ewg] Compiling OFED 1.3 on Gentoo
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com>
References: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com>
Message-ID: <20080515093709.GH24654@sashak.voltaire.com>

Hi Olga,

On 17:18 Mon 12 May     , Olga Shern wrote:
> 
> We are trying to compile OFED 1.3 on Gentoo and see the following error,

<snip...>

> But if I install source RPM file and then running 'rpmbuild -ba
> libibcommon.spec' then I can build RPM, so only rpmbuild --rebuild
> command causing to problems.

Basically Gentoo doesn't use RPM as package manager, but builds packages
from sources using portage/emerge stuff. So it is nice that it builds
somehow. As another workaround rpm2targz probably can be used too. Of
course better would be to have native *.ebuild files.

> Have someone succeeded to build OFED 1.3 on Gentoo?

I'm using Gentoo as my workstation, don't do RPMs there however.

Sasha


From sashak at voltaire.com  Thu May 15 02:40:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 12:40:40 +0300
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to
	usec
In-Reply-To: <482BFFC8.1060404@dev.mellanox.co.il>
References: <482BE026.7030104@dev.mellanox.co.il>
	<20080515082112.GC24654@sashak.voltaire.com>
	<482BFFC8.1060404@dev.mellanox.co.il>
Message-ID: <20080515094040.GK24654@sashak.voltaire.com>

On 12:18 Thu 15 May     , Yevgeny Kliteynik wrote:
>
> By the way, the problematic character confused Thunderburd too :)
> I see it was replaced by '?'.
> How did you apply the patch?

With 'git-am' and without any problem :)

Sasha


From sashak at voltaire.com  Thu May 15 02:52:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 12:52:25 +0300
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_qos_policy.c: log matched
	QoS criteria
In-Reply-To: <482C019E.40606@dev.mellanox.co.il>
References: <482C019E.40606@dev.mellanox.co.il>
Message-ID: <20080515095225.GL24654@sashak.voltaire.com>

Hi Yevgeny,

On 12:25 Thu 15 May     , Yevgeny Kliteynik wrote:
> 
> I think this patch was somehow lost in the pile of patches
> that you recently got. Anyhow, reposting it:

It wasn't lost, I just didn't process it yet (and there still be more
unreviewed patched on the list I need to care about).

My very first thought was to not do it because such debug prints hurt
performance a lot even when log level has lower value (it was measured
very well during Up/Down routing optimizations), so I pend it in order
to get some numbers first.

Another thing I don't like is that with higher debug levels OpenSM
generates ~1GB log file just during initial sweep. But this is more
general concern.

Sasha

> Adding log messages for matched criteria of the QoS policy rule.
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_qos_policy.c |   18 +++++++++++++++---
>  1 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
> index 6c81872..ebe3a7f 100644
> --- a/opensm/opensm/osm_qos_policy.c
> +++ b/opensm/opensm/osm_qos_policy.c
> @@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  {
>  	osm_qos_match_rule_t *p_qos_match_rule = NULL;
>  	cl_list_iterator_t list_iterator;
> +	osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log;
> 
>  	if (!cl_list_count(&p_qos_policy->qos_match_rules))
>  		return NULL;
> 
> +	OSM_LOG_ENTER(p_log);
> +
>  	/* Go over all QoS match rules and find the one that matches the request */
> 
>  	list_iterator = cl_list_head(&p_qos_policy->qos_match_rules);
> @@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> +			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"Source port matched.\n");
>  		}
> 
>  		/* If a match rule has Destination groups, PR request dest. has to be in this list */
> @@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> +			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"Destination port matched.\n");
>  		}
> 
>  		/* If a match rule has QoS classes, PR request HAS
> @@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"QoS Class matched.\n");
>  		}
> 
>  		/* If a match rule has Service IDs, PR request HAS
> @@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"Service ID matched.\n");
>  		}
> 
>  		/* If a match rule has PKeys, PR request HAS
> @@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params(
>  				list_iterator = cl_list_next(list_iterator);
>  				continue;
>  			}
> -
> +			OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> +				"PKey matched.\n");
>  		}
> 
>  		/* if we got here, then this match-rule matched this PR request */
>  		break;
>  	}
> 
> +	OSM_LOG_EXIT(p_log);
> +
>  	if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules))
>  		return NULL;
> 
> -- 
> 1.5.1.4
> 


From holt at sgi.com  Thu May 15 04:01:48 2008
From: holt at sgi.com (Robin Holt)
Date: Thu, 15 May 2008 06:01:48 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080515075747.GA7177@wotan.suse.de>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
Message-ID: <20080515110147.GD10126@sgi.com>

We are pursuing Linus' suggestion currently.  This discussion is
completely unrelated to that work.

On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote:
> I'm not sure if you're thinking about what I'm thinking of. With the
> scheme I'm imagining, all you will need is some way to raise an IPI-like
> interrupt on the target domain. The IPI target will have a driver to
> handle the interrupt, which will determine the mm and virtual addresses
> which are to be invalidated, and will then tear down those page tables
> and issue hardware TLB flushes within its domain. On the Linux side,
> I don't see why this can't be done.

We would need to deposit the payload into a central location to do the
invalidate, correct?  That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated.  Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt.  Unfortunately, here we have a
thundering herd.  There could be up to 16256 processors interrupting the
same processor.  That will be a lot of work.  It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed.  So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word.  But how will it handle the possible failure of the other partition?
How will it detect that failure and recover?  A timeout value could be
difficult to gauge because the other side may be off doing a considerable
amount of work and may just be backed up.

> Sure, you obviously would need to rework your code because it's been
> written with the assumption that it can sleep.

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems.  That pushes back to us.
I think the kernel's locking is perfectly reasonable.  The problem we run
into is we are trying to get from one context in one kernel to a different
context in another and the in-between piece needs to be sleepable.

> What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

XPMEM allows one process to make a portion of its virtual address range
directly addressable by another process with the appropriate access.
The other process can be on other partitions.  As long as Numa-link
allows access to the memory, we can make it available.  Userland has an
advantage in that the kernel entrance/exit code contains memory errors
so we can contain hardware failures (in most cases) to only needing to
terminate a user program and not lose the partition.  The kernel enjoys
no such fault containment so it can not safely directly reference memory.


Thanks,
Robin


From avi at qumranet.com  Thu May 15 04:12:34 2008
From: avi at qumranet.com (Avi Kivity)
Date: Thu, 15 May 2008 14:12:34 +0300
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080515110147.GD10126@sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<20080515110147.GD10126@sgi.com>
Message-ID: <482C1AA2.20307@qumranet.com>

Robin Holt wrote:
> Then we need to deposit the information needed to do the invalidate.
>
> Lastly, we would need to interrupt.  Unfortunately, here we have a
> thundering herd.  There could be up to 16256 processors interrupting the
> same processor.  That will be a lot of work.  It will need to look up the
> mm (without grabbing any sleeping locks in either xpmem or the kernel)
> and do the tlb invalidates.
>
>   

You don't need to interrupt every time.  Place your data in a queue (you 
do support rmw operations, right?) and interrupt.  Invalidates from 
other processors will see that the queue hasn't been processed yet and 
skip the interrupt.

-- 
error compiling committee.c: too many arguments to function


From sashak at voltaire.com  Thu May 15 04:19:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 May 2008 14:19:14 +0300
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080515111914.GO24654@sashak.voltaire.com>

Hi Hal,

On 04:19 Mon 12 May     , Hal Rosenstock wrote:
> 
> I filed this as bug 1031:
> https://bugs.openfabrics.org/show_bug.cgi?id=1031
> 
> > It would be nice if I could reproduce it in simulation.
> 
> Yes, that would be nice; but I don't have a sim case.

Do you have ibnetdiscover file for this case? If not from where report
is coming?

Sasha


From dorfman.eli at gmail.com  Thu May 15 04:23:31 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 15 May 2008 14:23:31 +0300
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <482B7FE4.9070502@fusionio.com>
References: <482B7FE4.9070502@fusionio.com>
Message-ID: <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>

On Thu, May 15, 2008 at 3:12 AM, Cameron Harr <charr at fusionio.com> wrote:
> Hi, I've been trying to compare performances between iSer and srpt and
> am getting mixed results where iSer wins for IOPs and srpt wins for some
> streaming b/w tests. I've tested with iozone, spew and FIO, and IOP
> numbers are always higher on iSer. My problem though is that I'm a
> little suspicious of some of the iSer numbers and whether they are
> really using Direct IO. For example, you'll see below in some of my FIO
> results that I'm getting a write B/W of 799.1 MB/s at one point. That's
> way above what I can get natively on the device (~650 MB/s DIO) and is
> more along the lines of buffered IO. If the IOP numbers are also using
> some kind of caching, that could possibly taint them also. Does anyone
> know if specifying DIO will really bypass all buffers or if something is
> getting cached in the agents (iscsi, tgtd)?
>
>
> FIO
> --------------- iSer 1----iSer 2----SRPT 1----SRPT 2-
> RBW (MB/s)      565.3     836.5     622.0     581.7
> Read IOPs       63488.1   68053.8   5335.6    5446.1
> WBW (MB/s)      799.1     737.7     589.5     594.4
> Write IOPs      79086.6   80005.7   33884.6   34058.6
>
>
> Thanks much,
> Cameron
>

Your question should be posted on linux-scsi.
See the following link that explains about DIO
http://tldp.org/HOWTO/SCSI-Generic-HOWTO/dio.html

Please check with sgp_dd to avoid any caching.

Thanks,
Eli


From hrosenstock at xsigo.com  Thu May 15 04:53:03 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 15 May 2008 04:53:03 -0700
Subject: [ofa-general] Re: [PATCH] OpenSM: Add
	QoS_management_in_OpenSM.txt to opensm/doc directory
In-Reply-To: <20080515060916.GA24654@sashak.voltaire.com>
References: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com>
	<20080515060916.GA24654@sashak.voltaire.com>
Message-ID: <1210852383.2026.833.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-15 at 09:09 +0300, Sasha Khapyorsky wrote:
> On 07:33 Tue 06 May     , Hal Rosenstock wrote:
> > Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory
> > 
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> 
> Applied. Thanks.

This should also be applied to ofed_1_3 branch IMO.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From kliteyn at dev.mellanox.co.il  Thu May 15 00:03:02 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 15 May 2008 10:03:02 +0300
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/ib_types.h: cosmetics -
 =?iso-8859-1?q?=B5sec_to_usec?=
Message-ID: <482BE026.7030104@dev.mellanox.co.il>

Hi Sasha,

Although "µsec" in the comments looks really cool and sophisticated :-),
I'd prefer to lose it and replace with a simple "usec".

Heaving 'µ' in the code confuses some editors and tools, such as "kompare".

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/iba/ib_types.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index 51695b5..6f3c400 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -3099,7 +3099,7 @@ ib_path_rec_pkt_life(IN const ib_path_rec_t * const p_rec)
 *		[in] Pointer to the path record object.
 *
 * RETURN VALUES
-*	Encoded path pkt_life = 4.096 µsec * 2 ** PacketLifeTime.
+*	Encoded path pkt_life = 4.096 usec * 2 ** PacketLifeTime.
 *
 * NOTES
 *
@@ -6391,7 +6391,7 @@ ib_multipath_rec_pkt_life(IN const ib_multipath_rec_t * const p_rec)
 *               [in] Pointer to the multipath record object.
 *
 * RETURN VALUES
-*       Encoded multipath pkt_life = 4.096 µsec * 2 ** PacketLifeTime.
+*       Encoded multipath pkt_life = 4.096 usec * 2 ** PacketLifeTime.
 *
 * NOTES
 *
-- 
1.5.1.4


From swise at opengridcomputing.com  Thu May 15 07:16:41 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 09:16:41 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <RTPCLUEXC1-PRDwjwtX000000e7@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	<1210805766.3949.114.camel@brick.pathscale.com>	<adazlqsh2e2.fsf@cisco.com>
	<RTPCLUEXC1-PRDwjwtX000000e7@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <482C45C9.2010600@opengridcomputing.com>


Talpey, Thomas wrote:
> At 07:49 PM 5/14/2008, Roland Dreier wrote:
>> also I wonder if it's clearer if we call this verb
>> ib_alloc_fast_reg_mr().
> 
> I have to disagree. Calling anything "fast" simply invites a "faster"
> thing to come along later. It's like calling something "new".
> 
> I say call it what it is - a work-request-based, alloc-phys-buffer-list,
> bind-pages-to-list, to-be-widely-supported memory registration.
> Obviously, the individual verbs need to be a bit more precise. :-)
> 
> Ralph - to answer your question who wants it, NFS/RDMA does, both
> client and server. I talked about requirements that it matches closely
> at Sonoma last month.
> 
> But Steve - aren't these capable of protecting memory at byte
> granularity? The word "page" in some of the names implies otherwise.
> 

The MR, once bound, defines the memory region at the byte granularity. 
The page list is just that: an array of DMA addresses to physical pages 
in memory.  The page list + the region length + the first byte offset 
define the region.

Steve.


From ogerlitz at voltaire.com  Thu May 15 07:21:25 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:21:25 +0300 (IDT)
Subject: [ofa-general] [RFC v2 PATCH 0/5] rdma/cma: RDMA_ALIGN_WITH_NETDEVICE
	ha mode
Message-ID: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>

main changes from v1:

- added new event RDMA_CM_EVENT_NETDEV_CHANGE
- took the approach of notifying the user vs disconnecting the ID
- this change bought us support also for the datagram (unconnected) services!

I prefer to to go with affiliated event approach, from the following reasons:

1) the rdma-cm consumer ULP is not actually exposed to neighbours/routes and netdevices,
i.e it knows the destination IP address and the rdma-cm does all the interaction with the
network stack needed for the local (device/gid|mac/port/pkey) and remote (gid|mac) address
resolutions, so in that respect this change follows this scheme.

2) its much harder for user space ULPs to get network events, but they can easily get rdma-cm events.


Or


From ogerlitz at voltaire.com  Thu May 15 07:22:03 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:22:03 +0300 (IDT)
Subject: [ofa-general] [RFC v2 PATCH 1/5] net/bonding: announce fail-over for
 the active-backup mode
In-Reply-To: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805151721320.23334@zuben.voltaire.com>

Enhance bonding to announce fail-over for the active-backup mode through
the netdev events notifier chain mechanism. Such an event can be of use
for the RDMA CM (communication manager) to let native RDMA ULPs (eg
NFS-RDMA, iSER) always use the same links as the IP stack does.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/net/bonding/bond_main.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c	2008-05-13 10:02:22.000000000 +0300
+++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c	2008-05-15 12:29:44.000000000 +0300
@@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon
 			bond->send_grat_arp = 1;
 		} else
 			bond_send_gratuitous_arp(bond);
+		netdev_bonding_change(bond->dev);
 	}
 }

Index: linux-2.6.26-rc2/include/linux/notifier.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/notifier.h	2008-05-13 10:02:30.000000000 +0300
+++ linux-2.6.26-rc2/include/linux/notifier.h	2008-05-13 11:50:44.000000000 +0300
@@ -197,6 +197,7 @@ static inline int notifier_to_errno(int
 #define NETDEV_GOING_DOWN	0x0009
 #define NETDEV_CHANGENAME	0x000A
 #define NETDEV_FEAT_CHANGE	0x000B
+#define NETDEV_BONDING_FAILOVER 0x000C

 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN
Index: linux-2.6.26-rc2/include/linux/netdevice.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/netdevice.h	2008-05-13 10:02:30.000000000 +0300
+++ linux-2.6.26-rc2/include/linux/netdevice.h	2008-05-13 11:50:20.000000000 +0300
@@ -1459,6 +1459,7 @@ extern void		__dev_addr_unsync(struct de
 extern void		dev_set_promiscuity(struct net_device *dev, int inc);
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
+extern void		netdev_bonding_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
Index: linux-2.6.26-rc2/net/core/dev.c
===================================================================
--- linux-2.6.26-rc2.orig/net/core/dev.c	2008-05-13 10:02:31.000000000 +0300
+++ linux-2.6.26-rc2/net/core/dev.c	2008-05-13 11:50:49.000000000 +0300
@@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi
 	}
 }

+void netdev_bonding_change(struct net_device *dev)
+{
+	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
+}
+EXPORT_SYMBOL(netdev_bonding_change);
+
 /**
  *	dev_load 	- load a network module
  *	@net: the applicable net namespace


From ogerlitz at voltaire.com  Thu May 15 07:22:35 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:22:35 +0300 (IDT)
Subject: [ofa-general] [RFC v2 PATCH 2/5] rdma/addr: keep the name of the
 netdevice in struct rdma_dev_addr 
In-Reply-To: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805151722090.23334@zuben.voltaire.com>

Keep also the local (src) device name in struct rdma_dev_addr. Under bonding HA
scheme this can be used by the rdma-cm to align RDMA sessions to use the same links
as the IP stack does under fail-over and route change cases.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/addr.c	2008-05-15 12:19:42.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/addr.c	2008-05-15 14:49:31.000000000 +0300
@@ -100,6 +100,7 @@ int rdma_copy_addr(struct rdma_dev_addr
 	memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
 	if (dst_dev_addr)
 		memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+	memcpy(dev_addr->src_dev_name, dev->name, IFNAMSIZ);
 	return 0;
 }
 EXPORT_SYMBOL(rdma_copy_addr);
Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-15 12:19:42.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-15 14:48:44.000000000 +0300
@@ -998,6 +998,7 @@ static struct rdma_id_private *cma_new_c
 	union cma_ip_addr *src, *dst;
 	__be16 port;
 	u8 ip_ver;
+	int ret;

 	if (cma_get_net_info(ib_event->private_data, listen_id->ps,
 			     &ip_ver, &port, &src, &dst))
@@ -1022,10 +1023,11 @@ static struct rdma_id_private *cma_new_c
 	if (rt->num_paths == 2)
 		rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path;

-	ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid);
 	ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
-	ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey));
-	rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA;
+	ret = rdma_translate_ip(&id->route.addr.src_addr,
+				&id->route.addr.dev_addr);
+	if (ret)
+		goto destroy_id;

 	id_priv = container_of(id, struct rdma_id_private, id);
 	id_priv->state = CMA_CONNECT;
Index: linux-2.6.26-rc2/include/rdma/ib_addr.h
===================================================================
--- linux-2.6.26-rc2.orig/include/rdma/ib_addr.h	2008-05-15 12:19:42.000000000 +0300
+++ linux-2.6.26-rc2/include/rdma/ib_addr.h	2008-05-15 14:49:08.000000000 +0300
@@ -57,6 +57,7 @@ struct rdma_dev_addr {
 	unsigned char dst_dev_addr[MAX_ADDR_LEN];
 	unsigned char broadcast[MAX_ADDR_LEN];
 	enum rdma_node_type dev_type;
+	char src_dev_name[IFNAMSIZ];
 };

 /**


From ogerlitz at voltaire.com  Thu May 15 07:23:31 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:23:31 +0300 (IDT)
Subject: [ofa-general] [RFC v2 PATCH 3/5] rdma/cma: add high availability
 mode attribute to IDs
In-Reply-To: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>

RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer of
the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
as the IP stack does.

In the current code, this does not happen when bonding did fail-over but
the link used by an already existing session is operating fine. Consumers
seeking this ha mode would get the new RDMA_CM_EVENT_NETDEV_CHANGE event
when such misalignment happens.

More ha modes can be added in the future.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

changes from v1 -
- added new event RDMA_CM_EVENT_NETDEV_CHANGE
- took the approach of notifying the user vs disconnecting the ID

Index: linux-2.6.26-rc2/include/rdma/rdma_cm.h
===================================================================
--- linux-2.6.26-rc2.orig/include/rdma/rdma_cm.h	2008-05-15 14:48:44.000000000 +0300
+++ linux-2.6.26-rc2/include/rdma/rdma_cm.h	2008-05-15 14:49:48.000000000 +0300
@@ -53,7 +53,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
 	RDMA_CM_EVENT_MULTICAST_JOIN,
-	RDMA_CM_EVENT_MULTICAST_ERROR
+	RDMA_CM_EVENT_MULTICAST_ERROR,
+	RDMA_CM_EVENT_NETDEV_CHANGE
 };

 enum rdma_port_space {
@@ -328,4 +329,10 @@ void rdma_leave_multicast(struct rdma_cm
  */
 void rdma_set_service_type(struct rdma_cm_id *id, int tos);

+enum  rdma_ha_mode {
+	RDMA_ALIGN_WITH_NETDEVICE = 1
+};
+
+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode);
+
 #endif /* RDMA_CM_H */
Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-15 14:48:44.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-15 16:30:42.000000000 +0300
@@ -143,6 +143,7 @@ struct rdma_id_private {
 	u32			qp_num;
 	u8			srq;
 	u8			tos;
+	enum rdma_ha_mode	ha_mode;
 };

 struct cma_multicast {
@@ -1523,6 +1524,19 @@ void rdma_set_service_type(struct rdma_c
 }
 EXPORT_SYMBOL(rdma_set_service_type);

+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode)
+{
+	struct rdma_id_private *id_priv;
+
+	if (mode != RDMA_ALIGN_WITH_NETDEVICE)
+		return -EINVAL;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	id_priv->ha_mode = mode;
+	return 0;
+}
+EXPORT_SYMBOL(rdma_set_high_availability_mode);
+
 static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec,
 			      void *context)
 {


From ogerlitz at voltaire.com  Thu May 15 07:25:34 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:25:34 +0300 (IDT)
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
 RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>

RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer
of the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
as the IP stack does. In the current code, this does not happen when bonding did
fail-over but the IB link used by an already existing session is operating fine.

Use netevent notification for sensing that a change has happened in the IP stack,
then scan the rdma-cm IDs list to see if there is an ID that is misaligned
in that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this
ID, in case this is what the user asked by setting this mode for the ID.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

changes from v1 -

- took the approach of notifying the user vs disconnecting the ID
- this change bought us support also for the datagram (unconnected) services!
- I used the cma_work_handler existing mechanism and decided to leave the ID state unchanged.

As for the locking/protection issues, I assume the netdev notifers protect against net
device removal etc while processing the event, so dev_get/put calls are not needed. Other than
that there's a need to protect against (rdma) device removal and ID destruction. Spending
some time on the code, I couldn't see how to do it in finer grain then the global mutex
being locked/unlocked over the exectution of the dobule (dev list / id list) loops.

Taking into account that this event is --rare-- and I changed the logic to first see
if this ID wanted ha notification and only then do the more expensive memcmp calls,
maybe this global locking is accaptable, and if not, I'd be happy to get some directions,
eg if/how cma_disable_remove() and cma_enable_remove() can help for taking the lock
to shorter time, etc.

Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c	2008-05-15 16:30:42.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c	2008-05-15 16:36:34.000000000 +0300
@@ -2743,6 +2743,64 @@ void rdma_leave_multicast(struct rdma_cm
 }
 EXPORT_SYMBOL(rdma_leave_multicast);

+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv)
+{
+	struct rdma_dev_addr *dev_addr;
+	struct cma_work *work;
+
+	dev_addr = &id_priv->id.route.addr.dev_addr;
+
+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
+		printk(KERN_ERR "addr change for device %s used by id %p, notifying\n",
+				ndev->name, &id_priv->id);
+		work = kzalloc(sizeof *work, GFP_KERNEL);
+		if (!work)
+			return -ENOMEM;
+		work->id = id_priv;
+		INIT_WORK(&work->work, cma_work_handler);
+		work->old_state = id_priv->state;
+		work->new_state = id_priv->state;
+		work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE;
+		atomic_inc(&id_priv->refcount);
+		queue_work(cma_wq, &work->work);
+	}
+}
+
+static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
+	void *ctx)
+{
+	struct net_device *ndev = (struct net_device *)ctx;
+	struct cma_device *cma_dev;
+	struct rdma_id_private *id_priv;
+	int ret = NOTIFY_DONE;
+
+	if (dev_net(ndev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (event != NETDEV_BONDING_FAILOVER)
+		return NOTIFY_DONE;
+
+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
+		return NOTIFY_DONE;
+
+	mutex_lock(&lock);
+	list_for_each_entry(cma_dev, &dev_list, list)
+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
+			if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE) {
+				ret = cma_netdev_align_id(ndev, id_priv);
+				if (ret)
+					break;
+			}
+		}
+	mutex_unlock(&lock);
+	return ret;
+}
+
+static struct notifier_block cma_nb = {
+	.notifier_call = cma_netdev_callback
+};
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
@@ -2847,6 +2905,7 @@ static int cma_init(void)

 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
+	register_netdevice_notifier(&cma_nb);

 	ret = ib_register_client(&cma_client);
 	if (ret)
@@ -2854,6 +2913,7 @@ static int cma_init(void)
 	return 0;

 err:
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
@@ -2863,6 +2923,7 @@ err:
 static void cma_cleanup(void)
 {
 	ib_unregister_client(&cma_client);
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);


From ogerlitz at voltaire.com  Thu May 15 07:26:05 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:26:05 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 5/5] ib/iser: use the rdma-cm new
 RDMA_ALIGN_WITH_NETDEVICE ha mode 
In-Reply-To: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805151725420.23334@zuben.voltaire.com>

enhance iser to request for notification on network stack changes which
makes its rdma connection unaligned with the link used by the stack
for the <src,dst> IPs used to establish the connection.

When RDMA_CM_EVENT_NETDEV_CHANGE arrives, just disconnect the connection, following
that the iscsid daemon would reconnect, and the new connection would be well aligned.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc2/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.26-rc2.orig/drivers/infiniband/ulp/iser/iser_verbs.c	2008-05-15 15:10:21.000000000 +0300
+++ linux-2.6.26-rc2/drivers/infiniband/ulp/iser/iser_verbs.c	2008-05-15 15:31:49.000000000 +0300
@@ -476,6 +476,9 @@ static int iser_cma_handler(struct rdma_
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:
 		iser_disconnected_handler(cma_id);
 		break;
+	case RDMA_CM_EVENT_NETDEV_CHANGE:
+		rdma_disconnect(cma_id);
+		break;
 	default:
 		iser_err("Unexpected RDMA CM event (%d)\n", event->event);
 		break;
@@ -534,7 +537,9 @@ int iser_connect(struct iser_conn   *ib_
 		iser_err("rdma_create_id failed: %d\n", err);
 		goto id_failure;
 	}
-
+
+	rdma_set_high_availability_mode(ib_conn->cma_id, RDMA_ALIGN_WITH_NETDEVICE);
+
 	src = (struct sockaddr *)src_addr;
 	dst = (struct sockaddr *)dst_addr;
 	err = rdma_resolve_addr(ib_conn->cma_id, src, dst, 1000);


From nico.mittenzwey at s2001.tu-chemnitz.de  Thu May 15 07:31:40 2008
From: nico.mittenzwey at s2001.tu-chemnitz.de (Nico Mittenzwey)
Date: Thu, 15 May 2008 16:31:40 +0200
Subject: [ofa-general] Retry count error with ipath on OFED-1.3
Message-ID: <482C494C.10204@s2001.tu-chemnitz.de>

Hi,

We have a problem with our QLogic InfiniPath PE-800 (rev 02), OFED 1.3 
and MPI. Running simple MPI jobs like the OSU MPI bandwidth test between 
two nodes results in a retry count error (see end of the mail). We tried 
different MPI implementations like the supplied openmpi or self compiled 
openmpi/mvapich but always get this error.
Using OFED 1.2 or the QLogic InfiniPath driver (which includes OFED 1.2) 
we don't get any errors.
The system is a Scientific Linux SL release 5.1 with kernel 
2.6.18-8.1.3.el5 (for OFED 1.2) or 2.6.18-53.1.14.el5 (OFED 1.3). There 
is also a Mellanox MT25204 HCA in the system which works perfectly 
(removing it doesn't help with the ipath problem).

Since we like to stay updated we want to use OFED 1.3.

Did anyone get the same error and found a solution?

Thanks & regards
Nico


OFED 1.3 Infinipath Error:
 ># OSU MPI Bandwidth Test v3.1
 ># Size        Bandwidth (MB/s)
 >1                         0.17
 >2                         0.39
 >4                         0.66
 >8                         1.80
 >16                        2.53
 >32                        5.11
 >64                        8.80
 >128                      23.09
 >256                      43.65
 >512                      84.42
 >1024                    151.63
 >[0,1,0][btl_openib_component.c:1338:btl_openib_component_progress] 
from >compute-6-7 to: compute-6-8 error polling HP CQ with status RETRY 
 >EXCEEDED ERROR status number 12 for wr_id 185705200 opcode 1
 >-------------------------------------------------------------------------- 

 >The InfiniBand retry count between two MPI processes has been
 >exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
 >(section 12.7.38):
 >
 >    The total number of times that the sender wishes the receiver to
 >    retry timeout, packet sequence, etc. errors before posting a
 >    completion error.
 >
 >This error typically means that there is something awry within the
 >InfiniBand fabric itself.  You should note the hosts on which this
 >error has occurred; it has been observed that rebooting or removing a
 >particular host from the job can sometimes resolve this issue.
 >
 >Two MCA parameters can be used to control Open MPI's behavior with
 >respect to the retry count:
 >
 >* btl_openib_ib_retry_count - The number of times the sender will
 >  attempt to retry (defaulted to 7, the maximum value).
 >
 >* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
 >  to 10).  The actual timeout value used is calculated as:
 >
 >     4.096 microseconds * (2^btl_openib_ib_timeout)
 >
 >  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
 >-------------------------------------------------------------------------- 

 >mpirun noticed that job rank 1 with PID 16883 on node compute-6-8 
 >exited on signal 15 (Terminated).


From swise at opengridcomputing.com  Thu May 15 07:36:02 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 09:36:02 -0500
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
Message-ID: <482C4A52.4000501@opengridcomputing.com>


Or Gerlitz wrote:
> RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer
> of the rdma-cm wants that RDMA sessions would always use the same links (eg <hca/port>)
> as the IP stack does. In the current code, this does not happen when bonding did
> fail-over but the IB link used by an already existing session is operating fine.
> 
> Use netevent notification for sensing that a change has happened in the IP stack,
> then scan the rdma-cm IDs list to see if there is an ID that is misaligned
> in that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this
> ID, in case this is what the user asked by setting this mode for the ID.
> 
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>

At this point, I wonder if the naming should be different.  IE the 
consumer really just wants notification of device change events.

So instead of adding a new function rdma_set_high_availability_mode, you 
could just set an option saying WANT_NETDEV_CHANGE_EVENTS.  Maybe we 
need to add rdma_set_option() to the kernel RDMA-CM API?

IE make it more generic.

Just a thought...

Steve.


From ogerlitz at voltaire.com  Thu May 15 07:38:59 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:38:59 +0300
Subject: [ofa-general] [RFC
	PATCH	4/4]	rdma/cma:	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000001c8b603$9bb70880$8e248686@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>	<482B395F.8020201@opengridcomputing.com>	<482B3FF8.4040102@opengridcomputing.com>
	<000001c8b603$9bb70880$8e248686@amr.corp.intel.com>
Message-ID: <482C4B03.9050507@voltaire.com>

Sean Hefty wrote:
> I thought about this, and I agree that it's worth exploring.  The locking to
> support device removal ended up being fairly complex.  (I'm not sure it would
> have been any easier for ULPs to do this though.)  The main counter I see to
> using a separate channel is that device removal is invoked per rdma_cm_id, so
> there's precedence for invoking the callback per id.
>
> My expectation is that this is a rare event.
Sean, Steve,

Yes, this is rare event. I have stated at the [v2 0/5] email posting why 
I prefer this to be ID affiliated event, will be glad to hear your 
feedback on my arguments.

Or.


From ogerlitz at voltaire.com  Thu May 15 07:41:11 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:41:11 +0300
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma:
	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>
	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>
	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>
Message-ID: <482C4B87.30807@voltaire.com>

Sean Hefty wrote:
> This is my current preferred solution.
>
> I don't have an issue with the rdma_cm issuing some sort of notification event
> when an IP address mapping changes.  I would use an event name that indicated
> this, rather than 'disconnect'.
>
> If this is implemented, I'd like to minimize the overhead per rdma_cm_id
> required to report this event.
OK, Sean, I took the notification (vs disconnection) approach which 
seemed to be suggested by all the reviewers. As for the overhead, I 
tried to minimize it.

Or.


From swise at opengridcomputing.com  Thu May 15 07:43:49 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 09:43:49 -0500
Subject: [ofa-general] [RFC
	PATCH	4/4]	rdma/cma:	implementRDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482C4B03.9050507@voltaire.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com><Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com><adak5hymg0l.fsf@cisco.com>	<482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com>	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>	<000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com>	<482B395F.8020201@opengridcomputing.com>	<482B3FF8.4040102@opengridcomputing.com>
	<000001c8b603$9bb70880$8e248686@amr.corp.intel.com>
	<482C4B03.9050507@voltaire.com>
Message-ID: <482C4C25.3040607@opengridcomputing.com>


Or Gerlitz wrote:
> Sean Hefty wrote:
>> I thought about this, and I agree that it's worth exploring.  The 
>> locking to
>> support device removal ended up being fairly complex.  (I'm not sure 
>> it would
>> have been any easier for ULPs to do this though.)  The main counter I 
>> see to
>> using a separate channel is that device removal is invoked per 
>> rdma_cm_id, so
>> there's precedence for invoking the callback per id.
>>
>> My expectation is that this is a rare event.
> Sean, Steve,
> 
> Yes, this is rare event. I have stated at the [v2 0/5] email posting why 
> I prefer this to be ID affiliated event, will be glad to hear your 
> feedback on my arguments.
> 

ID affiliated event seems reasonable.  Especially since its rare anyway.

Steve.


From ogerlitz at voltaire.com  Thu May 15 07:44:41 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 May 2008 17:44:41 +0300
Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
References: <Pine.LNX.4.64.0805131707210.26379@zuben.voltaire.com>	
	<Pine.LNX.4.64.0805131711220.26379@zuben.voltaire.com>	
	<Pine.LNX.4.64.0805131712240.26379@zuben.voltaire.com>	
	<Pine.LNX.4.64.0805131713241.26379@zuben.voltaire.com>	
	<adak5hymg0l.fsf@cisco.com>
	<482A0F32.2010001@opengridcomputing.com>	
	<482ADC2B.5080008@voltaire.com>
	<469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com>
Message-ID: <482C4C59.20303@voltaire.com>

Caitlin Bestler wrote:
> 2) Reduce the cost of connection teardown/rebuild by offering
>     an option to "pre-bind" two RDMA devices so that memory
>     registrations will be valid on both. This probably requires
>     device level co-operation on L-Key/STag allocation, but
>     it would be reasonable feature to consider for the High
>     Availability market.
>
I am not going to explore this direction, feel free to explore it and 
let me know your findings.

Or.


From charr at fusionio.com  Thu May 15 08:11:15 2008
From: charr at fusionio.com (Cameron Harr)
Date: Thu, 15 May 2008 09:11:15 -0600
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
References: <482B7FE4.9070502@fusionio.com>
	<694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
Message-ID: <482C5293.5090005@fusionio.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080515/30a2b608/attachment.html>

From landman at scalableinformatics.com  Thu May 15 08:25:22 2008
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 15 May 2008 11:25:22 -0400
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <482C5293.5090005@fusionio.com>
References: <482B7FE4.9070502@fusionio.com>	<694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
	<482C5293.5090005@fusionio.com>
Message-ID: <482C55E2.8060905@scalableinformatics.com>

Cameron Harr wrote:

> ----
> [root at test05 ~]# sgp_dd dio=1 if=/dev/zero of=/dev/fioa bs=512 bpt=2048 
> count=16777216 time=1

This is only 8 GB of IO.  It is possible that (despite dio) you are 
caching.  Make the IO much larger than RAM.  Use a count of 128m or so.


> time to transfer data was 5.556115 secs, 1546.03 MB/sec
> [root at test05 ~]# sg_dd dio=1 if=/dev/zero of=/dev/fioa bs=512 bpt=2048 
> count=16777216 time=1
> time to transfer data: 5.565360 secs at 1543.46 MB/sec
> [root at test05 ~]# dd oflag=direct if=/dev/zero of=/dev/fioa bs=1M count=8192
> 8589934592 bytes (8.6 GB) copied, 12.7761 seconds, 672 MB/s

We have found dd to be quite trustworthy with [oi]flag=direct.

> ----
> Using iSer, with the small transfer chunks, sgp_dd has numbers that are in line 
> with what I'd expect for DIO while sg_dd doesn't:
> ---------
> sgp_dd:  200.64 MB/s
> sg_dd:   735.42 MB/s
> dd:     62.3 MB/s
> --------
> But for larger transfers (with 1M block transfers), both sgp_dd and sg_dd show 
> well above what I think I can be getting:
> -------
> sgp_dd: 882.43
> sg_dd:   819.89
> dd:      731 MB/s #Which is still high, and which makes me suspect iSer

We had iSER bouncing from low 200s through 1000 MB/s during testing. 
Very hard to pin down good stable benchmark times.  This was a few 
months ago.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From charr at fusionio.com  Thu May 15 08:50:28 2008
From: charr at fusionio.com (Cameron Harr)
Date: Thu, 15 May 2008 09:50:28 -0600
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <482C55E2.8060905@scalableinformatics.com>
References: <482B7FE4.9070502@fusionio.com>	<694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
	<482C5293.5090005@fusionio.com>
	<482C55E2.8060905@scalableinformatics.com>
Message-ID: <482C5BC4.6090301@fusionio.com>

Joe Landman wrote:
> This is only 8 GB of IO.  It is possible that (despite dio) you are 
> caching.  Make the IO much larger than RAM.  Use a count of 128m or so.

This is going to sound dumb, but I thought I had 4 GB of RAM and thus 
intentionally used a file size 2x my physical RAM. As it turns out, I 
have 32GB of RAM on the box (4G usually shows up as 38.... and I just 
saw the 3). Anyway, with a 64GB file the numbers are looking more 
accurate (and even low):
393.3 MB/s
> We have found dd to be quite trustworthy with [oi]flag=direct.
I like it too. At any rate, I'm going to need to do some new testing to 
avoid the ram size (might just set a mem limit on the boot line).

There's still a bit of a discrepancy between IOP performance with iSer 
and srpt. Has anyone else done comparisons with the two? I think Erez 
was hoping to get some numbers before too long.
Cameron


From worleys at gmail.com  Thu May 15 08:52:52 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 15 May 2008 09:52:52 -0600
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <20080515111914.GO24654@sashak.voltaire.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
Message-ID: <f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>

Is there any command line utility to tell nodes that don't see the
route change to "go ask the SM again for your routes"... or "clear the
route table"?

On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 04:19 Mon 12 May     , Hal Rosenstock wrote:
>>
>> I filed this as bug 1031:
>> https://bugs.openfabrics.org/show_bug.cgi?id=1031
>>
>> > It would be nice if I could reproduce it in simulation.
>>
>> Yes, that would be nice; but I don't have a sim case.
>
> Do you have ibnetdiscover file for this case? If not from where report
> is coming?
>
> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From landman at scalableinformatics.com  Thu May 15 08:58:58 2008
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 15 May 2008 11:58:58 -0400
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <482C5BC4.6090301@fusionio.com>
References: <482B7FE4.9070502@fusionio.com>	<694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
	<482C5293.5090005@fusionio.com>
	<482C55E2.8060905@scalableinformatics.com>
	<482C5BC4.6090301@fusionio.com>
Message-ID: <482C5DC2.7000100@scalableinformatics.com>

Cameron Harr wrote:
> Joe Landman wrote:
>> This is only 8 GB of IO.  It is possible that (despite dio) you are 
>> caching.  Make the IO much larger than RAM.  Use a count of 128m or so.
> 
> This is going to sound dumb, but I thought I had 4 GB of RAM and thus 
> intentionally used a file size 2x my physical RAM. As it turns out, I 
> have 32GB of RAM on the box (4G usually shows up as 38.... and I just 
> saw the 3). Anyway, with a 64GB file the numbers are looking more 
> accurate (and even low):
> 393.3 MB/s

This is about right.  We were seeing ~650MB/s iSER for a 1.3 TB file dd 
on our units, but it bounced all over the place in terms of rates.  Very 
hard to pin down a single performance number.  Locally the drives were 
 >750 MB/s, so 650 isn't terrible.

>> We have found dd to be quite trustworthy with [oi]flag=direct.
> I like it too. At any rate, I'm going to need to do some new testing to 
> avoid the ram size (might just set a mem limit on the boot line).
> 
> There's still a bit of a discrepancy between IOP performance with iSer 
> and srpt. Has anyone else done comparisons with the two? I think Erez 
> was hoping to get some numbers before too long.
> Cameron


I think it might be coalescing the IOPs somehow (what do your elevators 
look like, how deep are your queues).  Each drive can do 100-300 IOPs 
best case.  30000 IOPs is 100-300 drives.  Or 
caching/coalescing/elevators in action.

Joe


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From hrosenstock at xsigo.com  Thu May 15 09:10:27 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 15 May 2008 09:10:27 -0700
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
Message-ID: <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>

Chris,

On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
> Is there any command line utility to tell nodes that don't see the
> route change to "go ask the SM again for your routes"... or "clear the
> route table"?

I'm not sure what you're asking. There is no route table at end nodes;
only switch nodes and the SM maintains these. The end node only has path
records which it has retrieved and perhaps cached. Path records should
be refreshed when SM or local LID changes which are local events to the
end node.

-- Hal

> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > Hi Hal,
> >
> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
> >>
> >> I filed this as bug 1031:
> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
> >>
> >> > It would be nice if I could reproduce it in simulation.
> >>
> >> Yes, that would be nice; but I don't have a sim case.
> >
> > Do you have ibnetdiscover file for this case? If not from where report
> > is coming?
> >
> > Sasha
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From charr at fusionio.com  Thu May 15 09:11:25 2008
From: charr at fusionio.com (Cameron Harr)
Date: Thu, 15 May 2008 10:11:25 -0600
Subject: [ofa-general] iSer and Direct IO
In-Reply-To: <482C5DC2.7000100@scalableinformatics.com>
References: <482B7FE4.9070502@fusionio.com>	<694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com>
	<482C5293.5090005@fusionio.com>
	<482C55E2.8060905@scalableinformatics.com>
	<482C5BC4.6090301@fusionio.com>
	<482C5DC2.7000100@scalableinformatics.com>
Message-ID: <482C60AD.9070109@fusionio.com>

Joe Landman wrote:
>
> I think it might be coalescing the IOPs somehow (what do your 
> elevators look like, how deep are your queues).  Each drive can do 
> 100-300 IOPs best case.  30000 IOPs is 100-300 drives.  Or 
> caching/coalescing/elevators in action.
>
I'm actually using a single nand-flash device (Fusion IO's ioDrive) 
which can do up to 100K IOPs depending on the pattern and am using the 
default cfq elevator with all default values (queue depth 128). Not sure 
if you're looking for something more than that, but it's a pretty simple 
setup.
Cameron


From worleys at gmail.com  Thu May 15 09:26:37 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 15 May 2008 10:26:37 -0600
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
Message-ID: <f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>

On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> Chris,
>
> On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
>> Is there any command line utility to tell nodes that don't see the
>> route change to "go ask the SM again for your routes"... or "clear the
>> route table"?
>
> I'm not sure what you're asking. There is no route table at end nodes;
> only switch nodes and the SM maintains these. The end node only has path
> records which it has retrieved and perhaps cached. Path records should
> be refreshed when SM or local LID changes which are local events to the
> end node.

After an sm change (i.e. using the "-r" switch), nodes can't ping each
other over IPoIB (other protocols also can't communicate).

Restarting the OFED stack works, but modules won't unload if there was
something active (i.e. Lustre), so the only recource to getting the
OFED stack working again is a hard reboot.

That's what I'd like to avoid if possible.

Chris
>
> -- Hal
>
>> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> > Hi Hal,
>> >
>> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
>> >>
>> >> I filed this as bug 1031:
>> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
>> >>
>> >> > It would be nice if I could reproduce it in simulation.
>> >>
>> >> Yes, that would be nice; but I don't have a sim case.
>> >
>> > Do you have ibnetdiscover file for this case? If not from where report
>> > is coming?
>> >
>> > Sasha
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>


From sean.hefty at intel.com  Thu May 15 09:42:20 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 May 2008 09:42:20 -0700
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability
	mode attribute to IDs
In-Reply-To: <Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
Message-ID: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>

>+enum  rdma_ha_mode {
>+	RDMA_ALIGN_WITH_NETDEVICE = 1
>+};
>+
>+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode
>mode);

I think we should just always report this event, and let users ignore it if they
want.  We don't seem to gain much by filtering the event at a lower level.

- Sean


From swise at opengridcomputing.com  Thu May 15 09:45:24 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 11:45:24 -0500
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
Message-ID: <482C68A4.9020305@opengridcomputing.com>

Sean Hefty wrote:
>> +enum  rdma_ha_mode {
>> +	RDMA_ALIGN_WITH_NETDEVICE = 1
>> +};
>> +
>> +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode
>> mode);
>>     
>
> I think we should just always report this event, and let users ignore it if they
> want.  We don't seem to gain much by filtering the event at a lower level.
>
> - Sean
>
>   
Um, doesn't that then change the ABI?  Some apps might hurl on a new 
(unexpected) event.

Steve.


From sean.hefty at intel.com  Thu May 15 09:47:07 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 May 2008 09:47:07 -0700
Subject: [ofa-general] RE: [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
Message-ID: <000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com>

>+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private
>*id_priv)
>+{
>+	struct rdma_dev_addr *dev_addr;
>+	struct cma_work *work;
>+
>+	dev_addr = &id_priv->id.route.addr.dev_addr;
>+
>+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
>+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
>+		printk(KERN_ERR "addr change for device %s used by id %p,
>notifying\n",
>+				ndev->name, &id_priv->id);
>+		work = kzalloc(sizeof *work, GFP_KERNEL);
>+		if (!work)
>+			return -ENOMEM;
>+		work->id = id_priv;
>+		INIT_WORK(&work->work, cma_work_handler);
>+		work->old_state = id_priv->state;
>+		work->new_state = id_priv->state;
>+		work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE;
>+		atomic_inc(&id_priv->refcount);
>+		queue_work(cma_wq, &work->work);
>+	}
>+}

My initial thought on this is to see if we can just queue a single work item
that can be used to invoke the user callbacks.  I'd have to see how the locking
worked out though to know if that approach is 'cleaner'.

Currently, the rdma_cm ensures that only a single callback to the user is
invoked at a time.  This is needed to support the user trying to destroy their
rdma_cm_id from the callback.  I didn't look to see if this still maintains
that.

- Sean


From sean.hefty at intel.com  Thu May 15 09:50:16 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 May 2008 09:50:16 -0700
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482C4A52.4000501@opengridcomputing.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
Message-ID: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>

>So instead of adding a new function rdma_set_high_availability_mode, you
>could just set an option saying WANT_NETDEV_CHANGE_EVENTS.  Maybe we
>need to add rdma_set_option() to the kernel RDMA-CM API?
>
>IE make it more generic.

I agree with this.  Having a generic mechanism to report rare events would be
useful.  Maybe the device removal notification can be adapted for this purpose?

- Sean


From swise at opengridcomputing.com  Thu May 15 09:55:06 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 11:55:06 -0500
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
	<000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
Message-ID: <482C6AEA.50109@opengridcomputing.com>

Sean Hefty wrote:
>> So instead of adding a new function rdma_set_high_availability_mode, you
>> could just set an option saying WANT_NETDEV_CHANGE_EVENTS.  Maybe we
>> need to add rdma_set_option() to the kernel RDMA-CM API?
>>
>> IE make it more generic.
>>     
>
> I agree with this.  Having a generic mechanism to report rare events would be
> useful.  Maybe the device removal notification can be adapted for this purpose?
>
> - Sean
>   

Both of these events are device related...  We could have a cm_id option 
that can be set that sez "i want device related events"...


From sean.hefty at intel.com  Thu May 15 09:59:12 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 May 2008 09:59:12 -0700
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <482C6AEA.50109@opengridcomputing.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
	<000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
	<482C6AEA.50109@opengridcomputing.com>
Message-ID: <000301c8b6ac$ffd1aa60$bd59180a@amr.corp.intel.com>

>Both of these events are device related...  We could have a cm_id option
>that can be set that sez "i want device related events"...

I'm not sure you want to hide device removal events, since the user must destroy
their rdma_cm_id in that case.


From hrosenstock at xsigo.com  Thu May 15 10:12:48 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 15 May 2008 10:12:48 -0700
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
Message-ID: <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote:
> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> > Chris,
> >
> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
> >> Is there any command line utility to tell nodes that don't see the
> >> route change to "go ask the SM again for your routes"... or "clear the
> >> route table"?
> >
> > I'm not sure what you're asking. There is no route table at end nodes;
> > only switch nodes and the SM maintains these. The end node only has path
> > records which it has retrieved and perhaps cached. Path records should
> > be refreshed when SM or local LID changes which are local events to the
> > end node.
> 
> After an sm change (i.e. using the "-r" switch),

That should be a local LID change.

>  nodes can't ping each
> other over IPoIB (other protocols also can't communicate).

Sounds like ULP issue(s) in handling this. What kernel and/or OFED
version are you running ?

-- Hal

> Restarting the OFED stack works, but modules won't unload if there was
> something active (i.e. Lustre), so the only recource to getting the
> OFED stack working again is a hard reboot.
> 
> That's what I'd like to avoid if possible.
> 
> Chris
> >
> > -- Hal
> >
> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >> > Hi Hal,
> >> >
> >> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
> >> >>
> >> >> I filed this as bug 1031:
> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
> >> >>
> >> >> > It would be nice if I could reproduce it in simulation.
> >> >>
> >> >> Yes, that would be nice; but I don't have a sim case.
> >> >
> >> > Do you have ibnetdiscover file for this case? If not from where report
> >> > is coming?
> >> >
> >> > Sasha
> >> > _______________________________________________
> >> > general mailing list
> >> > general at lists.openfabrics.org
> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >
> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >> >
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >


From weiny2 at llnl.gov  Thu May 15 10:14:18 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 15 May 2008 10:14:18 -0700
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
Message-ID: <20080515101418.4ccb53f3.weiny2@llnl.gov>

On Thu, 15 May 2008 10:26:37 -0600
"Chris Worley" <worleys at gmail.com> wrote:

> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> > Chris,
> >
> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
> >> Is there any command line utility to tell nodes that don't see the
> >> route change to "go ask the SM again for your routes"... or "clear the
> >> route table"?
> >
> > I'm not sure what you're asking. There is no route table at end nodes;
> > only switch nodes and the SM maintains these. The end node only has path
> > records which it has retrieved and perhaps cached. Path records should
> > be refreshed when SM or local LID changes which are local events to the
> > end node.
> 
> After an sm change (i.e. using the "-r" switch), nodes can't ping each
> other over IPoIB (other protocols also can't communicate).

Is it absolutely necessary to run with the "-r" switch?  Here we have not
problems letting the SM attempt to use the same LID's for nodes.

Ira

> 
> Restarting the OFED stack works, but modules won't unload if there was
> something active (i.e. Lustre), so the only recource to getting the
> OFED stack working again is a hard reboot.
> 
> That's what I'd like to avoid if possible.
> 
> Chris
> >
> > -- Hal
> >
> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >> > Hi Hal,
> >> >
> >> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
> >> >>
> >> >> I filed this as bug 1031:
> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
> >> >>
> >> >> > It would be nice if I could reproduce it in simulation.
> >> >>
> >> >> Yes, that would be nice; but I don't have a sim case.
> >> >
> >> > Do you have ibnetdiscover file for this case? If not from where report
> >> > is coming?
> >> >
> >> > Sasha
> >> > _______________________________________________
> >> > general mailing list
> >> > general at lists.openfabrics.org
> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >
> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >> >
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From clameter at sgi.com  Thu May 15 10:33:57 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Thu, 15 May 2008 10:33:57 -0700 (PDT)
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080515075747.GA7177@wotan.suse.de>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de>
Message-ID: <Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>

On Thu, 15 May 2008, Nick Piggin wrote:

> Oh, I get that confused because of the mixed up naming conventions
> there: unmap_page_range should actually be called zap_page_range. But
> at any rate, yes we can easily zap pagetables without holding mmap_sem.

How is that synchronized with code that walks the same pagetable. These 
walks may not hold mmap_sem either. I would expect that one could only 
remove a portion of the pagetable where we have some sort of guarantee 
that no accesses occur. So the removal of the vma prior ensures that?


From worleys at gmail.com  Thu May 15 10:35:17 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 15 May 2008 11:35:17 -0600
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
	<1210871569.12616.41.camel@hrosenstock-ws.xsigo.com>
Message-ID: <f3177b9e0805151035u301e36f8pe661b25185cf11a5@mail.gmail.com>

On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote:
>> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>> > Chris,
>> >
>> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
>> >> Is there any command line utility to tell nodes that don't see the
>> >> route change to "go ask the SM again for your routes"... or "clear the
>> >> route table"?
>> >
>> > I'm not sure what you're asking. There is no route table at end nodes;
>> > only switch nodes and the SM maintains these. The end node only has path
>> > records which it has retrieved and perhaps cached. Path records should
>> > be refreshed when SM or local LID changes which are local events to the
>> > end node.
>>
>> After an sm change (i.e. using the "-r" switch),
>
> That should be a local LID change.
>
>>  nodes can't ping each
>> other over IPoIB (other protocols also can't communicate).
>
> Sounds like ULP issue(s) in handling this. What kernel and/or OFED
> version are you running ?

Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel
with Lustre 1.6.4.2 changes.

The compute nodes are running the same kernel w/ OFED 1.2.5.5... which
will be upgraded to 1.3 by the end of the day.

Chris
>
> -- Hal
>
>> Restarting the OFED stack works, but modules won't unload if there was
>> something active (i.e. Lustre), so the only recource to getting the
>> OFED stack working again is a hard reboot.
>>
>> That's what I'd like to avoid if possible.
>>
>> Chris
>> >
>> > -- Hal
>> >
>> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> >> > Hi Hal,
>> >> >
>> >> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
>> >> >>
>> >> >> I filed this as bug 1031:
>> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
>> >> >>
>> >> >> > It would be nice if I could reproduce it in simulation.
>> >> >>
>> >> >> Yes, that would be nice; but I don't have a sim case.
>> >> >
>> >> > Do you have ibnetdiscover file for this case? If not from where report
>> >> > is coming?
>> >> >
>> >> > Sasha
>> >> > _______________________________________________
>> >> > general mailing list
>> >> > general at lists.openfabrics.org
>> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >> >
>> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >> >
>> >> _______________________________________________
>> >> general mailing list
>> >> general at lists.openfabrics.org
>> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >>
>> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>> >
>
>


From worleys at gmail.com  Thu May 15 10:37:18 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 15 May 2008 11:37:18 -0600
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <20080515101418.4ccb53f3.weiny2@llnl.gov>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
	<20080515101418.4ccb53f3.weiny2@llnl.gov>
Message-ID: <f3177b9e0805151037s32c9267agbe4f764319e1425b@mail.gmail.com>

On Thu, May 15, 2008 at 11:14 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Thu, 15 May 2008 10:26:37 -0600
> "Chris Worley" <worleys at gmail.com> wrote:
<snip>
>> After an sm change (i.e. using the "-r" switch), nodes can't ping each
>> other over IPoIB (other protocols also can't communicate).
>
> Is it absolutely necessary to run with the "-r" switch?  Here we have not
> problems letting the SM attempt to use the same LID's for nodes.

yes, especially when chaging routing algorithms between the default
and fat-tree.

Chris


From ralph.campbell at qlogic.com  Thu May 15 10:40:43 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 10:40:43 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <RTPCLUEXC1-PRDbbTMk000000e9@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<adave1gh25o.fsf@cisco.com>
	<RTPCLUEXC1-PRDFRaqb000000e8@RTPMVEXC1-PRD.hq.netapp.com>
	<RTPCLUEXC1-PRDbbTMk000000e9@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1210873243.3949.130.camel@brick.pathscale.com>

On Thu, 2008-05-15 at 02:11 -0400, Talpey, Thomas wrote:
> At 02:04 AM 5/15/2008, Talpey, Thomas wrote:
> >At 07:54 PM 5/14/2008, Roland Dreier wrote:
> >>Second question -- IB BMME and iWARP talk about a key portion (least
> >>significant byte) of STag/L_Key/R_Key as being under consumer control.
> >>Do we want to expose that as part of this API?  Basically it means we
> >>need to add a way for the consumer to pass in a new L_Key/STag as part
> >>of a lot of calls.
> >
> >I think the Key portion is a quite useful way for the upper layer to
> >salt the actual R_Keys as a protection mechanism, and having it would
> >simplify a bunch of defensive code in the NFS/RDMA client. Currently,
> >because the keys are provider-chosen and potentially recycled, there
> >is a latent risk.
> >
> >But, I only want it if ALL future providers support it in some way. If a
> >subset does not, it's not worth coding around the differences.
> 
> I forgot to mention that the provider portion of the R_Key is reduced
> to 24 bits as a result of exposing/requiring the key. This may cause an
> issue at large scale, if the R_Keys have global scope. If they are limited
> to use on specific connections as in iWARP, then this is less of an issue.
> 
> Tom.

For IB, the R_Keys are global and the spec. says that the user portion
always has to be the lower 8 bits (ch 10.6.3.4) so it should be the
same for all HCAs.


From hrosenstock at xsigo.com  Thu May 15 10:50:40 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 15 May 2008 10:50:40 -0700
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <f3177b9e0805151035u301e36f8pe661b25185cf11a5@mail.gmail.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
	<1210871569.12616.41.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805151035u301e36f8pe661b25185cf11a5@mail.gmail.com>
Message-ID: <1210873840.12616.45.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-15 at 11:35 -0600, Chris Worley wrote:
> On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> > On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote:
> >> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> >> > Chris,
> >> >
> >> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
> >> >> Is there any command line utility to tell nodes that don't see the
> >> >> route change to "go ask the SM again for your routes"... or "clear the
> >> >> route table"?
> >> >
> >> > I'm not sure what you're asking. There is no route table at end nodes;
> >> > only switch nodes and the SM maintains these. The end node only has path
> >> > records which it has retrieved and perhaps cached. Path records should
> >> > be refreshed when SM or local LID changes which are local events to the
> >> > end node.
> >>
> >> After an sm change (i.e. using the "-r" switch),
> >
> > That should be a local LID change.
> >
> >>  nodes can't ping each
> >> other over IPoIB (other protocols also can't communicate).
> >
> > Sounds like ULP issue(s) in handling this. What kernel and/or OFED
> > version are you running ?
> 
> Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel
> with Lustre 1.6.4.2 changes.
> 
> The compute nodes are running the same kernel w/ OFED 1.2.5.5... which
> will be upgraded to 1.3 by the end of the day.

Maybe that will be better for LID change; Let us know.

-- Hal

> 
> Chris
> >
> > -- Hal
> >
> >> Restarting the OFED stack works, but modules won't unload if there was
> >> something active (i.e. Lustre), so the only recource to getting the
> >> OFED stack working again is a hard reboot.
> >>
> >> That's what I'd like to avoid if possible.
> >>
> >> Chris
> >> >
> >> > -- Hal
> >> >
> >> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >> >> > Hi Hal,
> >> >> >
> >> >> > On 04:19 Mon 12 May     , Hal Rosenstock wrote:
> >> >> >>
> >> >> >> I filed this as bug 1031:
> >> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031
> >> >> >>
> >> >> >> > It would be nice if I could reproduce it in simulation.
> >> >> >>
> >> >> >> Yes, that would be nice; but I don't have a sim case.
> >> >> >
> >> >> > Do you have ibnetdiscover file for this case? If not from where report
> >> >> > is coming?
> >> >> >
> >> >> > Sasha
> >> >> > _______________________________________________
> >> >> > general mailing list
> >> >> > general at lists.openfabrics.org
> >> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >> >
> >> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >> >> >
> >> >> _______________________________________________
> >> >> general mailing list
> >> >> general at lists.openfabrics.org
> >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >>
> >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >> >
> >> >
> >
> >


From swise at opengridcomputing.com  Thu May 15 11:17:34 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 13:17:34 -0500
Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions.
Message-ID: <20080515181734.21020.47137.stgit@dell3.ogc.int>


The following patch proposes the API and core changes needed to
implement the IB BMMR and iWARP equivalient memory extensions.

Please review these vs the verbs specs and see what I've missed.

This patch is a request for comments.

Steve.

Changes since Version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

-----

RDMA: New Memory Extensions.

Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.
---

 drivers/infiniband/core/verbs.c |   46 +++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |   55 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+	struct ib_mr *mr;
+
+	if (!pd->device->alloc_fast_reg_mr)
+		return ERR_PTR(-ENOSYS);
+
+	mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len);
+
+	if (!IS_ERR(mr)) {
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int max_page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	if (!device->alloc_fast_reg_page_list)
+		return ERR_PTR(-ENOSYS);
+
+	page_list = device->alloc_fast_reg_page_list(device, max_page_list_len);
+
+	if (!IS_ERR(page_list)) {
+		page_list->device = device;
+		page_list->max_page_list_len = max_page_list_len;
+	}
+
+	return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+	page_list->device->free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..cbef5a6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
 	IB_DEVICE_SEND_W_INV		= (1<<21),
+	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<22),
 };
 
 enum ib_atomic_cap {
@@ -414,6 +415,8 @@ enum ib_wc_opcode {
 	IB_WC_FETCH_ADD,
 	IB_WC_BIND_MW,
 	IB_WC_LSO,
+	IB_WC_FAST_REG_MR,
+	IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -628,6 +631,9 @@ enum ib_wr_opcode {
 	IB_WR_ATOMIC_FETCH_AND_ADD,
 	IB_WR_LSO,
 	IB_WR_SEND_WITH_INV,
+	IB_WR_FAST_REG_MR,
+	IB_WR_INVALIDATE_MR,
+	IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +682,20 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u64				iova_start;
+			struct ib_mr 			*mr;
+			struct ib_fast_reg_page_list	*page_list;
+			unsigned int			page_size;
+			unsigned int			page_list_len;
+			unsigned int			first_byte_offset;
+			u32				length;
+			int				access_flags;
+			
+		} fast_reg;
+		struct {
+			struct ib_mr 	*mr;
+		} local_inv;
 	} wr;
 };
 
@@ -1014,6 +1034,10 @@ struct ib_device {
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
+	struct ib_mr *		   (*alloc_fast_reg_mr)(struct ib_pd *pd,
+					       int max_page_list_len);
+	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
+	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -1808,6 +1832,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
 int ib_dereg_mr(struct ib_mr *mr);
 
 /**
+ * ib_alloc_fast_reg_mr - Allocates memory region usable with the
+ * IB_WR_FAST_REG_MR send work request.
+ * @pd: The protection domain associated with the region.
+ * @max_page_list_len: requested max physical buffer list size to be allocated.
+ */
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len);
+
+struct ib_fast_reg_page_list {
+	struct ib_device 	*device;
+	u64			*page_list;
+	unsigned int		max_page_list_len;
+};
+
+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array to be used
+ * in a IB_WR_FAST_REG_MR work request.  The resources allocated by this method
+ * allows for dev-specific optimization of the FAST_REG operation.
+ * @device - ib device pointer.
+ * @page_list_len - depth of the page list array to be allocated.
+ */
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len);
+
+/**
+ * ib_free_fast_reg_page_list - Deallocates a previously allocated
+ * page list array.
+ * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
+ */
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From ralph.campbell at qlogic.com  Thu May 15 11:18:59 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 11:18:59 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <adaej84gu3g.fsf@cisco.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>
	<482B8C5A.5020904@opengridcomputing.com>  <adaej84gu3g.fsf@cisco.com>
Message-ID: <1210875539.3949.153.camel@brick.pathscale.com>

On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote:
>  > So you want the page size specified in the fast_reg_page_list as
>  > opposed to when the page list is bound to the fast_reg mr (via
>  > post_send)?
> 
> It's kind of the same thing, since the fast_reg_page_list is part of the
> send work request... the structures you have at the moment are:
> 
>  > +		struct {
>  > +			u64				iova_start;
>  > +			struct ib_fast_reg_page_list	*page_list;
>  > +			int				fbo;
>  > +			u32				length;
>  > +			int				access_flags;
>  > +			struct ib_mr 			*mr;
> 
> (side note... move this pointer up with the other pointers, so you don't
> end up with a hole in the structure due to alignment... or stick an int
> page_size in to fill the hole)
> 
>  > +		} fast_reg;
> 
>  > +struct ib_fast_reg_page_list {
>  > +	struct ib_device 	*device;
>  > +	u64			*page_list;
>  > +	int			page_list_len;
>  > +};
> 
> is page_list_len the maximum length of the page_list, or is it filled in
> by the consumer?  The driver could figure out the length of the
> page_list for any given work request by looking at the MR length and the
> page_size I suppose.
> 
>  - R.

I think Roland and Steve misunderstood what I was asking about
the struct ib_fast_reg_page_list * returned from
ib_alloc_fast_reg_page_list().

The question is "what can the caller do with the pointer?"
Clearly, the caller can pass the pointer to
ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the
normal ways.

Can the caller dereference the pointer and look at the
values in page_list[]? Are these values understood to be
a physical addresses that can be passed to phys_to_virt() for example?
Are they byte addresses always aligned to a page boundary?

The reason I ask is that the address used with the [LR]_Key from
ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc.
because the ipath driver doesn't necessarily use physical addresses
for the address in the send WQEs. Normally, the address in the
send WQE is a kernel virtual address so the ib_ipath driver can
memcpy() the data to the chip.

Lets say that ib_ipath uses vmalloc() to allocate the pages
instead of dma_alloc_coherent(). As long as the ULP only uses
the page_list values as an uninterpreted number that is passed
back to the driver via subsequent verbs calls, it wouldn't
matter to the ULP what the number represents. But if the ULP
expects to be able to call some other kernel function to
map or translate that value, then the ULP has to know what
kind of number it represents, its size and alignment, etc.


From swise at opengridcomputing.com  Thu May 15 11:39:25 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 13:39:25 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <1210875539.3949.153.camel@brick.pathscale.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>	
	<482B8C5A.5020904@opengridcomputing.com>
	<adaej84gu3g.fsf@cisco.com>
	<1210875539.3949.153.camel@brick.pathscale.com>
Message-ID: <482C835D.50401@opengridcomputing.com>

Ralph Campbell wrote:
> On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote:
>   
>>  > So you want the page size specified in the fast_reg_page_list as
>>  > opposed to when the page list is bound to the fast_reg mr (via
>>  > post_send)?
>>
>> It's kind of the same thing, since the fast_reg_page_list is part of the
>> send work request... the structures you have at the moment are:
>>
>>  > +		struct {
>>  > +			u64				iova_start;
>>  > +			struct ib_fast_reg_page_list	*page_list;
>>  > +			int				fbo;
>>  > +			u32				length;
>>  > +			int				access_flags;
>>  > +			struct ib_mr 			*mr;
>>
>> (side note... move this pointer up with the other pointers, so you don't
>> end up with a hole in the structure due to alignment... or stick an int
>> page_size in to fill the hole)
>>
>>  > +		} fast_reg;
>>
>>  > +struct ib_fast_reg_page_list {
>>  > +	struct ib_device 	*device;
>>  > +	u64			*page_list;
>>  > +	int			page_list_len;
>>  > +};
>>
>> is page_list_len the maximum length of the page_list, or is it filled in
>> by the consumer?  The driver could figure out the length of the
>> page_list for any given work request by looking at the MR length and the
>> page_size I suppose.
>>
>>  - R.
>>     
>
> I think Roland and Steve misunderstood what I was asking about
> the struct ib_fast_reg_page_list * returned from
> ib_alloc_fast_reg_page_list().
>
> The question is "what can the caller do with the pointer?"
> Clearly, the caller can pass the pointer to
> ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the
> normal ways.
>
> Can the caller dereference the pointer and look at the
> values in page_list[]? Are these values understood to be
> a physical addresses that can be passed to phys_to_virt() for example?
> Are they byte addresses always aligned to a page boundary?
>
>   

The caller must _fill in_ the values in the page list.  That's the whole 
point.   IE all this func is doing is allocating the _memory_ to store 
the page list that the caller is building.  The special function is 
needed because some devices might need to DMA the page list array from 
this memory as part of processing the FAST_REG_MR work request, and thus 
needs to allocate it dma coherently.  The pointer returned is a kernel 
virtual address and can be read from/written to by the caller.

> The reason I ask is that the address used with the [LR]_Key from
> ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc.
> because the ipath driver doesn't necessarily use physical addresses
> for the address in the send WQEs. Normally, the address in the
> send WQE is a kernel virtual address so the ib_ipath driver can
> memcpy() the data to the chip.
>   

> Lets say that ib_ipath uses vmalloc() to allocate the pages
> instead of dma_alloc_coherent(). As long as the ULP only uses
> the page_list values as an uninterpreted number that is passed
> back to the driver via subsequent verbs calls, it wouldn't
> matter to the ULP what the number represents. But if the ULP
> expects to be able to call some other kernel function to
> map or translate that value, then the ULP has to know what
> kind of number it represents, its size and alignment, etc.
>   


We're not talking about allocating the pages themselves. 

Here's an example (ignoring errors):

page_list = ib_alloc_fast_reg_page_list(device, 1);

v = get_free_page(GFP_KERNEL);

page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE,
                                                                
DMA_TO_DEVICE|DMA_FROM_DEVICE);

wr.opcode = IB_WR_FAST_REG_MR;
wr.next = NULL;
wr.send_flags = 0;
wr.wr_id = 0xdeadbeef;
wr.wr.fast_reg.mr = mr;
wr.wr.fast_reg.page_list = page_list;
wr.wr.fast_reg.page_size = PAGE_SIZE;
wr.wr.fast_reg.page_list_len = 1;
wr.wr.fast_reg.first_byte_offset = 0;
wr.wr.fast_reg.iova_start = (u64)v;
wr.wr.fast_reg.length = PAGE_SIZE;
wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE |
                                                        
IB_ACCESS_REMOTE_READ |
                                                        
IB_ACCESS_REMOTE_WRITE;

ib_post_send(qp, &wr, &bad_wr);


From worleys at gmail.com  Thu May 15 11:45:08 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 15 May 2008 12:45:08 -0600
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <1210873840.12616.45.camel@hrosenstock-ws.xsigo.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
	<1210871569.12616.41.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805151035u301e36f8pe661b25185cf11a5@mail.gmail.com>
	<1210873840.12616.45.camel@hrosenstock-ws.xsigo.com>
Message-ID: <f3177b9e0805151145g4d230505obc0d85581b9ca954@mail.gmail.com>

On Thu, May 15, 2008 at 11:50 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> On Thu, 2008-05-15 at 11:35 -0600, Chris Worley wrote:
>> On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>> > On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote:
>> >> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>> >> > Chris,
>> >> >
>> >> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote:
>> >> >> Is there any command line utility to tell nodes that don't see the
>> >> >> route change to "go ask the SM again for your routes"... or "clear the
>> >> >> route table"?
>> >> >
>> >> > I'm not sure what you're asking. There is no route table at end nodes;
>> >> > only switch nodes and the SM maintains these. The end node only has path
>> >> > records which it has retrieved and perhaps cached. Path records should
>> >> > be refreshed when SM or local LID changes which are local events to the
>> >> > end node.
>> >>
>> >> After an sm change (i.e. using the "-r" switch),
>> >
>> > That should be a local LID change.
>> >
>> >>  nodes can't ping each
>> >> other over IPoIB (other protocols also can't communicate).
>> >
>> > Sounds like ULP issue(s) in handling this. What kernel and/or OFED
>> > version are you running ?
>>
>> Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel
>> with Lustre 1.6.4.2 changes.
>>
>> The compute nodes are running the same kernel w/ OFED 1.2.5.5... which
>> will be upgraded to 1.3 by the end of the day.
>
> Maybe that will be better for LID change; Let us know.

Unfortunately, it isn't a good day to test; a critical job is running.
 After upgrading all but the nodes the critical job was running on, I
found the opensmd hung, the rebooted nodes were not getting
initialized, I had to "kill -9" it.  Upon opensmd restart, I couldn't
risk using the "-r" switch, but, w/o it, the fat-tree came up w/o
error.

Chris


From ralph.campbell at qlogic.com  Thu May 15 11:53:17 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 11:53:17 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <482C835D.50401@opengridcomputing.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>
	<482B8C5A.5020904@opengridcomputing.com> <adaej84gu3g.fsf@cisco.com>
	<1210875539.3949.153.camel@brick.pathscale.com>
	<482C835D.50401@opengridcomputing.com>
Message-ID: <1210877597.3949.158.camel@brick.pathscale.com>

On Thu, 2008-05-15 at 13:39 -0500, Steve Wise wrote:
> Ralph Campbell wrote:
> > On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote:
> >   
> >>  > So you want the page size specified in the fast_reg_page_list as
> >>  > opposed to when the page list is bound to the fast_reg mr (via
> >>  > post_send)?
> >>
> >> It's kind of the same thing, since the fast_reg_page_list is part of the
> >> send work request... the structures you have at the moment are:
> >>
> >>  > +		struct {
> >>  > +			u64				iova_start;
> >>  > +			struct ib_fast_reg_page_list	*page_list;
> >>  > +			int				fbo;
> >>  > +			u32				length;
> >>  > +			int				access_flags;
> >>  > +			struct ib_mr 			*mr;
> >>
> >> (side note... move this pointer up with the other pointers, so you don't
> >> end up with a hole in the structure due to alignment... or stick an int
> >> page_size in to fill the hole)
> >>
> >>  > +		} fast_reg;
> >>
> >>  > +struct ib_fast_reg_page_list {
> >>  > +	struct ib_device 	*device;
> >>  > +	u64			*page_list;
> >>  > +	int			page_list_len;
> >>  > +};
> >>
> >> is page_list_len the maximum length of the page_list, or is it filled in
> >> by the consumer?  The driver could figure out the length of the
> >> page_list for any given work request by looking at the MR length and the
> >> page_size I suppose.
> >>
> >>  - R.
> >>     
> >
> > I think Roland and Steve misunderstood what I was asking about
> > the struct ib_fast_reg_page_list * returned from
> > ib_alloc_fast_reg_page_list().
> >
> > The question is "what can the caller do with the pointer?"
> > Clearly, the caller can pass the pointer to
> > ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the
> > normal ways.
> >
> > Can the caller dereference the pointer and look at the
> > values in page_list[]? Are these values understood to be
> > a physical addresses that can be passed to phys_to_virt() for example?
> > Are they byte addresses always aligned to a page boundary?
> >
> >   
> 
> The caller must _fill in_ the values in the page list.  That's the whole 
> point.   IE all this func is doing is allocating the _memory_ to store 
> the page list that the caller is building.  The special function is 
> needed because some devices might need to DMA the page list array from 
> this memory as part of processing the FAST_REG_MR work request, and thus 
> needs to allocate it dma coherently.  The pointer returned is a kernel 
> virtual address and can be read from/written to by the caller.
> 
> > The reason I ask is that the address used with the [LR]_Key from
> > ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc.
> > because the ipath driver doesn't necessarily use physical addresses
> > for the address in the send WQEs. Normally, the address in the
> > send WQE is a kernel virtual address so the ib_ipath driver can
> > memcpy() the data to the chip.
> >   
> 
> > Lets say that ib_ipath uses vmalloc() to allocate the pages
> > instead of dma_alloc_coherent(). As long as the ULP only uses
> > the page_list values as an uninterpreted number that is passed
> > back to the driver via subsequent verbs calls, it wouldn't
> > matter to the ULP what the number represents. But if the ULP
> > expects to be able to call some other kernel function to
> > map or translate that value, then the ULP has to know what
> > kind of number it represents, its size and alignment, etc.
> >   
> 
> 
> We're not talking about allocating the pages themselves. 
> 
> Here's an example (ignoring errors):
> 
> page_list = ib_alloc_fast_reg_page_list(device, 1);
> 
> v = get_free_page(GFP_KERNEL);
> 
> page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE,
>                                                                 
> DMA_TO_DEVICE|DMA_FROM_DEVICE);
> 
> wr.opcode = IB_WR_FAST_REG_MR;
> wr.next = NULL;
> wr.send_flags = 0;
> wr.wr_id = 0xdeadbeef;
> wr.wr.fast_reg.mr = mr;
> wr.wr.fast_reg.page_list = page_list;
> wr.wr.fast_reg.page_size = PAGE_SIZE;
> wr.wr.fast_reg.page_list_len = 1;
> wr.wr.fast_reg.first_byte_offset = 0;
> wr.wr.fast_reg.iova_start = (u64)v;
> wr.wr.fast_reg.length = PAGE_SIZE;
> wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE |
>                                                         
> IB_ACCESS_REMOTE_READ |
>                                                         
> IB_ACCESS_REMOTE_WRITE;
> 
> ib_post_send(qp, &wr, &bad_wr);

OK. Thanks for clarifying. This wasn't clear to me from the
original description but I understand now.


From swise at opengridcomputing.com  Thu May 15 12:05:45 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 14:05:45 -0500
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <1210877597.3949.158.camel@brick.pathscale.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>	
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>		
	<482B8C5A.5020904@opengridcomputing.com>
	<adaej84gu3g.fsf@cisco.com>	
	<1210875539.3949.153.camel@brick.pathscale.com>	
	<482C835D.50401@opengridcomputing.com>
	<1210877597.3949.158.camel@brick.pathscale.com>
Message-ID: <482C8989.10305@opengridcomputing.com>


>> We're not talking about allocating the pages themselves. 
>>
>> Here's an example (ignoring errors):
>>
>> page_list = ib_alloc_fast_reg_page_list(device, 1);
>>
>> v = get_free_page(GFP_KERNEL);
>>
>> page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE,
>>                                                                 
>> DMA_TO_DEVICE|DMA_FROM_DEVICE);
>>
>> wr.opcode = IB_WR_FAST_REG_MR;
>> wr.next = NULL;
>> wr.send_flags = 0;
>> wr.wr_id = 0xdeadbeef;
>> wr.wr.fast_reg.mr = mr;
>> wr.wr.fast_reg.page_list = page_list;
>> wr.wr.fast_reg.page_size = PAGE_SIZE;
>> wr.wr.fast_reg.page_list_len = 1;
>> wr.wr.fast_reg.first_byte_offset = 0;
>> wr.wr.fast_reg.iova_start = (u64)v;
>> wr.wr.fast_reg.length = PAGE_SIZE;
>> wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE |
>>                                                         
>> IB_ACCESS_REMOTE_READ |
>>                                                         
>> IB_ACCESS_REMOTE_WRITE;
>>
>> ib_post_send(qp, &wr, &bad_wr);
>>     
>
> OK. Thanks for clarifying. This wasn't clear to me from the
> original description but I understand now.
>   

Perhaps ib_alloc_fast_reg_page_list() isn't clear.  Maybe 
ib_alloc_fast_reg_page_list_mem() is better?  That's getting too long 
for my taste, but if others thing it helps... I'll change it.


Steve.


From ralph.campbell at qlogic.com  Thu May 15 12:37:25 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 12:37:25 -0700
Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions.
In-Reply-To: <482C8989.10305@opengridcomputing.com>
References: <20080514190532.28544.41595.stgit@dell3.ogc.int>
	<1210805766.3949.114.camel@brick.pathscale.com>
	<adazlqsh2e2.fsf@cisco.com>
	<482B8C5A.5020904@opengridcomputing.com> <adaej84gu3g.fsf@cisco.com>
	<1210875539.3949.153.camel@brick.pathscale.com>
	<482C835D.50401@opengridcomputing.com>
	<1210877597.3949.158.camel@brick.pathscale.com>
	<482C8989.10305@opengridcomputing.com>
Message-ID: <1210880245.3949.180.camel@brick.pathscale.com>

On Thu, 2008-05-15 at 14:05 -0500, Steve Wise wrote:
> >> We're not talking about allocating the pages themselves. 
> >>
> >> Here's an example (ignoring errors):
> >>
> >> page_list = ib_alloc_fast_reg_page_list(device, 1);
> >>
> >> v = get_free_page(GFP_KERNEL);
> >>
> >> page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE,
> >>                                                                 
> >> DMA_TO_DEVICE|DMA_FROM_DEVICE);
> >>
> >> wr.opcode = IB_WR_FAST_REG_MR;
> >> wr.next = NULL;
> >> wr.send_flags = 0;
> >> wr.wr_id = 0xdeadbeef;
> >> wr.wr.fast_reg.mr = mr;
> >> wr.wr.fast_reg.page_list = page_list;
> >> wr.wr.fast_reg.page_size = PAGE_SIZE;
> >> wr.wr.fast_reg.page_list_len = 1;
> >> wr.wr.fast_reg.first_byte_offset = 0;
> >> wr.wr.fast_reg.iova_start = (u64)v;
> >> wr.wr.fast_reg.length = PAGE_SIZE;
> >> wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE |
> >>                                                         
> >> IB_ACCESS_REMOTE_READ |
> >>                                                         
> >> IB_ACCESS_REMOTE_WRITE;
> >>
> >> ib_post_send(qp, &wr, &bad_wr);
> >>     
> >
> > OK. Thanks for clarifying. This wasn't clear to me from the
> > original description but I understand now.
> >   
> 
> Perhaps ib_alloc_fast_reg_page_list() isn't clear.  Maybe 
> ib_alloc_fast_reg_page_list_mem() is better?  That's getting too long 
> for my taste, but if others thing it helps... I'll change it.

At a minimum, I would change the comments for the function
in ib_verbs.h:

+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array
+ * @device - ib device pointer.
+ * @page_list_len - size of the page list array to be allocated.
+ *
+ * This allocates and returns a struct ib_fast_reg_page_list *
+ * and a page_list array that is at least page_list_len in size.
+ * The actual size is returned in max_page_list_len.
+ * The caller is responsible for initializing the contents of the
+ * page_list array before posting a send work request with the
+ * IB_WC_FAST_REG_MR opcode. The page_list array entries must be
+ * translated using one of the ib_dma_*() functions similar to the
+ * addresses passed to ib_map_phys_fmr(). Once the ib_post_send()
+ * is issued, the struct ib_fast_reg_page_list should not be modified
+ * by the caller until a completion notice is returned by the device.
+ */


From swise at opengridcomputing.com  Thu May 15 12:41:43 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 May 2008 14:41:43 -0500
Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions.
In-Reply-To: <20080515181734.21020.47137.stgit@dell3.ogc.int>
References: <20080515181734.21020.47137.stgit@dell3.ogc.int>
Message-ID: <482C91F7.60708@opengridcomputing.com>

BTW: 

I think we need a way for users to query the device to know the max 
page_list_length that can be handled in a FAST_REG_MR work request.  In 
other words, a device attribute.

Steve.


From weiny2 at llnl.gov  Thu May 15 13:27:21 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 15 May 2008 13:27:21 -0700
Subject: [ofa-general] [PATCH] OpenSM: Fix rpm build,
 <sysconfdir>/opensm/opensm.conf failed to install
Message-ID: <20080515132721.37644ade.weiny2@llnl.gov>

Sasha,

I found this while trying to add the Performance Manager HOWTO to the rpm.
Therefore, I think this will conflict slightly with that patch.  If you like I
can resubmit that patch after you apply this.

Thanks,
Ira

>From 8453b86e94175ff3054a57c5c50e337a96d536bd Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Thu, 15 May 2008 13:13:16 -0700
Subject: [PATCH] Fix rpm build, <sysconfdir>/opensm/opensm.conf failed to install


Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 opensm/configure.in   |    9 +++++++--
 opensm/opensm.spec.in |    8 ++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/opensm/configure.in b/opensm/configure.in
index d36d7be..2ae8bd0 100644
--- a/opensm/configure.in
+++ b/opensm/configure.in
@@ -87,7 +87,7 @@ conf_dir_tmp1="`eval echo ${sysconfdir} | sed 's/^NONE/$ac_default_prefix/'`"
 SYS_CONFIG_DIR="`eval echo $conf_dir_tmp1`"
 
 dnl Check for a different subdir for the config files.
-OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/opensm
+OPENSM_CONFIG_SUB_DIR=opensm
 AC_MSG_CHECKING(for --with-opensm-conf-sub-dir)
 AC_ARG_WITH(opensm-conf-sub-dir,
     AC_HELP_STRING([--with-opensm-conf-sub-dir=dir],
@@ -96,10 +96,15 @@ AC_ARG_WITH(opensm-conf-sub-dir,
     no)
         ;;
     *)
-        OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/$withval
+        OPENSM_CONFIG_SUB_DIR=$withval
         ;;
     esac ]
 )
+dnl this needs to be configured for rpmbuilds separate from the full path
+dnl "OPENSM_CONFIG_DIR"
+AC_SUBST(OPENSM_CONFIG_SUB_DIR)
+
+OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/$OPENSM_CONFIG_SUB_DIR
 AC_MSG_RESULT($OPENSM_CONFIG_DIR)
 AC_DEFINE_UNQUOTED(OPENSM_CONFIG_DIR,
 	["$OPENSM_CONFIG_DIR"],
diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index feabfef..b439323 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -94,9 +94,9 @@ if [ -f /etc/redhat-release -o -s /etc/redhat-release ]; then
 else
     REDHAT=""
 fi
-mkdir -p $etc/{init.d,logrotate.d} @OPENSM_CONFIG_DIR@
+mkdir -p $etc/{init.d,logrotate.d} $etc/@OPENSM_CONFIG_SUB_DIR@
 install -m 755 scripts/${REDHAT}opensm.init $etc/init.d/opensmd
-install -m 644 scripts/opensm.conf @OPENSM_CONFIG_DIR@/opensm.conf
+install -m 644 scripts/opensm.conf $etc/@OPENSM_CONFIG_SUB_DIR@/opensm.conf
 install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm
 install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh
 
@@ -128,10 +128,10 @@ fi
 %doc AUTHORS COPYING README
 %{_sysconfdir}/init.d/opensmd
 %{_sbindir}/sldd.sh
-%config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf
+%config(noreplace) %{_sysconfdir}/@OPENSM_CONFIG_SUB_DIR@/opensm.conf
 %config(noreplace) %{_sysconfdir}/logrotate.d/opensm
 %dir /var/cache/opensm
-%dir @OPENSM_CONFIG_DIR@
+%dir %{_sysconfdir}/@OPENSM_CONFIG_SUB_DIR@
 
 %files libs
 %defattr(-,root,root,-)
-- 
1.5.1


From weiny2 at llnl.gov  Thu May 15 13:27:23 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 15 May 2008 13:27:23 -0700
Subject: [ofa-general] [PATCH] OpenSM: Add a Performance Manager HOWTO to the
 docs and the dist
Message-ID: <20080515132723.3add7c6a.weiny2@llnl.gov>

There seems to be a lot of questions on the list about how to gather port
counters.  The Performance Manager included in OpenSM, v3.1.X (OFED 1.3) can be
used to collect these counters in one place.  I decided to write a little HOWTO
to help people to set it up.

Patch is attached,
Ira

>From bfc303f76a40fb5e3a9cf2c01c16c25c517c8ddd Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Thu, 15 May 2008 08:19:17 -0700
Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist


Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 opensm/Makefile.am                       |    3 +-
 opensm/doc/performance-manager-HOWTO.txt |  153 ++++++++++++++++++++++++++++++
 opensm/opensm.spec.in                    |    2 +-
 3 files changed, 156 insertions(+), 2 deletions(-)
 create mode 100644 opensm/doc/performance-manager-HOWTO.txt

diff --git a/opensm/Makefile.am b/opensm/Makefile.am
index 3811963..4c79f49 100644
--- a/opensm/Makefile.am
+++ b/opensm/Makefile.am
@@ -24,8 +24,9 @@ endif
 man_MANS = man/opensm.8 man/osmtest.8
 
 various_scripts = $(wildcard scripts/*)
+docs = doc/performance-manager-HOWTO.txt
 
-EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS)
+EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs)
 
 dist-hook: $(EXTRA_DIST)
 	if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \
diff --git a/opensm/doc/performance-manager-HOWTO.txt b/opensm/doc/performance-manager-HOWTO.txt
new file mode 100644
index 0000000..c655f6c
--- /dev/null
+++ b/opensm/doc/performance-manager-HOWTO.txt
@@ -0,0 +1,153 @@
+OpenSM Performance manager HOWTO
+================================
+
+Introduction
+============
+
+OpenSM now includes a performance manager which collects Port counters from
+the subnet and stores them internally in OpenSM.
+
+Some of the features of the performance manager are:
+
+	1) Collect port data and error counters per v1.2 spec and store in
+	   64bit internal counts.
+	2) Automatic reset of counters when they reach approximatly 3/4 full.
+	   (While not guarenteeing that counts will not be missed this does
+	   keep counts incrementing as best as possible given the current
+	   hardware limitations.)
+	3) Basic warnings in the OpenSM log on "critical" errors like symbol
+	   errors.
+	4) Automatically detects "outside" resets of counters and adjusts to
+	   continue collecting data.
+	5) Can be run in a standby SM.
+
+Known issues are:
+
+	1) Data counters will be lost on high data rate links.  Sweeping the
+	   fabric fast enough for a DDR link is not practical.
+	2) Default partition support only.
+
+
+Setup and Usage
+===============
+
+Using the Performance Manager consists of 3 steps:
+
+	1) compiling in support for the perfmgr (Optionally: the console
+	   socket as well)
+	2) enabling the perfmgr and console in opensm.opts
+	3) retrieving data which has been collected.
+	   3a) using console to "dump data"
+	   3b) using a plugin module to store the data to your own
+	       "database"
+
+Step 1: Compile in support for the Performance Manager
+------------------------------------------------------
+
+Because of the performance manager's experimental status, it is not enabled at
+compile time by default.  (This will hopefully soon change as more people use
+it and confirm that it does not break things...  ;-)  The configure option is
+"--enable-perf-mgr".
+
+At this time it is really best to enable the console socket option as well.
+OpenSM can be run in an "interactive" mode.  But with the console socket option
+turned on one can also make a connection to a running OpenSM.  The console
+option is "--enable-console-socket".  This option requires the use of
+tcp_wrappers to ensure security.  Please be aware of your configuration for
+tcp_wrappers as the commands presented in the console can affect the operation
+of your subnet.
+
+The following configure line includes turning on the performance manager as
+well as the console:
+
+	./configure --enable-perf-mgr --enable-console-socket
+
+
+Step 2: Enable the perfmgr and console in opensm.opts
+-----------------------------------------------------
+
+Turning the Perfmorance Manager on is pretty easy, set the following options in
+the opensm.opts config file.  (Default location is
+/var/cache/opensm/opensm.opts)
+
+	# Turn it all on.
+	perfmgr TRUE
+
+	# sweep time in seconds
+	perfmgr_sweep_time_s 180
+
+	# Dump file to dump the events to
+	event_db_dump_file /var/log/opensm_port_counters.log
+
+Also enable the console socket and configure the port for it to listen to if
+desired.
+
+	# console [off|local|socket]
+	console socket
+
+	# Telnet port for console (default 10000)
+	console_port 10000
+
+As noted above you also need to set up tcp_wrappers to prevent unauthorized
+users from connecting to the console.[*]
+
+	[*] As an alternate you can use the loopback mode but I noticed when
+	writing this (OpenSM v3.1.10; OFED 1.3) that there are some bugs in
+	specifying the loopback mode in the opensm.opts file.  Look for this to
+	be fixed in newer versions.
+
+	[**] Also you could use "local" but this is only useful if you run
+	OpenSM in the foreground of a terminal.  As OpenSM is usually started
+	as a daemon I left this out as an option.
+
+Step 3: retrieve data which has been collected
+----------------------------------------------
+
+Step 3a: Using console dump function
+------------------------------------
+
+The console command "perfmgr dump_counters" will dump counters to the file
+specified in the opensm.opts file.  In the example above
+"/var/log/opensm_port_counters.log"
+
+Example output is below:
+
+<snip>
+"SW1 wopr ISR9024D (MLX4 FW)" 0x8f10400411f56 port 1 (Since Mon May 12 13:27:14 2008)
+     symbol_err_cnt       : 0
+     link_err_recover     : 0
+     link_downed          : 0
+     rcv_err              : 0
+     rcv_rem_phys_err     : 0
+     rcv_switch_relay_err : 2
+     xmit_discards        : 0
+     xmit_constraint_err  : 0
+     rcv_constraint_err   : 0
+     link_integrity_err   : 0
+     buf_overrun_err      : 0
+     vl15_dropped         : 0
+     xmit_data            : 470435
+     rcv_data             : 405956
+     xmit_pkts            : 8954
+     rcv_pkts             : 6900
+     unicast_xmit_pkts    : 0
+     unicast_rcv_pkts     : 0
+     multicast_xmit_pkts  : 0
+     multicast_rcv_pkts   : 0
+</snip>
+
+
+Step 3b: Using a plugin module
+------------------------------
+
+If you want a more automated method of retrieving the data OpenSM provides a
+plugin interface to extend OpenSM.  The header file is osm_event_plugin.h.
+The functions you register with this interface will be called when data is
+collected.  You can then use that data as appropriate.
+
+An example plugin can be configured at compile time using the
+"--enable-default-event-plugin" option on the configure line.  This plugin is
+very simple.  It logs "events" recieved from the performance manager to a log
+file.  I don't recomend using this directly but rather use it as a templat to
+create your own plugin.
+
diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index feabfef..c36d6f2 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -125,7 +125,7 @@ fi
 %{_sbindir}/opensm
 %{_sbindir}/osmtest
 %{_mandir}/man8/*
-%doc AUTHORS COPYING README
+%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt
 %{_sysconfdir}/init.d/opensmd
 %{_sbindir}/sldd.sh
 %config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf
-- 
1.5.1


From weiny2 at llnl.gov  Thu May 15 13:34:56 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 15 May 2008 13:34:56 -0700
Subject: [ofa-general] [Announce] libopensmskummeeplugin;
	OpenSM/PerfMgr to MySQL plugin
Message-ID: <20080515133456.7b1e5c04.weiny2@llnl.gov>

Announcing, libopensmskummeeplugin.

https://computing.llnl.gov/linux/skummeeplugin.html

This plugin takes the data from the PerfMgr and logs it to a MySQL DB.  In
addition it comes with scripts to set up the connection between the PerfMgr and
SKUMMEE (https://sourceforge.net/projects/skummee) an open source cluster
monitoring tool.

Although this has been developed primarily to get data into SKUMMEE it can be
used to simply store the data in a MySQL DB which can then be querried.

I hope someone finds this useful,
Ira Weiny


From rdreier at cisco.com  Thu May 15 14:50:38 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 14:50:38 -0700
Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions.
In-Reply-To: <482C91F7.60708@opengridcomputing.com> (Steve Wise's message of
	"Thu, 15 May 2008 14:41:43 -0500")
References: <20080515181734.21020.47137.stgit@dell3.ogc.int>
	<482C91F7.60708@opengridcomputing.com>
Message-ID: <aday76bdyoh.fsf@cisco.com>

 > I think we need a way for users to query the device to know the max
 > page_list_length that can be handled in a FAST_REG_MR work request.
 > In other words, a device attribute.

Yeah, stick it in there...


From rdreier at cisco.com  Thu May 15 15:22:12 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 15:22:12 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com> (Thomas
	Talpey's message of "Wed, 14 May 2008 23:32:21 -0400")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <adaod77dx7v.fsf@cisco.com>

 > We've been hit by this twice this week on two NFS/RDMA servers, so I'm
 > glad to see this! But, for us it happens with memless ConnectX - our mthca
 > devices are ok (but OTOH they're memfull not memfree)

Strange... as I said before though something seems to have changed to
affect this, though I have no idea what.  I'm including the test program
I use to check if QP creation succeeds, you can run this on any suspect
systems and see what it prints.

 > I'll be happy to test it with our misbehaving cards, but I can't do it until
 > next week since they just went into a box for shipping. In the meantime,
 > dare I ask - what's different about memfree cards that limits the sge
 > attributes like this? And, what values result from the new code? The
 > ConnectX ones I have report 32, and fail when trying to set that.

The patch doesn't change ConnectX -- creating a QP with max send/recv
sge 32 works fine for me here with mlx4 from 2.6.26-rc2.  For mem-free
the new max_sge reported is 27 sge entries, and for memful it is 59 (and
creating such QPs succeeds of course).  The difference between memfree
and memful that matters is just that the max_sge on memfree runs into
the max WQE size, and the code didn't handle that correctly without the
patch.

Here's the test program to check QP creation vs reported max_sge:

#include <stdio.h>
#include <string.h>

#include <infiniband/verbs.h>

int main(int argc, char *argv)
{
	struct ibv_device      **dev_list;
	struct ibv_device_attr	 dev_attr;
	struct ibv_context	*context;
	struct ibv_pd		*pd;
	struct ibv_cq		*cq;
	struct ibv_qp_init_attr  qp_attr;
	int			 t;
	static const struct {
		enum ibv_qp_type type;
		char		*name;
	}			 type_tab[] = {
		{ IBV_QPT_RC, "RC" },
		{ IBV_QPT_UC, "UC" },
		{ IBV_QPT_UD, "UD" },
	};

	dev_list = ibv_get_device_list(NULL);
	if (!dev_list) {
		printf("No RDMA devices found\n");
		return 1;
	}

	for (; *dev_list; ++dev_list) {
		printf("%s:\n", ibv_get_device_name(*dev_list));

		context = ibv_open_device(*dev_list);
		if (!context) {
			printf("  ibv_open_device failed\n");
			continue;
		}

		if (ibv_query_device(context, &dev_attr)) {
			printf("  ibv_query_device failed\n");
			continue;
		}

		cq = ibv_create_cq(context, 1, NULL, NULL, 0);
		if (!cq) {
			printf("  ibv_create_cq failed\n");
			continue;
		}

		pd = ibv_alloc_pd(context);
		if (!pd) {
			printf("  ibv_alloc_pd failed\n");
			continue;
		}

		for (t = 0; t < sizeof type_tab / sizeof type_tab[0]; ++t) {
			memset(&qp_attr, 0, sizeof qp_attr);

			qp_attr.send_cq = cq;
			qp_attr.recv_cq = cq;
			qp_attr.cap.max_send_wr = 1;
			qp_attr.cap.max_recv_wr = 1;
			qp_attr.cap.max_send_sge = dev_attr.max_sge;
			qp_attr.cap.max_recv_sge = dev_attr.max_sge;
			qp_attr.qp_type = type_tab[t].type;

			printf("  %s: SGE %d ", type_tab[t].name, dev_attr.max_sge);

			if (ibv_create_qp(pd, &qp_attr))
				printf("ok (got %d/%d)\n",
				       qp_attr.cap.max_send_sge,
				       qp_attr.cap.max_recv_sge);
			else
				printf("FAILED\n");
		}
	}

	return 0;
}


From rdreier at cisco.com  Thu May 15 15:31:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 15:31:55 -0700
Subject: [ofa-general] Re: [PATCH 12/13] QLogic VNIC: Driver Kconfig and
	Makefile.
In-Reply-To: <20080430172156.31725.94843.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:51:56 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172156.31725.94843.stgit@localhost.localdomain>
Message-ID: <adafxsjdwro.fsf@cisco.com>

 > +config INFINIBAND_QLGC_VNIC_DEBUG
 > +	bool "QLogic VNIC Verbose debugging"
 > +	depends on INFINIBAND_QLGC_VNIC
 > +	default n
 > +	---help---
 > +	  This option causes verbose debugging code to be compiled
 > +	  into the QLogic VNIC driver.  The output can be turned on via the
 > +	  vnic_debug module parameter.

If you have runtime control of this, I suggest making it default to on,
like mthca does with:

	config INFINIBAND_MTHCA_DEBUG
		bool "Verbose debugging output" if EMBEDDED
		depends on INFINIBAND_MTHCA
		default y

otherwise distros will leave the option off and it becomes a pain to
debug problems because you force users to do compiles.


From rdreier at cisco.com  Thu May 15 15:33:56 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 15:33:56 -0700
Subject: [ofa-general] Re: [PATCH 10/13] QLogic VNIC: Driver Statistics
	collection
In-Reply-To: <20080430172055.31725.70663.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:50:55 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172055.31725.70663.stgit@localhost.localdomain>
Message-ID: <adabq37dwob.fsf@cisco.com>

 > +#else	/*CONFIG_INIFINIBAND_VNIC_STATS*/
 > +
 > +static inline void vnic_connected_stats(struct vnic *vnic)
 > +{
 > +	;
 > +}

there are an awful lot of stubs here.  Do you really expect anyone to
set CONFIG_INFINIBAND_QLGC_VNIC_STATS=n?

 - R.


From akepner at sgi.com  Thu May 15 15:34:18 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 15 May 2008 15:34:18 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <adar6c6jane.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com> <adar6c6jane.fsf@cisco.com>
Message-ID: <20080515223418.GY29302@sgi.com>


Last night we were able to reproduce the bug that I reported at the 
beginning of this thread. 

In brief, the bug is that we stop getting completions on the IPoIB-UD 
send queue. The queue fills, and we get an endless stream of "post_send 
failed".

This is with 2.6.16.46-0.12 (SLES 10 SP1), and OFED 1.3, running 
on a moderately large (512 CPU) cluster. IB HCA is MT25204, f/w 
1.2.0.

The workload is an MPI job of some sort, and the failure seems 
to happen very soon (within minutes) after the job starts. We're 
using CM.

I've added some debug output to the driver. 

The debug driver prints the "tx_outstanding" value when we get 
post_send failures in ipoib_send():

ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80)
....

(We never call netif_stop_queue().)


I also instrumented calls to mthca_arbel_post_send() and 
mthca_poll_one() - we keep a circular buffer of the last 
0x80 sends, and completions. 

One curious thing is that the mthca_poll_one() and 
mthca_arbel_post_send() routines are usually called at a 
very regular rate, e.g., here are the last few calls 
to mthca_poll_one():

# delta_t    head      tail
# [jiffies]
....
   125      0x5c       0x5b
   125      0x5d       0x5c
   125      0x5e       0x5d
   125      0x5f       0x5e
   125      0x60       0x5f
   125      0x61       0x60
   125      0x62       0x61
   125      0x63       0x62
   125      0x64       0x63
   125      0x65       0x64
   125      0x66       0x65
   125      0x67       0x66
   125      0x68       0x67
   125      0x69       0x68
   125      0x69       0x69

After a short time, we just stop logging any more calls 
to mthca_poll_one().


Then it took a few minutes to fill the queue, and start 
making the "post_send failed" messages. The last few 
succeeding calls to mthca_arbel_post_send() were:


# delta_t    head      tail
# [jiffies]
....
   125      0xdc       0x69
   125      0xdd       0x69
   125      0xde       0x69
   125      0xdf       0x69
   125      0xe0       0x69
   125      0xe1       0x69
   125      0xe2       0x69
   125      0xe3       0x69
   125      0xe4       0x69
   125      0xe5       0x69
   125      0xe6       0x69
   125      0xe7       0x69
   2860     0xe8       0x69
   250      0xe9       0x69

HZ = 250, so we're calling these routines twice per second. 

Looks like we must be doing:

static void ipoib_ib_tx_timer_func(unsigned long dev_ptr)
{

	if (post_zlen_send_wr(priv, wrid)) {
		...
	....
	poll_tx(priv);
	...
	mod_timer(&priv->poll_timer, jiffies + HZ / 2);
}

Apparently when we do:

static inline int mthca_poll_one(

	...
        cqe = next_cqe_sw(cq);
        if (!cqe)
                return -EAGAIN;


we find that the next CQE is owned by h/w (if we hadn't returned 
early, the debug code would've logged the poll.)

-- 
Arthur


From rdreier at cisco.com  Thu May 15 15:38:00 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 15:38:00 -0700
Subject: [ofa-general] [PATCH 09/13] QLogic VNIC: IB Multicast for
	Ethernet broadcast/multicast
In-Reply-To: <20080430172025.31725.97795.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Wed, 30 Apr 2008 22:50:25 +0530")
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172025.31725.97795.stgit@localhost.localdomain>
Message-ID: <ada7idvdwhj.fsf@cisco.com>

 > +#define SET_MCAST_STATE_INVALID \
 > +do { \
 > +	viport->mc_info.state = MCAST_STATE_INVALID; \
 > +	viport->mc_info.mc = NULL; \
 > +	memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid)); \
 > +} while (0);

Seems like this could be profitably implemented in C instead of CPP.

 > +		spin_lock_irqsave(&viport->mc_info.lock, flags);
 > +		viport->mc_info.state = MCAST_STATE_INVALID;
 > +		spin_unlock_irqrestore(&viport->mc_info.lock, flags);

This pattern makes me uneasy about the locking... setting the state
member will already be atomic, so what do you think you're protecting
against here by taking the lock?

 - R.


From ralph.campbell at qlogic.com  Thu May 15 15:48:26 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 15:48:26 -0700
Subject: [ofa-general] [PATCH 0/2] IB/ipath -- fixes for 2.6.26
Message-ID: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>

The following patches fix two minor bugs for the QLogic DDR HCA.

IB/ipath - fix printk compiler warning for ipath_sdma_status
IB/ipath - fix UC receive completion opcode

These can also be pulled into Roland's infiniband.git for-2.6.26 repo using:
git pull git://git.qlogic.com/ipath-linux-2.6 for-roland


From ralph.campbell at qlogic.com  Thu May 15 15:48:31 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 15:48:31 -0700
Subject: [ofa-general] [PATCH 1/2] IB/ipath - fix printk compiler warning for
	ipath_sdma_status
In-Reply-To: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com>

This patch fixes a printk format string compiler warning to match
the change of ipath_sdma_status from u64 to unsigned long.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_sdma.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 3697449..0e860fd 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -345,7 +345,7 @@ resched:
 	 * state change
 	 */
 	if (jiffies > dd->ipath_sdma_abort_jiffies) {
-		ipath_dbg("looping with status 0x%016llx\n",
+		ipath_dbg("looping with status 0x%016lx\n",
 			  dd->ipath_sdma_status);
 		dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ;
 	}
@@ -615,7 +615,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd)
 	}
 	spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
 	if (!needed) {
-		ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n",
+		ipath_dbg("invalid attempt to restart SDMA, status 0x%016lx\n",
 			dd->ipath_sdma_status);
 		goto bail;
 	}


From ralph.campbell at qlogic.com  Thu May 15 15:48:36 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 15 May 2008 15:48:36 -0700
Subject: [ofa-general] [PATCH 2/2] IB/ipath - fix UC receive completion
	opcode
In-Reply-To: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080515224836.23487.696.stgit@eng-46.mv.qlogic.com>

When I fixed the RC receive completion opcode, I forgot to fix UC
which had the same problem for RDMA write with immediate returning
the wrong opcode.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_uc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c
index 7fd18e8..0596ec1 100644
--- a/drivers/infiniband/hw/ipath/ipath_uc.c
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c
@@ -407,12 +407,11 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			dev->n_pkt_drops++;
 			goto done;
 		}
-		/* XXX Need to free SGEs */
+		wc.opcode = IB_WC_RECV;
 	last_imm:
 		ipath_copy_sge(&qp->r_sge, data, tlen);
 		wc.wr_id = qp->r_wr_id;
 		wc.status = IB_WC_SUCCESS;
-		wc.opcode = IB_WC_RECV;
 		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
 		wc.slid = qp->remote_ah_attr.dlid;
@@ -514,6 +513,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			goto done;
 		}
 		wc.byte_len = qp->r_len;
+		wc.opcode = IB_WC_RECV_RDMA_WITH_IMM;
 		goto last_imm;
 
 	case OP(RDMA_WRITE_LAST):


From rdreier at cisco.com  Thu May 15 16:35:39 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 16:35:39 -0700
Subject: [ofa-general] Re: [PATCH 1/2] IB/ipath - fix printk compiler warning
	for ipath_sdma_status
In-Reply-To: <20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Thu, 15 May 2008 15:48:31 -0700")
References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
	<20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com>
Message-ID: <ada3aojdttg.fsf@cisco.com>

 > This patch fixes a printk format string compiler warning to match
 > the change of ipath_sdma_status from u64 to unsigned long.

Thanks... already had fixed this locally.  (Don't know how I missed the warning)


From rdreier at cisco.com  Thu May 15 16:36:07 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 May 2008 16:36:07 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/ipath - fix UC receive completion
	opcode
In-Reply-To: <20080515224836.23487.696.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Thu, 15 May 2008 15:48:36 -0700")
References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com>
	<20080515224836.23487.696.stgit@eng-46.mv.qlogic.com>
Message-ID: <aday76bcf88.fsf@cisco.com>

thanks, applied.


From npiggin at suse.de  Thu May 15 16:52:03 2008
From: npiggin at suse.de (Nick Piggin)
Date: Fri, 16 May 2008 01:52:03 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
Message-ID: <20080515235203.GB25305@wotan.suse.de>

On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
> On Thu, 15 May 2008, Nick Piggin wrote:
> 
> > Oh, I get that confused because of the mixed up naming conventions
> > there: unmap_page_range should actually be called zap_page_range. But
> > at any rate, yes we can easily zap pagetables without holding mmap_sem.
> 
> How is that synchronized with code that walks the same pagetable. These 
> walks may not hold mmap_sem either. I would expect that one could only 
> remove a portion of the pagetable where we have some sort of guarantee 
> that no accesses occur. So the removal of the vma prior ensures that?
 
I don't really understand the question. If you remove the pte and invalidate
the TLBS on the remote image's process (importing the page), then it can
of course try to refault the page in because it's vma is still there. But
you catch that refault in your driver , which can prevent the page from
being faulted back in.


From okir at lst.de  Fri May 16 00:19:44 2008
From: okir at lst.de (Olaf Kirch)
Date: Fri, 16 May 2008 09:19:44 +0200
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaod77dx7v.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaod77dx7v.fsf@cisco.com>
Message-ID: <200805160919.45676.okir@lst.de>

On Friday 16 May 2008 00:22:12 Roland Dreier wrote:
> Strange... as I said before though something seems to have changed to
> affect this, though I have no idea what.  I'm including the test program
> I use to check if QP creation succeeds, you can run this on any suspect
> systems and see what it prints.

I ran into this a few weeks back as well, when I tried to up the SG limit
in RDS to 32 (on a arbel memfree card).

I grepped around the code a bit, got a little confused because of all the
different max_sge, max_sg and max_gs variables :-) and eventually
convinced myself that the max_sge reported simply doesn't include the
transport specific overhead that mthca_alloc_wqe_buf factors in.

Given that you have quite different WQE overheads depending on the transport,
a conservative max_sge value that works for all transports wastes one or two
entries on some others. Maybe once the QP is created, it could report
the actual max_sge value (which may actually be greater than the conservative,
transport-independent max_sge estimate of the device).

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From keshetti85-student at yahoo.co.in  Fri May 16 02:42:58 2008
From: keshetti85-student at yahoo.co.in (Keshetti Mahesh)
Date: Fri, 16 May 2008 15:12:58 +0530
Subject: [ofa-general] Retry count error with ipath on OFED-1.3
Message-ID: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com>

OFED 1.3 Infinipath Error:
 ># OSU MPI Bandwidth Test v3.1
 ># Size        Bandwidth (MB/s)
 >1                         0.17
 >2                         0.39
 >4                         0.66
 >8                         1.80
 >16                        2.53
 >32                        5.11
 >64                        8.80
 >128                      23.09
 >256                      43.65
 >512                      84.42
 >1024                    151.63
 >[0,1,0][btl_openib_component.c:1338:btl_openib_component_progress]
from >compute-6-7 to: compute-6-8 error polling HP CQ with status RETRY
 >EXCEEDED ERROR status number 12 for wr_id 185705200 opcode 1
 >--------------------------------------------------------------------------

 >The InfiniBand retry count between two MPI processes has been
 >exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
 >(section 12.7.38):
 >
 >    The total number of times that the sender wishes the receiver to
 >    retry timeout, packet sequence, etc. errors before posting a
 >    completion error.
 >
 >This error typically means that there is something awry within the
 >InfiniBand fabric itself.  You should note the hosts on which this
 >error has occurred; it has been observed that rebooting or removing a
 >particular host from the job can sometimes resolve this issue.
 >
 >Two MCA parameters can be used to control Open MPI's behavior with
 >respect to the retry count:
 >
 >* btl_openib_ib_retry_count - The number of times the sender will
 >  attempt to retry (defaulted to 7, the maximum value).
 >
 >* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
 >  to 10).  The actual timeout value used is calculated as:
 >
 >     4.096 microseconds * (2^btl_openib_ib_timeout)
 >
 >  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
 >--------------------------------------------------------------------------

 >mpirun noticed that job rank 1 with PID 16883 on node compute-6-8
 >exited on signal 15 (Terminated).


Hi Nico,

The above error arises due to dead lock scenario in the network. But that
should not happen in your case since you are using only 2 nodes.
Try increasing the IB parameters (btl_openib_ib_retry_count and
btl_openib_ib_timeout) mentioned in the error.

-Mahesh


From holt at sgi.com  Fri May 16 04:23:06 2008
From: holt at sgi.com (Robin Holt)
Date: Fri, 16 May 2008 06:23:06 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080515235203.GB25305@wotan.suse.de>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
Message-ID: <20080516112306.GA4287@sgi.com>

On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
> On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
> > On Thu, 15 May 2008, Nick Piggin wrote:
> > 
> > > Oh, I get that confused because of the mixed up naming conventions
> > > there: unmap_page_range should actually be called zap_page_range. But
> > > at any rate, yes we can easily zap pagetables without holding mmap_sem.
> > 
> > How is that synchronized with code that walks the same pagetable. These 
> > walks may not hold mmap_sem either. I would expect that one could only 
> > remove a portion of the pagetable where we have some sort of guarantee 
> > that no accesses occur. So the removal of the vma prior ensures that?
>  
> I don't really understand the question. If you remove the pte and invalidate
> the TLBS on the remote image's process (importing the page), then it can
> of course try to refault the page in because it's vma is still there. But
> you catch that refault in your driver , which can prevent the page from
> being faulted back in.

I think Christoph's question has more to do with faults that are
in flight.  A recently requested fault could have just released the
last lock that was holding up the invalidate callout.  It would then
begin messaging back the response PFN which could still be in flight.
The invalidate callout would then fire and do the interrupt shoot-down
while that response was still active (essentially beating the inflight
response).  The invalidate would clear up nothing and then the response
would insert the PFN after it is no longer the correct PFN.

Thanks,
Robin


From holt at sgi.com  Fri May 16 04:50:05 2008
From: holt at sgi.com (Robin Holt)
Date: Fri, 16 May 2008 06:50:05 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080516112306.GA4287@sgi.com>
References: <alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com>
Message-ID: <20080516115005.GC4287@sgi.com>

On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote:
> On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
> > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
> > > On Thu, 15 May 2008, Nick Piggin wrote:
> > > 
> > > > Oh, I get that confused because of the mixed up naming conventions
> > > > there: unmap_page_range should actually be called zap_page_range. But
> > > > at any rate, yes we can easily zap pagetables without holding mmap_sem.
> > > 
> > > How is that synchronized with code that walks the same pagetable. These 
> > > walks may not hold mmap_sem either. I would expect that one could only 
> > > remove a portion of the pagetable where we have some sort of guarantee 
> > > that no accesses occur. So the removal of the vma prior ensures that?
> >  
> > I don't really understand the question. If you remove the pte and invalidate
> > the TLBS on the remote image's process (importing the page), then it can
> > of course try to refault the page in because it's vma is still there. But
> > you catch that refault in your driver , which can prevent the page from
> > being faulted back in.
> 
> I think Christoph's question has more to do with faults that are
> in flight.  A recently requested fault could have just released the
> last lock that was holding up the invalidate callout.  It would then
> begin messaging back the response PFN which could still be in flight.
> The invalidate callout would then fire and do the interrupt shoot-down
> while that response was still active (essentially beating the inflight
> response).  The invalidate would clear up nothing and then the response
> would insert the PFN after it is no longer the correct PFN.

I just looked over XPMEM.  I think we could make this work.  We already
have a list of active faults which is protected by a simple spinlock.
I would need to nest this lock within another lock protected our PFN
table (currently it is a mutex) and then the invalidate interrupt handler
would need to mark the fault as invalid (which is also currently there).

I think my sticking points with the interrupt method remain at fault
containment and timeout.  The inability of the ia64 processor to handle
provide predictive failures for the read/write of memory on other
partitions prevents us from being able to contain the failure.  I don't
think we can get the information we would need to do the invalidate
without introducing fault containment issues which has been a continous
area of concern for our customers.


Thanks,
Robin


From contriver613 at g5-design.com  Fri May 16 06:01:29 2008
From: contriver613 at g5-design.com (Ervin Mcintosh)
Date: Fri, 16 May 2008 18:01:29 +0500
Subject: [ofa-general] $159.95 Viagra 100mg x 90 pills
Message-ID: <01c8b77e$dd806280$47da975a@contriver613>

50mg x 10 pills $6.00 per pill
http://carrybelieve.com


From eli at dev.mellanox.co.il  Fri May 16 06:06:56 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Fri, 16 May 2008 16:06:56 +0300
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080515223418.GY29302@sgi.com>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com> <adar6c6jane.fsf@cisco.com>
	<20080515223418.GY29302@sgi.com>
Message-ID: <1210943216.9524.15.camel@eli-laptop>

On Thu, 2008-05-15 at 15:34 -0700, akepner at sgi.com wrote:

> ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80)
> ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
> ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
> ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80)
> ....

This should not happen. Can you send the source files for ipoib which
you're using (with the debug patches)?

> 
> (We never call netif_stop_queue().)
> 
You mean you don't see it get called; you did not change the code so it
won't be called, correct?


From olaf.kirch at oracle.com  Fri May 16 07:38:16 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Fri, 16 May 2008 16:38:16 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805141516.01908.okir@lst.de>
References: <200805121157.38135.jon@opengridcomputing.com>
	<482A17FC.7070804@oracle.com> <200805141516.01908.okir@lst.de>
Message-ID: <200805161638.18067.olaf.kirch@oracle.com>

On Wednesday 14 May 2008 15:16:00 Olaf Kirch wrote:
> I'll let you know as soon as I have something for you to test.

Okay, here we go. I have a whole stack of patches sitting in
http://www.openfabrics.org/git/?p=~okir/ofed_1_3/linux-2.6.git
on branch future-20080516

This patch stack contains everything I'm working on right now, but
which isn't ready for OFED 1.3.1 yet (and I haven't started on 1.4
yet). So you get a lot more than you bargained for... but a fair
bit of that is actually needed because it prepares the ground for
the flow control stuff, so I didn't bother with ripping out the
unneeded pieces and rediffing everything.

I did some light testing with the code - it hasn't oopsed in a few hours,
but I'm getting occasional errors with rds-stress right after reloading
the module. They go away on the next attempt - so there's still something
fishy about the code (or about rds-stress).

I'm not completely happy with the performance yet. Early versions might
have been more adequately dubbed "trickle control" - but I eventually
managed to get something not-too-bad.

However, I'm still seeing performance degradation of ~5% with some packet
sizes. And that is *just* the overhead from exchanging the credit information
and checking it - at some point we need to take a spinlock, and that seems
to delay things just enough to make a dent in my throughput graph.

In fact, I haven't yet found a test case where the sender had to slow down
sending because it ran out of credits. Which confirms my suspicion that the
current setup isn't so bad, at least for IB...

If you're interested in the flow control code, the last commit
on that branch is the one to look at - commit header appended below.
I'm pretty sure that this is not exactly what iWARP needs, so please
send comments/patches on how to beat it into shape for iWARP.

Enjoyable weekend to everyone,
Olaf

commit e8a64b4f83df9df6617f75dff9e591b86174fa7c
Author: Olaf Kirch <olaf.kirch at oracle.com>
Date:   Fri May 16 06:16:40 2008 -0700

    RDS: Implement IB flow control

    Here it is - flow control for RDS/IB.

    This patch is still very much experimental. Here's the essentials

     -  The approach chosen here uses a credit-based flow control
        mechanism. Every SEND WR (including ACKs) consumes one credit,
        and if the sender runs out of credits, it stalls.

     -  As new receive buffers are posted, credits are transferred to the
        remote node (using yet another RDS header byte for this).

     -  Flow control is negotiated during connection setup. Initial credits
        are exchanged in the rds_ib_connect_private sruct - sending a value
        of zero (which is also the default for older protocol versions)
        means no flow control.

     -  We avoid deadlock (both nodes depleting their credits, and being
        unable to inform the peer of newly posted buffers) by requiring
        that the last credit can only be used if we're posting new credits
        to the peer.

    Flow control is configurable via sysctl. It only affects newly created
    connections however - so your best bet is to set this right after loading
    the RDS module.

    Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>


-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From akepner at sgi.com  Fri May 16 07:39:30 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 16 May 2008 07:39:30 -0700
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <1210943216.9524.15.camel@eli-laptop>
References: <20080508171936.GU24293@sgi.com> <adaod7gsnea.fsf@cisco.com>
	<20080508174358.GW24293@sgi.com> <adafxsssn1o.fsf@cisco.com>
	<20080510190721.GI5298@sgi.com> <adar6c6jane.fsf@cisco.com>
	<20080515223418.GY29302@sgi.com>
	<1210943216.9524.15.camel@eli-laptop>
Message-ID: <20080516143930.GE29302@sgi.com>

On Fri, May 16, 2008 at 04:06:56PM +0300, Eli Cohen wrote:
> On Thu, 2008-05-15 at 15:34 -0700, akepner at sgi.com wrote:
> 
> > ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80)
> > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
> > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80)
> > ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80)
> > ....
> 
> This should not happen. Can you send the source files for ipoib which
> you're using (with the debug patches)?

Sure. I'll send them privately, and not spam the mail list 
with this. 

But I'll restate what I said earlier in this email thread - 
I don't think the root cause here is IPoIB. I think IPoIB is 
a victim when the card stops generating completions. We've 
seen what looks to be the *same* bug (send queue gets forever 
stuffed up) on both OFED 1.2 and OFED 1.3. The drivers in these 
two releases (I know you're well aware) are very different. 
The common element is MT25204.

> 
> > 
> > (We never call netif_stop_queue().)
> > 
> You mean you don't see it get called; you did not change the code so it
> won't be called, correct?

I didn't change things so that netif_stop_queue() wouldn't be 
called. 

-- 
Arthur


From emily at ontariosnearnorth.on.ca  Fri May 16 06:26:00 2008
From: emily at ontariosnearnorth.on.ca (Rolex Watches)
Date: Fri, 16 May 2008 13:26:00 +0000
Subject: [ofa-general] A.Lange & Sohne Watches
Message-ID: <000601c8b767$02aecdee$6f433f95@wktkggs>

Replica Watches - cheap and really good solution!
What is a replica watch and how is it different from the real watches? 

A replica watch is a watch made similar to that of the real brand ones, except, at a much lower cost. 
A real Rolex can go up to hundreds of thousands of dollars, but you can get a replica similar to that one, 
for only a few hundred dollars. This allows the normal everyday person to be able to look and feel classy, 
without having to actually spend such ridiculous amounts of money on it. 
Visit our replica watches shop!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080516/f8da3d42/attachment.html>

From matthewtsmall at gmail.com  Fri May 16 09:40:17 2008
From: matthewtsmall at gmail.com (Matthew Small)
Date: Fri, 16 May 2008 12:40:17 -0400
Subject: [ofa-general] The significance of port numbers when creating QPs?
Message-ID: <bb8be6ca0805160940k4640b706g308a822a930d77bf@mail.gmail.com>

Can anyone explain a little on the significance of choosing a port number
when creating a QP.  In particular, my implementation has multiple QPs
associated with a single PD and the only attr.port_num I can  use to
initialize my queue pair seems to be 1.  Can someone answer why this is and
perhaps explain a general method for choosing an available port_num.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080516/0ee36093/attachment.html>

From ralph.campbell at qlogic.com  Fri May 16 09:50:15 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 16 May 2008 09:50:15 -0700
Subject: [ofa-general] The significance of port numbers when creating QPs?
In-Reply-To: <bb8be6ca0805160940k4640b706g308a822a930d77bf@mail.gmail.com>
References: <bb8be6ca0805160940k4640b706g308a822a930d77bf@mail.gmail.com>
Message-ID: <1210956615.3949.199.camel@brick.pathscale.com>

It depends on the hardware you have in your system.
Most HCAs have one or two ports (a CX4 connector
for the IB cable). The port_num is a property of
the address handle (for UD QPs) or QP attributes
(for UC, RC QPs) which specifies which physical IB
port to use.

On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote:
> Can anyone explain a little on the significance of choosing a port
> number when creating a QP.  In particular, my implementation has
> multiple QPs associated with a single PD and the only attr.port_num I
> can  use to  initialize my queue pair seems to be 1.  Can someone
> answer why this is and perhaps explain a general method for choosing
> an available port_num.
> 
> -Matt
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From affiliatesvcs at gmx.com  Fri May 16 09:54:24 2008
From: affiliatesvcs at gmx.com (Webroot)
Date: Fri, 16 May 2008 09:54:24 -0700
Subject: [ofa-general] ***SPAM*** Information Request - Improve System
	Performance
Message-ID: <464c3d01c8c09668a195edb40011141e@gmx.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080516/4eb3b09a/attachment.html>

From matthewtsmall at gmail.com  Fri May 16 10:27:24 2008
From: matthewtsmall at gmail.com (Matthew Small)
Date: Fri, 16 May 2008 13:27:24 -0400
Subject: [ofa-general] The significance of port numbers when creating QPs?
In-Reply-To: <1210956615.3949.199.camel@brick.pathscale.com>
References: <bb8be6ca0805160940k4640b706g308a822a930d77bf@mail.gmail.com>
	<1210956615.3949.199.camel@brick.pathscale.com>
Message-ID: <bb8be6ca0805161027k4d6da717w7e09cd0880756e47@mail.gmail.com>

So, when you are using an RC QP and attempting to write code for general
hardware, how would you query the device to find which physical IB ports are
available?

On Fri, May 16, 2008 at 12:50 PM, Ralph Campbell <ralph.campbell at qlogic.com>
wrote:

> It depends on the hardware you have in your system.
> Most HCAs have one or two ports (a CX4 connector
> for the IB cable). The port_num is a property of
> the address handle (for UD QPs) or QP attributes
> (for UC, RC QPs) which specifies which physical IB
> port to use.
>
> On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote:
> > Can anyone explain a little on the significance of choosing a port
> > number when creating a QP.  In particular, my implementation has
> > multiple QPs associated with a single PD and the only attr.port_num I
> > can  use to  initialize my queue pair seems to be 1.  Can someone
> > answer why this is and perhaps explain a general method for choosing
> > an available port_num.
> >
> > -Matt
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080516/a4dc1bc2/attachment.html>

From ralph.campbell at qlogic.com  Fri May 16 10:45:04 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 16 May 2008 10:45:04 -0700
Subject: [ofa-general] The significance of port numbers when creating QPs?
In-Reply-To: <bb8be6ca0805161027k4d6da717w7e09cd0880756e47@mail.gmail.com>
References: <bb8be6ca0805160940k4640b706g308a822a930d77bf@mail.gmail.com>
	<1210956615.3949.199.camel@brick.pathscale.com>
	<bb8be6ca0805161027k4d6da717w7e09cd0880756e47@mail.gmail.com>
Message-ID: <1210959904.3949.213.camel@brick.pathscale.com>

ibv_query_device() will return the number of physical ports
but what you are probably asking is how to establish a
connection to a particular host. That is like hostname
to IP address which is accomplished via the connection
manager. See the documentation for librdmacm and libibverbs.

On Fri, 2008-05-16 at 13:27 -0400, Matthew Small wrote:
> So, when you are using an RC QP and attempting to write code for
> general hardware, how would you query the device to find which
> physical IB ports are available?
> 
> On Fri, May 16, 2008 at 12:50 PM, Ralph Campbell
> <ralph.campbell at qlogic.com> wrote:
>         It depends on the hardware you have in your system.
>         Most HCAs have one or two ports (a CX4 connector
>         for the IB cable). The port_num is a property of
>         the address handle (for UD QPs) or QP attributes
>         (for UC, RC QPs) which specifies which physical IB
>         port to use.
>         
>         
>         On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote:
>         > Can anyone explain a little on the significance of choosing
>         a port
>         > number when creating a QP.  In particular, my implementation
>         has
>         > multiple QPs associated with a single PD and the only
>         attr.port_num I
>         > can  use to  initialize my queue pair seems to be 1.  Can
>         someone
>         > answer why this is and perhaps explain a general method for
>         choosing
>         > an available port_num.
>         >
>         > -Matt
>         
>         > _______________________________________________
>         > general mailing list
>         > general at lists.openfabrics.org
>         >
>         http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>         >
>         > To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
> 


From ruimario at gmail.com  Fri May 16 11:10:56 2008
From: ruimario at gmail.com (Rui Machado)
Date: Fri, 16 May 2008 20:10:56 +0200
Subject: [ofa-general] timeout question
Message-ID: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>

Hi,

>>
>> when setting the timeout in a struct ibv_qp_attr, this value
>> corresponds to the Local ACK timeout which according to the Infiniband
>> spec will define the transport timer timeout defined by the formula:
>> 4.096uS * 2 ^Local Ack timeout". Is this right?
>> And is there a value for this timeout to be considered "good practice"?
>>
> This value is depend on your fabric size, on the HCA you have (and some more factors)..
>> Also, in a client-server setup, if this timeout is set to a "big
>> value" (like 30) when the server dies, the client will take that
>> amount of time to realize the failure. Is this correct?
>>
> Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded
> (if there was a SR which was posted without any response from the receiver).
>
hmm..... and is there no workaround for this, for this situation? I
mean, if the server dies isn't there any possibility that
the sender/client realizes this. If the timeout it's too large this
can be cumbersome.

I tried reducing the timeout and indeed the client realizes faster
when the server exits but another problem arises: Without exiting the
server,
on the client side I get the error (retry exceed) when polling for a
recently posted send - this after some hours.

Thank you for the help.


Rui


From rdreier at cisco.com  Fri May 16 11:13:38 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 11:13:38 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <200805160919.45676.okir@lst.de> (Olaf Kirch's message of "Fri,
	16 May 2008 09:19:44 +0200")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaod77dx7v.fsf@cisco.com> <200805160919.45676.okir@lst.de>
Message-ID: <adaiqxece25.fsf@cisco.com>

 > Given that you have quite different WQE overheads depending on the transport,
 > a conservative max_sge value that works for all transports wastes one or two
 > entries on some others. Maybe once the QP is created, it could report
 > the actual max_sge value (which may actually be greater than the conservative,
 > transport-independent max_sge estimate of the device).

Is using 27 S/G entries vs, say 29 really a big problem?

The interface exists for the driver to return the actual capabilities
returned, but the mthca code is a bit of a mess and I think it would
require a decent amount of cleanup work to do this sanely.

 - R.


From rdreier at cisco.com  Fri May 16 11:16:32 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 11:16:32 -0700
Subject: [ofa-general] timeout question
In-Reply-To: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
	(Rui Machado's message of "Fri, 16 May 2008 20:10:56 +0200")
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
Message-ID: <adaej82cdxb.fsf@cisco.com>

 > hmm..... and is there no workaround for this, for this situation? I
 > mean, if the server dies isn't there any possibility that
 > the sender/client realizes this. If the timeout it's too large this
 > can be cumbersome.
 > 
 > I tried reducing the timeout and indeed the client realizes faster
 > when the server exits but another problem arises: Without exiting the
 > server,
 > on the client side I get the error (retry exceed) when polling for a
 > recently posted send - this after some hours.

There's a tradeoff between detecting real failures faster, and reducing
false errors detected because a response came too slowly.

Clearly if a response may take an amount of time 'X' to be received
under normal conditions, there's no way to conclude that the remote side
has failed without waiting at least 'X'.

 - R.


From rdreier at cisco.com  Fri May 16 11:21:26 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 11:21:26 -0700
Subject: [ofa-general] Re: [PATCH v2] IB/mlx4: Add send with invalidate
	support
In-Reply-To: <1210836027.18385.2.camel@mtls03> (Eli Cohen's message of "Thu,
	15 May 2008 10:20:27 +0300")
References: <1210836027.18385.2.camel@mtls03>
Message-ID: <adaabiqcdp5.fsf@cisco.com>

 > +#define MLX4_FW_VER_LOCAL_SEND_INVL mlx4_fw_ver(2, 5, 0)

 > +	if (dev->dev->caps.fw_ver >=  MLX4_FW_VER_LOCAL_SEND_INVL)
 > +		props->device_cap_flags |= IB_DEVICE_SEND_W_INV;

Are we forced to to look at the firmware version, or can we use the bmme
flag that the DEV_CAP firmware command gives us?

 - R.


From richard.frank at oracle.com  Fri May 16 11:28:03 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Fri, 16 May 2008 14:28:03 -0400
Subject: [ofa-general] Folks is this a known problem / already fixed ?
Message-ID: <482DD233.5010808@oracle.com>

We see the following failure for our ConnetX HCAs.. with 1.3.1 Daily 
20080512 done on vanilla OEL5U1.

They are failing to initialize with the following:

mlx4_core: Mellanox ConnectX core driver v1.0 (February 28, 2008)
mlx4_core: Initializing 0000:05:00.0
mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting.
mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting.
mlx4_core: probe of 0000:05:00.0 failed with error -16

And lspci shows:

05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev 
a0)
       Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR]
       Flags: fast devsel, IRQ 169
       Memory at fcc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
       Memory at fff000000 (64-bit, prefetchable) [disabled] [size=8M]
       Memory at fcbfe000 (64-bit, non-prefetchable) [disabled] [size=8K]
       Capabilities: [40] Power Management version 3
       Capabilities: [48] Vital Product Data
       Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256
       Capabilities: [60] Express Endpoint IRQ 0


From ruimario at gmail.com  Fri May 16 11:40:09 2008
From: ruimario at gmail.com (Rui Machado)
Date: Fri, 16 May 2008 20:40:09 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <adaej82cdxb.fsf@cisco.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
	<adaej82cdxb.fsf@cisco.com>
Message-ID: <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>

2008/5/16 Roland Dreier <rdreier at cisco.com>:
>  > hmm..... and is there no workaround for this, for this situation? I
>  > mean, if the server dies isn't there any possibility that
>  > the sender/client realizes this. If the timeout it's too large this
>  > can be cumbersome.
>  >
>  > I tried reducing the timeout and indeed the client realizes faster
>  > when the server exits but another problem arises: Without exiting the
>  > server,
>  > on the client side I get the error (retry exceed) when polling for a
>  > recently posted send - this after some hours.
>
> There's a tradeoff between detecting real failures faster, and reducing
> false errors detected because a response came too slowly.
>
> Clearly if a response may take an amount of time 'X' to be received
> under normal conditions, there's no way to conclude that the remote side
> has failed without waiting at least 'X'.
>

I understand. So there's no really difference between the two
situations, real server failure or just a load problem that takes more
time?
Something like a different error or a SIGPIPE :) ?

I will describe my situation, maybe it helps (bare with me as I'm
starting with Infiniband and so on)
I have a client and a server.The clients posts RDMA calls one at a
time (post, poll, post...). So server is just there.
If I try to start something like 16 clients on 1 machine, after a few
hours I will get an error on some client programs (retry excess) with
a timeout of 14. If I increase the timeout for 32, I don't see that
error but if I stop the server, the clients take a lot of time to
acknowledge that, which is also not wanted.
That's why I asked  if there a 'good value'. If I have such a load
between 2 nodes, I always have to risk that if the server dies the
client will take much time to see it. That's not nice!

Thanks for the help and quick answers,

Rui


From dotanba at gmail.com  Fri May 16 18:54:54 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 17 May 2008 03:54:54 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>	<adaej82cdxb.fsf@cisco.com>
	<6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>
Message-ID: <482E3AEE.4070603@gmail.com>

Rui Machado wrote:
> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>   
>>  > hmm..... and is there no workaround for this, for this situation? I
>>  > mean, if the server dies isn't there any possibility that
>>  > the sender/client realizes this. If the timeout it's too large this
>>  > can be cumbersome.
>>  >
>>  > I tried reducing the timeout and indeed the client realizes faster
>>  > when the server exits but another problem arises: Without exiting the
>>  > server,
>>  > on the client side I get the error (retry exceed) when polling for a
>>  > recently posted send - this after some hours.
>>
>> There's a tradeoff between detecting real failures faster, and reducing
>> false errors detected because a response came too slowly.
>>
>> Clearly if a response may take an amount of time 'X' to be received
>> under normal conditions, there's no way to conclude that the remote side
>> has failed without waiting at least 'X'.
>>
>>     
>
> I understand. So there's no really difference between the two
> situations, real server failure or just a load problem that takes more
> time?
>   
 From the sender QP point of view, they are the same (ack/nack wasn't 
send during a specific
period of time)
> Something like a different error or a SIGPIPE :) ?
>
> I will describe my situation, maybe it helps (bare with me as I'm
> starting with Infiniband and so on)
> I have a client and a server.The clients posts RDMA calls one at a
> time (post, poll, post...). So server is just there.
> If I try to start something like 16 clients on 1 machine, after a few
> hours I will get an error on some client programs (retry excess) with
> a timeout of 14. If I increase the timeout for 32, I don't see that
> error but if I stop the server, the clients take a lot of time to
> acknowledge that, which is also not wanted.
> That's why I asked  if there a 'good value'. If I have such a load
> between 2 nodes, I always have to risk that if the server dies the
> client will take much time to see it. That's not nice!
>   
Did you try to increase the retry_count too?
(and not only the timeout).

By the way, Which RDMA operation do you execute READ or WRITE?
> Thanks for the help and quick answers,
>   
You are always welcome ..
Dotan


From ruimario at gmail.com  Fri May 16 12:01:07 2008
From: ruimario at gmail.com (Rui Machado)
Date: Fri, 16 May 2008 21:01:07 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <482E3AEE.4070603@gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
	<adaej82cdxb.fsf@cisco.com>
	<6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>
	<482E3AEE.4070603@gmail.com>
Message-ID: <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com>

2008/5/17 Dotan Barak <dotanba at gmail.com>:
> Rui Machado wrote:
>>
>> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>>
>>>
>>>  > hmm..... and is there no workaround for this, for this situation? I
>>>  > mean, if the server dies isn't there any possibility that
>>>  > the sender/client realizes this. If the timeout it's too large this
>>>  > can be cumbersome.
>>>  >
>>>  > I tried reducing the timeout and indeed the client realizes faster
>>>  > when the server exits but another problem arises: Without exiting the
>>>  > server,
>>>  > on the client side I get the error (retry exceed) when polling for a
>>>  > recently posted send - this after some hours.
>>>
>>> There's a tradeoff between detecting real failures faster, and reducing
>>> false errors detected because a response came too slowly.
>>>
>>> Clearly if a response may take an amount of time 'X' to be received
>>> under normal conditions, there's no way to conclude that the remote side
>>> has failed without waiting at least 'X'.
>>>
>>>
>>
>> I understand. So there's no really difference between the two
>> situations, real server failure or just a load problem that takes more
>> time?
>>
>
> From the sender QP point of view, they are the same (ack/nack wasn't send
> during a specific
> period of time)
>>
>> Something like a different error or a SIGPIPE :) ?
>>
>> I will describe my situation, maybe it helps (bare with me as I'm
>> starting with Infiniband and so on)
>> I have a client and a server.The clients posts RDMA calls one at a
>> time (post, poll, post...). So server is just there.
>> If I try to start something like 16 clients on 1 machine, after a few
>> hours I will get an error on some client programs (retry excess) with
>> a timeout of 14. If I increase the timeout for 32, I don't see that
>> error but if I stop the server, the clients take a lot of time to
>> acknowledge that, which is also not wanted.
>> That's why I asked  if there a 'good value'. If I have such a load
>> between 2 nodes, I always have to risk that if the server dies the
>> client will take much time to see it. That's not nice!
>>
>
> Did you try to increase the retry_count too?
> (and not only the timeout).

But that wouldn't change my scenario since the overall time is given
by the timeout * retry count right?

> By the way, Which RDMA operation do you execute READ or WRITE?
>>
READ.

>> Thanks for the help and quick answers,
>>
>
> You are always welcome ..

Great :)
Cheers,

Rui


From dotanba at gmail.com  Fri May 16 13:04:51 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 16 May 2008 22:04:51 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>	
	<adaej82cdxb.fsf@cisco.com>	
	<6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>	
	<482E3AEE.4070603@gmail.com>
	<6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com>
Message-ID: <482DE8E3.4090200@gmail.com>

Rui Machado wrote:
> 2008/5/17 Dotan Barak <dotanba at gmail.com>:
>   
>> Rui Machado wrote:
>>     
>>> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>>>
>>>       
>>>>  > hmm..... and is there no workaround for this, for this situation? I
>>>>  > mean, if the server dies isn't there any possibility that
>>>>  > the sender/client realizes this. If the timeout it's too large this
>>>>  > can be cumbersome.
>>>>  >
>>>>  > I tried reducing the timeout and indeed the client realizes faster
>>>>  > when the server exits but another problem arises: Without exiting the
>>>>  > server,
>>>>  > on the client side I get the error (retry exceed) when polling for a
>>>>  > recently posted send - this after some hours.
>>>>
>>>> There's a tradeoff between detecting real failures faster, and reducing
>>>> false errors detected because a response came too slowly.
>>>>
>>>> Clearly if a response may take an amount of time 'X' to be received
>>>> under normal conditions, there's no way to conclude that the remote side
>>>> has failed without waiting at least 'X'.
>>>>
>>>>
>>>>         
>>> I understand. So there's no really difference between the two
>>> situations, real server failure or just a load problem that takes more
>>> time?
>>>
>>>       
>> From the sender QP point of view, they are the same (ack/nack wasn't send
>> during a specific
>> period of time)
>>     
>>> Something like a different error or a SIGPIPE :) ?
>>>
>>> I will describe my situation, maybe it helps (bare with me as I'm
>>> starting with Infiniband and so on)
>>> I have a client and a server.The clients posts RDMA calls one at a
>>> time (post, poll, post...). So server is just there.
>>> If I try to start something like 16 clients on 1 machine, after a few
>>> hours I will get an error on some client programs (retry excess) with
>>> a timeout of 14. If I increase the timeout for 32, I don't see that
>>> error but if I stop the server, the clients take a lot of time to
>>> acknowledge that, which is also not wanted.
>>> That's why I asked  if there a 'good value'. If I have such a load
>>> between 2 nodes, I always have to risk that if the server dies the
>>> client will take much time to see it. That's not nice!
>>>
>>>       
>> Did you try to increase the retry_count too?
>> (and not only the timeout).
>>     
Yes.
>
> But that wouldn't change my scenario since the overall time is given
> by the timeout * retry count right?
>
>   
>> By the way, Which RDMA operation do you execute READ or WRITE?
>>     
> READ.
>   
Can you replace it with a write (from the other side)?
READ has "higher price" than a WRITE.

Anyway, you should get the mentioned behavior anyway..

When the sender get the error, what is the status of the receiver QP?
(did you try to execute ibv_query_qp and get its status?)

Dotan


From paulmck at linux.vnet.ibm.com  Fri May 16 12:07:52 2008
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Fri, 16 May 2008 12:07:52 -0700
Subject: [ofa-general] Re: [PATCH 001/001] mmu-notifier-core v17
In-Reply-To: <20080509193230.GH7710@duo.random>
References: <20080509193230.GH7710@duo.random>
Message-ID: <20080516190752.GK11333@linux.vnet.ibm.com>

On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <andrea at qumranet.com>

The hlist_del_init_rcu() primitive looks good.

The rest of the RCU code looks fine assuming that "mn->ops->release()"
either does call_rcu() to defer actual removal, or that the actual
removal is deferred until after mmu_notifier_release() returns.

Acked-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>

> With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
> pages. There are secondary MMUs (with secondary sptes and secondary
> tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
> spte in mmu-notifier context, I mean "secondary pte". In GRU case
> there's no actual secondary pte and there's only a secondary tlb
> because the GRU secondary MMU has no knowledge about sptes and every
> secondary tlb miss event in the MMU always generates a page fault that
> has to be resolved by the CPU (this is not the case of KVM where the a
> secondary tlb miss will walk sptes in hardware and it will refill the
> secondary tlb transparently to software if the corresponding spte is
> present). The same way zap_page_range has to invalidate the pte before
> freeing the page, the spte (and secondary tlb) must also be
> invalidated before any page is freed and reused.
> 
> Currently we take a page_count pin on every page mapped by sptes, but
> that means the pages can't be swapped whenever they're mapped by any
> spte because they're part of the guest working set. Furthermore a spte
> unmap event can immediately lead to a page to be freed when the pin is
> released (so requiring the same complex and relatively slow tlb_gather
> smp safe logic we have in zap_page_range and that can be avoided
> completely if the spte unmap event doesn't require an unpin of the
> page previously mapped in the secondary MMU).
> 
> The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
> know when the VM is swapping or freeing or doing anything on the
> primary MMU so that the secondary MMU code can drop sptes before the
> pages are freed, avoiding all page pinning and allowing 100% reliable
> swapping of guest physical address space. Furthermore it avoids the
> code that teardown the mappings of the secondary MMU, to implement a
> logic like tlb_gather in zap_page_range that would require many IPI to
> flush other cpu tlbs, for each fixed number of spte unmapped.
> 
> To make an example: if what happens on the primary MMU is a protection
> downgrade (from writeable to wrprotect) the secondary MMU mappings
> will be invalidated, and the next secondary-mmu-page-fault will call
> get_user_pages and trigger a do_wp_page through get_user_pages if it
> called get_user_pages with write=1, and it'll re-establishing an
> updated spte or secondary-tlb-mapping on the copied page. Or it will
> setup a readonly spte or readonly tlb mapping if it's a guest-read, if
> it calls get_user_pages with write=0. This is just an example.
> 
> This allows to map any page pointed by any pte (and in turn visible in
> the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
> or an full MMU with both sptes and secondary-tlb like the
> shadow-pagetable layer with kvm), or a remote DMA in software like
> XPMEM (hence needing of schedule in XPMEM code to send the invalidate
> to the remote node, while no need to schedule in kvm/gru as it's an
> immediate event like invalidating primary-mmu pte).
> 
> At least for KVM without this patch it's impossible to swap guests
> reliably. And having this feature and removing the page pin allows
> several other optimizations that simplify life considerably.
> 
> Dependencies:
> 
> 1) Introduces list_del_init_rcu and documents it (fixes a comment for
>    list_del_rcu too)
> 
> 2) mm_take_all_locks() to register the mmu notifier when the whole VM
>    isn't doing anything with "mm". This allows mmu notifier users to
>    keep track if the VM is in the middle of the
>    invalidate_range_begin/end critical section with an atomic counter
>    incraese in range_begin and decreased in range_end. No secondary
>    MMU page fault is allowed to map any spte or secondary tlb
>    reference, while the VM is in the middle of range_begin/end as any
>    page returned by get_user_pages in that critical section could
>    later immediately be freed without any further ->invalidate_page
>    notification (invalidate_range_begin/end works on ranges and
>    ->invalidate_page isn't called immediately before freeing the
>    page). To stop all page freeing and pagetable overwrites the
>    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
>    locks must be taken too.
> 
> 3) It'd be a waste to add branches in the VM if nobody could possibly
>    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
>    if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
>    advantage of mmu notifiers, but this already allows to compile a
>    KVM external module against a kernel with mmu notifiers enabled and
>    from the next pull from kvm.git we'll start using them. And
>    GRU/XPMEM will also be able to continue the development by enabling
>    KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
>    to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
>    the same way KVM does it (even if KVM=n). This guarantees nobody
>    selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.
> 
> The mmu_notifier_register call can fail because mm_take_all_locks may
> be interrupted by a signal and return -EINTR. Because
> mmu_notifier_reigster is used when a driver startup, a failure can be
> gracefully handled. Here an example of the change applied to kvm to
> register the mmu notifiers. Usually when a driver startups other
> allocations are required anyway and -ENOMEM failure paths exists
> already.
> 
>  struct  kvm *kvm_arch_create_vm(void)
>  {
>         struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
> +       int err;
> 
>         if (!kvm)
>                 return ERR_PTR(-ENOMEM);
> 
>         INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> 
> +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
> +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
> +       if (err) {
> +               kfree(kvm);
> +               return ERR_PTR(err);
> +       }
> +
>         return kvm;
>  }
> 
> mmu_notifier_unregister returns void and it's reliable.
> 
> Signed-off-by: Andrea Arcangeli <andrea at qumranet.com>
> Signed-off-by: Nick Piggin <npiggin at suse.de>
> Signed-off-by: Christoph Lameter <clameter at sgi.com>
> ---
> 
> Full patchset is here:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc1/mmu-notifier-v17
> 
> Thanks!
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -21,6 +21,7 @@ config KVM
>  	tristate "Kernel-based Virtual Machine (KVM) support"
>  	depends on HAVE_KVM
>  	select PREEMPT_NOTIFIERS
> +	select MMU_NOTIFIER
>  	select ANON_INODES
>  	---help---
>  	  Support hosting fully virtualized guest machines using hardware
> diff --git a/include/linux/list.h b/include/linux/list.h
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
>   * or hlist_del_rcu(), running on this same list.
>   * However, it is perfectly legal to run concurrently with
>   * the _rcu list-traversal primitives, such as
> - * hlist_for_each_entry().
> + * hlist_for_each_entry_rcu().
>   */
>  static inline void hlist_del_rcu(struct hlist_node *n)
>  {
> @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
>  	if (!hlist_unhashed(n)) {
>  		__hlist_del(n);
>  		INIT_HLIST_NODE(n);
> +	}
> +}
> +
> +/**
> + * hlist_del_init_rcu - deletes entry from hash list with re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: list_unhashed() on the node return true after this. It is
> + * useful for RCU based read lockfree traversal if the writer side
> + * must know if the list entry is still hashed or already unhashed.
> + *
> + * In particular, it means that we can not poison the forward pointers
> + * that may still be used for walking the hash list and we can only
> + * zero the pprev pointer so list_unhashed() will return true after
> + * this.
> + *
> + * The caller must take whatever precautions are necessary (such as
> + * holding appropriate locks) to avoid racing with another
> + * list-mutation primitive, such as hlist_add_head_rcu() or
> + * hlist_del_rcu(), running on this same list.  However, it is
> + * perfectly legal to run concurrently with the _rcu list-traversal
> + * primitives, such as hlist_for_each_entry_rcu().
> + */
> +static inline void hlist_del_init_rcu(struct hlist_node *n)
> +{
> +	if (!hlist_unhashed(n)) {
> +		__hlist_del(n);
> +		n->pprev = NULL;
>  	}
>  }
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1067,6 +1067,9 @@ extern struct vm_area_struct *copy_vma(s
>  	unsigned long addr, unsigned long len, pgoff_t pgoff);
>  extern void exit_mmap(struct mm_struct *);
> 
> +extern int mm_take_all_locks(struct mm_struct *mm);
> +extern void mm_drop_all_locks(struct mm_struct *mm);
> +
>  #ifdef CONFIG_PROC_FS
>  /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */
>  extern void added_exe_file_vma(struct mm_struct *mm);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -10,6 +10,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/rwsem.h>
>  #include <linux/completion.h>
> +#include <linux/cpumask.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
> 
> @@ -19,6 +20,7 @@
>  #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
> 
>  struct address_space;
> +struct mmu_notifier_mm;
> 
>  #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
>  typedef atomic_long_t mm_counter_t;
> @@ -235,6 +237,9 @@ struct mm_struct {
>  	struct file *exe_file;
>  	unsigned long num_exe_file_vmas;
>  #endif
> +#ifdef CONFIG_MMU_NOTIFIER
> +	struct mmu_notifier_mm *mmu_notifier_mm;
> +#endif
>  };
> 
>  #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/mmu_notifier.h
> @@ -0,0 +1,279 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier;
> +struct mmu_notifier_ops;
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * The mmu notifier_mm structure is allocated and installed in
> + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> + * critical section and it's released only when mm_count reaches zero
> + * in mmdrop().
> + */
> +struct mmu_notifier_mm {
> +	/* all mmu notifiers registerd in this mm are queued in this list */
> +	struct hlist_head list;
> +	/* to serialize the list modifications and hlist_unhashed */
> +	spinlock_t lock;
> +};
> +
> +struct mmu_notifier_ops {
> +	/*
> +	 * Called either by mmu_notifier_unregister or when the mm is
> +	 * being destroyed by exit_mmap, always before all pages are
> +	 * freed. This can run concurrently with other mmu notifier
> +	 * methods (the ones invoked outside the mm context) and it
> +	 * should tear down all secondary mmu mappings and freeze the
> +	 * secondary mmu. If this method isn't implemented you've to
> +	 * be sure that nothing could possibly write to the pages
> +	 * through the secondary mmu by the time the last thread with
> +	 * tsk->mm == mm exits.
> +	 *
> +	 * As side note: the pages freed after ->release returns could
> +	 * be immediately reallocated by the gart at an alias physical
> +	 * address with a different cache model, so if ->release isn't
> +	 * implemented because all _software_ driven memory accesses
> +	 * through the secondary mmu are terminated by the time the
> +	 * last thread of this mm quits, you've also to be sure that
> +	 * speculative _hardware_ operations can't allocate dirty
> +	 * cachelines in the cpu that could not be snooped and made
> +	 * coherent with the other read and write operations happening
> +	 * through the gart alias address, so leading to memory
> +	 * corruption.
> +	 */
> +	void (*release)(struct mmu_notifier *mn,
> +			struct mm_struct *mm);
> +
> +	/*
> +	 * clear_flush_young is called after the VM is
> +	 * test-and-clearing the young/accessed bitflag in the
> +	 * pte. This way the VM will provide proper aging to the
> +	 * accesses to the page through the secondary MMUs and not
> +	 * only to the ones through the Linux pte.
> +	 */
> +	int (*clear_flush_young)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long address);
> +
> +	/*
> +	 * Before this is invoked any secondary MMU is still ok to
> +	 * read/write to the page previously pointed to by the Linux
> +	 * pte because the page hasn't been freed yet and it won't be
> +	 * freed until this returns. If required set_page_dirty has to
> +	 * be called internally to this method.
> +	 */
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);
> +
> +	/*
> +	 * invalidate_range_start() and invalidate_range_end() must be
> +	 * paired and are called only when the mmap_sem and/or the
> +	 * locks protecting the reverse maps are held. The subsystem
> +	 * must guarantee that no additional references are taken to
> +	 * the pages in the range established between the call to
> +	 * invalidate_range_start() and the matching call to
> +	 * invalidate_range_end().
> +	 *
> +	 * Invalidation of multiple concurrent ranges may be
> +	 * optionally permitted by the driver. Either way the
> +	 * establishment of sptes is forbidden in the range passed to
> +	 * invalidate_range_begin/end for the whole duration of the
> +	 * invalidate_range_begin/end critical section.
> +	 *
> +	 * invalidate_range_start() is called when all pages in the
> +	 * range are still mapped and have at least a refcount of one.
> +	 *
> +	 * invalidate_range_end() is called when all pages in the
> +	 * range have been unmapped and the pages have been freed by
> +	 * the VM.
> +	 *
> +	 * The VM will remove the page table entries and potentially
> +	 * the page between invalidate_range_start() and
> +	 * invalidate_range_end(). If the page must not be freed
> +	 * because of pending I/O or other circumstances then the
> +	 * invalidate_range_start() callback (or the initial mapping
> +	 * by the driver) must make sure that the refcount is kept
> +	 * elevated.
> +	 *
> +	 * If the driver increases the refcount when the pages are
> +	 * initially mapped into an address space then either
> +	 * invalidate_range_start() or invalidate_range_end() may
> +	 * decrease the refcount. If the refcount is decreased on
> +	 * invalidate_range_start() then the VM can free pages as page
> +	 * table entries are removed.  If the refcount is only
> +	 * droppped on invalidate_range_end() then the driver itself
> +	 * will drop the last refcount but it must take care to flush
> +	 * any secondary tlb before doing the final free on the
> +	 * page. Pages will no longer be referenced by the linux
> +	 * address space but may still be referenced by sptes until
> +	 * the last refcount is dropped.
> +	 */
> +	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +				       struct mm_struct *mm,
> +				       unsigned long start, unsigned long end);
> +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> +				     struct mm_struct *mm,
> +				     unsigned long start, unsigned long end);
> +};
> +
> +/*
> + * The notifier chains are protected by mmap_sem and/or the reverse map
> + * semaphores. Notifier chains are only changed when all reverse maps and
> + * the mmap_sem locks are taken.
> + *
> + * Therefore notifier chains can only be traversed when either
> + *
> + * 1. mmap_sem is held.
> + * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock).
> + * 3. No other concurrent thread can access the list (release)
> + */
> +struct mmu_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_notifier_ops *ops;
> +};
> +
> +static inline int mm_has_notifiers(struct mm_struct *mm)
> +{
> +	return unlikely(mm->mmu_notifier_mm);
> +}
> +
> +extern int mmu_notifier_register(struct mmu_notifier *mn,
> +				 struct mm_struct *mm);
> +extern int __mmu_notifier_register(struct mmu_notifier *mn,
> +				   struct mm_struct *mm);
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> +				    struct mm_struct *mm);
> +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
> +extern void __mmu_notifier_release(struct mm_struct *mm);
> +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> +					  unsigned long address);
> +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					  unsigned long address);
> +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end);
> +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end);
> +
> +static inline void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	if (mm_has_notifiers(mm))
> +		__mmu_notifier_release(mm);
> +}
> +
> +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	if (mm_has_notifiers(mm))
> +		return __mmu_notifier_clear_flush_young(mm, address);
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	if (mm_has_notifiers(mm))
> +		__mmu_notifier_invalidate_page(mm, address);
> +}
> +
> +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	if (mm_has_notifiers(mm))
> +		__mmu_notifier_invalidate_range_start(mm, start, end);
> +}
> +
> +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	if (mm_has_notifiers(mm))
> +		__mmu_notifier_invalidate_range_end(mm, start, end);
> +}
> +
> +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> +{
> +	mm->mmu_notifier_mm = NULL;
> +}
> +
> +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> +	if (mm_has_notifiers(mm))
> +		__mmu_notifier_mm_destroy(mm);
> +}
> +
> +/*
> + * These two macros will sometime replace ptep_clear_flush.
> + * ptep_clear_flush is impleemnted as macro itself, so this also is
> + * implemented as a macro until ptep_clear_flush will converted to an
> + * inline function, to diminish the risk of compilation failure. The
> + * invalidate_page method over time can be moved outside the PT lock
> + * and these two macros can be later removed.
> + */
> +#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
> +({									\
> +	pte_t __pte;							\
> +	struct vm_area_struct *___vma = __vma;				\
> +	unsigned long ___address = __address;				\
> +	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
> +	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
> +	__pte;								\
> +})
> +
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +({									\
> +	int __young;							\
> +	struct vm_area_struct *___vma = __vma;				\
> +	unsigned long ___address = __address;				\
> +	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +						  ___address);		\
> +	__young;							\
> +})
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +static inline void mmu_notifier_release(struct mm_struct *mm)
> +{
> +}
> +
> +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +}
> +
> +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +}
> +
> +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +}
> +
> +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> +{
> +}
> +
> +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> +}
> +
> +#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_notify ptep_clear_flush
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -19,6 +19,7 @@
>   */
>  #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
>  #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
> +#define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
> 
>  static inline void mapping_set_error(struct address_space *mapping, int error)
>  {
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -26,6 +26,14 @@
>   */
>  struct anon_vma {
>  	spinlock_t lock;	/* Serialize access to vma list */
> +	/*
> +	 * NOTE: the LSB of the head.next is set by
> +	 * mm_take_all_locks() _after_ taking the above lock. So the
> +	 * head must only be read/written after taking the above lock
> +	 * to be sure to see a valid next pointer. The LSB bit itself
> +	 * is serialized by a system wide lock only visible to
> +	 * mm_take_all_locks() (mm_all_locks_mutex).
> +	 */
>  	struct list_head head;	/* List of private "related" vmas */
>  };
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -54,6 +54,7 @@
>  #include <linux/tty.h>
>  #include <linux/proc_fs.h>
>  #include <linux/blkdev.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -386,6 +387,7 @@ static struct mm_struct * mm_init(struct
> 
>  	if (likely(!mm_alloc_pgd(mm))) {
>  		mm->def_flags = 0;
> +		mmu_notifier_mm_init(mm);
>  		return mm;
>  	}
> 
> @@ -418,6 +420,7 @@ void __mmdrop(struct mm_struct *mm)
>  	BUG_ON(mm == &init_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> +	mmu_notifier_mm_destroy(mm);
>  	free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -205,3 +205,6 @@ config VIRT_TO_BUS
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	bool
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
> 
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
>  		if (pte) {
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
> -			pteval = ptep_clear_flush(vma, address, pte);
> +			pteval = ptep_clear_flush_notify(vma, address, pte);
>  			page_remove_rmap(page, vma);
>  			dec_mm_counter(mm, file_rss);
>  			BUG_ON(pte_dirty(pteval));
> diff --git a/mm/fremap.c b/mm/fremap.c
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -15,6 +15,7 @@
>  #include <linux/rmap.h>
>  #include <linux/module.h>
>  #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/mmu_context.h>
>  #include <asm/cacheflush.h>
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
>  		spin_unlock(&mapping->i_mmap_lock);
>  	}
> 
> +	mmu_notifier_invalidate_range_start(mm, start, start + size);
>  	err = populate_range(mm, vma, start, size, pgoff);
> +	mmu_notifier_invalidate_range_end(mm, start, start + size);
>  	if (!err && !(flags & MAP_NONBLOCK)) {
>  		if (unlikely(has_write_lock)) {
>  			downgrade_write(&mm->mmap_sem);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -14,6 +14,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/cpuset.h>
>  #include <linux/mutex.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
>  	BUG_ON(start & ~HPAGE_MASK);
>  	BUG_ON(end & ~HPAGE_MASK);
> 
> +	mmu_notifier_invalidate_range_start(mm, start, end);
>  	spin_lock(&mm->page_table_lock);
>  	for (address = start; address < end; address += HPAGE_SIZE) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	flush_tlb_range(vma, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end);
>  	list_for_each_entry_safe(page, tmp, &page_list, lru) {
>  		list_del(&page->lru);
>  		put_page(page);
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -51,6 +51,7 @@
>  #include <linux/init.h>
>  #include <linux/writeback.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/pgalloc.h>
>  #include <asm/uaccess.h>
> @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
>  	unsigned long next;
>  	unsigned long addr = vma->vm_start;
>  	unsigned long end = vma->vm_end;
> +	int ret;
> 
>  	/*
>  	 * Don't copy ptes where a page fault will fill them correctly.
> @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
>  	if (is_vm_hugetlb_page(vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> 
> +	/*
> +	 * We need to invalidate the secondary MMU mappings only when
> +	 * there could be a permission downgrade on the ptes of the
> +	 * parent mm. And a permission downgrade will only happen if
> +	 * is_cow_mapping() returns true.
> +	 */
> +	if (is_cow_mapping(vma->vm_flags))
> +		mmu_notifier_invalidate_range_start(src_mm, addr, end);
> +
> +	ret = 0;
>  	dst_pgd = pgd_offset(dst_mm, addr);
>  	src_pgd = pgd_offset(src_mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(src_pgd))
>  			continue;
> -		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> -						vma, addr, next))
> -			return -ENOMEM;
> +		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> +					    vma, addr, next))) {
> +			ret = -ENOMEM;
> +			break;
> +		}
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> -	return 0;
> +
> +	if (is_cow_mapping(vma->vm_flags))
> +		mmu_notifier_invalidate_range_end(src_mm,
> +						  vma->vm_start, end);
> +	return ret;
>  }
> 
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
>  	unsigned long start = start_addr;
>  	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
>  	int fullmm = (*tlbp)->fullmm;
> +	struct mm_struct *mm = vma->vm_mm;
> 
> +	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>  		unsigned long end;
> 
> @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  		}
>  	}
>  out:
> +	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
>  	return start;	/* which is now the end (or restart) address */
>  }
> 
> @@ -1544,10 +1565,11 @@ int apply_to_page_range(struct mm_struct
>  {
>  	pgd_t *pgd;
>  	unsigned long next;
> -	unsigned long end = addr + size;
> +	unsigned long start = addr, end = addr + size;
>  	int err;
> 
>  	BUG_ON(addr >= end);
> +	mmu_notifier_invalidate_range_start(mm, start, end);
>  	pgd = pgd_offset(mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
> @@ -1555,6 +1577,7 @@ int apply_to_page_range(struct mm_struct
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> +	mmu_notifier_invalidate_range_end(mm, start, end);
>  	return err;
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
> @@ -1756,7 +1779,7 @@ gotten:
>  		 * seen in the presence of one thread doing SMC and another
>  		 * thread doing COW.
>  		 */
> -		ptep_clear_flush(vma, address, page_table);
> +		ptep_clear_flush_notify(vma, address, page_table);
>  		set_pte_at(mm, address, page_table, entry);
>  		update_mmu_cache(vma, address, entry);
>  		lru_cache_add_active(new_page);
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -26,6 +26,7 @@
>  #include <linux/mount.h>
>  #include <linux/mempolicy.h>
>  #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -2048,6 +2049,7 @@ void exit_mmap(struct mm_struct *mm)
> 
>  	/* mm's last user has gone, and its about to be pulled down */
>  	arch_exit_mmap(mm);
> +	mmu_notifier_release(mm);
> 
>  	lru_add_drain();
>  	flush_cache_mm(mm);
> @@ -2255,3 +2257,152 @@ int install_special_mapping(struct mm_st
> 
>  	return 0;
>  }
> +
> +static DEFINE_MUTEX(mm_all_locks_mutex);
> +
> +/*
> + * This operation locks against the VM for all pte/vma/mm related
> + * operations that could ever happen on a certain mm. This includes
> + * vmtruncate, try_to_unmap, and all page faults.
> + *
> + * The caller must take the mmap_sem in write mode before calling
> + * mm_take_all_locks(). The caller isn't allowed to release the
> + * mmap_sem until mm_drop_all_locks() returns.
> + *
> + * mmap_sem in write mode is required in order to block all operations
> + * that could modify pagetables and free pages without need of
> + * altering the vma layout (for example populate_range() with
> + * nonlinear vmas). It's also needed in write mode to avoid new
> + * anon_vmas to be associated with existing vmas.
> + *
> + * A single task can't take more than one mm_take_all_locks() in a row
> + * or it would deadlock.
> + *
> + * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in
> + * mapping->flags avoid to take the same lock twice, if more than one
> + * vma in this mm is backed by the same anon_vma or address_space.
> + *
> + * We can take all the locks in random order because the VM code
> + * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never
> + * takes more than one of them in a row. Secondly we're protected
> + * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
> + *
> + * mm_take_all_locks() and mm_drop_all_locks are expensive operations
> + * that may have to take thousand of locks.
> + *
> + * mm_take_all_locks() can fail if it's interrupted by signals.
> + */
> +int mm_take_all_locks(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	int ret = -EINTR;
> +
> +	BUG_ON(down_read_trylock(&mm->mmap_sem));
> +
> +	mutex_lock(&mm_all_locks_mutex);
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		struct file *filp;
> +		if (signal_pending(current))
> +			goto out_unlock;
> +		if (vma->anon_vma && !test_bit(0, (unsigned long *)
> +					       &vma->anon_vma->head.next)) {
> +			/*
> +			 * The LSB of head.next can't change from
> +			 * under us because we hold the
> +			 * global_mm_spinlock.
> +			 */
> +			spin_lock(&vma->anon_vma->lock);
> +			/*
> +			 * We can safely modify head.next after taking
> +			 * the anon_vma->lock. If some other vma in
> +			 * this mm shares the same anon_vma we won't
> +			 * take it again.
> +			 *
> +			 * No need of atomic instructions here,
> +			 * head.next can't change from under us thanks
> +			 * to the anon_vma->lock.
> +			 */
> +			if (__test_and_set_bit(0, (unsigned long *)
> +					       &vma->anon_vma->head.next))
> +				BUG();
> +		}
> +
> +		filp = vma->vm_file;
> +		if (filp && filp->f_mapping &&
> +		    !test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
> +			/*
> +			 * AS_MM_ALL_LOCKS can't change from under us
> +			 * because we hold the global_mm_spinlock.
> +			 *
> +			 * Operations on ->flags have to be atomic
> +			 * because even if AS_MM_ALL_LOCKS is stable
> +			 * thanks to the global_mm_spinlock, there may
> +			 * be other cpus changing other bitflags in
> +			 * parallel to us.
> +			 */
> +			if (test_and_set_bit(AS_MM_ALL_LOCKS,
> +					     &filp->f_mapping->flags))
> +				BUG();
> +			spin_lock(&filp->f_mapping->i_mmap_lock);
> +		}
> +	}
> +	ret = 0;
> +
> +out_unlock:
> +	if (ret)
> +		mm_drop_all_locks(mm);
> +
> +	return ret;
> +}
> +
> +/*
> + * The mmap_sem cannot be released by the caller until
> + * mm_drop_all_locks() returns.
> + */
> +void mm_drop_all_locks(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +
> +	BUG_ON(down_read_trylock(&mm->mmap_sem));
> +	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		struct file *filp;
> +		if (vma->anon_vma &&
> +		    test_bit(0, (unsigned long *)
> +			     &vma->anon_vma->head.next)) {
> +			/*
> +			 * The LSB of head.next can't change to 0 from
> +			 * under us because we hold the
> +			 * global_mm_spinlock.
> +			 *
> +			 * We must however clear the bitflag before
> +			 * unlocking the vma so the users using the
> +			 * anon_vma->head will never see our bitflag.
> +			 *
> +			 * No need of atomic instructions here,
> +			 * head.next can't change from under us until
> +			 * we release the anon_vma->lock.
> +			 */
> +			if (!__test_and_clear_bit(0, (unsigned long *)
> +						  &vma->anon_vma->head.next))
> +				BUG();
> +			spin_unlock(&vma->anon_vma->lock);
> +		}
> +		filp = vma->vm_file;
> +		if (filp && filp->f_mapping &&
> +		    test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
> +			/*
> +			 * AS_MM_ALL_LOCKS can't change to 0 from under us
> +			 * because we hold the global_mm_spinlock.
> +			 */
> +			spin_unlock(&filp->f_mapping->i_mmap_lock);
> +			if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
> +						&filp->f_mapping->flags))
> +				BUG();
> +		}
> +	}
> +
> +	mutex_unlock(&mm_all_locks_mutex);
> +}
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/mmu_notifier.c
> @@ -0,0 +1,276 @@
> +/*
> + *  linux/mm/mmu_notifier.c
> + *
> + *  Copyright (C) 2008  Qumranet, Inc.
> + *  Copyright (C) 2008  SGI
> + *             Christoph Lameter <clameter at sgi.com>
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/err.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +/*
> + * This function can't run concurrently against mmu_notifier_register
> + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
> + * in parallel despite there being no task using this mm any more,
> + * through the vmas outside of the exit_mmap context, such as with
> + * vmtruncate. This serializes against mmu_notifier_unregister with
> + * the mmu_notifier_mm->lock in addition to RCU and it serializes
> + * against the other mmu notifiers with RCU. struct mmu_notifier_mm
> + * can't go away from under us as exit_mmap holds an mm_count pin
> + * itself.
> + */
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> +		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> +				 struct mmu_notifier,
> +				 hlist);
> +		/*
> +		 * We arrived before mmu_notifier_unregister so
> +		 * mmu_notifier_unregister will do nothing other than
> +		 * to wait ->release to finish and
> +		 * mmu_notifier_unregister to return.
> +		 */
> +		hlist_del_init_rcu(&mn->hlist);
> +		/*
> +		 * RCU here will block mmu_notifier_unregister until
> +		 * ->release returns.
> +		 */
> +		rcu_read_lock();
> +		spin_unlock(&mm->mmu_notifier_mm->lock);
> +		/*
> +		 * if ->release runs before mmu_notifier_unregister it
> +		 * must be handled as it's the only way for the driver
> +		 * to flush all existing sptes and stop the driver
> +		 * from establishing any more sptes before all the
> +		 * pages in the mm are freed.
> +		 */
> +		if (mn->ops->release)
> +			mn->ops->release(mn, mm);
> +		rcu_read_unlock();
> +		spin_lock(&mm->mmu_notifier_mm->lock);
> +	}
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> +	/*
> +	 * synchronize_rcu here prevents mmu_notifier_release to
> +	 * return to exit_mmap (which would proceed freeing all pages
> +	 * in the mm) until the ->release method returns, if it was
> +	 * invoked by mmu_notifier_unregister.
> +	 *
> +	 * The mmu_notifier_mm can't go away from under us because one
> +	 * mm_count is hold by exit_mmap.
> +	 */
> +	synchronize_rcu();
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->clear_flush_young can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> +					unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int young = 0;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->clear_flush_young)
> +			young |= mn->ops->clear_flush_young(mn, mm, address);
> +	}
> +	rcu_read_unlock();
> +
> +	return young;
> +}
> +
> +void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->invalidate_page)
> +			mn->ops->invalidate_page(mn, mm, address);
> +	}
> +	rcu_read_unlock();
> +}
> +
> +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->invalidate_range_start)
> +			mn->ops->invalidate_range_start(mn, mm, start, end);
> +	}
> +	rcu_read_unlock();
> +}
> +
> +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> +		if (mn->ops->invalidate_range_end)
> +			mn->ops->invalidate_range_end(mn, mm, start, end);
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static int do_mmu_notifier_register(struct mmu_notifier *mn,
> +				    struct mm_struct *mm,
> +				    int take_mmap_sem)
> +{
> +	struct mmu_notifier_mm * mmu_notifier_mm;
> +	int ret;
> +
> +	BUG_ON(atomic_read(&mm->mm_users) <= 0);
> +
> +	ret = -ENOMEM;
> +	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
> +	if (unlikely(!mmu_notifier_mm))
> +		goto out;
> +
> +	if (take_mmap_sem)
> +		down_write(&mm->mmap_sem);
> +	ret = mm_take_all_locks(mm);
> +	if (unlikely(ret))
> +		goto out_cleanup;
> +
> +	if (!mm_has_notifiers(mm)) {
> +		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
> +		spin_lock_init(&mmu_notifier_mm->lock);
> +		mm->mmu_notifier_mm = mmu_notifier_mm;
> +		mmu_notifier_mm = NULL;
> +	}
> +	atomic_inc(&mm->mm_count);
> +
> +	/*
> +	 * Serialize the update against mmu_notifier_unregister. A
> +	 * side note: mmu_notifier_release can't run concurrently with
> +	 * us because we hold the mm_users pin (either implicitly as
> +	 * current->mm or explicitly with get_task_mm() or similar).
> +	 * We can't race against any other mmu notifier method either
> +	 * thanks to mm_take_all_locks().
> +	 */
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> +	mm_drop_all_locks(mm);
> +out_cleanup:
> +	if (take_mmap_sem)
> +		up_write(&mm->mmap_sem);
> +	/* kfree() does nothing if mmu_notifier_mm is NULL */
> +	kfree(mmu_notifier_mm);
> +out:
> +	BUG_ON(atomic_read(&mm->mm_users) <= 0);
> +	return ret;
> +}
> +
> +/*
> + * Must not hold mmap_sem nor any other VM related lock when calling
> + * this registration function. Must also ensure mm_users can't go down
> + * to zero while this runs to avoid races with mmu_notifier_release,
> + * so mm has to be current->mm or the mm should be pinned safely such
> + * as with get_task_mm(). If the mm is not current->mm, the mm_users
> + * pin should be released by calling mmput after mmu_notifier_register
> + * returns. mmu_notifier_unregister must be always called to
> + * unregister the notifier. mm_count is automatically pinned to allow
> + * mmu_notifier_unregister to safely run at any time later, before or
> + * after exit_mmap. ->release will always be called before exit_mmap
> + * frees the pages.
> + */
> +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	return do_mmu_notifier_register(mn, mm, 1);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +/*
> + * Same as mmu_notifier_register but here the caller must hold the
> + * mmap_sem in write mode.
> + */
> +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	return do_mmu_notifier_register(mn, mm, 0);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +/* this is called after the last mmu_notifier_unregister() returned */
> +void __mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> +	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
> +	kfree(mm->mmu_notifier_mm);
> +	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
> +}
> +
> +/*
> + * This releases the mm_count pin automatically and frees the mm
> + * structure if it was the last user of it. It serializes against
> + * running mmu notifiers with RCU and against mmu_notifier_unregister
> + * with the unregister lock + RCU. All sptes must be dropped before
> + * calling mmu_notifier_unregister. ->release or any other notifier
> + * method may be invoked concurrently with mmu_notifier_unregister,
> + * and only after mmu_notifier_unregister returned we're guaranteed
> + * that ->release or any other method can't run anymore.
> + */
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	BUG_ON(atomic_read(&mm->mm_count) <= 0);
> +
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	if (!hlist_unhashed(&mn->hlist)) {
> +		hlist_del_rcu(&mn->hlist);
> +
> +		/*
> +		 * RCU here will force exit_mmap to wait ->release to finish
> +		 * before freeing the pages.
> +		 */
> +		rcu_read_lock();
> +		spin_unlock(&mm->mmu_notifier_mm->lock);
> +		/*
> +		 * exit_mmap will block in mmu_notifier_release to
> +		 * guarantee ->release is called before freeing the
> +		 * pages.
> +		 */
> +		if (mn->ops->release)
> +			mn->ops->release(mn, mm);
> +		rcu_read_unlock();
> +	} else
> +		spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> +	/*
> +	 * Wait any running method to finish, of course including
> +	 * ->release if it was run by mmu_notifier_relase instead of us.
> +	 */
> +	synchronize_rcu();
> +
> +	BUG_ON(atomic_read(&mm->mm_count) <= 0);
> +
> +	mmdrop(mm);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -21,6 +21,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mmu_notifier.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -198,10 +199,12 @@ success:
>  		dirty_accountable = 1;
>  	}
> 
> +	mmu_notifier_invalidate_range_start(mm, start, end);
>  	if (is_vm_hugetlb_page(vma))
>  		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
>  	else
>  		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
> +	mmu_notifier_invalidate_range_end(mm, start, end);
>  	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
>  	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
>  	return 0;
> diff --git a/mm/mremap.c b/mm/mremap.c
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/security.h>
>  #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> +	unsigned long old_start;
> 
> +	old_start = old_addr;
> +	mmu_notifier_invalidate_range_start(vma->vm_mm,
> +					    old_start, old_end);
>  	if (vma->vm_file) {
>  		/*
>  		 * Subtle point from Rajesh Venkatasubramanian: before
> @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
>  }
> 
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -49,6 +49,7 @@
>  #include <linux/module.h>
>  #include <linux/kallsyms.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mmu_notifier.h>
> 
>  #include <asm/tlbflush.h>
> 
> @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
>  	if (vma->vm_flags & VM_LOCKED) {
>  		referenced++;
>  		*mapcount = 1;	/* break early from loop */
> -	} else if (ptep_clear_flush_young(vma, address, pte))
> +	} else if (ptep_clear_flush_young_notify(vma, address, pte))
>  		referenced++;
> 
>  	/* Pretend the page is referenced if the task has the
> @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page 
>  		pte_t entry;
> 
>  		flush_cache_page(vma, address, pte_pfn(*pte));
> -		entry = ptep_clear_flush(vma, address, pte);
> +		entry = ptep_clear_flush_notify(vma, address, pte);
>  		entry = pte_wrprotect(entry);
>  		entry = pte_mkclean(entry);
>  		set_pte_at(mm, address, pte, entry);
> @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page 
>  	 * skipped over this mm) then we should reactivate it.
>  	 */
>  	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
> -			(ptep_clear_flush_young(vma, address, pte)))) {
> +			(ptep_clear_flush_young_notify(vma, address, pte)))) {
>  		ret = SWAP_FAIL;
>  		goto out_unmap;
>  	}
> 
>  	/* Nuke the page table entry. */
>  	flush_cache_page(vma, address, page_to_pfn(page));
> -	pteval = ptep_clear_flush(vma, address, pte);
> +	pteval = ptep_clear_flush_notify(vma, address, pte);
> 
>  	/* Move the dirty bit to the physical page now the pte is gone. */
>  	if (pte_dirty(pteval))
> @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
>  		page = vm_normal_page(vma, address, *pte);
>  		BUG_ON(!page || PageAnon(page));
> 
> -		if (ptep_clear_flush_young(vma, address, pte))
> +		if (ptep_clear_flush_young_notify(vma, address, pte))
>  			continue;
> 
>  		/* Nuke the page table entry. */
>  		flush_cache_page(vma, address, pte_pfn(*pte));
> -		pteval = ptep_clear_flush(vma, address, pte);
> +		pteval = ptep_clear_flush_notify(vma, address, pte);
> 
>  		/* If nonlinear, store the file page offset in the pte. */
>  		if (page->index != linear_page_index(vma, address))


From dotanba at gmail.com  Fri May 16 18:15:03 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Sat, 17 May 2008 03:15:03 +0200
Subject: ***SPAM*** Re: [ofa-general] timeout question
In-Reply-To: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
Message-ID: <482E3197.5040604@gmail.com>

Rui Machado wrote:
> Hi,
>
>   
>>> when setting the timeout in a struct ibv_qp_attr, this value
>>> corresponds to the Local ACK timeout which according to the Infiniband
>>> spec will define the transport timer timeout defined by the formula:
>>> 4.096uS * 2 ^Local Ack timeout". Is this right?
>>> And is there a value for this timeout to be considered "good practice"?
>>>
>>>       
>> This value is depend on your fabric size, on the HCA you have (and some more factors)..
>>     
>>> Also, in a client-server setup, if this timeout is set to a "big
>>> value" (like 30) when the server dies, the client will take that
>>> amount of time to realize the failure. Is this correct?
>>>
>>>       
>> Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded
>> (if there was a SR which was posted without any response from the receiver).
>>
>>     
> hmm..... and is there no workaround for this, for this situation? I
> mean, if the server dies isn't there any possibility that
> the sender/client realizes this. If the timeout it's too large this
> can be cumbersome.
>
> I tried reducing the timeout and indeed the client realizes faster
> when the server exits but another problem arises: Without exiting the
> server,
> on the client side I get the error (retry exceed) when polling for a
> recently posted send - this after some hours.
>   
You don't really need to set a timeout of hours, I believe that a few 
seconds should be enough for
almost any (todays) cluster...


> Thank you for the help.
>   
You are welcome
:)

Dotan


From rdreier at cisco.com  Fri May 16 12:41:42 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 12:41:42 -0700
Subject: [ofa-general] [PATCH/RFC] Remove IB_DEVICE_SEND_W_INV from 2.6.26
Message-ID: <adazlqqavex.fsf@cisco.com>

Given that we should have full support for memory management extensions
pending for 2.6.27, and the support we have for send w/ invalidate in
2.6.26 is incomplete (no provision for returning STag/L_Key in receive
completion and no implementation of that in amso1100 for one thing), I
think it makes sense to simply remove the IB_DEVICE_SEND_W_INV
capability flag rather than moving it to a new bit position.

Then when we add all the memory management extension support in 2.6.27,
we can just use bit 21 for IB_DEVICE_MEM_MGT_EXTENSIONS and avoid having
such fine grained distinctions, and avoid having all sorts of strange
code to monkey around with the SEND_W_INV bit in libibverbs and
userspace driver libraries.

Thoughts pro or con?

 - R.

diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c
index 9a054c6..b1441ae 100644
--- a/drivers/infiniband/hw/amso1100/c2_rnic.c
+++ b/drivers/infiniband/hw/amso1100/c2_rnic.c
@@ -455,8 +455,7 @@ int __devinit c2_rnic_init(struct c2_dev *c2dev)
 	     IB_DEVICE_CURR_QP_STATE_MOD |
 	     IB_DEVICE_SYS_IMAGE_GUID |
 	     IB_DEVICE_ZERO_STAG |
-	     IB_DEVICE_MEM_WINDOW |
-	     IB_DEVICE_SEND_W_INV);
+	     IB_DEVICE_MEM_WINDOW);
 
 	/* Allocate the qptr_array */
 	c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *));
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..31d30b1 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -105,7 +105,6 @@ enum ib_device_cap_flags {
 	 */
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
-	IB_DEVICE_SEND_W_INV		= (1<<21),
 };
 
 enum ib_atomic_cap {


From swise at opengridcomputing.com  Fri May 16 12:46:42 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 14:46:42 -0500
Subject: [ofa-general] Re: [PATCH/RFC] Remove IB_DEVICE_SEND_W_INV from
	2.6.26
In-Reply-To: <adazlqqavex.fsf@cisco.com>
References: <adazlqqavex.fsf@cisco.com>
Message-ID: <482DE4A2.9070305@opengridcomputing.com>

Sounds ok to me.


Roland Dreier wrote:
> Given that we should have full support for memory management extensions
> pending for 2.6.27, and the support we have for send w/ invalidate in
> 2.6.26 is incomplete (no provision for returning STag/L_Key in receive
> completion and no implementation of that in amso1100 for one thing), I
> think it makes sense to simply remove the IB_DEVICE_SEND_W_INV
> capability flag rather than moving it to a new bit position.
>
> Then when we add all the memory management extension support in 2.6.27,
> we can just use bit 21 for IB_DEVICE_MEM_MGT_EXTENSIONS and avoid having
> such fine grained distinctions, and avoid having all sorts of strange
> code to monkey around with the SEND_W_INV bit in libibverbs and
> userspace driver libraries.
>
> Thoughts pro or con?
>
>  - R.
>
> diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c
> index 9a054c6..b1441ae 100644
> --- a/drivers/infiniband/hw/amso1100/c2_rnic.c
> +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c
> @@ -455,8 +455,7 @@ int __devinit c2_rnic_init(struct c2_dev *c2dev)
>  	     IB_DEVICE_CURR_QP_STATE_MOD |
>  	     IB_DEVICE_SYS_IMAGE_GUID |
>  	     IB_DEVICE_ZERO_STAG |
> -	     IB_DEVICE_MEM_WINDOW |
> -	     IB_DEVICE_SEND_W_INV);
> +	     IB_DEVICE_MEM_WINDOW);
>  
>  	/* Allocate the qptr_array */
>  	c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *));
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 911a661..31d30b1 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -105,7 +105,6 @@ enum ib_device_cap_flags {
>  	 */
>  	IB_DEVICE_UD_IP_CSUM		= (1<<18),
>  	IB_DEVICE_UD_TSO		= (1<<19),
> -	IB_DEVICE_SEND_W_INV		= (1<<21),
>  };
>  
>  enum ib_atomic_cap {
>   


From hrosenstock at xsigo.com  Fri May 16 12:52:09 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 16 May 2008 12:52:09 -0700
Subject: [ofa-general] Re: [PATCH] OpenSM: Add a Performance Manager HOWTO to
	the docs and the dist
In-Reply-To: <20080515132723.3add7c6a.weiny2@llnl.gov>
References: <20080515132723.3add7c6a.weiny2@llnl.gov>
Message-ID: <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote:
> I decided to write a little HOWTO to help people to set it up.

Nice writeup :-)

> 5) Can be run in a standby SM

I thought it was changed so that it can run in a standalone mode without
SM. Am I confusing this with something else ?

-- Hal


From hrosenstock at xsigo.com  Fri May 16 13:13:52 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 16 May 2008 13:13:52 -0700
Subject: [ofa-general] [PATCH] [TRIVIAL] OpenSM/doc/modular_routing.txt: Fix
	typo
Message-ID: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com>

OpenSM/doc/modular_routing.txt: Fix typo

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/doc/modular-routing.txt b/opensm/doc/modular-routing.txt
index f2f70f0..a531c5a 100644
--- a/opensm/doc/modular-routing.txt
+++ b/opensm/doc/modular-routing.txt
@@ -64,7 +64,7 @@ standard opensm dump directory (/var/log by default) when
 OSM_LOG_ROUTING logging flag is set.
 
 When routing engine 'file' is activated, but dump file is not specified
-or not cannot be open default lid matrix algorithm will be used.
+or cannot be opened, the default lid matrix algorithm will be used.
 
 There is also a switch forwarding tables dumper which generates
 a file compatible with dump_lfts.sh output. This file can be used


From hrosenstock at xsigo.com  Fri May 16 13:16:16 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 16 May 2008 13:16:16 -0700
Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt:
	Remove mention of OFED
Message-ID: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>

OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED

I'll leave the other heavy lifting to Yevgeny :-)

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt
index 307c80f..17a4fd5 100644
--- a/opensm/doc/QoS_management_in_OpenSM.txt
+++ b/opensm/doc/QoS_management_in_OpenSM.txt
@@ -65,7 +65,7 @@ matching rules (see below). Port group lists ports by:
 II) QoS Setup (denoted by qos-setup).
 This section describes how to set up SL2VL and VL Arbitration tables on
 various nodes in the fabric.
-However, this is not supported in OFED 1.3.
+However, this is not supported in OpenSM currently.
 SL2VL and VLArb tables should be configured in the OpenSM options file
 (default location - /var/cache/opensm/opensm.opts).
 
@@ -203,8 +203,8 @@ policy file and their syntax:
     qos-setup
         # This section of the policy file describes how to set up SL2VL and VL
         # Arbitration tables on various nodes in the fabric.
-        # However, this is not supported in OFED 1.3 - the section is parsed
-        # and ignored. SL2VL and VLArb tables should be configured in the
+        # However, this is not supported in OpenSM currently - the section is
+        # parsed and ignored. SL2VL and VLArb tables should be configured in the
         # OpenSM options file (by default - /var/cache/opensm/opensm.opts).
     end-qos-setup
 

From weiny2 at llnl.gov  Fri May 16 13:35:02 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 16 May 2008 13:35:02 -0700
Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO
 to the docs and the dist
In-Reply-To: <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com>
References: <20080515132723.3add7c6a.weiny2@llnl.gov>
	<1210967529.12616.287.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080516133502.27a1e9b6.weiny2@llnl.gov>

On Fri, 16 May 2008 12:52:09 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote:
> > I decided to write a little HOWTO to help people to set it up.
> 
> Nice writeup :-)
> 
> > 5) Can be run in a standby SM
> 
> I thought it was changed so that it can run in a standalone mode without
> SM. Am I confusing this with something else ?
> 

I think you are right I should have said standalone.  However, can't it also
work in a standby SM?

yea, from the patch which Sasha applied:

   opensm/perfmgr: PerfMgr for SM standby and inactive states

Here is an updated patch with the correction.

Ira


>From 9be13c3da4d34ad0a736ced4c9e3bb5e13a24bb6 Mon Sep 17 00:00:00 2001
From: Ira K. Weiny <weiny2 at llnl.gov>
Date: Thu, 15 May 2008 08:19:17 -0700
Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist


Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 opensm/Makefile.am                       |    3 +-
 opensm/doc/performance-manager-HOWTO.txt |  153 ++++++++++++++++++++++++++++++
 opensm/opensm.spec.in                    |    2 +-
 3 files changed, 156 insertions(+), 2 deletions(-)
 create mode 100644 opensm/doc/performance-manager-HOWTO.txt

diff --git a/opensm/Makefile.am b/opensm/Makefile.am
index 3811963..4c79f49 100644
--- a/opensm/Makefile.am
+++ b/opensm/Makefile.am
@@ -24,8 +24,9 @@ endif
 man_MANS = man/opensm.8 man/osmtest.8
 
 various_scripts = $(wildcard scripts/*)
+docs = doc/performance-manager-HOWTO.txt
 
-EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS)
+EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs)
 
 dist-hook: $(EXTRA_DIST)
 	if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \
diff --git a/opensm/doc/performance-manager-HOWTO.txt b/opensm/doc/performance-manager-HOWTO.txt
new file mode 100644
index 0000000..f0380f3
--- /dev/null
+++ b/opensm/doc/performance-manager-HOWTO.txt
@@ -0,0 +1,153 @@
+OpenSM Performance manager HOWTO
+================================
+
+Introduction
+============
+
+OpenSM now includes a performance manager which collects Port counters from
+the subnet and stores them internally in OpenSM.
+
+Some of the features of the performance manager are:
+
+	1) Collect port data and error counters per v1.2 spec and store in
+	   64bit internal counts.
+	2) Automatic reset of counters when they reach approximatly 3/4 full.
+	   (While not guarenteeing that counts will not be missed this does
+	   keep counts incrementing as best as possible given the current
+	   hardware limitations.)
+	3) Basic warnings in the OpenSM log on "critical" errors like symbol
+	   errors.
+	4) Automatically detects "outside" resets of counters and adjusts to
+	   continue collecting data.
+	5) Can be run when OpenSM is in standby or inactive states.
+
+Known issues are:
+
+	1) Data counters will be lost on high data rate links.  Sweeping the
+	   fabric fast enough for a DDR link is not practical.
+	2) Default partition support only.
+
+
+Setup and Usage
+===============
+
+Using the Performance Manager consists of 3 steps:
+
+	1) compiling in support for the perfmgr (Optionally: the console
+	   socket as well)
+	2) enabling the perfmgr and console in opensm.opts
+	3) retrieving data which has been collected.
+	   3a) using console to "dump data"
+	   3b) using a plugin module to store the data to your own
+	       "database"
+
+Step 1: Compile in support for the Performance Manager
+------------------------------------------------------
+
+Because of the performance manager's experimental status, it is not enabled at
+compile time by default.  (This will hopefully soon change as more people use
+it and confirm that it does not break things...  ;-)  The configure option is
+"--enable-perf-mgr".
+
+At this time it is really best to enable the console socket option as well.
+OpenSM can be run in an "interactive" mode.  But with the console socket option
+turned on one can also make a connection to a running OpenSM.  The console
+option is "--enable-console-socket".  This option requires the use of
+tcp_wrappers to ensure security.  Please be aware of your configuration for
+tcp_wrappers as the commands presented in the console can affect the operation
+of your subnet.
+
+The following configure line includes turning on the performance manager as
+well as the console:
+
+	./configure --enable-perf-mgr --enable-console-socket
+
+
+Step 2: Enable the perfmgr and console in opensm.opts
+-----------------------------------------------------
+
+Turning the Perfmorance Manager on is pretty easy, set the following options in
+the opensm.opts config file.  (Default location is
+/var/cache/opensm/opensm.opts)
+
+	# Turn it all on.
+	perfmgr TRUE
+
+	# sweep time in seconds
+	perfmgr_sweep_time_s 180
+
+	# Dump file to dump the events to
+	event_db_dump_file /var/log/opensm_port_counters.log
+
+Also enable the console socket and configure the port for it to listen to if
+desired.
+
+	# console [off|local|socket]
+	console socket
+
+	# Telnet port for console (default 10000)
+	console_port 10000
+
+As noted above you also need to set up tcp_wrappers to prevent unauthorized
+users from connecting to the console.[*]
+
+	[*] As an alternate you can use the loopback mode but I noticed when
+	writing this (OpenSM v3.1.10; OFED 1.3) that there are some bugs in
+	specifying the loopback mode in the opensm.opts file.  Look for this to
+	be fixed in newer versions.
+
+	[**] Also you could use "local" but this is only useful if you run
+	OpenSM in the foreground of a terminal.  As OpenSM is usually started
+	as a daemon I left this out as an option.
+
+Step 3: retrieve data which has been collected
+----------------------------------------------
+
+Step 3a: Using console dump function
+------------------------------------
+
+The console command "perfmgr dump_counters" will dump counters to the file
+specified in the opensm.opts file.  In the example above
+"/var/log/opensm_port_counters.log"
+
+Example output is below:
+
+<snip>
+"SW1 wopr ISR9024D (MLX4 FW)" 0x8f10400411f56 port 1 (Since Mon May 12 13:27:14 2008)
+     symbol_err_cnt       : 0
+     link_err_recover     : 0
+     link_downed          : 0
+     rcv_err              : 0
+     rcv_rem_phys_err     : 0
+     rcv_switch_relay_err : 2
+     xmit_discards        : 0
+     xmit_constraint_err  : 0
+     rcv_constraint_err   : 0
+     link_integrity_err   : 0
+     buf_overrun_err      : 0
+     vl15_dropped         : 0
+     xmit_data            : 470435
+     rcv_data             : 405956
+     xmit_pkts            : 8954
+     rcv_pkts             : 6900
+     unicast_xmit_pkts    : 0
+     unicast_rcv_pkts     : 0
+     multicast_xmit_pkts  : 0
+     multicast_rcv_pkts   : 0
+</snip>
+
+
+Step 3b: Using a plugin module
+------------------------------
+
+If you want a more automated method of retrieving the data OpenSM provides a
+plugin interface to extend OpenSM.  The header file is osm_event_plugin.h.
+The functions you register with this interface will be called when data is
+collected.  You can then use that data as appropriate.
+
+An example plugin can be configured at compile time using the
+"--enable-default-event-plugin" option on the configure line.  This plugin is
+very simple.  It logs "events" recieved from the performance manager to a log
+file.  I don't recomend using this directly but rather use it as a templat to
+create your own plugin.
+
diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index feabfef..c36d6f2 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -125,7 +125,7 @@ fi
 %{_sbindir}/opensm
 %{_sbindir}/osmtest
 %{_mandir}/man8/*
-%doc AUTHORS COPYING README
+%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt
 %{_sysconfdir}/init.d/opensmd
 %{_sbindir}/sldd.sh
 %config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf
-- 
1.5.1


From hrosenstock at xsigo.com  Fri May 16 13:37:00 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 16 May 2008 13:37:00 -0700
Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO
	to the docs and the dist
In-Reply-To: <20080516133502.27a1e9b6.weiny2@llnl.gov>
References: <20080515132723.3add7c6a.weiny2@llnl.gov>
	<1210967529.12616.287.camel@hrosenstock-ws.xsigo.com>
	<20080516133502.27a1e9b6.weiny2@llnl.gov>
Message-ID: <1210970220.12616.304.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-16 at 13:35 -0700, Ira Weiny wrote:
> However, can't it also work in a standby SM?

Yes, it works with SM in any state and standalone without SM AFAIK.

-- Hal


From hrosenstock at xsigo.com  Fri May 16 14:18:38 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 16 May 2008 14:18:38 -0700
Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt:
	Remove mention of OFED
In-Reply-To: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>
References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>
Message-ID: <1210972718.12616.309.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-16 at 13:16 -0700, Hal Rosenstock wrote:
> OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED
> 
> I'll leave the other heavy lifting to Yevgeny :-)

I forgot: Please apply to master and ofed_1_3

> Signed-off-by: Hal Rosenstock <hal at xsigo.com>


From rdreier at cisco.com  Fri May 16 14:28:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 14:28:18 -0700
Subject: [ofa-general] [PATCH/RFC] RDMA/cxgb3: Fix uninitialized variable
	warning in iwch_post_send()
Message-ID: <adaiqxeaqh9.fsf@cisco.com>

    drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send':
    drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function

This is what akpm describes as "the dopey
gcc-doesn't-know-that-foo(&var)-writes-to-var problem."

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/infiniband/hw/cxgb3/iwch_qp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 79dbe5b..9926137 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -229,7 +229,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		      struct ib_send_wr **bad_wr)
 {
 	int err = 0;
-	u8 t3_wr_flit_cnt;
+	u8 uninitialized_var(t3_wr_flit_cnt);
 	enum t3_wr_opcode t3_wr_opcode = 0;
 	enum t3_wr_flags t3_wr_flags;
 	struct iwch_qp *qhp;
-- 
1.5.5.1


From swise at opengridcomputing.com  Fri May 16 14:52:55 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 16:52:55 -0500
Subject: [ofa-general] Re: [PATCH/RFC] RDMA/cxgb3: Fix uninitialized variable
 warning in iwch_post_send()
In-Reply-To: <adaiqxeaqh9.fsf@cisco.com>
References: <adaiqxeaqh9.fsf@cisco.com>
Message-ID: <482E0237.3000603@opengridcomputing.com>

Acked-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Fri May 16 15:30:37 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:30:37 -0500
Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions
Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int>

The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 2:
	- added device attribute max_fast_reg_page_list_len
	- added cxgb3 patch

Changes since version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

Steve.


From swise at opengridcomputing.com  Fri May 16 15:32:43 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:32:43 -0500
Subject: [ofa-general] [PATCH RFC v3] RDMA/Core: MEM_MGT_EXTENSIONS support
In-Reply-To: <20080516223037.27127.26712.stgit@dell3.ogc.int>
References: <20080516223037.27127.26712.stgit@dell3.ogc.int>
Message-ID: <20080516223243.27127.10687.stgit@dell3.ogc.int>


Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/verbs.c |   46 ++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |   56 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+	struct ib_mr *mr;
+
+	if (!pd->device->alloc_fast_reg_mr)
+		return ERR_PTR(-ENOSYS);
+
+	mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len);
+
+	if (!IS_ERR(mr)) {
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int max_page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	if (!device->alloc_fast_reg_page_list)
+		return ERR_PTR(-ENOSYS);
+
+	page_list = device->alloc_fast_reg_page_list(device, max_page_list_len);
+
+	if (!IS_ERR(page_list)) {
+		page_list->device = device;
+		page_list->max_page_list_len = max_page_list_len;
+	}
+
+	return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+	page_list->device->free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..c4ace0f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
 	IB_DEVICE_SEND_W_INV		= (1<<21),
+	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<22),
 };
 
 enum ib_atomic_cap {
@@ -151,6 +152,7 @@ struct ib_device_attr {
 	int			max_srq;
 	int			max_srq_wr;
 	int			max_srq_sge;
+	unsigned int		max_fast_reg_page_list_len;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -414,6 +416,8 @@ enum ib_wc_opcode {
 	IB_WC_FETCH_ADD,
 	IB_WC_BIND_MW,
 	IB_WC_LSO,
+	IB_WC_FAST_REG_MR,
+	IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -628,6 +632,9 @@ enum ib_wr_opcode {
 	IB_WR_ATOMIC_FETCH_AND_ADD,
 	IB_WR_LSO,
 	IB_WR_SEND_WITH_INV,
+	IB_WR_FAST_REG_MR,
+	IB_WR_INVALIDATE_MR,
+	IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +683,20 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u64				iova_start;
+			struct ib_mr 			*mr;
+			struct ib_fast_reg_page_list	*page_list;
+			unsigned int			page_size;
+			unsigned int			page_list_len;
+			unsigned int			first_byte_offset;
+			u32				length;
+			int				access_flags;
+			
+		} fast_reg;
+		struct {
+			struct ib_mr 	*mr;
+		} local_inv;
 	} wr;
 };
 
@@ -1014,6 +1035,10 @@ struct ib_device {
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
+	struct ib_mr *		   (*alloc_fast_reg_mr)(struct ib_pd *pd,
+					       int max_page_list_len);
+	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
+	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -1808,6 +1833,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
 int ib_dereg_mr(struct ib_mr *mr);
 
 /**
+ * ib_alloc_fast_reg_mr - Allocates memory region usable with the
+ * IB_WR_FAST_REG_MR send work request.
+ * @pd: The protection domain associated with the region.
+ * @max_page_list_len: requested max physical buffer list size to be allocated.
+ */
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len);
+
+struct ib_fast_reg_page_list {
+	struct ib_device 	*device;
+	u64			*page_list;
+	unsigned int		max_page_list_len;
+};
+
+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array to be used
+ * in a IB_WR_FAST_REG_MR work request.  The resources allocated by this method
+ * allows for dev-specific optimization of the FAST_REG operation.
+ * @device - ib device pointer.
+ * @page_list_len - depth of the page list array to be allocated.
+ */
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len);
+
+/**
+ * ib_free_fast_reg_page_list - Deallocates a previously allocated
+ * page list array.
+ * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
+ */
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From swise at opengridcomputing.com  Fri May 16 15:32:56 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:32:56 -0500
Subject: [ofa-general] [RESEND PATCH RFC v3 0/2] RDMA: New Memory Extensions
Message-ID: <20080516223256.27221.34568.stgit@dell3.ogc.int>

The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 2:
	- added device attribute max_fast_reg_page_list_len
	- added cxgb3 patch

Changes since version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

Steve.


From swise at opengridcomputing.com  Fri May 16 15:34:20 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:34:20 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223256.27221.34568.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
Message-ID: <20080516223419.27221.49014.stgit@dell3.ogc.int>


Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/verbs.c |   46 ++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |   56 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+	struct ib_mr *mr;
+
+	if (!pd->device->alloc_fast_reg_mr)
+		return ERR_PTR(-ENOSYS);
+
+	mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len);
+
+	if (!IS_ERR(mr)) {
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int max_page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	if (!device->alloc_fast_reg_page_list)
+		return ERR_PTR(-ENOSYS);
+
+	page_list = device->alloc_fast_reg_page_list(device, max_page_list_len);
+
+	if (!IS_ERR(page_list)) {
+		page_list->device = device;
+		page_list->max_page_list_len = max_page_list_len;
+	}
+
+	return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+	page_list->device->free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..c4ace0f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
 	IB_DEVICE_SEND_W_INV		= (1<<21),
+	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<22),
 };
 
 enum ib_atomic_cap {
@@ -151,6 +152,7 @@ struct ib_device_attr {
 	int			max_srq;
 	int			max_srq_wr;
 	int			max_srq_sge;
+	unsigned int		max_fast_reg_page_list_len;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -414,6 +416,8 @@ enum ib_wc_opcode {
 	IB_WC_FETCH_ADD,
 	IB_WC_BIND_MW,
 	IB_WC_LSO,
+	IB_WC_FAST_REG_MR,
+	IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -628,6 +632,9 @@ enum ib_wr_opcode {
 	IB_WR_ATOMIC_FETCH_AND_ADD,
 	IB_WR_LSO,
 	IB_WR_SEND_WITH_INV,
+	IB_WR_FAST_REG_MR,
+	IB_WR_INVALIDATE_MR,
+	IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +683,20 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u64				iova_start;
+			struct ib_mr 			*mr;
+			struct ib_fast_reg_page_list	*page_list;
+			unsigned int			page_size;
+			unsigned int			page_list_len;
+			unsigned int			first_byte_offset;
+			u32				length;
+			int				access_flags;
+			
+		} fast_reg;
+		struct {
+			struct ib_mr 	*mr;
+		} local_inv;
 	} wr;
 };
 
@@ -1014,6 +1035,10 @@ struct ib_device {
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
+	struct ib_mr *		   (*alloc_fast_reg_mr)(struct ib_pd *pd,
+					       int max_page_list_len);
+	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
+	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -1808,6 +1833,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
 int ib_dereg_mr(struct ib_mr *mr);
 
 /**
+ * ib_alloc_fast_reg_mr - Allocates memory region usable with the
+ * IB_WR_FAST_REG_MR send work request.
+ * @pd: The protection domain associated with the region.
+ * @max_page_list_len: requested max physical buffer list size to be allocated.
+ */
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len);
+
+struct ib_fast_reg_page_list {
+	struct ib_device 	*device;
+	u64			*page_list;
+	unsigned int		max_page_list_len;
+};
+
+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array to be used
+ * in a IB_WR_FAST_REG_MR work request.  The resources allocated by this method
+ * allows for dev-specific optimization of the FAST_REG operation.
+ * @device - ib device pointer.
+ * @page_list_len - depth of the page list array to be allocated.
+ */
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len);
+
+/**
+ * ib_free_fast_reg_page_list - Deallocates a previously allocated
+ * page list array.
+ * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
+ */
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From swise at opengridcomputing.com  Fri May 16 15:34:22 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:34:22 -0500
Subject: [ofa-general] [PATCH RFC v3 2/2] RDMA/cxgb3: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223256.27221.34568.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
Message-ID: <20080516223422.27221.23807.stgit@dell3.ogc.int>


- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl 
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
  fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add invalidate_mr work request support.
- add send_with_inv and send_with_se_inv work request support.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c      |   13 ++-
 drivers/infiniband/hw/cxgb3/cxio_hal.h      |    1 
 drivers/infiniband/hw/cxgb3/cxio_wr.h       |   50 ++++++++++-
 drivers/infiniband/hw/cxgb3/iwch_provider.c |   77 ++++++++++++++++-
 drivers/infiniband/hw/cxgb3/iwch_qp.c       |  123 +++++++++++++++++++--------
 5 files changed, 214 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 3f441fc..6315c77 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -145,7 +145,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
 	}
 	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
 	memset(wqe, 0, sizeof(*wqe));
-	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7);
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7, 3);
 	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
 	sge_cmd = qpid << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
@@ -558,7 +558,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
 	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
 	memset(wqe, 0, sizeof(*wqe));
 	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 0,
-		       T3_CTL_QP_TID, 7);
+		       T3_CTL_QP_TID, 7, 3);
 	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
 	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
@@ -674,7 +674,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
 		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
 			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
 					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
-			       wr_len);
+			       wr_len, 3);
 		if (flag == T3_COMPLETION_FLAG)
 			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
 		len -= 96;
@@ -816,6 +816,13 @@ int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
 			     0, 0);
 }
 
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			     0, 0, 0ULL, 0, 0, 0, 0);
+}
+
 int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
 {
 	struct t3_rdma_init_wr *wqe;
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 6e128f6..e7659f6 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -165,6 +165,7 @@ int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
 int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size,
 		   u32 pbl_addr);
 int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
 int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
 int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index f1a25a8..bc8f49b 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -72,7 +72,8 @@ enum t3_wr_opcode {
 	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
 	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
 	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
-	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP,
+	T3_WR_FASTREG = FW_WROPCODE_RI_FASTREGISTER_MR
 } __attribute__ ((packed));
 
 enum t3_rdma_opcode {
@@ -89,7 +90,8 @@ enum t3_rdma_opcode {
 	T3_FAST_REGISTER,
 	T3_LOCAL_INV,
 	T3_QP_MOD,
-	T3_BYPASS
+	T3_BYPASS,	
+	T3_RDMA_READ_REQ_WITH_INV,
 } __attribute__ ((packed));
 
 static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
@@ -170,11 +172,45 @@ struct t3_send_wr {
 	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
 };
 
+#define T3_MAX_FASTREG_DEPTH 24
+
+struct t3_fastreg_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 stag;		/* 2 */
+	__be32 len;
+	__be32 va_base_hi;	/* 3 */
+	__be32 va_base_lo_fbo;
+	__be32 page_type_perms;	/* 4 */
+	__be32 reserved;
+	__be64 pbl_addrs[0];	/* 5+ */
+};
+
+#define S_FR_PAGE_COUNT	31
+#define M_FR_PAGE_COUNT	0xff
+#define V_FR_PAGE_COUNT(x)	((x) << S_FR_PAGE_COUNT)
+#define G_FR_PAGE_COUNT(x)	((((x) >> S_FR_PAGE_COUNT)) & M_FR_PAGE_COUNT)
+
+#define S_FR_PAGE_SIZE	23
+#define M_FR_PAGE_SIZE	0x7
+#define V_FR_PAGE_SIZE(x)	((x) << S_FR_PAGE_SIZE)
+#define G_FR_PAGE_SIZE(x)	((((x) >> S_FR_PAGE_SIZE)) & M_FR_PAGE_SIZE)
+
+#define S_FR_TYPE	20
+#define M_FR_TYPE	0x1
+#define V_FR_TYPE(x)	((x) << S_FR_TYPE)
+#define G_FR_TYPE(x)	((((x) >> S_FR_TYPE)) & M_FR_TYPE)
+
+#define S_FR_PERMS	20
+#define M_FR_PERMS	0x1f
+#define V_FR_PERMS(x)	((x) << S_FR_PERMS)
+#define G_FR_PERMS(x)	((((x) >> S_FR_PERMS)) & M_FR_PERMS)
+
 struct t3_local_inv_wr {
 	struct fw_riwrh wrh;	/* 0 */
 	union t3_wrid wrid;	/* 1 */
 	__be32 stag;		/* 2 */
-	__be32 reserved3;
+	__be32 reserved;
 };
 
 struct t3_rdma_write_wr {
@@ -210,7 +246,8 @@ enum t3_mem_perms {
 	T3_MEM_ACCESS_LOCAL_READ = 0x1,
 	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
 	T3_MEM_ACCESS_REM_READ = 0x4,
-	T3_MEM_ACCESS_REM_WRITE = 0x8
+	T3_MEM_ACCESS_REM_WRITE = 0x8,
+	T3_MEM_ACCESS_MW_BIND = 0x10
 } __attribute__ ((packed));
 
 struct t3_bind_mw_wr {
@@ -346,6 +383,7 @@ union t3_wr {
 	struct t3_rdma_write_wr write;
 	struct t3_rdma_read_wr read;
 	struct t3_receive_wr recv;
+	struct t3_fastreg_wr fastreg;
 	struct t3_local_inv_wr local_inv;
 	struct t3_bind_mw_wr bind;
 	struct t3_bypass_wr bypass;
@@ -368,10 +406,10 @@ static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
 
 static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
 				  enum t3_wr_flags flags, u8 genbit, u32 tid,
-				  u8 len)
+				  u8 len, u8 sopeop)
 {
 	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
-					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_SOPEOP(sopeop) |
 					 V_FW_RIWR_FLAGS(flags));
 	wmb();
 	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 8934178..cf51800 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -768,6 +768,64 @@ static int iwch_dealloc_mw(struct ib_mw *mw)
 	return 0;
 }
 
+static struct ib_mr *iwch_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	ret = iwch_alloc_pbl(mhp, pbl_depth);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->attr.pbl_size = pbl_depth;
+	ret = cxio_allocate_stag(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		iwch_free_pbl(mhp);
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_NON_SHARED_MR;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __func__, mmid, mhp, stag);
+	return &(mhp->ibmr);
+}
+
+static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl(
+					struct ib_device *device,
+					int page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	page_list = kmalloc(sizeof *page_list + page_list_len * sizeof(u64), 
+			    GFP_KERNEL);
+	if (!page_list)
+		return ERR_PTR(-ENOMEM);
+
+	page_list->page_list = (u64 *)(page_list + 1);
+
+	return page_list;
+}
+
+static void iwch_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list)
+{
+	kfree(page_list);
+}
+
 static int iwch_destroy_qp(struct ib_qp *ib_qp)
 {
 	struct iwch_dev *rhp;
@@ -843,6 +901,15 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	 */
 	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
 	wqsize = roundup_pow_of_two(rqsize + sqsize);
+
+	/*
+	 * Kernel users need more wq space for fastreg WRs which can take	
+	 * 2 WR fragments.
+	 */
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (!ucontext && wqsize < (rqsize + (2 * sqsize)))
+		wqsize = roundup_pow_of_two(rqsize +
+				roundup_pow_of_two(attrs->cap.max_send_wr * 2));
 	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __func__,
 	     wqsize, sqsize, rqsize);
 	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
@@ -851,7 +918,6 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	qhp->wq.size_log2 = ilog2(wqsize);
 	qhp->wq.rq_size_log2 = ilog2(rqsize);
 	qhp->wq.sq_size_log2 = ilog2(sqsize);
-	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
 	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
 			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
 		kfree(qhp);
@@ -1048,6 +1114,7 @@ static int iwch_query_device(struct ib_device *ibdev,
 	props->max_mr = dev->attr.max_mem_regs;
 	props->max_pd = dev->attr.max_pds;
 	props->local_ca_ack_delay = 0;
+	props->max_fast_reg_page_list_len = T3_MAX_FASTREG_DEPTH;
 
 	return 0;
 }
@@ -1145,8 +1212,9 @@ int iwch_register_device(struct iwch_dev *dev)
 	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
 	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
 	dev->ibdev.owner = THIS_MODULE;
-	dev->device_cap_flags =
-	    (IB_DEVICE_ZERO_STAG | IB_DEVICE_MEM_WINDOW);
+	dev->device_cap_flags = IB_DEVICE_ZERO_STAG | 
+				IB_DEVICE_MEM_WINDOW | 
+				IB_DEVICE_MEM_MGT_EXTENSIONS;
 
 	dev->ibdev.uverbs_cmd_mask =
 	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
@@ -1198,6 +1266,9 @@ int iwch_register_device(struct iwch_dev *dev)
 	dev->ibdev.alloc_mw = iwch_alloc_mw;
 	dev->ibdev.bind_mw = iwch_bind_mw;
 	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+	dev->ibdev.alloc_fast_reg_mr = iwch_alloc_fast_reg_mr;
+	dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl;
+	dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl;
 
 	dev->ibdev.attach_mcast = iwch_multicast_attach;
 	dev->ibdev.detach_mcast = iwch_multicast_detach;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 79dbe5b..9c0cc7e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -44,54 +44,39 @@ static int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
 
 	switch (wr->opcode) {
 	case IB_WR_SEND:
-	case IB_WR_SEND_WITH_IMM:
 		if (wr->send_flags & IB_SEND_SOLICITED)
 			wqe->send.rdmaop = T3_SEND_WITH_SE;
 		else
 			wqe->send.rdmaop = T3_SEND;
 		wqe->send.rem_stag = 0;
 		break;
-#if 0				/* Not currently supported */
-	case TYPE_SEND_INVALIDATE:
-	case TYPE_SEND_INVALIDATE_IMMEDIATE:
-		wqe->send.rdmaop = T3_SEND_WITH_INV;
-		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
-		break;
-	case TYPE_SEND_SE_INVALIDATE:
-		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+	case IB_WR_SEND_WITH_INV:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		else
+			wqe->send.rdmaop = T3_SEND_WITH_INV;
 		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
 		break;
-#endif
 	default:
-		break;
+		return -EINVAL;
 	}
 	if (wr->num_sge > T3_MAX_SGE)
 		return -EINVAL;
 	wqe->send.reserved[0] = 0;
 	wqe->send.reserved[1] = 0;
 	wqe->send.reserved[2] = 0;
-	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
-		plen = 4;
-		wqe->send.sgl[0].stag = wr->ex.imm_data;
-		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
-		wqe->send.num_sgle = __constant_cpu_to_be32(0);
-		*flit_cnt = 5;
-	} else {
-		plen = 0;
-		for (i = 0; i < wr->num_sge; i++) {
-			if ((plen + wr->sg_list[i].length) < plen) {
-				return -EMSGSIZE;
-			}
-			plen += wr->sg_list[i].length;
-			wqe->send.sgl[i].stag =
-			    cpu_to_be32(wr->sg_list[i].lkey);
-			wqe->send.sgl[i].len =
-			    cpu_to_be32(wr->sg_list[i].length);
-			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+	plen = 0;
+	for (i = 0; i < wr->num_sge; i++) {
+		if ((plen + wr->sg_list[i].length) < plen) {
+			return -EMSGSIZE;
 		}
-		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
-		*flit_cnt = 4 + ((wr->num_sge) << 1);
+		plen += wr->sg_list[i].length;
+		wqe->send.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->send.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
 	}
+	wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+	*flit_cnt = 4 + ((wr->num_sge) << 1);
 	wqe->send.plen = cpu_to_be32(plen);
 	return 0;
 }
@@ -155,6 +140,56 @@ static int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
+static int iwch_build_fastreg(union t3_wr *wqe, struct ib_send_wr *wr,
+				u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq)
+{
+	int i;
+	u64 *p;
+
+	if (wr->wr.fast_reg.page_list_len > T3_MAX_FASTREG_DEPTH)
+		return -EINVAL;
+	*wr_cnt = 1;
+	wqe->fastreg.stag = cpu_to_be32(wr->wr.fast_reg.mr->rkey);
+	wqe->fastreg.len = cpu_to_be32(wr->wr.fast_reg.length);
+	wqe->fastreg.va_base_hi = cpu_to_be32(wr->wr.fast_reg.iova_start>>32);
+	wqe->fastreg.va_base_lo_fbo = 
+				cpu_to_be32(wr->wr.fast_reg.iova_start&0xffffffff);
+	wqe->fastreg.page_type_perms = cpu_to_be32(
+		V_FR_PAGE_COUNT(wr->wr.fast_reg.page_list_len) | 
+		V_FR_PAGE_SIZE(ilog2(wr->wr.fast_reg.page_size)-12) | 
+		V_FR_TYPE(T3_VA_BASED_TO) | 
+		V_FR_PERMS(iwch_ib_to_mwbind_access(wr->wr.fast_reg.access_flags)));
+	p = &wqe->fastreg.pbl_addrs[0];
+	for (i=0; i<wr->wr.fast_reg.page_list_len; i++, p++) {
+
+		/* If we need a 2nd WR, then set it up */
+		if (i == 10) {
+			*wr_cnt = 2;
+			wqe = (union t3_wr *)(wq->queue + 
+				Q_PTR2IDX((wq->wptr+1), wq->size_log2));
+			build_fw_riwrh((void *)wqe, T3_WR_FASTREG, 0,
+			       Q_GENBIT(wq->wptr, wq->size_log2),
+			       0, 1 + wr->wr.fast_reg.page_list_len - 10, 1);
+			
+			p = &wqe->flit[1];
+		}
+		*p = cpu_to_be64((u64)wr->wr.fast_reg.page_list->page_list[i]);
+	}
+	*flit_cnt = 5 + wr->wr.fast_reg.page_list_len;
+	if (*flit_cnt > 15)
+		*flit_cnt = 15;
+	return 0;
+}
+
+static int iwch_build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr,
+				u8 *flit_cnt)
+{
+	wqe->local_inv.stag = cpu_to_be32(wr->wr.local_inv.mr->rkey);
+	wqe->local_inv.reserved = 0;
+	*flit_cnt = sizeof(struct t3_local_inv_wr) >> 3;
+	return 0;
+}
+
 /*
  * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
  */
@@ -238,6 +273,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 	u32 num_wrs;
 	unsigned long flag;
 	struct t3_swsq *sqp;
+	int wr_cnt = 1;
 
 	qhp = to_iwch_qp(ibqp);
 	spin_lock_irqsave(&qhp->lock, flag);
@@ -262,15 +298,15 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		t3_wr_flags = 0;
 		if (wr->send_flags & IB_SEND_SOLICITED)
 			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
-		if (wr->send_flags & IB_SEND_FENCE)
-			t3_wr_flags |= T3_READ_FENCE_FLAG;
 		if (wr->send_flags & IB_SEND_SIGNALED)
 			t3_wr_flags |= T3_COMPLETION_FLAG;
 		sqp = qhp->wq.sq +
 		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
 		switch (wr->opcode) {
 		case IB_WR_SEND:
-		case IB_WR_SEND_WITH_IMM:
+		case IB_WR_SEND_WITH_INV:
+			if (wr->send_flags & IB_SEND_FENCE)
+				t3_wr_flags |= T3_READ_FENCE_FLAG;
 			t3_wr_opcode = T3_WR_SEND;
 			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
 			break;
@@ -289,6 +325,17 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 			if (!qhp->wq.oldest_read)
 				qhp->wq.oldest_read = sqp;
 			break;
+		case IB_WR_FAST_REG_MR:
+			t3_wr_opcode = T3_WR_FASTREG;
+			err = iwch_build_fastreg(wqe, wr, &t3_wr_flit_cnt, 
+						 &wr_cnt, &qhp->wq);
+			break;
+		case IB_WR_INVALIDATE_MR:
+			if (wr->send_flags & IB_SEND_FENCE)
+				t3_wr_flags |= T3_LOCAL_FENCE_FLAG;
+			t3_wr_opcode = T3_WR_INV_STAG;
+			err = iwch_build_inv_stag(wqe, wr, &t3_wr_flit_cnt);
+			break;
 		default:
 			PDBG("%s post of type=%d TBD!\n", __func__,
 			     wr->opcode);
@@ -307,14 +354,14 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 
 		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
 			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
-			       0, t3_wr_flit_cnt);
+			       0, t3_wr_flit_cnt, (wr_cnt == 1) ? 3 : 2);
 		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n",
 		     __func__, (unsigned long long) wr->wr_id, idx,
 		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
 		     sqp->opcode);
 		wr = wr->next;
 		num_wrs--;
-		++(qhp->wq.wptr);
+		qhp->wq.wptr += wr_cnt;
 		++(qhp->wq.sq_wptr);
 	}
 	spin_unlock_irqrestore(&qhp->lock, flag);
@@ -359,7 +406,7 @@ int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 			wr->wr_id;
 		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
 			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
-			       0, sizeof(struct t3_receive_wr) >> 3);
+			       0, sizeof(struct t3_receive_wr) >> 3, 3);
 		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
 		     "wqe %p \n", __func__, (unsigned long long) wr->wr_id,
 		     idx, qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
@@ -444,7 +491,7 @@ int iwch_bind_mw(struct ib_qp *qp,
 	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
 	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
 		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0,
-			        sizeof(struct t3_bind_mw_wr) >> 3);
+			        sizeof(struct t3_bind_mw_wr) >> 3, 3);
 	++(qhp->wq.wptr);
 	++(qhp->wq.sq_wptr);
 	spin_unlock_irqrestore(&qhp->lock, flag);


From rdreier at cisco.com  Fri May 16 15:48:36 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 15:48:36 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com> (Thomas
	Talpey's message of "Wed, 14 May 2008 23:32:21 -0400")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <adaej81c1bv.fsf@cisco.com>

 > We've been hit by this twice this week on two NFS/RDMA servers, so I'm
 > glad to see this! But, for us it happens with memless ConnectX - our mthca
 > devices are ok (but OTOH they're memfull not memfree)

OK, I see a problem with mlx4 -- it may spuriously return failure when
you try to create a QP with max_send_sge == 32, but only for kernel
QPs.  Which is why my userspace test didn't catch it.

 - R.


From rdreier at cisco.com  Fri May 16 16:12:51 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 16:12:51 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaej81c1bv.fsf@cisco.com> (Roland Dreier's message of "Fri, 16
	May 2008 15:48:36 -0700")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com>
Message-ID: <adaabipc07g.fsf@cisco.com>

 > OK, I see a problem with mlx4 -- it may spuriously return failure when
 > you try to create a QP with max_send_sge == 32, but only for kernel
 > QPs.  Which is why my userspace test didn't catch it.

The problem is this code in set_kernel_sq_size:

	if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC &&
	    qp->sq_signal_bits && BITS_PER_LONG == 64 &&
	    type != IB_QPT_SMI && type != IB_QPT_GSI)
		qp->sq.wqe_shift = ilog2(64);
	else
		qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s));

	for (;;) {
		if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)
			return -EINVAL;

if we can't use the "WQE shrinking" feature (because of selective
signaling in the NFS/RDMA case), and we want to use 32 sge entries, then
the WQE size 's' will end up a little more than 512 bytes, and the
wqe_shift will end up as 10.  But since the max_sq_desc_sz is 1008, we
return -EINVAL, when it is really fine to have a wqe_shift of 10 as long
as we don't use more than 1008 bytes per descriptor (I think).

So something like this is probably the fix (it suffices to make NFS/RDMA
mount work with ConnectX on both sides):

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index cec030e..b6612a0 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -372,7 +372,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s));
 
 	for (;;) {
-		if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)
+		if (qp->sq.wqe_shift >
+		    ilog2(roundup_pow_of_two(dev->dev->caps.max_sq_desc_sz)))
 			return -EINVAL;
 
 		qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift);
@@ -395,7 +396,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		++qp->sq.wqe_shift;
 	}
 
-	qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
+	qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
+			     (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
 			 send_wqe_overhead(type, qp->flags)) /
 		sizeof (struct mlx4_wqe_data_seg);


From rdreier at cisco.com  Fri May 16 16:24:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 May 2008 16:24:37 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaabipc07g.fsf@cisco.com> (Roland Dreier's message of "Fri, 16
	May 2008 16:12:51 -0700")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
Message-ID: <adawsltal3e.fsf@cisco.com>

Or maybe something like this is better:

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index cec030e..907eb34 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) +
 		send_wqe_overhead(type, qp->flags);
 
+	if (s > dev->dev->caps.max_sq_desc_sz)
+		return -EINVAL;
+
 	/*
 	 * Hermon supports shrinking WQEs, such that a single work
 	 * request can include multiple units of 1 << wqe_shift.  This
@@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s));
 
 	for (;;) {
-		if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)
-			return -EINVAL;
-
 		qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift);
 
 		/*
@@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		++qp->sq.wqe_shift;
 	}
 
-	qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
+	qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
+			     (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
 			 send_wqe_overhead(type, qp->flags)) /
 		sizeof (struct mlx4_wqe_data_seg);
 

From clameter at sgi.com  Fri May 16 18:38:28 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Fri, 16 May 2008 18:38:28 -0700 (PDT)
Subject: [ofa-general] mm notifier: Notifications when pages are unmapped.
In-Reply-To: <alpine.LFD.1.10.0805141114410.3019@woody.linux-foundation.org>
References: <6b384bb988786aa78ef0.1210170958@duo.random>
	<alpine.LFD.1.10.0805071429170.3024@woody.linux-foundation.org>
	<20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<alpine.LFD.1.10.0805140807400.3019@woody.linux-foundation.org>
	<Pine.LNX.4.64.0805141053350.15490@schroedinger.engr.sgi.com>
	<alpine.LFD.1.10.0805141114410.3019@woody.linux-foundation.org>
Message-ID: <Pine.LNX.4.64.0805161835500.2123@schroedinger.engr.sgi.com>

Implementation of what Linus suggested: Defer the XPMEM processing until 
after the locks are dropped. Allow immediate action by GRU/KVM.


This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:

1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.

2. Do take a refcount on pages mapped externally. These are handling by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.

With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.

With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.

A device driver must subscribe to a process using

        mm_register_notifier(struct mm_struct *, struct mm_notifier *)

The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.

When the process terminates then first the ->release method is called to
remove all pages still mapped to the proces.

Before the mm_struct is freed the ->destroy() method is called which
should dispose of the mm_notifier structure.

The following callbacks exist:

invalidate_range(notifier, mm_struct *, from , to)

	Invalidate a range of addresses. The invalidation is
	not required to complete immediately.

invalidate_range_sync(notifier, mm_struct *, from, to)

	This is called after some invalidate_range callouts.
	The driver may only return when the invalidation of the references
	is completed. Callback is only called from non atomic contexts.
	There is no need to provide this callback if the driver can remove
	references in an atomic context.

invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address)

	Invalidate references to a particular page. The driver may
	defer the invalidation.

invalidate_page_sync(notifier, mm_struct *,struct *)

	Called after one or more invalidate_page() callbacks. The callback
	must only return when the external references have been removed.
	The callback does not need to be provided if the driver can remove
	references in atomic contexts.

[NOTE] The invalidate_page_sync() callback is weird because it is called for
	every notifier that supports the invalidate_page_sync() callback
	if a page has PageNotifier() set. The driver must determine in an efficient
	way that the page is not of interest. This is because we do not have the
	mm context after we have dropped the rmap list lock.
	Drivers incrementing the refcount must set and clear PageNotifier
	appropriately when establishing and/or dropping a refcount!
	[These conditions are similar to the rmap notifier that was introduced
	in my V7 of the mmu_notifier].

There is no support for an aging callback. A device driver may simply set the
reference bit on the linux pte when the external mapping is referenced if such
support is desired.

The patch is provisional. All functions are inlined for now. They should be wrapped like
in Andrea's series. Its probably good to have Andrea review this if we actually
decide to go this route since he is pretty good as detecting issues with complex
lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the
strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and
we are reintroducing that now in a light weight order to be able to defer freeing
until after the rmap spinlocks have been dropped.

Jack tested this with the GRU.

Signed-off-by: Christoph Lameter <clameter at sgi.com>

---
 fs/hugetlbfs/inode.c       |    2 
 include/linux/mm_types.h   |    3 
 include/linux/page-flags.h |    3 
 include/linux/rmap.h       |  161 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c              |    4 +
 mm/Kconfig                 |    4 +
 mm/filemap_xip.c           |    2 
 mm/fremap.c                |    2 
 mm/hugetlb.c               |    3 
 mm/memory.c                |   38 ++++++++--
 mm/mmap.c                  |    3 
 mm/mprotect.c              |    3 
 mm/mremap.c                |    5 +
 mm/rmap.c                  |   11 ++-
 14 files changed, 234 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2008-05-16 16:06:26.000000000 -0700
@@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+#ifdef CONFIG_MM_NOTIFIER
+		mm->mm_notifier = NULL;
+#endif
 		return mm;
 	}
 
@@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mm_notifier_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c	2008-05-16 16:06:26.000000000 -0700
@@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
+			mm_notifier_invalidate_page(mm, page, address);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
@@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp
 		}
 	}
 	spin_unlock(&mapping->i_mmap_lock);
+	mm_notifier_invalidate_page_sync(page);
 }
 
 /*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/fremap.c	2008-05-16 16:06:26.000000000 -0700
@@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mm_notifier_invalidate_range(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mm_notifier_invalidate_range_sync(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/hugetlb.c	2008-05-16 17:50:31.000000000 -0700
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mm_notifier_invalidate_range(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
@@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area
 		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 		__unmap_hugepage_range(vma, start, end);
 		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		mm_notifier_invalidate_range_sync(vma->vm_mm, start, end);
 	}
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/memory.c	2008-05-16 16:06:26.000000000 -0700
@@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 */
 	if (is_cow_mapping(vm_flags)) {
 		ptep_set_wrprotect(src_mm, addr, src_pte);
+		mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end);
+
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			tlb_finish_mmu(*tlbp, tlb_start, start);
+			mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
@@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
+	if (tlb) {
 		tlb_finish_mmu(tlb, address, end);
+		mm_notifier_invalidate_range(mm, address, end);
+	}
 	return end;
 }
 
@@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 
@@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = NULL;
 		goto unlock;
 	}
 
@@ -1774,6 +1792,7 @@ gotten:
 		 * thread doing COW.
 		 */
 		ptep_clear_flush(vma, address, page_table);
+		mm_notifier_invalidate_page(mm, old_page, address);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1787,10 +1806,13 @@ gotten:
 
 	if (new_page)
 		page_cache_release(new_page);
-	if (old_page)
-		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (old_page) {
+		mm_notifier_invalidate_page_sync(old_page);
+		page_cache_release(old_page);
+	}
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mmap.c	2008-05-16 16:06:26.000000000 -0700
@@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mm_notifier_invalidate_range(mm, start, end);
+	mm_notifier_invalidate_range_sync(mm, start, end);
 }
 
 /*
@@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mm_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mprotect.c	2008-05-16 16:06:26.000000000 -0700
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/rmap.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -132,6 +133,7 @@ static void change_protection(struct vm_
 		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
+	mm_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 int
@@ -211,6 +213,7 @@ success:
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mm_notifier_invalidate_range_sync(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/mremap.c	2008-05-16 16:06:26.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start = old_addr;
 
 	if (vma->vm_file) {
 		/*
@@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	mm_notifier_invalidate_range(mm, old_addr, old_end);
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
@@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+
+	mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/rmap.c	2008-05-16 16:06:26.000000000 -0700
@@ -52,6 +52,9 @@
 
 #include <asm/tlbflush.h>
 
+struct mm_notifier *mm_notifier_page_sync;
+DECLARE_RWSEM(mm_notifier_page_sync_sem);
+
 struct kmem_cache *anon_vma_cachep;
 
 /* This must be called under the mmap_sem. */
@@ -458,6 +461,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mm_notifier_invalidate_page(mm, page, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -502,8 +506,8 @@ int page_mkclean(struct page *page)
 				ret = 1;
 			}
 		}
+		mm_notifier_invalidate_page_sync(page);
 	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
@@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mm_notifier_invalidate_page(mm, page, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mm_notifier_invalidate_page(mm, page, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	mm_notifier_invalidate_page_sync(page);
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
-
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/rmap.h	2008-05-16 18:32:52.000000000 -0700
@@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 
+#ifdef CONFIG_MM_NOTIFIER
+
+struct mm_notifier_ops {
+	void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm,
+					unsigned long start, unsigned long end);
+	void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+					unsigned long start, unsigned long end);
+	void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm,
+					struct page *page, unsigned long addr);
+	void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+								struct page *page);
+	void (*release)(struct mm_notifier *mn, struct mm_struct *mm);
+	void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm);
+};
+
+struct mm_notifier {
+	struct mm_notifier_ops *ops;
+	struct mm_struct *mm;
+	struct mm_notifier *next;
+	struct mm_notifier *next_page_sync;
+};
+
+extern struct mm_notifier *mm_notifier_page_sync;
+extern struct rw_semaphore mm_notifier_page_sync_sem;
+
+/*
+ * Must hold mmap_sem when calling mm_notifier_register.
+ */
+static inline void mm_notifier_register(struct mm_notifier *mn,
+						struct mm_struct *mm)
+{
+	mn->mm = mm;
+	mn->next = mm->mm_notifier;
+	rcu_assign_pointer(mm->mm_notifier, mn);
+	if (mn->ops->invalidate_page_sync) {
+		down_write(&mm_notifier_page_sync_sem);
+		mn->next_page_sync = mm_notifier_page_sync;
+		mm_notifier_page_sync = mn;
+		up_write(&mm_notifier_page_sync_sem);
+	}
+}
+
+/*
+ * Invalidate remote references in a particular address range
+ */
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->invalidate_range(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references in a particular address range.
+ * Can sleep. Only return if all remote references have been removed.
+ */
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		if (mn->ops->invalidate_range_sync)
+			mn->ops->invalidate_range_sync(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references to a page
+ */
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+					struct page *page, unsigned long addr)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->invalidate_page(mn, mm, page, addr);
+}
+
+/*
+ * Invalidate remote references to a partioular page. Only return
+ * if all references have been removed.
+ *
+ * Note: This is an expensive function since it is not clear at the time
+ * of call to which mm_struct() the page belongs.. It walks through the
+ * mmlist  and calls the mmu notifier ops for each address space in the
+ * system. At some point this needs to be optimized.
+ */
+static inline void mm_notifier_invalidate_page_sync(struct page *page)
+{
+	struct mm_notifier *mn;
+
+	if (!PageNotifier(page))
+		return;
+
+	down_read(&mm_notifier_page_sync_sem);
+
+	for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync)
+		if (mn->ops->invalidate_page_sync)
+				mn->ops->invalidate_page_sync(mn, mn->mm, page);
+
+	up_read(&mm_notifier_page_sync_sem);
+}
+
+/*
+ * Invalidate all remote references before shutdown
+ */
+static inline void mm_notifier_release(struct mm_struct *mm)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->release(mn, mm);
+}
+
+/*
+ * Release resources before freeing mm_struct.
+ */
+static inline void mm_notifier_destroy(struct mm_struct *mm)
+{
+	struct mm_notifier *mn;
+
+	while (mm->mm_notifier) {
+		mn = mm->mm_notifier;
+		mm->mm_notifier = mn->next;
+		if (mn->ops->invalidate_page_sync) {
+			struct mm_notifier *m;
+
+			down_write(&mm_notifier_page_sync_sem);
+
+			if (mm_notifier_page_sync != mn) {
+				for (m = mm_notifier_page_sync; m; m = m->next_page_sync)
+					if (m->next_page_sync == mn)
+						break;
+
+				m->next_page_sync = mn->next_page_sync;
+			} else
+				mm_notifier_page_sync = mn->next_page_sync;
+
+			up_write(&mm_notifier_page_sync_sem);
+		}
+		mn->ops->destroy(mn, mm);
+	}
+}
+#else
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+			unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+			unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+			struct page *page, unsigned long address) {}
+static inline void mm_notifier_invalidate_page_sync(struct page *page) {}
+static inline void mm_notifier_release(struct mm_struct *mm) {}
+static inline void mm_notifier_destroy(struct mm_struct *mm) {}
+#endif
+
 #endif	/* _LINUX_RMAP_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/Kconfig	2008-05-16 16:06:26.000000000 -0700
@@ -205,3 +205,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MM_NOTIFIER
+	def_bool y
+
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/mm_types.h	2008-05-16 16:06:26.000000000 -0700
@@ -244,6 +244,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MM_NOTIFIER
+	struct mm_notifier *mm_notifier;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/page-flags.h	2008-05-16 16:06:26.000000000 -0700
@@ -93,6 +93,7 @@ enum pageflags {
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
+	PG_notifier,		/* Call notifier when page is changed/unmapped */
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
 PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
 
+PAGEFLAG(Notifier, notifier);
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-05-16 16:06:55.000000000 -0700
@@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree
 
 		__unmap_hugepage_range(vma,
 				vma->vm_start + v_offset, vma->vm_end);
+		mm_notifier_invalidate_range_sync(vma->vm_mm,
+				vma->vm_start + v_offset, vma->vm_end);
 	}
 }
 

From sashak at voltaire.com  Sat May 17 08:17:50 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 17 May 2008 18:17:50 +0300
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <f3177b9e0805151037s32c9267agbe4f764319e1425b@mail.gmail.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>
	<4825EEC9.4070208@dev.mellanox.co.il>
	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>
	<20080515111914.GO24654@sashak.voltaire.com>
	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>
	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>
	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>
	<20080515101418.4ccb53f3.weiny2@llnl.gov>
	<f3177b9e0805151037s32c9267agbe4f764319e1425b@mail.gmail.com>
Message-ID: <20080517151750.GA30185@sashak.voltaire.com>

On 11:37 Thu 15 May     , Chris Worley wrote:
> On Thu, May 15, 2008 at 11:14 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > On Thu, 15 May 2008 10:26:37 -0600
> > "Chris Worley" <worleys at gmail.com> wrote:
> <snip>
> >> After an sm change (i.e. using the "-r" switch), nodes can't ping each
> >> other over IPoIB (other protocols also can't communicate).
> >
> > Is it absolutely necessary to run with the "-r" switch?  Here we have not
> > problems letting the SM attempt to use the same LID's for nodes.
> 
> yes, especially when chaging routing algorithms between the default
> and fat-tree.

As Yevgeny said it looks like an error (or at least unexpected behavior)
in fat-tree code. Could you send ibnetdiscover output and "old" guid2lid
file for us?

Sasha


From kliteyn at mellanox.co.il  Sat May 17 08:52:45 2008
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Sat, 17 May 2008 18:52:45 +0300
Subject: [ofa-general] Re: OpenSM and fat tree
In-Reply-To: <20080517151750.GA30185@sashak.voltaire.com>
References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com>	<4825EEC9.4070208@dev.mellanox.co.il>	<1210591145.2026.458.camel@hrosenstock-ws.xsigo.com>	<20080515111914.GO24654@sashak.voltaire.com>	<f3177b9e0805150852t3f7802cfjd3679b46dd484ddb@mail.gmail.com>	<1210867827.12616.38.camel@hrosenstock-ws.xsigo.com>	<f3177b9e0805150926k6b88a930geb4977ce635881a5@mail.gmail.com>	<20080515101418.4ccb53f3.weiny2@llnl.gov>	<f3177b9e0805151037s32c9267agbe4f764319e1425b@mail.gmail.com>
	<20080517151750.GA30185@sashak.voltaire.com>
Message-ID: <482EFF4D.8060501@mellanox.co.il>


Sasha Khapyorsky wrote:
> On 11:37 Thu 15 May     , Chris Worley wrote:
>   
>> On Thu, May 15, 2008 at 11:14 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
>>     
>>> On Thu, 15 May 2008 10:26:37 -0600
>>> "Chris Worley" <worleys at gmail.com> wrote:
>>>       
>> <snip>
>>     
>>>> After an sm change (i.e. using the "-r" switch), nodes can't ping each
>>>> other over IPoIB (other protocols also can't communicate).
>>>>         
>>> Is it absolutely necessary to run with the "-r" switch?  Here we have not
>>> problems letting the SM attempt to use the same LID's for nodes.
>>>       
>> yes, especially when chaging routing algorithms between the default
>> and fat-tree.
>>     
>
> As Yevgeny said it looks like an error (or at least unexpected behavior)
> in fat-tree code. Could you send ibnetdiscover output and "old" guid2lid
> file for us?
>   

There's also an open bug on bugzilla for this: 
https://bugs.openfabrics.org/show_bug.cgi?id=1031
(which also lacks the details that would help me to reproduce it).

-- Yevgeny

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


From eli at dev.mellanox.co.il  Sat May 17 11:27:44 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sat, 17 May 2008 21:27:44 +0300
Subject: [ofa-general] Folks is this a known problem / already fixed ?
In-Reply-To: <482DD233.5010808@oracle.com>
References: <482DD233.5010808@oracle.com>
Message-ID: <1211048864.6696.7.camel@eli-laptop>

On Fri, 2008-05-16 at 14:28 -0400, Richard Frank wrote:
> We see the following failure for our ConnetX HCAs.. with 1.3.1 Daily 
> 20080512 done on vanilla OEL5U1.
> 
> They are failing to initialize with the following:
> 
> mlx4_core: Mellanox ConnectX core driver v1.0 (February 28, 2008)
> mlx4_core: Initializing 0000:05:00.0
> mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting.
> mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting.
> mlx4_core: probe of 0000:05:00.0 failed with error -16
> 
> And lspci shows:
> 
> 05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev 
> a0)
>        Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR]
>        Flags: fast devsel, IRQ 169
>        Memory at fcc00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>        Memory at fff000000 (64-bit, prefetchable) [disabled] [size=8M]
>        Memory at fcbfe000 (64-bit, non-prefetchable) [disabled] [size=8K]
>        Capabilities: [40] Power Management version 3
>        Capabilities: [48] Vital Product Data
>        Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256
>        Capabilities: [60] Express Endpoint IRQ 0
> 
Can you send the output of lspci for the bridge connecting the ConnectX
with the upstream PCI bus? I guess the problem would be that the bridge
blocks memory writes to ConnectX's UAR area thus causing a failure to
arm the EQ and eventually resulting in failure to load the driver. Now
it could be a failure of the kernel to configure the bridge properly.
Could you try with the latest kernel?


From sashak at voltaire.com  Sat May 17 16:13:55 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 18 May 2008 02:13:55 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM: Fix rpm build,
	<sysconfdir>/opensm/opensm.conf failed to install
In-Reply-To: <20080515132721.37644ade.weiny2@llnl.gov>
References: <20080515132721.37644ade.weiny2@llnl.gov>
Message-ID: <20080517231355.GB30185@sashak.voltaire.com>

Hi Ira,

On 13:27 Thu 15 May     , Ira Weiny wrote:
> 
> I found this while trying to add the Performance Manager HOWTO to the rpm.

Right, when *.spec was generated with different sysconfdir value than
one used with rpmbuild there is an issue.. Thanks for the finding this.

> Therefore, I think this will conflict slightly with that patch.  If you like I
> can resubmit that patch after you apply this.

Don't think it is needed.

> 
> Thanks,
> Ira
> 
> From 8453b86e94175ff3054a57c5c50e337a96d536bd Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Thu, 15 May 2008 13:13:16 -0700
> Subject: [PATCH] Fix rpm build, <sysconfdir>/opensm/opensm.conf failed to install
> 
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 17 16:23:45 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 18 May 2008 02:23:45 +0300
Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO
	to the docs and the dist
In-Reply-To: <20080516133502.27a1e9b6.weiny2@llnl.gov>
References: <20080515132723.3add7c6a.weiny2@llnl.gov>
	<1210967529.12616.287.camel@hrosenstock-ws.xsigo.com>
	<20080516133502.27a1e9b6.weiny2@llnl.gov>
Message-ID: <20080517232345.GC30185@sashak.voltaire.com>

On 13:35 Fri 16 May     , Ira Weiny wrote:
> On Fri, 16 May 2008 12:52:09 -0700
> Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> 
> > On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote:
> > > I decided to write a little HOWTO to help people to set it up.
> > 
> > Nice writeup :-)

Really good doc.

> > 
> > > 5) Can be run in a standby SM
> > 
> > I thought it was changed so that it can run in a standalone mode without
> > SM. Am I confusing this with something else ?
> > 
> 
> I think you are right I should have said standalone.  However, can't it also
> work in a standby SM?
> 
> yea, from the patch which Sasha applied:
> 
>    opensm/perfmgr: PerfMgr for SM standby and inactive states
> 
> Here is an updated patch with the correction.
> 
> Ira
> 
> 
> From 9be13c3da4d34ad0a736ced4c9e3bb5e13a24bb6 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Thu, 15 May 2008 08:19:17 -0700
> Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist
> 
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 17 16:32:02 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 18 May 2008 02:32:02 +0300
Subject: [ofa-general] Re: [PATCH] [TRIVIAL] OpenSM/doc/modular_routing.txt:
	Fix typo
In-Reply-To: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com>
References: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080517233202.GE30185@sashak.voltaire.com>

On 13:13 Fri 16 May     , Hal Rosenstock wrote:
> OpenSM/doc/modular_routing.txt: Fix typo
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 17 16:33:37 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 18 May 2008 02:33:37 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt:
	Remove mention of OFED
In-Reply-To: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>
References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080517233337.GF30185@sashak.voltaire.com>

On 13:16 Fri 16 May     , Hal Rosenstock wrote:
> OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED
> 
> I'll leave the other heavy lifting to Yevgeny :-)
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 17 17:10:38 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 18 May 2008 03:10:38 +0300
Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt:
	Remove mention of OFED
In-Reply-To: <1210972718.12616.309.camel@hrosenstock-ws.xsigo.com>
References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com>
	<1210972718.12616.309.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080518001038.GI30185@sashak.voltaire.com>

On 14:18 Fri 16 May     , Hal Rosenstock wrote:
> On Fri, 2008-05-16 at 13:16 -0700, Hal Rosenstock wrote:
> > OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED
> > 
> > I'll leave the other heavy lifting to Yevgeny :-)
> 
> I forgot: Please apply to master and ofed_1_3

Applied this in my branch.

Sasha


From yqayc at glamsham.com  Sat May 17 22:32:14 2008
From: yqayc at glamsham.com (courtney)
Date: Sat, 17 May 2008 21:32:14 -0800 
Subject: [ofa-general] hello from courtney
Message-ID: <55439863766036.iOfApjqsm3@side> 


Hi,
i am here sitting in the internet caffe. Found your email and
decided to write. I am 25 y.o.girl.
I have a picture if you want. No need to reply here as 
this is not may email. Write me at acourtney3 at famplayfit.cn


From ogerlitz at voltaire.com  Sat May 17 22:59:53 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 18 May 2008 08:59:53 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
In-Reply-To: <1210836027.18385.2.camel@mtls03>
References: <1210836027.18385.2.camel@mtls03>
Message-ID: <482FC5D9.7060009@voltaire.com>

Eli Cohen wrote:
> --- a/drivers/infiniband/hw/mlx4/cq.c
> +++ b/drivers/infiniband/hw/mlx4/cq.c
> @@ -637,6 +637,7 @@ repoll:
>  		case MLX4_OPCODE_SEND_IMM:
>  			wc->wc_flags |= IB_WC_WITH_IMM;
>  		case MLX4_OPCODE_SEND:
> +		case MLX4_OPCODE_SEND_INVAL:
>  			wc->opcode    = IB_WC_SEND;
>  			break;
>  		case MLX4_OPCODE_RDMA_READ:
> @@ -676,6 +677,13 @@ repoll:
>  			wc->wc_flags = IB_WC_WITH_IMM;
>  			wc->imm_data = cqe->immed_rss_invalid;
>  			break;
> +		case MLX4_RECV_OPCODE_SEND_INVAL:
> +			wc->opcode = IB_WC_RECV;
> +			wc->wc_flags = IB_WC_WITH_INVALIDATE;
> +			/*
> +			 * TBD: maybe we should just call this ieth_val
> +			 */
> +			wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid);
Eli,

Is it correct that "cqe->immed_rss_invalid" equals 
"wr->ex.invalidate_rkey" that was provided at the sender? if yes, any 
reason not to have the same/similar union (imm_data/invalidate_rkey) 
also for the work completion structure?

Or


From eli at dev.mellanox.co.il  Sun May 18 02:14:58 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 18 May 2008 12:14:58 +0300
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
In-Reply-To: <482FC5D9.7060009@voltaire.com>
References: <1210836027.18385.2.camel@mtls03> <482FC5D9.7060009@voltaire.com>
Message-ID: <1211102098.6963.14.camel@eli-laptop>

On Sun, 2008-05-18 at 08:59 +0300, Or Gerlitz wrote:
> Is it correct that "cqe->immed_rss_invalid" equals 
> "wr->ex.invalidate_rkey" that was provided at the sender? if yes, any 
> reason not to have the same/similar union (imm_data/invalidate_rkey) 
> also for the work completion structure?
No reason for them to be different. Roland already suggested to use a
union here although he defines the union locally inside the containing
struct thus he has two definitions for the same union. Roland do you
intend to commit that?


From monis at Voltaire.COM  Sun May 18 05:19:50 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 18 May 2008 15:19:50 +0300
Subject: [ofa-general] [PATCH] IB/core: handle race between elements in
	qork queues after event
In-Reply-To: <ada8wycisxy.fsf@cisco.com>
References: <48187E5A.7040809@Voltaire.COM>
	<4819BD29.7080002@Voltaire.COM>	<adaskwwz7ie.fsf@cisco.com>
	<4820638E.4030901@Voltaire.COM>	<4827FBDF.9040308@Voltaire.COM>
	<adatzh2ksoc.fsf@cisco.com>	<482A979F.6040305@Voltaire.COM>
	<ada8wycisxy.fsf@cisco.com>
Message-ID: <48301EE6.5070902@Voltaire.COM>

Roland Dreier wrote:
>  > OK. Here is an example that was viewed in our tests.
>  > One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server).
>  > SM takeover event takes place during traffic and as a result multicast info is flushed
>  > and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience
>  > is a very big chance) that the request to rejoin will be to the old  SM and only after a retry join completes successfully.
>  > This takes too long and the patch solves it.
> 
> OK, that is fairly convincing (and it would be nice to include when
> sending the original patch).
> 
> Please resend a version that fixes the races in the patch and we can
> probably add this for 2.6.27.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
Thanks. I will resend this patch in a series of 2 (in a different thread)
The other patch in the series is related to the one above but was sent by me earlier
in a different thread without justification. 


From ogerlitz at voltaire.com  Sun May 18 05:22:12 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 18 May 2008 15:22:12 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
Message-ID: <48301F74.4020905@voltaire.com>

Steve Wise wrote:
> - device-specific alloc/free of physical buffer lists for use in fast
> register work requests.  This allows devices to allocate this memory as
> needed (like via dma_alloc_coherent).
>
Steve,

Reading through the suggested API / patches and the previous threads I 
was not sure to understand if the HW driver must not assume that it has 
the ownership on the page --list-- structure until the registration work 
request is completed - or not.

Now, if ownership can not be assumed (eg as for the SG list elements 
pointed by send/recv WR), the driver has to clone it anyway, and thus I 
don't see the need in the ib_alloc/free_fast_reg_page_list verbs.

If ownership can be assumed, I suggest to have the core use the 
implementation of these two verbs as you did that for the Chelsio driver 
in case the HW driver did not implement it (i.e instead of returning 
ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM 
device (I think...) since the device is going to do DMA to read the page 
list, and the free_list verb should do DMA unmapping, etc.

Or.


From monis at Voltaire.COM  Sun May 18 05:25:24 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 18 May 2008 15:25:24 +0300
Subject: [ofa-general] [PATCH 0/2] IB: Improve recovery from SM change events
	after takeover
Message-ID: <48302034.8040709@Voltaire.COM>


The patches below improve the the recovery of the IPoIB driver from
a faulure of  the SM and taking over by another SM. The purpose was
to minimize the the time that 2 hosts with IPoIB stay remain disconnected
after SM takeover event. 

Here is an example that was viewed in our tests.
One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server).
SM takeover event takes place during traffic and as a result multicast info is flushed
and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience
is a very big chance) that the request to rejoin will be to the old  SM and only after a retry join completes successfully.

Our tests for IP multicast and unicast traffic between 2 hosts show that without the patch there
is a period of time of  up to 5  seconds  that that communication is lost and with the
patch the time decreases  to less than a second.


From monis at Voltaire.COM  Sun May 18 05:34:31 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 18 May 2008 15:34:31 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <48302034.8040709@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM>
Message-ID: <48302257.2050308@Voltaire.COM>


This patch solves a race between work elements that are carried out after an
event occurs. When SM address handle becomes invalid and needs an update it is
handled by a work in the global workqueue. On the other hand this event is also
handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join.
Although queuing is in the right order, it is done to 2 different workqueues and so
there is no guarantee that the first to be queued is the first to be executed.

The patch sets the SM address handle  to NULL and until update_sm_ah() is called,
any request that needs sm_ah is replied with -EAGAIN return status.

For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the
wrong SM so the request gets lost. Consumers can be improved if they examine the return
code and respond to EAGAIN properly but even without an improvement the situation
is not getting worse and in some cases it gets better.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---

 drivers/infiniband/core/sa_query.c |   26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index cf474ec..a2e61d7 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE   ||
 	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		struct ib_sa_device *sa_dev;
-		sa_dev = container_of(handler, typeof(*sa_dev), event_handler);
-
+		unsigned long flags;
+		struct ib_sa_device *sa_dev =
+			container_of(handler, typeof(*sa_dev), event_handler);
+		struct ib_sa_port *port =
+			&sa_dev->port[event->element.port_num - sa_dev->start_port];
+		struct ib_sa_sm_ah *sm_ah;
+
+		spin_lock_irqsave(&port->ah_lock, flags);
+		sm_ah = port->sm_ah;
+		port->sm_ah = NULL;
+		spin_unlock_irqrestore(&port->ah_lock, flags);
+
+		if (sm_ah)
+			kref_put(&sm_ah->ref, free_sm_ah);
 		schedule_work(&sa_dev->port[event->element.port_num -
 					    sa_dev->start_port].update_task);
 	}
@@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	if (!port->sm_ah)
+		return  -EAGAIN;
 	agent = port->agent;
 
 	query = kmalloc(sizeof *query, gfp_mask);
@@ -780,6 +793,9 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	if (!port->sm_ah)
+		return  -EAGAIN;
+
 	agent = port->agent;
 
 	if (method != IB_MGMT_METHOD_GET &&
@@ -877,8 +893,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
-	agent = port->agent;
+	if (!port->sm_ah)
+		return  -EAGAIN;
 
+	agent = port->agent;
 	query = kmalloc(sizeof *query, gfp_mask);
 	if (!query)
 		return -ENOMEM;


From monis at Voltaire.COM  Sun May 18 05:36:11 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 18 May 2008 15:36:11 +0300
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and
 handle each according to level of severity
In-Reply-To: <48302034.8040709@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM>
Message-ID: <483022BB.9060004@Voltaire.COM>


The purpose of this patch is to make the events that are related to SM change
(namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
When SM related events are handled, it is not necessary to flush unicast
info from device but only multicast info. This patch divides the events that are
handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1
does more than 0).
The main change is in __ipoib_ib_dev_flush(). Instead of flagging  to the function
about pkey_events we now use leveling. An event that requires "harder" flushing
calls this function with higher number for level. Besides the concept,
the actual change is that SM related events are not  flushing unicast info and
not bringing the device down but only refresh the multicast info in the background.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---

 drivers/infiniband/ulp/ipoib/ipoib.h       |    9 ++++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   37 ++++++++++++++++++-----------
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |    5 ++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   19 +++++++-------
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..8ed4dc0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -276,10 +276,11 @@ struct ipoib_dev_priv {
 
 	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
-	struct work_struct flush_task;
+	struct work_struct flush_task0;
+	struct work_struct flush_task1;
+	struct work_struct flush_task2;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
-	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8		  port;
@@ -427,7 +428,9 @@ void ipoib_flush_paths(struct net_device *dev);
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
-void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_ib_dev_flush0(struct work_struct *work);
+void ipoib_ib_dev_flush1(struct work_struct *work);
+void ipoib_ib_dev_flush2(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index f429bce..2a9c058 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -898,12 +898,14 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	return 0;
 }
 
-static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
 {
 	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
 	u16 new_index;
 
+	ipoib_dbg(priv, "Try flushing level %d\n", level);
+
 	mutex_lock(&priv->vlan_mutex);
 
 	/*
@@ -911,7 +913,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 	 * the parent is down.
 	 */
 	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		__ipoib_ib_dev_flush(cpriv, pkey_event);
+		__ipoib_ib_dev_flush(cpriv, level);
 
 	mutex_unlock(&priv->vlan_mutex);
 
@@ -925,7 +927,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 		return;
 	}
 
-	if (pkey_event) {
+	if (level == 2) {
 		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
 			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 			ipoib_ib_dev_down(dev, 0);
@@ -943,11 +945,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 		priv->pkey_index = new_index;
 	}
 
-	ipoib_dbg(priv, "flushing\n");
 
-	ipoib_ib_dev_down(dev, 0);
+	ipoib_mcast_dev_flush(dev);
+
+	if (level >= 1)
+		ipoib_ib_dev_down(dev, 0);
 
-	if (pkey_event) {
+	if (level >= 2) {
 		ipoib_ib_dev_stop(dev, 0);
 		ipoib_ib_dev_open(dev);
 	}
@@ -957,29 +961,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 	 * we get here, don't bring it back up if it's not configured up
 	 */
 	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
-		ipoib_ib_dev_up(dev);
+		if (level >= 1)
+			ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+void ipoib_ib_dev_flush0(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+		container_of(work, struct ipoib_dev_priv, flush_task0);
 
-	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 0);
 }
 
-void ipoib_pkey_event(struct work_struct *work)
+void ipoib_ib_dev_flush1(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+		container_of(work, struct ipoib_dev_priv, flush_task1);
 
-	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 1);
 }
 
+void ipoib_ib_dev_flush2(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task2);
+
+	__ipoib_ib_dev_flush(priv, 2);
+}
+
 void ipoib_ib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2442090..2808023 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -989,9 +989,10 @@ static void ipoib_setup(struct net_device *dev)
 	INIT_LIST_HEAD(&priv->multicast_list);
 
 	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
-	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
-	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
+	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
+	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
+	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
 	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 8766d29..80c0409 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
 	if (record->element.port_num != priv->port)
 		return;
 
-	if (record->event == IB_EVENT_PORT_ERR    ||
-	    record->event == IB_EVENT_PORT_ACTIVE ||
-	    record->event == IB_EVENT_LID_CHANGE  ||
-	    record->event == IB_EVENT_SM_CHANGE   ||
-	    record->event == IB_EVENT_CLIENT_REREGISTER) {
-		ipoib_dbg(priv, "Port state change event\n");
-		queue_work(ipoib_workqueue, &priv->flush_task);
+	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
+			record->device->name, record->element.port_num);
+	if ( record->event == IB_EVENT_SM_CHANGE   ||
+	     record->event == IB_EVENT_CLIENT_REREGISTER) {
+		queue_work(ipoib_workqueue, &priv->flush_task0);
+	} else if (record->event == IB_EVENT_PORT_ERR ||
+		   record->event == IB_EVENT_PORT_ACTIVE ||
+		   record->event == IB_EVENT_LID_CHANGE) {
+		queue_work(ipoib_workqueue, &priv->flush_task1);
 	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
-		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
-		queue_work(ipoib_workqueue, &priv->pkey_event_task);
+		queue_work(ipoib_workqueue, &priv->flush_task2);
 	}
 }


From jackm at dev.mellanox.co.il  Sun May 18 07:34:55 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 18 May 2008 17:34:55 +0300
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <200805160919.45676.okir@lst.de>
References: <adaskwkfep5.fsf@cisco.com> <adaod77dx7v.fsf@cisco.com>
	<200805160919.45676.okir@lst.de>
Message-ID: <200805181734.55378.jackm@dev.mellanox.co.il>

This is actually a known issue, which we never got around to fixing.

From the OFED 1.2.5 release notes (document docs/mthca_release_notes.txt),
in section "3. Known Issues":

3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 3)
   entries only.

- Jack

On Friday 16 May 2008 10:19, Olaf Kirch wrote:
> On Friday 16 May 2008 00:22:12 Roland Dreier wrote:
> 
> I ran into this a few weeks back as well, when I tried to up the SG limit
> in RDS to 32 (on a arbel memfree card).

memfree returns 30 as the max, I believe.  ConnectX returns 32.
 
> I grepped around the code a bit, got a little confused because of all the
> different max_sge, max_sg and max_gs variables :-) and eventually
> convinced myself that the max_sge reported simply doesn't include the
> transport specific overhead that mthca_alloc_wqe_buf factors in.
> 
> Given that you have quite different WQE overheads depending on the transport,
> a conservative max_sge value that works for all transports wastes one or two
> entries on some others. Maybe once the QP is created, it could report
> the actual max_sge value (which may actually be greater than the conservative,
> transport-independent max_sge estimate of the device).

This is a problem, because then you are returning a value which is greater that the
declared device max.  This causes IB Spec non-compliance.

> 
> Olaf


From hrosenstock at xsigo.com  Sun May 18 07:44:50 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sun, 18 May 2008 07:44:50 -0700
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <483022BB.9060004@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM>  <483022BB.9060004@Voltaire.COM>
Message-ID: <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>

On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote:
> The purpose of this patch is to make the events that are related to SM change
> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
> When SM related events are handled, it is not necessary to flush unicast
> info from device but only multicast info.

How is unicast invalidation handled on these changes ? On a local LID
change event, how does an end port know/determine what else (e.g. other
LIDs, paths) the SM might have changed (that specifically might affect
IPoIB since this is limited to IPoIB) ?

Also, wouldn't there be similar issues with other ULPs ?

-- Hal

> This patch divides the events that are
> handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1
> does more than 0).
> The main change is in __ipoib_ib_dev_flush(). Instead of flagging  to the function
> about pkey_events we now use leveling. An event that requires "harder" flushing
> calls this function with higher number for level. Besides the concept,
> the actual change is that SM related events are not  flushing unicast info and
> not bringing the device down but only refresh the multicast info in the background.
> 
> Signed-off-by: Moni Levy  <monil at voltaire.com>
> Signed-off-by: Moni Shoua <monis at voltaire.com>
> 
> ---
> 
>  drivers/infiniband/ulp/ipoib/ipoib.h       |    9 ++++---
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   37 ++++++++++++++++++-----------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c  |    5 ++-
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c |   19 +++++++-------
>  4 files changed, 43 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index ca126fc..8ed4dc0 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -276,10 +276,11 @@ struct ipoib_dev_priv {
>  
>  	struct delayed_work pkey_poll_task;
>  	struct delayed_work mcast_task;
> -	struct work_struct flush_task;
> +	struct work_struct flush_task0;
> +	struct work_struct flush_task1;
> +	struct work_struct flush_task2;
>  	struct work_struct restart_task;
>  	struct delayed_work ah_reap_task;
> -	struct work_struct pkey_event_task;
>  
>  	struct ib_device *ca;
>  	u8		  port;
> @@ -427,7 +428,9 @@ void ipoib_flush_paths(struct net_device *dev);
>  struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
>  
>  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
> -void ipoib_ib_dev_flush(struct work_struct *work);
> +void ipoib_ib_dev_flush0(struct work_struct *work);
> +void ipoib_ib_dev_flush1(struct work_struct *work);
> +void ipoib_ib_dev_flush2(struct work_struct *work);
>  void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
>  
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index f429bce..2a9c058 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -898,12 +898,14 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>  	return 0;
>  }
>  
> -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
> +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
>  {
>  	struct ipoib_dev_priv *cpriv;
>  	struct net_device *dev = priv->dev;
>  	u16 new_index;
>  
> +	ipoib_dbg(priv, "Try flushing level %d\n", level);
> +
>  	mutex_lock(&priv->vlan_mutex);
>  
>  	/*
> @@ -911,7 +913,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * the parent is down.
>  	 */
>  	list_for_each_entry(cpriv, &priv->child_intfs, list)
> -		__ipoib_ib_dev_flush(cpriv, pkey_event);
> +		__ipoib_ib_dev_flush(cpriv, level);
>  
>  	mutex_unlock(&priv->vlan_mutex);
>  
> @@ -925,7 +927,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		return;
>  	}
>  
> -	if (pkey_event) {
> +	if (level == 2) {
>  		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
>  			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
>  			ipoib_ib_dev_down(dev, 0);
> @@ -943,11 +945,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		priv->pkey_index = new_index;
>  	}
>  
> -	ipoib_dbg(priv, "flushing\n");
>  
> -	ipoib_ib_dev_down(dev, 0);
> +	ipoib_mcast_dev_flush(dev);
> +
> +	if (level >= 1)
> +		ipoib_ib_dev_down(dev, 0);
>  
> -	if (pkey_event) {
> +	if (level >= 2) {
>  		ipoib_ib_dev_stop(dev, 0);
>  		ipoib_ib_dev_open(dev);
>  	}
> @@ -957,29 +961,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * we get here, don't bring it back up if it's not configured up
>  	 */
>  	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
> -		ipoib_ib_dev_up(dev);
> +		if (level >= 1)
> +			ipoib_ib_dev_up(dev);
>  		ipoib_mcast_restart_task(&priv->restart_task);
>  	}
>  }
>  
> -void ipoib_ib_dev_flush(struct work_struct *work)
> +void ipoib_ib_dev_flush0(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, flush_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task0);
>  
> -	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 0);
>  }
>  
> -void ipoib_pkey_event(struct work_struct *work)
> +void ipoib_ib_dev_flush1(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, pkey_event_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task1);
>  
> -	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 1);
>  }
>  
> +void ipoib_ib_dev_flush2(struct work_struct *work)
> +{
> +	struct ipoib_dev_priv *priv =
> +		container_of(work, struct ipoib_dev_priv, flush_task2);
> +
> +	__ipoib_ib_dev_flush(priv, 2);
> +}
> +
>  void ipoib_ib_dev_cleanup(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 2442090..2808023 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -989,9 +989,10 @@ static void ipoib_setup(struct net_device *dev)
>  	INIT_LIST_HEAD(&priv->multicast_list);
>  
>  	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
> -	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
>  	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
> -	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
> +	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
> +	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
> +	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
>  	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
>  	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index 8766d29..80c0409 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
>  	if (record->element.port_num != priv->port)
>  		return;
>  
> -	if (record->event == IB_EVENT_PORT_ERR    ||
> -	    record->event == IB_EVENT_PORT_ACTIVE ||
> -	    record->event == IB_EVENT_LID_CHANGE  ||
> -	    record->event == IB_EVENT_SM_CHANGE   ||
> -	    record->event == IB_EVENT_CLIENT_REREGISTER) {
> -		ipoib_dbg(priv, "Port state change event\n");
> -		queue_work(ipoib_workqueue, &priv->flush_task);
> +	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
> +			record->device->name, record->element.port_num);
> +	if ( record->event == IB_EVENT_SM_CHANGE   ||
> +	     record->event == IB_EVENT_CLIENT_REREGISTER) {
> +		queue_work(ipoib_workqueue, &priv->flush_task0);
> +	} else if (record->event == IB_EVENT_PORT_ERR ||
> +		   record->event == IB_EVENT_PORT_ACTIVE ||
> +		   record->event == IB_EVENT_LID_CHANGE) {
> +		queue_work(ipoib_workqueue, &priv->flush_task1);
>  	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
> -		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
> -		queue_work(ipoib_workqueue, &priv->pkey_event_task);
> +		queue_work(ipoib_workqueue, &priv->flush_task2);
>  	}
>  }
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jackm at dev.mellanox.co.il  Sun May 18 07:49:41 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 18 May 2008 17:49:41 +0300
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaabipc07g.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com> <adaej81c1bv.fsf@cisco.com>
	<adaabipc07g.fsf@cisco.com>
Message-ID: <200805181749.42187.jackm@dev.mellanox.co.il>

On Saturday 17 May 2008 02:12, Roland Dreier wrote:
> 
> if we can't use the "WQE shrinking" feature (because of selective
> signaling in the NFS/RDMA case), and we want to use 32 sge entries, then
> the WQE size 's' will end up a little more than 512 bytes, and the
> wqe_shift will end up as 10.

> But since the max_sq_desc_sz is 1008, we 
> return -EINVAL, when it is really fine to have a wqe_shift of 10 as long
> as we don't use more than 1008 bytes per descriptor (I think).
Correct. 

...

> @@ -395,7 +396,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
>  		++qp->sq.wqe_shift;
>  	}
>  
> -	qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
> +	qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
> +			     (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
>  			 send_wqe_overhead(type, qp->flags)) /
>  		sizeof (struct mlx4_wqe_data_seg);

In this case, sq.max_gs ( = (1008 - wqe overhead) / 16) will be larger than the
"max sge" value returned by ib_query_device, (max_sge returned by ib_query_device is 32).
I'm not crazy about this inconsistency.  Please note also that the IB Spec does not
differentiate between Send max_sge, and Receive max_sge, so we're reduced to enforcing
the minimum of the two values.

The general approach taken in the driver is to enforce the smallest of the sge values,
to avoid dealing with the individual qp type maxima.

- Jack


From tziporet at dev.mellanox.co.il  Sun May 18 08:05:49 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 18 May 2008 18:05:49 +0300
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds
	and	not in others
In-Reply-To: <f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>	<482AC510.3090602@dev.mellanox.co.il>
	<f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>
Message-ID: <483045CD.8060301@mellanox.co.il>

Chris Worley wrote:
> Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5
> in the "kitchen sink" build.
>
> Is there any reason to NOT use connected mode?
>   
In general the CM is better in performance of medium & large messages.
We found the UD mode is better in small UDP messages

Tziporet


From monis at Voltaire.COM  Sun May 18 08:10:02 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 18 May 2008 18:10:02 +0300
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to	groups
	and handle each according to level of severity
In-Reply-To: <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>
Message-ID: <483046CA.3010403@Voltaire.COM>

Hal Rosenstock wrote:
> On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote:
>> The purpose of this patch is to make the events that are related to SM change
>> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
>> When SM related events are handled, it is not necessary to flush unicast
>> info from device but only multicast info.
> 
> How is unicast invalidation handled on these changes ? On a local LID
> change event, how does an end port know/determine what else (e.g. other
> LIDs, paths) the SM might have changed (that specifically might affect
> IPoIB since this is limited to IPoIB) ?
I'm not sure I understand the question but local LID change would be handled as before
with a LID_CHANGE event. For this type of event, there is not change in what IPoIB does to cope.

> 
> Also, wouldn't there be similar issues with other ULPs ?
There might be but the purpose of this one is to make things better  for IPoIB
> 
> -- Hal
> 


From vlad at dev.mellanox.co.il  Sun May 18 08:22:39 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 18 May 2008 18:22:39 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
 for changing ConnectX page size
Message-ID: <483049BF.4050603@dev.mellanox.co.il>

 From 3bb8b713da6a0b2087201d6fb6c1a8d9274cf16e Mon Sep 17 00:00:00 2001
From: Vladimir Sokolovsky <vlad at mellanox.co.il>
Date: Sun, 18 May 2008 11:25:55 +0300
Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size

There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which hardcoded
the minimum acceptable page_shift to be 12. However, new mlx4 firmware has a
minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that
ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs.

To preserve firmware compatibility with released OFED drivers, the firmware
will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these
drivers.
However, to enable new drivers to take advantage of the available smaller page
size, the mlx4 driver now first sets the log_pg_sz to the device minimum via
the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP().
The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value.

Signed-off-by: Jack Morgenstein <jackm at mellanox.co.il>
Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
  drivers/net/mlx4/fw.c    |   28 ++++++++++++++++++++++++++++
  drivers/net/mlx4/fw.h    |    6 ++++++
  drivers/net/mlx4/main.c  |   13 +++++++++++++
  include/linux/mlx4/cmd.h |    2 +-
  4 files changed, 48 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index d82f275..2b5006b 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags)
  			mlx4_dbg(dev, "    %s\n", fname[i]);
  }

+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	u32 *inbox;
+	int err = 0;
+
+#define MOD_STAT_CFG_IN_SIZE		0x100
+
+#define MOD_STAT_CFG_PG_SZ_M_OFFSET	0x002
+#define MOD_STAT_CFG_PG_SZ_OFFSET	0x003
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+	inbox = mailbox->buf;
+
+	memset(inbox, 0, MOD_STAT_CFG_IN_SIZE);
+
+	MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET);
+	MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET);
+
+	err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG,
+			MLX4_CMD_TIME_CLASS_A);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
  int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
  {
  	struct mlx4_cmd_mailbox *mailbox;
diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h
index 306cb9b..a0e046c 100644
--- a/drivers/net/mlx4/fw.h
+++ b/drivers/net/mlx4/fw.h
@@ -38,6 +38,11 @@
  #include "mlx4.h"
  #include "icm.h"

+struct mlx4_mod_stat_cfg {
+	u8 log_pg_sz;
+	u8 log_pg_sz_m;
+};
+
  struct mlx4_dev_cap {
  	int max_srq_sz;
  	int max_qp_sz;
@@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages);
  int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm);
  int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev);
  int mlx4_NOP(struct mlx4_dev *dev);
+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg);

  #endif /* MLX4_FW_H */
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index a6aa49f..2a155ee 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -75,6 +75,11 @@ static char mlx4_version[] __devinitdata =
  	DRV_NAME ": Mellanox ConnectX core driver v"
  	DRV_VERSION " (" DRV_RELDATE ")\n";

+static int mlx4_log_pg_sz = 0;
+module_param(mlx4_log_pg_sz, int, 0444);
+MODULE_PARM_DESC(mlx4_log_pg_sz,
+		 "set FW log system min page size (0 gets native FW min. default=0)");
+
  static struct mlx4_profile default_profile = {
  	.num_qp		= 1 << 17,
  	.num_srq	= 1 << 16,
@@ -485,6 +490,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  	struct mlx4_priv	  *priv = mlx4_priv(dev);
  	struct mlx4_adapter	   adapter;
  	struct mlx4_dev_cap	   dev_cap;
+	struct mlx4_mod_stat_cfg   mlx4_cfg;
  	struct mlx4_profile	   profile;
  	struct mlx4_init_hca_param init_hca;
  	u64 icm_size;
@@ -502,6 +508,13 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  		return err;
  	}

+	mlx4_cfg.log_pg_sz_m = 1;
+	mlx4_cfg.log_pg_sz = (u8) mlx4_log_pg_sz;
+	err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg);
+	if (err)
+		mlx4_warn(dev, "Failed to override log_pg_sz parameter to %d\n",
+			  mlx4_log_pg_sz);
+
  	err = mlx4_dev_cap(dev, &dev_cap);
  	if (err) {
  		mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n");
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 77323a7..3b563ed 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -169,7 +169,7 @@ static inline int mlx4_cmd_imm(struct mlx4_dev *dev, u64 in_param, u64 *out_para
  			       u32 in_modifier, u8 op_modifier, u16 op,
  			       unsigned long timeout)
  {
-	return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier,
+	return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier,
  			  op_modifier, op, timeout);
  }

-- 
1.5.5.1


From roland.list at gmail.com  Sun May 18 09:04:27 2008
From: roland.list at gmail.com (Roland Dreier)
Date: Sun, 18 May 2008 09:04:27 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <200805181749.42187.jackm@dev.mellanox.co.il>
References: <adaskwkfep5.fsf@cisco.com> <adaej81c1bv.fsf@cisco.com>
	<adaabipc07g.fsf@cisco.com>
	<200805181749.42187.jackm@dev.mellanox.co.il>
Message-ID: <f8ca0a150805180904x10985952gd82c80dc858eaa5@mail.gmail.com>

>> -     qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
>> +     qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
>> +                          (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
>>                        send_wqe_overhead(type, qp->flags)) /
>>               sizeof (struct mlx4_wqe_data_seg);
>
> In this case, sq.max_gs ( = (1008 - wqe overhead) / 16) will be larger than the
> "max sge" value returned by ib_query_device, (max_sge returned by ib_query_device is 32).
> I'm not crazy about this inconsistency.  Please note also that the IB Spec does not
> differentiate between Send max_sge, and Receive max_sge, so we're reduced to enforcing
> the minimum of the two values.

OK, we can clamp the value lower here to the max_sge reported by the
driver (but the change
I'm making here already only lowers the returned sq.max_gs value,
since the value

    qp->sq_max_wqes_per_wr << qp->sq.wqe_shift

will be 1024 in the case in question).

But can you point me to the place in the IB spec where it requires all
sge limits to be no bigger
than the returned max_sge value?

 - R.


From swise at opengridcomputing.com  Sun May 18 09:46:21 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 18 May 2008 11:46:21 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48301F74.4020905@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<48301F74.4020905@voltaire.com>
Message-ID: <48305D5D.4040401@opengridcomputing.com>


Or Gerlitz wrote:
> Steve Wise wrote:
>> - device-specific alloc/free of physical buffer lists for use in fast
>> register work requests.  This allows devices to allocate this memory as
>> needed (like via dma_alloc_coherent).
>>
> Steve,
> 
> Reading through the suggested API / patches and the previous threads I 
> was not sure to understand if the HW driver must not assume that it has 
> the ownership on the page --list-- structure until the registration work 
> request is completed - or not.
> 

Yes, the driver owns the page list structure until the WR completes (ie 
is reaped by the consumer via poll_cq()).

> Now, if ownership can not be assumed (eg as for the SG list elements 
> pointed by send/recv WR), the driver has to clone it anyway, and thus I 
> don't see the need in the ib_alloc/free_fast_reg_page_list verbs.
> 
> If ownership can be assumed, I suggest to have the core use the 
> implementation of these two verbs as you did that for the Chelsio driver 
> in case the HW driver did not implement it (i.e instead of returning 
> ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM 
> device (I think...) since the device is going to do DMA to read the page 
> list, and the free_list verb should do DMA unmapping, etc.
> 

Some devices don't need DMA mappings at all (chelsio for instance). The 
idea of a device-specific method was so the device could allocate a 
bigger structure to hold its own context info.  So a core service that 
sets up DMA, in my opinion, isn't really useful.

Steve.


> Or.
> 


From rdreier at cisco.com  Sun May 18 14:39:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 18 May 2008 14:39:55 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48301F74.4020905@voltaire.com> (Or Gerlitz's message of "Sun, 18
	May 2008 15:22:12 +0300")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<48301F74.4020905@voltaire.com>
Message-ID: <ada63tb9tqs.fsf@cisco.com>

 > If ownership can be assumed, I suggest to have the core use the
 > implementation of these two verbs as you did that for the Chelsio
 > driver in case the HW driver did not implement it (i.e instead of
 > returning ENOSYS). In that case, the alloc_list verb should do DMA
 > mapping FROM device (I think...) since the device is going to do DMA
 > to read the page list, and the free_list verb should do DMA unmapping,
 > etc.

Yes, the point of this verb is that the low-level driver owns the page
list from when the fast register work request is posted until it
completes.  This should be explicitly documented somewhere.

However the reason for having the low-level driver implement it is so
that all strange device-specific issues can be taken care of in the
driver.  For instance mlx4 is going to require that the page list be
aligned to 64 bytes, and will DMA from the memory, so we need to use
dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
in software, so kmalloc is sufficient.

 - R.


From rdreier at cisco.com  Sun May 18 14:42:23 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 18 May 2008 14:42:23 -0700
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <48302257.2050308@Voltaire.COM> (Moni Shoua's message of "Sun, 18
	May 2008 15:34:31 +0300")
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
Message-ID: <ada1w3z9tmo.fsf@cisco.com>

I asked you to resend with the race fixed.  However I guess I never
spelled out what race I meant ... I thought I pointed this out, but
looking at the archives I don't see it.

So think about this:

What happens if someone calls ib_sa_path_rec_get() and

	@@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
	 		return -ENODEV;
	 
	 	port  = &sa_dev->port[port_num - sa_dev->start_port];
	+	if (!port->sm_ah)
	+		return  -EAGAIN;

right about here (*after* the test), ib_sa_event() does "port->sm_ah = NULL;"
on another CPU.

	 	agent = port->agent;
	 
	 	query = kmalloc(sizeof *query, gfp_mask);


From roland.list at gmail.com  Sun May 18 14:45:40 2008
From: roland.list at gmail.com (Roland Dreier)
Date: Sun, 18 May 2008 14:45:40 -0700
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <483049BF.4050603@dev.mellanox.co.il>
References: <483049BF.4050603@dev.mellanox.co.il>
Message-ID: <f8ca0a150805181445s7a1754c4lcea0a9139af61497@mail.gmail.com>

[Adding list cc]

> +static int mlx4_log_pg_sz = 0;
> +module_param(mlx4_log_pg_sz, int, 0444);
> +MODULE_PARM_DESC(mlx4_log_pg_sz,
> +                "set FW log system min page size (0 gets native FW min. default=0)");

Why do we need this module parameter?  When would someone set it to anything
other than 0?

> -       return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier,
> +       return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier,
>                          op_modifier, op, timeout);

I don't see any call to mlx4_cmd_imm in this patch -- why is this change needed?

 - R.


From jackm at dev.mellanox.co.il  Sun May 18 22:12:38 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 08:12:38 +0300
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <f8ca0a150805180904x10985952gd82c80dc858eaa5@mail.gmail.com>
References: <adaskwkfep5.fsf@cisco.com>
	<200805181749.42187.jackm@dev.mellanox.co.il>
	<f8ca0a150805180904x10985952gd82c80dc858eaa5@mail.gmail.com>
Message-ID: <200805190812.39408.jackm@dev.mellanox.co.il>

On Sunday 18 May 2008 19:04, Roland Dreier wrote:
> But can you point me to the place in the IB spec where it requires all
> sge limits to be no bigger
> than the returned max_sge value?
> 
Its not mentioned specifically, but certainly is strongly implied.

1. Section 11.2.1.2 -- Query HCA:
• The maximum number of scatter/gather entries per Work Re-
  quest supported by this HCA, for all Work Requests other
  than Reliable Datagram Receive Queue Work Requests.

This certainly implies that the create verb should not return
a number of scatter-gather entries greater than the max_sge value
(which is the max value supported by this HCA).
Otherwise, why have a max_sge value returned by Query HCA?  What use
would it serve?

Furthermore, how can the HCA return an sge value
in Create QP which exceeds the max_sge value returned by Query HCA? If
it does, the sge value in Create QP should be the one returned in
Query HCA!

- Jack


From rdreier at cisco.com  Sun May 18 22:54:07 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 18 May 2008 22:54:07 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <200805190812.39408.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 19 May 2008 08:12:38 +0300")
References: <adaskwkfep5.fsf@cisco.com>
	<200805181749.42187.jackm@dev.mellanox.co.il>
	<f8ca0a150805180904x10985952gd82c80dc858eaa5@mail.gmail.com>
	<200805190812.39408.jackm@dev.mellanox.co.il>
Message-ID: <adaprri96v4.fsf@cisco.com>

 > 1. Section 11.2.1.2 -- Query HCA:
 > • The maximum number of scatter/gather entries per Work Re-
 >   quest supported by this HCA, for all Work Requests other
 >   than Reliable Datagram Receive Queue Work Requests.
 > 
 > This certainly implies that the create verb should not return
 > a number of scatter-gather entries greater than the max_sge value
 > (which is the max value supported by this HCA).

I take this to mean that Query HCA should return the maxium number of
s/g entries supported for all work requests (leaving aside RD).  In
other words, if some work requests support 4 s/g entries and others
support 8, then query HCA should return 4 as the max s/g entries, since
this is the largest number that *all* work requests support (although
some support 8).

 > Otherwise, why have a max_sge value returned by Query HCA?  What use
 > would it serve?

It gives an upper bound on what consumers can request in a simple way,
without having to have the complexity of per-transport limits for send
and receive queues separately.

 > Furthermore, how can the HCA return an sge value
 > in Create QP which exceeds the max_sge value returned by Query HCA? If
 > it does, the sge value in Create QP should be the one returned in
 > Query HCA!

The mlx4 case is a simple example: send work requests support more s/g
entries than receive work requests do.  So query HCA must return the
lower receive work request limit, but I see no reason why create QP
can't return the actual limit for send work requests.

 - R.


From jackm at dev.mellanox.co.il  Mon May 19 00:07:24 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 10:07:24 +0300
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <adaprri96v4.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
	<200805190812.39408.jackm@dev.mellanox.co.il>
	<adaprri96v4.fsf@cisco.com>
Message-ID: <200805191007.24888.jackm@dev.mellanox.co.il>

On Monday 19 May 2008 08:54, Roland Dreier wrote:
> The mlx4 case is a simple example: send work requests support more s/g
> entries than receive work requests do.  So query HCA must return the
> lower receive work request limit, but I see no reason why create QP
> can't return the actual limit for send work requests.
> 
Then, we get into the complexity of sanity checking in create_qp (since we should
be able to use the value returned by create-qp when calling create-qp, and get
the same result). Essentially, we will need to check the requested sge numbers
per QP type, whether it is for send or receive, etc. IMHO, this gets nasty very
quickly -- creates a problem with support -- users will need a "roadmap" for create-qp.

I much prefer to treat the query_hca returned values as absolute maxima, and enforce
these limits (although this is at the expense of additional s/g entries for some
qp types and send/receive).

- Jack


From eli at dev.mellanox.co.il  Mon May 19 00:40:10 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 19 May 2008 10:40:10 +0300
Subject: [ofa-general] Re: [PATCH v2] IB/mlx4: Add send with invalidate
	support
In-Reply-To: <adaabiqcdp5.fsf@cisco.com>
References: <1210836027.18385.2.camel@mtls03>  <adaabiqcdp5.fsf@cisco.com>
Message-ID: <1211182810.6515.6.camel@eli-laptop>

On Fri, 2008-05-16 at 11:21 -0700, Roland Dreier wrote:

> Are we forced to to look at the firmware version, or can we use the bmme
> flag that the DEV_CAP firmware command gives us?

We are going to have a few capability bits defined for each bmme
feature. Once they're defined I'll regenerate the patch and resend.


From ogerlitz at voltaire.com  Mon May 19 01:04:10 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 May 2008 11:04:10 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <ada63tb9tqs.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int><20080516223419.27221.49014.stgit@dell3.ogc.int><48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>
Message-ID: <4831347A.1010506@voltaire.com>

Roland Dreier wrote:
>
> Yes, the point of this verb is that the low-level driver owns the page
> list from when the fast register work request is posted until it
> completes.  This should be explicitly documented somewhere.
>
OK, got it, so this is different case compared to the SG elements which 
are not owned by the driver once the posting call returns.
>
> However the reason for having the low-level driver implement it is so
> that all strange device-specific issues can be taken care of in the
> driver.  For instance mlx4 is going to require that the page list be
> aligned to 64 bytes, and will DMA from the memory, so we need to use
> dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
> in software, so kmalloc is sufficient.
>
I see. Just wondering, in the mlx4 case, is it a must to use dma 
consistent memory allocation or dma mapping would work too?

Or.
Or.


From olaf.kirch at oracle.com  Mon May 19 01:05:59 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Mon, 19 May 2008 10:05:59 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805161638.18067.olaf.kirch@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805141516.01908.okir@lst.de>
	<200805161638.18067.olaf.kirch@oracle.com>
Message-ID: <200805191006.00114.olaf.kirch@oracle.com>


> However, I'm still seeing performance degradation of ~5% with some packet
> sizes. And that is *just* the overhead from exchanging the credit information
> and checking it - at some point we need to take a spinlock, and that seems
> to delay things just enough to make a dent in my throughput graph.

Here's an updated version of the flow control patch - which is now completely
lockless, and uses a single atomic_t to hold both credit counters. This has
given me back close to full performance in my testing (throughput seems to be
down less than 1%, which is almost within the noise range).

I'll push it to my git tree a little later today, so folks can test it if
they like.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
----
From: Olaf Kirch <olaf.kirch at oracle.com>
Subject: RDS: Implement IB flow control

Here it is - flow control for RDS/IB.

This patch is still very much experimental. Here's the essentials

 -	The approach chosen here uses a credit-based flow control
	mechanism. Every SEND WR (including ACKs) consumes one credit,
	and if the sender runs out of credits, it stalls.

 -	As new receive buffers are posted, credits are transferred to the
	remote node (using yet another RDS header byte for this).

 -	Flow control is negotiated during connection setup. Initial credits
 	are exchanged in the rds_ib_connect_private sruct - sending a value
	of zero (which is also the default for older protocol versions)
	means no flow control.

 -	We avoid deadlock (both nodes depleting their credits, and being
 	unable to inform the peer of newly posted buffers) by requiring
	that the last credit can only be used if we're posting new credits
	to the peer.

The approach implemented here is lock-free; preliminary tests show
the impact on throughput to be less than 1%, and the impact on RTT,
CPU, TX delay and other metrics to be below the noise threshold.

Flow control is configurable via sysctl. It only affects newly created
connections however - so your best bet is to set this right after loading
the RDS module.

Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>
---
 net/rds/ib.c        |    1 
 net/rds/ib.h        |   30 ++++++++
 net/rds/ib_cm.c     |   49 ++++++++++++-
 net/rds/ib_recv.c   |   48 +++++++++---
 net/rds/ib_send.c   |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/ib_stats.c  |    3 
 net/rds/ib_sysctl.c |   10 ++
 net/rds/rds.h       |    4 -
 8 files changed, 325 insertions(+), 14 deletions(-)

Index: ofa_kernel-1.3/net/rds/ib.h
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib.h
+++ ofa_kernel-1.3/net/rds/ib.h
@@ -46,6 +46,7 @@ struct rds_ib_connect_private {
 	__be16			dp_protocol_minor_mask; /* bitmask */
 	__be32			dp_reserved1;
 	__be64			dp_ack_seq;
+	__be32			dp_credit;		/* non-zero enables flow ctl */
 };
 
 struct rds_ib_send_work {
@@ -110,15 +111,32 @@ struct rds_ib_connection {
 	struct ib_sge		i_ack_sge;
 	u64			i_ack_dma;
 	unsigned long		i_ack_queued;
+
+	/* Flow control related information
+	 *
+	 * Our algorithm uses a pair variables that we need to access
+	 * atomically - one for the send credits, and one posted
+	 * recv credits we need to transfer to remote.
+	 * Rather than protect them using a slow spinlock, we put both into
+	 * a single atomic_t and update it using cmpxchg
+	 */
+	atomic_t		i_credits;
  
 	/* Protocol version specific information */
 	unsigned int		i_hdr_idx;	/* 1 (old) or 0 (3.1 or later) */
+	unsigned int		i_flowctl : 1;	/* enable/disable flow ctl */
 
 	/* Batched completions */
 	unsigned int		i_unsignaled_wrs;
 	long			i_unsignaled_bytes;
 };
 
+/* This assumes that atomic_t is at least 32 bits */
+#define IB_GET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_GET_POST_CREDITS(v)	((v) >> 16)
+#define IB_SET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_SET_POST_CREDITS(v)	((v) << 16)
+
 struct rds_ib_ipaddr {
 	struct list_head	list;
 	__be32			ipaddr;
@@ -153,14 +171,17 @@ struct rds_ib_statistics {
 	unsigned long	s_ib_tx_cq_call;
 	unsigned long	s_ib_tx_cq_event;
 	unsigned long	s_ib_tx_ring_full;
+	unsigned long	s_ib_tx_throttle;
 	unsigned long	s_ib_tx_sg_mapping_failure;
 	unsigned long	s_ib_tx_stalled;
+	unsigned long	s_ib_tx_credit_updates;
 	unsigned long	s_ib_rx_cq_call;
 	unsigned long	s_ib_rx_cq_event;
 	unsigned long	s_ib_rx_ring_empty;
 	unsigned long	s_ib_rx_refill_from_cq;
 	unsigned long	s_ib_rx_refill_from_thread;
 	unsigned long	s_ib_rx_alloc_limit;
+	unsigned long	s_ib_rx_credit_updates;
 	unsigned long	s_ib_ack_sent;
 	unsigned long	s_ib_ack_send_failure;
 	unsigned long	s_ib_ack_send_delayed;
@@ -244,6 +265,8 @@ void rds_ib_flush_mrs(void);
 int __init rds_ib_recv_init(void);
 void rds_ib_recv_exit(void);
 int rds_ib_recv(struct rds_connection *conn);
+int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill);
 void rds_ib_inc_purge(struct rds_incoming *inc);
 void rds_ib_inc_free(struct rds_incoming *inc);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov,
@@ -252,6 +275,7 @@ void rds_ib_recv_cq_comp_handler(struct 
 void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
 void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
 void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
+void rds_ib_attempt_ack(struct rds_ib_connection *ic);
 void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
 u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
 
@@ -266,12 +290,17 @@ u32 rds_ib_ring_completed(struct rds_ib_
 extern wait_queue_head_t rds_ib_ring_empty_wait;
 
 /* ib_send.c */
+void rds_ib_xmit_complete(struct rds_connection *conn);
 int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
 	        unsigned int hdr_off, unsigned int sg, unsigned int off);
 void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
 void rds_ib_send_init_ring(struct rds_ib_connection *ic);
 void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
 int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits);
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted);
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted,
+			     u32 *adv_credits);
 
 /* ib_stats.c */
 RDS_DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
@@ -287,6 +316,7 @@ extern unsigned long rds_ib_sysctl_max_r
 extern unsigned long rds_ib_sysctl_max_unsig_wrs;
 extern unsigned long rds_ib_sysctl_max_unsig_bytes;
 extern unsigned long rds_ib_sysctl_max_recv_allocation;
+extern unsigned int rds_ib_sysctl_flow_control;
 extern ctl_table rds_ib_sysctl_table[];
 
 /*
Index: ofa_kernel-1.3/net/rds/ib_cm.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib_cm.c
+++ ofa_kernel-1.3/net/rds/ib_cm.c
@@ -55,6 +55,22 @@ static void rds_ib_set_protocol(struct r
 }
 
 /*
+ * Set up flow control
+ */
+static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (rds_ib_sysctl_flow_control && credits != 0) {
+		/* We're doing flow control */
+		ic->i_flowctl = 1;
+		rds_ib_send_add_credits(conn, credits);
+	} else {
+		ic->i_flowctl = 0;
+	}
+}
+
+/*
  * Connection established.
  * We get here for both outgoing and incoming connection.
  */
@@ -72,12 +88,16 @@ static void rds_ib_connect_complete(stru
 		rds_ib_set_protocol(conn,
 				RDS_PROTOCOL(dp->dp_protocol_major,
 					dp->dp_protocol_minor));
+		rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
 	}
 
-	rdsdebug("RDS/IB: ib conn complete on %u.%u.%u.%u version %u.%u\n",
+	printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n",
 			NIPQUAD(conn->c_laddr),
 			RDS_PROTOCOL_MAJOR(conn->c_version),
-			RDS_PROTOCOL_MINOR(conn->c_version));
+			RDS_PROTOCOL_MINOR(conn->c_version),
+			ic->i_flowctl? ", flow control" : "");
+
+	rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
 
 	/* Tune the RNR timeout. We use a rather low timeout, but
 	 * not the absolute minimum - this should be tunable.
@@ -129,6 +149,24 @@ static void rds_ib_cm_fill_conn_param(st
 		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
 		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
 
+		/* Advertise flow control.
+		 *
+		 * Major chicken and egg alert!
+		 * We would like to post receive buffers before we get here (eg.
+		 * in rds_ib_setup_qp), so that we can give the peer an accurate
+		 * credit value.
+		 * Unfortunately we can't post receive buffers until we've finished
+		 * protocol negotiation, and know in which order data and payload
+		 * are arranged.
+		 *
+		 * What we do here is we give the peer a small initial credit, and
+		 * initialize the number of posted buffers to a negative value.
+		 */
+		if (ic->i_flowctl) {
+			atomic_set(&ic->i_credits, IB_SET_POST_CREDITS(-4));
+			dp->dp_credit = cpu_to_be32(4);
+		}
+
 		conn_param->private_data = dp;
 		conn_param->private_data_len = sizeof(*dp);
 	}
@@ -363,6 +401,7 @@ static int rds_ib_cm_handle_connect(stru
 	ic = conn->c_transport_data;
 
 	rds_ib_set_protocol(conn, version);
+	rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
 
 	/* If the peer gave us the last packet it saw, process this as if
 	 * we had received a regular ACK. */
@@ -428,6 +467,7 @@ out:
 static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
 {
 	struct rds_connection *conn = cm_id->context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
 	struct rdma_conn_param conn_param;
 	struct rds_ib_connect_private dp;
 	int ret;
@@ -435,6 +475,7 @@ static int rds_ib_cm_initiate_connect(st
 	/* If the peer doesn't do protocol negotiation, we must
 	 * default to RDSv3.0 */
 	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
+	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
 
 	ret = rds_ib_setup_qp(conn);
 	if (ret) {
@@ -688,6 +729,10 @@ void rds_ib_conn_shutdown(struct rds_con
 #endif
 	ic->i_ack_recv = 0;
 
+	/* Clear flow control state */
+	ic->i_flowctl = 0;
+	atomic_set(&ic->i_credits, 0);
+
 	if (ic->i_ibinc) {
 		rds_inc_put(&ic->i_ibinc->ii_inc);
 		ic->i_ibinc = NULL;
Index: ofa_kernel-1.3/net/rds/ib_recv.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib_recv.c
+++ ofa_kernel-1.3/net/rds/ib_recv.c
@@ -220,16 +220,17 @@ out:
  * -1 is returned if posting fails due to temporary resource exhaustion.
  */
 int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
-		       gfp_t page_gfp)
+		       gfp_t page_gfp, int prefill)
 {
 	struct rds_ib_connection *ic = conn->c_transport_data;
 	struct rds_ib_recv_work *recv;
 	struct ib_recv_wr *failed_wr;
+	unsigned int posted = 0;
 	int ret = 0;
 	u32 pos;
 
-	while (rds_conn_up(conn) && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
-
+	while ((prefill || rds_conn_up(conn))
+			&& rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
 		if (pos >= ic->i_recv_ring.w_nr) {
 			printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n",
 					pos);
@@ -257,8 +258,14 @@ int rds_ib_recv_refill(struct rds_connec
 			ret = -1;
 			break;
 		}
+
+		posted++;
 	}
 
+	/* We're doing flow control - update the window. */
+	if (ic->i_flowctl && posted)
+		rds_ib_advertise_credits(conn, posted);
+
 	if (ret)
 		rds_ib_ring_unalloc(&ic->i_recv_ring, 1);
 	return ret;
@@ -436,7 +443,7 @@ static u64 rds_ib_get_ack(struct rds_ib_
 #endif
 
 
-static void rds_ib_send_ack(struct rds_ib_connection *ic)
+static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits)
 {
 	struct rds_header *hdr = ic->i_ack;
 	struct ib_send_wr *failed_wr;
@@ -448,6 +455,7 @@ static void rds_ib_send_ack(struct rds_i
 	rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq);
 	rds_message_populate_header(hdr, 0, 0, 0);
 	hdr->h_ack = cpu_to_be64(seq);
+	hdr->h_credit = adv_credits;
 	rds_message_make_checksum(hdr);
 	ic->i_ack_queued = jiffies;
 
@@ -460,6 +468,8 @@ static void rds_ib_send_ack(struct rds_i
 		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
 
  		rds_ib_stats_inc(s_ib_ack_send_failure);
+		/* Need to finesse this later. */
+		BUG();
 	} else
 		rds_ib_stats_inc(s_ib_ack_sent);
 }
@@ -502,15 +512,27 @@ static void rds_ib_send_ack(struct rds_i
  * When we get here, we're called from the recv queue handler.
  * Check whether we ought to transmit an ACK.
  */
-static void rds_ib_attempt_ack(struct rds_ib_connection *ic)
+void rds_ib_attempt_ack(struct rds_ib_connection *ic)
 {
+	unsigned int adv_credits;
+
 	if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
 		return;
-	if (!test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
-		clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
-		rds_ib_send_ack(ic);
-	} else
+
+	if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
 		rds_ib_stats_inc(s_ib_ack_send_delayed);
+		return;
+	}
+
+	/* Can we get a send credit? */
+	if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) {
+		rds_ib_stats_inc(s_ib_tx_throttle);
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		return;
+	}
+
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	rds_ib_send_ack(ic, adv_credits);
 }
 
 /*
@@ -706,6 +728,10 @@ void rds_ib_process_recv(struct rds_conn
 	state->ack_recv = be64_to_cpu(ihdr->h_ack);
 	state->ack_recv_valid = 1;
 
+	/* Process the credits update if there was one */
+	if (ihdr->h_credit)
+		rds_ib_send_add_credits(conn, ihdr->h_credit);
+
 	if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) {
 		/* This is an ACK-only packet. The fact that it gets
 		 * special treatment here is that historically, ACKs
@@ -877,7 +903,7 @@ void rds_ib_recv_cq_comp_handler(struct 
 
 	if (mutex_trylock(&ic->i_recv_mutex)) {
 		if (rds_ib_recv_refill(conn, GFP_ATOMIC,
-					 GFP_ATOMIC | __GFP_HIGHMEM))
+					 GFP_ATOMIC | __GFP_HIGHMEM, 0))
 			ret = -EAGAIN;
 		else
 			rds_ib_stats_inc(s_ib_rx_refill_from_cq);
@@ -901,7 +927,7 @@ int rds_ib_recv(struct rds_connection *c
 	 * we're really low and we want the caller to back off for a bit.
 	 */
 	mutex_lock(&ic->i_recv_mutex);
-	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER))
+	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0))
 		ret = -ENOMEM;
 	else
 		rds_ib_stats_inc(s_ib_rx_refill_from_thread);
Index: ofa_kernel-1.3/net/rds/ib.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib.c
+++ ofa_kernel-1.3/net/rds/ib.c
@@ -187,6 +187,7 @@ static void rds_ib_exit(void)
 
 struct rds_transport rds_ib_transport = {
 	.laddr_check		= rds_ib_laddr_check,
+	.xmit_complete		= rds_ib_xmit_complete,
 	.xmit			= rds_ib_xmit,
 	.xmit_cong_map		= NULL,
 	.xmit_rdma		= rds_ib_xmit_rdma,
Index: ofa_kernel-1.3/net/rds/ib_send.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib_send.c
+++ ofa_kernel-1.3/net/rds/ib_send.c
@@ -245,6 +245,144 @@ void rds_ib_send_cq_comp_handler(struct 
 	}
 }
 
+/*
+ * This is the main function for allocating credits when sending
+ * messages.
+ *
+ * Conceptually, we have two counters:
+ *  -	send credits: this tells us how many WRs we're allowed
+ *	to submit without overruning the reciever's queue. For
+ *	each SEND WR we post, we decrement this by one.
+ *
+ *  -	posted credits: this tells us how many WRs we recently
+ *	posted to the receive queue. This value is transferred
+ *	to the peer as a "credit update" in a RDS header field.
+ *	Every time we transmit credits to the peer, we subtract
+ *	the amount of transferred credits from this counter.
+ *
+ * It is essential that we avoid situations where both sides have
+ * exhausted their send credits, and are unable to send new credits
+ * to the peer. We achieve this by requiring that we send at least
+ * one credit update to the peer before exhausting our credits.
+ * When new credits arrive, we subtract one credit that is withheld
+ * until we've posted new buffers and are ready to transmit these
+ * credits (see rds_ib_send_add_credits below).
+ *
+ * The RDS send code is essentially single-threaded; rds_send_xmit
+ * grabs c_send_sem to ensure exclusive access to the send ring.
+ * However, the ACK sending code is independent and can race with
+ * message SENDs.
+ *
+ * In the send path, we need to update the counters for send credits
+ * and the counter of posted buffers atomically - when we use the
+ * last available credit, we cannot allow another thread to race us
+ * and grab the posted credits counter.  Hence, we have to use a
+ * spinlock to protect the credit counter, or use atomics.
+ *
+ * Spinlocks shared between the send and the receive path are bad,
+ * because they create unnecessary delays. An early implementation
+ * using a spinlock showed a 5% degradation in throughput at some
+ * loads.
+ *
+ * This implementation avoids spinlocks completely, putting both
+ * counters into a single atomic, and updating that atomic using
+ * atomic_add (in the receive path, when receiving fresh credits),
+ * and using atomic_cmpxchg when updating the two counters.
+ */
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic,
+			     u32 wanted, u32 *adv_credits)
+{
+	unsigned int avail, posted, got = 0, advertise;
+	long oldval, newval;
+
+	*adv_credits = 0;
+	if (!ic->i_flowctl)
+		return wanted;
+
+try_again:
+	advertise = 0;
+	oldval = newval = atomic_read(&ic->i_credits);
+	posted = IB_GET_POST_CREDITS(oldval);
+	avail = IB_GET_SEND_CREDITS(oldval);
+
+	rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n",
+			wanted, avail, posted);
+
+	/* The last credit must be used to send a credit updated. */
+	if (avail && !posted)
+		avail--;
+
+	if (avail < wanted) {
+		struct rds_connection *conn = ic->i_cm_id->context;
+
+		/* Oops, there aren't that many credits left! */
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		got = avail;
+	} else {
+		/* Sometimes you get what you want, lalala. */
+		got = wanted;
+	}
+	newval -= IB_SET_SEND_CREDITS(got);
+
+	if (got && posted) {
+		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
+		newval -= IB_SET_POST_CREDITS(advertise);
+	}
+
+	/* Finally bill everything */
+	if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval)
+		goto try_again;
+
+	*adv_credits = advertise;
+	return got;
+}
+
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (credits == 0)
+		return;
+
+	rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n",
+			credits,
+			IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)),
+			test_bit(RDS_LL_SEND_FULL, &conn->c_flags)? ", ll_send_full" : "");
+
+	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
+	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
+
+	rds_ib_stats_inc(s_ib_rx_credit_updates);
+}
+
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (posted == 0)
+		return;
+
+	atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits);
+
+	/* Decide whether to send an update to the peer now.
+	 * If we would send a credit update for every single buffer we
+	 * post, we would end up with an ACK storm (ACK arrives,
+	 * consumes buffer, we refill the ring, send ACK to remote
+	 * advertising the newly posted buffer... ad inf)
+	 *
+	 * Performance pretty much depends on how often we send
+	 * credit updates - too frequent updates mean lots of ACKs.
+	 * Too infrequent updates, and the peer will run out of
+	 * credits and has to throttle.
+	 * For the time being, 16 seems to be a good compromise.
+	 */
+	if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16)
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+}
+
 static inline void
 rds_ib_xmit_populate_wr(struct rds_ib_connection *ic,
 		struct rds_ib_send_work *send, unsigned int pos,
@@ -307,6 +445,8 @@ int rds_ib_xmit(struct rds_connection *c
 	u32 pos;
 	u32 i;
 	u32 work_alloc;
+	u32 credit_alloc;
+	u32 adv_credits = 0;
 	int send_flags = 0;
 	int sent;
 	int ret;
@@ -314,6 +454,7 @@ int rds_ib_xmit(struct rds_connection *c
 	BUG_ON(off % RDS_FRAG_SIZE);
 	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
 
+	/* FIXME we may overallocate here */
 	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0)
 		i = 1;
 	else
@@ -327,8 +468,29 @@ int rds_ib_xmit(struct rds_connection *c
 		goto out;
 	}
 
+	credit_alloc = work_alloc;
+	if (ic->i_flowctl) {
+		credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits);
+		if (credit_alloc < work_alloc) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
+			work_alloc = credit_alloc;
+		}
+		if (work_alloc == 0) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+			rds_ib_stats_inc(s_ib_tx_throttle);
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
 	/* map the message the first time we see it */
 	if (ic->i_rm == NULL) {
+		/*
+		printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(rm->m_inc.i_hdr.h_dport),
+				rm->m_inc.i_hdr.h_flags,
+				be32_to_cpu(rm->m_inc.i_hdr.h_len));
+		   */
 		if (rm->m_nents) {
 			rm->m_count = ib_dma_map_sg(dev,
 					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
@@ -449,6 +611,24 @@ add_header:
 		 * have been set up to point to the right header buffer. */
 		memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header));
 
+		if (0) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(hdr->h_dport),
+				hdr->h_flags,
+				be32_to_cpu(hdr->h_len));
+		}
+		if (adv_credits) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			/* add credit and redo the header checksum */
+			hdr->h_credit = adv_credits;
+			rds_message_make_checksum(hdr);
+			adv_credits = 0;
+			rds_ib_stats_inc(s_ib_tx_credit_updates);
+		}
+
 		if (prev)
 			prev->s_wr.next = &send->s_wr;
 		prev = send;
@@ -472,6 +652,8 @@ add_header:
 		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
 		work_alloc = i;
 	}
+	if (ic->i_flowctl && i < credit_alloc)
+		rds_ib_send_add_credits(conn, credit_alloc - i);
 
 	/* XXX need to worry about failed_wr and partial sends. */
 	failed_wr = &first->s_wr;
@@ -487,11 +669,14 @@ add_header:
 			ic->i_rm = prev->s_rm;
 			prev->s_rm = NULL;
 		}
+		/* Finesse this later */
+		BUG();
 		goto out;
 	}
 
 	ret = sent;
 out:
+	BUG_ON(adv_credits);
 	return ret;
 }
 
@@ -630,3 +815,12 @@ int rds_ib_xmit_rdma(struct rds_connecti
 out:
 	return ret;
 }
+
+void rds_ib_xmit_complete(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	/* We may have a pending ACK or window update we were unable
+	 * to send previously (due to flow control). Try again. */
+	rds_ib_attempt_ack(ic);
+}
Index: ofa_kernel-1.3/net/rds/ib_stats.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib_stats.c
+++ ofa_kernel-1.3/net/rds/ib_stats.c
@@ -46,14 +46,17 @@ static char *rds_ib_stat_names[] = {
 	"ib_tx_cq_call",
 	"ib_tx_cq_event",
 	"ib_tx_ring_full",
+	"ib_tx_throttle",
 	"ib_tx_sg_mapping_failure",
 	"ib_tx_stalled",
+	"ib_tx_credit_updates",
 	"ib_rx_cq_call",
 	"ib_rx_cq_event",
 	"ib_rx_ring_empty",
 	"ib_rx_refill_from_cq",
 	"ib_rx_refill_from_thread",
 	"ib_rx_alloc_limit",
+	"ib_rx_credit_updates",
 	"ib_ack_sent",
 	"ib_ack_send_failure",
 	"ib_ack_send_delayed",
Index: ofa_kernel-1.3/net/rds/rds.h
===================================================================
--- ofa_kernel-1.3.orig/net/rds/rds.h
+++ ofa_kernel-1.3/net/rds/rds.h
@@ -170,6 +170,7 @@ struct rds_connection {
 #define RDS_FLAG_CONG_BITMAP	0x01
 #define RDS_FLAG_ACK_REQUIRED	0x02
 #define RDS_FLAG_RETRANSMITTED	0x04
+#define RDS_MAX_ADV_CREDIT	255
 
 /*
  * Maximum space available for extension headers.
@@ -183,7 +184,8 @@ struct rds_header {
 	__be16	h_sport;
 	__be16	h_dport;
 	u8	h_flags;
-	u8	h_padding[5];
+	u8	h_credit;
+	u8	h_padding[4];
 	__sum16	h_csum;
 
 	u8	h_exthdr[RDS_HEADER_EXT_SPACE];
Index: ofa_kernel-1.3/net/rds/ib_sysctl.c
===================================================================
--- ofa_kernel-1.3.orig/net/rds/ib_sysctl.c
+++ ofa_kernel-1.3/net/rds/ib_sysctl.c
@@ -53,6 +53,8 @@ unsigned long rds_ib_sysctl_max_unsig_by
 static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1;
 static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL;
 
+unsigned int rds_ib_sysctl_flow_control = 1;
+
 ctl_table rds_ib_sysctl_table[] = {
 	{
 		.ctl_name       = 1,
@@ -102,6 +104,14 @@ ctl_table rds_ib_sysctl_table[] = {
 		.mode           = 0644,
 		.proc_handler   = &proc_doulongvec_minmax,
 	},
+	{
+		.ctl_name	= 6,
+		.procname	= "flow_control",
+		.data		= &rds_ib_sysctl_flow_control,
+		.maxlen		= sizeof(rds_ib_sysctl_flow_control),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0}
 };
 

From ogerlitz at voltaire.com  Mon May 19 01:26:29 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 May 2008 11:26:29 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
Message-ID: <483139B5.8040908@voltaire.com>

Steve Wise wrote:
> Support for the IB BMME and iWARP equivalent memory extensions to 
> non shared memory regions. Usage Model:
>
> - MR allocated with ib_alloc_mr()
> - Page lists allocated via ib_alloc_fast_reg_page_list().
> - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR)
> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
> - MR deallocated with ib_dereg_mr()
> - page lists dealloced via ib_free_fast_reg_page_list().
Steve,

Does this design goes hand-in-hand with remote invalidation? such that 
if the remote side invalidated the mapping there no need to issue the 
IB_WR_INVALIDATE_MR work request.

Also, does the proposed design support fmr pages of granularity 
different than the OS ones? for example the OS pages are 4K and the ULP 
wants to use fmr of 512 byte "pages (the "block lists" feature), etc. In 
that case doesn't the size of each page has to be specified in as a 
param to the alloc_fast_reg_mr() verb?
>
> Applications can allocate a fast_reg mr once, and then can repeatedly
> bind the mr to different physical memory SGLs via posting work requests
> to the send queue.  For each outstanding mr-to-pbl binding in the SQ
> pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
> be achieved while still allowing device-specific page_list processing.
mmm, is it a must for the ULP issue page list alloc/free per 
IB_WR_FAST_REG_MR call?

> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -676,6 +683,20 @@ struct ib_send_wr {
>  			u16	pkey_index; /* valid for GSI only */
>  			u8	port_num;   /* valid for DR SMPs on switch only */
>  		} ud;
> +		struct {
> +			u64				iova_start;
> +			struct ib_mr 			*mr;
> +			struct ib_fast_reg_page_list	*page_list;
> +			unsigned int			page_size;
> +			unsigned int			page_list_len;
> +			unsigned int			first_byte_offset;
> +			u32				length;
> +			int				access_flags;
> +			
> +		} fast_reg;
> +		struct {
> +			struct ib_mr 	*mr;
> +		} local_inv;
>  	} wr;
>  };
I suggest to use a "page_shift" notation and not "page_size" to comply 
with the kernel semantics of other APIs.


Or.


From Sumit.Gaur at Sun.COM  Mon May 19 02:55:00 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Mon, 19 May 2008 15:25:00 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <20080513185404.3D00BE60C16@openfabrics.org>
References: <20080513185404.3D00BE60C16@openfabrics.org>
Message-ID: <48314E74.9010107@Sun.COM>

Hi
I have an issue while my program interacting with OFED umad library. I have two 
separate threads one for sending SMP,GMP packets and another to receive 
response. Things are working fine but during the whole process I keep receiving 
packets with unknown tid apart from correct response. Is it a correct behavior. 
If yes how I could avoid them ?

Thanks and Regards
sumit

general-request at lists.openfabrics.org wrote:
> Send general mailing list submissions to
> 	general at lists.openfabrics.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
> 	general-request at lists.openfabrics.org
> 
> You can reach the person managing the list at
> 	general-owner at lists.openfabrics.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of general digest..."
> 
> 
> Today's Topics:
> 
>    1. Re:  [PATCH] IB/core: handle race between elements in	qork
>       queues after event (Roland Dreier)
>    2. Re:  RDS flow control (Steve Wise)
>    3. Re:  RDS flow control (Olaf Kirch)
>    4. Re:  RDS flow control (Steve Wise)
>    5. Re:  RDS flow control (Olaf Kirch)
>    6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
>       checking (Roland Dreier)
>    7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
>       struct pid * not pid_t. (Roland Dreier)
>    8. Re:  bitops take an unsigned long * (Roland Dreier)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 13 May 2008 10:41:39 -0700
> From: Roland Dreier <rdreier at cisco.com>
> Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
> 	elements in	qork queues after event
> To: Moni Shoua <monis at Voltaire.COM>
> Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
> 	<general at lists.openfabrics.org>
> Message-ID: <adatzh2ksoc.fsf at cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
>  > Can we please go on with this patch? We would like to see it in the next kernel.
> 
> I still don't get why this is important to you.  Is there a concrete
> example of a situation where this actually makes a measurable difference?
> 
> We need some justification for adding this locking complexity beyond "it
> doesn't hurt."  (And also of course we need it fixed so there aren't races)
> 
>  - R.
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 13 May 2008 12:58:11 -0500
> From: Steve Wise <swise at opengridcomputing.com>
> Subject: Re: [ofa-general] RDS flow control
> To: Richard Frank <richard.frank at oracle.com>
> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Richard Frank wrote:
> 
>>Steve Wise wrote:
>>
>>>Olaf Kirch wrote:
>>>
>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>>> 
>>>>
>>>>>As part of my effort to get RDS working for iWARP, I will be 
>>>>>working on the RDS flow control.  Flow control is needed for iWARP 
>>>>>due to the fact that iWARP connections terminate if there is no 
>>>>>posted recv for an incoming packet.  IB connections do not have 
>>>>>this limitation if setup in a certain way.  In its current 
>>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
>>>>>This causes IB to retransmit until there is a posted recv buffer.     
>>>>
>>>>I think for the initial implementation, it is fine for iWARP to just
>>>>fail the connect when that happens, and re-establish the connection.
>>>>
>>>>If you use reasonable defaults for the send and recv queues, receiver
>>>>overruns should be relatively rare.
>>>>
>>>>Once everything else works, let's revisit the flow control part.
>>>>
>>>>  
>>>
>>>I _think_ you'll hit this quickly with one-way flows.  Send 
>>>completions for iWARP only mean the user's buffer can be reused.  Not 
>>>that its placed at the remote peer or in the remote user's buffer.
>>>
>>
>>Let's see what happens - anyway - this could be solved in an IWARP 
>>extension to RDS  - right ?
> 
> 
> 
> Yes, by adding flow control.  And it could be iwarp-specific if you 
> want.    I would not suggest relying on connection termination and 
> re-establishment as the way to handle this :).
> 
> 
> 
> 
>>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
>>>and rnr_retry == 0 using the rds perf tools?
>>>Also "the everything else" part depends on remove fmr usage.  I'm 
>>>working on the new RDMA memory verbs allowing fast registration of 
>>>physical memory via a send WR.  To support iWARP we need to remove 
>>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
>>>fastreg verbs.   Thoughts?
>>>
>>
>>What does "fast" imply here - how does this compare to the performance 
>>of FMRs ?
> 
> 
> 
> Don't know yet, but probably as fast. 
> 
> 
>>Why would not push memory window creation into the RDS transport 
>>specific implementations ?
> 
> 
> Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
> (I'm ignorant on the specifics of the implementation at this point, so 
> please excuse any dumb statements :)
> 
> 
> 
>>Changing the API may be OK - if we retain the performance we have with 
>>IB.
> 
> 
> 
> I assume nothing would fly that regresses IB performance.  Worst case, 
> you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> Hopefully though, IB + iWARP will be a common transport.
> 
> 
> 
>>>Stay tuned for the new verbs API RFC...
>>>
>>>Steve.
>>>_______________________________________________
>>>general mailing list
>>>general at lists.openfabrics.org
>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>>To unsubscribe, please visit 
>>>http://openib.org/mailman/listinfo/openib-general
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 13 May 2008 20:04:00 +0200
> From: Olaf Kirch <okir at lst.de>
> Subject: Re: [ofa-general] RDS flow control
> To: Steve Wise <swise at opengridcomputing.com>
> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> Message-ID: <200805132004.01371.okir at lst.de>
> Content-Type: text/plain;  charset="iso-8859-1"
> 
> On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
> 
>>Yes, by adding flow control.  And it could be iwarp-specific if you 
>>want.    I would not suggest relying on connection termination and 
>>re-establishment as the way to handle this :).
> 
> 
> No, not in the long term. But let's hold off on the flow control stuff
> for a little - I would first like to finish my patch set and hand it
> out for you folks to bang on it, rather than the other way round.
> Okay with you guys?
> 
> 
>>I assume nothing would fly that regresses IB performance.  Worst case, 
>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>Hopefully though, IB + iWARP will be a common transport.
> 
> 
> If it turns out that way, fine. If iWARP ands up sharing 80% of the
> code with IB except the RDMA specific functions, I think that's
> very much acceptable, too.
> 
> Olaf


From ruimario at gmail.com  Mon May 19 03:49:43 2008
From: ruimario at gmail.com (Rui Machado)
Date: Mon, 19 May 2008 12:49:43 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <482DE8E3.4090200@gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>
	<adaej82cdxb.fsf@cisco.com>
	<6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>
	<482E3AEE.4070603@gmail.com>
	<6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com>
	<482DE8E3.4090200@gmail.com>
Message-ID: <6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com>

2008/5/16 Dotan Barak <dotanba at gmail.com>:
> Rui Machado wrote:
>>
>> 2008/5/17 Dotan Barak <dotanba at gmail.com>:
>>
>>>
>>> Rui Machado wrote:
>>>
>>>>
>>>> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>>>>
>>>>
>>>>>
>>>>>  > hmm..... and is there no workaround for this, for this situation? I
>>>>>  > mean, if the server dies isn't there any possibility that
>>>>>  > the sender/client realizes this. If the timeout it's too large this
>>>>>  > can be cumbersome.
>>>>>  >
>>>>>  > I tried reducing the timeout and indeed the client realizes faster
>>>>>  > when the server exits but another problem arises: Without exiting
>>>>> the
>>>>>  > server,
>>>>>  > on the client side I get the error (retry exceed) when polling for a
>>>>>  > recently posted send - this after some hours.
>>>>>
>>>>> There's a tradeoff between detecting real failures faster, and reducing
>>>>> false errors detected because a response came too slowly.
>>>>>
>>>>> Clearly if a response may take an amount of time 'X' to be received
>>>>> under normal conditions, there's no way to conclude that the remote
>>>>> side
>>>>> has failed without waiting at least 'X'.
>>>>>
>>>>>
>>>>>
>>>>
>>>> I understand. So there's no really difference between the two
>>>> situations, real server failure or just a load problem that takes more
>>>> time?
>>>>
>>>>
>>>
>>> From the sender QP point of view, they are the same (ack/nack wasn't send
>>> during a specific
>>> period of time)
>>>
>>>>
>>>> Something like a different error or a SIGPIPE :) ?
>>>>
>>>> I will describe my situation, maybe it helps (bare with me as I'm
>>>> starting with Infiniband and so on)
>>>> I have a client and a server.The clients posts RDMA calls one at a
>>>> time (post, poll, post...). So server is just there.
>>>> If I try to start something like 16 clients on 1 machine, after a few
>>>> hours I will get an error on some client programs (retry excess) with
>>>> a timeout of 14. If I increase the timeout for 32, I don't see that
>>>> error but if I stop the server, the clients take a lot of time to
>>>> acknowledge that, which is also not wanted.
>>>> That's why I asked  if there a 'good value'. If I have such a load
>>>> between 2 nodes, I always have to risk that if the server dies the
>>>> client will take much time to see it. That's not nice!
>>>>
>>>>
>>>
>>> Did you try to increase the retry_count too?
>>> (and not only the timeout).
>>>
>
> Yes.
>>
>> But that wouldn't change my scenario since the overall time is given
>> by the timeout * retry count right?
>>
>>
>>>
>>> By the way, Which RDMA operation do you execute READ or WRITE?
>>>
>>
>> READ.
>>
>
> Can you replace it with a write (from the other side)?
> READ has "higher price" than a WRITE.
>

Can you please, shortly explain why this higher price?

> Anyway, you should get the mentioned behavior anyway..
>
> When the sender get the error, what is the status of the receiver QP?
> (did you try to execute ibv_query_qp and get its status?)
>

I tried to get the qp state right after the error and it is 6 (which I
believe is IBV_QPS_ERR).
Why do you ask?

Thanks
Rui


From ogerlitz at voltaire.com  Mon May 19 04:29:30 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 May 2008 14:29:30 +0300
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <482C68A4.9020305@opengridcomputing.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
Message-ID: <4831649A.2020206@voltaire.com>

Steve Wise wrote:
> Sean Hefty wrote:
>> I think we should just always report this event, and let users ignore 
>> it if they
>> want.  We don't seem to gain much by filtering the event at a lower 
>> level.  
> Um, doesn't that then change the ABI?  Some apps might hurl on a new 
> (unexpected) event.
Steve, I think ULPs should be designed/coded to live well with new 
events delivered by the rdma-cm as (A) the model is event based and (B) 
such events can be introduced while developing new features... So my 
suggestion is that a ULP which is limited in that sense would have to 
state in its package dependency requirements that they are dependent on 
librdmacm 1.0.7 or earlier. Ofcourse, the kernel is one package so I 
will make sure that the current intree rdma-cm consumers (iser, rnfs) 
live well with this event.

Sean, please let me know your preference (as it was somehow unclear from 
the thread) if you want the delivery of this event to be dependent on 
the ulp asking for it or no.


Or.


From ogerlitz at voltaire.com  Mon May 19 04:42:21 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 May 2008 14:42:21 +0300
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
	<000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
Message-ID: <4831679D.9040804@voltaire.com>

Sean Hefty wrote:
>> So instead of adding a new function rdma_set_high_availability_mode, you
>> could just set an option saying WANT_NETDEV_CHANGE_EVENTS.  Maybe we
>> need to add rdma_set_option() to the kernel RDMA-CM API?
> I agree with this.  Having a generic mechanism to report rare events would be
> useful.  Maybe the device removal notification can be adapted for this purpose?
Sean, as suggested in the past (eg over the QoS discussion) 
rdma_set_option can serve from more purposes similarly to setsockopt, 
and I guess that down the road as the RDMA stack would get enhanced by 
more features adding these rdma_set/get_opt calls would make sense. So 
(Steve) in that respect, I don't see rdma_set_opt as a mechanism to 
report rare events.

As I said please let me know your preference so I can work on a patch.

Or.


From hrosenstock at xsigo.com  Mon May 19 04:47:23 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 04:47:23 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <48314E74.9010107@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
Message-ID: <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>

Sumit,

On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
> Hi
> I have an issue while my program interacting with OFED umad library.

Are you referring to libibumad ?

> I have two 
> separate threads one for sending SMP,GMP packets and another to receive 
> response. Things are working fine but during the whole process I keep receiving 
> packets with unknown tid apart from correct response.

What's the exact message ?

>  Is it a correct behavior.

It could be; there's not enough info as to what is going on. It could be
some unsolicited message (e.g. from SM) comes in during your
transactions. Can you see what MADs are incoming ? One way to do that
would be to run madeye.

> If yes how I could avoid them ?

Not sure what you are seeing yet.

-- Hal

> Thanks and Regards
> sumit
> 
> general-request at lists.openfabrics.org wrote:
> > Send general mailing list submissions to
> > 	general at lists.openfabrics.org
> > 
> > To subscribe or unsubscribe via the World Wide Web, visit
> > 	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > or, via email, send a message with subject or body 'help' to
> > 	general-request at lists.openfabrics.org
> > 
> > You can reach the person managing the list at
> > 	general-owner at lists.openfabrics.org
> > 
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of general digest..."
> > 
> > 
> > Today's Topics:
> > 
> >    1. Re:  [PATCH] IB/core: handle race between elements in	qork
> >       queues after event (Roland Dreier)
> >    2. Re:  RDS flow control (Steve Wise)
> >    3. Re:  RDS flow control (Olaf Kirch)
> >    4. Re:  RDS flow control (Steve Wise)
> >    5. Re:  RDS flow control (Olaf Kirch)
> >    6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
> >       checking (Roland Dreier)
> >    7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
> >       struct pid * not pid_t. (Roland Dreier)
> >    8. Re:  bitops take an unsigned long * (Roland Dreier)
> > 
> > 
> > ----------------------------------------------------------------------
> > 
> > Message: 1
> > Date: Tue, 13 May 2008 10:41:39 -0700
> > From: Roland Dreier <rdreier at cisco.com>
> > Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
> > 	elements in	qork queues after event
> > To: Moni Shoua <monis at Voltaire.COM>
> > Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
> > 	<general at lists.openfabrics.org>
> > Message-ID: <adatzh2ksoc.fsf at cisco.com>
> > Content-Type: text/plain; charset=us-ascii
> > 
> >  > Can we please go on with this patch? We would like to see it in the next kernel.
> > 
> > I still don't get why this is important to you.  Is there a concrete
> > example of a situation where this actually makes a measurable difference?
> > 
> > We need some justification for adding this locking complexity beyond "it
> > doesn't hurt."  (And also of course we need it fixed so there aren't races)
> > 
> >  - R.
> > 
> > 
> > ------------------------------
> > 
> > Message: 2
> > Date: Tue, 13 May 2008 12:58:11 -0500
> > From: Steve Wise <swise at opengridcomputing.com>
> > Subject: Re: [ofa-general] RDS flow control
> > To: Richard Frank <richard.frank at oracle.com>
> > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> > Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> > 
> > Richard Frank wrote:
> > 
> >>Steve Wise wrote:
> >>
> >>>Olaf Kirch wrote:
> >>>
> >>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
> >>>> 
> >>>>
> >>>>>As part of my effort to get RDS working for iWARP, I will be 
> >>>>>working on the RDS flow control.  Flow control is needed for iWARP 
> >>>>>due to the fact that iWARP connections terminate if there is no 
> >>>>>posted recv for an incoming packet.  IB connections do not have 
> >>>>>this limitation if setup in a certain way.  In its current 
> >>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
> >>>>>This causes IB to retransmit until there is a posted recv buffer.     
> >>>>
> >>>>I think for the initial implementation, it is fine for iWARP to just
> >>>>fail the connect when that happens, and re-establish the connection.
> >>>>
> >>>>If you use reasonable defaults for the send and recv queues, receiver
> >>>>overruns should be relatively rare.
> >>>>
> >>>>Once everything else works, let's revisit the flow control part.
> >>>>
> >>>>  
> >>>
> >>>I _think_ you'll hit this quickly with one-way flows.  Send 
> >>>completions for iWARP only mean the user's buffer can be reused.  Not 
> >>>that its placed at the remote peer or in the remote user's buffer.
> >>>
> >>
> >>Let's see what happens - anyway - this could be solved in an IWARP 
> >>extension to RDS  - right ?
> > 
> > 
> > 
> > Yes, by adding flow control.  And it could be iwarp-specific if you 
> > want.    I would not suggest relying on connection termination and 
> > re-establishment as the way to handle this :).
> > 
> > 
> > 
> > 
> >>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
> >>>and rnr_retry == 0 using the rds perf tools?
> >>>Also "the everything else" part depends on remove fmr usage.  I'm 
> >>>working on the new RDMA memory verbs allowing fast registration of 
> >>>physical memory via a send WR.  To support iWARP we need to remove 
> >>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
> >>>fastreg verbs.   Thoughts?
> >>>
> >>
> >>What does "fast" imply here - how does this compare to the performance 
> >>of FMRs ?
> > 
> > 
> > 
> > Don't know yet, but probably as fast. 
> > 
> > 
> >>Why would not push memory window creation into the RDS transport 
> >>specific implementations ?
> > 
> > 
> > Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
> > (I'm ignorant on the specifics of the implementation at this point, so 
> > please excuse any dumb statements :)
> > 
> > 
> > 
> >>Changing the API may be OK - if we retain the performance we have with 
> >>IB.
> > 
> > 
> > 
> > I assume nothing would fly that regresses IB performance.  Worst case, 
> > you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> > Hopefully though, IB + iWARP will be a common transport.
> > 
> > 
> > 
> >>>Stay tuned for the new verbs API RFC...
> >>>
> >>>Steve.
> >>>_______________________________________________
> >>>general mailing list
> >>>general at lists.openfabrics.org
> >>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>
> >>>To unsubscribe, please visit 
> >>>http://openib.org/mailman/listinfo/openib-general
> > 
> > 
> > 
> > 
> > ------------------------------
> > 
> > Message: 3
> > Date: Tue, 13 May 2008 20:04:00 +0200
> > From: Olaf Kirch <okir at lst.de>
> > Subject: Re: [ofa-general] RDS flow control
> > To: Steve Wise <swise at opengridcomputing.com>
> > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> > Message-ID: <200805132004.01371.okir at lst.de>
> > Content-Type: text/plain;  charset="iso-8859-1"
> > 
> > On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
> > 
> >>Yes, by adding flow control.  And it could be iwarp-specific if you 
> >>want.    I would not suggest relying on connection termination and 
> >>re-establishment as the way to handle this :).
> > 
> > 
> > No, not in the long term. But let's hold off on the flow control stuff
> > for a little - I would first like to finish my patch set and hand it
> > out for you folks to bang on it, rather than the other way round.
> > Okay with you guys?
> > 
> > 
> >>I assume nothing would fly that regresses IB performance.  Worst case, 
> >>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> >>Hopefully though, IB + iWARP will be a common transport.
> > 
> > 
> > If it turns out that way, fine. If iWARP ands up sharing 80% of the
> > code with IB except the RDMA specific functions, I think that's
> > very much acceptable, too.
> > 
> > Olaf
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Sumit.Gaur at Sun.COM  Mon May 19 04:50:05 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Mon, 19 May 2008 17:20:05 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4831696D.6060409@Sun.COM>

Hi Hal,


Hal Rosenstock wrote:
> Sumit,
> 
> On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
> 
>>Hi
>>I have an issue while my program interacting with OFED umad library.
> 
> 
> Are you referring to libibumad ?
yes, I am using mad_receive(0, -1) function to get my response back.
> 
> 
>>I have two 
>>separate threads one for sending SMP,GMP packets and another to receive 
>>response. Things are working fine but during the whole process I keep receiving 
>>packets with unknown tid apart from correct response.
> 
> 
> What's the exact message ?
Response comes as proper mad packets but with "tid" that I have never send and 
my logic to keep track of send/response pkts failed.
> 
> 
>> Is it a correct behavior.
> 
> 
> It could be; there's not enough info as to what is going on. It could be
> some unsolicited message (e.g. from SM) comes in during your
> transactions. Can you see what MADs are incoming ? One way to do that
> would be to run madeye.
Yes I could see complete mad with madhdr as following fields

Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, 
ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435

	If these are unsolicited packets. Is there anyway to filter them.

Any reference to madeye ?
> 
> 
>>If yes how I could avoid them ?
> 
> 
> Not sure what you are seeing yet.
> 
> -- Hal
> 
> 
>>Thanks and Regards
>>sumit
>>
>>general-request at lists.openfabrics.org wrote:
>>
>>>Send general mailing list submissions to
>>>	general at lists.openfabrics.org
>>>
>>>To subscribe or unsubscribe via the World Wide Web, visit
>>>	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>or, via email, send a message with subject or body 'help' to
>>>	general-request at lists.openfabrics.org
>>>
>>>You can reach the person managing the list at
>>>	general-owner at lists.openfabrics.org
>>>
>>>When replying, please edit your Subject line so it is more specific
>>>than "Re: Contents of general digest..."
>>>
>>>
>>>Today's Topics:
>>>
>>>   1. Re:  [PATCH] IB/core: handle race between elements in	qork
>>>      queues after event (Roland Dreier)
>>>   2. Re:  RDS flow control (Steve Wise)
>>>   3. Re:  RDS flow control (Olaf Kirch)
>>>   4. Re:  RDS flow control (Steve Wise)
>>>   5. Re:  RDS flow control (Olaf Kirch)
>>>   6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
>>>      checking (Roland Dreier)
>>>   7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
>>>      struct pid * not pid_t. (Roland Dreier)
>>>   8. Re:  bitops take an unsigned long * (Roland Dreier)
>>>
>>>
>>>----------------------------------------------------------------------
>>>
>>>Message: 1
>>>Date: Tue, 13 May 2008 10:41:39 -0700
>>>From: Roland Dreier <rdreier at cisco.com>
>>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
>>>	elements in	qork queues after event
>>>To: Moni Shoua <monis at Voltaire.COM>
>>>Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
>>>	<general at lists.openfabrics.org>
>>>Message-ID: <adatzh2ksoc.fsf at cisco.com>
>>>Content-Type: text/plain; charset=us-ascii
>>>
>>> > Can we please go on with this patch? We would like to see it in the next kernel.
>>>
>>>I still don't get why this is important to you.  Is there a concrete
>>>example of a situation where this actually makes a measurable difference?
>>>
>>>We need some justification for adding this locking complexity beyond "it
>>>doesn't hurt."  (And also of course we need it fixed so there aren't races)
>>>
>>> - R.
>>>
>>>
>>>------------------------------
>>>
>>>Message: 2
>>>Date: Tue, 13 May 2008 12:58:11 -0500
>>>From: Steve Wise <swise at opengridcomputing.com>
>>>Subject: Re: [ofa-general] RDS flow control
>>>To: Richard Frank <richard.frank at oracle.com>
>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
>>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>
>>>Richard Frank wrote:
>>>
>>>
>>>>Steve Wise wrote:
>>>>
>>>>
>>>>>Olaf Kirch wrote:
>>>>>
>>>>>
>>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>As part of my effort to get RDS working for iWARP, I will be 
>>>>>>>working on the RDS flow control.  Flow control is needed for iWARP 
>>>>>>>due to the fact that iWARP connections terminate if there is no 
>>>>>>>posted recv for an incoming packet.  IB connections do not have 
>>>>>>>this limitation if setup in a certain way.  In its current 
>>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
>>>>>>>This causes IB to retransmit until there is a posted recv buffer.     
>>>>>>
>>>>>>I think for the initial implementation, it is fine for iWARP to just
>>>>>>fail the connect when that happens, and re-establish the connection.
>>>>>>
>>>>>>If you use reasonable defaults for the send and recv queues, receiver
>>>>>>overruns should be relatively rare.
>>>>>>
>>>>>>Once everything else works, let's revisit the flow control part.
>>>>>>
>>>>>> 
>>>>>
>>>>>I _think_ you'll hit this quickly with one-way flows.  Send 
>>>>>completions for iWARP only mean the user's buffer can be reused.  Not 
>>>>>that its placed at the remote peer or in the remote user's buffer.
>>>>>
>>>>
>>>>Let's see what happens - anyway - this could be solved in an IWARP 
>>>>extension to RDS  - right ?
>>>
>>>
>>>
>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>want.    I would not suggest relying on connection termination and 
>>>re-establishment as the way to handle this :).
>>>
>>>
>>>
>>>
>>>
>>>>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
>>>>>and rnr_retry == 0 using the rds perf tools?
>>>>>Also "the everything else" part depends on remove fmr usage.  I'm 
>>>>>working on the new RDMA memory verbs allowing fast registration of 
>>>>>physical memory via a send WR.  To support iWARP we need to remove 
>>>>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
>>>>>fastreg verbs.   Thoughts?
>>>>>
>>>>
>>>>What does "fast" imply here - how does this compare to the performance 
>>>>of FMRs ?
>>>
>>>
>>>
>>>Don't know yet, but probably as fast. 
>>>
>>>
>>>
>>>>Why would not push memory window creation into the RDS transport 
>>>>specific implementations ?
>>>
>>>
>>>Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
>>>(I'm ignorant on the specifics of the implementation at this point, so 
>>>please excuse any dumb statements :)
>>>
>>>
>>>
>>>
>>>>Changing the API may be OK - if we retain the performance we have with 
>>>>IB.
>>>
>>>
>>>
>>>I assume nothing would fly that regresses IB performance.  Worst case, 
>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>Hopefully though, IB + iWARP will be a common transport.
>>>
>>>
>>>
>>>
>>>>>Stay tuned for the new verbs API RFC...
>>>>>
>>>>>Steve.
>>>>>_______________________________________________
>>>>>general mailing list
>>>>>general at lists.openfabrics.org
>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>>To unsubscribe, please visit 
>>>>>http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>>
>>>
>>>------------------------------
>>>
>>>Message: 3
>>>Date: Tue, 13 May 2008 20:04:00 +0200
>>>From: Olaf Kirch <okir at lst.de>
>>>Subject: Re: [ofa-general] RDS flow control
>>>To: Steve Wise <swise at opengridcomputing.com>
>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>Message-ID: <200805132004.01371.okir at lst.de>
>>>Content-Type: text/plain;  charset="iso-8859-1"
>>>
>>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
>>>
>>>
>>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>>want.    I would not suggest relying on connection termination and 
>>>>re-establishment as the way to handle this :).
>>>
>>>
>>>No, not in the long term. But let's hold off on the flow control stuff
>>>for a little - I would first like to finish my patch set and hand it
>>>out for you folks to bang on it, rather than the other way round.
>>>Okay with you guys?
>>>
>>>
>>>
>>>>I assume nothing would fly that regresses IB performance.  Worst case, 
>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>>Hopefully though, IB + iWARP will be a common transport.
>>>
>>>
>>>If it turns out that way, fine. If iWARP ands up sharing 80% of the
>>>code with IB except the RDMA specific functions, I think that's
>>>very much acceptable, too.
>>>
>>>Olaf
>>
>>_______________________________________________
>>general mailing list
>>general at lists.openfabrics.org
>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From ogerlitz at voltaire.com  Mon May 19 05:05:08 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 May 2008 15:05:08 +0300
Subject: [ofa-general] Re: [RFC v2 PATCH 4/5] rdma/cma: implement
 RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com>
Message-ID: <48316CF4.2010507@voltaire.com>

Sean Hefty wrote:
>> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private
>> *id_priv)
>> +{
>> +	struct rdma_dev_addr *dev_addr;
>> +	struct cma_work *work;
>> +
>> +	dev_addr = &id_priv->id.route.addr.dev_addr;
>> +
>> +	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
>> +	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
>> +		printk(KERN_ERR "addr change for device %s used by id %p,
>> notifying\n",
>> +				ndev->name, &id_priv->id);
>> +		work = kzalloc(sizeof *work, GFP_KERNEL);
>> +		if (!work)
>> +			return -ENOMEM;
>> +		work->id = id_priv;
>> +		INIT_WORK(&work->work, cma_work_handler);
>> +		work->old_state = id_priv->state;
>> +		work->new_state = id_priv->state;
>> +		work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE;
>> +		atomic_inc(&id_priv->refcount);
>> +		queue_work(cma_wq, &work->work);
>> +	}
>> +}
>
> My initial thought on this is to see if we can just queue a single work item
> that can be used to invoke the user callbacks.  I'd have to see how the locking
> worked out though to know if that approach is 'cleaner'.
Sean,

Yes, it is possible to queue a single work item that can be used to 
invoke the user callbacks, eg cma_netdev_change_handler() would be 
queued to be executed by a thread (eg cma_wq mentioned below)  and do 
what this code does. What makes you think that its 'cleaner' to do it 
this way?
> Currently, the rdma_cm ensures that only a single callback to the user is
> invoked at a time.  This is needed to support the user trying to destroy their
> rdma_cm_id from the callback.  I didn't look to see if this still maintains
> that.
OK, so I understand from the code that the callback to the user may be 
delivered not only through cma_work_handler (that is in the context of 
the work queue thread that is created by the rdmn-cm). So is the design 
keeps on serialization through tracking the ID state/changes before 
invoking the callback, or its a different method?

Or.


From hrosenstock at xsigo.com  Mon May 19 06:01:21 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 06:01:21 -0700
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events
	to	groups and handle each according to level of severity
In-Reply-To: <483046CA.3010403@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>
	<483046CA.3010403@Voltaire.COM>
Message-ID: <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com>

On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote:
> Hal Rosenstock wrote:
> > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote:
> >> The purpose of this patch is to make the events that are related to SM change
> >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
> >> When SM related events are handled, it is not necessary to flush unicast
> >> info from device but only multicast info.
> > 
> > How is unicast invalidation handled on these changes ? On a local LID
> > change event, how does an end port know/determine what else (e.g. other
> > LIDs, paths) the SM might have changed (that specifically might affect
> > IPoIB since this is limited to IPoIB) ?
> I'm not sure I understand the question but local LID change would be handled as before
> with a LID_CHANGE event. For this type of event, there is not change in what IPoIB does to cope.

It's SM change which I'm not sure about. I'm unaware of an IBA spec
guarantee on preservation of paths on SM failover. Can you point me at
this ?

Also, as many routing protocols are dependent on where they are run in
the subnet (location of SM node in the topology), I don't think all path
parameters can be maintained when in a heterogeneous subnet and hence
would need refreshing (or flushing to cause this) on an SM change event.

So while it may work in a homogeneous subnet, I don't think this is the
general case.

> > Also, wouldn't there be similar issues with other ULPs ?
> There might be but the purpose of this one is to make things better  for IPoIB

Understood; just trying to widen the scope. IMO other ULPs should at
least be inspected for the same issues. The multicast issue is IPoIB
specific but local LID, client reregister (maybe only events for other
ULPs as multicast and service records may not apply (perhaps except DAPL
but this may be old implementation)) and SM changes apply to all.

-- Hal

> > -- Hal
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jackm at dev.mellanox.co.il  Mon May 19 06:03:50 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 16:03:50 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <f8ca0a150805181445s7a1754c4lcea0a9139af61497@mail.gmail.com>
References: <483049BF.4050603@dev.mellanox.co.il>
	<f8ca0a150805181445s7a1754c4lcea0a9139af61497@mail.gmail.com>
Message-ID: <200805191603.50735.jackm@dev.mellanox.co.il>

On Monday 19 May 2008 00:45, Roland Dreier wrote:
> [Adding list cc]
> 
> > +static int mlx4_log_pg_sz = 0;
> > +module_param(mlx4_log_pg_sz, int, 0444);
> > +MODULE_PARM_DESC(mlx4_log_pg_sz,
> > +                "set FW log system min page size (0 gets native FW min. default=0)");
> 
> Why do we need this module parameter?  When would someone set it to anything
> other than 0?

This is in case at some installation, the administrator wishes to use the legacy device page size
of 12, for example.  Having a module parameter enables such tweaking to be done painlessly.

> 
> > -       return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier,
> > +       return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier,
> >                          op_modifier, op, timeout);
> 
> I don't see any call to mlx4_cmd_imm in this patch -- why is this change needed?
>
You're right, this was just a hold-over from the first version of the patch (which used
immediate data instead of the mailbox).

- Jack


From Thomas.Talpey at netapp.com  Mon May 19 06:14:32 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 19 May 2008 09:14:32 -0400
Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca
	max_sge value... ugh.
In-Reply-To: <adaabipc07g.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>

At 07:12 PM 5/16/2008, Roland Dreier wrote:
>if we can't use the "WQE shrinking" feature (because of selective
>signaling in the NFS/RDMA case), and we want to use 32 sge entries, then
>the WQE size 's' will end up a little more than 512 bytes, and the
>wqe_shift will end up as 10.

Can you elaborate on this? The NFS/RDMA client does selective signalling
on its send queue in order to save on interrupts and CQE generation/handling.
Which I always thought was a (very) good approach. Because the RPC
request/response paradigm guarantees an eventual receive completion,
we simply defer (or even completely avoid) this work.

Would that be a bad trade if it takes a WQE management opportunity away
from the provider? It's quite easy to change this in the NFS/RDMA code,
or make it a selectable parameter.

Tom.


From hrosenstock at xsigo.com  Mon May 19 06:25:14 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 06:25:14 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <4831696D.6060409@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
Message-ID: <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>

Hi Sumit,

On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote:
> Hi Hal,
> 
> 
> Hal Rosenstock wrote:
> > Sumit,
> > 
> > On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
> > 
> >>Hi
> >>I have an issue while my program interacting with OFED umad library.
> > 
> > 
> > Are you referring to libibumad ?
> yes, I am using mad_receive(0, -1) function to get my response back.

OK.
 
> >>I have two 
> >>separate threads one for sending SMP,GMP packets and another to receive 
> >>response. Things are working fine but during the whole process I keep receiving 
> >>packets with unknown tid apart from correct response.
> > 
> > 
> > What's the exact message ?
> Response comes as proper mad packets but with "tid" that I have never send and 
> my logic to keep track of send/response pkts failed.
> > 
> > 
> >> Is it a correct behavior.
> > 
> > 
> > It could be; there's not enough info as to what is going on. It could be
> > some unsolicited message (e.g. from SM) comes in during your
> > transactions. Can you see what MADs are incoming ? One way to do that
> > would be to run madeye.
> Yes I could see complete mad with madhdr as following fields
> 
> Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, 
> ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435

Class 129 is a Subn directed route packet. Some of the other info (like
attribute ID) doesn't look right to me but maybe that's something
"special" to your environment.

> 	If these are unsolicited packets. Is there anyway to filter them.

Yes. How do you register ?

> Any reference to madeye ?

There's only the code for this (kernel module) which is added by OFED
(not upstream) in drivers/infiniband/util but it's pretty
straightforward to use.

-- Hal
 
> >>If yes how I could avoid them ?
> > 
> > 
> > Not sure what you are seeing yet.
> > 
> > -- Hal
> > 
> > 
> >>Thanks and Regards
> >>sumit
> >>
> >>general-request at lists.openfabrics.org wrote:
> >>
> >>>Send general mailing list submissions to
> >>>	general at lists.openfabrics.org
> >>>
> >>>To subscribe or unsubscribe via the World Wide Web, visit
> >>>	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>or, via email, send a message with subject or body 'help' to
> >>>	general-request at lists.openfabrics.org
> >>>
> >>>You can reach the person managing the list at
> >>>	general-owner at lists.openfabrics.org
> >>>
> >>>When replying, please edit your Subject line so it is more specific
> >>>than "Re: Contents of general digest..."
> >>>
> >>>
> >>>Today's Topics:
> >>>
> >>>   1. Re:  [PATCH] IB/core: handle race between elements in	qork
> >>>      queues after event (Roland Dreier)
> >>>   2. Re:  RDS flow control (Steve Wise)
> >>>   3. Re:  RDS flow control (Olaf Kirch)
> >>>   4. Re:  RDS flow control (Steve Wise)
> >>>   5. Re:  RDS flow control (Olaf Kirch)
> >>>   6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
> >>>      checking (Roland Dreier)
> >>>   7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
> >>>      struct pid * not pid_t. (Roland Dreier)
> >>>   8. Re:  bitops take an unsigned long * (Roland Dreier)
> >>>
> >>>
> >>>----------------------------------------------------------------------
> >>>
> >>>Message: 1
> >>>Date: Tue, 13 May 2008 10:41:39 -0700
> >>>From: Roland Dreier <rdreier at cisco.com>
> >>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
> >>>	elements in	qork queues after event
> >>>To: Moni Shoua <monis at Voltaire.COM>
> >>>Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
> >>>	<general at lists.openfabrics.org>
> >>>Message-ID: <adatzh2ksoc.fsf at cisco.com>
> >>>Content-Type: text/plain; charset=us-ascii
> >>>
> >>> > Can we please go on with this patch? We would like to see it in the next kernel.
> >>>
> >>>I still don't get why this is important to you.  Is there a concrete
> >>>example of a situation where this actually makes a measurable difference?
> >>>
> >>>We need some justification for adding this locking complexity beyond "it
> >>>doesn't hurt."  (And also of course we need it fixed so there aren't races)
> >>>
> >>> - R.
> >>>
> >>>
> >>>------------------------------
> >>>
> >>>Message: 2
> >>>Date: Tue, 13 May 2008 12:58:11 -0500
> >>>From: Steve Wise <swise at opengridcomputing.com>
> >>>Subject: Re: [ofa-general] RDS flow control
> >>>To: Richard Frank <richard.frank at oracle.com>
> >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> >>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
> >>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> >>>
> >>>Richard Frank wrote:
> >>>
> >>>
> >>>>Steve Wise wrote:
> >>>>
> >>>>
> >>>>>Olaf Kirch wrote:
> >>>>>
> >>>>>
> >>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>As part of my effort to get RDS working for iWARP, I will be 
> >>>>>>>working on the RDS flow control.  Flow control is needed for iWARP 
> >>>>>>>due to the fact that iWARP connections terminate if there is no 
> >>>>>>>posted recv for an incoming packet.  IB connections do not have 
> >>>>>>>this limitation if setup in a certain way.  In its current 
> >>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
> >>>>>>>This causes IB to retransmit until there is a posted recv buffer.     
> >>>>>>
> >>>>>>I think for the initial implementation, it is fine for iWARP to just
> >>>>>>fail the connect when that happens, and re-establish the connection.
> >>>>>>
> >>>>>>If you use reasonable defaults for the send and recv queues, receiver
> >>>>>>overruns should be relatively rare.
> >>>>>>
> >>>>>>Once everything else works, let's revisit the flow control part.
> >>>>>>
> >>>>>> 
> >>>>>
> >>>>>I _think_ you'll hit this quickly with one-way flows.  Send 
> >>>>>completions for iWARP only mean the user's buffer can be reused.  Not 
> >>>>>that its placed at the remote peer or in the remote user's buffer.
> >>>>>
> >>>>
> >>>>Let's see what happens - anyway - this could be solved in an IWARP 
> >>>>extension to RDS  - right ?
> >>>
> >>>
> >>>
> >>>Yes, by adding flow control.  And it could be iwarp-specific if you 
> >>>want.    I would not suggest relying on connection termination and 
> >>>re-establishment as the way to handle this :).
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
> >>>>>and rnr_retry == 0 using the rds perf tools?
> >>>>>Also "the everything else" part depends on remove fmr usage.  I'm 
> >>>>>working on the new RDMA memory verbs allowing fast registration of 
> >>>>>physical memory via a send WR.  To support iWARP we need to remove 
> >>>>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
> >>>>>fastreg verbs.   Thoughts?
> >>>>>
> >>>>
> >>>>What does "fast" imply here - how does this compare to the performance 
> >>>>of FMRs ?
> >>>
> >>>
> >>>
> >>>Don't know yet, but probably as fast. 
> >>>
> >>>
> >>>
> >>>>Why would not push memory window creation into the RDS transport 
> >>>>specific implementations ?
> >>>
> >>>
> >>>Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
> >>>(I'm ignorant on the specifics of the implementation at this point, so 
> >>>please excuse any dumb statements :)
> >>>
> >>>
> >>>
> >>>
> >>>>Changing the API may be OK - if we retain the performance we have with 
> >>>>IB.
> >>>
> >>>
> >>>
> >>>I assume nothing would fly that regresses IB performance.  Worst case, 
> >>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> >>>Hopefully though, IB + iWARP will be a common transport.
> >>>
> >>>
> >>>
> >>>
> >>>>>Stay tuned for the new verbs API RFC...
> >>>>>
> >>>>>Steve.
> >>>>>_______________________________________________
> >>>>>general mailing list
> >>>>>general at lists.openfabrics.org
> >>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>
> >>>>>To unsubscribe, please visit 
> >>>>>http://openib.org/mailman/listinfo/openib-general
> >>>
> >>>
> >>>
> >>>
> >>>------------------------------
> >>>
> >>>Message: 3
> >>>Date: Tue, 13 May 2008 20:04:00 +0200
> >>>From: Olaf Kirch <okir at lst.de>
> >>>Subject: Re: [ofa-general] RDS flow control
> >>>To: Steve Wise <swise at opengridcomputing.com>
> >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> >>>Message-ID: <200805132004.01371.okir at lst.de>
> >>>Content-Type: text/plain;  charset="iso-8859-1"
> >>>
> >>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
> >>>
> >>>
> >>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
> >>>>want.    I would not suggest relying on connection termination and 
> >>>>re-establishment as the way to handle this :).
> >>>
> >>>
> >>>No, not in the long term. But let's hold off on the flow control stuff
> >>>for a little - I would first like to finish my patch set and hand it
> >>>out for you folks to bang on it, rather than the other way round.
> >>>Okay with you guys?
> >>>
> >>>
> >>>
> >>>>I assume nothing would fly that regresses IB performance.  Worst case, 
> >>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> >>>>Hopefully though, IB + iWARP will be a common transport.
> >>>
> >>>
> >>>If it turns out that way, fine. If iWARP ands up sharing 80% of the
> >>>code with IB except the RDMA specific functions, I think that's
> >>>very much acceptable, too.
> >>>
> >>>Olaf
> >>
> >>_______________________________________________
> >>general mailing list
> >>general at lists.openfabrics.org
> >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 


From swise at opengridcomputing.com  Mon May 19 06:40:29 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 08:40:29 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483139B5.8040908@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483139B5.8040908@voltaire.com>
Message-ID: <4831834D.3050705@opengridcomputing.com>

Or Gerlitz wrote:
> Steve Wise wrote:
>> Support for the IB BMME and iWARP equivalent memory extensions to non 
>> shared memory regions. Usage Model:
>>
>> - MR allocated with ib_alloc_mr()
>> - Page lists allocated via ib_alloc_fast_reg_page_list().
>> - MR made VALID and bound to a specific page list via 
>> ib_post_send(IB_WR_FAST_REG_MR)
>> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
>> - MR deallocated with ib_dereg_mr()
>> - page lists dealloced via ib_free_fast_reg_page_list().
> Steve,
>
> Does this design goes hand-in-hand with remote invalidation? such that 
> if the remote side invalidated the mapping there no need to issue the 
> IB_WR_INVALIDATE_MR work request.
>

Yes.

> Also, does the proposed design support fmr pages of granularity 
> different than the OS ones? for example the OS pages are 4K and the 
> ULP wants to use fmr of 512 byte "pages (the "block lists" feature), 
> etc. In that case doesn't the size of each page has to be specified in 
> as a param to the alloc_fast_reg_mr() verb?

Page size is passed in at the registration time.  At allocation time, 
the HW only need to know what the max page list length (or PBL depth) 
will ever be so it can pre-allocate that at alloc time.  The the actualy 
page list length, the page size of each entry in the page list, as well 
as the page list itself is passed in via the 
post_send(IB_WR_FAST_REG_MR) work request.  See the fast_reg union in 
struct ib_send_wr.


>>
>> Applications can allocate a fast_reg mr once, and then can repeatedly
>> bind the mr to different physical memory SGLs via posting work requests
>> to the send queue.  For each outstanding mr-to-pbl binding in the SQ
>> pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
>> be achieved while still allowing device-specific page_list processing.
> mmm, is it a must for the ULP issue page list alloc/free per 
> IB_WR_FAST_REG_MR call?
>
No, the can be reused as needed.  They typically will only get allocated 
once, used many times, then freed when the application is done.  My 
point in the text above was that an application could allocate N page 
lists and use them in a pipeline for the same fast reg mr by fencing 
things appropriately in the SQ.


>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -676,6 +683,20 @@ struct ib_send_wr {
>>              u16    pkey_index; /* valid for GSI only */
>>              u8    port_num;   /* valid for DR SMPs on switch only */
>>          } ud;
>> +        struct {
>> +            u64                iova_start;
>> +            struct ib_mr             *mr;
>> +            struct ib_fast_reg_page_list    *page_list;
>> +            unsigned int            page_size;
>> +            unsigned int            page_list_len;
>> +            unsigned int            first_byte_offset;
>> +            u32                length;
>> +            int                access_flags;
>> +           
>> +        } fast_reg;
>> +        struct {
>> +            struct ib_mr     *mr;
>> +        } local_inv;
>>      } wr;
>>  };
> I suggest to use a "page_shift" notation and not "page_size" to comply 
> with the kernel semantics of other APIs.
>
Ok, I wondered about that.  It will also ensure a power of two.

Steve.


From Thomas.Talpey at netapp.com  Mon May 19 06:53:23 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 19 May 2008 09:53:23 -0400
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:
	MEM_MGT_EXTENSIONS support
In-Reply-To: <4831834D.3050705@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483139B5.8040908@voltaire.com>
	<4831834D.3050705@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRD51Eeo0000010a@RTPMVEXC1-PRD.hq.netapp.com>

At 09:40 AM 5/19/2008, Steve Wise wrote:
>> I suggest to use a "page_shift" notation and not "page_size" to comply 
>> with the kernel semantics of other APIs.
>>
>Ok, I wondered about that.  It will also ensure a power of two.

Does it have to be ^2? In the iWARP spec development, we envisioned
the possibility of arbitrary page sizes. I don't recall any such dependency
in the protocol architecture.

Storage has been known to adopt non ^2 blocks, for instance including
block checksums in sectors, etc. If transferred, these will become quite
inefficient on ^2 hardware.

Tom.


From swise at opengridcomputing.com  Mon May 19 06:56:03 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 08:56:03 -0500
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <4831679D.9040804@voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
	<000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
	<4831679D.9040804@voltaire.com>
Message-ID: <483186F3.6090703@opengridcomputing.com>

Or Gerlitz wrote:
> Sean Hefty wrote:
>>> So instead of adding a new function rdma_set_high_availability_mode, 
>>> you
>>> could just set an option saying WANT_NETDEV_CHANGE_EVENTS.  Maybe we
>>> need to add rdma_set_option() to the kernel RDMA-CM API?
>> I agree with this.  Having a generic mechanism to report rare events 
>> would be
>> useful.  Maybe the device removal notification can be adapted for 
>> this purpose?
> Sean, as suggested in the past (eg over the QoS discussion) 
> rdma_set_option can serve from more purposes similarly to setsockopt, 
> and I guess that down the road as the RDMA stack would get enhanced by 
> more features adding these rdma_set/get_opt calls would make sense. So 
> (Steve) in that respect, I don't see rdma_set_opt as a mechanism to 
> report rare events.
>
I don't understand your rationale above on why adding a new function is 
better than using an extensive "set this option" function.  Can you 
clarify? 

Function rdma_set_option() wouldn't be the mechanism to report the 
events.  It would be the mechanism to set an option indicating the cm_id 
wants NETDEV_CHANGE events.  Seems like the exact fit rather than a new 
API call to set some very specific mode or option.  I missed the QoS 
thread you're referencing, so please excuse me if I'm rehashing 
something that has been agreed-to in the past...


Steve.


From Sumit.Gaur at Sun.COM  Mon May 19 06:49:41 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Mon, 19 May 2008 19:19:41 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
Message-ID: <48318575.7060701@Sun.COM>


Hal Rosenstock wrote:
> Hi Sumit,
> 
> On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote:
> 
>>Hi Hal,
>>
>>
>>Hal Rosenstock wrote:
>>
>>>Sumit,
>>>
>>>On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
>>>
>>>
>>>>Hi
>>>>I have an issue while my program interacting with OFED umad library.
>>>
>>>
>>>Are you referring to libibumad ?
>>
>>yes, I am using mad_receive(0, -1) function to get my response back.
> 
> 
> OK.
>  
> 
>>>>I have two 
>>>>separate threads one for sending SMP,GMP packets and another to receive 
>>>>response. Things are working fine but during the whole process I keep receiving 
>>>>packets with unknown tid apart from correct response.
>>>
>>>
>>>What's the exact message ?
>>
>>Response comes as proper mad packets but with "tid" that I have never send and 
>>my logic to keep track of send/response pkts failed.
>>
>>>
>>>>Is it a correct behavior.
>>>
>>>
>>>It could be; there's not enough info as to what is going on. It could be
>>>some unsolicited message (e.g. from SM) comes in during your
>>>transactions. Can you see what MADs are incoming ? One way to do that
>>>would be to run madeye.
>>
>>Yes I could see complete mad with madhdr as following fields
>>
>>Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, 
>>ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435
> 
> 
> Class 129 is a Subn directed route packet. Some of the other info (like
> attribute ID) doesn't look right to me but maybe that's something
> "special" to your environment.
Sorry missed last number AttributeID=4352

> 
> 
>>	If these are unsolicited packets. Is there anyway to filter them.
> 
> 
> Yes. How do you register ?
For registration I am calling  madrpc_init(ca, ca_port, mgmt_classes, 4) 
function once before starting polling thread for SMI and GSI packet receive. 
Once I received packet I am filtering them on the basis of madhdr->MgmtClass.

int		mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, 
IB_PERFORMANCE_CLASS};

for given ca and ca port of local node.

> 
> 
>>Any reference to madeye ?
> 
> 
> There's only the code for this (kernel module) which is added by OFED
> (not upstream) in drivers/infiniband/util but it's pretty
> straightforward to use.
> 
> -- Hal
>  
> 
>>>>If yes how I could avoid them ?
>>>
>>>
>>>Not sure what you are seeing yet.
>>>
>>>-- Hal
>>>
>>>
>>>
>>>>Thanks and Regards
>>>>sumit
>>>>
>>>>general-request at lists.openfabrics.org wrote:
>>>>
>>>>
>>>>>Send general mailing list submissions to
>>>>>	general at lists.openfabrics.org
>>>>>
>>>>>To subscribe or unsubscribe via the World Wide Web, visit
>>>>>	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>or, via email, send a message with subject or body 'help' to
>>>>>	general-request at lists.openfabrics.org
>>>>>
>>>>>You can reach the person managing the list at
>>>>>	general-owner at lists.openfabrics.org
>>>>>
>>>>>When replying, please edit your Subject line so it is more specific
>>>>>than "Re: Contents of general digest..."
>>>>>
>>>>>
>>>>>Today's Topics:
>>>>>
>>>>>  1. Re:  [PATCH] IB/core: handle race between elements in	qork
>>>>>     queues after event (Roland Dreier)
>>>>>  2. Re:  RDS flow control (Steve Wise)
>>>>>  3. Re:  RDS flow control (Olaf Kirch)
>>>>>  4. Re:  RDS flow control (Steve Wise)
>>>>>  5. Re:  RDS flow control (Olaf Kirch)
>>>>>  6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
>>>>>     checking (Roland Dreier)
>>>>>  7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
>>>>>     struct pid * not pid_t. (Roland Dreier)
>>>>>  8. Re:  bitops take an unsigned long * (Roland Dreier)
>>>>>
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>
>>>>>Message: 1
>>>>>Date: Tue, 13 May 2008 10:41:39 -0700
>>>>>From: Roland Dreier <rdreier at cisco.com>
>>>>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
>>>>>	elements in	qork queues after event
>>>>>To: Moni Shoua <monis at Voltaire.COM>
>>>>>Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
>>>>>	<general at lists.openfabrics.org>
>>>>>Message-ID: <adatzh2ksoc.fsf at cisco.com>
>>>>>Content-Type: text/plain; charset=us-ascii
>>>>>
>>>>>
>>>>>>Can we please go on with this patch? We would like to see it in the next kernel.
>>>>>
>>>>>I still don't get why this is important to you.  Is there a concrete
>>>>>example of a situation where this actually makes a measurable difference?
>>>>>
>>>>>We need some justification for adding this locking complexity beyond "it
>>>>>doesn't hurt."  (And also of course we need it fixed so there aren't races)
>>>>>
>>>>>- R.
>>>>>
>>>>>
>>>>>------------------------------
>>>>>
>>>>>Message: 2
>>>>>Date: Tue, 13 May 2008 12:58:11 -0500
>>>>>From: Steve Wise <swise at opengridcomputing.com>
>>>>>Subject: Re: [ofa-general] RDS flow control
>>>>>To: Richard Frank <richard.frank at oracle.com>
>>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
>>>>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>>
>>>>>Richard Frank wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Steve Wise wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Olaf Kirch wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>As part of my effort to get RDS working for iWARP, I will be 
>>>>>>>>>working on the RDS flow control.  Flow control is needed for iWARP 
>>>>>>>>>due to the fact that iWARP connections terminate if there is no 
>>>>>>>>>posted recv for an incoming packet.  IB connections do not have 
>>>>>>>>>this limitation if setup in a certain way.  In its current 
>>>>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
>>>>>>>>>This causes IB to retransmit until there is a posted recv buffer.     
>>>>>>>>
>>>>>>>>I think for the initial implementation, it is fine for iWARP to just
>>>>>>>>fail the connect when that happens, and re-establish the connection.
>>>>>>>>
>>>>>>>>If you use reasonable defaults for the send and recv queues, receiver
>>>>>>>>overruns should be relatively rare.
>>>>>>>>
>>>>>>>>Once everything else works, let's revisit the flow control part.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>I _think_ you'll hit this quickly with one-way flows.  Send 
>>>>>>>completions for iWARP only mean the user's buffer can be reused.  Not 
>>>>>>>that its placed at the remote peer or in the remote user's buffer.
>>>>>>>
>>>>>>
>>>>>>Let's see what happens - anyway - this could be solved in an IWARP 
>>>>>>extension to RDS  - right ?
>>>>>
>>>>>
>>>>>
>>>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>>>want.    I would not suggest relying on connection termination and 
>>>>>re-establishment as the way to handle this :).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
>>>>>>>and rnr_retry == 0 using the rds perf tools?
>>>>>>>Also "the everything else" part depends on remove fmr usage.  I'm 
>>>>>>>working on the new RDMA memory verbs allowing fast registration of 
>>>>>>>physical memory via a send WR.  To support iWARP we need to remove 
>>>>>>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
>>>>>>>fastreg verbs.   Thoughts?
>>>>>>>
>>>>>>
>>>>>>What does "fast" imply here - how does this compare to the performance 
>>>>>>of FMRs ?
>>>>>
>>>>>
>>>>>
>>>>>Don't know yet, but probably as fast. 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Why would not push memory window creation into the RDS transport 
>>>>>>specific implementations ?
>>>>>
>>>>>
>>>>>Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
>>>>>(I'm ignorant on the specifics of the implementation at this point, so 
>>>>>please excuse any dumb statements :)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Changing the API may be OK - if we retain the performance we have with 
>>>>>>IB.
>>>>>
>>>>>
>>>>>
>>>>>I assume nothing would fly that regresses IB performance.  Worst case, 
>>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>>>Hopefully though, IB + iWARP will be a common transport.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>Stay tuned for the new verbs API RFC...
>>>>>>>
>>>>>>>Steve.
>>>>>>>_______________________________________________
>>>>>>>general mailing list
>>>>>>>general at lists.openfabrics.org
>>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>>
>>>>>>>To unsubscribe, please visit 
>>>>>>>http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>------------------------------
>>>>>
>>>>>Message: 3
>>>>>Date: Tue, 13 May 2008 20:04:00 +0200
>>>>>From: Olaf Kirch <okir at lst.de>
>>>>>Subject: Re: [ofa-general] RDS flow control
>>>>>To: Steve Wise <swise at opengridcomputing.com>
>>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>>>Message-ID: <200805132004.01371.okir at lst.de>
>>>>>Content-Type: text/plain;  charset="iso-8859-1"
>>>>>
>>>>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>>>>want.    I would not suggest relying on connection termination and 
>>>>>>re-establishment as the way to handle this :).
>>>>>
>>>>>
>>>>>No, not in the long term. But let's hold off on the flow control stuff
>>>>>for a little - I would first like to finish my patch set and hand it
>>>>>out for you folks to bang on it, rather than the other way round.
>>>>>Okay with you guys?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>I assume nothing would fly that regresses IB performance.  Worst case, 
>>>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>>>>Hopefully though, IB + iWARP will be a common transport.
>>>>>
>>>>>
>>>>>If it turns out that way, fine. If iWARP ands up sharing 80% of the
>>>>>code with IB except the RDMA specific functions, I think that's
>>>>>very much acceptable, too.
>>>>>
>>>>>Olaf
>>>>
>>>>_______________________________________________
>>>>general mailing list
>>>>general at lists.openfabrics.org
>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>
> 


From swise at opengridcomputing.com  Mon May 19 06:58:20 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 08:58:20 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:  MEM_MGT_EXTENSIONS
	support
In-Reply-To: <RTPCLUEXC1-PRD51Eeo0000010a@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483139B5.8040908@voltaire.com>
	<4831834D.3050705@opengridcomputing.com>
	<RTPCLUEXC1-PRD51Eeo0000010a@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <4831877C.6010208@opengridcomputing.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/18b3c9d8/attachment.html>

From Thomas.Talpey at netapp.com  Mon May 19 06:59:34 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 19 May 2008 09:59:34 -0400
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: 
	MEM_MGT_EXTENSIONS support
In-Reply-To: <4831877C.6010208@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483139B5.8040908@voltaire.com>
	<4831834D.3050705@opengridcomputing.com>
	<RTPCLUEXC1-PRD51Eeo0000010a@RTPMVEXC1-PRD.hq.netapp.com>
	<4831877C.6010208@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRDR0fme0000010b@RTPMVEXC1-PRD.hq.netapp.com>

At 09:58 AM 5/19/2008, Steve Wise wrote:
>>Storage has been known to adopt non ^2 blocks, for instance including
>>block checksums in sectors, etc. If transferred, these will become quite
>>inefficient on ^2 hardware.
>>
>>  
>Is this true today for any of the existing RDMA ULPs that will utilize fastreg?


Ask the iSER and SRP folks. NFS won't.

Tom.


From jackm at dev.mellanox.co.il  Mon May 19 07:03:05 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 17:03:05 +0300
Subject: [ofa-general] [PATCH] IPoIB: Test for NULL broadcast object in
	opiob_mcast_join_finish.
Message-ID: <200805191703.05887.jackm@dev.mellanox.co.il>

IPoIB: "join finish" occurring just after device was flushed caused Oops.

ipoib_mcast_join_finish() processing could conceivably occur just after
ipoib_mcast_dev_flush() was invoked (in which case the broadcast pointer
is NULL).  This patch tests for and fixes this case.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---

Roland,

We encountered this problem in our regression testing (kernel Oops).
(bugzilla bug 1040). The test randomly causes the HCA physical port to go
down then up. We then have a situation where a "flush" could occur while
IPoIB mcast initialization was still in progress.

Index: ofed_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- ofed_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-05-19 15:48:17.000000000 +0300
+++ ofed_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-05-19 16:07:52.723294000 +0300
@@ -194,7 +194,13 @@ static int ipoib_mcast_join_finish(struc
 	/* Set the cached Q_Key before we attach if it's the broadcast group */
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
+		spin_lock_irq(&priv->lock);
+		if (!priv->broadcast) {
+			spin_unlock_irq(&priv->lock);
+			return -EAGAIN;
+		}
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
+		spin_unlock_irq(&priv->lock);
 		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
 	}
 

-------------------------------------------------------


From tziporet at mellanox.co.il  Mon May 19 07:20:57 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 19 May 2008 17:20:57 +0300
Subject: [ofa-general] Agenda for the OFED meeting today (May 5)
Message-ID: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com>

Hi,

This is the agenda for the OFED meeting today:
1. OFED 1.3.1:
   1.1 Schedule:
	rc1 - done on May 6
	rc2 - May 22 <== I propose to delay to Thursday since there are
few IPOIB bugs on work
	GA  - May 29
   1.2 OS support:
	SLES10 SP2 backports were done (thanks to Moshe from Voltaire)
	There is a request fro RHEL 5.2 - who has this OS and can help
with the backports?
   1.3 Bugs status
	Please set release version 1.3.1 for all bugs that should be
resolved in 1.3.1
	In the way the bugs are assigned today it is very hard to
extract the relevant bugs for the release.
	This is the list of bugs that should be resolved to my best
knowledge (please add more):

1024	normal	monis at voltaire.com	Bonding-Ping not recovery after
reconnect the non active interface
1027	normal	sashak at voltaire.com	kernel panic in mad.c
handle_outgoing_dr_smp with RESULT_CONSUMED
1031	normal	kliteyn at mellanox.co.il	OpenSM fat tree routing thinks
fat tree isn't
1032	critical	vu at mellanox.com	      RHEL  5.1 and OFED 1.3
cannot write IO blocks greater than 1024.
1038	normal	eli at mellanox.co.il	Kernel panic while running
tcp/ip ltp tests
1040	normal	jackm at mellanox.co.il	Kernel Oops during "port up/down
test"
1041	normal	vlad at mellanox.co.il	Install Failed with memtrack
flag in the conf file
1042	normal	vlad at mellanox.co.il	ofed-1.3.1 install fails
	
2. OFED 1.4:
	- Kernel rebase status: we have prepared the new tree, make-dist
pass but compilation still fails.
	  Any help to resolve compilation issues is welcome.
	  URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
ofed_kernel
	- Update from the participants (mainly on new
components/features):
	  - NFSoRDMA - Jeff
	  - Management - Sasha
	  - Multiple EQs to best fit multi-core systems - we try to
define it with Roland
	  - RDMA CM to support IPv6 - Woody any news on this?
	  - IB BMME and iWARP equivalent memory extensions - under
progress on the general list

3. Open discussion
   - Upgrade memory in the OFA server:
     This request raised long time ago and we had a promise to do it
after 1.3 release. What is the status?
   - Other topics ...

Tziporet


From swise at opengridcomputing.com  Mon May 19 07:23:13 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 09:23:13 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <ada63tb9tqs.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>
Message-ID: <48318D51.5070904@opengridcomputing.com>

Roland Dreier wrote:
>  > If ownership can be assumed, I suggest to have the core use the
>  > implementation of these two verbs as you did that for the Chelsio
>  > driver in case the HW driver did not implement it (i.e instead of
>  > returning ENOSYS). In that case, the alloc_list verb should do DMA
>  > mapping FROM device (I think...) since the device is going to do DMA
>  > to read the page list, and the free_list verb should do DMA unmapping,
>  > etc.
>
> Yes, the point of this verb is that the low-level driver owns the page
> list from when the fast register work request is posted until it
> completes.  This should be explicitly documented somewhere.
>
>   

I've added it to the comments for ib_alloc_fast_reg_page_list() as per 
Ralph Campbell's suggestion.


> However the reason for having the low-level driver implement it is so
> that all strange device-specific issues can be taken care of in the
> driver.  For instance mlx4 is going to require that the page list be
> aligned to 64 bytes, and will DMA from the memory, so we need to use
> dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
> in software, so kmalloc is sufficient.
>
>  - R.
>   


From swise at opengridcomputing.com  Fri May 16 15:30:37 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:30:37 -0500
Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions
Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int>

The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 3:
	- better comments to ib_alloc_fast_reg_page_list() function
	  to explicitly state the page list is owned by the device until
	  the fast_reg WR completes.

Changes since version 2:
	- added device attribute max_fast_reg_page_list_len
	- added cxgb3 patch

Changes since version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

Steve.


From swise at opengridcomputing.com  Fri May 16 15:30:37 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 May 2008 17:30:37 -0500
Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions
Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int>

The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 3:
	- better comments to ib_alloc_fast_reg_page_list() function
	  to explicitly state the page list is owned by the device until
	  the fast_reg WR completes.
	
	- cxgb3 - when allocating a page list, set max_page_list_len

Changes since version 2:
	- added device attribute max_fast_reg_page_list_len
	- added cxgb3 patch

Changes since version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

Steve.


From or.gerlitz at gmail.com  Mon May 19 07:44:03 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 19 May 2008 17:44:03 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <483049BF.4050603@dev.mellanox.co.il>
References: <483049BF.4050603@dev.mellanox.co.il>
Message-ID: <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>

On 5/18/08, Vladimir Sokolovsky <vlad at dev.mellanox.co.il> wrote:
>
> There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which
> hardcoded
> the minimum acceptable page_shift to be 12. However, new mlx4 firmware has
> a
> minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so
> that
> ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs.
>
> To preserve firmware compatibility with released OFED drivers, the firmware
> will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for
> these
> drivers.


Hi Vlad, Roland,

To start with, the bug is in the Linux kernel mlx4 driver, there's nothing
like "OFED 1.3 mlx4 driver" (who's the maintainer? why there's need to be
another instace of a driver merged into the mainline kerel, etc). This bug
was fixed last week or so by a patch sent by someone from Mellanox.

To continue with, maybe we just state that kernels < X = 2.6.26 are not
compatible with FW version > Y = 2.3? or have the patch that fixes the
problem be sent to -stable versions of older kernels?

If those solutions are not enough, I think that the default behaviour of FW
AND the mainline driver would be to get the actual minimal driver supported,
namely nine.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/22b07b11/attachment.html>

From or.gerlitz at gmail.com  Mon May 19 07:48:12 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 19 May 2008 17:48:12 +0300
Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement
	RDMA_ALIGN_WITH_NETDEVICE ha mode
In-Reply-To: <483186F3.6090703@opengridcomputing.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151723441.23334@zuben.voltaire.com>
	<482C4A52.4000501@opengridcomputing.com>
	<000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com>
	<4831679D.9040804@voltaire.com>
	<483186F3.6090703@opengridcomputing.com>
Message-ID: <15ddcffd0805190748k7ea3ca96o815cf64e71758d82@mail.gmail.com>

On 5/19/08, Steve Wise <swise at opengridcomputing.com> wrote:

> I don't understand your rationale above on why adding a new function is
> better than using an extensive "set this option" function.  Can you clarify?
>
> Function rdma_set_option() wouldn't be the mechanism to report the events.
>  It would be the mechanism to set an option indicating the cm_id wants
> NETDEV_CHANGE events.  Seems like the exact fit rather than a new API call
> to set some very specific mode or option.  I missed the QoS thread you're
> referencing, so please excuse me if I'm rehashing something that has been
> agreed-to in the past...
>

Steve,

I think there was some misunderstanding here, I am not against
rdma_set_option, I just said that I would follow what would be decided over
the review / maintainer decision. I am fine both with using a set_opt call
or a dedicated call.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/83fa02e7/attachment.html>

From hrosenstock at xsigo.com  Mon May 19 07:49:10 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 07:49:10 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <48318575.7060701@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
Message-ID: <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-19 at 19:19 +0530, Sumit Gaur - Sun Microsystem wrote:
> 
> Hal Rosenstock wrote:
> > Hi Sumit,
> > 
> > On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote:
> > 
> >>Hi Hal,
> >>
> >>
> >>Hal Rosenstock wrote:
> >>
> >>>Sumit,
> >>>
> >>>On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
> >>>
> >>>
> >>>>Hi
> >>>>I have an issue while my program interacting with OFED umad library.
> >>>
> >>>
> >>>Are you referring to libibumad ?
> >>
> >>yes, I am using mad_receive(0, -1) function to get my response back.
> > 
> > 
> > OK.
> >  
> > 
> >>>>I have two 
> >>>>separate threads one for sending SMP,GMP packets and another to receive 
> >>>>response. Things are working fine but during the whole process I keep receiving 
> >>>>packets with unknown tid apart from correct response.
> >>>
> >>>
> >>>What's the exact message ?
> >>
> >>Response comes as proper mad packets but with "tid" that I have never send and 
> >>my logic to keep track of send/response pkts failed.
> >>
> >>>
> >>>>Is it a correct behavior.
> >>>
> >>>
> >>>It could be; there's not enough info as to what is going on. It could be
> >>>some unsolicited message (e.g. from SM) comes in during your
> >>>transactions. Can you see what MADs are incoming ? One way to do that
> >>>would be to run madeye.
> >>
> >>Yes I could see complete mad with madhdr as following fields
> >>
> >>Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, 
> >>ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435
> > 
> > 
> > Class 129 is a Subn directed route packet. Some of the other info (like
> > attribute ID) doesn't look right to me but maybe that's something
> > "special" to your environment.
> Sorry missed last number AttributeID=4352

I don't know what that attribute ID is so there's something different
about that.

Out of curiousity, what SM are you using ?

> >>	If these are unsolicited packets. Is there anyway to filter them.
> > 
> > 
> > Yes. How do you register ?
> For registration I am calling  madrpc_init(ca, ca_port, mgmt_classes, 4) 
> function once before starting polling thread for SMI and GSI packet receive. 
> Once I received packet I am filtering them on the basis of madhdr->MgmtClass.
> 
> int		mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, 
> IB_PERFORMANCE_CLASS};
> 
> for given ca and ca port of local node.

That looks like it would register with a NULL method mask which should
filter unsolicited packets.

I think I see the issue: the incoming packet appears to have a method of
129 (GetResp) which has the response bit on so it's not considered
unsolicited. You need to see what exactly that packet is and where it's
coming from and why.

-- Hal

> > 
> > 
> >>Any reference to madeye ?
> > 
> > 
> > There's only the code for this (kernel module) which is added by OFED
> > (not upstream) in drivers/infiniband/util but it's pretty
> > straightforward to use.
> > 
> > -- Hal
> >  
> > 
> >>>>If yes how I could avoid them ?
> >>>
> >>>
> >>>Not sure what you are seeing yet.
> >>>
> >>>-- Hal
> >>>
> >>>
> >>>
> >>>>Thanks and Regards
> >>>>sumit
> >>>>
> >>>>general-request at lists.openfabrics.org wrote:
> >>>>
> >>>>
> >>>>>Send general mailing list submissions to
> >>>>>	general at lists.openfabrics.org
> >>>>>
> >>>>>To subscribe or unsubscribe via the World Wide Web, visit
> >>>>>	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>or, via email, send a message with subject or body 'help' to
> >>>>>	general-request at lists.openfabrics.org
> >>>>>
> >>>>>You can reach the person managing the list at
> >>>>>	general-owner at lists.openfabrics.org
> >>>>>
> >>>>>When replying, please edit your Subject line so it is more specific
> >>>>>than "Re: Contents of general digest..."
> >>>>>
> >>>>>
> >>>>>Today's Topics:
> >>>>>
> >>>>>  1. Re:  [PATCH] IB/core: handle race between elements in	qork
> >>>>>     queues after event (Roland Dreier)
> >>>>>  2. Re:  RDS flow control (Steve Wise)
> >>>>>  3. Re:  RDS flow control (Olaf Kirch)
> >>>>>  4. Re:  RDS flow control (Steve Wise)
> >>>>>  5. Re:  RDS flow control (Olaf Kirch)
> >>>>>  6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
> >>>>>     checking (Roland Dreier)
> >>>>>  7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
> >>>>>     struct pid * not pid_t. (Roland Dreier)
> >>>>>  8. Re:  bitops take an unsigned long * (Roland Dreier)
> >>>>>
> >>>>>
> >>>>>----------------------------------------------------------------------
> >>>>>
> >>>>>Message: 1
> >>>>>Date: Tue, 13 May 2008 10:41:39 -0700
> >>>>>From: Roland Dreier <rdreier at cisco.com>
> >>>>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
> >>>>>	elements in	qork queues after event
> >>>>>To: Moni Shoua <monis at Voltaire.COM>
> >>>>>Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
> >>>>>	<general at lists.openfabrics.org>
> >>>>>Message-ID: <adatzh2ksoc.fsf at cisco.com>
> >>>>>Content-Type: text/plain; charset=us-ascii
> >>>>>
> >>>>>
> >>>>>>Can we please go on with this patch? We would like to see it in the next kernel.
> >>>>>
> >>>>>I still don't get why this is important to you.  Is there a concrete
> >>>>>example of a situation where this actually makes a measurable difference?
> >>>>>
> >>>>>We need some justification for adding this locking complexity beyond "it
> >>>>>doesn't hurt."  (And also of course we need it fixed so there aren't races)
> >>>>>
> >>>>>- R.
> >>>>>
> >>>>>
> >>>>>------------------------------
> >>>>>
> >>>>>Message: 2
> >>>>>Date: Tue, 13 May 2008 12:58:11 -0500
> >>>>>From: Steve Wise <swise at opengridcomputing.com>
> >>>>>Subject: Re: [ofa-general] RDS flow control
> >>>>>To: Richard Frank <richard.frank at oracle.com>
> >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> >>>>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
> >>>>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> >>>>>
> >>>>>Richard Frank wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Steve Wise wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>Olaf Kirch wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>As part of my effort to get RDS working for iWARP, I will be 
> >>>>>>>>>working on the RDS flow control.  Flow control is needed for iWARP 
> >>>>>>>>>due to the fact that iWARP connections terminate if there is no 
> >>>>>>>>>posted recv for an incoming packet.  IB connections do not have 
> >>>>>>>>>this limitation if setup in a certain way.  In its current 
> >>>>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7.  
> >>>>>>>>>This causes IB to retransmit until there is a posted recv buffer.     
> >>>>>>>>
> >>>>>>>>I think for the initial implementation, it is fine for iWARP to just
> >>>>>>>>fail the connect when that happens, and re-establish the connection.
> >>>>>>>>
> >>>>>>>>If you use reasonable defaults for the send and recv queues, receiver
> >>>>>>>>overruns should be relatively rare.
> >>>>>>>>
> >>>>>>>>Once everything else works, let's revisit the flow control part.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>I _think_ you'll hit this quickly with one-way flows.  Send 
> >>>>>>>completions for iWARP only mean the user's buffer can be reused.  Not 
> >>>>>>>that its placed at the remote peer or in the remote user's buffer.
> >>>>>>>
> >>>>>>
> >>>>>>Let's see what happens - anyway - this could be solved in an IWARP 
> >>>>>>extension to RDS  - right ?
> >>>>>
> >>>>>
> >>>>>
> >>>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
> >>>>>want.    I would not suggest relying on connection termination and 
> >>>>>re-establishment as the way to handle this :).
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
> >>>>>>>and rnr_retry == 0 using the rds perf tools?
> >>>>>>>Also "the everything else" part depends on remove fmr usage.  I'm 
> >>>>>>>working on the new RDMA memory verbs allowing fast registration of 
> >>>>>>>physical memory via a send WR.  To support iWARP we need to remove 
> >>>>>>>the fmr usage from RDS.   The idea was to replace fmrs with the new 
> >>>>>>>fastreg verbs.   Thoughts?
> >>>>>>>
> >>>>>>
> >>>>>>What does "fast" imply here - how does this compare to the performance 
> >>>>>>of FMRs ?
> >>>>>
> >>>>>
> >>>>>
> >>>>>Don't know yet, but probably as fast. 
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Why would not push memory window creation into the RDS transport 
> >>>>>>specific implementations ?
> >>>>>
> >>>>>
> >>>>>Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
> >>>>>(I'm ignorant on the specifics of the implementation at this point, so 
> >>>>>please excuse any dumb statements :)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Changing the API may be OK - if we retain the performance we have with 
> >>>>>>IB.
> >>>>>
> >>>>>
> >>>>>
> >>>>>I assume nothing would fly that regresses IB performance.  Worst case, 
> >>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> >>>>>Hopefully though, IB + iWARP will be a common transport.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>Stay tuned for the new verbs API RFC...
> >>>>>>>
> >>>>>>>Steve.
> >>>>>>>_______________________________________________
> >>>>>>>general mailing list
> >>>>>>>general at lists.openfabrics.org
> >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>>>
> >>>>>>>To unsubscribe, please visit 
> >>>>>>>http://openib.org/mailman/listinfo/openib-general
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>------------------------------
> >>>>>
> >>>>>Message: 3
> >>>>>Date: Tue, 13 May 2008 20:04:00 +0200
> >>>>>From: Olaf Kirch <okir at lst.de>
> >>>>>Subject: Re: [ofa-general] RDS flow control
> >>>>>To: Steve Wise <swise at opengridcomputing.com>
> >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
> >>>>>Message-ID: <200805132004.01371.okir at lst.de>
> >>>>>Content-Type: text/plain;  charset="iso-8859-1"
> >>>>>
> >>>>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Yes, by adding flow control.  And it could be iwarp-specific if you 
> >>>>>>want.    I would not suggest relying on connection termination and 
> >>>>>>re-establishment as the way to handle this :).
> >>>>>
> >>>>>
> >>>>>No, not in the long term. But let's hold off on the flow control stuff
> >>>>>for a little - I would first like to finish my patch set and hand it
> >>>>>out for you folks to bang on it, rather than the other way round.
> >>>>>Okay with you guys?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>I assume nothing would fly that regresses IB performance.  Worst case, 
> >>>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess.  
> >>>>>>Hopefully though, IB + iWARP will be a common transport.
> >>>>>
> >>>>>
> >>>>>If it turns out that way, fine. If iWARP ands up sharing 80% of the
> >>>>>code with IB except the RDMA specific functions, I think that's
> >>>>>very much acceptable, too.
> >>>>>
> >>>>>Olaf
> >>>>
> >>>>_______________________________________________
> >>>>general mailing list
> >>>>general at lists.openfabrics.org
> >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>
> >>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>
> >>>
> > 


From rdreier at cisco.com  Mon May 19 07:49:40 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 07:49:40 -0700
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <200805191603.50735.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 19 May 2008 16:03:50 +0300")
References: <483049BF.4050603@dev.mellanox.co.il>
	<f8ca0a150805181445s7a1754c4lcea0a9139af61497@mail.gmail.com>
	<200805191603.50735.jackm@dev.mellanox.co.il>
Message-ID: <adahccu8i2j.fsf@cisco.com>

 > This is in case at some installation, the administrator wishes to use
 > the legacy device page size of 12, for example.  Having a module
 > parameter enables such tweaking to be done painlessly.

And why would the administrator want that?

 - R.


From rdreier at cisco.com  Mon May 19 07:51:01 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 07:51:01 -0700
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> (Or
	Gerlitz's message of "Mon, 19 May 2008 17:44:03 +0300")
References: <483049BF.4050603@dev.mellanox.co.il>
	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>
Message-ID: <adad4ni8i0a.fsf@cisco.com>

 > To continue with, maybe we just state that kernels < X = 2.6.26 are not
 > compatible with FW version > Y = 2.3? or have the patch that fixes the
 > problem be sent to -stable versions of older kernels?

Why?  This patch provides a pretty simple way for older kernels and
newer firmware to continue to work together, while providing the new
functionality with new firmware and new kernels.  Why introduce
gratuitous breakage?

 - R.


From or.gerlitz at gmail.com  Mon May 19 07:55:42 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 19 May 2008 17:55:42 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <adad4ni8i0a.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>
	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>
	<adad4ni8i0a.fsf@cisco.com>
Message-ID: <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>

On 5/19/08, Roland Dreier <rdreier at cisco.com> wrote:
>
> Why?  This patch provides a pretty simple way for older kernels and
> newer firmware to continue to work together, while providing the new
> functionality with new firmware and new kernels.  Why introduce
> gratuitous breakage?
>
> I understand that for the new functionality to take effect with new
kernels, the admin has to set the module param for a non default value,
correct? so you are fine with with that?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/976656f8/attachment.html>

From jackm at dev.mellanox.co.il  Mon May 19 08:01:43 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 18:01:43 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <adahccu8i2j.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>
	<200805191603.50735.jackm@dev.mellanox.co.il>
	<adahccu8i2j.fsf@cisco.com>
Message-ID: <200805191801.43596.jackm@dev.mellanox.co.il>

On Monday 19 May 2008 17:49, Roland Dreier wrote:
>  > This is in case at some installation, the administrator wishes to use
>  > the legacy device page size of 12, for example.  Having a module
>  > parameter enables such tweaking to be done painlessly.
> 
> And why would the administrator want that?
> 
Ok, We'll get rid of the module parameter.

- Jack


From olga.shern at gmail.com  Mon May 19 08:11:37 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Mon, 19 May 2008 18:11:37 +0300
Subject: [ofa-general] Re: [ewg] Agenda for the OFED meeting today (May 5)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com>
Message-ID: <bc457d660805190811q66868b25v1b0306ea7aac4ba3@mail.gmail.com>

On 5/19/08, Tziporet Koren <tziporet at mellanox.co.il> wrote:
>
> Hi,
>
> This is the agenda for the OFED meeting today:
> 1. OFED 1.3.1:
>   1.1 Schedule:
>        rc1 - done on May 6
>        rc2 - May 22 <== I propose to delay to Thursday since there are
> few IPOIB bugs on work
>        GA  - May 29
>   1.2 OS support:
>        SLES10 SP2 backports were done (thanks to Moshe from Voltaire)
>        There is a request fro RHEL 5.2 - who has this OS and can help
> with the backports?
>   1.3 Bugs status
>        Please set release version 1.3.1 for all bugs that should be
> resolved in 1.3.1
>        In the way the bugs are assigned today it is very hard to
> extract the relevant bugs for the release.
>        This is the list of bugs that should be resolved to my best
> knowledge (please add more):


There is also bug number 1004
1004 <https://bugs.openfabrics.org/show_bug.cgi?id=1004> maj P2 RHEL
eli at mellanox.co.il  IPoIB failed on stress testing

1024    normal  monis at voltaire.com      Bonding-Ping not recovery after
> reconnect the non active interface
> 1027    normal  sashak at voltaire.com     kernel panic in mad.c
> handle_outgoing_dr_smp with RESULT_CONSUMED
> 1031    normal  kliteyn at mellanox.co.il  OpenSM fat tree routing thinks
> fat tree isn't
> 1032    critical        vu at mellanox.com       RHEL  5.1 and OFED 1.3
> cannot write IO blocks greater than 1024.
> 1038    normal  eli at mellanox.co.il      Kernel panic while running
> tcp/ip ltp tests
> 1040    normal  jackm at mellanox.co.il    Kernel Oops during "port up/down
> test"
> 1041    normal  vlad at mellanox.co.il     Install Failed with memtrack
> flag in the conf file
> 1042    normal  vlad at mellanox.co.il     ofed-1.3.1 install fails
>
> 2. OFED 1.4:
>        - Kernel rebase status: we have prepared the new tree, make-dist
> pass but compilation still fails.
>          Any help to resolve compilation issues is welcome.
>          URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
> ofed_kernel
>        - Update from the participants (mainly on new
> components/features):
>          - NFSoRDMA - Jeff
>          - Management - Sasha
>          - Multiple EQs to best fit multi-core systems - we try to
> define it with Roland
>          - RDMA CM to support IPv6 - Woody any news on this?
>          - IB BMME and iWARP equivalent memory extensions - under
> progress on the general list
>
> 3. Open discussion
>   - Upgrade memory in the OFA server:
>     This request raised long time ago and we had a promise to do it
> after 1.3 release. What is the status?
>   - Other topics ...
>
> Tziporet
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/fb3c1c6a/attachment.html>

From monis at Voltaire.COM  Mon May 19 08:17:00 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 19 May 2008 18:17:00 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <ada1w3z9tmo.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
	<ada1w3z9tmo.fsf@cisco.com>
Message-ID: <483199EC.7070900@Voltaire.COM>

Thanks for the comment and example. Please take a look below. The last paragraph  in the patch documentation
refers to the race you pointed. The test is made after sm_ah is copied to the mad so there is no risk of  it being
NULL in the mad structure. 

---------------------------------------

This patch solves a race between work elements that are carried out after an
event occurs. When SM address handle becomes invalid and needs an update it is
handled by a work in the global workqueue. On the other hand this event is also
handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join.
Although queuing is in the right order, it is done to 2 different workqueues and so
there is no guarantee that the first to be queued is the first to be executed.

The patch sets the SM address handle  to NULL and until update_sm_ah() is called,
any request that needs sm_ah is replied with -EAGAIN return status.

For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the
wrong SM so the request gets lost. Consumers can be improved if they examine the return
code and respond to EAGAIN properly but even without an improvement the situation
is not getting worse and in some cases it gets better.

If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after
the check for NULL SM address handle the result would be as before the patch and without a
rist of  dereferencing a NULL  pinter.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---

 drivers/infiniband/core/sa_query.c |   49 +++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index cf474ec..8170381 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE   ||
 	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		struct ib_sa_device *sa_dev;
-		sa_dev = container_of(handler, typeof(*sa_dev), event_handler);
-
+		unsigned long flags;
+		struct ib_sa_device *sa_dev =
+			container_of(handler, typeof(*sa_dev), event_handler);
+		struct ib_sa_port *port =
+			&sa_dev->port[event->element.port_num - sa_dev->start_port];
+		struct ib_sa_sm_ah *sm_ah;
+
+		spin_lock_irqsave(&port->ah_lock, flags);
+		sm_ah = port->sm_ah;
+		port->sm_ah = NULL;
+		spin_unlock_irqrestore(&port->ah_lock, flags);
+
+		if (sm_ah)
+			kref_put(&sm_ah->ref, free_sm_ah);
 		schedule_work(&sa_dev->port[event->element.port_num -
 					    sa_dev->start_port].update_task);
 	}
@@ -673,6 +684,10 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 	ret = alloc_mad(&query->sa_query, gfp_mask);
 	if (ret)
 		goto err1;
+	if (!port->sm_ah) {
+		ret = -EAGAIN;
+		goto err2;
+	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -694,13 +709,14 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 
 	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
-		goto err2;
+		goto err3;
 
 	return ret;
 
-err2:
+err3:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
+err2:
 	free_mad(&query->sa_query);
 
 err1:
@@ -780,6 +796,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
+
 	agent = port->agent;
 
 	if (method != IB_MGMT_METHOD_GET &&
@@ -795,6 +812,10 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 	ret = alloc_mad(&query->sa_query, gfp_mask);
 	if (ret)
 		goto err1;
+	if (!port->sm_ah) {
+		ret = -EAGAIN;
+		goto err2;
+	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -817,15 +838,15 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 
 	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
-		goto err2;
+		goto err3;
 
 	return ret;
 
-err2:
+err3:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
+err2:
 	free_mad(&query->sa_query);
-
 err1:
 	kfree(query);
 	return ret;
@@ -877,8 +898,8 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 		return -ENODEV;
 
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
-	agent = port->agent;
 
+	agent = port->agent;
 	query = kmalloc(sizeof *query, gfp_mask);
 	if (!query)
 		return -ENOMEM;
@@ -887,6 +908,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 	ret = alloc_mad(&query->sa_query, gfp_mask);
 	if (ret)
 		goto err1;
+	if (!port->sm_ah) {
+		ret = -EAGAIN;
+		goto err2;
+	}
 
 	ib_sa_client_get(client);
 	query->sa_query.client = client;
@@ -909,15 +934,15 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 
 	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
-		goto err2;
+		goto err3;
 
 	return ret;
 
-err2:
+err3:
 	*sa_query = NULL;
 	ib_sa_client_put(query->sa_query.client);
+err2:
 	free_mad(&query->sa_query);
-
 err1:
 	kfree(query);
 	return ret;


From jackm at dev.mellanox.co.il  Mon May 19 08:25:37 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 19 May 2008 18:25:37 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <adahccu8i2j.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>
	<200805191603.50735.jackm@dev.mellanox.co.il>
	<adahccu8i2j.fsf@cisco.com>
Message-ID: <200805191825.38176.jackm@dev.mellanox.co.il>

On Monday 19 May 2008 17:49, Roland Dreier wrote:
>  > This is in case at some installation, the administrator wishes to use
>  > the legacy device page size of 12, for example.  Having a module
>  > parameter enables such tweaking to be done painlessly.
> 
> And why would the administrator want that?
> 
I just remembered.  If we create FMRs using 512 as the device page size, we will
use 8 times the MTT entries as we would if the page size was 4K.  The ULP (rds and iser)
can run out of MTT entries much faster.

This can give an administrator a quick workaround if needed (until we fix the
resource allocator to allow bitmaps larger than 2^20 -- which is the current default
max number of MTTs).  (The allocator's problem is that kzalloc cannot allocate a block
larger than 128 KB (= 1Meg bits).

- Jack


From yevgenyp at mellanox.co.il  Mon May 19 08:22:52 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Mon, 19 May 2008 18:22:52 +0300
Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, 
	Patch 10)
In-Reply-To: <adaiqxz6kno.fsf@cisco.com>
References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com>
	<adaiqxz6kno.fsf@cisco.com>
Message-ID: <48319B4C.7040309@mellanox.co.il>

Roland Dreier wrote:
>  > > I would just like to see an approach that is fully thought through and
>  > > gives a way for applications/kernel drivers to choose a CQ vector based
>  > > on some information about what CPU it will go to.
> 
>  > Isn't the decision of which CPU an MSI-X is routed to (and hence, to
>  > which CPI an EQ is bound to) determined by userspace? (either by the irq
>  > balancer process or by manually setting /proc/irq/<vec>/smp_affinity)?
> 
> Yes, but how can anything tell which IRQ number corresponds to a given
> "CQ vector" number?  (And don't be too stuck on MSI-X, since ehca uses
> some completely different GX-bus related thing to get multiple interrupts)
> 
>  > What are we risking in making the default action to spread interrupts?
> 
> There are fairly plausible scenarios like a multi-threaded app where
> each thread creates a send CQ and a receive CQ, which should both be
> bound to the same CPU as the thread.  If we spread all CQs then it's
> impossible to get thread-locality.
> 
> I'm not saying that round-robin is necessarily a bad default policy, but
> I do think there needs to be a complete picture of how that policy can
> be overridden before we go for multiple interrupt vectors.
> 
>  - R.

Hello Roland,
We can add the multiple interrupt vectors support in two stages:
1. The low level driver can create multiple interrupt vectors. Their name would include a
serial number from 0 to #CPU's-1. The number of completion vectors can
be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific
completion vector when creating CQ, which means that passing vector=0 while creating CQ
will assign it to completion vector #0.

2. As the second stage, we can create a "don't care" value which would mean that the driver can
can attach the CQ to any completion vector. In this case the policy shouldn't necessary be
round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ
to the least busy one.

What is your opinion on this solution?

Thanks,
Yevgeny


From rdreier at cisco.com  Mon May 19 08:43:48 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 08:43:48 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4831347A.1010506@voltaire.com> (Or Gerlitz's message of "Mon, 19
	May 2008 11:04:10 +0300")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<48301F74.4020905@voltaire.com> <ada63tb9tqs.fsf@cisco.com>
	<4831347A.1010506@voltaire.com>
Message-ID: <ada4p8u8fkb.fsf@cisco.com>

 > I see. Just wondering, in the mlx4 case, is it a must to use dma
 > consistent memory allocation or dma mapping would work too?

dma mapping would work too but then handling the map/unmap becomes an
issue.  I think it is way too complicated too add new verbs for
map/unmap fastreg page list (in addition to the alloc/free fastreg page
list that we are already adding) and force the consumer to do it.  And
if we expect the low-level driver to do it, then the map is easy (can be
done while posting the send) but the unmap is a pain -- it would have to
be done inside poll_cq when reapind the completion, and the low-level
driver would have to keep some complicated extra data structure to go
back from the completion to the original fast reg page list structure.


From rdreier at cisco.com  Mon May 19 08:44:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 08:44:57 -0700
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <200805191825.38176.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 19 May 2008 18:25:37 +0300")
References: <483049BF.4050603@dev.mellanox.co.il>
	<200805191603.50735.jackm@dev.mellanox.co.il>
	<adahccu8i2j.fsf@cisco.com>
	<200805191825.38176.jackm@dev.mellanox.co.il>
Message-ID: <adazlqm70xy.fsf@cisco.com>

 > I just remembered.  If we create FMRs using 512 as the device page size, we will
 > use 8 times the MTT entries as we would if the page size was 4K.  The ULP (rds and iser)
 > can run out of MTT entries much faster.

Seems like a ULP issue -- if they don't want 512 byte pages for FMRs,
why are they asking for them?

 - R.


From rdreier at cisco.com  Mon May 19 08:45:49 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 08:45:49 -0700
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> (Or
	Gerlitz's message of "Mon, 19 May 2008 17:55:42 +0300")
References: <483049BF.4050603@dev.mellanox.co.il>
	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>
	<adad4ni8i0a.fsf@cisco.com>
	<15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>
Message-ID: <adave1a70wi.fsf@cisco.com>

 > I understand that for the new functionality to take effect with new
 > kernels, the admin has to set the module param for a non default value,
 > correct? so you are fine with with that?

You misunderstood the patch I think (unless I did).  By default new
kernel + new firmware gets the smaller page size.

The kernel module parameter does seem kind of useless.

 - R.


From swise at opengridcomputing.com  Mon May 19 08:46:39 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 10:46:39 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <ada4p8u8fkb.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>	<4831347A.1010506@voltaire.com>
	<ada4p8u8fkb.fsf@cisco.com>
Message-ID: <4831A0DF.2070603@opengridcomputing.com>

Roland Dreier wrote:
>  > I see. Just wondering, in the mlx4 case, is it a must to use dma
>  > consistent memory allocation or dma mapping would work too?
>
> dma mapping would work too but then handling the map/unmap becomes an
> issue.  I think it is way too complicated too add new verbs for
> map/unmap fastreg page list (in addition to the alloc/free fastreg page
> list that we are already adding) and force the consumer to do it.  And
> if we expect the low-level driver to do it, then the map is easy (can be
> done while posting the send) but the unmap is a pain -- it would have to
> be done inside poll_cq when reapind the completion, and the low-level
> driver would have to keep some complicated extra data structure to go
> back from the completion to the original fast reg page list structure.
>   

And certain platforms can fail map requests (like PPC64) because they 
have limited resources for dma mapping.  So then you'd fail a SQ work 
request when you might not want to...

Steve.


From olga.shern at gmail.com  Mon May 19 08:47:57 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Mon, 19 May 2008 18:47:57 +0300
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups
	and handle each according to level of severity
In-Reply-To: <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>
	<483046CA.3010403@Voltaire.COM>
	<1211202081.12616.417.camel@hrosenstock-ws.xsigo.com>
Message-ID: <bc457d660805190847u4196667dg67215c0ee7abcfc2@mail.gmail.com>

On 5/19/08, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>
> On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote:
> > Hal Rosenstock wrote:
> > > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote:
> > >> The purpose of this patch is to make the events that are related to SM
> change
> > >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
> > >> When SM related events are handled, it is not necessary to flush
> unicast
> > >> info from device but only multicast info.
> > >
> > > How is unicast invalidation handled on these changes ? On a local LID
> > > change event, how does an end port know/determine what else (e.g. other
> > > LIDs, paths) the SM might have changed (that specifically might affect
> > > IPoIB since this is limited to IPoIB) ?
> > I'm not sure I understand the question but local LID change would be
> handled as before
> > with a LID_CHANGE event. For this type of event, there is not change in
> what IPoIB does to cope.
>
> It's SM change which I'm not sure about. I'm unaware of an IBA spec
> guarantee on preservation of paths on SM failover. Can you point me at
> this ?
>
> Also, as many routing protocols are dependent on where they are run in
> the subnet (location of SM node in the topology), I don't think all path
> parameters can be maintained when in a heterogeneous subnet and hence
> would need refreshing (or flushing to cause this) on an SM change event.
>
> So while it may work in a homogeneous subnet, I don't think this is the
> general case.


You are rigth there is no IBA spec request to preserve LIDs but all SMs that
we are familiar with,
are doing so.
You are refering to the case where there is remote LID change but not local
LID change,
but also without this patch this case is not taken care of.  We should think
about solution for this case in the future.


> > Also, wouldn't there be similar issues with other ULPs ?
> > There might be but the purpose of this one is to make things better  for
> IPoIB
>
> Understood; just trying to widen the scope. IMO other ULPs should at
> least be inspected for the same issues. The multicast issue is IPoIB
> specific but local LID, client reregister (maybe only events for other
> ULPs as multicast and service records may not apply (perhaps except DAPL
> but this may be old implementation)) and SM changes apply to all.
>
> -- Hal
>
> > > -- Hal
> > >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/2aa669a1/attachment.html>

From rdreier at cisco.com  Mon May 19 08:49:04 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 08:49:04 -0700
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <483199EC.7070900@Voltaire.COM> (Moni Shoua's message of "Mon, 19
	May 2008 18:17:00 +0300")
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
	<ada1w3z9tmo.fsf@cisco.com> <483199EC.7070900@Voltaire.COM>
Message-ID: <adaprri70r3.fsf@cisco.com>

and what happens if alloc_mad() is called while port->sm_ah is NULL?


From Sumit.Gaur at Sun.COM  Mon May 19 09:08:56 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur)
Date: Mon, 19 May 2008 21:38:56 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4831A618.9090806@Sun.COM>

Hal Rosenstock wrote:
> On Mon, 2008-05-19 at 19:19 +0530, Sumit Gaur - Sun Microsystem wrote:
>   
>> Hal Rosenstock wrote:
>>     
>>> Hi Sumit,
>>>
>>> On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote:
>>>
>>>       
>>>> Hi Hal,
>>>>
>>>>
>>>> Hal Rosenstock wrote:
>>>>
>>>>         
>>>>> Sumit,
>>>>>
>>>>> On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Hi
>>>>>> I have an issue while my program interacting with OFED umad library.
>>>>>>             
>>>>> Are you referring to libibumad ?
>>>>>           
>>>> yes, I am using mad_receive(0, -1) function to get my response back.
>>>>         
>>> OK.
>>>  
>>>
>>>       
>>>>>> I have two 
>>>>>> separate threads one for sending SMP,GMP packets and another to receive 
>>>>>> response. Things are working fine but during the whole process I keep receiving 
>>>>>> packets with unknown tid apart from correct response.
>>>>>>             
>>>>> What's the exact message ?
>>>>>           
>>>> Response comes as proper mad packets but with "tid" that I have never send and 
>>>> my logic to keep track of send/response pkts failed.
>>>>
>>>>         
>>>>>> Is it a correct behavior.
>>>>>>             
>>>>> It could be; there's not enough info as to what is going on. It could be
>>>>> some unsolicited message (e.g. from SM) comes in during your
>>>>> transactions. Can you see what MADs are incoming ? One way to do that
>>>>> would be to run madeye.
>>>>>           
>>>> Yes I could see complete mad with madhdr as following fields
>>>>
>>>> Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, 
>>>> ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435
>>>>         
>>> Class 129 is a Subn directed route packet. Some of the other info (like
>>> attribute ID) doesn't look right to me but maybe that's something
>>> "special" to your environment.
>>>       
>> Sorry missed last number AttributeID=4352
>>     
>
> I don't know what that attribute ID is so there's something different
> about that.
>
> Out of curiousity, what SM are you using ?
>
>   
>>>> 	If these are unsolicited packets. Is there anyway to filter them.
>>>>         
>>> Yes. How do you register ?
>>>       
>> For registration I am calling  madrpc_init(ca, ca_port, mgmt_classes, 4) 
>> function once before starting polling thread for SMI and GSI packet receive. 
>> Once I received packet I am filtering them on the basis of madhdr->MgmtClass.
>>
>> int		mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, 
>> IB_PERFORMANCE_CLASS};
>>
>> for given ca and ca port of local node.
>>     
>
> That looks like it would register with a NULL method mask which should
> filter unsolicited packets.
>
> I think I see the issue: the incoming packet appears to have a method of
> 129 (GetResp) which has the response bit on so it's not considered
> unsolicited. You need to see what exactly that packet is and where it's
> coming from and why.
>
> -- Hal
>
>   

Hi Hal,
It is true that packets received are looks like proper response but as I 
mentioned before they content TID that I have never send to OFED and  
this  cause the problem. Why OFED is sending these extra packets  Is the 
matter to investigate.
sumit
sumit
>>>       
>>>> Any reference to madeye ?
>>>>         
>>> There's only the code for this (kernel module) which is added by OFED
>>> (not upstream) in drivers/infiniband/util but it's pretty
>>> straightforward to use.
>>>
>>> -- Hal
>>>  
>>>
>>>       
>>>>>> If yes how I could avoid them ?
>>>>>>             
>>>>> Not sure what you are seeing yet.
>>>>>
>>>>> -- Hal
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Thanks and Regards
>>>>>> sumit
>>>>>>
>>>>>> general-request at lists.openfabrics.org wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Send general mailing list submissions to
>>>>>>> 	general at lists.openfabrics.org
>>>>>>>
>>>>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>>>> 	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>> or, via email, send a message with subject or body 'help' to
>>>>>>> 	general-request at lists.openfabrics.org
>>>>>>>
>>>>>>> You can reach the person managing the list at
>>>>>>> 	general-owner at lists.openfabrics.org
>>>>>>>
>>>>>>> When replying, please edit your Subject line so it is more specific
>>>>>>> than "Re: Contents of general digest..."
>>>>>>>
>>>>>>>
>>>>>>> Today's Topics:
>>>>>>>
>>>>>>>  1. Re:  [PATCH] IB/core: handle race between elements in	qork
>>>>>>>     queues after event (Roland Dreier)
>>>>>>>  2. Re:  RDS flow control (Steve Wise)
>>>>>>>  3. Re:  RDS flow control (Olaf Kirch)
>>>>>>>  4. Re:  RDS flow control (Steve Wise)
>>>>>>>  5. Re:  RDS flow control (Olaf Kirch)
>>>>>>>  6. Re:  [PATCH 3/3] IB/ipath - fix RDMA read response	sequence
>>>>>>>     checking (Roland Dreier)
>>>>>>>  7.  Re: [PATCH][INFINIBAND]: Make ipath_portdata work with
>>>>>>>     struct pid * not pid_t. (Roland Dreier)
>>>>>>>  8. Re:  bitops take an unsigned long * (Roland Dreier)
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------------
>>>>>>>
>>>>>>> Message: 1
>>>>>>> Date: Tue, 13 May 2008 10:41:39 -0700
>>>>>>> From: Roland Dreier <rdreier at cisco.com>
>>>>>>> Subject: Re: [ofa-general] [PATCH] IB/core: handle race between
>>>>>>> 	elements in	qork queues after event
>>>>>>> To: Moni Shoua <monis at Voltaire.COM>
>>>>>>> Cc: Olga Stern <olgas at voltaire.com>,	OpenFabrics General
>>>>>>> 	<general at lists.openfabrics.org>
>>>>>>> Message-ID: <adatzh2ksoc.fsf at cisco.com>
>>>>>>> Content-Type: text/plain; charset=us-ascii
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Can we please go on with this patch? We would like to see it in the next kernel.
>>>>>>>>                 
>>>>>>> I still don't get why this is important to you.  Is there a concrete
>>>>>>> example of a situation where this actually makes a measurable difference?
>>>>>>>
>>>>>>> We need some justification for adding this locking complexity beyond "it
>>>>>>> doesn't hurt."  (And also of course we need it fixed so there aren't races)
>>>>>>>
>>>>>>> - R.
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> Message: 2
>>>>>>> Date: Tue, 13 May 2008 12:58:11 -0500
>>>>>>> From: Steve Wise <swise at opengridcomputing.com>
>>>>>>> Subject: Re: [ofa-general] RDS flow control
>>>>>>> To: Richard Frank <richard.frank at oracle.com>
>>>>>>> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>>>>> Message-ID: <4829D6B3.5080900 at opengridcomputing.com>
>>>>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>>>>
>>>>>>> Richard Frank wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Steve Wise wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Olaf Kirch wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> On Monday 12 May 2008 18:57:38 Jon Mason wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> As part of my effort to get RDS working for iWARP, I will be 
>>>>>>>>>>> working on the RDS flow control.  Flow control is needed for iWARP 
>>>>>>>>>>> due to the fact that iWARP connections terminate if there is no 
>>>>>>>>>>> posted recv for an incoming packet.  IB connections do not have 
>>>>>>>>>>> this limitation if setup in a certain way.  In its current 
>>>>>>>>>>> implementation, RDS sets the connection attribute rnr_retry to 7.  
>>>>>>>>>>> This causes IB to retransmit until there is a posted recv buffer.     
>>>>>>>>>>>                       
>>>>>>>>>> I think for the initial implementation, it is fine for iWARP to just
>>>>>>>>>> fail the connect when that happens, and re-establish the connection.
>>>>>>>>>>
>>>>>>>>>> If you use reasonable defaults for the send and recv queues, receiver
>>>>>>>>>> overruns should be relatively rare.
>>>>>>>>>>
>>>>>>>>>> Once everything else works, let's revisit the flow control part.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>> I _think_ you'll hit this quickly with one-way flows.  Send 
>>>>>>>>> completions for iWARP only mean the user's buffer can be reused.  Not 
>>>>>>>>> that its placed at the remote peer or in the remote user's buffer.
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> Let's see what happens - anyway - this could be solved in an IWARP 
>>>>>>>> extension to RDS  - right ?
>>>>>>>>                 
>>>>>>>
>>>>>>> Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>>>>> want.    I would not suggest relying on connection termination and 
>>>>>>> re-establishment as the way to handle this :).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>>> But perhaps I'm wrong.  Jon, maybe you should try to hit this with IB 
>>>>>>>>> and rnr_retry == 0 using the rds perf tools?
>>>>>>>>> Also "the everything else" part depends on remove fmr usage.  I'm 
>>>>>>>>> working on the new RDMA memory verbs allowing fast registration of 
>>>>>>>>> physical memory via a send WR.  To support iWARP we need to remove 
>>>>>>>>> the fmr usage from RDS.   The idea was to replace fmrs with the new 
>>>>>>>>> fastreg verbs.   Thoughts?
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> What does "fast" imply here - how does this compare to the performance 
>>>>>>>> of FMRs ?
>>>>>>>>                 
>>>>>>>
>>>>>>> Don't know yet, but probably as fast. 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Why would not push memory window creation into the RDS transport 
>>>>>>>> specific implementations ?
>>>>>>>>                 
>>>>>>> Isn't it already transport-specific?  IE you don't need FMRs for TCP.  
>>>>>>> (I'm ignorant on the specifics of the implementation at this point, so 
>>>>>>> please excuse any dumb statements :)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Changing the API may be OK - if we retain the performance we have with 
>>>>>>>> IB.
>>>>>>>>                 
>>>>>>>
>>>>>>> I assume nothing would fly that regresses IB performance.  Worst case, 
>>>>>>> you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>>>>> Hopefully though, IB + iWARP will be a common transport.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>>> Stay tuned for the new verbs API RFC...
>>>>>>>>>
>>>>>>>>> Steve.
>>>>>>>>> _______________________________________________
>>>>>>>>> general mailing list
>>>>>>>>> general at lists.openfabrics.org
>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>>>>
>>>>>>>>> To unsubscribe, please visit 
>>>>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>>>>                   
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> Message: 3
>>>>>>> Date: Tue, 13 May 2008 20:04:00 +0200
>>>>>>> From: Olaf Kirch <okir at lst.de>
>>>>>>> Subject: Re: [ofa-general] RDS flow control
>>>>>>> To: Steve Wise <swise at opengridcomputing.com>
>>>>>>> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org
>>>>>>> Message-ID: <200805132004.01371.okir at lst.de>
>>>>>>> Content-Type: text/plain;  charset="iso-8859-1"
>>>>>>>
>>>>>>> On Tuesday 13 May 2008 19:58:11 Steve Wise wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Yes, by adding flow control.  And it could be iwarp-specific if you 
>>>>>>>> want.    I would not suggest relying on connection termination and 
>>>>>>>> re-establishment as the way to handle this :).
>>>>>>>>                 
>>>>>>> No, not in the long term. But let's hold off on the flow control stuff
>>>>>>> for a little - I would first like to finish my patch set and hand it
>>>>>>> out for you folks to bang on it, rather than the other way round.
>>>>>>> Okay with you guys?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> I assume nothing would fly that regresses IB performance.  Worst case, 
>>>>>>>> you have an iwarp-specific RDS transport like you do for TCP, I guess.  
>>>>>>>> Hopefully though, IB + iWARP will be a common transport.
>>>>>>>>                 
>>>>>>> If it turns out that way, fine. If iWARP ands up sharing 80% of the
>>>>>>> code with IB except the RDMA specific functions, I think that's
>>>>>>> very much acceptable, too.
>>>>>>>
>>>>>>> Olaf
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> general mailing list
>>>>>> general at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>
>>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>>             
>>>>>           
>
>   


From hrosenstock at xsigo.com  Mon May 19 09:17:50 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 09:17:50 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <4831A618.9090806@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
Message-ID: <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>

Sumit,
On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote:

> Hi Hal,
> It is true that packets received are looks like proper response but as I 
> mentioned before they content TID that I have never send to OFED and  
> this  cause the problem. Why OFED is sending these extra packets  Is the 
> matter to investigate.

The received packet is SM class attribute ID 4352 which is non IBA
standard and AFAIK OFED does not send so it likely comes from some non
OFED software.

As far as why it is being received, it is a response to a class your
application is subscribed to so it passes it through.

As to what is going on, some sort of packet trace would be needed.

-- Hal

> sumit
> sumit


From or.gerlitz at gmail.com  Mon May 19 09:27:56 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 19 May 2008 19:27:56 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use
	for changing ConnectX page size
In-Reply-To: <adazlqm70xy.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>
	<200805191603.50735.jackm@dev.mellanox.co.il>
	<adahccu8i2j.fsf@cisco.com>
	<200805191825.38176.jackm@dev.mellanox.co.il>
	<adazlqm70xy.fsf@cisco.com>
Message-ID: <15ddcffd0805190927n4dc16c12m22aa9b5e219ed65e@mail.gmail.com>

On 5/19/08, Roland Dreier <rdreier at cisco.com> wrote:
>
> Seems like a ULP issue -- if they don't want 512 byte pages for FMRs,
> why are they asking for them?
>

Indeed, the ULP code has to do well with the reported sizes, etc

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/8f64fc09/attachment.html>

From balaji at mcs.anl.gov  Mon May 19 08:46:58 2008
From: balaji at mcs.anl.gov (Pavan Balaji)
Date: Mon, 19 May 2008 10:46:58 -0500
Subject: [ofa-general] [p2s2-announce] Reminder: P2S2 Workshop Deadline
	Coming Up
Message-ID: <4831A0F2.70106@mcs.anl.gov>


We would like to remind you all that the P2S2 Workshop deadline is
coming up in a few days (May 21st). We look forward to receiving paper
submissions from you.

Please note in the CFP below that the actual workshop is moved from
the first day of the ICPP conference (Sep. 8th) to the last day
(Sep. 12th), so that it does not conflict with other conferences in
the same area.

This announcement list is for people who are interested in the P2S2
workshop. If you are not interested in these announcements,
information on how to unsubscribe from this list is available at the
bottom of this email.

========================================================================

CALL FOR PAPERS
===============

First International Workshop on
Parallel Programming Models and Systems Software
for High-end Computing (P2S2)
Sep. 12th, 2008

Web link: http://www.mcs.anl.gov/events/workshops/p2s2

To be held in conjunction with
ICPP-08: The 27th International Conference on Parallel Processing
Sep. 8-12, 2008
Portland, Oregon, USA


SCOPE
-----
The goal of this workshop is to bring together researchers and 
practitioners in parallel programming models and systems software for 
high-end computing systems. Please join us in a discussion of new ideas, 
experiences, and the latest trends in these areas at the workshop.


TOPICS OF INTEREST
------------------
The focus areas for this workshop include, but are not limited to:

     * Programming models and their high-performance implementations
           o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel
           o Other Hybrid Programming Models
     * Systems software for scientific and enterprise computing
           o Communication sub-subsystems for high-end computing
           o High-performance File and storage systems
           o Fault-tolerance techniques and implementations
           o Efficient and high-performance virtualization and other 
management mechanisms
     * Tools for Management, Maintenance, Coordination and Synchronization
           o Software for Enterprise Data-centers using Modern Architectures
           o Job scheduling libraries
           o Management libraries for large-scale system
           o Toolkits for process and task coordination on modern platforms
     * Performance evaluation, analysis and modeling of emerging 
computing platforms


PROCEEDINGS
-----------
Proceedings of this workshop will be published by the IEEE Computer 
Society (together with the ICPP conference proceedings) in CD format 
only and will be available at the conference.


SUBMISSION INSTRUCTIONS
-----------------------
Submissions should be in PDF format in U.S. Letter size paper. They 
should not exceed 8 pages (all inclusive). Submissions will be judged 
based on relevance, significance, originality, correctness and clarity.


DATES AND DEADLINES
-------------------
Paper Submission: Extended to May 21st, 2008
Author Notification: June 4th, 2008
Camera Ready: June 18th, 2008


PROGRAM CHAIRS
--------------
  * Pavan Balaji (Argonne National Laboratory)
  * Sayantan Sur (IBM Research)


STEERING COMMITTEE
------------------
  * William D. Gropp (University of Illinois Urbana-Champaign)
  * Dhabaleswar K. Panda (Ohio State University)
  * Vijay Saraswat (IBM Research)


PROGRAM COMMITTEE
-----------------
  * David Bernholdt (Oak Ridge National Laboratory)
  * Ron Brightwell (Sandia National Laboratory)
  * Wu-chun Feng (Virginia Tech)
  * Richard Graham (Oak Ridge National Laboratory)
  * Hyun-wook Jin (Konkuk University, South Korea)
  * Sameer Kumar (IBM Research)
  * Doug Lea (State University of New York at Oswego)
  * Jarek Nieplocha (Pacific Northwest National Laboratory)
  * Scott Pakin (Los Alamos National Laboratory)
  * Vivek Sarkar (Rice University)
  * Rajeev Thakur (Argonne National Laboratory)
  * Pete Wyckoff (Ohio Supercomputing Center)

If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov

========================================================================
If you do not want to receive any more announcements regarding the P2S2 
workshop, please send an email to majordomo at mcs.anl.gov with the email 
body (not email subject) as "unsubscribe p2s2-announce".
========================================================================

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


From tziporet at dev.mellanox.co.il  Mon May 19 09:41:28 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 19 May 2008 19:41:28 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to
	use	for changing ConnectX page size
In-Reply-To: <adazlqm70xy.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>	<200805191603.50735.jackm@dev.mellanox.co.il>	<adahccu8i2j.fsf@cisco.com>	<200805191825.38176.jackm@dev.mellanox.co.il>
	<adazlqm70xy.fsf@cisco.com>
Message-ID: <4831ADB8.6040508@mellanox.co.il>

Roland Dreier wrote:
>  > I just remembered.  If we create FMRs using 512 as the device page size, we will
>  > use 8 times the MTT entries as we would if the page size was 4K.  The ULP (rds and iser)
>  > can run out of MTT entries much faster.
>
> Seems like a ULP issue -- if they don't want 512 byte pages for FMRs,
> why are they asking for them?
>   

I agree - each ULP should ask the size needed and not relay on the HCA 
to decide this.
This feature was added for backward compatibility of older kernel when 
using the new coming FW and no need to add more options to the driver.

Tziporet


From tziporet at dev.mellanox.co.il  Mon May 19 09:43:02 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 19 May 2008 19:43:02 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to
	use	for changing ConnectX page size
In-Reply-To: <adave1a70wi.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>	<adad4ni8i0a.fsf@cisco.com>	<15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>
	<adave1a70wi.fsf@cisco.com>
Message-ID: <4831AE16.9080807@mellanox.co.il>

Roland Dreier wrote:
> You misunderstood the patch I think (unless I did).  By default new
> kernel + new firmware gets the smaller page size.
>   
This is correct.

Tziporet


From sashak at voltaire.com  Mon May 19 10:01:30 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:01:30 +0300
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <1207851427.7695.123.camel@cardanus.llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1207851427.7695.123.camel@cardanus.llnl.gov>
Message-ID: <20080519170130.GH4616@sashak.voltaire.com>

Hi Al,

On 11:17 Thu 10 Apr     , Al Chu wrote:
> 
> I suddenly thought about this.  If the /var/cache/opensm/opensm.opts
> file is no longer readable (and presumably people will not know about it
> b/c it is not documented anywhere),

At the moment I changed "usage" ('--help' option) message accordingly.

> how will users know how to write the
> opensm.conf?

/var/cache/opensm/opensm.opts still be writable with '-c' option. And
this can be used as template.

> Will opesn distribute a "template" .conf file with all
> values initially commented out??  (I think this is the best idea).

Have nothing against it.

Sasha


From sashak at voltaire.com  Mon May 19 10:03:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:03:03 +0300
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <1210617225.11133.461.camel@cardanus.llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
Message-ID: <20080519170303.GI4616@sashak.voltaire.com>

On 11:33 Mon 12 May     , Al Chu wrote:
> 
> Ira and I were chatting.  A few other comments:
> 
> 1) Many configuration values are not output by default in opensm right
> now, mainly b/c it behaves like a cache rather than an configuration
> file.  i.e.
> 
>         if (p_opts->connect_roots)
>                 fprintf(opts_file,
>                         "# Connect roots (use FALSE if unsure)\n"
>                         "connect_roots %s\n\n",
>                         p_opts->connect_roots ? "TRUE" : "FALSE");
> 
> Going forward w/ a config file, I think these should be output by
> default all the time so users know they exist.

Good point! Will submit patches shortly.

> 2) Will there be an option to specify an alternate configuration file,
> i.e. not /etc/opensm/opensm.conf?

Yes, '-F' or '--config' option.

Sasha


From sashak at voltaire.com  Mon May 19 10:06:24 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:06:24 +0300
Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config
	file
In-Reply-To: <20080512144541.3879de40.weiny2@llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080512144541.3879de40.weiny2@llnl.gov>
Message-ID: <20080519170624.GJ4616@sashak.voltaire.com>

Hi Ira,

On 14:45 Mon 12 May     , Ira Weiny wrote:
> 
> Also, I wonder if anyone would object to you applying your patches to the tree
> as is and we work out the details from there?  I don't see anything wrong with
> your patches except that more work will be needed, as you said, in the man
> pages and scripts.
> 
> After you apply your patches I think we can start in changing the man pages and
> scripts.

Basically I'm fine with such approach. But I think applying "AS IS" will
yet break startup scripts, I will look at this to setup at least
temporary solution there.

Sasha


From sashak at voltaire.com  Mon May 19 10:08:39 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:08:39 +0300
Subject: [ofa-general] [PATCH] opensm: merge disable_multicast and
	no_multicast_option options
In-Reply-To: <20080519170303.GI4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
Message-ID: <20080519170839.GK4616@sashak.voltaire.com>


I cannot find how those options should be different.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h     |    3 +--
 opensm/opensm/osm_sa_class_port_info.c |    2 +-
 opensm/opensm/osm_subnet.c             |   10 ++--------
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index b1dd659..daab453 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -221,7 +221,6 @@ typedef struct _osm_subn_opt {
 	boolean_t reassign_lids;
 	boolean_t ignore_other_sm;
 	boolean_t single_thread;
-	boolean_t no_multicast_option;
 	boolean_t disable_multicast;
 	boolean_t force_log_flush;
 	uint8_t subnet_timeout;
@@ -338,7 +337,7 @@ typedef struct _osm_subn_opt {
 *	ignore_other_sm_option
 *		This flag is TRUE if other SMs on the subnet should be ignored.
 *
-*	no_multicast_option
+*	disable_multicast
 *		This flag is TRUE if OpenSM should disable multicast support.
 *
 *	max_msg_fifo_timeout
diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c
index f0afb32..0839c1b 100644
--- a/opensm/opensm/osm_sa_class_port_info.c
+++ b/opensm/opensm/osm_sa_class_port_info.c
@@ -167,7 +167,7 @@ __osm_cpi_rcv_respond(IN osm_sa_t * sa,
 	if (sa->p_subn->opt.qos)
 		ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED);
 
-	if (sa->p_subn->opt.no_multicast_option != TRUE)
+	if (!sa->p_subn->opt.disable_multicast)
 		p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP;
 	p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask);
 
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 47d735f..a916270 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -409,7 +409,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->reassign_lids = FALSE;
 	p_opt->ignore_other_sm = FALSE;
 	p_opt->single_thread = FALSE;
-	p_opt->no_multicast_option = FALSE;
 	p_opt->disable_multicast = FALSE;
 	p_opt->force_log_flush = FALSE;
 	p_opt->subnet_timeout = OSM_DEFAULT_SUBNET_TIMEOUT;
@@ -1230,9 +1229,6 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
 		opts_unpack_boolean("single_thread",
 				    p_key, p_val, &p_opts->single_thread);
 
-		opts_unpack_boolean("no_multicast_option",
-				    p_key, p_val, &p_opts->no_multicast_option);
-
 		opts_unpack_boolean("disable_multicast",
 				    p_key, p_val, &p_opts->disable_multicast);
 
@@ -1673,9 +1669,8 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		"enable_quirks %s\n\n"
 		"# If TRUE disables client reregistration\n"
 		"no_clients_rereg %s\n\n"
-		"# If TRUE OpenSM should disable multicast support\n"
-		"no_multicast_option %s\n\n"
-		"# No multicast routing is performed if TRUE\n"
+		"# If TRUE OpenSM should disable multicast support and\n"
+		"# no multicast routing is performed if TRUE\n"
 		"disable_multicast %s\n\n"
 		"# If TRUE opensm will exit on fatal initialization issues\n"
 		"exit_on_fatal %s\n\n" "# console [off|local"
@@ -1695,7 +1690,6 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		p_opts->dump_files_dir,
 		p_opts->enable_quirks ? "TRUE" : "FALSE",
 		p_opts->no_clients_rereg ? "TRUE" : "FALSE",
-		p_opts->no_multicast_option ? "TRUE" : "FALSE",
 		p_opts->disable_multicast ? "TRUE" : "FALSE",
 		p_opts->exit_on_fatal ? "TRUE" : "FALSE",
 		p_opts->console,
-- 
1.5.4.rc2.60.gb2e62


From sashak at voltaire.com  Mon May 19 10:09:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:09:16 +0300
Subject: [ofa-general] [PATCH] opensm: remove unused pfn_ui_* callback
	options
In-Reply-To: <20080519170839.GK4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
	<20080519170839.GK4616@sashak.voltaire.com>
Message-ID: <20080519170916.GL4616@sashak.voltaire.com>


Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks
from OpenSM subnet options.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |   20 ------------------
 opensm/opensm/osm_lid_mgr.c        |    7 ------
 opensm/opensm/osm_mcast_mgr.c      |   40 ++++++-----------------------------
 opensm/opensm/osm_subnet.c         |    4 ---
 4 files changed, 7 insertions(+), 64 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index daab453..56b0165 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -248,10 +248,6 @@ typedef struct _osm_subn_opt {
 	uint16_t console_port;
 	cl_map_t port_prof_ignore_guids;
 	boolean_t port_profile_switch_nodes;
-	osm_pfn_ui_extension_t pfn_ui_pre_lid_assign;
-	void *ui_pre_lid_assign_ctx;
-	osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign;
-	void *ui_mcast_fdb_assign_ctx;
 	boolean_t sweep_on_trap;
 	char *routing_engine_name;
 	boolean_t connect_roots;
@@ -412,22 +408,6 @@ typedef struct _osm_subn_opt {
 *		If TRUE will count the number of switch nodes routed through
 *		the link. If FALSE - only CA/RT nodes are counted.
 *
-*	pfn_ui_pre_lid_assign
-*		A UI function to be invoked prior to lid assigment. It should
-*		return 1 if any change was made to any lid or 0 otherwise.
-*
-*	ui_pre_lid_assign_ctx
-*		A UI context (void *) to be provided to the pfn_ui_pre_lid_assign
-*
-*	pfn_ui_mcast_fdb_assign
-*		A UI function to be called inside the mcast manager instead of
-*		the call for the build spanning tree. This will be called on
-*		every multicast call for create, join and leave, and is
-*		responsible for the mcast FDB configuration.
-*
-*	ui_mcast_fdb_assign_ctx
-*		A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign
-*
 *	sweep_on_trap
 *		Received traps will initiate a new sweep.
 *
diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index af0d020..7f25750 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr)
 	   persistent db */
 	__osm_lid_mgr_init_sweep(p_mgr);
 
-	if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) {
-		OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE,
-			"Invoking UI function pfn_ui_pre_lid_assign\n");
-		p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt.
-							 ui_pre_lid_assign_ctx);
-	}
-
 	/* Set the send_set_reqs of the p_mgr to FALSE, and
 	   we'll see if any set requests were sent. If not -
 	   can signal OSM_SIGNAL_DONE */
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 683a16d..a6185fe 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
 {
 	ib_api_status_t status = IB_SUCCESS;
 	ib_net16_t mlid;
-	boolean_t ui_mcast_fdb_assign_func_defined;
 
 	OSM_LOG_ENTER(sm->p_log);
 
@@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
 		goto Exit;
 	}
 
-	if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign)
-		ui_mcast_fdb_assign_func_defined = TRUE;
-	else
-		ui_mcast_fdb_assign_func_defined = FALSE;
-
 	/*
 	   Clear the multicast tables to start clean, then build
 	   the spanning tree which sets the mcast table bits for each
 	   port in the group.
-	   We will clean the multicast tables if a ui_mcast function isn't
-	   defined, or if such function is defined, but we got here
-	   through a MC_CREATE request - this means we are creating a new
-	   multicast group - clean all old data.
 	 */
-	if (ui_mcast_fdb_assign_func_defined == FALSE ||
-	    req_type == OSM_MCAST_REQ_TYPE_CREATE)
-		__osm_mcast_mgr_clear(sm, p_mgrp);
-
-	/* If a UI function is defined, then we will call it here.
-	   If not - the use the regular build spanning tree function */
-	if (ui_mcast_fdb_assign_func_defined == FALSE) {
-		status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
-		if (status != IB_SUCCESS) {
-			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
-				"Unable to create spanning tree (%s)\n",
-				ib_get_err_str(status));
-			goto Exit;
-		}
-	} else {
-		if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) {
-			OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
-				"Invoking UI function pfn_ui_mcast_fdb_assign\n");
-		}
+	__osm_mcast_mgr_clear(sm, p_mgrp);
 
-		sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt.
-							ui_mcast_fdb_assign_ctx,
-							mlid, req_type,
-							port_guid);
+	status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
+	if (status != IB_SUCCESS) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
+			"Unable to create spanning tree (%s)\n",
+			ib_get_err_str(status));
+		goto Exit;
 	}
 
 Exit:
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index a916270..2191f2d 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE;
 	p_opt->accum_log_file = TRUE;
 	p_opt->port_profile_switch_nodes = FALSE;
-	p_opt->pfn_ui_pre_lid_assign = NULL;
-	p_opt->ui_pre_lid_assign_ctx = NULL;
-	p_opt->pfn_ui_mcast_fdb_assign = NULL;
-	p_opt->ui_mcast_fdb_assign_ctx = NULL;
 	p_opt->sweep_on_trap = TRUE;
 	p_opt->routing_engine_name = NULL;
 	p_opt->connect_roots = FALSE;
-- 
1.5.4.rc2.60.gb2e62


From sashak at voltaire.com  Mon May 19 10:10:06 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:10:06 +0300
Subject: [ofa-general] [PATCH] opensm: port_prof_ignore_file option
In-Reply-To: <20080519170839.GK4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
	<20080519170839.GK4616@sashak.voltaire.com>
Message-ID: <20080519171006.GM4616@sashak.voltaire.com>


Move run-time port_prof_ignore_guids map to osm_subnet_t struct and
instead in options define port_prof_ignore_file - a name of the file with
port guids to be ignored by port profiling. Command line option '-i'
('--ignore-guids') will work as before.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_port_profile.h |    7 +++----
 opensm/include/opensm/osm_subnet.h       |   10 +++++++---
 opensm/opensm/main.c                     |    9 ++++-----
 opensm/opensm/osm_subnet.c               |   17 ++++++++++++++---
 4 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/opensm/include/opensm/osm_port_profile.h b/opensm/include/opensm/osm_port_profile.h
index 2442850..bbb59ef 100644
--- a/opensm/include/opensm/osm_port_profile.h
+++ b/opensm/include/opensm/osm_port_profile.h
@@ -205,7 +205,7 @@ static inline boolean_t
 osm_port_prof_is_ignored_port(IN const osm_subn_t * p_subn,
 			      IN ib_net64_t port_guid, IN uint8_t port_num)
 {
-	const cl_map_t *p_map = &(p_subn->opt.port_prof_ignore_guids);
+	const cl_map_t *p_map = &p_subn->port_prof_ignore_guids;
 	const void *p_obj = cl_map_get(p_map, port_guid);
 	size_t res;
 
@@ -246,7 +246,7 @@ static inline void
 osm_port_prof_set_ignored_port(IN osm_subn_t * p_subn,
 			       IN ib_net64_t port_guid, IN uint8_t port_num)
 {
-	cl_map_t *p_map = &(p_subn->opt.port_prof_ignore_guids);
+	cl_map_t *p_map = &p_subn->port_prof_ignore_guids;
 	const void *p_obj = cl_map_get(p_map, port_guid);
 	size_t value = 0;
 
@@ -259,8 +259,7 @@ osm_port_prof_set_ignored_port(IN osm_subn_t * p_subn,
 	}
 
 	value = value | (1 << port_num);
-	cl_map_insert(&(p_subn->opt.port_prof_ignore_guids),
-		      port_guid, (void *)value);
+	cl_map_insert(p_map, port_guid, (void *)value);
 }
 
 /*
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 56b0165..349ba79 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -246,7 +246,7 @@ typedef struct _osm_subn_opt {
 	boolean_t accum_log_file;
 	char *console;
 	uint16_t console_port;
-	cl_map_t port_prof_ignore_guids;
+	char *port_prof_ignore_file;
 	boolean_t port_profile_switch_nodes;
 	boolean_t sweep_on_trap;
 	char *routing_engine_name;
@@ -401,8 +401,8 @@ typedef struct _osm_subn_opt {
 *		If TRUE (default) - the log file will be accumulated.
 *		If FALSE - the log file will be erased before starting current opensm run.
 *
-*	port_prof_ignore_guids
-*		A map of guids to be ignored by port profiling.
+*	port_prof_ignore_file
+*		Name of file with port guids to be ignored by port profiling.
 *
 *	port_profile_switch_nodes
 *		If TRUE will count the number of switch nodes routed through
@@ -531,6 +531,7 @@ typedef struct _osm_subn {
 	cl_qlist_t sa_sr_list;
 	cl_qlist_t sa_infr_list;
 	cl_ptr_vector_t port_lid_tbl;
+	cl_map_t port_prof_ignore_guids;
 	ib_net16_t master_sm_base_lid;
 	ib_net16_t sm_base_lid;
 	ib_net64_t sm_port_guid;
@@ -587,6 +588,9 @@ typedef struct _osm_subn {
 *		Container of pointers to all Port objects in the subent.
 *		Indexed by port LID.
 *
+*	port_prof_ignore_guids
+*		A map of guids to be ignored by port profiling.
+*
 *	master_sm_base_lid
 *		The base LID owned by the subnet's master SM.
 *
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index fb41d50..89a42b4 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -596,7 +596,6 @@ int main(int argc, char *argv[])
 	int32_t vendor_debug = 0;
 	uint32_t next_option;
 	boolean_t cache_options = FALSE;
-	char *ignore_guids_file_name = NULL;
 	uint32_t val;
 	const char *const short_option =
 	    "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:";
@@ -702,9 +701,9 @@ int main(int argc, char *argv[])
 			/*
 			   Specifies ignore guids file.
 			 */
-			ignore_guids_file_name = optarg;
+			opt.port_prof_ignore_file = optarg;
 			printf(" Ignore Guids File = %s\n",
-			       ignore_guids_file_name);
+			       opt.port_prof_ignore_file);
 			break;
 
 		case 'g':
@@ -1027,8 +1026,8 @@ int main(int argc, char *argv[])
 	/*
 	 * Define some port guids to ignore during path equalization
 	 */
-	if (ignore_guids_file_name != NULL) {
-		status = parse_ignore_guids_file(ignore_guids_file_name, &osm);
+	if (opt.port_prof_ignore_file != NULL) {
+		status = parse_ignore_guids_file(opt.port_prof_ignore_file, &osm);
 		if (status != IB_SUCCESS) {
 			printf("\nError from parse_ignore_guids_file (0x%X)\n",
 			       status);
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 2191f2d..20add92 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -166,8 +166,9 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn)
 	}
 
 	cl_ptr_vector_destroy(&p_subn->port_lid_tbl);
-	cl_map_remove_all(&(p_subn->opt.port_prof_ignore_guids));
-	cl_map_destroy(&(p_subn->opt.port_prof_ignore_guids));
+
+	cl_map_remove_all(&p_subn->port_prof_ignore_guids);
+	cl_map_destroy(&p_subn->port_prof_ignore_guids);
 
 	osm_qos_policy_destroy(p_subn->p_qos_policy);
 
@@ -212,7 +213,7 @@ osm_subn_init(IN osm_subn_t * const p_subn,
 	p_subn->min_ca_rate = IB_MAX_RATE;
 
 	/* note that insert and remove are part of the port_profile thing */
-	cl_map_init(&(p_subn->opt.port_prof_ignore_guids), 10);
+	cl_map_init(&p_subn->port_prof_ignore_guids, 10);
 
 	p_subn->ignore_existing_lfts = TRUE;
 
@@ -452,6 +453,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->qos = FALSE;
 	p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE;
 	p_opt->accum_log_file = TRUE;
+	p_opt->port_prof_ignore_file = NULL;
 	p_opt->port_profile_switch_nodes = FALSE;
 	p_opt->sweep_on_trap = TRUE;
 	p_opt->routing_engine_name = NULL;
@@ -1270,6 +1272,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
 		opts_unpack_uint8("log_flags",
 				  p_key, p_val, &p_opts->log_flags);
 
+		opts_unpack_charp("port_prof_ignore_file", p_key, p_val,
+				  &p_opts->port_prof_ignore_file);
+
 		opts_unpack_boolean("port_profile_switch_nodes", p_key, p_val,
 				    &p_opts->port_profile_switch_nodes);
 
@@ -1525,6 +1530,12 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		"port_profile_switch_nodes %s\n\n",
 		p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE");
 
+	if (p_opts->port_prof_ignore_file)
+		fprintf(opts_file,
+			"# Name of file with port guids to be ignored by port profiling\n"
+			"port_prof_ignore_file %s\n\n",
+			p_opts->port_prof_ignore_file);
+
 	if (p_opts->routing_engine_name)
 		fprintf(opts_file,
 			"# Routing engine\n"
-- 
1.5.4.rc2.60.gb2e62


From kliteyn at dev.mellanox.co.il  Mon May 19 10:12:57 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 19 May 2008 20:12:57 +0300
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>	<48314E74.9010107@Sun.COM>	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>	<4831696D.6060409@Sun.COM>	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>	<48318575.7060701@Sun.COM>	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4831B519.2060002@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Sumit,
> On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote:
> 
>> Hi Hal,
>> It is true that packets received are looks like proper response but as I 
>> mentioned before they content TID that I have never send to OFED and  
>> this  cause the problem. Why OFED is sending these extra packets  Is the 
>> matter to investigate.
> 
> The received packet is SM class attribute ID 4352 which is non IBA
> standard and AFAIK OFED does not send so it likely comes from some non
> OFED software.

Just a thought:
Decimal 4352 is 0x1100. With reverted endian we get 0x0011,
which is NodeInfo, that SM sends while sweeping the subnet,
which comes at regular interval.

As I said, just a thought...

-- Yevgeny

> As far as why it is being received, it is a response to a class your
> application is subscribed to so it passes it through.
> 
> As to what is going on, some sort of packet trace would be needed.
> 
> -- Hal
> 
>> sumit
>> sumit
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Mon May 19 10:10:46 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 20:10:46 +0300
Subject: [ofa-general] [PATCH] opensm: write all OpenSM options to cache file
In-Reply-To: <20080519171006.GM4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
	<20080519170839.GK4616@sashak.voltaire.com>
	<20080519171006.GM4616@sashak.voltaire.com>
Message-ID: <20080519171046.GN4616@sashak.voltaire.com>


We want to have all OpenSM options in cache file, so it will be useful
as configuration template.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |   93 ++++++++++++++++++++++----------------------
 1 files changed, 47 insertions(+), 46 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 20add92..2dc0ca8 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -79,6 +79,8 @@
 #define OSM_PATH_MAX	256
 #endif
 
+static const char null_str[] = "(null)";
+
 /**********************************************************************
  **********************************************************************/
 void osm_subn_construct(IN osm_subn_t * const p_subn)
@@ -621,7 +623,7 @@ opts_unpack_charp(IN char *p_req_key,
 			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
 
 			/* special case the "(null)" string */
-			if (strcmp("(null)", p_val_str) == 0) {
+			if (strcmp(null_str, p_val_str) == 0) {
 				*p_val = NULL;
 			} else {
 				/*
@@ -1530,51 +1532,50 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		"port_profile_switch_nodes %s\n\n",
 		p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE");
 
-	if (p_opts->port_prof_ignore_file)
-		fprintf(opts_file,
-			"# Name of file with port guids to be ignored by port profiling\n"
-			"port_prof_ignore_file %s\n\n",
-			p_opts->port_prof_ignore_file);
-
-	if (p_opts->routing_engine_name)
-		fprintf(opts_file,
-			"# Routing engine\n"
-                        "# Supported engines: minhop, updn, file, ftree, lash, dor\n"
-			"routing_engine %s\n\n", p_opts->routing_engine_name);
-	if (p_opts->connect_roots)
-		fprintf(opts_file,
-			"# Connect roots (use FALSE if unsure)\n"
-			"connect_roots %s\n\n",
-			p_opts->connect_roots ? "TRUE" : "FALSE");
-	if (p_opts->lid_matrix_dump_file)
-		fprintf(opts_file,
-			"# Lid matrix dump file name\n"
-			"lid_matrix_dump_file %s\n\n",
-			p_opts->lid_matrix_dump_file);
-	if (p_opts->ucast_dump_file)
-		fprintf(opts_file,
-			"# Ucast dump file name\n"
-			"ucast_dump_file %s\n\n", p_opts->ucast_dump_file);
-	if (p_opts->root_guid_file)
-		fprintf(opts_file,
-			"# The file holding the root node guids (for fat-tree or Up/Down)\n"
-			"# One guid in each line\n"
-			"root_guid_file %s\n\n", p_opts->root_guid_file);
-	if (p_opts->cn_guid_file)
-		fprintf(opts_file,
-			"# The file holding the fat-tree compute node guids\n"
-			"# One guid in each line\n"
-			"cn_guid_file %s\n\n", p_opts->cn_guid_file);
-	if (p_opts->ids_guid_file)
-		fprintf(opts_file,
-			"# The file holding the node ids which will be used by"
-			" Up/Down algorithm instead\n# of GUIDs (one guid and"
-			" id in each line)\n"
-			"ids_guid_file %s\n\n", p_opts->ids_guid_file);
-	if (p_opts->sa_db_file)
-		fprintf(opts_file,
-			"# SA database file name\n"
-			"sa_db_file %s\n\n", p_opts->sa_db_file);
+	fprintf(opts_file,
+		"# Name of file with port guids to be ignored by port profiling\n"
+		"port_prof_ignore_file %s\n\n", p_opts->port_prof_ignore_file ?
+		p_opts->port_prof_ignore_file : null_str);
+
+	fprintf(opts_file,
+		"# Routing engine\n"
+		"# Supported engines: minhop, updn, file, ftree, lash, dor\n"
+		"routing_engine %s\n\n", p_opts->routing_engine_name ?
+		p_opts->routing_engine_name : null_str);
+
+	fprintf(opts_file,
+		"# Connect roots (use FALSE if unsure)\n"
+		"connect_roots %s\n\n",
+		p_opts->connect_roots ? "TRUE" : "FALSE");
+
+	fprintf(opts_file,
+		"# Lid matrix dump file name\n"
+		"lid_matrix_dump_file %s\n\n", p_opts->lid_matrix_dump_file ?
+		p_opts->lid_matrix_dump_file : null_str);
+
+	fprintf(opts_file,
+		"# Ucast dump file name\nucast_dump_file %s\n\n",
+		p_opts->ucast_dump_file ? p_opts->ucast_dump_file : null_str);
+
+	fprintf(opts_file,
+		"# The file holding the root node guids (for fat-tree or Up/Down)\n"
+		"# One guid in each line\nroot_guid_file %s\n\n",
+		p_opts->root_guid_file ? p_opts->root_guid_file : null_str);
+
+	fprintf(opts_file,
+		"# The file holding the fat-tree compute node guids\n"
+		"# One guid in each line\ncn_guid_file %s\n\n",
+		p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str);
+
+	fprintf(opts_file,
+		"# The file holding the node ids which will be used by"
+		" Up/Down algorithm instead\n# of GUIDs (one guid and"
+		" id in each line)\nids_guid_file %s\n\n",
+		p_opts->ids_guid_file ? p_opts->ids_guid_file : null_str);
+
+	fprintf(opts_file,
+		"# SA database file name\nsa_db_file %s\n\n",
+		p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
 
 	fprintf(opts_file,
 		"#\n# HANDOVER - MULTIPLE SMs OPTIONS\n#\n"
-- 
1.5.4.rc2.60.gb2e62


From hrosenstock at xsigo.com  Mon May 19 10:18:28 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 10:18:28 -0700
Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <bc457d660805190847u4196667dg67215c0ee7abcfc2@mail.gmail.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<1211121890.12616.381.camel@hrosenstock-ws.xsigo.com>
	<483046CA.3010403@Voltaire.COM>
	<1211202081.12616.417.camel@hrosenstock-ws.xsigo.com>
	<bc457d660805190847u4196667dg67215c0ee7abcfc2@mail.gmail.com>
Message-ID: <1211217509.12616.487.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-19 at 18:47 +0300, Olga Shern (Voltaire) wrote:
> 
> 
> On 5/19/08, Hal Rosenstock <hrosenstock at xsigo.com> wrote: 
>         On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote:
>         > Hal Rosenstock wrote:
>         > > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote:
>         > >> The purpose of this patch is to make the events that are
>         related to SM change
>         > >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less
>         disruptive.
>         > >> When SM related events are handled, it is not necessary
>         to flush unicast
>         > >> info from device but only multicast info.
>         > >
>         > > How is unicast invalidation handled on these changes ? On
>         a local LID
>         > > change event, how does an end port know/determine what
>         else (e.g. other
>         > > LIDs, paths) the SM might have changed (that specifically
>         might affect
>         > > IPoIB since this is limited to IPoIB) ?
>         > I'm not sure I understand the question but local LID change
>         would be handled as before
>         > with a LID_CHANGE event. For this type of event, there is
>         not change in what IPoIB does to cope.
>         
>         It's SM change which I'm not sure about. I'm unaware of an IBA
>         spec
>         guarantee on preservation of paths on SM failover. Can you
>         point me at
>         this ?
>         
>         Also, as many routing protocols are dependent on where they
>         are run in
>         the subnet (location of SM node in the topology), I don't
>         think all path
>         parameters can be maintained when in a heterogeneous subnet
>         and hence
>         would need refreshing (or flushing to cause this) on an SM
>         change event.
>         
>         So while it may work in a homogeneous subnet, I don't think
>         this is the
>         general case.
>  
> You are rigth there is no IBA spec request to preserve LIDs but all
> SMs that we are familiar with, 
> are doing so.

It's more than LID preservation though; it's also routing preservation
(and rate, etc.) if a path is rerouted in a heterogenous subnet on SM
failover.

In terms of SM LID preservation though, in the case of OpenSM there are
two additional scenarios where LID preservation on SM failover is not a
valid assumption:
1. If the guid2lid files are not sync'd with OpenSM instances.
2. If reassign LIDs is used.

Also, since it's not a spec requirement, I don't see how this can be
relied upon. Maybe some option (mod param) for this being the case in
some configuration with the default being that LID preservation is not
assumed ?

Also, what happens when that assumption is not valid ? I'm referring to
the case where it's treated as a less disruptive event but it really
needed to be treated as a more disruptive one ?

> You are refering to the case where there is remote LID change but not
> local LID change,

Yes, in addition to the changing the SM change event handling issues
above.

> but also without this patch this case is not taken care of.

True.

>   We should think about solution for this case in the future.

Indeed.

-- Hal

>         > > Also, wouldn't there be similar issues with other ULPs ?
>         > There might be but the purpose of this one is to make things
>         better  for IPoIB
>         
>         Understood; just trying to widen the scope. IMO other ULPs
>         should at
>         least be inspected for the same issues. The multicast issue is
>         IPoIB
>         specific but local LID, client reregister (maybe only events
>         for other
>         ULPs as multicast and service records may not apply (perhaps
>         except DAPL
>         but this may be old implementation)) and SM changes apply to
>         all.
>         
>         -- Hal
>         
>         > > -- Hal
>         > >
>         > _______________________________________________
>         > general mailing list
>         > general at lists.openfabrics.org
>         > http://lists.openfabrics.org/cgi-
>         bin/mailman/listinfo/general
>         >
>         > To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
>         _______________________________________________
>         general mailing list
>         general at lists.openfabrics.org
>         http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>         
>         To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
> 


From hrosenstock at xsigo.com  Mon May 19 10:19:55 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Mon, 19 May 2008 10:19:55 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <4831B519.2060002@dev.mellanox.co.il>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
	<4831B519.2060002@dev.mellanox.co.il>
Message-ID: <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-05-19 at 20:12 +0300, Yevgeny Kliteynik wrote:
> Hal Rosenstock wrote:
> > Sumit,
> > On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote:
> > 
> >> Hi Hal,
> >> It is true that packets received are looks like proper response but as I 
> >> mentioned before they content TID that I have never send to OFED and  
> >> this  cause the problem. Why OFED is sending these extra packets  Is the 
> >> matter to investigate.
> > 
> > The received packet is SM class attribute ID 4352 which is non IBA
> > standard and AFAIK OFED does not send so it likely comes from some non
> > OFED software.
> 
> Just a thought:
> Decimal 4352 is 0x1100. With reverted endian we get 0x0011,
> which is NodeInfo, that SM sends while sweeping the subnet,
> which comes at regular interval.
> 
> As I said, just a thought...

Yes, that makes sense to me. As this is an incoming response, maybe this
node is running the SM as well as this application.

-- Hal

> -- Yevgeny
> 
> > As far as why it is being received, it is a response to a class your
> > application is subscribed to so it passes it through.
> > 
> > As to what is going on, some sort of packet trace would be needed.
> > 
> > -- Hal
> > 
> >> sumit
> >> sumit
> > 
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:31:23 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:01:23 +0530
Subject: [ofa-general] [PATCH v2 00/13] QLogic VNIC Driver 
Message-ID: <20080519102843.12355.832.stgit@localhost.localdomain>

Roland,

This is the second round of QLogic Virtual NIC driver patch series for submission
to 2.6.27 kernel. The series has been tested against your for-2.6.27 branch.

Based on comments received on first series of patches, following fixes are
introduced in this series:

        -  Removal of IB cache implementation for QLogic VNIC ULP.
        -  netdev->priv structure allocation through alloc_netdev and use of
           netdev_priv() to access the same.
        -  Implementation of spinlock to protect potential vnic->current_path
           race conditions.
        -  Removed the use of "vnic->xmit_started" variable.
        -  vnic_multicast.c coding style and lock fixes.
        -  Use of "time_after" macro for jiffies comparison.
        -  vnic_npevent_str has been moved to vnic_main.c to avoid its
           inclusion every time along with vnic_main.h
        -  Use of kernel "is_power_of_2" function in place of driver's own.
        -  Global "recv_ref" variable has been renamed to "vnic_recv_ref".

I have signed-off all patches in the series. The sparse endianness checking
for the driver did not give any warnings and checkpatch.pl have few warnings
indicating lines slightly longer than 80 columns.

Background:
As mentioned in the first version of patch series, this series adds QLogic
Virtual NIC (VNIC) driver which works in conjunction with the the QLogic
Ethernet Virtual I/O Controller (EVIC) hardware. The VNIC driver along with the
QLogic EVIC's two 10 Gigabit ethernet ports, enables Infiniband clusters to
connect to Ethernet networks. This driver also works with the earlier version of
the I/O Controller, the VEx.

The QLogic VNIC driver creates virtual ethernet interfaces and tunnels the
Ethernet data to/from the EVIC over Infiniband using an Infiniband reliable
connection.

      [PATCH v2 01/13] QLogic VNIC: Driver - netdev implementation
      [PATCH v2 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx
      [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx
      [PATCH v2 04/13] QLogic VNIC: Implementation of Control path of communication protocol
      [PATCH v2 05/13] QLogic VNIC: Implementation of Data path of communication protocol
      [PATCH v2 06/13] QLogic VNIC: IB core stack interaction
      [PATCH v2 07/13] QLogic VNIC: Handling configurable parameters of the driver
      [PATCH v2 08/13] QLogic VNIC: sysfs interface implementation for the driver
      [PATCH v2 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast
      [PATCH v2 10/13] QLogic VNIC: Driver Statistics collection
      [PATCH v2 11/13] QLogic VNIC: Driver utility file - implements various utility macros
      [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile.
      [PATCH v2 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile

 drivers/infiniband/Kconfig                         |    2 
 drivers/infiniband/Makefile                        |    1 
 drivers/infiniband/ulp/qlgc_vnic/Kconfig           |   28 
 drivers/infiniband/ulp/qlgc_vnic/Makefile          |   13 
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c     |  379 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h     |  242 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c    | 2286 ++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h    |  179 ++
 .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h    |  368 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c       | 1492 +++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h       |  206 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c         | 1043 +++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h         |  206 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c       | 1098 ++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h       |  154 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c  |  319 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h  |   77 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c    |  112 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h    |   79 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c      |  234 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h      |  497 ++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c        | 1131 ++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h        |   62 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h    |  103 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h       |  250 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c     | 1214 +++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h     |  176 ++
 27 files changed, 11951 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h

-- 
Regards,
Ram


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:31:58 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:01:58 +0530
Subject: [ofa-general] [PATCH v2 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103158.12355.61926.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

QLogic Virtual NIC Driver. This patch implements netdev registration,
netdev functions and state maintenance of the QLogic Virtual NIC
corresponding to the various events associated with the QLogic Ethernet 
Virtual I/O Controller (EVIC/VEx) connection.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h |  154 ++++
 2 files changed, 1252 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
new file mode 100644
index 0000000..570c069
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
@@ -0,0 +1,1098 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/skbuff.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/completion.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_netpath.h"
+#include "vnic_viport.h"
+#include "vnic_ib.h"
+#include "vnic_stats.h"
+
+#define MODULEVERSION "1.3.0.0.4"
+#define MODULEDETAILS	\
+		"QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION
+
+MODULE_AUTHOR("QLogic Corp.");
+MODULE_DESCRIPTION(MODULEDETAILS);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller");
+
+u32 vnic_debug;
+
+module_param(vnic_debug, uint, 0444);
+MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0");
+
+LIST_HEAD(vnic_list);
+
+static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue);
+static LIST_HEAD(vnic_npevent_list);
+static DECLARE_COMPLETION(vnic_npevent_thread_exit);
+static spinlock_t vnic_npevent_list_lock;
+static struct task_struct *vnic_npevent_thread;
+static int vnic_npevent_thread_end;
+
+static const char *const vnic_npevent_str[] = {
+    "PRIMARY CONNECTED",
+    "PRIMARY DISCONNECTED",
+    "PRIMARY CARRIER",
+    "PRIMARY NO CARRIER",
+    "PRIMARY TIMER EXPIRED",
+    "PRIMARY SETLINK",
+    "SECONDARY CONNECTED",
+    "SECONDARY DISCONNECTED",
+    "SECONDARY CARRIER",
+    "SECONDARY NO CARRIER",
+    "SECONDARY TIMER EXPIRED",
+    "SECONDARY SETLINK",
+    "FREE VNIC",
+};
+
+void vnic_connected(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_connected()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED);
+
+	vnic_connected_stats(vnic);
+}
+
+void vnic_disconnected(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_disconnected()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED);
+}
+
+void vnic_link_up(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_link_up()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP);
+}
+
+void vnic_link_down(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_link_down()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN);
+}
+
+void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath)
+{
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_stop_xmit()\n");
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (netpath == vnic->current_path) {
+		if (!netif_queue_stopped(vnic->netdevice)) {
+			netif_stop_queue(vnic->netdevice);
+			vnic->failed_over = 0;
+		}
+
+		vnic_stop_xmit_stats(vnic);
+	}
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath)
+{
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_restart_xmit()\n");
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (netpath == vnic->current_path) {
+		if (netif_queue_stopped(vnic->netdevice))
+			netif_wake_queue(vnic->netdevice);
+
+		vnic_restart_xmit_stats(vnic);
+	}
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
+		      struct sk_buff *skb)
+{
+	VNIC_FUNCTION("vnic_recv_packet()\n");
+	if ((netpath != vnic->current_path) || !vnic->open) {
+		VNIC_INFO("tossing packet\n");
+		dev_kfree_skb(skb);
+		return;
+	}
+
+	vnic->netdevice->last_rx = jiffies;
+	skb->dev = vnic->netdevice;
+	skb->protocol = eth_type_trans(skb, skb->dev);
+	if (!vnic->config->use_rx_csum)
+		skb->ip_summed = CHECKSUM_NONE;
+	netif_rx(skb);
+	vnic_recv_pkt_stats(vnic);
+}
+
+static struct net_device_stats *vnic_get_stats(struct net_device *device)
+{
+	struct vnic *vnic;
+	struct netpath *np;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_get_stats()\n");
+	vnic = netdev_priv(device);
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	np = vnic->current_path;
+	if (np && np->viport) {
+		atomic_inc(&np->viport->reference_count);
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+		viport_get_stats(np->viport, &vnic->stats);
+		atomic_dec(&np->viport->reference_count);
+		wake_up(&np->viport->reference_queue);
+	} else
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+
+	return &vnic->stats;
+}
+
+static int vnic_open(struct net_device *device)
+{
+	struct vnic *vnic;
+
+	VNIC_FUNCTION("vnic_open()\n");
+	vnic = netdev_priv(device);
+
+	vnic->open++;
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+	netif_start_queue(vnic->netdevice);
+
+	return 0;
+}
+
+static int vnic_stop(struct net_device *device)
+{
+	struct vnic *vnic;
+	int ret = 0;
+
+	VNIC_FUNCTION("vnic_stop()\n");
+	vnic = netdev_priv(device);
+	netif_stop_queue(device);
+	vnic->open--;
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+
+	return ret;
+}
+
+static int vnic_hard_start_xmit(struct sk_buff *skb,
+				struct net_device *device)
+{
+	struct vnic *vnic;
+	struct netpath *np;
+	cycles_t xmit_time;
+	int	 ret = -1;
+
+	VNIC_FUNCTION("vnic_hard_start_xmit()\n");
+	vnic = netdev_priv(device);
+	np = vnic->current_path;
+
+	vnic_pre_pkt_xmit_stats(&xmit_time);
+
+	if (np && np->viport)
+		ret = viport_xmit_packet(np->viport, skb);
+
+	if (ret) {
+		vnic_xmit_fail_stats(vnic);
+		dev_kfree_skb_any(skb);
+		vnic->stats.tx_dropped++;
+		goto out;
+	}
+
+	device->trans_start = jiffies;
+	vnic_post_pkt_xmit_stats(vnic, xmit_time);
+out:
+	return 0;
+}
+
+static void vnic_tx_timeout(struct net_device *device)
+{
+	struct vnic *vnic;
+	struct viport *viport = NULL;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_tx_timeout()\n");
+	vnic = netdev_priv(device);
+	device->trans_start = jiffies;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (vnic->current_path && vnic->current_path->viport) {
+		if (vnic->failed_over) {
+			if (vnic->current_path == &vnic->primary_path)
+				viport = vnic->secondary_path.viport;
+			else if (vnic->current_path == &vnic->secondary_path)
+				viport = vnic->primary_path.viport;
+		} else
+			viport = vnic->current_path->viport;
+
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+		if (viport)
+			viport_failure(viport);
+	} else
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+
+	VNIC_ERROR("vnic_tx_timeout\n");
+}
+
+static void vnic_set_multicast_list(struct net_device *device)
+{
+	struct vnic *vnic;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_set_multicast_list()\n");
+	vnic = netdev_priv(device);
+
+	spin_lock_irqsave(&vnic->lock, flags);
+	if (device->mc_count == 0) {
+		if (vnic->mc_list_len) {
+			vnic->mc_list_len = vnic->mc_count = 0;
+			kfree(vnic->mc_list);
+		}
+	} else {
+		struct dev_mc_list *mc_list = device->mc_list;
+		int i;
+
+		if (device->mc_count > vnic->mc_list_len) {
+			if (vnic->mc_list_len)
+				kfree(vnic->mc_list);
+			vnic->mc_list_len = device->mc_count + 10;
+			vnic->mc_list = kmalloc(vnic->mc_list_len *
+						sizeof *mc_list, GFP_ATOMIC);
+			if (!vnic->mc_list) {
+				vnic->mc_list_len = vnic->mc_count = 0;
+				VNIC_ERROR("failed allocating mc_list\n");
+				goto failure;
+			}
+		}
+		vnic->mc_count = device->mc_count;
+		for (i = 0; i < device->mc_count; i++) {
+			vnic->mc_list[i] = *mc_list;
+			vnic->mc_list[i].next = &vnic->mc_list[i + 1];
+			mc_list = mc_list->next;
+		}
+	}
+	spin_unlock_irqrestore(&vnic->lock, flags);
+
+	if (vnic->primary_path.viport)
+		viport_set_multicast(vnic->primary_path.viport,
+				     vnic->mc_list, vnic->mc_count);
+
+	if (vnic->secondary_path.viport)
+		viport_set_multicast(vnic->secondary_path.viport,
+				     vnic->mc_list, vnic->mc_count);
+
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+	return;
+failure:
+	spin_unlock_irqrestore(&vnic->lock, flags);
+}
+
+/**
+ * Following set of functions queues up the events for EVIC and the
+ * kernel thread queuing up the event might return.
+ */
+static int vnic_set_mac_address(struct net_device *device, void *addr)
+{
+	struct vnic	*vnic;
+	struct sockaddr	*sockaddr = addr;
+	u8		*address;
+	int		ret = -1;
+
+	VNIC_FUNCTION("vnic_set_mac_address()\n");
+	vnic = netdev_priv(device);
+
+	if (!is_valid_ether_addr(sockaddr->sa_data))
+		return -EADDRNOTAVAIL;
+
+	if (netif_running(device))
+		return -EBUSY;
+
+	memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN);
+	address = sockaddr->sa_data;
+
+	if (vnic->primary_path.viport)
+		ret = viport_set_unicast(vnic->primary_path.viport,
+					 address);
+
+	if (ret)
+		return ret;
+
+	if (vnic->secondary_path.viport)
+		viport_set_unicast(vnic->secondary_path.viport, address);
+
+	vnic->mac_set = 1;
+	return 0;
+}
+
+static int vnic_change_mtu(struct net_device *device, int mtu)
+{
+	struct vnic	*vnic;
+	int		ret = 0;
+	int		pri_max_mtu;
+	int		sec_max_mtu;
+
+	VNIC_FUNCTION("vnic_change_mtu()\n");
+	vnic = netdev_priv(device);
+
+	if (vnic->primary_path.viport)
+		pri_max_mtu = viport_max_mtu(vnic->primary_path.viport);
+	else
+		pri_max_mtu = MAX_PARAM_VALUE;
+
+	if (vnic->secondary_path.viport)
+		sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport);
+	else
+		sec_max_mtu = MAX_PARAM_VALUE;
+
+	if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) {
+		device->mtu = mtu;
+		vnic_npevent_queue_evt(&vnic->primary_path,
+				       VNIC_PRINP_SETLINK);
+		vnic_npevent_queue_evt(&vnic->secondary_path,
+				       VNIC_SECNP_SETLINK);
+	} else if (pri_max_mtu < sec_max_mtu)
+		printk(KERN_WARNING PFX "%s: Maximum "
+					"supported MTU size is %d. "
+					"Cannot set MTU to %d\n",
+					vnic->config->name, pri_max_mtu, mtu);
+	else
+		printk(KERN_WARNING PFX "%s: Maximum "
+					"supported MTU size is %d. "
+					"Cannot set MTU to %d\n",
+					vnic->config->name, sec_max_mtu, mtu);
+
+	return ret;
+}
+
+static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath)
+{
+	u8	*address;
+	int	ret;
+
+	if (!vnic->mac_set) {
+		/* if netpath == secondary_path, then the primary path isn't
+		 * connected.  MAC address will be set when the primary
+		 * connects.
+		 */
+		netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr);
+		address = vnic->netdevice->dev_addr;
+
+		if (vnic->secondary_path.viport)
+			viport_set_unicast(vnic->secondary_path.viport,
+					   address);
+
+		vnic->mac_set = 1;
+	}
+	ret = register_netdev(vnic->netdevice);
+	if (ret) {
+		printk(KERN_ERR PFX "%s failed registering netdev "
+			"error %d - calling viport_failure\n",
+			config_viport_name(vnic->primary_path.viport->config),
+				ret);
+		vnic_free(vnic);
+		printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n",
+			config_viport_name(vnic->primary_path.viport->config));
+		return ret;
+	}
+
+	vnic->state = VNIC_REGISTERED;
+	vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/
+	return 0;
+}
+
+static void vnic_npevent_dequeue_all(struct vnic *vnic)
+{
+	unsigned long flags;
+	struct vnic_npevent *npevt, *tmp;
+
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	if (list_empty(&vnic_npevent_list))
+		goto out;
+	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
+				 list_ptrs) {
+		if ((npevt->vnic == vnic)) {
+			list_del(&npevt->list_ptrs);
+			kfree(npevt);
+		}
+	}
+out:
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+}
+
+static void update_path_and_reconnect(struct netpath *netpath,
+				      struct vnic *vnic)
+{
+	struct viport_config *config = netpath->viport->config;
+	int delay = 1;
+
+	if (vnic_ib_get_path(netpath, vnic))
+		return;
+	/*
+	 * tell viport_connect to wait for default_no_path_timeout
+	 * before connecting if  we are retrying the same path index
+	 * within default_no_path_timeout.
+	 * This prevents flooding connect requests to a path (or set
+	 * of paths) that aren't successfully connecting for some reason.
+	 */
+	if (time_after(jiffies,
+		(netpath->connect_time + vnic->config->no_path_timeout))) {
+		netpath->path_idx = config->path_idx;
+		netpath->connect_time = jiffies;
+		netpath->delay_reconnect = 0;
+		delay = 0;
+	} else if (config->path_idx != netpath->path_idx) {
+		delay = netpath->delay_reconnect;
+		netpath->path_idx = config->path_idx;
+		netpath->delay_reconnect = 1;
+	} else
+		delay = 1;
+	viport_connect(netpath->viport, delay);
+}
+
+static inline void vnic_set_checksum_flag(struct vnic *vnic,
+					  struct netpath *target_path)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	vnic->current_path = target_path;
+	vnic->failed_over = 1;
+	if (vnic->config->use_tx_csum &&
+	    netpath_can_tx_csum(vnic->current_path))
+		vnic->netdevice->features |= NETIF_F_IP_CSUM;
+
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+static void vnic_set_uni_multicast(struct vnic *vnic,
+				   struct netpath *netpath)
+{
+	unsigned long	flags;
+	u8		*address;
+
+	if (vnic->mac_set) {
+		address = vnic->netdevice->dev_addr;
+
+		if (netpath->viport)
+			viport_set_unicast(netpath->viport, address);
+	}
+	spin_lock_irqsave(&vnic->lock, flags);
+
+	if (vnic->mc_list && netpath->viport)
+		viport_set_multicast(netpath->viport, vnic->mc_list,
+				     vnic->mc_count);
+
+	spin_unlock_irqrestore(&vnic->lock, flags);
+	if (vnic->state == VNIC_REGISTERED) {
+		if (!netpath->viport)
+			return;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags & ~IFF_UP,
+				vnic->netdevice->mtu);
+	}
+}
+
+static void vnic_set_netpath_timers(struct vnic *vnic,
+				    struct netpath *netpath)
+{
+	switch (netpath->timer_state) {
+	case NETPATH_TS_IDLE:
+		netpath->timer_state = NETPATH_TS_ACTIVE;
+		if (vnic->state == VNIC_UNINITIALIZED)
+			netpath_timer(netpath,
+				      vnic->config->
+				      primary_connect_timeout);
+		else
+			netpath_timer(netpath,
+				      vnic->config->
+				      primary_reconnect_timeout);
+			break;
+	case NETPATH_TS_ACTIVE:
+		/*nothing to do*/
+		break;
+	case NETPATH_TS_EXPIRED:
+		if (vnic->state == VNIC_UNINITIALIZED)
+			vnic_npevent_register(vnic, netpath);
+
+		break;
+	}
+}
+
+static void vnic_check_primary_path_timer(struct vnic *vnic)
+{
+	switch (vnic->primary_path.timer_state) {
+	case NETPATH_TS_ACTIVE:
+		/* nothing to do. just wait */
+		break;
+	case NETPATH_TS_IDLE:
+		netpath_timer(&vnic->primary_path,
+			      vnic->config->
+			      primary_switch_timeout);
+		break;
+	case NETPATH_TS_EXPIRED:
+		printk(KERN_INFO PFX
+		       "%s: switching to primary path\n",
+		       vnic->config->name);
+
+		vnic_set_checksum_flag(vnic, &vnic->primary_path);
+		break;
+	}
+}
+
+static void vnic_carrier_loss(struct vnic *vnic,
+			      struct netpath *last_path)
+{
+	if (vnic->primary_path.carrier) {
+		vnic->carrier = 1;
+		vnic_set_checksum_flag(vnic, &vnic->primary_path);
+
+		if (last_path && last_path != vnic->current_path)
+			printk(KERN_INFO PFX
+			       "%s: failing over to primary path\n",
+			       vnic->config->name);
+		else if (!last_path)
+			printk(KERN_INFO PFX "%s: using primary path\n",
+			       vnic->config->name);
+
+	} else if ((vnic->secondary_path.carrier) &&
+		   (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) {
+		vnic->carrier = 1;
+		vnic_set_checksum_flag(vnic, &vnic->secondary_path);
+
+		if (last_path && last_path != vnic->current_path)
+			printk(KERN_INFO PFX
+			       "%s: failing over to secondary path\n",
+			       vnic->config->name);
+		else if (!last_path)
+			printk(KERN_INFO PFX "%s: using secondary path\n",
+			       vnic->config->name);
+
+	}
+
+}
+
+static void vnic_handle_path_change(struct vnic *vnic,
+				    struct netpath **path)
+{
+	struct netpath *last_path = *path;
+
+	if (!last_path) {
+		if (vnic->current_path == &vnic->primary_path)
+			last_path = &vnic->secondary_path;
+		else
+			last_path = &vnic->primary_path;
+
+	}
+
+	if (vnic->current_path && vnic->current_path->viport)
+		viport_set_link(vnic->current_path->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+
+	if (last_path->viport)
+		viport_set_link(last_path->viport,
+				 vnic->netdevice->flags &
+				 ~IFF_UP, vnic->netdevice->mtu);
+
+	vnic_restart_xmit(vnic, vnic->current_path);
+}
+
+static void vnic_report_path_change(struct vnic *vnic,
+				    struct netpath *last_path,
+				    int other_path_ok)
+{
+	if (!vnic->current_path) {
+		if (last_path == &vnic->primary_path)
+			printk(KERN_INFO PFX "%s: primary path lost, "
+			       "no failover path available\n",
+			       vnic->config->name);
+		else
+			printk(KERN_INFO PFX "%s: secondary path lost, "
+			       "no failover path available\n",
+			       vnic->config->name);
+		return;
+	}
+
+	if (last_path != vnic->current_path)
+		return;
+
+	if (vnic->current_path == &vnic->secondary_path) {
+		if (other_path_ok != vnic->primary_path.carrier) {
+			if (other_path_ok)
+				printk(KERN_INFO PFX "%s: primary path no"
+				       " longer available for failover\n",
+				       vnic->config->name);
+			else
+				printk(KERN_INFO PFX "%s: primary path now"
+				       " available for failover\n",
+				       vnic->config->name);
+		}
+	} else {
+		if (other_path_ok != vnic->secondary_path.carrier) {
+			if (other_path_ok)
+				printk(KERN_INFO PFX "%s: secondary path no"
+				       " longer available for failover\n",
+				       vnic->config->name);
+			else
+				printk(KERN_INFO PFX "%s: secondary path now"
+				       " available for failover\n",
+				       vnic->config->name);
+		}
+	}
+}
+
+static void vnic_handle_free_vnic_evt(struct vnic *vnic)
+{
+	unsigned long flags;
+
+	if (!netif_queue_stopped(vnic->netdevice))
+		netif_stop_queue(vnic->netdevice);
+
+	netpath_timer_stop(&vnic->primary_path);
+	netpath_timer_stop(&vnic->secondary_path);
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	vnic->current_path = NULL;
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+	netpath_free(&vnic->primary_path);
+	netpath_free(&vnic->secondary_path);
+	if (vnic->state == VNIC_REGISTERED)
+		unregister_netdev(vnic->netdevice);
+
+	vnic_npevent_dequeue_all(vnic);
+	kfree(vnic->config);
+	if (vnic->mc_list_len) {
+		vnic->mc_list_len = vnic->mc_count = 0;
+		kfree(vnic->mc_list);
+	}
+
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_dev_attr_group);
+	vnic_cleanup_stats_files(vnic);
+	device_unregister(&vnic->dev_info.dev);
+	wait_for_completion(&vnic->dev_info.released);
+	free_netdev(vnic->netdevice);
+}
+
+static struct vnic *vnic_handle_npevent(struct vnic *vnic,
+					 enum vnic_npevent_type npevt_type)
+{
+	struct netpath	*netpath;
+	const char *netpath_str;
+
+	if (npevt_type <= VNIC_PRINP_LASTTYPE)
+		netpath_str = netpath_to_string(vnic, &vnic->primary_path);
+	else if	(npevt_type <= VNIC_SECNP_LASTTYPE)
+		netpath_str = netpath_to_string(vnic, &vnic->secondary_path);
+	else
+		netpath_str = netpath_to_string(vnic, vnic->current_path);
+
+	VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n",
+		  vnic->config->name, vnic_npevent_str[npevt_type],
+		  netpath_str, vnic->carrier);
+
+	switch (npevt_type) {
+	case VNIC_PRINP_CONNECTED:
+		netpath = &vnic->primary_path;
+		if (vnic->state == VNIC_UNINITIALIZED) {
+			if (vnic_npevent_register(vnic, netpath))
+				break;
+		}
+		vnic_set_uni_multicast(vnic, netpath);
+		break;
+	case VNIC_SECNP_CONNECTED:
+		vnic_set_uni_multicast(vnic, &vnic->secondary_path);
+		break;
+	case VNIC_PRINP_TIMEREXPIRED:
+		netpath = &vnic->primary_path;
+		netpath->timer_state = NETPATH_TS_EXPIRED;
+		if (!netpath->carrier)
+			update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_SECNP_TIMEREXPIRED:
+		netpath = &vnic->secondary_path;
+		netpath->timer_state = NETPATH_TS_EXPIRED;
+		if (!netpath->carrier)
+			update_path_and_reconnect(netpath, vnic);
+		else {
+			if (vnic->state == VNIC_UNINITIALIZED)
+				vnic_npevent_register(vnic, netpath);
+		}
+		break;
+	case VNIC_PRINP_LINKUP:
+		vnic->primary_path.carrier = 1;
+		break;
+	case VNIC_SECNP_LINKUP:
+		netpath = &vnic->secondary_path;
+		netpath->carrier = 1;
+		if (!vnic->carrier)
+			vnic_set_netpath_timers(vnic, netpath);
+		break;
+	case VNIC_PRINP_LINKDOWN:
+		vnic->primary_path.carrier = 0;
+		break;
+	case VNIC_SECNP_LINKDOWN:
+		if (vnic->state == VNIC_UNINITIALIZED)
+			netpath_timer_stop(&vnic->secondary_path);
+		vnic->secondary_path.carrier = 0;
+		break;
+	case VNIC_PRINP_DISCONNECTED:
+		netpath = &vnic->primary_path;
+		netpath_timer_stop(netpath);
+		netpath->carrier = 0;
+		update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_SECNP_DISCONNECTED:
+		netpath = &vnic->secondary_path;
+		netpath_timer_stop(netpath);
+		netpath->carrier = 0;
+		update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_PRINP_SETLINK:
+		netpath = vnic->current_path;
+		if (!netpath || !netpath->viport)
+			break;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+		break;
+	case VNIC_SECNP_SETLINK:
+		netpath = &vnic->secondary_path;
+		if (!netpath || !netpath->viport)
+			break;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+		break;
+	case VNIC_NP_FREEVNIC:
+		vnic_handle_free_vnic_evt(vnic);
+		vnic = NULL;
+		break;
+	}
+	return vnic;
+}
+
+static int vnic_npevent_statemachine(void *context)
+{
+	struct vnic_npevent	*vnic_link_evt;
+	enum vnic_npevent_type	npevt_type;
+	struct vnic		*vnic;
+	int			last_carrier;
+	int			other_path_ok = 0;
+	struct netpath		*last_path;
+
+	while (!vnic_npevent_thread_end ||
+	       !list_empty(&vnic_npevent_list)) {
+		unsigned long flags;
+
+		wait_event_interruptible(vnic_npevent_queue,
+					 !list_empty(&vnic_npevent_list)
+					 || vnic_npevent_thread_end);
+		spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+		if (list_empty(&vnic_npevent_list)) {
+			spin_unlock_irqrestore(&vnic_npevent_list_lock,
+					       flags);
+			VNIC_INFO("netpath statemachine wake"
+				  " on empty list\n");
+			continue;
+		}
+
+		vnic_link_evt = list_entry(vnic_npevent_list.next,
+					   struct vnic_npevent,
+					   list_ptrs);
+		list_del(&vnic_link_evt->list_ptrs);
+		spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+		vnic = vnic_link_evt->vnic;
+		npevt_type = vnic_link_evt->event_type;
+		kfree(vnic_link_evt);
+
+		if (vnic->current_path == &vnic->secondary_path)
+			other_path_ok = vnic->primary_path.carrier;
+		else if (vnic->current_path == &vnic->primary_path)
+			other_path_ok = vnic->secondary_path.carrier;
+
+		vnic = vnic_handle_npevent(vnic, npevt_type);
+
+		if (!vnic)
+			continue;
+
+		last_carrier = vnic->carrier;
+		last_path = vnic->current_path;
+
+		if (!vnic->current_path ||
+		    !vnic->current_path->carrier) {
+			vnic->carrier = 0;
+			vnic->current_path = NULL;
+			vnic->netdevice->features &= ~NETIF_F_IP_CSUM;
+		}
+
+		if (!vnic->carrier)
+			vnic_carrier_loss(vnic, last_path);
+		else if ((vnic->current_path != &vnic->primary_path) &&
+			 (vnic->config->prefer_primary) &&
+			 (vnic->primary_path.carrier))
+				vnic_check_primary_path_timer(vnic);
+
+		if (last_path)
+			vnic_report_path_change(vnic, last_path,
+						other_path_ok);
+
+		VNIC_INFO("new netpath=%s, carrier=%d\n",
+			  netpath_to_string(vnic, vnic->current_path),
+			  vnic->carrier);
+
+		if (vnic->current_path != last_path)
+			vnic_handle_path_change(vnic, &last_path);
+
+		if (vnic->carrier != last_carrier) {
+			if (vnic->carrier) {
+				VNIC_INFO("netif_carrier_on\n");
+				netif_carrier_on(vnic->netdevice);
+				vnic_carrier_loss_stats(vnic);
+			} else {
+				VNIC_INFO("netif_carrier_off\n");
+				netif_carrier_off(vnic->netdevice);
+				vnic_disconn_stats(vnic);
+			}
+
+		}
+	}
+	complete_and_exit(&vnic_npevent_thread_exit, 0);
+	return 0;
+}
+
+void vnic_npevent_queue_evt(struct netpath *netpath,
+			    enum vnic_npevent_type evt)
+{
+	struct vnic_npevent *npevent;
+	unsigned long flags;
+
+	npevent = kmalloc(sizeof *npevent, GFP_ATOMIC);
+	if (!npevent) {
+		VNIC_ERROR("Could not allocate memory for vnic event\n");
+		return;
+	}
+	npevent->vnic = netpath->parent;
+	npevent->event_type = evt;
+	INIT_LIST_HEAD(&npevent->list_ptrs);
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	list_add_tail(&npevent->list_ptrs, &vnic_npevent_list);
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+	wake_up(&vnic_npevent_queue);
+}
+
+void vnic_npevent_dequeue_evt(struct netpath *netpath,
+			      enum vnic_npevent_type evt)
+{
+	unsigned long flags;
+	struct vnic_npevent *npevt, *tmp;
+	struct vnic *vnic = netpath->parent;
+
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	if (list_empty(&vnic_npevent_list))
+		goto out;
+	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
+				 list_ptrs) {
+		if ((npevt->vnic == vnic) &&
+		    (npevt->event_type == evt)) {
+			list_del(&npevt->list_ptrs);
+			kfree(npevt);
+			break;
+		}
+	}
+out:
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+}
+
+static int vnic_npevent_start(void)
+{
+	VNIC_FUNCTION("vnic_npevent_start()\n");
+
+	spin_lock_init(&vnic_npevent_list_lock);
+	vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL,
+						"qlgc_vnic_npevent_s_m");
+	if (IS_ERR(vnic_npevent_thread)) {
+		printk(KERN_WARNING PFX "failed to create vnic npevent"
+		       " thread; error %d\n",
+			(int) PTR_ERR(vnic_npevent_thread));
+		vnic_npevent_thread = NULL;
+		return 1;
+	}
+
+	return 0;
+}
+
+void vnic_npevent_cleanup(void)
+{
+	if (vnic_npevent_thread) {
+		vnic_npevent_thread_end = 1;
+		wake_up(&vnic_npevent_queue);
+		wait_for_completion(&vnic_npevent_thread_exit);
+		vnic_npevent_thread = NULL;
+	}
+}
+
+static void vnic_setup(struct net_device *device)
+{
+	ether_setup(device);
+
+	/* ether_setup is used to fill
+	 * device parameters for ethernet devices.
+	 * We override some of the parameters
+	 * which are specific to VNIC.
+	 */
+	device->get_stats		= vnic_get_stats;
+	device->open			= vnic_open;
+	device->stop			= vnic_stop;
+	device->hard_start_xmit		= vnic_hard_start_xmit;
+	device->tx_timeout		= vnic_tx_timeout;
+	device->set_multicast_list	= vnic_set_multicast_list;
+	device->set_mac_address		= vnic_set_mac_address;
+	device->change_mtu		= vnic_change_mtu;
+	device->watchdog_timeo 		= 10 * HZ;
+	device->features		= 0;
+}
+
+struct vnic *vnic_allocate(struct vnic_config *config)
+{
+	struct vnic *vnic = NULL;
+	struct net_device *netdev;
+
+	VNIC_FUNCTION("vnic_allocate()\n");
+	netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup);
+	if (!netdev) {
+		VNIC_ERROR("failed allocating vnic structure\n");
+		return NULL;
+	}
+
+	vnic = netdev_priv(netdev);
+	vnic->netdevice = netdev;
+	spin_lock_init(&vnic->lock);
+	spin_lock_init(&vnic->current_path_lock);
+	vnic_alloc_stats(vnic);
+	vnic->state = VNIC_UNINITIALIZED;
+	vnic->config = config;
+
+	netpath_init(&vnic->primary_path, vnic, 0);
+	netpath_init(&vnic->secondary_path, vnic, 1);
+
+	vnic->current_path = NULL;
+	vnic->failed_over = 0;
+
+	list_add_tail(&vnic->list_ptrs, &vnic_list);
+
+	return vnic;
+}
+
+void vnic_free(struct vnic *vnic)
+{
+	VNIC_FUNCTION("vnic_free()\n");
+	list_del(&vnic->list_ptrs);
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC);
+}
+
+static void __exit vnic_cleanup(void)
+{
+	VNIC_FUNCTION("vnic_cleanup()\n");
+
+	VNIC_INIT("unloading %s\n", MODULEDETAILS);
+
+	while (!list_empty(&vnic_list)) {
+		struct vnic *vnic =
+		    list_entry(vnic_list.next, struct vnic, list_ptrs);
+		vnic_free(vnic);
+	}
+
+	vnic_npevent_cleanup();
+	viport_cleanup();
+	vnic_ib_cleanup();
+}
+
+static int __init vnic_init(void)
+{
+	int ret;
+	VNIC_FUNCTION("vnic_init()\n");
+	VNIC_INIT("Initializing %s\n", MODULEDETAILS);
+
+	ret = config_start();
+	if (ret) {
+		VNIC_ERROR("config_start failed\n");
+		goto failure;
+	}
+
+	ret = vnic_ib_init();
+	if (ret) {
+		VNIC_ERROR("ib_start failed\n");
+		goto failure;
+	}
+
+	ret = viport_start();
+	if (ret) {
+		VNIC_ERROR("viport_start failed\n");
+		goto failure;
+	}
+
+	ret = vnic_npevent_start();
+	if (ret) {
+		VNIC_ERROR("vnic_npevent_start failed\n");
+		goto failure;
+	}
+
+	return 0;
+failure:
+	vnic_cleanup();
+	return ret;
+}
+
+module_init(vnic_init);
+module_exit(vnic_cleanup);
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
new file mode 100644
index 0000000..7535124
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
@@ -0,0 +1,154 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_MAIN_H_INCLUDED
+#define VNIC_MAIN_H_INCLUDED
+
+#include <linux/timex.h>
+#include <linux/netdevice.h>
+#include <linux/kthread.h>
+#include <linux/fs.h>
+
+#include "vnic_config.h"
+#include "vnic_netpath.h"
+
+extern u16 vnic_max_mtu;
+extern struct list_head vnic_list;
+extern struct attribute_group vnic_stats_attr_group;
+extern cycles_t vnic_recv_ref;
+
+enum vnic_npevent_type {
+	VNIC_PRINP_CONNECTED	= 0,
+	VNIC_PRINP_DISCONNECTED	= 1,
+	VNIC_PRINP_LINKUP	= 2,
+	VNIC_PRINP_LINKDOWN	= 3,
+	VNIC_PRINP_TIMEREXPIRED	= 4,
+	VNIC_PRINP_SETLINK	= 5,
+
+	/* used to figure out PRI vs SEC types for dbg msg*/
+	VNIC_PRINP_LASTTYPE     = VNIC_PRINP_SETLINK,
+
+	VNIC_SECNP_CONNECTED	= 6,
+	VNIC_SECNP_DISCONNECTED	= 7,
+	VNIC_SECNP_LINKUP	= 8,
+	VNIC_SECNP_LINKDOWN	= 9,
+	VNIC_SECNP_TIMEREXPIRED	= 10,
+	VNIC_SECNP_SETLINK	= 11,
+
+	/* used to figure out PRI vs SEC types for dbg msg*/
+	VNIC_SECNP_LASTTYPE     = VNIC_SECNP_SETLINK,
+
+	VNIC_NP_FREEVNIC	= 12,
+
+	/*
+	 * NOTE : If any new netpath event is being added, don't forget to
+	 * add corresponding netpath event string into vnic_main.c.
+	 */
+};
+
+struct vnic_npevent {
+	struct list_head	list_ptrs;
+	struct vnic		*vnic;
+	enum vnic_npevent_type	event_type;
+};
+
+void vnic_npevent_queue_evt(struct netpath *netpath,
+			    enum vnic_npevent_type evt);
+void vnic_npevent_dequeue_evt(struct netpath *netpath,
+			      enum vnic_npevent_type evt);
+
+enum vnic_state {
+	VNIC_UNINITIALIZED	= 0,
+	VNIC_REGISTERED		= 1
+};
+
+struct vnic {
+	struct list_head		list_ptrs;
+	enum vnic_state			state;
+	struct vnic_config		*config;
+	struct netpath			*current_path;
+	struct netpath			primary_path;
+	struct netpath			secondary_path;
+	int				open;
+	int				carrier;
+	int				failed_over;
+	int				mac_set;
+	struct net_device_stats 	stats;
+	struct net_device		*netdevice;
+	struct dev_info			dev_info;
+	struct dev_mc_list		*mc_list;
+	int				mc_list_len;
+	int				mc_count;
+	spinlock_t			lock;
+	spinlock_t			current_path_lock;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	start_time;
+		cycles_t	conn_time;
+		cycles_t	disconn_ref;	/* intermediate time */
+		cycles_t	disconn_time;
+		u32		disconn_num;
+		cycles_t	xmit_time;
+		u32		xmit_num;
+		u32		xmit_fail;
+		cycles_t	recv_time;
+		u32		recv_num;
+		u32		multicast_recv_num;
+		cycles_t	xmit_ref;	/* intermediate time */
+		cycles_t	xmit_off_time;
+		u32		xmit_off_num;
+		cycles_t	carrier_ref;	/* intermediate time */
+		cycles_t	carrier_off_time;
+		u32		carrier_off_num;
+	} statistics;
+	struct dev_info		stat_info;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct vnic *vnic_allocate(struct vnic_config *config);
+
+void vnic_free(struct vnic *vnic);
+
+void vnic_connected(struct vnic *vnic, struct netpath *netpath);
+void vnic_disconnected(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_link_up(struct vnic *vnic, struct netpath *netpath);
+void vnic_link_down(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath);
+void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
+		      struct sk_buff *skb);
+void vnic_npevent_cleanup(void);
+void completion_callback_cleanup(struct vnic_ib_conn *ib_conn);
+#endif	/* VNIC_MAIN_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:32:28 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:02:28 +0530
Subject: [ofa-general] [PATCH v2 02/13] QLogic VNIC: Netpath - abstraction of
	connection to EVIC/VEx
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103228.12355.9952.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch implements the netpath layer of QLogic VNIC. Netpath is an 
abstraction of a connection to EVIC. It primarily includes the 
implementation which maintains the timers to monitor the status of
the connection to EVIC/VEx.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c |  112 +++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h |   79 ++++++++++++++++
 2 files changed, 191 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
new file mode 100644
index 0000000..820b996
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_netpath.h"
+
+static void vnic_npevent_timeout(unsigned long data)
+{
+	struct netpath *netpath = (struct netpath *)data;
+
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_TIMEREXPIRED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_TIMEREXPIRED);
+}
+
+void netpath_timer(struct netpath *netpath, int timeout)
+{
+	if (netpath->timer_state == NETPATH_TS_ACTIVE)
+		del_timer_sync(&netpath->timer);
+	if (timeout) {
+		init_timer(&netpath->timer);
+		netpath->timer_state = NETPATH_TS_ACTIVE;
+		netpath->timer.expires = jiffies + timeout;
+		netpath->timer.data = (unsigned long)netpath;
+		netpath->timer.function = vnic_npevent_timeout;
+		add_timer(&netpath->timer);
+	} else
+		vnic_npevent_timeout((unsigned long)netpath);
+}
+
+void netpath_timer_stop(struct netpath *netpath)
+{
+	if (netpath->timer_state != NETPATH_TS_ACTIVE)
+		return;
+	del_timer_sync(&netpath->timer);
+	if (netpath->second_bias)
+		vnic_npevent_dequeue_evt(netpath, VNIC_SECNP_TIMEREXPIRED);
+	else
+		vnic_npevent_dequeue_evt(netpath, VNIC_PRINP_TIMEREXPIRED);
+
+	netpath->timer_state = NETPATH_TS_IDLE;
+}
+
+void netpath_free(struct netpath *netpath)
+{
+	if (!netpath->viport)
+		return;
+	viport_free(netpath->viport);
+	netpath->viport = NULL;
+	sysfs_remove_group(&netpath->dev_info.dev.kobj,
+			   &vnic_path_attr_group);
+	device_unregister(&netpath->dev_info.dev);
+	wait_for_completion(&netpath->dev_info.released);
+}
+
+void netpath_init(struct netpath *netpath, struct vnic *vnic,
+		  int second_bias)
+{
+	netpath->parent = vnic;
+	netpath->carrier = 0;
+	netpath->viport = NULL;
+	netpath->second_bias = second_bias;
+	netpath->timer_state = NETPATH_TS_IDLE;
+	init_timer(&netpath->timer);
+}
+
+const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath)
+{
+	if (!netpath)
+		return "NULL";
+	else if (netpath == &vnic->primary_path)
+		return "PRIMARY";
+	else if (netpath == &vnic->secondary_path)
+		return "SECONDARY";
+	else
+		return "UNKNOWN";
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
new file mode 100644
index 0000000..f4e142e
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_NETPATH_H_INCLUDED
+#define VNIC_NETPATH_H_INCLUDED
+
+#include <linux/spinlock.h>
+
+#include "vnic_sys.h"
+
+struct viport;
+struct vnic;
+
+enum netpath_ts {
+	NETPATH_TS_IDLE		= 0,
+	NETPATH_TS_ACTIVE	= 1,
+	NETPATH_TS_EXPIRED	= 2
+};
+
+struct netpath {
+	int			carrier;
+	struct vnic		*parent;
+	struct viport		*viport;
+	size_t			path_idx;
+	unsigned long		connect_time;
+	int			second_bias;
+	u8			is_primary_path;
+	u8 			delay_reconnect;
+	struct timer_list	timer;
+	enum netpath_ts		timer_state;
+	struct dev_info		dev_info;
+};
+
+void netpath_init(struct netpath *netpath, struct vnic *vnic,
+		  int second_bias);
+void netpath_free(struct netpath *netpath);
+
+void netpath_timer(struct netpath *netpath, int timeout);
+void netpath_timer_stop(struct netpath *netpath);
+
+const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath);
+
+#define netpath_get_hw_addr(netpath, address)		\
+	viport_get_hw_addr((netpath)->viport, address)
+#define netpath_is_connected(netpath)			\
+	(netpath->state == NETPATH_CONNECTED)
+#define netpath_can_tx_csum(netpath)			\
+	viport_can_tx_csum(netpath->viport)
+
+#endif	/* VNIC_NETPATH_H_INCLUDED */


From joel at finetec.com  Mon May 19 10:29:30 2008
From: joel at finetec.com (Joe Li)
Date: Mon, 19 May 2008 10:29:30 -0700
Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while
	installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8
Message-ID: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>


Hello everyone,

I am a newbie to openfabric and I have an issue here which needs your help. When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I get an ofa-kernel rpm build error:
"Running rpm -e --allmatches libibverbs libibcommon libibumad librdmacm opensm-libs ibutils openib opensm-libs dapl libibcommon libibumad libibverbs librdmacm ibutils ibutils-libs
Build ofa_kernel RPM
Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'configure_options   --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-cxgb3-mod --with-nes-mod --with-ipath_inf-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-srp-target-mod --with-rds-mod' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'KVERSION 2.6.25-rc3' --define 'K_SRC /lib/modules/2.6.25-rc3/build' --define 'network_dir /etc/sysconfig/network-scripts' --define '_prefix /usr' /ofed1.3/OFED-1.3/SRPMS/ofa_kernel-1.3-ofed1.3.src.rpm
Failed to build ofa_kernel RPM 
See /tmp/OFED.6361.logs/ofa_kernel.rpmbuild.log"

In the ofa_kernel.rpmbuild.log file, it says: 
  gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.addr.o.d  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.1.2/include -D__KERNEL__ \
-include include/linux/autoconf.h \
-include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/linux/autoconf.h \
 \
 \
 \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/debug \
-I/usr/local/include/scst \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/net/cxgb3 \
-Iinclude \
 \
 -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os  -fno-stack-protector -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args  -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow  -fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign    -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)"  -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'rdma_translate_ip':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:113: warning: passing argument 1 of 'ip_dev_find' makes pointer from integer without a cast
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:113: error: too few arguments to function 'ip_dev_find'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_send_arp':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: warning: passing argument 1 of 'ip_route_output_key' from incompatible pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: warning: passing argument 2 of 'ip_route_output_key' from incompatible pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: error: too few arguments to function 'ip_route_output_key'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_resolve_remote':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: warning: passing argument 1 of 'ip_route_output_key' from incompatible pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: warning: passing argument 2 of 'ip_route_output_key' from incompatible pointer type
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: error: too few arguments to function 'ip_route_output_key'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_resolve_local':
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:264: warning: passing argument 1 of 'ip_dev_find' makes pointer from integer without a cast
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:264: error: too few arguments to function 'ip_dev_find'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:268: error: implicit declaration of function 'ZERONET'
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:272: error: implicit declaration of function 'LOOPBACK'
make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.o] Error 1
make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core] Error 2
make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband] Error 2
make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.25-rc3'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.67082 (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.67082 (%build)

Can anyone please point out what might be wrong? Thanks in advance.

Regards
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/4fa982f1/attachment.html>

From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:32:58 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:02:58 +0530
Subject: [ofa-general] [PATCH v2 03/13] QLogic VNIC: Implementation of
	communication protocol with EVIC/VEx
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103258.12355.6146.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

Implementation of the statemachine for the protocol used while 
communicating with the EVIC. The patch also implements the viport
abstraction which represents the virtual ethernet port on EVIC.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 ++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h |  176 +++
 2 files changed, 1390 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
new file mode 100644
index 0000000..0a94cd3
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
@@ -0,0 +1,1214 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/netdevice.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/net.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_netpath.h"
+#include "vnic_control.h"
+#include "vnic_data.h"
+#include "vnic_config.h"
+#include "vnic_control_pkt.h"
+
+#define VIPORT_DISCONN_TIMER	10000 	 /* 10 seconds */
+
+#define MAX_RETRY_INTERVAL 	  20000  /* 20 seconds */
+#define RETRY_INCREMENT		  5000   /* 5 seconds  */
+#define MAX_CONNECT_RETRY_TIMEOUT 600000 /* 10 minutes */
+
+static DECLARE_WAIT_QUEUE_HEAD(viport_queue);
+static LIST_HEAD(viport_list);
+static DECLARE_COMPLETION(viport_thread_exit);
+static spinlock_t viport_list_lock;
+
+static struct task_struct *viport_thread;
+static int viport_thread_end;
+
+static void viport_timer(struct viport *viport, int timeout);
+
+struct viport *viport_allocate(struct viport_config *config)
+{
+	struct viport *viport;
+
+	VIPORT_FUNCTION("viport_allocate()\n");
+	viport = kzalloc(sizeof *viport, GFP_KERNEL);
+	if (!viport) {
+		VIPORT_ERROR("failed allocating viport structure\n");
+		return NULL;
+	}
+
+	viport->state = VIPORT_DISCONNECTED;
+	viport->link_state = LINK_FIRSTCONNECT;
+	viport->connect = WAIT;
+	viport->new_mtu = 1500;
+	viport->new_flags = 0;
+	viport->config = config;
+	viport->connect = DELAY;
+	viport->data.max_mtu = vnic_max_mtu;
+	spin_lock_init(&viport->lock);
+	init_waitqueue_head(&viport->stats_queue);
+	init_waitqueue_head(&viport->disconnect_queue);
+	init_waitqueue_head(&viport->reference_queue);
+	INIT_LIST_HEAD(&viport->list_ptrs);
+
+	vnic_mc_init(viport);
+
+	return viport;
+}
+
+void viport_connect(struct viport *viport, int delay)
+{
+	VIPORT_FUNCTION("viport_connect()\n");
+
+	if (viport->connect != DELAY)
+		viport->connect = (delay) ? DELAY : NOW;
+	if (viport->link_state == LINK_FIRSTCONNECT) {
+		u32 duration;
+		duration = (net_random() & 0x1ff);
+		if (!viport->parent->is_primary_path)
+			duration += 0x1ff;
+		viport->link_state = LINK_RETRYWAIT;
+		viport_timer(viport, duration);
+	} else
+		viport_kick(viport);
+}
+
+void viport_disconnect(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_disconnect()\n");
+	viport->disconnect = 1;
+	viport_failure(viport);
+	wait_event(viport->disconnect_queue, viport->disconnect == 0);
+}
+
+void viport_free(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_free()\n");
+	viport_disconnect(viport);	/* NOTE: this can sleep */
+	vnic_mc_uninit(viport);
+	kfree(viport->config);
+	kfree(viport);
+}
+
+void viport_set_link(struct viport *viport, u16 flags, u16 mtu)
+{
+	unsigned long localflags;
+	int i;
+
+	VIPORT_FUNCTION("viport_set_link()\n");
+	if (mtu > data_max_mtu(&viport->data)) {
+		VIPORT_ERROR("configuration error."
+			     " mtu of %d unsupported by %s\n", mtu,
+			     config_viport_name(viport->config));
+		goto failure;
+	}
+
+	spin_lock_irqsave(&viport->lock, localflags);
+	flags &= IFF_UP | IFF_ALLMULTI | IFF_PROMISC;
+	if ((viport->new_flags != flags)
+	    || (viport->new_mtu != mtu)) {
+		viport->new_flags = flags;
+		viport->new_mtu = mtu;
+		viport->updates |= NEED_LINK_CONFIG;
+		if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+			if (((viport->mtu <= MCAST_MSG_SIZE) && (mtu >  MCAST_MSG_SIZE)) ||
+			    ((viport->mtu >  MCAST_MSG_SIZE) && (mtu <= MCAST_MSG_SIZE))) {
+			/*
+			 * MTU value will enable/disable the multicast. In
+			 * either case, need to send the CMD_CONFIG_ADDRESS2 to
+			 * EVIC. Hence, setting the NEED_ADDRESS_CONFIG flag.
+			 */
+				viport->updates |= NEED_ADDRESS_CONFIG;
+				if (mtu <= MCAST_MSG_SIZE) {
+				    VIPORT_PRINT("%s: MTU changed; "
+						"old:%d new:%d (threshold:%d);"
+						" MULTICAST will be enabled.\n",
+						config_viport_name(viport->config),
+						viport->mtu, mtu,
+						(int)MCAST_MSG_SIZE);
+				} else {
+				    VIPORT_PRINT("%s: MTU changed; "
+						"old:%d new:%d (threshold:%d); "
+						"MULTICAST will be disabled.\n",
+						config_viport_name(viport->config),
+						viport->mtu, mtu,
+						(int)MCAST_MSG_SIZE);
+				}
+				/* When we resend these addresses, EVIC will
+				 * send mgid=0 back in response. So no need to
+				 * shutoff ib_multicast.
+				 */
+				for (i = MCAST_ADDR_START; i < viport->num_mac_addresses; i++) {
+					if (viport->mac_addresses[i].valid)
+						viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+				}
+			}
+		}
+		viport_kick(viport);
+	}
+
+	spin_unlock_irqrestore(&viport->lock, localflags);
+	return;
+failure:
+	viport_failure(viport);
+}
+
+int viport_set_unicast(struct viport *viport, u8 *address)
+{
+	unsigned long flags;
+	int	ret = -1;
+	VIPORT_FUNCTION("viport_set_unicast()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+
+	if (!viport->mac_addresses)
+		goto out;
+
+	if (memcmp(viport->mac_addresses[UNICAST_ADDR].address,
+		   address, ETH_ALEN)) {
+		memcpy(viport->mac_addresses[UNICAST_ADDR].address,
+		       address, ETH_ALEN);
+		viport->mac_addresses[UNICAST_ADDR].operation
+		    = VNIC_OP_SET_ENTRY;
+		viport->updates |= NEED_ADDRESS_CONFIG;
+		viport_kick(viport);
+	}
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&viport->lock, flags);
+	return ret;
+}
+
+int viport_set_multicast(struct viport *viport,
+			 struct dev_mc_list *mc_list, int mc_count)
+{
+	u32 old_update_list;
+	int i;
+	int ret = -1;
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_set_multicast()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+
+	if (!viport->mac_addresses)
+		goto out;
+
+	old_update_list = viport->updates;
+	if (mc_count > viport->num_mac_addresses - MCAST_ADDR_START)
+		viport->updates |= NEED_LINK_CONFIG | MCAST_OVERFLOW;
+	else {
+		if (mc_count == 0) {
+			ret = 0;
+			goto out;
+		}
+		if (viport->updates & MCAST_OVERFLOW) {
+			viport->updates &= ~MCAST_OVERFLOW;
+			viport->updates |= NEED_LINK_CONFIG;
+		}
+		for (i = MCAST_ADDR_START; i < mc_count + MCAST_ADDR_START;
+						i++, mc_list = mc_list->next) {
+			if (viport->mac_addresses[i].valid &&
+				!memcmp(viport->mac_addresses[i].address,
+						mc_list->dmi_addr, ETH_ALEN))
+			continue;
+		memcpy(viport->mac_addresses[i].address,
+					 mc_list->dmi_addr, ETH_ALEN);
+		viport->mac_addresses[i].valid = 1;
+		viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+	}
+	for (; i < viport->num_mac_addresses; i++) {
+		if (!viport->mac_addresses[i].valid)
+			continue;
+		viport->mac_addresses[i].valid = 0;
+		viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+	}
+	if (mc_count)
+		viport->updates |= NEED_ADDRESS_CONFIG;
+	}
+
+	if (viport->updates != old_update_list)
+		viport_kick(viport);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&viport->lock, flags);
+	return ret;
+}
+
+static inline void viport_disable_multicast(struct viport *viport)
+{
+	VIPORT_INFO("turned off IB_MULTICAST\n");
+	viport->config->control_config.ib_multicast = 0;
+	viport->config->control_config.ib_config.conn_data.features_supported &=
+				__constant_cpu_to_be32((u32)~VNIC_FEAT_INBOUND_IB_MC);
+	viport->link_state = LINK_RESET;
+}
+
+void viport_get_stats(struct viport *viport,
+		     struct net_device_stats *stats)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_get_stats()\n");
+	/* Reference count has been already incremented indicating
+	 * that viport structure is being used, which prevents its
+	 * freeing when this task sleeps
+	 */
+	if (time_after(jiffies,
+		(viport->last_stats_time + viport->config->stats_interval))) {
+
+		spin_lock_irqsave(&viport->lock, flags);
+		viport->updates |= NEED_STATS;
+		spin_unlock_irqrestore(&viport->lock, flags);
+		viport_kick(viport);
+		wait_event(viport->stats_queue,
+			   !(viport->updates & NEED_STATS)
+			   || (viport->disconnect == 1));
+
+		if (viport->stats.ethernet_status)
+			vnic_link_up(viport->vnic, viport->parent);
+		else
+			vnic_link_down(viport->vnic, viport->parent);
+	}
+
+	stats->rx_packets = be64_to_cpu(viport->stats.if_in_ok);
+	stats->tx_packets = be64_to_cpu(viport->stats.if_out_ok);
+	stats->rx_bytes   = be64_to_cpu(viport->stats.if_in_octets);
+	stats->tx_bytes   = be64_to_cpu(viport->stats.if_out_octets);
+	stats->rx_errors  = be64_to_cpu(viport->stats.if_in_errors);
+	stats->tx_errors  = be64_to_cpu(viport->stats.if_out_errors);
+	stats->rx_dropped = 0;	/* EIOC doesn't track */
+	stats->tx_dropped = 0;	/* EIOC doesn't track */
+	stats->multicast  = be64_to_cpu(viport->stats.if_in_nucast_pkts);
+	stats->collisions = 0;	/* EIOC doesn't track */
+}
+
+int viport_xmit_packet(struct viport *viport, struct sk_buff *skb)
+{
+	int status = -1;
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_xmit_packet()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+	if (viport->state == VIPORT_CONNECTED)
+		status = data_xmit_packet(&viport->data, skb);
+	spin_unlock_irqrestore(&viport->lock, flags);
+
+	return status;
+}
+
+void viport_kick(struct viport *viport)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_kick()\n");
+	spin_lock_irqsave(&viport_list_lock, flags);
+	if (list_empty(&viport->list_ptrs)) {
+		list_add_tail(&viport->list_ptrs, &viport_list);
+		wake_up(&viport_queue);
+	}
+	spin_unlock_irqrestore(&viport_list_lock, flags);
+}
+
+void viport_failure(struct viport *viport)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_failure()\n");
+	vnic_stop_xmit(viport->vnic, viport->parent);
+	spin_lock_irqsave(&viport_list_lock, flags);
+	viport->errored = 1;
+	if (list_empty(&viport->list_ptrs)) {
+		list_add_tail(&viport->list_ptrs, &viport_list);
+		wake_up(&viport_queue);
+	}
+	spin_unlock_irqrestore(&viport_list_lock, flags);
+}
+
+static void viport_timeout(unsigned long data)
+{
+	struct viport *viport;
+
+	VIPORT_FUNCTION("viport_timeout()\n");
+	viport = (struct viport *)data;
+	viport->timer_active = 0;
+	viport_kick(viport);
+}
+
+static void viport_timer(struct viport *viport, int timeout)
+{
+	VIPORT_FUNCTION("viport_timer()\n");
+	if (viport->timer_active)
+		del_timer(&viport->timer);
+	init_timer(&viport->timer);
+	viport->timer.expires = jiffies + timeout;
+	viport->timer.data = (unsigned long)viport;
+	viport->timer.function = viport_timeout;
+	viport->timer_active = 1;
+	add_timer(&viport->timer);
+}
+
+static void viport_timer_stop(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_timer_stop()\n");
+	if (viport->timer_active)
+		del_timer(&viport->timer);
+	viport->timer_active = 0;
+}
+
+static int viport_init_mac_addresses(struct viport *viport)
+{
+	struct vnic_address_op2	*temp;
+	unsigned long		flags;
+	int			i;
+
+	VIPORT_FUNCTION("viport_init_mac_addresses()\n");
+	i = viport->num_mac_addresses * sizeof *temp;
+	temp = kzalloc(viport->num_mac_addresses * sizeof *temp,
+		       GFP_KERNEL);
+	if (!temp) {
+		VIPORT_ERROR("failed allocating MAC address table\n");
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->mac_addresses = temp;
+	for (i = 0; i < viport->num_mac_addresses; i++) {
+		viport->mac_addresses[i].index = cpu_to_be16(i);
+		viport->mac_addresses[i].vlan =
+				cpu_to_be16(viport->default_vlan);
+	}
+	memset(viport->mac_addresses[BROADCAST_ADDR].address,
+	       0xFF, ETH_ALEN);
+	viport->mac_addresses[BROADCAST_ADDR].valid = 1;
+	memcpy(viport->mac_addresses[UNICAST_ADDR].address,
+	       viport->hw_mac_address, ETH_ALEN);
+	viport->mac_addresses[UNICAST_ADDR].valid = 1;
+
+	spin_unlock_irqrestore(&viport->lock, flags);
+
+	return 0;
+}
+
+static inline void viport_match_mac_address(struct vnic *vnic,
+					    struct viport *viport)
+{
+	if (vnic && vnic->current_path &&
+	    viport == vnic->current_path->viport &&
+	    vnic->mac_set &&
+	    memcmp(vnic->netdevice->dev_addr, viport->hw_mac_address, ETH_ALEN)) {
+		VIPORT_ERROR("*** ERROR MAC address mismatch; "
+				"current = %02x:%02x:%02x:%02x:%02x:%02x "
+				"From EVIC = %02x:%02x:%02x:%02x:%02x:%02x\n",
+				vnic->netdevice->dev_addr[0],
+				vnic->netdevice->dev_addr[1],
+				vnic->netdevice->dev_addr[2],
+				vnic->netdevice->dev_addr[3],
+				vnic->netdevice->dev_addr[4],
+				vnic->netdevice->dev_addr[5],
+				viport->hw_mac_address[0],
+				viport->hw_mac_address[1],
+				viport->hw_mac_address[2],
+				viport->hw_mac_address[3],
+				viport->hw_mac_address[4],
+				viport->hw_mac_address[5]);
+	}
+}
+
+static int viport_handle_init_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_UNINITIALIZED:
+			LINK_STATE("state LINK_UNINITIALIZED\n");
+			viport->updates = 0;
+			spin_lock_irq(&viport_list_lock);
+			list_del_init(&viport->list_ptrs);
+			spin_unlock_irq(&viport_list_lock);
+			if (atomic_read(&viport->reference_count)) {
+				wake_up(&viport->stats_queue);
+				wait_event(viport->reference_queue,
+					 atomic_read(&viport->reference_count) == 0);
+			}
+			/* No more references to viport structure
+			 * so it is safe to delete it by waking disconnect
+			 * queue
+			 */
+
+			viport->disconnect = 0;
+			wake_up(&viport->disconnect_queue);
+			break;
+		case LINK_INITIALIZE:
+			LINK_STATE("state LINK_INITIALIZE\n");
+			viport->errored = 0;
+			viport->connect = WAIT;
+			viport->last_stats_time = 0;
+			if (viport->disconnect)
+				viport->link_state = LINK_UNINITIALIZED;
+			else
+				viport->link_state = LINK_INITIALIZECONTROL;
+			break;
+		case LINK_INITIALIZECONTROL:
+			LINK_STATE("state LINK_INITIALIZECONTROL\n");
+			viport->pd = ib_alloc_pd(viport->config->ibdev);
+			if (IS_ERR(viport->pd))
+				viport->link_state = LINK_DISCONNECTED;
+			else if (control_init(&viport->control, viport,
+					    &viport->config->control_config,
+					    viport->pd)) {
+				ib_dealloc_pd(viport->pd);
+				viport->link_state = LINK_DISCONNECTED;
+
+			} else
+				viport->link_state = LINK_INITIALIZEDATA;
+			break;
+		case LINK_INITIALIZEDATA:
+			LINK_STATE("state LINK_INITIALIZEDATA\n");
+			if (data_init(&viport->data, viport,
+				      &viport->config->data_config,
+				      viport->pd))
+				viport->link_state = LINK_CLEANUPCONTROL;
+			else
+				viport->link_state = LINK_CONTROLCONNECT;
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_control_states(struct viport *viport)
+{
+	enum link_state old_state;
+	struct vnic *vnic;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_CONTROLCONNECT:
+			if (vnic_ib_cm_connect(&viport->control.ib_conn))
+				viport->link_state = LINK_CLEANUPDATA;
+			else
+				viport->link_state = LINK_CONTROLCONNECTWAIT;
+			break;
+		case LINK_CONTROLCONNECTWAIT:
+			LINK_STATE("state LINK_CONTROLCONNECTWAIT\n");
+			if (control_is_connected(&viport->control))
+				viport->link_state = LINK_INITVNICREQ;
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			}
+			break;
+		case LINK_INITVNICREQ:
+			LINK_STATE("state LINK_INITVNICREQ\n");
+			if (control_init_vnic_req(&viport->control))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_INITVNICRSP;
+			break;
+		case LINK_INITVNICRSP:
+			LINK_STATE("state LINK_INITVNICRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_init_vnic_rsp(&viport->control,
+						  &viport->features_supported,
+						  viport->hw_mac_address,
+						  &viport->num_mac_addresses,
+						  &viport->default_vlan)) {
+				if (viport_init_mac_addresses(viport))
+					viport->link_state =
+							LINK_RESETCONTROL;
+				else {
+					viport->link_state =
+							LINK_BEGINDATAPATH;
+					/*
+					 * Ensure that the current path's MAC
+					 * address matches the one returned by
+					 * EVIC - we've had cases of mismatch
+					 * which then caused havoc.
+					 */
+					vnic = viport->parent->parent;
+					viport_match_mac_address(vnic, viport);
+				}
+			}
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESETCONTROL;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_data_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_BEGINDATAPATH:
+			LINK_STATE("state LINK_BEGINDATAPATH\n");
+			viport->link_state = LINK_CONFIGDATAPATHREQ;
+			break;
+		case LINK_CONFIGDATAPATHREQ:
+			LINK_STATE("state LINK_CONFIGDATAPATHREQ\n");
+			if (control_config_data_path_req(&viport->control,
+						data_path_id(&viport->
+							     data),
+						data_host_pool_max
+						(&viport->data),
+						data_eioc_pool_max
+						(&viport->data)))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_CONFIGDATAPATHRSP;
+			break;
+		case LINK_CONFIGDATAPATHRSP:
+			LINK_STATE("state LINK_CONFIGDATAPATHRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_data_path_rsp(&viport->control,
+							 data_host_pool
+							 (&viport->data),
+							 data_eioc_pool
+							 (&viport->data),
+							 data_host_pool_max
+							 (&viport->data),
+							 data_eioc_pool_max
+							 (&viport->data),
+							 data_host_pool_min
+							 (&viport->data),
+							 data_eioc_pool_min
+							 (&viport->data)))
+				viport->link_state = LINK_DATACONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESETCONTROL;
+			}
+			break;
+		case LINK_DATACONNECT:
+			LINK_STATE("state LINK_DATACONNECT\n");
+			if (data_connect(&viport->data))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_DATACONNECTWAIT;
+			break;
+		case LINK_DATACONNECTWAIT:
+			LINK_STATE("state LINK_DATACONNECTWAIT\n");
+			control_process_async(&viport->control);
+			if (data_is_connected(&viport->data))
+				viport->link_state = LINK_XCHGPOOLREQ;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_xchgpool_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_XCHGPOOLREQ:
+			LINK_STATE("state LINK_XCHGPOOLREQ\n");
+			if (control_exchange_pools_req(&viport->control,
+						       data_local_pool_addr
+						       (&viport->data),
+						       data_local_pool_rkey
+						       (&viport->data)))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_XCHGPOOLRSP;
+			break;
+		case LINK_XCHGPOOLRSP:
+			LINK_STATE("state LINK_XCHGPOOLRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_exchange_pools_rsp(&viport->control,
+						       data_remote_pool_addr
+						       (&viport->data),
+						       data_remote_pool_rkey
+						       (&viport->data)))
+				viport->link_state = LINK_INITIALIZED;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		case LINK_INITIALIZED:
+			LINK_STATE("state LINK_INITIALIZED\n");
+			viport->state = VIPORT_CONNECTED;
+			printk(KERN_INFO PFX
+			       "%s: connection established\n",
+			       config_viport_name(viport->config));
+			data_connected(&viport->data);
+			vnic_connected(viport->parent->parent,
+				       viport->parent);
+			if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+				printk(KERN_INFO PFX "%s: Supports Inbound IB "
+					"Multicast\n",
+					config_viport_name(viport->config));
+				if (mc_data_init(&viport->mc_data, viport,
+						&viport->config->data_config,
+						viport->pd)) {
+					viport_disable_multicast(viport);
+					break;
+				}
+			}
+			spin_lock_irq(&viport->lock);
+			viport->mtu = 1500;
+			viport->flags = 0;
+			if ((viport->mtu != viport->new_mtu) ||
+			    (viport->flags != viport->new_flags))
+				viport->updates |= NEED_LINK_CONFIG;
+			spin_unlock_irq(&viport->lock);
+			viport->link_state = LINK_IDLE;
+			viport->retry_duration = 0;
+			viport->total_retry_duration = 0;
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_idle_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int handle_mc_join_compl, handle_mc_join;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_IDLE:
+			LINK_STATE("state LINK_IDLE\n");
+			if (viport->config->hb_interval)
+				viport_timer(viport,
+					     viport->config->hb_interval);
+			viport->link_state = LINK_IDLING;
+			break;
+		case LINK_IDLING:
+			LINK_STATE("state LINK_IDLING\n");
+			control_process_async(&viport->control);
+			if (viport->errored) {
+				viport_timer_stop(viport);
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+				break;
+			}
+
+			spin_lock_irq(&viport->lock);
+			handle_mc_join = (viport->updates & NEED_MCAST_JOIN);
+			handle_mc_join_compl =
+				      (viport->updates & NEED_MCAST_COMPLETION);
+			/*
+			 * Turn off both flags, the handler functions will
+			 * rearm them if necessary.
+			 */
+			viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION);
+
+			if (viport->updates & NEED_LINK_CONFIG) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_CONFIGLINKREQ;
+			} else if (viport->updates & NEED_ADDRESS_CONFIG) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_CONFIGADDRSREQ;
+			} else if (viport->updates & NEED_STATS) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_REPORTSTATREQ;
+			} else if (viport->config->hb_interval) {
+				if (!viport->timer_active)
+					viport->link_state =
+						LINK_HEARTBEATREQ;
+			}
+			spin_unlock_irq(&viport->lock);
+			if (handle_mc_join) {
+				if (vnic_mc_join(viport))
+					viport_disable_multicast(viport);
+			}
+			if (handle_mc_join_compl)
+				vnic_mc_join_handle_completion(viport);
+
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_config_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int res;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_CONFIGLINKREQ:
+			LINK_STATE("state LINK_CONFIGLINKREQ\n");
+			spin_lock_irq(&viport->lock);
+			viport->updates &= ~NEED_LINK_CONFIG;
+			viport->flags = viport->new_flags;
+			if (viport->updates & MCAST_OVERFLOW)
+				viport->flags |= IFF_ALLMULTI;
+			viport->mtu = viport->new_mtu;
+			spin_unlock_irq(&viport->lock);
+			if (control_config_link_req(&viport->control,
+						    viport->flags,
+						    viport->mtu))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_CONFIGLINKRSP;
+			break;
+		case LINK_CONFIGLINKRSP:
+			LINK_STATE("state LINK_CONFIGLINKRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_link_rsp(&viport->control,
+						    &viport->flags,
+						    &viport->mtu))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		case LINK_CONFIGADDRSREQ:
+			LINK_STATE("state LINK_CONFIGADDRSREQ\n");
+
+			spin_lock_irq(&viport->lock);
+			res = control_config_addrs_req(&viport->control,
+						       viport->mac_addresses,
+						       viport->
+						       num_mac_addresses);
+
+			if (res > 0) {
+				viport->updates &= ~NEED_ADDRESS_CONFIG;
+				viport->link_state = LINK_CONFIGADDRSRSP;
+			} else if (res == 0)
+				viport->link_state = LINK_CONFIGADDRSRSP;
+			else
+				viport->link_state = LINK_RESET;
+			spin_unlock_irq(&viport->lock);
+			break;
+		case LINK_CONFIGADDRSRSP:
+			LINK_STATE("state LINK_CONFIGADDRSRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_addrs_rsp(&viport->control))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_stat_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_REPORTSTATREQ:
+			LINK_STATE("state LINK_REPORTSTATREQ\n");
+			if (control_report_statistics_req(&viport->control))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_REPORTSTATRSP;
+			break;
+		case LINK_REPORTSTATRSP:
+			LINK_STATE("state LINK_REPORTSTATRSP\n");
+			control_process_async(&viport->control);
+
+			spin_lock_irq(&viport->lock);
+			if (control_report_statistics_rsp(&viport->control,
+						  &viport->stats) == 0) {
+				viport->updates &= ~NEED_STATS;
+				viport->last_stats_time = jiffies;
+				wake_up(&viport->stats_queue);
+				viport->link_state = LINK_IDLE;
+			}
+
+			spin_unlock_irq(&viport->lock);
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_heartbeat_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_HEARTBEATREQ:
+			LINK_STATE("state LINK_HEARTBEATREQ\n");
+			if (control_heartbeat_req(&viport->control,
+						  viport->config->hb_timeout))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_HEARTBEATRSP;
+			break;
+		case LINK_HEARTBEATRSP:
+			LINK_STATE("state LINK_HEARTBEATRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_heartbeat_rsp(&viport->control))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_reset_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int handle_mc_join_compl = 0, handle_mc_join = 0;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_RESET:
+			LINK_STATE("state LINK_RESET\n");
+			viport->errored = 0;
+			spin_lock_irq(&viport->lock);
+			viport->state = VIPORT_DISCONNECTED;
+			/*
+			 * Turn off both flags, the handler functions will
+			 * rearm them if necessary
+			 */
+			viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION);
+
+			spin_unlock_irq(&viport->lock);
+			vnic_link_down(viport->vnic, viport->parent);
+			printk(KERN_INFO PFX
+			       "%s: connection lost\n",
+			       config_viport_name(viport->config));
+			if (handle_mc_join) {
+				if (vnic_mc_join(viport))
+					viport_disable_multicast(viport);
+			}
+			if (handle_mc_join_compl)
+				vnic_mc_join_handle_completion(viport);
+			if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+				vnic_mc_leave(viport);
+				vnic_mc_data_cleanup(&viport->mc_data);
+			}
+
+			if (control_reset_req(&viport->control))
+				viport->link_state = LINK_DATADISCONNECT;
+			else
+				viport->link_state = LINK_RESETRSP;
+			break;
+		case LINK_RESETRSP:
+			LINK_STATE("state LINK_RESETRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_reset_rsp(&viport->control))
+				viport->link_state = LINK_DATADISCONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_DATADISCONNECT;
+			}
+			break;
+		case LINK_RESETCONTROL:
+			LINK_STATE("state LINK_RESETCONTROL\n");
+			if (control_reset_req(&viport->control))
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			else
+				viport->link_state = LINK_RESETCONTROLRSP;
+			break;
+		case LINK_RESETCONTROLRSP:
+			LINK_STATE("state LINK_RESETCONTROLRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_reset_rsp(&viport->control))
+				viport->link_state = LINK_CONTROLDISCONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_disconn_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_DATADISCONNECT:
+			LINK_STATE("state LINK_DATADISCONNECT\n");
+			data_disconnect(&viport->data);
+			viport->link_state = LINK_CONTROLDISCONNECT;
+			break;
+		case LINK_CONTROLDISCONNECT:
+			LINK_STATE("state LINK_CONTROLDISCONNECT\n");
+			viport->link_state = LINK_CLEANUPDATA;
+			break;
+		case LINK_CLEANUPDATA:
+			LINK_STATE("state LINK_CLEANUPDATA\n");
+			data_cleanup(&viport->data);
+			viport->link_state = LINK_CLEANUPCONTROL;
+			break;
+		case LINK_CLEANUPCONTROL:
+			LINK_STATE("state LINK_CLEANUPCONTROL\n");
+			spin_lock_irq(&viport->lock);
+			kfree(viport->mac_addresses);
+			viport->mac_addresses = NULL;
+			spin_unlock_irq(&viport->lock);
+			control_cleanup(&viport->control);
+			ib_dealloc_pd(viport->pd);
+			viport->link_state = LINK_DISCONNECTED;
+			break;
+		case LINK_DISCONNECTED:
+			LINK_STATE("state LINK_DISCONNECTED\n");
+			vnic_disconnected(viport->parent->parent,
+					  viport->parent);
+			if (viport->disconnect != 0)
+				viport->link_state = LINK_UNINITIALIZED;
+			else if (viport->retry == 1) {
+				viport->retry = 0;
+			/*
+			 * Check if the initial retry interval has crossed
+			 * 20 seconds.
+			 * The retry interval is initially 5 seconds which
+			 * is incremented by 5. Once it is 20 the interval
+			 * is fixed to 20 seconds till 10 minutes,
+			 * after which retrying is stopped
+			 */
+				if (viport->retry_duration  < MAX_RETRY_INTERVAL)
+					viport->retry_duration +=
+								RETRY_INCREMENT;
+
+				viport->total_retry_duration +=
+							 viport->retry_duration;
+
+				if (viport->total_retry_duration >=
+					MAX_CONNECT_RETRY_TIMEOUT) {
+					viport->link_state = LINK_UNINITIALIZED;
+					printk("Timed out after retrying"
+					       " for retry_duration %d msecs\n"
+						, viport->total_retry_duration);
+				} else {
+					viport->connect = DELAY;
+					viport->link_state = LINK_RETRYWAIT;
+				}
+				viport_timer(viport,
+				     msecs_to_jiffies(viport->retry_duration));
+			} else {
+				u32 duration = 5000 + ((net_random()) & 0x1FF);
+				if (!viport->parent->is_primary_path)
+					duration += 0x1ff;
+				viport_timer(viport,
+					     msecs_to_jiffies(duration));
+				viport->connect = DELAY;
+				viport->link_state = LINK_RETRYWAIT;
+			}
+			break;
+		case LINK_RETRYWAIT:
+			LINK_STATE("state LINK_RETRYWAIT\n");
+			viport->stats.ethernet_status = 0;
+			viport->updates = 0;
+			wake_up(&viport->stats_queue);
+			if (viport->disconnect != 0) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_UNINITIALIZED;
+			} else if (viport->connect == DELAY) {
+				if (!viport->timer_active)
+					viport->link_state = LINK_INITIALIZE;
+			} else if (viport->connect == NOW) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_INITIALIZE;
+			}
+			break;
+		case LINK_FIRSTCONNECT:
+			viport->stats.ethernet_status = 0;
+			viport->updates = 0;
+			wake_up(&viport->stats_queue);
+			if (viport->disconnect != 0) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_UNINITIALIZED;
+			}
+
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_statemachine(void *context)
+{
+	struct viport *viport;
+	enum link_state old_link_state;
+
+	VIPORT_FUNCTION("viport_statemachine()\n");
+	while (!viport_thread_end || !list_empty(&viport_list)) {
+		wait_event_interruptible(viport_queue,
+					 !list_empty(&viport_list)
+					 || viport_thread_end);
+		spin_lock_irq(&viport_list_lock);
+		if (list_empty(&viport_list)) {
+			spin_unlock_irq(&viport_list_lock);
+			continue;
+		}
+		viport = list_entry(viport_list.next, struct viport,
+				    list_ptrs);
+		list_del_init(&viport->list_ptrs);
+		spin_unlock_irq(&viport_list_lock);
+
+		do {
+			old_link_state = viport->link_state;
+
+			/*
+			 * Optimize for the state machine steady state
+			 * by checking for the most common states first.
+			 *
+			 */
+			if (viport_handle_idle_states(viport) == 0)
+				break;
+			if (viport_handle_heartbeat_states(viport) == 0)
+				break;
+			if (viport_handle_stat_states(viport) == 0)
+				break;
+			if (viport_handle_config_states(viport) == 0)
+				break;
+
+			if (viport_handle_init_states(viport) == 0)
+				break;
+			if (viport_handle_control_states(viport) == 0)
+				break;
+			if (viport_handle_data_states(viport) == 0)
+				break;
+			if (viport_handle_xchgpool_states(viport) == 0)
+				break;
+			if (viport_handle_reset_states(viport) == 0)
+				break;
+			if (viport_handle_disconn_states(viport) == 0)
+				break;
+		} while (viport->link_state != old_link_state);
+	}
+
+	complete_and_exit(&viport_thread_exit, 0);
+}
+
+int viport_start(void)
+{
+	VIPORT_FUNCTION("viport_start()\n");
+
+	spin_lock_init(&viport_list_lock);
+	viport_thread = kthread_run(viport_statemachine, NULL,
+					"qlgc_vnic_viport_s_m");
+	if (IS_ERR(viport_thread)) {
+		printk(KERN_WARNING PFX "Could not create viport_thread;"
+		       " error %d\n", (int) PTR_ERR(viport_thread));
+		viport_thread = NULL;
+		return 1;
+	}
+
+	return 0;
+}
+
+void viport_cleanup(void)
+{
+	VIPORT_FUNCTION("viport_cleanup()\n");
+	if (viport_thread) {
+		viport_thread_end = 1;
+		wake_up(&viport_queue);
+		wait_for_completion(&viport_thread_exit);
+		viport_thread = NULL;
+	}
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h
new file mode 100644
index 0000000..6d36181
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h
@@ -0,0 +1,176 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_VIPORT_H_INCLUDED
+#define VNIC_VIPORT_H_INCLUDED
+
+#include "vnic_control.h"
+#include "vnic_data.h"
+#include "vnic_multicast.h"
+
+enum viport_state {
+	VIPORT_DISCONNECTED	= 0,
+	VIPORT_CONNECTED	= 1
+};
+
+enum link_state {
+	LINK_UNINITIALIZED	= 0,
+	LINK_INITIALIZE		= 1,
+	LINK_INITIALIZECONTROL	= 2,
+	LINK_INITIALIZEDATA	= 3,
+	LINK_CONTROLCONNECT	= 4,
+	LINK_CONTROLCONNECTWAIT	= 5,
+	LINK_INITVNICREQ	= 6,
+	LINK_INITVNICRSP	= 7,
+	LINK_BEGINDATAPATH	= 8,
+	LINK_CONFIGDATAPATHREQ	= 9,
+	LINK_CONFIGDATAPATHRSP	= 10,
+	LINK_DATACONNECT	= 11,
+	LINK_DATACONNECTWAIT	= 12,
+	LINK_XCHGPOOLREQ	= 13,
+	LINK_XCHGPOOLRSP	= 14,
+	LINK_INITIALIZED	= 15,
+	LINK_IDLE		= 16,
+	LINK_IDLING		= 17,
+	LINK_CONFIGLINKREQ	= 18,
+	LINK_CONFIGLINKRSP	= 19,
+	LINK_CONFIGADDRSREQ	= 20,
+	LINK_CONFIGADDRSRSP	= 21,
+	LINK_REPORTSTATREQ	= 22,
+	LINK_REPORTSTATRSP	= 23,
+	LINK_HEARTBEATREQ	= 24,
+	LINK_HEARTBEATRSP	= 25,
+	LINK_RESET		= 26,
+	LINK_RESETRSP		= 27,
+	LINK_RESETCONTROL	= 28,
+	LINK_RESETCONTROLRSP	= 29,
+	LINK_DATADISCONNECT	= 30,
+	LINK_CONTROLDISCONNECT	= 31,
+	LINK_CLEANUPDATA	= 32,
+	LINK_CLEANUPCONTROL	= 33,
+	LINK_DISCONNECTED	= 34,
+	LINK_RETRYWAIT		= 35,
+	LINK_FIRSTCONNECT	= 36
+};
+
+enum {
+	BROADCAST_ADDR		= 0,
+	UNICAST_ADDR		= 1,
+	MCAST_ADDR_START	= 2
+};
+
+#define current_mac_address	mac_addresses[UNICAST_ADDR].address
+
+enum {
+	NEED_STATS           	= 0x00000001,
+	NEED_ADDRESS_CONFIG  	= 0x00000002,
+	NEED_LINK_CONFIG     	= 0x00000004,
+	MCAST_OVERFLOW       	= 0x00000008,
+	NEED_MCAST_COMPLETION	= 0x00000010,
+	NEED_MCAST_JOIN      	= 0x00000020
+};
+
+struct viport {
+	struct list_head		list_ptrs;
+	struct netpath			*parent;
+	struct vnic			*vnic;
+	struct viport_config		*config;
+	struct control			control;
+	struct data			data;
+	spinlock_t			lock;
+	struct ib_pd			*pd;
+	enum viport_state		state;
+	enum link_state			link_state;
+	struct vnic_cmd_report_stats_rsp stats;
+	wait_queue_head_t		stats_queue;
+	unsigned long			last_stats_time;
+	u32				features_supported;
+	u8				hw_mac_address[ETH_ALEN];
+	u16				default_vlan;
+	u16				num_mac_addresses;
+	struct vnic_address_op2		*mac_addresses;
+	u32				updates;
+	u16				flags;
+	u16				new_flags;
+	u16				mtu;
+	u16				new_mtu;
+	u32				errored;
+	enum { WAIT, DELAY, NOW }	connect;
+	u32				disconnect;
+	u32 				retry;
+	wait_queue_head_t		disconnect_queue;
+	int				timer_active;
+	struct timer_list		timer;
+	u32 				retry_duration;
+	u32 				total_retry_duration;
+	atomic_t			reference_count;
+	wait_queue_head_t		reference_queue;
+	struct mc_info	mc_info;
+	struct mc_data	mc_data;
+};
+
+int  viport_start(void);
+void viport_cleanup(void);
+
+struct viport *viport_allocate(struct viport_config *config);
+void viport_free(struct viport *viport);
+
+void viport_connect(struct viport *viport, int delay);
+void viport_disconnect(struct viport *viport);
+
+void viport_set_link(struct viport *viport, u16 flags, u16 mtu);
+void viport_get_stats(struct viport *viport,
+		      struct net_device_stats *stats);
+int  viport_xmit_packet(struct viport *viport, struct sk_buff *skb);
+void viport_kick(struct viport *viport);
+
+void viport_failure(struct viport *viport);
+
+int viport_set_unicast(struct viport *viport, u8 *address);
+int viport_set_multicast(struct viport *viport,
+			 struct dev_mc_list *mc_list,
+			 int mc_count);
+
+#define viport_max_mtu(viport)		data_max_mtu(&(viport)->data)
+
+#define viport_get_hw_addr(viport, address)			\
+	memcpy(address, (viport)->hw_mac_address, ETH_ALEN)
+
+#define viport_features(viport) ((viport)->features_supported)
+
+#define viport_can_tx_csum(viport)				\
+	(((viport)->features_supported & 			\
+	(VNIC_FEAT_IPV4_CSUM_TX | VNIC_FEAT_TCP_CSUM_TX |	\
+	VNIC_FEAT_UDP_CSUM_TX)) == (VNIC_FEAT_IPV4_CSUM_TX |	\
+	VNIC_FEAT_TCP_CSUM_TX | VNIC_FEAT_UDP_CSUM_TX))
+
+#endif /* VNIC_VIPORT_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:33:28 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:03:28 +0530
Subject: [ofa-general] [PATCH v2 04/13] QLogic VNIC: Implementation of
	Control path of communication protocol
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103328.12355.6429.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the files that define the control packet formats
and implements various control messages that are exchanged as part
of the communication protocol with the EVIC/VEx.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c    | 2286 ++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h    |  179 ++
 .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h    |  368 +++
 3 files changed, 2833 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
new file mode 100644
index 0000000..774a071
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
@@ -0,0 +1,2286 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/list.h>
+#include <linux/vmalloc.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_stats.h"
+
+#define vnic_multicast_address(rsp2_address, index)           \
+	((rsp2_address)->list_address_ops[index].address[0] & 0x01)
+
+static void control_log_control_packet(struct vnic_control_packet *pkt);
+
+char *control_ifcfg_name(struct control *control)
+{
+	if (!control)
+		return "nctl";
+	if (!control->parent)
+		return "np";
+	if (!control->parent->parent)
+		return "npp";
+	if (!control->parent->parent->parent)
+		return "nppp";
+	if (!control->parent->parent->parent->config)
+		return "npppc";
+	return (control->parent->parent->parent->config->name);
+}
+
+static void control_recv(struct control *control, struct recv_io *recv_io)
+{
+	if (vnic_ib_post_recv(&control->ib_conn, &recv_io->io))
+		viport_failure(control->parent);
+}
+
+static void control_recv_complete(struct io *io)
+{
+	struct recv_io			*recv_io = (struct recv_io *)io;
+	struct recv_io			*last_recv_io;
+	struct control			*control = &io->viport->control;
+	struct vnic_control_packet	*pkt = control_packet(recv_io);
+	struct vnic_control_header	*c_hdr = &pkt->hdr;
+	unsigned long			flags;
+	cycles_t			response_time;
+
+	CONTROL_FUNCTION("%s: control_recv_complete() State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+	control_note_rsptime_stats(&response_time);
+	CONTROL_PACKET(pkt);
+	spin_lock_irqsave(&control->io_lock, flags);
+	if (c_hdr->pkt_type == TYPE_INFO) {
+		last_recv_io = control->info;
+		control->info = recv_io;
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		viport_kick(control->parent);
+		if (last_recv_io)
+			control_recv(control, last_recv_io);
+	} else if (c_hdr->pkt_type == TYPE_RSP) {
+		u8 repost = 0;
+		u8 fail = 0;
+		u8 kick = 0;
+
+		switch (control->req_state) {
+		case REQ_INACTIVE:
+		case RSP_RECEIVED:
+		case REQ_COMPLETED:
+			CONTROL_ERROR("%s: Unexpected control"
+					"response received: CMD = %d\n",
+					control_ifcfg_name(control),
+					c_hdr->pkt_cmd);
+			control_log_control_packet(pkt);
+			control->req_state = REQ_FAILED;
+			fail = 1;
+			break;
+		case REQ_POSTED:
+		case REQ_SENT:
+			if (c_hdr->pkt_cmd != control->last_cmd
+				|| c_hdr->pkt_seq_num != control->seq_num) {
+				CONTROL_ERROR("%s: Incorrect Control Response "
+					      "received\n",
+					      control_ifcfg_name(control));
+				CONTROL_ERROR("%s: Sent control request:\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(control_last_req(control));
+				CONTROL_ERROR("%s: Received control response:\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(pkt);
+				control->req_state = REQ_FAILED;
+				fail = 1;
+			} else {
+				control->response = recv_io;
+				control_update_rsptime_stats(control,
+							    response_time);
+				if (control->req_state == REQ_POSTED) {
+					CONTROL_INFO("%s: Recv CMD RSP %d"
+						     "before Send Completion\n",
+						     control_ifcfg_name(control),
+						     c_hdr->pkt_cmd);
+					control->req_state = RSP_RECEIVED;
+				} else {
+					control->req_state = REQ_COMPLETED;
+					kick = 1;
+				}
+			}
+			break;
+		case REQ_FAILED:
+			/* stay in REQ_FAILED state */
+			repost = 1;
+			break;
+		}
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		/* we must do this outside the lock*/
+		if (kick)
+			viport_kick(control->parent);
+		if (repost || fail) {
+			control_recv(control, recv_io);
+			if (fail)
+				viport_failure(control->parent);
+		}
+
+	} else {
+		list_add_tail(&recv_io->io.list_ptrs,
+			      &control->failure_list);
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		viport_kick(control->parent);
+	}
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+}
+
+static void control_timeout(unsigned long data)
+{
+	struct control *control;
+	unsigned long 		  flags;
+	u8 fail = 0;
+	u8 kick = 0;
+
+	control = (struct control *)data;
+	CONTROL_FUNCTION("%s: control_timeout(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	control->timer_state = TIMER_EXPIRED;
+
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		kick = 1;
+		/* stay in REQ_INACTIVE state */
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+		control->req_state = REQ_FAILED;
+		CONTROL_ERROR("%s: No send Completion for Cmd=%d \n",
+			      control_ifcfg_name(control), control->last_cmd);
+		control_timeout_stats(control);
+		fail = 1;
+		break;
+	case RSP_RECEIVED:
+		control->req_state = REQ_FAILED;
+		CONTROL_ERROR("%s: No response received from EIOC for Cmd=%d\n",
+			      control_ifcfg_name(control), control->last_cmd);
+		control_timeout_stats(control);
+		fail = 1;
+		break;
+	case REQ_COMPLETED:
+		/* stay in REQ_COMPLETED state*/
+		kick = 1;
+		break;
+	case REQ_FAILED:
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	if (kick)
+		viport_kick(control->parent);
+
+	return;
+}
+
+static void control_timer(struct control *control, int timeout)
+{
+	CONTROL_FUNCTION("%s: control_timer()\n",
+			 control_ifcfg_name(control));
+	if (control->timer_state == TIMER_ACTIVE)
+		mod_timer(&control->timer, jiffies + timeout);
+	else {
+		init_timer(&control->timer);
+		control->timer.expires = jiffies + timeout;
+		control->timer.data = (unsigned long)control;
+		control->timer.function = control_timeout;
+		control->timer_state = TIMER_ACTIVE;
+		add_timer(&control->timer);
+	}
+}
+
+static void control_timer_stop(struct control *control)
+{
+	CONTROL_FUNCTION("%s: control_timer_stop()\n",
+			 control_ifcfg_name(control));
+	if (control->timer_state == TIMER_ACTIVE)
+		del_timer_sync(&control->timer);
+
+	control->timer_state = TIMER_IDLE;
+}
+
+static int control_send(struct control *control, struct send_io *send_io)
+{
+	unsigned long 	flags;
+	u8 ret = -1;
+	u8 fail = 0;
+	struct vnic_control_packet *pkt = control_packet(send_io);
+
+	CONTROL_FUNCTION("%s: control_send(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		CONTROL_PACKET(pkt);
+		control_timer(control, control->config->rsp_timeout);
+		control_note_reqtime_stats(control);
+		if (vnic_ib_post_send(&control->ib_conn, &control->send_io.io)) {
+			CONTROL_ERROR("%s: Failed to post send\n",
+				control_ifcfg_name(control));
+			/* stay in REQ_INACTIVE state*/
+			fail = 1;
+		} else {
+			control->last_cmd = pkt->hdr.pkt_cmd;
+			control->req_state = REQ_POSTED;
+			ret = 0;
+		}
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+	case RSP_RECEIVED:
+	case REQ_COMPLETED:
+		CONTROL_ERROR("%s:Previous Command is not completed."
+			      "New CMD: %d Last CMD: %d Seq: %d\n",
+			      control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+			      control->last_cmd, control->seq_num);
+
+		control->req_state = REQ_FAILED;
+		fail = 1;
+		break;
+	case REQ_FAILED:
+		/* this can occur after an error when ViPort state machine
+		 * attempts to reset the link.
+		 */
+		CONTROL_INFO("%s:Attempt to send in failed state."
+			     "New CMD: %d Last CMD: %d\n",
+			     control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+			     control->last_cmd);
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	return ret;
+
+}
+
+static void control_send_complete(struct io *io)
+{
+	struct control *control = &io->viport->control;
+	unsigned long 		  flags;
+	u8 fail = 0;
+	u8 kick = 0;
+
+	CONTROL_FUNCTION("%s: control_sendComplete(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+	case REQ_SENT:
+	case REQ_COMPLETED:
+		CONTROL_ERROR("%s: Unexpected control send completion\n",
+			      control_ifcfg_name(control));
+		fail = 1;
+		control->req_state = REQ_FAILED;
+		break;
+	case REQ_POSTED:
+		control->req_state = REQ_SENT;
+		break;
+	case RSP_RECEIVED:
+		control->req_state = REQ_COMPLETED;
+		kick = 1;
+		break;
+	case REQ_FAILED:
+		/* stay in REQ_FAILED state */
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	if (kick)
+		viport_kick(control->parent);
+
+	return;
+}
+
+void control_process_async(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	unsigned long			flags;
+
+	CONTROL_FUNCTION("%s: control_process_async()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	spin_lock_irqsave(&control->io_lock, flags);
+	recv_io = control->info;
+	if (recv_io) {
+		CONTROL_INFO("%s: processing info packet\n",
+			     control_ifcfg_name(control));
+		control->info = NULL;
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		pkt = control_packet(recv_io);
+		if (pkt->hdr.pkt_cmd == CMD_REPORT_STATUS) {
+			u32		status;
+			status =
+			  be32_to_cpu(pkt->cmd.report_status.status_number);
+			switch (status) {
+			case VNIC_STATUS_LINK_UP:
+				CONTROL_INFO("%s: link up\n",
+					     control_ifcfg_name(control));
+				vnic_link_up(control->parent->vnic,
+					     control->parent->parent);
+				break;
+			case VNIC_STATUS_LINK_DOWN:
+				CONTROL_INFO("%s: link down\n",
+					     control_ifcfg_name(control));
+				vnic_link_down(control->parent->vnic,
+					       control->parent->parent);
+				break;
+			default:
+				CONTROL_ERROR("%s: asynchronous status"
+					      " received from EIOC\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(pkt);
+				break;
+			}
+		}
+		if ((pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) ||
+		     pkt->cmd.report_status.is_fatal)
+			viport_failure(control->parent);
+
+		control_recv(control, recv_io);
+		spin_lock_irqsave(&control->io_lock, flags);
+	}
+
+	while (!list_empty(&control->failure_list)) {
+		CONTROL_INFO("%s: processing error packet\n",
+			     control_ifcfg_name(control));
+		recv_io = (struct recv_io *)
+		    list_entry(control->failure_list.next, struct io,
+			       list_ptrs);
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		pkt = control_packet(recv_io);
+		CONTROL_ERROR("%s: asynchronous error received from EIOC\n",
+			      control_ifcfg_name(control));
+		control_log_control_packet(pkt);
+		if ((pkt->hdr.pkt_type != TYPE_ERR)
+		    || (pkt->hdr.pkt_cmd != CMD_REPORT_STATUS)
+		    || pkt->cmd.report_status.is_fatal)
+			viport_failure(control->parent);
+
+		control_recv(control, recv_io);
+		spin_lock_irqsave(&control->io_lock, flags);
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	CONTROL_FUNCTION("%s: done control_process_async\n",
+		     control_ifcfg_name(control));
+}
+
+static struct send_io *control_init_hdr(struct control *control, u8 cmd)
+{
+	struct control_config		*config;
+	struct vnic_control_packet	*pkt;
+	struct vnic_control_header	*hdr;
+
+	CONTROL_FUNCTION("control_init_hdr()\n");
+	config = control->config;
+
+	pkt = control_packet(&control->send_io);
+	hdr = &pkt->hdr;
+
+	hdr->pkt_type = TYPE_REQ;
+	hdr->pkt_cmd = cmd;
+	control->seq_num++;
+	hdr->pkt_seq_num = control->seq_num;
+	hdr->pkt_retry_count = 0;
+
+	return &control->send_io;
+}
+
+static struct recv_io *control_get_rsp(struct control *control)
+{
+	struct recv_io	*recv_io = NULL;
+	unsigned long	flags;
+	u8 fail = 0;
+
+	CONTROL_FUNCTION("%s: control_getRsp(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		CONTROL_ERROR("%s: Checked for Response with no"
+			      "command pending\n",
+			      control_ifcfg_name(control));
+		control->req_state = REQ_FAILED;
+		fail = 1;
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+	case RSP_RECEIVED:
+		/* no response available yet
+		 stay in present state*/
+		break;
+	case REQ_COMPLETED:
+		recv_io = control->response;
+		if (!recv_io) {
+			control->req_state = REQ_FAILED;
+			fail = 1;
+			break;
+		}
+		control->response = NULL;
+		control->last_cmd = CMD_INVALID;
+		control_timer_stop(control);
+		control->req_state = REQ_INACTIVE;
+		break;
+	case REQ_FAILED:
+		control_timer_stop(control);
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	if (fail)
+		viport_failure(control->parent);
+	return recv_io;
+}
+
+int control_init_vnic_req(struct control *control)
+{
+	struct send_io			*send_io;
+	struct control_config		*config = control->config;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_init_vnic_req	*init_vnic_req;
+
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_INIT_VNIC);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	init_vnic_req = &pkt->cmd.init_vnic_req;
+	init_vnic_req->vnic_major_version =
+				 __constant_cpu_to_be16(VNIC_MAJORVERSION);
+	init_vnic_req->vnic_minor_version =
+				 __constant_cpu_to_be16(VNIC_MINORVERSION);
+	init_vnic_req->vnic_instance = config->vnic_instance;
+	init_vnic_req->num_data_paths = 1;
+	init_vnic_req->num_address_entries =
+				cpu_to_be16(config->max_address_entries);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	CONTROL_PACKET(pkt);
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+static int control_chk_vnic_rsp_values(struct control *control,
+				       u16 *num_addrs,
+				       u8 num_data_paths,
+				       u8 num_lan_switches,
+				       u32 *features)
+{
+
+	struct control_config		*config = control->config;
+
+	if ((control->maj_ver > VNIC_MAJORVERSION)
+	    || ((control->maj_ver == VNIC_MAJORVERSION)
+		&& (control->min_ver > VNIC_MINORVERSION))) {
+		CONTROL_ERROR("%s: unsupported version\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_data_paths != 1) {
+		CONTROL_ERROR("%s: EIOC returned too many datapaths\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (*num_addrs > config->max_address_entries) {
+		CONTROL_ERROR("%s: EIOC returned more address"
+			      " entries than requested\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (*num_addrs < config->min_address_entries) {
+		CONTROL_ERROR("%s: not enough address entries\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_lan_switches < 1) {
+		CONTROL_ERROR("%s: EIOC returned no lan switches\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_lan_switches > 1) {
+		CONTROL_ERROR("%s: EIOC returned multiple lan switches\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	CONTROL_ERROR("%s checking features %x ib_multicast:%d\n",
+			control_ifcfg_name(control),
+			*features, config->ib_multicast);
+	if ((*features & VNIC_FEAT_INBOUND_IB_MC) && !config->ib_multicast) {
+		/* disable multicast if it is not on in the cfg file, or
+		   if we turned it off because join failed */
+		*features &= ~VNIC_FEAT_INBOUND_IB_MC;
+	}
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_init_vnic_rsp(struct control *control, u32 *features,
+			  u8 *mac_address, u16 *num_addrs, u16 *vlan)
+{
+	u8 num_data_paths;
+	u8 num_lan_switches;
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_init_vnic_rsp	*init_vnic_rsp;
+
+
+	CONTROL_FUNCTION("%s: control_init_vnic_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_INIT_VNIC)
+		goto failure;
+
+	init_vnic_rsp = &pkt->cmd.init_vnic_rsp;
+	control->maj_ver = be16_to_cpu(init_vnic_rsp->vnic_major_version);
+	control->min_ver = be16_to_cpu(init_vnic_rsp->vnic_minor_version);
+	num_data_paths = init_vnic_rsp->num_data_paths;
+	num_lan_switches = init_vnic_rsp->num_lan_switches;
+	*features = be32_to_cpu(init_vnic_rsp->features_supported);
+	*num_addrs = be16_to_cpu(init_vnic_rsp->num_address_entries);
+
+	if (control_chk_vnic_rsp_values(control, num_addrs,
+					num_data_paths,
+					num_lan_switches,
+					features))
+		goto failure;
+
+	control->lan_switch.lan_switch_num =
+			init_vnic_rsp->lan_switch[0].lan_switch_num;
+	control->lan_switch.num_enet_ports =
+			init_vnic_rsp->lan_switch[0].num_enet_ports;
+	control->lan_switch.default_vlan =
+			init_vnic_rsp->lan_switch[0].default_vlan;
+	*vlan = be16_to_cpu(control->lan_switch.default_vlan);
+	memcpy(control->lan_switch.hw_mac_address,
+	       init_vnic_rsp->lan_switch[0].hw_mac_address, ETH_ALEN);
+	memcpy(mac_address, init_vnic_rsp->lan_switch[0].hw_mac_address,
+	       ETH_ALEN);
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+static void copy_recv_pool_config(struct vnic_recv_pool_config *src,
+				  struct vnic_recv_pool_config *dst)
+{
+	dst->size_recv_pool_entry  = src->size_recv_pool_entry;
+	dst->num_recv_pool_entries = src->num_recv_pool_entries;
+	dst->timeout_before_kick   = src->timeout_before_kick;
+	dst->num_recv_pool_entries_before_kick =
+				src->num_recv_pool_entries_before_kick;
+	dst->num_recv_pool_bytes_before_kick =
+				src->num_recv_pool_bytes_before_kick;
+	dst->free_recv_pool_entries_per_update =
+				src->free_recv_pool_entries_per_update;
+}
+
+static int check_recv_pool_config_value(__be32 *src, __be32 *dst,
+					__be32 *max, __be32 *min,
+					char *name)
+{
+	u32 value;
+
+	value = be32_to_cpu(*src);
+	if (value > be32_to_cpu(*max)) {
+		CONTROL_ERROR("value %s too large\n", name);
+		return -1;
+	} else if (value < be32_to_cpu(*min)) {
+		CONTROL_ERROR("value %s too small\n", name);
+		return -1;
+	}
+
+	*dst = cpu_to_be32(value);
+	return 0;
+}
+
+static int check_recv_pool_config(struct vnic_recv_pool_config *src,
+				  struct vnic_recv_pool_config *dst,
+				  struct vnic_recv_pool_config *max,
+				  struct vnic_recv_pool_config *min)
+{
+	if (check_recv_pool_config_value(&src->size_recv_pool_entry,
+				     &dst->size_recv_pool_entry,
+				     &max->size_recv_pool_entry,
+				     &min->size_recv_pool_entry,
+				     "size_recv_pool_entry")
+	    || check_recv_pool_config_value(&src->num_recv_pool_entries,
+				     &dst->num_recv_pool_entries,
+				     &max->num_recv_pool_entries,
+				     &min->num_recv_pool_entries,
+				     "num_recv_pool_entries")
+	    || check_recv_pool_config_value(&src->timeout_before_kick,
+				     &dst->timeout_before_kick,
+				     &max->timeout_before_kick,
+				     &min->timeout_before_kick,
+				     "timeout_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     num_recv_pool_entries_before_kick,
+				     &dst->
+				     num_recv_pool_entries_before_kick,
+				     &max->
+				     num_recv_pool_entries_before_kick,
+				     &min->
+				     num_recv_pool_entries_before_kick,
+				     "num_recv_pool_entries_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     num_recv_pool_bytes_before_kick,
+				     &dst->
+				     num_recv_pool_bytes_before_kick,
+				     &max->
+				     num_recv_pool_bytes_before_kick,
+				     &min->
+				     num_recv_pool_bytes_before_kick,
+				     "num_recv_pool_bytes_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     free_recv_pool_entries_per_update,
+				     &dst->
+				     free_recv_pool_entries_per_update,
+				     &max->
+				     free_recv_pool_entries_per_update,
+				     &min->
+				     free_recv_pool_entries_per_update,
+				     "free_recv_pool_entries_per_update"))
+		goto failure;
+
+	if (!is_power_of_2(be32_to_cpu(dst->num_recv_pool_entries))) {
+		CONTROL_ERROR("num_recv_pool_entries (%d)"
+			      " must be power of 2\n",
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	if (!is_power_of_2(be32_to_cpu(dst->
+				      free_recv_pool_entries_per_update))) {
+		CONTROL_ERROR("free_recv_pool_entries_per_update (%d)"
+			      " must be power of 2\n",
+			      dst->free_recv_pool_entries_per_update);
+		goto failure;
+	}
+
+	if (be32_to_cpu(dst->free_recv_pool_entries_per_update) >=
+	    be32_to_cpu(dst->num_recv_pool_entries)) {
+		CONTROL_ERROR("free_recv_pool_entries_per_update (%d) must"
+			      " be less than num_recv_pool_entries (%d)\n",
+			      dst->free_recv_pool_entries_per_update,
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	if (be32_to_cpu(dst->num_recv_pool_entries_before_kick) >=
+	    be32_to_cpu(dst->num_recv_pool_entries)) {
+		CONTROL_ERROR("num_recv_pool_entries_before_kick (%d) must"
+			      " be less than num_recv_pool_entries (%d)\n",
+			      dst->num_recv_pool_entries_before_kick,
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_config_data_path_req(struct control *control, u64 path_id,
+				     struct vnic_recv_pool_config *host,
+				     struct vnic_recv_pool_config *eioc)
+{
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_data_path	*config_data_path;
+
+	CONTROL_FUNCTION("%s: control_config_data_path_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_CONFIG_DATA_PATH);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	config_data_path = &pkt->cmd.config_data_path_req;
+	config_data_path->data_path = 0;
+	config_data_path->path_identifier = path_id;
+	copy_recv_pool_config(host,
+			      &config_data_path->host_recv_pool_config);
+	copy_recv_pool_config(eioc,
+			      &config_data_path->eioc_recv_pool_config);
+	CONTROL_PACKET(pkt);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_config_data_path_rsp(struct control *control,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc,
+				 struct vnic_recv_pool_config *max_host,
+				 struct vnic_recv_pool_config *max_eioc,
+				 struct vnic_recv_pool_config *min_host,
+				 struct vnic_recv_pool_config *min_eioc)
+{
+	struct recv_io				*recv_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_data_path	*config_data_path;
+
+	CONTROL_FUNCTION("%s: control_config_data_path_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_CONFIG_DATA_PATH)
+		goto failure;
+
+	config_data_path = &pkt->cmd.config_data_path_rsp;
+	if (config_data_path->data_path != 0) {
+		CONTROL_ERROR("%s: received CMD_CONFIG_DATA_PATH response"
+			      " for wrong data path: %u\n",
+			      control_ifcfg_name(control),
+			      config_data_path->data_path);
+		goto failure;
+	}
+
+	if (check_recv_pool_config(&config_data_path->
+				   host_recv_pool_config,
+				   host, max_host, min_host)
+	    || check_recv_pool_config(&config_data_path->
+				      eioc_recv_pool_config,
+				      eioc, max_eioc, min_eioc)) {
+		goto failure;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_exchange_pools_req(struct control *control, u64 addr, u32 rkey)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_exchange_pools	*exchange_pools;
+
+	CONTROL_FUNCTION("%s: control_exchange_pools_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_EXCHANGE_POOLS);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	exchange_pools = &pkt->cmd.exchange_pools_req;
+	exchange_pools->data_path = 0;
+	exchange_pools->pool_rkey = cpu_to_be32(rkey);
+	exchange_pools->pool_addr = cpu_to_be64(addr);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_exchange_pools_rsp(struct control *control, u64 *addr,
+			       u32 *rkey)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_exchange_pools	*exchange_pools;
+
+	CONTROL_FUNCTION("%s: control_exchange_pools_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_EXCHANGE_POOLS)
+		goto failure;
+
+	exchange_pools = &pkt->cmd.exchange_pools_rsp;
+	*rkey = be32_to_cpu(exchange_pools->pool_rkey);
+	*addr = be64_to_cpu(exchange_pools->pool_addr);
+
+	if (exchange_pools->data_path != 0) {
+		CONTROL_ERROR("%s: received CMD_EXCHANGE_POOLS response"
+			      " for wrong data path: %u\n",
+			      control_ifcfg_name(control),
+			      exchange_pools->data_path);
+		goto failure;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_config_link_req(struct control *control, u16 flags, u16 mtu)
+{
+	struct send_io			*send_io;
+	struct vnic_cmd_config_link	*config_link_req;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_config_link_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_CONFIG_LINK);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	config_link_req = &pkt->cmd.config_link_req;
+	config_link_req->lan_switch_num =
+				control->lan_switch.lan_switch_num;
+	config_link_req->cmd_flags = VNIC_FLAG_SET_MTU;
+	if (flags & IFF_UP)
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_NIC;
+	else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_NIC;
+	if (flags & IFF_ALLMULTI)
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL;
+	else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_MCAST_ALL;
+	if (flags & IFF_PROMISC) {
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_PROMISC;
+		/* the EIOU doesn't really do PROMISC mode.
+		 * if PROMISC is set, it only receives unicast packets
+		 * I also have to set MCAST_ALL if I want real
+		 * PROMISC mode.
+		 */
+		config_link_req->cmd_flags &= ~VNIC_FLAG_DISABLE_MCAST_ALL;
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL;
+	} else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_PROMISC;
+
+	config_link_req->mtu_size = cpu_to_be16(mtu);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_config_link_rsp(struct control *control, u16 *flags, u16 *mtu)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_config_link	*config_link_rsp;
+
+	CONTROL_FUNCTION("%s: control_config_link_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_CONFIG_LINK)
+		goto failure;
+	config_link_rsp = &pkt->cmd.config_link_rsp;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_NIC)
+		*flags |= IFF_UP;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL)
+		*flags |= IFF_ALLMULTI;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_PROMISC)
+		*flags |= IFF_PROMISC;
+
+	*mtu = be16_to_cpu(config_link_rsp->mtu_size);
+
+	if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+		/* featuresSupported might include INBOUND_IB_MC but
+		   MTU might cause it to be auto-disabled at embedded */
+		if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) {
+			union ib_gid mgid = config_link_rsp->allmulti_mgid;
+			if (mgid.raw[0] != 0xff) {
+				CONTROL_ERROR("%s: invalid formatprefix "
+						VNIC_GID_FMT "\n",
+						control_ifcfg_name(control),
+						VNIC_GID_RAW_ARG(mgid.raw));
+			} else {
+				/* rather than issuing join here, which might
+				 * arrive at SM before EVIC creates the MC
+				 * group, postpone it.
+				 */
+				vnic_mc_join_setup(control->parent, &mgid);
+				CONTROL_ERROR("join setup for ALL_MULTI\n");
+			}
+		}
+		/* we don't want to leave mcast group if MCAST_ALL is disabled
+		 * because there are no doubt multicast addresses set and we
+		 * want to stay joined so we can get that traffic via the
+		 * mcast group.
+		 */
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+/* control_config_addrs_req:
+ * return values:
+ *          -1: failure
+ *           0: incomplete (successful operation, but more address
+ *              table entries to be updated)
+ *           1: complete
+ */
+int control_config_addrs_req(struct control *control,
+			     struct vnic_address_op2 *addrs, u16 num)
+{
+	u16  i;
+	u8   j;
+	int  ret = 1;
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_addresses	*config_addrs_req;
+    struct vnic_cmd_config_addresses2   *config_addrs_req2;
+
+	CONTROL_FUNCTION("%s: control_config_addrs_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+		CONTROL_INFO("Sending CMD_CONFIG_ADDRESSES2 %lx MAX:%d "
+				"sizes:%d %d(off:%d) sizes2:%d %d %d"
+				"(off:%d - %d %d %d %d %d %d %d)\n", jiffies,
+				(int)MAX_CONFIG_ADDR_ENTRIES2,
+				(int)sizeof(struct vnic_cmd_config_addresses),
+			(int)sizeof(struct vnic_address_op),
+			(int)offsetof(struct vnic_cmd_config_addresses,
+							list_address_ops),
+			(int)sizeof(struct vnic_cmd_config_addresses2),
+			(int)sizeof(struct vnic_address_op2),
+			(int)sizeof(union ib_gid),
+			(int)offsetof(struct vnic_cmd_config_addresses2,
+							list_address_ops),
+			(int)offsetof(struct vnic_address_op2, index),
+			(int)offsetof(struct vnic_address_op2, operation),
+			(int)offsetof(struct vnic_address_op2, valid),
+			(int)offsetof(struct vnic_address_op2, address),
+			(int)offsetof(struct vnic_address_op2, vlan),
+			(int)offsetof(struct vnic_address_op2, reserved),
+			(int)offsetof(struct vnic_address_op2, mgid)
+			);
+		send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES2);
+		if (!send_io)
+			goto failure;
+
+		pkt = control_packet(send_io);
+		config_addrs_req2 = &pkt->cmd.config_addresses_req2;
+		memset(pkt->cmd.cmd_data, 0, VNIC_MAX_CONTROLDATASZ);
+		config_addrs_req2->lan_switch_num =
+			control->lan_switch.lan_switch_num;
+		for (i = 0, j = 0; (i < num) && (j < MAX_CONFIG_ADDR_ENTRIES2); i++) {
+			if (!addrs[i].operation)
+				continue;
+			config_addrs_req2->list_address_ops[j].index =
+								 cpu_to_be16(i);
+			config_addrs_req2->list_address_ops[j].operation =
+							VNIC_OP_SET_ENTRY;
+			config_addrs_req2->list_address_ops[j].valid =
+								 addrs[i].valid;
+			memcpy(config_addrs_req2->list_address_ops[j].address,
+			       addrs[i].address, ETH_ALEN);
+			config_addrs_req2->list_address_ops[j].vlan =
+								 addrs[i].vlan;
+			addrs[i].operation = 0;
+			CONTROL_INFO("%s i=%d "
+				"addr[%d]=%02x:%02x:%02x:%02x:%02x:%02x "
+				"valid:%d\n", control_ifcfg_name(control), i, j,
+				addrs[i].address[0], addrs[i].address[1],
+				addrs[i].address[2], addrs[i].address[3],
+				addrs[i].address[4], addrs[i].address[5],
+				addrs[i].valid);
+			j++;
+		}
+		config_addrs_req2->num_address_ops = j;
+	} else {
+		send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES);
+		if (!send_io)
+			goto failure;
+
+		pkt = control_packet(send_io);
+		config_addrs_req = &pkt->cmd.config_addresses_req;
+		config_addrs_req->lan_switch_num =
+					control->lan_switch.lan_switch_num;
+		for (i = 0, j = 0; (i < num) && (j < 16); i++) {
+			if (!addrs[i].operation)
+				continue;
+			config_addrs_req->list_address_ops[j].index =
+								 cpu_to_be16(i);
+			config_addrs_req->list_address_ops[j].operation =
+							VNIC_OP_SET_ENTRY;
+			config_addrs_req->list_address_ops[j].valid =
+								 addrs[i].valid;
+			memcpy(config_addrs_req->list_address_ops[j].address,
+			       addrs[i].address, ETH_ALEN);
+			config_addrs_req->list_address_ops[j].vlan =
+								 addrs[i].vlan;
+			addrs[i].operation = 0;
+			j++;
+		}
+		config_addrs_req->num_address_ops = j;
+	}
+	for (; i < num; i++) {
+		if (addrs[i].operation) {
+			ret = 0;
+			break;
+		}
+	}
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	if (control_send(control, send_io))
+		return -1;
+	return ret;
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+static int process_cmd_config_address2_rsp(struct control *control,
+					   struct vnic_control_packet *pkt,
+					   struct recv_io *recv_io)
+{
+	struct vnic_cmd_config_addresses2 *config_addrs_rsp2;
+	int idx, mcaddrs, nomgid;
+	union ib_gid mgid, rsp_mgid;
+
+	config_addrs_rsp2 = &pkt->cmd.config_addresses_rsp2;
+	CONTROL_INFO("%s rsp to CONFIG_ADDRESSES2\n",
+				 control_ifcfg_name(control));
+
+	for (idx = 0, mcaddrs = 0, nomgid = 1;
+			idx < config_addrs_rsp2->num_address_ops;
+				idx++) {
+		if (!config_addrs_rsp2->list_address_ops[idx].valid)
+			continue;
+
+		/* check if address is multicasts */
+		if (!vnic_multicast_address(config_addrs_rsp2, idx))
+			continue;
+
+		mcaddrs++;
+		mgid = config_addrs_rsp2->list_address_ops[idx].mgid;
+		CONTROL_INFO("%s: got mgid " VNIC_GID_FMT
+				" MCAST_MSG_SIZE:%d mtu:%d\n",
+				control_ifcfg_name(control),
+				VNIC_GID_RAW_ARG(mgid.raw),
+				(int)MCAST_MSG_SIZE,
+				control->parent->mtu);
+
+		/* Embedded should have turned off multicast
+		 * due to large MTU size; mgid had better be 0.
+		 */
+		if (control->parent->mtu > MCAST_MSG_SIZE) {
+			if ((mgid.global.subnet_prefix != 0) ||
+				(mgid.global.interface_id != 0)) {
+				CONTROL_ERROR("%s: invalid mgid; "
+						"expected 0 "
+						VNIC_GID_FMT "\n",
+						control_ifcfg_name(control),
+						VNIC_GID_RAW_ARG(mgid.raw));
+				}
+				continue;
+			}
+		if (mgid.raw[0] != 0xff) {
+			CONTROL_ERROR("%s: invalid formatprefix "
+					VNIC_GID_FMT "\n",
+					control_ifcfg_name(control),
+					VNIC_GID_RAW_ARG(mgid.raw));
+			continue;
+		}
+		nomgid = 0; /* got a valid mgid */
+
+		/* let's verify that all the mgids match this one */
+		for (; idx < config_addrs_rsp2->num_address_ops; idx++) {
+			if (!config_addrs_rsp2->list_address_ops[idx].valid)
+				continue;
+
+			/* check if address is multicasts */
+			if (!vnic_multicast_address(config_addrs_rsp2, idx))
+				continue;
+
+			rsp_mgid = config_addrs_rsp2->list_address_ops[idx].mgid;
+			if (memcmp(&mgid, &rsp_mgid, sizeof(union ib_gid)) == 0)
+				continue;
+
+			CONTROL_ERROR("%s: Multicast Group MGIDs not "
+					"unique; mgids: " VNIC_GID_FMT
+					 " " VNIC_GID_FMT "\n",
+					control_ifcfg_name(control),
+					VNIC_GID_RAW_ARG(mgid.raw),
+					VNIC_GID_RAW_ARG(rsp_mgid.raw));
+			return 1;
+		}
+
+		/* rather than issuing join here, which might arrive
+		 * at SM before EVIC creates the MC group, postpone it.
+		 */
+		vnic_mc_join_setup(control->parent, &mgid);
+
+		/* there is only one multicast group to join, so we're done. */
+		break;
+	}
+
+	/* we sent atleast one multicast address but got no MGID
+	 * back so, if it is not allmulti case, leave the group
+	 * we joined before. (for allmulti case we have to stay
+	 * joined)
+	 */
+	if ((config_addrs_rsp2->num_address_ops > 0) && (mcaddrs > 0) &&
+		nomgid && !(control->parent->flags & IFF_ALLMULTI)) {
+		CONTROL_INFO("numaddrops:%d mcadrs:%d nomgid:%d\n",
+			config_addrs_rsp2->num_address_ops,
+				mcaddrs > 0, nomgid);
+
+		vnic_mc_leave(control->parent);
+	}
+
+	return 0;
+}
+
+int control_config_addrs_rsp(struct control *control)
+{
+	struct recv_io *recv_io;
+	struct vnic_control_packet *pkt;
+
+	CONTROL_FUNCTION("%s: control_config_addrs_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if ((pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES) &&
+		(pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES2))
+		goto failure;
+
+	if (((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) &&
+	      !control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) ||
+	      ((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES) &&
+	       control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC)) {
+		CONTROL_ERROR("%s unexpected response pktCmd:%d flag:%x\n",
+				control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+				control->parent->features_supported &
+				VNIC_FEAT_INBOUND_IB_MC);
+		goto failure;
+	}
+
+	if (pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) {
+		if (process_cmd_config_address2_rsp(control, pkt, recv_io))
+			goto failure;
+	} else {
+		struct vnic_cmd_config_addresses *config_addrs_rsp;
+		config_addrs_rsp = &pkt->cmd.config_addresses_rsp;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_report_statistics_req(struct control *control)
+{
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_report_stats_req	*report_statistics_req;
+
+	CONTROL_FUNCTION("%s: control_report_statistics_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_REPORT_STATISTICS);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	report_statistics_req = &pkt->cmd.report_statistics_req;
+	report_statistics_req->lan_switch_num =
+	    control->lan_switch.lan_switch_num;
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_report_statistics_rsp(struct control *control,
+				  struct vnic_cmd_report_stats_rsp *stats)
+{
+	struct recv_io				*recv_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_report_stats_rsp	*rep_stat_rsp;
+
+	CONTROL_FUNCTION("%s: control_report_statistics_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_REPORT_STATISTICS)
+		goto failure;
+
+	rep_stat_rsp = &pkt->cmd.report_statistics_rsp;
+
+	stats->if_in_broadcast_pkts   = rep_stat_rsp->if_in_broadcast_pkts;
+	stats->if_in_multicast_pkts   = rep_stat_rsp->if_in_multicast_pkts;
+	stats->if_in_octets	      = rep_stat_rsp->if_in_octets;
+	stats->if_in_ucast_pkts       = rep_stat_rsp->if_in_ucast_pkts;
+	stats->if_in_nucast_pkts      = rep_stat_rsp->if_in_nucast_pkts;
+	stats->if_in_underrun	      = rep_stat_rsp->if_in_underrun;
+	stats->if_in_errors	      = rep_stat_rsp->if_in_errors;
+	stats->if_out_errors	      = rep_stat_rsp->if_out_errors;
+	stats->if_out_octets	      = rep_stat_rsp->if_out_octets;
+	stats->if_out_ucast_pkts      = rep_stat_rsp->if_out_ucast_pkts;
+	stats->if_out_multicast_pkts  = rep_stat_rsp->if_out_multicast_pkts;
+	stats->if_out_broadcast_pkts  = rep_stat_rsp->if_out_broadcast_pkts;
+	stats->if_out_nucast_pkts     = rep_stat_rsp->if_out_nucast_pkts;
+	stats->if_out_ok	      = rep_stat_rsp->if_out_ok;
+	stats->if_in_ok		      = rep_stat_rsp->if_in_ok;
+	stats->if_out_ucast_bytes     = rep_stat_rsp->if_out_ucast_bytes;
+	stats->if_out_multicast_bytes = rep_stat_rsp->if_out_multicast_bytes;
+	stats->if_out_broadcast_bytes = rep_stat_rsp->if_out_broadcast_bytes;
+	stats->if_in_ucast_bytes      = rep_stat_rsp->if_in_ucast_bytes;
+	stats->if_in_multicast_bytes  = rep_stat_rsp->if_in_multicast_bytes;
+	stats->if_in_broadcast_bytes  = rep_stat_rsp->if_in_broadcast_bytes;
+	stats->ethernet_status	      = rep_stat_rsp->ethernet_status;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_reset_req(struct control *control)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_reset_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_RESET);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_reset_rsp(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_reset_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_RESET)
+		goto failure;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_heartbeat_req(struct control *control, u32 hb_interval)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_heartbeat	*heartbeat_req;
+
+	CONTROL_FUNCTION("%s: control_heartbeat_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_HEARTBEAT);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	heartbeat_req = &pkt->cmd.heartbeat_req;
+	heartbeat_req->hb_interval = cpu_to_be32(hb_interval);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_heartbeat_rsp(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_heartbeat	*heartbeat_rsp;
+
+	CONTROL_FUNCTION("%s: control_heartbeat_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_HEARTBEAT)
+		goto failure;
+
+	heartbeat_rsp = &pkt->cmd.heartbeat_rsp;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+static int control_init_recv_ios(struct control *control,
+				 struct viport *viport,
+				 struct vnic_control_packet *pkt)
+{
+	struct io		*io;
+	struct ib_device	*ibdev = viport->config->ibdev;
+	struct control_config	*config = control->config;
+	dma_addr_t		recv_dma;
+	unsigned int		i;
+
+
+	control->recv_len = sizeof *pkt * config->num_recvs;
+	control->recv_dma = ib_dma_map_single(ibdev,
+					      pkt, control->recv_len,
+					      DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(ibdev, control->recv_dma)) {
+		CONTROL_ERROR("control recv dma map error\n");
+		goto failure;
+	}
+
+	recv_dma = control->recv_dma;
+	for (i = 0; i < config->num_recvs; i++) {
+		io = &control->recv_ios[i].io;
+		io->viport = viport;
+		io->routine = control_recv_complete;
+		io->type = RECV;
+
+		control->recv_ios[i].virtual_addr = (u8 *)pkt;
+		control->recv_ios[i].list.addr = recv_dma;
+		control->recv_ios[i].list.length = sizeof *pkt;
+		control->recv_ios[i].list.lkey = control->mr->lkey;
+
+		recv_dma = recv_dma + sizeof *pkt;
+		pkt++;
+
+		io->rwr.wr_id = (u64)io;
+		io->rwr.sg_list = &control->recv_ios[i].list;
+		io->rwr.num_sge = 1;
+		if (vnic_ib_post_recv(&control->ib_conn, io))
+			goto unmap_recv;
+	}
+
+	return 0;
+unmap_recv:
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->recv_dma, control->recv_len,
+			    DMA_FROM_DEVICE);
+failure:
+	return -1;
+}
+
+static int control_init_send_ios(struct control *control,
+				 struct viport *viport,
+				 struct vnic_control_packet *pkt)
+{
+	struct io		*io;
+	struct ib_device	*ibdev = viport->config->ibdev;
+
+	control->send_io.virtual_addr = (u8 *)pkt;
+	control->send_len = sizeof *pkt;
+	control->send_dma = ib_dma_map_single(ibdev, pkt,
+					      control->send_len,
+					      DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(ibdev, control->send_dma)) {
+		CONTROL_ERROR("control send dma map error\n");
+		goto failure;
+	}
+
+	io = &control->send_io.io;
+	io->viport = viport;
+	io->routine = control_send_complete;
+
+	control->send_io.list.addr = control->send_dma;
+	control->send_io.list.length = sizeof *pkt;
+	control->send_io.list.lkey = control->mr->lkey;
+
+	io->swr.wr_id = (u64)io;
+	io->swr.sg_list = &control->send_io.list;
+	io->swr.num_sge = 1;
+	io->swr.opcode = IB_WR_SEND;
+	io->swr.send_flags = IB_SEND_SIGNALED;
+	io->type = SEND;
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_init(struct control *control, struct viport *viport,
+		 struct control_config *config, struct ib_pd *pd)
+{
+	struct vnic_control_packet	*pkt;
+	unsigned int sz;
+
+	CONTROL_FUNCTION("%s: control_init()\n",
+			 control_ifcfg_name(control));
+	control->parent = viport;
+	control->config = config;
+	control->ib_conn.viport = viport;
+	control->ib_conn.ib_config = &config->ib_config;
+	control->ib_conn.state = IB_CONN_UNINITTED;
+	control->ib_conn.callback_thread = NULL;
+	control->ib_conn.callback_thread_end = 0;
+	control->req_state = REQ_INACTIVE;
+	control->last_cmd  = CMD_INVALID;
+	control->seq_num = 0;
+	control->response = NULL;
+	control->info = NULL;
+	INIT_LIST_HEAD(&control->failure_list);
+	spin_lock_init(&control->io_lock);
+
+	if (vnic_ib_conn_init(&control->ib_conn, viport, pd,
+			      &config->ib_config)) {
+		CONTROL_ERROR("Control IB connection"
+			      " initialization failed\n");
+		goto failure;
+	}
+
+	control->mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(control->mr)) {
+		CONTROL_ERROR("%s: failed to register memory"
+			      " for control connection\n",
+			      control_ifcfg_name(control));
+		goto destroy_conn;
+	}
+
+	control->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev,
+						 vnic_ib_cm_handler,
+						 &control->ib_conn);
+	if (IS_ERR(control->ib_conn.cm_id)) {
+		CONTROL_ERROR("creating control CM ID failed\n");
+		goto destroy_mr;
+	}
+
+	sz = sizeof(struct recv_io) * config->num_recvs;
+	control->recv_ios = vmalloc(sz);
+
+	if (!control->recv_ios) {
+		CONTROL_ERROR("%s: failed allocating space for recv ios\n",
+			      control_ifcfg_name(control));
+		goto destroy_cm_id;
+	}
+
+	memset(control->recv_ios, 0, sz);
+	/*One send buffer and num_recvs recv buffers */
+	control->local_storage = kzalloc(sizeof *pkt *
+					 (config->num_recvs + 1),
+					 GFP_KERNEL);
+
+	if (!control->local_storage) {
+		CONTROL_ERROR("%s: failed allocating space"
+			      " for local storage\n",
+			      control_ifcfg_name(control));
+		goto free_recv_ios;
+	}
+
+	pkt = control->local_storage;
+	if (control_init_send_ios(control, viport, pkt))
+		goto free_storage;
+
+	pkt++;
+	if (control_init_recv_ios(control, viport, pkt))
+		goto unmap_send;
+
+	return 0;
+
+unmap_send:
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->send_dma, control->send_len,
+			    DMA_TO_DEVICE);
+free_storage:
+	kfree(control->local_storage);
+free_recv_ios:
+	vfree(control->recv_ios);
+destroy_cm_id:
+	ib_destroy_cm_id(control->ib_conn.cm_id);
+destroy_mr:
+	ib_dereg_mr(control->mr);
+destroy_conn:
+	ib_destroy_qp(control->ib_conn.qp);
+	ib_destroy_cq(control->ib_conn.cq);
+failure:
+	return -1;
+}
+
+void control_cleanup(struct control *control)
+{
+	CONTROL_FUNCTION("%s: control_disconnect()\n",
+			 control_ifcfg_name(control));
+
+	if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0))
+		CONTROL_ERROR("control CM DREQ sending failed\n");
+
+	control->ib_conn.state = IB_CONN_DISCONNECTED;
+	control_timer_stop(control);
+	control->req_state  = REQ_INACTIVE;
+	control->response   = NULL;
+	control->last_cmd   = CMD_INVALID;
+	completion_callback_cleanup(&control->ib_conn);
+	ib_destroy_cm_id(control->ib_conn.cm_id);
+	ib_destroy_qp(control->ib_conn.qp);
+	ib_destroy_cq(control->ib_conn.cq);
+	ib_dereg_mr(control->mr);
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->send_dma, control->send_len,
+			    DMA_TO_DEVICE);
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->recv_dma, control->recv_len,
+			    DMA_FROM_DEVICE);
+	vfree(control->recv_ios);
+	kfree(control->local_storage);
+
+}
+
+static void control_log_report_status_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_REPORT_STATUS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO
+	       "               lan_switch_num = %u, is_fatal = %u\n",
+	       pkt->cmd.report_status.lan_switch_num,
+	       pkt->cmd.report_status.is_fatal);
+	printk(KERN_INFO
+	       "               status_number = %u, status_info = %u\n",
+	       be32_to_cpu(pkt->cmd.report_status.status_number),
+	       be32_to_cpu(pkt->cmd.report_status.status_info));
+	pkt->cmd.report_status.file_name[31] = '\0';
+	pkt->cmd.report_status.routine[31] = '\0';
+	printk(KERN_INFO "               filename = %s, routine = %s\n",
+	       pkt->cmd.report_status.file_name,
+	       pkt->cmd.report_status.routine);
+	printk(KERN_INFO
+	       "               line_num = %u, error_parameter = %u\n",
+	       be32_to_cpu(pkt->cmd.report_status.line_num),
+	       be32_to_cpu(pkt->cmd.report_status.error_parameter));
+	pkt->cmd.report_status.desc_text[127] = '\0';
+	printk(KERN_INFO "               desc_text = %s\n",
+	       pkt->cmd.report_status.desc_text);
+}
+
+static void control_log_report_stats_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_REPORT_STATISTICS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               lan_switch_num = %u\n",
+	       pkt->cmd.report_statistics_req.lan_switch_num);
+	if (pkt->hdr.pkt_type == TYPE_REQ)
+		return;
+	printk(KERN_INFO "               if_in_broadcast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_broadcast_pkts));
+	printk(" if_in_multicast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_multicast_pkts));
+	printk(KERN_INFO "               if_in_octets = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_octets));
+	printk(" if_in_ucast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_ucast_pkts));
+	printk(KERN_INFO "               if_in_nucast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_nucast_pkts));
+	printk(" if_in_underrun = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_underrun));
+	printk(KERN_INFO "               if_in_errors = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_errors));
+	printk(" if_out_errors = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_errors));
+	printk(KERN_INFO "               if_out_octets = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_octets));
+	printk(" if_out_ucast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_ucast_pkts));
+	printk(KERN_INFO "               if_out_multicast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_multicast_pkts));
+	printk(" if_out_broadcast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_broadcast_pkts));
+	printk(KERN_INFO "               if_out_nucast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_nucast_pkts));
+	printk(" if_out_ok = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.if_out_ok));
+	printk(KERN_INFO "               if_in_ok = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.if_in_ok));
+	printk(" if_out_ucast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_ucast_bytes));
+	printk(KERN_INFO "               if_out_multicast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+		      if_out_multicast_bytes));
+	printk(" if_out_broadcast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_broadcast_bytes));
+	printk(KERN_INFO "               if_in_ucast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_ucast_bytes));
+	printk(" if_in_multicast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_multicast_bytes));
+	printk(KERN_INFO "               if_in_broadcast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_broadcast_bytes));
+	printk(" ethernet_status = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   ethernet_status));
+}
+
+static void control_log_config_link_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_CONFIG_LINK\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               cmd_flags = %x\n",
+	       pkt->cmd.config_link_req.cmd_flags);
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_ENABLE_NIC)
+		printk(KERN_INFO
+		       "                      VNIC_FLAG_ENABLE_NIC\n");
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_DISABLE_NIC)
+		printk(KERN_INFO
+		       "                      VNIC_FLAG_DISABLE_NIC\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL)
+		printk(KERN_INFO
+		       "                     VNIC_FLAG_ENABLE_"
+		       "MCAST_ALL\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_DISABLE_MCAST_ALL)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_DISABLE_"
+		       "MCAST_ALL\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_ENABLE_PROMISC)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_ENABLE_"
+		       "PROMISC\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_DISABLE_PROMISC)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_DISABLE_"
+		       "PROMISC\n");
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_SET_MTU)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_SET_MTU\n");
+	printk(KERN_INFO
+	       "               lan_switch_num = %x, mtu_size = %d\n",
+	       pkt->cmd.config_link_req.lan_switch_num,
+	       be16_to_cpu(pkt->cmd.config_link_req.mtu_size));
+	if (pkt->hdr.pkt_type == TYPE_RSP) {
+		printk(KERN_INFO
+		       "               default_vlan = %u,"
+		       " hw_mac_address ="
+		       " %02x:%02x:%02x:%02x:%02x:%02x\n",
+		       be16_to_cpu(pkt->cmd.config_link_req.
+				   default_vlan),
+		       pkt->cmd.config_link_req.hw_mac_address[0],
+		       pkt->cmd.config_link_req.hw_mac_address[1],
+		       pkt->cmd.config_link_req.hw_mac_address[2],
+		       pkt->cmd.config_link_req.hw_mac_address[3],
+		       pkt->cmd.config_link_req.hw_mac_address[4],
+		       pkt->cmd.config_link_req.hw_mac_address[5]);
+	}
+}
+
+static void print_config_addr(struct vnic_address_op *list,
+				int num_address_ops, size_t mgidoff)
+{
+	int i = 0;
+
+	while (i < num_address_ops && i < 16) {
+		printk(KERN_INFO "               list_address_ops[%u].index"
+				 " = %u\n", i, be16_to_cpu(list->index));
+		switch (list->operation) {
+		case VNIC_OP_GET_ENTRY:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = VNIC_OP_GET_ENTRY\n", i);
+			break;
+		case VNIC_OP_SET_ENTRY:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = VNIC_OP_SET_ENTRY\n", i);
+			break;
+		default:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = UNKNOWN(%d)\n", i,
+					 list->operation);
+			break;
+		}
+		printk(KERN_INFO "               list_address_ops[%u].valid"
+				 " = %u\n", i, list->valid);
+		printk(KERN_INFO "               list_address_ops[%u].address"
+				 " = %02x:%02x:%02x:%02x:%02x:%02x\n", i,
+				 list->address[0], list->address[1],
+				 list->address[2], list->address[3],
+				 list->address[4], list->address[5]);
+		printk(KERN_INFO "               list_address_ops[%u].vlan"
+				 " = %u\n", i, be16_to_cpu(list->vlan));
+		if (mgidoff) {
+			printk(KERN_INFO
+				 "               list_address_ops[%u].mgid"
+				 " = " VNIC_GID_FMT "\n", i,
+				 VNIC_GID_RAW_ARG((char *)list + mgidoff));
+			list = (struct vnic_address_op *)
+			       ((char *)list + sizeof(struct vnic_address_op2));
+		} else
+			list = (struct vnic_address_op *)
+			       ((char *)list + sizeof(struct vnic_address_op));
+	i++;
+	}
+}
+
+static void control_log_config_addrs_pkt(struct vnic_control_packet *pkt,
+					u8 addresses2)
+{
+	struct vnic_address_op *list;
+	int no_address_ops;
+
+	if (addresses2)
+		printk(KERN_INFO
+			"               pkt_cmd = CMD_CONFIG_ADDRESSES2\n");
+	else
+		printk(KERN_INFO
+			"               pkt_cmd = CMD_CONFIG_ADDRESSES\n");
+	printk(KERN_INFO "               pkt_seq_num = %u,"
+			" pkt_retry_count = %u\n",
+			pkt->hdr.pkt_seq_num, pkt->hdr.pkt_retry_count);
+	if (addresses2) {
+		printk(KERN_INFO "               num_address_ops = %x,"
+				" lan_switch_num = %d\n",
+				pkt->cmd.config_addresses_req2.num_address_ops,
+				pkt->cmd.config_addresses_req2.lan_switch_num);
+		list = (struct vnic_address_op *)
+				pkt->cmd.config_addresses_req2.list_address_ops;
+		no_address_ops = pkt->cmd.config_addresses_req2.num_address_ops;
+		print_config_addr(list, no_address_ops,
+				offsetof(struct vnic_address_op2, mgid));
+	} else {
+		printk(KERN_INFO "               num_address_ops = %x,"
+				" lan_switch_num = %d\n",
+				pkt->cmd.config_addresses_req.num_address_ops,
+				pkt->cmd.config_addresses_req.lan_switch_num);
+		list = pkt->cmd.config_addresses_req.list_address_ops;
+		no_address_ops = pkt->cmd.config_addresses_req.num_address_ops;
+		print_config_addr(list, no_address_ops, 0);
+	}
+}
+
+static void control_log_exch_pools_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_EXCHANGE_POOLS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               datapath = %u\n",
+	       pkt->cmd.exchange_pools_req.data_path);
+	printk(KERN_INFO "               pool_rkey = %08x"
+	       " pool_addr = %llx\n",
+	       be32_to_cpu(pkt->cmd.exchange_pools_req.pool_rkey),
+	       be64_to_cpu(pkt->cmd.exchange_pools_req.pool_addr));
+}
+
+static void control_log_data_path_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_CONFIG_DATA_PATH\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               path_identifier = %llx,"
+	       " data_path = %u\n",
+	       pkt->cmd.config_data_path_req.path_identifier,
+	       pkt->cmd.config_data_path_req.data_path);
+	printk(KERN_INFO
+	       "host config    size_recv_pool_entry = %u,"
+	       " num_recv_pool_entries = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.size_recv_pool_entry),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.num_recv_pool_entries));
+	printk(KERN_INFO
+	       "               timeout_before_kick = %u,"
+	       " num_recv_pool_entries_before_kick = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.timeout_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      num_recv_pool_entries_before_kick));
+	printk(KERN_INFO
+	       "               num_recv_pool_bytes_before_kick = %u,"
+	       " free_recv_pool_entries_per_update = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      num_recv_pool_bytes_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      free_recv_pool_entries_per_update));
+	printk(KERN_INFO
+	       "eioc config    size_recv_pool_entry = %u,"
+	       " num_recv_pool_entries = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.size_recv_pool_entry),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.num_recv_pool_entries));
+	printk(KERN_INFO
+	       "               timeout_before_kick = %u,"
+	       " num_recv_pool_entries_before_kick = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.timeout_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      num_recv_pool_entries_before_kick));
+	printk(KERN_INFO
+	       "               num_recv_pool_bytes_before_kick = %u,"
+	       " free_recv_pool_entries_per_update = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      num_recv_pool_bytes_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      free_recv_pool_entries_per_update));
+}
+
+static void control_log_init_vnic_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_INIT_VNIC\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO
+	       "               vnic_major_version = %u,"
+	       " vnic_minor_version = %u\n",
+	       be16_to_cpu(pkt->cmd.init_vnic_req.vnic_major_version),
+	       be16_to_cpu(pkt->cmd.init_vnic_req.vnic_minor_version));
+	if (pkt->hdr.pkt_type == TYPE_REQ) {
+		printk(KERN_INFO
+		       "               vnic_instance = %u,"
+		       " num_data_paths = %u\n",
+		       pkt->cmd.init_vnic_req.vnic_instance,
+		       pkt->cmd.init_vnic_req.num_data_paths);
+		printk(KERN_INFO
+		       "               num_address_entries = %u\n",
+		       be16_to_cpu(pkt->cmd.init_vnic_req.
+			      num_address_entries));
+	} else {
+		printk(KERN_INFO
+		       "               num_lan_switches = %u,"
+		       " num_data_paths = %u\n",
+		       pkt->cmd.init_vnic_rsp.num_lan_switches,
+		       pkt->cmd.init_vnic_rsp.num_data_paths);
+		printk(KERN_INFO
+		       "               num_address_entries = %u,"
+		       " features_supported = %08x\n",
+		       be16_to_cpu(pkt->cmd.init_vnic_rsp.
+			      num_address_entries),
+		       be32_to_cpu(pkt->cmd.init_vnic_rsp.
+			      features_supported));
+		if (pkt->cmd.init_vnic_rsp.num_lan_switches != 0) {
+			printk(KERN_INFO
+			       "lan_switch[0]  lan_switch_num = %u,"
+			       " num_enet_ports = %08x\n",
+			       pkt->cmd.init_vnic_rsp.
+			       lan_switch[0].lan_switch_num,
+			       pkt->cmd.init_vnic_rsp.
+			       lan_switch[0].num_enet_ports);
+			printk(KERN_INFO
+			       "               default_vlan = %u,"
+			       " hw_mac_address ="
+			       " %02x:%02x:%02x:%02x:%02x:%02x\n",
+			       be16_to_cpu(pkt->cmd.init_vnic_rsp.
+				      lan_switch[0].default_vlan),
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[0],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[1],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[2],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[3],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[4],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[5]);
+		}
+	}
+}
+
+static void control_log_control_packet(struct vnic_control_packet *pkt)
+{
+	switch (pkt->hdr.pkt_type) {
+	case TYPE_INFO:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_INFO\n");
+		break;
+	case TYPE_REQ:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_REQ\n");
+		break;
+	case TYPE_RSP:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_RSP\n");
+		break;
+	case TYPE_ERR:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_ERR\n");
+		break;
+	default:
+		printk(KERN_INFO "control_packet: pkt_type = UNKNOWN\n");
+	}
+
+	switch (pkt->hdr.pkt_cmd) {
+	case CMD_INIT_VNIC:
+		control_log_init_vnic_pkt(pkt);
+		break;
+	case CMD_CONFIG_DATA_PATH:
+		control_log_data_path_pkt(pkt);
+		break;
+	case CMD_EXCHANGE_POOLS:
+		control_log_exch_pools_pkt(pkt);
+		break;
+	case CMD_CONFIG_ADDRESSES:
+		control_log_config_addrs_pkt(pkt, 0);
+		break;
+	case CMD_CONFIG_ADDRESSES2:
+		control_log_config_addrs_pkt(pkt, 1);
+		break;
+	case CMD_CONFIG_LINK:
+		control_log_config_link_pkt(pkt);
+		break;
+	case CMD_REPORT_STATISTICS:
+		control_log_report_stats_pkt(pkt);
+		break;
+	case CMD_CLEAR_STATISTICS:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_CLEAR_STATISTICS\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	case CMD_REPORT_STATUS:
+		control_log_report_status_pkt(pkt);
+
+		break;
+	case CMD_RESET:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_RESET\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	case CMD_HEARTBEAT:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_HEARTBEAT\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		printk(KERN_INFO "               hb_interval = %d\n",
+		       be32_to_cpu(pkt->cmd.heartbeat_req.hb_interval));
+		break;
+	default:
+		printk(KERN_INFO
+		       "               pkt_cmd = UNKNOWN (%u)\n",
+		       pkt->hdr.pkt_cmd);
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	}
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
new file mode 100644
index 0000000..57fab67
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
@@ -0,0 +1,179 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONTROL_H_INCLUDED
+#define VNIC_CONTROL_H_INCLUDED
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+#include <linux/timex.h>
+#include <linux/completion.h>
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+
+#include "vnic_ib.h"
+#include "vnic_control_pkt.h"
+
+enum control_timer_state {
+	TIMER_IDLE	= 0,
+	TIMER_ACTIVE	= 1,
+	TIMER_EXPIRED	= 2
+};
+
+enum control_request_state {
+	REQ_INACTIVE,  /* quiet state, all previous operations done
+			*      response is NULL
+			*      last_cmd = CMD_INVALID
+			*      timer_state = IDLE
+			*/
+	REQ_POSTED,    /* REQ put on send Q
+			*      response is NULL
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_SENT,      /* Send completed for REQ
+			*      response is NULL
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	RSP_RECEIVED,  /* Received Resp, but no Send completion yet
+			*      response is response buffer received
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_COMPLETED, /* all processing for REQ completed, ready to be gotten
+			*      response is response buffer received
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_FAILED,    /* processing of REQ/RSP failed.
+			*      response is NULL
+			*      last_cmd = CMD_INVALID
+			*      timer_state = IDLE or EXPIRED
+			*      viport has been moved to error state to force
+			*      recovery
+			*/
+};
+
+struct control {
+	struct viport			*parent;
+	struct control_config		*config;
+	struct ib_mr			*mr;
+	struct vnic_ib_conn		ib_conn;
+	struct vnic_control_packet	*local_storage;
+	int				send_len;
+	int				recv_len;
+	u16				maj_ver;
+	u16				min_ver;
+	struct vnic_lan_switch_attribs	lan_switch;
+	struct send_io			send_io;
+	struct recv_io			*recv_ios;
+	dma_addr_t			send_dma;
+	dma_addr_t			recv_dma;
+	enum control_timer_state	timer_state;
+	enum control_request_state      req_state;
+	struct timer_list		timer;
+	u8				seq_num;
+	u8				last_cmd;
+	struct recv_io			*response;
+	struct recv_io			*info;
+	struct list_head		failure_list;
+	spinlock_t			io_lock;
+	struct completion		done;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	request_time;	/* intermediate value */
+		cycles_t	response_time;
+		u32		response_num;
+		cycles_t	response_max;
+		cycles_t	response_min;
+		u32		timeout_num;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+int control_init(struct control *control, struct viport *viport,
+		 struct control_config *config, struct ib_pd *pd);
+
+void control_cleanup(struct control *control);
+
+void control_process_async(struct control *control);
+
+int control_init_vnic_req(struct control *control);
+int control_init_vnic_rsp(struct control *control, u32 *features,
+			  u8 *mac_address, u16 *num_addrs, u16 *vlan);
+
+int control_config_data_path_req(struct control *control, u64 path_id,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc);
+int control_config_data_path_rsp(struct control *control,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc,
+				 struct vnic_recv_pool_config *max_host,
+				 struct vnic_recv_pool_config *max_eioc,
+				 struct vnic_recv_pool_config *min_host,
+				 struct vnic_recv_pool_config *min_eioc);
+
+int control_exchange_pools_req(struct control *control,
+			       u64 addr, u32 rkey);
+int control_exchange_pools_rsp(struct control *control,
+			       u64 *addr, u32 *rkey);
+
+int control_config_link_req(struct control *control,
+			    u16 flags, u16 mtu);
+int control_config_link_rsp(struct control *control,
+			    u16 *flags, u16 *mtu);
+
+int control_config_addrs_req(struct control *control,
+			     struct vnic_address_op2 *addrs, u16 num);
+int control_config_addrs_rsp(struct control *control);
+
+int control_report_statistics_req(struct control *control);
+int control_report_statistics_rsp(struct control *control,
+				  struct vnic_cmd_report_stats_rsp *stats);
+
+int control_heartbeat_req(struct control *control, u32 hb_interval);
+int control_heartbeat_rsp(struct control *control);
+
+int control_reset_req(struct control *control);
+int control_reset_rsp(struct control *control);
+
+#define control_packet(io) 					\
+	(struct vnic_control_packet *)(io)->virtual_addr
+#define control_is_connected(control) 				\
+	(vnic_ib_conn_connected(&((control)->ib_conn)))
+
+#define control_last_req(control)	control_packet(&(control)->send_io)
+#define control_features(control)	(control)->features_supported
+
+#define control_get_mac_address(control,addr) 				\
+	memcpy(addr, (control)->lan_switch.hw_mac_address, ETH_ALEN)
+
+#endif	/* VNIC_CONTROL_H_INCLUDED */
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
new file mode 100644
index 0000000..1fc62fb
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
@@ -0,0 +1,368 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONTROL_PKT_H_INCLUDED
+#define VNIC_CONTROL_PKT_H_INCLUDED
+
+#include <linux/utsname.h>
+#include <rdma/ib_verbs.h>
+
+#define VNIC_MAX_NODENAME_LEN	64
+
+struct vnic_connection_data {
+	u64	path_id;
+	u8	vnic_instance;
+	u8	path_num;
+	u8	nodename[VNIC_MAX_NODENAME_LEN + 1];
+	u8  	reserved; /* for alignment */
+	__be32 	features_supported;
+};
+
+struct vnic_control_header {
+	u8	pkt_type;
+	u8	pkt_cmd;
+	u8	pkt_seq_num;
+	u8	pkt_retry_count;
+	u32	reserved;	/* for 64-bit alignmnet */
+};
+
+/* ptk_type values */
+enum {
+	TYPE_INFO	= 0,
+	TYPE_REQ	= 1,
+	TYPE_RSP	= 2,
+	TYPE_ERR	= 3
+};
+
+/* ptk_cmd values */
+enum {
+	CMD_INVALID		= 0,
+	CMD_INIT_VNIC		= 1,
+	CMD_CONFIG_DATA_PATH	= 2,
+	CMD_EXCHANGE_POOLS	= 3,
+	CMD_CONFIG_ADDRESSES	= 4,
+	CMD_CONFIG_LINK		= 5,
+	CMD_REPORT_STATISTICS	= 6,
+	CMD_CLEAR_STATISTICS	= 7,
+	CMD_REPORT_STATUS	= 8,
+	CMD_RESET		= 9,
+	CMD_HEARTBEAT		= 10,
+	CMD_CONFIG_ADDRESSES2	= 11,
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_REQ data format */
+struct vnic_cmd_init_vnic_req {
+	__be16	vnic_major_version;
+	__be16	vnic_minor_version;
+	u8	vnic_instance;
+	u8	num_data_paths;
+	__be16	num_address_entries;
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP subdata format */
+struct vnic_lan_switch_attribs {
+	u8	lan_switch_num;
+	u8	num_enet_ports;
+	__be16	default_vlan;
+	u8	hw_mac_address[ETH_ALEN];
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP data format */
+struct vnic_cmd_init_vnic_rsp {
+	__be16				vnic_major_version;
+	__be16				vnic_minor_version;
+	u8				num_lan_switches;
+	u8				num_data_paths;
+	__be16				num_address_entries;
+	__be32				features_supported;
+	struct vnic_lan_switch_attribs	lan_switch[1];
+};
+
+/* features_supported values */
+enum {
+	VNIC_FEAT_IPV4_HEADERS		= 0x0001,
+	VNIC_FEAT_IPV6_HEADERS		= 0x0002,
+	VNIC_FEAT_IPV4_CSUM_RX		= 0x0004,
+	VNIC_FEAT_IPV4_CSUM_TX		= 0x0008,
+	VNIC_FEAT_TCP_CSUM_RX		= 0x0010,
+	VNIC_FEAT_TCP_CSUM_TX		= 0x0020,
+	VNIC_FEAT_UDP_CSUM_RX		= 0x0040,
+	VNIC_FEAT_UDP_CSUM_TX		= 0x0080,
+	VNIC_FEAT_TCP_SEGMENT		= 0x0100,
+	VNIC_FEAT_IPV4_IPSEC_OFFLOAD	= 0x0200,
+	VNIC_FEAT_IPV6_IPSEC_OFFLOAD	= 0x0400,
+	VNIC_FEAT_FCS_PROPAGATE		= 0x0800,
+	VNIC_FEAT_PF_KICK		= 0x1000,
+	VNIC_FEAT_PF_FORCE_ROUTE	= 0x2000,
+	VNIC_FEAT_CHASH_OFFLOAD		= 0x4000,
+	/* host send with immediate data */
+	VNIC_FEAT_RDMA_IMMED		= 0x8000,
+	/* host ignore inbound PF_VLAN_INSERT flag */
+	VNIC_FEAT_IGNORE_VLAN		= 0x10000,
+	/* host supports IB multicast for inbound Ethernet mcast traffic */
+	VNIC_FEAT_INBOUND_IB_MC 	= 0x20000,
+};
+
+/* pkt_cmd CMD_CONFIG_DATA_PATH subdata format */
+struct vnic_recv_pool_config {
+	__be32	size_recv_pool_entry;
+	__be32	num_recv_pool_entries;
+	__be32	timeout_before_kick;
+	__be32	num_recv_pool_entries_before_kick;
+	__be32	num_recv_pool_bytes_before_kick;
+	__be32	free_recv_pool_entries_per_update;
+};
+
+/* pkt_cmd CMD_CONFIG_DATA_PATH data format */
+struct vnic_cmd_config_data_path {
+	u64				path_identifier;
+	u8				data_path;
+	u8				reserved[3];
+	struct vnic_recv_pool_config	host_recv_pool_config;
+	struct vnic_recv_pool_config	eioc_recv_pool_config;
+};
+
+/* pkt_cmd CMD_EXCHANGE_POOLS data format */
+struct vnic_cmd_exchange_pools {
+	u8	data_path;
+	u8	reserved[3];
+	__be32	pool_rkey;
+	__be64	pool_addr;
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES subdata format */
+struct vnic_address_op {
+	__be16	index;
+	u8	operation;
+	u8	valid;
+	u8	address[6];
+	__be16	vlan;
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES2 subdata format */
+struct vnic_address_op2 {
+	__be16	index;
+	u8	operation;
+	u8	valid;
+	u8	address[6];
+	__be16	vlan;
+	u32 reserved; /* for alignment */
+	union ib_gid mgid; /* valid in rsp only if both ends support mcast */
+};
+
+/* operation values */
+enum {
+	VNIC_OP_SET_ENTRY = 0x01,
+	VNIC_OP_GET_ENTRY = 0x02
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES data format */
+struct vnic_cmd_config_addresses {
+	u8			num_address_ops;
+	u8			lan_switch_num;
+	struct vnic_address_op	list_address_ops[1];
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES2 data format */
+struct vnic_cmd_config_addresses2 {
+	u8			num_address_ops;
+	u8			lan_switch_num;
+	u8			reserved1;
+	u8			reserved2;
+	u8			reserved3;
+	struct vnic_address_op2	list_address_ops[1];
+};
+
+/* CMD_CONFIG_LINK data format */
+struct vnic_cmd_config_link {
+	u8	cmd_flags;
+	u8	lan_switch_num;
+	__be16	mtu_size;
+	__be16	default_vlan;
+	u8	hw_mac_address[6];
+	u32	reserved; /* for alignment */
+	/* valid in rsp only if both ends support mcast */
+	union ib_gid allmulti_mgid;
+};
+
+/* cmd_flags values */
+enum {
+	VNIC_FLAG_ENABLE_NIC		= 0x01,
+	VNIC_FLAG_DISABLE_NIC		= 0x02,
+	VNIC_FLAG_ENABLE_MCAST_ALL	= 0x04,
+	VNIC_FLAG_DISABLE_MCAST_ALL	= 0x08,
+	VNIC_FLAG_ENABLE_PROMISC	= 0x10,
+	VNIC_FLAG_DISABLE_PROMISC	= 0x20,
+	VNIC_FLAG_SET_MTU		= 0x40
+};
+
+/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_REQ data format */
+struct vnic_cmd_report_stats_req {
+	u8	lan_switch_num;
+};
+
+/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_RSP data format */
+struct vnic_cmd_report_stats_rsp {
+	u8	lan_switch_num;
+	u8	reserved[7];		/* for 64-bit alignment */
+	__be64	if_in_broadcast_pkts;
+	__be64	if_in_multicast_pkts;
+	__be64	if_in_octets;
+	__be64	if_in_ucast_pkts;
+	__be64	if_in_nucast_pkts;	/* if_in_broadcast_pkts
+					 + if_in_multicast_pkts */
+	__be64	if_in_underrun;		/* (OID_GEN_RCV_NO_BUFFER) */
+	__be64	if_in_errors;		/* (OID_GEN_RCV_ERROR) */
+	__be64	if_out_errors;		/* (OID_GEN_XMIT_ERROR) */
+	__be64	if_out_octets;
+	__be64	if_out_ucast_pkts;
+	__be64	if_out_multicast_pkts;
+	__be64	if_out_broadcast_pkts;
+	__be64	if_out_nucast_pkts;	/* if_out_broadcast_pkts
+					 + if_out_multicast_pkts */
+	__be64	if_out_ok;		/* if_out_nucast_pkts
+					 + if_out_ucast_pkts(OID_GEN_XMIT_OK) */
+	__be64	if_in_ok;		/* if_in_nucast_pkts
+					 + if_in_ucast_pkts(OID_GEN_RCV_OK) */
+	__be64	if_out_ucast_bytes;	/* (OID_GEN_DIRECTED_BYTES_XMT) */
+	__be64	if_out_multicast_bytes;	/* (OID_GEN_MULTICAST_BYTES_XMT) */
+	__be64	if_out_broadcast_bytes;	/* (OID_GEN_BROADCAST_BYTES_XMT) */
+	__be64	if_in_ucast_bytes;	/* (OID_GEN_DIRECTED_BYTES_RCV) */
+	__be64	if_in_multicast_bytes;	/* (OID_GEN_MULTICAST_BYTES_RCV) */
+	__be64	if_in_broadcast_bytes;	/* (OID_GEN_BROADCAST_BYTES_RCV) */
+	__be64	 ethernet_status;	/* OID_GEN_MEDIA_CONNECT_STATUS) */
+};
+
+/* pkt_cmd CMD_CLEAR_STATISTICS data format */
+struct vnic_cmd_clear_statistics {
+	u8	lan_switch_num;
+};
+
+/* pkt_cmd CMD_REPORT_STATUS data format */
+struct vnic_cmd_report_status {
+	u8	lan_switch_num;
+	u8	is_fatal;
+	u8	reserved[2];		/* for 32-bit alignment */
+	__be32	status_number;
+	__be32	status_info;
+	u8	file_name[32];
+	u8	routine[32];
+	__be32	line_num;
+	__be32	error_parameter;
+	u8	desc_text[128];
+};
+
+/* pkt_cmd CMD_HEARTBEAT data format */
+struct vnic_cmd_heartbeat {
+	__be32	hb_interval;
+};
+
+enum {
+	VNIC_STATUS_LINK_UP			= 1,
+	VNIC_STATUS_LINK_DOWN			= 2,
+	VNIC_STATUS_ENET_AGGREGATION_CHANGE	= 3,
+	VNIC_STATUS_EIOC_SHUTDOWN		= 4,
+	VNIC_STATUS_CONTROL_ERROR		= 5,
+	VNIC_STATUS_EIOC_ERROR			= 6
+};
+
+#define VNIC_MAX_CONTROLPKTSZ		256
+#define VNIC_MAX_CONTROLDATASZ						\
+	(VNIC_MAX_CONTROLPKTSZ - sizeof(struct vnic_control_header))
+
+struct vnic_control_packet {
+	struct vnic_control_header	hdr;
+	union {
+		struct vnic_cmd_init_vnic_req		init_vnic_req;
+		struct vnic_cmd_init_vnic_rsp		init_vnic_rsp;
+		struct vnic_cmd_config_data_path	config_data_path_req;
+		struct vnic_cmd_config_data_path	config_data_path_rsp;
+		struct vnic_cmd_exchange_pools		exchange_pools_req;
+		struct vnic_cmd_exchange_pools		exchange_pools_rsp;
+		struct vnic_cmd_config_addresses	config_addresses_req;
+		struct vnic_cmd_config_addresses2	config_addresses_req2;
+		struct vnic_cmd_config_addresses	config_addresses_rsp;
+		struct vnic_cmd_config_addresses2	config_addresses_rsp2;
+		struct vnic_cmd_config_link		config_link_req;
+		struct vnic_cmd_config_link		config_link_rsp;
+		struct vnic_cmd_report_stats_req	report_statistics_req;
+		struct vnic_cmd_report_stats_rsp	report_statistics_rsp;
+		struct vnic_cmd_clear_statistics	clear_statistics_req;
+		struct vnic_cmd_clear_statistics	clear_statistics_rsp;
+		struct vnic_cmd_report_status		report_status;
+		struct vnic_cmd_heartbeat		heartbeat_req;
+		struct vnic_cmd_heartbeat		heartbeat_rsp;
+
+		char   cmd_data[VNIC_MAX_CONTROLDATASZ];
+	} cmd;
+};
+
+union ib_gid_cpu {
+	u8      raw[16];
+	struct {
+		u64  subnet_prefix;
+		u64  interface_id;
+	} global;
+};
+
+static inline void bswap_ib_gid(union ib_gid *mgid1, union ib_gid_cpu *mgid2)
+{
+    /* swap hi & low */
+    __be64 low = mgid1->global.subnet_prefix;
+    mgid2->global.subnet_prefix = be64_to_cpu(mgid1->global.interface_id);
+    mgid2->global.interface_id = be64_to_cpu(low);
+}
+
+#define VNIC_GID_FMT 	"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x"
+
+#define VNIC_GID_RAW_ARG(gid) 	be16_to_cpu(*(__be16 *)&(gid)[0]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[2]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[4]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[6]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[8]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[10]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[12]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[14])
+
+
+/* These defines are used to figure out how many address entries can be passed
+ * in config_addresses request.
+ */
+#define MAX_CONFIG_ADDR_ENTRIES \
+	((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses) \
+	- sizeof(struct vnic_address_op)))/sizeof(struct vnic_address_op))
+#define MAX_CONFIG_ADDR_ENTRIES2 \
+	((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses2) \
+	- sizeof(struct vnic_address_op2)))/sizeof(struct vnic_address_op2))
+
+
+#endif	/* VNIC_CONTROL_PKT_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:33:58 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:03:58 +0530
Subject: [ofa-general] [PATCH v2 05/13] QLogic VNIC: Implementation of Data
	path of communication protocol
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103358.12355.76791.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch implements the actual data transfer part of the communication
protocol with the EVIC/VEx. RDMA of ethernet packets is implemented in
here.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c    | 1492 +++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h    |  206 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h |  103 ++
 3 files changed, 1801 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
new file mode 100644
index 0000000..b81fcde
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
@@ -0,0 +1,1492 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <net/inet_sock.h>
+#include <linux/ip.h>
+#include <linux/if_ether.h>
+#include <linux/vmalloc.h>
+
+#include "vnic_util.h"
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_data.h"
+#include "vnic_trailer.h"
+#include "vnic_stats.h"
+
+static void data_received_kick(struct io *io);
+static void data_xmit_complete(struct io *io);
+
+static void mc_data_recv_routine(struct io *io);
+static void mc_data_post_recvs(struct mc_data *mc_data);
+static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb,
+	struct viport_trailer *trailer);
+
+static u32 min_rcv_skb = 60;
+module_param(min_rcv_skb, int, 0444);
+MODULE_PARM_DESC(min_rcv_skb, "Packets of size (in bytes) less than"
+		 " or equal this value will be copied during receive."
+		 " Default 60");
+
+static u32 min_xmt_skb = 60;
+module_param(min_xmt_skb, int, 0444);
+MODULE_PARM_DESC(min_xmit_skb, "Packets of size (in bytes) less than"
+		 " or equal to this value will be copied during transmit."
+		 "Default 60");
+
+int data_init(struct data *data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd)
+{
+	DATA_FUNCTION("data_init()\n");
+
+	data->parent = viport;
+	data->config = config;
+	data->ib_conn.viport = viport;
+	data->ib_conn.ib_config = &config->ib_config;
+	data->ib_conn.state = IB_CONN_UNINITTED;
+	data->ib_conn.callback_thread = NULL;
+	data->ib_conn.callback_thread_end = 0;
+
+	if ((min_xmt_skb < 60) || (min_xmt_skb > 9000)) {
+		DATA_ERROR("min_xmt_skb (%d) must be between 60 and 9000\n",
+			   min_xmt_skb);
+		goto failure;
+	}
+	if (vnic_ib_conn_init(&data->ib_conn, viport, pd,
+			      &config->ib_config)) {
+		DATA_ERROR("Data IB connection initialization failed\n");
+		goto failure;
+	}
+	data->mr = ib_get_dma_mr(pd,
+				 IB_ACCESS_LOCAL_WRITE |
+				 IB_ACCESS_REMOTE_READ |
+				 IB_ACCESS_REMOTE_WRITE);
+	if (IS_ERR(data->mr)) {
+		DATA_ERROR("failed to register memory for"
+			   " data connection\n");
+		goto destroy_conn;
+	}
+
+	data->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev,
+					      vnic_ib_cm_handler,
+					      &data->ib_conn);
+
+	if (IS_ERR(data->ib_conn.cm_id)) {
+		DATA_ERROR("creating data CM ID failed\n");
+		goto dereg_mr;
+	}
+
+	return 0;
+
+dereg_mr:
+	ib_dereg_mr(data->mr);
+destroy_conn:
+	completion_callback_cleanup(&data->ib_conn);
+	ib_destroy_qp(data->ib_conn.qp);
+	ib_destroy_cq(data->ib_conn.cq);
+failure:
+	return -1;
+}
+
+static void data_post_recvs(struct data *data)
+{
+	unsigned long flags;
+	int i = 0;
+
+	DATA_FUNCTION("data_post_recvs()\n");
+	spin_lock_irqsave(&data->recv_ios_lock, flags);
+	while (!list_empty(&data->recv_ios)) {
+		struct io *io = list_entry(data->recv_ios.next,
+					   struct io, list_ptrs);
+		struct recv_io *recv_io = (struct recv_io *)io;
+
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+		if (vnic_ib_post_recv(&data->ib_conn, &recv_io->io)) {
+			viport_failure(data->parent);
+			return;
+		}
+		i++;
+		spin_lock_irqsave(&data->recv_ios_lock, flags);
+	}
+	spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+	DATA_INFO("data posted %d %p\n", i, &data->recv_ios);
+}
+
+static void data_init_pool_work_reqs(struct data *data,
+				      struct recv_io *recv_io)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct rdma_io		*rdma_io;
+	struct rdma_dest	*rdma_dest;
+	dma_addr_t		xmit_dma;
+	u8			*xmit_data;
+	unsigned int		i;
+
+	INIT_LIST_HEAD(&data->recv_ios);
+	spin_lock_init(&data->recv_ios_lock);
+	spin_lock_init(&data->xmit_buf_lock);
+	for (i = 0; i < data->config->num_recvs; i++) {
+		recv_io[i].io.viport = data->parent;
+		recv_io[i].io.routine = data_received_kick;
+		recv_io[i].list.addr = data->region_data_dma;
+		recv_io[i].list.length = 4;
+		recv_io[i].list.lkey = data->mr->lkey;
+
+		recv_io[i].io.rwr.wr_id = (u64)&recv_io[i].io;
+		recv_io[i].io.rwr.sg_list = &recv_io[i].list;
+		recv_io[i].io.rwr.num_sge = 1;
+
+		list_add(&recv_io[i].io.list_ptrs, &data->recv_ios);
+	}
+
+	INIT_LIST_HEAD(&recv_pool->avail_recv_bufs);
+	for (i = 0; i < recv_pool->pool_sz; i++) {
+		rdma_dest = &recv_pool->recv_bufs[i];
+		list_add(&rdma_dest->list_ptrs,
+			 &recv_pool->avail_recv_bufs);
+	}
+
+	xmit_dma = xmit_pool->xmitdata_dma;
+	xmit_data = xmit_pool->xmit_data;
+
+	for (i = 0; i < xmit_pool->num_xmit_bufs; i++) {
+		rdma_io = &xmit_pool->xmit_bufs[i];
+		rdma_io->index = i;
+		rdma_io->io.viport = data->parent;
+		rdma_io->io.routine = data_xmit_complete;
+
+		rdma_io->list[0].lkey = data->mr->lkey;
+		rdma_io->list[1].lkey = data->mr->lkey;
+		rdma_io->io.swr.wr_id = (u64)rdma_io;
+		rdma_io->io.swr.sg_list = rdma_io->list;
+		rdma_io->io.swr.num_sge = 2;
+		rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE;
+		rdma_io->io.swr.send_flags = IB_SEND_SIGNALED;
+		rdma_io->io.type = RDMA;
+
+		rdma_io->data = xmit_data;
+		rdma_io->data_dma = xmit_dma;
+
+		xmit_data += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT);
+		xmit_dma += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT);
+		rdma_io->trailer = (struct viport_trailer *)xmit_data;
+		rdma_io->trailer_dma = xmit_dma;
+		xmit_data += sizeof(struct viport_trailer);
+		xmit_dma += sizeof(struct viport_trailer);
+	}
+
+	xmit_pool->rdma_rkey = data->mr->rkey;
+	xmit_pool->rdma_addr = xmit_pool->buf_pool_dma;
+}
+
+static void data_init_free_bufs_swrs(struct data *data)
+{
+	struct rdma_io		*rdma_io;
+	struct send_io		*send_io;
+
+	rdma_io = &data->free_bufs_io;
+	rdma_io->io.viport = data->parent;
+	rdma_io->io.routine = NULL;
+
+	rdma_io->list[0].lkey = data->mr->lkey;
+
+	rdma_io->io.swr.wr_id = (u64)rdma_io;
+	rdma_io->io.swr.sg_list = rdma_io->list;
+	rdma_io->io.swr.num_sge = 1;
+	rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE;
+	rdma_io->io.swr.send_flags = IB_SEND_SIGNALED;
+	rdma_io->io.type = RDMA;
+
+	send_io = &data->kick_io;
+	send_io->io.viport = data->parent;
+	send_io->io.routine = NULL;
+
+	send_io->list.addr = data->region_data_dma;
+	send_io->list.length = 0;
+	send_io->list.lkey = data->mr->lkey;
+
+	send_io->io.swr.wr_id = (u64)send_io;
+	send_io->io.swr.sg_list = &send_io->list;
+	send_io->io.swr.num_sge = 1;
+	send_io->io.swr.opcode = IB_WR_SEND;
+	send_io->io.swr.send_flags = IB_SEND_SIGNALED;
+	send_io->io.type = SEND;
+}
+
+static int data_init_buf_pools(struct data *data)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct viport		*viport = data->parent;
+
+	recv_pool->buf_pool_len =
+	    sizeof(struct buff_pool_entry) * recv_pool->eioc_pool_sz;
+
+	recv_pool->buf_pool = kzalloc(recv_pool->buf_pool_len, GFP_KERNEL);
+
+	if (!recv_pool->buf_pool) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " for recv pool bufpool\n",
+			   recv_pool->buf_pool_len);
+		goto failure;
+	}
+
+	recv_pool->buf_pool_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      recv_pool->buf_pool, recv_pool->buf_pool_len,
+			      DMA_TO_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, recv_pool->buf_pool_dma)) {
+		DATA_ERROR("xmit buf_pool dma map error\n");
+		goto free_recv_pool;
+	}
+
+	xmit_pool->buf_pool_len =
+	    sizeof(struct buff_pool_entry) * xmit_pool->pool_sz;
+	xmit_pool->buf_pool = kzalloc(xmit_pool->buf_pool_len, GFP_KERNEL);
+
+	if (!xmit_pool->buf_pool) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " for xmit pool bufpool\n",
+			   xmit_pool->buf_pool_len);
+		goto unmap_recv_pool;
+	}
+
+	xmit_pool->buf_pool_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      xmit_pool->buf_pool, xmit_pool->buf_pool_len,
+			      DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->buf_pool_dma)) {
+		DATA_ERROR("xmit buf_pool dma map error\n");
+		goto free_xmit_pool;
+	}
+
+	xmit_pool->xmit_data = kzalloc(xmit_pool->xmitdata_len, GFP_KERNEL);
+
+	if (!xmit_pool->xmit_data) {
+		DATA_ERROR("failed allocating %d bytes for xmit data\n",
+			   xmit_pool->xmitdata_len);
+		goto unmap_xmit_pool;
+	}
+
+	xmit_pool->xmitdata_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      xmit_pool->xmit_data, xmit_pool->xmitdata_len,
+			      DMA_TO_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->xmitdata_dma)) {
+		DATA_ERROR("xmit data dma map error\n");
+		goto free_xmit_data;
+	}
+
+	return 0;
+
+free_xmit_data:
+	kfree(xmit_pool->xmit_data);
+unmap_xmit_pool:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    xmit_pool->buf_pool_dma,
+			    xmit_pool->buf_pool_len, DMA_FROM_DEVICE);
+free_xmit_pool:
+	kfree(xmit_pool->buf_pool);
+unmap_recv_pool:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    recv_pool->buf_pool_dma,
+			    recv_pool->buf_pool_len, DMA_TO_DEVICE);
+free_recv_pool:
+	kfree(recv_pool->buf_pool);
+failure:
+	return -1;
+}
+
+static void data_init_xmit_pool(struct data *data)
+{
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+
+	xmit_pool->pool_sz =
+		be32_to_cpu(data->eioc_pool_parms.num_recv_pool_entries);
+	xmit_pool->buffer_sz =
+		be32_to_cpu(data->eioc_pool_parms.size_recv_pool_entry);
+
+	xmit_pool->notify_count = 0;
+	xmit_pool->notify_bundle = data->config->notify_bundle;
+	xmit_pool->next_xmit_pool = 0;
+	xmit_pool->num_xmit_bufs = xmit_pool->notify_bundle * 2;
+	xmit_pool->next_xmit_buf = 0;
+	xmit_pool->last_comp_buf = xmit_pool->num_xmit_bufs - 1;
+	/* This assumes that data_init_recv_pool has been called
+	 * before.
+	 */
+	data->max_mtu = MAX_PAYLOAD(min((data)->recv_pool.buffer_sz,
+				   (data)->xmit_pool.buffer_sz)) - VLAN_ETH_HLEN;
+
+	xmit_pool->kick_count = 0;
+	xmit_pool->kick_byte_count = 0;
+
+	xmit_pool->send_kicks =
+	  be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_entries_before_kick)
+	  || be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_bytes_before_kick);
+	xmit_pool->kick_bundle =
+	    be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_entries_before_kick);
+	xmit_pool->kick_byte_bundle =
+	    be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_bytes_before_kick);
+
+	xmit_pool->need_buffers = 1;
+
+	xmit_pool->xmitdata_len =
+	    BUFFER_SIZE(min_xmt_skb) * xmit_pool->num_xmit_bufs;
+}
+
+static void data_init_recv_pool(struct data *data)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+
+	recv_pool->pool_sz = data->config->host_recv_pool_entries;
+	recv_pool->eioc_pool_sz =
+		be32_to_cpu(data->host_pool_parms.num_recv_pool_entries);
+	if (recv_pool->pool_sz > recv_pool->eioc_pool_sz)
+		recv_pool->pool_sz =
+		    be32_to_cpu(data->host_pool_parms.num_recv_pool_entries);
+
+	recv_pool->buffer_sz =
+		    be32_to_cpu(data->host_pool_parms.size_recv_pool_entry);
+
+	recv_pool->sz_free_bundle =
+		be32_to_cpu(data->
+			host_pool_parms.free_recv_pool_entries_per_update);
+	recv_pool->num_free_bufs = 0;
+	recv_pool->num_posted_bufs = 0;
+
+	recv_pool->next_full_buf = 0;
+	recv_pool->next_free_buf = 0;
+	recv_pool->kick_on_free  = 0;
+}
+
+int data_connect(struct data *data)
+{
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct recv_io		*recv_io;
+	unsigned int		sz;
+	struct viport		*viport = data->parent;
+
+	DATA_FUNCTION("data_connect()\n");
+
+	/* Do not interchange the order of the functions
+	 * called below as this will affect the MAX MTU
+	 * calculation
+	 */
+
+	data_init_recv_pool(data);
+	data_init_xmit_pool(data);
+
+	sz = sizeof(struct rdma_dest) * recv_pool->pool_sz    +
+	     sizeof(struct recv_io) * data->config->num_recvs +
+	     sizeof(struct rdma_io) * xmit_pool->num_xmit_bufs;
+
+	data->local_storage = vmalloc(sz);
+
+	if (!data->local_storage) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " local storage\n", sz);
+		goto out;
+	}
+
+	memset(data->local_storage, 0, sz);
+
+	recv_pool->recv_bufs = (struct rdma_dest *)data->local_storage;
+	sz = sizeof(struct rdma_dest) * recv_pool->pool_sz;
+
+	recv_io = (struct recv_io *)(data->local_storage + sz);
+	sz += sizeof(struct recv_io) * data->config->num_recvs;
+
+	xmit_pool->xmit_bufs = (struct rdma_io *)(data->local_storage + sz);
+	data->region_data = kzalloc(4, GFP_KERNEL);
+
+	if (!data->region_data) {
+		DATA_ERROR("failed to alloc memory for region data\n");
+		goto free_local_storage;
+	}
+
+	data->region_data_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      data->region_data, 4, DMA_BIDIRECTIONAL);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, data->region_data_dma)) {
+		DATA_ERROR("region data dma map error\n");
+		goto free_region_data;
+	}
+
+	if (data_init_buf_pools(data))
+		goto unmap_region_data;
+
+	data_init_free_bufs_swrs(data);
+	data_init_pool_work_reqs(data, recv_io);
+
+	data_post_recvs(data);
+
+	if (vnic_ib_cm_connect(&data->ib_conn))
+		goto unmap_region_data;
+
+	return 0;
+
+unmap_region_data:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    data->region_data_dma, 4, DMA_BIDIRECTIONAL);
+free_region_data:
+		kfree(data->region_data);
+free_local_storage:
+		vfree(data->local_storage);
+out:
+	return -1;
+}
+
+static void data_add_free_buffer(struct data *data, int index,
+				 struct rdma_dest *rdma_dest)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct buff_pool_entry *bpe;
+	dma_addr_t vaddr_dma;
+
+	DATA_FUNCTION("data_add_free_buffer()\n");
+	rdma_dest->trailer->connection_hash_and_valid = 0;
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	bpe = &pool->buf_pool[index];
+	bpe->rkey = cpu_to_be32(data->mr->rkey);
+	vaddr_dma = ib_dma_map_single(data->parent->config->ibdev,
+					rdma_dest->data, pool->buffer_sz,
+					DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(data->parent->config->ibdev, vaddr_dma)) {
+		DATA_ERROR("rdma_dest->data dma map error\n");
+		goto failure;
+	}
+	bpe->remote_addr = cpu_to_be64(vaddr_dma);
+	bpe->valid = (u32) (rdma_dest - &pool->recv_bufs[0]) + 1;
+	++pool->num_free_bufs;
+failure:
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma, pool->buf_pool_len,
+				      DMA_TO_DEVICE);
+}
+
+/* NOTE: this routine is not reentrant */
+static void data_alloc_buffers(struct data *data, int initial_allocation)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct rdma_dest *rdma_dest;
+	struct sk_buff *skb;
+	int index;
+
+	DATA_FUNCTION("data_alloc_buffers()\n");
+	index = ADD(pool->next_free_buf, pool->num_free_bufs,
+		    pool->eioc_pool_sz);
+
+	while (!list_empty(&pool->avail_recv_bufs)) {
+		rdma_dest =
+		    list_entry(pool->avail_recv_bufs.next,
+			       struct rdma_dest, list_ptrs);
+		if (!rdma_dest->skb) {
+			if (initial_allocation)
+				skb = alloc_skb(pool->buffer_sz + 2,
+						GFP_KERNEL);
+			else
+				skb = dev_alloc_skb(pool->buffer_sz + 2);
+			if (!skb)
+				break;
+			skb_reserve(skb, 2);
+			skb_put(skb, pool->buffer_sz);
+			rdma_dest->skb = skb;
+			rdma_dest->data = skb->data;
+			rdma_dest->trailer =
+			  (struct viport_trailer *)(rdma_dest->data +
+						    pool->buffer_sz -
+						    sizeof(struct
+							   viport_trailer));
+		}
+		rdma_dest->trailer->connection_hash_and_valid = 0;
+
+		list_del_init(&rdma_dest->list_ptrs);
+
+		data_add_free_buffer(data, index, rdma_dest);
+		index = NEXT(index, pool->eioc_pool_sz);
+	}
+}
+
+static void data_send_kick_message(struct data *data)
+{
+	struct xmit_pool *pool = &data->xmit_pool;
+	DATA_FUNCTION("data_send_kick_message()\n");
+	/* stop timer for bundle_timeout */
+	if (data->kick_timer_on) {
+		del_timer(&data->kick_timer);
+		data->kick_timer_on = 0;
+	}
+	pool->kick_count = 0;
+	pool->kick_byte_count = 0;
+
+	/* TODO: keep track of when kick is outstanding, and
+	 * don't reuse until complete
+	 */
+	if (vnic_ib_post_send(&data->ib_conn, &data->free_bufs_io.io)) {
+		DATA_ERROR("failed to post send\n");
+		viport_failure(data->parent);
+	}
+}
+
+static void data_send_free_recv_buffers(struct data *data)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct ib_send_wr *swr = &data->free_bufs_io.io.swr;
+
+	int bufs_sent = 0;
+	u64 rdma_addr;
+	u32 offset;
+	u32 sz;
+	unsigned int num_to_send, next_increment;
+
+	DATA_FUNCTION("data_send_free_recv_buffers()\n");
+
+	for (num_to_send = pool->sz_free_bundle;
+	     num_to_send <= pool->num_free_bufs;
+	     num_to_send += pool->sz_free_bundle) {
+		/* handle multiple bundles as one when possible. */
+		next_increment = num_to_send + pool->sz_free_bundle;
+		if ((next_increment <= pool->num_free_bufs)
+		    && (pool->next_free_buf + next_increment <=
+			pool->eioc_pool_sz))
+			continue;
+
+		offset = pool->next_free_buf *
+				sizeof(struct buff_pool_entry);
+		sz = num_to_send * sizeof(struct buff_pool_entry);
+		rdma_addr = pool->eioc_rdma_addr + offset;
+		swr->sg_list->length = sz;
+		swr->sg_list->addr = pool->buf_pool_dma + offset;
+		swr->wr.rdma.remote_addr = rdma_addr;
+
+		if (vnic_ib_post_send(&data->ib_conn,
+		    &data->free_bufs_io.io)) {
+			DATA_ERROR("failed to post send\n");
+			viport_failure(data->parent);
+			return;
+		}
+		INC(pool->next_free_buf, num_to_send, pool->eioc_pool_sz);
+		pool->num_free_bufs -= num_to_send;
+		pool->num_posted_bufs += num_to_send;
+		bufs_sent = 1;
+	}
+
+	if (bufs_sent) {
+		if (pool->kick_on_free)
+			data_send_kick_message(data);
+	}
+	if (pool->num_posted_bufs == 0) {
+		struct vnic *vnic = data->parent->vnic;
+		unsigned long flags;
+
+		spin_lock_irqsave(&vnic->current_path_lock, flags);
+		if (vnic->current_path == &vnic->primary_path) {
+			spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+			DATA_ERROR("%s: primary path: "
+					"unable to allocate receive buffers\n",
+					vnic->config->name);
+		} else {
+			if (vnic->current_path == &vnic->secondary_path) {
+				spin_unlock_irqrestore(&vnic->current_path_lock,
+							flags);
+				DATA_ERROR("%s: secondary path: "
+					"unable to allocate receive buffers\n",
+					vnic->config->name);
+			} else
+				spin_unlock_irqrestore(&vnic->current_path_lock,
+							flags);
+		}
+		data->ib_conn.state = IB_CONN_ERRORED;
+		viport_failure(data->parent);
+	}
+}
+
+void data_connected(struct data *data)
+{
+	DATA_FUNCTION("data_connected()\n");
+	data->free_bufs_io.io.swr.wr.rdma.rkey =
+				data->recv_pool.eioc_rdma_rkey;
+	data_alloc_buffers(data, 1);
+	data_send_free_recv_buffers(data);
+	data->connected = 1;
+}
+
+void data_disconnect(struct data *data)
+{
+	struct xmit_pool *xmit_pool = &data->xmit_pool;
+	struct recv_pool *recv_pool = &data->recv_pool;
+	unsigned int i;
+
+	DATA_FUNCTION("data_disconnect()\n");
+
+	data->connected = 0;
+	if (data->kick_timer_on) {
+		del_timer_sync(&data->kick_timer);
+		data->kick_timer_on = 0;
+	}
+
+	if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0))
+		DATA_ERROR("data CM DREQ sending failed\n");
+	data->ib_conn.state = IB_CONN_DISCONNECTED;
+
+	completion_callback_cleanup(&data->ib_conn);
+
+	for (i = 0; i < xmit_pool->num_xmit_bufs; i++) {
+		if (xmit_pool->xmit_bufs[i].skb)
+			dev_kfree_skb(xmit_pool->xmit_bufs[i].skb);
+		xmit_pool->xmit_bufs[i].skb = NULL;
+
+	}
+	for (i = 0; i < recv_pool->pool_sz; i++) {
+		if (data->recv_pool.recv_bufs[i].skb)
+			dev_kfree_skb(recv_pool->recv_bufs[i].skb);
+		recv_pool->recv_bufs[i].skb = NULL;
+	}
+	vfree(data->local_storage);
+	if (data->region_data) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    data->region_data_dma, 4,
+				    DMA_BIDIRECTIONAL);
+		kfree(data->region_data);
+	}
+
+	if (recv_pool->buf_pool) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    recv_pool->buf_pool_dma,
+				    recv_pool->buf_pool_len, DMA_TO_DEVICE);
+		kfree(recv_pool->buf_pool);
+	}
+
+	if (xmit_pool->buf_pool) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    xmit_pool->buf_pool_dma,
+				    xmit_pool->buf_pool_len, DMA_FROM_DEVICE);
+		kfree(xmit_pool->buf_pool);
+	}
+
+	if (xmit_pool->xmit_data) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    xmit_pool->xmitdata_dma,
+				    xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+		kfree(xmit_pool->xmit_data);
+	}
+}
+
+void data_cleanup(struct data *data)
+{
+	ib_destroy_cm_id(data->ib_conn.cm_id);
+
+	/* Completion callback cleanup called again.
+	 * This is to cleanup the threads in case there is an
+	 * error before state LINK_DATACONNECT due to which
+	 * data_disconnect is not called.
+	 */
+	completion_callback_cleanup(&data->ib_conn);
+	ib_destroy_qp(data->ib_conn.qp);
+	ib_destroy_cq(data->ib_conn.cq);
+	ib_dereg_mr(data->mr);
+
+}
+
+static int data_alloc_xmit_buffer(struct data *data, struct sk_buff *skb,
+				  struct buff_pool_entry **pp_bpe,
+				  struct rdma_io **pp_rdma_io,
+				  int *last)
+{
+	struct xmit_pool	*pool = &data->xmit_pool;
+	unsigned long		flags;
+	int			ret;
+
+	DATA_FUNCTION("data_alloc_xmit_buffer()\n");
+
+	spin_lock_irqsave(&data->xmit_buf_lock, flags);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+	*last = 0;
+	*pp_rdma_io = &pool->xmit_bufs[pool->next_xmit_buf];
+	*pp_bpe = &pool->buf_pool[pool->next_xmit_pool];
+
+	if ((*pp_bpe)->valid && pool->next_xmit_buf !=
+	     pool->last_comp_buf) {
+		INC(pool->next_xmit_buf, 1, pool->num_xmit_bufs);
+		INC(pool->next_xmit_pool, 1, pool->pool_sz);
+		if (!pool->buf_pool[pool->next_xmit_pool].valid) {
+			DATA_INFO("just used the last EIOU"
+				  " receive buffer\n");
+			*last = 1;
+			pool->need_buffers = 1;
+			vnic_stop_xmit(data->parent->vnic,
+				       data->parent->parent);
+			data_kickreq_stats(data);
+		} else if (pool->next_xmit_buf == pool->last_comp_buf) {
+			DATA_INFO("just used our last xmit buffer\n");
+			pool->need_buffers = 1;
+			vnic_stop_xmit(data->parent->vnic,
+				       data->parent->parent);
+		}
+		(*pp_rdma_io)->skb = skb;
+		(*pp_bpe)->valid = 0;
+		ret = 0;
+	} else {
+		data_no_xmitbuf_stats(data);
+		DATA_ERROR("Out of xmit buffers\n");
+		vnic_stop_xmit(data->parent->vnic,
+			       data->parent->parent);
+		ret = -1;
+	}
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma,
+				      pool->buf_pool_len, DMA_TO_DEVICE);
+	spin_unlock_irqrestore(&data->xmit_buf_lock, flags);
+	return ret;
+}
+
+static void data_rdma_packet(struct data *data, struct buff_pool_entry *bpe,
+			     struct rdma_io *rdma_io)
+{
+	struct ib_send_wr	*swr;
+	struct sk_buff		*skb;
+	dma_addr_t		trailer_data_dma;
+	dma_addr_t		skb_data_dma;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct viport		*viport = data->parent;
+	u8			*d;
+	int			len;
+	int			fill_len;
+
+	DATA_FUNCTION("data_rdma_packet()\n");
+	swr = &rdma_io->io.swr;
+	skb = rdma_io->skb;
+	len = ALIGN(rdma_io->len, VIPORT_TRAILER_ALIGNMENT);
+	fill_len = len - skb->len;
+
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   xmit_pool->xmitdata_dma,
+				   xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+
+	d = (u8 *) rdma_io->trailer - fill_len;
+	trailer_data_dma = rdma_io->trailer_dma - fill_len;
+	memset(d, 0, fill_len);
+
+	swr->sg_list[0].length = skb->len;
+	if (skb->len <= min_xmt_skb) {
+		memcpy(rdma_io->data, skb->data, skb->len);
+		swr->sg_list[0].lkey = data->mr->lkey;
+		swr->sg_list[0].addr = rdma_io->data_dma;
+		dev_kfree_skb_any(skb);
+		rdma_io->skb = NULL;
+	} else {
+		swr->sg_list[0].lkey = data->mr->lkey;
+
+		skb_data_dma = ib_dma_map_single(viport->config->ibdev,
+						skb->data, skb->len,
+						DMA_TO_DEVICE);
+
+		if (ib_dma_mapping_error(viport->config->ibdev, skb_data_dma)) {
+			DATA_ERROR("skb data dma map error\n");
+			goto failure;
+		}
+
+		rdma_io->skb_data_dma = skb_data_dma;
+
+		swr->sg_list[0].addr = skb_data_dma;
+		skb_orphan(skb);
+	}
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   xmit_pool->buf_pool_dma,
+				   xmit_pool->buf_pool_len, DMA_TO_DEVICE);
+
+	swr->sg_list[1].addr = trailer_data_dma;
+	swr->sg_list[1].length = fill_len + sizeof(struct viport_trailer);
+	swr->sg_list[0].lkey = data->mr->lkey;
+	swr->wr.rdma.remote_addr = be64_to_cpu(bpe->remote_addr);
+	swr->wr.rdma.remote_addr += data->xmit_pool.buffer_sz;
+	swr->wr.rdma.remote_addr -= (sizeof(struct viport_trailer) + len);
+	swr->wr.rdma.rkey = be32_to_cpu(bpe->rkey);
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->buf_pool_dma,
+				      xmit_pool->buf_pool_len, DMA_TO_DEVICE);
+
+	/* If VNIC_FEAT_RDMA_IMMED is supported then change the work request
+	 * opcode to IB_WR_RDMA_WRITE_WITH_IMM
+	 */
+
+	if (data->parent->features_supported & VNIC_FEAT_RDMA_IMMED) {
+		swr->ex.imm_data = 0;
+		swr->opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	}
+
+	data->xmit_pool.notify_count++;
+	if (data->xmit_pool.notify_count >= data->xmit_pool.notify_bundle) {
+		data->xmit_pool.notify_count = 0;
+		swr->send_flags = IB_SEND_SIGNALED;
+	} else {
+		swr->send_flags = 0;
+	}
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->xmitdata_dma,
+				      xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+	if (vnic_ib_post_send(&data->ib_conn, &rdma_io->io)) {
+		DATA_ERROR("failed to post send for data RDMA write\n");
+		viport_failure(data->parent);
+		goto failure;
+	}
+
+	data_xmits_stats(data);
+failure:
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->xmitdata_dma,
+				      xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+}
+
+static void data_kick_timeout_handler(unsigned long arg)
+{
+	struct data *data = (struct data *)arg;
+
+	DATA_FUNCTION("data_kick_timeout_handler()\n");
+	data->kick_timer_on = 0;
+	data_send_kick_message(data);
+}
+
+int data_xmit_packet(struct data *data, struct sk_buff *skb)
+{
+	struct xmit_pool	*pool = &data->xmit_pool;
+	struct rdma_io		*rdma_io;
+	struct buff_pool_entry	*bpe;
+	struct viport_trailer	*trailer;
+	unsigned int		sz = skb->len;
+	int			last;
+
+	DATA_FUNCTION("data_xmit_packet()\n");
+	if (sz > pool->buffer_sz) {
+		DATA_ERROR("outbound packet too large, size = %d\n", sz);
+		return -1;
+	}
+
+	if (data_alloc_xmit_buffer(data, skb, &bpe, &rdma_io, &last)) {
+		DATA_ERROR("error in allocating data xmit buffer\n");
+		return -1;
+	}
+
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->xmitdata_dma, pool->xmitdata_len,
+				   DMA_TO_DEVICE);
+	trailer = rdma_io->trailer;
+
+	memset(trailer, 0, sizeof *trailer);
+	memcpy(trailer->dest_mac_addr, skb->data, ETH_ALEN);
+
+	if (skb->sk)
+		trailer->connection_hash_and_valid = 0x40 |
+			 ((be16_to_cpu(inet_sk(skb->sk)->sport) +
+			   be16_to_cpu(inet_sk(skb->sk)->dport)) & 0x3f);
+
+	trailer->connection_hash_and_valid |= CHV_VALID;
+
+	if ((sz > 16) && (*(__be16 *) (skb->data + 12) ==
+			   __constant_cpu_to_be16(ETH_P_8021Q))) {
+		trailer->vlan = *(__be16 *) (skb->data + 14);
+		memmove(skb->data + 4, skb->data, 12);
+		skb_pull(skb, 4);
+		sz -= 4;
+		trailer->pkt_flags |= PF_VLAN_INSERT;
+	}
+	if (last)
+		trailer->pkt_flags |= PF_KICK;
+	if (sz < ETH_ZLEN) {
+		/* EIOU requires all packets to be
+		 * of ethernet minimum packet size.
+		 */
+		trailer->data_length = __constant_cpu_to_be16(ETH_ZLEN);
+		rdma_io->len = ETH_ZLEN;
+	} else {
+		trailer->data_length = cpu_to_be16(sz);
+		rdma_io->len = sz;
+	}
+
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		trailer->tx_chksum_flags = TX_CHKSUM_FLAGS_CHECKSUM_V4
+		    | TX_CHKSUM_FLAGS_IP_CHECKSUM
+		    | TX_CHKSUM_FLAGS_TCP_CHECKSUM
+		    | TX_CHKSUM_FLAGS_UDP_CHECKSUM;
+	}
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->xmitdata_dma, pool->xmitdata_len,
+				      DMA_TO_DEVICE);
+	data_rdma_packet(data, bpe, rdma_io);
+
+	if (pool->send_kicks) {
+		/* EIOC needs kicks to inform it of sent packets */
+		pool->kick_count++;
+		pool->kick_byte_count += sz;
+		if ((pool->kick_count >= pool->kick_bundle)
+		    || (pool->kick_byte_count >= pool->kick_byte_bundle)) {
+			data_send_kick_message(data);
+		} else if (pool->kick_count == 1) {
+			init_timer(&data->kick_timer);
+			/* timeout_before_kick is in usec */
+			data->kick_timer.expires =
+			   msecs_to_jiffies(be32_to_cpu(data->
+				eioc_pool_parms.timeout_before_kick) * 1000)
+				+ jiffies;
+			data->kick_timer.data = (unsigned long)data;
+			data->kick_timer.function = data_kick_timeout_handler;
+			add_timer(&data->kick_timer);
+			data->kick_timer_on = 1;
+		}
+	}
+	return 0;
+}
+
+static void data_check_xmit_buffers(struct data *data)
+{
+	struct xmit_pool *pool = &data->xmit_pool;
+	unsigned long flags;
+
+	DATA_FUNCTION("data_check_xmit_buffers()\n");
+	spin_lock_irqsave(&data->xmit_buf_lock, flags);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	if (data->xmit_pool.need_buffers
+	    && pool->buf_pool[pool->next_xmit_pool].valid
+	    && pool->next_xmit_buf != pool->last_comp_buf) {
+		data->xmit_pool.need_buffers = 0;
+		vnic_restart_xmit(data->parent->vnic,
+				  data->parent->parent);
+		DATA_INFO("there are free xmit buffers\n");
+	}
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma, pool->buf_pool_len,
+				      DMA_TO_DEVICE);
+
+	spin_unlock_irqrestore(&data->xmit_buf_lock, flags);
+}
+
+static struct sk_buff *data_recv_to_skbuff(struct data *data,
+					   struct rdma_dest *rdma_dest)
+{
+	struct viport_trailer *trailer;
+	struct sk_buff *skb = NULL;
+	int start;
+	unsigned int len;
+	u8 rx_chksum_flags;
+
+	DATA_FUNCTION("data_recv_to_skbuff()\n");
+	trailer = rdma_dest->trailer;
+	start = data_offset(data, trailer);
+	len = data_len(data, trailer);
+
+	if (len <= min_rcv_skb)
+		skb = dev_alloc_skb(len + VLAN_HLEN + 2);
+			 /* leave room for VLAN header and alignment */
+	if (skb) {
+		skb_reserve(skb, VLAN_HLEN + 2);
+		memcpy(skb->data, rdma_dest->data + start, len);
+		skb_put(skb, len);
+	} else {
+		skb = rdma_dest->skb;
+		rdma_dest->skb = NULL;
+		rdma_dest->trailer = NULL;
+		rdma_dest->data = NULL;
+		skb_pull(skb, start);
+		skb_trim(skb, len);
+	}
+
+	rx_chksum_flags = trailer->rx_chksum_flags;
+	DATA_INFO("rx_chksum_flags = %d, LOOP = %c, IP = %c,"
+	     " TCP = %c, UDP = %c\n",
+	     rx_chksum_flags,
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) ? 'Y' : 'N',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED) ? 'N' :
+	     '-',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED) ? 'N' :
+	     '-',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED) ? 'N' :
+	     '-');
+
+	if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK)
+	    || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED)
+		&& ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED)
+		    || (rx_chksum_flags &
+			RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED))))
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+	else
+		skb->ip_summed = CHECKSUM_NONE;
+
+	if ((trailer->pkt_flags & PF_VLAN_INSERT) &&
+		!(data->parent->features_supported & VNIC_FEAT_IGNORE_VLAN)) {
+		u8 *rv;
+
+		rv = skb_push(skb, 4);
+		memmove(rv, rv + 4, 12);
+		*(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q);
+		if (trailer->pkt_flags & PF_PVID_OVERRIDDEN)
+			*(__be16 *) (rv + 14) = trailer->vlan &
+					__constant_cpu_to_be16(0xF000);
+		else
+			*(__be16 *) (rv + 14) = trailer->vlan;
+	}
+
+	return skb;
+}
+
+static int data_incoming_recv(struct data *data)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct rdma_dest *rdma_dest;
+	struct viport_trailer *trailer;
+	struct buff_pool_entry *bpe;
+	struct sk_buff *skb;
+	dma_addr_t vaddr_dma;
+
+	DATA_FUNCTION("data_incoming_recv()\n");
+	if (pool->next_full_buf == pool->next_free_buf)
+		return -1;
+	bpe = &pool->buf_pool[pool->next_full_buf];
+	vaddr_dma = be64_to_cpu(bpe->remote_addr);
+	rdma_dest = &pool->recv_bufs[bpe->valid - 1];
+	trailer = rdma_dest->trailer;
+
+	if (!trailer
+	    || !(trailer->connection_hash_and_valid & CHV_VALID))
+		return -1;
+
+	/* received a packet */
+	if (trailer->pkt_flags & PF_KICK)
+		pool->kick_on_free = 1;
+
+	skb = data_recv_to_skbuff(data, rdma_dest);
+
+	if (skb) {
+		vnic_recv_packet(data->parent->vnic,
+				 data->parent->parent, skb);
+		list_add(&rdma_dest->list_ptrs, &pool->avail_recv_bufs);
+	}
+
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    vaddr_dma, pool->buffer_sz,
+			    DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	bpe->valid = 0;
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+					pool->buf_pool_dma, pool->buf_pool_len,
+					DMA_TO_DEVICE);
+
+	INC(pool->next_full_buf, 1, pool->eioc_pool_sz);
+	pool->num_posted_bufs--;
+	data_recvs_stats(data);
+	return 0;
+}
+
+static void data_received_kick(struct io *io)
+{
+	struct data *data = &io->viport->data;
+	unsigned long flags;
+
+	DATA_FUNCTION("data_received_kick()\n");
+	data_note_kickrcv_time();
+	spin_lock_irqsave(&data->recv_ios_lock, flags);
+	list_add(&io->list_ptrs, &data->recv_ios);
+	spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+	data_post_recvs(data);
+	data_rcvkicks_stats(data);
+	data_check_xmit_buffers(data);
+
+	while (!data_incoming_recv(data));
+
+	if (data->connected) {
+		data_alloc_buffers(data, 0);
+		data_send_free_recv_buffers(data);
+	}
+}
+
+static void data_xmit_complete(struct io *io)
+{
+	struct rdma_io *rdma_io = (struct rdma_io *)io;
+	struct data *data = &io->viport->data;
+	struct xmit_pool *pool = &data->xmit_pool;
+	struct sk_buff *skb;
+
+	DATA_FUNCTION("data_xmit_complete()\n");
+
+	if (rdma_io->skb)
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    rdma_io->skb_data_dma, rdma_io->skb->len,
+				    DMA_TO_DEVICE);
+
+	while (pool->last_comp_buf != rdma_io->index) {
+		INC(pool->last_comp_buf, 1, pool->num_xmit_bufs);
+		skb = pool->xmit_bufs[pool->last_comp_buf].skb;
+		if (skb)
+			dev_kfree_skb_any(skb);
+		pool->xmit_bufs[pool->last_comp_buf].skb = NULL;
+	}
+
+	data_check_xmit_buffers(data);
+}
+
+static int mc_data_alloc_skb(struct ud_recv_io *recv_io, u32 len,
+				int initial_allocation)
+{
+	struct sk_buff *skb;
+	struct mc_data *mc_data = &recv_io->io.viport->mc_data;
+
+	DATA_FUNCTION("mc_data_alloc_skb\n");
+	if (initial_allocation)
+		skb = alloc_skb(len, GFP_KERNEL);
+	else
+		skb = alloc_skb(len, GFP_ATOMIC);
+	if (!skb) {
+		DATA_ERROR("failed to alloc MULTICAST skb\n");
+		return -1;
+	}
+	skb_put(skb, len);
+	recv_io->skb = skb;
+
+	recv_io->skb_data_dma = ib_dma_map_single(
+					recv_io->io.viport->config->ibdev,
+					skb->data, skb->len,
+					DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(recv_io->io.viport->config->ibdev,
+			recv_io->skb_data_dma)) {
+		DATA_ERROR("skb data dma map error\n");
+		dev_kfree_skb(skb);
+		return -1;
+	}
+
+	recv_io->list[0].addr = recv_io->skb_data_dma;
+	recv_io->list[0].length = sizeof(struct ib_grh);
+	recv_io->list[0].lkey = mc_data->mr->lkey;
+
+	recv_io->list[1].addr = recv_io->skb_data_dma + sizeof(struct ib_grh);
+	recv_io->list[1].length = len - sizeof(struct ib_grh);
+	recv_io->list[1].lkey = mc_data->mr->lkey;
+
+	recv_io->io.rwr.wr_id = (u64)&recv_io->io;
+	recv_io->io.rwr.sg_list = recv_io->list;
+	recv_io->io.rwr.num_sge = 2;
+	recv_io->io.rwr.next = NULL;
+
+	return 0;
+}
+
+static int mc_data_alloc_buffers(struct mc_data *mc_data)
+{
+	unsigned int i, num;
+	struct ud_recv_io *bufs = NULL, *recv_io;
+
+	DATA_FUNCTION("mc_data_alloc_buffers\n");
+	if (!mc_data->skb_len) {
+		unsigned int len;
+		/* align multicast msg buffer on viport_trailer boundary */
+		len = (MCAST_MSG_SIZE + VIPORT_TRAILER_ALIGNMENT - 1) &
+				(~((unsigned int)VIPORT_TRAILER_ALIGNMENT - 1));
+		/*
+		 * Add size of grh and trailer -
+		 * note, we don't need a + 4 for vlan because we have room in
+		 * netbuf for grh & trailer and we'll strip them both, so there
+		 * will be room enough to handle the 4 byte insertion for vlan.
+		 */
+		len +=	sizeof(struct ib_grh) +
+				sizeof(struct viport_trailer);
+		mc_data->skb_len = len;
+		DATA_INFO("mc_data->skb_len %d (sizes:%d %d)\n",
+					len, (int)sizeof(struct ib_grh),
+					(int)sizeof(struct viport_trailer));
+	}
+	mc_data->recv_len = sizeof(struct ud_recv_io) * mc_data->num_recvs;
+	bufs = kmalloc(mc_data->recv_len, GFP_KERNEL);
+	if (!bufs) {
+		DATA_ERROR("failed to allocate MULTICAST buffers size:%d\n",
+				mc_data->recv_len);
+		return -1;
+	}
+	DATA_INFO("allocated num_recvs:%d recv_len:%d \n",
+			mc_data->num_recvs, mc_data->recv_len);
+	for (num = 0; num < mc_data->num_recvs; num++) {
+		recv_io = &bufs[num];
+		recv_io->len = mc_data->skb_len;
+		recv_io->io.type = RECV_UD;
+		recv_io->io.viport = mc_data->parent;
+		recv_io->io.routine = mc_data_recv_routine;
+
+		if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 1)) {
+			for (i = 0; i < num; i++) {
+				recv_io = &bufs[i];
+				ib_dma_unmap_single(recv_io->io.viport->config->ibdev,
+						    recv_io->skb_data_dma,
+						    recv_io->skb->len,
+						    DMA_FROM_DEVICE);
+				dev_kfree_skb(recv_io->skb);
+			}
+			kfree(bufs);
+			return -1;
+		}
+		list_add_tail(&recv_io->io.list_ptrs,
+						 &mc_data->avail_recv_ios_list);
+	}
+	mc_data->recv_ios = bufs;
+	return 0;
+}
+
+void vnic_mc_data_cleanup(struct mc_data *mc_data)
+{
+	unsigned int num;
+
+	DATA_FUNCTION("vnic_mc_data_cleanup()\n");
+	completion_callback_cleanup(&mc_data->ib_conn);
+	if (!IS_ERR(mc_data->ib_conn.qp)) {
+		ib_destroy_qp(mc_data->ib_conn.qp);
+		mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL);
+	}
+	if (!IS_ERR(mc_data->ib_conn.cq)) {
+		ib_destroy_cq(mc_data->ib_conn.cq);
+		mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+	}
+	if (mc_data->recv_ios) {
+		for (num = 0; num < mc_data->num_recvs; num++) {
+			if (mc_data->recv_ios[num].skb)
+				dev_kfree_skb(mc_data->recv_ios[num].skb);
+			mc_data->recv_ios[num].skb = NULL;
+		}
+		kfree(mc_data->recv_ios);
+		mc_data->recv_ios = (struct ud_recv_io *)NULL;
+	}
+	if (mc_data->mr) {
+		ib_dereg_mr(mc_data->mr);
+		mc_data->mr = (struct ib_mr *)NULL;
+	}
+	DATA_FUNCTION("vnic_mc_data_cleanup done\n");
+
+}
+
+int mc_data_init(struct mc_data *mc_data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd)
+{
+	DATA_FUNCTION("mc_data_init()\n");
+
+	mc_data->num_recvs = viport->data.config->num_recvs;
+
+	INIT_LIST_HEAD(&mc_data->avail_recv_ios_list);
+	spin_lock_init(&mc_data->recv_lock);
+
+	mc_data->parent = viport;
+	mc_data->config = config;
+
+	mc_data->ib_conn.cm_id = NULL;
+	mc_data->ib_conn.viport = viport;
+	mc_data->ib_conn.ib_config = &config->ib_config;
+	mc_data->ib_conn.state = IB_CONN_UNINITTED;
+	mc_data->ib_conn.callback_thread = NULL;
+	mc_data->ib_conn.callback_thread_end = 0;
+
+	if (vnic_ib_mc_init(mc_data, viport, pd,
+			      &config->ib_config)) {
+		DATA_ERROR("vnic_ib_mc_init failed\n");
+		goto failure;
+	}
+	mc_data->mr = ib_get_dma_mr(pd,
+				 IB_ACCESS_LOCAL_WRITE |
+				 IB_ACCESS_REMOTE_WRITE);
+	if (IS_ERR(mc_data->mr)) {
+		DATA_ERROR("failed to register memory for"
+			   " mc_data connection\n");
+		goto destroy_conn;
+	}
+
+	if (mc_data_alloc_buffers(mc_data))
+		goto dereg_mr;
+
+	mc_data_post_recvs(mc_data);
+	if (vnic_ib_mc_mod_qp_to_rts(mc_data->ib_conn.qp))
+		goto dereg_mr;
+
+	return 0;
+
+dereg_mr:
+	ib_dereg_mr(mc_data->mr);
+	mc_data->mr = (struct ib_mr *)NULL;
+destroy_conn:
+	completion_callback_cleanup(&mc_data->ib_conn);
+	ib_destroy_qp(mc_data->ib_conn.qp);
+	mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL);
+	ib_destroy_cq(mc_data->ib_conn.cq);
+	mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+failure:
+	return -1;
+}
+
+static void mc_data_post_recvs(struct mc_data *mc_data)
+{
+	unsigned long flags;
+	int i = 0;
+	DATA_FUNCTION("mc_data_post_recvs\n");
+	spin_lock_irqsave(&mc_data->recv_lock, flags);
+	while (!list_empty(&mc_data->avail_recv_ios_list)) {
+		struct io *io = list_entry(mc_data->avail_recv_ios_list.next,
+				struct io, list_ptrs);
+		struct ud_recv_io *recv_io =
+					container_of(io, struct ud_recv_io, io);
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+		if (vnic_ib_mc_post_recv(mc_data, &recv_io->io)) {
+			viport_failure(mc_data->parent);
+			return;
+		}
+		spin_lock_irqsave(&mc_data->recv_lock, flags);
+		i++;
+	}
+	DATA_INFO("mcdata posted %d %p\n", i, &mc_data->avail_recv_ios_list);
+	spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+}
+
+static void mc_data_recv_routine(struct io *io)
+{
+	struct sk_buff *skb;
+	struct ib_grh *grh;
+	struct viport_trailer *trailer;
+	struct mc_data *mc_data;
+	unsigned long flags;
+	struct ud_recv_io *recv_io = container_of(io, struct ud_recv_io, io);
+	union ib_gid_cpu sgid;
+
+	DATA_FUNCTION("mc_data_recv_routine\n");
+	skb = recv_io->skb;
+	grh = (struct ib_grh *)skb->data;
+	mc_data = &recv_io->io.viport->mc_data;
+
+	ib_dma_unmap_single(recv_io->io.viport->config->ibdev,
+			    recv_io->skb_data_dma, recv_io->skb->len,
+			    DMA_FROM_DEVICE);
+
+	/* first - check if we've got our own mc packet  */
+	/* convert sgid from host to cpu form before comparing */
+	bswap_ib_gid(&grh->sgid, &sgid);
+	if (cpu_to_be64(sgid.global.interface_id) ==
+		io->viport->config->path_info.path.sgid.global.interface_id) {
+		DATA_ERROR("dropping - our mc packet\n");
+		dev_kfree_skb(skb);
+	} else {
+		/* GRH is at head and trailer at end. Remove GRH from head.  */
+		trailer = (struct viport_trailer *)
+				(skb->data + recv_io->len -
+				 sizeof(struct viport_trailer));
+		skb_pull(skb, sizeof(struct ib_grh));
+		if (trailer->connection_hash_and_valid & CHV_VALID) {
+			mc_data_recv_to_skbuff(io->viport, skb, trailer);
+			vnic_recv_packet(io->viport->vnic, io->viport->parent,
+					skb);
+			vnic_multicast_recv_pkt_stats(io->viport->vnic);
+		} else {
+			DATA_ERROR("dropping - no CHV_VALID in HashAndValid\n");
+			dev_kfree_skb(skb);
+		}
+	}
+	recv_io->skb = NULL;
+	if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 0))
+		return;
+
+	spin_lock_irqsave(&mc_data->recv_lock, flags);
+	list_add_tail(&recv_io->io.list_ptrs, &mc_data->avail_recv_ios_list);
+	spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+	mc_data_post_recvs(mc_data);
+	return;
+}
+
+static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb,
+				   struct viport_trailer *trailer)
+{
+	u8 rx_chksum_flags = trailer->rx_chksum_flags;
+
+	/* drop alignment bytes at start */
+	skb_pull(skb, trailer->data_alignment_offset);
+	/* drop excess from end */
+	skb_trim(skb, __be16_to_cpu(trailer->data_length));
+
+	if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK)
+	    || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED)
+		&& ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED)
+		    || (rx_chksum_flags &
+			RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED))))
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+	else
+		skb->ip_summed = CHECKSUM_NONE;
+
+	if ((trailer->pkt_flags & PF_VLAN_INSERT) &&
+	    !(viport->features_supported & VNIC_FEAT_IGNORE_VLAN)) {
+		u8 *rv;
+
+		/* insert VLAN id between source & length */
+		DATA_INFO("VLAN adjustment\n");
+		rv = skb_push(skb, 4);
+		memmove(rv, rv + 4, 12);
+		*(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q);
+		if (trailer->pkt_flags & PF_PVID_OVERRIDDEN)
+		/*
+		 *  Indicates VLAN is 0 but we keep the protocol id.
+		 */
+			*(__be16 *) (rv + 14) = trailer->vlan &
+					__constant_cpu_to_be16(0xF000);
+		else
+			*(__be16 *) (rv + 14) = trailer->vlan;
+		DATA_INFO("vlan:%x\n", *(int *)(rv+14));
+	}
+
+    return;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
new file mode 100644
index 0000000..866b9ee
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
@@ -0,0 +1,206 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_DATA_H_INCLUDED
+#define VNIC_DATA_H_INCLUDED
+
+#include <linux/if_vlan.h>
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+#include <linux/timex.h>
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+
+#include "vnic_ib.h"
+#include "vnic_control_pkt.h"
+#include "vnic_trailer.h"
+
+struct rdma_dest {
+	struct list_head	list_ptrs;
+	struct sk_buff		*skb;
+	u8			*data;
+	struct viport_trailer	*trailer __attribute__((aligned(32)));
+};
+
+struct buff_pool_entry {
+	__be64	remote_addr;
+	__be32	rkey;
+	u32	valid;
+};
+
+struct recv_pool {
+	u32			buffer_sz;
+	u32			pool_sz;
+	u32			eioc_pool_sz;
+	u32	 		eioc_rdma_rkey;
+	u64 			eioc_rdma_addr;
+	u32 			next_full_buf;
+	u32 			next_free_buf;
+	u32 			num_free_bufs;
+	u32 			num_posted_bufs;
+	u32 			sz_free_bundle;
+	int			kick_on_free;
+	struct buff_pool_entry	*buf_pool;
+	dma_addr_t		buf_pool_dma;
+	int			buf_pool_len;
+	struct rdma_dest	*recv_bufs;
+	struct list_head	avail_recv_bufs;
+};
+
+struct xmit_pool {
+	u32			buffer_sz;
+	u32 			pool_sz;
+	u32 			notify_count;
+	u32 			notify_bundle;
+	u32 			next_xmit_buf;
+	u32 			last_comp_buf;
+	u32 			num_xmit_bufs;
+	u32 			next_xmit_pool;
+	u32 			kick_count;
+	u32 			kick_byte_count;
+	u32 			kick_bundle;
+	u32 			kick_byte_bundle;
+	int			need_buffers;
+	int			send_kicks;
+	uint32_t 		rdma_rkey;
+	u64 			rdma_addr;
+	struct buff_pool_entry	*buf_pool;
+	dma_addr_t		buf_pool_dma;
+	int			buf_pool_len;
+	struct rdma_io		*xmit_bufs;
+	u8			*xmit_data;
+	dma_addr_t		xmitdata_dma;
+	int			xmitdata_len;
+};
+
+struct data {
+	struct viport			*parent;
+	struct data_config		*config;
+	struct ib_mr			*mr;
+	struct vnic_ib_conn		ib_conn;
+	u8				*local_storage;
+	struct vnic_recv_pool_config	host_pool_parms;
+	struct vnic_recv_pool_config	eioc_pool_parms;
+	struct recv_pool		recv_pool;
+	struct xmit_pool		xmit_pool;
+	u8				*region_data;
+	dma_addr_t			region_data_dma;
+	struct rdma_io			free_bufs_io;
+	struct send_io			kick_io;
+	struct list_head		recv_ios;
+	spinlock_t			recv_ios_lock;
+	spinlock_t			xmit_buf_lock;
+	int				kick_timer_on;
+	int				connected;
+	u16				max_mtu;
+	struct timer_list		kick_timer;
+	struct completion		done;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		u32		xmit_num;
+		u32		recv_num;
+		u32		free_buf_sends;
+		u32		free_buf_num;
+		u32		free_buf_min;
+		u32		kick_recvs;
+		u32		kick_reqs;
+		u32		no_xmit_bufs;
+		cycles_t	no_xmit_buf_time;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct mc_data {
+    struct viport           *parent;
+    struct data_config      *config;
+    struct ib_mr            *mr;
+    struct vnic_ib_conn     ib_conn;
+
+    u32                     num_recvs;
+    u32                     skb_len;
+    spinlock_t              recv_lock;
+    int                     recv_len;
+    struct ud_recv_io      *recv_ios;
+    struct list_head        avail_recv_ios_list;
+};
+
+int data_init(struct data *data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd);
+
+int  data_connect(struct data *data);
+void data_connected(struct data *data);
+void data_disconnect(struct data *data);
+
+int data_xmit_packet(struct data *data, struct sk_buff *skb);
+
+void data_cleanup(struct data *data);
+
+#define data_is_connected(data)		\
+	(vnic_ib_conn_connected(&((data)->ib_conn)))
+#define data_path_id(data)		(data)->config->path_id
+#define data_eioc_pool(data)		&(data)->eioc_pool_parms
+#define data_host_pool(data)		&(data)->host_pool_parms
+#define data_eioc_pool_min(data)	&(data)->config->eioc_min
+#define data_host_pool_min(data)	&(data)->config->host_min
+#define data_eioc_pool_max(data)	&(data)->config->eioc_max
+#define data_host_pool_max(data)	&(data)->config->host_max
+#define data_local_pool_addr(data)	(data)->xmit_pool.rdma_addr
+#define data_local_pool_rkey(data)	(data)->xmit_pool.rdma_rkey
+#define data_remote_pool_addr(data)	&(data)->recv_pool.eioc_rdma_addr
+#define data_remote_pool_rkey(data)	&(data)->recv_pool.eioc_rdma_rkey
+
+#define data_max_mtu(data)		(data)->max_mtu
+
+
+#define data_len(data, trailer)		be16_to_cpu(trailer->data_length)
+#define data_offset(data, trailer)					\
+	((data)->recv_pool.buffer_sz - sizeof(struct viport_trailer)	\
+	- ALIGN(data_len((data), (trailer)), VIPORT_TRAILER_ALIGNMENT)	\
+	+ (trailer->data_alignment_offset))
+
+/* the following macros manipulate ring buffer indexes.
+ * the ring buffer size must be a power of 2.
+ */
+#define ADD(index, increment, size)	(((index) + (increment))&((size) - 1))
+#define NEXT(index, size)		ADD(index, 1, size)
+#define INC(index, increment, size)	(index) = ADD(index, increment, size)
+
+/* this is max multicast msg embedded will send */
+#define MCAST_MSG_SIZE \
+		(2048 - sizeof(struct ib_grh) - sizeof(struct viport_trailer))
+
+int mc_data_init(struct mc_data *mc_data, struct viport *viport,
+	struct data_config *config,
+	struct ib_pd *pd);
+
+void vnic_mc_data_cleanup(struct mc_data *mc_data);
+
+#endif	/* VNIC_DATA_H_INCLUDED */
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
new file mode 100644
index 0000000..dd8a073
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
@@ -0,0 +1,103 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_TRAILER_H_INCLUDED
+#define VNIC_TRAILER_H_INCLUDED
+
+/* pkt_flags values */
+enum {
+	PF_CHASH_VALID		= 0x01,
+	PF_IPSEC_VALID		= 0x02,
+	PF_TCP_SEGMENT		= 0x04,
+	PF_KICK			= 0x08,
+	PF_VLAN_INSERT		= 0x10,
+	PF_PVID_OVERRIDDEN 	= 0x20,
+	PF_FCS_INCLUDED 	= 0x40,
+	PF_FORCE_ROUTE		= 0x80
+};
+
+/* tx_chksum_flags values */
+enum {
+	TX_CHKSUM_FLAGS_CHECKSUM_V4	= 0x01,
+	TX_CHKSUM_FLAGS_CHECKSUM_V6	= 0x02,
+	TX_CHKSUM_FLAGS_TCP_CHECKSUM	= 0x04,
+	TX_CHKSUM_FLAGS_UDP_CHECKSUM	= 0x08,
+	TX_CHKSUM_FLAGS_IP_CHECKSUM	= 0x10
+};
+
+/* rx_chksum_flags values */
+enum {
+	RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED	= 0x01,
+	RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED	= 0x02,
+	RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED	= 0x04,
+	RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED	= 0x08,
+	RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED	= 0x10,
+	RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED	= 0x20,
+	RX_CHKSUM_FLAGS_LOOPBACK		= 0x40,
+	RX_CHKSUM_FLAGS_RESERVED		= 0x80
+};
+
+/* connection_hash_and_valid values */
+enum {
+	CHV_VALID	= 0x80,
+	CHV_HASH_MASH	= 0x7f
+};
+
+struct viport_trailer {
+	s8	data_alignment_offset;
+	u8	rndis_header_length;	/* reserved for use by edp */
+	__be16	data_length;
+	u8	pkt_flags;
+	u8	tx_chksum_flags;
+	u8	rx_chksum_flags;
+	u8	ip_sec_flags;
+	u32	tcp_seq_no;
+	u32	ip_sec_offload_handle;
+	u32	ip_sec_next_offload_handle;
+	u8	dest_mac_addr[6];
+	__be16	vlan;
+	u16	time_stamp;
+	u8	origin;
+	u8	connection_hash_and_valid;
+};
+
+#define VIPORT_TRAILER_ALIGNMENT	32
+
+#define BUFFER_SIZE(len)					\
+	(sizeof(struct viport_trailer) +			\
+	 ALIGN((len), VIPORT_TRAILER_ALIGNMENT))
+
+#define MAX_PAYLOAD(len)					\
+	ALIGN_DOWN((len) - sizeof(struct viport_trailer),	\
+		   VIPORT_TRAILER_ALIGNMENT)
+
+#endif	/* VNIC_TRAILER_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:34:29 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:04:29 +0530
Subject: [ofa-general] [PATCH v2 06/13] QLogic VNIC: IB core stack
	interaction
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103428.12355.53123.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

The patch implements the interaction of the QLogic VNIC driver with
the underlying core infiniband stack.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 ++++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h |  206 ++++++
 2 files changed, 1249 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
new file mode 100644
index 0000000..c43e69e
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
@@ -0,0 +1,1043 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/netdevice.h>
+#include <linux/list.h>
+
+#include "vnic_util.h"
+#include "vnic_data.h"
+#include "vnic_config.h"
+#include "vnic_ib.h"
+#include "vnic_viport.h"
+#include "vnic_sys.h"
+#include "vnic_main.h"
+#include "vnic_stats.h"
+
+static int vnic_ib_inited;
+static void vnic_add_one(struct ib_device *device);
+static void vnic_remove_one(struct ib_device *device);
+static int vnic_defer_completion(void *ptr);
+
+static int vnic_ib_mc_init_qp(struct mc_data *mc_data,
+		struct vnic_ib_config *config,
+		struct ib_pd *pd,
+		struct viport_config *viport_config);
+
+static struct ib_client vnic_client = {
+	.name = "vnic",
+	.add = vnic_add_one,
+	.remove = vnic_remove_one
+};
+
+struct ib_sa_client vnic_sa_client;
+
+int vnic_ib_init(void)
+{
+	int ret = -1;
+
+	IB_FUNCTION("vnic_ib_init()\n");
+
+	/* class has to be registered before
+	 * calling ib_register_client() because, that call
+	 * will trigger vnic_add_port() which will register
+	 * class_device for the port with the parent class
+	 * as vnic_class
+	 */
+	ret = class_register(&vnic_class);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register class"
+		       " infiniband_qlgc_vnic; error %d", ret);
+		goto out;
+	}
+
+	ib_sa_register_client(&vnic_sa_client);
+	ret = ib_register_client(&vnic_client);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register IB client;"
+		       " error %d", ret);
+		goto err_ib_reg;
+	}
+
+	interface_dev.dev.class = &vnic_class;
+	interface_dev.dev.release = vnic_release_dev;
+	snprintf(interface_dev.dev.bus_id,
+		 BUS_ID_SIZE, "interfaces");
+	init_completion(&interface_dev.released);
+	ret = device_register(&interface_dev.dev);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register class interfaces;"
+		       " error %d", ret);
+		goto err_class_dev;
+	}
+	ret = device_create_file(&interface_dev.dev,
+				       &dev_attr_delete_vnic);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't create class file"
+		       " 'delete_vnic'; error %d", ret);
+		goto err_class_file;
+	}
+
+	vnic_ib_inited = 1;
+
+	return ret;
+err_class_file:
+	device_unregister(&interface_dev.dev);
+err_class_dev:
+	ib_unregister_client(&vnic_client);
+err_ib_reg:
+	ib_sa_unregister_client(&vnic_sa_client);
+	class_unregister(&vnic_class);
+out:
+	return ret;
+}
+
+static struct vnic_ib_port *vnic_add_port(struct vnic_ib_device *device,
+					  u8 port_num)
+{
+	struct vnic_ib_port *port;
+
+	port = kzalloc(sizeof *port, GFP_KERNEL);
+	if (!port)
+		return NULL;
+
+	init_completion(&port->pdev_info.released);
+	port->dev = device;
+	port->port_num = port_num;
+
+	port->pdev_info.dev.class = &vnic_class;
+	port->pdev_info.dev.parent = NULL;
+	port->pdev_info.dev.release = vnic_release_dev;
+	snprintf(port->pdev_info.dev.bus_id, BUS_ID_SIZE,
+		 "vnic-%s-%d", device->dev->name, port_num);
+
+	if (device_register(&port->pdev_info.dev))
+		goto free_port;
+
+	if (device_create_file(&port->pdev_info.dev,
+				     &dev_attr_create_primary))
+		goto err_class;
+	if (device_create_file(&port->pdev_info.dev,
+				     &dev_attr_create_secondary))
+		goto err_class;
+
+	return port;
+err_class:
+	device_unregister(&port->pdev_info.dev);
+free_port:
+	kfree(port);
+
+	return NULL;
+}
+
+static void vnic_add_one(struct ib_device *device)
+{
+	struct vnic_ib_device *vnic_dev;
+	struct vnic_ib_port *port;
+	int s, e, p;
+
+	vnic_dev = kmalloc(sizeof *vnic_dev, GFP_KERNEL);
+	if (!vnic_dev)
+		return;
+
+	vnic_dev->dev = device;
+	INIT_LIST_HEAD(&vnic_dev->port_list);
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
+		s = 0;
+		e = 0;
+
+	} else {
+		s = 1;
+		e = device->phys_port_cnt;
+
+	}
+
+	for (p = s; p <= e; p++) {
+		port = vnic_add_port(vnic_dev, p);
+		if (port)
+			list_add_tail(&port->list, &vnic_dev->port_list);
+	}
+
+	ib_set_client_data(device, &vnic_client, vnic_dev);
+
+}
+
+static void vnic_remove_one(struct ib_device *device)
+{
+	struct vnic_ib_device *vnic_dev;
+	struct vnic_ib_port *port, *tmp_port;
+
+	vnic_dev = ib_get_client_data(device, &vnic_client);
+	list_for_each_entry_safe(port, tmp_port,
+				 &vnic_dev->port_list, list) {
+		device_unregister(&port->pdev_info.dev);
+		/*
+		 * wait for sysfs entries to go away, so that no new vnics
+		 * are created
+		 */
+		wait_for_completion(&port->pdev_info.released);
+		kfree(port);
+
+	}
+	kfree(vnic_dev);
+
+	/* TODO Only those vnic interfaces associated with
+	 * the HCA whose remove event is called should be freed
+	 * Currently all the vnic interfaces are freed
+	 */
+
+	while (!list_empty(&vnic_list)) {
+		struct vnic *vnic =
+		    list_entry(vnic_list.next, struct vnic, list_ptrs);
+		vnic_free(vnic);
+	}
+
+	vnic_npevent_cleanup();
+	viport_cleanup();
+
+}
+
+void vnic_ib_cleanup(void)
+{
+	IB_FUNCTION("vnic_ib_cleanup()\n");
+
+	if (!vnic_ib_inited)
+		return;
+
+	device_unregister(&interface_dev.dev);
+	wait_for_completion(&interface_dev.released);
+
+	ib_unregister_client(&vnic_client);
+	ib_sa_unregister_client(&vnic_sa_client);
+	class_unregister(&vnic_class);
+}
+
+static void vnic_path_rec_completion(int status,
+				     struct ib_sa_path_rec *pathrec,
+				     void *context)
+{
+	struct vnic_ib_path_info *p = context;
+	p->status = status;
+	if (!status)
+		p->path = *pathrec;
+
+	complete(&p->done);
+}
+
+int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic)
+{
+	struct viport_config *config = netpath->viport->config;
+	int ret = 0;
+
+	init_completion(&config->path_info.done);
+	IB_INFO("Using SA path rec get time out value of %d\n",
+	       config->sa_path_rec_get_timeout);
+	config->path_info.path_query_id =
+			 ib_sa_path_rec_get(&vnic_sa_client,
+					    config->ibdev,
+					    config->port,
+					    &config->path_info.path,
+					    IB_SA_PATH_REC_DGID      |
+					    IB_SA_PATH_REC_SGID      |
+					    IB_SA_PATH_REC_NUMB_PATH |
+					    IB_SA_PATH_REC_PKEY,
+					    config->sa_path_rec_get_timeout,
+					    GFP_KERNEL,
+					    vnic_path_rec_completion,
+					    &config->path_info,
+					    &config->path_info.path_query);
+
+	if (config->path_info.path_query_id < 0) {
+		IB_ERROR("SA path record query failed; error %d\n",
+			 config->path_info.path_query_id);
+		ret = config->path_info.path_query_id;
+		goto out;
+	}
+
+	wait_for_completion(&config->path_info.done);
+
+	if (config->path_info.status < 0) {
+		printk(KERN_WARNING PFX "connection not available to dgid "
+		       "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x",
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[0]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[2]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[4]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[6]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[8]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[10]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[12]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[14]));
+
+		if (config->path_info.status == -ETIMEDOUT)
+			printk(KERN_INFO " path query timed out\n");
+		else if (config->path_info.status == -EIO)
+			printk(KERN_INFO " path query sending error\n");
+		else
+			printk(KERN_INFO " error %d\n",
+			       config->path_info.status);
+
+		ret = config->path_info.status;
+	}
+out:
+	if (ret)
+		netpath_timer(netpath, vnic->config->no_path_timeout);
+
+	return ret;
+}
+
+static inline void vnic_ib_handle_completions(struct ib_wc *wc,
+					      struct vnic_ib_conn *ib_conn,
+					      u32 *comp_num,
+					      cycles_t *comp_time)
+{
+	struct io *io;
+
+	io = (struct io *)(wc->wr_id);
+	vnic_ib_comp_stats(ib_conn, comp_num);
+	if (wc->status) {
+		IB_INFO("completion error  wc.status %d"
+			 " wc.opcode %d vendor err 0x%x\n",
+			  wc->status, wc->opcode, wc->vendor_err);
+	} else if (io) {
+		vnic_ib_io_stats(io, ib_conn, *comp_time);
+		if (io->type == RECV_UD) {
+			struct ud_recv_io *recv_io =
+				container_of(io, struct ud_recv_io, io);
+			recv_io->len = wc->byte_len;
+		}
+		if (io->routine)
+			(*io->routine) (io);
+	}
+}
+
+static void ib_qp_event(struct ib_event *event, void *context)
+{
+	IB_ERROR("QP event %d\n", event->event);
+}
+
+static void vnic_ib_completion(struct ib_cq *cq, void *ptr)
+{
+	struct vnic_ib_conn *ib_conn = ptr;
+	unsigned long	 flags;
+	int compl_received;
+	struct ib_wc wc;
+	cycles_t  comp_time;
+	u32  comp_num = 0;
+
+	/* for multicast, cm_id is NULL, so skip that test */
+	if (ib_conn->cm_id &&
+	    (ib_conn->state != IB_CONN_CONNECTED))
+		return;
+
+	/* Check if completion processing is taking place in thread
+	 * If not then process completions in this handler,
+	 * else set compl_received if not set, to indicate that
+	 * there are more completions to process in thread.
+	 */
+
+	spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+	compl_received = ib_conn->compl_received;
+	spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags);
+
+	if (ib_conn->in_thread || compl_received) {
+		if (!compl_received) {
+			spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+			ib_conn->compl_received = 1;
+			spin_unlock_irqrestore(&ib_conn->compl_received_lock,
+									flags);
+		}
+		wake_up(&(ib_conn->callback_wait_queue));
+	} else {
+		vnic_ib_note_comptime_stats(&comp_time);
+		vnic_ib_callback_stats(ib_conn);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		while (ib_poll_cq(cq, 1, &wc) > 0) {
+			vnic_ib_handle_completions(&wc, ib_conn, &comp_num,
+								 &comp_time);
+			if (ib_conn->cm_id &&
+				 ib_conn->state != IB_CONN_CONNECTED)
+				break;
+
+			/* If we get more completions than the completion limit
+			 * defer completion to the thread
+			 */
+			if ((!ib_conn->in_thread) &&
+			    (comp_num >= ib_conn->ib_config->completion_limit)) {
+				ib_conn->in_thread = 1;
+				spin_lock_irqsave(
+					&ib_conn->compl_received_lock, flags);
+				ib_conn->compl_received = 1;
+				spin_unlock_irqrestore(
+					&ib_conn->compl_received_lock, flags);
+				wake_up(&(ib_conn->callback_wait_queue));
+				break;
+			}
+
+		}
+		vnic_ib_maxio_stats(ib_conn, comp_num);
+	}
+}
+
+static int vnic_ib_mod_qp_to_rts(struct ib_cm_id *cm_id,
+			     struct vnic_ib_conn *ib_conn)
+{
+	int attr_mask = 0;
+	int ret;
+	struct ib_qp_attr *qp_attr = NULL;
+
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	qp_attr->qp_state = IB_QPS_RTR;
+
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto out;
+
+	ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask);
+	if (ret)
+		goto out;
+
+	IB_INFO("QP RTR\n");
+
+	qp_attr->qp_state = IB_QPS_RTS;
+
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto out;
+
+	ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask);
+	if (ret)
+		goto out;
+
+	IB_INFO("QP RTS\n");
+
+	ret = ib_send_cm_rtu(cm_id, NULL, 0);
+	if (ret)
+		goto out;
+out:
+	kfree(qp_attr);
+	return ret;
+}
+
+int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct vnic_ib_conn *ib_conn = cm_id->context;
+	struct viport *viport = ib_conn->viport;
+	int err = 0;
+
+	switch (event->event) {
+	case IB_CM_REQ_ERROR:
+		IB_ERROR("sending CM REQ failed\n");
+		err = 1;
+		viport->retry = 1;
+		break;
+	case IB_CM_REP_RECEIVED:
+		IB_INFO("CM REP recvd\n");
+		if (vnic_ib_mod_qp_to_rts(cm_id, ib_conn))
+			err = 1;
+		else {
+			ib_conn->state = IB_CONN_CONNECTED;
+			vnic_ib_connected_time_stats(ib_conn);
+			IB_INFO("RTU SENT\n");
+		}
+		break;
+	case IB_CM_REJ_RECEIVED:
+		printk(KERN_ERR PFX " CM rejected control connection\n");
+		if (event->param.rej_rcvd.reason ==
+		    IB_CM_REJ_INVALID_SERVICE_ID)
+			printk(KERN_ERR "reason: invalid service ID. "
+			       "IOCGUID value specified may be incorrect\n");
+		else
+			printk(KERN_ERR "reason code : 0x%x\n",
+			       event->param.rej_rcvd.reason);
+
+		err = 1;
+		viport->retry = 1;
+		break;
+	case IB_CM_MRA_RECEIVED:
+		IB_INFO("CM MRA received\n");
+		break;
+
+	case IB_CM_DREP_RECEIVED:
+		IB_INFO("CM DREP recvd\n");
+		ib_conn->state = IB_CONN_DISCONNECTED;
+		break;
+
+	case IB_CM_TIMEWAIT_EXIT:
+		IB_ERROR("CM timewait exit\n");
+		err = 1;
+		break;
+
+	default:
+		IB_INFO("unhandled CM event %d\n", event->event);
+		break;
+
+	}
+
+	if (err) {
+		ib_conn->state = IB_CONN_DISCONNECTED;
+		viport_failure(viport);
+	}
+
+	viport_kick(viport);
+	return 0;
+}
+
+
+int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn)
+{
+	struct ib_cm_req_param	*req = NULL;
+	struct viport		*viport;
+	int 			ret = -1;
+
+	if (!vnic_ib_conn_initted(ib_conn)) {
+		IB_ERROR("IB Connection out of state for CM connect (%d)\n",
+			 ib_conn->state);
+		return -EINVAL;
+	}
+
+	vnic_ib_conntime_stats(ib_conn);
+	req = kzalloc(sizeof *req, GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	viport	= ib_conn->viport;
+
+	req->primary_path	= &viport->config->path_info.path;
+	req->alternate_path	= NULL;
+	req->qp_num		= ib_conn->qp->qp_num;
+	req->qp_type		= ib_conn->qp->qp_type;
+	req->service_id 	= ib_conn->ib_config->service_id;
+	req->private_data	= &ib_conn->ib_config->conn_data;
+	req->private_data_len	= sizeof(struct vnic_connection_data);
+	req->flow_control	= 1;
+
+	get_random_bytes(&req->starting_psn, 4);
+	req->starting_psn &= 0xffffff;
+
+	/*
+	 * Both responder_resources and initiator_depth are set to zero
+	 * as we do not need RDMA read.
+	 *
+	 * They also must be set to zero, otherwise data connections
+	 * are rejected by VEx.
+	 */
+	req->responder_resources 	= 0;
+	req->initiator_depth		= 0;
+	req->remote_cm_response_timeout = 20;
+	req->local_cm_response_timeout  = 20;
+	req->retry_count		= ib_conn->ib_config->retry_count;
+	req->rnr_retry_count		= ib_conn->ib_config->rnr_retry_count;
+	req->max_cm_retries		= 15;
+
+	ib_conn->state = IB_CONN_CONNECTING;
+
+	ret = ib_send_cm_req(ib_conn->cm_id, req);
+
+	kfree(req);
+
+	if (ret) {
+		IB_ERROR("CM REQ sending failed; error %d \n", ret);
+		ib_conn->state = IB_CONN_DISCONNECTED;
+	}
+
+	return ret;
+}
+
+static int vnic_ib_init_qp(struct vnic_ib_conn *ib_conn,
+			   struct vnic_ib_config *config,
+			   struct ib_pd	*pd,
+			   struct viport_config *viport_config)
+{
+	struct ib_qp_init_attr	*init_attr;
+	struct ib_qp_attr	*attr;
+	int			ret;
+
+	init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL);
+	if (!init_attr)
+		return -ENOMEM;
+
+	init_attr->event_handler	= ib_qp_event;
+	init_attr->cap.max_send_wr	= config->num_sends;
+	init_attr->cap.max_recv_wr	= config->num_recvs;
+	init_attr->cap.max_recv_sge	= config->recv_scatter;
+	init_attr->cap.max_send_sge	= config->send_gather;
+	init_attr->sq_sig_type		= IB_SIGNAL_ALL_WR;
+	init_attr->qp_type		= IB_QPT_RC;
+	init_attr->send_cq		= ib_conn->cq;
+	init_attr->recv_cq		= ib_conn->cq;
+
+	ib_conn->qp = ib_create_qp(pd, init_attr);
+
+	if (IS_ERR(ib_conn->qp)) {
+		ret = -1;
+		IB_ERROR("could not create QP\n");
+		goto free_init_attr;
+	}
+
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (!attr) {
+		ret = -ENOMEM;
+		goto destroy_qp;
+	}
+
+	ret = ib_find_pkey(viport_config->ibdev, viport_config->port,
+			  be16_to_cpu(viport_config->path_info.path.pkey),
+			  &attr->pkey_index);
+	if (ret) {
+		printk(KERN_WARNING PFX "ib_find_pkey() failed; "
+		       "error %d\n", ret);
+		goto freeattr;
+	}
+
+	attr->qp_state		= IB_QPS_INIT;
+	attr->qp_access_flags	= IB_ACCESS_REMOTE_WRITE;
+	attr->port_num		= viport_config->port;
+
+	ret = ib_modify_qp(ib_conn->qp, attr,
+			   IB_QP_STATE |
+			   IB_QP_PKEY_INDEX |
+			   IB_QP_ACCESS_FLAGS | IB_QP_PORT);
+	if (ret) {
+		printk(KERN_WARNING PFX "could not modify QP; error %d \n",
+		       ret);
+		goto freeattr;
+	}
+
+	kfree(attr);
+	kfree(init_attr);
+	return ret;
+
+freeattr:
+	kfree(attr);
+destroy_qp:
+	ib_destroy_qp(ib_conn->qp);
+free_init_attr:
+	kfree(init_attr);
+	return ret;
+}
+
+int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config)
+{
+	struct viport_config	*viport_config = viport->config;
+	int		ret = -1;
+	unsigned int	cq_size = config->num_sends + config->num_recvs;
+
+
+	if (!vnic_ib_conn_uninitted(ib_conn)) {
+		IB_ERROR("IB Connection out of state for init (%d)\n",
+			 ib_conn->state);
+		return -EINVAL;
+	}
+
+	ib_conn->cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion,
+#ifdef BUILD_FOR_OFED_1_2
+				   NULL, ib_conn, cq_size);
+#else
+				   NULL, ib_conn, cq_size, 0);
+#endif
+	if (IS_ERR(ib_conn->cq)) {
+		IB_ERROR("could not create CQ\n");
+		goto out;
+	}
+
+	IB_INFO("cq created %p %d\n", ib_conn->cq, cq_size);
+	ib_req_notify_cq(ib_conn->cq, IB_CQ_NEXT_COMP);
+	init_waitqueue_head(&(ib_conn->callback_wait_queue));
+	init_completion(&(ib_conn->callback_thread_exit));
+
+	spin_lock_init(&ib_conn->compl_received_lock);
+
+	ib_conn->callback_thread = kthread_run(vnic_defer_completion, ib_conn,
+						"qlgc_vnic_def_compl");
+	if (IS_ERR(ib_conn->callback_thread)) {
+		IB_ERROR("Could not create vnic_callback_thread;"
+			" error %d\n", (int) PTR_ERR(ib_conn->callback_thread));
+		ib_conn->callback_thread = NULL;
+		goto destroy_cq;
+	}
+
+	ret = vnic_ib_init_qp(ib_conn, config, pd, viport_config);
+
+	if (ret)
+		goto destroy_thread;
+
+	spin_lock_init(&ib_conn->conn_lock);
+	ib_conn->state = IB_CONN_INITTED;
+
+	return ret;
+
+destroy_thread:
+	completion_callback_cleanup(ib_conn);
+destroy_cq:
+	ib_destroy_cq(ib_conn->cq);
+out:
+	return ret;
+}
+
+int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io)
+{
+	cycles_t		post_time;
+	struct ib_recv_wr	*bad_wr;
+	int			ret = -1;
+	unsigned long		flags;
+
+	IB_FUNCTION("vnic_ib_post_recv()\n");
+
+	spin_lock_irqsave(&ib_conn->conn_lock, flags);
+
+	if (!vnic_ib_conn_initted(ib_conn) &&
+	    !vnic_ib_conn_connected(ib_conn)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	vnic_ib_pre_rcvpost_stats(ib_conn, io, &post_time);
+	io->type = RECV;
+	ret = ib_post_recv(ib_conn->qp, &io->rwr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting rcv wr; error %d\n", ret);
+		ib_conn->state = IB_CONN_ERRORED;
+		goto out;
+	}
+
+	vnic_ib_post_rcvpost_stats(ib_conn, post_time);
+out:
+	spin_unlock_irqrestore(&ib_conn->conn_lock, flags);
+	return ret;
+
+}
+
+int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io)
+{
+	cycles_t		post_time;
+	unsigned long		flags;
+	struct ib_send_wr	*bad_wr;
+	int			ret = -1;
+
+	IB_FUNCTION("vnic_ib_post_send()\n");
+
+	spin_lock_irqsave(&ib_conn->conn_lock, flags);
+	if (!vnic_ib_conn_connected(ib_conn)) {
+		IB_ERROR("IB Connection out of state for"
+			 " posting sends (%d)\n", ib_conn->state);
+		goto out;
+	}
+
+	vnic_ib_pre_sendpost_stats(io, &post_time);
+	if (io->swr.opcode == IB_WR_RDMA_WRITE)
+		io->type = RDMA;
+	else
+		io->type = SEND;
+
+	ret = ib_post_send(ib_conn->qp, &io->swr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting send wr; error %d\n", ret);
+		ib_conn->state = IB_CONN_ERRORED;
+		goto out;
+	}
+
+	vnic_ib_post_sendpost_stats(ib_conn, io, post_time);
+out:
+	spin_unlock_irqrestore(&ib_conn->conn_lock, flags);
+	return ret;
+}
+
+static int vnic_defer_completion(void *ptr)
+{
+	struct vnic_ib_conn *ib_conn = ptr;
+	struct ib_wc wc;
+	struct ib_cq *cq = ib_conn->cq;
+	cycles_t 	 comp_time;
+	u32              comp_num = 0;
+	unsigned long	flags;
+
+	while (!ib_conn->callback_thread_end) {
+		wait_event_interruptible(ib_conn->callback_wait_queue,
+					 ib_conn->compl_received ||
+					 ib_conn->callback_thread_end);
+		ib_conn->in_thread = 1;
+		spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+		ib_conn->compl_received = 0;
+		spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags);
+		if (ib_conn->cm_id &&
+		    ib_conn->state != IB_CONN_CONNECTED)
+			goto out_thread;
+
+		vnic_ib_note_comptime_stats(&comp_time);
+		vnic_ib_callback_stats(ib_conn);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		while (ib_poll_cq(cq, 1, &wc) > 0) {
+			vnic_ib_handle_completions(&wc, ib_conn, &comp_num,
+								 &comp_time);
+			if (ib_conn->cm_id &&
+				 ib_conn->state != IB_CONN_CONNECTED)
+				break;
+		}
+		vnic_ib_maxio_stats(ib_conn, comp_num);
+out_thread:
+		ib_conn->in_thread = 0;
+	}
+	complete_and_exit(&(ib_conn->callback_thread_exit), 0);
+	return 0;
+}
+
+void completion_callback_cleanup(struct vnic_ib_conn *ib_conn)
+{
+	if (ib_conn->callback_thread) {
+		ib_conn->callback_thread_end = 1;
+		wake_up(&(ib_conn->callback_wait_queue));
+		wait_for_completion(&(ib_conn->callback_thread_exit));
+		ib_conn->callback_thread = NULL;
+	}
+}
+
+int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config)
+{
+	struct viport_config	*viport_config = viport->config;
+	int		ret = -1;
+	unsigned int	cq_size = config->num_recvs; /* recvs only */
+
+	IB_FUNCTION("vnic_ib_mc_init\n");
+
+	mc_data->ib_conn.cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion,
+#ifdef BUILD_FOR_OFED_1_2
+				   NULL, &mc_data->ib_conn, cq_size);
+#else
+				   NULL, &mc_data->ib_conn, cq_size, 0);
+#endif
+	if (IS_ERR(mc_data->ib_conn.cq)) {
+		IB_ERROR("ib_create_cq failed\n");
+		goto out;
+	}
+	IB_INFO("mc cq created %p %d\n", mc_data->ib_conn.cq, cq_size);
+
+	ret = ib_req_notify_cq(mc_data->ib_conn.cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		IB_ERROR("ib_req_notify_cq failed %x \n", ret);
+		goto destroy_cq;
+	}
+
+	init_waitqueue_head(&(mc_data->ib_conn.callback_wait_queue));
+	init_completion(&(mc_data->ib_conn.callback_thread_exit));
+
+	spin_lock_init(&mc_data->ib_conn.compl_received_lock);
+	mc_data->ib_conn.callback_thread = kthread_run(vnic_defer_completion,
+							&mc_data->ib_conn,
+							"qlgc_vnic_mc_def_compl");
+	if (IS_ERR(mc_data->ib_conn.callback_thread)) {
+		IB_ERROR("Could not create vnic_callback_thread for MULTICAST;"
+			" error %d\n",
+			(int) PTR_ERR(mc_data->ib_conn.callback_thread));
+		mc_data->ib_conn.callback_thread = NULL;
+		goto destroy_cq;
+	}
+	IB_INFO("callback_thread created\n");
+
+	ret = vnic_ib_mc_init_qp(mc_data, config, pd, viport_config);
+	if (ret)
+		goto destroy_thread;
+
+	spin_lock_init(&mc_data->ib_conn.conn_lock);
+	mc_data->ib_conn.state = IB_CONN_INITTED; /* stays in this state */
+
+	return ret;
+
+destroy_thread:
+	completion_callback_cleanup(&mc_data->ib_conn);
+destroy_cq:
+	ib_destroy_cq(mc_data->ib_conn.cq);
+	mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+out:
+	return ret;
+}
+
+static int vnic_ib_mc_init_qp(struct mc_data *mc_data,
+			   struct vnic_ib_config *config,
+			   struct ib_pd	*pd,
+			   struct viport_config *viport_config)
+{
+	struct ib_qp_init_attr	*init_attr;
+	struct ib_qp_attr	*qp_attr;
+	int			ret;
+
+	IB_FUNCTION("vnic_ib_mc_init_qp\n");
+
+	if (!mc_data->ib_conn.cq) {
+		IB_ERROR("cq is null\n");
+		return -ENOMEM;
+	}
+
+	init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL);
+	if (!init_attr) {
+		IB_ERROR("failed to alloc init_attr\n");
+		return -ENOMEM;
+	}
+
+	init_attr->cap.max_recv_wr	= config->num_recvs;
+	init_attr->cap.max_send_wr	= 1;
+	init_attr->cap.max_recv_sge	= 2;
+	init_attr->cap.max_send_sge	= 1;
+
+	/* Completion for all work requests. */
+	init_attr->sq_sig_type		= IB_SIGNAL_ALL_WR;
+
+	init_attr->qp_type		= IB_QPT_UD;
+
+	init_attr->send_cq		= mc_data->ib_conn.cq;
+	init_attr->recv_cq		= mc_data->ib_conn.cq;
+
+	IB_INFO("creating qp %d \n", config->num_recvs);
+
+	mc_data->ib_conn.qp = ib_create_qp(pd, init_attr);
+
+	if (IS_ERR(mc_data->ib_conn.qp)) {
+		ret = -1;
+		IB_ERROR("could not create QP\n");
+		goto free_init_attr;
+	}
+
+	qp_attr = kzalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr) {
+		ret = -ENOMEM;
+		goto destroy_qp;
+	}
+
+	qp_attr->qp_state	= IB_QPS_INIT;
+	qp_attr->port_num	= viport_config->port;
+	qp_attr->qkey 		= IOC_NUMBER(be64_to_cpu(viport_config->ioc_guid));
+	qp_attr->pkey_index	= 0;
+	/* cannot set access flags for UD qp
+	qp_attr->qp_access_flags	= IB_ACCESS_REMOTE_WRITE; */
+
+	IB_INFO("port_num:%d qkey:%d pkey:%d\n", qp_attr->port_num,
+			qp_attr->qkey, qp_attr->pkey_index);
+	ret = ib_modify_qp(mc_data->ib_conn.qp, qp_attr,
+			   IB_QP_STATE |
+			   IB_QP_PKEY_INDEX |
+			   IB_QP_QKEY |
+
+			/* cannot set this for UD
+			   IB_QP_ACCESS_FLAGS | */
+
+			   IB_QP_PORT);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to INIT failed %d \n", ret);
+		goto free_qp_attr;
+	}
+
+	kfree(qp_attr);
+	kfree(init_attr);
+	return ret;
+
+free_qp_attr:
+	kfree(qp_attr);
+destroy_qp:
+	ib_destroy_qp(mc_data->ib_conn.qp);
+	mc_data->ib_conn.qp = ERR_PTR(-EINVAL);
+free_init_attr:
+	kfree(init_attr);
+	return ret;
+}
+
+int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp)
+{
+	int ret;
+	struct ib_qp_attr *qp_attr = NULL;
+
+	IB_FUNCTION("vnic_ib_mc_mod_qp_to_rts\n");
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	memset(qp_attr, 0, sizeof *qp_attr);
+	qp_attr->qp_state = IB_QPS_RTR;
+
+	ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to RTR failed %d\n", ret);
+		goto out;
+	}
+	IB_INFO("MC QP RTR\n");
+
+	memset(qp_attr, 0, sizeof *qp_attr);
+	qp_attr->qp_state = IB_QPS_RTS;
+	qp_attr->sq_psn = 0;
+
+	ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE | IB_QP_SQ_PSN);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to RTS failed %d\n", ret);
+		goto out;
+	}
+	IB_INFO("MC QP RTS\n");
+
+	return 0;
+
+out:
+	kfree(qp_attr);
+	return -1;
+}
+
+int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io)
+{
+	cycles_t		post_time;
+	struct ib_recv_wr	*bad_wr;
+	int			ret = -1;
+
+	IB_FUNCTION("vnic_ib_mc_post_recv()\n");
+
+	vnic_ib_pre_rcvpost_stats(&mc_data->ib_conn, io, &post_time);
+	io->type = RECV_UD;
+	ret = ib_post_recv(mc_data->ib_conn.qp, &io->rwr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting rcv wr; error %d\n", ret);
+		goto out;
+	}
+	vnic_ib_post_rcvpost_stats(&mc_data->ib_conn, post_time);
+
+out:
+	return ret;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
new file mode 100644
index 0000000..ebf9ef5
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
@@ -0,0 +1,206 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_IB_H_INCLUDED
+#define VNIC_IB_H_INCLUDED
+
+#include <linux/timex.h>
+#include <linux/completion.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_pack.h>
+#include <rdma/ib_sa.h>
+#include <rdma/ib_cm.h>
+
+#include "vnic_sys.h"
+#include "vnic_netpath.h"
+#define PFX	"qlgc_vnic: "
+
+struct io;
+typedef void (comp_routine_t) (struct io *io);
+
+enum vnic_ib_conn_state {
+	IB_CONN_UNINITTED	= 0,
+	IB_CONN_INITTED		= 1,
+	IB_CONN_CONNECTING	= 2,
+	IB_CONN_CONNECTED	= 3,
+	IB_CONN_DISCONNECTED	= 4,
+	IB_CONN_ERRORED		= 5
+};
+
+struct vnic_ib_conn {
+	struct viport		*viport;
+	struct vnic_ib_config	*ib_config;
+	spinlock_t		conn_lock;
+	enum vnic_ib_conn_state	state;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct ib_cm_id		*cm_id;
+	int 			callback_thread_end;
+	struct task_struct	*callback_thread;
+	wait_queue_head_t	callback_wait_queue;
+	u32 			in_thread;
+	u32 			compl_received;
+	struct completion 	callback_thread_exit;
+	spinlock_t		compl_received_lock;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	connection_time;
+		cycles_t	rdma_post_time;
+		u32		rdma_post_ios;
+		cycles_t	rdma_comp_time;
+		u32		rdma_comp_ios;
+		cycles_t	send_post_time;
+		u32		send_post_ios;
+		cycles_t	send_comp_time;
+		u32		send_comp_ios;
+		cycles_t	recv_post_time;
+		u32		recv_post_ios;
+		cycles_t	recv_comp_time;
+		u32		recv_comp_ios;
+		u32		num_ios;
+		u32		num_callbacks;
+		u32		max_ios;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct vnic_ib_path_info {
+	struct ib_sa_path_rec	path;
+	struct ib_sa_query	*path_query;
+	int			path_query_id;
+	int			status;
+	struct			completion done;
+};
+
+struct vnic_ib_device {
+	struct ib_device	*dev;
+	struct list_head	port_list;
+};
+
+struct vnic_ib_port {
+	struct vnic_ib_device	*dev;
+	u8			port_num;
+	struct dev_info		pdev_info;
+	struct list_head	list;
+};
+
+struct io {
+	struct list_head	list_ptrs;
+	struct viport		*viport;
+	comp_routine_t		*routine;
+	struct ib_recv_wr	rwr;
+	struct ib_send_wr	swr;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	cycles_t		time;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+	enum {RECV, RDMA, SEND, RECV_UD}	type;
+};
+
+struct rdma_io {
+	struct io		io;
+	struct ib_sge		list[2];
+	u16			index;
+	u16			len;
+	u8			*data;
+	dma_addr_t		data_dma;
+	struct sk_buff		*skb;
+	dma_addr_t		skb_data_dma;
+	struct viport_trailer 	*trailer;
+	dma_addr_t 		trailer_dma;
+};
+
+struct send_io {
+	struct io	io;
+	struct ib_sge	list;
+	u8		*virtual_addr;
+};
+
+struct recv_io {
+	struct io	io;
+	struct ib_sge	list;
+	u8		*virtual_addr;
+};
+
+struct ud_recv_io {
+	struct io	io;
+	u16 	len;
+	dma_addr_t		skb_data_dma;
+	struct ib_sge	list[2]; /* one for grh and other for rest of pkt. */
+	struct sk_buff 	*skb;
+};
+
+int	vnic_ib_init(void);
+void	vnic_ib_cleanup(void);
+
+struct vnic;
+int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic);
+int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config);
+
+int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io);
+int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io);
+int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn);
+int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+
+#define	vnic_ib_conn_uninitted(ib_conn)			\
+	((ib_conn)->state == IB_CONN_UNINITTED)
+#define	vnic_ib_conn_initted(ib_conn)			\
+	((ib_conn)->state == IB_CONN_INITTED)
+#define	vnic_ib_conn_connecting(ib_conn)		\
+	((ib_conn)->state == IB_CONN_CONNECTING)
+#define	vnic_ib_conn_connected(ib_conn)			\
+	((ib_conn)->state == IB_CONN_CONNECTED)
+#define	vnic_ib_conn_disconnected(ib_conn)		\
+	((ib_conn)->state == IB_CONN_DISCONNECTED)
+
+#define MCAST_GROUP_INVALID 0x00 /* viport failed to join or left mc group */
+#define MCAST_GROUP_JOINING 0x01 /* wait for completion */
+#define MCAST_GROUP_JOINED  0x02 /* join process completed successfully */
+
+/* vnic_sa_client is used to register with sa once. It is needed to join and
+ * leave multicast groups.
+ */
+extern struct ib_sa_client vnic_sa_client;
+
+/* The following functions are using initialize and handle multicast
+ * components.
+ */
+struct mc_data; /* forward declaration */
+/* Initialize all necessary mc components */
+int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport,
+			struct ib_pd *pd, struct vnic_ib_config *config);
+/* Put multicast qp in RTS */
+int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp);
+/* Post multicast receive buffers */
+int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io);
+
+#endif	/* VNIC_IB_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:34:59 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:04:59 +0530
Subject: [ofa-general] [PATCH v2 07/13] QLogic VNIC: Handling configurable
	parameters of the driver
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103459.12355.51105.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the files that handle various configurable parameters
of the VNIC driver ---- configuration of virtual NIC, control, data 
connections to the EVIC and general IB connection parameters.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c |  379 ++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h |  242 +++++++++++++++
 2 files changed, 621 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
new file mode 100644
index 0000000..8bde3d8
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
@@ -0,0 +1,379 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/string.h>
+#include <linux/utsname.h>
+#include <linux/if_vlan.h>
+
+#include "vnic_util.h"
+#include "vnic_config.h"
+#include "vnic_trailer.h"
+#include "vnic_main.h"
+
+u16 vnic_max_mtu = MAX_MTU;
+
+static u32 default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT;
+static u32 sa_path_rec_get_timeout = SA_PATH_REC_GET_TIMEOUT;
+static u32 default_primary_reconnect_timeout =
+				    DEFAULT_PRIMARY_RECONNECT_TIMEOUT;
+static u32 default_primary_switch_timeout = DEFAULT_PRIMARY_SWITCH_TIMEOUT;
+static int default_prefer_primary         = DEFAULT_PREFER_PRIMARY;
+
+static int use_rx_csum = VNIC_USE_RX_CSUM;
+static int use_tx_csum = VNIC_USE_TX_CSUM;
+
+static u32 control_response_timeout = CONTROL_RSP_TIMEOUT;
+static u32 completion_limit = DEFAULT_COMPLETION_LIMIT;
+
+module_param(vnic_max_mtu, ushort, 0444);
+MODULE_PARM_DESC(vnic_max_mtu, "Maximum MTU size (1500-9500). Default is 9500");
+
+module_param(default_prefer_primary, bool, 0444);
+MODULE_PARM_DESC(default_prefer_primary, "Determines if primary path is"
+		 " preferred (1) or not (0). Defaults to 0");
+module_param(use_rx_csum, bool, 0444);
+MODULE_PARM_DESC(use_rx_csum, "Determines if RX checksum is done on VEx (1)"
+		 " or not (0). Defaults to 1");
+module_param(use_tx_csum, bool, 0444);
+MODULE_PARM_DESC(use_tx_csum, "Determines if TX checksum is done on VEx (1)"
+		 " or not (0). Defaults to 1");
+module_param(default_no_path_timeout, uint, 0444);
+MODULE_PARM_DESC(default_no_path_timeout, "Time to wait in milliseconds"
+		 " before reconnecting to VEx after connection loss");
+module_param(default_primary_reconnect_timeout, uint, 0444);
+MODULE_PARM_DESC(default_primary_reconnect_timeout,  "Time to wait in"
+		 " milliseconds before reconnecting the"
+		 " primary path to VEx");
+module_param(default_primary_switch_timeout, uint, 0444);
+MODULE_PARM_DESC(default_primary_switch_timeout, "Time to wait before"
+		 " switching back to primary path if"
+		 " primary path is preferred");
+module_param(sa_path_rec_get_timeout, uint, 0444);
+MODULE_PARM_DESC(sa_path_rec_get_timeout, "Time out value in milliseconds"
+		 " for SA path record get queries");
+
+module_param(control_response_timeout, uint, 0444);
+MODULE_PARM_DESC(control_response_timeout, "Time out value in milliseconds"
+		 " to wait for response to control requests");
+
+module_param(completion_limit, uint, 0444);
+MODULE_PARM_DESC(completion_limit, "Maximum completions to process"
+		" in a single completion callback invocation. Default is 100"
+		" Minimum value is 10");
+
+static void config_control_defaults(struct control_config *control_config,
+				    struct path_param *params)
+{
+	int len;
+	char *dot;
+	u64 sid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8)
+	      |	IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	control_config->ib_config.service_id = cpu_to_be64(sid);
+	control_config->ib_config.conn_data.path_id = 0;
+	control_config->ib_config.conn_data.vnic_instance = params->instance;
+	control_config->ib_config.conn_data.path_num = 0;
+	control_config->ib_config.conn_data.features_supported =
+			__constant_cpu_to_be32((u32) (VNIC_FEAT_IGNORE_VLAN |
+						      VNIC_FEAT_RDMA_IMMED));
+	dot = strchr(init_utsname()->nodename, '.');
+
+	if (dot)
+		len = dot - init_utsname()->nodename;
+	else
+		len = strlen(init_utsname()->nodename);
+
+	if (len > VNIC_MAX_NODENAME_LEN)
+		len = VNIC_MAX_NODENAME_LEN;
+
+	memcpy(control_config->ib_config.conn_data.nodename,
+					init_utsname()->nodename, len);
+
+	if (params->ib_multicast == 1)
+		control_config->ib_multicast = 1;
+	else if (params->ib_multicast == 0)
+		control_config->ib_multicast = 0;
+	else {
+		/* parameter is not set - enable it by default */
+		control_config->ib_multicast = 1;
+		CONFIG_ERROR("IOCGUID=%llx INSTANCE=%d IB_MULTICAST defaulted"
+					" to TRUE\n",
+					be64_to_cpu(params->ioc_guid),
+					(char)params->instance);
+	}
+
+	if (control_config->ib_multicast)
+		control_config->ib_config.conn_data.features_supported |=
+			__constant_cpu_to_be32(VNIC_FEAT_INBOUND_IB_MC);
+
+	control_config->ib_config.retry_count = RETRY_COUNT;
+	control_config->ib_config.rnr_retry_count = RETRY_COUNT;
+	control_config->ib_config.min_rnr_timer = MIN_RNR_TIMER;
+
+	/* These values are not configurable*/
+	control_config->ib_config.num_recvs    = 5;
+	control_config->ib_config.num_sends    = 1;
+	control_config->ib_config.recv_scatter = 1;
+	control_config->ib_config.send_gather  = 1;
+	control_config->ib_config.completion_limit = completion_limit;
+
+	control_config->num_recvs = control_config->ib_config.num_recvs;
+
+	control_config->vnic_instance = params->instance;
+	control_config->max_address_entries = MAX_ADDRESS_ENTRIES;
+	control_config->min_address_entries = MIN_ADDRESS_ENTRIES;
+	control_config->rsp_timeout = msecs_to_jiffies(control_response_timeout);
+}
+
+static void config_data_defaults(struct data_config *data_config,
+				 struct path_param *params)
+{
+	u64 sid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8)
+	      |	IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	data_config->ib_config.service_id = cpu_to_be64(sid);
+	data_config->ib_config.conn_data.path_id = jiffies; /* random */
+	data_config->ib_config.conn_data.vnic_instance = params->instance;
+	data_config->ib_config.conn_data.path_num = 0;
+
+	data_config->ib_config.retry_count = RETRY_COUNT;
+	data_config->ib_config.rnr_retry_count = RETRY_COUNT;
+	data_config->ib_config.min_rnr_timer = MIN_RNR_TIMER;
+
+	/*
+	 * NOTE: the num_recvs size assumes that the EIOC could
+	 * RDMA enough packets to fill all of the host recv
+	 * pool entries, plus send a kick message after each
+	 * packet, plus RDMA new buffers for the size of
+	 * the EIOC recv buffer pool, plus send kick messages
+	 * after each min_host_update_sz of new buffers all
+	 * before the host can even pull off the first completed
+	 * receive off the completion queue, and repost the
+	 * receive. NOT LIKELY!
+	 */
+	data_config->ib_config.num_recvs = HOST_RECV_POOL_ENTRIES +
+	    (MAX_EIOC_POOL_SZ / MIN_HOST_UPDATE_SZ);
+
+	data_config->ib_config.num_sends = (2 * NOTIFY_BUNDLE_SZ) +
+	    (HOST_RECV_POOL_ENTRIES / MIN_EIOC_UPDATE_SZ) + 1;
+
+	data_config->ib_config.recv_scatter = 1; /* not configurable */
+	data_config->ib_config.send_gather = 2;	 /* not configurable */
+	data_config->ib_config.completion_limit = completion_limit;
+
+	data_config->num_recvs = data_config->ib_config.num_recvs;
+	data_config->path_id = data_config->ib_config.conn_data.path_id;
+
+
+	data_config->host_recv_pool_entries = HOST_RECV_POOL_ENTRIES;
+
+	data_config->host_min.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU));
+	data_config->host_max.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + vnic_max_mtu));
+	data_config->eioc_min.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU));
+	data_config->eioc_max.size_recv_pool_entry =
+			__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_entries =
+				__constant_cpu_to_be32(MIN_HOST_POOL_SZ);
+	data_config->host_max.num_recv_pool_entries =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+	data_config->eioc_min.num_recv_pool_entries =
+				__constant_cpu_to_be32(MIN_EIOC_POOL_SZ);
+	data_config->eioc_max.num_recv_pool_entries =
+				__constant_cpu_to_be32(MAX_EIOC_POOL_SZ);
+
+	data_config->host_min.timeout_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_TIMEOUT);
+	data_config->host_max.timeout_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_TIMEOUT);
+	data_config->eioc_min.timeout_before_kick = 0;
+	data_config->eioc_max.timeout_before_kick =
+			__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_entries_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_ENTRIES);
+	data_config->host_max.num_recv_pool_entries_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_ENTRIES);
+	data_config->eioc_min.num_recv_pool_entries_before_kick = 0;
+	data_config->eioc_max.num_recv_pool_entries_before_kick =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_bytes_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_BYTES);
+	data_config->host_max.num_recv_pool_bytes_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_BYTES);
+	data_config->eioc_min.num_recv_pool_bytes_before_kick = 0;
+	data_config->eioc_max.num_recv_pool_bytes_before_kick =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MIN_HOST_UPDATE_SZ);
+	data_config->host_max.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MAX_HOST_UPDATE_SZ);
+	data_config->eioc_min.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MIN_EIOC_UPDATE_SZ);
+	data_config->eioc_max.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MAX_EIOC_UPDATE_SZ);
+
+	data_config->notify_bundle = NOTIFY_BUNDLE_SZ;
+}
+
+static void config_path_info_defaults(struct viport_config *config,
+				      struct path_param *params)
+{
+	int i;
+	ib_query_gid(config->ibdev, config->port, 0,
+			  &config->path_info.path.sgid);
+	for (i = 0; i < 16; i++)
+		config->path_info.path.dgid.raw[i] = params->dgid[i];
+
+	config->path_info.path.pkey = params->pkey;
+	config->path_info.path.numb_path = 1;
+	config->sa_path_rec_get_timeout = sa_path_rec_get_timeout;
+
+}
+
+static void config_viport_defaults(struct viport_config *config,
+				      struct path_param *params)
+{
+	config->ibdev = params->ibdev;
+	config->port = params->port;
+	config->ioc_guid = params->ioc_guid;
+	config->stats_interval = msecs_to_jiffies(VIPORT_STATS_INTERVAL);
+	config->hb_interval = msecs_to_jiffies(VIPORT_HEARTBEAT_INTERVAL);
+	config->hb_timeout = VIPORT_HEARTBEAT_TIMEOUT * 1000;
+				/*hb_timeout needs to be in usec*/
+	strcpy(config->ioc_string, params->ioc_string);
+	config_path_info_defaults(config, params);
+
+	config_control_defaults(&config->control_config, params);
+	config_data_defaults(&config->data_config, params);
+}
+
+static void config_vnic_defaults(struct vnic_config *config)
+{
+	config->no_path_timeout = msecs_to_jiffies(default_no_path_timeout);
+	config->primary_connect_timeout =
+	    msecs_to_jiffies(DEFAULT_PRIMARY_CONNECT_TIMEOUT);
+	config->primary_reconnect_timeout =
+	    msecs_to_jiffies(default_primary_reconnect_timeout);
+	config->primary_switch_timeout =
+	    msecs_to_jiffies(default_primary_switch_timeout);
+	config->prefer_primary = default_prefer_primary;
+	config->use_rx_csum = use_rx_csum;
+	config->use_tx_csum = use_tx_csum;
+}
+
+struct viport_config *config_alloc_viport(struct path_param *params)
+{
+	struct viport_config *config;
+
+	config = kzalloc(sizeof *config, GFP_KERNEL);
+	if (!config) {
+		CONFIG_ERROR("could not allocate memory for"
+			     " struct viport_config\n");
+		return NULL;
+	}
+
+	config_viport_defaults(config, params);
+
+	return config;
+}
+
+struct vnic_config *config_alloc_vnic(void)
+{
+	struct vnic_config *config;
+
+	config = kzalloc(sizeof *config, GFP_KERNEL);
+	if (!config) {
+		CONFIG_ERROR("couldn't allocate memory for"
+			     " struct vnic_config\n");
+
+		return NULL;
+	}
+
+	config_vnic_defaults(config);
+	return config;
+}
+
+char *config_viport_name(struct viport_config *config)
+{
+	/* function only called by one thread, can return a static string */
+	static char str[64];
+
+	sprintf(str, "GUID %llx instance %d",
+		be64_to_cpu(config->ioc_guid),
+		config->control_config.vnic_instance);
+	return str;
+}
+
+int config_start(void)
+{
+	vnic_max_mtu = min_t(u16, vnic_max_mtu, MAX_MTU);
+	vnic_max_mtu = max_t(u16, vnic_max_mtu, MIN_MTU);
+
+	sa_path_rec_get_timeout = min_t(u32, sa_path_rec_get_timeout,
+					MAX_SA_TIMEOUT);
+	sa_path_rec_get_timeout = max_t(u32, sa_path_rec_get_timeout,
+					MIN_SA_TIMEOUT);
+
+	control_response_timeout = min_t(u32, control_response_timeout,
+					 MAX_CONTROL_RSP_TIMEOUT);
+
+	control_response_timeout = max_t(u32, control_response_timeout,
+					 MIN_CONTROL_RSP_TIMEOUT);
+
+	completion_limit	 = max_t(u32, completion_limit,
+					 MIN_COMPLETION_LIMIT);
+
+	if (!default_no_path_timeout)
+		default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT;
+
+	if (!default_primary_reconnect_timeout)
+		default_primary_reconnect_timeout =
+					 DEFAULT_PRIMARY_RECONNECT_TIMEOUT;
+
+	if (!default_primary_switch_timeout)
+		default_primary_switch_timeout =
+					DEFAULT_PRIMARY_SWITCH_TIMEOUT;
+
+	return 0;
+
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
new file mode 100644
index 0000000..dca5f98
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONFIG_H_INCLUDED
+#define VNIC_CONFIG_H_INCLUDED
+
+#include <rdma/ib_verbs.h>
+#include <linux/types.h>
+#include <linux/if.h>
+
+#include "vnic_control.h"
+#include "vnic_ib.h"
+
+#define SST_AGN         0x10ULL
+#define SST_OUI         0x00066AULL
+
+enum {
+	CONTROL_PATH_ID = 0x0,
+	DATA_PATH_ID    = 0x1
+};
+
+#define IOC_NUMBER(GUID)        (((GUID) >> 32) & 0xFF)
+
+enum {
+	VNIC_CLASS_SUBCLASS	= 0x2000066A,
+	VNIC_PROTOCOL		= 0,
+	VNIC_PROT_VERSION	= 1
+};
+
+enum {
+	MIN_MTU	= 1500,	/* minimum negotiated MTU size */
+	MAX_MTU	= 9500	/* jumbo frame */
+};
+
+/*
+ * TODO: tune the pool parameter values
+ */
+enum {
+	MIN_ADDRESS_ENTRIES = 16,
+	MAX_ADDRESS_ENTRIES = 64
+};
+
+enum {
+	HOST_RECV_POOL_ENTRIES	= 512,
+	MIN_HOST_POOL_SZ	= 64,
+	MIN_EIOC_POOL_SZ	= 64,
+	MAX_EIOC_POOL_SZ	= 256,
+	MIN_HOST_UPDATE_SZ	= 8,
+	MAX_HOST_UPDATE_SZ	= 32,
+	MIN_EIOC_UPDATE_SZ	= 8,
+	MAX_EIOC_UPDATE_SZ	= 32,
+	NOTIFY_BUNDLE_SZ	= 32
+};
+
+enum {
+	MIN_HOST_KICK_TIMEOUT = 10,	/* in usec */
+	MAX_HOST_KICK_TIMEOUT = 100	/* in usec */
+};
+
+enum {
+	MIN_HOST_KICK_ENTRIES = 1,
+	MAX_HOST_KICK_ENTRIES = 128
+};
+
+enum {
+	MIN_HOST_KICK_BYTES = 0,
+	MAX_HOST_KICK_BYTES = 5000
+};
+
+enum {
+	DEFAULT_NO_PATH_TIMEOUT			= 10000,
+	DEFAULT_PRIMARY_CONNECT_TIMEOUT		= 10000,
+	DEFAULT_PRIMARY_RECONNECT_TIMEOUT	= 10000,
+	DEFAULT_PRIMARY_SWITCH_TIMEOUT		= 10000
+};
+
+enum {
+	VIPORT_STATS_INTERVAL		= 500,	/* .5 sec */
+	VIPORT_HEARTBEAT_INTERVAL	= 1000,	/* 1 second */
+	VIPORT_HEARTBEAT_TIMEOUT	= 64000	/* 64 sec */
+};
+
+enum {
+	/* 5 sec increased for EVIC support for large number of
+	 * host connections
+	 */
+	CONTROL_RSP_TIMEOUT		= 5000,
+	MIN_CONTROL_RSP_TIMEOUT		= 1000,	/* 1  sec */
+	MAX_CONTROL_RSP_TIMEOUT		= 60000	/* 60 sec */
+};
+
+/* Maximum number of completions to be processed
+ * during a single completion callback invocation
+ */
+enum {
+	DEFAULT_COMPLETION_LIMIT 	= 100,
+	MIN_COMPLETION_LIMIT		= 10
+};
+
+/* infiniband connection parameters */
+enum {
+	RETRY_COUNT		= 3,
+	MIN_RNR_TIMER		= 22,	/* 20 ms */
+	DEFAULT_PKEY		= 0	/* pkey table index */
+};
+
+enum {
+	SA_PATH_REC_GET_TIMEOUT	= 1000,	/* 1000 ms */
+	MIN_SA_TIMEOUT		= 100,	/* 100 ms */
+	MAX_SA_TIMEOUT		= 20000	/* 20s */
+};
+
+#define MAX_PARAM_VALUE                 0x40000000
+#define VNIC_USE_RX_CSUM		1
+#define VNIC_USE_TX_CSUM		1
+#define	DEFAULT_PREFER_PRIMARY		0
+
+/* As per IBTA specification, IOCString Maximum length can be 512 bits. */
+#define MAX_IOC_STRING_LEN 		(512/8)
+
+struct path_param {
+	__be64			ioc_guid;
+	u8			ioc_string[MAX_IOC_STRING_LEN+1];
+	u8			port;
+	u8			instance;
+	struct ib_device	*ibdev;
+	struct vnic_ib_port	*ibport;
+	char			name[IFNAMSIZ];
+	u8			dgid[16];
+	__be16			pkey;
+	int			rx_csum;
+	int			tx_csum;
+	int			heartbeat;
+	int			ib_multicast;
+};
+
+struct vnic_ib_config {
+	__be64				service_id;
+	struct vnic_connection_data	conn_data;
+	u32				retry_count;
+	u32				rnr_retry_count;
+	u8				min_rnr_timer;
+	u32				num_sends;
+	u32				num_recvs;
+	u32				recv_scatter;	/* 1 */
+	u32				send_gather;	/* 1 or 2 */
+	u32				completion_limit;
+};
+
+struct control_config {
+	struct vnic_ib_config	ib_config;
+	u32			num_recvs;
+	u8			vnic_instance;
+	u16			max_address_entries;
+	u16			min_address_entries;
+	u32			rsp_timeout;
+	u32			ib_multicast;
+};
+
+struct data_config {
+	struct vnic_ib_config		ib_config;
+	u64				path_id;
+	u32				num_recvs;
+	u32				host_recv_pool_entries;
+	struct vnic_recv_pool_config	host_min;
+	struct vnic_recv_pool_config	host_max;
+	struct vnic_recv_pool_config	eioc_min;
+	struct vnic_recv_pool_config	eioc_max;
+	u32				notify_bundle;
+};
+
+struct viport_config {
+	struct viport			*viport;
+	struct control_config		control_config;
+	struct data_config		data_config;
+	struct vnic_ib_path_info	path_info;
+	u32				sa_path_rec_get_timeout;
+	struct ib_device		*ibdev;
+	u32				port;
+	unsigned long			stats_interval;
+	u32				hb_interval;
+	u32				hb_timeout;
+	__be64				ioc_guid;
+	u8				ioc_string[MAX_IOC_STRING_LEN+1];
+	size_t				path_idx;
+};
+
+/*
+ * primary_connect_timeout   - if the secondary connects first,
+ *                             how long do we give the primary?
+ * primary_reconnect_timeout - same as above, but used when recovering
+ *                             from the case where both paths fail
+ * primary_switch_timeout -    how long do we wait before switching to the
+ *                             primary when it comes back?
+ */
+struct vnic_config {
+	struct vnic	*vnic;
+	char		name[IFNAMSIZ];
+	unsigned long	no_path_timeout;
+	u32 		primary_connect_timeout;
+	u32		primary_reconnect_timeout;
+	u32		primary_switch_timeout;
+	int		prefer_primary;
+	int		use_rx_csum;
+	int		use_tx_csum;
+};
+
+int config_start(void);
+struct viport_config *config_alloc_viport(struct path_param *params);
+struct vnic_config   *config_alloc_vnic(void);
+char *config_viport_name(struct viport_config *config);
+
+#endif	/* VNIC_CONFIG_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:35:29 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:05:29 +0530
Subject: [ofa-general] [PATCH v2 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103529.12355.82570.stgit@localhost.localdomain>

From: Amar Mudrankit <amar.mudrankit at qlogic.com>

The sysfs interface for the QLogic VNIC driver is implemented through
this patch.

Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1131 +++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h |   62 +
 2 files changed, 1193 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
new file mode 100644
index 0000000..312f37c
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
@@ -0,0 +1,1131 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/parser.h>
+#include <linux/if.h>
+
+#include "vnic_util.h"
+#include "vnic_config.h"
+#include "vnic_ib.h"
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_stats.h"
+
+/*
+ * target eiocs are added by writing
+ *
+ * ioc_guid=<EIOC GUID>,dgid=<dest GID>,pkey=<P_key>,name=<interface_name>
+ * to the create_primary  sysfs attribute.
+ */
+enum {
+	VNIC_OPT_ERR = 0,
+	VNIC_OPT_IOC_GUID = 1 << 0,
+	VNIC_OPT_DGID = 1 << 1,
+	VNIC_OPT_PKEY = 1 << 2,
+	VNIC_OPT_NAME = 1 << 3,
+	VNIC_OPT_INSTANCE = 1 << 4,
+	VNIC_OPT_RXCSUM = 1 << 5,
+	VNIC_OPT_TXCSUM = 1 << 6,
+	VNIC_OPT_HEARTBEAT = 1 << 7,
+	VNIC_OPT_IOC_STRING = 1 << 8,
+	VNIC_OPT_IB_MULTICAST = 1 << 9,
+	VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID |
+			VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY),
+};
+
+static match_table_t vnic_opt_tokens = {
+	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
+	{VNIC_OPT_DGID, "dgid=%s"},
+	{VNIC_OPT_PKEY, "pkey=%x"},
+	{VNIC_OPT_NAME, "name=%s"},
+	{VNIC_OPT_INSTANCE, "instance=%d"},
+	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
+	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
+	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
+	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
+	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
+	{VNIC_OPT_ERR, NULL}
+};
+
+void vnic_release_dev(struct device *dev)
+{
+	struct dev_info *dev_info =
+	    container_of(dev, struct dev_info, dev);
+
+	complete(&dev_info->released);
+
+}
+
+struct class vnic_class = {
+	.name = "infiniband_qlgc_vnic",
+	.dev_release = vnic_release_dev
+};
+
+struct dev_info interface_dev;
+
+DEVICE_ATTR(create_primary, S_IWUSR, NULL, vnic_create_primary);
+DEVICE_ATTR(create_secondary, S_IWUSR, NULL, vnic_create_secondary);
+DEVICE_ATTR(delete_vnic, S_IWUSR, NULL, vnic_delete);
+
+static int vnic_parse_options(const char *buf, struct path_param *param)
+{
+	char *options, *sep_opt;
+	char *p;
+	char dgid[3];
+	substring_t args[MAX_OPT_ARGS];
+	int opt_mask = 0;
+	int token;
+	int ret = -EINVAL;
+	int i, len;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, ",")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, vnic_opt_tokens, args);
+		opt_mask |= token;
+
+		switch (token) {
+		case VNIC_OPT_IOC_GUID:
+			p = match_strdup(args);
+			param->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL,
+								      16));
+			kfree(p);
+			break;
+
+		case VNIC_OPT_DGID:
+			p = match_strdup(args);
+			if (strlen(p) != 32) {
+				printk(KERN_WARNING PFX
+				       "bad dest GID parameter '%s'\n", p);
+				kfree(p);
+				goto out;
+			}
+
+			for (i = 0; i < 16; ++i) {
+				strlcpy(dgid, p + i * 2, 3);
+				param->dgid[i] = simple_strtoul(dgid, NULL,
+								16);
+
+			}
+			kfree(p);
+			break;
+
+		case VNIC_OPT_PKEY:
+			if (match_hex(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad P_key parameter '%s'\n", p);
+				goto out;
+			}
+			param->pkey = cpu_to_be16(token);
+			break;
+
+		case VNIC_OPT_NAME:
+			p = match_strdup(args);
+			if (strlen(p) >= IFNAMSIZ) {
+				printk(KERN_WARNING PFX
+				       "interface name parameter too long\n");
+				kfree(p);
+				goto out;
+			}
+			strcpy(param->name, p);
+			kfree(p);
+			break;
+		case VNIC_OPT_INSTANCE:
+			if (match_int(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad instance parameter '%s'\n", p);
+				goto out;
+			}
+
+			if (token > 255 || token < 0) {
+				printk(KERN_WARNING PFX
+				       "instance parameter must be"
+				       " >= 0 and <= 255\n");
+				goto out;
+			}
+
+			param->instance = token;
+			break;
+		case VNIC_OPT_RXCSUM:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->rx_csum = 1;
+			else if (!strncmp(p, "false", 5))
+				param->rx_csum = 0;
+			else {
+				printk(KERN_WARNING PFX
+				       "bad rx_csum parameter."
+				       " must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		case VNIC_OPT_TXCSUM:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->tx_csum = 1;
+			else if (!strncmp(p, "false", 5))
+				param->tx_csum = 0;
+			else {
+				printk(KERN_WARNING PFX
+				       "bad tx_csum parameter."
+				       " must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		case VNIC_OPT_HEARTBEAT:
+			if (match_int(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad instance parameter '%s'\n", p);
+				goto out;
+			}
+
+			if (token > 6000 || token <= 0) {
+				printk(KERN_WARNING PFX
+				       "heartbeat parameter must be"
+				       " > 0 and <= 6000\n");
+				goto out;
+			}
+			param->heartbeat = token;
+			break;
+		case VNIC_OPT_IOC_STRING:
+			p = match_strdup(args);
+			len = strlen(p);
+			if (len > MAX_IOC_STRING_LEN) {
+				printk(KERN_WARNING PFX
+				       "ioc string parameter too long\n");
+				kfree(p);
+				goto out;
+			}
+			strcpy(param->ioc_string, p);
+			if (*(p + len - 1) != '\"') {
+				strcat(param->ioc_string, ",");
+				kfree(p);
+				p = strsep(&sep_opt, "\"");
+				strcat(param->ioc_string, p);
+				sep_opt++;
+			} else {
+				*(param->ioc_string + len - 1) = '\0';
+				kfree(p);
+			}
+			break;
+		case VNIC_OPT_IB_MULTICAST:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->ib_multicast = 1;
+			else if (!strncmp(p, "false", 5))
+				param->ib_multicast = 0;
+			else {
+					printk(KERN_WARNING PFX
+					"bad ib_multicast parameter."
+					" must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		default:
+			printk(KERN_WARNING PFX
+			       "unknown parameter or missing value "
+			       "'%s' in target creation request\n", p);
+			goto out;
+		}
+
+	}
+
+	if ((opt_mask & VNIC_OPT_ALL) == VNIC_OPT_ALL)
+		ret = 0;
+	else
+		for (i = 0; i < ARRAY_SIZE(vnic_opt_tokens); ++i)
+			if ((vnic_opt_tokens[i].token & VNIC_OPT_ALL) &&
+			    !(vnic_opt_tokens[i].token & opt_mask))
+				printk(KERN_WARNING PFX
+				       "target creation request is "
+				       "missing parameter '%s'\n",
+				       vnic_opt_tokens[i].pattern);
+
+out:
+	kfree(options);
+	return ret;
+
+}
+
+static ssize_t show_vnic_state(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+	switch (vnic->state) {
+	case VNIC_UNINITIALIZED:
+		return sprintf(buf, "VNIC_UNINITIALIZED\n");
+	case VNIC_REGISTERED:
+		return sprintf(buf, "VNIC_REGISTERED\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+	}
+
+}
+
+static DEVICE_ATTR(vnic_state, S_IRUGO, show_vnic_state, NULL);
+
+static ssize_t show_rx_csum(struct device *dev,
+			    struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+
+	if (vnic->config->use_rx_csum)
+		return sprintf(buf, "true\n");
+	else
+		return sprintf(buf, "false\n");
+}
+
+static DEVICE_ATTR(rx_csum, S_IRUGO, show_rx_csum, NULL);
+
+static ssize_t show_tx_csum(struct device *dev,
+			    struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+
+	if (vnic->config->use_tx_csum)
+		return sprintf(buf, "true\n");
+	else
+		return sprintf(buf, "false\n");
+}
+
+static DEVICE_ATTR(tx_csum, S_IRUGO, show_tx_csum, NULL);
+
+static ssize_t show_current_path(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+	unsigned long flags;
+	size_t length;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (vnic->current_path == &vnic->primary_path)
+		length = sprintf(buf, "primary_path\n");
+	else if (vnic->current_path == &vnic->secondary_path)
+		length = sprintf(buf, "secondary path\n");
+	else
+		length = sprintf(buf, "none\n");
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+	return length;
+}
+
+static DEVICE_ATTR(current_path, S_IRUGO, show_current_path, NULL);
+
+static struct attribute *vnic_dev_attrs[] = {
+	&dev_attr_vnic_state.attr,
+	&dev_attr_rx_csum.attr,
+	&dev_attr_tx_csum.attr,
+	&dev_attr_current_path.attr,
+	NULL
+};
+
+struct attribute_group vnic_dev_attr_group = {
+	.attrs = vnic_dev_attrs,
+};
+
+static inline void print_dgid(u8 *dgid)
+{
+	int i;
+
+	for (i = 0; i < 16; i += 2)
+		printk("%04x", be16_to_cpu(*(__be16 *)&dgid[i]));
+}
+
+static inline int is_dgid_zero(u8 *dgid)
+{
+	int i;
+
+	for (i = 0; i < 16; i++) {
+		if (dgid[i] != 0)
+			return 1;
+	}
+	return 0;
+}
+
+static int create_netpath(struct netpath *npdest,
+			  struct path_param *p_params)
+{
+	struct viport_config	*viport_config;
+	struct viport		*viport;
+	struct vnic		*vnic;
+	struct list_head	*ptr;
+	int			ret = 0;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (vnic->primary_path.viport) {
+			viport_config = vnic->primary_path.viport->config;
+			if ((viport_config->ioc_guid == p_params->ioc_guid)
+			    && (viport_config->control_config.vnic_instance
+				== p_params->instance)
+			    && (be64_to_cpu(p_params->ioc_guid))) {
+				SYS_ERROR("GUID %llx,"
+					  " INSTANCE %d already in use\n",
+					  be64_to_cpu(p_params->ioc_guid),
+					  p_params->instance);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		if (vnic->secondary_path.viport) {
+			viport_config = vnic->secondary_path.viport->config;
+			if ((viport_config->ioc_guid == p_params->ioc_guid)
+			    && (viport_config->control_config.vnic_instance
+				== p_params->instance)
+			    && (be64_to_cpu(p_params->ioc_guid))) {
+				SYS_ERROR("GUID %llx,"
+					  " INSTANCE %d already in use\n",
+					  be64_to_cpu(p_params->ioc_guid),
+					  p_params->instance);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+	if (npdest->viport) {
+		SYS_ERROR("create_netpath: path already exists\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	viport_config = config_alloc_viport(p_params);
+	if (!viport_config) {
+		SYS_ERROR("create_netpath: failed creating viport config\n");
+		ret = -1;
+		goto out;
+	}
+
+	/*User specified heartbeat value is in 1/100s of a sec*/
+	if (p_params->heartbeat != -1) {
+		viport_config->hb_interval =
+			msecs_to_jiffies(p_params->heartbeat * 10);
+		viport_config->hb_timeout =
+			(p_params->heartbeat << 6) * 10000; /* usec */
+	}
+
+	viport_config->path_idx = 0;
+
+	viport = viport_allocate(viport_config);
+	if (!viport) {
+		SYS_ERROR("create_netpath: failed creating viport\n");
+		kfree(viport_config);
+		ret = -1;
+		goto out;
+	}
+
+	npdest->viport = viport;
+	viport->parent = npdest;
+	viport->vnic = npdest->parent;
+
+	if (is_dgid_zero(p_params->dgid) &&  p_params->ioc_guid != 0
+	   &&  p_params->pkey != 0) {
+		viport_kick(viport);
+		vnic_disconnected(npdest->parent, npdest);
+	} else {
+		printk(KERN_WARNING "Specified parameters IOCGUID=%llx, "
+			"P_Key=%x, DGID=", be64_to_cpu(p_params->ioc_guid),
+			p_params->pkey);
+		print_dgid(p_params->dgid);
+		printk(" insufficient for establishing %s path for interface "
+			"%s. Hence, path will not be established.\n",
+			(npdest->second_bias ? "secondary" : "primary"),
+			p_params->name);
+	}
+out:
+	return ret;
+}
+
+static struct vnic *create_vnic(struct path_param *param)
+{
+	struct vnic_config *vnic_config;
+	struct vnic *vnic;
+	struct list_head *ptr;
+
+	SYS_INFO("create_vnic: name = %s\n", param->name);
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, param->name)) {
+			SYS_ERROR("vnic %s already exists\n",
+				   param->name);
+			return NULL;
+		}
+	}
+
+	vnic_config = config_alloc_vnic();
+	if (!vnic_config) {
+		SYS_ERROR("create_vnic: failed creating vnic config\n");
+		return NULL;
+	}
+
+	if (param->rx_csum != -1)
+		vnic_config->use_rx_csum = param->rx_csum;
+
+	if (param->tx_csum != -1)
+		vnic_config->use_tx_csum = param->tx_csum;
+
+	strcpy(vnic_config->name, param->name);
+	vnic = vnic_allocate(vnic_config);
+	if (!vnic) {
+		SYS_ERROR("create_vnic: failed allocating vnic\n");
+		goto free_vnic_config;
+	}
+
+	init_completion(&vnic->dev_info.released);
+
+	vnic->dev_info.dev.class = NULL;
+	vnic->dev_info.dev.parent = &interface_dev.dev;
+	vnic->dev_info.dev.release = vnic_release_dev;
+	snprintf(vnic->dev_info.dev.bus_id, BUS_ID_SIZE,
+		 vnic_config->name);
+
+	if (device_register(&vnic->dev_info.dev)) {
+		SYS_ERROR("create_vnic: error in registering"
+			  " vnic class dev\n");
+		goto free_vnic;
+	}
+
+	if (sysfs_create_group(&vnic->dev_info.dev.kobj,
+			       &vnic_dev_attr_group)) {
+		SYS_ERROR("create_vnic: error in creating"
+			  "vnic attr group\n");
+		goto err_attr;
+
+	}
+
+	if (vnic_setup_stats_files(vnic))
+		goto err_stats;
+
+	return vnic;
+err_stats:
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_dev_attr_group);
+err_attr:
+	device_unregister(&vnic->dev_info.dev);
+	wait_for_completion(&vnic->dev_info.released);
+free_vnic:
+	list_del(&vnic->list_ptrs);
+	kfree(vnic);
+free_vnic_config:
+	kfree(vnic_config);
+	return NULL;
+}
+
+ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr,
+		    const char *buf, size_t count)
+{
+	struct vnic *vnic;
+	struct list_head *ptr;
+	int ret = -EINVAL;
+
+	if (count > IFNAMSIZ) {
+		printk(KERN_WARNING PFX "invalid vnic interface name\n");
+		return ret;
+	}
+
+	SYS_INFO("vnic_delete: name = %s\n", buf);
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, buf)) {
+			vnic_free(vnic);
+			return count;
+		}
+	}
+
+	printk(KERN_WARNING PFX "vnic interface '%s' does not exist\n", buf);
+	return ret;
+}
+
+static ssize_t show_viport_state(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+	switch (path->viport->state) {
+	case VIPORT_DISCONNECTED:
+		return sprintf(buf, "VIPORT_DISCONNECTED\n");
+	case VIPORT_CONNECTED:
+		return sprintf(buf, "VIPORT_CONNECTED\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+	}
+
+}
+
+static DEVICE_ATTR(viport_state, S_IRUGO, show_viport_state, NULL);
+
+static ssize_t show_link_state(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	switch (path->viport->link_state) {
+	case LINK_UNINITIALIZED:
+		return sprintf(buf, "LINK_UNINITIALIZED\n");
+	case LINK_INITIALIZE:
+		return sprintf(buf, "LINK_INITIALIZE\n");
+	case LINK_INITIALIZECONTROL:
+		return sprintf(buf, "LINK_INITIALIZECONTROL\n");
+	case LINK_INITIALIZEDATA:
+		return sprintf(buf, "LINK_INITIALIZEDATA\n");
+	case LINK_CONTROLCONNECT:
+		return sprintf(buf, "LINK_CONTROLCONNECT\n");
+	case LINK_CONTROLCONNECTWAIT:
+		return sprintf(buf, "LINK_CONTROLCONNECTWAIT\n");
+	case LINK_INITVNICREQ:
+		return sprintf(buf, "LINK_INITVNICREQ\n");
+	case LINK_INITVNICRSP:
+		return sprintf(buf, "LINK_INITVNICRSP\n");
+	case LINK_BEGINDATAPATH:
+		return sprintf(buf, "LINK_BEGINDATAPATH\n");
+	case LINK_CONFIGDATAPATHREQ:
+		return sprintf(buf, "LINK_CONFIGDATAPATHREQ\n");
+	case LINK_CONFIGDATAPATHRSP:
+		return sprintf(buf, "LINK_CONFIGDATAPATHRSP\n");
+	case LINK_DATACONNECT:
+		return sprintf(buf, "LINK_DATACONNECT\n");
+	case LINK_DATACONNECTWAIT:
+		return sprintf(buf, "LINK_DATACONNECTWAIT\n");
+	case LINK_XCHGPOOLREQ:
+		return sprintf(buf, "LINK_XCHGPOOLREQ\n");
+	case LINK_XCHGPOOLRSP:
+		return sprintf(buf, "LINK_XCHGPOOLRSP\n");
+	case LINK_INITIALIZED:
+		return sprintf(buf, "LINK_INITIALIZED\n");
+	case LINK_IDLE:
+		return sprintf(buf, "LINK_IDLE\n");
+	case LINK_IDLING:
+		return sprintf(buf, "LINK_IDLING\n");
+	case LINK_CONFIGLINKREQ:
+		return sprintf(buf, "LINK_CONFIGLINKREQ\n");
+	case LINK_CONFIGLINKRSP:
+		return sprintf(buf, "LINK_CONFIGLINKRSP\n");
+	case LINK_CONFIGADDRSREQ:
+		return sprintf(buf, "LINK_CONFIGADDRSREQ\n");
+	case LINK_CONFIGADDRSRSP:
+		return sprintf(buf, "LINK_CONFIGADDRSRSP\n");
+	case LINK_REPORTSTATREQ:
+		return sprintf(buf, "LINK_REPORTSTATREQ\n");
+	case LINK_REPORTSTATRSP:
+		return sprintf(buf, "LINK_REPORTSTATRSP\n");
+	case LINK_HEARTBEATREQ:
+		return sprintf(buf, "LINK_HEARTBEATREQ\n");
+	case LINK_HEARTBEATRSP:
+		return sprintf(buf, "LINK_HEARTBEATRSP\n");
+	case LINK_RESET:
+		return sprintf(buf, "LINK_RESET\n");
+	case LINK_RESETRSP:
+		return sprintf(buf, "LINK_RESETRSP\n");
+	case LINK_RESETCONTROL:
+		return sprintf(buf, "LINK_RESETCONTROL\n");
+	case LINK_RESETCONTROLRSP:
+		return sprintf(buf, "LINK_RESETCONTROLRSP\n");
+	case LINK_DATADISCONNECT:
+		return sprintf(buf, "LINK_DATADISCONNECT\n");
+	case LINK_CONTROLDISCONNECT:
+		return sprintf(buf, "LINK_CONTROLDISCONNECT\n");
+	case LINK_CLEANUPDATA:
+		return sprintf(buf, "LINK_CLEANUPDATA\n");
+	case LINK_CLEANUPCONTROL:
+		return sprintf(buf, "LINK_CLEANUPCONTROL\n");
+	case LINK_DISCONNECTED:
+		return sprintf(buf, "LINK_DISCONNECTED\n");
+	case LINK_RETRYWAIT:
+		return sprintf(buf, "LINK_RETRYWAIT\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+
+	}
+
+}
+static DEVICE_ATTR(link_state, S_IRUGO, show_link_state, NULL);
+
+static ssize_t show_heartbeat(struct device *dev,
+			      struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	/* hb_inteval is in jiffies, convert it back to
+	 * 1/100ths of a second
+	 */
+	return sprintf(buf, "%d\n",
+		(jiffies_to_msecs(path->viport->config->hb_interval)/10));
+}
+
+static DEVICE_ATTR(heartbeat, S_IRUGO, show_heartbeat, NULL);
+
+static ssize_t show_ioc_guid(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%llx\n",
+				__be64_to_cpu(path->viport->config->ioc_guid));
+}
+
+static DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL);
+
+static inline void get_dgid_string(u8 *dgid, char *buf)
+{
+	int i;
+	char holder[5];
+
+	for (i = 0; i < 16; i += 2) {
+		sprintf(holder, "%04x", be16_to_cpu(*(__be16 *)&dgid[i]));
+		strcat(buf, holder);
+	}
+
+	strcat(buf, "\n");
+}
+
+static ssize_t show_dgid(struct device *dev,
+			 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	get_dgid_string(path->viport->config->path_info.path.dgid.raw, buf);
+
+	return strlen(buf);
+}
+
+static DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL);
+
+static ssize_t show_pkey(struct device *dev,
+			 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%x\n", path->viport->config->path_info.path.pkey);
+}
+
+static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
+
+static ssize_t show_hca_info(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "vnic-%s-%d\n", path->viport->config->ibdev->name,
+						path->viport->config->port);
+}
+
+static DEVICE_ATTR(hca_info, S_IRUGO, show_hca_info, NULL);
+
+static ssize_t show_ioc_string(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%s\n", path->viport->config->ioc_string);
+}
+
+static  DEVICE_ATTR(ioc_string, S_IRUGO, show_ioc_string, NULL);
+
+static ssize_t show_multicast_state(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	if (!(path->viport->features_supported & VNIC_FEAT_INBOUND_IB_MC))
+		return sprintf(buf, "feature not enabled\n");
+
+	switch (path->viport->mc_info.state) {
+	case MCAST_STATE_INVALID:
+		return sprintf(buf, "state=Invalid\n");
+	case MCAST_STATE_JOINING:
+		return sprintf(buf, "state=Joining MGID:" VNIC_GID_FMT "\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw));
+	case MCAST_STATE_ATTACHING:
+		return sprintf(buf, "state=Attaching MGID:" VNIC_GID_FMT
+			" MLID:%X\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw),
+			path->viport->mc_info.mlid);
+	case MCAST_STATE_JOINED_ATTACHED:
+		return sprintf(buf,
+			"state=Joined & Attached MGID:" VNIC_GID_FMT
+			" MLID:%X\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw),
+			path->viport->mc_info.mlid);
+	case MCAST_STATE_DETACHING:
+		return sprintf(buf, "state=Detaching MGID: " VNIC_GID_FMT "\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw));
+	case MCAST_STATE_RETRIED:
+		return sprintf(buf, "state=Retries Exceeded\n");
+	}
+	return sprintf(buf, "invalid state\n");
+}
+
+static  DEVICE_ATTR(multicast_state, S_IRUGO, show_multicast_state, NULL);
+
+static struct attribute *vnic_path_attrs[] = {
+	&dev_attr_viport_state.attr,
+	&dev_attr_link_state.attr,
+	&dev_attr_heartbeat.attr,
+	&dev_attr_ioc_guid.attr,
+	&dev_attr_dgid.attr,
+	&dev_attr_pkey.attr,
+	&dev_attr_hca_info.attr,
+	&dev_attr_ioc_string.attr,
+	&dev_attr_multicast_state.attr,
+	NULL
+};
+
+struct attribute_group vnic_path_attr_group = {
+	.attrs = vnic_path_attrs,
+};
+
+
+static int setup_path_class_files(struct netpath *path, char *name)
+{
+	init_completion(&path->dev_info.released);
+
+	path->dev_info.dev.class = NULL;
+	path->dev_info.dev.parent = &path->parent->dev_info.dev;
+	path->dev_info.dev.release = vnic_release_dev;
+	snprintf(path->dev_info.dev.bus_id, BUS_ID_SIZE, name);
+
+	if (device_register(&path->dev_info.dev)) {
+		SYS_ERROR("error in registering path class dev\n");
+		goto out;
+	}
+
+	if (sysfs_create_group(&path->dev_info.dev.kobj,
+			       &vnic_path_attr_group)) {
+		SYS_ERROR("error in creating vnic path group attrs");
+		goto err_path;
+	}
+
+	return 0;
+
+err_path:
+	device_unregister(&path->dev_info.dev);
+	wait_for_completion(&path->dev_info.released);
+out:
+	return -1;
+
+}
+
+static inline void update_dgids(u8 *old, u8 *new, char *vnic_name,
+				char *path_name)
+{
+	int i;
+
+	if (!memcmp(old, new, 16))
+		return;
+
+	printk(KERN_INFO PFX "Changing dgid from 0x");
+	print_dgid(old);
+	printk(" to 0x");
+	print_dgid(new);
+	printk(" for %s path of %s\n", path_name, vnic_name);
+	for (i = 0; i < 16; i++)
+		old[i] = new[i];
+}
+
+static inline void update_ioc_guids(struct path_param *params,
+				    struct netpath *path,
+				    char *vnic_name, char *path_name)
+{
+	u64 sid;
+
+	if (path->viport->config->ioc_guid == params->ioc_guid)
+		return;
+
+	printk(KERN_INFO PFX "Changing IOC GUID from 0x%llx to 0x%llx "
+			 "for %s path of %s\n",
+			 __be64_to_cpu(path->viport->config->ioc_guid),
+			 __be64_to_cpu(params->ioc_guid), path_name, vnic_name);
+
+	path->viport->config->ioc_guid = params->ioc_guid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8)
+				| IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	path->viport->config->control_config.ib_config.service_id =
+							 cpu_to_be64(sid);
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8)
+				| IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	path->viport->config->data_config.ib_config.service_id =
+							 cpu_to_be64(sid);
+}
+
+static inline void update_pkeys(__be16 *old, __be16 *new, char *vnic_name,
+				char *path_name)
+{
+	if (*old == *new)
+		return;
+
+	printk(KERN_INFO PFX "Changing P_Key from 0x%x to 0x%x "
+			 "for %s path of %s\n", *old, *new,
+			 path_name, vnic_name);
+	*old = *new;
+}
+
+static void update_ioc_strings(struct path_param *params, struct netpath *path,
+								char *path_name)
+{
+	if (!strcmp(params->ioc_string, path->viport->config->ioc_string))
+		return;
+
+	printk(KERN_INFO PFX "Changing ioc_string to %s for %s path of %s\n",
+				params->ioc_string, path_name, params->name);
+
+	strcpy(path->viport->config->ioc_string, params->ioc_string);
+}
+
+static void update_path_parameters(struct path_param *params,
+				   struct netpath *path)
+{
+	update_dgids(path->viport->config->path_info.path.dgid.raw,
+		params->dgid, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_ioc_guids(params, path, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_pkeys(&path->viport->config->path_info.path.pkey,
+		&params->pkey, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_ioc_strings(params, path,
+		(path->second_bias ? "secondary" : "primary"));
+}
+
+static ssize_t update_params_and_connect(struct path_param *params,
+					 struct netpath *path, size_t count)
+{
+	if (is_dgid_zero(params->dgid) && params->ioc_guid != 0 &&
+	    params->pkey != 0) {
+
+		if (!memcmp(path->viport->config->path_info.path.dgid.raw,
+			params->dgid, 16) &&
+		    params->ioc_guid == path->viport->config->ioc_guid &&
+		    params->pkey     == path->viport->config->path_info.path.pkey) {
+
+			printk(KERN_WARNING PFX "All of the dgid, ioc_guid and "
+						"pkeys are same as the existing"
+						" one. Not updating values.\n");
+			return -EINVAL;
+		} else {
+			if (path->viport->state == VIPORT_CONNECTED) {
+				printk(KERN_WARNING PFX "%s path of %s "
+					"interface is already in connected "
+					"state. Not updating values.\n",
+				(path->second_bias ? "Secondary" : "Primary"),
+				path->parent->config->name);
+				return -EINVAL;
+			} else {
+				update_path_parameters(params, path);
+				viport_kick(path->viport);
+				vnic_disconnected(path->parent, path);
+				return count;
+			}
+		}
+	} else {
+		printk(KERN_WARNING PFX "Either dgid, iocguid, pkey is zero. "
+					"No update.\n");
+		return -EINVAL;
+	}
+}
+
+ssize_t vnic_create_primary(struct device *dev,
+			    struct device_attribute *dev_attr, const char *buf,
+			    size_t count)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic_ib_port *target =
+	    container_of(info, struct vnic_ib_port, pdev_info);
+
+	struct path_param param;
+	int ret = -EINVAL;
+	struct vnic *vnic;
+	struct list_head    *ptr;
+
+	param.instance = 0;
+	param.rx_csum = -1;
+	param.tx_csum = -1;
+	param.heartbeat = -1;
+	param.ib_multicast = -1;
+	*param.ioc_string = '\0';
+
+	ret = vnic_parse_options(buf, &param);
+
+	if (ret)
+		goto out;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, param.name)) {
+			ret = update_params_and_connect(&param,
+							&vnic->primary_path,
+							count);
+			goto out;
+		}
+	 }
+
+	param.ibdev = target->dev->dev;
+	param.ibport = target;
+	param.port = target->port_num;
+
+	vnic = create_vnic(&param);
+	if (!vnic) {
+		printk(KERN_ERR PFX "creating vnic failed\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (create_netpath(&vnic->primary_path, &param)) {
+		printk(KERN_ERR PFX "creating primary netpath failed\n");
+		goto free_vnic;
+	}
+
+	if (setup_path_class_files(&vnic->primary_path, "primary_path"))
+		goto free_vnic;
+
+	if (vnic && !vnic->primary_path.viport) {
+		printk(KERN_ERR PFX "no valid netpaths\n");
+		goto free_vnic;
+	}
+
+	return count;
+
+free_vnic:
+	vnic_free(vnic);
+	ret = -EINVAL;
+out:
+	return ret;
+}
+
+ssize_t vnic_create_secondary(struct device *dev,
+			      struct device_attribute *dev_attr,
+			      const char *buf, size_t count)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic_ib_port *target =
+	    container_of(info, struct vnic_ib_port, pdev_info);
+
+	struct path_param param;
+	struct vnic *vnic = NULL;
+	int ret = -EINVAL;
+	struct list_head *ptr;
+	int found = 0;
+
+	param.instance = 0;
+	param.rx_csum = -1;
+	param.tx_csum = -1;
+	param.heartbeat = -1;
+	param.ib_multicast = -1;
+	*param.ioc_string = '\0';
+
+	ret = vnic_parse_options(buf, &param);
+
+	if (ret)
+		goto out;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strncmp(vnic->config->name, param.name, IFNAMSIZ)) {
+			if (vnic->secondary_path.viport) {
+				ret = update_params_and_connect(&param,
+								&vnic->secondary_path,
+								count);
+				goto out;
+			}
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found) {
+		printk(KERN_ERR PFX
+		       "primary connection with name '%s' does not exist\n",
+		       param.name);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	param.ibdev = target->dev->dev;
+	param.ibport = target;
+	param.port = target->port_num;
+
+	if (create_netpath(&vnic->secondary_path, &param)) {
+		printk(KERN_ERR PFX "creating secondary netpath failed\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (setup_path_class_files(&vnic->secondary_path, "secondary_path"))
+		goto free_vnic;
+
+	return count;
+
+free_vnic:
+	vnic_free(vnic);
+	ret = -EINVAL;
+out:
+	return ret;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
new file mode 100644
index 0000000..b41e770
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_SYS_H_INCLUDED
+#define VNIC_SYS_H_INCLUDED
+
+struct dev_info {
+	struct device		dev;
+	struct completion	released;
+};
+
+extern struct class vnic_class;
+extern struct dev_info interface_dev;
+extern struct attribute_group vnic_dev_attr_group;
+extern struct attribute_group vnic_path_attr_group;
+extern struct device_attribute dev_attr_create_primary;
+extern struct device_attribute dev_attr_create_secondary;
+extern struct device_attribute dev_attr_delete_vnic;
+
+extern void vnic_release_dev(struct device *dev);
+
+extern ssize_t vnic_create_primary(struct device *dev,
+				   struct device_attribute *dev_attr,
+				   const char *buf, size_t count);
+
+extern ssize_t vnic_create_secondary(struct device *dev,
+				     struct device_attribute *dev_attr,
+				     const char *buf, size_t count);
+
+extern ssize_t vnic_delete(struct device *dev,
+			   struct device_attribute *dev_attr,
+			   const char *buf, size_t count);
+#endif	/*VNIC_SYS_H_INCLUDED*/


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:36:29 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:06:29 +0530
Subject: [ofa-general] [PATCH v2 10/13] QLogic VNIC: Driver Statistics
	collection
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103629.12355.46869.stgit@localhost.localdomain>

From: Amar Mudrankit <amar.mudrankit at qlogic.com>

Collection of statistics about QLogic VNIC interfaces is implemented
in this patch.

Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c |  234 ++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h |  497 +++++++++++++++++++++++++
 2 files changed, 731 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
new file mode 100644
index 0000000..d11a8df
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
@@ -0,0 +1,234 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+
+#include "vnic_main.h"
+
+cycles_t vnic_recv_ref;
+
+/*
+ * TODO: Statistics reporting for control path, data path,
+ *       RDMA times, IOs etc
+ *
+ */
+static ssize_t show_lifetime(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time = get_cycles() - vnic->statistics.start_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(lifetime, S_IRUGO, show_lifetime, NULL);
+
+static ssize_t show_conntime(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	if (vnic->statistics.conn_time)
+		return sprintf(buf, "%llu\n",
+			   (unsigned long long)vnic->statistics.conn_time);
+	return 0;
+}
+
+static DEVICE_ATTR(connection_time, S_IRUGO, show_conntime, NULL);
+
+static ssize_t show_disconnects(struct device *dev,
+				struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	u32 num;
+
+	if (vnic->statistics.disconn_ref)
+		num = vnic->statistics.disconn_num + 1;
+	else
+		num = vnic->statistics.disconn_num;
+
+	return sprintf(buf, "%d\n", num);
+}
+
+static DEVICE_ATTR(disconnects, S_IRUGO, show_disconnects, NULL);
+
+static ssize_t show_total_disconn_time(struct device *dev,
+				       struct device_attribute *dev_attr,
+				       char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time;
+
+	if (vnic->statistics.disconn_ref)
+		time = vnic->statistics.disconn_time +
+		       get_cycles() - vnic->statistics.disconn_ref;
+	else
+		time = vnic->statistics.disconn_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(total_disconn_time, S_IRUGO, show_total_disconn_time, NULL);
+
+static ssize_t show_carrier_losses(struct device *dev,
+				   struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	u32 num;
+
+	if (vnic->statistics.carrier_ref)
+		num = vnic->statistics.carrier_off_num + 1;
+	else
+		num = vnic->statistics.carrier_off_num;
+
+	return sprintf(buf, "%d\n", num);
+}
+
+static DEVICE_ATTR(carrier_losses, S_IRUGO, show_carrier_losses, NULL);
+
+static ssize_t show_total_carr_loss_time(struct device *dev,
+					 struct device_attribute *dev_attr,
+					 char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time;
+
+	if (vnic->statistics.carrier_ref)
+		time = vnic->statistics.carrier_off_time +
+		       get_cycles() - vnic->statistics.carrier_ref;
+	else
+		time = vnic->statistics.carrier_off_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(total_carrier_loss_time, S_IRUGO,
+			 show_total_carr_loss_time, NULL);
+
+static ssize_t show_total_recv_time(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%llu\n",
+		       (unsigned long long)vnic->statistics.recv_time);
+}
+
+static DEVICE_ATTR(total_recv_time, S_IRUGO, show_total_recv_time, NULL);
+
+static ssize_t show_recvs(struct device *dev,
+			  struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.recv_num);
+}
+
+static DEVICE_ATTR(recvs, S_IRUGO, show_recvs, NULL);
+
+static ssize_t show_multicast_recvs(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.multicast_recv_num);
+}
+
+static DEVICE_ATTR(multicast_recvs, S_IRUGO, show_multicast_recvs, NULL);
+
+static ssize_t show_total_xmit_time(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%llu\n",
+		       (unsigned long long)vnic->statistics.xmit_time);
+}
+
+static DEVICE_ATTR(total_xmit_time, S_IRUGO, show_total_xmit_time, NULL);
+
+static ssize_t show_xmits(struct device *dev,
+			  struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.xmit_num);
+}
+
+static DEVICE_ATTR(xmits, S_IRUGO, show_xmits, NULL);
+
+static ssize_t show_failed_xmits(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.xmit_fail);
+}
+
+static DEVICE_ATTR(failed_xmits, S_IRUGO, show_failed_xmits, NULL);
+
+static struct attribute *vnic_stats_attrs[] = {
+	&dev_attr_lifetime.attr,
+	&dev_attr_xmits.attr,
+	&dev_attr_total_xmit_time.attr,
+	&dev_attr_failed_xmits.attr,
+	&dev_attr_recvs.attr,
+	&dev_attr_multicast_recvs.attr,
+	&dev_attr_total_recv_time.attr,
+	&dev_attr_connection_time.attr,
+	&dev_attr_disconnects.attr,
+	&dev_attr_total_disconn_time.attr,
+	&dev_attr_carrier_losses.attr,
+	&dev_attr_total_carrier_loss_time.attr,
+	NULL
+};
+
+struct attribute_group vnic_stats_attr_group = {
+	.attrs = vnic_stats_attrs,
+};
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
new file mode 100644
index 0000000..a241b71
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
@@ -0,0 +1,497 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_STATS_H_INCLUDED
+#define VNIC_STATS_H_INCLUDED
+
+#include "vnic_main.h"
+#include "vnic_ib.h"
+#include "vnic_sys.h"
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+
+static inline void vnic_connected_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.conn_time == 0) {
+		vnic->statistics.conn_time =
+		    get_cycles() - vnic->statistics.start_time;
+	}
+
+	if (vnic->statistics.disconn_ref != 0) {
+		vnic->statistics.disconn_time +=
+		    get_cycles() - vnic->statistics.disconn_ref;
+		vnic->statistics.disconn_num++;
+		vnic->statistics.disconn_ref = 0;
+	}
+
+}
+
+static inline void vnic_stop_xmit_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.xmit_ref == 0)
+		vnic->statistics.xmit_ref = get_cycles();
+}
+
+static inline void vnic_restart_xmit_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.xmit_ref != 0) {
+		vnic->statistics.xmit_off_time +=
+		    get_cycles() - vnic->statistics.xmit_ref;
+		vnic->statistics.xmit_off_num++;
+		vnic->statistics.xmit_ref = 0;
+	}
+}
+
+static inline void vnic_recv_pkt_stats(struct vnic *vnic)
+{
+	vnic->statistics.recv_time += get_cycles() - vnic_recv_ref;
+	vnic->statistics.recv_num++;
+}
+
+static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic)
+{
+	vnic->statistics.multicast_recv_num++;
+}
+
+static inline void vnic_pre_pkt_xmit_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic,
+					    cycles_t time)
+{
+	vnic->statistics.xmit_time += get_cycles() - time;
+	vnic->statistics.xmit_num++;
+
+}
+
+static inline void vnic_xmit_fail_stats(struct vnic *vnic)
+{
+	vnic->statistics.xmit_fail++;
+}
+
+static inline void vnic_carrier_loss_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.carrier_ref != 0) {
+		vnic->statistics.carrier_off_time +=
+			get_cycles() -  vnic->statistics.carrier_ref;
+		vnic->statistics.carrier_off_num++;
+		vnic->statistics.carrier_ref = 0;
+	}
+}
+
+static inline int vnic_setup_stats_files(struct vnic *vnic)
+{
+	init_completion(&vnic->stat_info.released);
+	vnic->stat_info.dev.class = NULL;
+	vnic->stat_info.dev.parent = &vnic->dev_info.dev;
+	vnic->stat_info.dev.release = vnic_release_dev;
+	snprintf(vnic->stat_info.dev.bus_id, BUS_ID_SIZE,
+		 "stats");
+
+	if (device_register(&vnic->stat_info.dev)) {
+		SYS_ERROR("create_vnic: error in registering"
+			  " stat class dev\n");
+		goto stats_out;
+	}
+
+	if (sysfs_create_group(&vnic->stat_info.dev.kobj,
+			       &vnic_stats_attr_group))
+		goto err_stats_file;
+
+	return 0;
+err_stats_file:
+	device_unregister(&vnic->stat_info.dev);
+	wait_for_completion(&vnic->stat_info.released);
+stats_out:
+	return -1;
+}
+
+static inline void vnic_cleanup_stats_files(struct vnic *vnic)
+{
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_stats_attr_group);
+	device_unregister(&vnic->stat_info.dev);
+	wait_for_completion(&vnic->stat_info.released);
+}
+
+static inline void vnic_disconn_stats(struct vnic *vnic)
+{
+	if (!vnic->statistics.disconn_ref)
+		vnic->statistics.disconn_ref = get_cycles();
+
+	if (vnic->statistics.carrier_ref == 0)
+		vnic->statistics.carrier_ref = get_cycles();
+}
+
+static inline void vnic_alloc_stats(struct vnic *vnic)
+{
+	vnic->statistics.start_time = get_cycles();
+}
+
+static inline void control_note_rsptime_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void control_update_rsptime_stats(struct control *control,
+						cycles_t response_time)
+{
+	response_time -= control->statistics.request_time;
+	control->statistics.response_time += response_time;
+	control->statistics.response_num++;
+	if (control->statistics.response_max < response_time)
+		control->statistics.response_max = response_time;
+	if ((control->statistics.response_min == 0) ||
+	    (control->statistics.response_min > response_time))
+		control->statistics.response_min =  response_time;
+
+}
+
+static inline void control_note_reqtime_stats(struct control *control)
+{
+	control->statistics.request_time = get_cycles();
+}
+
+static inline void control_timeout_stats(struct control *control)
+{
+	control->statistics.timeout_num++;
+}
+
+static inline void data_kickreq_stats(struct data *data)
+{
+	data->statistics.kick_reqs++;
+}
+
+static inline void data_no_xmitbuf_stats(struct data *data)
+{
+	data->statistics.no_xmit_bufs++;
+}
+
+static inline void data_xmits_stats(struct data *data)
+{
+	data->statistics.xmit_num++;
+}
+
+static inline void data_recvs_stats(struct data *data)
+{
+	data->statistics.recv_num++;
+}
+
+static inline void data_note_kickrcv_time(void)
+{
+	vnic_recv_ref = get_cycles();
+}
+
+static inline void data_rcvkicks_stats(struct data *data)
+{
+	data->statistics.kick_recvs++;
+}
+
+
+static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.connection_time = get_cycles();
+}
+
+static inline void vnic_ib_note_comptime_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.num_callbacks++;
+}
+
+static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn,
+				      u32 *comp_num)
+{
+	ib_conn->statistics.num_ios++;
+	*comp_num = *comp_num + 1;
+
+}
+
+static inline void vnic_ib_io_stats(struct io *io,
+				    struct vnic_ib_conn *ib_conn,
+				    cycles_t comp_time)
+{
+	if ((io->type == RECV) || (io->type == RECV_UD))
+		io->time = comp_time;
+	else if (io->type == RDMA) {
+		ib_conn->statistics.rdma_comp_time += comp_time - io->time;
+		ib_conn->statistics.rdma_comp_ios++;
+	} else if (io->type == SEND) {
+		ib_conn->statistics.send_comp_time += comp_time - io->time;
+		ib_conn->statistics.send_comp_ios++;
+	}
+}
+
+static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn,
+				       u32 comp_num)
+{
+	if (comp_num > ib_conn->statistics.max_ios)
+		ib_conn->statistics.max_ios = comp_num;
+}
+
+static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.connection_time =
+			 get_cycles() - ib_conn->statistics.connection_time;
+
+}
+
+static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					     struct io *io,
+					     cycles_t *time)
+{
+	*time = get_cycles();
+	if (io->time != 0) {
+		ib_conn->statistics.recv_comp_time += *time - io->time;
+		ib_conn->statistics.recv_comp_ios++;
+	}
+
+}
+
+static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					      cycles_t time)
+{
+	ib_conn->statistics.recv_post_time += get_cycles() - time;
+	ib_conn->statistics.recv_post_ios++;
+}
+
+static inline void vnic_ib_pre_sendpost_stats(struct io *io,
+					      cycles_t *time)
+{
+	io->time = *time = get_cycles();
+}
+
+static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn,
+					       struct io *io,
+					       cycles_t time)
+{
+	time = get_cycles() - time;
+	if (io->swr.opcode == IB_WR_RDMA_WRITE) {
+		ib_conn->statistics.rdma_post_time += time;
+		ib_conn->statistics.rdma_post_ios++;
+	} else {
+		ib_conn->statistics.send_post_time += time;
+		ib_conn->statistics.send_post_ios++;
+	}
+}
+#else	/*CONFIG_INIFINIBAND_VNIC_STATS*/
+
+static inline void vnic_connected_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_stop_xmit_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_restart_xmit_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_recv_pkt_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_pre_pkt_xmit_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic,
+					    cycles_t time)
+{
+	;
+}
+
+static inline void vnic_xmit_fail_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline int vnic_setup_stats_files(struct vnic *vnic)
+{
+	return 0;
+}
+
+static inline void vnic_cleanup_stats_files(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_carrier_loss_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_disconn_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_alloc_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void control_note_rsptime_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void control_update_rsptime_stats(struct control *control,
+						cycles_t response_time)
+{
+	;
+}
+
+static inline void control_note_reqtime_stats(struct control *control)
+{
+	;
+}
+
+static inline void control_timeout_stats(struct control *control)
+{
+	;
+}
+
+static inline void data_kickreq_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_no_xmitbuf_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_xmits_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_recvs_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_note_kickrcv_time(void)
+{
+	;
+}
+
+static inline void data_rcvkicks_stats(struct data *data)
+{
+	;
+}
+
+static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn)
+{
+	;
+}
+
+static inline void vnic_ib_note_comptime_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn)
+
+{
+	;
+}
+static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn,
+				      u32 *comp_num)
+{
+	;
+}
+
+static inline void vnic_ib_io_stats(struct io *io,
+				    struct vnic_ib_conn *ib_conn,
+				    cycles_t comp_time)
+{
+	;
+}
+
+static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn,
+				       u32 comp_num)
+{
+	;
+}
+
+static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn)
+{
+	;
+}
+
+static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					     struct io *io,
+					     cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					      cycles_t time)
+{
+	;
+}
+
+static inline void vnic_ib_pre_sendpost_stats(struct io *io,
+					      cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn,
+					       struct io *io,
+					       cycles_t time)
+{
+	;
+}
+#endif	/*CONFIG_INIFINIBAND_VNIC_STATS*/
+
+#endif	/*VNIC_STATS_H_INCLUDED*/


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:37:00 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:07:00 +0530
Subject: [ofa-general] [PATCH v2 11/13] QLogic VNIC: Driver utility file -
	implements various utility macros
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103700.12355.20370.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the driver utility file which mainly contains utility
macros for debugging of QLogic VNIC driver.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h |  250 ++++++++++++++++++++++++++
 1 files changed, 250 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
new file mode 100644
index 0000000..572e338
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
@@ -0,0 +1,250 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_UTIL_H_INCLUDED
+#define VNIC_UTIL_H_INCLUDED
+
+#define MODULE_NAME "QLGC_VNIC"
+
+#define VNIC_MAJORVERSION	1
+#define VNIC_MINORVERSION	1
+
+#define ALIGN_DOWN(x, a)	((x)&(~((a)-1)))
+
+extern u32 vnic_debug;
+
+enum {
+	DEBUG_IB_INFO			= 0x00000001,
+	DEBUG_IB_FUNCTION		= 0x00000002,
+	DEBUG_IB_FSTATUS		= 0x00000004,
+	DEBUG_IB_ASSERTS		= 0x00000008,
+	DEBUG_CONTROL_INFO		= 0x00000010,
+	DEBUG_CONTROL_FUNCTION	= 0x00000020,
+	DEBUG_CONTROL_PACKET	= 0x00000040,
+	DEBUG_CONFIG_INFO		= 0x00000100,
+	DEBUG_DATA_INFO 		= 0x00001000,
+	DEBUG_DATA_FUNCTION		= 0x00002000,
+	DEBUG_NETPATH_INFO		= 0x00010000,
+	DEBUG_VIPORT_INFO		= 0x00100000,
+	DEBUG_VIPORT_FUNCTION	= 0x00200000,
+	DEBUG_LINK_STATE		= 0x00400000,
+	DEBUG_VNIC_INFO 		= 0x01000000,
+	DEBUG_VNIC_FUNCTION		= 0x02000000,
+	DEBUG_MCAST_INFO		= 0x04000000,
+	DEBUG_MCAST_FUNCTION	= 0x08000000,
+	DEBUG_SYS_INFO			= 0x10000000,
+	DEBUG_SYS_VERBOSE		= 0x40000000
+};
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_DEBUG
+#define PRINT(level, x, fmt, arg...)					\
+	printk(level "%s: %s: %s, line %d: " fmt,			\
+	       MODULE_NAME, x, __FILE__, __LINE__, ##arg)
+
+#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...)		\
+	do {								\
+		if (condition)						\
+			printk(level "%s: %s: %s, line %d: " fmt,	\
+			       MODULE_NAME, x, __FILE__, __LINE__,	\
+			       ##arg);					\
+	} while (0)
+#else
+#define PRINT(level, x, fmt, arg...)					\
+	printk(level "%s: " fmt, MODULE_NAME, ##arg)
+
+#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...)		\
+	do {								\
+		 if (condition)						\
+			printk(level "%s: %s: " fmt,			\
+			       MODULE_NAME, x, ##arg);			\
+	} while (0)
+#endif	/*CONFIG_INFINIBAND_QLGC_VNIC_DEBUG*/
+
+#define IB_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "IB", fmt, ##arg)
+#define IB_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "IB", fmt, ##arg)
+
+#define IB_FUNCTION(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO, 				\
+			  "IB", 				\
+			  (vnic_debug & DEBUG_IB_FUNCTION), 	\
+			  fmt, ##arg)
+
+#define IB_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "IB",					\
+			  (vnic_debug & DEBUG_IB_INFO),		\
+			  fmt, ##arg)
+
+#define IB_ASSERT(x)							\
+	do {								\
+		 if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x))		\
+			panic("%s assertion failed, file:  %s,"		\
+				" line %d: ",				\
+				MODULE_NAME, __FILE__, __LINE__)	\
+	} while (0)
+
+#define CONTROL_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "CONTROL", fmt, ##arg)
+#define CONTROL_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "CONTROL", fmt, ##arg)
+
+#define CONTROL_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,					\
+			  "CONTROL",					\
+			  (vnic_debug & DEBUG_CONTROL_INFO),		\
+			  fmt, ##arg)
+
+#define CONTROL_FUNCTION(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,					\
+			"CONTROL",					\
+			(vnic_debug & DEBUG_CONTROL_FUNCTION),		\
+			fmt, ##arg)
+
+#define CONTROL_PACKET(pkt)					\
+	do {							\
+		 if (vnic_debug & DEBUG_CONTROL_PACKET)		\
+			control_log_control_packet(pkt);	\
+	} while (0)
+
+#define CONFIG_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "CONFIG", fmt, ##arg)
+#define CONFIG_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "CONFIG", fmt, ##arg)
+
+#define CONFIG_INFO(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "CONFIG",				\
+			  (vnic_debug & DEBUG_CONFIG_INFO),	\
+			  fmt, ##arg)
+
+#define DATA_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "DATA", fmt, ##arg)
+#define DATA_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "DATA", fmt, ##arg)
+
+#define DATA_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "DATA",				\
+			  (vnic_debug & DEBUG_DATA_INFO),	\
+			  fmt, ##arg)
+
+#define DATA_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "DATA",				\
+			  (vnic_debug & DEBUG_DATA_FUNCTION),	\
+			  fmt, ##arg)
+
+
+#define MCAST_PRINT(fmt, arg...)        \
+    PRINT(KERN_INFO, "MCAST", fmt, ##arg)
+#define MCAST_ERROR(fmt, arg...)        \
+    PRINT(KERN_ERR, "MCAST", fmt, ##arg)
+
+#define MCAST_INFO(fmt, arg...)   	              		\
+	PRINT_CONDITIONAL(KERN_INFO,     			\
+			"MCAST",   				\
+			(vnic_debug & DEBUG_MCAST_INFO),	\
+			fmt, ##arg)
+
+#define MCAST_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			"MCAST",				\
+			(vnic_debug & DEBUG_MCAST_FUNCTION), 	\
+			fmt, ##arg)
+
+#define NETPATH_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "NETPATH", fmt, ##arg)
+#define NETPATH_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "NETPATH", fmt, ##arg)
+
+#define NETPATH_INFO(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "NETPATH",				\
+			  (vnic_debug & DEBUG_NETPATH_INFO),	\
+			  fmt, ##arg)
+
+#define VIPORT_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "VIPORT", fmt, ##arg)
+#define VIPORT_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "VIPORT", fmt, ##arg)
+
+#define VIPORT_INFO(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "VIPORT",				\
+			  (vnic_debug & DEBUG_VIPORT_INFO),	\
+			  fmt, ##arg)
+
+#define VIPORT_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "VIPORT",				\
+			  (vnic_debug & DEBUG_VIPORT_FUNCTION),	\
+			  fmt, ##arg)
+
+#define LINK_STATE(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "LINK",				\
+			  (vnic_debug & DEBUG_LINK_STATE),	\
+			  fmt, ##arg)
+
+#define VNIC_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "NIC", fmt, ##arg)
+#define VNIC_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "NIC", fmt, ##arg)
+#define VNIC_INIT(fmt, arg...)			\
+	PRINT(KERN_INFO, "NIC", fmt, ##arg)
+
+#define VNIC_INFO(fmt, arg...)					\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "NIC",				\
+			   (vnic_debug & DEBUG_VNIC_INFO),	\
+			   fmt, ##arg)
+
+#define VNIC_FUNCTION(fmt, arg...)				\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "NIC",				\
+			   (vnic_debug & DEBUG_VNIC_FUNCTION),	\
+			   fmt, ##arg)
+
+#define SYS_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "SYS", fmt, ##arg)
+#define SYS_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "SYS", fmt, ##arg)
+
+#define SYS_INFO(fmt, arg...)					\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "SYS",				\
+			   (vnic_debug & DEBUG_SYS_INFO),	\
+			   fmt, ##arg)
+
+#endif	/* VNIC_UTIL_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:37:30 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:07:30 +0530
Subject: [ofa-general] [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and
	Makefile.
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103730.12355.14730.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

Kconfig and Makefile for the QLogic VNIC driver.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/Kconfig  |   28 ++++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/Makefile |   13 +++++++++++++
 2 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile

diff --git a/drivers/infiniband/ulp/qlgc_vnic/Kconfig b/drivers/infiniband/ulp/qlgc_vnic/Kconfig
new file mode 100644
index 0000000..6a08770
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/Kconfig
@@ -0,0 +1,28 @@
+config INFINIBAND_QLGC_VNIC
+	tristate "QLogic VNIC - Support for QLogic Ethernet Virtual I/O Controller"
+	depends on INFINIBAND && NETDEVICES && INET
+	---help---
+	  Support for the QLogic Ethernet Virtual I/O Controller
+	  (EVIC). In conjunction with the EVIC, this provides virtual
+	  ethernet interfaces and transports ethernet packets over
+	  InfiniBand so that you can communicate with Ethernet networks
+	  using your IB device.
+
+config INFINIBAND_QLGC_VNIC_DEBUG
+	bool "QLogic VNIC Verbose debugging"
+	depends on INFINIBAND_QLGC_VNIC
+	default n
+	---help---
+	  This option causes verbose debugging code to be compiled
+	  into the QLogic VNIC driver.  The output can be turned on via the
+	  vnic_debug module parameter.
+
+config INFINIBAND_QLGC_VNIC_STATS
+	bool "QLogic VNIC Statistics"
+	depends on INFINIBAND_QLGC_VNIC
+	default n
+	---help---
+	  This option compiles statistics collecting code into the
+	  data path of the QLogic VNIC driver to help in profiling and fine
+	  tuning. This adds some overhead in the interest of gathering
+	  data.
diff --git a/drivers/infiniband/ulp/qlgc_vnic/Makefile b/drivers/infiniband/ulp/qlgc_vnic/Makefile
new file mode 100644
index 0000000..509dd67
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/Makefile
@@ -0,0 +1,13 @@
+obj-$(CONFIG_INFINIBAND_QLGC_VNIC)		+= qlgc_vnic.o
+
+qlgc_vnic-y					:= vnic_main.o \
+						   vnic_ib.o \
+						   vnic_viport.o \
+						   vnic_control.o \
+						   vnic_data.o \
+						   vnic_netpath.o \
+						   vnic_config.o \
+						   vnic_sys.o \
+						   vnic_multicast.o
+
+qlgc_vnic-$(CONFIG_INFINIBAND_QLGC_VNIC_STATS)	+= vnic_stats.o


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:35:59 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:05:59 +0530
Subject: [ofa-general] [PATCH v2 09/13] QLogic VNIC: IB Multicast for
	Ethernet broadcast/multicast
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103559.12355.85037.stgit@localhost.localdomain>

From: Usha Srinivasan <usha.srinivasan at qlogic.com>

Implementation of ethernet broadcasting and multicasting for QLogic
VNIC interface by making use of underlying IB multicasting. 

Signed-off-by: Usha Srinivasan <usha.srinivasan at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c |  319 +++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h |   77 +++++
 2 files changed, 396 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
new file mode 100644
index 0000000..f40ea20
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
@@ -0,0 +1,319 @@
+/*
+ * Copyright (c) 2008 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <linux/jiffies.h>
+#include <rdma/ib_sa.h>
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_util.h"
+
+static inline void vnic_set_multicast_state_invalid(struct viport *viport)
+{
+	viport->mc_info.state = MCAST_STATE_INVALID;
+	viport->mc_info.mc = NULL;
+	memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid));
+}
+
+int vnic_mc_init(struct viport *viport)
+{
+	MCAST_FUNCTION("vnic_mc_init %p\n", viport);
+	vnic_set_multicast_state_invalid(viport);
+	viport->mc_info.retries = 0;
+	spin_lock_init(&viport->mc_info.lock);
+
+	return 0;
+}
+
+void vnic_mc_uninit(struct viport *viport)
+{
+	unsigned long flags;
+	MCAST_FUNCTION("vnic_mc_uninit %p\n", viport);
+
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if ((viport->mc_info.state != MCAST_STATE_INVALID) &&
+	    (viport->mc_info.state != MCAST_STATE_RETRIED)) {
+		MCAST_ERROR("%s mcast state is not INVALID or RETRIED %d\n",
+				control_ifcfg_name(&viport->control),
+				viport->mc_info.state);
+	}
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+	MCAST_FUNCTION("vnic_mc_uninit done\n");
+}
+
+
+/* This function is called when NEED_MCAST_COMPLETION is set.
+ * It finishes off the join multicast work.
+ */
+int vnic_mc_join_handle_completion(struct viport *viport)
+{
+	unsigned int ret = 0;
+
+	MCAST_FUNCTION("vnic_mc_join_handle_completion()\n");
+	if (viport->mc_info.state != MCAST_STATE_JOINING) {
+		MCAST_ERROR("%s unexpected mcast state in handle_completion: "
+				" %d\n", control_ifcfg_name(&viport->control),
+				viport->mc_info.state);
+		ret = -1;
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_ATTACHING;
+	MCAST_INFO("%s Attaching QP %lx mgid:"
+			VNIC_GID_FMT " mlid:%x\n",
+			control_ifcfg_name(&viport->control), jiffies,
+			VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw),
+					 viport->mc_info.mlid);
+	ret = ib_attach_mcast(viport->mc_data.ib_conn.qp, &viport->mc_info.mgid,
+			viport->mc_info.mlid);
+	if (ret) {
+		MCAST_ERROR("%s Attach mcast qp failed %d\n",
+				control_ifcfg_name(&viport->control), ret);
+		ret = -1;
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_JOINED_ATTACHED;
+	MCAST_INFO("%s UD QP successfully attached to mcast group\n",
+			control_ifcfg_name(&viport->control));
+
+out:
+	return ret;
+}
+
+/* NOTE: ib_sa.h says "returning a non-zero value from this callback will
+ * result in destroying the multicast tracking structure.
+ */
+static int vnic_mc_join_complete(int status,
+				struct ib_sa_multicast *multicast)
+{
+	struct viport *viport = (struct viport *)multicast->context;
+	unsigned long flags;
+
+	MCAST_FUNCTION("vnic_mc_join_complete() status:%x\n", status);
+	if (status) {
+		spin_lock_irqsave(&viport->mc_info.lock, flags);
+		if (status == -ENETRESET) {
+			vnic_set_multicast_state_invalid(viport);
+			viport->mc_info.retries = 0;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			MCAST_ERROR("%s got ENETRESET\n",
+					control_ifcfg_name(&viport->control));
+			goto out;
+		}
+		/* perhaps the mcgroup hasn't yet been created - retry */
+		viport->mc_info.retries++;
+		viport->mc_info.mc = NULL;
+		if (viport->mc_info.retries > MAX_MCAST_JOIN_RETRIES) {
+			viport->mc_info.state = MCAST_STATE_RETRIED;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			MCAST_ERROR("%s join failed 0x%x - max retries:%d "
+					"exceeded\n",
+					control_ifcfg_name(&viport->control),
+					status, viport->mc_info.retries);
+		} else {
+			viport->mc_info.state = MCAST_STATE_INVALID;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			spin_lock_irqsave(&viport->lock, flags);
+			viport->updates |= NEED_MCAST_JOIN;
+			spin_unlock_irqrestore(&viport->lock, flags);
+			viport_kick(viport);
+			MCAST_ERROR("%s join failed 0x%x - retrying; "
+					"retries:%d\n",
+					control_ifcfg_name(&viport->control),
+					status, viport->mc_info.retries);
+		}
+		goto out;
+	}
+
+	/* finish join work from main state loop for viport - in case
+	 * the work itself cannot be done in a callback environment */
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->mc_info.mlid = be16_to_cpu(multicast->rec.mlid);
+	viport->updates |= NEED_MCAST_COMPLETION;
+	spin_unlock_irqrestore(&viport->lock, flags);
+	viport_kick(viport);
+	MCAST_INFO("%s setting NEED_MCAST_COMPLETION %x %x\n",
+			control_ifcfg_name(&viport->control),
+			multicast->rec.mlid, viport->mc_info.mlid);
+out:
+	return status;
+}
+
+void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid)
+{
+	unsigned long flags;
+
+	MCAST_FUNCTION("in vnic_mc_join_setup\n");
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if (viport->mc_info.state != MCAST_STATE_INVALID) {
+		if (viport->mc_info.state == MCAST_STATE_DETACHING)
+			MCAST_ERROR("%s detach in progress\n",
+					control_ifcfg_name(&viport->control));
+		else if (viport->mc_info.state == MCAST_STATE_RETRIED)
+			MCAST_ERROR("%s max join retries exceeded\n",
+					control_ifcfg_name(&viport->control));
+		else {
+			/* join/attach in progress or done */
+			/* verify that the current mgid is same as prev mgid */
+			if (memcmp(mgid, &viport->mc_info.mgid, sizeof(union ib_gid)) != 0) {
+				/* Separate MGID for each IOC */
+				MCAST_ERROR("%s Multicast Group MGIDs not "
+					"unique; mgids: " VNIC_GID_FMT
+					 " " VNIC_GID_FMT "\n",
+					control_ifcfg_name(&viport->control),
+					VNIC_GID_RAW_ARG(mgid->raw),
+					VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw));
+			} else
+				MCAST_INFO("%s join already issued: %d\n",
+					control_ifcfg_name(&viport->control),
+					viport->mc_info.state);
+
+		}
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		return;
+	}
+	viport->mc_info.mgid = *mgid;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->updates |= NEED_MCAST_JOIN;
+	spin_unlock_irqrestore(&viport->lock, flags);
+	viport_kick(viport);
+	MCAST_INFO("%s setting NEED_MCAST_JOIN \n",
+			control_ifcfg_name(&viport->control));
+}
+
+int vnic_mc_join(struct viport *viport)
+{
+	struct ib_sa_mcmember_rec rec;
+	ib_sa_comp_mask comp_mask;
+	unsigned long flags;
+	int ret = 0;
+
+	MCAST_FUNCTION("vnic_mc_join()\n");
+	if (!viport->mc_data.ib_conn.qp) {
+		MCAST_ERROR("%s qp is NULL\n",
+				control_ifcfg_name(&viport->control));
+		ret = -1;
+		goto out;
+	}
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if (viport->mc_info.state != MCAST_STATE_INVALID) {
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		MCAST_INFO("%s Multicast join already issued\n",
+				control_ifcfg_name(&viport->control));
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_JOINING;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+
+	memset(&rec, 0, sizeof(rec));
+	rec.join_state = 2; /* bit 1 is Nonmember */
+	rec.mgid = viport->mc_info.mgid;
+	rec.port_gid = viport->config->path_info.path.sgid;
+
+	comp_mask = 	IB_SA_MCMEMBER_REC_MGID     |
+			IB_SA_MCMEMBER_REC_PORT_GID |
+			IB_SA_MCMEMBER_REC_JOIN_STATE;
+
+	MCAST_INFO("%s Joining Multicast group%lx mgid:"
+			VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n",
+			control_ifcfg_name(&viport->control), jiffies,
+			VNIC_GID_RAW_ARG(rec.mgid.raw),
+			VNIC_GID_RAW_ARG(rec.port_gid.raw));
+
+	viport->mc_info.mc = ib_sa_join_multicast(&vnic_sa_client,
+			viport->config->ibdev, viport->config->port,
+			&rec, comp_mask, GFP_KERNEL,
+			vnic_mc_join_complete, viport);
+
+	if (IS_ERR(viport->mc_info.mc)) {
+		MCAST_ERROR("%s Multicast joining failed " VNIC_GID_FMT
+				".\n",
+				control_ifcfg_name(&viport->control),
+				VNIC_GID_RAW_ARG(rec.mgid.raw));
+		viport->mc_info.state = MCAST_STATE_INVALID;
+		ret = -1;
+		goto out;
+	}
+	MCAST_INFO("%s Multicast group join issued mgid:"
+			VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n",
+			control_ifcfg_name(&viport->control),
+			VNIC_GID_RAW_ARG(rec.mgid.raw),
+			VNIC_GID_RAW_ARG(rec.port_gid.raw));
+out:
+	return ret;
+}
+
+void vnic_mc_leave(struct viport *viport)
+{
+	unsigned long flags;
+	unsigned int ret;
+	struct ib_sa_multicast *mc;
+
+	MCAST_FUNCTION("vnic_mc_leave()\n");
+
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if ((viport->mc_info.state == MCAST_STATE_INVALID) ||
+	    (viport->mc_info.state == MCAST_STATE_RETRIED)) {
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		return;
+	}
+
+	if (viport->mc_info.state == MCAST_STATE_JOINED_ATTACHED) {
+
+		viport->mc_info.state = MCAST_STATE_DETACHING;
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		ret = ib_detach_mcast(viport->mc_data.ib_conn.qp,
+					 &viport->mc_info.mgid,
+					viport->mc_info.mlid);
+		if (ret) {
+			MCAST_ERROR("%s UD QP Detach failed %d\n",
+				control_ifcfg_name(&viport->control), ret);
+			return;
+		}
+		MCAST_INFO("%s UD QP detached succesfully\n",
+				control_ifcfg_name(&viport->control));
+		spin_lock_irqsave(&viport->mc_info.lock, flags);
+	}
+	mc = viport->mc_info.mc;
+	vnic_set_multicast_state_invalid(viport);
+	viport->mc_info.retries = 0;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+
+	if (mc) {
+		MCAST_INFO("%s Freeing up multicast structure.\n",
+				control_ifcfg_name(&viport->control));
+		ib_sa_free_multicast(mc);
+	}
+	MCAST_FUNCTION("vnic_mc_leave done\n");
+	return;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
new file mode 100644
index 0000000..e049180
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
@@ -0,0 +1,77 @@
+/*
+ * Copyright (c) 2008 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __VNIC_MULTICAST_H__
+#define __VNIC_MULTTCAST_H__
+
+enum {
+	MCAST_STATE_INVALID         = 0x00, /* join not attempted or failed */
+	MCAST_STATE_JOINING         = 0x01, /* join mcgroup in progress */
+	MCAST_STATE_ATTACHING       = 0x02, /* join completed with success,
+					     * attach qp to mcgroup in progress
+					     */
+	MCAST_STATE_JOINED_ATTACHED = 0x03, /* join completed with success */
+	MCAST_STATE_DETACHING       = 0x04, /* detach qp in progress */
+	MCAST_STATE_RETRIED         = 0x05, /* retried join and failed */
+};
+
+#define MAX_MCAST_JOIN_RETRIES 	       5 /* used to retry join */
+
+struct mc_info {
+	u8  			state;
+	spinlock_t 		lock;
+	union ib_gid 		mgid;
+	u16 			mlid;
+	struct ib_sa_multicast 	*mc;
+	u8 			retries;
+};
+
+
+int vnic_mc_init(struct viport *viport);
+void vnic_mc_uninit(struct viport *viport);
+extern char *control_ifcfg_name(struct control *control);
+
+/* This function is called when a viport gets a multicast mgid from EVIC
+   and must join the multicast group. It sets up NEED_MCAST_JOIN flag, which
+   results in vnic_mc_join being called later. */
+void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid);
+
+/* This function is called when NEED_MCAST_JOIN flag is set. */
+int vnic_mc_join(struct viport *viport);
+
+/* This function is called when NEED_MCAST_COMPLETION is set.
+   It finishes off the join multicast work. */
+int vnic_mc_join_handle_completion(struct viport *viport);
+
+void vnic_mc_leave(struct viport *viport);
+
+#endif /* __VNIC_MULTICAST_H__ */


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 03:38:00 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 19 May 2008 16:08:00 +0530
Subject: [ofa-general] [PATCH v2 13/13] QLogic VNIC: Modifications to IB
	Kconfig and Makefile
In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
Message-ID: <20080519103800.12355.70429.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch modifies the toplevel Infiniband Kconfig and Makefile
to include QLogic VNIC as new ULP.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/Kconfig  |    2 ++
 drivers/infiniband/Makefile |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a5dc78a..0775df5 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -53,4 +53,6 @@ source "drivers/infiniband/ulp/srp/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 
+source "drivers/infiniband/ulp/qlgc_vnic/Kconfig"
+
 endif # INFINIBAND
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index ed35e44..845271e 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES)		+= hw/nes/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
+obj-$(CONFIG_INFINIBAND_QLGC_VNIC)	+= ulp/qlgc_vnic/


From weiny2 at llnl.gov  Mon May 19 11:21:25 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 19 May 2008 11:21:25 -0700
Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback
	options
In-Reply-To: <20080519170916.GL4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
	<20080519170839.GK4616@sashak.voltaire.com>
	<20080519170916.GL4616@sashak.voltaire.com>
Message-ID: <20080519112125.3ec8ae61.weiny2@llnl.gov>

Could these be used by some Windows add on?

Not that it matters...

Ira


On Mon, 19 May 2008 20:09:16 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> 
> Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks
> from OpenSM subnet options.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/include/opensm/osm_subnet.h |   20 ------------------
>  opensm/opensm/osm_lid_mgr.c        |    7 ------
>  opensm/opensm/osm_mcast_mgr.c      |   40 ++++++-----------------------------
>  opensm/opensm/osm_subnet.c         |    4 ---
>  4 files changed, 7 insertions(+), 64 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index daab453..56b0165 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -248,10 +248,6 @@ typedef struct _osm_subn_opt {
>  	uint16_t console_port;
>  	cl_map_t port_prof_ignore_guids;
>  	boolean_t port_profile_switch_nodes;
> -	osm_pfn_ui_extension_t pfn_ui_pre_lid_assign;
> -	void *ui_pre_lid_assign_ctx;
> -	osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign;
> -	void *ui_mcast_fdb_assign_ctx;
>  	boolean_t sweep_on_trap;
>  	char *routing_engine_name;
>  	boolean_t connect_roots;
> @@ -412,22 +408,6 @@ typedef struct _osm_subn_opt {
>  *		If TRUE will count the number of switch nodes routed through
>  *		the link. If FALSE - only CA/RT nodes are counted.
>  *
> -*	pfn_ui_pre_lid_assign
> -*		A UI function to be invoked prior to lid assigment. It should
> -*		return 1 if any change was made to any lid or 0 otherwise.
> -*
> -*	ui_pre_lid_assign_ctx
> -*		A UI context (void *) to be provided to the pfn_ui_pre_lid_assign
> -*
> -*	pfn_ui_mcast_fdb_assign
> -*		A UI function to be called inside the mcast manager instead of
> -*		the call for the build spanning tree. This will be called on
> -*		every multicast call for create, join and leave, and is
> -*		responsible for the mcast FDB configuration.
> -*
> -*	ui_mcast_fdb_assign_ctx
> -*		A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign
> -*
>  *	sweep_on_trap
>  *		Received traps will initiate a new sweep.
>  *
> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
> index af0d020..7f25750 100644
> --- a/opensm/opensm/osm_lid_mgr.c
> +++ b/opensm/opensm/osm_lid_mgr.c
> @@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr)
>  	   persistent db */
>  	__osm_lid_mgr_init_sweep(p_mgr);
>  
> -	if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) {
> -		OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE,
> -			"Invoking UI function pfn_ui_pre_lid_assign\n");
> -		p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt.
> -							 ui_pre_lid_assign_ctx);
> -	}
> -
>  	/* Set the send_set_reqs of the p_mgr to FALSE, and
>  	   we'll see if any set requests were sent. If not -
>  	   can signal OSM_SIGNAL_DONE */
> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
> index 683a16d..a6185fe 100644
> --- a/opensm/opensm/osm_mcast_mgr.c
> +++ b/opensm/opensm/osm_mcast_mgr.c
> @@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
>  {
>  	ib_api_status_t status = IB_SUCCESS;
>  	ib_net16_t mlid;
> -	boolean_t ui_mcast_fdb_assign_func_defined;
>  
>  	OSM_LOG_ENTER(sm->p_log);
>  
> @@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
>  		goto Exit;
>  	}
>  
> -	if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign)
> -		ui_mcast_fdb_assign_func_defined = TRUE;
> -	else
> -		ui_mcast_fdb_assign_func_defined = FALSE;
> -
>  	/*
>  	   Clear the multicast tables to start clean, then build
>  	   the spanning tree which sets the mcast table bits for each
>  	   port in the group.
> -	   We will clean the multicast tables if a ui_mcast function isn't
> -	   defined, or if such function is defined, but we got here
> -	   through a MC_CREATE request - this means we are creating a new
> -	   multicast group - clean all old data.
>  	 */
> -	if (ui_mcast_fdb_assign_func_defined == FALSE ||
> -	    req_type == OSM_MCAST_REQ_TYPE_CREATE)
> -		__osm_mcast_mgr_clear(sm, p_mgrp);
> -
> -	/* If a UI function is defined, then we will call it here.
> -	   If not - the use the regular build spanning tree function */
> -	if (ui_mcast_fdb_assign_func_defined == FALSE) {
> -		status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
> -		if (status != IB_SUCCESS) {
> -			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
> -				"Unable to create spanning tree (%s)\n",
> -				ib_get_err_str(status));
> -			goto Exit;
> -		}
> -	} else {
> -		if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) {
> -			OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
> -				"Invoking UI function pfn_ui_mcast_fdb_assign\n");
> -		}
> +	__osm_mcast_mgr_clear(sm, p_mgrp);
>  
> -		sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt.
> -							ui_mcast_fdb_assign_ctx,
> -							mlid, req_type,
> -							port_guid);
> +	status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
> +	if (status != IB_SUCCESS) {
> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
> +			"Unable to create spanning tree (%s)\n",
> +			ib_get_err_str(status));
> +		goto Exit;
>  	}
>  
>  Exit:
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index a916270..2191f2d 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE;
>  	p_opt->accum_log_file = TRUE;
>  	p_opt->port_profile_switch_nodes = FALSE;
> -	p_opt->pfn_ui_pre_lid_assign = NULL;
> -	p_opt->ui_pre_lid_assign_ctx = NULL;
> -	p_opt->pfn_ui_mcast_fdb_assign = NULL;
> -	p_opt->ui_mcast_fdb_assign_ctx = NULL;
>  	p_opt->sweep_on_trap = TRUE;
>  	p_opt->routing_engine_name = NULL;
>  	p_opt->connect_roots = FALSE;
> -- 
> 1.5.4.rc2.60.gb2e62
> 


From ramachandra.kuchimanchi at qlogic.com  Mon May 19 11:59:16 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Tue, 20 May 2008 00:29:16 +0530
Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while
	installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8
In-Reply-To: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>
References: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>
Message-ID: <71d336490805191159k451cfc4dn6f6ad4f3ab88c43b@mail.gmail.com>

On Mon, May 19, 2008 at 10:59 PM, Joe Li <joel at finetec.com> wrote:
>
> Hello everyone,
>
> I am a newbie to openfabric and I have an issue here which needs your help.
> When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I
> get an ofa-kernel rpm build error:

OFED-1.3 does not support kernel 2.6.25. Among the plain vanilla kernel versions
it supports 2.6.24. Please refer to docs/OFED_release_notes.txt for
the supported
kernels.

Regards,
Ram


From joel at finetec.com  Mon May 19 12:00:55 2008
From: joel at finetec.com (Joe Li)
Date: Mon, 19 May 2008 12:00:55 -0700
Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while
	installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8
References: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>
	<71d336490805191159k451cfc4dn6f6ad4f3ab88c43b@mail.gmail.com>
Message-ID: <CC0DDBC6124CE545AF814D537976744B73E0E3@pdc.finetec.corp>

Thank you very much for the information, I will check the release notes.
 
Regards
Joe

________________________________

From: ariston at gmail.com on behalf of Ramachandra K
Sent: Mon 5/19/2008 11:59 AM
To: Joe Li
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8


On Mon, May 19, 2008 at 10:59 PM, Joe Li <joel at finetec.com> wrote:
>
> Hello everyone,
>
> I am a newbie to openfabric and I have an issue here which needs your help.
> When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I
> get an ofa-kernel rpm build error:

OFED-1.3 does not support kernel 2.6.25. Among the plain vanilla kernel versions
it supports 2.6.24. Please refer to docs/OFED_release_notes.txt for
the supported
kernels.

Regards,
Ram


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080519/ca865d8d/attachment.html>

From dotanba at gmail.com  Mon May 19 13:06:15 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 19 May 2008 22:06:15 +0200
Subject: [ofa-general] timeout question
In-Reply-To: <6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com>
References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com>	
	<adaej82cdxb.fsf@cisco.com>	
	<6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com>	
	<482E3AEE.4070603@gmail.com>	
	<6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com>	
	<482DE8E3.4090200@gmail.com>
	<6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com>
Message-ID: <4831DDB7.4060804@gmail.com>

Rui Machado wrote:
>>
>> Can you replace it with a write (from the other side)?
>> READ has "higher price" than a WRITE.
>>
>>     
>
> Can you please, shortly explain why this higher price?
>   
Basically, when a Read request is being sent the HCA need to preserve 
resources in order to accept the response.

>   
>> Anyway, you should get the mentioned behavior anyway..
>>
>> When the sender get the error, what is the status of the receiver QP?
>> (did you try to execute ibv_query_qp and get its status?)
>>
>>     
>
> I tried to get the qp state right after the error and it is 6 (which I
> believe is IBV_QPS_ERR).
> Why do you ask?
>   
If the QP in the receiver side is in ERROR state, it means that there 
was an error and this is the root
cause that caused to the retry exceeded from the first place ...

maybe there was a remote address/key violation that caused the receiver 
QP to be transitioned to the
ERROR state.

Dotan


From worleys at gmail.com  Mon May 19 12:11:43 2008
From: worleys at gmail.com (Chris Worley)
Date: Mon, 19 May 2008 13:11:43 -0600
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and
	not in others
In-Reply-To: <483045CD.8060301@mellanox.co.il>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
	<482AC510.3090602@dev.mellanox.co.il>
	<f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>
	<483045CD.8060301@mellanox.co.il>
Message-ID: <f3177b9e0805191211qac1100btd42fb4ba35a5746d@mail.gmail.com>

In using netcat in UDP mode over IPoIB, I loose 25%-40% of the
packets.  Is that expected?

Thanks,

Chris
On Sun, May 18, 2008 at 9:05 AM, Tziporet Koren
<tziporet at dev.mellanox.co.il> wrote:
> Chris Worley wrote:
>>
>> Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5
>> in the "kitchen sink" build.
>>
>> Is there any reason to NOT use connected mode?
>>
>
> In general the CM is better in performance of medium & large messages.
> We found the UD mode is better in small UDP messages
>
> Tziporet
>
>
>


From sashak at voltaire.com  Mon May 19 12:53:02 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 May 2008 22:53:02 +0300
Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback
	options
In-Reply-To: <20080519112125.3ec8ae61.weiny2@llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080519170303.GI4616@sashak.voltaire.com>
	<20080519170839.GK4616@sashak.voltaire.com>
	<20080519170916.GL4616@sashak.voltaire.com>
	<20080519112125.3ec8ae61.weiny2@llnl.gov>
Message-ID: <20080519195302.GA1183@sashak.voltaire.com>

On 11:21 Mon 19 May     , Ira Weiny wrote:
> Could these be used by some Windows add on?

Don't think, likely there were some ideas for utilizing osmsh or so (now
we have more powerful "routing engine" stuff). But I don't know for
sure. Guess we will know if it was.

> Not that it matters...

You said... :)

Sasha


From xma at us.ibm.com  Mon May 19 13:08:07 2008
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 19 May 2008 13:08:07 -0700
Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, 
	Patch 10)
In-Reply-To: <48319B4C.7040309@mellanox.co.il>
Message-ID: <OFB88E2568.C3A9C030-ON8725744E.006E3E1B-8825744E.006E98E4@us.ibm.com>

We should support smp affinity from the userland as well through sysfs.

Thanks
Shirley 


Yevgeny Petrilin <yevgenyp at mellanox.co.il> 
Sent by: general-bounces at lists.openfabrics.org
05/19/2008 08:22 AM

To
Roland Dreier <rdreier at cisco.com>
cc
Christoph Raisch <RAISCH at de.ibm.com>, Hoang-Nam Nguyen 
<HNGUYEN at de.ibm.com>, general at lists.openfabrics.org
Subject
Re: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, 
Patch 10)


Roland Dreier wrote:
>  > > I would just like to see an approach that is fully thought through 
and
>  > > gives a way for applications/kernel drivers to choose a CQ vector 
based
>  > > on some information about what CPU it will go to.
> 
>  > Isn't the decision of which CPU an MSI-X is routed to (and hence, to
>  > which CPI an EQ is bound to) determined by userspace? (either by the 
irq
>  > balancer process or by manually setting 
/proc/irq/<vec>/smp_affinity)?
> 
> Yes, but how can anything tell which IRQ number corresponds to a given
> "CQ vector" number?  (And don't be too stuck on MSI-X, since ehca uses
> some completely different GX-bus related thing to get multiple 
interrupts)
> 
>  > What are we risking in making the default action to spread 
interrupts?
> 
> There are fairly plausible scenarios like a multi-threaded app where
> each thread creates a send CQ and a receive CQ, which should both be
> bound to the same CPU as the thread.  If we spread all CQs then it's
> impossible to get thread-locality.
> 
> I'm not saying that round-robin is necessarily a bad default policy, but
> I do think there needs to be a complete picture of how that policy can
> be overridden before we go for multiple interrupt vectors.
> 
>  - R.

Hello Roland,
We can add the multiple interrupt vectors support in two stages:
1. The low level driver can create multiple interrupt vectors. Their name 
would include a
serial number from 0 to #CPU's-1. The number of completion vectors can
be populated through ib_device.num_comp_vectors. Then each ulp can ask for 
a specific
completion vector when creating CQ, which means that passing vector=0 
while creating CQ
will assign it to completion vector #0.

2. As the second stage, we can create a "don't care" value which would 
mean that the driver can
can attach the CQ to any completion vector. In this case the policy 
shouldn't necessary be
round-robin. We can manage the number of "clients" for each completion 
vector and then assign the CQ
to the least busy one.

What is your opinion on this solution?

Thanks,
Yevgeny
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Mon May 19 14:24:49 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 19 May 2008 14:24:49 -0700
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <4831649A.2020206@voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
	<4831649A.2020206@voltaire.com>
Message-ID: <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>

>Sean, please let me know your preference (as it was somehow unclear from
>the thread) if you want the delivery of this event to be dependent on
>the ulp asking for it or no.

I spent most of the morning looking at this, and until I know what the
trade-offs really are in the implementation, I can't say that I have a strong
preference for how to deal with any of this.  My main concerns are:

* All callbacks from the rdma_cm are serialized
* We minimize the overhead of reporting events
* We don't lose events
* If the user returns a non-zero value from a callback, the rdma_cm_id is
  destroyed, an no further callbacks are invoked.

and in concept I prefer to:

* Always report the event and let ULPs ignore it
* Let someone come up with a fantastically simple way of reporting new events

The existing rdma_cm callbacks are naturally serialized with each other.
(Callback for connect after resolve route after resolve address...)  This allows
using the stack for event structures, but the cost is complex synchronization
with device removal.  Supporting additional events while meeting the concerns
listed above will be equally challenging.  So if we can simplify device removal
handling, then supporting similar types of events should be easier as well.

If we can guarantee that this works, one option is to acquire a mutex before
invoking a callback on an rdma_cm_id.  I hesitate to hold any locks while in a
callback, since it restricts what the user can do, but if the mutex is only used
to synchronize calling the user back, it may work, since the rdma_cm never
invokes a callback from a downcall.  This should simplify the device removal
handling, eliminating wait_remove and dev_remove from the rdma_cm_id.

Alternatively, the ib_cm serializes callbacks using different logic (see
cm_process_work() and use of work_count/work_list).  I've been looking at what
it would take to use the ib_cm event logic in the rdma_cm.  The trick is to
minimize the event reporting overhead without losing any events, (and minimizing
the overhead may require registering for events...)  

What I've been exploring is adding an event_list to the rdma_cm_id.  Whenever
the user performs an asynchronous operation, event structure(s) is allocated and
placed on the event_list.  When an asynchronous operation completes, the event
structure is removed from this list, placed on a work_list, and a call like
cma_process_work() is invoked.  Note that some operations (e.g. connect) result
in multiple callbacks to the rdma_cm (connect and disconnect).  And the more I
consider this option, the more appealing just holding a mutex around the
callbacks becomes.

- Sean


From robert.j.woodruff at intel.com  Mon May 19 15:01:31 2008
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 19 May 2008 15:01:31 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email info
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>


Hi guys,

I am starting to put together an updated list of maintainers
that can be displayed on the OFA website, 
as what is out there now is horribly out of
date. I am only collecting the info for the Linux side,
Sean and Stan will be collecting the info for the Windows stack.

Please respond with the components that you (or the people that
you work with) maintain and the
current maintainer of that component and email info.

I have been working with the website maintainers to put
in place a way that we can have the link to the maintainers
list be a pointer to a simple text file, so it can be
easily updated on the server when things change in the future,
but I'd like to put together the initial file and then later
it can be easily updated when things change.


woody


From swise at opengridcomputing.com  Mon May 19 15:04:21 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 17:04:21 -0500
Subject: [ofa-general] Re: Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
Message-ID: <4831F965.6060607@opengridcomputing.com>

Woodruff, Robert J wrote:
> Hi guys,
>
> I am starting to put together an updated list of maintainers
> that can be displayed on the OFA website, 
> as what is out there now is horribly out of
> date. I am only collecting the info for the Linux side,
> Sean and Stan will be collecting the info for the Windows stack.
>
> Please respond with the components that you (or the people that
> you work with) maintain and the
> current maintainer of that component and email info.
>
>   
Can you send the list of components?

> I have been working with the website maintainers to put
> in place a way that we can have the link to the maintainers
> list be a pointer to a simple text file, so it can be
> easily updated on the server when things change in the future,
> but I'd like to put together the initial file and then later
> it can be easily updated when things change.
>
>
> woody
>   


From rdreier at cisco.com  Mon May 19 15:05:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:05:59 -0700
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <adaprri70r3.fsf@cisco.com> (Roland Dreier's message of "Mon, 19
	May 2008 08:49:04 -0700")
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
	<ada1w3z9tmo.fsf@cisco.com> <483199EC.7070900@Voltaire.COM>
	<adaprri70r3.fsf@cisco.com>
Message-ID: <adabq326jaw.fsf@cisco.com>

 > and what happens if alloc_mad() is called while port->sm_ah is NULL?

Trivial fix seems to be to move the test for whether port->sm_ah is NULL
into alloc_mad(), and have it return -EAGAIN if so.

 - R.


From sean.hefty at intel.com  Mon May 19 15:11:09 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 19 May 2008 15:11:09 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
Message-ID: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>

This is what I could find in the MAINTAINERS file for 2.6.25:

EHCA (IBM GX bus InfiniBand adapter) DRIVER:
P:      Hoang-Nam Nguyen
M:      hnguyen at de.ibm.com
P:      Christoph Raisch
M:      raisch at de.ibm.com
L:      general at lists.openfabrics.org
S:      Supported

INFINIBAND SUBSYSTEM
P:      Roland Dreier
M:      rolandd at cisco.com
P:      Sean Hefty
M:      sean.hefty at intel.com
P:      Hal Rosenstock
M:      hal.rosenstock at gmail.com 
L:      general at lists.openfabrics.org
W:      http://www.openib.org/
T:      git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git
S:      Supported

IPATH DRIVER:
P:      Ralph Campbell
M:      infinipath at qlogic.com
L:      general at lists.openfabrics.org
T:      git git://git.qlogic.com/ipath-linux-2.6
S:      Supported

NETEFFECT IWARP RNIC DRIVER (IW_NES)
P:      Faisal Latif
M:      flatif at neteffect.com
P:      Nishi Gupta
M:      ngupta at neteffect.com
P:      Glenn Streiff
M:      gstreiff at neteffect.com
L:      general at lists.openfabrics.org
W:      http://www.neteffect.com
S:      Supported
F:      drivers/infiniband/hw/nes/

AMSO1100 RNIC DRIVER
P:      Tom Tucker
M:      tom at opengridcomputing.com
P:      Steve Wise
M:      swise at opengridcomputing.com
L:      general at lists.openfabrics.org
S:      Maintained


From rdreier at cisco.com  Mon May 19 15:12:15 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:12:15 -0700
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <483199EC.7070900@Voltaire.COM> (Moni Shoua's message of "Mon, 19
	May 2008 18:17:00 +0300")
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
	<ada1w3z9tmo.fsf@cisco.com> <483199EC.7070900@Voltaire.COM>
Message-ID: <ada7idq6j0g.fsf@cisco.com>

By the way:

 > +		struct ib_sa_sm_ah *sm_ah;
 > +
 > +		spin_lock_irqsave(&port->ah_lock, flags);
 > +		sm_ah = port->sm_ah;
 > +		port->sm_ah = NULL;
 > +		spin_unlock_irqrestore(&port->ah_lock, flags);
 > +
 > +		if (sm_ah)
 > +			kref_put(&sm_ah->ref, free_sm_ah);

Is there some reason why this can't be simpler like:

		spin_lock_irqsave(&port->ah_lock, flags);
		if (port->sm_ah)
			kref_put(&port->sm_ah->ref, free_sm_ah);
		port->sm_ah = NULL;
		spin_unlock_irqrestore(&port->ah_lock, flags);

I guess the same cleanup applies to update_sm_ah(), except after your
patch I don't see any way that update_sm_ah() could be called with sm_ah
anything but NULL, so we could drop the old_ah stuff completely there.

 - R.


From rdreier at cisco.com  Mon May 19 15:15:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:15:08 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> (Sean Hefty's
	message of "Mon, 19 May 2008 15:11:09 -0700")
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
Message-ID: <ada3aoe6ivn.fsf@cisco.com>

thanks Sean.

I notice at least CXGB3 is missing from MAINTAINERS -- Steve, if you
want to send an entry I would add it.

Also if anyone wants to add an ISER entry that would be good too.

I could add IP-OVER-INFINIBAND and SRP entries but not sure it's worth
it since I'm already in the main INFINIBAND entry.

Maybe it's worth changing "INFINIBAND SUBSYSTEM" to
"INFINIBAND/IWARP/RDMA SUBSYSTEM"?

 - R.


From rdreier at cisco.com  Mon May 19 15:19:28 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:19:28 -0700
Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support,
	Patch 10)
In-Reply-To: <48319B4C.7040309@mellanox.co.il> (Yevgeny Petrilin's message of
	"Mon, 19 May 2008 18:22:52 +0300")
References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com>
	<adaiqxz6kno.fsf@cisco.com> <48319B4C.7040309@mellanox.co.il>
Message-ID: <adatzgu543z.fsf@cisco.com>

 > We can add the multiple interrupt vectors support in two stages:
 > 1. The low level driver can create multiple interrupt vectors. Their name would include a
 > serial number from 0 to #CPU's-1. The number of completion vectors can
 > be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific
 > completion vector when creating CQ, which means that passing vector=0 while creating CQ
 > will assign it to completion vector #0.
 > 
 > 2. As the second stage, we can create a "don't care" value which would mean that the driver can
 > can attach the CQ to any completion vector. In this case the policy shouldn't necessary be
 > round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ
 > to the least busy one.

this makes sense.  However I think we need to come up with some
mechanism where a ULP or application can assign some semantic value to
the CQ event vector it chooses.  Maybe a new verb is required.

 - R.


From rdreier at cisco.com  Mon May 19 15:23:03 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:23:03 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups
	and handle each according to level of severity
In-Reply-To: <483022BB.9060004@Voltaire.COM> (Moni Shoua's message of "Sun, 18
	May 2008 15:36:11 +0300")
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
Message-ID: <adaprri53y0.fsf@cisco.com>

 > The purpose of this patch is to make the events that are related to SM change
 > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive.
 > When SM related events are handled, it is not necessary to flush unicast
 > info from device but only multicast info. This patch divides the events that are
 > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1
 > does more than 0).

I see two issues with this patch:

 - Is is architecturally guaranteed by the IB spec that flushing unicast
   info is not required on an SM change or client reregister event?

 - The implementation looks to make maintainability somewhat harder,
   since it's not very clear what level 0, 1, and 2 events really mean.
   I suggest using some symbolic names (maybe bitmasks that are |ed
   together?) to make it explicit what is being flushed.

 - R.


From rdreier at cisco.com  Mon May 19 15:24:04 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:24:04 -0700
Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support
In-Reply-To: <1211102098.6963.14.camel@eli-laptop> (Eli Cohen's message of
	"Sun, 18 May 2008 12:14:58 +0300")
References: <1210836027.18385.2.camel@mtls03> <482FC5D9.7060009@voltaire.com>
	<1211102098.6963.14.camel@eli-laptop>
Message-ID: <adalk2653wb.fsf@cisco.com>

 > No reason for them to be different. Roland already suggested to use a
 > union here although he defines the union locally inside the containing
 > struct thus he has two definitions for the same union. Roland do you
 > intend to commit that?

I can if everyone agrees with it.

I can't think of a good way to describe the union independently, so I
think I'll keep it as being duplicated between the WR and completion
structures.

 - R.


From rdreier at cisco.com  Mon May 19 15:27:21 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 15:27:21 -0700
Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca
	max_sge value... ugh.
In-Reply-To: <RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com> (Thomas
	Talpey's message of "Mon, 19 May 2008 09:14:32 -0400")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
	<RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <adahccu53qu.fsf@cisco.com>

 > >if we can't use the "WQE shrinking" feature (because of selective
 > >signaling in the NFS/RDMA case), and we want to use 32 sge entries, then
 > >the WQE size 's' will end up a little more than 512 bytes, and the
 > >wqe_shift will end up as 10.
 > 
 > Can you elaborate on this? The NFS/RDMA client does selective signalling
 > on its send queue in order to save on interrupts and CQE generation/handling.
 > Which I always thought was a (very) good approach. Because the RPC
 > request/response paradigm guarantees an eventual receive completion,
 > we simply defer (or even completely avoid) this work.
 > 
 > Would that be a bad trade if it takes a WQE management opportunity away
 > from the provider? It's quite easy to change this in the NFS/RDMA code,
 > or make it a selectable parameter.

mlx4 has a feature that lets the driver post smaller WQEs to the send
queue if not all s/g entries are used.  But the current implementation
at least can only use this feature if selective signaling is off.

So it's a tradeoff -- more work completions or bigger data structures
for the HCA to fetch.  In the NFS/RDMA case I would expect the selective
signaling to be a win, but I guess the thing to do is try ConnectX
without selective signaling and see which wins.

 - R.


From robert.j.woodruff at intel.com  Mon May 19 15:28:15 2008
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 19 May 2008 15:28:15 -0700
Subject: [ofa-general] RE: Current list of Linux maintainers and their email
	info
In-Reply-To: <4831F965.6060607@opengridcomputing.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<4831F965.6060607@opengridcomputing.com>
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>

Steve Wise wrote,

>Can you send the list of components?

Here is what I have so far as the list of kernel and userspace
components.


                      Kernel Components

Core kernel drivers, infiniband/core
Sean Hefty, sean.hefty at intel.com
Roland Dreier, rdreir at csco.com

Hardware Drivers:
Mellanox HCA drivers, infiniband/hw/mthca, infiniband/hw/mlx4

Qlogic HCA driver, infiniband/hw/ipath

NetEffects RNIC driver, infiniband/hw/nes

IBM HCA, infinband/hw/ehca

Chelsio RNIC, infiniband/hw/cxgb3

Upper Level Protocols

IPoIB

SRP

iSer

SDP

SRPT

qlgc_vnic

RDS

                      User Space Components

libibverbs

uDAPL

IB-Bonding

IB-Sim

IB-Utils

IB-Diags

libibcm

librdmacm

libibcommon

libibmad

libibumad

libipathverbs

libmlx4

libmthca

libnes

librdmacm

libsdp

mpi-selector

mpitests

mstflint

mvapich

mvapich2

openmpi

open-iscsi

opensm

perftest

qlvnictools

qperf

rds-tools

sdpnetstat

srptools


From robert.j.woodruff at intel.com  Mon May 19 15:49:43 2008
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 19 May 2008 15:49:43 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <ada3aoe6ivn.fsf@cisco.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com><000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
	<ada3aoe6ivn.fsf@cisco.com>
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C04B19B80@orsmsx418.amr.corp.intel.com>

Roland wrote,
>Maybe it's worth changing "INFINIBAND SUBSYSTEM" to
>"INFINIBAND/IWARP/RDMA SUBSYSTEM"?

> - R.

Good point, probably should change to something like
INFINIBAND/IWARP/RDMA SUBSYSTEM,
and perhaps for the kernel components,
we should base our list off of the MAINTAINERS file
in the kernel tree, plus I suppose we'd have to
add a couple extra entries in our list for the things
(like SDP) that are in OFED, but not upstream.


From swise at opengridcomputing.com  Mon May 19 16:04:19 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 May 2008 18:04:19 -0500
Subject: [ofa-general] [PATCH] Add cxgb3 and iw_cxgb3 maintainers.
Message-ID: <20080519230419.5000.56974.stgit@dell3.ogc.int>


Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 MAINTAINERS |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c68a118..11453eb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1239,6 +1239,20 @@ L:	video4linux-list at redhat.com
 W:	http://linuxtv.org
 S:	Maintained
 
+CXGB3 ETHERNET DRIVER (CXGB3)
+P:	Divy Le Ray
+M:	divy at chelsio.com
+L:	netdev at vger.kernel.org
+W:	http://www.chelsio.com
+S:	Supported
+
+CXGB3 IWARP RNIC DRIVER (IW_CXGB3)
+P:	Steve Wise
+M:	swise at chelsio.com
+L:	general at lists.openfabrics.org
+W:	http://www.openfabrics.org
+S:	Supported
+
 CYBERPRO FB DRIVER
 P:	Russell King
 M:	rmk at arm.linux.org.uk


From rdreier at cisco.com  Mon May 19 16:10:11 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 May 2008 16:10:11 -0700
Subject: [ofa-general] Re: [PATCH] Add cxgb3 and iw_cxgb3 maintainers.
In-Reply-To: <20080519230419.5000.56974.stgit@dell3.ogc.int> (Steve Wise's
	message of "Mon, 19 May 2008 18:04:19 -0500")
References: <20080519230419.5000.56974.stgit@dell3.ogc.int>
Message-ID: <ada4p8t6gbw.fsf@cisco.com>

cool with you, Divy?

If so I will merge.


From divy at chelsio.com  Mon May 19 16:11:33 2008
From: divy at chelsio.com (Divy Le Ray)
Date: Mon, 19 May 2008 16:11:33 -0700
Subject: [ofa-general] Re: [PATCH] Add cxgb3 and iw_cxgb3 maintainers.
In-Reply-To: <ada4p8t6gbw.fsf@cisco.com>
References: <20080519230419.5000.56974.stgit@dell3.ogc.int>
	<ada4p8t6gbw.fsf@cisco.com>
Message-ID: <48320925.7070405@chelsio.com>

Roland Dreier wrote:
> cool with you, Divy?
>
> If so I will merge.
>   
Yes, it is perfect.

Thanks a lot,
Divy


From sashak at voltaire.com  Mon May 19 18:56:52 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 20 May 2008 04:56:52 +0300
Subject: [ofa-general] Re: Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<4831F965.6060607@opengridcomputing.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
Message-ID: <20080520015652.GE1183@sashak.voltaire.com>

On 15:28 Mon 19 May     , Woodruff, Robert J wrote:
> 
> Here is what I have so far as the list of kernel and userspace
> components.

<snip...>

> IB-Sim
> IB-Diags
> libibcommon
> libibmad
> libibumad
> opensm

are "mine" - Sasha Khapyorsky <sashak at voltaire.com>

Probably it would be good to unify component and package names? if so:
IB-Sim -> ibsim, IB-Diags -> infiniband-diags.

Sasha


From eli at dev.mellanox.co.il  Mon May 19 22:10:29 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Tue, 20 May 2008 08:10:29 +0300
Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca
	max_sge value... ugh.
In-Reply-To: <adahccu53qu.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
	<RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>
	<adahccu53qu.fsf@cisco.com>
Message-ID: <1211260229.6556.18.camel@eli-laptop>

On Mon, 2008-05-19 at 15:27 -0700, Roland Dreier wrote:
> mlx4 has a feature that lets the driver post smaller WQEs to the send
> queue if not all s/g entries are used.  But the current implementation
> at least can only use this feature if selective signaling is off.
> 
> So it's a tradeoff -- more work completions or bigger data structures
> for the HCA to fetch.  In the NFS/RDMA case I would expect the selective
> signaling to be a win, but I guess the thing to do is try ConnectX
> without selective signaling and see which wins.
> 
Roland, I posted a few months ago a patch that optimizes post send for
selective signaling QPs. It must have slipped somehow because I did not
get any reply on it and since I did not know of anyone using selective
signaling I forgot about this too. The idea is that for selective
signaling QPs, before you stamp the WQE, you read the value of the DS
field which denotes the effective size of the descriptor as used in the
previous post, and stamp only that area, relying on the fact that the
rest of the descriptor is already stamped. Here is a link to the patch.
I don't know if it applies cleanly now but if we agree on the idea I
will generate it again against the current tree.

http://lists.openfabrics.org/pipermail/general/2008-January/045071.html


From npiggin at suse.de  Mon May 19 22:31:46 2008
From: npiggin at suse.de (Nick Piggin)
Date: Tue, 20 May 2008 07:31:46 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080516115005.GC4287@sgi.com>
References: <20080508003838.GA9878@sgi.com>
	<200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
Message-ID: <20080520053145.GA19502@wotan.suse.de>

On Fri, May 16, 2008 at 06:50:05AM -0500, Robin Holt wrote:
> On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote:
> > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
> > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
> > > > On Thu, 15 May 2008, Nick Piggin wrote:
> > > > 
> > > > > Oh, I get that confused because of the mixed up naming conventions
> > > > > there: unmap_page_range should actually be called zap_page_range. But
> > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem.
> > > > 
> > > > How is that synchronized with code that walks the same pagetable. These 
> > > > walks may not hold mmap_sem either. I would expect that one could only 
> > > > remove a portion of the pagetable where we have some sort of guarantee 
> > > > that no accesses occur. So the removal of the vma prior ensures that?
> > >  
> > > I don't really understand the question. If you remove the pte and invalidate
> > > the TLBS on the remote image's process (importing the page), then it can
> > > of course try to refault the page in because it's vma is still there. But
> > > you catch that refault in your driver , which can prevent the page from
> > > being faulted back in.
> > 
> > I think Christoph's question has more to do with faults that are
> > in flight.  A recently requested fault could have just released the
> > last lock that was holding up the invalidate callout.  It would then
> > begin messaging back the response PFN which could still be in flight.
> > The invalidate callout would then fire and do the interrupt shoot-down
> > while that response was still active (essentially beating the inflight
> > response).  The invalidate would clear up nothing and then the response
> > would insert the PFN after it is no longer the correct PFN.
> 
> I just looked over XPMEM.  I think we could make this work.  We already
> have a list of active faults which is protected by a simple spinlock.
> I would need to nest this lock within another lock protected our PFN
> table (currently it is a mutex) and then the invalidate interrupt handler
> would need to mark the fault as invalid (which is also currently there).
> 
> I think my sticking points with the interrupt method remain at fault
> containment and timeout.  The inability of the ia64 processor to handle
> provide predictive failures for the read/write of memory on other
> partitions prevents us from being able to contain the failure.  I don't
> think we can get the information we would need to do the invalidate
> without introducing fault containment issues which has been a continous
> area of concern for our customers.

Really? You can get the information through via a sleeping messaging API,
but not a non-sleeping one? What is the difference from the hardware POV?


From orenmeron at dev.mellanox.co.il  Mon May 19 22:36:59 2008
From: orenmeron at dev.mellanox.co.il (Oren Meron)
Date: Tue, 20 May 2008 08:36:59 +0300
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
Message-ID: <4832637B.6040800@dev.mellanox.co.il>

Woodruff, Robert J wrote:
> Hi guys,
> 
> I am starting to put together an updated list of maintainers
> that can be displayed on the OFA website, 
> as what is out there now is horribly out of
> date. I am only collecting the info for the Linux side,
> Sean and Stan will be collecting the info for the Windows stack.
> 
> Please respond with the components that you (or the people that
> you work with) maintain and the
> current maintainer of that component and email info.
> 
> I have been working with the website maintainers to put
> in place a way that we can have the link to the maintainers
> list be a pointer to a simple text file, so it can be
> easily updated on the server when things change in the future,
> but I'd like to put together the initial file and then later
> it can be easily updated when things change.
> 
> 
> woody
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
Hi Woody,
i maintain the perftest low-level bench marks.
thanks - oren.


From ogerlitz at voltaire.com  Mon May 19 23:01:38 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 09:01:38 +0300
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
Message-ID: <48326942.7080800@voltaire.com>

Sean Hefty wrote:
> This is what I could find in the MAINTAINERS file for 2.6.25:
I am not sure to follow why there's a need to duplicate the Linux kernel 
IB (RDMA) stack maintainers file at the ofa website, but if for some 
reason people feel this is needed I suggest to have a smart link that 
somehow goes to Linus tree and fetches the up2date info.

Or.


From ogerlitz at voltaire.com  Mon May 19 23:05:56 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 09:05:56 +0300
Subject: [ofa-general] Re: Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<4831F965.6060607@opengridcomputing.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
Message-ID: <48326A44.2080606@voltaire.com>

Woodruff, Robert J wrote:
> SRPT
>   
Isn't what you can "SRPT"  the target known as SCST which supports also 
iSCSI? if yes, I don't see why the string "SRP" has to be in the name.
> IB-Bonding
>   
the ib-bonding package provides the kernel bonding module

Or.


From Sumit.Gaur at Sun.COM  Mon May 19 23:14:10 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Tue, 20 May 2008 11:44:10 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
	<4831B519.2060002@dev.mellanox.co.il>
	<1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>
Message-ID: <48326C32.7000303@Sun.COM>


Hal Rosenstock wrote:
> On Mon, 2008-05-19 at 20:12 +0300, Yevgeny Kliteynik wrote:
> 
>>Hal Rosenstock wrote:
>>
>>>Sumit,
>>>On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote:
>>>
>>>
>>>>Hi Hal,
>>>>It is true that packets received are looks like proper response but as I 
>>>>mentioned before they content TID that I have never send to OFED and  
>>>>this  cause the problem. Why OFED is sending these extra packets  Is the 
>>>>matter to investigate.
>>>
>>>The received packet is SM class attribute ID 4352 which is non IBA
>>>standard and AFAIK OFED does not send so it likely comes from some non
>>>OFED software.
>>
>>Just a thought:
>>Decimal 4352 is 0x1100. With reverted endian we get 0x0011,
>>which is NodeInfo, that SM sends while sweeping the subnet,
>>which comes at regular interval.
>>
>>As I said, just a thought...
> 
> 
> Yes, that makes sense to me. As this is an incoming response, maybe this
> node is running the SM as well as this application.
Yes, Node is running SM too.

sminfo: sm lid 1 sm guid 0x3ba00534f000d, activity count 1213884 priority 7 
state 3 SMINFO_MASTER

Now looks like we are going in right direction So extra packets are incoming SM 
packets. So same question again arises How we can identify and filter these 
incoming SM packets in application from the regular responses.

> 
> -- Hal
> 
> 
>>-- Yevgeny
>>
>>
>>>As far as why it is being received, it is a response to a class your
>>>application is subscribed to so it passes it through.
>>>


From ogerlitz at voltaire.com  Mon May 19 23:37:12 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 09:37:12 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4831A0DF.2070603@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>	<4831347A.1010506@voltaire.com>
	<ada4p8u8fkb.fsf@cisco.com>
	<4831A0DF.2070603@opengridcomputing.com>
Message-ID: <48327198.7080305@voltaire.com>

Steve Wise wrote:
>> dma mapping would work too but then handling the map/unmap becomes an
>> issue.  I think it is way too complicated too add new verbs for
>> map/unmap fastreg page list (in addition to the alloc/free fastreg page
>> list that we are already adding) and force the consumer to do it.  And
>> if we expect the low-level driver to do it, then the map is easy (can be
>> done while posting the send) but the unmap is a pain -- it would have to
>> be done inside poll_cq when reapind the completion, and the low-level
>> driver would have to keep some complicated extra data structure to go
>> back from the completion to the original fast reg page list structure.
>>   
> And certain platforms can fail map requests (like PPC64) because they 
> have limited resources for dma mapping.  So then you'd fail a SQ work 
> request when you might not want to...
I see the point in allocating the page lists in dma consistent memory to 
make the mechanics of letting the HCA to DMA the list easier and 
simpler, as I think Roland is suggesting in his post. However, I an not 
sure to understand how this helps in the PPC64 case, if the HCA does DMA 
to fetch the list, then IOMMU slots have to be consumed this way or 
another, correct?

Or.


From ogerlitz at voltaire.com  Mon May 19 23:41:29 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 09:41:29 +0300
Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while
	installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8
In-Reply-To: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>
References: <CC0DDBC6124CE545AF814D537976744B73E0E2@pdc.finetec.corp>
Message-ID: <48327299.6060303@voltaire.com>

Joe Li wrote:
>
> I am a newbie to openfabric and I have an issue here which needs your 
> help. When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 
> 2.6.25-rc8, I get an ofa-kernel rpm build error:
>
May I ask from what reason on earth you need the ofa-kernel rpm on top 
of kernel whose IB stack is newer then the contents of the package?

Or.


From ogerlitz at voltaire.com  Mon May 19 23:45:31 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 09:45:31 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to
	use	for changing ConnectX page size
In-Reply-To: <adave1a70wi.fsf@cisco.com>
References: <483049BF.4050603@dev.mellanox.co.il>	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>	<adad4ni8i0a.fsf@cisco.com>	<15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>
	<adave1a70wi.fsf@cisco.com>
Message-ID: <4832738B.7070105@voltaire.com>

Roland Dreier wrote:
> You misunderstood the patch I think (unless I did).  By default new
> kernel + new firmware gets the smaller page size.
>   
OK, thanks, I guess the change-log can be changed to make this point 
clearer.

Actually, thinking on this a little further, if the (say 2.6.27/28 or 
later) mlx4 driver is going to support the memory extensions, maybe we 
could remove from it the support for the proprietary FMRs anyway at that 
point?

Or.


From tziporet at dev.mellanox.co.il  Tue May 20 00:58:01 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 10:58:01 +0300
Subject: [ofa-general] Re: Current list of Linux maintainers and their
	email	info
In-Reply-To: <48326A44.2080606@voltaire.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<48326A44.2080606@voltaire.com>
Message-ID: <48328489.2030305@mellanox.co.il>

Or Gerlitz wrote:
> Woodruff, Robert J wrote:
>> SRPT
>>   
> Isn't what you can "SRPT"  the target known as SCST which supports 
> also iSCSI? if yes, I don't see why the string "SRP" has to be in the 
> name.
>> IB-Bonding
>>   
> the ib-bonding package provides the kernel bonding module
>
But I think it will be good to know who is the maintainer from the IB 
side (at least for OFED users)

Tziporet


From ogerlitz at voltaire.com  Tue May 20 01:03:40 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 11:03:40 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
Message-ID: <483285DC.20003@voltaire.com>

Steve Wise wrote:
> Support for the IB BMME and iWARP equivalent memory extensions to 
> non shared memory regions.  Usage Model:
> - MR allocated with ib_alloc_mr()
> - Page lists allocated via ib_alloc_fast_reg_page_list().
> - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR)
> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
Steve,

I am trying to further understand what would be a real life ULP design 
here, and I think there are some more issues to clarify/define for the 
case of ULP which has to create a mapping for a list of pages and send 
this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it for 
RDMA.

AFAIK, the idea was to let the ulp post --two-- work requests, where the 
first creates the mapping and the second sends this mapping to the 
remote side, such that the second does not start before the first 
completes (i.e a fence).

Now, the above scheme means that the ulp knows the value of the 
rkey/stag at the time of posting these two work requests (since it has 
to encode it in the second one), so something has to be clarified re the 
rkey/stag here, do they change each time this MR is used? how many bits 
can be changed, etc.

I guess my questions are to some extent RTFM ones, but, first, with some 
quick looking in the IB spec I did not manage to get enough answers 
(pointers appreciated...) and second, you are proposing an 
implementation here, so I think it makes sense to review the actual 
usage model to see all aspects needed for ULPs are covered...

Talking on usage, do you plan to patch the mainline nfs-rdma code to use 
these verbs?

Or.
> - MR deallocated with ib_dereg_mr()
> - page lists dealloced via ib_free_fast_reg_page_list()


From kliteyn at mellanox.co.il  Tue May 20 01:05:26 2008
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 20 May 2008 11:05:26 +0300
Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback
	options
In-Reply-To: <20080519112125.3ec8ae61.weiny2@llnl.gov>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>	<1210617225.11133.461.camel@cardanus.llnl.gov>	<20080519170303.GI4616@sashak.voltaire.com>	<20080519170839.GK4616@sashak.voltaire.com>	<20080519170916.GL4616@sashak.voltaire.com>
	<20080519112125.3ec8ae61.weiny2@llnl.gov>
Message-ID: <48328646.1000403@mellanox.co.il>


Ira Weiny wrote:
> Could these be used by some Windows add on?
>   

I'm not aware of any usage of these callbacks.

-- Yevgeny

> Not that it matters...
>
> Ira
>
>
> On Mon, 19 May 2008 20:09:16 +0300
> Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
>   
>> Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks
>> from OpenSM subnet options.
>>
>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>> ---
>>  opensm/include/opensm/osm_subnet.h |   20 ------------------
>>  opensm/opensm/osm_lid_mgr.c        |    7 ------
>>  opensm/opensm/osm_mcast_mgr.c      |   40 ++++++-----------------------------
>>  opensm/opensm/osm_subnet.c         |    4 ---
>>  4 files changed, 7 insertions(+), 64 deletions(-)
>>
>> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
>> index daab453..56b0165 100644
>> --- a/opensm/include/opensm/osm_subnet.h
>> +++ b/opensm/include/opensm/osm_subnet.h
>> @@ -248,10 +248,6 @@ typedef struct _osm_subn_opt {
>>  	uint16_t console_port;
>>  	cl_map_t port_prof_ignore_guids;
>>  	boolean_t port_profile_switch_nodes;
>> -	osm_pfn_ui_extension_t pfn_ui_pre_lid_assign;
>> -	void *ui_pre_lid_assign_ctx;
>> -	osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign;
>> -	void *ui_mcast_fdb_assign_ctx;
>>  	boolean_t sweep_on_trap;
>>  	char *routing_engine_name;
>>  	boolean_t connect_roots;
>> @@ -412,22 +408,6 @@ typedef struct _osm_subn_opt {
>>  *		If TRUE will count the number of switch nodes routed through
>>  *		the link. If FALSE - only CA/RT nodes are counted.
>>  *
>> -*	pfn_ui_pre_lid_assign
>> -*		A UI function to be invoked prior to lid assigment. It should
>> -*		return 1 if any change was made to any lid or 0 otherwise.
>> -*
>> -*	ui_pre_lid_assign_ctx
>> -*		A UI context (void *) to be provided to the pfn_ui_pre_lid_assign
>> -*
>> -*	pfn_ui_mcast_fdb_assign
>> -*		A UI function to be called inside the mcast manager instead of
>> -*		the call for the build spanning tree. This will be called on
>> -*		every multicast call for create, join and leave, and is
>> -*		responsible for the mcast FDB configuration.
>> -*
>> -*	ui_mcast_fdb_assign_ctx
>> -*		A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign
>> -*
>>  *	sweep_on_trap
>>  *		Received traps will initiate a new sweep.
>>  *
>> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
>> index af0d020..7f25750 100644
>> --- a/opensm/opensm/osm_lid_mgr.c
>> +++ b/opensm/opensm/osm_lid_mgr.c
>> @@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr)
>>  	   persistent db */
>>  	__osm_lid_mgr_init_sweep(p_mgr);
>>  
>> -	if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) {
>> -		OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE,
>> -			"Invoking UI function pfn_ui_pre_lid_assign\n");
>> -		p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt.
>> -							 ui_pre_lid_assign_ctx);
>> -	}
>> -
>>  	/* Set the send_set_reqs of the p_mgr to FALSE, and
>>  	   we'll see if any set requests were sent. If not -
>>  	   can signal OSM_SIGNAL_DONE */
>> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
>> index 683a16d..a6185fe 100644
>> --- a/opensm/opensm/osm_mcast_mgr.c
>> +++ b/opensm/opensm/osm_mcast_mgr.c
>> @@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
>>  {
>>  	ib_api_status_t status = IB_SUCCESS;
>>  	ib_net16_t mlid;
>> -	boolean_t ui_mcast_fdb_assign_func_defined;
>>  
>>  	OSM_LOG_ENTER(sm->p_log);
>>  
>> @@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm,
>>  		goto Exit;
>>  	}
>>  
>> -	if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign)
>> -		ui_mcast_fdb_assign_func_defined = TRUE;
>> -	else
>> -		ui_mcast_fdb_assign_func_defined = FALSE;
>> -
>>  	/*
>>  	   Clear the multicast tables to start clean, then build
>>  	   the spanning tree which sets the mcast table bits for each
>>  	   port in the group.
>> -	   We will clean the multicast tables if a ui_mcast function isn't
>> -	   defined, or if such function is defined, but we got here
>> -	   through a MC_CREATE request - this means we are creating a new
>> -	   multicast group - clean all old data.
>>  	 */
>> -	if (ui_mcast_fdb_assign_func_defined == FALSE ||
>> -	    req_type == OSM_MCAST_REQ_TYPE_CREATE)
>> -		__osm_mcast_mgr_clear(sm, p_mgrp);
>> -
>> -	/* If a UI function is defined, then we will call it here.
>> -	   If not - the use the regular build spanning tree function */
>> -	if (ui_mcast_fdb_assign_func_defined == FALSE) {
>> -		status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
>> -		if (status != IB_SUCCESS) {
>> -			OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
>> -				"Unable to create spanning tree (%s)\n",
>> -				ib_get_err_str(status));
>> -			goto Exit;
>> -		}
>> -	} else {
>> -		if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) {
>> -			OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
>> -				"Invoking UI function pfn_ui_mcast_fdb_assign\n");
>> -		}
>> +	__osm_mcast_mgr_clear(sm, p_mgrp);
>>  
>> -		sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt.
>> -							ui_mcast_fdb_assign_ctx,
>> -							mlid, req_type,
>> -							port_guid);
>> +	status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp);
>> +	if (status != IB_SUCCESS) {
>> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: "
>> +			"Unable to create spanning tree (%s)\n",
>> +			ib_get_err_str(status));
>> +		goto Exit;
>>  	}
>>  
>>  Exit:
>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>> index a916270..2191f2d 100644
>> --- a/opensm/opensm/osm_subnet.c
>> +++ b/opensm/opensm/osm_subnet.c
>> @@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>>  	p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE;
>>  	p_opt->accum_log_file = TRUE;
>>  	p_opt->port_profile_switch_nodes = FALSE;
>> -	p_opt->pfn_ui_pre_lid_assign = NULL;
>> -	p_opt->ui_pre_lid_assign_ctx = NULL;
>> -	p_opt->pfn_ui_mcast_fdb_assign = NULL;
>> -	p_opt->ui_mcast_fdb_assign_ctx = NULL;
>>  	p_opt->sweep_on_trap = TRUE;
>>  	p_opt->routing_engine_name = NULL;
>>  	p_opt->connect_roots = FALSE;
>> -- 
>> 1.5.4.rc2.60.gb2e62
>>
>>     
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


From vlad at dev.mellanox.co.il  Tue May 20 01:08:17 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 20 May 2008 11:08:17 +0300
Subject: [ofa-general] [PATCH v1] mlx4: implement MOD_STAT_CFG command to use
 for changing ConnectX page size
Message-ID: <483286F1.2080406@dev.mellanox.co.il>

 From a2df38ebba98611e24336c9e4ac4f709224aeadc Mon Sep 17 00:00:00 2001
From: Vladimir Sokolovsky <vlad at mellanox.co.il>
Date: Sun, 18 May 2008 11:25:55 +0300
Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size

There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which hardcoded
the minimum acceptable page_shift to be 12. However, new mlx4 firmware has a
minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that
ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs.

To preserve firmware compatibility with released OFED drivers, the firmware
will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these
drivers.
However, to enable new drivers to take advantage of the available smaller page
size, the mlx4 driver now first sets the log_pg_sz to the device minimum via
the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP().
The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value.

Signed-off-by: Jack Morgenstein <jackm at mellanox.co.il>
Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
  drivers/net/mlx4/fw.c   |   28 ++++++++++++++++++++++++++++
  drivers/net/mlx4/fw.h   |    6 ++++++
  drivers/net/mlx4/main.c |    7 +++++++
  3 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index d82f275..2b5006b 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags)
  			mlx4_dbg(dev, "    %s\n", fname[i]);
  }

+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	u32 *inbox;
+	int err = 0;
+
+#define MOD_STAT_CFG_IN_SIZE		0x100
+
+#define MOD_STAT_CFG_PG_SZ_M_OFFSET	0x002
+#define MOD_STAT_CFG_PG_SZ_OFFSET	0x003
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+	inbox = mailbox->buf;
+
+	memset(inbox, 0, MOD_STAT_CFG_IN_SIZE);
+
+	MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET);
+	MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET);
+
+	err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG,
+			MLX4_CMD_TIME_CLASS_A);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
  int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
  {
  	struct mlx4_cmd_mailbox *mailbox;
diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h
index 306cb9b..a0e046c 100644
--- a/drivers/net/mlx4/fw.h
+++ b/drivers/net/mlx4/fw.h
@@ -38,6 +38,11 @@
  #include "mlx4.h"
  #include "icm.h"

+struct mlx4_mod_stat_cfg {
+	u8 log_pg_sz;
+	u8 log_pg_sz_m;
+};
+
  struct mlx4_dev_cap {
  	int max_srq_sz;
  	int max_qp_sz;
@@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages);
  int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm);
  int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev);
  int mlx4_NOP(struct mlx4_dev *dev);
+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg);

  #endif /* MLX4_FW_H */
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index a6aa49f..d373601 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -485,6 +485,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  	struct mlx4_priv	  *priv = mlx4_priv(dev);
  	struct mlx4_adapter	   adapter;
  	struct mlx4_dev_cap	   dev_cap;
+	struct mlx4_mod_stat_cfg   mlx4_cfg;
  	struct mlx4_profile	   profile;
  	struct mlx4_init_hca_param init_hca;
  	u64 icm_size;
@@ -502,6 +503,12 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  		return err;
  	}

+	mlx4_cfg.log_pg_sz_m = 1;
+	mlx4_cfg.log_pg_sz = 0;
+	err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg);
+	if (err)
+		mlx4_warn(dev, "Failed to override log_pg_sz parameter\n");
+
  	err = mlx4_dev_cap(dev, &dev_cap);
  	if (err) {
  		mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n");
-- 
1.5.5.1


From ogerlitz at voltaire.com  Tue May 20 02:09:32 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 12:09:32 +0300
Subject: [ofa-general] Re: Current list of Linux maintainers and their
	email	info
In-Reply-To: <48328489.2030305@mellanox.co.il>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il>
Message-ID: <4832954C.2080209@voltaire.com>

Tziporet Koren wrote:
> But I think it will be good to know who is the maintainer from the IB 
> side (at least for OFED users)
The mainline maintainer info of bonding is:

BONDING DRIVER
P:      Jay Vosburgh
M:      fubar at us.ibm.com
L:      bonding-devel at lists.sourceforge.net
W:      http://sourceforge.net/projects/bonding/
S:      Supported

You need to ask him in case you intend to copy this record from the 
maintainers file, or ask Moni Shoua if you can list him as a contact for 
issues not related directly to the mainline driver.

Or.


From ogerlitz at voltaire.com  Tue May 20 02:12:37 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 12:12:37 +0300
Subject: [ofa-general] [PATCH v1] mlx4: implement MOD_STAT_CFG command
	to use for changing ConnectX page size
In-Reply-To: <483286F1.2080406@dev.mellanox.co.il>
References: <483286F1.2080406@dev.mellanox.co.il>
Message-ID: <48329605.80704@voltaire.com>

Vladimir Sokolovsky wrote:
> There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which 
> hardcoded
> the minimum acceptable page_shift to be 12. However, new mlx4 firmware 
> has a
> minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- 
> so that
> ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs.
Please remove the word OFED from the change-log, the bug is in the mlx4 
driver, period, it was not added by patches merged into ofed. Also 
replace "mlx4 firmware" with "connectx firmware" as mlx4 is not the name 
of any HW product and this text can confuse.

Or.


From tziporet at dev.mellanox.co.il  Tue May 20 02:56:13 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 12:56:13 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to
	use	for changing ConnectX page size
In-Reply-To: <4832738B.7070105@voltaire.com>
References: <483049BF.4050603@dev.mellanox.co.il>	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>	<adad4ni8i0a.fsf@cisco.com>	<15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>	<adave1a70wi.fsf@cisco.com>
	<4832738B.7070105@voltaire.com>
Message-ID: <4832A03D.6000103@mellanox.co.il>

Or Gerlitz wrote:
> Roland Dreier wrote:
>> You misunderstood the patch I think (unless I did).  By default new
>> kernel + new firmware gets the smaller page size.
>>   
> OK, thanks, I guess the change-log can be changed to make this point 
> clearer.
>
> Actually, thinking on this a little further, if the (say 2.6.27/28 or 
> later) mlx4 driver is going to support the memory extensions, maybe we 
> could remove from it the support for the proprietary FMRs anyway at 
> that point?
>
We plan to add the memory extension for 1.6.27 or 28, but this is with 
ConnectX only.
So if someone is using InfiniHost III they will still need the FMRs

Tziporet


From holt at sgi.com  Tue May 20 03:01:11 2008
From: holt at sgi.com (Robin Holt)
Date: Tue, 20 May 2008 05:01:11 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080520053145.GA19502@wotan.suse.de>
References: <200805132206.47655.nickpiggin@yahoo.com.au>
	<20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
	<20080520053145.GA19502@wotan.suse.de>
Message-ID: <20080520100111.GC30341@sgi.com>

On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote:
> On Fri, May 16, 2008 at 06:50:05AM -0500, Robin Holt wrote:
> > On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote:
> > > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
> > > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
> > > > > On Thu, 15 May 2008, Nick Piggin wrote:
> > > > > 
> > > > > > Oh, I get that confused because of the mixed up naming conventions
> > > > > > there: unmap_page_range should actually be called zap_page_range. But
> > > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem.
> > > > > 
> > > > > How is that synchronized with code that walks the same pagetable. These 
> > > > > walks may not hold mmap_sem either. I would expect that one could only 
> > > > > remove a portion of the pagetable where we have some sort of guarantee 
> > > > > that no accesses occur. So the removal of the vma prior ensures that?
> > > >  
> > > > I don't really understand the question. If you remove the pte and invalidate
> > > > the TLBS on the remote image's process (importing the page), then it can
> > > > of course try to refault the page in because it's vma is still there. But
> > > > you catch that refault in your driver , which can prevent the page from
> > > > being faulted back in.
> > > 
> > > I think Christoph's question has more to do with faults that are
> > > in flight.  A recently requested fault could have just released the
> > > last lock that was holding up the invalidate callout.  It would then
> > > begin messaging back the response PFN which could still be in flight.
> > > The invalidate callout would then fire and do the interrupt shoot-down
> > > while that response was still active (essentially beating the inflight
> > > response).  The invalidate would clear up nothing and then the response
> > > would insert the PFN after it is no longer the correct PFN.
> > 
> > I just looked over XPMEM.  I think we could make this work.  We already
> > have a list of active faults which is protected by a simple spinlock.
> > I would need to nest this lock within another lock protected our PFN
> > table (currently it is a mutex) and then the invalidate interrupt handler
> > would need to mark the fault as invalid (which is also currently there).
> > 
> > I think my sticking points with the interrupt method remain at fault
> > containment and timeout.  The inability of the ia64 processor to handle
> > provide predictive failures for the read/write of memory on other
> > partitions prevents us from being able to contain the failure.  I don't
> > think we can get the information we would need to do the invalidate
> > without introducing fault containment issues which has been a continous
> > area of concern for our customers.
> 
> Really? You can get the information through via a sleeping messaging API,
> but not a non-sleeping one? What is the difference from the hardware POV?

That was covered in the early very long discussion about 28 seconds.
The read timeout for the BTE is 28 seconds and it automatically retried
for certain failures.  In interrupt context, that is 56 seconds without
any subsequent interrupts of that or lower priority.

Thanks,
Robin


From npiggin at suse.de  Tue May 20 03:50:25 2008
From: npiggin at suse.de (Nick Piggin)
Date: Tue, 20 May 2008 12:50:25 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080520100111.GC30341@sgi.com>
References: <20080513153238.GL19717@sgi.com>
	<20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
	<20080520053145.GA19502@wotan.suse.de>
	<20080520100111.GC30341@sgi.com>
Message-ID: <20080520105025.GA25791@wotan.suse.de>

On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote:
> On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote:
> > 
> > Really? You can get the information through via a sleeping messaging API,
> > but not a non-sleeping one? What is the difference from the hardware POV?
> 
> That was covered in the early very long discussion about 28 seconds.
> The read timeout for the BTE is 28 seconds and it automatically retried
> for certain failures.  In interrupt context, that is 56 seconds without
> any subsequent interrupts of that or lower priority.

I thought you said it would be possible to get the required invalidate
information without using the BTE. Couldn't you use XPMEM pages in
the kernel to read the data out of, if nothing else?


From holt at sgi.com  Tue May 20 04:05:29 2008
From: holt at sgi.com (Robin Holt)
Date: Tue, 20 May 2008 06:05:29 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080520105025.GA25791@wotan.suse.de>
References: <20080514041122.GE24516@wotan.suse.de>
	<20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
	<20080520053145.GA19502@wotan.suse.de>
	<20080520100111.GC30341@sgi.com>
	<20080520105025.GA25791@wotan.suse.de>
Message-ID: <20080520110528.GD30341@sgi.com>

On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote:
> On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote:
> > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote:
> > > 
> > > Really? You can get the information through via a sleeping messaging API,
> > > but not a non-sleeping one? What is the difference from the hardware POV?
> > 
> > That was covered in the early very long discussion about 28 seconds.
> > The read timeout for the BTE is 28 seconds and it automatically retried
> > for certain failures.  In interrupt context, that is 56 seconds without
> > any subsequent interrupts of that or lower priority.
> 
> I thought you said it would be possible to get the required invalidate
> information without using the BTE. Couldn't you use XPMEM pages in
> the kernel to read the data out of, if nothing else?

I was wrong about that.  I thought it was safe to do an uncached write,
but it turns out any processor write is uncontained and the MCA that
surfaces would be fatal.  Likewise for the uncached read.


From npiggin at suse.de  Tue May 20 04:14:24 2008
From: npiggin at suse.de (Nick Piggin)
Date: Tue, 20 May 2008 13:14:24 +0200
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080520110528.GD30341@sgi.com>
References: <20080514112625.GY9878@sgi.com>
	<20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
	<20080520053145.GA19502@wotan.suse.de>
	<20080520100111.GC30341@sgi.com>
	<20080520105025.GA25791@wotan.suse.de>
	<20080520110528.GD30341@sgi.com>
Message-ID: <20080520111424.GB25791@wotan.suse.de>

On Tue, May 20, 2008 at 06:05:29AM -0500, Robin Holt wrote:
> On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote:
> > On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote:
> > > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote:
> > > > 
> > > > Really? You can get the information through via a sleeping messaging API,
> > > > but not a non-sleeping one? What is the difference from the hardware POV?
> > > 
> > > That was covered in the early very long discussion about 28 seconds.
> > > The read timeout for the BTE is 28 seconds and it automatically retried
> > > for certain failures.  In interrupt context, that is 56 seconds without
> > > any subsequent interrupts of that or lower priority.
> > 
> > I thought you said it would be possible to get the required invalidate
> > information without using the BTE. Couldn't you use XPMEM pages in
> > the kernel to read the data out of, if nothing else?
> 
> I was wrong about that.  I thought it was safe to do an uncached write,
> but it turns out any processor write is uncontained and the MCA that
> surfaces would be fatal.  Likewise for the uncached read.

Oh, so the BTE transfer is purely for fault isolation. I was thinking
you guys might have sufficient control of the hardware to be able to
do it at the level of CPU memory operations, but if it is some
limitation of ia64, then I guess that's a problem.

How do you do fault isolation of userspace XPMEM accesses?


From ogerlitz at voltaire.com  Tue May 20 04:19:35 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 14:19:35 +0300
Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to
	use	for changing ConnectX page size
In-Reply-To: <4832A03D.6000103@mellanox.co.il>
References: <483049BF.4050603@dev.mellanox.co.il>	<15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com>	<adad4ni8i0a.fsf@cisco.com>	<15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com>	<adave1a70wi.fsf@cisco.com>
	<4832738B.7070105@voltaire.com> <4832A03D.6000103@mellanox.co.il>
Message-ID: <4832B3C7.1060602@voltaire.com>

Tziporet Koren wrote:
> Or Gerlitz wrote:
>> Actually, thinking on this a little further, if the (say 2.6.27/28 or 
>> later) mlx4 driver is going to support the memory extensions, maybe 
>> we could remove from it the support for the proprietary FMRs anyway 
>> at that point?
> We plan to add the memory extension for 1.6.27 or 28, but this is with 
> ConnectX only.
> So if someone is using InfiniHost III they will still need the FMRs
>
Sure, if it was not clear, I said remove it from the --mlx4-- driver and 
not from the core/mthca

Or.


From holt at sgi.com  Tue May 20 04:26:35 2008
From: holt at sgi.com (Robin Holt)
Date: Tue, 20 May 2008 06:26:35 -0500
Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem
In-Reply-To: <20080520111424.GB25791@wotan.suse.de>
References: <20080515075747.GA7177@wotan.suse.de>
	<Pine.LNX.4.64.0805151031250.18708@schroedinger.engr.sgi.com>
	<20080515235203.GB25305@wotan.suse.de>
	<20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com>
	<20080520053145.GA19502@wotan.suse.de>
	<20080520100111.GC30341@sgi.com>
	<20080520105025.GA25791@wotan.suse.de>
	<20080520110528.GD30341@sgi.com>
	<20080520111424.GB25791@wotan.suse.de>
Message-ID: <20080520112635.GE30341@sgi.com>

On Tue, May 20, 2008 at 01:14:24PM +0200, Nick Piggin wrote:
> On Tue, May 20, 2008 at 06:05:29AM -0500, Robin Holt wrote:
> > On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote:
> > > On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote:
> > > > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote:
> > > > > 
> > > > > Really? You can get the information through via a sleeping messaging API,
> > > > > but not a non-sleeping one? What is the difference from the hardware POV?
> > > > 
> > > > That was covered in the early very long discussion about 28 seconds.
> > > > The read timeout for the BTE is 28 seconds and it automatically retried
> > > > for certain failures.  In interrupt context, that is 56 seconds without
> > > > any subsequent interrupts of that or lower priority.
> > > 
> > > I thought you said it would be possible to get the required invalidate
> > > information without using the BTE. Couldn't you use XPMEM pages in
> > > the kernel to read the data out of, if nothing else?
> > 
> > I was wrong about that.  I thought it was safe to do an uncached write,
> > but it turns out any processor write is uncontained and the MCA that
> > surfaces would be fatal.  Likewise for the uncached read.
> 
> Oh, so the BTE transfer is purely for fault isolation. I was thinking
> you guys might have sufficient control of the hardware to be able to
> do it at the level of CPU memory operations, but if it is some
> limitation of ia64, then I guess that's a problem.
> 
> How do you do fault isolation of userspace XPMEM accesses?

The MCA handler can see the fault was either in userspace (processor
priviledge level I believe) or in the early kernel entry where it is
saving registers.  When it sees that condition, it kills the users
process.  While in kernel space, there is no equivalent of the saving
user state that forces the processor stall.


From ogerlitz at voltaire.com  Tue May 20 05:10:20 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 May 2008 15:10:20 +0300
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
	<4831649A.2020206@voltaire.com>
	<000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>
Message-ID: <4832BFAC.2050506@voltaire.com>

Sean Hefty wrote:
> and in concept I prefer to:
> * Always report the event and let ULPs ignore it
> * Let someone come up with a fantastically simple way of reporting new events
I am fine with the approach of always report the event and let ULPs 
ignore it. Looking on how the ABI versions are exchanged between the 
rdma_ucm module to librdmacm,  I don't see much alternatives other to 
bumping the ABI version to five. If librdmacm can somehow note against 
what ABI version the app was built, we could bump the ABI version to 
five and require the user to upgrade his librdmacm to be able to run, 
but have --librdmacm-- hide this event from the user in case "his 
version" of the ABI is smaller.

> I spent most of the morning looking at this, and until I know what the
> trade-offs really are in the implementation, I can't say that I have a strong
> preference for how to deal with any of this.  My main concerns are:
>
> * All callbacks from the rdma_cm are serialized
> * We minimize the overhead of reporting events
> * We don't lose events
> * If the user returns a non-zero value from a callback, the rdma_cm_id is
>   destroyed, an no further callbacks are invoked.
Thanks for looking into that. Yes, I think its correct and fair to 
require that all these characteristics would remain also after merging 
the new event.
> The existing rdma_cm callbacks are naturally serialized with each other.
> (Callback for connect after resolve route after resolve address...)  This allows
> using the stack for event structures, but the cost is complex synchronization
> with device removal.  Supporting additional events while meeting the concerns
> listed above will be equally challenging.  So if we can simplify device removal
> handling, then supporting similar types of events should be easier as well.
>
> If we can guarantee that this works, one option is to acquire a mutex before
> invoking a callback on an rdma_cm_id.  I hesitate to hold any locks while in a
> callback, since it restricts what the user can do, but if the mutex is only used
> to synchronize calling the user back, it may work, since the rdma_cm never
> invokes a callback from a downcall.  This should simplify the device removal
> handling, eliminating wait_remove and dev_remove from the rdma_cm_id.
I would like to look into this possibility which as you stated later in 
your post is simpler compared to the alternatives and would also make 
the current code of supporting device removal less complex. So 
can/should that mutex be the existing one defined in cma.c or a new one?

Or


From tziporet at dev.mellanox.co.il  Tue May 20 05:12:39 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 15:12:39 +0300
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds
	and	not in others
In-Reply-To: <f3177b9e0805191211qac1100btd42fb4ba35a5746d@mail.gmail.com>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>	<482AC510.3090602@dev.mellanox.co.il>	<f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>	<483045CD.8060301@mellanox.co.il>
	<f3177b9e0805191211qac1100btd42fb4ba35a5746d@mail.gmail.com>
Message-ID: <4832C037.1060206@mellanox.co.il>

Chris Worley wrote:
> In using netcat in UDP mode over IPoIB, I loose 25%-40% of the
> packets.  Is that expected?
>   

What is the netcap test?

Tziporet


From tziporet at dev.mellanox.co.il  Tue May 20 05:20:47 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 15:20:47 +0300
Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, 
	Patch 10)
In-Reply-To: <adatzgu543z.fsf@cisco.com>
References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com>	<adaiqxz6kno.fsf@cisco.com>
	<48319B4C.7040309@mellanox.co.il> <adatzgu543z.fsf@cisco.com>
Message-ID: <4832C21F.9060005@mellanox.co.il>

Roland Dreier wrote:
>  > We can add the multiple interrupt vectors support in two stages:
>  > 1. The low level driver can create multiple interrupt vectors. Their name would include a
>  > serial number from 0 to #CPU's-1. The number of completion vectors can
>  > be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific
>  > completion vector when creating CQ, which means that passing vector=0 while creating CQ
>  > will assign it to completion vector #0.
>  > 
>  > 2. As the second stage, we can create a "don't care" value which would mean that the driver can
>  > can attach the CQ to any completion vector. In this case the policy shouldn't necessary be
>  > round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ
>  > to the least busy one.
>
> this makes sense.  However I think we need to come up with some
> mechanism where a ULP or application can assign some semantic value to
> the CQ event vector it chooses.  Maybe a new verb is required.
>
>   
Add another verb is also a good idea. Do you have anything in mind?
For now all ULPs use vector 0 and we stay with the same behavior as today.
So is it OK to merge the change of the mlx4_core driver now?

Thanks
Tziporet


From vlad at dev.mellanox.co.il  Tue May 20 05:25:52 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 20 May 2008 15:25:52 +0300
Subject: [ofa-general] [PATCH v2] mlx4: implement MOD_STAT_CFG command to use
 for changing ConnectX page size
Message-ID: <4832C350.50004@dev.mellanox.co.il>

 From a2df38ebba98611e24336c9e4ac4f709224aeadc Mon Sep 17 00:00:00 2001
From: Vladimir Sokolovsky <vlad at mellanox.co.il>
Date: Sun, 18 May 2008 11:25:55 +0300
Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size

There was a bug in the mlx4 driver in mlx4_alloc_fmr which hardcoded
the minimum acceptable page_shift to be 12. However, new ConnectX firmware has a
minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that
ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs.

To preserve firmware compatibility with released mlx4 drivers, the firmware
will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these
drivers.
However, to enable new drivers to take advantage of the available smaller page
size, the mlx4 driver now first sets the log_pg_sz to the device minimum via
the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP().
The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value.

Signed-off-by: Jack Morgenstein <jackm at mellanox.co.il>
Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
  drivers/net/mlx4/fw.c   |   28 ++++++++++++++++++++++++++++
  drivers/net/mlx4/fw.h   |    6 ++++++
  drivers/net/mlx4/main.c |    7 +++++++
  3 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index d82f275..2b5006b 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags)
  			mlx4_dbg(dev, "    %s\n", fname[i]);
  }

+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	u32 *inbox;
+	int err = 0;
+
+#define MOD_STAT_CFG_IN_SIZE		0x100
+
+#define MOD_STAT_CFG_PG_SZ_M_OFFSET	0x002
+#define MOD_STAT_CFG_PG_SZ_OFFSET	0x003
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+	inbox = mailbox->buf;
+
+	memset(inbox, 0, MOD_STAT_CFG_IN_SIZE);
+
+	MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET);
+	MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET);
+
+	err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG,
+			MLX4_CMD_TIME_CLASS_A);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
  int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
  {
  	struct mlx4_cmd_mailbox *mailbox;
diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h
index 306cb9b..a0e046c 100644
--- a/drivers/net/mlx4/fw.h
+++ b/drivers/net/mlx4/fw.h
@@ -38,6 +38,11 @@
  #include "mlx4.h"
  #include "icm.h"

+struct mlx4_mod_stat_cfg {
+	u8 log_pg_sz;
+	u8 log_pg_sz_m;
+};
+
  struct mlx4_dev_cap {
  	int max_srq_sz;
  	int max_qp_sz;
@@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages);
  int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm);
  int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev);
  int mlx4_NOP(struct mlx4_dev *dev);
+int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg);

  #endif /* MLX4_FW_H */
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index a6aa49f..d373601 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -485,6 +485,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  	struct mlx4_priv	  *priv = mlx4_priv(dev);
  	struct mlx4_adapter	   adapter;
  	struct mlx4_dev_cap	   dev_cap;
+	struct mlx4_mod_stat_cfg   mlx4_cfg;
  	struct mlx4_profile	   profile;
  	struct mlx4_init_hca_param init_hca;
  	u64 icm_size;
@@ -502,6 +503,12 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
  		return err;
  	}

+	mlx4_cfg.log_pg_sz_m = 1;
+	mlx4_cfg.log_pg_sz = 0;
+	err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg);
+	if (err)
+		mlx4_warn(dev, "Failed to override log_pg_sz parameter\n");
+
  	err = mlx4_dev_cap(dev, &dev_cap);
  	if (err) {
  		mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n");
-- 
1.5.5.1


From hrosenstock at xsigo.com  Tue May 20 05:46:25 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 05:46:25 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <48326C32.7000303@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
	<4831B519.2060002@dev.mellanox.co.il>
	<1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>
	<48326C32.7000303@Sun.COM>
Message-ID: <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote:
> How we can identify and filter these incoming SM packets in application from 
> the regular responses.

I'm surprised that it's working this way; that SM responses are getting
into your application as they _should_ have a different transaction ID
per the following.

>From the kernel Documentation/infiniband/user_mad.txt:

Transaction IDs

  Users of the umad devices can use the lower 32 bits of the
  transaction ID field (that is, the least significant half of the
  field in network byte order) in MADs being sent to match
  request/response pairs.  The upper 32 bits are reserved for use by
  the kernel and will be overwritten before a MAD is sent.

Is the same fd being used by OpenSM and your application somehow or you
are not using OpenSM and your SM overlaps with this ?

-- Hal


From hrosenstock at xsigo.com  Tue May 20 05:46:47 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 05:46:47 -0700
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds
	and	not in others
In-Reply-To: <4832C037.1060206@mellanox.co.il>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
	<482AC510.3090602@dev.mellanox.co.il>
	<f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>
	<483045CD.8060301@mellanox.co.il>
	<f3177b9e0805191211qac1100btd42fb4ba35a5746d@mail.gmail.com>
	<4832C037.1060206@mellanox.co.il>
Message-ID: <1211287607.12616.569.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 15:12 +0300, Tziporet Koren wrote:
> Chris Worley wrote:
> > In using netcat in UDP mode over IPoIB, I loose 25%-40% of the
> > packets.  Is that expected?
> >   
> 
> What is the netcap test?

See http://netcat.sourceforge.net/

> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Thomas.Talpey at netapp.com  Tue May 20 05:51:13 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 20 May 2008 08:51:13 -0400
Subject: [ofa-general] RE: Current list of Linux maintainers and
	their email info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.cor
	p.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<4831F965.6060607@opengridcomputing.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
Message-ID: <RTPCLUEXC1-PRDLtosJ00000122@RTPMVEXC1-PRD.hq.netapp.com>

At 06:28 PM 5/19/2008, Woodruff, Robert J wrote:
>Here is what I have so far as the list of kernel and userspace
>components.
>
>                      Kernel Components
>
>Core kernel drivers, infiniband/core
>...

Your list doesn't include NFS/RDMA, but those components are effectively part
of existing MAINTAINER sections.

So, it might be worth adding a note to the effect that the NFS/RDMA client is
maintained by Tom Talpey, as part of Trond Myklebust's NFS client codebase,
and the NFS/RDMA server is Tom Tucker's, as part of Bruce Fields' and Neil Brown's
NFS server base. These are the "NFS CLIENT" and "KERNEL NFSD" sections of the
existing MAINTAINERS file, respectively.

I don't personally think these NFS/RDMA sub-components currently rise to the
level of needing mention in MAINTAINERS, but something in the openib docs would
be very good.

Tom.


From monis at Voltaire.COM  Tue May 20 06:31:08 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 16:31:08 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <ada7idq6j0g.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM>
	<48302257.2050308@Voltaire.COM>	<ada1w3z9tmo.fsf@cisco.com>
	<483199EC.7070900@Voltaire.COM> <ada7idq6j0g.fsf@cisco.com>
Message-ID: <4832D29C.2060704@Voltaire.COM>

Roland Dreier wrote:
> By the way:
> 
>  > +		struct ib_sa_sm_ah *sm_ah;
>  > +
>  > +		spin_lock_irqsave(&port->ah_lock, flags);
>  > +		sm_ah = port->sm_ah;
>  > +		port->sm_ah = NULL;
>  > +		spin_unlock_irqrestore(&port->ah_lock, flags);
>  > +
>  > +		if (sm_ah)
>  > +			kref_put(&sm_ah->ref, free_sm_ah);
> 
> Is there some reason why this can't be simpler like:
> 
> 		spin_lock_irqsave(&port->ah_lock, flags);
> 		if (port->sm_ah)
> 			kref_put(&port->sm_ah->ref, free_sm_ah);
> 		port->sm_ah = NULL;
> 		spin_unlock_irqrestore(&port->ah_lock, flags);
> 
What happens if this happens

# |         CPU-0					|	CPU-1
  |      						|
1 | if (port->sm_ah)					|
  |      kref_put(&port->sm_ah->ref, free_sm_ah);	|
--+-----------------------------------------------------+-----------------------
2 |							| alloc_mad() 
--+-----------------------------------------------------+-----------------------
3 | port->sm_ah = NULL;					|

As I see it, process on CPU-1 gets a garbage sm_ah
Do you agree?
> I guess the same cleanup applies to update_sm_ah(), except after your
> patch I don't see any way that update_sm_ah() could be called with sm_ah
> anything but NULL, so we could drop the old_ah stuff completely there.
I agree. The cleanup code can be completely removed.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From monis at Voltaire.COM  Tue May 20 06:32:22 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 16:32:22 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <adabq326jaw.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM>
	<48302257.2050308@Voltaire.COM>	<ada1w3z9tmo.fsf@cisco.com>
	<483199EC.7070900@Voltaire.COM>	<adaprri70r3.fsf@cisco.com>
	<adabq326jaw.fsf@cisco.com>
Message-ID: <4832D2E6.9050907@Voltaire.COM>

Roland Dreier wrote:
>  > and what happens if alloc_mad() is called while port->sm_ah is NULL?
> 
> Trivial fix seems to be to move the test for whether port->sm_ah is NULL
> into alloc_mad(), and have it return -EAGAIN if so.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
I agree


From monis at Voltaire.COM  Tue May 20 06:34:33 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 16:34:33 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <adaprri70r3.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM>
	<48302257.2050308@Voltaire.COM>	<ada1w3z9tmo.fsf@cisco.com>
	<483199EC.7070900@Voltaire.COM> <adaprri70r3.fsf@cisco.com>
Message-ID: <4832D369.7070300@Voltaire.COM>

Roland Dreier wrote:
> and what happens if alloc_mad() is called while port->sm_ah is NULL?
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
In this case it is protected by the check after alloc_mad() is called but
I already moved the check inside alloc_mad() as you suggested so the answer is a bit irrelevant now.


From swise at opengridcomputing.com  Tue May 20 06:40:44 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 May 2008 08:40:44 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48327198.7080305@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>	<4831347A.1010506@voltaire.com>
	<ada4p8u8fkb.fsf@cisco.com>
	<4831A0DF.2070603@opengridcomputing.com>
	<48327198.7080305@voltaire.com>
Message-ID: <4832D4DC.2040006@opengridcomputing.com>

Or Gerlitz wrote:
> Steve Wise wrote:
>>> dma mapping would work too but then handling the map/unmap becomes an
>>> issue.  I think it is way too complicated too add new verbs for
>>> map/unmap fastreg page list (in addition to the alloc/free fastreg page
>>> list that we are already adding) and force the consumer to do it.  And
>>> if we expect the low-level driver to do it, then the map is easy 
>>> (can be
>>> done while posting the send) but the unmap is a pain -- it would 
>>> have to
>>> be done inside poll_cq when reapind the completion, and the low-level
>>> driver would have to keep some complicated extra data structure to go
>>> back from the completion to the original fast reg page list structure.
>>>   
>> And certain platforms can fail map requests (like PPC64) because they 
>> have limited resources for dma mapping.  So then you'd fail a SQ work 
>> request when you might not want to...
> I see the point in allocating the page lists in dma consistent memory 
> to make the mechanics of letting the HCA to DMA the list easier and 
> simpler, as I think Roland is suggesting in his post. However, I an 
> not sure to understand how this helps in the PPC64 case, if the HCA 
> does DMA to fetch the list, then IOMMU slots have to be consumed this 
> way or another, correct?
>

My point is that if you do the mappipng at allocation time, then the 
failure will happen when you allocate the page list vs when you post the 
send WR.  Maybe it doesn't matter, but the idea, I think, is to not fail 
post_send for lack of resources.  Everything should be pre-allocated 
pretty much by the time you post work requests...

Steve.


> Or.
>


From rdreier at cisco.com  Tue May 20 06:53:45 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 06:53:45 -0700
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <4832D29C.2060704@Voltaire.COM> (Moni Shoua's message of "Tue, 20
	May 2008 16:31:08 +0300")
References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM>
	<ada1w3z9tmo.fsf@cisco.com> <483199EC.7070900@Voltaire.COM>
	<ada7idq6j0g.fsf@cisco.com> <4832D29C.2060704@Voltaire.COM>
Message-ID: <adave193wuu.fsf@cisco.com>

 > > 		spin_lock_irqsave(&port->ah_lock, flags);
 > > 		if (port->sm_ah)
 > > 			kref_put(&port->sm_ah->ref, free_sm_ah);
 > > 		port->sm_ah = NULL;
 > > 		spin_unlock_irqrestore(&port->ah_lock, flags);
 > > 
 > What happens if this happens
 > 
 > # |         CPU-0					|	CPU-1
 >   |      						|
 > 1 | if (port->sm_ah)					|
 >   |      kref_put(&port->sm_ah->ref, free_sm_ah);	|
 > --+-----------------------------------------------------+-----------------------
 > 2 |							| alloc_mad() 
 > --+-----------------------------------------------------+-----------------------
 > 3 | port->sm_ah = NULL;					|
 > 
 > As I see it, process on CPU-1 gets a garbage sm_ah
 > Do you agree?

alloc_mad() must obviously take the lock when looking at port->sm_ah,
and take a reference with kref_get() before dropping the lock.

 - R.


From swise at opengridcomputing.com  Tue May 20 06:55:28 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 May 2008 08:55:28 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483285DC.20003@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
Message-ID: <4832D850.2010102@opengridcomputing.com>

Or Gerlitz wrote:
> Steve Wise wrote:
>> Support for the IB BMME and iWARP equivalent memory extensions to non 
>> shared memory regions.  Usage Model:
>> - MR allocated with ib_alloc_mr()
>> - Page lists allocated via ib_alloc_fast_reg_page_list().
>> - MR made VALID and bound to a specific page list via 
>> ib_post_send(IB_WR_FAST_REG_MR)
>> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
> Steve,
>
> I am trying to further understand what would be a real life ULP design 
> here, and I think there are some more issues to clarify/define for the 
> case of ULP which has to create a mapping for a list of pages and send 
> this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it 
> for RDMA.
>
> AFAIK, the idea was to let the ulp post --two-- work requests, where 
> the first creates the mapping and the second sends this mapping to the 
> remote side, such that the second does not start before the first 
> completes (i.e a fence).
>
> Now, the above scheme means that the ulp knows the value of the 
> rkey/stag at the time of posting these two work requests (since it has 
> to encode it in the second one), so something has to be clarified re 
> the rkey/stag here, do they change each time this MR is used? how many 
> bits can be changed, etc.

The ULP knows the rkey/stag because its returned up front in the 
ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue 
which we haven't exposed yet to the ULP).  The same rkey/stag can be 
used for multiple mappings.  It can be made invalid at any point in time 
via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same 
rkey/stag advertised is not a risk.

So you allocate the rkey/stag up front, allocate page_lists up front, 
then as needed you populate your page list and bind it to the rkey/stag 
via IB_WR_FAST_REG_MR, and invalidate that mapping via 
IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
proper fencing, you can pipeline these mappings.   Eventually when 
you're done doing IO (like for NFSRDMA when the mount is unmounted) you 
free up the page list(s) and mr/rkey/stag.

So NFSRDMA will keep these fast_reg_mrs and page_list structs 
pre-allocated and hung off some context so that per RPC, they can be 
bound/registered, the IO executed, and then the MR invalidated as part 
of processing the RPC.

>
> I guess my questions are to some extent RTFM ones, but, first, with 
> some quick looking in the IB spec I did not manage to get enough 
> answers (pointers appreciated...) and second, you are proposing an 
> implementation here, so I think it makes sense to review the actual 
> usage model to see all aspects needed for ULPs are covered...
>
> Talking on usage, do you plan to patch the mainline nfs-rdma code to 
> use these verbs?

Yes.  Tom Tucker will be doing this.  Jon Mason is implementing RDS 
changes to utilize this too.  The hope is all this makes 2.6.27/ofed-1.4.

I can also post test code (krping module) if anyone is interested.  I'm 
developing that now.

Steve.


From monis at Voltaire.COM  Tue May 20 07:01:02 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 17:01:02 +0300
Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between elements
 in work queues after event
In-Reply-To: <48302034.8040709@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM>
Message-ID: <4832D99E.3010205@Voltaire.COM>

This is the second version after some of Roland's comments.
One comments however is still pending.

-----------------------------------------------------
This patch solves a race between work elements that are carried out after an
event occurs. When SM address handle becomes invalid and needs an update it is
handled by a work in the global workqueue. On the other hand this event is also
handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join.
Although queuing is in the right order, it is done to 2 different workqueues and so
there is no guarantee that the first to be queued is the first to be executed.

The patch sets the SM address handle  to NULL and until update_sm_ah() is called,
any request that needs sm_ah is replied with -EAGAIN return status.

For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the
wrong SM so the request gets lost. Consumers can be improved if they examine the return
code and respond to EAGAIN properly but even without an improvement the situation
is not getting worse and in some cases it gets better.

If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after
the check for NULL SM address handle the result would be as before the patch and without a
risk of dereferencing a NULL  pinter.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---
 drivers/infiniband/core/sa_query.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index cf474ec..7224bd1 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -361,7 +361,7 @@ static void update_sm_ah(struct work_struct *work)
 {
 	struct ib_sa_port *port =
 		container_of(work, struct ib_sa_port, update_task);
-	struct ib_sa_sm_ah *new_ah, *old_ah;
+	struct ib_sa_sm_ah *new_ah;
 	struct ib_port_attr port_attr;
 	struct ib_ah_attr   ah_attr;
 
@@ -397,12 +397,9 @@ static void update_sm_ah(struct work_struct *work)
 	}
 
 	spin_lock_irq(&port->ah_lock);
-	old_ah = port->sm_ah;
 	port->sm_ah = new_ah;
 	spin_unlock_irq(&port->ah_lock);
 
-	if (old_ah)
-		kref_put(&old_ah->ref, free_sm_ah);
 }
 
 static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event)
@@ -413,9 +410,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE   ||
 	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		struct ib_sa_device *sa_dev;
-		sa_dev = container_of(handler, typeof(*sa_dev), event_handler);
-
+		unsigned long flags;
+		struct ib_sa_device *sa_dev =
+			container_of(handler, typeof(*sa_dev), event_handler);
+		struct ib_sa_port *port =
+			&sa_dev->port[event->element.port_num - sa_dev->start_port];
+		struct ib_sa_sm_ah *sm_ah;
+
+		spin_lock_irqsave(&port->ah_lock, flags);
+		sm_ah = port->sm_ah;
+		port->sm_ah = NULL;
+		spin_unlock_irqrestore(&port->ah_lock, flags);
+
+		if (sm_ah)
+			kref_put(&port->sm_ah->ref, free_sm_ah);
 		schedule_work(&sa_dev->port[event->element.port_num -
 					    sa_dev->start_port].update_task);
 	}
@@ -519,6 +527,10 @@ static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask)
 	unsigned long flags;
 
 	spin_lock_irqsave(&query->port->ah_lock, flags);
+	if (!query->port->sm_ah) {
+		spin_unlock_irqrestore(&query->port->ah_lock, flags);
+		return -EAGAIN;
+	}
 	kref_get(&query->port->sm_ah->ref);
 	query->sm_ah = query->port->sm_ah;
 	spin_unlock_irqrestore(&query->port->ah_lock, flags);


From monis at Voltaire.COM  Tue May 20 07:05:26 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 17:05:26 +0300
Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements
	in qork queues after event
In-Reply-To: <adave193wuu.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM>
	<48302257.2050308@Voltaire.COM>	<ada1w3z9tmo.fsf@cisco.com>
	<483199EC.7070900@Voltaire.COM>	<ada7idq6j0g.fsf@cisco.com>
	<4832D29C.2060704@Voltaire.COM> <adave193wuu.fsf@cisco.com>
Message-ID: <4832DAA6.7070500@Voltaire.COM>

Roland Dreier wrote:
>  > > 		spin_lock_irqsave(&port->ah_lock, flags);
>  > > 		if (port->sm_ah)
>  > > 			kref_put(&port->sm_ah->ref, free_sm_ah);
>  > > 		port->sm_ah = NULL;
>  > > 		spin_unlock_irqrestore(&port->ah_lock, flags);
>  > > 
>  > What happens if this happens
>  > 
>  > # |         CPU-0					|	CPU-1
>  >   |      						|
>  > 1 | if (port->sm_ah)					|
>  >   |      kref_put(&port->sm_ah->ref, free_sm_ah);	|
>  > --+-----------------------------------------------------+-----------------------
>  > 2 |							| alloc_mad() 
>  > --+-----------------------------------------------------+-----------------------
>  > 3 | port->sm_ah = NULL;					|
>  > 
>  > As I see it, process on CPU-1 gets a garbage sm_ah
>  > Do you agree?
> 
> alloc_mad() must obviously take the lock when looking at port->sm_ah,
> and take a reference with kref_get() before dropping the lock.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
You're right. 
I just sent a V2 but it needs to be modified according to above.
I'll resend soon

thanks
 

From rdreier at cisco.com  Tue May 20 07:07:29 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 07:07:29 -0700
Subject: [ofa-general] Re: [PATCH V2 1/2] IB/core: handle race between
	elements in work queues after event
In-Reply-To: <4832D99E.3010205@Voltaire.COM> (Moni Shoua's message of "Tue, 20
	May 2008 17:01:02 +0300")
References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM>
Message-ID: <adamyml3w7y.fsf@cisco.com>

 > This is the second version after some of Roland's comments.
 > One comments however is still pending.

What comment is still pending?


From monis at voltaire.com  Tue May 20 07:09:56 2008
From: monis at voltaire.com (Moni Shoua)
Date: Tue, 20 May 2008 17:09:56 +0300
Subject: [ofa-general] RE: [PATCH V2 1/2] IB/core: handle race between
	elements in work queues after event
In-Reply-To: <adamyml3w7y.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM>
	<adamyml3w7y.fsf@cisco.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CA01269D56@exil.voltaire.com>

I meant to the one with simpler cleanup in ib_sa_event() but it is no
longer pending.

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Tuesday, May 20, 2008 17:07
To: Moni Shoua
Cc: Olga Shern; OpenFabrics General; Moni Levy
Subject: Re: [PATCH V2 1/2] IB/core: handle race between elements in
work queues after event

 > This is the second version after some of Roland's comments.
 > One comments however is still pending.

What comment is still pending?


From monis at Voltaire.COM  Tue May 20 07:13:20 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 20 May 2008 17:13:20 +0300
Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between elements
	in work queues after event
In-Reply-To: <4832D99E.3010205@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM>
Message-ID: <4832DC80.2000408@Voltaire.COM>

This patch solves a race between work elements that are carried out after an
event occurs. When SM address handle becomes invalid and needs an update it is
handled by a work in the global workqueue. On the other hand this event is also
handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join.
Although queuing is in the right order, it is done to 2 different workqueues and so
there is no guarantee that the first to be queued is the first to be executed.

The patch sets the SM address handle  to NULL and until update_sm_ah() is called,
any request that needs sm_ah is replied with -EAGAIN return status.

For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the
wrong SM so the request gets lost. Consumers can be improved if they examine the return
code and respond to EAGAIN properly but even without an improvement the situation
is not getting worse and in some cases it gets better.

If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after
the check for NULL SM address handle the result would be as before the patch and without a
risk of dereferencing a NULL  pinter.

Signed-off-by: Moni Levy  <monil at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>

---

 drivers/infiniband/core/sa_query.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index cf474ec..78ea815 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -361,7 +361,7 @@ static void update_sm_ah(struct work_struct *work)
 {
 	struct ib_sa_port *port =
 		container_of(work, struct ib_sa_port, update_task);
-	struct ib_sa_sm_ah *new_ah, *old_ah;
+	struct ib_sa_sm_ah *new_ah;
 	struct ib_port_attr port_attr;
 	struct ib_ah_attr   ah_attr;
 
@@ -397,12 +397,9 @@ static void update_sm_ah(struct work_struct *work)
 	}
 
 	spin_lock_irq(&port->ah_lock);
-	old_ah = port->sm_ah;
 	port->sm_ah = new_ah;
 	spin_unlock_irq(&port->ah_lock);
 
-	if (old_ah)
-		kref_put(&old_ah->ref, free_sm_ah);
 }
 
 static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event)
@@ -413,8 +410,17 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
 	    event->event == IB_EVENT_PKEY_CHANGE ||
 	    event->event == IB_EVENT_SM_CHANGE   ||
 	    event->event == IB_EVENT_CLIENT_REREGISTER) {
-		struct ib_sa_device *sa_dev;
-		sa_dev = container_of(handler, typeof(*sa_dev), event_handler);
+		unsigned long flags;
+		struct ib_sa_device *sa_dev =
+			container_of(handler, typeof(*sa_dev), event_handler);
+		struct ib_sa_port *port =
+			&sa_dev->port[event->element.port_num - sa_dev->start_port];
+
+		spin_lock_irqsave(&port->ah_lock, flags);
+		if (port->sm_ah)
+			kref_put(&port->sm_ah->ref, free_sm_ah);
+		port->sm_ah = NULL;
+		spin_unlock_irqrestore(&port->ah_lock, flags);
 
 		schedule_work(&sa_dev->port[event->element.port_num -
 					    sa_dev->start_port].update_task);
@@ -519,6 +525,10 @@ static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask)
 	unsigned long flags;
 
 	spin_lock_irqsave(&query->port->ah_lock, flags);
+	if (!query->port->sm_ah) {
+		spin_unlock_irqrestore(&query->port->ah_lock, flags);
+		return -EAGAIN;
+	}
 	kref_get(&query->port->sm_ah->ref);
 	query->sm_ah = query->port->sm_ah;
 	spin_unlock_irqrestore(&query->port->ah_lock, flags);


From worleys at gmail.com  Tue May 20 07:34:31 2008
From: worleys at gmail.com (Chris Worley)
Date: Tue, 20 May 2008 08:34:31 -0600
Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and
	not in others
In-Reply-To: <4832C037.1060206@mellanox.co.il>
References: <f3177b9e0805131703i672c013ev471daa4d87f3db02@mail.gmail.com>
	<f3177b9e0805131747i4565174dg6a30bf5c372ac8a2@mail.gmail.com>
	<482AC510.3090602@dev.mellanox.co.il>
	<f3177b9e0805140926q5880e579i372bb7198de8ebb8@mail.gmail.com>
	<483045CD.8060301@mellanox.co.il>
	<f3177b9e0805191211qac1100btd42fb4ba35a5746d@mail.gmail.com>
	<4832C037.1060206@mellanox.co.il>
Message-ID: <f3177b9e0805200734g6cf2b799m3de54907599488d8@mail.gmail.com>

On Tue, May 20, 2008 at 6:12 AM, Tziporet Koren
<tziporet at dev.mellanox.co.il> wrote:
> Chris Worley wrote:
>>
>> In using netcat in UDP mode over IPoIB, I loose 25%-40% of the
>> packets.  Is that expected?
>>
>
> What is the netcap test?

  Start a listener on one node (both nodes running RHEL5.1/OFED1.3):

[root at poib01 ~]# nc -v -v -u -l 61984 | dd of=/dev/null bs=1024k

  Start a sender on another, which completes:

[root at poib04 mnt]# dd if=/dev/zero bs=1024k count=10000 | nc -v -v -u
36.102.28.91 61984
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 48.0578 seconds, 218 MB/s

  But the sender just hangs until ^C:

Connection from 36.102.28.94 port 61984 [udp/*] accepted
0+6078248 records in
0+6078248 records out
6239293440 bytes (6.2 GB) copied, 179.339 seconds, 34.8 MB/s

Chris
>
> Tziporet
>


From tziporet at mellanox.co.il  Tue May 20 07:59:34 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 17:59:34 +0300
Subject: [ofa-general] OFED May 29, 08 meeting summary
Message-ID: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com>


OFED May 29, 08 meeting summary

1. OFED 1.3.1:
   1.1 Schedule: decided to delay rc2 in two days to enable more IPoIB
bug fixes.
	rc1 - done on May 6
	rc2 - May 22
	GA  - May 29
   1.2 OS support:
	SLES10 SP2 backports were done (thanks to Moshe from Voltaire)
	There is a request fro RHEL 5.2 - we still look for a volunteer
to add the backports
   1.3 Bugs status
	Please set release version 1.3.1 for all bugs that should be
resolved in 1.3.1
	We decided these are the bugs that should be fixed for 1.3.1:

1027	normal	sashak at voltaire.com	kernel panic in mad.c
handle_outgoing_dr_smp with RESULT_CONSUMED
	Qlogic should check
1032	critical	vu at mellanox.com	      RHEL  5.1 and OFED 1.3
cannot write IO blocks greater than 1024.
	Mellanox to check
1038	normal	eli at mellanox.co.il	Kernel panic while running
tcp/ip ltp tests
	Under debug
1040	normal	jackm at mellanox.co.il	Kernel Oops during "port up/down
test"
	A fix was provided
1041	normal	vlad at mellanox.co.il	Install Failed with memtrack
flag in the conf file
	To be fixed
1042	normal	vlad at mellanox.co.il	ofed-1.3.1 install fails
	To be fixed
1004  major       eli at mellanox.co.il      IPoIB failed on stress testing
	Under debug - Eli & Jack
uDAPL bug should be provided too - Arlin
	
2. OFED 1.4:
	- Kernel rebase status: we have prepared the new tree, make-dist
pass but compilation still fails.
	  Any help to resolve compilation issues is welcome.
	  URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
ofed_kernel
	  Vlad will send status soon
	- Update from the participants (mainly on new
components/features):
	  - NFSoRDMA - Jeff is working now on OFED 1.3 on SLES 10 and
will do porting for 1.4 later
	  - Management - no update but a lot of activity on the list
	  - Multiple EQs to best fit multi-core systems - definition
ongoing
	  - RDMA CM to support IPv6 - Seems no one will do it so I will
remove this from the plans
	  - IB BMME and iWARP equivalent memory extensions (Steve):
	    Verbs are under definition; Chelsio already implemented them
and planning to change NFSoRDMA to use them.
          Mellanox plan to implement them too with ConnectX.

3. Upgrade memory in the OFA server:
   We need to upgrade the OFA server: memory, hard disk and Ubuntu
version
   Date selected is second week of June
   Johann - what is the procedure to get this going?

4. Web site update:
   Woody updated on improvements to the web site that should be
available in few weeks.

5. OpenSuse build system (Yiftah): 
   Voltaire found its require more work that first planed and they put
this effort on hold for now.

Tziporet


From vlad at dev.mellanox.co.il  Tue May 20 08:40:53 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 20 May 2008 18:40:53 +0300
Subject: [ofa-general] Re: [ewg] OFED May 29, 08 meeting summary
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com>
Message-ID: <4832F105.8000402@dev.mellanox.co.il>

Tziporet Koren wrote:
> 2. OFED 1.4:
> 	- Kernel rebase status: we have prepared the new tree, make-dist
> pass but compilation still fails.
> 	  Any help to resolve compilation issues is welcome.
> 	  URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
> ofed_kernel
> 	  Vlad will send status soon

Compilation passed on 2.6.26-rc2 kernel (except SDP).
Working on backport patches.

- Vladimir


From David.Shue.ctr at rl.af.mil  Tue May 20 08:54:21 2008
From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB)
Date: Tue, 20 May 2008 11:54:21 -0400
Subject: [ofa-general] Infiniband Card Trouble
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D8678AE3@EPEXCH1.qlogic.org>
References: <E87DB458899A9040ACD4D627073D9EF6039A2CB8@VFOHMLAO11.Enterprise.afmc.ds.af.mil>
	<C07C40DB2364324799506DE8FF12F8D8678AE3@EPEXCH1.qlogic.org>
Message-ID: <E87DB458899A9040ACD4D627073D9EF603AE39C7@VFOHMLAO11.Enterprise.afmc.ds.af.mil>

UPDATE

I was given a reflash image from a MELLANOX rep and it worked for me!

Thanks for everyone's effort to help me.

-Dave

-----Original Message-----
From: Mike Heinz [mailto:michael.heinz at qlogic.com] 
Sent: Thursday, May 01, 2008 11:21 AM
To: Shue, David CTR USAF AFMC AFRL/RITB; general at lists.openfabrics.org
Subject: RE: [ofa-general] Infiniband Card Trouble

#6 makes it sound like it's an ofed installation issue rather than the
HCA itself.
 
Could you post the relevant /var/log/messages? Messages from ib_mthca
would be especially important. In addition, the output from
 
mstflint -d <mypciaddress> q 
 
could also be useful.
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 

________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shue, David
CTR USAF AFMC AFRL/RITB
Sent: Thursday, May 01, 2008 9:09 AM
To: general at lists.openfabrics.org
Subject: [ofa-general] Infiniband Card Trouble


Hello,

 
I have used the OFED-1.3 software to communicate with the current cards
I have.  These cards come up as "MT23108" in the logs, and I am not sure
whom the manufacturer is.  I was able to program the cards, and even
install MPICH2 and run tests.

 
I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric
(HPC) Adapter"
http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&
prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do
not work the same.  The machine boots up fine with the card in, and
shows the card as Mellanox "MT23108" also?  The two cards are visibly
different in every way.  Is the MT23108 a certain platform for IB?  I am
new to the entire IB technology.  This is the history of what I did.  


1)     Staged the machine RH EL v5

2)     Install the IB card

3)     Boot machine up

4)     Can see the card looking at "lspci" and "dmesg" but nothing in
the network area or under "ifconfig"  (Just like with the first cards)

5)     I then install the OFED-1.3 software to communicate and configure
the card

6)     When I go to start the card (instead of reboot but have tried
both ways) /etc/init.d/openib start, it all fails.  I then look in the
log file and see a bunch of "unknown symbol..." and "disagrees..."  for
all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on.

7)     When I reboot, the machine reaches "UDEV" of the reboot stage,
hangs for a little bit, and then many errors show and the machine won't
boot, unless I take the card out.  If I uninstall the OFED software, it
will reboot fine with the card still in.  The card from HP giving me
problems, does not appear to have any drivers for it.  It looks like HP
supports it to work on Windows, and HPUX.  

 
I'm look for any help you can provide.

 
Thanks in advance,

Dave  

 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

  David Shue                      

  Systems Specialist        

  Computer Sciences Corporation                                     

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 

 
From Jeffrey.C.Becker at nasa.gov  Tue May 20 09:26:01 2008
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Tue, 20 May 2008 09:26:01 -0700
Subject: [ofa-general] RE: Current list of Linux maintainers and	their
	email info
In-Reply-To: <RTPCLUEXC1-PRDLtosJ00000122@RTPMVEXC1-PRD.hq.netapp.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<RTPCLUEXC1-PRDLtosJ00000122@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <4832FB99.1080409@nasa.gov>

Talpey, Thomas wrote:
> At 06:28 PM 5/19/2008, Woodruff, Robert J wrote:
>   
>> Here is what I have so far as the list of kernel and userspace
>> components.
>>
>>                      Kernel Components
>>
>> Core kernel drivers, infiniband/core
>> ...
>>     
>
> Your list doesn't include NFS/RDMA, but those components are effectively part
> of existing MAINTAINER sections.
>
> So, it might be worth adding a note to the effect that the NFS/RDMA client is
> maintained by Tom Talpey, as part of Trond Myklebust's NFS client codebase,
> and the NFS/RDMA server is Tom Tucker's, as part of Bruce Fields' and Neil Brown's
> NFS server base. These are the "NFS CLIENT" and "KERNEL NFSD" sections of the
> existing MAINTAINERS file, respectively.
>   
And I am responsible for integrating this work with OFED, particularly 
providing and maintaining kernel backports for the various distros. Thanks.

-jeff
> I don't personally think these NFS/RDMA sub-components currently rise to the
> level of needing mention in MAINTAINERS, but something in the openib docs would
> be very good.
>
> Tom.
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From arlin.r.davis at intel.com  Tue May 20 09:36:11 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 20 May 2008 09:36:11 -0700
Subject: [ofa-general] RE: [ewg] OFED May 29, 08 meeting summary
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com>
Message-ID: <B0095134066CC94FBC80973103FFA1FE0725FDBB@orsmsx416.amr.corp.intel.com>

 
>
>OFED May 29, 08 meeting summary
>
>1. OFED 1.3.1:
>   1.3 Bugs status
>	Please set release version 1.3.1 for all bugs that should be
>resolved in 1.3.1
>	We decided these are the bugs that should be fixed for 1.3.1:
>
>uDAPL bug should be provided too - Arlin

1044  normal   arlin.r.davis at intel.com	uDAPL 1.2 - dat_ia_open delay if
DNS not configured
	To be fixed.


From ramachandra.kuchimanchi at qlogic.com  Tue May 20 09:52:50 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Tue, 20 May 2008 22:22:50 +0530
Subject: [ofa-general] Re: [PATCH 10/13] QLogic VNIC: Driver Statistics
	collection
In-Reply-To: <adabq37dwob.fsf@cisco.com>
References: <20080430171028.31725.86190.stgit@localhost.localdomain>
	<20080430172055.31725.70663.stgit@localhost.localdomain>
	<adabq37dwob.fsf@cisco.com>
Message-ID: <71d336490805200952k2c31b6abkfcf70ed7481a95d@mail.gmail.com>

Roland,

On Fri, May 16, 2008 at 4:03 AM, Roland Dreier <rdreier at cisco.com> wrote:
>  > +#else       /*CONFIG_INIFINIBAND_VNIC_STATS*/
>  > +
>  > +static inline void vnic_connected_stats(struct vnic *vnic)
>  > +{
>  > +    ;
>  > +}
>
> there are an awful lot of stubs here.  Do you really expect anyone to
> set CONFIG_INFINIBAND_QLGC_VNIC_STATS=n?

Sorry, missed this comment from the first round of review. Yes, the default
behavior we want is to disable the statistics collection as some of the
statistics collection is in the data transfer path and hence this code
can add overheads. These statistics are more for performance debugging/tuning.

Regards,
Ram


From sean.hefty at intel.com  Tue May 20 10:02:34 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 20 May 2008 10:02:34 -0700
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <4832BFAC.2050506@voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
	<4831649A.2020206@voltaire.com>
	<000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>
	<4832BFAC.2050506@voltaire.com>
Message-ID: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com>

>I am fine with the approach of always report the event and let ULPs
>ignore it. Looking on how the ABI versions are exchanged between the
>rdma_ucm module to librdmacm,  I don't see much alternatives other to
>bumping the ABI version to five. If librdmacm can somehow note against
>what ABI version the app was built, we could bump the ABI version to
>five and require the user to upgrade his librdmacm to be able to run,
>but have --librdmacm-- hide this event from the user in case "his
>version" of the ABI is smaller.

I was only thinking of the kernel interfaces, but I don't see that this really
changes the ABI.  An existing library continues to work unmodified.  (Is this
that different than adding a new return value from a call?)  If there really is
an issue, then the rdma_ucm can toss the event.

>I would like to look into this possibility which as you stated later in
>your post is simpler compared to the alternatives and would also make
>the current code of supporting device removal less complex. So
>can/should that mutex be the existing one defined in cma.c or a new one?

After more thought, this approach is what I would try first.  I think you will
need a new mutex per rdma_cm_id that does nothing but serializes callbacks.  You
might be able to acquire/release it in disable/enable remove, but I didn't look
into the implementation in that much detail.

- Sean


From clameter at sgi.com  Tue May 20 11:21:09 2008
From: clameter at sgi.com (Christoph Lameter)
Date: Tue, 20 May 2008 11:21:09 -0700 (PDT)
Subject: [ofa-general] Yet another mm notifier: Notify when pages are
	unmapped.
Message-ID: <Pine.LNX.4.64.0805201114340.6592@schroedinger.engr.sgi.com>

Robin suggested that the last post as a reply in the anon_vma thread made 
this patch vanish. So here it is again (guess we are all tired of 
notifiers...)


This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:

1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.

2. Do take a refcount on pages mapped externally. These are handled by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.

With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.

With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.

A device driver must subscribe to a process using

        mm_register_notifier(struct mm_struct *, struct mm_notifier *)

The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.

When the process terminates then the ->release method is called first to
remove all pages still mapped to the proces.

Before the mm_struct is freed the ->destroy() method is called which
should dispose of the mm_notifier structure.

The following callbacks exist:

invalidate_range(notifier, mm_struct *, from , to)

	Invalidate a range of addresses. The invalidation is
	not required to complete immediately.

invalidate_range_sync(notifier, mm_struct *, from, to)

	This is called after some invalidate_range callouts.
	The driver may only return when the invalidation of the references
	is completed. Callback is only called from non atomic contexts.
	There is no need to provide this callback if the driver can remove
	references in an atomic context.

invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address)

	Invalidate references to a particular page. The driver may
	defer the invalidation.

invalidate_page_sync(notifier, mm_struct *,struct *)

	Called after one or more invalidate_page() callbacks. The callback
	must only return when the external references have been removed.
	The callback does not need to be provided if the driver can remove
	references in atomic contexts.

[NOTE] The invalidate_page_sync() callback is weird because it is called for
	every notifier that supports the invalidate_page_sync() callback
	if a page has PageNotifier() set. The driver must determine in an efficient
	way that the page is not of interest. This is because we do not have the
	mm context after we have dropped the rmap list lock.
	Drivers incrementing the refcount must set and clear PageNotifier
	appropriately when establishing and/or dropping a refcount!
	[These conditions are similar to the rmap notifier that was introduced
	in my V7 of the mmu_notifier].

There is no support for an aging callback. A device driver may simply set the
reference bit on the linux pte when the external mapping is referenced if such
support is desired.

The patch is provisional. All functions are inlined for now. They should be wrapped like
in Andrea's series. Its probably good to have Andrea review this if we actually
decide to go this route since he is pretty good as detecting issues with complex
lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the
strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and
we are reintroducing that now in a light weight order to be able to defer freeing
until after the rmap spinlocks have been dropped.

Jack tested this with the GRU.

Signed-off-by: Christoph Lameter <clameter at sgi.com>

---
 fs/hugetlbfs/inode.c       |    2 
 include/linux/mm_types.h   |    3 
 include/linux/page-flags.h |    3 
 include/linux/rmap.h       |  161 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c              |    4 +
 mm/Kconfig                 |    4 +
 mm/filemap_xip.c           |    2 
 mm/fremap.c                |    2 
 mm/hugetlb.c               |    3 
 mm/memory.c                |   38 ++++++++--
 mm/mmap.c                  |    3 
 mm/mprotect.c              |    3 
 mm/mremap.c                |    5 +
 mm/rmap.c                  |   11 ++-
 14 files changed, 234 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2008-05-16 16:06:26.000000000 -0700
@@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+#ifdef CONFIG_MM_NOTIFIER
+		mm->mm_notifier = NULL;
+#endif
 		return mm;
 	}
 
@@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mm_notifier_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c	2008-05-16 16:06:26.000000000 -0700
@@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
+			mm_notifier_invalidate_page(mm, page, address);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
@@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp
 		}
 	}
 	spin_unlock(&mapping->i_mmap_lock);
+	mm_notifier_invalidate_page_sync(page);
 }
 
 /*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/fremap.c	2008-05-16 16:06:26.000000000 -0700
@@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mm_notifier_invalidate_range(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mm_notifier_invalidate_range_sync(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/hugetlb.c	2008-05-16 17:50:31.000000000 -0700
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mm_notifier_invalidate_range(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
@@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area
 		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 		__unmap_hugepage_range(vma, start, end);
 		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		mm_notifier_invalidate_range_sync(vma->vm_mm, start, end);
 	}
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/memory.c	2008-05-16 16:06:26.000000000 -0700
@@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 */
 	if (is_cow_mapping(vm_flags)) {
 		ptep_set_wrprotect(src_mm, addr, src_pte);
+		mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end);
+
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			tlb_finish_mmu(*tlbp, tlb_start, start);
+			mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
@@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
+	if (tlb) {
 		tlb_finish_mmu(tlb, address, end);
+		mm_notifier_invalidate_range(mm, address, end);
+	}
 	return end;
 }
 
@@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 
@@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = NULL;
 		goto unlock;
 	}
 
@@ -1774,6 +1792,7 @@ gotten:
 		 * thread doing COW.
 		 */
 		ptep_clear_flush(vma, address, page_table);
+		mm_notifier_invalidate_page(mm, old_page, address);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1787,10 +1806,13 @@ gotten:
 
 	if (new_page)
 		page_cache_release(new_page);
-	if (old_page)
-		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (old_page) {
+		mm_notifier_invalidate_page_sync(old_page);
+		page_cache_release(old_page);
+	}
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mmap.c	2008-05-16 16:06:26.000000000 -0700
@@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mm_notifier_invalidate_range(mm, start, end);
+	mm_notifier_invalidate_range_sync(mm, start, end);
 }
 
 /*
@@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mm_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mprotect.c	2008-05-16 16:06:26.000000000 -0700
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/rmap.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -132,6 +133,7 @@ static void change_protection(struct vm_
 		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
+	mm_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 int
@@ -211,6 +213,7 @@ success:
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mm_notifier_invalidate_range_sync(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/mremap.c	2008-05-16 16:06:26.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start = old_addr;
 
 	if (vma->vm_file) {
 		/*
@@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	mm_notifier_invalidate_range(mm, old_addr, old_end);
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
@@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+
+	mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/rmap.c	2008-05-16 16:06:26.000000000 -0700
@@ -52,6 +52,9 @@
 
 #include <asm/tlbflush.h>
 
+struct mm_notifier *mm_notifier_page_sync;
+DECLARE_RWSEM(mm_notifier_page_sync_sem);
+
 struct kmem_cache *anon_vma_cachep;
 
 /* This must be called under the mmap_sem. */
@@ -458,6 +461,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mm_notifier_invalidate_page(mm, page, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -502,8 +506,8 @@ int page_mkclean(struct page *page)
 				ret = 1;
 			}
 		}
+		mm_notifier_invalidate_page_sync(page);
 	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
@@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mm_notifier_invalidate_page(mm, page, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mm_notifier_invalidate_page(mm, page, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	mm_notifier_invalidate_page_sync(page);
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
-
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/rmap.h	2008-05-16 18:32:52.000000000 -0700
@@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 
+#ifdef CONFIG_MM_NOTIFIER
+
+struct mm_notifier_ops {
+	void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm,
+					unsigned long start, unsigned long end);
+	void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+					unsigned long start, unsigned long end);
+	void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm,
+					struct page *page, unsigned long addr);
+	void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+								struct page *page);
+	void (*release)(struct mm_notifier *mn, struct mm_struct *mm);
+	void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm);
+};
+
+struct mm_notifier {
+	struct mm_notifier_ops *ops;
+	struct mm_struct *mm;
+	struct mm_notifier *next;
+	struct mm_notifier *next_page_sync;
+};
+
+extern struct mm_notifier *mm_notifier_page_sync;
+extern struct rw_semaphore mm_notifier_page_sync_sem;
+
+/*
+ * Must hold mmap_sem when calling mm_notifier_register.
+ */
+static inline void mm_notifier_register(struct mm_notifier *mn,
+						struct mm_struct *mm)
+{
+	mn->mm = mm;
+	mn->next = mm->mm_notifier;
+	rcu_assign_pointer(mm->mm_notifier, mn);
+	if (mn->ops->invalidate_page_sync) {
+		down_write(&mm_notifier_page_sync_sem);
+		mn->next_page_sync = mm_notifier_page_sync;
+		mm_notifier_page_sync = mn;
+		up_write(&mm_notifier_page_sync_sem);
+	}
+}
+
+/*
+ * Invalidate remote references in a particular address range
+ */
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->invalidate_range(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references in a particular address range.
+ * Can sleep. Only return if all remote references have been removed.
+ */
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		if (mn->ops->invalidate_range_sync)
+			mn->ops->invalidate_range_sync(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references to a page
+ */
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+					struct page *page, unsigned long addr)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->invalidate_page(mn, mm, page, addr);
+}
+
+/*
+ * Invalidate remote references to a partioular page. Only return
+ * if all references have been removed.
+ *
+ * Note: This is an expensive function since it is not clear at the time
+ * of call to which mm_struct() the page belongs.. It walks through the
+ * mmlist  and calls the mmu notifier ops for each address space in the
+ * system. At some point this needs to be optimized.
+ */
+static inline void mm_notifier_invalidate_page_sync(struct page *page)
+{
+	struct mm_notifier *mn;
+
+	if (!PageNotifier(page))
+		return;
+
+	down_read(&mm_notifier_page_sync_sem);
+
+	for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync)
+		if (mn->ops->invalidate_page_sync)
+				mn->ops->invalidate_page_sync(mn, mn->mm, page);
+
+	up_read(&mm_notifier_page_sync_sem);
+}
+
+/*
+ * Invalidate all remote references before shutdown
+ */
+static inline void mm_notifier_release(struct mm_struct *mm)
+{
+	struct mm_notifier *mn;
+
+	for (mn = rcu_dereference(mm->mm_notifier); mn;
+					mn = rcu_dereference(mn->next))
+		mn->ops->release(mn, mm);
+}
+
+/*
+ * Release resources before freeing mm_struct.
+ */
+static inline void mm_notifier_destroy(struct mm_struct *mm)
+{
+	struct mm_notifier *mn;
+
+	while (mm->mm_notifier) {
+		mn = mm->mm_notifier;
+		mm->mm_notifier = mn->next;
+		if (mn->ops->invalidate_page_sync) {
+			struct mm_notifier *m;
+
+			down_write(&mm_notifier_page_sync_sem);
+
+			if (mm_notifier_page_sync != mn) {
+				for (m = mm_notifier_page_sync; m; m = m->next_page_sync)
+					if (m->next_page_sync == mn)
+						break;
+
+				m->next_page_sync = mn->next_page_sync;
+			} else
+				mm_notifier_page_sync = mn->next_page_sync;
+
+			up_write(&mm_notifier_page_sync_sem);
+		}
+		mn->ops->destroy(mn, mm);
+	}
+}
+#else
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+			unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+			unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+			struct page *page, unsigned long address) {}
+static inline void mm_notifier_invalidate_page_sync(struct page *page) {}
+static inline void mm_notifier_release(struct mm_struct *mm) {}
+static inline void mm_notifier_destroy(struct mm_struct *mm) {}
+#endif
+
 #endif	/* _LINUX_RMAP_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/Kconfig	2008-05-16 16:06:26.000000000 -0700
@@ -205,3 +205,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MM_NOTIFIER
+	def_bool y
+
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/mm_types.h	2008-05-16 16:06:26.000000000 -0700
@@ -244,6 +244,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MM_NOTIFIER
+	struct mm_notifier *mm_notifier;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/page-flags.h	2008-05-16 16:06:26.000000000 -0700
@@ -93,6 +93,7 @@ enum pageflags {
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
+	PG_notifier,		/* Call notifier when page is changed/unmapped */
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
 PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
 
+PAGEFLAG(Notifier, notifier);
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-05-16 16:06:55.000000000 -0700
@@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree
 
 		__unmap_hugepage_range(vma,
 				vma->vm_start + v_offset, vma->vm_end);
+		mm_notifier_invalidate_range_sync(vma->vm_mm,
+				vma->vm_start + v_offset, vma->vm_end);
 	}
 }
 

From robert.j.woodruff at intel.com  Tue May 20 11:35:24 2008
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 20 May 2008 11:35:24 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <48326942.7080800@voltaire.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
	<48326942.7080800@voltaire.com>
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C04B50919@orsmsx418.amr.corp.intel.com>

Or wrote,
> This is what I could find in the MAINTAINERS file for 2.6.25:
>I am not sure to follow why there's a need to duplicate the Linux
kernel 
>IB (RDMA) stack maintainers file at the ofa website, but if for some 
>reason people feel this is needed I suggest to have a smart link that 
>somehow goes to Linus tree and fetches the up2date info.

>Or.

We should have the list for a couple of reasons, first,
not all OFA components are upstream, e.g., SDP.
And also, the MAINTAINERS list in the kernel tree is only
for the kernel components, so it would be good to have a list
for the OFA user-space components/maintainers as well, and if we are 
posting a maintainers list on our website, 
might as well include the complete
list.

woody


From sean.hefty at intel.com  Tue May 20 12:06:14 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 20 May 2008 12:06:14 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B50919@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
	<48326942.7080800@voltaire.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B50919@orsmsx418.amr.corp.intel.com>
Message-ID: <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com>

OFED needs a separate list of maintainers.


From tziporet at dev.mellanox.co.il  Tue May 20 12:10:40 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 20 May 2008 22:10:40 +0300
Subject: [ofa-general] RE: Current list of Linux maintainers and their
	email	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
Message-ID: <48332230.2000609@mellanox.co.il>

Woodruff, Robert J wrote:
> Here is what I have so far as the list of kernel and userspace
> components.
>
>
>   
Regarding some of the ULPs - Roland is the maintainer in the kernel but 
we have other people in OFED that work on them
I suggest we add their names too under OFED column.
I also filled the user space components with my best knowledge on the owners

Tziporet
>
>                       Kernel Components
>
> Core kernel drivers, infiniband/core
> Sean Hefty, sean.hefty at intel.com
> Roland Dreier, rdreir at csco.com
>
> Hardware Drivers:
> Mellanox HCA drivers, infiniband/hw/mthca, infiniband/hw/mlx4
>   
In addition for Roland please add Jack Morgenstein 
<jackm at mellanox.co.il> for OFED
> Qlogic HCA driver, infiniband/hw/ipath
>
> NetEffects RNIC driver, infiniband/hw/nes
>
> IBM HCA, infinband/hw/ehca
>
> Chelsio RNIC, infiniband/hw/cxgb3
>
> Upper Level Protocols
>
> IPoIB
>   
Can we also add here Eli Cohen <eli at mellanox.co.il> since he is working 
on all the OFED related issues
> SRP
>   
Please add here Vu Pham <vuhuong at mellanox.com> for OFED
> iSer
>
> SDP
>   
Amir Vadai amirv at mellanox.co.il
> SRPT
>   
Vu Pham vuhuong at mellanox.com
> qlgc_vnic
>
> RDS
>   
Olaf Kirch olaf.kirch at oracle.com
>                       User Space Components
>
> libibverbs
>   
Roland Dreier <rdreier at cisco.com>
> uDAPL
>   
"Davis, Arlin R" <arlin.r.davis at intel.com>
> IB-Bonding
>   
This is a kernel module and not user space.
> IB-Sim
>   
Sasha Khapyorsky <sashak at voltaire.com>
> IB-Utils
>   
Oren Kladnitsky <orenk at mellanox.co.il>
> IB-Diags
>   
Sasha Khapyorsky <sashak at voltaire.com>
> libibcm
>   
"Hefty, Sean" <sean.hefty at intel.com>
> librdmacm
>   
"Hefty, Sean" <sean.hefty at intel.com>
> libibcommon
>   
Roland Dreier <rdreier at cisco.com>
> libibmad
>   
Sasha Khapyorsky <sashak at voltaire.com>
> libibumad
>   
Sasha Khapyorsky <sashak at voltaire.com>
> libipathverbs
>
> libmlx4
>   
Roland Dreier <rdreier at cisco.com>
In addition for Roland please add Jack Morgenstein 
<jackm at mellanox.co.il> for OFED

> libmthca
>
>   
Roland Dreier <rdreier at cisco.com>
In addition for Roland please add Jack Morgenstein 
<jackm at mellanox.co.il> for OFED
> libnes
>
> librdmacm
>   
"Hefty, Sean" <sean.hefty at intel.com>
> libsdp
>   
Amir Vadai amirv at mellanox.co.il

> mpi-selector
>   
Jeff Squyres (jsquyres) <jsquyres at cisco.com>
> mpitests
>   
Pavel Shamis (Pasha) <pasha at mellanox.co.il>
> mstflint 
Oren Kladnitsky <orenk at mellanox.co.il>
> mvapich
>   
"Pavel Shamis (Pasha)" <pasha at mellanox.co.il>
> mvapich2
>   
Jonathan L. Perkins <perkinjo at cse.ohio-state.edu>
> openmpi
>
>   
Jeff Squyres (jsquyres) <jsquyres at cisco.com>
> open-iscsi
>
> opensm
>   
Sasha Khapyorsky <sashak at voltaire.com>
>   
> perftest
>   
Oren Meron <orenmeron at mellanox.co.il>
> qlvnictools
>
> qperf
>   
Johann George <johann.george at qlogic.com>
> rds-tools
>   
Olaf Kirch olaf.kirch at oracle.com
> sdpnetstat
>
>   
Amir Vadai amirv at mellanox.co.il
> srptools
>   
Vu Pham vuhuong at mellanox.com


From hrosenstock at xsigo.com  Tue May 20 12:13:39 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 12:13:39 -0700
Subject: [ofa-general] RE: Current list of Linux maintainers and their
	email	info
In-Reply-To: <48332230.2000609@mellanox.co.il>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<4831F965.6060607@opengridcomputing.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<48332230.2000609@mellanox.co.il>
Message-ID: <1211310819.12616.634.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 22:10 +0300, Tziporet Koren wrote:
> > libibcommon
> >   
> Roland Dreier <rdreier at cisco.com>
Sasha Khapyorsky <sashak at voltaire.com>


From robert.j.woodruff at intel.com  Tue May 20 13:12:48 2008
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 20 May 2008 13:12:48 -0700
Subject: [ofa-general] Re: Current list of Linux maintainers and their
	email	info
In-Reply-To: <4832954C.2080209@voltaire.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il>
	<4832954C.2080209@voltaire.com>
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C04B50A36@orsmsx418.amr.corp.intel.com>

Or wrote,
Tziporet Koren wrote:
>> But I think it will be good to know who is the maintainer from the IB

>> side (at least for OFED users)
>The mainline maintainer info of bonding is:

>BONDING DRIVER
>P:      Jay Vosburgh
>M:      fubar at us.ibm.com
>L:      bonding-devel at lists.sourceforge.net
>W:      http://sourceforge.net/projects/bonding/
>S:      Supported

>You need to ask him in case you intend to copy this record from the 
>maintainers file, or ask Moni Shoua if you can list him as a contact
for 
>issues not related directly to the mainline driver.

>Or.

Since this is a separate open source project in sourceforge,
and not developed in OFA/OFED, perhaps we do not need this 
in our list of maintainers.


From matthias at sgi.com  Tue May 20 13:19:50 2008
From: matthias at sgi.com (Matthias Blankenhaus)
Date: Tue, 20 May 2008 13:19:50 -0700 (PDT)
Subject: [ofa-general] saquery port problems
Message-ID: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>

Howdy !

While using this tool to run some queries on a two port HCA, I noticed 
some odd behavior.  Here are my observations running on a SLES10SP2 
(x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III 
Ex HCA:

(01) saquery -C mthca0 -m  
     This yields the output for port number two.  This is not conform with the
     usual ib tools behavior to report on port one per default.

(02) saquery -C mthca0 -m -P 1
     Fails with "Failed to find active port, check port status with "ibstat".
     This is incorrect, since

     # ibstat mthca0 1
CA: 'mthca0'
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID: 0x0008f10403987dc5

     This might be the reason why (01) report on port two.

(03) saquery -C mthca0 -m -P 2
     Works and is identical with the out out from (01).

However, the following command options work:

(04) saquery -P 1 -m
     Correctly yields the output for port one.  In other words
     port one seems to be fine unlike reported in (02).

(05) saquery -P 2 -m
     Correctly yields the output for port two.


Is it incorrect to use -C and -P in combination ?  Why does does
saquery think that port one is not active ?


Thanx,
Matthias


From jon at opengridcomputing.com  Tue May 20 13:45:22 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Tue, 20 May 2008 15:45:22 -0500
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805191006.00114.olaf.kirch@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805141516.01908.okir@lst.de>
	<200805161638.18067.olaf.kirch@oracle.com>
	<200805191006.00114.olaf.kirch@oracle.com>
Message-ID: <20080520204522.GD31790@opengridcomputing.com>

On Mon, May 19, 2008 at 10:05:59AM +0200, Olaf Kirch wrote:
> 
> > However, I'm still seeing performance degradation of ~5% with some packet
> > sizes. And that is *just* the overhead from exchanging the credit information
> > and checking it - at some point we need to take a spinlock, and that seems
> > to delay things just enough to make a dent in my throughput graph.
> 
> Here's an updated version of the flow control patch - which is now completely
> lockless, and uses a single atomic_t to hold both credit counters. This has
> given me back close to full performance in my testing (throughput seems to be
> down less than 1%, which is almost within the noise range).
> 
> I'll push it to my git tree a little later today, so folks can test it if
> they like.

Works well on my setup.

With proper flow control, there should no longer be a need for rnr_retry (as there
should always be a posted recv buffer waiting for the incoming data).  I did a
quick test and removed it and everything seemed to be happy on my rds-stress run.

Thanks for pulling this in.
Jon

> 
> Olaf
> -- 
> Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
> okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
> ----
> From: Olaf Kirch <olaf.kirch at oracle.com>
> Subject: RDS: Implement IB flow control
> 
> Here it is - flow control for RDS/IB.
> 
> This patch is still very much experimental. Here's the essentials
> 
>  -	The approach chosen here uses a credit-based flow control
> 	mechanism. Every SEND WR (including ACKs) consumes one credit,
> 	and if the sender runs out of credits, it stalls.
> 
>  -	As new receive buffers are posted, credits are transferred to the
> 	remote node (using yet another RDS header byte for this).
> 
>  -	Flow control is negotiated during connection setup. Initial credits
>  	are exchanged in the rds_ib_connect_private sruct - sending a value
> 	of zero (which is also the default for older protocol versions)
> 	means no flow control.
> 
>  -	We avoid deadlock (both nodes depleting their credits, and being
>  	unable to inform the peer of newly posted buffers) by requiring
> 	that the last credit can only be used if we're posting new credits
> 	to the peer.
> 
> The approach implemented here is lock-free; preliminary tests show
> the impact on throughput to be less than 1%, and the impact on RTT,
> CPU, TX delay and other metrics to be below the noise threshold.
> 
> Flow control is configurable via sysctl. It only affects newly created
> connections however - so your best bet is to set this right after loading
> the RDS module.
> 
> Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>
> ---
>  net/rds/ib.c        |    1 
>  net/rds/ib.h        |   30 ++++++++
>  net/rds/ib_cm.c     |   49 ++++++++++++-
>  net/rds/ib_recv.c   |   48 +++++++++---
>  net/rds/ib_send.c   |  194 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/rds/ib_stats.c  |    3 
>  net/rds/ib_sysctl.c |   10 ++
>  net/rds/rds.h       |    4 -
>  8 files changed, 325 insertions(+), 14 deletions(-)
> 
> Index: ofa_kernel-1.3/net/rds/ib.h
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib.h
> +++ ofa_kernel-1.3/net/rds/ib.h
> @@ -46,6 +46,7 @@ struct rds_ib_connect_private {
>  	__be16			dp_protocol_minor_mask; /* bitmask */
>  	__be32			dp_reserved1;
>  	__be64			dp_ack_seq;
> +	__be32			dp_credit;		/* non-zero enables flow ctl */
>  };
>  
>  struct rds_ib_send_work {
> @@ -110,15 +111,32 @@ struct rds_ib_connection {
>  	struct ib_sge		i_ack_sge;
>  	u64			i_ack_dma;
>  	unsigned long		i_ack_queued;
> +
> +	/* Flow control related information
> +	 *
> +	 * Our algorithm uses a pair variables that we need to access
> +	 * atomically - one for the send credits, and one posted
> +	 * recv credits we need to transfer to remote.
> +	 * Rather than protect them using a slow spinlock, we put both into
> +	 * a single atomic_t and update it using cmpxchg
> +	 */
> +	atomic_t		i_credits;
>   
>  	/* Protocol version specific information */
>  	unsigned int		i_hdr_idx;	/* 1 (old) or 0 (3.1 or later) */
> +	unsigned int		i_flowctl : 1;	/* enable/disable flow ctl */
>  
>  	/* Batched completions */
>  	unsigned int		i_unsignaled_wrs;
>  	long			i_unsignaled_bytes;
>  };
>  
> +/* This assumes that atomic_t is at least 32 bits */
> +#define IB_GET_SEND_CREDITS(v)	((v) & 0xffff)
> +#define IB_GET_POST_CREDITS(v)	((v) >> 16)
> +#define IB_SET_SEND_CREDITS(v)	((v) & 0xffff)
> +#define IB_SET_POST_CREDITS(v)	((v) << 16)
> +
>  struct rds_ib_ipaddr {
>  	struct list_head	list;
>  	__be32			ipaddr;
> @@ -153,14 +171,17 @@ struct rds_ib_statistics {
>  	unsigned long	s_ib_tx_cq_call;
>  	unsigned long	s_ib_tx_cq_event;
>  	unsigned long	s_ib_tx_ring_full;
> +	unsigned long	s_ib_tx_throttle;
>  	unsigned long	s_ib_tx_sg_mapping_failure;
>  	unsigned long	s_ib_tx_stalled;
> +	unsigned long	s_ib_tx_credit_updates;
>  	unsigned long	s_ib_rx_cq_call;
>  	unsigned long	s_ib_rx_cq_event;
>  	unsigned long	s_ib_rx_ring_empty;
>  	unsigned long	s_ib_rx_refill_from_cq;
>  	unsigned long	s_ib_rx_refill_from_thread;
>  	unsigned long	s_ib_rx_alloc_limit;
> +	unsigned long	s_ib_rx_credit_updates;
>  	unsigned long	s_ib_ack_sent;
>  	unsigned long	s_ib_ack_send_failure;
>  	unsigned long	s_ib_ack_send_delayed;
> @@ -244,6 +265,8 @@ void rds_ib_flush_mrs(void);
>  int __init rds_ib_recv_init(void);
>  void rds_ib_recv_exit(void);
>  int rds_ib_recv(struct rds_connection *conn);
> +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
> +		       gfp_t page_gfp, int prefill);
>  void rds_ib_inc_purge(struct rds_incoming *inc);
>  void rds_ib_inc_free(struct rds_incoming *inc);
>  int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov,
> @@ -252,6 +275,7 @@ void rds_ib_recv_cq_comp_handler(struct 
>  void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
>  void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
>  void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
> +void rds_ib_attempt_ack(struct rds_ib_connection *ic);
>  void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
>  u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
>  
> @@ -266,12 +290,17 @@ u32 rds_ib_ring_completed(struct rds_ib_
>  extern wait_queue_head_t rds_ib_ring_empty_wait;
>  
>  /* ib_send.c */
> +void rds_ib_xmit_complete(struct rds_connection *conn);
>  int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
>  	        unsigned int hdr_off, unsigned int sg, unsigned int off);
>  void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
>  void rds_ib_send_init_ring(struct rds_ib_connection *ic);
>  void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
>  int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
> +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits);
> +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted);
> +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted,
> +			     u32 *adv_credits);
>  
>  /* ib_stats.c */
>  RDS_DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
> @@ -287,6 +316,7 @@ extern unsigned long rds_ib_sysctl_max_r
>  extern unsigned long rds_ib_sysctl_max_unsig_wrs;
>  extern unsigned long rds_ib_sysctl_max_unsig_bytes;
>  extern unsigned long rds_ib_sysctl_max_recv_allocation;
> +extern unsigned int rds_ib_sysctl_flow_control;
>  extern ctl_table rds_ib_sysctl_table[];
>  
>  /*
> Index: ofa_kernel-1.3/net/rds/ib_cm.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib_cm.c
> +++ ofa_kernel-1.3/net/rds/ib_cm.c
> @@ -55,6 +55,22 @@ static void rds_ib_set_protocol(struct r
>  }
>  
>  /*
> + * Set up flow control
> + */
> +static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits)
> +{
> +	struct rds_ib_connection *ic = conn->c_transport_data;
> +
> +	if (rds_ib_sysctl_flow_control && credits != 0) {
> +		/* We're doing flow control */
> +		ic->i_flowctl = 1;
> +		rds_ib_send_add_credits(conn, credits);
> +	} else {
> +		ic->i_flowctl = 0;
> +	}
> +}
> +
> +/*
>   * Connection established.
>   * We get here for both outgoing and incoming connection.
>   */
> @@ -72,12 +88,16 @@ static void rds_ib_connect_complete(stru
>  		rds_ib_set_protocol(conn,
>  				RDS_PROTOCOL(dp->dp_protocol_major,
>  					dp->dp_protocol_minor));
> +		rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
>  	}
>  
> -	rdsdebug("RDS/IB: ib conn complete on %u.%u.%u.%u version %u.%u\n",
> +	printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n",
>  			NIPQUAD(conn->c_laddr),
>  			RDS_PROTOCOL_MAJOR(conn->c_version),
> -			RDS_PROTOCOL_MINOR(conn->c_version));
> +			RDS_PROTOCOL_MINOR(conn->c_version),
> +			ic->i_flowctl? ", flow control" : "");
> +
> +	rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
>  
>  	/* Tune the RNR timeout. We use a rather low timeout, but
>  	 * not the absolute minimum - this should be tunable.
> @@ -129,6 +149,24 @@ static void rds_ib_cm_fill_conn_param(st
>  		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
>  		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
>  
> +		/* Advertise flow control.
> +		 *
> +		 * Major chicken and egg alert!
> +		 * We would like to post receive buffers before we get here (eg.
> +		 * in rds_ib_setup_qp), so that we can give the peer an accurate
> +		 * credit value.
> +		 * Unfortunately we can't post receive buffers until we've finished
> +		 * protocol negotiation, and know in which order data and payload
> +		 * are arranged.
> +		 *
> +		 * What we do here is we give the peer a small initial credit, and
> +		 * initialize the number of posted buffers to a negative value.
> +		 */
> +		if (ic->i_flowctl) {
> +			atomic_set(&ic->i_credits, IB_SET_POST_CREDITS(-4));
> +			dp->dp_credit = cpu_to_be32(4);
> +		}
> +
>  		conn_param->private_data = dp;
>  		conn_param->private_data_len = sizeof(*dp);
>  	}
> @@ -363,6 +401,7 @@ static int rds_ib_cm_handle_connect(stru
>  	ic = conn->c_transport_data;
>  
>  	rds_ib_set_protocol(conn, version);
> +	rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
>  
>  	/* If the peer gave us the last packet it saw, process this as if
>  	 * we had received a regular ACK. */
> @@ -428,6 +467,7 @@ out:
>  static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
>  {
>  	struct rds_connection *conn = cm_id->context;
> +	struct rds_ib_connection *ic = conn->c_transport_data;
>  	struct rdma_conn_param conn_param;
>  	struct rds_ib_connect_private dp;
>  	int ret;
> @@ -435,6 +475,7 @@ static int rds_ib_cm_initiate_connect(st
>  	/* If the peer doesn't do protocol negotiation, we must
>  	 * default to RDSv3.0 */
>  	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
> +	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
>  
>  	ret = rds_ib_setup_qp(conn);
>  	if (ret) {
> @@ -688,6 +729,10 @@ void rds_ib_conn_shutdown(struct rds_con
>  #endif
>  	ic->i_ack_recv = 0;
>  
> +	/* Clear flow control state */
> +	ic->i_flowctl = 0;
> +	atomic_set(&ic->i_credits, 0);
> +
>  	if (ic->i_ibinc) {
>  		rds_inc_put(&ic->i_ibinc->ii_inc);
>  		ic->i_ibinc = NULL;
> Index: ofa_kernel-1.3/net/rds/ib_recv.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib_recv.c
> +++ ofa_kernel-1.3/net/rds/ib_recv.c
> @@ -220,16 +220,17 @@ out:
>   * -1 is returned if posting fails due to temporary resource exhaustion.
>   */
>  int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
> -		       gfp_t page_gfp)
> +		       gfp_t page_gfp, int prefill)
>  {
>  	struct rds_ib_connection *ic = conn->c_transport_data;
>  	struct rds_ib_recv_work *recv;
>  	struct ib_recv_wr *failed_wr;
> +	unsigned int posted = 0;
>  	int ret = 0;
>  	u32 pos;
>  
> -	while (rds_conn_up(conn) && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
> -
> +	while ((prefill || rds_conn_up(conn))
> +			&& rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
>  		if (pos >= ic->i_recv_ring.w_nr) {
>  			printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n",
>  					pos);
> @@ -257,8 +258,14 @@ int rds_ib_recv_refill(struct rds_connec
>  			ret = -1;
>  			break;
>  		}
> +
> +		posted++;
>  	}
>  
> +	/* We're doing flow control - update the window. */
> +	if (ic->i_flowctl && posted)
> +		rds_ib_advertise_credits(conn, posted);
> +
>  	if (ret)
>  		rds_ib_ring_unalloc(&ic->i_recv_ring, 1);
>  	return ret;
> @@ -436,7 +443,7 @@ static u64 rds_ib_get_ack(struct rds_ib_
>  #endif
>  
>  
> -static void rds_ib_send_ack(struct rds_ib_connection *ic)
> +static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits)
>  {
>  	struct rds_header *hdr = ic->i_ack;
>  	struct ib_send_wr *failed_wr;
> @@ -448,6 +455,7 @@ static void rds_ib_send_ack(struct rds_i
>  	rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq);
>  	rds_message_populate_header(hdr, 0, 0, 0);
>  	hdr->h_ack = cpu_to_be64(seq);
> +	hdr->h_credit = adv_credits;
>  	rds_message_make_checksum(hdr);
>  	ic->i_ack_queued = jiffies;
>  
> @@ -460,6 +468,8 @@ static void rds_ib_send_ack(struct rds_i
>  		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
>  
>   		rds_ib_stats_inc(s_ib_ack_send_failure);
> +		/* Need to finesse this later. */
> +		BUG();
>  	} else
>  		rds_ib_stats_inc(s_ib_ack_sent);
>  }
> @@ -502,15 +512,27 @@ static void rds_ib_send_ack(struct rds_i
>   * When we get here, we're called from the recv queue handler.
>   * Check whether we ought to transmit an ACK.
>   */
> -static void rds_ib_attempt_ack(struct rds_ib_connection *ic)
> +void rds_ib_attempt_ack(struct rds_ib_connection *ic)
>  {
> +	unsigned int adv_credits;
> +
>  	if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
>  		return;
> -	if (!test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
> -		clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
> -		rds_ib_send_ack(ic);
> -	} else
> +
> +	if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
>  		rds_ib_stats_inc(s_ib_ack_send_delayed);
> +		return;
> +	}
> +
> +	/* Can we get a send credit? */
> +	if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) {
> +		rds_ib_stats_inc(s_ib_tx_throttle);
> +		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
> +		return;
> +	}
> +
> +	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
> +	rds_ib_send_ack(ic, adv_credits);
>  }
>  
>  /*
> @@ -706,6 +728,10 @@ void rds_ib_process_recv(struct rds_conn
>  	state->ack_recv = be64_to_cpu(ihdr->h_ack);
>  	state->ack_recv_valid = 1;
>  
> +	/* Process the credits update if there was one */
> +	if (ihdr->h_credit)
> +		rds_ib_send_add_credits(conn, ihdr->h_credit);
> +
>  	if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) {
>  		/* This is an ACK-only packet. The fact that it gets
>  		 * special treatment here is that historically, ACKs
> @@ -877,7 +903,7 @@ void rds_ib_recv_cq_comp_handler(struct 
>  
>  	if (mutex_trylock(&ic->i_recv_mutex)) {
>  		if (rds_ib_recv_refill(conn, GFP_ATOMIC,
> -					 GFP_ATOMIC | __GFP_HIGHMEM))
> +					 GFP_ATOMIC | __GFP_HIGHMEM, 0))
>  			ret = -EAGAIN;
>  		else
>  			rds_ib_stats_inc(s_ib_rx_refill_from_cq);
> @@ -901,7 +927,7 @@ int rds_ib_recv(struct rds_connection *c
>  	 * we're really low and we want the caller to back off for a bit.
>  	 */
>  	mutex_lock(&ic->i_recv_mutex);
> -	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER))
> +	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0))
>  		ret = -ENOMEM;
>  	else
>  		rds_ib_stats_inc(s_ib_rx_refill_from_thread);
> Index: ofa_kernel-1.3/net/rds/ib.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib.c
> +++ ofa_kernel-1.3/net/rds/ib.c
> @@ -187,6 +187,7 @@ static void rds_ib_exit(void)
>  
>  struct rds_transport rds_ib_transport = {
>  	.laddr_check		= rds_ib_laddr_check,
> +	.xmit_complete		= rds_ib_xmit_complete,
>  	.xmit			= rds_ib_xmit,
>  	.xmit_cong_map		= NULL,
>  	.xmit_rdma		= rds_ib_xmit_rdma,
> Index: ofa_kernel-1.3/net/rds/ib_send.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib_send.c
> +++ ofa_kernel-1.3/net/rds/ib_send.c
> @@ -245,6 +245,144 @@ void rds_ib_send_cq_comp_handler(struct 
>  	}
>  }
>  
> +/*
> + * This is the main function for allocating credits when sending
> + * messages.
> + *
> + * Conceptually, we have two counters:
> + *  -	send credits: this tells us how many WRs we're allowed
> + *	to submit without overruning the reciever's queue. For
> + *	each SEND WR we post, we decrement this by one.
> + *
> + *  -	posted credits: this tells us how many WRs we recently
> + *	posted to the receive queue. This value is transferred
> + *	to the peer as a "credit update" in a RDS header field.
> + *	Every time we transmit credits to the peer, we subtract
> + *	the amount of transferred credits from this counter.
> + *
> + * It is essential that we avoid situations where both sides have
> + * exhausted their send credits, and are unable to send new credits
> + * to the peer. We achieve this by requiring that we send at least
> + * one credit update to the peer before exhausting our credits.
> + * When new credits arrive, we subtract one credit that is withheld
> + * until we've posted new buffers and are ready to transmit these
> + * credits (see rds_ib_send_add_credits below).
> + *
> + * The RDS send code is essentially single-threaded; rds_send_xmit
> + * grabs c_send_sem to ensure exclusive access to the send ring.
> + * However, the ACK sending code is independent and can race with
> + * message SENDs.
> + *
> + * In the send path, we need to update the counters for send credits
> + * and the counter of posted buffers atomically - when we use the
> + * last available credit, we cannot allow another thread to race us
> + * and grab the posted credits counter.  Hence, we have to use a
> + * spinlock to protect the credit counter, or use atomics.
> + *
> + * Spinlocks shared between the send and the receive path are bad,
> + * because they create unnecessary delays. An early implementation
> + * using a spinlock showed a 5% degradation in throughput at some
> + * loads.
> + *
> + * This implementation avoids spinlocks completely, putting both
> + * counters into a single atomic, and updating that atomic using
> + * atomic_add (in the receive path, when receiving fresh credits),
> + * and using atomic_cmpxchg when updating the two counters.
> + */
> +int rds_ib_send_grab_credits(struct rds_ib_connection *ic,
> +			     u32 wanted, u32 *adv_credits)
> +{
> +	unsigned int avail, posted, got = 0, advertise;
> +	long oldval, newval;
> +
> +	*adv_credits = 0;
> +	if (!ic->i_flowctl)
> +		return wanted;
> +
> +try_again:
> +	advertise = 0;
> +	oldval = newval = atomic_read(&ic->i_credits);
> +	posted = IB_GET_POST_CREDITS(oldval);
> +	avail = IB_GET_SEND_CREDITS(oldval);
> +
> +	rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n",
> +			wanted, avail, posted);
> +
> +	/* The last credit must be used to send a credit updated. */
> +	if (avail && !posted)
> +		avail--;
> +
> +	if (avail < wanted) {
> +		struct rds_connection *conn = ic->i_cm_id->context;
> +
> +		/* Oops, there aren't that many credits left! */
> +		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
> +		got = avail;
> +	} else {
> +		/* Sometimes you get what you want, lalala. */
> +		got = wanted;
> +	}
> +	newval -= IB_SET_SEND_CREDITS(got);
> +
> +	if (got && posted) {
> +		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
> +		newval -= IB_SET_POST_CREDITS(advertise);
> +	}
> +
> +	/* Finally bill everything */
> +	if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval)
> +		goto try_again;
> +
> +	*adv_credits = advertise;
> +	return got;
> +}
> +
> +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
> +{
> +	struct rds_ib_connection *ic = conn->c_transport_data;
> +
> +	if (credits == 0)
> +		return;
> +
> +	rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n",
> +			credits,
> +			IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)),
> +			test_bit(RDS_LL_SEND_FULL, &conn->c_flags)? ", ll_send_full" : "");
> +
> +	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
> +	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
> +		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
> +
> +	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
> +
> +	rds_ib_stats_inc(s_ib_rx_credit_updates);
> +}
> +
> +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted)
> +{
> +	struct rds_ib_connection *ic = conn->c_transport_data;
> +
> +	if (posted == 0)
> +		return;
> +
> +	atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits);
> +
> +	/* Decide whether to send an update to the peer now.
> +	 * If we would send a credit update for every single buffer we
> +	 * post, we would end up with an ACK storm (ACK arrives,
> +	 * consumes buffer, we refill the ring, send ACK to remote
> +	 * advertising the newly posted buffer... ad inf)
> +	 *
> +	 * Performance pretty much depends on how often we send
> +	 * credit updates - too frequent updates mean lots of ACKs.
> +	 * Too infrequent updates, and the peer will run out of
> +	 * credits and has to throttle.
> +	 * For the time being, 16 seems to be a good compromise.
> +	 */
> +	if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16)
> +		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
> +}
> +
>  static inline void
>  rds_ib_xmit_populate_wr(struct rds_ib_connection *ic,
>  		struct rds_ib_send_work *send, unsigned int pos,
> @@ -307,6 +445,8 @@ int rds_ib_xmit(struct rds_connection *c
>  	u32 pos;
>  	u32 i;
>  	u32 work_alloc;
> +	u32 credit_alloc;
> +	u32 adv_credits = 0;
>  	int send_flags = 0;
>  	int sent;
>  	int ret;
> @@ -314,6 +454,7 @@ int rds_ib_xmit(struct rds_connection *c
>  	BUG_ON(off % RDS_FRAG_SIZE);
>  	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
>  
> +	/* FIXME we may overallocate here */
>  	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0)
>  		i = 1;
>  	else
> @@ -327,8 +468,29 @@ int rds_ib_xmit(struct rds_connection *c
>  		goto out;
>  	}
>  
> +	credit_alloc = work_alloc;
> +	if (ic->i_flowctl) {
> +		credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits);
> +		if (credit_alloc < work_alloc) {
> +			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
> +			work_alloc = credit_alloc;
> +		}
> +		if (work_alloc == 0) {
> +			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
> +			rds_ib_stats_inc(s_ib_tx_throttle);
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +
>  	/* map the message the first time we see it */
>  	if (ic->i_rm == NULL) {
> +		/*
> +		printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n",
> +				be16_to_cpu(rm->m_inc.i_hdr.h_dport),
> +				rm->m_inc.i_hdr.h_flags,
> +				be32_to_cpu(rm->m_inc.i_hdr.h_len));
> +		   */
>  		if (rm->m_nents) {
>  			rm->m_count = ib_dma_map_sg(dev,
>  					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
> @@ -449,6 +611,24 @@ add_header:
>  		 * have been set up to point to the right header buffer. */
>  		memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header));
>  
> +		if (0) {
> +			struct rds_header *hdr = &ic->i_send_hdrs[pos];
> +
> +			printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n",
> +				be16_to_cpu(hdr->h_dport),
> +				hdr->h_flags,
> +				be32_to_cpu(hdr->h_len));
> +		}
> +		if (adv_credits) {
> +			struct rds_header *hdr = &ic->i_send_hdrs[pos];
> +
> +			/* add credit and redo the header checksum */
> +			hdr->h_credit = adv_credits;
> +			rds_message_make_checksum(hdr);
> +			adv_credits = 0;
> +			rds_ib_stats_inc(s_ib_tx_credit_updates);
> +		}
> +
>  		if (prev)
>  			prev->s_wr.next = &send->s_wr;
>  		prev = send;
> @@ -472,6 +652,8 @@ add_header:
>  		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
>  		work_alloc = i;
>  	}
> +	if (ic->i_flowctl && i < credit_alloc)
> +		rds_ib_send_add_credits(conn, credit_alloc - i);
>  
>  	/* XXX need to worry about failed_wr and partial sends. */
>  	failed_wr = &first->s_wr;
> @@ -487,11 +669,14 @@ add_header:
>  			ic->i_rm = prev->s_rm;
>  			prev->s_rm = NULL;
>  		}
> +		/* Finesse this later */
> +		BUG();
>  		goto out;
>  	}
>  
>  	ret = sent;
>  out:
> +	BUG_ON(adv_credits);
>  	return ret;
>  }
>  
> @@ -630,3 +815,12 @@ int rds_ib_xmit_rdma(struct rds_connecti
>  out:
>  	return ret;
>  }
> +
> +void rds_ib_xmit_complete(struct rds_connection *conn)
> +{
> +	struct rds_ib_connection *ic = conn->c_transport_data;
> +
> +	/* We may have a pending ACK or window update we were unable
> +	 * to send previously (due to flow control). Try again. */
> +	rds_ib_attempt_ack(ic);
> +}
> Index: ofa_kernel-1.3/net/rds/ib_stats.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib_stats.c
> +++ ofa_kernel-1.3/net/rds/ib_stats.c
> @@ -46,14 +46,17 @@ static char *rds_ib_stat_names[] = {
>  	"ib_tx_cq_call",
>  	"ib_tx_cq_event",
>  	"ib_tx_ring_full",
> +	"ib_tx_throttle",
>  	"ib_tx_sg_mapping_failure",
>  	"ib_tx_stalled",
> +	"ib_tx_credit_updates",
>  	"ib_rx_cq_call",
>  	"ib_rx_cq_event",
>  	"ib_rx_ring_empty",
>  	"ib_rx_refill_from_cq",
>  	"ib_rx_refill_from_thread",
>  	"ib_rx_alloc_limit",
> +	"ib_rx_credit_updates",
>  	"ib_ack_sent",
>  	"ib_ack_send_failure",
>  	"ib_ack_send_delayed",
> Index: ofa_kernel-1.3/net/rds/rds.h
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/rds.h
> +++ ofa_kernel-1.3/net/rds/rds.h
> @@ -170,6 +170,7 @@ struct rds_connection {
>  #define RDS_FLAG_CONG_BITMAP	0x01
>  #define RDS_FLAG_ACK_REQUIRED	0x02
>  #define RDS_FLAG_RETRANSMITTED	0x04
> +#define RDS_MAX_ADV_CREDIT	255
>  
>  /*
>   * Maximum space available for extension headers.
> @@ -183,7 +184,8 @@ struct rds_header {
>  	__be16	h_sport;
>  	__be16	h_dport;
>  	u8	h_flags;
> -	u8	h_padding[5];
> +	u8	h_credit;
> +	u8	h_padding[4];
>  	__sum16	h_csum;
>  
>  	u8	h_exthdr[RDS_HEADER_EXT_SPACE];
> Index: ofa_kernel-1.3/net/rds/ib_sysctl.c
> ===================================================================
> --- ofa_kernel-1.3.orig/net/rds/ib_sysctl.c
> +++ ofa_kernel-1.3/net/rds/ib_sysctl.c
> @@ -53,6 +53,8 @@ unsigned long rds_ib_sysctl_max_unsig_by
>  static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1;
>  static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL;
>  
> +unsigned int rds_ib_sysctl_flow_control = 1;
> +
>  ctl_table rds_ib_sysctl_table[] = {
>  	{
>  		.ctl_name       = 1,
> @@ -102,6 +104,14 @@ ctl_table rds_ib_sysctl_table[] = {
>  		.mode           = 0644,
>  		.proc_handler   = &proc_doulongvec_minmax,
>  	},
> +	{
> +		.ctl_name	= 6,
> +		.procname	= "flow_control",
> +		.data		= &rds_ib_sysctl_flow_control,
> +		.maxlen		= sizeof(rds_ib_sysctl_flow_control),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},
>  	{ .ctl_name = 0}
>  };
>  
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From matthias at sgi.com  Tue May 20 13:44:32 2008
From: matthias at sgi.com (Matthias Blankenhaus)
Date: Tue, 20 May 2008 13:44:32 -0700 (PDT)
Subject: [ofa-general] saquery port problems
In-Reply-To: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
References: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
Message-ID: <Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>

Forgot some important info:

saquery BUILD VERSION: 1.3.6
OFED-1.3


On Tue, 20 May 2008, Matthias Blankenhaus wrote:

> Howdy !
> 
> While using this tool to run some queries on a two port HCA, I noticed 
> some odd behavior.  Here are my observations running on a SLES10SP2 
> (x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III 
> Ex HCA:
> 
> (01) saquery -C mthca0 -m  
>      This yields the output for port number two.  This is not conform with the
>      usual ib tools behavior to report on port one per default.
> 
> (02) saquery -C mthca0 -m -P 1
>      Fails with "Failed to find active port, check port status with "ibstat".
>      This is incorrect, since
> 
>      # ibstat mthca0 1
> CA: 'mthca0'
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 20
> Base lid: 5
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510a68
> Port GUID: 0x0008f10403987dc5
> 
>      This might be the reason why (01) report on port two.
> 
> (03) saquery -C mthca0 -m -P 2
>      Works and is identical with the out out from (01).
> 
> However, the following command options work:
> 
> (04) saquery -P 1 -m
>      Correctly yields the output for port one.  In other words
>      port one seems to be fine unlike reported in (02).
> 
> (05) saquery -P 2 -m
>      Correctly yields the output for port two.
> 
> 
> Is it incorrect to use -C and -P in combination ?  Why does does
> saquery think that port one is not active ?
> 
> 
> Thanx,
> Matthias
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From Thomas.Talpey at netapp.com  Tue May 20 13:53:39 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 20 May 2008 16:53:39 -0400
Subject: [ofa-general] RDS flow control
In-Reply-To: <20080520204522.GD31790@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805141516.01908.okir@lst.de>
	<200805161638.18067.olaf.kirch@oracle.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRDFRaqb0000013a@RTPMVEXC1-PRD.hq.netapp.com>

At 04:45 PM 5/20/2008, Jon Mason wrote:
>With proper flow control, there should no longer be a need for 
>rnr_retry (as there
>should always be a posted recv buffer waiting for the incoming data).  I did a
>quick test and removed it and everything seemed to be happy on my 
>rds-stress run.

I'd be interested in any extended load-testing of operation with
rnr_retry==0 that you might be able to do. The NFS/RDMA client
sets it to zero, for the same reason (the rpcrdma protocol exchanges
credits).

But at the NFS Connectathon last week we were seeing spontaneous
connection loss, that went away when we set rnr_retry to 7 (infinity).
However, it also did not appear when it was set to 1, and later we
were able to pass again at zero.

Very strange, I'm still trying to figure if it's an upper layer issue
or some lower layer timing quirk. The switch we were using there
was a bit flaky.

Tom.


From rdreier at cisco.com  Tue May 20 14:02:18 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:02:18 -0700
Subject: [ofa-general] mthca max_sge value... ugh.
In-Reply-To: <200805191007.24888.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 19 May 2008 10:07:24 +0300")
References: <adaskwkfep5.fsf@cisco.com>
	<200805190812.39408.jackm@dev.mellanox.co.il>
	<adaprri96v4.fsf@cisco.com>
	<200805191007.24888.jackm@dev.mellanox.co.il>
Message-ID: <adaiqx84rl0.fsf@cisco.com>

 > Then, we get into the complexity of sanity checking in create_qp (since we should
 > be able to use the value returned by create-qp when calling create-qp, and get
 > the same result). Essentially, we will need to check the requested sge numbers
 > per QP type, whether it is for send or receive, etc. IMHO, this gets nasty very
 > quickly -- creates a problem with support -- users will need a "roadmap" for create-qp.

Actually it seems pretty easy to understand -- returned max_sge is the
largest value that is guaranteed to work.  If it happens that the
requested QP gives more capabilities "for free" then the driver will
tell you in the returned structure.  But whatever.

 > I much prefer to treat the query_hca returned values as absolute maxima, and enforce
 > these limits (although this is at the expense of additional s/g entries for some
 > qp types and send/receive).

OK, I added the patch below to fix this mlx4 bug without returning any
s/g entries beyond what the device returns.

commit cd155c1c7c9e64df6afb5504d292fef7cb783a4f
Author: Roland Dreier <rolandd at cisco.com>
Date:   Tue May 20 14:00:02 2008 -0700

    IB/mlx4: Fix creation of kernel QP with max number of send s/g entries
    
    When creating a kernel QP where the consumer asked for a send queue
    with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
    returned an error if the send queue stride is larger than the
    hardware's maximum send work request descriptor size.  This is not a
    problem; the only issue is to make sure that the actual descriptors
    used do not overflow the maximum descriptor size, so check this instead.
    
    Clamp the returned max_send_sge value to be no bigger than what
    query_device returns for the max_sge to avoid confusing hapless users,
    even if the hardware is capable of handling a few more s/g entries.
    
    This bug caused NFS/RDMA mounts to fail when the server adapter used
    the mlx4 driver.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index cec030e..a80df22 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) +
 		send_wqe_overhead(type, qp->flags);
 
+	if (s > dev->dev->caps.max_sq_desc_sz)
+		return -EINVAL;
+
 	/*
 	 * Hermon supports shrinking WQEs, such that a single work
 	 * request can include multiple units of 1 << wqe_shift.  This
@@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s));
 
 	for (;;) {
-		if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)
-			return -EINVAL;
-
 		qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift);
 
 		/*
@@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		++qp->sq.wqe_shift;
 	}
 
-	qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
+	qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
+			     (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
 			 send_wqe_overhead(type, qp->flags)) /
 		sizeof (struct mlx4_wqe_data_seg);
 
@@ -411,7 +412,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 
 	cap->max_send_wr  = qp->sq.max_post =
 		(qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr;
-	cap->max_send_sge = qp->sq.max_gs;
+	cap->max_send_sge = min(qp->sq.max_gs,
+				min(dev->dev->caps.max_sq_sg,
+				    dev->dev->caps.max_rq_sg));
 	/* We don't support inline sends for kernel QPs (yet) */
 	cap->max_inline_data = 0;
 

From rdreier at cisco.com  Tue May 20 14:02:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:02:59 -0700
Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca
	max_sge value... ugh.
In-Reply-To: <1211260229.6556.18.camel@eli-laptop> (Eli Cohen's message of
	"Tue, 20 May 2008 08:10:29 +0300")
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
	<RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>
	<adahccu53qu.fsf@cisco.com> <1211260229.6556.18.camel@eli-laptop>
Message-ID: <adaej7w4rjw.fsf@cisco.com>

 > Roland, I posted a few months ago a patch that optimizes post send for
 > selective signaling QPs. It must have slipped somehow because I did not
 > get any reply on it and since I did not know of anyone using selective
 > signaling I forgot about this too. The idea is that for selective
 > signaling QPs, before you stamp the WQE, you read the value of the DS
 > field which denotes the effective size of the descriptor as used in the
 > previous post, and stamp only that area, relying on the fact that the
 > rest of the descriptor is already stamped. Here is a link to the patch.
 > I don't know if it applies cleanly now but if we agree on the idea I
 > will generate it again against the current tree.

Does it make a measurable difference?  If so then it seems like a good idea.


From olaf.kirch at oracle.com  Tue May 20 14:13:39 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Tue, 20 May 2008 23:13:39 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <20080520204522.GD31790@opengridcomputing.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
Message-ID: <200805202313.40213.olaf.kirch@oracle.com>

On Tuesday 20 May 2008 22:45:22 Jon Mason wrote:
> Works well on my setup.

Good to hear!

> With proper flow control, there should no longer be a need for rnr_retry (as there
> should always be a posted recv buffer waiting for the incoming data).  I did a
> quick test and removed it and everything seemed to be happy on my rds-stress run.

I would like to make the setting of the RNR retry/timeout conditional on
whether both ends of the connection support flow control or not - we need
to think of rolling upgrades of a cluster, so mixed environments just have
to work. Unfortunately, the RNR retry count is set prior to establishing
the connection, before we even know whether the remote is capable of doing
flow control.

Is there a way of changing the RNR retry count back to 0 after establishing
the connection?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From rdreier at cisco.com  Tue May 20 14:21:40 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:21:40 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com> (Olaf Kirch's message
	of "Tue, 20 May 2008 23:13:39 +0200")
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com>
Message-ID: <ada1w3w4qor.fsf@cisco.com>

 > Is there a way of changing the RNR retry count back to 0 after establishing
 > the connection?

Yes... quite complicated but possible.  Basically you have to transition
to the QP to the "send queue drained" (SQD) state, change the rnr retry
value in an SQD->SQD transition and then transition back to RTS.  Not
sure if anyone has ever tested that whole operation, so it may or may
not actually work without driver/fw fixes required.

 - R.


From hrosenstock at xsigo.com  Tue May 20 14:24:15 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 14:24:15 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <ada1w3w4qor.fsf@cisco.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com> <ada1w3w4qor.fsf@cisco.com>
Message-ID: <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 14:21 -0700, Roland Dreier wrote:
>  > Is there a way of changing the RNR retry count back to 0 after establishing
>  > the connection?
> 
> Yes... quite complicated but possible.  Basically you have to transition
> to the QP to the "send queue drained" (SQD) state, change the rnr retry
> value in an SQD->SQD transition and then transition back to RTS.  Not
> sure if anyone has ever tested that whole operation, so it may or may
> not actually work without driver/fw fixes required.

That's the local end; is there some needed CM aspect of this too ?

-- Hal

>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Tue May 20 14:27:45 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:27:45 -0700
Subject: [ofa-general] [PATCH] IPoIB: Test for NULL broadcast object in
	opiob_mcast_join_finish.
In-Reply-To: <200805191703.05887.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 19 May 2008 17:03:05 +0300")
References: <200805191703.05887.jackm@dev.mellanox.co.il>
Message-ID: <adawslo3bu6.fsf@cisco.com>

thanks, applied.


From ralph.campbell at qlogic.com  Tue May 20 14:28:38 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 20 May 2008 14:28:38 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com>
Message-ID: <1211318918.3949.268.camel@brick.pathscale.com>

On Tue, 2008-05-20 at 23:13 +0200, Olaf Kirch wrote:
> On Tuesday 20 May 2008 22:45:22 Jon Mason wrote:
> > Works well on my setup.
> 
> Good to hear!
> 
> > With proper flow control, there should no longer be a need for rnr_retry (as there
> > should always be a posted recv buffer waiting for the incoming data).  I did a
> > quick test and removed it and everything seemed to be happy on my rds-stress run.
> 
> I would like to make the setting of the RNR retry/timeout conditional on
> whether both ends of the connection support flow control or not - we need
> to think of rolling upgrades of a cluster, so mixed environments just have
> to work. Unfortunately, the RNR retry count is set prior to establishing
> the connection, before we even know whether the remote is capable of doing
> flow control.
> 
> Is there a way of changing the RNR retry count back to 0 after establishing
> the connection?

You can use ib_modify_qp() to set the QP state to IB_QPS_SQD (drain),
modify the IB_QP_RNR_RETRY parameter, and modify the QP back to
IB_QPS_RTS. It seems to me that modify QP could allow a RTS to RTS
transition and set the IB_QP_RNR_RETRY count but the qp_state_table[]
doesn't seem to indicate that is valid.


From rdreier at cisco.com  Tue May 20 14:31:50 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:31:50 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> (Hal
	Rosenstock's message of "Tue, 20 May 2008 14:24:15 -0700")
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com> <ada1w3w4qor.fsf@cisco.com>
	<1211318655.18236.31.camel@hrosenstock-ws.xsigo.com>
Message-ID: <adaskwc3bnd.fsf@cisco.com>

 > That's the local end; is there some needed CM aspect of this too ?

I don't think so.  RNR retry behavior is purely local so I don't see any
need to coordinate when changing it.

 - R.


From rdreier at cisco.com  Tue May 20 14:33:40 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 14:33:40 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <1211318918.3949.268.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Tue, 20 May 2008 14:28:38 -0700")
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com>
	<1211318918.3949.268.camel@brick.pathscale.com>
Message-ID: <adaod703bkb.fsf@cisco.com>

 > You can use ib_modify_qp() to set the QP state to IB_QPS_SQD (drain),
 > modify the IB_QP_RNR_RETRY parameter, and modify the QP back to
 > IB_QPS_RTS. It seems to me that modify QP could allow a RTS to RTS
 > transition and set the IB_QP_RNR_RETRY count but the qp_state_table[]
 > doesn't seem to indicate that is valid.

The IB spec doesn't allow changing RNR retry on RTS to RTS transitions.
Probably because synchronizing the change with in-flight send requests
(that might be doing RNR handling at that moment) is too much of a mess.

 - R.


From hrosenstock at xsigo.com  Tue May 20 14:41:11 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 14:41:11 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <adaskwc3bnd.fsf@cisco.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com> <ada1w3w4qor.fsf@cisco.com>
	<1211318655.18236.31.camel@hrosenstock-ws.xsigo.com>
	<adaskwc3bnd.fsf@cisco.com>
Message-ID: <1211319671.18236.38.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 14:31 -0700, Roland Dreier wrote:
>  > That's the local end; is there some needed CM aspect of this too ?
> 
> I don't think so.  RNR retry behavior is purely local so I don't see any
> need to coordinate when changing it.

Yes, but it is exchanged in both CM REQ and REP:
The total number of times that the REQ or REP sender wishes the receiver
to retry RNR NAK errors before posting a completion error

-- Hal

>  - R.


From dave.olson at qlogic.com  Tue May 20 15:08:51 2008
From: dave.olson at qlogic.com (Dave Olson)
Date: Tue, 20 May 2008 15:08:51 -0700 (PDT)
Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when
 IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
Message-ID: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>

Ralph Campbell will submit this patch for ofed 1.3.1, also.

IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED

This was observed with the hw/ipath driver, but could happen with any
driver.  It's OFED bug 1027.  The fix is to kfree the local data and
break, rather than falling through.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>

--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -747,7 +747,8 @@ static int handle_outgoing_dr_smp(struct
ib_mad_agent_private *mad_agent_priv,
                break;
        case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
                kmem_cache_free(ib_mad_cache, mad_priv);
-               break;
+               kfree(local);
+               goto out;
        case IB_MAD_RESULT_SUCCESS:
                /* Treat like an incoming receive MAD */
                port_priv = ib_get_mad_port(mad_agent_priv->agent.device,

Dave Olson
dave.olson at qlogic.com


From arlin.r.davis at intel.com  Tue May 20 15:10:00 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Tue, 20 May 2008 15:10:00 -0700
Subject: [ofa-general] [PATCH 1/1][v1.2] dtest: fix build issue with Redhat
	EL5.1
Message-ID: <002401c8bac6$4047fe10$2fbf020a@amr.corp.intel.com>


need include files/definitions for sleep, getpid, gettimeofday

Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 test/dtest/dtest.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index 2db141f..039b6bf 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -35,13 +35,16 @@
 #include <stdlib.h>
 #include <string.h>
 #include <netdb.h>
+#include <sys/types.h>
 #include <sys/socket.h>
+#include <sys/time.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include <sys/mman.h>
 #include <getopt.h>
 #include <inttypes.h>
+#include <unistd.h>
 
 #ifndef DAPL_PROVIDER
 #define DAPL_PROVIDER "OpenIB-cma"
-- 
1.5.2.5


From rdreier at cisco.com  Tue May 20 15:10:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 15:10:08 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <1211319671.18236.38.camel@hrosenstock-ws.xsigo.com> (Hal
	Rosenstock's message of "Tue, 20 May 2008 14:41:11 -0700")
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com> <ada1w3w4qor.fsf@cisco.com>
	<1211318655.18236.31.camel@hrosenstock-ws.xsigo.com>
	<adaskwc3bnd.fsf@cisco.com>
	<1211319671.18236.38.camel@hrosenstock-ws.xsigo.com>
Message-ID: <adak5ho39vj.fsf@cisco.com>

 > Yes, but it is exchanged in both CM REQ and REP:
 > The total number of times that the REQ or REP sender wishes the receiver
 > to retry RNR NAK errors before posting a completion error

I know -- but there isn't any requirement that I know of to do any
further CM stuff if the values change after the connection is established.


From rdreier at cisco.com  Tue May 20 15:17:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 15:17:59 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com> (Dave
	Olson's message of "Tue, 20 May 2008 15:08:51 -0700 (PDT)")
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
Message-ID: <adafxsc39ig.fsf@cisco.com>

 >         case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
 >                 kmem_cache_free(ib_mad_cache, mad_priv);
 > -               break;
 > +               kfree(local);
 > +               goto out;

Seems you need to set ret = 1 here?  Otherwise I think ib_post_send_mad
will continue handling the send even though the packet was supposedly
consumed.

Also as a side note, I think handle_outgoing_dr_smp() would be clearer
if rather than having

out:
	return ret;

and then doing stuff like

	ret = -EINVAL;
	goto out;

the code just did "return -EINVAL;"

Maybe I'll do that cleanup for 2.6.27.

 - R.


From dave.olson at qlogic.com  Tue May 20 15:23:26 2008
From: dave.olson at qlogic.com (Dave Olson)
Date: Tue, 20 May 2008 15:23:26 -0700 (PDT)
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when
 IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <adafxsc39ig.fsf@cisco.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<adafxsc39ig.fsf@cisco.com>
Message-ID: <alpine.LFD.1.00.0805201520570.7525@topaz.pathscale.com>

On Tue, 20 May 2008, Roland Dreier wrote:

|  >         case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
|  >                 kmem_cache_free(ib_mad_cache, mad_priv);
|  > -               break;
|  > +               kfree(local);
|  > +               goto out;
| 
| Seems you need to set ret = 1 here?  Otherwise I think ib_post_send_mad
| will continue handling the send even though the packet was supposedly
| consumed.

Yes,  you are right about the fact that it should be set, but apparently
all callers are simply checking for a return value > 0, because the
packet is only sent once (return values > 1 have no defined meaning so
I'm not surprised the callers just check > 0).

Do you want me to resubmit it that way, or do you want to make the
change?

| Also as a side note, I think handle_outgoing_dr_smp() would be clearer
| if rather than having
| 
| out:
| 	return ret;
| 
| and then doing stuff like
| 
| 	ret = -EINVAL;
| 	goto out;
| 
| the code just did "return -EINVAL;"
| 
| Maybe I'll do that cleanup for 2.6.27.

Seems reasonable enough to me.

Dave Olson
dave.olson at qlogic.com


From ralph.campbell at qlogic.com  Tue May 20 15:28:07 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 20 May 2008 15:28:07 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <adafxsc39ig.fsf@cisco.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<adafxsc39ig.fsf@cisco.com>
Message-ID: <1211322487.3949.275.camel@brick.pathscale.com>

On Tue, 2008-05-20 at 15:17 -0700, Roland Dreier wrote:
>  >         case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
>  >                 kmem_cache_free(ib_mad_cache, mad_priv);
>  > -               break;
>  > +               kfree(local);
>  > +               goto out;
> 
> Seems you need to set ret = 1 here?  Otherwise I think ib_post_send_mad
> will continue handling the send even though the packet was supposedly
> consumed.

I agree.

> Also as a side note, I think handle_outgoing_dr_smp() would be clearer
> if rather than having
> 
> out:
> 	return ret;
> 
> and then doing stuff like
> 
> 	ret = -EINVAL;
> 	goto out;
> 
> the code just did "return -EINVAL;"
> 
> Maybe I'll do that cleanup for 2.6.27.

I also agree but I remember at one point we got pushback from
one of the mainline kernel developers who really wanted to see
only one return point in the code even if it meant more gotos.
I don't remember who though.


From sean.hefty at intel.com  Tue May 20 15:31:46 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 20 May 2008 15:31:46 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c
	whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <1211322487.3949.275.camel@brick.pathscale.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com><adafxsc39ig.fsf@cisco.com>
	<1211322487.3949.275.camel@brick.pathscale.com>
Message-ID: <000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com>

>I also agree but I remember at one point we got pushback from
>one of the mainline kernel developers who really wanted to see
>only one return point in the code even if it meant more gotos.
>I don't remember who though.

I think the coding style document calls out using a single return point, but I
don't think that's always the cleanest approach either.


From rdreier at cisco.com  Tue May 20 15:33:45 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 15:33:45 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <alpine.LFD.1.00.0805201520570.7525@topaz.pathscale.com> (Dave
	Olson's message of "Tue, 20 May 2008 15:23:26 -0700 (PDT)")
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<adafxsc39ig.fsf@cisco.com>
	<alpine.LFD.1.00.0805201520570.7525@topaz.pathscale.com>
Message-ID: <adabq3038s6.fsf@cisco.com>

 > Yes,  you are right about the fact that it should be set, but apparently
 > all callers are simply checking for a return value > 0, because the
 > packet is only sent once (return values > 1 have no defined meaning so
 > I'm not surprised the callers just check > 0).

In my tree (ie the upstream kernel) I see only one place
handle_outgoing_dr_smp() is called, and it looks like:

                        ret = handle_outgoing_dr_smp(mad_agent_priv,
                                                     mad_send_wr);
                        if (ret < 0)            /* error */
                                goto error;
                        else if (ret == 1)      /* locally consumed */
                                continue;

so I'm not sure I understand what you mean.  Clearly ret == 1 i special
(any other positive return value is treated like 0).

 > Do you want me to resubmit it that way, or do you want to make the
 > change?

I can fix it up locally but you are in charge of making sure that OFED
1.3.1 gets what you want it to.


From hrosenstock at xsigo.com  Tue May 20 15:34:03 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 15:34:03 -0700
Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
Message-ID: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote:
> Ralph Campbell will submit this patch for ofed 1.3.1, also.
> 
> IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED
> 
> This was observed with the hw/ipath driver, but could happen with any
> driver.  

Does this also occur with mthca/mlx4 ? Was the same thing that caused
this on ipath tried with either of these HCAs ?

> It's OFED bug 1027.

What's the port disable command which causes this crash ?

-- Hal

> The fix is to kfree the local data and break, rather than falling through.
> 
> Signed-off-by: Dave Olson <dave.olson at qlogic.com>
> 
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -747,7 +747,8 @@ static int handle_outgoing_dr_smp(struct
> ib_mad_agent_private *mad_agent_priv,
>                 break;
>         case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
>                 kmem_cache_free(ib_mad_cache, mad_priv);
> -               break;
> +               kfree(local);
> +               goto out;
>         case IB_MAD_RESULT_SUCCESS:
>                 /* Treat like an incoming receive MAD */
>                 port_priv = ib_get_mad_port(mad_agent_priv->agent.device,
> 
> Dave Olson
> dave.olson at qlogic.com
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Tue May 20 15:37:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 May 2008 15:37:37 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c
	whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com> (Sean Hefty's
	message of "Tue, 20 May 2008 15:31:46 -0700")
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<adafxsc39ig.fsf@cisco.com>
	<1211322487.3949.275.camel@brick.pathscale.com>
	<000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com>
Message-ID: <ada7ido38lq.fsf@cisco.com>

 > >I also agree but I remember at one point we got pushback from
 > >one of the mainline kernel developers who really wanted to see
 > >only one return point in the code even if it meant more gotos.
 > >I don't remember who though.
 > 
 > I think the coding style document calls out using a single return point, but I
 > don't think that's always the cleanest approach either.

CodingStyle suggests using goto to avoid duplicating cleaup code at
every return.  I don't think anyone would argue in favor of the style
we're talking about here of using goto to jump to a plain return
statement.  it doesn't help avoid bugs caused by  missing cleanup, and
it actually *causes* bugs like the one here where it becomes easy to
forget what the function is returning.

 - R.


From dave.olson at qlogic.com  Tue May 20 15:41:43 2008
From: dave.olson at qlogic.com (Dave Olson)
Date: Tue, 20 May 2008 15:41:43 -0700 (PDT)
Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<1211322844.18236.44.camel@hrosenstock-ws.xsigo.com>
Message-ID: <alpine.LFD.1.00.0805201535110.7525@topaz.pathscale.com>

On Tue, 20 May 2008, Hal Rosenstock wrote:

| On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote:
| > Ralph Campbell will submit this patch for ofed 1.3.1, also.
| > 
| > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED
| > 
| > This was observed with the hw/ipath driver, but could happen with any
| > driver.  
| 
| Does this also occur with mthca/mlx4 ? Was the same thing that caused
| this on ipath tried with either of these HCAs ?

No, it doesn't happen on mthca/mlx4, but presumably those drivers never
return this value.  Yes, we tried it with them.  No problem is
introduced on those by the change, either.

| > It's OFED bug 1027.
| 
| What's the port disable command which causes this crash ?

It was with the qlogic quicksilver command iba_portdisable
and iba_portenable commands (they've been ported to work with OFED 1.3).

ibportstate refuses to work on non-switch nodes, although in the past
we've done local modifications to allow it to work on HCAs as well
(we never submitted that change, because we assumed somebody explictly
didn't want HCAs targetted).

Dave Olson
dave.olson at qlogic.com


From dave.olson at qlogic.com  Tue May 20 15:44:47 2008
From: dave.olson at qlogic.com (Dave Olson)
Date: Tue, 20 May 2008 15:44:47 -0700 (PDT)
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when
 IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <adabq3038s6.fsf@cisco.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<adafxsc39ig.fsf@cisco.com>
	<alpine.LFD.1.00.0805201520570.7525@topaz.pathscale.com>
	<adabq3038s6.fsf@cisco.com>
Message-ID: <alpine.LFD.1.00.0805201542570.7525@topaz.pathscale.com>

On Tue, 20 May 2008, Roland Dreier wrote:

|  > Yes,  you are right about the fact that it should be set, but apparently
|  > all callers are simply checking for a return value > 0, because the
|  > packet is only sent once (return values > 1 have no defined meaning so
|  > I'm not surprised the callers just check > 0).
| 
| In my tree (ie the upstream kernel) I see only one place
| handle_outgoing_dr_smp() is called, and it looks like:
| 
|                         ret = handle_outgoing_dr_smp(mad_agent_priv,
|                                                      mad_send_wr);
|                         if (ret < 0)            /* error */
|                                 goto error;
|                         else if (ret == 1)      /* locally consumed */
|                                 continue;
| 
| so I'm not sure I understand what you mean.  Clearly ret == 1 i special
| (any other positive return value is treated like 0).

Indeed, I see that also, now that I look again more carefully.
We definitely didn't see an infinite attempt to send the packet,
so something else must have cleaned that up for us.  Anyway, returning
1 is clearly the right answer.

|  > Do you want me to resubmit it that way, or do you want to make the
|  > change?
| 
| I can fix it up locally but you are in charge of making sure that OFED
| 1.3.1 gets what you want it to.

Thanks, and yes, we'll do that for OFED 1.3.1

Dave Olson
dave.olson at qlogic.com


From nab at linux-iscsi.org  Tue May 20 15:48:17 2008
From: nab at linux-iscsi.org (Nicholas A. Bellinger)
Date: Tue, 20 May 2008 15:48:17 -0700
Subject: [ofa-general] LIO-Target Core v3.0.0 imported in k.o git
Message-ID: <1211323697.14731.68.camel@haakon2.linux-iscsi.org>

Greetings all,

The LIO-Target Core v3.0.0 tree has been imported from v2.9-STABLE from
Linux-iSCSI.org source tree repositories into kernel.org git, and is
building w/ v2.6.26-rc3.  It can be found at:

http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=summary

I will be continuing the cleanup activies for upstream, which so far has
included the removal of legacy unused engine level mirroring/replication
bits (as we are using LIO-DRBD or LIO-NR1 w/ MD for this now), and a few
LINUX_VERSION_CODE removals.  

[nab at hera lio-core]$ wc -l *.c *.h | tail -n 1
  57879 total

The goal now will be to seperate out the LIO-Target / LIO-Core pieces
for moving the latter upstream initially (eg a working passthrough, and
then v3.0 LIO-Target using traditional iSCSI, then iWARP and iSER.  This
will all be going into the roadmap on Linux-iSCSI.org for reference.

I invite interested parties to have a look, and please contact me on the
LIO-Target devel list, or privately if you would like to get involved
looking at some code that are in line with your
knowledge/interests/projects.

Many thanks for your most valuable of time,

--nab


From hrosenstock at xsigo.com  Tue May 20 15:49:21 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 20 May 2008 15:49:21 -0700
Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when
	IB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <alpine.LFD.1.00.0805201535110.7525@topaz.pathscale.com>
References: <alpine.LFD.1.00.0805201507160.7525@topaz.pathscale.com>
	<1211322844.18236.44.camel@hrosenstock-ws.xsigo.com>
	<alpine.LFD.1.00.0805201535110.7525@topaz.pathscale.com>
Message-ID: <1211323761.18236.54.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-20 at 15:41 -0700, Dave Olson wrote:
> On Tue, 20 May 2008, Hal Rosenstock wrote:
> 
> | On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote:
> | > Ralph Campbell will submit this patch for ofed 1.3.1, also.
> | > 
> | > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED
> | > 
> | > This was observed with the hw/ipath driver, but could happen with any
> | > driver.  
> | 
> | Does this also occur with mthca/mlx4 ? Was the same thing that caused
> | this on ipath tried with either of these HCAs ?
> 
> No, it doesn't happen on mthca/mlx4, but presumably those drivers never
> return this value.  Yes, we tried it with them.  No problem is
> introduced on those by the change, either.
> 
> | > It's OFED bug 1027.
> | 
> | What's the port disable command which causes this crash ?
> 
> It was with the qlogic quicksilver command iba_portdisable
> and iba_portenable commands (they've been ported to work with OFED 1.3).
> 
> ibportstate refuses to work on non-switch nodes,

Right.

>  although in the past
> we've done local modifications to allow it to work on HCAs as well
> (we never submitted that change, because we assumed somebody explictly
> didn't want HCAs targetted).

Yes, it was a conscious choice discussed on the list. It's so people
don't shoot themselves in the foot as once a port is disabled, it might
not be so easy to enable it depending on the configuration (out of band
access might be needed to reenable). The only configurations where that
would be limiting would be CA<->CA and CA<->router without intervening
switches.

-- Hal

> Dave Olson
> dave.olson at qlogic.com
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Thomas.Talpey at netapp.com  Tue May 20 19:18:20 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 20 May 2008 22:18:20 -0400
Subject: Selective signalling vs WQE size - was Re: [ofa-general]
	mthca max_sge value... ugh.
In-Reply-To: <adaej7w4rjw.fsf@cisco.com>
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
	<RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>
	<adahccu53qu.fsf@cisco.com> <1211260229.6556.18.camel@eli-laptop>
	<adaej7w4rjw.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRDhad2V0000013e@RTPMVEXC1-PRD.hq.netapp.com>

At 05:02 PM 5/20/2008, Roland Dreier wrote:
>
> > The idea is that for selective
> > signaling QPs, before you stamp the WQE, you read the value of the DS
> > field which denotes the effective size of the descriptor as used in the
> > previous post, and stamp only that area, relying on the fact that the
> > rest of the descriptor is already stamped.
>
>Does it make a measurable difference?  If so then it seems like a good idea.

I'll be happy to try it, but I bet it'll be hard to measure the difference
with a storage workload. It sounds like a bit of a micro-optimization
at the HCA interface, avoiding a few DMA cycles?

I didn't see a URL so let me know if so.

Tom.


From ogerlitz at voltaire.com  Tue May 20 23:13:14 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 09:13:14 +0300
Subject: [ofa-general] Re: Current list of Linux maintainers and their
	email	info
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C04B50A36@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>	<4831F965.6060607@opengridcomputing.com>	<BAE9DCEF64577A439B3A37F36F9B691C04B19B19@orsmsx418.amr.corp.intel.com>
	<48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il>
	<4832954C.2080209@voltaire.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B50A36@orsmsx418.amr.corp.intel.com>
Message-ID: <4833BD7A.5030303@voltaire.com>

Woodruff, Robert J wrote:
> Since this is a separate open source project in sourceforge,
> and not developed in OFA/OFED, perhaps we do not need this 
> in our list of maintainers.
>
Woody,

The bonding driver is a kernel module maintained by Jay V.

The ib-bonding package provided both this module and some enhancements 
to network configuration tools  to support bonding of ipoib devices.

Or.


From ogerlitz at voltaire.com  Tue May 20 23:16:19 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 09:16:19 +0300
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
	<48326942.7080800@voltaire.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B50919@orsmsx418.amr.corp.intel.com>
	<000001c8baac$93e26c50$0d59180a@amr.corp.intel.com>
Message-ID: <4833BE33.7000504@voltaire.com>

Sean Hefty wrote:
> OFED needs a separate list of maintainers
Does this refer to the issue that the kernel IB maintainers can't be 
accountable for the OFED IB kernel code since it includes patches which 
were never reviewed nor merged upstream, as Roland noted in his comment 
(below)?!

Or

>
> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com]
> Sent: Thursday, May 15, 2008 03:02
> To: Or Gerlitz
> Cc: Sean Hefty; general at lists.openfabrics.org
> Subject: Re: [ofa-general] Re: the so many IPoIB-UD failures 
> introduced by OFED 1.3
>
>  > Maybe its about time for the Linux IB maintainers to get a little 
> angry?!
>
> I'm not angry about it, although I have pretty much given up on trying
> to debug IPoIB issues seen running anything other than an upstream
> kernel.  It seems like the OFED maintainers, the enterprise distros and
> their customers should be more concerned about the failure of the OFED
> process -- clearly producing something much buggier and less reliable
> than the stock kernel is not what anyone wants.
>
>  - R.


From olaf.kirch at oracle.com  Tue May 20 23:37:54 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 21 May 2008 08:37:54 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <adaod703bkb.fsf@cisco.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<1211318918.3949.268.camel@brick.pathscale.com>
	<adaod703bkb.fsf@cisco.com>
Message-ID: <200805210837.55051.olaf.kirch@oracle.com>

On Tuesday 20 May 2008 23:33:40 Roland Dreier wrote:
> The IB spec doesn't allow changing RNR retry on RTS to RTS transitions.
> Probably because synchronizing the change with in-flight send requests
> (that might be doing RNR handling at that moment) is too much of a mess.

I tried modifying the RNR retry count before transitioning to RTS (while
the QP is still in RTR state), but that failed with EINVAL. Shouldn't it be
possible to do that?

Anyway, when I take the brief detour through SQD state, resetting the RNR
retry count seems to work.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From okir at lst.de  Tue May 20 23:49:49 2008
From: okir at lst.de (Olaf Kirch)
Date: Wed, 21 May 2008 08:49:49 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805210837.55051.olaf.kirch@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<adaod703bkb.fsf@cisco.com>
	<200805210837.55051.olaf.kirch@oracle.com>
Message-ID: <200805210849.50582.okir@lst.de>

On Wednesday 21 May 2008 08:37:54 Olaf Kirch wrote:
> Anyway, when I take the brief detour through SQD state, resetting the RNR
> retry count seems to work.

For those willing to test this, I committed the following patch to
branch future-20080519 on my git tree.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
-----
commit 735bdc95be33db4f21051c0d50090bc128719d98
Author: Olaf Kirch <olaf.kirch at oracle.com>
Date:   Tue May 20 22:41:04 2008 -0700

    RDS: disable RNR retries when flow control is on
    
    When flow control is enabled on a connection, we don't need RNR
    retries. Turning them off allows us to detect potential bugs in
    the credit accounting more quickly.
    
    Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 20c888d..a49e394 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -99,21 +99,40 @@ static void rds_ib_connect_complete(struct rds_connection *conn, struct rdma_cm_
 
 	rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
 
-	/* Tune the RNR timeout. We use a rather low timeout, but
-	 * not the absolute minimum - this should be tunable.
+	/* Tune RNR behavior. Without flow control, we use a rather
+	 * low timeout, but not the absolute minimum - this should
+	 * be tunable.
 	 *
 	 * We already set the RNR retry count to 7 (which is the
-	 * smallest infinite number :-) above
+	 * smallest infinite number :-) above.
+	 * If flow control is off, we want to change this back to 0
+	 * so that we learn quickly when our credit accounting is
+	 * buggy.
 	 */
-	qp_attr.qp_state = IB_QPS_RTS;
-	qp_attr.min_rnr_timer = IB_RNR_TIMER_000_32;
-	ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr,
-				IB_QP_STATE | IB_QP_MIN_RNR_TIMER);
-	if (ret) {
-		printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER, %u): err=%d\n",
-				qp_attr.min_rnr_timer, -ret);
+	if (ic->i_flowctl) {
+		/* It seems we have to take a brief detour through SQD state
+		 * in order to change the RNR retry count. */
+		qp_attr.qp_state = IB_QPS_SQD;
+		ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, SQD): err=%d\n", -ret);
+
+		qp_attr.rnr_retry = 0;
+		ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_RNR_RETRY);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_RNR_RETRY, 0): err=%d\n", -ret);
+	} else {
+		qp_attr.min_rnr_timer = IB_RNR_TIMER_000_32;
+		ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_MIN_RNR_TIMER);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret);
 	}
 
+	qp_attr.qp_state = IB_QPS_RTS;
+	ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
+	if (ret)
+		printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", -ret);
+
 	/* update ib_device with this local ipaddr */
 	rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
 	ib_update_ipaddr_for_device(rds_ibdev, conn->c_laddr);


From eli at dev.mellanox.co.il  Tue May 20 23:58:56 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 21 May 2008 09:58:56 +0300
Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca
	max_sge value... ugh.
In-Reply-To: <RTPCLUEXC1-PRDhad2V0000013e@RTPMVEXC1-PRD.hq.netapp.com>
References: <adaskwkfep5.fsf@cisco.com>
	<RTPCLUEXC1-PRDxYUsX000000e5@RTPMVEXC1-PRD.hq.netapp.com>
	<adaej81c1bv.fsf@cisco.com> <adaabipc07g.fsf@cisco.com>
	<RTPCLUEXC1-PRDFwKVO00000106@RTPMVEXC1-PRD.hq.netapp.com>
	<adahccu53qu.fsf@cisco.com> <1211260229.6556.18.camel@eli-laptop>
	<adaej7w4rjw.fsf@cisco.com>
	<RTPCLUEXC1-PRDhad2V0000013e@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211353136.31377.5.camel@mtls03>

On Tue, 2008-05-20 at 22:18 -0400, Talpey, Thomas wrote:
> >
> >Does it make a measurable difference?  If so then it seems like a good idea.
It does a noticeable difference when the required message rate is high.
In those cases you spare the CPU from the need to write to memory,
possibly saving cache misses. I saw differences for IPoIB in OFED where
we use selective signalling for the UD QP.

> 
> I'll be happy to try it, but I bet it'll be hard to measure the difference
> with a storage workload. It sounds like a bit of a micro-optimization
> at the HCA interface, avoiding a few DMA cycles?
> 
> I didn't see a URL so let me know if so.
> 

http://lists.openfabrics.org/pipermail/general/2008-January/045071.html


From sean.hefty at intel.com  Wed May 21 00:39:43 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 21 May 2008 00:39:43 -0700
Subject: [ofa-general] Current list of Linux maintainers and their email
	info
In-Reply-To: <4833BE33.7000504@voltaire.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C04B19A8E@orsmsx418.amr.corp.intel.com>
	<000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com>
	<48326942.7080800@voltaire.com>
	<BAE9DCEF64577A439B3A37F36F9B691C04B50919@orsmsx418.amr.corp.intel.com>
	<000001c8baac$93e26c50$0d59180a@amr.corp.intel.com>
	<4833BE33.7000504@voltaire.com>
Message-ID: <000101c8bb15$d5fe0180$6f248686@amr.corp.intel.com>

>> OFED needs a separate list of maintainers
>Does this refer to the issue that the kernel IB maintainers can't be
>accountable for the OFED IB kernel code since it includes patches which
>were never reviewed nor merged upstream, as Roland noted in his comment
>(below)?!

I thought the userspace libraries differ as well.


From Sumit.Gaur at Sun.COM  Wed May 21 00:33:00 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Wed, 21 May 2008 13:03:00 +0530
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
	<4831B519.2060002@dev.mellanox.co.il>
	<1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>
	<48326C32.7000303@Sun.COM>
	<1211287585.12616.568.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4833D02C.9040205@Sun.COM>

Hi Hal/Yevgeny,
	Looks like I am more confusing you then a clearcut question so I am giving you 
my implementation detail in simple steps:-

1) I am calling madrpc_init(ca, ca_port, mgmt_classes, 4); for given ca and 
ca_port to register following four classes to OFED library

{IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS};

2) After registration I am opening two separate independent threads one for 
sending MADs and another for receiving it.

3) Sending thread send MADs using

umad_send(port_id, class_agents[mgtclass],&sndbuf, length, timeout, 0);

4) Receiver thread receive MADs using  mad_receive(0, -1); function.

5) I am sending SMP and GMP packets at regular time interval and keep receiving 
response on receiver thread properly. But sometime I am receiving some extra 
packets with *unknown tids*(tid I have never send). e.g.

Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129,
ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=4352
(All are decimal representation)

		Now question comes how I could filter these extra packets. These  incoming 
packets could be response SM sends while sweeping the subnet(As pointed out by 
Yevgeny). Is these any unique MAD field that could be checked for SM response. 
OR this could not be filtered then i will change logic in application.

Thanks and Regards
sumit


Hal Rosenstock wrote:
> On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote:
> 
>>How we can identify and filter these incoming SM packets in application from 
>>the regular responses.
> 
> 
> I'm surprised that it's working this way; that SM responses are getting
> into your application as they _should_ have a different transaction ID
> per the following.
yes they have different TID.
> 
>>From the kernel Documentation/infiniband/user_mad.txt:
> 
> Transaction IDs
> 
>   Users of the umad devices can use the lower 32 bits of the
>   transaction ID field (that is, the least significant half of the
>   field in network byte order) in MADs being sent to match
>   request/response pairs.  The upper 32 bits are reserved for use by
>   the kernel and will be overwritten before a MAD is sent.
> 
> Is the same fd being used by OpenSM and your application somehow or you
> are not using OpenSM and your SM overlaps with this ?
I am not using OpenSM, I am directing calling umad libraries.
> 
> -- Hal
> 


From ogerlitz at voltaire.com  Wed May 21 00:41:23 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 10:41:23 +0300
Subject: [ofa-general] RDS flow control
In-Reply-To: <ada1w3w4qor.fsf@cisco.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805191006.00114.olaf.kirch@oracle.com>	<20080520204522.GD31790@opengridcomputing.com>	<200805202313.40213.olaf.kirch@oracle.com>
	<ada1w3w4qor.fsf@cisco.com>
Message-ID: <4833D223.5090007@voltaire.com>

Roland Dreier wrote:
>  > Is there a way of changing the RNR retry count back to 0 after establishing
>  > the connection?
>
> Yes... quite complicated but possible.  Basically you have to transition
> to the QP to the "send queue drained" (SQD) state, change the rnr retry
> value in an SQD->SQD transition and then transition back to RTS. 

In case the RTS->SQD->SQD->RTS transition is not applicable or just for the sake of being aware to more solutions, I gave it some thought and its seems possible for you to build a protocol which uses exchange (through the private data carried by the CM messages) whether each side supports credit management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability decide what value to place into the QP RNR retries.

On the passive side of the connection its trivial, since the rdma-cm uses the values you place into the conn_param parameters of rdma_accept.

On the active side, things are a bit more complex, but with some changes, I think you would be able to do it also in a different way than the SQD one: the RNR retries are set into the QP once its being moved to RTS (Ready-To-Send). So, if you managed to get the QP into your hands --before-- the RTU is sent (since this point in time is the last synchoronization step provided to you by the IB CM), you could set the RNR retries value accroding to info carried in the REP message sent by the passive (which you have posted in the private data to rdma_accept, etc).

This would be possible, if you enhance the rdma-cm to deliver RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP port space (eg conditioned on some new field in conn_param) where today its supported only to PS_SDP ones.

Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE event, decide what RNR retry value you want to use, and call rdma_accept providing this value (one more little change is needed here in cma.c), the rdma cm would override the value set by cm_init_qp_rts_attr, see cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr -> cm_init_qp_rts_attr

and you are done...

Or.


From ogerlitz at voltaire.com  Wed May 21 00:48:42 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 10:48:42 +0300
Subject: [ofa-general] RDS flow control
In-Reply-To: <4833D223.5090007@voltaire.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805191006.00114.olaf.kirch@oracle.com>	<20080520204522.GD31790@opengridcomputing.com>	<200805202313.40213.olaf.kirch@oracle.com>	<ada1w3w4qor.fsf@cisco.com>
	<4833D223.5090007@voltaire.com>
Message-ID: <4833D3DA.4040106@voltaire.com>

Or Gerlitz wrote:
> HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability 
I see now that only the mlx4, mthca and ehca drivers advertise this 
capability, but the ipath doesn't, Ralph, was it just forgotten or you 
guys really don't support this?

Or.


From ogerlitz at voltaire.com  Wed May 21 02:08:05 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 12:08:05 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4832D4DC.2040006@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<48301F74.4020905@voltaire.com>
	<ada63tb9tqs.fsf@cisco.com>	<4831347A.1010506@voltaire.com>
	<ada4p8u8fkb.fsf@cisco.com>
	<4831A0DF.2070603@opengridcomputing.com>
	<48327198.7080305@voltaire.com>
	<4832D4DC.2040006@opengridcomputing.com>
Message-ID: <4833E675.5050500@voltaire.com>

Steve Wise wrote:
> My point is that if you do the mappipng at allocation time, then the 
> failure will happen when you allocate the page list vs when you post 
> the send WR.  Maybe it doesn't matter, but the idea, I think, is to 
> not fail post_send for lack of resources.  Everything should be 
> pre-allocated pretty much by the time you post work requests...
fair-enough. I understand we are requiring that a page list can be 
reused without being freed, just make sure its documents.

Or.


From ogerlitz at voltaire.com  Wed May 21 02:24:59 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 12:24:59 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4832D850.2010102@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
Message-ID: <4833EA6B.9000705@voltaire.com>

Steve Wise wrote:
>>> Support for the IB BMME and iWARP equivalent memory extensions ... 
>>> Usage Model:
>>> - MR allocated with ib_alloc_mr()
>>> - Page lists allocated via ib_alloc_fast_reg_page_list().
>>> - MR made VALID & bound to a specific page list via 
>>> ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via 
>>> ib_post_send(IB_WR_INVALIDATE_MR)
>> AFAIK, the idea was to let the ulp post --two-- work requests, where 
>> the first creates the mapping and the second sends this mapping to 
>> the remote side, such that the second does not start before the first 
>> completes (i.e a fence).
>>
>> Now, the above scheme means that the ulp knows the value of the 
>> rkey/stag at the time of posting these two work requests (since it 
>> has to encode it in the second one), so something has to be clarified 
>> re the rkey/stag here, do they change each time this MR is used? how 
>> many bits can be changed, etc.
>
> The ULP knows the rkey/stag because its returned up front in the 
> ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue 
> which we haven't exposed yet to the ULP).  The same rkey/stag can be 
> used for multiple mappings.  It can be made invalid at any point in 
> time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the 
> same rkey/stag advertised is not a risk.
I understand that this (same rkey/stag used for all mapping produced for 
a specific mr) is what you are proposing, I still think there's a chance 
that by the spec and (not less important!) by existing HW support, its 
possible to have a different rkey/stag per mapping done on an mr, for 
example the IB spec uses a "consumer owned key portion of the L_Key" 
notation which makes me think there should be a way to have different 
rkey per mapping, Roland? Dror?
> 10.7.2.6 FAST REGISTER PHYSICAL MR
> The Fast Register Physical MR Operation is allowed on Non-Shared 
> Physical Memory Regions that were created with a Consumer owned key 
> portion of the L_Key, and any associated R_Key
Or


From ogerlitz at voltaire.com  Wed May 21 02:33:15 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 12:33:15 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4832D850.2010102@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
Message-ID: <4833EC5B.8070504@voltaire.com>

Steve Wise wrote:
> So you allocate the rkey/stag up front, allocate page_lists up front, 
> then as needed you populate your page list and bind it to the 
> rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via 
> IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
> proper fencing, you can pipeline these mappings.   Eventually when 
> you're done doing IO (like for NFSRDMA when the mount is unmounted) 
> you free up the page list(s) and mr/rkey/stag.
Yes, that was my thought as well.

Just to make sure, by "proper fencing" your understanding is that for 
both IB and iWARP the ULP should not wait for the fmr work request to 
complete and post the send work-request carrying the rkey/stag with the 
IB_SEND_FENCE flag?

Looking in the IB spec, its seems that the fence indicator only applies 
to previous rdma-read / atomic operations, eg in section  11.4.1.1 POST 
SEND REQUEST it says:
> Fence indicator. If the fence indicator is set, then all prior RDMA 
> Read and Atomic Work Requests on the queue must be completed before 
> starting to process this Work Request.

>> Talking on usage, do you plan to patch the mainline nfs-rdma code to 
>> use these verbs?
> Yes.  Tom Tucker will be doing this.  Jon Mason is implementing RDS 
> changes to utilize this too.  The hope is all this makes 2.6.27/ofed-1.4.
>
> I can also post test code (krping module) if anyone is interested.  
> I'm developing that now.
>
Posting this code would be very much helpful (also to the discussion, I 
think), thanks.

Or.


From vlad at lists.openfabrics.org  Wed May 21 03:10:51 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 21 May 2008 03:10:51 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080521-0200 daily build status
Message-ID: <20080521101051.8307EE60E41@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at dev.mellanox.co.il  Wed May 21 03:31:55 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 21 May 2008 13:31:55 +0300
Subject: [ofa-general] RDS flow control
In-Reply-To: <ada1w3w4qor.fsf@cisco.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805191006.00114.olaf.kirch@oracle.com>	<20080520204522.GD31790@opengridcomputing.com>	<200805202313.40213.olaf.kirch@oracle.com>
	<ada1w3w4qor.fsf@cisco.com>
Message-ID: <4833FA1B.4010707@mellanox.co.il>

Roland Dreier wrote:
>  > Is there a way of changing the RNR retry count back to 0 after establishing
>  > the connection?
>
> Yes... quite complicated but possible.  Basically you have to transition
> to the QP to the "send queue drained" (SQD) state, change the rnr retry
> value in an SQD->SQD transition and then transition back to RTS.  Not
> sure if anyone has ever tested that whole operation, so it may or may
> not actually work without driver/fw fixes required.
>
>
>   
SQD is not implemented in ConnectX for now

Tziporet


From hrosenstock at xsigo.com  Wed May 21 04:29:22 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 21 May 2008 04:29:22 -0700
Subject: [ofa-general] Receiving Unknown packets at regular interval
In-Reply-To: <4833D02C.9040205@Sun.COM>
References: <20080513185404.3D00BE60C16@openfabrics.org>
	<48314E74.9010107@Sun.COM>
	<1211197643.12616.408.camel@hrosenstock-ws.xsigo.com>
	<4831696D.6060409@Sun.COM>
	<1211203514.12616.430.camel@hrosenstock-ws.xsigo.com>
	<48318575.7060701@Sun.COM>
	<1211208550.12616.446.camel@hrosenstock-ws.xsigo.com>
	<4831A618.9090806@Sun.COM>
	<1211213870.12616.475.camel@hrosenstock-ws.xsigo.com>
	<4831B519.2060002@dev.mellanox.co.il>
	<1211217595.12616.490.camel@hrosenstock-ws.xsigo.com>
	<48326C32.7000303@Sun.COM>
	<1211287585.12616.568.camel@hrosenstock-ws.xsigo.com>
	<4833D02C.9040205@Sun.COM>
Message-ID: <1211369362.18236.78.camel@hrosenstock-ws.xsigo.com>

Sumit,

On Wed, 2008-05-21 at 13:03 +0530, Sumit Gaur - Sun Microsystem wrote:
> Hi Hal/Yevgeny,
> 	Looks like I am more confusing you then a clearcut question so I am giving you 
> my implementation detail in simple steps:-
> 
> 1) I am calling madrpc_init(ca, ca_port, mgmt_classes, 4); for given ca and 
> ca_port to register following four classes to OFED library
> 
> {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS};
> 
> 2) After registration I am opening two separate independent threads one for 
> sending MADs and another for receiving it.
> 
> 3) Sending thread send MADs using
> 
> umad_send(port_id, class_agents[mgtclass],&sndbuf, length, timeout, 0);
> 
> 4) Receiver thread receive MADs using  mad_receive(0, -1); function.
> 
> 5) I am sending SMP and GMP packets at regular time interval and keep receiving 
> response on receiver thread properly. But sometime I am receiving some extra 
> packets with *unknown tids*(tid I have never send). e.g.
> 
> Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129,
> ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=4352
> (All are decimal representation)
> 
> 		Now question comes how I could filter these extra packets. These  incoming 
> packets could be response SM sends while sweeping the subnet(As pointed out by 
> Yevgeny). Is these any unique MAD field that could be checked for SM response. 
> OR this could not be filtered then i will change logic in application.

Not knowing what SMPs you are sending in your application, it's hard to
be more specific. You can filter on class and attribute ID (assuming
these attribute IDs per class) are distinct. Another approach would be
able to filter on transaction ID as the upper 32 bits should be
different (on a per underlying fd basis). This approach is simpler and
not rely on non overlapping subsets of attributes.

I was also trying to say that I'm not sure you should be seeing these
packets (and I don't think your application should need to do this). The
current filtering appears to not be working for some unknown reason in
your environment.

Hopefully this makes more sense. Sorry for all the confusion.

-- Hal

> Thanks and Regards
> sumit
> 
> 
> 
> Hal Rosenstock wrote:
> > On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote:
> > 
> >>How we can identify and filter these incoming SM packets in application from 
> >>the regular responses.
> > 
> > 
> > I'm surprised that it's working this way; that SM responses are getting
> > into your application as they _should_ have a different transaction ID
> > per the following.
> yes they have different TID.
> > 
> >>From the kernel Documentation/infiniband/user_mad.txt:
> > 
> > Transaction IDs
> > 
> >   Users of the umad devices can use the lower 32 bits of the
> >   transaction ID field (that is, the least significant half of the
> >   field in network byte order) in MADs being sent to match
> >   request/response pairs.  The upper 32 bits are reserved for use by
> >   the kernel and will be overwritten before a MAD is sent.
> > 
> > Is the same fd being used by OpenSM and your application somehow or you
> > are not using OpenSM and your SM overlaps with this ?
> I am not using OpenSM, I am directing calling umad libraries.
> > 
> > -- Hal
> > 


From tziporet at mellanox.co.il  Wed May 21 04:30:28 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 21 May 2008 14:30:28 +0300
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c
	whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <1211322487.3949.275.camel@brick.pathscale.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com>

Ralph,

Can you provide the patch to OFED 1.3.1 today so we will be able to
include it in RC1?

Thanks
Tziporet 

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Ralph
Campbell
Sent: Wednesday, May 21, 2008 1:28 AM
To: Roland Dreier
Cc: Dave Olson; general at lists.openfabrics.org
Subject: Re: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c
whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned

On Tue, 2008-05-20 at 15:17 -0700, Roland Dreier wrote:
>  >         case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
>  >                 kmem_cache_free(ib_mad_cache, mad_priv);
>  > -               break;
>  > +               kfree(local);
>  > +               goto out;
> 
> Seems you need to set ret = 1 here?  Otherwise I think
ib_post_send_mad
> will continue handling the send even though the packet was supposedly
> consumed.

I agree.

> Also as a side note, I think handle_outgoing_dr_smp() would be clearer
> if rather than having
> 
> out:
> 	return ret;
> 
> and then doing stuff like
> 
> 	ret = -EINVAL;
> 	goto out;
> 
> the code just did "return -EINVAL;"
> 
> Maybe I'll do that cleanup for 2.6.27.

I also agree but I remember at one point we got pushback from
one of the mainline kernel developers who really wanted to see
only one return point in the code even if it meant more gotos.
I don't remember who though.

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From ogerlitz at voltaire.com  Wed May 21 05:43:11 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 15:43:11 +0300
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
	<4831649A.2020206@voltaire.com>
	<000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>
	<4832BFAC.2050506@voltaire.com>
	<000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com>
Message-ID: <483418DF.1080707@voltaire.com>

Sean Hefty wrote:
> I was only thinking of the kernel interfaces, but I don't see that this really
> changes the ABI.  An existing library continues to work unmodified.  (Is this
> that different than adding a new return value from a call?)  If there really is
> an issue, then the rdma_ucm can toss the event.
Yes, I agree that the ABI shouldn't be changed on every new return code 
or event added so we can deliver the new event and existing apps should 
be ignoring it (and if a real issue is find, we can block it at the 
rdma_ucm).

Or


From ossrosch at linux.vnet.ibm.com  Wed May 21 05:58:55 2008
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Wed, 21 May 2008 14:58:55 +0200
Subject: [ofa-general] [PATCH ofed-1.3.1 0/2] IB/ehca: Misc fixes
Message-ID: <200805211458.56384.ossrosch@linux.vnet.ibm.com>

Hi Vlad!
I'm sending you a patch set for ehca to be included in ofed-1.3.1. These patches are based on OFED-1.3.1-rc1 and already included
in kernel main line.

1/2  IB/ehca: Fix function return types
2/2  IB/ehca: Wait for async events to finish before destroying QP 

They should apply cleanly against Vlad's git tree. Please accept them if
they are ok.

Thanks
Stefan


From ossrosch at linux.vnet.ibm.com  Wed May 21 05:59:18 2008
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Wed, 21 May 2008 14:59:18 +0200
Subject: [ofa-general] [PATCH ofed-1.3.1 1/2] IB/ehca: Fix function return
	types
Message-ID: <200805211459.19594.ossrosch@linux.vnet.ibm.com>


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
ehca_0041_Fix_wrong_return_types.patch |   38 +++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch
--- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch	2008-05-21 14:39:20.000000000 +0200
@@ -0,0 +1,38 @@
+diff -Nurp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
+--- a/drivers/infiniband/hw/ehca/ehca_hca.c	2008-05-21 13:54:31.000000000 +0200
++++ b/drivers/infiniband/hw/ehca/ehca_hca.c	2008-05-21 14:35:25.000000000 +0200
+@@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device *
+ 	props->max_ee          = limit_uint(rblock->max_rd_ee_context);
+ 	props->max_rdd         = limit_uint(rblock->max_rd_domain);
+ 	props->max_fmr         = limit_uint(rblock->max_mr);
+-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
+ 	props->max_qp_rd_atom  = limit_uint(rblock->max_rr_qp);
+ 	props->max_ee_rd_atom  = limit_uint(rblock->max_rr_ee_context);
+ 	props->max_res_rd_atom = limit_uint(rblock->max_rr_hca);
+@@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device *
+ 	}
+ 
+ 	props->max_pkeys           = 16;
+-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
++	props->local_ca_ack_delay  = min_t(u8, rblock->local_ca_ack_delay, 255);
+ 	props->max_raw_ipv6_qp     = limit_uint(rblock->max_raw_ipv6_qp);
+ 	props->max_raw_ethy_qp     = limit_uint(rblock->max_raw_ethy_qp);
+ 	props->max_mcast_grp       = limit_uint(rblock->max_mcast_grp);
+@@ -136,7 +135,7 @@ query_device1:
+ 	return ret;
+ }
+ 
+-static int map_mtu(struct ehca_shca *shca, u32 fw_mtu)
++static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu)
+ {
+ 	switch (fw_mtu) {
+ 	case 0x1:
+@@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shc
+ 	}
+ }
+ 
+-static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
++static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
+ {
+ 	switch (vl_cap) {
+ 	case 0x1:


From eli at mellanox.co.il  Wed May 21 05:59:29 2008
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 21 May 2008 15:59:29 +0300
Subject: [ofa-general] [PATCH] IB/mlx4: Optimize stamping for selective
	signalling QPs
Message-ID: <1211374769.6577.21.camel@eli-laptop>

>From e6b956c2233669fc21ba1565fbcf78ce2cd186b7 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Wed, 21 May 2008 15:55:51 +0300
Subject: [PATCH] IB/mlx4: Optimize stamping for selective signalling QPs

The idea is that for selective signaling QPs, before stamping the WQE,
you read the value of the DS field which denotes the effective size of the
descriptor as used in the previous post, and stamp only that area, relying
on the fact that the rest of the descriptor is already stamped.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

This version cleanly applies on the head of the "for-2.6.26" branch.

 drivers/infiniband/hw/mlx4/qp.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index cec030e..b4d25c2 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -129,9 +129,10 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
 	int ind;
 	void *buf;
 	__be32 stamp;
+        struct mlx4_wqe_ctrl_seg *ctrl;
 
-	s = roundup(size, 1U << qp->sq.wqe_shift);
 	if (qp->sq_max_wqes_per_wr > 1) {
+		s = roundup(size, 1U << qp->sq.wqe_shift);
 		for (i = 0; i < s; i += 64) {
 			ind = (i >> qp->sq.wqe_shift) + n;
 			stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0x7fffffff) :
@@ -141,7 +142,8 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
 			*wqe = stamp;
 		}
 	} else {
-		buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
+		ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
+		s = (ctrl->fence_size & 0x3f) << 4;
 		for (i = 64; i < s; i += 64) {
 			wqe = buf + i;
 			*wqe = cpu_to_be32(0xffffffff);
-- 
1.5.5.1


From ossrosch at linux.vnet.ibm.com  Wed May 21 05:59:53 2008
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Wed, 21 May 2008 14:59:53 +0200
Subject: [ofa-general] [PATCH ofed-1.3.1 2/2] IB/ehca: Wait for async events
	to finish before destroying QP
Message-ID: <200805211459.55844.ossrosch@linux.vnet.ibm.com>


Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
---
ehca_0042_Count_async_events_for_EQs.patch |   55 +++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)

diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch
--- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch	1970-01-01 01:00:00.000000000 +0100
+++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch	2008-05-21 14:41:20.000000000 +0200
@@ -0,0 +1,55 @@
+diff -Nurp a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
+--- a/drivers/infiniband/hw/ehca/ehca_classes.h	2008-05-21 13:54:31.000000000 +0200
++++ b/drivers/infiniband/hw/ehca/ehca_classes.h	2008-05-21 14:35:25.000000000 +0200
+@@ -192,6 +192,8 @@ struct ehca_qp {
+ 	int mtu_shift;
+ 	u32 message_count;
+ 	u32 packet_count;
++	atomic_t nr_events; /* events seen */
++	wait_queue_head_t wait_completion;
+ };
+ 
+ #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ)
+diff -Nurp a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
+--- a/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-21 13:54:31.000000000 +0200
++++ b/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-21 14:35:25.000000000 +0200
+@@ -204,6 +204,8 @@ static void qp_event_callback(struct ehc
+ 
+ 	read_lock(&ehca_qp_idr_lock);
+ 	qp = idr_find(&ehca_qp_idr, token);
++	if (qp)
++		atomic_inc(&qp->nr_events);
+ 	read_unlock(&ehca_qp_idr_lock);
+ 
+ 	if (!qp)
+@@ -223,6 +225,8 @@ static void qp_event_callback(struct ehc
+ 	if (fatal && qp->ext_type == EQPT_SRQBASE)
+ 		dispatch_qp_event(shca, qp, IB_EVENT_QP_LAST_WQE_REACHED);
+ 
++	if (atomic_dec_and_test(&qp->nr_events))
++		wake_up(&qp->wait_completion);
+ 	return;
+ }
+ 
+diff -Nurp a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
+--- a/drivers/infiniband/hw/ehca/ehca_qp.c	2008-05-21 13:54:31.000000000 +0200
++++ b/drivers/infiniband/hw/ehca/ehca_qp.c	2008-05-21 14:35:25.000000000 +0200
+@@ -561,6 +566,8 @@ static struct ehca_qp *internal_create_q
+ 		return ERR_PTR(-ENOMEM);
+ 	}
+ 
++	atomic_set(&my_qp->nr_events, 0);
++	init_waitqueue_head(&my_qp->wait_completion);
+ 	spin_lock_init(&my_qp->spinlock_s);
+ 	spin_lock_init(&my_qp->spinlock_r);
+ 	my_qp->qp_type = qp_type;
+@@ -1929,6 +1936,9 @@ static int internal_destroy_qp(struct ib
+ 	idr_remove(&ehca_qp_idr, my_qp->token);
+ 	write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+ 
++	/* now wait until all pending events have completed */
++	wait_event(my_qp->wait_completion, !atomic_read(&my_qp->nr_events));
++
+ 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
+ 	if (h_ret != H_SUCCESS) {
+ 		ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li "


From richard.frank at oracle.com  Wed May 21 06:21:05 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Wed, 21 May 2008 09:21:05 -0400
Subject: [ofa-general] RDS flow control
In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com>
References: <200805121157.38135.jon@opengridcomputing.com>	<200805191006.00114.olaf.kirch@oracle.com>	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com>
Message-ID: <483421C1.2090000@oracle.com>

 From Oracle's perspective I think you can punt on the cluster rolling 
upgrade aspect of this.

It's likely that Oracle will be running a newer version of RDS with flow 
control - or all nodes will be on an older version.

As long as different versions of drivers interact without crashing the 
node - and report a mismatch in protocol - we're probably OK.

Assuming the flow control turns out to not impact performance - then we 
should just remove the old RNR code.

If we're ready = I give this version of the driver to our performance 
folks - hopefully they can give a bash in the next week or so..


Olaf Kirch wrote:
> On Tuesday 20 May 2008 22:45:22 Jon Mason wrote:
>   
>> Works well on my setup.
>>     
>
> Good to hear!
>
>   
>> With proper flow control, there should no longer be a need for rnr_retry (as there
>> should always be a posted recv buffer waiting for the incoming data).  I did a
>> quick test and removed it and everything seemed to be happy on my rds-stress run.
>>     
>
> I would like to make the setting of the RNR retry/timeout conditional on
> whether both ends of the connection support flow control or not - we need
> to think of rolling upgrades of a cluster, so mixed environments just have
> to work. Unfortunately, the RNR retry count is set prior to establishing
> the connection, before we even know whether the remote is capable of doing
> flow control.
>
> Is there a way of changing the RNR retry count back to 0 after establishing
> the connection?
>
> Olaf
>   


From vlad at dev.mellanox.co.il  Wed May 21 06:28:18 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 21 May 2008 16:28:18 +0300
Subject: [ofa-general] Re: [ewg] [PATCH ofed-1.3.1 1/2] IB/ehca: Fix function
	return types
In-Reply-To: <200805211459.19594.ossrosch@linux.vnet.ibm.com>
References: <200805211459.19594.ossrosch@linux.vnet.ibm.com>
Message-ID: <48342372.4000109@dev.mellanox.co.il>

Stefan Roscher wrote:
> Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> ---
> ehca_0041_Fix_wrong_return_types.patch |   38 +++++++++++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
> 
> diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch
> --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch	1970-01-01 01:00:00.000000000 +0100
> +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch	2008-05-21 14:39:20.000000000 +0200
> @@ -0,0 +1,38 @@
> +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
> +--- a/drivers/infiniband/hw/ehca/ehca_hca.c	2008-05-21 13:54:31.000000000 +0200
> ++++ b/drivers/infiniband/hw/ehca/ehca_hca.c	2008-05-21 14:35:25.000000000 +0200
> +@@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device *
> + 	props->max_ee          = limit_uint(rblock->max_rd_ee_context);
> + 	props->max_rdd         = limit_uint(rblock->max_rd_domain);
> + 	props->max_fmr         = limit_uint(rblock->max_mr);
> +-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
> + 	props->max_qp_rd_atom  = limit_uint(rblock->max_rr_qp);
> + 	props->max_ee_rd_atom  = limit_uint(rblock->max_rr_ee_context);
> + 	props->max_res_rd_atom = limit_uint(rblock->max_rr_hca);
> +@@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device *
> + 	}
> + 
> + 	props->max_pkeys           = 16;
> +-	props->local_ca_ack_delay  = limit_uint(rblock->local_ca_ack_delay);
> ++	props->local_ca_ack_delay  = min_t(u8, rblock->local_ca_ack_delay, 255);
> + 	props->max_raw_ipv6_qp     = limit_uint(rblock->max_raw_ipv6_qp);
> + 	props->max_raw_ethy_qp     = limit_uint(rblock->max_raw_ethy_qp);
> + 	props->max_mcast_grp       = limit_uint(rblock->max_mcast_grp);
> +@@ -136,7 +135,7 @@ query_device1:
> + 	return ret;
> + }
> + 
> +-static int map_mtu(struct ehca_shca *shca, u32 fw_mtu)
> ++static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu)
> + {
> + 	switch (fw_mtu) {
> + 	case 0x1:
> +@@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shc
> + 	}
> + }
> + 
> +-static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
> ++static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap)
> + {
> + 	switch (vl_cap) {
> + 	case 0x1:

Applied,

Regards,
Vladimir


From vlad at dev.mellanox.co.il  Wed May 21 06:28:36 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 21 May 2008 16:28:36 +0300
Subject: [ofa-general] [PATCH ofed-1.3.1 2/2] IB/ehca: Wait for async
	events	to finish before destroying QP
In-Reply-To: <200805211459.55844.ossrosch@linux.vnet.ibm.com>
References: <200805211459.55844.ossrosch@linux.vnet.ibm.com>
Message-ID: <48342384.4080509@dev.mellanox.co.il>

Stefan Roscher wrote:
> Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> ---
> ehca_0042_Count_async_events_for_EQs.patch |   55 +++++++++++++++++++++++++++++
> 1 file changed, 55 insertions(+)
> 
> diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch
> --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch	1970-01-01 01:00:00.000000000 +0100
> +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch	2008-05-21 14:41:20.000000000 +0200
> @@ -0,0 +1,55 @@
> +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
> +--- a/drivers/infiniband/hw/ehca/ehca_classes.h	2008-05-21 13:54:31.000000000 +0200
> ++++ b/drivers/infiniband/hw/ehca/ehca_classes.h	2008-05-21 14:35:25.000000000 +0200
> +@@ -192,6 +192,8 @@ struct ehca_qp {
> + 	int mtu_shift;
> + 	u32 message_count;
> + 	u32 packet_count;
> ++	atomic_t nr_events; /* events seen */
> ++	wait_queue_head_t wait_completion;
> + };
> + 
> + #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ)
> +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
> +--- a/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-21 13:54:31.000000000 +0200
> ++++ b/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-21 14:35:25.000000000 +0200
> +@@ -204,6 +204,8 @@ static void qp_event_callback(struct ehc
> + 
> + 	read_lock(&ehca_qp_idr_lock);
> + 	qp = idr_find(&ehca_qp_idr, token);
> ++	if (qp)
> ++		atomic_inc(&qp->nr_events);
> + 	read_unlock(&ehca_qp_idr_lock);
> + 
> + 	if (!qp)
> +@@ -223,6 +225,8 @@ static void qp_event_callback(struct ehc
> + 	if (fatal && qp->ext_type == EQPT_SRQBASE)
> + 		dispatch_qp_event(shca, qp, IB_EVENT_QP_LAST_WQE_REACHED);
> + 
> ++	if (atomic_dec_and_test(&qp->nr_events))
> ++		wake_up(&qp->wait_completion);
> + 	return;
> + }
> + 
> +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
> +--- a/drivers/infiniband/hw/ehca/ehca_qp.c	2008-05-21 13:54:31.000000000 +0200
> ++++ b/drivers/infiniband/hw/ehca/ehca_qp.c	2008-05-21 14:35:25.000000000 +0200
> +@@ -561,6 +566,8 @@ static struct ehca_qp *internal_create_q
> + 		return ERR_PTR(-ENOMEM);
> + 	}
> + 
> ++	atomic_set(&my_qp->nr_events, 0);
> ++	init_waitqueue_head(&my_qp->wait_completion);
> + 	spin_lock_init(&my_qp->spinlock_s);
> + 	spin_lock_init(&my_qp->spinlock_r);
> + 	my_qp->qp_type = qp_type;
> +@@ -1929,6 +1936,9 @@ static int internal_destroy_qp(struct ib
> + 	idr_remove(&ehca_qp_idr, my_qp->token);
> + 	write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
> + 
> ++	/* now wait until all pending events have completed */
> ++	wait_event(my_qp->wait_completion, !atomic_read(&my_qp->nr_events));
> ++
> + 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
> + 	if (h_ret != H_SUCCESS) {
> + 		ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li "

Applied,

Regards,
Vladimir


From ogerlitz at voltaire.com  Wed May 21 06:31:54 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 21 May 2008 16:31:54 +0300
Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high
	availability mode attribute to IDs
In-Reply-To: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>	<Pine.LNX.4.64.0805151722440.23334@zuben.voltaire.com>
	<000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com>
	<482C68A4.9020305@opengridcomputing.com>
	<4831649A.2020206@voltaire.com>
	<000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com>
	<4832BFAC.2050506@voltaire.com>
	<000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com>
Message-ID: <4834244A.4010509@voltaire.com>

Sean Hefty wrote:
> After more thought, this approach is what I would try first.  I think you will
> need a new mutex per rdma_cm_id that does nothing but serializes callbacks.  You
> might be able to acquire/release it in disable/enable remove, but I didn't look
> into the implementation in that much detail.
OK, thanks a lot for the guidance, I will try that and let you know.

Or.


From nix.or.die at googlemail.com  Wed May 21 07:06:36 2008
From: nix.or.die at googlemail.com (Gabriel C)
Date: Wed, 21 May 2008 16:06:36 +0200
Subject: [ofa-general] linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c ,
 fix compiler warnings
Message-ID: <48342C6C.2010502@googlemail.com>

On linux-next from today , allmodconfig, I see the following warnings on 64bit:

...

  CC [M]  drivers/infiniband/hw/ipath/ipath_sdma.o
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type
drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'

...

Signed-off-by: Gabriel C <nix.or.die at googlemail.com>

---

I see the 'format' warnings in mainline also.

 drivers/infiniband/hw/ipath/ipath_sdma.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 3697449..5f80151 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque)
 	/* everything is stopped, time to clean up and restart */
 	if (status == IPATH_SDMA_ABORT_ABORTED) {
 		struct ipath_sdma_txreq *txp, *txpnext;
-		u64 hwstatus;
+		unsigned long hwstatus;
 		int notify = 0;
 
 		hwstatus = ipath_read_kreg64(dd,
@@ -346,7 +346,7 @@ resched:
 	 */
 	if (jiffies > dd->ipath_sdma_abort_jiffies) {
 		ipath_dbg("looping with status 0x%016llx\n",
-			  dd->ipath_sdma_status);
+			  (unsigned long long)dd->ipath_sdma_status);
 		dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ;
 	}
 resched_noprint:
@@ -616,7 +616,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd)
 	spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
 	if (!needed) {
 		ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n",
-			dd->ipath_sdma_status);
+			(unsigned long long)dd->ipath_sdma_status);
 		goto bail;
 	}
 	spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags);


From yevgenyp at mellanox.co.il  Wed May 21 07:35:33 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Wed, 21 May 2008 17:35:33 +0300
Subject: [ofa-general][PATCH 0/2] mlx4: Multiple completion vectors
Message-ID: <48343335.6050608@mellanox.co.il>

Hello Roland,

This is the implementation of mlx4 support for multiple completion vectors.
Main idea: Create a completion EQ for every core to allow a better utilization of
multi-core machines.
Number of created completion vectors is advertised through ib_device.num_comp_vectors.
Each ULP can decide to which completion vector it wants to assign a created CQ.
It can also let mlx4_core to decide on the completion vector number, and it will attach
the CQ to the EQ that has the smallest number of CQs attached to it.

Thanks,
Yevgeny


From yevgenyp at mellanox.co.il  Wed May 21 07:35:49 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Wed, 21 May 2008 17:35:49 +0300
Subject: [ofa-general][PATCH v1 1/2]mlx4: Multiple completion vectors support
Message-ID: <48343345.6010205@mellanox.co.il>

>From cde7a50546a0172849955909de41bcb1f8395f4e Mon Sep 17 00:00:00 2001
From: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
Date: Tue, 20 May 2008 11:29:51 +0300
Subject: [PATCH] mlx4: Multiple completion vectors support

The driver now creates a completion EQ for every core.
While allocating CQ a ULP asks a completion vector number
it wants the CQ to be attached to. The number of completion
vectors is advertised through ib_device.num_comp_vectors

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/cq.c   |    2 +-
 drivers/infiniband/hw/mlx4/main.c |    2 +-
 drivers/net/mlx4/cq.c             |   14 ++++++++--
 drivers/net/mlx4/eq.c             |   47 ++++++++++++++++++++++++------------
 drivers/net/mlx4/main.c           |   14 ++++++----
 drivers/net/mlx4/mlx4.h           |    4 +-
 include/linux/mlx4/device.h       |    4 ++-
 7 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 4521319..3519f92 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -221,7 +221,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 	}

 	err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar,
-			    cq->db.dma, &cq->mcq, 0);
+			    cq->db.dma, &cq->mcq, vector, 0);
 	if (err)
 		goto err_dbmap;

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 098dcd2..7ffcb00 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -567,7 +567,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	mlx4_foreach_port(i, ibdev->ports_map)
 		ibdev->num_ports++;
 	ibdev->ib_dev.phys_port_cnt     = ibdev->num_ports;
-	ibdev->ib_dev.num_comp_vectors	= 1;
+	ibdev->ib_dev.num_comp_vectors	= dev->caps.num_comp_vectors;
 	ibdev->ib_dev.dma_device	= &dev->pdev->dev;

 	ibdev->ib_dev.uverbs_abi_ver	= MLX4_IB_UVERBS_ABI_VERSION;
diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 95e87a2..9be895f 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -189,7 +189,7 @@ EXPORT_SYMBOL_GPL(mlx4_cq_resize);

 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
-		  int collapsed)
+		  unsigned vector, int collapsed)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_cq_table *cq_table = &priv->cq_table;
@@ -227,7 +227,15 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,

 	cq_context->flags	    = cpu_to_be32(!!collapsed << 18);
 	cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index);
-	cq_context->comp_eqn        = priv->eq_table.eq[MLX4_EQ_COMP].eqn;
+
+	if (vector >= dev->caps.num_comp_vectors) {
+		err = -EINVAL;
+		goto err_radix;
+	}
+
+	cq->comp_eq_idx		    = MLX4_EQ_COMP_CPU0 + vector;
+	cq_context->comp_eqn	    = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 +
+							vector].eqn;
 	cq_context->log_page_size   = mtt->page_shift - MLX4_ICM_PAGE_SHIFT;

 	mtt_addr = mlx4_mtt_addr(dev, mtt);
@@ -276,7 +284,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 	if (err)
 		mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);

-	synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq);
+	synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq);

 	spin_lock_irq(&cq_table->lock);
 	radix_tree_delete(&cq_table->tree, cq->cqn);
diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index e141a15..825e90c 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -265,7 +265,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr)

 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);

 	return IRQ_RETVAL(work);
@@ -482,7 +482,7 @@ static void mlx4_free_irqs(struct mlx4_dev *dev)

 	if (eq_table->have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		if (eq_table->eq[i].have_irq)
 			free_irq(eq_table->eq[i].irq, eq_table->eq + i);
 }
@@ -553,6 +553,7 @@ void mlx4_unmap_eq_icm(struct mlx4_dev *dev)
 int mlx4_init_eq_table(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
+	int req_eqs;
 	int err;
 	int i;

@@ -573,11 +574,21 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 	priv->eq_table.clr_int  = priv->clr_base +
 		(priv->eq_table.inta_pin < 32 ? 4 : 0);

-	err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE,
-			     (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0,
-			     &priv->eq_table.eq[MLX4_EQ_COMP]);
-	if (err)
-		goto err_out_unmap;
+	dev->caps.num_comp_vectors = 0;
+	req_eqs = (dev->flags & MLX4_FLAG_MSI_X) ? num_online_cpus() : 1;
+	while (req_eqs) {
+		err = mlx4_create_eq(
+			dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE,
+			(dev->flags & MLX4_FLAG_MSI_X) ?
+			(MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors) : 0,
+			&priv->eq_table.eq[MLX4_EQ_COMP_CPU0 +
+			dev->caps.num_comp_vectors]);
+		if (err)
+			goto err_out_comp;
+
+		dev->caps.num_comp_vectors++;
+		req_eqs--;
+	}

 	err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE,
 			     (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0,
@@ -586,12 +597,16 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 		goto err_out_comp;

 	if (dev->flags & MLX4_FLAG_MSI_X) {
-		static const char *eq_name[] = {
-			[MLX4_EQ_COMP]  = DRV_NAME " (comp)",
-			[MLX4_EQ_ASYNC] = DRV_NAME " (async)"
-		};
+		static char eq_name[MLX4_NUM_EQ][20];
+
+		for (i = 0; i < MLX4_EQ_COMP_CPU0 +
+		      dev->caps.num_comp_vectors; ++i) {
+			if (i == 0)
+				snprintf(eq_name[0], 20, DRV_NAME "(async)");
+			else
+				snprintf(eq_name[i], 20, "comp_" DRV_NAME "%d",
+					 i - 1);

-		for (i = 0; i < MLX4_NUM_EQ; ++i) {
 			err = request_irq(priv->eq_table.eq[i].irq,
 					  mlx4_msi_x_interrupt,
 					  0, eq_name[i], priv->eq_table.eq + i);
@@ -616,7 +631,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 		mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
 			   priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		eq_set_ci(&priv->eq_table.eq[i], 1);

 	return 0;
@@ -625,9 +640,9 @@ err_out_async:
 	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]);

 err_out_comp:
-	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]);
+	for (i = 0; i < dev->caps.num_comp_vectors; ++i)
+		mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i]);

-err_out_unmap:
 	mlx4_unmap_clr_int(dev);
 	mlx4_free_irqs(dev);

@@ -646,7 +661,7 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev)

 	mlx4_free_irqs(dev);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		mlx4_free_eq(dev, &priv->eq_table.eq[i]);

 	mlx4_unmap_clr_int(dev);
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index f86d472..4a909cb 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -911,22 +911,24 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct msix_entry entries[MLX4_NUM_EQ];
+	int needed_vectors = MLX4_EQ_COMP_CPU0 + num_online_cpus();
 	int err;
 	int i;

 	if (msi_x) {
-		for (i = 0; i < MLX4_NUM_EQ; ++i)
+		for (i = 0; i < needed_vectors; ++i)
 			entries[i].entry = i;

-		err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries));
+		err = pci_enable_msix(dev->pdev, entries, needed_vectors);
 		if (err) {
 			if (err > 0)
-				mlx4_info(dev, "Only %d MSI-X vectors available, "
-					  "not using MSI-X\n", err);
+				mlx4_info(dev, "Only %d MSI-X vectors "
+					  "available, need %d. Not using MSI-X\n",
+					  err, needed_vectors);
 			goto no_msi;
 		}

-		for (i = 0; i < MLX4_NUM_EQ; ++i)
+		for (i = 0; i < needed_vectors; ++i)
 			priv->eq_table.eq[i].irq = entries[i].vector;

 		dev->flags |= MLX4_FLAG_MSI_X;
@@ -934,7 +936,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	}

 no_msi:
-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < needed_vectors; ++i)
 		priv->eq_table.eq[i].irq = dev->pdev->irq;
 }

diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 7bc4cbf..4435272 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -64,8 +64,8 @@ enum {

 enum {
 	MLX4_EQ_ASYNC,
-	MLX4_EQ_COMP,
-	MLX4_NUM_EQ
+	MLX4_EQ_COMP_CPU0,
+	MLX4_NUM_EQ = MLX4_EQ_COMP_CPU0 + NR_CPUS
 };

 enum {
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index d97314d..7cbe078 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -187,6 +187,7 @@ struct mlx4_caps {
 	int			reserved_cqs;
 	int			num_eqs;
 	int			reserved_eqs;
+	int			num_comp_vectors;
 	int			num_mpts;
 	int			num_mtt_segs;
 	int			fmr_reserved_mtts;
@@ -305,6 +306,7 @@ struct mlx4_cq {
 	int			arm_sn;

 	int			cqn;
+	int			comp_eq_idx;

 	atomic_t		refcount;
 	struct completion	free;
@@ -434,7 +436,7 @@ void mlx4_free_hwq_res(struct mlx4_dev *mdev, struct mlx4_hwq_resources *wqres,

 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
-		  int collapsed);
+		  unsigned vector, int collapsed);
 void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);

 int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base);
-- 
1.5.4


From yevgenyp at mellanox.co.il  Wed May 21 07:36:13 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Wed, 21 May 2008 17:36:13 +0300
Subject: [ofa-general][PATCH v1 2/2] mlx4: Default value for automatic
	completion vector selection
Message-ID: <4834335D.8030903@mellanox.co.il>

>From b652a738eb5acbb01f0d0a143a12a7bdcc86d002 Mon Sep 17 00:00:00 2001
From: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
Date: Tue, 20 May 2008 13:51:00 +0300
Subject: [PATCH] mlx4: Default value for automatic completion vector selection

When the vector number passed to mlx4_cq_alloc is MLX4_ANY_VECTOR (0xff),
the driver selects the completion vector that has the least CQs attached
to it and attaches the CQ to the chosen vector.

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
 drivers/net/mlx4/cq.c       |   22 +++++++++++++++++++++-
 drivers/net/mlx4/mlx4.h     |    1 +
 include/linux/mlx4/device.h |    4 ++++
 3 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 9be895f..7f0bdf6 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq,
 }
 EXPORT_SYMBOL_GPL(mlx4_cq_resize);

+static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv)
+{
+	int i;
+	int index = 0;
+	int min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0].load;
+
+	for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) {
+		if (priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load < min) {
+			index = i;
+			min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load;
+		}
+	}
+
+	return index;
+}
+
 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
 		  unsigned vector, int collapsed)
@@ -228,7 +244,9 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	cq_context->flags	    = cpu_to_be32(!!collapsed << 18);
 	cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index);

-	if (vector >= dev->caps.num_comp_vectors) {
+	if (vector == MLX4_ANY_VECTOR)
+		vector = mlx4_find_least_loaded_vector(priv);
+	else if (vector >= dev->caps.num_comp_vectors) {
 		err = -EINVAL;
 		goto err_radix;
 	}
@@ -248,6 +266,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	if (err)
 		goto err_radix;

+	priv->eq_table.eq[cq->comp_eq_idx].load++;
 	cq->cons_index = 0;
 	cq->arm_sn     = 1;
 	cq->uar        = uar;
@@ -285,6 +304,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 		mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);

 	synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq);
+	priv->eq_table.eq[cq->comp_eq_idx].load--;

 	spin_lock_irq(&cq_table->lock);
 	radix_tree_delete(&cq_table->tree, cq->cqn);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 4435272..b2d103a 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -144,6 +144,7 @@ struct mlx4_eq {
 	u16			irq;
 	u16			have_irq;
 	int			nent;
+	int			load;
 	struct mlx4_buf_list   *page_list;
 	struct mlx4_mtt		mtt;
 };
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 7cbe078..3cfe5c1 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -151,6 +151,10 @@ enum {
 	MLX4_NUM_FEXCH          = 64 * 1024,
 };

+enum {
+	MLX4_ANY_VECTOR		= 0xff
+};
+
 static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor)
 {
 	return (major << 32) | (minor << 16) | subminor;
-- 
1.5.4


From jackm at dev.mellanox.co.il  Wed May 21 07:36:00 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 21 May 2008 17:36:00 +0300
Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3)
In-Reply-To: <20080511102345.GJ5298@sgi.com>
References: <20080508171936.GU24293@sgi.com>
	<1210493899.15669.116.camel@mtls03> <20080511102345.GJ5298@sgi.com>
Message-ID: <200805211736.01231.jackm@dev.mellanox.co.il>

Arthur,
I just checked in a fix for bugzilla 1004, which seems to be the same problem you are seeing.
(I just noticed your explanation in this thread in an earlier post:
"So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), 
followed by a call to ipoib_send() would get to a situation where 
the queue was full, but not stopped." ).

This is correct, and this was the bug (in addition to a missing invocation
of netif_stop_queue in ipoib_ib_tx_timer_func() ).
The patch uses the same value for tx_outstanding in all cases in the
test for invoking netif_stop_queue(), so that there is no way the kernel
will continue to send TX packets to IPoIB if the queue becomes too full.
(using the same value in all tests creates a "barrier" with no holes).

This patch will be part of OFED 1.3.1-rc2 -- and you should see no more
mthca "queue full" messages.

- Jack

P.S., this fix is not needed in the upstream kernel, since the unsignalled UD
send mechanism was not added upstream.

On Sunday 11 May 2008 13:23, akepner at sgi.com wrote:
> On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote:
> > ....
> > The reason why the queue is stopped when there is one entry still left
> > is to allow ipoib_ib_tx_timer_func() to post a special send request that
> > will ensure a completion is reported for this operation thus freeing
> > entries at the tx ring. I don't think the scenario you describe here can
> > lead to a deadlock since if that happens, it will be released because of
> > either one of the following two reasons:
> > 1. If the tx queue contains not yet polled, more than one completion of
> > send WRs posted by ipoib_cm_send(), they will soon be polled since they
> > are posted to a signaled QP and sooner or later will generate
> > completions and interrupts. In this case, subsequent postings to
> > ipoib_send() will work as expected.
> > 
> > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
> > means that there are 126 outstanding ipoib_send() requests at the tx
> > queue and this means that a few of them are signaled and are expected to
> > be completed soon.
> 
> Thanks for the explanation. 
> 
> The main problem that we're seeing is that we just stop getting 
> completions for the send queue. (And we see this with OFED-1.2 
> and 1.3, which makes me think that it's unlikely to be due to the 
> IPoIB driver since that's changed so much.) 
> 
> > .....
> > And last, could you arrange a remote access to a machine in this
> > condition so we could check the state of the device/FW?
> > 
> 
> Yes, I think so. Let me see if I can arrange that.
> 


From olaf.kirch at oracle.com  Wed May 21 07:52:49 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Wed, 21 May 2008 16:52:49 +0200
Subject: [ofa-general] RDS flow control
In-Reply-To: <4833FA1B.4010707@mellanox.co.il>
References: <200805121157.38135.jon@opengridcomputing.com>
	<ada1w3w4qor.fsf@cisco.com> <4833FA1B.4010707@mellanox.co.il>
Message-ID: <200805211652.50470.olaf.kirch@oracle.com>

On Wednesday 21 May 2008 12:31:55 Tziporet Koren wrote:
> SQD is not implemented in ConnectX for now

So what do I do on ConnectX? Will the state transition RTR->SQD
just appear to work (despite not doing anything), or will it fail?
Will the subsequent change of the RNR retry count succeed?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From ctung at neteffect.com  Wed May 21 09:49:58 2008
From: ctung at neteffect.com (Chien Tung)
Date: Wed, 21 May 2008 11:49:58 -0500
Subject: [ofa-general] [ PATCH ] RDMA/nes Update MAINTAINERS list
Message-ID: <200805211649.m4LGnwPP026935@velma.neteffect.com>

Adding Chien to maintainers list for NetEffect.

Signed-off-by: Chien Tung <ctung at neteffect.com>
---
 MAINTAINERS |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index bc1c008..39feafc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2837,8 +2837,8 @@ S:	Maintained
 NETEFFECT IWARP RNIC DRIVER (IW_NES)
 P:	Faisal Latif
 M:	flatif at neteffect.com
-P:	Nishi Gupta
-M:	ngupta at neteffect.com
+P:	Chien Tung
+M:	ctung at neteffect.com
 P:	Glenn Streiff
 M:	gstreiff at neteffect.com
 L:	general at lists.openfabrics.org


From kurai at magna-tokyo.com  Wed May 21 10:04:14 2008
From: kurai at magna-tokyo.com (Top Software kaufen)
Date: Wed, 21 May 2008 18:04:14 +0100
Subject: [ofa-general] Legale und populaere Software aus aller Welt!
Message-ID: <01c8bb6d$13ea2300$a167b750@kurai>

Ihre Software kommt ganz schnell ins Haus. Zahlen Sie und laden Sie es runter! Wir verkaufen Programme in allen europaeischen Sprachen, fuer Windows und fuer Macintosh. Wir verkaufen nur originale Vollversionen, aber sehr preiswert.

Unser kompetentes Team wird Ihnen bei der Istallation helfen, falls Sie es brauchen. Wir bieten Geld-Zurueck-Garantie und rasche Antworten vom Support!

Wir bieten Ihnen die perfekten Softwaren an

http://macabesoft.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080521/ef691918/attachment.html>

From ralph.campbell at qlogic.com  Wed May 21 10:29:24 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 21 May 2008 10:29:24 -0700
Subject: [ofa-general] RDS flow control
In-Reply-To: <4833D3DA.4040106@voltaire.com>
References: <200805121157.38135.jon@opengridcomputing.com>
	<200805191006.00114.olaf.kirch@oracle.com>
	<20080520204522.GD31790@opengridcomputing.com>
	<200805202313.40213.olaf.kirch@oracle.com>	<ada1w3w4qor.fsf@cisco.com>
	<4833D223.5090007@voltaire.com>  <4833D3DA.4040106@voltaire.com>
Message-ID: <1211390964.3949.283.camel@brick.pathscale.com>

On Wed, 2008-05-21 at 10:48 +0300, Or Gerlitz wrote:
> Or Gerlitz wrote:
> > HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability 
> I see now that only the mlx4, mthca and ehca drivers advertise this 
> capability, but the ipath doesn't, Ralph, was it just forgotten or you 
> guys really don't support this?
> 
> Or.

It is supported. Its just a bug that the capability bit isn't set.
I'll make a patch for this.


From ralph.campbell at qlogic.com  Wed May 21 10:34:20 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 21 May 2008 10:34:20 -0700
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c
	whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com>
Message-ID: <1211391260.3949.286.camel@brick.pathscale.com>

On Wed, 2008-05-21 at 14:30 +0300, Tziporet Koren wrote:
> Ralph,
> 
> Can you provide the patch to OFED 1.3.1 today so we will be able to
> include it in RC1?
> 
> Thanks
> Tziporet 

The patch is available for pulling from:
git://git.openfabrics.org/~ralphc/linux-2.6/.git ofed_kernel

commit 2c62d7930703acea41568a98cc74e712475ebe38
Author: Ralph Campbell (QLogic) <ralphc at hosting.openfabrics.org>
Date:   Tue May 20 16:58:41 2008 -0700

    IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONS
    
    This was observed with the hw/ipath driver, but could happen with any
    driver.  It's OFED bug 1027.  The fix is to kfree the local data and
    break, rather than falling through.
    
    Signed-off-by: Dave Olson <dave.olson at qlogic.com>


From gdror at mellanox.co.il  Wed May 21 12:02:02 2008
From: gdror at mellanox.co.il (Dror Goldenberg)
Date: Wed, 21 May 2008 22:02:02 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4833EA6B.9000705@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
Message-ID: <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>

 
-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
Sent: Wednesday, May 21, 2008 12:25 PM
To: Steve Wise
Cc: rdreier at cisco.com; general at lists.openfabrics.org; Dror Goldenberg
Subject: Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:
MEM_MGT_EXTENSIONS support

Steve Wise wrote:
>>> Support for the IB BMME and iWARP equivalent memory extensions ... 
>>> Usage Model:
>>> - MR allocated with ib_alloc_mr()
>>> - Page lists allocated via ib_alloc_fast_reg_page_list().
>>> - MR made VALID & bound to a specific page list via
>>> ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via
>>> ib_post_send(IB_WR_INVALIDATE_MR)
>> AFAIK, the idea was to let the ulp post --two-- work requests, where 
>> the first creates the mapping and the second sends this mapping to 
>> the remote side, such that the second does not start before the first

>> completes (i.e a fence).
>>
>> Now, the above scheme means that the ulp knows the value of the 
>> rkey/stag at the time of posting these two work requests (since it 
>> has to encode it in the second one), so something has to be clarified

>> re the rkey/stag here, do they change each time this MR is used? how 
>> many bits can be changed, etc.
>
> The ULP knows the rkey/stag because its returned up front in the 
> ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue

> which we haven't exposed yet to the ULP).  The same rkey/stag can be 
> used for multiple mappings.  It can be made invalid at any point in 
> time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the 
> same rkey/stag advertised is not a risk.
I understand that this (same rkey/stag used for all mapping produced for
a specific mr) is what you are proposing, I still think there's a chance
that by the spec and (not less important!) by existing HW support, its
possible to have a different rkey/stag per mapping done on an mr, for
example the IB spec uses a "consumer owned key portion of the L_Key" 
notation which makes me think there should be a way to have different
rkey per mapping, Roland? Dror?


[dg] When you post a fast register WQE, you specify the new 8 LSBits to
be assigned to the MR. The rest 24 MSBits are the ones that you obtained
while allocating the MR, and they persist throughout the lifetime of
this MR.


From worleys at gmail.com  Wed May 21 14:44:14 2008
From: worleys at gmail.com (Chris Worley)
Date: Wed, 21 May 2008 15:44:14 -0600
Subject: OFED 1.3 w/ a Lustre kernel (was Re: [ofa-general] kernel ib build
	(OFED 1.3) fails on SLES 10)
Message-ID: <f3177b9e0805211444k2dfdae66t9fb947125fc2f90@mail.gmail.com>

I'm building OFED 1.3 for an RHEL5.1 kernel for Lustre 1.6.4.3:
2.6.18-53.1.13.el5_lustre.1.6.4.3smp.

The install.pl script errors-out at the end of building the kernel
modules RPM saying some built modules don't exist, but those modules
do get built; the only other warning is some undefined symbols coming
out of the modpost command.  The differences between the .configs of
the two kernels are minimal, so I think it's the same problem as the
attached (not getting the proper patch files ).

What's the work-around?

Chris
P.S. Here's a sample of the output of the modpost command:

WARNING: "scst_register_target_template"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
WARNING: "scst_unregister"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
                                       WARNING: "scst_register"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
                                         WARNING:
"scst_unregister_target_template"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
                       WARNING: "scst_register_session"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
                                 WARNING: "scst_rx_mgmt_fn"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!
                                       WARNING: "scst_cmd_init_done"
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko]
undefined!

Here's the errors at the end:

RPM build errors:
...
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/iscsi_tcp.ko
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/libiscsi.ko
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/scsi_transport_iscsi.ko
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/net/rds
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/net/cxgb3
    File not found:
/var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/net/mlx4

On Tue, Apr 8, 2008 at 8:43 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
> On Tue, 2008-04-08 at 10:13 +0200, Thomas Großmann wrote:
>> Hi,
>
> Hi
>
>> kernel ib build (OFED 1.3) fails on SLES 10.
>
> To be fair, it fails on Sun's version of the SLES 10 kernel for Lustre,
> and here is why:
>
>> Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.52212
>> + umask 022
>> + cd /var/tmp/OFED_topdir/BUILD
>> + /bin/rm -rf /var/tmp/OFED
>> ++ dirname /var/tmp/OFED
>> + /bin/mkdir -p /var/tmp
>> + /bin/mkdir /var/tmp/OFED
>> + cd ofa_kernel-1.3
>> + rm -rf /var/tmp/OFED
>> + cd /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3
>> + mkdir -p /var/tmp/OFED//usr/local/ofed-1.3/src
>> + cp -a /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3 /var/tmp/OFED//usr/local/ofed-1.3/src
>> + ./configure --prefix=/usr/local/ofed-1.3 --kernel-version 2.6.16-54-0.2.5_lustre.1.6.4.3smp --kernel-sources /lib/modules/2.6.16-54-0.2.5_lustre.1.6.4.3smp/build --modules-dir /lib/modules/2.6.16-54-0.2.5_lustre.1.6.4.3smp/updates --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-cxgb3-mod --with-nes-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-srp-target-mod --with-rds-mod --with-qlgc_vnic-mod
>> ofed_patch.mk does not exist. running ofed_patch.sh
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/ofed_scripts/ofed_patch.sh --kernel-version 2.6.16-54-0.2.5_lustre.1.6.4.3smp
> ----------------------------------------------------------------------------------------------------------------------^
> This kernel version does match what ofed_patch.sh thinks is a SLES 10
> kernel because it is not of the form "2.6.16.*-*-*".  Here's the code in
> ofed_patch.sh which detects SLES 10 kernels and assigns the right patch
> series for it:
>
>        2.6.16.*-*-*)
>                minor=$(echo $KVERSION | cut -d"." -f4 | cut -d"-" -f1)
>                if [ $minor -lt 37 ]; then
>                        echo 2.6.16_sles10
>                elif [ $minor -lt 60 ]; then
>                        echo 2.6.16_sles10_sp1
>                else
>                        echo 2.6.16_sles10_sp2
>                fi
>        ;;
>
> The lustre kernel version for SLES 10 is
> "2.6.16-54-0.2.5_lustre.1.6.4.3smp".  In order for it to match the above
> code it needs to have a "-" put before the "smp" at the end.  I am
> working on the Lustre build process to do exactly this right at this
> moment as well as build our released RPMs with OFED 1.3 support right in
> them.  My work is being done in Lustre bugzilla ticket 15316.  When I
> have something working, I will post an attachment there with a patch for
> our current b1_6 that should apply to 1.6.4.3.
>
> In theory you should be able use the "--with-backport*" configure
> options to override this detection when building the RPMs however see my
> message to this list (inconsistent use of --with-backport[-patches])
> last Saturday about how this seems to be broken currently.
>
> Cheers,
> b.
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From arlin.r.davis at intel.com  Wed May 21 15:07:24 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Wed, 21 May 2008 15:07:24 -0700
Subject: [ofa-general] [PATCH 1/3][2.0] dapl: change cma provider to use
	max_rdma_read_in,
	out from ep_attr instead of HCA max values when connecting.
Message-ID: <000d01c8bb8f$0d353f50$8bc3020a@amr.corp.intel.com>


Patch set for v2.0. Same fixes already applied to v1.2 

Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 dapl/openib_cma/dapl_ib_cm.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c
index d3835b3..de35eba 100755
--- a/dapl/openib_cma/dapl_ib_cm.c
+++ b/dapl/openib_cma/dapl_ib_cm.c
@@ -540,8 +540,10 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HANDLE ep_handle,
 
 	/* Setup QP/CM parameters and private data in cm_id */
 	(void)dapl_os_memzero(&conn->params, sizeof(conn->params));
-	conn->params.responder_resources = conn->hca->ib_trans.max_rdma_rd_in;
-	conn->params.initiator_depth = conn->hca->ib_trans.max_rdma_rd_out;
+	conn->params.responder_resources =
+		ep_ptr->param.ep_attr.max_rdma_read_in;
+	conn->params.initiator_depth =
+		ep_ptr->param.ep_attr.max_rdma_read_out;
 	conn->params.flow_control = 1;
 	conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT;
 	conn->params.retry_count = IB_RC_RETRY_COUNT;
-- 
1.5.2.5


From arlin.r.davis at intel.com  Wed May 21 15:07:28 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 21 May 2008 15:07:28 -0700
Subject: [ofa-general] [PATCH 3/3][2.0] dtest, dtestx,
	dapltest: fix build issues with Redhat EL5.1
Message-ID: <B0095134066CC94FBC80973103FFA1FE0729DD25@orsmsx416.amr.corp.intel.com>


need include files/definitions for sleep, getpid, gettimeofday

Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 test/dapltest/mdep/linux/dapl_mdep_user.h |    1 +
 test/dtest/dtest.c                        |    3 +++
 test/dtest/dtestx.c                       |    3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/test/dapltest/mdep/linux/dapl_mdep_user.h
b/test/dapltest/mdep/linux/dapl_mdep_user.h
index 52199d1..c39d3d6 100755
--- a/test/dapltest/mdep/linux/dapl_mdep_user.h
+++ b/test/dapltest/mdep/linux/dapl_mdep_user.h
@@ -42,6 +42,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <sys/times.h>
+#include <sys/time.h>
 
 /* inet_ntoa */
 #include <sys/socket.h>
diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index 9c8ec71..095ff40 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -63,13 +63,16 @@
 #include <endian.h>
 #include <byteswap.h>
 #include <netdb.h>
+#include <sys/types.h>
 #include <sys/socket.h>
+#include <sys/time.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include <sys/mman.h>
 #include <getopt.h>
 #include <inttypes.h>
+#include <unistd.h>
 
 #define DAPL_PROVIDER "ofa-v2-ib0"
 
diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c
index 19ef788..1db60eb 100755
--- a/test/dtest/dtestx.c
+++ b/test/dtest/dtestx.c
@@ -48,12 +48,15 @@
 #define DAPL_PROVIDER "ibnic0v2"
 #else
 #include <netdb.h>
+#include <sys/types.h>
 #include <sys/socket.h>
+#include <sys/time.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include <inttypes.h>
 #include <string.h>
+#include <unistd.h>
 
 #define DAPL_PROVIDER "ofa-v2-ib0"
 #define F64x "%"PRIx64""
-- 
1.5.2.5


From arlin.r.davis at intel.com  Wed May 21 15:07:26 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 21 May 2008 15:07:26 -0700
Subject: [ofa-general] [PATCH 2/3][2.0] dapl: Fix long delays with the cma
	provider open call when DNS is not configure on server.
Message-ID: <B0095134066CC94FBC80973103FFA1FE0729DD24@orsmsx416.amr.corp.intel.com>


Open call should default to netdev names when resolving local IP address
for cma binding to match dat.conf settings. The open code attempts to
resolve with IP or Hostname first and if there is no DNS services setup
the failover to netdev name resolution is delayed for as much as 20
seconds.

Signed-off by: Arlin Davis ardavis at ichips.intel.com
---
 dapl/openib_cma/dapl_ib_util.c |   68
+++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_util.c
b/dapl/openib_cma/dapl_ib_util.c
index 41986a3..e3a3b29 100755
--- a/dapl/openib_cma/dapl_ib_util.c
+++ b/dapl/openib_cma/dapl_ib_util.c
@@ -105,41 +105,37 @@ bail:
 /* Get IP address using network name, address, or device name */
 static int getipaddr(char *name, char *addr, int len)
 {
-	struct addrinfo *res;
-	int ret;
-	
-	/* Assume network name and address type for first attempt */
-	if (getaddrinfo(name, NULL, NULL, &res)) {
-		/* retry using network device name */
-		ret = getipaddr_netdev(name,addr,len);
-		if (ret) {
-			dapl_log(DAPL_DBG_TYPE_ERR, 
-				 " open_hca: getaddr_netdev ERROR:"
-				 " %s. Is %s configured?\n", 
-				 strerror(errno), name);
-			return ret;
-		}
-	} else {
-		if (len >= res->ai_addrlen)
-			memcpy(addr, res->ai_addr, res->ai_addrlen);
-		else {
-			freeaddrinfo(res);
-			return EINVAL;
-		}
-		
-		freeaddrinfo(res);
-	}
+        struct addrinfo *res;
+
+        /* assume netdev for first attempt, then network and address
type */
+        if (getipaddr_netdev(name,addr,len)) {
+                if (getaddrinfo(name, NULL, NULL, &res)) {
+                        dapl_log(DAPL_DBG_TYPE_ERR,
+                                " open_hca: getaddr_netdev ERROR:"
+                                " %s. Is %s configured?\n",
+                                strerror(errno), name);
+                        return 1;
+                } else {
+                        if (len >= res->ai_addrlen)
+                                memcpy(addr, res->ai_addr,
res->ai_addrlen);
+                        else {
+                                freeaddrinfo(res);
+                                return 1;
+                        }
+                        freeaddrinfo(res);
+                }
+        }
 
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
-		" getipaddr: family %d port %d addr %d.%d.%d.%d\n", 
-		((struct sockaddr_in *)addr)->sin_family,
-		((struct sockaddr_in *)addr)->sin_port,
-		((struct sockaddr_in *)addr)->sin_addr.s_addr >> 0 &
0xff,
-		((struct sockaddr_in *)addr)->sin_addr.s_addr >> 8 &
0xff,
-		((struct sockaddr_in *)addr)->sin_addr.s_addr >> 16 &
0xff,
-		((struct sockaddr_in *)addr)->sin_addr.s_addr >> 24 &
0xff);
-	
-	return 0;
+        dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
+                " getipaddr: family %d port %d addr %d.%d.%d.%d\n",
+                ((struct sockaddr_in *)addr)->sin_family,
+                ((struct sockaddr_in *)addr)->sin_port,
+                ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 0 &
0xff,
+                ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 8 &
0xff,
+                ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 16 &
0xff,
+                ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 24 &
0xff);
+
+        return 0;
 }
 
 /*
@@ -640,7 +636,7 @@ DAT_RETURN dapli_ib_thread_init(void)
 	while (g_ib_thread_state != IB_THREAD_RUN) {
                 struct timespec sleep, remain;
                 sleep.tv_sec = 0;
-                sleep.tv_nsec = 20000000; /* 20 ms */
+                sleep.tv_nsec = 2000000; /* 2 ms */
                 dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
                              " ib_thread_init: waiting for
ib_thread\n");
 		dapl_os_unlock(&g_hca_lock);
@@ -677,7 +673,7 @@ void dapli_ib_thread_destroy(void)
 	while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) {
 		struct timespec	sleep, remain;
 		sleep.tv_sec = 0;
-		sleep.tv_nsec = 20000000; /* 20 ms */
+		sleep.tv_nsec = 2000000; /* 2 ms */
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
 			" ib_thread_destroy: waiting for ib_thread\n");
 		write(g_ib_pipe[1], "w", sizeof "w");
-- 
1.5.2.5


From arlin.r.davis at intel.com  Wed May 21 15:48:14 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Wed, 21 May 2008 15:48:14 -0700
Subject: [ofa-general] [ANNOUNCE] dapl-1.2.7 and dapl-2.0.9 release
Message-ID: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com>


New release for dapl 1.2 and 2.0 available on the OFA download page and in my git tree.
 
md5sum: 21e2ff9e64ba5ef697e413bb2373db18 dapl-1.2.7.tar.gz 
md5sum: 89e07acd0ff4dc73078ee46ee663c786 dapl-2.0.9.tar.gz 

Vlad, please pull both packages into OFED 1.3.1 and install the following:
 
dapl-1.2.7-1
dapl-devel-1.2.7-1
dapl-2.0.9-1
dapl-utils-2.0.9-1
dapl-devel-2.0.9-1
dapl-debuginfo-2.0.9-1

Summary of fixes since last package:
v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 
v1,v2 - long delay during dat_ia_open when DNS not configured 
v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max 

See http://www.openfabrics.org/downloads/dapl/ more details.

-arlin


From matthias at sgi.com  Wed May 21 16:48:28 2008
From: matthias at sgi.com (Matthias Blankenhaus)
Date: Wed, 21 May 2008 16:48:28 -0700 (PDT)
Subject: [ofa-general] saquery port problems
In-Reply-To: <Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>
References: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>
Message-ID: <Pine.LNX.4.64.0805211642370.25026@protactinium.engr.sgi.com>

I have a patch that fixes the problem:

diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c
--- infiniband-diags-1.3.6.vanilla/src/saquery.c        2008-02-28 
00:58:36.000000000 -0800
+++ my/src/saquery.c    2008-05-21 16:08:19.583221794 -0700
@@ -1304,13 +1304,13 @@ get_bind_handle(void)
                        ca_name_index++;
                if (sa_port_num && sa_port_num != attr_array[i].port_num)
                        continue;
-               if (sa_hca_name && i == 0)
-                       continue;
                if (sa_hca_name
                 && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0)
                        continue;
-               if (attr_array[i].link_state == IB_LINK_ACTIVE)
+               if (attr_array[i].link_state == IB_LINK_ACTIVE) {
                        port_guid = attr_array[i].port_guid;
+                       break;
+                }
        }


I have tested it and it solves the problem.

Does this look ok ?

Matthias


On Tue, 20 May 2008, Matthias Blankenhaus wrote:

> Forgot some important info:
> 
> saquery BUILD VERSION: 1.3.6
> OFED-1.3
> 
> 
> On Tue, 20 May 2008, Matthias Blankenhaus wrote:
> 
> > Howdy !
> > 
> > While using this tool to run some queries on a two port HCA, I noticed 
> > some odd behavior.  Here are my observations running on a SLES10SP2 
> > (x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III 
> > Ex HCA:
> > 
> > (01) saquery -C mthca0 -m  
> >      This yields the output for port number two.  This is not conform with the
> >      usual ib tools behavior to report on port one per default.
> > 
> > (02) saquery -C mthca0 -m -P 1
> >      Fails with "Failed to find active port, check port status with "ibstat".
> >      This is incorrect, since
> > 
> >      # ibstat mthca0 1
> > CA: 'mthca0'
> > Port 1:
> > State: Active
> > Physical state: LinkUp
> > Rate: 20
> > Base lid: 5
> > LMC: 0
> > SM lid: 1
> > Capability mask: 0x02510a68
> > Port GUID: 0x0008f10403987dc5
> > 
> >      This might be the reason why (01) report on port two.
> > 
> > (03) saquery -C mthca0 -m -P 2
> >      Works and is identical with the out out from (01).
> > 
> > However, the following command options work:
> > 
> > (04) saquery -P 1 -m
> >      Correctly yields the output for port one.  In other words
> >      port one seems to be fine unlike reported in (02).
> > 
> > (05) saquery -P 2 -m
> >      Correctly yields the output for port two.
> > 
> > 
> > Is it incorrect to use -C and -P in combination ?  Why does does
> > saquery think that port one is not active ?
> > 
> > 
> > Thanx,
> > Matthias
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sfr at canb.auug.org.au  Wed May 21 17:17:14 2008
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Thu, 22 May 2008 10:17:14 +1000
Subject: [ofa-general] Re: linux-next: [PATCH]
 infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <48342C6C.2010502@googlemail.com>
References: <48342C6C.2010502@googlemail.com>
Message-ID: <20080522101714.3f2c5a82.sfr@canb.auug.org.au>

Hi Gabriel,

We appreciate the reports, thanks.

On Wed, 21 May 2008 16:06:36 +0200 Gabriel C <nix.or.die at googlemail.com> wrote:
>
> On linux-next from today, allmodconfig, I see the following warnings on 64bit:

What architecture are you building on?

-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/1a6186e0/attachment.sig>

From tony at bakeyournoodle.com  Wed May 21 17:23:35 2008
From: tony at bakeyournoodle.com (Tony Breeds)
Date: Thu, 22 May 2008 10:23:35 +1000
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <48342C6C.2010502@googlemail.com>
References: <48342C6C.2010502@googlemail.com>
Message-ID: <20080522002335.GG20457@bakeyournoodle.com>

On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote:
> On linux-next from today , allmodconfig, I see the following warnings on 64bit:

x86_64 right?

<snip>

> diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
> index 3697449..5f80151 100644
> --- a/drivers/infiniband/hw/ipath/ipath_sdma.c
> +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
> @@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque)
>  	/* everything is stopped, time to clean up and restart */
>  	if (status == IPATH_SDMA_ABORT_ABORTED) {
>  		struct ipath_sdma_txreq *txp, *txpnext;
> -		u64 hwstatus;
> +		unsigned long hwstatus;
>  		int notify = 0;
>  
>  		hwstatus = ipath_read_kreg64(dd,

This can't be right.  hwstatus needs to be u64, as that's what ipath_read_kreg64() retuns.
and a little bit further down we see:

---
if (/* ScoreBoardDrainInProg */
    test_bit(63, &hwstatus) ||
    /* AbortInProg */
    test_bit(62, &hwstatus) ||
    /* InternalSDmaEnable */
    test_bit(61, &hwstatus) ||
---

so hwstatus, clearly needs to be 64-bits.  This brings up an interesting
point.  test_bit() and co are essntally expecting to be passed the address
of an unsigned long[], so is it correct to pass &u64?

Yours Tony

  linux.conf.au    http://www.marchsouth.org/
  Jan 19 - 24 2009 The Australian Linux Technical Conference!


From nix.or.die at googlemail.com  Wed May 21 18:45:32 2008
From: nix.or.die at googlemail.com (Gabriel C)
Date: Thu, 22 May 2008 03:45:32 +0200
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <20080522101714.3f2c5a82.sfr@canb.auug.org.au>
References: <48342C6C.2010502@googlemail.com>
	<20080522101714.3f2c5a82.sfr@canb.auug.org.au>
Message-ID: <1820d69d0805211845i54908fc5s486d8b9c1d818481@mail.gmail.com>

2008/5/22 Stephen Rothwell <sfr at canb.auug.org.au>:

> Hi Gabriel,
>
> We appreciate the reports, thanks.
>
> On Wed, 21 May 2008 16:06:36 +0200 Gabriel C <nix.or.die at googlemail.com>
> wrote:
> >
> > On linux-next from today, allmodconfig, I see the following warnings on
> 64bit:
>
> What architecture are you building on?


It is x86_64

Gabriel

> <http://www.canb.auug.org.au/%7Esfr/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/7ec9d9f2/attachment.html>

From nix.or.die at googlemail.com  Wed May 21 19:39:16 2008
From: nix.or.die at googlemail.com (Gabriel C)
Date: Thu, 22 May 2008 04:39:16 +0200
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <20080522002335.GG20457@bakeyournoodle.com>
References: <48342C6C.2010502@googlemail.com>
	<20080522002335.GG20457@bakeyournoodle.com>
Message-ID: <1820d69d0805211939u7476ead9pe17946f5d4ee7248@mail.gmail.com>

2008/5/22 Tony Breeds <tony at bakeyournoodle.com>:
>
> On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote:
> > On linux-next from today , allmodconfig, I see the following warnings on 64bit:
>
> x86_64 right?
>
> <snip>
>
> > diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
> > index 3697449..5f80151 100644
> > --- a/drivers/infiniband/hw/ipath/ipath_sdma.c
> > +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
> > @@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque)
> >       /* everything is stopped, time to clean up and restart */
> >       if (status == IPATH_SDMA_ABORT_ABORTED) {
> >               struct ipath_sdma_txreq *txp, *txpnext;
> > -             u64 hwstatus;
> > +             unsigned long hwstatus;
> >               int notify = 0;
> >
> >               hwstatus = ipath_read_kreg64(dd,
>
> This can't be right.  hwstatus needs to be u64, as that's what ipath_read_kreg64() retuns.
> and a little bit further down we see:
>
> ---
> if (/* ScoreBoardDrainInProg */
>    test_bit(63, &hwstatus) ||
>    /* AbortInProg */
>    test_bit(62, &hwstatus) ||
>    /* InternalSDmaEnable */
>    test_bit(61, &hwstatus) ||
> ---
>
> so hwstatus, clearly needs to be 64-bits.

Hmm , right it need be 64-bits.

I should drink my coffee first and read code more carefully
before sending out wrong patches , sorry.

> This brings up an interesting point.  test_bit() and co are
> essntally expecting to be passed the address
> of an unsigned long[], so is it correct to pass &u64?

I'm not sure about this one.

>
> Yours Tony
>

Gabriel


From nix.or.die at googlemail.com  Wed May 21 19:42:39 2008
From: nix.or.die at googlemail.com (Gabriel C)
Date: Thu, 22 May 2008 04:42:39 +0200
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <20080522002335.GG20457@bakeyournoodle.com>
References: <48342C6C.2010502@googlemail.com>
	<20080522002335.GG20457@bakeyournoodle.com>
Message-ID: <1820d69d0805211942v36fa54edh216a14efae9ebdde@mail.gmail.com>

2008/5/22 Tony Breeds <tony at bakeyournoodle.com>:
> On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote:
>> On linux-next from today , allmodconfig, I see the following warnings on 64bit:
>
> x86_64 right?
>

Yes it is x86_64


From vlad at dev.mellanox.co.il  Wed May 21 22:58:44 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 22 May 2008 08:58:44 +0300
Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in
	mad.c	whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned
In-Reply-To: <1211391260.3949.286.camel@brick.pathscale.com>
References: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com>
	<1211391260.3949.286.camel@brick.pathscale.com>
Message-ID: <48350B94.7030906@dev.mellanox.co.il>

Ralph Campbell wrote:
> On Wed, 2008-05-21 at 14:30 +0300, Tziporet Koren wrote:
>> Ralph,
>>
>> Can you provide the patch to OFED 1.3.1 today so we will be able to
>> include it in RC1?
>>
>> Thanks
>> Tziporet 
> 
> The patch is available for pulling from:
> git://git.openfabrics.org/~ralphc/linux-2.6/.git ofed_kernel
> 
> commit 2c62d7930703acea41568a98cc74e712475ebe38
> Author: Ralph Campbell (QLogic) <ralphc at hosting.openfabrics.org>
> Date:   Tue May 20 16:58:41 2008 -0700
> 
>     IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONS
>     
>     This was observed with the hw/ipath driver, but could happen with any
>     driver.  It's OFED bug 1027.  The fix is to kfree the local data and
>     break, rather than falling through.
>     
>     Signed-off-by: Dave Olson <dave.olson at qlogic.com>
> 
> 
> 

Done,

Regards,
Vladimir


From vlad at dev.mellanox.co.il  Wed May 21 23:17:39 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 22 May 2008 09:17:39 +0300
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] dapl-1.2.7 and dapl-2.0.9 release
In-Reply-To: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com>
References: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com>
Message-ID: <48351003.1040906@dev.mellanox.co.il>

Arlin Davis wrote:
> New release for dapl 1.2 and 2.0 available on the OFA download page and in my git tree.
>  
> md5sum: 21e2ff9e64ba5ef697e413bb2373db18 dapl-1.2.7.tar.gz 
> md5sum: 89e07acd0ff4dc73078ee46ee663c786 dapl-2.0.9.tar.gz 
> 
> Vlad, please pull both packages into OFED 1.3.1 and install the following:
>  
> dapl-1.2.7-1
> dapl-devel-1.2.7-1
> dapl-2.0.9-1
> dapl-utils-2.0.9-1
> dapl-devel-2.0.9-1
> dapl-debuginfo-2.0.9-1
> 
> Summary of fixes since last package:
> v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 
> v1,v2 - long delay during dat_ia_open when DNS not configured 
> v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max 
> 
> See http://www.openfabrics.org/downloads/dapl/ more details.
> 
> -arlin
> 

Done,

Regards,
Vladimir


From ogerlitz at voltaire.com  Wed May 21 23:30:40 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 22 May 2008 09:30:40 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
Message-ID: <48351310.5090108@voltaire.com>

Dror Goldenberg wrote:
> When you post a fast register WQE, you specify the new 8 LSBits to
> be assigned to the MR. The rest 24 MSBits are the ones that you obtained
> while allocating the MR, and they persist throughout the lifetime of
> this MR.
OK, thanks Dror.

Steve, do we agree on this point? if yes, the next version of the 
patches should include the new rkey value (or just the new 8 LSbits) in 
the work request.

Or.


From ogerlitz at voltaire.com  Wed May 21 23:49:07 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 22 May 2008 09:49:07 +0300
Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by
	OFED 1.3
In-Reply-To: <adar6c4h1tp.fsf@cisco.com>
References: <20080508171936.GU24293@sgi.com>
	<adaod7gsnea.fsf@cisco.com>	<20080508174358.GW24293@sgi.com>
	<adafxsssn1o.fsf@cisco.com>	<20080510190721.GI5298@sgi.com>
	<adar6c6jane.fsf@cisco.com>	<482A8574.8070201@voltaire.com>
	<adar6c4h1tp.fsf@cisco.com>
Message-ID: <48351763.4070705@voltaire.com>

Roland Dreier wrote:
>  > Maybe its about time for the Linux IB maintainers to get a little angry?!
>
> I'm not angry about it, although I have pretty much given up on trying
> to debug IPoIB issues seen running anything other than an upstream
> kernel.  It seems like the OFED maintainers, the enterprise distros and
> their customers should be more concerned about the failure of the OFED
> process -- clearly producing something much buggier and less reliable
> than the stock kernel is not what anyone wants.
Still, it wastes your time... for example this thread ended up be the 
ninth! bug in a patch which was not reviewed on the mailing list (see 
https://bugs.openfabrics.org/show_bug.cgi?id=1004#c12) and is now on 
hold to be sent for review since it does not provide any benefit (see 
http://lists.openfabrics.org/pipermail/general/2008-March/048322.html), 
can we get more crazy than that?

Or.


From sashak at voltaire.com  Thu May 22 00:37:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 22 May 2008 10:37:03 +0300
Subject: [ofa-general] saquery port problems
In-Reply-To: <Pine.LNX.4.64.0805211642370.25026@protactinium.engr.sgi.com>
References: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805211642370.25026@protactinium.engr.sgi.com>
Message-ID: <20080522073703.GA31474@sashak.voltaire.com>

Hi Matthias,

On 16:48 Wed 21 May     , Matthias Blankenhaus wrote:
> I have a patch that fixes the problem:
> 
> diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c
> --- infiniband-diags-1.3.6.vanilla/src/saquery.c        2008-02-28 
> 00:58:36.000000000 -0800
> +++ my/src/saquery.c    2008-05-21 16:08:19.583221794 -0700
> @@ -1304,13 +1304,13 @@ get_bind_handle(void)
>                         ca_name_index++;
>                 if (sa_port_num && sa_port_num != attr_array[i].port_num)
>                         continue;
> -               if (sa_hca_name && i == 0)
> -                       continue;
>                 if (sa_hca_name
>                  && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0)
>                         continue;
> -               if (attr_array[i].link_state == IB_LINK_ACTIVE)
> +               if (attr_array[i].link_state == IB_LINK_ACTIVE) {
>                         port_guid = attr_array[i].port_guid;
> +                       break;
> +                }
>         }
> 
> 
> I have tested it and it solves the problem.
> 
> Does this look ok ?

Yes, this looks correct. Thanks for fixing this. I just will need your
'Signed-off-by:' line in order to apply the patch.

Sasha


From eli at mellanox.co.il  Thu May 22 01:51:04 2008
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 22 May 2008 11:51:04 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: set max CM MTU when moving to CM
	mode
Message-ID: <1211446264.7310.49.camel@eli-laptop>

>From c878b9d3a4cfd031e8baaba46a224b46b1ced441 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Thu, 22 May 2008 11:45:04 +0300
Subject: [PATCH] IB/ipoib: set max CM MTU when moving to CM mode

This will relieve the user from the need to restore CM mode MTU
every time we switch from UD to CM mode.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
I would like to push this patch to ofed 1.3.1 too.

 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 97e67d3..e6f57dd 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1387,6 +1387,11 @@ static ssize_t set_mode(struct device *d, struct device_attribute *attr,
 		dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO);
 		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
+		if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu)
+			ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n",
+				   priv->mcast_mtu);
+		dev->mtu = ipoib_cm_max_mtu(dev);
+
 		ipoib_flush_paths(dev);
 		return count;
 	}
-- 
1.5.5.1


From vlad at lists.openfabrics.org  Thu May 22 03:10:44 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 22 May 2008 03:10:44 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080522-0200 daily build status
Message-ID: <20080522101044.DDC07E60E8B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.18-53.el5
Log:
                 from include/linux/mutex.h:13,
                 from /home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core/addr.c:31:
/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/kernel_addons/backport/2.6.18-EL5.2/include/linux/log2.h:53: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'is_power_of_2'
make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core/addr.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-53.el5'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------


From cousin_vinnie at hotmail.fr  Thu May 22 04:29:21 2008
From: cousin_vinnie at hotmail.fr (Renaud Durand)
Date: Thu, 22 May 2008 13:29:21 +0200
Subject: [ofa-general] iSER problem
Message-ID: <BAY139-W3C4F1CCE3977BD17C51F4FBC60@phx.gbl>


hello,
I tried to run iscsi on my SLES10 sp1 computers, for this, i followed the tutorials of OFED's wiki (https://wiki.openfabrics.org/tiki-index.php?page=iSER and https://wiki.openfabrics.org/tiki-index.php?page=ISER-target)
my problem is that : the target is working well (I can "discover" myself either with 127.0.0.1, my ethernet address and my ib address)

linux-target:~ # iscsiadm -m discovery -t sendtargets -p 127.0.0.1
161.74.X.X:3260,1 iqn.2001-04.com.example:storage.disk2.amiens.sys1.xyz
127.0.0.1:3260,1 iqn.2001-04.com.example:storage.disk2.amiens.sys1.xyz
linux-target:~ # 

but the remote computer can't discover the target

linux-cx5e:~ # iscsiadm -m discovery -t sendtargets -p 161.74.X.X
iscsiadm: connection to discovery address 161.74.X.X failed
iscsiadm: connection to discovery address 161.74.X.X failed
iscsiadm: connection to discovery address 161.74.X.X failed
iscsiadm: connection to discovery address 161.74.X.X failed
iscsiadm: connection to discovery address 161.74.X.X failed
iscsiadm: connection login retries (reopen_max) 5 exceeded
 
but the ping works 
linux-cx5e:~ # ping 161.74.X.X
PING 161.74.83.128 (161.74.X.X) 56(84) bytes of data.
64 bytes from 161.74.X.X: icmp_seq=1 ttl=64 time=2.27 ms
64 bytes from 161.74.X.X: icmp_seq=2 ttl=64 time=0.068 ms

here is lsmod on my computer

linux-cx5e:~ # lsmod | grep iscsi
iscsi_tcp              27520  0 
libiscsi               30208  1 iscsi_tcp
scsi_transport_iscsi    34320  3 iscsi_tcp,libiscsi
scsi_mod              156600  6 iscsi_tcp,libiscsi,scsi_transport_iscsi,sg,libata,sd_mod


I really don't understand what the problem is,
if you have a suggestion/solution please tell me because I am desperate 


_________________________________________________________________
Retouchez, classez et partagez vos photos gratuitement avec le logiciel Galerie de Photos !
http://www.windowslive.fr/galerie/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/0cc02009/attachment.html>

From Thomas.Talpey at netapp.com  Thu May 22 04:52:30 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Thu, 22 May 2008 07:52:30 -0400
Subject: [ofa-general] iSER problem
In-Reply-To: <BAY139-W3C4F1CCE3977BD17C51F4FBC60@phx.gbl>
References: <BAY139-W3C4F1CCE3977BD17C51F4FBC60@phx.gbl>
Message-ID: <RTPCLUEXC1-PRDypzmr0000014b@RTPMVEXC1-PRD.hq.netapp.com>

At 07:29 AM 5/22/2008, Renaud Durand wrote:
>but the remote computer can't discover the target
>
>linux-cx5e:~ # iscsiadm -m discovery -t sendtargets -p 161.74.X.X
>iscsiadm: connection to discovery address 161.74.X.X failed


Do you have a firewall protecting the iSCSI ports perhaps?

Tom.


From kliteyn at dev.mellanox.co.il  Thu May 22 05:17:18 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 22 May 2008 15:17:18 +0300
Subject: [ofa-general] [PATCH] ibutils: fix some log messages
Message-ID: <4835644E.2050508@dev.mellanox.co.il>

Hi Oren,

Fixing some log messages from 'info' to 'error'

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 ibmgtsim/src/dispatcher.cpp |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ibmgtsim/src/dispatcher.cpp b/ibmgtsim/src/dispatcher.cpp
index 02b63fe..1fd9512 100644
--- a/ibmgtsim/src/dispatcher.cpp
+++ b/ibmgtsim/src/dispatcher.cpp
@@ -309,12 +309,12 @@ IBMSDispatcher::routeMadToDestByLid(
   int hops = 0;

   MSGREG(inf0, 'I', "Routing MAD mgmt_class:$ method:$ tid:$ to lid:$ from:$ port:$", "dispatcher");
-  MSGREG(inf1, 'I', "Got to dead-end routing to lid:$ at node:$",
+  MSGREG(inf1, 'E', "Got to dead-end routing to lid:$ at node:$ (fdb)",
          "dispatcher");
   MSGREG(inf2, 'I', "Arrived at lid $ = node $ after $ hops", "dispatcher");
-  MSGREG(inf3, 'I', "Got to dead-end routing to lid:$ at node:$ port:$",
+  MSGREG(inf3, 'E', "Got to dead-end routing to lid:$ at node:$ port:$",
          "dispatcher");
-  MSGREG(inf4, 'I', "Got to dead-end routing to lid:$ at HCA node:$ port:$ lid:$",
+  MSGREG(inf4, 'E', "Got to dead-end routing to lid:$ at HCA node:$ port:$ lid:$",
          "dispatcher");
   MSGREG(inf5, 'V', "Got node:$ through port:$", "dispatcher");

-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu May 22 05:20:53 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 22 May 2008 15:20:53 +0300
Subject: [ofa-general] [PATCH] ibutils: fixing seg. fault in
	ibis_gsi_mad_ctrl.c
Message-ID: <48356525.9010700@dev.mellanox.co.il>


Hi Oren,

Fixing seg fault in allocation of gsi management class vector.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 ibis/src/ibis_gsi_mad_ctrl.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ibis/src/ibis_gsi_mad_ctrl.c b/ibis/src/ibis_gsi_mad_ctrl.c
index 356d33d..bfb5fe6 100644
--- a/ibis/src/ibis_gsi_mad_ctrl.c
+++ b/ibis/src/ibis_gsi_mad_ctrl.c
@@ -731,7 +731,7 @@ ibis_gsi_mad_ctrl_set_class_attr_cb(
   size = cl_vector_get_size(&p_ctrl->class_vector);
   if (size <= mad_class)
   {
-    cl_status = cl_vector_set_size(&p_ctrl->class_vector,mad_class);
+    cl_status = cl_vector_set_size(&p_ctrl->class_vector,mad_class+1);

     if( cl_status != CL_SUCCESS)
     {
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu May 22 05:33:13 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 22 May 2008 15:33:13 +0300
Subject: [ofa-general] [PATCH] ibutils: remove trailing blanks
Message-ID: <48356809.3070405@dev.mellanox.co.il>

Hi Oren,

Removing trailing blanks in ibis.i

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 ibis/src/ibis.i |  216 +++++++++++++++++++++++++++---------------------------
 1 files changed, 108 insertions(+), 108 deletions(-)

diff --git a/ibis/src/ibis.i b/ibis/src/ibis.i
index 21766f2..a3c9b45 100644
--- a/ibis/src/ibis.i
+++ b/ibis/src/ibis.i
@@ -96,11 +96,11 @@ ibisp_is_debug(void)

 //
 // TYPE MAPS:
-//
+//
 %include ibis_typemaps.i

 //
-// exception handling wrapper based on the MsgMgr interfaces
+// exception handling wrapper based on the MsgMgr interfaces
 //
 %{

@@ -110,7 +110,7 @@ ibisp_is_debug(void)
   void ibis_set_tcl_error(char *err) {
     if (strlen(err) < 1024)
       strcpy(ibis_tcl_error_msg, err);
-    else
+    else
       strncpy(ibis_tcl_error_msg, err, 1024);
     ibis_tcl_error = 1;
   }
@@ -122,24 +122,24 @@ ibisp_is_debug(void)
   if (!IbisObj.initialized)
   {
     Tcl_SetStringObj(
-      Tcl_GetObjResult(interp),
+      Tcl_GetObjResult(interp),
       "ibis was not yet initialized. please use ibis_init and then ibis_set_port before.", -1);
     return TCL_ERROR;
   }
-
+
   if (! IbisObj.port_guid)
   {
     Tcl_SetStringObj(
-      Tcl_GetObjResult(interp),
+      Tcl_GetObjResult(interp),
       " ibis was not yet initialized. please use ibis_set_port before.", -1);
     return TCL_ERROR;
   }

   ibis_tcl_error = 0;
-  $function;
-  if (ibis_tcl_error) {
+  $function;
+  if (ibis_tcl_error) {
 	 Tcl_SetStringObj(Tcl_GetObjResult(interp), ibis_tcl_error_msg, -1);
- 	 return TCL_ERROR;
+ 	 return TCL_ERROR;
   }
 }

@@ -178,10 +178,10 @@ ibisp_is_debug(void)
   ibis_t    IbisObj;
   static ibis_opt_t  *ibis_opt_p;
   ibis_opt_t IbisOpts;
-
-  /* initialize the ibis object - is not done during init so we
+
+  /* initialize the ibis object - is not done during init so we
      can play with the options ... */
-  int ibis_ui_init(void)
+  int ibis_ui_init(void)
   {
     ib_api_status_t status;
 #ifdef OSM_BUILD_OPENIB
@@ -202,8 +202,8 @@ ibisp_is_debug(void)
       printf("-E- fail to init ibcr_init.\n");
       ibcr_destroy( p_ibcr_global );
       exit(1);
-    }	
-	
+    }
+
     status = ibpm_init(p_ibpm_global);
     if( status != IB_SUCCESS )
     {
@@ -218,7 +218,7 @@ ibisp_is_debug(void)
       printf("-E- Fail to init ibvs_init.\n");
       ibvs_destroy( p_ibvs_global );
       exit(1);
-    }	
+    }

     status = ibbbm_init(p_ibbbm_global);
     if( status != IB_SUCCESS )
@@ -226,7 +226,7 @@ ibisp_is_debug(void)
       printf("-E- Fail to init ibbbm_init.\n");
       ibbbm_destroy( p_ibbbm_global );
       exit(1);
-    }	
+    }

     status = ibsm_init(gp_ibsm);
     if( status != IB_SUCCESS )
@@ -234,7 +234,7 @@ ibisp_is_debug(void)
       printf("-E- Fail to init ibbbm_init.\n");
       ibsm_destroy( gp_ibsm );
       exit(1);
-    }	
+    }

     return 0;
   }
@@ -265,12 +265,12 @@ ibisp_is_debug(void)
   /* simply return the active port guid ibis is binded to */
   uint64_t ibis_get_port(void)
   {
-    return (IbisObj.port_guid);
+    return (IbisObj.port_guid);
   }

   /* set the port we bind to and initialize sub packages */
   int ibis_set_port(uint64_t port_guid)
-  {
+  {
     ib_api_status_t status;

     if (! IbisObj.initialized) {
@@ -280,7 +280,7 @@ ibisp_is_debug(void)
     }

     IbisObj.port_guid = port_guid;
-
+
     status = ibcr_bind(p_ibcr_global);
     if( status != IB_SUCCESS )
     {
@@ -303,15 +303,15 @@ ibisp_is_debug(void)
       printf("-E- Fail to ibvs_bind.\n");
       ibvs_destroy( p_ibvs_global );
       exit(1);
-    }		
-
+    }
+
     status = ibbbm_bind(p_ibbbm_global);
     if( status != IB_SUCCESS )
     {
       printf("-E- Fail to ibbbm_bind.\n");
       ibbbm_destroy( p_ibbbm_global );
       exit(1);
-    }		
+    }

     status = ibsm_bind(gp_ibsm);
     if( status != IB_SUCCESS )
@@ -319,9 +319,9 @@ ibisp_is_debug(void)
       printf("-E- Fail to ibsm_bind.\n");
       ibsm_destroy( gp_ibsm );
       exit(1);
-    }		
+    }

-    if (ibsac_bind(&IbisObj))
+    if (ibsac_bind(&IbisObj))
     {
       printf("-E- Fail to ibsac_bind.\n");
       exit(1);
@@ -345,7 +345,7 @@ ibisp_is_debug(void)
   }

   int ibis_set_transaction_timeout( uint32_t timeout_ms ) {
-	 osm_log(&(IbisObj.log),
+	 osm_log(&(IbisObj.log),
 				OSM_LOG_VERBOSE,
 				" Setting timeout to:%u[msec]\n", timeout_ms);
 	 IbisOpts.transaction_timeout = timeout_ms;
@@ -364,16 +364,16 @@ ibisp_is_debug(void)
 	 Tcl_Obj *p_obj;

     if (!IbisObj.initialized)
-    {
+    {
       Tcl_SetStringObj(
-        Tcl_GetObjResult(interp),
+        Tcl_GetObjResult(interp),
         "ibis was not yet initialized. please use ibis_init before.", -1);
       return TCL_ERROR;
     }
-
+
 	 /* command options */
     tcl_result = Tcl_GetObjResult(interp);
-	
+
     if ((objc < 1) || (objc > 1)) {
         Tcl_SetStringObj(tcl_result,"Wrong # args. ibis_get_local_ports_info ",-1);
         return TCL_ERROR;
@@ -394,12 +394,12 @@ ibisp_is_debug(void)
       return( TCL_ERROR );
     }

-	 /*
-		 Go over all ports and build the return  value
+	 /*
+		 Go over all ports and build the return  value
 	 */
 	 for( i = 0; i < num_ports; i++ )
     {
-
+
       // start with 1 on host channel adapters.
       sprintf(res, "0x%016" PRIx64 " 0x%04X %s %u",
               cl_ntoh64( attr_array[i].port_guid ),
@@ -407,11 +407,11 @@ ibisp_is_debug(void)
               ib_get_port_state_str( attr_array[i].link_state ),
               attr_array[i].port_num
               );
-
+
       p_obj = Tcl_NewStringObj(res, strlen(res));
       Tcl_ListObjAppendElement(interp, tcl_result, p_obj);
     }
-	
+
     return TCL_OK;
   }

@@ -421,7 +421,7 @@ ibisp_is_debug(void)

 //
 // INTERFACE DEFINITION (~copy of h file)
-//
+//

 %section "IBIS Constants"
 /* These constants are provided by IBIS: */
@@ -471,7 +471,7 @@ typedef struct _ibis_opt {
   /* If TRUE - forces flash after each log message (TRUE). */
   uint8_t log_flags;
   /* The log levels to be used */
-  char log_file[1024];
+  char log_file[1024];
   /* The name of the log file used (read only) */
   uint64_t sm_key;
   /* The SM_Key to be used when sending SubnetMgt and SubnetAdmin MADs */
@@ -497,7 +497,7 @@ int ibis_set_transaction_timeout(uint32_t timeout_ms);
 %text %{
 ibis_get_local_ports_info
    [return list]
-   Return the list of available IB ports with GUID, LID and State.
+   Return the list of available IB ports with GUID, LID and State.
 %}

 extern char * ibisSourceVersion;
@@ -509,12 +509,12 @@ extern char * ibisSourceVersion;

   /* Make sure that the osmv, complib and ibisp use
      same modes (debug/free) */
-  if ( osm_is_debug() != cl_is_debug()    ||
-       osm_is_debug() != ibisp_is_debug() ||
+  if ( osm_is_debug() != cl_is_debug()    ||
+       osm_is_debug() != ibisp_is_debug() ||
        ibisp_is_debug() != cl_is_debug() )
   {
     fprintf(stderr, "-E- OSMV, Complib and Ibis were compiled using different modes\n");
-    fprintf(stderr, "-E- OSMV debug:%d Complib debug:%d IBIS debug:%d \n",
+    fprintf(stderr, "-E- OSMV debug:%d Complib debug:%d IBIS debug:%d \n",
             osm_is_debug(), cl_is_debug(), ibisp_is_debug() );
     exit(1);
   }
@@ -526,7 +526,7 @@ extern char * ibisSourceVersion;
     /* we initialize the structs etc only once. */
     if (0 == notFirstTime++) {
       Tcl_StaticPackage(interp, "ibis", Ibis_Init, NULL);
-      Tcl_PkgProvide(interp, "ibis", IBIS_VERSION);
+      Tcl_PkgProvide(interp, "ibis", IBIS_VERSION);
       /* Default Options  */
       memset(&IbisOpts, 0,sizeof(ibis_opt_t));
       IbisOpts.transaction_timeout = 4*OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
@@ -538,45 +538,45 @@ extern char * ibisSourceVersion;
       IbisOpts.log_flags = OSM_LOG_ERROR;
       strcpy(IbisOpts.log_file,"/tmp/ibis.log");

-
+
       /* we want all exists to cleanup */
       Tcl_CreateExitHandler(ibis_exit, NULL);
-
+
       /* ------------------ IBCR ---------------------- */
       p_ibcr_global = ibcr_construct();
-
+
       if (p_ibcr_global == NULL) {
         printf("-E- Error from ibcr_construct.\n");
         exit(1);
-      }	
-
+      }
+
       /* ------------------ IBPM ---------------------- */
       p_ibpm_global = ibpm_construct();
-
+
       if (p_ibpm_global == NULL) {
         printf("-E- Error from ibpm_construct.\n");
         exit(1);
-      }	
-
+      }
+
       /* ------------------ IBVS ---------------------- */
 		p_ibvs_global = ibvs_construct();
-	
+
   		if (p_ibvs_global == NULL) {
 			printf("-E- Error from ibvs_construct.\n");
          exit(1);
-  		}	
+  		}

       /* ------------------ IBBBM ---------------------- */
 		p_ibbbm_global = ibbbm_construct();
-	
+
   		if (p_ibbbm_global == NULL) {
 			printf("-E- Error from ibbbm_construct.\n");
          exit(1);
-  		}
+  		}

       /* ------------------ IBSM ---------------------- */
 		gp_ibsm = ibsm_construct();
-	
+
   		if (gp_ibsm == NULL) {
 			printf("-E- Error from ibsm_construct.\n");
          exit(1);
@@ -593,7 +593,7 @@ extern char * ibisSourceVersion;
       memset(&ibsm_sm_info_obj, 0, sizeof(ib_sm_info_t));

       /* ------------------ IBSAC ---------------------- */
-
+
       /* Initialize global records */
       memset(&ibsac_node_rec, 0,sizeof(ibsac_node_rec));
       memset(&ibsac_portinfo_rec, 0,sizeof(ibsac_portinfo_rec));
@@ -603,10 +603,10 @@ extern char * ibisSourceVersion;
       memset(&ibsac_path_rec, 0, sizeof(ib_path_rec_t));
       memset(&ibsac_lft_rec, 0, sizeof(ib_lft_record_t));
       memset(&ibsac_mcm_rec, 0, sizeof(ib_member_rec_t));
-
-      /*
+
+      /*
        * A1 Supported features:
-       *
+       *
        * Query:                Rec/Info Types    Done
        *
        * NodeRecord            (nr, ni)           Y
@@ -627,7 +627,7 @@ extern char * ibisSourceVersion;
        * VLArbTableRecord      (vlarb)            Y
        * PKeyTableRecord       (pkr, pkt)         Y
        */
-
+
       /* We use alternate SWIG Objects mangling */
       SWIG_AltMnglInit();
       SWIG_AltMnglRegTypeToPrefix("_sacNodeInfo_p", "ni");
@@ -650,24 +650,24 @@ extern char * ibisSourceVersion;
       SWIG_AltMnglRegTypeToPrefix("_sacVlArbRec_p", "vlarb");
       SWIG_AltMnglRegTypeToPrefix("_sacPKeyTbl_p", "pkt");
       SWIG_AltMnglRegTypeToPrefix("_sacPKeyRec_p", "pkr");
-
+
       // register the pre-allocated objects
       SWIG_AltMnglRegObj("ni",&(ibsac_node_rec.node_info));
       SWIG_AltMnglRegObj("nr",&(ibsac_node_rec));
-
+
       SWIG_AltMnglRegObj("pi", &(ibsac_portinfo_rec.port_info));
       SWIG_AltMnglRegObj("pir",&(ibsac_portinfo_rec));
-	
+
       SWIG_AltMnglRegObj("smi", &(ibsac_sminfo_rec.sm_info));
       SWIG_AltMnglRegObj("smir",&(ibsac_sminfo_rec));
-
+
       SWIG_AltMnglRegObj("swi", &(ibsac_swinfo_rec.switch_info));
       SWIG_AltMnglRegObj("swir",&(ibsac_swinfo_rec));

       SWIG_AltMnglRegObj("path",&(ibsac_path_rec));
-
+
       SWIG_AltMnglRegObj("link",&(ibsac_link_rec));
-
+
       SWIG_AltMnglRegObj("lft",&(ibsac_lft_rec));

       SWIG_AltMnglRegObj("mcm",&(ibsac_mcm_rec));
@@ -675,18 +675,18 @@ extern char * ibisSourceVersion;
       SWIG_AltMnglRegObj("cpi",&(ibsac_class_port_info));
       SWIG_AltMnglRegObj("info",&(ibsac_inform_info));
       SWIG_AltMnglRegObj("svc",&(ibsac_svc_rec));
-
+
       SWIG_AltMnglRegObj("slvt", &(ibsac_slvl_rec.slvl_tbl));
       SWIG_AltMnglRegObj("slvr", &(ibsac_slvl_rec));
-
+
       SWIG_AltMnglRegObj("vlarb", &(ibsac_vlarb_rec));
-
+
       SWIG_AltMnglRegObj("pkt", &(ibsac_pkey_rec.pkey_tbl));
       SWIG_AltMnglRegObj("pkr", &(ibsac_pkey_rec));
-
+
       usleep(1000);
     }
-
+
     /* we defined this as a native command so declare it in here */
     Tcl_CreateObjCommand(interp, "ibis_get_local_ports_info",
                          ibis_get_local_ports_info, NULL, NULL);
@@ -697,113 +697,113 @@ extern char * ibisSourceVersion;
 									 (ClientData)ibis_opt_p, 0);

     /* add commands for accessing the global query records */
-    Tcl_CreateObjCommand(interp,"smNodeInfoMad",
+    Tcl_CreateObjCommand(interp,"smNodeInfoMad",
                          TclsmNodeInfoMethodCmd,
                          (ClientData)&ibsm_node_info_obj, 0);
-
-    Tcl_CreateObjCommand(interp,"smPortInfoMad",
+
+    Tcl_CreateObjCommand(interp,"smPortInfoMad",
                          TclsmPortInfoMethodCmd,
                          (ClientData)&ibsm_port_info_obj, 0);
-
-    Tcl_CreateObjCommand(interp,"smSwitchInfoMad",
+
+    Tcl_CreateObjCommand(interp,"smSwitchInfoMad",
                          TclsmSwInfoMethodCmd,
                          (ClientData)&ibsm_switch_info_obj, 0);
-
-    Tcl_CreateObjCommand(interp,"smLftBlockMad",
+
+    Tcl_CreateObjCommand(interp,"smLftBlockMad",
                          TclsmLftBlockMethodCmd,
                          (ClientData)&ibsm_lft_block_obj, 0);
-
-    Tcl_CreateObjCommand(interp,"smMftBlockMad",
+
+    Tcl_CreateObjCommand(interp,"smMftBlockMad",
                          TclsmMftBlockMethodCmd,
                          (ClientData)&ibsm_mft_block_obj, 0);

-    Tcl_CreateObjCommand(interp,"smGuidInfoMad",
+    Tcl_CreateObjCommand(interp,"smGuidInfoMad",
                          TclsmGuidInfoMethodCmd,
                          (ClientData)&ibsm_guid_info_obj, 0);

-    Tcl_CreateObjCommand(interp,"smPkeyTableMad",
+    Tcl_CreateObjCommand(interp,"smPkeyTableMad",
                          TclsmPkeyTableMethodCmd,
                          (ClientData)&ibsm_pkey_table_obj, 0);

-    Tcl_CreateObjCommand(interp,"smSlVlTableMad",
+    Tcl_CreateObjCommand(interp,"smSlVlTableMad",
                          TclsmSlVlTableMethodCmd,
                          (ClientData)&ibsm_slvl_table_obj, 0);

-    Tcl_CreateObjCommand(interp,"smVlArbTableMad",
+    Tcl_CreateObjCommand(interp,"smVlArbTableMad",
                          TclsmVlArbTableMethodCmd,
                          (ClientData)&ibsm_vl_arb_table_obj, 0);

-    Tcl_CreateObjCommand(interp,"smSMInfoMad",
+    Tcl_CreateObjCommand(interp,"smSMInfoMad",
                          TclsmSMInfoMethodCmd,
                          (ClientData)&ibsm_sm_info_obj, 0);

-    Tcl_CreateObjCommand(interp,"smNodeDescMad",
+    Tcl_CreateObjCommand(interp,"smNodeDescMad",
                          TclsmNodeDescMethodCmd,
                          (ClientData)&ibsm_node_desc_obj, 0);

-    Tcl_CreateObjCommand(interp,"smNoticeMad",
+    Tcl_CreateObjCommand(interp,"smNoticeMad",
                          TclsmNoticeMethodCmd,
                          (ClientData)&ibsm_notice_obj, 0);

-	 Tcl_CreateObjCommand(interp,"sacNodeQuery",
+	 Tcl_CreateObjCommand(interp,"sacNodeQuery",
 								 TclsacNodeRecMethodCmd,
 								 (ClientData)&ibsac_node_rec, 0);
-
-	 Tcl_CreateObjCommand(interp,"sacPortQuery",
+
+	 Tcl_CreateObjCommand(interp,"sacPortQuery",
 								 TclsacPortRecMethodCmd,
 								 (ClientData)&ibsac_portinfo_rec, 0);
-
-	 Tcl_CreateObjCommand(interp,"sacSmQuery",
+
+	 Tcl_CreateObjCommand(interp,"sacSmQuery",
 								 TclsacSmRecMethodCmd,
 								 (ClientData)&ibsac_sminfo_rec, 0);
-
-	 Tcl_CreateObjCommand(interp,"sacSwQuery",
+
+	 Tcl_CreateObjCommand(interp,"sacSwQuery",
 								 TclsacSwRecMethodCmd,
 								 (ClientData)&ibsac_swinfo_rec, 0);
-
-	 Tcl_CreateObjCommand(interp,"sacLinkQuery",
+
+	 Tcl_CreateObjCommand(interp,"sacLinkQuery",
 								 TclsacLinkRecMethodCmd,
 								 (ClientData)&ibsac_link_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacPathQuery",
+	 Tcl_CreateObjCommand(interp,"sacPathQuery",
 								 TclsacPathRecMethodCmd,
 								 (ClientData)&ibsac_path_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacLFTQuery",
+	 Tcl_CreateObjCommand(interp,"sacLFTQuery",
 								 TclsacLFTRecMethodCmd,
 								 (ClientData)&ibsac_lft_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacMCMQuery",
+	 Tcl_CreateObjCommand(interp,"sacMCMQuery",
 								 TclsacMCMRecMethodCmd,
 								 (ClientData)&ibsac_mcm_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacClassPortInfoQuery",
+	 Tcl_CreateObjCommand(interp,"sacClassPortInfoQuery",
 								 TclsacClassPortInfoMethodCmd,
 								 (ClientData)&ibsac_class_port_info, 0);

-	 Tcl_CreateObjCommand(interp,"sacInformInfoQuery",
+	 Tcl_CreateObjCommand(interp,"sacInformInfoQuery",
 								 TclsacInformInfoMethodCmd,
 								 (ClientData)&ibsac_inform_info, 0);

-	 Tcl_CreateObjCommand(interp,"sacServiceQuery",
+	 Tcl_CreateObjCommand(interp,"sacServiceQuery",
 								 TclsacServiceRecMethodCmd,
 								 (ClientData)&ibsac_svc_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacSLVlQuery",
+	 Tcl_CreateObjCommand(interp,"sacSLVlQuery",
 								 TclsacSlVlRecMethodCmd,
 								 (ClientData)&ibsac_slvl_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacVlArbQuery",
+	 Tcl_CreateObjCommand(interp,"sacVlArbQuery",
 								 TclsacVlArbRecMethodCmd,
 								 (ClientData)&ibsac_vlarb_rec, 0);

-	 Tcl_CreateObjCommand(interp,"sacPKeyQuery",
+	 Tcl_CreateObjCommand(interp,"sacPKeyQuery",
 								 TclsacPKeyRecMethodCmd,
 								 (ClientData)&ibsac_pkey_rec, 0);

     /*
-      use an embedded Tcl code for doing init if given command line
-      parameters: -port_num <port num>
+      use an embedded Tcl code for doing init if given command line
+      parameters: -port_num <port num>
     */
     Tcl_GlobalEval(
       interp,
@@ -835,4 +835,4 @@ extern char * ibisSourceVersion;
       "}\n");
   }
 %}
-
+
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu May 22 05:56:32 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 22 May 2008 15:56:32 +0300
Subject: [ofa-general] [PATCH] ibutils: remove trailing blanks in Makefile.am
Message-ID: <48356D80.60805@dev.mellanox.co.il>

Hi Oren,

Removing trailing blanks in ibis/src/Makefile.am

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 ibis/src/Makefile.am |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am
index 6b1fe09..7166906 100644
--- a/ibis/src/Makefile.am
+++ b/ibis/src/Makefile.am
@@ -48,13 +48,13 @@ AM_CXXFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing -fPIC  -
 # ibis shared library version triplet is:
 # API_ID:API_VER:NUM_PREV_API_SUP = x:y:z
 # * change of API_ID means new API
-# * change of AGE means how many API backward compt
+# * change of AGE means how many API backward compt
 # * change of API_VER is required every version
 # Results with SO version: x-z:z:y
 LIB_VER_TRIPLET="1:0:0"
 LIB_FILE_TRIPLET=1.0.0

-lib_LTLIBRARIES = libibiscom.la libibis.la
+lib_LTLIBRARIES = libibiscom.la libibis.la

 libibiscom_la_SOURCES = ibbbm.c ibcr.c	ibis.c ibis_gsi_mad_ctrl.c \
 	ibpm.c ibsac.c ibsm.c ibvs.c
@@ -74,9 +74,9 @@ LDADD = $(OSM_LDFLAGS)

 ibis_SOURCES = ibissh_wrap.cpp

-ibis_LDFLAGS = -static
+ibis_LDFLAGS = -static
 # note the order of the libraries does matter as we static link
-ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS)
+ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS)


 # SWIG FILES:
@@ -122,7 +122,7 @@ ibis_wrap.c: @MAINTAINER_MODE_TRUE@ $(SWIG_IFC_FILES)
 	swig -I$(srcdir) -dhtml -tcl8 -o swig_wrap.c $(srcdir)/ibis.i
 	$(srcdir)/fixSwigWrapper -g -s -p -o ibis_wrap.c
 	if test ibis_wrap.c -ef $(srcdir)/ibis_wrap.c; then cp ibis_wrap.c $(srcdir)/ibis_wrap.c; fi
-	rm -f swig_wrap.c	
+	rm -f swig_wrap.c

 ibissh_wrap.cpp: @MAINTAINER_MODE_TRUE@ $(SWIG_IFC_FILES)
 	swig -I$(srcdir) -dhtml -tcl8  -ltclsh.i -o swig_wrap.c $(srcdir)/ibis.i
@@ -156,7 +156,7 @@ EXTRA_DIST = swig_extended_obj.c fixSwigWrapper pkgIndex.tcl \
 install-libLTLIBRARIES:

 # this actually over write the lib install
-install-exec-am: install-binPROGRAMS
+install-exec-am: install-binPROGRAMS
 	mkdir -p $(DESTDIR)/$(libdir)/ibis$(VERSION)
 	cp .libs/libibis.so.$(LIB_FILE_TRIPLET) $(DESTDIR)/$(libdir)/ibis$(VERSION)/libibis.so.$(VERSION)
 	sed 's/%VERSION%/'$(VERSION)'/g' $(srcdir)/pkgIndex.tcl > $(DESTDIR)/$(libdir)/ibis$(VERSION)/pkgIndex.tcl
-- 
1.5.1.4


From swise at opengridcomputing.com  Thu May 22 06:45:14 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 22 May 2008 08:45:14 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48351310.5090108@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
Message-ID: <483578EA.4070503@opengridcomputing.com>


Or Gerlitz wrote:
> Dror Goldenberg wrote:
>> When you post a fast register WQE, you specify the new 8 LSBits to
>> be assigned to the MR. The rest 24 MSBits are the ones that you obtained
>> while allocating the MR, and they persist throughout the lifetime of
>> this MR.
> OK, thanks Dror.
> 
> Steve, do we agree on this point? if yes, the next version of the 
> patches should include the new rkey value (or just the new 8 LSbits) in 
> the work request.
>

Are we sure we need to expose this to the user?

> Or.
> 


From sashak at voltaire.com  Thu May 22 06:53:29 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 22 May 2008 16:53:29 +0300
Subject: [ofa-general] [PATCH] saquery: --smkey command line option
Message-ID: <20080522135329.GB32128@sashak.voltaire.com>


This adds possibility to specify SM_Key value with saquery. It should
work with queries where OSM_DEFAULT_SM_KEY was used.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/src/saquery.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index ed61721..8edac5d 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -69,6 +69,7 @@ char *argv0 = "saquery";
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
+static ib_net64_t smkey = OSM_DEFAULT_SM_KEY;
 
 /**
  * Declare some globals because I don't want this to be too complex.
@@ -730,7 +731,7 @@ get_all_records(osm_bind_handle_t bind_handle,
 		int trusted)
 {
 	return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset,
-			       trusted ? OSM_DEFAULT_SM_KEY : 0);
+			       trusted ? smkey : 0);
 }
 
 /**
@@ -1254,8 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle,
 
 	status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0,
 				 comp_mask, &pktr,
-				 ib_get_attr_offset(sizeof(pktr)),
-				 OSM_DEFAULT_SM_KEY);
+				 ib_get_attr_offset(sizeof(pktr)), smkey);
 	if (status != IB_SUCCESS)
 		return status;
 
@@ -1411,6 +1411,7 @@ usage(void)
 				"IPv6 format\n");
 	fprintf(stderr, "   -C <ca_name> specify the SA query HCA\n");
 	fprintf(stderr, "   -P <ca_port> specify the SA query port\n");
+	fprintf(stderr, "   --smkey <val> specify SM_Key value for the query\n");
 	fprintf(stderr, "   -t | --timeout <msec> specify the SA query "
 				"response timeout (default %u msec)\n",
 			DEFAULT_SA_TIMEOUT_MS);
@@ -1466,6 +1467,7 @@ main(int argc, char **argv)
 	   {"sgid-to-dgid", 1, 0, 2},
 	   {"timeout", 1, 0, 't'},
 	   {"node-name-map", 1, 0, 3},
+	   {"smkey", 1, 0, 4},
 	   { }
 	};
 
@@ -1512,6 +1514,9 @@ main(int argc, char **argv)
 		case 3:
 			node_name_map_file = strdup(optarg);
 			break;
+		case 4:
+			smkey = cl_hton64(strtoull(optarg, NULL, 0));
+			break;
 		case 'p':
 			query_type = IB_MAD_ATTR_PATH_RECORD;
 			break;
-- 
1.5.5.1.178.g1f811


From monis at Voltaire.COM  Thu May 22 06:58:33 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 22 May 2008 16:58:33 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <adaprri53y0.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com>
Message-ID: <48357C09.1040302@Voltaire.COM>


Hal, Roland 
Thanks for the comments. The patch below tries to address the issues that were 
raised in its previous form. Please note that I'm only asking for opinion for now.
If the idea is acceptable then I will recreate more elegant patch with the required
fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with
textual names).

The idea in few words is to flush only paths but keeping address handles in ipoib_neigh.
This will trigger a new path lookup when an ARP probe arrives and eventually an addess 
handle renewal. In the meantime, the old address handle is kept and can be used. In most
cases this address handle is a valid address handle and when it is not than the situatio
is not worse than before. 
My tests show that this patch completes the improvement that was archived with patch #1 
to zero packet loss (tested with ping flood) when SM change event occurs.


thanks

 MoniS


diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..8ef6573 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -276,10 +276,11 @@ struct ipoib_dev_priv {
 
 	struct delayed_work pkey_poll_task;
 	struct delayed_work mcast_task;
-	struct work_struct flush_task;
+	struct work_struct flush_task0;
+	struct work_struct flush_task1;
+	struct work_struct flush_task2;
 	struct work_struct restart_task;
 	struct delayed_work ah_reap_task;
-	struct work_struct pkey_event_task;
 
 	struct ib_device *ca;
 	u8		  port;
@@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		struct ipoib_ah *address, u32 qpn);
 void ipoib_reap_ah(struct work_struct *work);
 
+void ipoib_flush_paths_only(struct net_device *dev);
 void ipoib_flush_paths(struct net_device *dev);
 struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
-void ipoib_ib_dev_flush(struct work_struct *work);
+void ipoib_ib_dev_flush0(struct work_struct *work);
+void ipoib_ib_dev_flush1(struct work_struct *work);
+void ipoib_ib_dev_flush2(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index f429bce..5a6bbe8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	return 0;
 }
 
-static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
+static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
 {
 	struct ipoib_dev_priv *cpriv;
 	struct net_device *dev = priv->dev;
@@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 	 * the parent is down.
 	 */
 	list_for_each_entry(cpriv, &priv->child_intfs, list)
-		__ipoib_ib_dev_flush(cpriv, pkey_event);
+		__ipoib_ib_dev_flush(cpriv, level);
 
 	mutex_unlock(&priv->vlan_mutex);
 
@@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 		return;
 	}
 
-	if (pkey_event) {
+	if (level == 2) {
 		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
 			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 			ipoib_ib_dev_down(dev, 0);
@@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 		priv->pkey_index = new_index;
 	}
 
-	ipoib_dbg(priv, "flushing\n");
-
-	ipoib_ib_dev_down(dev, 0);
+	ipoib_flush_paths_only(dev);
+	ipoib_mcast_dev_flush(dev);
+	
+	if (level >= 1)
+		ipoib_ib_dev_down(dev, 0);
 
-	if (pkey_event) {
+	if (level >= 2) {
 		ipoib_ib_dev_stop(dev, 0);
 		ipoib_ib_dev_open(dev);
 	}
@@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
 	 * we get here, don't bring it back up if it's not configured up
 	 */
 	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
-		ipoib_ib_dev_up(dev);
+		if (level >= 1)
+			ipoib_ib_dev_up(dev);
 		ipoib_mcast_restart_task(&priv->restart_task);
 	}
 }
 
-void ipoib_ib_dev_flush(struct work_struct *work)
+void ipoib_ib_dev_flush0(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, flush_task);
+		container_of(work, struct ipoib_dev_priv, flush_task0);
 
-	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 0);
 }
 
-void ipoib_pkey_event(struct work_struct *work)
+void ipoib_ib_dev_flush1(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
-		container_of(work, struct ipoib_dev_priv, pkey_event_task);
+		container_of(work, struct ipoib_dev_priv, flush_task1);
 
-	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
 	__ipoib_ib_dev_flush(priv, 1);
 }
 
+void ipoib_ib_dev_flush2(struct work_struct *work)
+{
+	struct ipoib_dev_priv *priv =
+		container_of(work, struct ipoib_dev_priv, flush_task2);
+
+	__ipoib_ib_dev_flush(priv, 2);
+}
+
 void ipoib_ib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2442090..c41798d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path)
 	return 0;
 }
 
+static void path_free_only(struct net_device *dev, struct ipoib_path *path)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh *neigh, *tn;
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	while ((skb = __skb_dequeue(&path->queue)))
+		dev_kfree_skb_irq(skb);
+
+	if (path->ah)
+		ipoib_put_ah(path->ah);
+
+	kfree(path);
+}
 static void path_free(struct net_device *dev, struct ipoib_path *path)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
 
 #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
 
+void ipoib_flush_paths_only(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_path *path, *tp;
+	LIST_HEAD(remove_list);
+
+	spin_lock_irq(&priv->tx_lock);
+	spin_lock(&priv->lock);
+
+	list_splice_init(&priv->path_list, &remove_list);
+
+	list_for_each_entry(path, &remove_list, list)
+		rb_erase(&path->rb_node, &priv->path_tree);
+
+	list_for_each_entry_safe(path, tp, &remove_list, list) {
+		if (path->query)
+			ib_sa_cancel_query(path->query_id, path->query);
+		spin_unlock(&priv->lock);
+		spin_unlock_irq(&priv->tx_lock);
+		wait_for_completion(&path->done);
+		path_free_only(dev, path);
+		spin_lock_irq(&priv->tx_lock);
+		spin_lock(&priv->lock);
+	}
+	spin_unlock(&priv->lock);
+	spin_unlock_irq(&priv->tx_lock);
+}
+
 void ipoib_flush_paths(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -421,6 +464,8 @@ static void path_rec_completion(int status,
 			__skb_queue_tail(&skqueue, skb);
 
 		list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
+			if (neigh->ah)
+				ipoib_put_ah(neigh->ah);
 			kref_get(&path->ah->ref);
 			neigh->ah = path->ah;
 			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
@@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev)
 	INIT_LIST_HEAD(&priv->multicast_list);
 
 	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
-	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
 	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
-	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
+	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
+	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
+	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
 	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
 	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 8766d29..80c0409 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
 	if (record->element.port_num != priv->port)
 		return;
 
-	if (record->event == IB_EVENT_PORT_ERR    ||
-	    record->event == IB_EVENT_PORT_ACTIVE ||
-	    record->event == IB_EVENT_LID_CHANGE  ||
-	    record->event == IB_EVENT_SM_CHANGE   ||
-	    record->event == IB_EVENT_CLIENT_REREGISTER) {
-		ipoib_dbg(priv, "Port state change event\n");
-		queue_work(ipoib_workqueue, &priv->flush_task);
+	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
+			record->device->name, record->element.port_num);
+	if ( record->event == IB_EVENT_SM_CHANGE   ||
+	     record->event == IB_EVENT_CLIENT_REREGISTER) {
+		queue_work(ipoib_workqueue, &priv->flush_task0);
+	} else if (record->event == IB_EVENT_PORT_ERR ||
+		   record->event == IB_EVENT_PORT_ACTIVE ||
+		   record->event == IB_EVENT_LID_CHANGE) {
+		queue_work(ipoib_workqueue, &priv->flush_task1);
 	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
-		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
-		queue_work(ipoib_workqueue, &priv->pkey_event_task);
+		queue_work(ipoib_workqueue, &priv->flush_task2);
 	}
 }


From sashak at voltaire.com  Thu May 22 07:09:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 22 May 2008 17:09:16 +0300
Subject: [ofa-general] OSM_DEFAULT_SM_KEY byte order
Message-ID: <20080522140916.GC32128@sashak.voltaire.com>

Hi,

I noticed that OSM_DEFAULT_SM_KEY macro is defined and used in host byte
order, this means it has different values on LE and BE machines (as
result we could see some osmtest failures between x86 and G5). The fix
could be trivial:


diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index 62d472e..7cc2757 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -117,7 +117,7 @@ BEGIN_C_DECLS
 *
 * SYNOPSIS
 */
-#define OSM_DEFAULT_SM_KEY 1
+#define OSM_DEFAULT_SM_KEY CL_HTON64(1)
 /********/
 /****s* OpenSM: Base/OSM_DEFAULT_LMC
 * NAME


, but sort of backward compatibility (currently I know that
OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost.
Is this so important? Ideas?

Sasha


From swise at opengridcomputing.com  Thu May 22 07:22:42 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 22 May 2008 09:22:42 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4833EC5B.8070504@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EC5B.8070504@voltaire.com>
Message-ID: <483581B2.7000109@opengridcomputing.com>


Or Gerlitz wrote:
> Steve Wise wrote:
>> So you allocate the rkey/stag up front, allocate page_lists up front, 
>> then as needed you populate your page list and bind it to the 
>> rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via 
>> IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
>> proper fencing, you can pipeline these mappings.   Eventually when 
>> you're done doing IO (like for NFSRDMA when the mount is unmounted) 
>> you free up the page list(s) and mr/rkey/stag.
> Yes, that was my thought as well.
> 
> Just to make sure, by "proper fencing" your understanding is that for 
> both IB and iWARP the ULP should not wait for the fmr work request to 
> complete and post the send work-request carrying the rkey/stag with the 
> IB_SEND_FENCE flag?
> 
> Looking in the IB spec, its seems that the fence indicator only applies 
> to previous rdma-read / atomic operations, eg in section  11.4.1.1 POST 
> SEND REQUEST it says:
>> Fence indicator. If the fence indicator is set, then all prior RDMA 
>> Read and Atomic Work Requests on the queue must be completed before 
>> starting to process this Work Request.
> 

The fast register and invalidate work requests require that they be 
completed by the device _before_ processing any subsequent work 
requests.  So you can post subsequent SEND WRs that utilize the rkey 
without problems.

In addition, invalidate allows a local fence which means the device will 
no begin processing the invalitdae until all _prior_ work requests 
complete (similar to a read fence but for all prior WRS).

Steve.


From ogerlitz at voltaire.com  Thu May 22 07:33:12 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 22 May 2008 17:33:12 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483578EA.4070503@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
Message-ID: <48358428.2000902@voltaire.com>

Steve Wise wrote:
> Are we sure we need to expose this to the user?
I believe this is the way to go if we want to let smart ULPs generate 
new rkey/stag per mapping. Simpler ULPs could then just put the same 
value for each map associated with the same mr.

Or.


From hrosenstock at xsigo.com  Thu May 22 07:46:49 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 07:46:49 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080522135329.GB32128@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
Message-ID: <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>

Sasha,

On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> This adds possibility to specify SM_Key value with saquery. It should
> work with queries where OSM_DEFAULT_SM_KEY was used.

I think this starts down a slippery slope and perhaps bad precedent for
MKey as well. I know this is useful as a debug tool but compromises what
purports as "security" IMO as this means the keys need to be too widely
known.

-- Hal

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  infiniband-diags/src/saquery.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
> index ed61721..8edac5d 100644
> --- a/infiniband-diags/src/saquery.c
> +++ b/infiniband-diags/src/saquery.c
> @@ -69,6 +69,7 @@ char *argv0 = "saquery";
>  
>  static char *node_name_map_file = NULL;
>  static nn_map_t *node_name_map = NULL;
> +static ib_net64_t smkey = OSM_DEFAULT_SM_KEY;
>  
>  /**
>   * Declare some globals because I don't want this to be too complex.
> @@ -730,7 +731,7 @@ get_all_records(osm_bind_handle_t bind_handle,
>  		int trusted)
>  {
>  	return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset,
> -			       trusted ? OSM_DEFAULT_SM_KEY : 0);
> +			       trusted ? smkey : 0);
>  }
>  
>  /**
> @@ -1254,8 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle,
>  
>  	status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0,
>  				 comp_mask, &pktr,
> -				 ib_get_attr_offset(sizeof(pktr)),
> -				 OSM_DEFAULT_SM_KEY);
> +				 ib_get_attr_offset(sizeof(pktr)), smkey);
>  	if (status != IB_SUCCESS)
>  		return status;
>  
> @@ -1411,6 +1411,7 @@ usage(void)
>  				"IPv6 format\n");
>  	fprintf(stderr, "   -C <ca_name> specify the SA query HCA\n");
>  	fprintf(stderr, "   -P <ca_port> specify the SA query port\n");
> +	fprintf(stderr, "   --smkey <val> specify SM_Key value for the query\n");
>  	fprintf(stderr, "   -t | --timeout <msec> specify the SA query "
>  				"response timeout (default %u msec)\n",
>  			DEFAULT_SA_TIMEOUT_MS);
> @@ -1466,6 +1467,7 @@ main(int argc, char **argv)
>  	   {"sgid-to-dgid", 1, 0, 2},
>  	   {"timeout", 1, 0, 't'},
>  	   {"node-name-map", 1, 0, 3},
> +	   {"smkey", 1, 0, 4},
>  	   { }
>  	};
>  
> @@ -1512,6 +1514,9 @@ main(int argc, char **argv)
>  		case 3:
>  			node_name_map_file = strdup(optarg);
>  			break;
> +		case 4:
> +			smkey = cl_hton64(strtoull(optarg, NULL, 0));
> +			break;
>  		case 'p':
>  			query_type = IB_MAD_ATTR_PATH_RECORD;
>  			break;


From sashak at voltaire.com  Thu May 22 07:48:10 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 22 May 2008 17:48:10 +0300
Subject: [ofa-general] [PATCH] opensm/scripts/opensm.init.in: fix status
	command
Message-ID: <20080522144810.GD32128@sashak.voltaire.com>


This script is installed in SuSE systems where 'status' command/shell
function doesn't exist (bug#982
https://bugs.openfabrics.org/show_bug.cgi?id=982).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/scripts/opensm.init.in |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in
index da23b36..a573662 100644
--- a/opensm/scripts/opensm.init.in
+++ b/opensm/scripts/opensm.init.in
@@ -81,7 +81,13 @@ stop () {
 }
 
 Xstatus () {
-    status opensm
+	pid="`pidof opensm`"
+	ret=$?
+	if [ $ret -eq 0 ] ; then
+		echo "OpenSM is running... pid=$pid"
+	else
+		echo "OpenSM is not running."
+	fi
 }
 
 restart() {
-- 
1.5.4.rc2.60.gb2e62


From hrosenstock at xsigo.com  Thu May 22 07:52:41 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 07:52:41 -0700
Subject: [ofa-general] Re: OSM_DEFAULT_SM_KEY byte order
In-Reply-To: <20080522140916.GC32128@sashak.voltaire.com>
References: <20080522140916.GC32128@sashak.voltaire.com>
Message-ID: <1211467961.18236.178.camel@hrosenstock-ws.xsigo.com>

Sasha,

On Thu, 2008-05-22 at 17:09 +0300, Sasha Khapyorsky wrote:
> Hi,
> 
> I noticed that OSM_DEFAULT_SM_KEY macro is defined and used in host byte
> order, this means it has different values on LE and BE machines (as
> result we could see some osmtest failures between x86 and G5). The fix
> could be trivial:

> diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
> index 62d472e..7cc2757 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -117,7 +117,7 @@ BEGIN_C_DECLS
>  *
>  * SYNOPSIS
>  */
> -#define OSM_DEFAULT_SM_KEY 1
> +#define OSM_DEFAULT_SM_KEY CL_HTON64(1)
>  /********/
>  /****s* OpenSM: Base/OSM_DEFAULT_LMC
>  * NAME
> 
> 
> , but sort of backward compatibility (currently I know that
> OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost.
> Is this so important? Ideas?

IMO yes, I think this breaks both backward compatibility and what was
actually observed from some other SMs during interop testing.

I agree it needs fixing but I think the proper thing is probably more
like:

#define OSM_DEFAULT_SM_KEY CL_HTON64(0x0100000000000000);

-- Hal

> Sasha


From kliteyn at mellanox.co.il  Thu May 22 07:56:28 2008
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 22 May 2008 17:56:28 +0300
Subject: [ofa-general] [PATCH] opensm/scripts/opensm.init.in: fix status
	command
In-Reply-To: <20080522144810.GD32128@sashak.voltaire.com>
References: <20080522144810.GD32128@sashak.voltaire.com>
Message-ID: <4835899C.80400@mellanox.co.il>

Great, thanks.

-- Yevgeny

Sasha Khapyorsky wrote:
> This script is installed in SuSE systems where 'status' command/shell
> function doesn't exist (bug#982
> https://bugs.openfabrics.org/show_bug.cgi?id=982).
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/scripts/opensm.init.in |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in
> index da23b36..a573662 100644
> --- a/opensm/scripts/opensm.init.in
> +++ b/opensm/scripts/opensm.init.in
> @@ -81,7 +81,13 @@ stop () {
>  }
>  
>  Xstatus () {
> -    status opensm
> +	pid="`pidof opensm`"
> +	ret=$?
> +	if [ $ret -eq 0 ] ; then
> +		echo "OpenSM is running... pid=$pid"
> +	else
> +		echo "OpenSM is not running."
> +	fi
>  }
>  
>  restart() {
>   


From sashak at voltaire.com  Thu May 22 07:56:07 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 22 May 2008 17:56:07 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080522145607.GE32128@sashak.voltaire.com>

On 07:46 Thu 22 May     , Hal Rosenstock wrote:
> Sasha,
> 
> On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> > This adds possibility to specify SM_Key value with saquery. It should
> > work with queries where OSM_DEFAULT_SM_KEY was used.
> 
> I think this starts down a slippery slope and perhaps bad precedent for
> MKey as well. I know this is useful as a debug tool but compromises what
> purports as "security" IMO as this means the keys need to be too widely
> known.

When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM
side an user may know this or not, in later case saquery will not work
(just like now). I don't see a hole.

Sasha


From hrosenstock at xsigo.com  Thu May 22 08:07:06 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 08:07:06 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <48357C09.1040302@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com>  <48357C09.1040302@Voltaire.COM>
Message-ID: <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>

Moni,

On Thu, 2008-05-22 at 16:58 +0300, Moni Shoua wrote:
> Hal, Roland 
> Thanks for the comments. The patch below tries to address the issues that were 
> raised in its previous form. Please note that I'm only asking for opinion for now.
> If the idea is acceptable then I will recreate more elegant patch with the required
> fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with
> textual names).
> 
> The idea in few words is to flush only paths but keeping address handles in ipoib_neigh.
> This will trigger a new path lookup when an ARP probe arrives and eventually an addess 
> handle renewal. In the meantime, the old address handle is kept and can be used. In most
> cases this address handle is a valid address handle and when it is not than the situatio
> is not worse than before.

This part seems OK to me.
 
> My tests show that this patch completes the improvement that was archived with patch #1 
> to zero packet loss (tested with ping flood) when SM change event occurs.

Looks to me like SM change is still "level 0". I may have missed it but
I don't see how this addresses the general architectural concerns
previously raised. This patch may work in your test environment but I
don't think that covers all the cases.

-- Hal

> thanks
> 
>  MoniS
> 
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index ca126fc..8ef6573 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -276,10 +276,11 @@ struct ipoib_dev_priv {
>  
>  	struct delayed_work pkey_poll_task;
>  	struct delayed_work mcast_task;
> -	struct work_struct flush_task;
> +	struct work_struct flush_task0;
> +	struct work_struct flush_task1;
> +	struct work_struct flush_task2;
>  	struct work_struct restart_task;
>  	struct delayed_work ah_reap_task;
> -	struct work_struct pkey_event_task;
>  
>  	struct ib_device *ca;
>  	u8		  port;
> @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		struct ipoib_ah *address, u32 qpn);
>  void ipoib_reap_ah(struct work_struct *work);
>  
> +void ipoib_flush_paths_only(struct net_device *dev);
>  void ipoib_flush_paths(struct net_device *dev);
>  struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
>  
>  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
> -void ipoib_ib_dev_flush(struct work_struct *work);
> +void ipoib_ib_dev_flush0(struct work_struct *work);
> +void ipoib_ib_dev_flush1(struct work_struct *work);
> +void ipoib_ib_dev_flush2(struct work_struct *work);
>  void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
>  
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index f429bce..5a6bbe8 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>  	return 0;
>  }
>  
> -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
> +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
>  {
>  	struct ipoib_dev_priv *cpriv;
>  	struct net_device *dev = priv->dev;
> @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * the parent is down.
>  	 */
>  	list_for_each_entry(cpriv, &priv->child_intfs, list)
> -		__ipoib_ib_dev_flush(cpriv, pkey_event);
> +		__ipoib_ib_dev_flush(cpriv, level);
>  
>  	mutex_unlock(&priv->vlan_mutex);
>  
> @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		return;
>  	}
>  
> -	if (pkey_event) {
> +	if (level == 2) {
>  		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
>  			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
>  			ipoib_ib_dev_down(dev, 0);
> @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		priv->pkey_index = new_index;
>  	}
>  
> -	ipoib_dbg(priv, "flushing\n");
> -
> -	ipoib_ib_dev_down(dev, 0);
> +	ipoib_flush_paths_only(dev);
> +	ipoib_mcast_dev_flush(dev);
> +	
> +	if (level >= 1)
> +		ipoib_ib_dev_down(dev, 0);
>  
> -	if (pkey_event) {
> +	if (level >= 2) {
>  		ipoib_ib_dev_stop(dev, 0);
>  		ipoib_ib_dev_open(dev);
>  	}
> @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * we get here, don't bring it back up if it's not configured up
>  	 */
>  	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
> -		ipoib_ib_dev_up(dev);
> +		if (level >= 1)
> +			ipoib_ib_dev_up(dev);
>  		ipoib_mcast_restart_task(&priv->restart_task);
>  	}
>  }
>  
> -void ipoib_ib_dev_flush(struct work_struct *work)
> +void ipoib_ib_dev_flush0(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, flush_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task0);
>  
> -	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 0);
>  }
>  
> -void ipoib_pkey_event(struct work_struct *work)
> +void ipoib_ib_dev_flush1(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, pkey_event_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task1);
>  
> -	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 1);
>  }
>  
> +void ipoib_ib_dev_flush2(struct work_struct *work)
> +{
> +	struct ipoib_dev_priv *priv =
> +		container_of(work, struct ipoib_dev_priv, flush_task2);
> +
> +	__ipoib_ib_dev_flush(priv, 2);
> +}
> +
>  void ipoib_ib_dev_cleanup(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 2442090..c41798d 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path)
>  	return 0;
>  }
>  
> +static void path_free_only(struct net_device *dev, struct ipoib_path *path)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct ipoib_neigh *neigh, *tn;
> +	struct sk_buff *skb;
> +	unsigned long flags;
> +
> +	while ((skb = __skb_dequeue(&path->queue)))
> +		dev_kfree_skb_irq(skb);
> +
> +	if (path->ah)
> +		ipoib_put_ah(path->ah);
> +
> +	kfree(path);
> +}
>  static void path_free(struct net_device *dev, struct ipoib_path *path)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
>  
>  #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
>  
> +void ipoib_flush_paths_only(struct net_device *dev)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct ipoib_path *path, *tp;
> +	LIST_HEAD(remove_list);
> +
> +	spin_lock_irq(&priv->tx_lock);
> +	spin_lock(&priv->lock);
> +
> +	list_splice_init(&priv->path_list, &remove_list);
> +
> +	list_for_each_entry(path, &remove_list, list)
> +		rb_erase(&path->rb_node, &priv->path_tree);
> +
> +	list_for_each_entry_safe(path, tp, &remove_list, list) {
> +		if (path->query)
> +			ib_sa_cancel_query(path->query_id, path->query);
> +		spin_unlock(&priv->lock);
> +		spin_unlock_irq(&priv->tx_lock);
> +		wait_for_completion(&path->done);
> +		path_free_only(dev, path);
> +		spin_lock_irq(&priv->tx_lock);
> +		spin_lock(&priv->lock);
> +	}
> +	spin_unlock(&priv->lock);
> +	spin_unlock_irq(&priv->tx_lock);
> +}
> +
>  void ipoib_flush_paths(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> @@ -421,6 +464,8 @@ static void path_rec_completion(int status,
>  			__skb_queue_tail(&skqueue, skb);
>  
>  		list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
> +			if (neigh->ah)
> +				ipoib_put_ah(neigh->ah);
>  			kref_get(&path->ah->ref);
>  			neigh->ah = path->ah;
>  			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
> @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev)
>  	INIT_LIST_HEAD(&priv->multicast_list);
>  
>  	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
> -	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
>  	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
> -	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
> +	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
> +	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
> +	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
>  	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
>  	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index 8766d29..80c0409 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
>  	if (record->element.port_num != priv->port)
>  		return;
>  
> -	if (record->event == IB_EVENT_PORT_ERR    ||
> -	    record->event == IB_EVENT_PORT_ACTIVE ||
> -	    record->event == IB_EVENT_LID_CHANGE  ||
> -	    record->event == IB_EVENT_SM_CHANGE   ||
> -	    record->event == IB_EVENT_CLIENT_REREGISTER) {
> -		ipoib_dbg(priv, "Port state change event\n");
> -		queue_work(ipoib_workqueue, &priv->flush_task);
> +	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
> +			record->device->name, record->element.port_num);
> +	if ( record->event == IB_EVENT_SM_CHANGE   ||
> +	     record->event == IB_EVENT_CLIENT_REREGISTER) {
> +		queue_work(ipoib_workqueue, &priv->flush_task0);
> +	} else if (record->event == IB_EVENT_PORT_ERR ||
> +		   record->event == IB_EVENT_PORT_ACTIVE ||
> +		   record->event == IB_EVENT_LID_CHANGE) {
> +		queue_work(ipoib_workqueue, &priv->flush_task1);
>  	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
> -		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
> -		queue_work(ipoib_workqueue, &priv->pkey_event_task);
> +		queue_work(ipoib_workqueue, &priv->flush_task2);
>  	}
>  }


From hrosenstock at xsigo.com  Thu May 22 08:10:29 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 08:10:29 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080522145607.GE32128@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
Message-ID: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote:
> On 07:46 Thu 22 May     , Hal Rosenstock wrote:
> > Sasha,
> > 
> > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> > > This adds possibility to specify SM_Key value with saquery. It should
> > > work with queries where OSM_DEFAULT_SM_KEY was used.
> > 
> > I think this starts down a slippery slope and perhaps bad precedent for
> > MKey as well. I know this is useful as a debug tool but compromises what
> > purports as "security" IMO as this means the keys need to be too widely
> > known.
> 
> When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM
> side an user may know this or not, in later case saquery will not work
> (just like now). I don't see a hole.

I think it will tend towards proliferation of keys which will defeat any
security/trust. The idea of SMKey was to keep it private between SMs.
This is now spreading it wider IMO. I'm sure other patches will follow
in the same vein once an MKey manager exists.

-- Hal

> Sasha


From meier3 at llnl.gov  Thu May 22 08:15:04 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Thu, 22 May 2008 08:15:04 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with
 error if not root
Message-ID: <48358DF8.2060603@llnl.gov>

Sasha,

Trivial patch to enforce root for these perl scripts.  More importantly, 
doesn't silently fail if not root, and returns an error code.

-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-infiniband-diags-terminate-perl-scripts-with-error.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/04f92d60/attachment.ksh>

From hrosenstock at xsigo.com  Thu May 22 08:17:47 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 08:17:47 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <48358DF8.2060603@llnl.gov>
References: <48358DF8.2060603@llnl.gov>
Message-ID: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>

Tim,

On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote:
> Sasha,
> 
> Trivial patch to enforce root for these perl scripts.  More importantly, 
> doesn't silently fail if not root, and returns an error code.

Should these enforce root or be based on udev permissions for umad which
default to root ?

-- Hal

> plain text document attachment (0001-infiniband-diags-terminate-perl-
> scripts-with-error.patch)
> >From f4058a22d31dc31f0e8ecdffcc42bff065eefcce Mon Sep 17 00:00:00 2001
> From: Tim Meier <meier3 at llnl.gov>
> Date: Wed, 21 May 2008 16:40:18 -0700
> Subject: [PATCH] infiniband-diags: terminate perl scripts with error if not root
> 
> Adds the "auth_check" routine at the beginning of each main, which
> terminates with an error if not invoked as root.
> 
> Signed-off-by: Tim Meier <meier3 at llnl.gov>
> ---
>  infiniband-diags/scripts/IBswcountlimits.pm  |   10 ++++++++++
>  infiniband-diags/scripts/ibfindnodesusing.pl |    1 +
>  infiniband-diags/scripts/ibidsverify.pl      |    1 +
>  infiniband-diags/scripts/iblinkinfo.pl       |    1 +
>  infiniband-diags/scripts/ibprintca.pl        |    1 +
>  infiniband-diags/scripts/ibprintrt.pl        |    1 +
>  infiniband-diags/scripts/ibprintswitch.pl    |    1 +
>  infiniband-diags/scripts/ibqueryerrors.pl    |    1 +
>  infiniband-diags/scripts/ibswportwatch.pl    |    1 +
>  9 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm
> index 9bc356f..0b7563e 100755
> --- a/infiniband-diags/scripts/IBswcountlimits.pm
> +++ b/infiniband-diags/scripts/IBswcountlimits.pm
> @@ -123,6 +123,16 @@ sub check_counters
>  "Total number of packets, excluding link packets, received on all VLs to the port"
>  );
>  
> +# =========================================================================
> +#  only root is authorized, terminate with msg and err code
> +#
> +sub auth_check
> +{
> +	if ( $> != 0 ) {
> +		die "Permission denied, must be root\n";
> +	}
> +}
> +
>  sub check_data_counters
>  {
>  	my $print_action = $_[0];
> diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl
> index 1bf0987..49003af 100755
> --- a/infiniband-diags/scripts/ibfindnodesusing.pl
> +++ b/infiniband-diags/scripts/ibfindnodesusing.pl
> @@ -168,6 +168,7 @@ sub compress_hostlist
>  #
>  sub main
>  {
> +	auth_check;
>  	my $found_switch = undef;
>  	my $cache_file = get_cache_file($ca_name, $ca_port);
>  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
> diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl
> index de78e6b..b857166 100755
> --- a/infiniband-diags/scripts/ibidsverify.pl
> +++ b/infiniband-diags/scripts/ibidsverify.pl
> @@ -163,6 +163,7 @@ sub insert_portguid
>  
>  sub main
>  {
> +	auth_check;
>  	if ($regenerate_map
>  		|| !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology"))
>  	{
> diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
> index a195474..4bb9598 100755
> --- a/infiniband-diags/scripts/iblinkinfo.pl
> +++ b/infiniband-diags/scripts/iblinkinfo.pl
> @@ -98,6 +98,7 @@ my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port);
>  
>  sub main
>  {
> +	auth_check;
>  	get_link_ends($regenerate_map, $ca_name, $ca_port);
>  	if (defined($direct_route)) {
>  		# convert DR to guid, then use original single_switch option
> diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
> index 38b4330..d5c5fba 100755
> --- a/infiniband-diags/scripts/ibprintca.pl
> +++ b/infiniband-diags/scripts/ibprintca.pl
> @@ -88,6 +88,7 @@ if ($target_hca eq "") {
>  #
>  sub main
>  {
> +	auth_check;
>  	my $found_hca = undef;
>  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
>  	my $in_hca = "no";
> diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
> index 86dcb64..c6070ff 100755
> --- a/infiniband-diags/scripts/ibprintrt.pl
> +++ b/infiniband-diags/scripts/ibprintrt.pl
> @@ -88,6 +88,7 @@ if ($target_rt eq "") {
>  #
>  sub main
>  {
> +	auth_check;
>  	my $found_rt = undef;
>  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
>  	my $in_rt = "no";
> diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
> index 6712201..41a5131 100755
> --- a/infiniband-diags/scripts/ibprintswitch.pl
> +++ b/infiniband-diags/scripts/ibprintswitch.pl
> @@ -87,6 +87,7 @@ if ($target_switch eq "") {
>  #
>  sub main
>  {
> +	auth_check;
>  	my $found_switch = undef;
>  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
>  	my $in_switch = "no";
> diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl
> index c807c02..3330687 100755
> --- a/infiniband-diags/scripts/ibqueryerrors.pl
> +++ b/infiniband-diags/scripts/ibqueryerrors.pl
> @@ -185,6 +185,7 @@ $cache_file = get_cache_file($ca_name, $ca_port);
>  
>  sub main
>  {
> +	auth_check;
>  	if (@IBswcountlimits::suppress_errors) {
>  		my $msg = join(",", @IBswcountlimits::suppress_errors);
>  		print "Suppressing: $msg\n";
> diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl
> index 6d6ba1c..76398fa 100755
> --- a/infiniband-diags/scripts/ibswportwatch.pl
> +++ b/infiniband-diags/scripts/ibswportwatch.pl
> @@ -157,6 +157,7 @@ my $sw_port = $ARGV[1];
>  
>  sub main
>  {
> +	auth_check;
>  	clear_counters;
>  	get_new_counts($sw_addr, $sw_port);
>  	while ($cycle != 0) {
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Thu May 22 08:17:57 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 08:17:57 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <adaprri53y0.fsf@cisco.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com>
Message-ID: <1211469478.18236.198.camel@hrosenstock-ws.xsigo.com>

Roland,

On Mon, 2008-05-19 at 15:23 -0700, Roland Dreier wrote: 
> I see two issues with this patch:
> 
>  - Is is architecturally guaranteed by the IB spec that flushing unicast
>    info is not required on an SM change or client reregister event?

FWIW, here's my take on what the IBA spec says relative to this:

I don't think there's an issue with client reregister AFAIK. Client
registrations refer to subscriptions only.

SM change is another matter IMO and is not guaranteed as has been
pointed out in several earlier posts in this thread.

-- Hal


From olga.shern at gmail.com  Thu May 22 08:28:19 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Thu, 22 May 2008 18:28:19 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com> <48357C09.1040302@Voltaire.COM>
	<1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>
Message-ID: <bc457d660805220828m5bce997cnd56240bf355e3f91@mail.gmail.com>

On 5/22/08, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>
> Moni,
>
> On Thu, 2008-05-22 at 16:58 +0300, Moni Shoua wrote:
> > Hal, Roland
> > Thanks for the comments. The patch below tries to address the issues that
> were
> > raised in its previous form. Please note that I'm only asking for opinion
> for now.
> > If the idea is acceptable then I will recreate more elegant patch with
> the required
> > fixes if any and with respect to previous comments (such as replacing 0,1
> and 2 with
> > textual names).
> >
> > The idea in few words is to flush only paths but keeping address handles
> in ipoib_neigh.
> > This will trigger a new path lookup when an ARP probe arrives and
> eventually an addess
> > handle renewal. In the meantime, the old address handle is kept and can
> be used. In most
> > cases this address handle is a valid address handle and when it is not
> than the situatio
> > is not worse than before.
>
> This part seems OK to me.
>
> > My tests show that this patch completes the improvement that was archived
> with patch #1
> > to zero packet loss (tested with ping flood) when SM change event occurs.
>
> Looks to me like SM change is still "level 0". I may have missed it but
> I don't see how this addresses the general architectural concerns
> previously raised. This patch may work in your test environment but I
> don't think that covers all the cases.


Hal,
You pointed out that we cannot rely on the assumption that on SM failover
there is not path change.
In the previous patch we only flush multicast.
What Moni changed in this patch is that on SM failover (SM change event),
we will flush not only multicast but also all paths but without destroying
ah.

Olga


-- Hal
>
> > thanks
> >
> >  MoniS
> >
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
> b/drivers/infiniband/ulp/ipoib/ipoib.h
> > index ca126fc..8ef6573 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> > @@ -276,10 +276,11 @@ struct ipoib_dev_priv {
> >
> >       struct delayed_work pkey_poll_task;
> >       struct delayed_work mcast_task;
> > -     struct work_struct flush_task;
> > +     struct work_struct flush_task0;
> > +     struct work_struct flush_task1;
> > +     struct work_struct flush_task2;
> >       struct work_struct restart_task;
> >       struct delayed_work ah_reap_task;
> > -     struct work_struct pkey_event_task;
> >
> >       struct ib_device *ca;
> >       u8                port;
> > @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct
> sk_buff *skb,
> >               struct ipoib_ah *address, u32 qpn);
> >  void ipoib_reap_ah(struct work_struct *work);
> >
> > +void ipoib_flush_paths_only(struct net_device *dev);
> >  void ipoib_flush_paths(struct net_device *dev);
> >  struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
> >
> >  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int
> port);
> > -void ipoib_ib_dev_flush(struct work_struct *work);
> > +void ipoib_ib_dev_flush0(struct work_struct *work);
> > +void ipoib_ib_dev_flush1(struct work_struct *work);
> > +void ipoib_ib_dev_flush2(struct work_struct *work);
> >  void ipoib_pkey_event(struct work_struct *work);
> >  void ipoib_ib_dev_cleanup(struct net_device *dev);
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > index f429bce..5a6bbe8 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct
> ib_device *ca, int port)
> >       return 0;
> >  }
> >
> > -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int
> pkey_event)
> > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
> >  {
> >       struct ipoib_dev_priv *cpriv;
> >       struct net_device *dev = priv->dev;
> > @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct
> ipoib_dev_priv *priv, int pkey_event)
> >        * the parent is down.
> >        */
> >       list_for_each_entry(cpriv, &priv->child_intfs, list)
> > -             __ipoib_ib_dev_flush(cpriv, pkey_event);
> > +             __ipoib_ib_dev_flush(cpriv, level);
> >
> >       mutex_unlock(&priv->vlan_mutex);
> >
> > @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct
> ipoib_dev_priv *priv, int pkey_event)
> >               return;
> >       }
> >
> > -     if (pkey_event) {
> > +     if (level == 2) {
> >               if (ib_find_pkey(priv->ca, priv->port, priv->pkey,
> &new_index)) {
> >                       clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
> >                       ipoib_ib_dev_down(dev, 0);
> > @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct
> ipoib_dev_priv *priv, int pkey_event)
> >               priv->pkey_index = new_index;
> >       }
> >
> > -     ipoib_dbg(priv, "flushing\n");
> > -
> > -     ipoib_ib_dev_down(dev, 0);
> > +     ipoib_flush_paths_only(dev);
> > +     ipoib_mcast_dev_flush(dev);
> > +
> > +     if (level >= 1)
> > +             ipoib_ib_dev_down(dev, 0);
> >
> > -     if (pkey_event) {
> > +     if (level >= 2) {
> >               ipoib_ib_dev_stop(dev, 0);
> >               ipoib_ib_dev_open(dev);
> >       }
> > @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct
> ipoib_dev_priv *priv, int pkey_event)
> >        * we get here, don't bring it back up if it's not configured up
> >        */
> >       if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
> > -             ipoib_ib_dev_up(dev);
> > +             if (level >= 1)
> > +                     ipoib_ib_dev_up(dev);
> >               ipoib_mcast_restart_task(&priv->restart_task);
> >       }
> >  }
> >
> > -void ipoib_ib_dev_flush(struct work_struct *work)
> > +void ipoib_ib_dev_flush0(struct work_struct *work)
> >  {
> >       struct ipoib_dev_priv *priv =
> > -             container_of(work, struct ipoib_dev_priv, flush_task);
> > +             container_of(work, struct ipoib_dev_priv, flush_task0);
> >
> > -     ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
> >       __ipoib_ib_dev_flush(priv, 0);
> >  }
> >
> > -void ipoib_pkey_event(struct work_struct *work)
> > +void ipoib_ib_dev_flush1(struct work_struct *work)
> >  {
> >       struct ipoib_dev_priv *priv =
> > -             container_of(work, struct ipoib_dev_priv, pkey_event_task);
> > +             container_of(work, struct ipoib_dev_priv, flush_task1);
> >
> > -     ipoib_dbg(priv, "Flushing %s and restarting its QP\n",
> priv->dev->name);
> >       __ipoib_ib_dev_flush(priv, 1);
> >  }
> >
> > +void ipoib_ib_dev_flush2(struct work_struct *work)
> > +{
> > +     struct ipoib_dev_priv *priv =
> > +             container_of(work, struct ipoib_dev_priv, flush_task2);
> > +
> > +     __ipoib_ib_dev_flush(priv, 2);
> > +}
> > +
> >  void ipoib_ib_dev_cleanup(struct net_device *dev)
> >  {
> >       struct ipoib_dev_priv *priv = netdev_priv(dev);
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index 2442090..c41798d 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct
> ipoib_path *path)
> >       return 0;
> >  }
> >
> > +static void path_free_only(struct net_device *dev, struct ipoib_path
> *path)
> > +{
> > +     struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +     struct ipoib_neigh *neigh, *tn;
> > +     struct sk_buff *skb;
> > +     unsigned long flags;
> > +
> > +     while ((skb = __skb_dequeue(&path->queue)))
> > +             dev_kfree_skb_irq(skb);
> > +
> > +     if (path->ah)
> > +             ipoib_put_ah(path->ah);
> > +
> > +     kfree(path);
> > +}
> >  static void path_free(struct net_device *dev, struct ipoib_path *path)
> >  {
> >       struct ipoib_dev_priv *priv = netdev_priv(dev);
> > @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter
> *iter,
> >
> >  #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
> >
> > +void ipoib_flush_paths_only(struct net_device *dev)
> > +{
> > +     struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +     struct ipoib_path *path, *tp;
> > +     LIST_HEAD(remove_list);
> > +
> > +     spin_lock_irq(&priv->tx_lock);
> > +     spin_lock(&priv->lock);
> > +
> > +     list_splice_init(&priv->path_list, &remove_list);
> > +
> > +     list_for_each_entry(path, &remove_list, list)
> > +             rb_erase(&path->rb_node, &priv->path_tree);
> > +
> > +     list_for_each_entry_safe(path, tp, &remove_list, list) {
> > +             if (path->query)
> > +                     ib_sa_cancel_query(path->query_id, path->query);
> > +             spin_unlock(&priv->lock);
> > +             spin_unlock_irq(&priv->tx_lock);
> > +             wait_for_completion(&path->done);
> > +             path_free_only(dev, path);
> > +             spin_lock_irq(&priv->tx_lock);
> > +             spin_lock(&priv->lock);
> > +     }
> > +     spin_unlock(&priv->lock);
> > +     spin_unlock_irq(&priv->tx_lock);
> > +}
> > +
> >  void ipoib_flush_paths(struct net_device *dev)
> >  {
> >       struct ipoib_dev_priv *priv = netdev_priv(dev);
> > @@ -421,6 +464,8 @@ static void path_rec_completion(int status,
> >                       __skb_queue_tail(&skqueue, skb);
> >
> >               list_for_each_entry_safe(neigh, tn, &path->neigh_list,
> list) {
> > +                     if (neigh->ah)
> > +                             ipoib_put_ah(neigh->ah);
> >                       kref_get(&path->ah->ref);
> >                       neigh->ah = path->ah;
> >                       memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
> > @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev)
> >       INIT_LIST_HEAD(&priv->multicast_list);
> >
> >       INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
> > -     INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
> >       INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
> > -     INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
> > +     INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
> > +     INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
> > +     INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
> >       INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
> >       INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
> >  }
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> > index 8766d29..80c0409 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> > @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
> >       if (record->element.port_num != priv->port)
> >               return;
> >
> > -     if (record->event == IB_EVENT_PORT_ERR    ||
> > -         record->event == IB_EVENT_PORT_ACTIVE ||
> > -         record->event == IB_EVENT_LID_CHANGE  ||
> > -         record->event == IB_EVENT_SM_CHANGE   ||
> > -         record->event == IB_EVENT_CLIENT_REREGISTER) {
> > -             ipoib_dbg(priv, "Port state change event\n");
> > -             queue_work(ipoib_workqueue, &priv->flush_task);
> > +     ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
> > +                     record->device->name, record->element.port_num);
> > +     if ( record->event == IB_EVENT_SM_CHANGE   ||
> > +          record->event == IB_EVENT_CLIENT_REREGISTER) {
> > +             queue_work(ipoib_workqueue, &priv->flush_task0);
> > +     } else if (record->event == IB_EVENT_PORT_ERR ||
> > +                record->event == IB_EVENT_PORT_ACTIVE ||
> > +                record->event == IB_EVENT_LID_CHANGE) {
> > +             queue_work(ipoib_workqueue, &priv->flush_task1);
> >       } else if (record->event == IB_EVENT_PKEY_CHANGE) {
> > -             ipoib_dbg(priv, "P_Key change event on port:%d\n",
> priv->port);
> > -             queue_work(ipoib_workqueue, &priv->pkey_event_task);
> > +             queue_work(ipoib_workqueue, &priv->flush_task2);
> >       }
> >  }
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/b8cacf51/attachment.html>

From eli at mellanox.co.il  Thu May 22 08:40:15 2008
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 22 May 2008 18:40:15 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: copy small SKBs in CM mode
Message-ID: <1211470815.7310.61.camel@eli-laptop>

>From a8ea680caf189ad984aedaa81463ed66e45c4e65 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Thu, 22 May 2008 16:28:59 +0300
Subject: [PATCH] IB/ipoib: copy small SKBs in CM mode

CM mode of ipoib has a large overhead in the receive flow for managing
SKBs. It usually allocates an SKB with data as much as was used in the
currently received SKB  and moves unused fragments from the old SKB to the
new one. This involves a loop on all the remaining fragments and incurs
overhead on the CPU.
This patch, for small SKBs, allocates an SKB just large enough to contain
the received data and copies to it the data from the received SKB.
The newly allocated SKB is passed to the stack and the old SKB is reposted.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---

When running netperf I see significant improvement when using this patch
(BW Mbps):

with patch:
sender		receiver
313		313

without the patch:
509		134

 drivers/infiniband/ulp/ipoib/ipoib.h    |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |   15 +++++++++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..e39bf36 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -97,6 +97,7 @@ enum {
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
 
 	MAX_SEND_CQE		  = 16,
+	SKB_TSHOLD		  = 256,
 };
 
 #define	IPOIB_OP_RECV   (1ul << 31)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index e6f57dd..791bef7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -525,6 +525,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	u64 mapping[IPOIB_CM_RX_SG];
 	int frags;
 	int has_srq;
+	struct sk_buff *small_skb;
 
 	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -579,6 +580,19 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		}
 	}
 
+	if (wc->byte_len < SKB_TSHOLD) {
+		int dlen = wc->byte_len;
+
+		small_skb = dev_alloc_skb(dlen + 12);
+		if (small_skb) {
+			skb_reserve(small_skb, 12);
+			skb_copy_from_linear_data(skb, small_skb->data, dlen);
+			skb_put(small_skb, dlen);
+			skb = small_skb;
+			goto copied;
+		}
+	}
+
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
 					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
 
@@ -601,6 +615,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
 
+copied:
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
-- 
1.5.5.1


From hrosenstock at xsigo.com  Thu May 22 08:43:28 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 08:43:28 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <bc457d660805220828m5bce997cnd56240bf355e3f91@mail.gmail.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com> <48357C09.1040302@Voltaire.COM>
	<1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>
	<bc457d660805220828m5bce997cnd56240bf355e3f91@mail.gmail.com>
Message-ID: <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com>

Olga,

On Thu, 2008-05-22 at 18:28 +0300, Olga Shern (Voltaire) wrote:
> Hal, 
> You pointed out that we cannot rely on the assumption that on SM
> failover there is not path change.
> In the previous patch we only flush multicast.
> What Moni changed in this patch is that on SM failover (SM change
> event), we will flush not only multicast but also all paths but
> without destroying ah.

I missed that in the patch :-( It addresses the first level of concern
in terms of the unicast paths but leaves open the path parameter changes
(rate, etc.) as the address handles are preserved as Moni stated in
other words. I agree it's in the right direction. I would like to see
the whole problem solved. Is the cost of recreating the AHs too much or
is something else leading towards preserving the AHs ? That's what's
needed to be resolved for a complete solution.

-- Hal

> Olga


From swise at opengridcomputing.com  Thu May 22 09:00:34 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 22 May 2008 11:00:34 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48358428.2000902@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
Message-ID: <483598A2.1020503@opengridcomputing.com>


Or Gerlitz wrote:
> Steve Wise wrote:
>> Are we sure we need to expose this to the user?
> I believe this is the way to go if we want to let smart ULPs generate 
> new rkey/stag per mapping. Simpler ULPs could then just put the same 
> value for each map associated with the same mr.
> 
> Or.
> 

Roland, what do you think?  I'm ok with adding this.


From olga.shern at gmail.com  Thu May 22 09:06:29 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Thu, 22 May 2008 19:06:29 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com> <48357C09.1040302@Voltaire.COM>
	<1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>
	<bc457d660805220828m5bce997cnd56240bf355e3f91@mail.gmail.com>
	<1211471008.18236.209.camel@hrosenstock-ws.xsigo.com>
Message-ID: <bc457d660805220906h92567f7g42901fce2597de9e@mail.gmail.com>

On 5/22/08, Hal Rosenstock <hrosenstock at xsigo.com> wrote:
>
> Olga,
>
> On Thu, 2008-05-22 at 18:28 +0300, Olga Shern (Voltaire) wrote:
> > Hal,
> > You pointed out that we cannot rely on the assumption that on SM
> > failover there is not path change.
> > In the previous patch we only flush multicast.
> > What Moni changed in this patch is that on SM failover (SM change
> > event), we will flush not only multicast but also all paths but
> > without destroying ah.
>
> I missed that in the patch :-( It addresses the first level of concern
> in terms of the unicast paths but leaves open the path parameter changes
> (rate, etc.) as the address handles are preserved as Moni stated in
> other words. I agree it's in the right direction. I would like to see
> the whole problem solved. Is the cost of recreating the AHs too much or
> is something else leading towards preserving the AHs ? That's what's
> needed to be resolved for a complete solution.


I didn't explain it well, I will try again :)

On SM change event we will not destroy ah but will flush paths,
therefore unicast traffic will continue without packets lost.
When there will be arp probe (issued by the kernel) it will look for a path
and because we have deleted it  will issue path query to SM and after reply
from sm it will create a new ah that will replace the old ah.

Before this patch all packets were dropped till there is a new ah, this
patch creating new ah at the background.

I hope  it is clear now.

Olga

-- Hal
>
> > Olga
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/5273b911/attachment.html>

From sean.hefty at intel.com  Thu May 22 09:21:20 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 22 May 2008 09:21:20 -0700
Subject: [ofa-general] [PATCH] [for-2.6.27] rdma: fix license text
Message-ID: <000101c8bc27$df180960$ec248686@amr.corp.intel.com>

The license text for several files references a third software license
that was inadvertently copied in.  Update the license to match that used
by openfabrics.  This update was based on a request from HP.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/addr.c |   41 ++++++++++++++++++---------------
 drivers/infiniband/core/cma.c  |   42 ++++++++++++++++++----------------
 include/rdma/ib_addr.h         |   42 ++++++++++++++++++----------------
 include/rdma/rdma_cm.h         |   42 ++++++++++++++++++----------------
 include/rdma/rdma_cm_ib.h      |   50 ++++++++++++++++++++++------------------
 5 files changed, 119 insertions(+), 98 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 781ea59..e4eb8be 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -4,28 +4,33 @@
  * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  *
- * This Software is licensed under one of the following licenses:
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
  *
- * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/cpl.php.
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
  *
- * 2) under the terms of the "The BSD License" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/bsd-license.php.
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
  *
- * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
- *    copy of which is available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/gpl-license.php.
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
  *
- * Licensee has the right to choose one of the above licenses.
- *
- * Redistributions of source code must retain the above copyright
- * notice and one of the license notices.
- *
- * Redistributions in binary form must reproduce both the above copyright
- * notice, one of the license notices in the documentation
- * and/or other materials provided with the distribution.
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
  */
 
 #include <linux/mutex.h>
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 671f137..e5bd617 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -4,29 +4,33 @@
  * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
  * Copyright (c) 2005-2006 Intel Corporation.  All rights reserved.
  *
- * This Software is licensed under one of the following licenses:
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
  *
- * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/cpl.php.
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
  *
- * 2) under the terms of the "The BSD License" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/bsd-license.php.
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
  *
- * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
- *    copy of which is available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/gpl-license.php.
- *
- * Licensee has the right to choose one of the above licenses.
- *
- * Redistributions of source code must retain the above copyright
- * notice and one of the license notices.
- *
- * Redistributions in binary form must reproduce both the above copyright
- * notice, one of the license notices in the documentation
- * and/or other materials provided with the distribution.
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
  *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
  */
 
 #include <linux/completion.h>
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index c36750f..b42bdd0 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -2,29 +2,33 @@
  * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  *
- * This Software is licensed under one of the following licenses:
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
  *
- * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/cpl.php.
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
  *
- * 2) under the terms of the "The BSD License" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/bsd-license.php.
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
  *
- * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
- *    copy of which is available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/gpl-license.php.
- *
- * Licensee has the right to choose one of the above licenses.
- *
- * Redistributions of source code must retain the above copyright
- * notice and one of the license notices.
- *
- * Redistributions in binary form must reproduce both the above copyright
- * notice, one of the license notices in the documentation
- * and/or other materials provided with the distribution.
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
  *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
  */
 
 #if !defined(IB_ADDR_H)
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 010f876..d8f9a95 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -2,29 +2,33 @@
  * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  *
- * This Software is licensed under one of the following licenses:
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
  *
- * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/cpl.php.
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
  *
- * 2) under the terms of the "The BSD License" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/bsd-license.php.
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
  *
- * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
- *    copy of which is available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/gpl-license.php.
- *
- * Licensee has the right to choose one of the above licenses.
- *
- * Redistributions of source code must retain the above copyright
- * notice and one of the license notices.
- *
- * Redistributions in binary form must reproduce both the above copyright
- * notice, one of the license notices in the documentation
- * and/or other materials provided with the distribution.
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
  *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
  */
 
 #if !defined(RDMA_CM_H)
diff --git a/include/rdma/rdma_cm_ib.h b/include/rdma/rdma_cm_ib.h
index 950424b..2389c3b 100644
--- a/include/rdma/rdma_cm_ib.h
+++ b/include/rdma/rdma_cm_ib.h
@@ -1,29 +1,33 @@
 /*
  * Copyright (c) 2006 Intel Corporation.  All rights reserved.
  *
- * This Software is licensed under one of the following licenses:
- *
- * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/cpl.php.
- *
- * 2) under the terms of the "The BSD License" a copy of which is
- *    available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/bsd-license.php.
- *
- * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
- *    copy of which is available from the Open Source Initiative, see
- *    http://www.opensource.org/licenses/gpl-license.php.
- *
- * Licensee has the right to choose one of the above licenses.
- *
- * Redistributions of source code must retain the above copyright
- * notice and one of the license notices.
- *
- * Redistributions in binary form must reproduce both the above copyright
- * notice, one of the license notices in the documentation
- * and/or other materials provided with the distribution.
- *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
  */
 
 #if !defined(RDMA_CM_IB_H)


From hrosenstock at xsigo.com  Thu May 22 09:39:36 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 09:39:36 -0700
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <bc457d660805220906h92567f7g42901fce2597de9e@mail.gmail.com>
References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM>
	<adaprri53y0.fsf@cisco.com> <48357C09.1040302@Voltaire.COM>
	<1211468826.18236.184.camel@hrosenstock-ws.xsigo.com>
	<bc457d660805220828m5bce997cnd56240bf355e3f91@mail.gmail.com>
	<1211471008.18236.209.camel@hrosenstock-ws.xsigo.com>
	<bc457d660805220906h92567f7g42901fce2597de9e@mail.gmail.com>
Message-ID: <1211474376.18236.220.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-22 at 19:06 +0300, Olga Shern (Voltaire) wrote:
> I didn't explain it well, I will try again :)
>  
> On SM change event we will not destroy ah but will flush paths,
> therefore unicast traffic will continue without packets lost.
> When there will be arp probe (issued by the kernel) it will look for a
> path and because we have deleted it  will issue path query to SM and
> after reply from sm it will create a new ah that will replace the old
> ah.

So the new path parameters do become available after a while (once the
SA PR query is responded to). In the interim, IPoIB continues to send
using the potentially old AH in the hope that nothing changes. If so,
this sounds OK to me.

> Before this patch all packets were dropped till there is a new ah,
> this patch creating new ah at the background
 
> I hope  it is clear now.

If what I wrote above is a restatement, then it's clearer to me now.
Thanks.

-- Hal

> Olga  


From ralph.campbell at qlogic.com  Thu May 22 10:29:35 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 22 May 2008 10:29:35 -0700
Subject: [ofa-general] ib_device_cap_flags
Message-ID: <1211477375.3949.313.camel@brick.pathscale.com>

Or Gerlitz pointed out that the ipath driver wasn't setting
IB_DEVICE_RC_RNR_NAK_GEN. When I looked at the other flags
to see if any others were missing, I wasn't sure what one
of the flags means.  Can someone enlighten me?

IB_DEVICE_CURR_QP_STATE_MOD
      ib_modify_qp(IB_QP_CUR_STATE) supported?


From sean.hefty at intel.com  Thu May 22 10:34:58 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 22 May 2008 10:34:58 -0700
Subject: [ofa-general] ib_device_cap_flags
In-Reply-To: <1211477375.3949.313.camel@brick.pathscale.com>
References: <1211477375.3949.313.camel@brick.pathscale.com>
Message-ID: <000301c8bc32$2850a6a0$ec248686@amr.corp.intel.com>

>IB_DEVICE_CURR_QP_STATE_MOD
>      ib_modify_qp(IB_QP_CUR_STATE) supported?

"Ability of this HCA to support the Current QP State modifier
for Modify Queue Pair."

It allows the user to specify the current state of the QP when transitioning to
RTS (from RTR or SQD).

- Sean


From hrosenstock at xsigo.com  Thu May 22 11:08:00 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 22 May 2008 11:08:00 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
Message-ID: <1211479680.13185.37.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-22 at 08:17 -0700, Hal Rosenstock wrote:
> Tim,
> 
> On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote:
> > Sasha,
> > 
> > Trivial patch to enforce root for these perl scripts.  More importantly, 
> > doesn't silently fail if not root, and returns an error code.
> 
> Should these enforce root or be based on udev permissions for umad which
> default to root ?
> 
> -- Hal
> 
> > plain text document attachment (0001-infiniband-diags-terminate-perl-
> > scripts-with-error.patch)
> > >From f4058a22d31dc31f0e8ecdffcc42bff065eefcce Mon Sep 17 00:00:00 2001
> > From: Tim Meier <meier3 at llnl.gov>
> > Date: Wed, 21 May 2008 16:40:18 -0700
> > Subject: [PATCH] infiniband-diags: terminate perl scripts with error if not root
> > 
> > Adds the "auth_check" routine at the beginning of each main, which
> > terminates with an error if not invoked as root.
> > 
> > Signed-off-by: Tim Meier <meier3 at llnl.gov>
> > ---
> >  infiniband-diags/scripts/IBswcountlimits.pm  |   10 ++++++++++
> >  infiniband-diags/scripts/ibfindnodesusing.pl |    1 +
> >  infiniband-diags/scripts/ibidsverify.pl      |    1 +
> >  infiniband-diags/scripts/iblinkinfo.pl       |    1 +
> >  infiniband-diags/scripts/ibprintca.pl        |    1 +
> >  infiniband-diags/scripts/ibprintrt.pl        |    1 +
> >  infiniband-diags/scripts/ibprintswitch.pl    |    1 +
> >  infiniband-diags/scripts/ibqueryerrors.pl    |    1 +
> >  infiniband-diags/scripts/ibswportwatch.pl    |    1 +
> >  9 files changed, 18 insertions(+), 0 deletions(-)
> > 
> > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm
> > index 9bc356f..0b7563e 100755
> > --- a/infiniband-diags/scripts/IBswcountlimits.pm
> > +++ b/infiniband-diags/scripts/IBswcountlimits.pm
> > @@ -123,6 +123,16 @@ sub check_counters
> >  "Total number of packets, excluding link packets, received on all VLs to the port"
> >  );
> >  
> > +# =========================================================================
> > +#  only root is authorized, terminate with msg and err code
> > +#
> > +sub auth_check
> > +{
> > +	if ( $> != 0 ) {
> > +		die "Permission denied, must be root\n";
> > +	}

I think all that's needed is a slightly more sophisticated auth_check
than this :-) It could easily be a follow on patch to this.

-- Hal

> > +}
> > +
> >  sub check_data_counters
> >  {
> >  	my $print_action = $_[0];
> > diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl
> > index 1bf0987..49003af 100755
> > --- a/infiniband-diags/scripts/ibfindnodesusing.pl
> > +++ b/infiniband-diags/scripts/ibfindnodesusing.pl
> > @@ -168,6 +168,7 @@ sub compress_hostlist
> >  #
> >  sub main
> >  {
> > +	auth_check;
> >  	my $found_switch = undef;
> >  	my $cache_file = get_cache_file($ca_name, $ca_port);
> >  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
> > diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl
> > index de78e6b..b857166 100755
> > --- a/infiniband-diags/scripts/ibidsverify.pl
> > +++ b/infiniband-diags/scripts/ibidsverify.pl
> > @@ -163,6 +163,7 @@ sub insert_portguid
> >  
> >  sub main
> >  {
> > +	auth_check;
> >  	if ($regenerate_map
> >  		|| !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology"))
> >  	{
> > diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
> > index a195474..4bb9598 100755
> > --- a/infiniband-diags/scripts/iblinkinfo.pl
> > +++ b/infiniband-diags/scripts/iblinkinfo.pl
> > @@ -98,6 +98,7 @@ my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port);
> >  
> >  sub main
> >  {
> > +	auth_check;
> >  	get_link_ends($regenerate_map, $ca_name, $ca_port);
> >  	if (defined($direct_route)) {
> >  		# convert DR to guid, then use original single_switch option
> > diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
> > index 38b4330..d5c5fba 100755
> > --- a/infiniband-diags/scripts/ibprintca.pl
> > +++ b/infiniband-diags/scripts/ibprintca.pl
> > @@ -88,6 +88,7 @@ if ($target_hca eq "") {
> >  #
> >  sub main
> >  {
> > +	auth_check;
> >  	my $found_hca = undef;
> >  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
> >  	my $in_hca = "no";
> > diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
> > index 86dcb64..c6070ff 100755
> > --- a/infiniband-diags/scripts/ibprintrt.pl
> > +++ b/infiniband-diags/scripts/ibprintrt.pl
> > @@ -88,6 +88,7 @@ if ($target_rt eq "") {
> >  #
> >  sub main
> >  {
> > +	auth_check;
> >  	my $found_rt = undef;
> >  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
> >  	my $in_rt = "no";
> > diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
> > index 6712201..41a5131 100755
> > --- a/infiniband-diags/scripts/ibprintswitch.pl
> > +++ b/infiniband-diags/scripts/ibprintswitch.pl
> > @@ -87,6 +87,7 @@ if ($target_switch eq "") {
> >  #
> >  sub main
> >  {
> > +	auth_check;
> >  	my $found_switch = undef;
> >  	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
> >  	my $in_switch = "no";
> > diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl
> > index c807c02..3330687 100755
> > --- a/infiniband-diags/scripts/ibqueryerrors.pl
> > +++ b/infiniband-diags/scripts/ibqueryerrors.pl
> > @@ -185,6 +185,7 @@ $cache_file = get_cache_file($ca_name, $ca_port);
> >  
> >  sub main
> >  {
> > +	auth_check;
> >  	if (@IBswcountlimits::suppress_errors) {
> >  		my $msg = join(",", @IBswcountlimits::suppress_errors);
> >  		print "Suppressing: $msg\n";
> > diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl
> > index 6d6ba1c..76398fa 100755
> > --- a/infiniband-diags/scripts/ibswportwatch.pl
> > +++ b/infiniband-diags/scripts/ibswportwatch.pl
> > @@ -157,6 +157,7 @@ my $sw_port = $ARGV[1];
> >  
> >  sub main
> >  {
> > +	auth_check;
> >  	clear_counters;
> >  	get_new_counts($sw_addr, $sw_port);
> >  	while ($cycle != 0) {
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From tziporet at mellanox.co.il  Thu May 22 12:46:58 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 22 May 2008 22:46:58 +0300
Subject: [ofa-general] OFED 1.3.1 RC2 release is available
Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E6C3@mtlexch01.mtl.com>

Hi,

OFED 1.3.1 RC2 release is available on
http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc2.tgz

To get BUILD_ID run ofed_info

Please report any issues in Bugzilla https://bugs.openfabrics.org/

The GA version is expected on May 29

Release information:
--------------------
Linux Operating Systems:
        - RedHat EL4 up4:       2.6.9-42.ELsmp
        - RedHat EL4 up5:       2.6.9-55.ELsmp
        - RedHat EL4 up6:       2.6.9-67.ELsmp
        - RedHat EL5:           2.6.18-8.el5
        - RedHat EL5 up1:       2.6.18-53.el5
        - RedHat EL5 up2 beta:  2.6.18-84.el5       *
        - Fedora C6:            2.6.18-8.fc6        *
        - SLES10:               2.6.16.21-0.8-smp
        - SLES10 SP1:           2.6.16.46-0.12-smp
        - SLES10 SP1 up1:       2.6.16.53-0.16-smp
        - SLES10 SP2:           2.6.16.60-0.21-smp  *
        - OpenSuSE 10.3:        2.6.22-*-*          *
        - kernel.org:           2.6.23 and 2.6.24

      * OSes that are partially tested

Systems:
	* x86_64
	* x86
	* ia64
	* ppc64


Main changes from OFED 1.3.1-rc1
================================
* Added backports for the OSes (with very limited testing):
    * SLES10 SP2 with kernel 2.6.16.60-0.21-smp  
    * RedHat EL5 up2 beta with kernel 2.6.18-84.el5       
    
* MPI packages update:
    * mvapich-1.0.1-2481

* Updated libraries:
    * dapl-v1 1.2.7-1
    * dapl-v2 2.0.9-1
    * libcxgb3 1.2.1

* ULPs changes:
   * OpenSM: Fix segmentation fault
   * iSER: Bug fixes since 2.6.24
   * RDS: fixes for RDMA API
   * IPoIB: Fix several kernel crashes (see attached list)

* Updated low level drivers:
   * nes
   * mlx4
   * cxgb3
   * ehca
   * ipath


Main Changes from OFED-1.3:
===========================
* MPI packages update:
    * mvapich-1.0.1-2434
    * mvapich2-1.0.3-1
    * openmpi-1.2.6-1

* Updated libraries:
   * dapl-v1 1.2.6
   * dapl-v2 2.0.8
   * libcxgb3 1.2.0
   * librdmacm 1.0.7

* ULPs changes:
   * IB Bonding: ib-bonding-0.9.0-24
   * IPoIB bug fixes
   * RDS fixes for RDMA API
   * SRP failover

* Updated low level drivers:
   * nes
   * mlx4
   * cxgb3
   * ehca


Vlad & Tziporet

Note: 
In the attached tgz file you can find git-log of all changes.
In the CVS file there is a list of fixed bugs that were reported in
bugzilla

 <<ofed-1.3.1_rc2-rc1.diff.tgz>>  <<rc2-fixed-bugs.csv>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/d23ba17f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.3.1_rc2-rc1.diff.tgz
Type: application/x-compressed
Size: 7419 bytes
Desc: ofed-1.3.1_rc2-rc1.diff.tgz
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/d23ba17f/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rc2-fixed-bugs.csv
Type: application/octet-stream
Size: 464 bytes
Desc: rc2-fixed-bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/d23ba17f/attachment.obj>

From YJia at tmriusa.com  Thu May 22 13:13:05 2008
From: YJia at tmriusa.com (Yicheng Jia)
Date: Thu, 22 May 2008 15:13:05 -0500
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
Message-ID: <OF0E977937.7A879F2A-ON86257451.006C1B66-86257451.006F0F75@TMRIUSA.COM>

Hi Folks,

I'm trying to use CQ Event notification for multiple completions (ARM_N) 
according to Mellanox Lx III user manual for scatter/gathering RDMA. 
However I couldn't find it in current MLX driver. It seems to me that only 
ARM_NEXT and ARM_SOLICIT are implemented. So if there are multiple work 
requests, I have to use "poll_cq" to synchronously wait until all the 
requests are done, is it correct? Is there a way to do asynchronous 
multiple send by subscribing for a ARM_N event?

Thanks!
Yicheng

_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/13e82bc2/attachment.html>

From greg at kroah.com  Thu May 22 13:34:03 2008
From: greg at kroah.com (Greg KH)
Date: Thu, 22 May 2008 13:34:03 -0700
Subject: [ofa-general] question about drivers/infiniband/core/cm.c's kobject
	usage
Message-ID: <20080522203403.GA27263@kroah.com>

Hi,

I was working on some changes to the driver core that is cleaning up the
struct class fields, when I ran accross the usage of cm.c and the
infiniband_cm class.

It looks like you are registering "raw" kobjects in this class, chaining
things off of it, as if they were devices.

If so, why not just use struct device in the first place?  You are
creating a tree, which on modern distros, userspace will never see as
they are expecting everything to be showing up in /sys/devices/

Entries in /sys/class/*/* now are symlinks into the /sys/devices tree,
showing the representation of everything in one tree, not lots of little
trees all over the place.

So I was curious, was this done on purpose?  If so, why?  If not, any
objection to me switching it over to be using struct device properly?

thanks,

greg k-h


From sean.hefty at intel.com  Thu May 22 14:47:51 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 22 May 2008 14:47:51 -0700
Subject: [ofa-general] RE: question about drivers/infiniband/core/cm.c's
	kobject usage
In-Reply-To: <20080522203403.GA27263@kroah.com>
References: <20080522203403.GA27263@kroah.com>
Message-ID: <000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com>

>So I was curious, was this done on purpose?  If so, why?  If not, any
>objection to me switching it over to be using struct device properly?

It's entirely possible I have this wrong, but the intent is to export some
infiniband communication management message counters and relate them to the
corresponding ib_device/port.  For example:

/sys/class/infiniband_cm/<device name>/<port number>/<counter group>/<counter>

(E.g. /sys/class/infiniband_cm/mthca0/1/cm_tx_msgs/req)

If there's a better way to handle this, I have no objection to changing it.

- Sean


From arlin.r.davis at intel.com  Thu May 22 14:55:17 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 22 May 2008 14:55:17 -0700
Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows strange
	results
Message-ID: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com>


I have 2 servers (cst-53, cst-54) connected via one switch using a mlx4 dual port adapter. I did a
quick test, using ibv_rdma_bw, across port 1 and then port 2. When running across port 1 the data
seems to be split across port 1 and port 2, across port 2 the traffic is all on port 2 as expected.
Any ideas? Can I trust perfquery results?

Thanks,

-arlin

my configuration (OFED 1.3, RHEL5.1): 
cst-53: 
[root at cst-54 sbin]# ibstat
CA 'mlx4_0'
CA type: MT26418
Number of ports: 2
Firmware version: 2.4.938
Hardware version: a0
Node GUID: 0x0002c9030000a5b4
System image GUID: 0x0002c9030000a5b7
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 24
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x0002c9030000a5b5
Port 2:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 25
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x0002c9030000a5b6
cst-54: 
[root at cst-53 fw-25408-rel-2_4_938]# ibstat
CA 'mlx4_0'
CA type: MT26418
Number of ports: 2
Firmware version: 2.4.938
Hardware version: a0
Node GUID: 0x0002c9030000a620
System image GUID: 0x0002c9030000a623
Port 1:
State: Active
Physical state: LinkUp
Rate: 2
Base lid: 22
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x0002c9030000a621
Port 2:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 23
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x0002c9030000a622
TEST results: 
ibv_rdma_bw on port 2 is working fine, all traffic on port 2:: 
server: 
perfquery -R;/usr/bin/ib_rdma_bw -i 2;perfquery 24 1;perfquery 25 2 
11621: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
11621: Local address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6 RKey 0xc8002401 VAddr 0x002aaaaaf03000
11621: Remote address: LID 0x17, QPN 0x16004a, PSN 0x65136b, RKey 0x82002401 VAddr 0x002aaaab313000 
# Port counters: Lid 24 port 1
PortSelect:......................1
XmtData:.........................0
RcvData:.........................0
XmtPkts:.........................0
RcvPkts:.........................0
# Port counters: Lid 25 port 2
PortSelect:......................2
XmtData:.........................62946
RcvData:.........................116072668
XmtPkts:.........................7279
RcvPkts:.........................224274
client: 
perfquery -R;/usr/bin/ib_rdma_bw -i 2 cst-54;perfquery 22 1;perfquery 23 2
9190: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
9190: Local address: LID 0x17, QPN 0x16004a, PSN 0x65136b RKey 0x82002401 VAddr 0x002aaaab313000
9190: Remote address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6, RKey 0xc8002401 VAddr 0x002aaaaaf03000 
9190: Bandwidth peak (#0 to #983): 937.621 MB/sec
9190: Bandwidth average: 937.614 MB/sec
9190: Service Demand peak (#0 to #983): 2077 cycles/KB
9190: Service Demand Avg : 2077 cycles/KB
# Port counters: Lid 22 port 1
XmtData:.........................0
RcvData:.........................0
XmtPkts:.........................0
RcvPkts:.........................0
# Port counters: Lid 23 port 2
XmtData:.........................116075478
RcvData:.........................66442
XmtPkts:.........................224298
RcvPkts:.........................7318
port 1 with strange results - traffic split between port 1 and 2: 
client: 
[root at cst-53 fw-25408-rel-2_4_938]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 cst-54;perfquery 22
1;perfquery 23 2 
9144: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
9144: Local address: LID 0x16, QPN 0x10004a, PSN 0xd125be RKey 0x34002401 VAddr 0x002aaaab313000
9144: Remote address: LID 0x18, QPN 0x10004a, PSN 0x5cb52c, RKey 0x7a002401 VAddr 0x002aaaaaf03000 

9144: Bandwidth peak (#0 to #978): 234.634 MB/sec
9144: Bandwidth average: 234.634 MB/sec
9144: Service Demand peak (#0 to #978): 8303 cycles/KB
9144: Service Demand Avg : 8303 cycles/KB
# Port counters: Lid 22 port 1
XmtData:.........................16580072
RcvData:.........................7072
XmtPkts:.........................32001
RcvPkts:.........................1001
# Port counters: Lid 23 port 2
XmtData:.........................82915046
RcvData:.........................51692
XmtPkts:.........................160292
RcvPkts:.........................5308
server: 
[root at cst-54 sbin]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 ;perfquery 24 1;perfquery 25 2
11586: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
11586: Local address: LID 0x18, QPN 0x14004a, PSN 0x5e8160 RKey 0xae002401 VAddr 0x002aaaaaf03000
11586: Remote address: LID 0x16, QPN 0x14004a, PSN 0xf0dd70, RKey 0x68002401 VAddr 0x002aaaab313000 
# Port counters: Lid 24 port 1
XmtData:.........................7000
RcvData:.........................16580000
XmtPkts:.........................1000
RcvPkts:.........................32000
# Port counters: Lid 25 port 2
XmtData:.........................55802
RcvData:.........................99492206
XmtPkts:.........................6277
RcvPkts:.........................192268
ibtracert: 
cst-53 port 1 to cst-54 port 1 
[root at cst-54 sbin]# ibtracert 22 24
>From ca {0x0002c9030000a620} portnum 1 lid 22-22 "cst-53 HCA-1"
[1] -> switch port {0x000b8cffff004046}[2] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies"
[4] -> ca port {0x0002c9030000a5b5}[1] lid 24-24 "cst-54 HCA-1"
To ca {0x0002c9030000a5b4} portnum 1 lid 24-24 "cst-54 HCA-1"
cst-53 port 2 to cst-54 port 2 
[root at cst-54 sbin]# ibtracert 23 25
>From ca {0x0002c9030000a620} portnum 2 lid 23-23 "cst-53 HCA-1"
[2] -> switch port {0x000b8cffff004046}[3] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies"
[1] -> ca port {0x0002c9030000a5b6}[2] lid 25-25 "cst-54 HCA-1"
To ca {0x0002c9030000a5b4} portnum 2 lid 25-25 "cst-54 HCA-1"

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/c01749bf/attachment.html>

From greg at kroah.com  Thu May 22 14:58:12 2008
From: greg at kroah.com (Greg KH)
Date: Thu, 22 May 2008 14:58:12 -0700
Subject: [ofa-general] Re: question about drivers/infiniband/core/cm.c's
	kobject usage
In-Reply-To: <000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com>
References: <20080522203403.GA27263@kroah.com>
	<000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com>
Message-ID: <20080522215812.GA3366@kroah.com>

On Thu, May 22, 2008 at 02:47:51PM -0700, Sean Hefty wrote:
> >So I was curious, was this done on purpose?  If so, why?  If not, any
> >objection to me switching it over to be using struct device properly?
> 
> It's entirely possible I have this wrong, but the intent is to export some
> infiniband communication management message counters and relate them to the
> corresponding ib_device/port.  For example:
> 
> /sys/class/infiniband_cm/<device name>/<port number>/<counter group>/<counter>
> 
> (E.g. /sys/class/infiniband_cm/mthca0/1/cm_tx_msgs/req)
> 
> If there's a better way to handle this, I have no objection to changing it.

Yes, just hang all of the <port number> stuff off of the original
<device name> struct device.  That seems like it would be much simpler.

thanks,

greg k-h


From arlin.r.davis at intel.com  Thu May 22 15:42:36 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 22 May 2008 15:42:36 -0700
Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows
	strangeresults
In-Reply-To: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com>
References: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com>
Message-ID: <B0095134066CC94FBC80973103FFA1FE0729E82A@orsmsx416.amr.corp.intel.com>

never mind. my use of perfquery to reset counters was not correct. 


________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Davis, Arlin
R
	Sent: Thursday, May 22, 2008 2:55 PM
	To: [ofa_general]
	Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows
strangeresults
	
	
	I have 2 servers (cst-53, cst-54) connected via one switch using
a mlx4 dual port adapter. I did a quick test, using ibv_rdma_bw, across
port 1 and then port 2. When running across port 1 the data seems to be
split across port 1 and port 2, across port 2 the traffic is all on port
2 as expected. Any ideas? Can I trust perfquery results?

	Thanks, 

	-arlin 

	my configuration (OFED 1.3, RHEL5.1): 
	cst-53: 
	[root at cst-54 sbin]# ibstat
	CA 'mlx4_0'
	CA type: MT26418
	Number of ports: 2
	Firmware version: 2.4.938
	Hardware version: a0
	Node GUID: 0x0002c9030000a5b4
	System image GUID: 0x0002c9030000a5b7
	Port 1:
	State: Active
	Physical state: LinkUp
	Rate: 10
	Base lid: 24
	LMC: 0
	SM lid: 3
	Capability mask: 0x02510868
	Port GUID: 0x0002c9030000a5b5
	Port 2:
	State: Active
	Physical state: LinkUp
	Rate: 10
	Base lid: 25
	LMC: 0
	SM lid: 3
	Capability mask: 0x02510868
	Port GUID: 0x0002c9030000a5b6 
	cst-54: 
	[root at cst-53 fw-25408-rel-2_4_938]# ibstat
	CA 'mlx4_0'
	CA type: MT26418
	Number of ports: 2
	Firmware version: 2.4.938
	Hardware version: a0
	Node GUID: 0x0002c9030000a620
	System image GUID: 0x0002c9030000a623
	Port 1:
	State: Active
	Physical state: LinkUp
	Rate: 2
	Base lid: 22
	LMC: 0
	SM lid: 3
	Capability mask: 0x02510868
	Port GUID: 0x0002c9030000a621
	Port 2:
	State: Active
	Physical state: LinkUp
	Rate: 10
	Base lid: 23
	LMC: 0
	SM lid: 3
	Capability mask: 0x02510868
	Port GUID: 0x0002c9030000a622 
	TEST results: 
	ibv_rdma_bw on port 2 is working fine, all traffic on port 2:: 
	server: 
	perfquery -R;/usr/bin/ib_rdma_bw -i 2;perfquery 24 1;perfquery
25 2 
	11621: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 |
iters=1000 | duplex=0 | cma=0 |
	11621: Local address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6 RKey
0xc8002401 VAddr 0x002aaaaaf03000
	11621: Remote address: LID 0x17, QPN 0x16004a, PSN 0x65136b,
RKey 0x82002401 VAddr 0x002aaaab313000 
	# Port counters: Lid 24 port 1
	PortSelect:......................1
	XmtData:.........................0
	RcvData:.........................0
	XmtPkts:.........................0
	RcvPkts:.........................0
	# Port counters: Lid 25 port 2
	PortSelect:......................2
	XmtData:.........................62946
	RcvData:.........................116072668
	XmtPkts:.........................7279
	RcvPkts:.........................224274 
	client: 
	perfquery -R;/usr/bin/ib_rdma_bw -i 2 cst-54;perfquery 22
1;perfquery 23 2 
	9190: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 |
iters=1000 | duplex=0 | cma=0 |
	9190: Local address: LID 0x17, QPN 0x16004a, PSN 0x65136b RKey
0x82002401 VAddr 0x002aaaab313000
	9190: Remote address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6, RKey
0xc8002401 VAddr 0x002aaaaaf03000 
	9190: Bandwidth peak (#0 to #983): 937.621 MB/sec
	9190: Bandwidth average: 937.614 MB/sec
	9190: Service Demand peak (#0 to #983): 2077 cycles/KB
	9190: Service Demand Avg : 2077 cycles/KB
	# Port counters: Lid 22 port 1
	XmtData:.........................0
	RcvData:.........................0
	XmtPkts:.........................0
	RcvPkts:.........................0
	# Port counters: Lid 23 port 2
	XmtData:.........................116075478
	RcvData:.........................66442
	XmtPkts:.........................224298
	RcvPkts:.........................7318 
	port 1 with strange results - traffic split between port 1 and
2: 
	client: 
	[root at cst-53 fw-25408-rel-2_4_938]# perfquery
-R;/usr/bin/ib_rdma_bw -i 1 cst-54;perfquery 22 1;perfquery 23 2 
	9144: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 |
iters=1000 | duplex=0 | cma=0 |
	9144: Local address: LID 0x16, QPN 0x10004a, PSN 0xd125be RKey
0x34002401 VAddr 0x002aaaab313000
	9144: Remote address: LID 0x18, QPN 0x10004a, PSN 0x5cb52c, RKey
0x7a002401 VAddr 0x002aaaaaf03000 
	
	9144: Bandwidth peak (#0 to #978): 234.634 MB/sec
	9144: Bandwidth average: 234.634 MB/sec
	9144: Service Demand peak (#0 to #978): 8303 cycles/KB
	9144: Service Demand Avg : 8303 cycles/KB 
	# Port counters: Lid 22 port 1
	XmtData:.........................16580072
	RcvData:.........................7072
	XmtPkts:.........................32001
	RcvPkts:.........................1001
	# Port counters: Lid 23 port 2
	XmtData:.........................82915046
	RcvData:.........................51692
	XmtPkts:.........................160292
	RcvPkts:.........................5308 
	server: 
	[root at cst-54 sbin]# perfquery -R;/usr/bin/ib_rdma_bw -i 1
;perfquery 24 1;perfquery 25 2
	11586: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 |
iters=1000 | duplex=0 | cma=0 |
	11586: Local address: LID 0x18, QPN 0x14004a, PSN 0x5e8160 RKey
0xae002401 VAddr 0x002aaaaaf03000
	11586: Remote address: LID 0x16, QPN 0x14004a, PSN 0xf0dd70,
RKey 0x68002401 VAddr 0x002aaaab313000 
	# Port counters: Lid 24 port 1
	XmtData:.........................7000
	RcvData:.........................16580000
	XmtPkts:.........................1000
	RcvPkts:.........................32000
	# Port counters: Lid 25 port 2
	XmtData:.........................55802
	RcvData:.........................99492206
	XmtPkts:.........................6277
	RcvPkts:.........................192268 
	ibtracert: 
	cst-53 port 1 to cst-54 port 1 
	[root at cst-54 sbin]# ibtracert 22 24
	>From ca {0x0002c9030000a620} portnum 1 lid 22-22 "cst-53 HCA-1"
	[1] -> switch port {0x000b8cffff004046}[2] lid 2-2 "MT47396
Infiniscale-III Mellanox Technologies"
	[4] -> ca port {0x0002c9030000a5b5}[1] lid 24-24 "cst-54 HCA-1"
	To ca {0x0002c9030000a5b4} portnum 1 lid 24-24 "cst-54 HCA-1" 
	cst-53 port 2 to cst-54 port 2 
	[root at cst-54 sbin]# ibtracert 23 25
	>From ca {0x0002c9030000a620} portnum 2 lid 23-23 "cst-53 HCA-1"
	[2] -> switch port {0x000b8cffff004046}[3] lid 2-2 "MT47396
Infiniscale-III Mellanox Technologies"
	[1] -> ca port {0x0002c9030000a5b6}[2] lid 25-25 "cst-54 HCA-1"
	To ca {0x0002c9030000a5b4} portnum 2 lid 25-25 "cst-54 HCA-1"
	
	  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080522/c7e6d0b5/attachment.html>

From weiny2 at llnl.gov  Thu May 22 15:47:02 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 22 May 2008 15:47:02 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080522154702.430cdef7.weiny2@llnl.gov>

I guess my question is "does saquery need this to talk to the SA?"

I am assuming the answer is "yes".

I noticed this in the spec section 14.4.7 page 890:

   "The SM Key used for SM authentication is independent of the SM Key in the
   SA header used for SA authentication."

Does this mean there could be 2 SM_Key values in use?

Ira


On Thu, 22 May 2008 08:10:29 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote:
> > On 07:46 Thu 22 May     , Hal Rosenstock wrote:
> > > Sasha,
> > > 
> > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> > > > This adds possibility to specify SM_Key value with saquery. It should
> > > > work with queries where OSM_DEFAULT_SM_KEY was used.
> > > 
> > > I think this starts down a slippery slope and perhaps bad precedent for
> > > MKey as well. I know this is useful as a debug tool but compromises what
> > > purports as "security" IMO as this means the keys need to be too widely
> > > known.
> > 
> > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM
> > side an user may know this or not, in later case saquery will not work
> > (just like now). I don't see a hole.
> 
> I think it will tend towards proliferation of keys which will defeat any
> security/trust. The idea of SMKey was to keep it private between SMs.
> This is now spreading it wider IMO. I'm sure other patches will follow
> in the same vein once an MKey manager exists.
> 
> -- Hal
> 
> > Sasha
> 


From hypodermically at cl-invest.com  Thu May 22 20:52:46 2008
From: hypodermically at cl-invest.com (Jack Cooper)
Date: Thu, 22 May 2008 21:52:46 -0600
Subject: [ofa-general] 'EncoreDVD CS 3'
Message-ID: <000901c8bc77$6c0d4c00$0100007f@fpjrpq>

' Adobe CS3 Master Collection for PC or MAC includes:
' InDesign CS3
' Photoshop CS3
' Illustrator CS3
' Acrobat 8 Professional
' Flash CS3 Professional
' Dreamweaver CS3
' Fireworks CS3
' Contribute CS3
' After Effects CS3 Professional
' Premiere Pro CS3
' Encore DVD CS3
' Soundbooth CS3

' oemnewdeal . com in Web Browser

' System Requirements

' For PC:
' Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
' Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
' Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
' 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
' 38GB of available hard-disk space (additional free space required during installation)
' Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
' Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended)
' 1,280x1,024 monitor resolution with 32-bit color adapter
' DVD-ROM drive

' For MAC:
' PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp)
' Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server
' 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
' 36GB of available hard-disk space (additional free space required during installation)
' Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
' Core Audio compatible sound card
' 1,280x1,024 monitor resolution with 32-bit color adapter
' DVD-ROM drive' DVD+-R burner required for DVD creation

Doctors Provide Care Despite Obstacles in China

Last week's earthquake left more than 80,000 dead.  The Chinese government is now scaling back rescue operations and turning its attention to the millions of people in need of shelter and medical care.


From coarsens at gamewrench.com  Thu May 22 21:25:19 2008
From: coarsens at gamewrench.com (Tharen Avila)
Date: Fri, 23 May 2008 12:25:19 +0800
Subject: [ofa-general] `Adobe CS 3 Suite Ready for Download`
Message-ID: <000801c8bc8c$bfd66000$0100007f@afhehbr>

` Adobe CS3 Master Collection for PC or MAC includes:
` InDesign CS3
` Photoshop CS3
` Illustrator CS3
` Acrobat 8 Professional
` Flash CS3 Professional
` Dreamweaver CS3
` Fireworks CS3
` Contribute CS3
` After Effects CS3 Professional
` Premiere Pro CS3
` Encore DVD CS3
` Soundbooth CS3

` newxpnow . com in your Internet ExpIorer

` System Requirements

` For PC:
` Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
` Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
` Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
` 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
` 38GB of available hard-disk space (additional free space required during installation)
` Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
` Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended)
` 1,280x1,024 monitor resolution with 32-bit color adapter
` DVD-ROM drive

` For MAC:
` PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp)
` Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server
` 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
` 36GB of available hard-disk space (additional free space required during installation)
` Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
` Core Audio compatible sound card
` 1,280x1,024 monitor resolution with 32-bit color adapter
` DVD-ROM drive` DVD+-R burner required for DVD creation

China has announced it will investigate charges that shoddy construction led to the collapse of schools last week. Meanwhile, at refugee centers, makeshift tent classrooms give kids a small bit of comfort.

Italy Tackles Garbage Crisis, Illegal Immigration


From freyp at student.ethz.ch  Fri May 23 00:43:16 2008
From: freyp at student.ethz.ch (Philip Frey)
Date: Fri, 23 May 2008 09:43:16 +0200
Subject: [ofa-general] Multithreaded iWARP application
Message-ID: <48367594.9010401@student.ethz.ch>

Hello,

I have a peer-to-peer like application where on each peer there is 
thread listening for connection requests. The peers can at the same time 
also actively connect to other peers.

How can the concept or the "rdma_event_channel" now applied to this 
scenario?

Until now I only had one "rdma_event_channel" but with the thread, this 
leads to a race condition where the thread waits for a 
"RDMA_CM_EVENT_CONNECT_REQUST" while the peer tries to actively open a 
connection and is awaiting "RDMA_CM_EVENT_ADDR_RESOLVED" etc.

One solution would be to create a new "rdma_event_channel" for each 
active connection. But what happens at the accepting side? On the 
accepting "rdma_event_channel" (which is now exclusively used for that 
purpose), I get a new "rdma_cm_id" for the connection request from the 
respective event.

Is it now possible to create again a new "rdma_event_channel" for that 
new "rdma_cm_id"? If not, where does the "RDMA_CM_EVENT_ESTABLISHED" 
event go to? (The questionable line is marked with "<--HERE???" in the 
pseudo code below.)


In pseudo code:

/** connecting part **/
struct rdma_cm_id         *id;
struct rdma_event_channel *channel;
struct rdma_cm_event      *event;

channel = rdma_create_event_channel();
rdma_create_id(channel, &id, context, RDMA_PS_TCP);

rdma_resolve_addr(id, src_addr, dst_addr, timeout);
rdma_get_cm_event(channel, &event);	//expecting ADDR_RESOLVED
rdma_ack_cm_event(event);
//same for rdma_resolve_route()		//expecting ROUTE_RESOLVED

rdma_connect(id, conn_param);
rdma_get_event(channel, &event);	//expecting ESTABLISHED
rdma_ack_event(event);

... do RDMA here ...
//disconnect


/** accepting thread **/
struct rdma_cm_id         *listen_id, *id;
struct rdma_event_channel *listen_channel;
struct rdma_cm_event      *listen_event, *event;

channel = rdma_create_event_channel();
rdma_create_id(channel, &id, context, RDMA_PS_TCP);

rdma_bind_addr(id, addr);
rdma_listen(id, backlog);

while(1) {
	rdma_get_cm_event(listen_channel, listen_event);
	rdma_ack_cm_event(listen_event);
	//expecting CONNECT_REQUEST
	id = listen_event->id;
	id->channel = rdma_create_event_channel();	<-- HERE ???
	rdma_accept(id, conn_param);
	rdma_get_cm_event(id->channel, event);
	//expecting ESTABLISHED
	rdma_ack_event(event);

	... do RDMA here ...
	//await disconnect
}


Many thanks for your advice and kind regards!
  Philip


From sashak at voltaire.com  Fri May 23 01:49:41 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 11:49:41 +0300
Subject: [ofa-general] [PATCH] opensm/scripts: remove not used opensmd
	template
In-Reply-To: <20080519170624.GJ4616@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080512144541.3879de40.weiny2@llnl.gov>
	<20080519170624.GJ4616@sashak.voltaire.com>
Message-ID: <20080523084941.GA4164@sashak.voltaire.com>


Remove not used opensmd startup script template.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/configure.in       |    2 +-
 opensm/scripts/opensmd.in |  469 ---------------------------------------------
 2 files changed, 1 insertions(+), 470 deletions(-)
 delete mode 100755 opensm/scripts/opensmd.in

diff --git a/opensm/configure.in b/opensm/configure.in
index 2ae8bd0..e079065 100644
--- a/opensm/configure.in
+++ b/opensm/configure.in
@@ -207,7 +207,7 @@ OPENIB_APP_OSMV_CHECK_LIB
 # overrides.
 CFLAGS=$ac_env_CFLAGS_value
 
-AC_CONFIG_FILES([man/opensm.8 scripts/opensm.init scripts/redhat-opensm.init scripts/opensmd scripts/sldd.sh])
+AC_CONFIG_FILES([man/opensm.8 scripts/opensm.init scripts/redhat-opensm.init scripts/sldd.sh])
 
 dnl Create the following Makefiles
 AC_OUTPUT([include/opensm/osm_version.h Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec])
diff --git a/opensm/scripts/opensmd.in b/opensm/scripts/opensmd.in
deleted file mode 100755
index 7e5d868..0000000
--- a/opensm/scripts/opensmd.in
+++ /dev/null
@@ -1,469 +0,0 @@
-#!/bin/bash
-
-#
-# Copyright (c) 2006 Mellanox Technologies. All rights reserved.
-#
-# This Software is licensed under one of the following licenses:
-#
-# 1) under the terms of the "Common Public License 1.0" a copy of which is
-#    available from the Open Source Initiative, see
-#    http://www.opensource.org/licenses/cpl.php.
-#
-# 2) under the terms of the "The BSD License" a copy of which is
-#    available from the Open Source Initiative, see
-#    http://www.opensource.org/licenses/bsd-license.php.
-#
-# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
-#    copy of which is available from the Open Source Initiative, see
-#    http://www.opensource.org/licenses/gpl-license.php.
-#
-# Licensee has the right to choose one of the above licenses.
-#
-# Redistributions of source code must retain the above copyright
-# notice and one of the license notices.
-#
-# Redistributions in binary form must reproduce both the above copyright
-# notice, one of the license notices in the documentation
-# and/or other materials provided with the distribution.
-#
-#
-# processname: @sbindir@/opensm
-# config: @OPENSM_CONFIG_DIR@/opensm.conf
-# pidfile: /var/run/opensm.pid
-
-prefix=@prefix@
-exec_prefix=@exec_prefix@
-
-CONFIG=@OPENSM_CONFIG_DIR@/opensm.conf
-
-if [ ! -f $CONFIG ]; then
-    exit 0
-fi
-
-. $COFNIG
-
-prog=@sbindir@/opensm
-bin=${prog##*/}
-
-# Handover daemon for updating guid2lid cache file
-sldd_prog=@sbindir@/sldd.sh
-sldd_bin=${sldd_prog##*/}
-sldd_pid_file=/var/run/sldd.pid
-
-# Only use ONBOOT option if called by a runlevel directory.
-# Therefore determine the base, follow a runlevel link name ...
-base=${0##*/}
-link=${base#*[SK][0-9][0-9]}
-# ... and compare them
-if [ $link == $base ] ; then
-    ONBOOT=yes
-fi
-
-ACTION=$1
-shift
-
-if [ ! -x  $prog ]; then
-    echo "OpenSM not installed"
-    exit 1
-fi
-
-# Check if OpenSM configured to start automatically
-if [[ -z $ONBOOT || "$ONBOOT" != "yes" ]]; then
-    exit 0
-fi
-
-if ( grep -i 'SuSE Linux' /etc/issue >/dev/null 2>&1 ); then
-    if [ -n "$INIT_VERSION" ] ; then
-        # MODE=onboot
-        if LANG=C egrep -L "^ONBOOT=['\"]?[Nn][Oo]['\"]?" ${CONFIG} > /dev/null ; then
-            exit 0
-        fi
-    fi
-fi
-
-if [ -f /etc/init.d/functions ]; then
-    . /etc/init.d/functions
-fi
-
-# Setting OpenSM start parameters
-PID_FILE=/var/run/${bin}.pid
-touch $PID_FILE
-
-if [[ -z $DEBUG || "$DEBUG" == "none" ]]; then
-    DEBUG_FLAG=""
-else
-    DEBUG_FLAG="-d ${DEBUG}"
-fi
-
-if [[ -z $LMC || "$LMC" == "0" ]]; then
-    LMC_FLAG=""
-else
-    LMC_FLAG="-l ${LMC}"
-fi
-
-if [[ -z $MAXSMPS || "$MAXSMPS" == "4" ]]; then
-    MAXSMPS_FLAG=""
-else
-    MAXSMPS_FLAG="-maxsmps ${MAXSMPS}"
-fi
-
-if [[ -z $REASSIGN_LIDS || "$REASSIGN_LIDS" == "no" ]]; then
-    REASSIGN_LIDS_FLAG=""
-else
-    REASSIGN_LIDS_FLAG="-r"
-fi
-
-if [[ -z $SWEEP || "$SWEEP" == "10" ]]; then
-    SWEEP_FLAG=""
-else
-    SWEEP_FLAG="-s ${SWEEP}"
-fi
-
-if [[ -z $TIMEOUT || "$TIMEOUT" == "100" ]]; then
-    TIMEOUT_FLAG=""
-else
-    TIMEOUT_FLAG="-t ${TIMEOUT}"
-fi
-
-if [[ -z $OSM_LOG || "$OSM_LOG" == "/var/log/opensm.log" ]]; then
-    OSM_LOG_FLAG=""
-else
-    OSM_LOG_FLAG="-f ${OSM_LOG}"
-fi
-
-if [[ -z $VERBOSE || "$VERBOSE" == "none" ]]; then
-    VERBOSE_FLAG=""
-else
-    VERBOSE_FLAG="${VERBOSE}"
-fi
-
-if [[ -z $UPDN || "$UPDN" == "off" ]]; then
-    UPDN_FLAG=""
-else
-    UPDN_FLAG="-u"
-fi
-
-if [[ -z $GUID_FILE || "$GUID_FILE" == "none" ]]; then
-    GUID_FILE_FLAG=""
-else
-    GUID_FILE_FLAG="-a ${GUID_FILE}"
-fi
-
-if [[ -z $GUID || "$GUID" == "none" ]]; then
-    GUID_FLAG=""
-else
-    GUID_FLAG="-g ${GUID}"
-fi
-
-if [[ -z $HONORE_GUID2LID || "$HONORE_GUID2LID" == "none" ]]; then
-    HONORE_GUID2LID_FLAG=""
-else
-    HONORE_GUID2LID_FLAG="--honor_guid2lid"
-fi
-
-if [[ -n "${OSM_HOSTS}" && $(echo -n ${OSM_HOSTS} | wc -w | tr -d '[:space:]') -gt 1  ]]; then
-    HONORE_GUID2LID_FLAG="--honor_guid2lid"
-fi
-
-
-if [[ -z $CACHE_OPTIONS || "$CACHE_OPTIONS" == "none" ]]; then
-    CACHE_OPTIONS_FLAG=""
-else
-    CACHE_OPTIONS_FLAG="--cache-options"
-fi
-
-
-if [ -z $PORT_NUM ]; then
-    PORT_FLAG=1
-else
-    PORT_FLAG="${PORT_NUM}"
-fi
-
-
-#########################################################################
-# Get a sane screen width
-[ -z "${COLUMNS:-}" ] && COLUMNS=80
-
-[ -z "${CONSOLETYPE:-}" ] && [ -x /sbin/consoletype ] && CONSOLETYPE="`/sbin/consoletype`"
-
-if [ -f /etc/sysconfig/i18n -a -z "${NOLOCALE:-}" ] ; then
-  . /etc/sysconfig/i18n
-  if [ "$CONSOLETYPE" != "pty" ]; then
-        case "${LANG:-}" in
-                ja_JP*|ko_KR*|zh_CN*|zh_TW*)
-                        export LC_MESSAGES=en_US
-                        ;;
-                *)
-                        export LANG
-                        ;;
-        esac
-  else
-        export LANG
-  fi
-fi
-
-# Read in our configuration
-if [ -z "${BOOTUP:-}" ]; then
-  if [ -f /etc/sysconfig/init ]; then
-      . /etc/sysconfig/init
-  else
-    # This all seem confusing? Look in /etc/sysconfig/init,
-    # or in /usr/doc/initscripts-*/sysconfig.txt
-    BOOTUP=color
-    RES_COL=60
-    MOVE_TO_COL="echo -en \\033[${RES_COL}G"
-    SETCOLOR_SUCCESS="echo -en \\033[1;32m"
-    SETCOLOR_FAILURE="echo -en \\033[1;31m"
-    SETCOLOR_WARNING="echo -en \\033[1;33m"
-    SETCOLOR_NORMAL="echo -en \\033[0;39m"
-    LOGLEVEL=1
-  fi
-  if [ "$CONSOLETYPE" = "serial" ]; then
-      BOOTUP=serial
-      MOVE_TO_COL=
-      SETCOLOR_SUCCESS=
-      SETCOLOR_FAILURE=
-      SETCOLOR_WARNING=
-      SETCOLOR_NORMAL=
-  fi
-fi
-
-if [ "${BOOTUP:-}" != "verbose" ]; then
-   INITLOG_ARGS="-q"
-else
-   INITLOG_ARGS=
-fi
-
-echo_success() {
-  echo -n $@
-  [ "$BOOTUP" = "color" ] && $MOVE_TO_COL
-  echo -n "[  "
-  [ "$BOOTUP" = "color" ] && $SETCOLOR_SUCCESS
-  echo -n $"OK"
-  [ "$BOOTUP" = "color" ] && $SETCOLOR_NORMAL
-  echo -n "  ]"
-  echo -e "\r"
-  return 0
-}
-
-echo_failure() {
-  echo -n $@
-  [ "$BOOTUP" = "color" ] && $MOVE_TO_COL
-  echo -n "["
-  [ "$BOOTUP" = "color" ] && $SETCOLOR_FAILURE
-  echo -n $"FAILED"
-  [ "$BOOTUP" = "color" ] && $SETCOLOR_NORMAL
-  echo -n "]"
-  echo -e "\r"
-  return 1
-}
-
-
-#########################################################################
-
-# Check if $pid (could be plural) are running
-checkpid() {
-        local i
-
-        for i in $* ; do
-                [ -d "/proc/$i" ] || return 1
-        done
-        return 0
-}
-
-start_sldd()
-{
-    if [ -f $sldd_pid_file ]; then
-            local line p
-            read line < $sldd_pid_file
-            for p in $line ; do
-                    [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && sldd_pid="$sldd_pid $p"
-            done
-    fi
-
-    if [ -z "$sldd_pid" ]; then
-        sldd_pid=`pidof -x $sldd_bin`
-    fi
-
-    if [ -n "${sldd_pid:-}" ] ; then
-	kill -9 ${sldd_pid} > /dev/null 2>&1
-    fi
-
-    $sldd_prog > /dev/null 2>&1 &
-    sldd_pid=$!
-
-    echo ${sldd_pid} > $sldd_pid_file
-    # Sleep is needed in order to update local gid2lid cache file before running opensm
-    sleep 3
-}
-
-stop_sldd()
-{
-    if [ -f $sldd_pid_file ]; then
-            local line p
-            read line < $sldd_pid_file
-            for p in $line ; do
-                    [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && sldd_pid="$sldd_pid $p"
-            done
-    fi
-
-    if [ -z "$sldd_pid" ]; then
-        sldd_pid=`pidof -x $sldd_bin`
-    fi
-
-    if [ -n "${sldd_pid:-}" ] ; then
-        kill -15 ${sldd_pid} > /dev/null 2>&1
-    fi
-
-}
-
-start()
-{
-    if [ ! -d /sys/class/infiniband ]; then
-        echo
-        echo "Please load Infiniband driver first"
-        echo
-        return 2
-    fi
-
-    local OSM_PID=
-
-    if [ -f $PID_FILE ]; then
-            local line p
-            read line < $PID_FILE
-            for p in $line ; do
-                    [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && pid="$pid $p"
-            done
-    fi
-
-    if [ -z "$pid" ]; then
-        pid=`pidof -o $$ -o $PPID -o %PPID -x $bin`
-    fi
-
-    if [ -n "${pid:-}" ] ; then
-        echo $"${bin} (pid $pid) is already running..."
-    else
-
-	if [ -n "${HONORE_GUID2LID_FLAG}" ]; then
-		# Run sldd daemod
-		start_sldd
-	fi
-
-        # Start opensm
-        local START_FLAGS=""
-        for flag in "$DEBUG_FLAG" "$LMC_FLAG" "$MAXSMPS_FLAG" "$REASSIGN_LIDS_FLAG" "$SWEEP_FLAG" "$TIMEOUT_FLAG" "$OSM_LOG_FLAG" "$VERBOSE_FLAG" "$UPDN_FLAG" "$GUID_FILE_FLAG" "$GUID_FLAG" "$HONORE_GUID2LID_FLAG" "$CACHE_OPTIONS_FLAG"
-        do
-            [ ! -z "$flag" ] && START_FLAGS="$START_FLAGS $flag"
-        done
-
-        echo $PORT_FLAG | $prog $START_FLAGS > /dev/null 2>&1 &
-        OSM_PID=$!
-        echo $OSM_PID > $PID_FILE
-	sleep 1
-        checkpid $OSM_PID
-        RC=$?
-        [ $RC -eq 0 ] && echo_success "$bin start" || echo_failure "$bin start"
-
-    fi
-return $RC
-}
-
-stop()
-{
-    local pid=
-    local pid1=
-    local pid2=
-
-    # Stop sldd daemon
-    stop_sldd
-
-    if [ -f $PID_FILE ]; then
-            local line p
-            read line < $PID_FILE
-            for p in $line ; do
-                    [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && pid1="$pid1 $p"
-            done
-    fi
-
-    pid2=`pidof -o $$ -o $PPID -o %PPID -x $bin`
-
-    pid=`echo "$pid1 $pid2" | sed -e 's/\ /\n/g' | sort -n | uniq | sed -e 's/\n/\ /g'`
-
-    if [ -n "${pid:-}" ] ; then
-        # Kill opensm
-        kill -15 $pid > /dev/null 2>&1
-		cnt=0
-        while [ $cnt -lt 6 ]; do echo -n "."; sleep 1; let cnt++;done
-
-        for p in $pid
-        do
-            while checkpid $p ; do
-                kill -KILL $p > /dev/null 2>&1
-                echo -n "."
-                sleep 1
-            done
-        done
-        echo
-        checkpid $pid
-        RC=$?
-        [ $RC -eq 0 ] && echo_failure "$bin shutdown" || echo_success "$bin shutdown"
-        RC=$((! $RC))
-    else
-        echo_failure "$bin shutdown"
-        RC=1
-    fi
-
-    # Remove pid file if any.
-    rm -f $PID_FILE
-return $RC
-}
-
-status()
-{
-    local pid
-
-    # First try "pidof"
-    pid=`pidof -o $$ -o $PPID -o %PPID -x ${bin}`
-    if [ -n "$pid" ]; then
-            echo $"${bin} (pid $pid) is running..."
-            return 0
-    fi
-
-     # Next try "/var/run/opensm.pid" files
-     if [ -f $PID_FILE ] ; then
-             read pid < $PID_FILE
-             if [ -n "$pid" ]; then
-                     echo $"${bin} dead but pid file $PID_FILE exists"
-                     return 1
-             fi
-     fi
-     echo $"${bin} is stopped"
-     return 3
-}
-
-
-
-case $ACTION in
-	start)
-                start
-		;;
-	stop)
-		stop
-		;;
-	restart)
-		stop
-                start
-		;;
-	status)
-		status
-		;;
-	*)
-		echo
-		echo "Usage: `basename $0` {start|stop|restart|status}"
-		echo
-		exit 1
-		;;
-esac
-
-RC=$?
-exit $RC
-- 
1.5.5.1.178.g1f811


From sashak at voltaire.com  Fri May 23 01:50:10 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 11:50:10 +0300
Subject: [ofa-general] [PATCH] opensm/scripts: remove opensm.conf usage
In-Reply-To: <20080523084941.GA4164@sashak.voltaire.com>
References: <1207703425-19039-1-git-send-email-sashak@voltaire.com>
	<1210617225.11133.461.camel@cardanus.llnl.gov>
	<20080512144541.3879de40.weiny2@llnl.gov>
	<20080519170624.GJ4616@sashak.voltaire.com>
	<20080523084941.GA4164@sashak.voltaire.com>
Message-ID: <20080523085010.GB4164@sashak.voltaire.com>


Remove opensm.conf usage - startup script configuration will be
replaced soon by OpenSM's opensm.conf.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/scripts/opensm.sysconfig      |    4 +-
 opensm/scripts/redhat-opensm.init.in |  103 ++--------------------------------
 opensm/scripts/sldd.sh.in            |    3 +-
 3 files changed, 8 insertions(+), 102 deletions(-)

diff --git a/opensm/scripts/opensm.sysconfig b/opensm/scripts/opensm.sysconfig
index d3fba93..2cc02e6 100644
--- a/opensm/scripts/opensm.sysconfig
+++ b/opensm/scripts/opensm.sysconfig
@@ -1,2 +1,2 @@
-# If you want to pass any options to OpenSM, set them here.
-OPTIONS=
+# It will be used for sldd.sh
+OSM_HOSTS=""
diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in
index 5cc9079..5526e44 100755
--- a/opensm/scripts/redhat-opensm.init.in
+++ b/opensm/scripts/redhat-opensm.init.in
@@ -38,7 +38,7 @@
 #  $Id: openib-1.0-opensm.init,v 1.5 2006/08/02 18:18:23 dledford Exp $
 #
 # processname: @sbindir@/opensm
-# config: @OPENSM_CONFIG_DIR@/opensm.conf
+# config: @sysconfdir@/sysconfig/opensm.conf
 # pidfile: /var/run/opensm.pid
 
 prefix=@prefix@
@@ -46,7 +46,7 @@ exec_prefix=@exec_prefix@
 
 . /etc/rc.d/init.d/functions
 
-CONFIG=@OPENSM_CONFIG_DIR@/opensm.conf
+CONFIG=@sysconfdir@/sysconfig/opensm.conf
 if [ ! -f $CONFIG ]; then
     exit 0
 fi
@@ -67,97 +67,10 @@ ACTION=$1
 PID_FILE=/var/run/${bin}.pid
 touch $PID_FILE
 
-if [[ -z $DEBUG || "$DEBUG" == "none" ]]; then
-    DEBUG_FLAG=""
-else
-    DEBUG_FLAG="-d ${DEBUG}"
-fi
-
-if [[ -z $LMC || "$LMC" == "0" ]]; then
-    LMC_FLAG=""
-else
-    LMC_FLAG="-l ${LMC}"
-fi
-
-if [[ -z $MAXSMPS || "$MAXSMPS" == "4" ]]; then
-    MAXSMPS_FLAG=""
-else
-    MAXSMPS_FLAG="-maxsmps ${MAXSMPS}"
-fi
-
-if [[ -z $REASSIGN_LIDS || "$REASSIGN_LIDS" == "no" ]]; then
-    REASSIGN_LIDS_FLAG=""
-else
-    REASSIGN_LIDS_FLAG="-r"
-fi
-
-if [[ -z $SWEEP || "$SWEEP" == "10" ]]; then
-    SWEEP_FLAG=""
-else
-    SWEEP_FLAG="-s ${SWEEP}"
-fi
-
-if [[ -z $TIMEOUT || "$TIMEOUT" == "100" ]]; then
-    TIMEOUT_FLAG=""
-else
-    TIMEOUT_FLAG="-t ${TIMEOUT}"
-fi
-
-if [[ -z $OSM_LOG || "$OSM_LOG" == "/tmp/osm.log" ]]; then
-    OSM_LOG_FLAG=""
-else
-    OSM_LOG_FLAG="-f ${OSM_LOG}"
-fi
-
-if [[ -z $VERBOSE || "$VERBOSE" == "none" ]]; then
-    VERBOSE_FLAG=""
-else
-    VERBOSE_FLAG="${VERBOSE}"
-fi
-
-if [[ -z $UPDN || "$UPDN" == "off" ]]; then
-    UPDN_FLAG=""
-else
-    UPDN_FLAG="-u"
-fi
-
-if [[ -z $GUID_FILE || "$GUID_FILE" == "none" ]]; then
-    GUID_FILE_FLAG=""
-else
-    GUID_FILE_FLAG="-a ${GUID_FILE}"
-fi
-
-if [[ -z $GUID || "$GUID" == "none" ]]; then
-    GUID_FLAG=""
-else
-    GUID_FLAG="-g ${GUID}"
-fi
-
-if [[ -z $HONORE_GUID2LID || "$HONORE_GUID2LID" == "none" ]]; then
-    HONORE_GUID2LID_FLAG=""
-else
-    HONORE_GUID2LID_FLAG="--honor_guid2lid"
-fi
-
 if [[ -n "${OSM_HOSTS}" && $(echo -n ${OSM_HOSTS} | wc -w | tr -d '[:space:]') -gt 1  ]]; then
-    HONORE_GUID2LID_FLAG="--honor_guid2lid"
+    HONORE_GUID2LID="--honor_guid2lid"
 fi
 
-
-if [[ -z $CACHE_OPTIONS || "$CACHE_OPTIONS" == "none" ]]; then
-    CACHE_OPTIONS_FLAG=""
-else
-    CACHE_OPTIONS_FLAG="--cache-options"
-fi
-
-
-if [ -z $PORT_NUM ]; then
-    PORT_FLAG=1
-else
-    PORT_FLAG="${PORT_NUM}"
-fi
-
-
 #########################################################################
 
 start_sldd()
@@ -228,20 +141,14 @@ start()
         echo $"${bin} (pid $pid) is already running..."
     else
 
-	if [ -n "${HONORE_GUID2LID_FLAG}" ]; then
+	if [ -n "${HONORE_GUID2LID}" ]; then
 		# Run sldd daemod
 		start_sldd
 	fi
 
         # Start opensm
-        local START_FLAGS=""
-        for flag in "$DEBUG_FLAG" "$LMC_FLAG" "$MAXSMPS_FLAG" "$REASSIGN_LIDS_FLAG" "$SWEEP_FLAG" "$TIMEOUT_FLAG" "$OSM_LOG_FLAG" "$VERBOSE_FLAG" "$UPDN_FLAG" "$GUID_FILE_FLAG" "$GUID_FLAG" "$HONORE_GUID2LID_FLAG" "$CACHE_OPTIONS_FLAG"
-        do
-            [ ! -z "$flag" ] && START_FLAGS="$START_FLAGS $flag"
-        done
-
 	echo -n "Starting IB Subnet Manager"
-        echo $PORT_FLAG | $prog $START_FLAGS > /dev/null 2>&1 &
+        $prog --daemon ${HONORE_GUID2LID} > /dev/null
         cnt=0; alive=0
         while [ $cnt -lt 6 -a $alive -ne 1 ]; do
 		echo -n ".";
diff --git a/opensm/scripts/sldd.sh.in b/opensm/scripts/sldd.sh.in
index a6f660f..8162c5c 100755
--- a/opensm/scripts/sldd.sh.in
+++ b/opensm/scripts/sldd.sh.in
@@ -41,10 +41,9 @@
 prefix=@prefix@
 exec_prefix=@exec_prefix@
 
-# config: @sysconfdir@/ofa/opensm.conf
+# config: @sysconfdir@/sysconfig/opensm.conf
 
 [ -f @sysconfdir@/sysconfig/opensm.conf ] && CONFIG=@sysconfdir@/sysconfig/opensm.conf
-[ -f @sysconfdir@/ofa/opensm.conf ] && CONFIG=@sysconfdir@/ofa/opensm.conf
 
 SLDD_DEBUG=${SLDD_DEBUG:-0}
 
-- 
1.5.5.1.178.g1f811


From ds at apcoworldwide.com  Fri May 23 01:15:10 2008
From: ds at apcoworldwide.com (butler wang)
Date: Fri, 23 May 2008 08:15:10 +0000
Subject: [ofa-general] High Quality Watches Available Now
Message-ID: <000401c8bcbc$01f31c9c$3c3bdfa2@ybnfqr>

Quality watches at affordable price, All top brands Today!!

http://horsyjeearm.com/


From vlad at lists.openfabrics.org  Fri May 23 03:08:56 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 23 May 2008 03:08:56 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080523-0200 daily build status
Message-ID: <20080523100856.DD4F7E60D03@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From sashak at voltaire.com  Fri May 23 03:06:34 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 13:06:34 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080523100634.GD4164@sashak.voltaire.com>

On 08:10 Thu 22 May     , Hal Rosenstock wrote:
> 
> I think it will tend towards proliferation of keys which will defeat any
> security/trust. The idea of SMKey was to keep it private between SMs.
> This is now spreading it wider IMO.

Probably original idea was different, but now in IBA spec knowing a valid
SM_Key is mandatory for privileged SA clients (which need to get whole
list of MCMemberRecord, ServiceInfo, etc.).

Sasha


From sashak at voltaire.com  Fri May 23 03:25:57 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 13:25:57 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080522154702.430cdef7.weiny2@llnl.gov>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080522154702.430cdef7.weiny2@llnl.gov>
Message-ID: <20080523102557.GE4164@sashak.voltaire.com>

On 15:47 Thu 22 May     , Ira Weiny wrote:
> I guess my question is "does saquery need this to talk to the SA?"
> 
> I am assuming the answer is "yes".
> 
> I noticed this in the spec section 14.4.7 page 890:
> 
>    "The SM Key used for SM authentication is independent of the SM Key in the
>    SA header used for SA authentication."
> 
> Does this mean there could be 2 SM_Key values in use?

At least I see nothing in the spec against this. Also there is stated
explicitly that validity for non-zero values is vendor-defined.

Sasha


From sashak at voltaire.com  Fri May 23 03:35:32 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 13:35:32 +0300
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080523103532.GA4640@sashak.voltaire.com>

On 08:17 Thu 22 May     , Hal Rosenstock wrote:
> On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote:
> > Sasha,
> > 
> > Trivial patch to enforce root for these perl scripts.  More importantly, 
> > doesn't silently fail if not root, and returns an error code.
> 
> Should these enforce root or be based on udev permissions for umad which
> default to root ?

I would ask the same question as Hal did.

What is wrong with how it works now? On some system access to files could
be arranged for group members, or ibnetdiscover used as engine for many
scripts could be su/gid-ed. This will break there.

Sasha


From hrosenstock at xsigo.com  Fri May 23 04:07:35 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 04:07:35 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080522154702.430cdef7.weiny2@llnl.gov>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080522154702.430cdef7.weiny2@llnl.gov>
Message-ID: <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-22 at 15:47 -0700, Ira Weiny wrote:
> I guess my question is "does saquery need this to talk to the SA?"
> 
> I am assuming the answer is "yes".

It depends on whether trusted operations are needed to be supported or
not. A normal node has no need for trusted operations. There was a
reason why the additional information was hidden with a key. It allows a
malicious user to effect not just his node but the subnet.

As I mentioned, this starts to be a slippery slope with the management
keys. I think a better approach when non default key is in place is to
support this via the OpenSM console as OpenSM knows all the keys it's
supposed to.

> I noticed this in the spec section 14.4.7 page 890:
> 
>    "The SM Key used for SM authentication is independent of the SM Key in the
>    SA header used for SA authentication."
> 
> Does this mean there could be 2 SM_Key values in use?

This was a clarification added at IBA 1.2.1. The SA SMKey is really an
SA Key. This lack of separation is a limitation in the current OpenSM
implementation.

-- Hal

> Ira
> 
> 
> On Thu, 22 May 2008 08:10:29 -0700
> Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> 
> > On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote:
> > > On 07:46 Thu 22 May     , Hal Rosenstock wrote:
> > > > Sasha,
> > > > 
> > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> > > > > This adds possibility to specify SM_Key value with saquery. It should
> > > > > work with queries where OSM_DEFAULT_SM_KEY was used.
> > > > 
> > > > I think this starts down a slippery slope and perhaps bad precedent for
> > > > MKey as well. I know this is useful as a debug tool but compromises what
> > > > purports as "security" IMO as this means the keys need to be too widely
> > > > known.
> > > 
> > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM
> > > side an user may know this or not, in later case saquery will not work
> > > (just like now). I don't see a hole.
> > 
> > I think it will tend towards proliferation of keys which will defeat any
> > security/trust. The idea of SMKey was to keep it private between SMs.
> > This is now spreading it wider IMO. I'm sure other patches will follow
> > in the same vein once an MKey manager exists.
> > 
> > -- Hal
> > 
> > > Sasha
> > 


From hrosenstock at xsigo.com  Fri May 23 04:15:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 04:15:13 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080523100634.GD4164@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
Message-ID: <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-23 at 13:06 +0300, Sasha Khapyorsky wrote:
> On 08:10 Thu 22 May     , Hal Rosenstock wrote:
> > 
> > I think it will tend towards proliferation of keys which will defeat any
> > security/trust. The idea of SMKey was to keep it private between SMs.
> > This is now spreading it wider IMO.
> 
> Probably original idea was different,

No; the spec clarification was just that; a clarification of what the
original intent was rather than a change in the original idea.

> but now in IBA spec knowing a valid
> SM_Key is mandatory for privileged SA clients (which need to get whole
> list of MCMemberRecord, ServiceInfo, etc.).

It's a grey area. The issue is what the privileged SA clients should be
used for. I think this use case allows much more common knowledge of the
management keys (in this case the SA key) as it will not just be the
network administrator using it and even if it were, the user would be
looking over his shoulder. That more common knowledge allows for a
malicious user to more easily compromise the subnet. 

A better approach to all these trust issues IMO is to use the OpenSM
console to support these types of operations.

-- Hal

> Sasha


From hrosenstock at xsigo.com  Fri May 23 04:17:09 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 04:17:09 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080523102557.GE4164@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080522154702.430cdef7.weiny2@llnl.gov>
	<20080523102557.GE4164@sashak.voltaire.com>
Message-ID: <1211541429.13185.84.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-23 at 13:25 +0300, Sasha Khapyorsky wrote:
> On 15:47 Thu 22 May     , Ira Weiny wrote:
> > I guess my question is "does saquery need this to talk to the SA?"
> > 
> > I am assuming the answer is "yes".
> > 
> > I noticed this in the spec section 14.4.7 page 890:
> > 
> >    "The SM Key used for SM authentication is independent of the SM Key in the
> >    SA header used for SA authentication."
> > 
> > Does this mean there could be 2 SM_Key values in use?
> 
> At least I see nothing in the spec against this. 

Right; it is more a use case/compromise of trust issue and the
implications of that.

-- Hal

> Also there is stated
> explicitly that validity for non-zero values is vendor-defined.
> 
> Sasha


From Instant at lists.openfabrics.org  Fri May 23 05:21:20 2008
From: Instant at lists.openfabrics.org (Instant at lists.openfabrics.org)
Date: 23 May 2008 05:21:20 -0700
Subject: [ofa-general] Can you afford to lose 300,
	000 potential customers per year ?
Message-ID: <20080523052119.30E63A31B58893E0@from.header.has.no.domain>

Can you afford to lose 300,000 potential customers per year ?

How would You like to divert 1000s of fresh,
new visitors daily to Your web site or affiliate web site from
Google, Yahoo, MSN and others At $0 cost to you...?

...iNSTANT BOOSTER diverts 1000s of fresh,
new visitors daily to Your web site or affiliate
web site from Google, Yahoo, MSN and others
at $0 cost to you!

...No matter what you are selling or offering -
INTSANT BOOSTER will pull in hordes of potential customers to your website 
- instantly!


For Full Details Please read the attached .html file


Unsubscribe:
Please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/bcf14c27/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/bcf14c27/attachment-0001.htm>

From sashak at voltaire.com  Fri May 23 05:34:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 15:34:14 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080523123414.GB4640@sashak.voltaire.com>

On 04:15 Fri 23 May     , Hal Rosenstock wrote:
> 
> > but now in IBA spec knowing a valid
> > SM_Key is mandatory for privileged SA clients (which need to get whole
> > list of MCMemberRecord, ServiceInfo, etc.).
> 
> It's a grey area.

I don't see this as "grey" - spec is very clear about this sort of SA
restrictions.

> The issue is what the privileged SA clients should be
> used for.

It can be used for monitoring, SA DB sync/dump, debugging, etc..

> I think this use case allows much more common knowledge of the
> management keys (in this case the SA key) as it will not just be the
> network administrator using it and even if it were, the user would be
> looking over his shoulder.

A network administrator is not a little kid :) and this option is
optional. Following your logic we will need to disable root passwords
typing too.

> That more common knowledge allows for a
> malicious user to more easily compromise the subnet.

There is nothing which could prevent from a malicious user to put things
in the code.

> A better approach to all these trust issues IMO is to use the OpenSM
> console to support these types of operations.

OpenSM console is not protected even by SM_Key. And what about
diagnostics when other SMs are used?

Sasha


From hrosenstock at xsigo.com  Fri May 23 05:52:41 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 05:52:41 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080523123414.GB4640@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
Message-ID: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-23 at 15:34 +0300, Sasha Khapyorsky wrote:
> On 04:15 Fri 23 May     , Hal Rosenstock wrote:
> > 
> > > but now in IBA spec knowing a valid
> > > SM_Key is mandatory for privileged SA clients (which need to get whole
> > > list of MCMemberRecord, ServiceInfo, etc.).
> > 
> > It's a grey area.
> 
> I don't see this as "grey" - spec is very clear about this sort of SA
> restrictions.

It is not clear about the issues around the key proliferation. That is
what is grey at least to me but maybe I'm the only one (at least
speaking on this topic on this list).

> > The issue is what the privileged SA clients should be
> > used for.
> 
> It can be used for monitoring, SA DB sync/dump, debugging, etc..

All those uses are easily imagined but that's not what I meant by that
statement which was related to the key issue.

> > I think this use case allows much more common knowledge of the
> > management keys (in this case the SA key) as it will not just be the
> > network administrator using it and even if it were, the user would be
> > looking over his shoulder.
> 
> A network administrator is not a little kid :) and this option is
> optional. Following your logic we will need to disable root passwords
> typing too.

That's taking it too far. Root passwords are at least hidden when
typing.

> > That more common knowledge allows for a
> > malicious user to more easily compromise the subnet.
> 
> There is nothing which could prevent from a malicious user to put things
> in the code.

Of course not but it's one less hurdle to knock down.

> > A better approach to all these trust issues IMO is to use the OpenSM
> > console to support these types of operations.
> 
> OpenSM console is not protected even by SM_Key.

But can be protected by other weak access control currently and perhaps
more in the future. New commands which require trust can utilize SMKey
without it being specified (at least for OpenSM), no ?

> And what about diagnostics when other SMs are used?

I think there's a problem here in a trusted environments given the
approach taken as I've stated in the past but seems to have been
forgotten. The more trust the less the current diag strategy fits.

Are you also going to be proposing exposing MKeys too once MKey
management is supported by OpenSM/other SMs ?

-- Hal

> Sasha


From gstreiff at NetEffect.com  Fri May 23 06:17:58 2008
From: gstreiff at NetEffect.com (Glenn Streiff)
Date: Fri, 23 May 2008 08:17:58 -0500
Subject: [ofa-general] Re: Current list of Linux maintainers and their
	emailinfo
In-Reply-To: <20080520015652.GE1183@sashak.voltaire.com>
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC079501ED@venom2>

 
> On 15:28 Mon 19 May     , Woodruff, Robert J wrote:
> > 
> > Here is what I have so far as the list of kernel and userspace
> > components.
> 
> <snip...>
>

Hi, everyone.

For NetEffect,

    iw_nes  (iwarp rnic driver)
    libnes

will be maintained by Chien Tung <ctung at neteffect.com>

I'm in the process of passing the torch to Chien.  He is a very
capable developer and I know he will do a good job.  I've enjoyed
working with everyone.  I may still post from time to time as necessary. :-)

Glenn


From hrosenstock at xsigo.com  Fri May 23 06:19:36 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 06:19:36 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
Message-ID: <1211548776.13185.116.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
> It is not clear about the issues around the key proliferation.

In fact, if you notice, the IBA spec (at least the management chapters)
was very careful to ignore all the key management issues to avoid
discussions like we've been having ;-)

-- Hal


From hrosenstock at xsigo.com  Fri May 23 06:47:12 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 23 May 2008 06:47:12 -0700
Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys
Message-ID: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com>

management: Support separate SA and SM keys as clarified in IBA 1.2.1

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index ed61721..ccf7bdd 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -730,7 +730,7 @@ get_all_records(osm_bind_handle_t bind_handle,
 		int trusted)
 {
 	return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset,
-			       trusted ? OSM_DEFAULT_SM_KEY : 0);
+			       trusted ? OSM_DEFAULT_SA_KEY : 0);
 }
 
 /**
@@ -1255,7 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle,
 	status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0,
 				 comp_mask, &pktr,
 				 ib_get_attr_offset(sizeof(pktr)),
-				 OSM_DEFAULT_SM_KEY);
+				 OSM_DEFAULT_SA_KEY);
 	if (status != IB_SUCCESS)
 		return status;
 
diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index 62d472e..39f9057 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -119,6 +119,17 @@ BEGIN_C_DECLS
 */
 #define OSM_DEFAULT_SM_KEY 1
 /********/
+/****s* OpenSM: Base/OSM_DEFAULT_SA_KEY
+* NAME
+*	OSM_DEFAULT_SA_KEY
+*
+* DESCRIPTION
+*	Subnet Adminstration key value.
+*
+* SYNOPSIS
+*/
+#define OSM_DEFAULT_SA_KEY 1
+/********/
 /****s* OpenSM: Base/OSM_DEFAULT_LMC
 * NAME
 *	OSM_DEFAULT_LMC
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 349ba79..171b5db 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -208,6 +208,7 @@ typedef struct _osm_subn_opt {
 	ib_net64_t guid;
 	ib_net64_t m_key;
 	ib_net64_t sm_key;
+	ib_net64_t sa_key;
 	ib_net64_t subnet_prefix;
 	ib_net16_t m_key_lease_period;
 	uint32_t sweep_interval;
@@ -291,7 +292,10 @@ typedef struct _osm_subn_opt {
 *		M_Key value sent to all ports qualifing all Set(PortInfo).
 *
 *	sm_key
-*		SM_Key value of the SM to qualify rcv SA queries as "trusted".
+*		SM_Key value of the SM used for SM authentication. 
+*
+*	sa_key
+*		SM_Key value to qualify rcv SA queries as "trusted".
 *
 *	subnet_prefix
 *		Subnet prefix used on this subnet.
diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c
index 78fdec7..abd8d02 100644
--- a/opensm/opensm/osm_sa_mad_ctrl.c
+++ b/opensm/opensm/osm_sa_mad_ctrl.c
@@ -340,11 +340,11 @@ __osm_sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw,
 	 * otherwise discard the MAD.
 	 */
 	if ((p_sa_mad->sm_key != 0) &&
-	    (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sm_key)) {
+	    (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sa_key)) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A04: "
 			"Non-Zero SA MAD SM_Key: 0x%" PRIx64 " != SM_Key: 0x%"
 			PRIx64 "; MAD ignored\n", cl_ntoh64(p_sa_mad->sm_key),
-			cl_ntoh64(p_ctrl->p_subn->opt.sm_key)
+			cl_ntoh64(p_ctrl->p_subn->opt.sa_key)
 		    );
 		osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw);
 		goto Exit;
diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c
index 5cea525..4d19ed4 100644
--- a/opensm/opensm/osm_sa_pkey_record.c
+++ b/opensm/opensm/osm_sa_pkey_record.c
@@ -269,7 +269,7 @@ void osm_pkey_rec_rcv_process(IN void *ctx, IN void *data)
 	   to trusted requests.
 	   Check that the requester is a trusted one.
 	 */
-	if (p_rcvd_mad->sm_key != sa->p_subn->opt.sm_key) {
+	if (p_rcvd_mad->sm_key != sa->p_subn->opt.sa_key) {
 		/* This is not a trusted requester! */
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 4608: "
 			"Request from non-trusted requester: "
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 2dc0ca8..a5c9b02 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -395,6 +395,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->guid = 0;
 	p_opt->m_key = OSM_DEFAULT_M_KEY;
 	p_opt->sm_key = OSM_DEFAULT_SM_KEY;
+	p_opt->sa_key = OSM_DEFAULT_SA_KEY;
 	p_opt->subnet_prefix = IB_DEFAULT_SUBNET_PREFIX;
 	p_opt->m_key_lease_period = 0;
 	p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
@@ -1183,6 +1184,8 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts)
 
 		opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key);
 
+		opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key);
+
 		opts_unpack_net64("subnet_prefix",
 				  p_key, p_val, &p_opts->subnet_prefix);
 
@@ -1432,8 +1435,10 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		"m_key 0x%016" PRIx64 "\n\n"
 		"# The lease period used for the M_Key on this subnet in [sec]\n"
 		"m_key_lease_period %u\n\n"
-		"# SM_Key value of the SM to qualify rcv SA queries as 'trusted'\n"
+		"# SM_Key value of the SM used for SM authentication\n"
 		"sm_key 0x%016" PRIx64 "\n\n"
+		"# SM_Key value to qualify rcv SA queries as 'trusted'\n"
+		"sa_key 0x%016" PRIx64 "\n\n"
 		"# Subnet prefix used on this subnet\n"
 		"subnet_prefix 0x%016" PRIx64 "\n\n"
 		"# The LMC value used on this subnet\n"
@@ -1487,6 +1492,7 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts)
 		cl_ntoh64(p_opts->m_key),
 		cl_ntoh16(p_opts->m_key_lease_period),
 		cl_ntoh64(p_opts->sm_key),
+		cl_ntoh64(p_opts->sa_key),
 		cl_ntoh64(p_opts->subnet_prefix),
 		p_opts->lmc,
 		p_opts->lmc_esp0 ? "TRUE" : "FALSE",


From sashak at voltaire.com  Fri May 23 06:46:08 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 16:46:08 +0300
Subject: [ofa-general] [PATCH] liberals4: fix linker dependencies
Message-ID: <20080523134608.GD4640@sashak.voltaire.com>


As stated in bug 1002 (https://bugs.openfabrics.org/show_bug.cgi?id=1002)
when LDFLAGS like "-Wl,-z,defs" (disallows undefined symbols) is used it
fails to resolve libpthread symbols. This simple patch fixes it.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 configure.in |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/configure.in b/configure.in
index 9304539..4f9ba8f 100644
--- a/configure.in
+++ b/configure.in
@@ -27,6 +27,8 @@ AC_PROG_CC
 dnl Checks for libraries
 AC_CHECK_LIB(ibverbs, ibv_get_device_list, [],
     AC_MSG_ERROR([ibv_get_device_list() not found.  libmlx4 requires libibverbs.]))
+AC_CHECK_LIB(pthread, pthread_mutex_init, [],
+    AC_MSG_ERROR([pthread_mutex_init() not found.  libmlx4 requires libpthread.]))
 
 dnl Checks for header files.
 AC_CHECK_HEADER(infiniband/driver.h, [],
-- 
1.5.5.1.178.g1f811


From sashak at voltaire.com  Fri May 23 06:52:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 16:52:19 +0300
Subject: [ofa-general] [PATCH] libipathverbs: fix linker dependencies
Message-ID: <20080523135219.GE4640@sashak.voltaire.com>


As stated in bug 1002 (https://bugs.openfabrics.org/show_bug.cgi?id=1002)
when LDFLAGS like "-Wl,-z,defs" (disallows undefined symbols) is used it
fails to resolve libpthread symbols. This simple patch fixes it.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 configure.in |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/configure.in b/configure.in
index dcc207d..faaa4c3 100644
--- a/configure.in
+++ b/configure.in
@@ -62,6 +62,9 @@ AC_PROG_CC
 dnl Checks for libraries
 AC_CHECK_LIB(ibverbs, ibv_get_device_list, [],
     AC_MSG_ERROR([ibv_get_device_list() not found.  libipathverbs requires libibverbs.]))
+AC_CHECK_LIB(pthread, pthread_mutex_init, [],
+    AC_MSG_ERROR([pthread_mutex_init() not found.  libipathverbs requires libpthread.]))
+
 
 dnl Checks for header files.
 AC_CHECK_HEADER(infiniband/driver.h, [],
-- 
1.5.5.1.178.g1f811


From sashak at voltaire.com  Fri May 23 06:53:44 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 23 May 2008 16:53:44 +0300
Subject: [ofa-general] [PATCH] liberals4: fix linker dependencies
In-Reply-To: <20080523134608.GD4640@sashak.voltaire.com>
References: <20080523134608.GD4640@sashak.voltaire.com>
Message-ID: <20080523135344.GF4640@sashak.voltaire.com>

Oops sorry, subject should be:

libmlx4: fix linker dependencies

Sasha


From marcel.heinz at informatik.tu-chemnitz.de  Fri May 23 08:26:41 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Fri, 23 May 2008 17:26:41 +0200
Subject: [ofa-general] Multicast Performance
Message-ID: <4836E231.4000601@informatik.tu-chemnitz.de>

Hi,

I have ported an application to use InfiniBand multicast directly via
libibverbs. I have discovered very low multicast throughput, only
~250MByte/s although we are using 4x DDR components. To count out any
effects of the application, I've created a small benchmark (well, it's
only a hack). It just tries to keep the send/recv queue filled with work
request and polls the CQ in an endless loop. In server mode, it joins
to/creates the multicast group as FullMember, attaches the QP to the
group and receives any packets. The client joins as SendOnlyNonMember
and sends Datagrams of full MTU size to the group.

The test setup is as follows:

Host A <---> Switch <---> Host B

We use Mellanox InfiniHost III Lx HCAs (MT25204) and a Flextronics
F-X430046 24-Port Switch, OFED 1.3 and a "vanilla" 2.6.23.9 Linux kernel.

The results are:

Host A		Host B		Throughput (MByte/sec)
client		server		262
client		2xserver	146
client+server	server		944
client+server	---		946

as reference: unicast ib_send_bw (in UD mode): 1146

I don't see any reason why it should become _faster_ when I additionally
start a server on the same host as the client. OTOH, the 944MByte/s
sound relatively sane when compared to the unicast performance with the
additional overhead of having to copy the data locally.

These 260MB/s seem releatively near to the 2GBit/s effective throughput
of a 1x SDR connection. However, the created group is rate 6 (20GBit/s)
and /sys/class/infiniband/mthca0/ports/1/rate file showed 20 Gb/sec
during the whole test.

The error counters of all ports are showing nothing abnormal. Only the
RcvSwRelayErrors counter of the switch's port (to the host running the
client) is increasing very fast, but this seems to be normal for
multicast packets, as the switch is not relaying these packets back to
the source.

We could test on another cluster with 6 nodes (also with MT25204 HCAs, I
don't know the OFED version and switch type) and got the following results:

Host1	Host2	Host3	Host4	Host5	Host6	Throughput (MByte/s)	
1s	 1s 				 1c	 255,15
1s       1s      1s 			 1c 	 255,22
1s	 1s	 1s	 1s		 1c 	 255,22
1s	 1s 	 1s	 1s	 1s	 1c	 255,22

1s1c	 1s	 1s				 738,64
1s1c	 1s 	 1s	 1s			 695,08
1s1c	 1s 	 1s	 1s 	 1s		 565,14
1s1c	 1s 	 1s 	 1s	 1s     1s	 451,90

As long as there is no server and client on the same host, it at least
behaves like multicast. When having both client and server on the same
host, performance decreases as the number of servers increases, which is
totally surprising to me.

Another test I did was doing a ib_send_bw (UD) benchmark while the
multicast benchmark was running between A and B. I got ~260MByte/s for
the multicast and also 260MB/s for ib_send_bw.

Has anyone an idea of what is going on there or a hint what I should check?

Regards,
Marcel


From meier3 at llnl.gov  Fri May 23 08:58:48 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Fri, 23 May 2008 08:58:48 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <20080523103532.GA4640@sashak.voltaire.com>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
	<20080523103532.GA4640@sashak.voltaire.com>
Message-ID: <4836E9B8.2080406@llnl.gov>

Sasha Khapyorsky wrote:
> On 08:17 Thu 22 May     , Hal Rosenstock wrote:
>> On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote:
>>> Sasha,
>>>
>>> Trivial patch to enforce root for these perl scripts.  More importantly, 
>>> doesn't silently fail if not root, and returns an error code.
>> Should these enforce root or be based on udev permissions for umad which
>> default to root ?
> 
> I would ask the same question as Hal did.
> 

Ok, I understand.  I have created another patch with just the auth_check 
routine in it.

Following Hals advice, authorization is based on the umad permissions.

> What is wrong with how it works now? On some system access to files could
> be arranged for group members, or ibnetdiscover used as engine for many
> scripts could be su/gid-ed. This will break there.
> 
> Sasha
> 
The new patch shouldn't break code.  I didn't realize/think about non-root 
with the original patch.  The intent is simply to provide a consistent and
non-silent fail mechanism.

Currently, you can get partial functionality from these scripts (-? for 
example).  So in that sense, this can change the behavior if the check is used 
early in the script (as I did in the original patch).  I view most of these
scripts as "all or nothing".

-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov


From meier3 at llnl.gov  Fri May 23 09:04:55 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Fri, 23 May 2008 09:04:55 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with
 error if not authorized
Message-ID: <4836EB27.7060707@llnl.gov>

Sasha, Hal,

Here is a revised version of the patch - just the auth_check() routine.

Basically, it passes the test if root, or same ownership as umad0.

The motivation for this patch is to provide a quick (and early) check for the 
perl scripts that were only intended for privilaged users.  Stop partial 
functionality, and provide a non-zero exit code.

I will patch the relevant perl scripts to use this check, if accepted.

-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-infiniband-diags-terminate-perl-scripts-with-error.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/a644e458/attachment.ksh>

From sean.hefty at intel.com  Fri May 23 09:35:35 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 May 2008 09:35:35 -0700
Subject: [ofa-general] Multithreaded iWARP application
In-Reply-To: <48367594.9010401@student.ethz.ch>
References: <48367594.9010401@student.ethz.ch>
Message-ID: <000001c8bcf3$070e4670$6b58180a@amr.corp.intel.com>

>In pseudo code:
>
>/** connecting part **/
>struct rdma_cm_id         *id;
>struct rdma_event_channel *channel;
>struct rdma_cm_event      *event;
>
>channel = rdma_create_event_channel();
>rdma_create_id(channel, &id, context, RDMA_PS_TCP);
>
>rdma_resolve_addr(id, src_addr, dst_addr, timeout);
>rdma_get_cm_event(channel, &event);	//expecting ADDR_RESOLVED
>rdma_ack_cm_event(event);
>//same for rdma_resolve_route()		//expecting ROUTE_RESOLVED
>
>rdma_connect(id, conn_param);
>rdma_get_event(channel, &event);	//expecting ESTABLISHED
>rdma_ack_event(event);
>
>... do RDMA here ...
>//disconnect
>
>
>/** accepting thread **/
>struct rdma_cm_id         *listen_id, *id;
>struct rdma_event_channel *listen_channel;
>struct rdma_cm_event      *listen_event, *event;
>
>channel = rdma_create_event_channel();
>rdma_create_id(channel, &id, context, RDMA_PS_TCP);
>
>rdma_bind_addr(id, addr);
>rdma_listen(id, backlog);
>
>while(1) {
>	rdma_get_cm_event(listen_channel, listen_event);
>	rdma_ack_cm_event(listen_event);
>	//expecting CONNECT_REQUEST
>	id = listen_event->id;
>	id->channel = rdma_create_event_channel();	<-- HERE ???

This is disallowed.  The kernel is still maintaining the association between the
new id and its current channel, so will still deliver events to the old event
channel for that id.  I think the solution that you're looking for is the call
rdma_migrate_id().  The listen request will have its own event channel, and you
can migrate new connection(s) to separate channel(s).  Depending on your app,
you may be able to get away with a total of 2 channels - one for the listen, and
another one for the connected id's.

As a note, rdma_migrate_id() is a relatively new call.  So I don't know if your
installation has it.  Without it, you're stuck using a single event channel on
the listening side.

- Sean


From rdreier at cisco.com  Fri May 23 10:42:24 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 10:42:24 -0700
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <48342C6C.2010502@googlemail.com> (Gabriel C.'s message of "Wed, 
	21 May 2008 16:06:36 +0200")
References: <48342C6C.2010502@googlemail.com>
Message-ID: <adave14apdr.fsf@cisco.com>

 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type

Perhaps the best way to fix these is to change code like

		if (/* ScoreBoardDrainInProg */
		    test_bit(63, &hwstatus) ||
		    /* AbortInProg */
		    test_bit(62, &hwstatus) ||
		    /* InternalSDmaEnable */
		    test_bit(61, &hwstatus) ||
		    /* ScbEmpty */
		    !test_bit(30, &hwstatus)) {

to something like

		if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG |
				 IPATH_SDMA_STATUS_ABORT_IN_PROG	     |
				 IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) ||
		    !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) {

with appropriate defines for the constants 1ull << 63 etc.

(I think I got the logic correct but someone should check)

 > drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
 > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
 > drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'

I have a fix for this pending; will ask Linus to pull today.


From rdreier at cisco.com  Fri May 23 10:45:11 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 10:45:11 -0700
Subject: [ofa-general] Re: [ PATCH ] RDMA/nes Update MAINTAINERS list
In-Reply-To: <200805211649.m4LGnwPP026935@velma.neteffect.com> (Chien Tung's
	message of "Wed, 21 May 2008 11:49:58 -0500")
References: <200805211649.m4LGnwPP026935@velma.neteffect.com>
Message-ID: <adamymgap94.fsf@cisco.com>

 > Adding Chien to maintainers list for NetEffect.

No problem with this, but is it intentional to remove Nishi Gupta in the
same patch?

 > -P:	Nishi Gupta
 > -M:	ngupta at neteffect.com


From rdreier at cisco.com  Fri May 23 10:44:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 10:44:30 -0700
Subject: [ofa-general][PATCH v1 2/2] mlx4: Default value for automatic
	completion vector selection
In-Reply-To: <4834335D.8030903@mellanox.co.il> (Yevgeny Petrilin's message of
	"Wed, 21 May 2008 17:36:13 +0300")
References: <4834335D.8030903@mellanox.co.il>
Message-ID: <adar6bsapa9.fsf@cisco.com>

 > When the vector number passed to mlx4_cq_alloc is MLX4_ANY_VECTOR (0xff),
 > the driver selects the completion vector that has the least CQs attached
 > to it and attaches the CQ to the chosen vector.

Ummm... how could an app/ULP use this sanely?  Have a huge switch
statement to choose MLX4_ANY_VECTOR / EHCA_ANY_VECTOR / FOOHCA_ANY_VECTOR?

We need something generic like IB_CQ_VECTOR_LEAST_ATTACHED that
specifies the policy in a driver-independent way.

 - R.


From rdreier at cisco.com  Fri May 23 10:57:41 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 10:57:41 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaiqx4aooa.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a fixes for various issues:

 - Various trivial fixes that get rid of warnings
 - A couple of oopsable bugs fixed
 - Fixes for mthca/mlx4 driver bugs that stop NFS/RDMA from working
 - MAINTAINERS entry for Chelsio drivers

Andrew Morton (1):
      IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send()

Dave Olson (1):
      IB/mad: Fix kernel crash when .process_mad() returns SUCCESS|CONSUMED

Jack Morgenstein (1):
      IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish()

Ralph Campbell (1):
      IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate

Roland Dreier (4):
      IB/ipath: Fix printk format for ipath_sdma_status
      RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send()
      IB/mthca: Fix max_sge value returned by query_device
      IB/mlx4: Fix creation of kernel QP with max number of send s/g entries

Steve Wise (1):
      MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries

 MAINTAINERS                                    |   14 ++++++++++++++
 drivers/infiniband/core/mad.c                  |    4 +++-
 drivers/infiniband/hw/cxgb3/iwch_qp.c          |    2 +-
 drivers/infiniband/hw/ipath/ipath_sdma.c       |    4 ++--
 drivers/infiniband/hw/ipath/ipath_uc.c         |    4 ++--
 drivers/infiniband/hw/mlx4/qp.c                |   15 +++++++++------
 drivers/infiniband/hw/mthca/mthca_main.c       |   14 +++++++++++++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    6 ++++++
 8 files changed, 50 insertions(+), 13 deletions(-)


diff --git a/MAINTAINERS b/MAINTAINERS
index bc1c008..907d8c4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1239,6 +1239,20 @@ L:	video4linux-list at redhat.com
 W:	http://linuxtv.org
 S:	Maintained
 
+CXGB3 ETHERNET DRIVER (CXGB3)
+P:	Divy Le Ray
+M:	divy at chelsio.com
+L:	netdev at vger.kernel.org
+W:	http://www.chelsio.com
+S:	Supported
+
+CXGB3 IWARP RNIC DRIVER (IW_CXGB3)
+P:	Steve Wise
+M:	swise at chelsio.com
+L:	general at lists.openfabrics.org
+W:	http://www.openfabrics.org
+S:	Supported
+
 CYBERPRO FB DRIVER
 P:	Russell King
 M:	rmk at arm.linux.org.uk
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index fbe16d5..1adf2ef 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -747,7 +747,9 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv,
 		break;
 	case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED:
 		kmem_cache_free(ib_mad_cache, mad_priv);
-		break;
+		kfree(local);
+		ret = 1;
+		goto out;
 	case IB_MAD_RESULT_SUCCESS:
 		/* Treat like an incoming receive MAD */
 		port_priv = ib_get_mad_port(mad_agent_priv->agent.device,
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 79dbe5b..9926137 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -229,7 +229,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		      struct ib_send_wr **bad_wr)
 {
 	int err = 0;
-	u8 t3_wr_flit_cnt;
+	u8 uninitialized_var(t3_wr_flit_cnt);
 	enum t3_wr_opcode t3_wr_opcode = 0;
 	enum t3_wr_flags t3_wr_flags;
 	struct iwch_qp *qhp;
diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 3697449..0a8c1b8 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -345,7 +345,7 @@ resched:
 	 * state change
 	 */
 	if (jiffies > dd->ipath_sdma_abort_jiffies) {
-		ipath_dbg("looping with status 0x%016llx\n",
+		ipath_dbg("looping with status 0x%08lx\n",
 			  dd->ipath_sdma_status);
 		dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ;
 	}
@@ -615,7 +615,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd)
 	}
 	spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags);
 	if (!needed) {
-		ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n",
+		ipath_dbg("invalid attempt to restart SDMA, status 0x%08lx\n",
 			dd->ipath_sdma_status);
 		goto bail;
 	}
diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c
index 7fd18e8..0596ec1 100644
--- a/drivers/infiniband/hw/ipath/ipath_uc.c
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c
@@ -407,12 +407,11 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			dev->n_pkt_drops++;
 			goto done;
 		}
-		/* XXX Need to free SGEs */
+		wc.opcode = IB_WC_RECV;
 	last_imm:
 		ipath_copy_sge(&qp->r_sge, data, tlen);
 		wc.wr_id = qp->r_wr_id;
 		wc.status = IB_WC_SUCCESS;
-		wc.opcode = IB_WC_RECV;
 		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
 		wc.slid = qp->remote_ah_attr.dlid;
@@ -514,6 +513,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 			goto done;
 		}
 		wc.byte_len = qp->r_len;
+		wc.opcode = IB_WC_RECV_RDMA_WITH_IMM;
 		goto last_imm;
 
 	case OP(RDMA_WRITE_LAST):
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8e02ecf..a80df22 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) +
 		send_wqe_overhead(type, qp->flags);
 
+	if (s > dev->dev->caps.max_sq_desc_sz)
+		return -EINVAL;
+
 	/*
 	 * Hermon supports shrinking WQEs, such that a single work
 	 * request can include multiple units of 1 << wqe_shift.  This
@@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s));
 
 	for (;;) {
-		if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)
-			return -EINVAL;
-
 		qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift);
 
 		/*
@@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 		++qp->sq.wqe_shift;
 	}
 
-	qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) -
+	qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz,
+			     (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) -
 			 send_wqe_overhead(type, qp->flags)) /
 		sizeof (struct mlx4_wqe_data_seg);
 
@@ -411,7 +412,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
 
 	cap->max_send_wr  = qp->sq.max_post =
 		(qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr;
-	cap->max_send_sge = qp->sq.max_gs;
+	cap->max_send_sge = min(qp->sq.max_gs,
+				min(dev->dev->caps.max_sq_sg,
+				    dev->dev->caps.max_rq_sg));
 	/* We don't support inline sends for kernel QPs (yet) */
 	cap->max_inline_data = 0;
 
@@ -1457,7 +1460,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 	unsigned ind;
 	int uninitialized_var(stamp);
 	int uninitialized_var(size);
-	unsigned seglen;
+	unsigned uninitialized_var(seglen);
 	int i;
 
 	spin_lock_irqsave(&qp->sq.lock, flags);
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 9ebadd6..200cf13 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -45,6 +45,7 @@
 #include "mthca_cmd.h"
 #include "mthca_profile.h"
 #include "mthca_memfree.h"
+#include "mthca_wqe.h"
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
@@ -200,7 +201,18 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim)
 	mdev->limits.gid_table_len  	= dev_lim->max_gids;
 	mdev->limits.pkey_table_len 	= dev_lim->max_pkeys;
 	mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay;
-	mdev->limits.max_sg             = dev_lim->max_sg;
+	/*
+	 * Need to allow for worst case send WQE overhead and check
+	 * whether max_desc_sz imposes a lower limit than max_sg; UD
+	 * send has the biggest overhead.
+	 */
+	mdev->limits.max_sg		= min_t(int, dev_lim->max_sg,
+					      (dev_lim->max_desc_sz -
+					       sizeof (struct mthca_next_seg) -
+					       (mthca_is_memfree(mdev) ?
+						sizeof (struct mthca_arbel_ud_seg) :
+						sizeof (struct mthca_tavor_ud_seg))) /
+						sizeof (struct mthca_data_seg));
 	mdev->limits.max_wqes           = dev_lim->max_qp_sz;
 	mdev->limits.max_qp_init_rdma   = dev_lim->max_requester_per_qp;
 	mdev->limits.reserved_qps       = dev_lim->reserved_qps;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index d00a2c1..3f663fb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -194,7 +194,13 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	/* Set the cached Q_Key before we attach if it's the broadcast group */
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
+		spin_lock_irq(&priv->lock);
+		if (!priv->broadcast) {
+			spin_unlock_irq(&priv->lock);
+			return -EAGAIN;
+		}
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
+		spin_unlock_irq(&priv->lock);
 		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
 	}
 

From rdreier at cisco.com  Fri May 23 11:04:23 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 11:04:23 -0700
Subject: [ofa-general] [PATCH] [for-2.6.27] rdma: fix license text
In-Reply-To: <000101c8bc27$df180960$ec248686@amr.corp.intel.com> (Sean Hefty's
	message of "Thu, 22 May 2008 09:21:20 -0700")
References: <000101c8bc27$df180960$ec248686@amr.corp.intel.com>
Message-ID: <adaabigaod4.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Fri May 23 11:05:13 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 11:05:13 -0700
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <1211470815.7310.61.camel@eli-laptop> (Eli Cohen's message of
	"Thu, 22 May 2008 18:40:15 +0300")
References: <1211470815.7310.61.camel@eli-laptop>
Message-ID: <ada63t4aobq.fsf@cisco.com>

 > ---
 > 
 > When running netperf I see significant improvement when using this patch
 > (BW Mbps):
 > 
 > with patch:
 > sender		receiver
 > 313		313
 > 
 > without the patch:
 > 509		134

Any reason why we wouldn't want this info in the patch changelog?

Can you explain why the sender gets dramatically slower with the patch?

 - R.


From ctung at NetEffect.com  Fri May 23 11:17:15 2008
From: ctung at NetEffect.com (Chien Tung)
Date: Fri, 23 May 2008 13:17:15 -0500
Subject: [ofa-general] RE: [ PATCH ] RDMA/nes Update MAINTAINERS list
In-Reply-To: <adamymgap94.fsf@cisco.com>
References: <200805211649.m4LGnwPP026935@velma.neteffect.com>
	<adamymgap94.fsf@cisco.com>
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0811C8FB@venom2>

>  > Adding Chien to maintainers list for NetEffect.
> 
> No problem with this, but is it intentional to remove Nishi 
> Gupta in the same patch?

Yes.  I should of mentioned it in the abstract.

Chien


From swise at opengridcomputing.com  Fri May 23 11:20:30 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 May 2008 13:20:30 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48358428.2000902@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
Message-ID: <48370AEE.7080507@opengridcomputing.com>

Or Gerlitz wrote:
> Steve Wise wrote:
>> Are we sure we need to expose this to the user?
> I believe this is the way to go if we want to let smart ULPs generate 
> new rkey/stag per mapping. Simpler ULPs could then just put the same 
> value for each map associated with the same mr.
>
> Or.
>

How should I add this to the API?

Perhaps we just document the format of an rkey in the struct ib_mr.  
Thus the app would do this to change the key before posting the 
fast_reg_mr wr (coded to be explicit, not efficient):

u8 newkey;
u32 newrkey;

newkey = 0xaa;
newrkey = (mr->rkey & 0xffffff00) | newkey;
mr->rkey = newrkey
wr.wr.fast_reg.mr = mr;
...


Note, this assumes mr->rkey is in host byte order (I think the linux 
rdma code assumes this in other places too).


Steve.


From dotanba at gmail.com  Fri May 23 12:28:14 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 23 May 2008 21:28:14 +0200
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
In-Reply-To: <OF0E977937.7A879F2A-ON86257451.006C1B66-86257451.006F0F75@TMRIUSA.COM>
References: <OF0E977937.7A879F2A-ON86257451.006C1B66-86257451.006F0F75@TMRIUSA.COM>
Message-ID: <48371ACE.908@gmail.com>

Hi.

Yicheng Jia wrote:
>
> Hi Folks,
>
> I'm trying to use CQ Event notification for multiple completions 
> (ARM_N) according to Mellanox Lx III user manual for scatter/gathering 
> RDMA. However I couldn't find it in current MLX driver. It seems to me 
> that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are 
> multiple work requests, I have to use "poll_cq" to synchronously wait 
> until all the requests are done, is it correct? Is there a way to do 
> asynchronous multiple send by subscribing for a ARM_N event?
You are right: the low level drivers of Mellanox devices doesn't support 
ARM-N
(This feature is supported by the devices, but it wasn't implemented in 
the low level drivers).

You are right, in order to read all of the completions you need to use 
poll_cq.

By the way: Do you have you have to create a completion for any WR?
(if you are using one QP, this will maybe solve your problem).

Dotan


From dotanba at gmail.com  Fri May 23 12:29:49 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 23 May 2008 21:29:49 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <4836E231.4000601@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
Message-ID: <48371B2D.3040908@gmail.com>

Hi.

Do you use the latest released FW for this device?

thanks
Dotan

Marcel Heinz wrote:
> Hi,
>
> I have ported an application to use InfiniBand multicast directly via
> libibverbs. I have discovered very low multicast throughput, only
> ~250MByte/s although we are using 4x DDR components. To count out any
> effects of the application, I've created a small benchmark (well, it's
> only a hack). It just tries to keep the send/recv queue filled with work
> request and polls the CQ in an endless loop. In server mode, it joins
> to/creates the multicast group as FullMember, attaches the QP to the
> group and receives any packets. The client joins as SendOnlyNonMember
> and sends Datagrams of full MTU size to the group.
>
> The test setup is as follows:
>
> Host A <---> Switch <---> Host B
>
> We use Mellanox InfiniHost III Lx HCAs (MT25204) and a Flextronics
> F-X430046 24-Port Switch, OFED 1.3 and a "vanilla" 2.6.23.9 Linux kernel.
>
> The results are:
>
> Host A		Host B		Throughput (MByte/sec)
> client		server		262
> client		2xserver	146
> client+server	server		944
> client+server	---		946
>
> as reference: unicast ib_send_bw (in UD mode): 1146
>
> I don't see any reason why it should become _faster_ when I additionally
> start a server on the same host as the client. OTOH, the 944MByte/s
> sound relatively sane when compared to the unicast performance with the
> additional overhead of having to copy the data locally.
>
> These 260MB/s seem releatively near to the 2GBit/s effective throughput
> of a 1x SDR connection. However, the created group is rate 6 (20GBit/s)
> and /sys/class/infiniband/mthca0/ports/1/rate file showed 20 Gb/sec
> during the whole test.
>
> The error counters of all ports are showing nothing abnormal. Only the
> RcvSwRelayErrors counter of the switch's port (to the host running the
> client) is increasing very fast, but this seems to be normal for
> multicast packets, as the switch is not relaying these packets back to
> the source.
>
> We could test on another cluster with 6 nodes (also with MT25204 HCAs, I
> don't know the OFED version and switch type) and got the following results:
>
> Host1	Host2	Host3	Host4	Host5	Host6	Throughput (MByte/s)	
> 1s	 1s 				 1c	 255,15
> 1s       1s      1s 			 1c 	 255,22
> 1s	 1s	 1s	 1s		 1c 	 255,22
> 1s	 1s 	 1s	 1s	 1s	 1c	 255,22
>
> 1s1c	 1s	 1s				 738,64
> 1s1c	 1s 	 1s	 1s			 695,08
> 1s1c	 1s 	 1s	 1s 	 1s		 565,14
> 1s1c	 1s 	 1s 	 1s	 1s     1s	 451,90
>
> As long as there is no server and client on the same host, it at least
> behaves like multicast. When having both client and server on the same
> host, performance decreases as the number of servers increases, which is
> totally surprising to me.
>
> Another test I did was doing a ib_send_bw (UD) benchmark while the
> multicast benchmark was running between A and B. I got ~260MByte/s for
> the multicast and also 260MB/s for ib_send_bw.
>
> Has anyone an idea of what is going on there or a hint what I should check?
>
> Regards,
> Marcel
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From weiny2 at llnl.gov  Fri May 23 11:54:38 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 23 May 2008 11:54:38 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <20080523103532.GA4640@sashak.voltaire.com>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
	<20080523103532.GA4640@sashak.voltaire.com>
Message-ID: <20080523115438.72900365.weiny2@llnl.gov>

On Fri, 23 May 2008 13:35:32 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 08:17 Thu 22 May     , Hal Rosenstock wrote:
> > On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote:
> > > Sasha,
> > > 
> > > Trivial patch to enforce root for these perl scripts.  More importantly, 
> > > doesn't silently fail if not root, and returns an error code.
> > 
> > Should these enforce root or be based on udev permissions for umad which
> > default to root ?
> 
> I would ask the same question as Hal did.
> 
> What is wrong with how it works now? On some system access to files could
> be arranged for group members, or ibnetdiscover used as engine for many
> scripts could be su/gid-ed. This will break there.

The problem is, if you don't know what a particular script or option does and
it simply returns a prompt with a "0" return code the user will THINK it did
what whatever it was supposed to do, when in fact it did nothing!!!

This is especially bad with these scripts as most of them simply query the
fabric.  This could lead one to believe that it did not find an information to
return when in fact it did not query the fabric at all.

I realize that running things which you don't know what they do is bad but for
sure it should not return "0" when it clearly did not perform the requested
operation because of an error in permissions.

Ira


From rdreier at cisco.com  Fri May 23 11:56:15 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 11:56:15 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48370AEE.7080507@opengridcomputing.com> (Steve Wise's message of
	"Fri, 23 May 2008 13:20:30 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com>
Message-ID: <adafxs897e8.fsf@cisco.com>

 > How should I add this to the API?
 > 
 > Perhaps we just document the format of an rkey in the struct ib_mr.
 > Thus the app would do this to change the key before posting the
 > fast_reg_mr wr (coded to be explicit, not efficient):
 > 
 > u8 newkey;
 > u32 newrkey;
 > 
 > newkey = 0xaa;
 > newrkey = (mr->rkey & 0xffffff00) | newkey;
 > mr->rkey = newrkey
 > wr.wr.fast_reg.mr = mr;

Don't like it -- too easy for the consumer to screw up the data
structures.

Seems simpler to just add a u8 "key" field (or maybe there's a better
name) to the work request.

 - R.


From swise at opengridcomputing.com  Fri May 23 11:58:19 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 May 2008 13:58:19 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adafxs897e8.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>	<4833EA6B.9000705@voltaire.com>	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>
	<adafxs897e8.fsf@cisco.com>
Message-ID: <483713CB.3010408@opengridcomputing.com>

Roland Dreier wrote:
>  > How should I add this to the API?
>  > 
>  > Perhaps we just document the format of an rkey in the struct ib_mr.
>  > Thus the app would do this to change the key before posting the
>  > fast_reg_mr wr (coded to be explicit, not efficient):
>  > 
>  > u8 newkey;
>  > u32 newrkey;
>  > 
>  > newkey = 0xaa;
>  > newrkey = (mr->rkey & 0xffffff00) | newkey;
>  > mr->rkey = newrkey
>  > wr.wr.fast_reg.mr = mr;
>
> Don't like it -- too easy for the consumer to screw up the data
> structures.
>
> Seems simpler to just add a u8 "key" field (or maybe there's a better
> name) to the work request.
>
>   

And then the provider updates the mr->rkey field as part of WR processing?

Steve.


From dotanba at gmail.com  Fri May 23 13:14:36 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 23 May 2008 23:14:36 +0300
Subject: [ofa-general] [PATCH] core/include: fix coding style typos according
	to checkpatch.pl
Message-ID: <200805232314.37003.dotanba@gmail.com>

Fixed header files coding style typos according to checkpatch.pl
(without harming code readability).

Signed-off-by: Dotan Barak <dotanba at gmail.com>

---

diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
index f179d23..a5501e3 100644
--- a/include/rdma/ib_cache.h
+++ b/include/rdma/ib_cache.h
@@ -31,7 +31,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef _IB_CACHE_H
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index a627c86..48a30bd 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -32,7 +32,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #if !defined(IB_CM_H)
 #define IB_CM_H
diff --git a/include/rdma/ib_fmr_pool.h b/include/rdma/ib_fmr_pool.h
index 00dadbf..15195e6 100644
--- a/include/rdma/ib_fmr_pool.h
+++ b/include/rdma/ib_fmr_pool.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_fmr_pool.h 2730 2005-06-28 16:43:03Z sean.hefty $
  */
 
 #if !defined(IB_FMR_POOL_H)
@@ -61,7 +60,7 @@ struct ib_fmr_pool_param {
 	int                     pool_size;
 	int                     dirty_watermark;
 	void                  (*flush_function)(struct ib_fmr_pool *pool,
-						void *              arg);
+						void               *arg);
 	void                   *flush_arg;
 	unsigned                cache:1;
 };
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 7228c05..edfc0a9 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -33,10 +33,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_mad.h 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 
-#if !defined( IB_MAD_H )
+#if !defined(IB_MAD_H)
 #define IB_MAD_H
 
 #include <linux/list.h>
@@ -194,8 +193,7 @@ struct ib_vendor_mad {
 	u8			data[IB_MGMT_VENDOR_DATA];
 };
 
-struct ib_class_port_info
-{
+struct ib_class_port_info {
 	u8			base_version;
 	u8			class_version;
 	__be16			capability_mask;
@@ -614,11 +612,11 @@ int ib_process_mad_wc(struct ib_mad_agent *mad_agent,
  * any class specific header, and MAD data area.
  * If @rmpp_active is set, the RMPP header will be initialized for sending.
  */
-struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent,
-					    u32 remote_qpn, u16 pkey_index,
-					    int rmpp_active,
-					    int hdr_len, int data_len,
-					    gfp_t gfp_mask);
+struct ib_mad_send_buf *ib_create_send_mad(struct ib_mad_agent *mad_agent,
+					   u32 remote_qpn, u16 pkey_index,
+					   int rmpp_active,
+					   int hdr_len, int data_len,
+					   gfp_t gfp_mask);
 
 /**
  * ib_is_mad_class_rmpp - returns whether given management class
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index f926020..f1703d5 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -29,7 +29,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef IB_PACK_H
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 942692b..39c9780 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -31,7 +31,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_sa.h 2811 2005-07-06 18:11:43Z halr $
  */
 
 #ifndef IB_SA_H
diff --git a/include/rdma/ib_smi.h b/include/rdma/ib_smi.h
index f29af13..da9428c 100644
--- a/include/rdma/ib_smi.h
+++ b/include/rdma/ib_smi.h
@@ -33,10 +33,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_smi.h 1389 2004-12-27 22:56:47Z roland $
  */
 
-#if !defined( IB_SMI_H )
+#if !defined(IB_SMI_H)
 #define IB_SMI_H
 
 #include <rdma/ib_mad.h>
diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h
index 37650af..7e7571c 100644
--- a/include/rdma/ib_user_cm.h
+++ b/include/rdma/ib_user_cm.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $
  */
 
 #ifndef IB_USER_CM_H
diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h
index 29d2c72..11a8dde 100644
--- a/include/rdma/ib_user_mad.h
+++ b/include/rdma/ib_user_mad.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_mad.h 2814 2005-07-06 19:14:09Z halr $
  */
 
 #ifndef IB_USER_MAD_H
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index 8d65bf0..e226c45 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -32,7 +32,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $
  */
 
 #ifndef IB_USER_VERBS_H
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..2a3bf8f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -35,7 +35,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #if !defined(IB_VERBS_H)
@@ -778,7 +777,7 @@ struct ib_cq {
 	struct ib_uobject      *uobject;
 	ib_comp_handler   	comp_handler;
 	void                  (*event_handler)(struct ib_event *, void *);
-	void *            	cq_context;
+	void                   *cq_context;
 	int               	cqe;
 	atomic_t          	usecnt; /* count number of work queues */
 };
@@ -884,7 +883,7 @@ struct ib_dma_mapping_ops {
 	void		(*sync_single_for_cpu)(struct ib_device *dev,
 					       u64 dma_handle,
 					       size_t size,
-				               enum dma_data_direction dir);
+					       enum dma_data_direction dir);
 	void		(*sync_single_for_device)(struct ib_device *dev,
 						  u64 dma_handle,
 						  size_t size,
diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h
index aeefa9b..cbb822e 100644
--- a/include/rdma/iw_cm.h
+++ b/include/rdma/iw_cm.h
@@ -62,7 +62,7 @@ struct iw_cm_event {
 	struct sockaddr_in remote_addr;
 	void *private_data;
 	u8 private_data_len;
-	void* provider_data;
+	void *provider_data;
 };
 
 /**
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 010f876..37eebb3 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -57,11 +57,11 @@ enum rdma_cm_event_type {
 };
 
 enum rdma_port_space {
-	RDMA_PS_SDP  = 0x0001,
-	RDMA_PS_IPOIB= 0x0002,
-	RDMA_PS_TCP  = 0x0106,
-	RDMA_PS_UDP  = 0x0111,
-	RDMA_PS_SCTP = 0x0183
+	RDMA_PS_SDP   = 0x0001,
+	RDMA_PS_IPOIB = 0x0002,
+	RDMA_PS_TCP   = 0x0106,
+	RDMA_PS_UDP   = 0x0111,
+	RDMA_PS_SCTP  = 0x0183
 };
 
 struct rdma_addr {


From dotanba at gmail.com  Fri May 23 12:52:03 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 23 May 2008 22:52:03 +0300
Subject: [ofa-general] [PATCH] core/include: fix coding style typos according
	to checkpatch.pl
Message-ID: <200805232252.04047.dotanba@gmail.com>

Fixed header files coding style typos according to checkpatch.pl
(without harming code readability).

Signed-off-by: Dotan Barak <dotanba at gmail.com>

---

diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
index f179d23..a5501e3 100644
--- a/include/rdma/ib_cache.h
+++ b/include/rdma/ib_cache.h
@@ -31,7 +31,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef _IB_CACHE_H
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index a627c86..48a30bd 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -32,7 +32,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #if !defined(IB_CM_H)
 #define IB_CM_H
diff --git a/include/rdma/ib_fmr_pool.h b/include/rdma/ib_fmr_pool.h
index 00dadbf..15195e6 100644
--- a/include/rdma/ib_fmr_pool.h
+++ b/include/rdma/ib_fmr_pool.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_fmr_pool.h 2730 2005-06-28 16:43:03Z sean.hefty $
  */
 
 #if !defined(IB_FMR_POOL_H)
@@ -61,7 +60,7 @@ struct ib_fmr_pool_param {
 	int                     pool_size;
 	int                     dirty_watermark;
 	void                  (*flush_function)(struct ib_fmr_pool *pool,
-						void *              arg);
+						void               *arg);
 	void                   *flush_arg;
 	unsigned                cache:1;
 };
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 7228c05..edfc0a9 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -33,10 +33,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_mad.h 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 
-#if !defined( IB_MAD_H )
+#if !defined(IB_MAD_H)
 #define IB_MAD_H
 
 #include <linux/list.h>
@@ -194,8 +193,7 @@ struct ib_vendor_mad {
 	u8			data[IB_MGMT_VENDOR_DATA];
 };
 
-struct ib_class_port_info
-{
+struct ib_class_port_info {
 	u8			base_version;
 	u8			class_version;
 	__be16			capability_mask;
@@ -614,11 +612,11 @@ int ib_process_mad_wc(struct ib_mad_agent *mad_agent,
  * any class specific header, and MAD data area.
  * If @rmpp_active is set, the RMPP header will be initialized for sending.
  */
-struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent,
-					    u32 remote_qpn, u16 pkey_index,
-					    int rmpp_active,
-					    int hdr_len, int data_len,
-					    gfp_t gfp_mask);
+struct ib_mad_send_buf *ib_create_send_mad(struct ib_mad_agent *mad_agent,
+					   u32 remote_qpn, u16 pkey_index,
+					   int rmpp_active,
+					   int hdr_len, int data_len,
+					   gfp_t gfp_mask);
 
 /**
  * ib_is_mad_class_rmpp - returns whether given management class
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index f926020..f1703d5 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -29,7 +29,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef IB_PACK_H
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 942692b..39c9780 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -31,7 +31,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_sa.h 2811 2005-07-06 18:11:43Z halr $
  */
 
 #ifndef IB_SA_H
diff --git a/include/rdma/ib_smi.h b/include/rdma/ib_smi.h
index f29af13..da9428c 100644
--- a/include/rdma/ib_smi.h
+++ b/include/rdma/ib_smi.h
@@ -33,10 +33,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_smi.h 1389 2004-12-27 22:56:47Z roland $
  */
 
-#if !defined( IB_SMI_H )
+#if !defined(IB_SMI_H)
 #define IB_SMI_H
 
 #include <rdma/ib_mad.h>
diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h
index 37650af..7e7571c 100644
--- a/include/rdma/ib_user_cm.h
+++ b/include/rdma/ib_user_cm.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $
  */
 
 #ifndef IB_USER_CM_H
diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h
index 29d2c72..11a8dde 100644
--- a/include/rdma/ib_user_mad.h
+++ b/include/rdma/ib_user_mad.h
@@ -30,7 +30,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_mad.h 2814 2005-07-06 19:14:09Z halr $
  */
 
 #ifndef IB_USER_MAD_H
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index 8d65bf0..e226c45 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -32,7 +32,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $
  */
 
 #ifndef IB_USER_VERBS_H
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..2a3bf8f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -35,7 +35,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #if !defined(IB_VERBS_H)
@@ -778,7 +777,7 @@ struct ib_cq {
 	struct ib_uobject      *uobject;
 	ib_comp_handler   	comp_handler;
 	void                  (*event_handler)(struct ib_event *, void *);
-	void *            	cq_context;
+	void                   *cq_context;
 	int               	cqe;
 	atomic_t          	usecnt; /* count number of work queues */
 };
@@ -884,7 +883,7 @@ struct ib_dma_mapping_ops {
 	void		(*sync_single_for_cpu)(struct ib_device *dev,
 					       u64 dma_handle,
 					       size_t size,
-				               enum dma_data_direction dir);
+					       enum dma_data_direction dir);
 	void		(*sync_single_for_device)(struct ib_device *dev,
 						  u64 dma_handle,
 						  size_t size,
diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h
index aeefa9b..cbb822e 100644
--- a/include/rdma/iw_cm.h
+++ b/include/rdma/iw_cm.h
@@ -62,7 +62,7 @@ struct iw_cm_event {
 	struct sockaddr_in remote_addr;
 	void *private_data;
 	u8 private_data_len;
-	void* provider_data;
+	void *provider_data;
 };
 
 /**
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 010f876..37eebb3 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -57,11 +57,11 @@ enum rdma_cm_event_type {
 };
 
 enum rdma_port_space {
-	RDMA_PS_SDP  = 0x0001,
-	RDMA_PS_IPOIB= 0x0002,
-	RDMA_PS_TCP  = 0x0106,
-	RDMA_PS_UDP  = 0x0111,
-	RDMA_PS_SCTP = 0x0183
+	RDMA_PS_SDP   = 0x0001,
+	RDMA_PS_IPOIB = 0x0002,
+	RDMA_PS_TCP   = 0x0106,
+	RDMA_PS_UDP   = 0x0111,
+	RDMA_PS_SCTP  = 0x0183
 };
 
 struct rdma_addr {


From weiny2 at llnl.gov  Fri May 23 12:18:15 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 23 May 2008 12:18:15 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080522154702.430cdef7.weiny2@llnl.gov>
	<1211540855.13185.71.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080523121815.2c39e65a.weiny2@llnl.gov>

On Fri, 23 May 2008 04:07:35 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Thu, 2008-05-22 at 15:47 -0700, Ira Weiny wrote:
> > I guess my question is "does saquery need this to talk to the SA?"
> > 
> > I am assuming the answer is "yes".
> 
> It depends on whether trusted operations are needed to be supported or
> not. A normal node has no need for trusted operations. There was a
> reason why the additional information was hidden with a key. It allows a
> malicious user to effect not just his node but the subnet.

Ok...  I guess from your other emails the point is that ULP's must get these
keys by some "out of spec" method?  saquery only queries information, much of
which I think ULP's require to establish connections etc.  How are others
solving this problem?

> 
> As I mentioned, this starts to be a slippery slope with the management
> keys. I think a better approach when non default key is in place is to
> support this via the OpenSM console as OpenSM knows all the keys it's
> supposed to.

<just thinking out loud...>
   When you mention this I start to think about the secure API which Tim
   submitted a few months ago and was not accepted.  I know we are still
   discussing how to do "secure console" but perhaps this is a very valid use
   case for the SM to answer SSL socket connections to get keys?
</thinking> (yea, no longer thinking...  ;-)

Ira

> 
> > I noticed this in the spec section 14.4.7 page 890:
> > 
> >    "The SM Key used for SM authentication is independent of the SM Key in the
> >    SA header used for SA authentication."
> > 
> > Does this mean there could be 2 SM_Key values in use?
> 
> This was a clarification added at IBA 1.2.1. The SA SMKey is really an
> SA Key. This lack of separation is a limitation in the current OpenSM
> implementation.
> 
> -- Hal
> 
> > Ira
> > 
> > 
> > On Thu, 22 May 2008 08:10:29 -0700
> > Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> > 
> > > On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote:
> > > > On 07:46 Thu 22 May     , Hal Rosenstock wrote:
> > > > > Sasha,
> > > > > 
> > > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote:
> > > > > > This adds possibility to specify SM_Key value with saquery. It should
> > > > > > work with queries where OSM_DEFAULT_SM_KEY was used.
> > > > > 
> > > > > I think this starts down a slippery slope and perhaps bad precedent for
> > > > > MKey as well. I know this is useful as a debug tool but compromises what
> > > > > purports as "security" IMO as this means the keys need to be too widely
> > > > > known.
> > > > 
> > > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM
> > > > side an user may know this or not, in later case saquery will not work
> > > > (just like now). I don't see a hole.
> > > 
> > > I think it will tend towards proliferation of keys which will defeat any
> > > security/trust. The idea of SMKey was to keep it private between SMs.
> > > This is now spreading it wider IMO. I'm sure other patches will follow
> > > in the same vein once an MKey manager exists.
> > > 
> > > -- Hal
> > > 
> > > > Sasha
> > > 
> 


From Instant at lists.openfabrics.org  Fri May 23 12:27:23 2008
From: Instant at lists.openfabrics.org (Instant at lists.openfabrics.org)
Date: 23 May 2008 12:27:23 -0700
Subject: [ofa-general] Can you afford to lose 300,
	000 potential customers per year ?
Message-ID: <20080523122722.E73007C9F8082CD9@from.header.has.no.domain>

Can you afford to lose 300,000 potential customers per year ?

How would You like to divert 1000s of fresh,
new visitors daily to Your web site or affiliate web site from
Google, Yahoo, MSN and others At $0 cost to you...?

...iNSTANT BOOSTER diverts 1000s of fresh,
new visitors daily to Your web site or affiliate
web site from Google, Yahoo, MSN and others
at $0 cost to you!

...No matter what you are selling or offering -
INTSANT BOOSTER will pull in hordes of potential customers to your website 
- instantly!


For Full Details Please read the attached .html file


Unsubscribe:
Please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/f35a0870/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/f35a0870/attachment-0001.htm>

From dotanba at gmail.com  Fri May 23 13:32:07 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 23 May 2008 23:32:07 +0300
Subject: [ofa-general] [PATCH] libibverbs: fix coding style typos according
	to checkpatch.pl
Message-ID: <200805232332.07576.dotanba@gmail.com>

Fixed coding style typos according to checkpatch.pl
(without harming code readability).

Signed-off-by: Dotan Barak <dotanba at gmail.com>

---

diff --git a/examples/devinfo.c b/examples/devinfo.c
index 1fadc80..86ad7da 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -48,7 +48,7 @@
 #include <infiniband/driver.h>
 #include <infiniband/arch.h>
 
-static int verbose = 0;
+static int verbose;
 
 static int null_gid(union ibv_gid *gid)
 {
@@ -231,9 +231,8 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 		       device_attr.max_total_mcast_qp_attach);
 		printf("\tmax_ah:\t\t\t\t%d\n", device_attr.max_ah);
 		printf("\tmax_fmr:\t\t\t%d\n", device_attr.max_fmr);
-		if (device_attr.max_fmr) {
+		if (device_attr.max_fmr)
 			printf("\tmax_map_per_fmr:\t\t%d\n", device_attr.max_map_per_fmr);
-		}
 		printf("\tmax_srq:\t\t\t%d\n", device_attr.max_srq);
 		if (device_attr.max_srq) {
 			printf("\tmax_srq_wr:\t\t\t%d\n", device_attr.max_srq_wr);
diff --git a/examples/srq_pingpong.c b/examples/srq_pingpong.c
index 95bebf4..e47bae6 100644
--- a/examples/srq_pingpong.c
+++ b/examples/srq_pingpong.c
@@ -143,7 +143,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por
 		.ai_socktype = SOCK_STREAM
 	};
 	char *service;
-	char msg[ sizeof "0000:000000:000000"];
+	char msg[sizeof "0000:000000:000000"];
 	int n;
 	int r;
 	int i;
@@ -227,7 +227,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx,
 		.ai_socktype = SOCK_STREAM
 	};
 	char *service;
-	char msg[ sizeof "0000:000000:000000"];
+	char msg[sizeof "0000:000000:000000"];
 	int n;
 	int r;
 	int i;
@@ -275,7 +275,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx,
 		return NULL;
 	}
 
-	rem_dest = malloc(MAX_QP *sizeof *rem_dest);
+	rem_dest = malloc(MAX_QP * sizeof *rem_dest);
 	if (!rem_dest)
 		goto out;
 
diff --git a/src/cmd.c b/src/cmd.c
index 9db8aa6..66d7134 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -851,7 +851,7 @@ int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 			tmp->wr.ud.remote_qpn  = i->wr.ud.remote_qpn;
 			tmp->wr.ud.remote_qkey = i->wr.ud.remote_qkey;
 		} else {
-			switch(i->opcode) {
+			switch (i->opcode) {
 			case IBV_WR_RDMA_WRITE:
 			case IBV_WR_RDMA_WRITE_WITH_IMM:
 			case IBV_WR_RDMA_READ:
diff --git a/src/ibverbs.h b/src/ibverbs.h
index b1d2c2b..6a6e3c8 100644
--- a/src/ibverbs.h
+++ b/src/ibverbs.h
@@ -49,7 +49,7 @@
 #endif /* HAVE_VALGRIND_MEMCHECK_H */
 
 #ifndef VALGRIND_MAKE_MEM_DEFINED
-#  define VALGRIND_MAKE_MEM_DEFINED(addr,len)
+#  define VALGRIND_MAKE_MEM_DEFINED(addr, len)
 #endif
 
 #define HIDDEN		__attribute__((visibility ("hidden")))
diff --git a/src/init.c b/src/init.c
index 07ab855..82dfae4 100644
--- a/src/init.c
+++ b/src/init.c
@@ -321,7 +321,7 @@ static void read_config(void)
 			goto next;
 
 		read_config_file(path);
-	next:
+next:
 		free(path);
 	}
 

From rdreier at cisco.com  Fri May 23 13:05:35 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 13:05:35 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483713CB.3010408@opengridcomputing.com> (Steve Wise's message of
	"Fri, 23 May 2008 13:58:19 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com> <adafxs897e8.fsf@cisco.com>
	<483713CB.3010408@opengridcomputing.com>
Message-ID: <adaprrc7pm8.fsf@cisco.com>

 > And then the provider updates the mr->rkey field as part of WR processing?

Yeah, I guess so.

Actually thinking about it, another possibility would be to wrap up the

 > newrkey = (mr->rkey & 0xffffff00) | newkey;

operation in a little inline helper function so people don't screw it
up.  Maybe that's the cleanest way to do it.

(We would probably want the helper for low-level driver use anyway)

 - R.


From rdreier at cisco.com  Fri May 23 13:12:38 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 23 May 2008 13:12:38 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adaprrc7pm8.fsf@cisco.com> (Roland Dreier's message of "Fri, 23
	May 2008 13:05:35 -0700")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com> <adafxs897e8.fsf@cisco.com>
	<483713CB.3010408@opengridcomputing.com> <adaprrc7pm8.fsf@cisco.com>
Message-ID: <adahcco7pah.fsf@cisco.com>

 > Actually thinking about it, another possibility would be to wrap up the

 > newrkey = (mr->rkey & 0xffffff00) | newkey;

 > operation in a little inline helper function so people don't screw it
 > up.  Maybe that's the cleanest way to do it.

If we add a "key" field to the work request, then it seems too easy for
a consumer to forget to set it and end up passing uninitialized garbage.
If the consumer has to explicitly update the key when posting the work
request then that failure is avoided.

HOWEVER -- if we have the consumer update the key when posting the
operation, then there is the problem of what happens when the consumer
posts multiple fastreg work requests at once (ie fastreg, local inval,
new fastreg, etc. in a pipelined way).  Does the low-level driver just
take the the key value given when the WR is posted, even if there's a
new value there by the time the WR is executed?

 - R.


From ralph.campbell at qlogic.com  Fri May 23 14:43:29 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 23 May 2008 14:43:29 -0700
Subject: [ofa-general] [PATCH 0/1] IB/ipath -- fix for 2.6.26
Message-ID: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com>

The following patch fixes a minor bug that Or Gerlitz found.

IB/ipath - IB/ipath - fix device capability flags

This can be pulled into Roland's infiniband.git for-2.6.26 repo using:
git pull git://git.qlogic.com/ipath-linux-2.6 for-roland


From ralph.campbell at qlogic.com  Fri May 23 14:43:34 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 23 May 2008 14:43:34 -0700
Subject: [ofa-general] [PATCH] IB/ipath - fix device capability flags
In-Reply-To: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com>
References: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com>
Message-ID: <20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com>

The driver supports a few features that were not reported in the device
capability flags.  This patch fixes that.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_verbs.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index e0ec540..7779165 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1494,7 +1494,8 @@ static int ipath_query_device(struct ib_device *ibdev,
 
 	props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR |
 		IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT |
-		IB_DEVICE_SYS_IMAGE_GUID;
+		IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_RC_RNR_NAK_GEN |
+		IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SRQ_RESIZE;
 	props->page_size_cap = PAGE_SIZE;
 	props->vendor_id = dev->dd->ipath_vendorid;
 	props->vendor_part_id = dev->dd->ipath_deviceid;


From ralph.campbell at qlogic.com  Fri May 23 14:45:01 2008
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Fri, 23 May 2008 14:45:01 -0700
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <adave14apdr.fsf@cisco.com>
References: <48342C6C.2010502@googlemail.com>  <adave14apdr.fsf@cisco.com>
Message-ID: <1211579101.3949.326.camel@brick.pathscale.com>

This looks good to me.

On Fri, 2008-05-23 at 10:42 -0700, Roland Dreier wrote:
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
> 
> Perhaps the best way to fix these is to change code like
> 
> 		if (/* ScoreBoardDrainInProg */
> 		    test_bit(63, &hwstatus) ||
> 		    /* AbortInProg */
> 		    test_bit(62, &hwstatus) ||
> 		    /* InternalSDmaEnable */
> 		    test_bit(61, &hwstatus) ||
> 		    /* ScbEmpty */
> 		    !test_bit(30, &hwstatus)) {
> 
> to something like
> 
> 		if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG |
> 				 IPATH_SDMA_STATUS_ABORT_IN_PROG	     |
> 				 IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) ||
> 		    !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) {
> 
> with appropriate defines for the constants 1ull << 63 etc.
> 
> (I think I got the logic correct but someone should check)
> 
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
>  > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma':
>  > drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
> 
> I have a fix for this pending; will ask Linus to pull today.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From toneyn at excn.com  Fri May 23 14:54:30 2008
From: toneyn at excn.com (Replica Watches)
Date: Fri, 23 May 2008 21:54:30 +0000
Subject: [ofa-general] Exquisite Replica
Message-ID: <000601c8bd2e$03755a93$91d1dd8d@gybcsmc>

MY JEWELER COULD NOT TELL 
IT WAS NOT A REAL ROLEX!
More information how to buy an AAA+ quality replica!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080523/8564eee3/attachment.html>

From swise at opengridcomputing.com  Fri May 23 20:13:24 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 May 2008 22:13:24 -0500
Subject: [ofa-general] [ANNOUNCE] chelsio rnic 6.0 firmware 
Message-ID: <483787D4.3030805@opengridcomputing.com>

Chelsio iWARP fans,

The new ofed-1.3.1 cxgb3 drivers require a firmware upgrade for the 
chelsio rnic.  You can pull the firmware from:

http://service.chelsio.com/drivers/firmware/t3/t3fw-6.0.0.bin.gz

Unzip it and place it in /lib/firmware on your systems.  Then the next 
time you reload and configure cxgb3 it will install the new firmware.

Thanks,

Steve.


From swise at opengridcomputing.com  Fri May 23 20:32:24 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 May 2008 22:32:24 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adahcco7pah.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>	<4833EA6B.9000705@voltaire.com>	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>
	<adafxs897e8.fsf@cisco.com>	<483713CB.3010408@opengridcomputing.com>
	<adaprrc7pm8.fsf@cisco.com> <adahcco7pah.fsf@cisco.com>
Message-ID: <48378C48.5060904@opengridcomputing.com>


Roland Dreier wrote:
>  > Actually thinking about it, another possibility would be to wrap up the
> 
>  > newrkey = (mr->rkey & 0xffffff00) | newkey;
> 
>  > operation in a little inline helper function so people don't screw it
>  > up.  Maybe that's the cleanest way to do it.
> 
> If we add a "key" field to the work request, then it seems too easy for
> a consumer to forget to set it and end up passing uninitialized garbage.
> If the consumer has to explicitly update the key when posting the work
> request then that failure is avoided.
> 
> HOWEVER -- if we have the consumer update the key when posting the
> operation, then there is the problem of what happens when the consumer
> posts multiple fastreg work requests at once (ie fastreg, local inval,
> new fastreg, etc. in a pipelined way).  Does the low-level driver just
> take the the key value given when the WR is posted, even if there's a
> new value there by the time the WR is executed?
> 

I would have to say yes.  And it makes sense i think.

say rkey is 0x010203XX.  The a pipeline could look like:

fastreg (mr->rkey is 0x01020301)
rdma read (mr->rkey is 0x01020301)
invalidate local with fence (mr->rkey is 0x01020301)
fastreg (mr->rkey is 0x01020302)
rdma read (sink mr->rkey is 0x01020302)
invalidate local with fence (mr->rkey is 0x01020302)

So the consumer is using the correct mr->rkey at all times even though 
the rnic is possibly processing the previous generation (that was copied 
into a fastreg WR at an earlier point in time) at the same time as the 
app is registering the next generation of the rkey.

Steve.


From swise at opengridcomputing.com  Fri May 23 20:42:42 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 May 2008 22:42:42 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48378C48.5060904@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>	<4833EA6B.9000705@voltaire.com>	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>
	<adafxs897e8.fsf@cisco.com>	<483713CB.3010408@opengridcomputing.com>
	<adaprrc7pm8.fsf@cisco.com> <adahcco7pah.fsf@cisco.com>
	<48378C48.5060904@opengridcomputing.com>
Message-ID: <48378EB2.2060005@opengridcomputing.com>


Steve Wise wrote:
> 
> 
> Roland Dreier wrote:
>>  > Actually thinking about it, another possibility would be to wrap up 
>> the
>>
>>  > newrkey = (mr->rkey & 0xffffff00) | newkey;
>>
>>  > operation in a little inline helper function so people don't screw it
>>  > up.  Maybe that's the cleanest way to do it.
>>
>> If we add a "key" field to the work request, then it seems too easy for
>> a consumer to forget to set it and end up passing uninitialized garbage.
>> If the consumer has to explicitly update the key when posting the work
>> request then that failure is avoided.
>>
>> HOWEVER -- if we have the consumer update the key when posting the
>> operation, then there is the problem of what happens when the consumer
>> posts multiple fastreg work requests at once (ie fastreg, local inval,
>> new fastreg, etc. in a pipelined way).  Does the low-level driver just
>> take the the key value given when the WR is posted, even if there's a
>> new value there by the time the WR is executed?
>>
> 
> I would have to say yes.  And it makes sense i think.
> 
> say rkey is 0x010203XX.  The a pipeline could look like:
> 
> fastreg (mr->rkey is 0x01020301)
> rdma read (mr->rkey is 0x01020301)
> invalidate local with fence (mr->rkey is 0x01020301)
> fastreg (mr->rkey is 0x01020302)
> rdma read (sink mr->rkey is 0x01020302)
> invalidate local with fence (mr->rkey is 0x01020302)
> 
> So the consumer is using the correct mr->rkey at all times even though 
> the rnic is possibly processing the previous generation (that was copied 
> into a fastreg WR at an earlier point in time) at the same time as the 
> app is registering the next generation of the rkey.
> 

So something like this?

static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
{
	/* iWARP: rkey == lkey */
	if (mr->rkey == mr->lkey)
		mr->lkey = mr->lkey & 0xffffff00 | newkey;
	mr->rkey = mr->rkey & 0xffffff00 | newkey;
}


From qvcbfmubvcis at chevettes.com  Sat May 24 00:05:04 2008
From: qvcbfmubvcis at chevettes.com (ashleigh)
Date: Fri, 23 May 2008 23:05:04 -0800 (PDT) 
Subject: [ofa-general] is it you? ashleigh here
Message-ID: <152O481C0558MKBHKMXH@v.evacaville.com> 


hello, I am pretty russian girl, bored tonight.
would you like to chat with me and see my pics?
if so then email me at eashleigh3 at famplayfit.cn


From vlad at lists.openfabrics.org  Sat May 24 03:09:34 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 24 May 2008 03:09:34 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080524-0200 daily build status
Message-ID: <20080524100934.E47A1E60B71@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From Targeted at lists.openfabrics.org  Sat May 24 05:55:34 2008
From: Targeted at lists.openfabrics.org (Targeted at lists.openfabrics.org)
Date: 24 May 2008 05:55:34 -0700
Subject: [ofa-general] How to get free quality visitors to your website?
Message-ID: <20080524055534.559667DC67741ED0@from.header.has.no.domain>

No Matter what you are selling - Hit-Booster will send targeted visitors to 
your website!


Within 15 minutes you will have your own website traffic generator that 
will bring 
in an ever increasing amount of hits to your websites! Automatically

This software is perfect for bringing real traffic to your site... even 
if...
it's an affiliate link where you have no control over the website content!

 
For Full Details please read the attached .html file


Unsubscribe:
please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080524/dc515e66/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080524/dc515e66/attachment-0001.htm>

From Targeted at lists.openfabrics.org  Sat May 24 11:09:27 2008
From: Targeted at lists.openfabrics.org (Targeted at lists.openfabrics.org)
Date: 24 May 2008 11:09:27 -0700
Subject: [ofa-general] How to get free quality visitors to your website?
Message-ID: <20080524110926.8BBD9B7D5B303687@from.header.has.no.domain>

No Matter what you are selling - Hit-Booster will send targeted visitors to 
your website!


Within 15 minutes you will have your own website traffic generator that 
will bring 
in an ever increasing amount of hits to your websites! Automatically

This software is perfect for bringing real traffic to your site... even 
if...
it's an affiliate link where you have no control over the website content!

 
For Full Details please read the attached .html file


Unsubscribe:
please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080524/053943c5/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080524/053943c5/attachment-0001.htm>

From ogerlitz at voltaire.com  Sat May 24 22:19:07 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 25 May 2008 08:19:07 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
Message-ID: <4838F6CB.2040203@voltaire.com>

Steve Wise wrote:
> Usage Model:
> - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR)
> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
Hi Steve, Roland,

After discussing the rkey renew and fencing with send/rdma ops, I am 
quite clear with how all this plugs well into ULPs such as SCSI or FS 
low-level (interconnect) initiator/target drivers, specifically those 
who use a transactional protocol. Few more points to clarify are (sorry 
if it became somehow long):

* Do we want it to be a must for a consumer to invalidate an fast-reg mr 
before reusing it? if yes, how?

* If remote invalidation is supported, when the peer is done with the 
mr, it sends the "response" in
send-with-invalidate fashion and saves the mapper side from doing a 
local invalidate. For the case of the mapping produced by SCSI initiator 
or FS client, when remote invalidation is not supported, I don't see how 
a local invalidate design can be made in a pipe-lined manner - Since 
from the network perspective the I/O is done, the target response at 
your hands, but until doing mr invalidation the pages are typically not 
returned to the upper layer and the ULP has to "stall" till the 
invalidation WR is completed? I don't say its a bug or a big issue, just 
wonder what are your thoughts regarding this point.

* talking about remote invalidation, I understand that it requires 
support of both sides (and hence has to be negotiated), so the 
IB_DEVICE_SEND_W_INV device capability says that a device can 
send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?

* what about ZBVA, is it orthogonal to these calls, no enhancement of 
the suggested API is needed even if zbva is used, or the other way, it 
would work also when zbva is not used?

Or


From ogerlitz at voltaire.com  Sat May 24 22:32:29 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 25 May 2008 08:32:29 +0300
Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys
In-Reply-To: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com>
References: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com>
Message-ID: <4838F9ED.9090304@voltaire.com>

Hal Rosenstock wrote:
> management: Support separate SA and SM keys as clarified in IBA 1.2.1
Does some host side patch is needed to inter-operate with this change?

Or.


From vlad at lists.openfabrics.org  Sun May 25 03:09:17 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 25 May 2008 03:09:17 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080525-0200 daily build status
Message-ID: <20080525100917.9E205E60C2A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at voltaire.com  Sun May 25 05:27:34 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 25 May 2008 15:27:34 +0300 (IDT)
Subject: [ofa-general] Re: got scheduling while atomic in ipoib (was
 net/bonding: announce fail-over for the active-backup mode)
In-Reply-To: <Pine.LNX.4.64.0805151721320.23334@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805151715040.23334@zuben.voltaire.com>
	<Pine.LNX.4.64.0805151721320.23334@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805251521430.2489@zuben.voltaire.com>

> Enhance bonding to announce fail-over for the active-backup mode through
> the netdev events notifier chain mechanism. Such an event can be of use
> for the RDMA CM (communication manager) to let native RDMA ULPs (eg
> NFS-RDMA, iSER) always use the same links as the IP stack does.

> --- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c	2008-05-13 10:02:22.000000000 +0300
> +++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c	2008-05-15 12:29:44.000000000 +0300
> @@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon
>  			bond->send_grat_arp = 1;
>  		} else
>  			bond_send_gratuitous_arp(bond);
> +		netdev_bonding_change(bond->dev);
>  	}
>  }

> --- linux-2.6.26-rc2.orig/net/core/dev.c	2008-05-13 10:02:31.000000000 +0300
> +++ linux-2.6.26-rc2/net/core/dev.c	2008-05-13 11:50:49.000000000 +0300
> @@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi
>  	}
>  }
>
> +void netdev_bonding_change(struct net_device *dev)
> +{
> +	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
> +}
> +EXPORT_SYMBOL(netdev_bonding_change);

Hi Roland,

I have enhanced the bonding driver to deliver event through the netdev
notifier chain and getting this "scheduling while atomic" warning.

The function __bond_mii_monitor does spin_lock_bh before calling bond_select_active_slave()
which calls bond_change_active_slave() so maybe its not a good idea to deliver event under
these atomic conditions, but I still want to make sure I didn't stepped on some problem in
ipoib (as of the :ib_ipoib:ipoib_start_xmit+0x445/0x459 line in the trace), any idea?

bonding: bond0: link status definitely down for interface ib0, disabling it
bonding: bond0: making interface ib1 the new active one.
BUG: scheduling while atomic: bond0/14237/0x10000100
Pid: 14237, comm: bond0 Not tainted 2.6.26-rc3 #4

Call Trace:
 [<ffffffff804777d7>] schedule+0x98/0x57b
 [<ffffffff80277836>] dbg_redzone1+0x16/0x1f
 [<ffffffffa0106f22>] :ib_ipoib:ipoib_start_xmit+0x445/0x459
 [<ffffffff802799c2>] kmem_cache_alloc_node+0x147/0x177
 [<ffffffff8040a939>] __alloc_skb+0x35/0x12b
 [<ffffffff8022c99b>] __cond_resched+0x1c/0x43
 [<ffffffff80477e11>] _cond_resched+0x2d/0x38
 [<ffffffff802798a0>] kmem_cache_alloc_node+0x25/0x177
 [<ffffffff8040a939>] __alloc_skb+0x35/0x12b
 [<ffffffff8041825e>] rtmsg_ifinfo+0x3a/0xd4
 [<ffffffff80418335>] rtnetlink_event+0x3d/0x41
 [<ffffffff8047b925>] notifier_call_chain+0x30/0x54
 [<ffffffffa00a3d4b>] :bonding:bond_select_active_slave+0xb9/0xe8
 [<ffffffffa00a495e>] :bonding:__bond_mii_monitor+0x43a/0x464
 [<ffffffffa00a49e6>] :bonding:bond_mii_monitor+0x5e/0xaa
 [<ffffffffa00a4988>] :bonding:bond_mii_monitor+0x0/0xaa
 [<ffffffff8023d6fa>] run_workqueue+0x7f/0x107
 [<ffffffff8023d782>] worker_thread+0x0/0xef
 [<ffffffff8023d867>] worker_thread+0xe5/0xef
 [<ffffffff8024088f>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8024088f>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8024055a>] kthread+0x3d/0x63
 [<ffffffff8020c068>] child_rip+0xa/0x12
 [<ffffffff8024051d>] kthread+0x0/0x63
 [<ffffffff8020c05e>] child_rip+0x0/0x12

eth2: no IPv6 routers present
bond0: no IPv6 routers present
end_request: I/O error, dev fd0, sector 0

Or.


From eli at mellanox.co.il  Sun May 25 05:38:29 2008
From: eli at mellanox.co.il (Eli Cohen)
Date: Sun, 25 May 2008 15:38:29 +0300
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <ada63t4aobq.fsf@cisco.com>
References: <1211470815.7310.61.camel@eli-laptop> <ada63t4aobq.fsf@cisco.com>
Message-ID: <1211719109.13769.32.camel@mtls03>

On Fri, 2008-05-23 at 11:05 -0700, Roland Dreier wrote:
> > ---
>  > 
>  > When running netperf I see significant improvement when using this patch
>  > (BW Mbps):
>  > 
>  > with patch:
>  > sender		receiver
>  > 313		313
>  > 
>  > without the patch:
>  > 509		134
> 
> Any reason why we wouldn't want this info in the patch changelog?
Not really. If you think it should be there, I'll add it to the
changelog along with an explanation to the question bellow.

> 
> Can you explain why the sender gets dramatically slower with the patch?
> 
When using this patch, the overhead of the CPU for handling RX packets
is dramatically reduced. As a result, we do not experience RNR NACK
messages from the receiver which cause the connection to be closed and
reopened again; when the patch is not used, the receiver cannot handle
the packets fast enough so there is less time to post new buffers and
hence the mentioned RNR NACKs. So what happens is that the application,
e.g. netperf, *thinks* it posted a certain number of packets for
transmission but these packets are flushed and do not really get
transmitted. Since the connection gets opened and closed many times,
each time netperf gets the CPU time that otherwise would have been given
to the CPU to actually transmit the packtes. This can be verified when
looking at the port counters, the output of ifconfig and the oputput of
netperf (this is for the case without the patch):

tx packets
==========
port counter:	1,543,996
ifconfig:	1,581,426
netperf:	5,142,034

rx packtes
==========
netperf		1,1304,089
	
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
14.4.3.178 (14.4.3.178) port 0 AF_INET
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

114688     128   10.00     5142034      0     526.31
114688           10.00     1130489            115.71


From bradley.kite at gmail.com  Sun May 25 05:48:57 2008
From: bradley.kite at gmail.com (Bradley Kite)
Date: Sun, 25 May 2008 13:48:57 +0100
Subject: [ofa-general] SDP and epoll vs select()
Message-ID: <e97f32c10805250548t19666a89hfca5d8a7b4212520@mail.gmail.com>

Hi all,

Currently my application uses the Linux kernel's epoll interface for
socket event notifications. From what I've read it looks like the SDP
library only works with select()/poll() - is this actually the case or
will epoll work too?

Many thanks

--
Brad.


From eli at dev.mellanox.co.il  Sun May 25 08:59:41 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 25 May 2008 18:59:41 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: reduce CM tx object size
Message-ID: <1211731181.13769.46.camel@mtls03>

>From 74218f2b8fff790a0fa35c2bf3aa6ab48c08ba81 Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Sun, 25 May 2008 18:58:13 +0300
Subject: [PATCH] IB/ipoib: reduce CM tx object size

Since IPOIB CM does not publish NETIF_F_SG, we don't need a mapping
array so define a new struct with one u64 field and use it.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/ulp/ipoib/ipoib.h    |    7 ++++++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |   12 ++++++------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index e39bf36..2b6f60b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -109,6 +109,11 @@ enum {
 
 /* structs */
 
+struct ipoib_cm_tx_buf {
+	struct sk_buff *skb;
+	u64		mapping;
+};
+
 struct ipoib_header {
 	__be16	proto;
 	u16	reserved;
@@ -208,7 +213,7 @@ struct ipoib_cm_tx {
 	struct net_device   *dev;
 	struct ipoib_neigh  *neigh;
 	struct ipoib_path   *path;
-	struct ipoib_tx_buf *tx_ring;
+	struct ipoib_cm_tx_buf *tx_ring;
 	unsigned	     tx_head;
 	unsigned	     tx_tail;
 	unsigned long	     flags;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 9e0facc..064971d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -662,7 +662,7 @@ static inline int post_send(struct ipoib_dev_priv *priv,
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_cm_tx_buf *tx_req;
 	u64 addr;
 
 	if (unlikely(skb->len > tx->mtu)) {
@@ -693,7 +693,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 		return;
 	}
 
-	tx_req->mapping[0] = addr;
+	tx_req->mapping = addr;
 
 	if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
 			       addr, skb->len))) {
@@ -718,7 +718,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_tx *tx = wc->qp->qp_context;
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM;
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_cm_tx_buf *tx_req;
 	unsigned long flags;
 
 	ipoib_dbg_data(priv, "cm send completion: id %d, status: %d\n",
@@ -732,7 +732,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	tx_req = &tx->tx_ring[wr_id];
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len, DMA_TO_DEVICE);
+	ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE);
 
 	/* FIXME: is this right? Shouldn't we only increment on success? */
 	++dev->stats.tx_packets;
@@ -1102,7 +1102,7 @@ err_tx:
 static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_cm_tx_buf *tx_req;
 	unsigned long flags;
 	unsigned long begin;
 
@@ -1130,7 +1130,7 @@ timeout:
 
 	while ((int) p->tx_tail - (int) p->tx_head < 0) {
 		tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
-		ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len,
+		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
 				    DMA_TO_DEVICE);
 		dev_kfree_skb_any(tx_req->skb);
 		++p->tx_tail;
-- 
1.5.5.1


From swise at opengridcomputing.com  Sun May 25 09:56:36 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 25 May 2008 11:56:36 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4838F6CB.2040203@voltaire.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com>
Message-ID: <48399A44.7060009@opengridcomputing.com>


Or Gerlitz wrote:

> After discussing the rkey renew and fencing with send/rdma ops, I am 
> quite clear with how all this plugs well into ULPs such as SCSI or FS 
> low-level (interconnect) initiator/target drivers, specifically those 
> who use a transactional protocol. Few more points to clarify are (sorry 
> if it became somehow long):
> 
> * Do we want it to be a must for a consumer to invalidate an fast-reg mr 
> before reusing it? if yes, how?

The verbs specs mandate that the mr be in the invalid state when the 
fast-reg work request is processed.  So I think that means yes.  And the 
consumer invalidates it via the INVALIDATE_MR work request.

> 
> * If remote invalidation is supported, when the peer is done with the 
> mr, it sends the "response" in
> send-with-invalidate fashion and saves the mapper side from doing a 
> local invalidate. For the case of the mapping produced by SCSI initiator 
> or FS client, when remote invalidation is not supported, I don't see how 
> a local invalidate design can be made in a pipe-lined manner - Since 
> from the network perspective the I/O is done, the target response at 
> your hands, but until doing mr invalidation the pages are typically not 
> returned to the upper layer and the ULP has to "stall" till the 
> invalidation WR is completed? I don't say its a bug or a big issue, just 
> wonder what are your thoughts regarding this point.
> 

I guess that's why they invented send-with-inv, and read-with-inv-local.

> * talking about remote invalidation, I understand that it requires 
> support of both sides (and hence has to be negotiated), so the 
> IB_DEVICE_SEND_W_INV device capability says that a device can 
> send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?
> 
> * what about ZBVA, is it orthogonal to these calls, no enhancement of 
> the suggested API is needed even if zbva is used, or the other way, it 
> would work also when zbva is not used?
> 
> Or


From sashak at voltaire.com  Sun May 25 12:10:47 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 25 May 2008 22:10:47 +0300
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <4836E9B8.2080406@llnl.gov>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
	<20080523103532.GA4640@sashak.voltaire.com>
	<4836E9B8.2080406@llnl.gov>
Message-ID: <20080525191047.GS4616@sashak.voltaire.com>

Hi Tim, Ira,

On 08:58 Fri 23 May     , Timothy A. Meier wrote:
>
> Following Hals advice, authorization is based on the umad permissions.

I will send some more comments about this method later today. But
basically still think that some things could be broken and that it is
not really trivial to separate in this way wrong usage from desired
behavior reliably (with some approximation it is possible of course).

> The intent is simply to provide a consistent and
> non-silent fail mechanism.

OTOH I fully agree with yours and Ira's arguments about this - 'Silent'
fails are bad. I thought about how to solve this and and started to run
diag perl scripts from unprivileged account in various conditions (cache
file exists or not, cache dir is readable or not, etc.).

First thing I saw was that even on bad usage most scripts return 0. Then
I found that on many failures return status is not checked or ignored
and program return 0. I did those two patches (below) and up to now it
works fine for me (but likely I didn't cover everything). What do you
say?

Sasha


>From cbbc155996c9f6efe91b78f055a643809b997468 Mon Sep 17 00:00:00 2001
From: root <root at castor.voltaire.com>
Date: Sat, 24 May 2008 11:04:08 +0300
Subject: [PATCH] infiniband-diags/scripts/*.pl: exit 2 on usage errors

Add non-zero exit status (2) on usage errors for perl scripts.

Signed-off-by: root <root at castor.voltaire.com>
---
 infiniband-diags/scripts/check_lft_balance.pl |    2 +-
 infiniband-diags/scripts/ibfindnodesusing.pl  |    2 +-
 infiniband-diags/scripts/ibidsverify.pl       |    2 +-
 infiniband-diags/scripts/iblinkinfo.pl        |    2 +-
 infiniband-diags/scripts/ibprintca.pl         |    2 +-
 infiniband-diags/scripts/ibprintrt.pl         |    2 +-
 infiniband-diags/scripts/ibprintswitch.pl     |    2 +-
 infiniband-diags/scripts/ibqueryerrors.pl     |    2 +-
 infiniband-diags/scripts/ibswportwatch.pl     |    2 +-
 9 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/infiniband-diags/scripts/check_lft_balance.pl b/infiniband-diags/scripts/check_lft_balance.pl
index 66f5f0f..b0f0fef 100755
--- a/infiniband-diags/scripts/check_lft_balance.pl
+++ b/infiniband-diags/scripts/check_lft_balance.pl
@@ -70,7 +70,7 @@ sub usage
 	print "Usage: $prog [-R -v]\n";
 	print "  -R recalculate all cached information\n";
 	print "  -v verbose output\n";
-	exit 0;
+	exit 2;
 }
 
 sub is_port_up
diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl
index 1bf0987..71656b3 100755
--- a/infiniband-diags/scripts/ibfindnodesusing.pl
+++ b/infiniband-diags/scripts/ibfindnodesusing.pl
@@ -80,7 +80,7 @@ sub usage_and_exit
 	print "   -R Recalculate ibnetdiscover information\n";
 	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl
index de78e6b..1a236c8 100755
--- a/infiniband-diags/scripts/ibidsverify.pl
+++ b/infiniband-diags/scripts/ibidsverify.pl
@@ -46,7 +46,7 @@ sub usage_and_exit
 	print "   -h This help message\n";
 	print
 "   -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
index a195474..a7a3df5 100755
--- a/infiniband-diags/scripts/iblinkinfo.pl
+++ b/infiniband-diags/scripts/iblinkinfo.pl
@@ -62,7 +62,7 @@ sub usage_and_exit
 	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
 	print "   -g print port guids instead of node guids\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0              = `basename $0`;
diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
index 38b4330..0baea0b 100755
--- a/infiniband-diags/scripts/ibprintca.pl
+++ b/infiniband-diags/scripts/ibprintca.pl
@@ -51,7 +51,7 @@ sub usage_and_exit
 	print "   -l list cas\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
index 86dcb64..0b3db19 100755
--- a/infiniband-diags/scripts/ibprintrt.pl
+++ b/infiniband-diags/scripts/ibprintrt.pl
@@ -51,7 +51,7 @@ sub usage_and_exit
 	print "   -l list rts\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
index 6712201..c7377a9 100755
--- a/infiniband-diags/scripts/ibprintswitch.pl
+++ b/infiniband-diags/scripts/ibprintswitch.pl
@@ -50,7 +50,7 @@ sub usage_and_exit
 	print "   -l list switches\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl
index c807c02..5f2e167 100755
--- a/infiniband-diags/scripts/ibqueryerrors.pl
+++ b/infiniband-diags/scripts/ibqueryerrors.pl
@@ -149,7 +149,7 @@ sub usage_and_exit
 	print "   -d include the data counters in the output\n";
 	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	exit 0;
+	exit 2;
 }
 
 my $argv0          = `basename $0`;
diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl
index 6d6ba1c..d888f51 100755
--- a/infiniband-diags/scripts/ibswportwatch.pl
+++ b/infiniband-diags/scripts/ibswportwatch.pl
@@ -81,7 +81,7 @@ sub usage_and_exit
 	print "   -n <cycles> run n cycles then exit (default -1 == forever)\n";
 	print "   -G Address provided is a GUID\n";
 	print "   -b report bytes/second packets/second\n";
-	exit 0;
+	exit 2;
 }
 
 # =========================================================================
-- 
1.5.4.rc2.60.gb2e62


>From a3d4a44d668912526466f591931a099a0978f943 Mon Sep 17 00:00:00 2001
From: root <root at castor.voltaire.com>
Date: Sun, 25 May 2008 15:54:00 +0300
Subject: [PATCH] infiniband-diags/scripts/*.pl: prevent some zero exists on errors

Upon failures break execution and drop error status.

Signed-off-by: root <root at castor.voltaire.com>
---
 infiniband-diags/scripts/IBswcountlimits.pm |   19 +++++++++++--------
 infiniband-diags/scripts/ibprintca.pl       |    6 +++---
 infiniband-diags/scripts/ibprintrt.pl       |    6 +++---
 infiniband-diags/scripts/ibprintswitch.pl   |    4 ++--
 infiniband-diags/scripts/ibqueryerrors.pl   |    6 ++++--
 infiniband-diags/scripts/ibswportwatch.pl   |    6 ++----
 6 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm
index 9bc356f..9794ff1 100755
--- a/infiniband-diags/scripts/IBswcountlimits.pm
+++ b/infiniband-diags/scripts/IBswcountlimits.pm
@@ -219,8 +219,9 @@ sub any_counts
 #
 sub ensure_cache_dir
 {
-	if (!(-d "$IBswcountlimits::cache_dir")) {
-		mkdir $IBswcountlimits::cache_dir, 0700;
+	if (!(-d "$IBswcountlimits::cache_dir") &&
+	    !mkdir($IBswcountlimits::cache_dir, 0700)) {
+		die "cannot create $IBswcountlimits::cache_dir: $!\n";
 	}
 }
 
@@ -260,9 +261,8 @@ sub generate_ibnetdiscover_topology
 	my $cache_file   = get_cache_file($ca_name, $ca_port);
 	my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port);
 
-	`ibnetdiscover -g $extra_params > $cache_file`;
-	if ($? != 0) {
-		die "Execution of ibnetdiscover failed with errors\n";
+	if (`ibnetdiscover -g $extra_params > $cache_file`) {
+		die "Execution of ibnetdiscover failed: $!\n";
 	}
 }
 
@@ -421,7 +421,8 @@ sub get_num_ports
 	my $num_ports    = 0;
 	my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port);
 
-	my $data         = `smpquery $extra_params -G nodeinfo $guid`;
+	my $data         = `smpquery $extra_params -G nodeinfo $guid` ||
+		die "'smpquery $extra_params -G nodeinfo $guid' faild\n";
 	my @lines        = split("\n", $data);
 	my $pkt_lifetime = "";
 	foreach my $line (@lines) {
@@ -457,7 +458,8 @@ sub convert_dr_to_guid
 {
 	my $guid = undef;
 
-	my $data = `smpquery nodeinfo -D $_[0]`;
+	my $data = `smpquery nodeinfo -D $_[0]` ||
+		die "'mpquery nodeinfo -D $_[0]' failed\n";
 	my @lines = split("\n", $data);
 	foreach my $line (@lines) {
 		if ($line =~ /^PortGuid:\.+(.*)/) { $guid = $1; }
@@ -480,7 +482,8 @@ sub get_node_type
 		$query_arg .= "-D " . $_[0];
 	}
 
-	my $data = `$query_arg`;
+	my $data = `$query_arg` ||
+		die "'$query_arg' failed\n";
 	my @lines = split("\n", $data);
 	foreach my $line (@lines) {
 		if ($line =~ /^NodeType:\.+(.*)/) { $type = $1; }
diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
index 0baea0b..7de0801 100755
--- a/infiniband-diags/scripts/ibprintca.pl
+++ b/infiniband-diags/scripts/ibprintca.pl
@@ -118,9 +118,9 @@ sub main
 		print $ports{$port};
 	}
 	if (!$found_hca) {
-		print "\"$target_hca\" not found\n";
-		print "   Try running with the \"-R\" option.\n";
-		print "   If still not found the node is probably down.\n";
+		die "\"$target_hca\" not found\n" .
+			"   Try running with the \"-R\" option.\n" .
+			"   If still not found the node is probably down.\n";
 	}
 	close IBNET_TOPO;
 }
diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
index 0b3db19..43323ca 100755
--- a/infiniband-diags/scripts/ibprintrt.pl
+++ b/infiniband-diags/scripts/ibprintrt.pl
@@ -118,9 +118,9 @@ sub main
 		print $ports{$port};
 	}
 	if (!$found_rt) {
-		print "\"$target_rt\" not found\n";
-		print "   Try running with the \"-R\" option.\n";
-		print "   If still not found the node is probably down.\n";
+		die "\"$target_rt\" not found\n" .
+			"   Try running with the \"-R\" option.\n" .
+			"   If still not found the node is probably down.\n";
 	}
 	close IBNET_TOPO;
 }
diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
index c7377a9..8af3f48 100755
--- a/infiniband-diags/scripts/ibprintswitch.pl
+++ b/infiniband-diags/scripts/ibprintswitch.pl
@@ -117,8 +117,8 @@ sub main
 		print $ports{$port};
 	}
 	if (!$found_switch) {
-		print "Switch \"$target_switch\" not found\n";
-		print "   Try running with the \"-R\" option.\n";
+		die "Switch \"$target_switch\" not found\n" .
+			"   Try running with the \"-R\" option.\n";
 	}
 	close IBNET_TOPO;
 }
diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl
index 5f2e167..a6128b5 100755
--- a/infiniband-diags/scripts/ibqueryerrors.pl
+++ b/infiniband-diags/scripts/ibqueryerrors.pl
@@ -104,7 +104,8 @@ sub get_counts
 	my $ca_port      = $_[3];
 	my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port);
 
-	my $data = `perfquery $extra_params -G $addr $port`;
+	my $data = `perfquery $extra_params -G $addr $port` ||
+		die "'perfquery $extra_params -G $addr $port' FAILED.\n";
 	my @lines = split("\n", $data);
 	foreach my $line (@lines) {
 		foreach my $count (@IBswcountlimits::counters) {
@@ -121,7 +122,8 @@ my %switches = ();
 
 sub get_switches
 {
-	my $data = `ibswitches $cache_file`;
+	my $data = `ibswitches $cache_file` ||
+		die "'ibswitches $cache_file' failed.\n";
 	my @lines = split("\n", $data);
 	foreach my $line (@lines) {
 		if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) {
diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl
index d888f51..92066d1 100755
--- a/infiniband-diags/scripts/ibswportwatch.pl
+++ b/infiniband-diags/scripts/ibswportwatch.pl
@@ -121,12 +121,10 @@ sub get_new_counts
 		)
 	  )
 	{
-		print "perfquery failed : \"perfquery $GUID $addr $port\"\n";
-		system("cat $IBswcountlimits::cache_dir/perfquery.out");
-		exit 1;
+		die "perfquery failed : \"perfquery $GUID $addr $port\"\n";
 	}
 	open PERF_QUERY, "<$IBswcountlimits::cache_dir/perfquery.out"
-	  or die "perfquery failed";
+	  or die "cannot read '$IBswcountlimits::cache_dir/perfquery.out': $!\n";
 	while (my $line = <PERF_QUERY>) {
 		foreach my $count (@IBswcountlimits::counters) {
 			if ($line =~ /^$count:\.+(\d+)/) {
-- 
1.5.4.rc2.60.gb2e62


From sashak at voltaire.com  Sun May 25 12:14:30 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 25 May 2008 22:14:30 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags: terminate perl scripts
	with error if not authorized
In-Reply-To: <4836EB27.7060707@llnl.gov>
References: <4836EB27.7060707@llnl.gov>
Message-ID: <20080525191430.GT4616@sashak.voltaire.com>

Hi Tim,

On 09:04 Fri 23 May     , Timothy A. Meier wrote:
>  
> +# =========================================================================
> +#  only authorized if uid is root, or matches umad ownership
> +#
> +sub auth_check
> +{
> +	my $file = "/dev/infiniband/umad0";

How would we know that it is "/dev/infiniband/umad0" and not another
device (when first port in not connected, or if -C and/or -P options are
used, or if udev is configured to put the entries in another place)?

Really I don't see an easy (without reimplementing most of libibumad
device resolution functionality via sysfs in perl scripts) way to detect
device reliably.

> +	my $uid = (stat $file)[4];
> +	my $gid = (stat $file)[5];
> +	if (($> != $uid) && ($> != $gid) && ($> != 0)){

The requirement here is not really ownership, but rather that the file
is readable and writable by user which runs script. Right?

Sasha


From rdreier at cisco.com  Sun May 25 13:43:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 25 May 2008 13:43:08 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <4838F6CB.2040203@voltaire.com> (Or Gerlitz's message of "Sun, 25
	May 2008 08:19:07 +0300")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com>
Message-ID: <adazlqe5d43.fsf@cisco.com>

 > * Do we want it to be a must for a consumer to invalidate an fast-reg
 > mr before reusing it? if yes, how?

The verbs specs go into exhaustive detail about the state diagram for
validity of MRs.

 > * talking about remote invalidation, I understand that it requires
 > support of both sides (and hence has to be negotiated), so the
 > IB_DEVICE_SEND_W_INV device capability says that a device can
 > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?

I think we decided that all of these related features will be indicated
by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits.

 > * what about ZBVA, is it orthogonal to these calls, no enhancement of
 > the suggested API is needed even if zbva is used, or the other way, it
 > would work also when zbva is not used?

ZBVA would require adding some flag to request ZBVA when registering.

 - R.


From rdreier at cisco.com  Sun May 25 15:14:28 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 25 May 2008 15:14:28 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <48378EB2.2060005@opengridcomputing.com> (Steve Wise's message of
	"Fri, 23 May 2008 22:42:42 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com> <adafxs897e8.fsf@cisco.com>
	<483713CB.3010408@opengridcomputing.com> <adaprrc7pm8.fsf@cisco.com>
	<adahcco7pah.fsf@cisco.com> <48378C48.5060904@opengridcomputing.com>
	<48378EB2.2060005@opengridcomputing.com>
Message-ID: <adave1258vv.fsf@cisco.com>

 > So something like this?

yeah, looks reasonable...

 > static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
 > {
 > 	/* iWARP: rkey == lkey */

actually I need to reread the IB spec and understand how the consumer
key part of L_Key and R_Key is supposed to work... for Mellanox adapters
at least the L_Key and R_Key are the same too.

 > 	if (mr->rkey == mr->lkey)
 > 		mr->lkey = mr->lkey & 0xffffff00 | newkey;
 > 	mr->rkey = mr->rkey & 0xffffff00 | newkey;
 > }


From rdreier at cisco.com  Sun May 25 15:21:03 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 25 May 2008 15:21:03 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adave1258vv.fsf@cisco.com> (Roland Dreier's message of "Sun, 25
	May 2008 15:14:28 -0700")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com> <adafxs897e8.fsf@cisco.com>
	<483713CB.3010408@opengridcomputing.com> <adaprrc7pm8.fsf@cisco.com>
	<adahcco7pah.fsf@cisco.com> <48378C48.5060904@opengridcomputing.com>
	<48378EB2.2060005@opengridcomputing.com> <adave1258vv.fsf@cisco.com>
Message-ID: <adar6bq58kw.fsf@cisco.com>

 >  > static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
 >  > {
 >  > 	/* iWARP: rkey == lkey */
 > 
 > actually I need to reread the IB spec and understand how the consumer
 > key part of L_Key and R_Key is supposed to work... for Mellanox adapters
 > at least the L_Key and R_Key are the same too.
 > 
 >  > 	if (mr->rkey == mr->lkey)
 >  > 		mr->lkey = mr->lkey & 0xffffff00 | newkey;
 >  > 	mr->rkey = mr->rkey & 0xffffff00 | newkey;
 >  > }

I just looked in the IB spec (1.2.1) and it talks about passing the "Key
to use on the new L_Key and R_Key" into a fastreg work request.

So I think we can just can the test for rkey==lkey and just do

	mr->lkey = mr->lkey & 0xffffff00 | newkey;
	mr->rkey = mr->rkey & 0xffffff00 | newkey;

 - R.


From rdreier at cisco.com  Sun May 25 15:43:56 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 25 May 2008 15:43:56 -0700
Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and
	Makefile.
In-Reply-To: <20080519103730.12355.14730.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Mon, 19 May 2008 16:07:30 +0530")
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103730.12355.14730.stgit@localhost.localdomain>
Message-ID: <adamyme57ir.fsf@cisco.com>

 > +config INFINIBAND_QLGC_VNIC_DEBUG
 > +	bool "QLogic VNIC Verbose debugging"
 > +	depends on INFINIBAND_QLGC_VNIC
 > +	default n
 > +	---help---
 > +	  This option causes verbose debugging code to be compiled
 > +	  into the QLogic VNIC driver.  The output can be turned on via the
 > +	  vnic_debug module parameter.

I think I mentioned this before, but... if you default this option to
'n', then all distributions will build your module with the option off.
And if someone is having problems, they will be forced to rebuild their
kernel to get debug output, which is a heavy burden for most users.

Much better to do something like what I ended up doing for mthca, which
is to have the option on unless someone specifically enables
CONFIG_EMBEDDED and goes out of their way to disable it:

	config INFINIBAND_MTHCA_DEBUG
		bool "Verbose debugging output" if EMBEDDED
		depends on INFINIBAND_MTHCA
		default y
		---help---
		  This option causes debugging code to be compiled into the


From rdreier at cisco.com  Sun May 25 15:47:29 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 25 May 2008 15:47:29 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> (Steve Wise's
	message of "Fri, 16 May 2008 17:34:20 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
Message-ID: <adaiqx257cu.fsf@cisco.com>

 > - device-specific alloc/free of physical buffer lists for use in fast
 > register work requests.  This allows devices to allocate this memory as
 > needed (like via dma_alloc_coherent).

I'm looking at how one would implement the MM extensions for mlx4, and
it turns out that in addition to needing to allocate these fastreg page
lists in coherent memory, mlx4 is even going to need to write to the
memory (basically set the lsb of each address for internal device
reasons).  So I think we just need to update the documentation of the
interface so that not only does the page list belong to the device
driver between posting the fastreg work request and completing the
request, but also the device driver is allowed to change the page list
as part of the work request processing.

I don't see any real reason why this would cause problems for consumers;
does this seem OK to other people?


From BlogBlaster at lists.openfabrics.org  Sun May 25 16:32:51 2008
From: BlogBlaster at lists.openfabrics.org (BlogBlaster at lists.openfabrics.org)
Date: 25 May 2008 16:32:51 -0700
Subject: [ofa-general] "How would you like to have your ad on 2 Million
	Websites ?"
Message-ID: <20080525163250.D8F6585009E86E70@from.header.has.no.domain>

How would you like 2 Million Sites linking to your ad ?

Weblog or blog population is exploding around the world, resembling the 
growth of e-mail users in the 1990s.

Post your ads where people read them!

- What if you could place your ad on all these sites ?
Right, that would mean you would have millions of sites linking to your ad. 


For Full details please read the attached .html file


Unsubscribe

please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080525/e6fd40fa/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080525/e6fd40fa/attachment-0001.htm>

From BlogBlaster at lists.openfabrics.org  Sun May 25 17:29:10 2008
From: BlogBlaster at lists.openfabrics.org (BlogBlaster at lists.openfabrics.org)
Date: 25 May 2008 17:29:10 -0700
Subject: [ofa-general] "How would you like to have your ad on 2 Million
	Websites ?"
Message-ID: <20080525172909.62F3BB4F2EE3A4BA@from.header.has.no.domain>

How would you like 2 Million Sites linking to your ad ?

Weblog or blog population is exploding around the world, resembling the 
growth of e-mail users in the 1990s.

Post your ads where people read them!

- What if you could place your ad on all these sites ?
Right, that would mean you would have millions of sites linking to your ad. 


For Full details please read the attached .html file


Unsubscribe

please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080525/9ac211cc/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080525/9ac211cc/attachment-0001.htm>

From ihara at sun.com  Sun May 25 19:25:21 2008
From: ihara at sun.com (Shuichi Ihara)
Date: Mon, 26 May 2008 11:25:21 +0900
Subject: [ofa-general] question of mvapich version
Message-ID: <483A1F91.6080205@sun.com>


Hi,

I have a question about version of mvapich and mvapich2 which are included
in ofed-1.3.1-rc2.

It looks the mvapich version is 1.0.1-2481 in ofed-1.3.1, but I can't see same
version on mvapich's download site and SVN repositories. Is this version same as
the Revision 2481:/mvapich/branches/1.0?

Also, mvapich2's src.rpm filename is mvapich2-1.0.3-1.src.rpm. Is it also from
the svn branches? I would like to know where is source tree of both packages.

Thanks,

-Ihara


From panda at cse.ohio-state.edu  Sun May 25 20:03:26 2008
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Sun, 25 May 2008 23:03:26 -0400 (EDT)
Subject: [ofa-general] question of mvapich version
In-Reply-To: <483A1F91.6080205@sun.com>
Message-ID: <Pine.GSO.4.40.0805252254230.258-100000@xi.cse.ohio-state.edu>

> Hi,
>
> I have a question about version of mvapich and mvapich2 which are included
> in ofed-1.3.1-rc2.
>
> It looks the mvapich version is 1.0.1-2481 in ofed-1.3.1, but I can't see same
> version on mvapich's download site and SVN repositories. Is this version same as
> the Revision 2481:/mvapich/branches/1.0?

Yes.

This version will be soon (mostly during the coming week) available as
MVAPICH 1.0.1.

> Also, mvapich2's src.rpm filename is mvapich2-1.0.3-1.src.rpm. Is it
> also from the svn branches? I would like to know where is source tree
> of both packages.

Yes. This version will also be soon available as MVAPICH2 1.0.3.

Both MVAPICH and MVAPICH2 source packages in OFED are from the original
MVAPICH and MVAPICH2 SVN. Since OFED has its own release schedule, the
versions in OFED are being identified with version number and `-x' suffix
to keep track of the exact versions which are going into OFED.

Hope this helps.

Thanks,

DK

> Thanks,
>
> -Ihara
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From eli at dev.mellanox.co.il  Mon May 26 00:20:40 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Mon, 26 May 2008 10:20:40 +0300
Subject: [ofa-general] [PATCH] IB/ipoib: increase ring sizes
Message-ID: <1211786440.13769.54.camel@mtls03>

>From b1ec82e65173556919f0e0f728af520e41bdbd5b Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Mon, 26 May 2008 10:19:16 +0300
Subject: [PATCH] IB/ipoib: increase ring sizes

Increase IPoIB ring sizes to twice the original size to act as
a shock observer for high traffic picks.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/ulp/ipoib/ipoib.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 2b6f60b..c49fc09 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -65,8 +65,8 @@ enum {
 	IPOIB_CM_BUF_SIZE	  = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,
 	IPOIB_CM_HEAD_SIZE	  = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
 	IPOIB_CM_RX_SG		  = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE,
-	IPOIB_RX_RING_SIZE	  = 128,
-	IPOIB_TX_RING_SIZE	  = 64,
+	IPOIB_RX_RING_SIZE	  = 256,
+	IPOIB_TX_RING_SIZE	  = 128,
 	IPOIB_MAX_QUEUE_SIZE	  = 8192,
 	IPOIB_MIN_QUEUE_SIZE	  = 2,
 	IPOIB_CM_MAX_CONN_QP	  = 4096,
-- 
1.5.5.1


From ogerlitz at voltaire.com  Mon May 26 00:29:53 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 26 May 2008 10:29:53 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adaiqx257cu.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<adaiqx257cu.fsf@cisco.com>
Message-ID: <483A66F1.9040305@voltaire.com>

Roland Dreier wrote:
> I don't see any real reason why this would cause problems for consumers;
> does this seem OK to other people?
this seems fine to me.

Or.


From ramachandra.kuchimanchi at qlogic.com  Mon May 26 00:37:28 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Mon, 26 May 2008 13:07:28 +0530
Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig
	and Makefile.
In-Reply-To: <adamyme57ir.fsf@cisco.com>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103730.12355.14730.stgit@localhost.localdomain>
	<adamyme57ir.fsf@cisco.com>
Message-ID: <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com>

Roland,

On Mon, May 26, 2008 at 4:13 AM, Roland Dreier <rdreier at cisco.com> wrote:
>  > +config INFINIBAND_QLGC_VNIC_DEBUG
>  > +    bool "QLogic VNIC Verbose debugging"
>  > +    depends on INFINIBAND_QLGC_VNIC
>  > +    default n
>  > +    ---help---
>  > +      This option causes verbose debugging code to be compiled
>  > +      into the QLogic VNIC driver.  The output can be turned on via the
>  > +      vnic_debug module parameter.
>
> I think I mentioned this before, but... if you default this option to
> 'n', then all distributions will build your module with the option off.
> And if someone is having problems, they will be forced to rebuild their
> kernel to get debug output, which is a heavy burden for most users.

The debugging code is always compiled in and is controlled
at run time through vnic_debug module parameter.
INFINIBAND_QLGC_VNIC_DEBUG config option only controls verbose debugging
which adds some extra information in the debug statements (file name,
line number)
which we typically use for debug builds of the driver.  Even if this option is
set to 'n', users can still get all debug messages from the driver by using the
vnic_debug module parameter.

Regards,
Ram


From marcel.heinz at informatik.tu-chemnitz.de  Mon May 26 02:09:20 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Mon, 26 May 2008 11:09:20 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <48371B2D.3040908@gmail.com>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
Message-ID: <483A7E40.5040407@informatik.tu-chemnitz.de>

Hello,

Dotan Barak wrote:
> Hi.
> 
> Do you use the latest released FW for this device?

The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a
look at the switch later.

Regards,
Marcel


From vlad at lists.openfabrics.org  Mon May 26 03:10:10 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 26 May 2008 03:10:10 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080526-0200 daily build status
Message-ID: <20080526101010.8FA37E60CF8@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at voltaire.com  Mon May 26 04:10:49 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 26 May 2008 14:10:49 +0300
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adazlqe5d43.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com>
Message-ID: <483A9AB9.10809@voltaire.com>

Roland Dreier wrote:
>  > * talking about remote invalidation, I understand that it requires
>  > support of both sides (and hence has to be negotiated), so the
>  > IB_DEVICE_SEND_W_INV device capability says that a device can
>  > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?
>
> I think we decided that all of these related features will be indicated
> by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits.
send-with-invalidate is a little different in the sense that we would 
probably want to expose remote invalidation through libibverbs such that 
user space block/file targets (eg the iSER layer of STGT) would be able 
to use it, but (at least in this point of time) not expose the other 
memory management extensions to user space.

BTW - what's the status of the send-with-invalidate patches to the core 
and mlx4?
>  ZBVA would require adding some flag to request ZBVA when registering.
So this flag would be added as a field in the WR? for the current 
proposal, can the ULP dictate the VA as done with the current FMR API 
exposed by the core?

Or.


From swise at opengridcomputing.com  Mon May 26 06:05:48 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 26 May 2008 08:05:48 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adazlqe5d43.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com>
Message-ID: <483AB5AC.3030406@opengridcomputing.com>


Roland Dreier wrote:
>  > * Do we want it to be a must for a consumer to invalidate an fast-reg
>  > mr before reusing it? if yes, how?
> 
> The verbs specs go into exhaustive detail about the state diagram for
> validity of MRs.
> 
>  > * talking about remote invalidation, I understand that it requires
>  > support of both sides (and hence has to be negotiated), so the
>  > IB_DEVICE_SEND_W_INV device capability says that a device can
>  > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?
> 
> I think we decided that all of these related features will be indicated
> by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits.
> 

BTW:  a single capability bit doesn't allow apps to decide at run time 
whether to use read-with-inv, which is iwarp-only.  Perhaps we need that 
as its own capbility bit?  Or perhaps we can load detailed support/no 
support into the query device logic?  What it some devices can only 
support part of the suite of MEM_MGT_EXTENSIONS?


>  > * what about ZBVA, is it orthogonal to these calls, no enhancement of
>  > the suggested API is needed even if zbva is used, or the other way, it
>  > would work also when zbva is not used?
> 
> ZBVA would require adding some flag to request ZBVA when registering.
> 
>  - R.


From swise at opengridcomputing.com  Mon May 26 06:07:50 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 26 May 2008 08:07:50 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adaiqx257cu.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<adaiqx257cu.fsf@cisco.com>
Message-ID: <483AB626.8060404@opengridcomputing.com>


Roland Dreier wrote:
>  > - device-specific alloc/free of physical buffer lists for use in fast
>  > register work requests.  This allows devices to allocate this memory as
>  > needed (like via dma_alloc_coherent).
> 
> I'm looking at how one would implement the MM extensions for mlx4, and
> it turns out that in addition to needing to allocate these fastreg page
> lists in coherent memory, mlx4 is even going to need to write to the
> memory (basically set the lsb of each address for internal device
> reasons).  So I think we just need to update the documentation of the
> interface so that not only does the page list belong to the device
> driver between posting the fastreg work request and completing the
> request, but also the device driver is allowed to change the page list
> as part of the work request processing.
> 
> I don't see any real reason why this would cause problems for consumers;
> does this seem OK to other people?

Tom,

Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS 
support?


From rdreier at cisco.com  Mon May 26 14:47:29 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 14:47:29 -0700
Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig
	and Makefile.
In-Reply-To: <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com>
	(Ramachandra K.'s message of "Mon, 26 May 2008 13:07:28 +0530")
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103730.12355.14730.stgit@localhost.localdomain>
	<adamyme57ir.fsf@cisco.com>
	<71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com>
Message-ID: <adaod6s4u1a.fsf@cisco.com>

 > The debugging code is always compiled in and is controlled
 > at run time through vnic_debug module parameter.
 > INFINIBAND_QLGC_VNIC_DEBUG config option only controls verbose debugging
 > which adds some extra information in the debug statements (file name,
 > line number)
 > which we typically use for debug builds of the driver.  Even if this option is
 > set to 'n', users can still get all debug messages from the driver by using the
 > vnic_debug module parameter.

OK, I looked at the code.  Is there any point to having
CONFIG_INFINIBAND_QLGC_VNIC_DEBUG at all??  Is anyone going to care
about having __FILE__ and __LINE__ included in the output and want to
set this option to 'n'?

 - R.


From rdreier at cisco.com  Mon May 26 14:53:04 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 14:53:04 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483AB5AC.3030406@opengridcomputing.com> (Steve Wise's message of
	"Mon, 26 May 2008 08:05:48 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
Message-ID: <adak5hg4trz.fsf@cisco.com>

 > BTW:  a single capability bit doesn't allow apps to decide at run time
 > whether to use read-with-inv, which is iwarp-only.  Perhaps we need
 > that as its own capbility bit?  Or perhaps we can load detailed
 > support/no support into the query device logic?  What it some devices
 > can only support part of the suite of MEM_MGT_EXTENSIONS?

I think RDMA read with invalidate can be tested for as iWARP vs. IB.
The reason IB doesn't have it is kind of inherent in the IB protocol,
since remote access is not required for the RDMA target.

I think making the capability flags really fine-grained isn't worth
it -- we went too far in that direction historically, and no one checks
any capability flags at all.  It's just complexity.

So any device that supports only part of the IB base memory mgt
extensions (or doesn't support the full IWARP spec) just shouldn't
advertise MEM_MGT_EXTENSIONS I think.  Implementing such a device would
be kind of dumb anyway at this point.

 - R.


From rdreier at cisco.com  Mon May 26 14:57:41 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 14:57:41 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483A9AB9.10809@voltaire.com> (Or Gerlitz's message of "Mon, 26
	May 2008 14:10:49 +0300")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483A9AB9.10809@voltaire.com>
Message-ID: <adafxs44tka.fsf@cisco.com>

 > send-with-invalidate is a little different in the sense that we would
 > probably want to expose remote invalidation through libibverbs such
 > that user space block/file targets (eg the iSER layer of STGT) would
 > be able to use it, but (at least in this point of time) not expose the
 > other memory management extensions to user space.

Why?  Local invalidate and RDMA read with invalidate make perfect sense
for userspace too.  Of course fast register through a send queue can't
be used in userspace because it operations on physical memory, but I
think MEM_MGT_EXTENSIONS makes sense as something userspace can test for
and use.

 > BTW - what's the status of the send-with-invalidate patches to the
 > core and mlx4?

I'll add the completion struct changes for 2.6.27, and roll the mlx4
patches into the full MEM_MGT_EXTENSIONS patch.

 > >  ZBVA would require adding some flag to request ZBVA when registering.

 > So this flag would be added as a field in the WR? for the current
 > proposal, can the ULP dictate the VA as done with the current FMR API
 > exposed by the core?

The IB spec has a table that shows exactly which operations would need
flags to handle the ZBVA extension.  As for the VA, I think the latest
patch is pretty clear:

 > @@ -676,6 +683,20 @@ struct ib_send_wr {
 >  			u16	pkey_index; /* valid for GSI only */
 >  			u8	port_num;   /* valid for DR SMPs on switch only */
 >  		} ud;
 > +		struct {
 > +			u64				iova_start;

 - R.


From rdreier at cisco.com  Mon May 26 15:21:15 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 15:21:15 -0700
Subject: [ofa-general] Re: linux-next: [PATCH]
	infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings
In-Reply-To: <1211579101.3949.326.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Fri, 23 May 2008 14:45:01 -0700")
References: <48342C6C.2010502@googlemail.com> <adave14apdr.fsf@cisco.com>
	<1211579101.3949.326.camel@brick.pathscale.com>
Message-ID: <adabq2s4sh0.fsf@cisco.com>

OK, I added the following to my tree:

commit e8ffef73c8dd2c2d00287829db87cdaf229d3859
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon May 26 15:20:34 2008 -0700

    IB/ipath: Avoid test_bit() on u64 SDMA status value
    
    Gabriel C <nix.or.die at googlemail.com> pointed out that when the x86
    bitops are updated to operate on unsigned long, the code in
    sdma_abort_task() will produce warnings:
    
        drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
        drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
    
    and so on, because it uses test_bit() to operation on a u64 value
    (returned by ipath_read_kref64() for a hardware register).
    
    Fix up these warnings by converting the test_bit() operations to &ing
    with appropriate symbolic defines of the bits within the hardware
    register.  This has the benign side-effect of making the code more
    self-documenting as well.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 59a8b25..0bd8bcb 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -232,6 +232,11 @@ struct ipath_sdma_desc {
 #define IPATH_SDMA_TXREQ_S_ABORTED   2
 #define IPATH_SDMA_TXREQ_S_SHUTDOWN  3
 
+#define IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG	(1ull << 63)
+#define IPATH_SDMA_STATUS_ABORT_IN_PROG			(1ull << 62)
+#define IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE		(1ull << 61)
+#define IPATH_SDMA_STATUS_SCB_EMPTY			(1ull << 30)
+
 /* max dwords in small buffer packet */
 #define IPATH_SMALLBUF_DWORDS (dd->ipath_piosize2k >> 2)
 
diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c
index 0a8c1b8..eaba032 100644
--- a/drivers/infiniband/hw/ipath/ipath_sdma.c
+++ b/drivers/infiniband/hw/ipath/ipath_sdma.c
@@ -263,14 +263,10 @@ static void sdma_abort_task(unsigned long opaque)
 		hwstatus = ipath_read_kreg64(dd,
 				dd->ipath_kregs->kr_senddmastatus);
 
-		if (/* ScoreBoardDrainInProg */
-		    test_bit(63, &hwstatus) ||
-		    /* AbortInProg */
-		    test_bit(62, &hwstatus) ||
-		    /* InternalSDmaEnable */
-		    test_bit(61, &hwstatus) ||
-		    /* ScbEmpty */
-		    !test_bit(30, &hwstatus)) {
+		if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG |
+				 IPATH_SDMA_STATUS_ABORT_IN_PROG	     |
+				 IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) ||
+		    !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) {
 			if (dd->ipath_sdma_reset_wait > 0) {
 				/* not done shutting down sdma */
 				--dd->ipath_sdma_reset_wait;


From rdreier at cisco.com  Mon May 26 15:22:27 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 15:22:27 -0700
Subject: [ofa-general] [PATCH] IB/ipath - fix device capability flags
In-Reply-To: <20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Fri, 23 May 2008 14:43:34 -0700")
References: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com>
	<20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com>
Message-ID: <ada7idg4sf0.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Mon May 26 15:23:48 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 15:23:48 -0700
Subject: [ofa-general] Re: [ PATCH ] RDMA/nes Update MAINTAINERS list
In-Reply-To: <200805211649.m4LGnwPP026935@velma.neteffect.com> (Chien Tung's
	message of "Wed, 21 May 2008 11:49:58 -0500")
References: <200805211649.m4LGnwPP026935@velma.neteffect.com>
Message-ID: <ada3ao44scr.fsf@cisco.com>

thanks, applied.


From swise at opengridcomputing.com  Mon May 26 15:33:59 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 26 May 2008 17:33:59 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <adak5hg4trz.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com>	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
Message-ID: <483B3AD7.4050208@opengridcomputing.com>


Roland Dreier wrote:
>  > BTW:  a single capability bit doesn't allow apps to decide at run time
>  > whether to use read-with-inv, which is iwarp-only.  Perhaps we need
>  > that as its own capbility bit?  Or perhaps we can load detailed
>  > support/no support into the query device logic?  What it some devices
>  > can only support part of the suite of MEM_MGT_EXTENSIONS?
> 
> I think RDMA read with invalidate can be tested for as iWARP vs. IB.
> The reason IB doesn't have it is kind of inherent in the IB protocol,
> since remote access is not required for the RDMA target.
> 

The "invalidate local stag" part of a read is just a local sink side 
operation (ie no wire protocol change from a read).  It's not like 
processing an ingress send-with-inv.  It is really functionally like a 
read followed immediately by a fenced invalidate-local, but it doesn't 
stall the pipe.  So the device has to remember the read is a "with inv 
local stag" and invalidate the stag after the read response is placed 
and before the WCE is reaped by the application.

> I think making the capability flags really fine-grained isn't worth
> it -- we went too far in that direction historically, and no one checks
> any capability flags at all.  It's just complexity.
> 

Ok.

Steve.


From rdreier at cisco.com  Mon May 26 16:02:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 16:02:30 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483B3AD7.4050208@opengridcomputing.com> (Steve Wise's message of
	"Mon, 26 May 2008 17:33:59 -0500")
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com> <adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
Message-ID: <aday75w3bzt.fsf@cisco.com>

 > The "invalidate local stag" part of a read is just a local sink side
 > operation (ie no wire protocol change from a read).  It's not like
 > processing an ingress send-with-inv.  It is really functionally like a
 > read followed immediately by a fenced invalidate-local, but it doesn't
 > stall the pipe.  So the device has to remember the read is a "with inv
 > local stag" and invalidate the stag after the read response is placed
 > and before the WCE is reaped by the application.

Yes, understood.  My point was just that in IB, at least in theory, one
could just use an L_Key that doesn't have any remote permissions in the
scatter list of an RDMA read, while in iWARP, the STag used to place an
RDMA read response has to have remote write permission.  So RDMA read
with invalidate makes sense for iWARP, because it gives a race-free way
to allow an STag to be invalidated immediately after an RDMA read
response is placed, while in IB it's simpler just to never give remote
access at all.

 - R.


From uspropertyfax at gmail.com  Mon May 26 19:06:14 2008
From: uspropertyfax at gmail.com (US Property Report)
Date: Mon, 26 May 2008 19:06:14 -0700
Subject: [ofa-general] Property Fax Report
Message-ID: <e85ac85e8982de5728af16cf001a5a9f@gmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080526/378f6a1b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: property fax e.jpg
Type: image/jpeg
Size: 162943 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080526/378f6a1b/attachment.jpg>

From rdreier at cisco.com  Mon May 26 20:44:14 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 20:44:14 -0700
Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between
	elements in work queues after event
In-Reply-To: <4832DC80.2000408@Voltaire.COM> (Moni Shoua's message of "Tue, 20
	May 2008 17:13:20 +0300")
References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM>
	<4832DC80.2000408@Voltaire.COM>
Message-ID: <adatzgk2yy9.fsf@cisco.com>

thanks, applied for 2.6.27


From rdreier at cisco.com  Mon May 26 20:48:17 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 20:48:17 -0700
Subject: [ofa-general] [PATCH] IB/mlx4: Optimize stamping for selective
	signalling QPs
In-Reply-To: <1211374769.6577.21.camel@eli-laptop> (Eli Cohen's message of
	"Wed, 21 May 2008 15:59:29 +0300")
References: <1211374769.6577.21.camel@eli-laptop>
Message-ID: <adaprr82yri.fsf@cisco.com>

thanks, applied for 2.6.27


From rdreier at cisco.com  Mon May 26 20:58:29 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 20:58:29 -0700
Subject: [ofa-general] Re: [PATCH] core/include: fix coding style typos
	according to checkpatch.pl
In-Reply-To: <200805232252.04047.dotanba@gmail.com> (Dotan Barak's message of
	"Fri, 23 May 2008 22:52:03 +0300")
References: <200805232252.04047.dotanba@gmail.com>
Message-ID: <adahcck2yai.fsf@cisco.com>

Thanks, all changes do look like improvements.  applied for 2.6.27.

I'll get rid of the rest of the $Id lines in drivers/infiniband.


From rdreier at cisco.com  Mon May 26 21:10:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 21:10:08 -0700
Subject: [ofa-general] Re: [PATCH] core/include: fix coding style typos
	according to checkpatch.pl
In-Reply-To: <adahcck2yai.fsf@cisco.com> (Roland Dreier's message of "Mon, 26
	May 2008 20:58:29 -0700")
References: <200805232252.04047.dotanba@gmail.com> <adahcck2yai.fsf@cisco.com>
Message-ID: <ada8wxw2xr3.fsf@cisco.com>

 > I'll get rid of the rest of the $Id lines in drivers/infiniband.

like this...

commit 2be5019394ab8a6fa924bc955682db62950ddcc6
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon May 26 21:09:23 2008 -0700

    RDMA: Remove subversion $Id tags
    
    They don't get updated by git and so they're worse than useless.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/core/agent.h b/drivers/infiniband/core/agent.h
index fb9ed14..6669287 100644
--- a/drivers/infiniband/core/agent.h
+++ b/drivers/infiniband/core/agent.h
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: agent.h 1389 2004-12-27 22:56:47Z roland $
  */
 
 #ifndef __AGENT_H_
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index e85f701..6888356 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index a47fe64..55738ee 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 
 #include <linux/completion.h>
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 7ad47a4..05ac36e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: core_priv.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef _CORE_PRIV_H
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 5ac5ffe..7913b80 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: device.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c
index 1286dc1..4507043 100644
--- a/drivers/infiniband/core/fmr_pool.c
+++ b/drivers/infiniband/core/fmr_pool.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: fmr_pool.c 2730 2005-06-28 16:43:03Z sean.hefty $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 8b75010..05ce331 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mad_priv.h 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 
 #ifndef __IB_MAD_PRIV_H__
diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
index a5e2a31..d0ef7d6 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mad_rmpp.c 1921 2005-03-02 22:58:44Z sean.hefty $
  */
 
 #include "mad_priv.h"
diff --git a/drivers/infiniband/core/mad_rmpp.h b/drivers/infiniband/core/mad_rmpp.h
index f0616fd..3d336bf 100644
--- a/drivers/infiniband/core/mad_rmpp.h
+++ b/drivers/infiniband/core/mad_rmpp.h
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mad_rmpp.h 1921 2005-02-25 22:58:44Z sean.hefty $
  */
 
 #ifndef __MAD_RMPP_H__
diff --git a/drivers/infiniband/core/packer.c b/drivers/infiniband/core/packer.c
index c972d72..019bd4b 100644
--- a/drivers/infiniband/core/packer.c
+++ b/drivers/infiniband/core/packer.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: packer.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/string.h>
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 78ea815..1341de7 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: sa_query.c 2811 2005-07-06 18:11:43Z halr $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 9575655..36a0ef9 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include "core_priv.h"
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index d7a6881..54fc1de 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 
 #include <linux/completion.h>
diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c
index 997c07d..8ec7876 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ud_header.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index fe78f7d..5c145b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: uverbs_mem.c 2743 2005-06-28 22:27:59Z roland $
  */
 
 #include <linux/mm.h>
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 840ede9..eb58fcf 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 376a57c..b3ea958 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: uverbs.h 2559 2005-06-06 19:43:16Z roland $
  */
 
 #ifndef UVERBS_H
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 2c3bff5..112b37c 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: uverbs_cmd.c 2708 2005-06-24 17:27:21Z roland $
  */
 
 #include <linux/file.h>
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f806da1..9bc07f4 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: uverbs_main.c 2733 2005-06-28 19:14:34Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..9f399d3 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -34,8 +34,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_allocator.c b/drivers/infiniband/hw/mthca/mthca_allocator.c
index a763067..c5ccc2d 100644
--- a/drivers/infiniband/hw/mthca/mthca_allocator.c
+++ b/drivers/infiniband/hw/mthca/mthca_allocator.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_allocator.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c
index 4b111a8..32f6c63 100644
--- a/drivers/infiniband/hw/mthca/mthca_av.c
+++ b/drivers/infiniband/hw/mthca/mthca_av.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_av.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/string.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_catas.c b/drivers/infiniband/hw/mthca/mthca_catas.c
index e948158..40573e4 100644
--- a/drivers/infiniband/hw/mthca/mthca_catas.c
+++ b/drivers/infiniband/hw/mthca/mthca_catas.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id$
  */
 
 #include <linux/jiffies.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 54d230e..c33e1c5 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_cmd.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/completion.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.h b/drivers/infiniband/hw/mthca/mthca_cmd.h
index 8928ca4..6efd326 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.h
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_cmd.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef MTHCA_CMD_H
diff --git a/drivers/infiniband/hw/mthca/mthca_config_reg.h b/drivers/infiniband/hw/mthca/mthca_config_reg.h
index afa56bf..75671f7 100644
--- a/drivers/infiniband/hw/mthca/mthca_config_reg.h
+++ b/drivers/infiniband/hw/mthca/mthca_config_reg.h
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_config_reg.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef MTHCA_CONFIG_REG_H
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 20401d2..f788fce 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_cq.c 1369 2004-12-20 16:17:07Z roland $
  */
 
 #include <linux/hardirq.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index 7bc32f8..2997d8d 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_dev.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef MTHCA_DEV_H
diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h
index b374dc3..14f51ef 100644
--- a/drivers/infiniband/hw/mthca/mthca_doorbell.h
+++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_doorbell.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/types.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c
index 8bde7f9..4e36aa7 100644
--- a/drivers/infiniband/hw/mthca/mthca_eq.c
+++ b/drivers/infiniband/hw/mthca/mthca_eq.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_eq.c 1382 2004-12-24 02:21:02Z roland $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c
index 8b7e83e..6404495 100644
--- a/drivers/infiniband/hw/mthca/mthca_mad.c
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_mad.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/string.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 200cf13..fb9f91b 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_main.c 1396 2004-12-28 04:10:27Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c
index a8ad072..3f5f948 100644
--- a/drivers/infiniband/hw/mthca/mthca_mcg.c
+++ b/drivers/infiniband/hw/mthca/mthca_mcg.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_mcg.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/string.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index b224079..9e77ba9 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id$
  */
 
 #include <linux/mm.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.h b/drivers/infiniband/hw/mthca/mthca_memfree.h
index a1ab068..da9b8f9 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.h
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id$
  */
 
 #ifndef MTHCA_MEMFREE_H
diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c
index 820205d..8489b1e 100644
--- a/drivers/infiniband/hw/mthca/mthca_mr.c
+++ b/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_mr.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/slab.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_pd.c b/drivers/infiniband/hw/mthca/mthca_pd.c
index c1e9507..266f14e 100644
--- a/drivers/infiniband/hw/mthca/mthca_pd.c
+++ b/drivers/infiniband/hw/mthca/mthca_pd.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_pd.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/errno.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c
index 605a8d5..d168c25 100644
--- a/drivers/infiniband/hw/mthca/mthca_profile.c
+++ b/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_profile.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_profile.h b/drivers/infiniband/hw/mthca/mthca_profile.h
index e76cb62..62b009c 100644
--- a/drivers/infiniband/hw/mthca/mthca_profile.h
+++ b/drivers/infiniband/hw/mthca/mthca_profile.h
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_profile.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef MTHCA_PROFILE_H
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index be34f99..87ad889 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -32,8 +32,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_provider.c 4859 2006-01-09 21:55:10Z roland $
  */
 
 #include <rdma/ib_smi.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h
index 934bf95..c621f87 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.h
+++ b/drivers/infiniband/hw/mthca/mthca_provider.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_provider.h 1349 2004-12-16 21:09:43Z roland $
  */
 
 #ifndef MTHCA_PROVIDER_H
diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 09dc361..3b1c5ba 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_qp.c 1355 2004-12-17 15:23:43Z roland $
  */
 
 #include <linux/string.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
index 91934f2..acb6817 100644
--- a/drivers/infiniband/hw/mthca/mthca_reset.c
+++ b/drivers/infiniband/hw/mthca/mthca_reset.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_reset.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/init.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c
index a5ffff6..4fabe62 100644
--- a/drivers/infiniband/hw/mthca/mthca_srq.c
+++ b/drivers/infiniband/hw/mthca/mthca_srq.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_srq.c 3047 2005-08-10 03:59:35Z roland $
  */
 
 #include <linux/slab.h>
diff --git a/drivers/infiniband/hw/mthca/mthca_uar.c b/drivers/infiniband/hw/mthca/mthca_uar.c
index 8b72848..ca5900c 100644
--- a/drivers/infiniband/hw/mthca/mthca_uar.c
+++ b/drivers/infiniband/hw/mthca/mthca_uar.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id$
  */
 
 #include <asm/page.h>		/* PAGE_SHIFT */
diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h
index e1262c9..5fe56e8 100644
--- a/drivers/infiniband/hw/mthca/mthca_user.h
+++ b/drivers/infiniband/hw/mthca/mthca_user.h
@@ -29,7 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
  */
 
 #ifndef MTHCA_USER_H
diff --git a/drivers/infiniband/hw/mthca/mthca_wqe.h b/drivers/infiniband/hw/mthca/mthca_wqe.h
index b3551a8..341a5ae 100644
--- a/drivers/infiniband/hw/mthca/mthca_wqe.h
+++ b/drivers/infiniband/hw/mthca/mthca_wqe.h
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: mthca_wqe.h 3047 2005-08-10 03:59:35Z roland $
  */
 
 #ifndef MTHCA_WQE_H
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..0dcbab3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib.h 1358 2004-12-17 22:00:11Z roland $
  */
 
 #ifndef _IPOIB_H
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 97e67d3..91c9592 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id$
  */
 
 #include <rdma/ib_cm.h>
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c
index 8b882bb..961c585 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_fs.c 1389 2004-12-27 22:56:47Z roland $
  */
 
 #include <linux/err.h>
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index f429bce..eca8518 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -31,8 +31,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_ib.c 1386 2004-12-27 16:23:17Z roland $
  */
 
 #include <linux/delay.h>
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2442090..f217b1e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_main.c 1377 2004-12-23 19:57:12Z roland $
  */
 
 #include "ipoib.h"
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 3f663fb..4a6538b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -30,8 +30,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_multicast.c 1362 2004-12-18 15:56:29Z roland $
  */
 
 #include <linux/skbuff.h>
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 8766d29..810790a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include "ipoib.h"
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 1cdb5cf..b08eb56 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ipoib_vlan.c 1349 2004-12-16 21:09:43Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c
index aeb58ca..356fac6 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -42,9 +42,6 @@
  *	Zhenyu Wang
  * Modified by:
  *      Erez Zilber
- *
- *
- * $Id: iscsi_iser.c 6965 2006-05-07 11:36:20Z ogerlitz $
  */
 
 #include <linux/types.h>
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index a8c1b30..0e10703 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -36,8 +36,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: iscsi_iser.h 7051 2006-05-10 12:29:11Z ogerlitz $
  */
 #ifndef __ISCSI_ISER_H__
 #define __ISCSI_ISER_H__
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index 08dc81c..31ad498 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: iser_initiator.c 6964 2006-05-07 11:11:43Z ogerlitz $
  */
 #include <linux/kernel.h>
 #include <linux/slab.h>
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index cac50c4..81e49cb 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: iser_memory.c 6964 2006-05-07 11:11:43Z ogerlitz $
  */
 #include <linux/module.h>
 #include <linux/kernel.h>
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index d19cfe6..77cabee 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -29,8 +29,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: iser_verbs.c 7051 2006-05-10 12:29:11Z ogerlitz $
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 81cc59c..ed7c5f7 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ib_srp.c 3932 2005-11-01 17:19:29Z roland $
  */
 
 #include <linux/module.h>
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index 63d2ae7..e185b90 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -28,8 +28,6 @@
  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
- *
- * $Id: ib_srp.h 3932 2005-11-01 17:19:29Z roland $
  */
 
 #ifndef IB_SRP_H


From rdreier at cisco.com  Mon May 26 21:18:24 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 May 2008 21:18:24 -0700
Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM
	mode
In-Reply-To: <1211719109.13769.32.camel@mtls03> (Eli Cohen's message of "Sun, 
	25 May 2008 15:38:29 +0300")
References: <1211470815.7310.61.camel@eli-laptop> <ada63t4aobq.fsf@cisco.com>
	<1211719109.13769.32.camel@mtls03>
Message-ID: <ada4p8k2xdb.fsf@cisco.com>

 > +	if (wc->byte_len < SKB_TSHOLD) {
 > +		int dlen = wc->byte_len;
 > +
 > +		small_skb = dev_alloc_skb(dlen + 12);
 > +		if (small_skb) {
 > +			skb_reserve(small_skb, 12);
 > +			skb_copy_from_linear_data(skb, small_skb->data, dlen);
 > +			skb_put(small_skb, dlen);

Just noticed in the original patch: you need calls to
ib_dma_sync_single_for_cpu and ib_dma_sync_single_for_device around this
skb_copy_from_linear_data.

 > > Any reason why we wouldn't want this info in the patch changelog?

 > Not really. If you think it should be there, I'll add it to the
 > changelog along with an explanation to the question bellow.

Yes, definitely we want the performance info.  Imagine if you were
reading the patch in git history -- clearly this justification and
measurement would be very helpful in understanding why the patch was
added, and there's no reason to leave out the useful information you've
already written.

 - R.


From ramachandra.kuchimanchi at qlogic.com  Mon May 26 23:23:42 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Tue, 27 May 2008 11:53:42 +0530
Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig
	and Makefile.
In-Reply-To: <adaod6s4u1a.fsf@cisco.com>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103730.12355.14730.stgit@localhost.localdomain>
	<adamyme57ir.fsf@cisco.com>
	<71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com>
	<adaod6s4u1a.fsf@cisco.com>
Message-ID: <71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com>

Roland,

On Tue, May 27, 2008 at 3:17 AM, Roland Dreier <rdreier at cisco.com> wrote:

> OK, I looked at the code.  Is there any point to having
> CONFIG_INFINIBAND_QLGC_VNIC_DEBUG at all??  Is anyone going to care
> about having __FILE__ and __LINE__ included in the output and want to
> set this option to 'n'?

Makes sense. We will get rid of this CONFIG option. Apart from this
are there any other changes you
would like to see in the patch series ?

Regards,
Ram


From tziporet at mellanox.co.il  Tue May 27 00:00:11 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 27 May 2008 10:00:11 +0300
Subject: [ofa-general] Update on features that are delayed from ofed 1.4
Message-ID: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com>

Hi,

The following features were planned for OFED 1.4 but will not be
implemented eventually:
*	RDMA CM to support IPv6
*	Xsigo's host drivers: Virtual NIC and HBA
*	New IB verb for reliable multicast
*	SDP: RDMA zero copy

Please review and comment

Tziporet


From marcel.heinz at informatik.tu-chemnitz.de  Tue May 27 00:44:27 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Tue, 27 May 2008 09:44:27 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483A7E40.5040407@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
Message-ID: <483BBBDB.6000605@informatik.tu-chemnitz.de>

Marcel Heinz wrote:
> Dotan Barak wrote:
>>Do you use the latest released FW for this device?
> 
> The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a
> look at the switch later.

The Switch is Mellanox MT47396 based and uses FW version 1.0.0.  This
isn't the latest one, but I don't see anything in the release notes of
the 1.0.5 firmware which is related to our problem.

Regards,
Marcel


From Sumit.Gaur at Sun.COM  Tue May 27 00:50:27 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Tue, 27 May 2008 13:20:27 +0530
Subject: [Fwd: Re: [ofa-general] question regarding umad_recv]
Message-ID: <483BBD43.5080807@Sun.COM>

An embedded message was scrubbed...
From: Sumit Gaur - Sun Microsystem <Sumit.Gaur at Sun.COM>
Subject: Re: [ofa-general] question regarding umad_recv
Date: Tue, 27 May 2008 13:14:28 +0530
Size: 2626
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/de585a42/attachment.mht>

From okir at lst.de  Tue May 27 01:14:29 2008
From: okir at lst.de (Olaf Kirch)
Date: Tue, 27 May 2008 10:14:29 +0200
Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling
Message-ID: <200805271014.30175.okir@lst.de>

Oops, this should have gone to ofa-general as well.

Olaf
----------  Forwarded Message  ----------

Subject: [rds-devel] RDS: Fix a bug in RDMA signalling
Date: Tuesday 27 May 2008
From: Olaf Kirch <olaf.kirch at oracle.com>
To: rds-devel at oss.oracle.com

I found an issue with RDMA completions vs. IB connection teardown
yesterday - the patch seems correct to me (and passed my testing),
but I would appreciate a second set of eyeballs, and some additional
testing (Rick, can you spin the crload wheel with this one, please?)

Vlad, is there any chance to get this into 1.3.1? I think I remember
that rc2 (which is out already) was supposed to be the final release
candidate for 1.3.1 - correct?

The patch is also available from my tree at
git://git.openfabrics.org/~okir/ofed_1_3/linux-2.6.git code-drop-20080527

Olaf
-------------------
From: Olaf Kirch <olaf.kirch at oracle.com>
Subject: RDS: Fix a bug in RDMA signalling

Code inspection revealed a problem in the way we signal RDMA
completions to user space when a connection goes down.

The send CQ handler calls rds_ib_send_unmap_rm, indicating success
no matter what the WC status code says. This means we may
signal success for RDMAs that get flushed out with WR_FLUSH_ERR.

This patch fixes the problem by passing the wc.status value to
rds_ib_send_unmap_rm for inspection.

While I was at it, I moved the code that translates WC status codes
to RDMA notifications from the send CQ handler to rds_ib_send_unmap_rm
where it belongs.

Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>
---
 net/rds/ib_send.c |   39 ++++++++++++++++++++-------------------
 net/rds/rdma.h    |    2 +-
 net/rds/send.c    |    3 ++-
 3 files changed, 23 insertions(+), 21 deletions(-)

Index: ofa_kernel-1.3.1/net/rds/ib_send.c
===================================================================
--- ofa_kernel-1.3.1.orig/net/rds/ib_send.c
+++ ofa_kernel-1.3.1/net/rds/ib_send.c
@@ -41,7 +41,7 @@
 
 void rds_ib_send_unmap_rm(struct rds_ib_connection *ic,
 		          struct rds_ib_send_work *send,
-			  int success)
+			  int wc_status)
 {
 	struct rds_message *rm = send->s_rm;
 
@@ -52,7 +52,9 @@ void rds_ib_send_unmap_rm(struct rds_ib_
 		     DMA_TO_DEVICE);
 
 	/* raise rdma completion hwm */
-	if (rm->m_rdma_op && success) {
+	if (rm->m_rdma_op && wc_status != IB_WC_WR_FLUSH_ERR) {
+		int notify_status;
+
 		/* If the user asked for a completion notification on this
 		 * message, we can implement three different semantics:
 		 *  1.	Notify when we received the ACK on the RDS message
@@ -68,7 +70,20 @@ void rds_ib_send_unmap_rm(struct rds_ib_
 		 * don't call rds_rdma_send_complete at all, and fall back to the notify
 		 * handling in the ACK processing code.
 		 */
-		rds_rdma_send_complete(rm);
+		switch (wc_status) {
+		case IB_WC_SUCCESS:
+			notify_status = RDS_RDMA_SUCCESS;
+			break;
+
+		case IB_WC_REM_ACCESS_ERR:
+			notify_status = RDS_RDMA_REMOTE_ERROR;
+			break;
+
+		default:
+			notify_status = RDS_RDMA_OTHER_ERROR;
+			break;
+		}
+		rds_rdma_send_complete(rm, notify_status);
 
 		if (rm->m_rdma_op->r_write)
 			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
@@ -118,7 +133,7 @@ void rds_ib_send_clear_ring(struct rds_i
 		if (send->s_wr.opcode == 0xdead)
 			continue;
 		if (send->s_rm)
-			rds_ib_send_unmap_rm(ic, send, 0);
+			rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR);
 		if (send->s_op)
 			ib_dma_unmap_sg(ic->i_cm_id->device,
 				send->s_op->r_sg, send->s_op->r_nents,
@@ -174,7 +189,7 @@ void rds_ib_send_cq_comp_handler(struct 
 			switch (send->s_wr.opcode) {
 			case IB_WR_SEND:
 				if (send->s_rm)
-					rds_ib_send_unmap_rm(ic, send, 1);
+					rds_ib_send_unmap_rm(ic, send, wc.status);
 				break;
 			case IB_WR_RDMA_WRITE:
 				if (send->s_op)
@@ -204,20 +219,6 @@ void rds_ib_send_cq_comp_handler(struct 
 			oldest = (oldest + 1) % ic->i_send_ring.w_nr;
 		}
 
-		if (unlikely(wc.status != IB_WC_SUCCESS && send->s_op && send->s_op->r_notifier)) {
-			switch (wc.status) {
-			default:
-				send->s_op->r_notifier->n_status = RDS_RDMA_OTHER_ERROR;
-				break;
-			case IB_WC_REM_ACCESS_ERR:
-				send->s_op->r_notifier->n_status = RDS_RDMA_REMOTE_ERROR;
-				break;
-			case IB_WC_WR_FLUSH_ERR:
-				/* flushed out; not an error */
-				break;
-			}
-		}
-
 		rds_ib_ring_free(&ic->i_send_ring, completed);
 
 		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
Index: ofa_kernel-1.3.1/net/rds/rdma.h
===================================================================
--- ofa_kernel-1.3.1.orig/net/rds/rdma.h
+++ ofa_kernel-1.3.1/net/rds/rdma.h
@@ -71,6 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock *
 int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
 			  struct cmsghdr *cmsg);
 void rds_rdma_free_op(struct rds_rdma_op *ro);
-void rds_rdma_send_complete(struct rds_message *rm);
+void rds_rdma_send_complete(struct rds_message *rm, int);
 
 #endif
Index: ofa_kernel-1.3.1/net/rds/send.c
===================================================================
--- ofa_kernel-1.3.1.orig/net/rds/send.c
+++ ofa_kernel-1.3.1/net/rds/send.c
@@ -361,7 +361,7 @@ int rds_send_acked_before(struct rds_con
  * the IB send completion on the RDMA op and the accompanying
  * message.
  */
-void rds_rdma_send_complete(struct rds_message *rm)
+void rds_rdma_send_complete(struct rds_message *rm, int status)
 {
 	struct rds_sock *rs = NULL;
 	struct rds_rdma_op *ro;
@@ -376,6 +376,7 @@ void rds_rdma_send_complete(struct rds_m
 		rs = rm->m_rs;
 		sock_hold(rds_rs_to_sk(rs));
 
+		notifier->n_status = status;
 		spin_lock(&rs->rs_lock);
 		list_add_tail(&notifier->n_list, &rs->rs_notify_queue);
 		spin_unlock(&rs->rs_lock);

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

_______________________________________________
rds-devel mailing list
rds-devel at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/rds-devel

-------------------------------------------------------

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From ogerlitz at voltaire.com  Tue May 27 02:01:45 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 27 May 2008 12:01:45 +0300
Subject: [ofa-general] Update on features that are delayed 
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com>
Message-ID: <483BCDF9.2080207@voltaire.com>

Tziporet Koren wrote:
> The following features were planned for OFED 1.4 but will not be implemented eventually:
> *	RDMA CM to support IPv6
Hi Sean,

Can you please elaborate what is missing in the design/implementation of 
the rdma-cm regarding IPv6?

Or.


From Feed at lists.openfabrics.org  Tue May 27 02:04:18 2008
From: Feed at lists.openfabrics.org (Feed at lists.openfabrics.org)
Date: 27 May 2008 02:04:18 -0700
Subject: [ofa-general] Feed Blaster puts your ad right to the screens of
	millions in 15 Minutes !
Message-ID: <20080527020417.9D5C205E710DF692@from.header.has.no.domain>

More and more people are subscribing to feeds every
day and there are millions who are already subscribed.

Thus, your ad will reach a very broad range of potential customers with 
each use of Feed Blaster!

Feed Blaster is the first & only submitter that can submit your
ads to thousands of feeds within a few minutes!

Post your ads where people read them!

- What if you could place your ad into all these feeds ?

Right, that would mean you would have millions of sites
linking to your ad - and millions of users reading your message within
minutes - and my idea actually works


For Full details please read the attached .html file


Unsubscribe:
 please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/fe89b9ad/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/fe89b9ad/attachment-0001.htm>

From eli at dev.mellanox.co.il  Tue May 27 02:05:48 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Tue, 27 May 2008 12:05:48 +0300
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
Message-ID: <1211879148.13769.94.camel@mtls03>

>From db74e3fc04ef41da02d65c056b78275365891b3d Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Thu, 22 May 2008 16:28:59 +0300
Subject: [PATCH] IB/ipoib: copy small SKBs in CM mode

CM mode of ipoib has a large overhead in the receive flow for managing
SKBs. It usually allocates an SKB with data as much as was used in the
currently received SKB  and moves unused fragments from the old SKB to the
new one. This involves a loop on all the remaining fragments and incurs
overhead on the CPU.
This patch, for small SKBs, allocates an SKB just large enough to contain
the received data and copies to it the data from the received SKB.
The newly allocated SKB is passed to the stack and the old SKB is reposted.

When running netperf, UDP small messages, without this pach I get:

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
14.4.3.178 (14.4.3.178) port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

114688     128   10.00     5142034      0     526.31
114688           10.00     1130489            115.71

With this patch I get both send and receive at ~315 mbps.
The reason for that is as follows:
When using this patch, the overhead of the CPU for handling RX packets
is dramatically reduced. As a result, we do not experience RNR NACK
messages from the receiver which cause the connection to be closed and
reopened again; when the patch is not used, the receiver cannot handle
the packets fast enough so there is less time to post new buffers and
hence the mentioned RNR NACKs. So what happens is that the application
*thinks* it posted a certain number of packets for transmission but these
packets are flushed and do not really get transmitted. Since the connection
gets opened and closed many times, each time netperf gets the CPU time that
otherwise would have been given to IPoIB  to actually transmit the packtes.
This can be verified when looking at the port counters, the output of
ifconfig and the oputput of netperf (this is for the case without the patch):

tx packets
==========
port counter:   1,543,996
ifconfig:       1,581,426
netperf:        5,142,034

rx packtes
==========
netperf         1,1304,089

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changes since V1:
1. wrapped call to skb_copy_from_linear_data() with calls to
ib_dma_sync_single_for_cpu() and ib_dma_sync_single_for_device()
2. Ensure SKB_TSHOLD is not defined to large.

 drivers/infiniband/ulp/ipoib/ipoib.h      |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |   19 +++++++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   10 ++++++++++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..e39bf36 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -97,6 +97,7 @@ enum {
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
 
 	MAX_SEND_CQE		  = 16,
+	SKB_TSHOLD		  = 256,
 };
 
 #define	IPOIB_OP_RECV   (1ul << 31)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 97e67d3..7be0a43 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -525,6 +525,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	u64 mapping[IPOIB_CM_RX_SG];
 	int frags;
 	int has_srq;
+	struct sk_buff *small_skb;
 
 	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -579,6 +580,23 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		}
 	}
 
+	if (wc->byte_len < SKB_TSHOLD) {
+		int dlen = wc->byte_len;
+
+		small_skb = dev_alloc_skb(dlen + 12);
+		if (small_skb) {
+			skb_reserve(small_skb, 12);
+			ib_dma_sync_single_for_cpu(priv->ca, rx_ring[wr_id].mapping[0],
+						   dlen, DMA_FROM_DEVICE);
+			skb_copy_from_linear_data(skb, small_skb->data, dlen);
+			ib_dma_sync_single_for_device(priv->ca, rx_ring[wr_id].mapping[0],
+						      dlen, DMA_FROM_DEVICE);
+			skb_put(small_skb, dlen);
+			skb = small_skb;
+			goto copied;
+		}
+	}
+
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
 					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
 
@@ -601,6 +619,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
 
+copied:
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2442090..ec6e7c5 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1304,6 +1304,16 @@ static int __init ipoib_init_module(void)
 	ipoib_max_conn_qp = min(ipoib_max_conn_qp, IPOIB_CM_MAX_CONN_QP);
 #endif
 
+	/*
+	 * we rely on this condition when copying small skbs and we
+	 * pass ownership of the first fragment only.
+	 */
+	if (SKB_TSHOLD > IPOIB_CM_HEAD_SIZE) {
+		printk("%s: SKB_TSHOLD(%d) must not be larger then %d\n",
+		       THIS_MODULE->name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE);
+		return -EINVAL;
+	}
+
 	ret = ipoib_register_debugfs();
 	if (ret)
 		return ret;
-- 
1.5.5.1


From Sumit.Gaur at Sun.COM  Tue May 27 02:04:43 2008
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Tue, 27 May 2008 14:34:43 +0530
Subject: [ofa-general] question regarding umad_recv
In-Reply-To: <20071016204021.GC12364@sashak.voltaire.com>
References: <20071015195140.GA12364@sashak.voltaire.com>
	<000001c80f65$87bea170$3c98070a@amr.corp.intel.com>
	<20071016204021.GC12364@sashak.voltaire.com>
Message-ID: <483BCEAB.3020902@Sun.COM>

Hi,

     Just want to confirm which OFED build you have integrated this bug 
(*support multiple umad_open_port*)fix. I am working on OFED-1.2.5.5. Also want 
to know if it is not available in my current Build, Could I apply any available 
patch to get fix.

Thanks and Regards
sumit


Sasha Khapyorsky wrote:
> On 12:56 Mon 15 Oct     , Sean Hefty wrote:
> 
>>>Seems you don't think it is very critical, cannot say I disagree so much.
>>>Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED?
>>
>>My vote is to retain some sort of abstraction.  Once we get rid of it, it will
>>be very hard to add it back in.
> 
> 
> That is true, but I cannot find scenario when using fd as umad device
> handle could be insufficient. Even if we will need to create some
> internally tracked per device data again (unlikely) fd can serve as us
> index just well. The whole issue is all about naming and seems minor for
> me - without actual API change we can rename it once and rename again
> later if it will be needed or keep things as it is - both options are
> fine. Anc since there is concern let's do nothing and stay with "as is".
> 
> 
>>My concern is that multi-thread receive handling isn't easily supported when
>>RMPP is involved, and having umad_recv take an abstract 'id' gives us some
>>flexibility that could come in useful someday.
>>
>>E.g. something like:
>>umad_recv() -> returns too small, gives necessary size + id specific to a mad
>>uamd_recv(mad id, new size ...) -> returns reassembled rmpp mad
> 
> 
> With this second umad_recv() we also will need to specify which umad
> device to use, I think API change will be required, right?
> (the option to encode both fd and mad id as first umad_recv() parameter
> looks messy for me.)
> 
> Sasha


From sashak at voltaire.com  Tue May 27 03:03:17 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 27 May 2008 13:03:17 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080523121815.2c39e65a.weiny2@llnl.gov>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080522154702.430cdef7.weiny2@llnl.gov>
	<1211540855.13185.71.camel@hrosenstock-ws.xsigo.com>
	<20080523121815.2c39e65a.weiny2@llnl.gov>
Message-ID: <20080527100317.GE12014@sashak.voltaire.com>

Hi Ira,

On 12:18 Fri 23 May     , Ira Weiny wrote:
> 
> <just thinking out loud...>
>    When you mention this I start to think about the secure API which Tim
>    submitted a few months ago and was not accepted.

Don't think I missed that. I remember that there were some another
changes, not secure API yet.

Sasha


From vlad at lists.openfabrics.org  Tue May 27 03:11:55 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 27 May 2008 03:11:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080527-0200 daily build status
Message-ID: <20080527101155.D4044E60D57@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From sashak at voltaire.com  Tue May 27 03:33:41 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 27 May 2008 13:33:41 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080527103341.GF12014@sashak.voltaire.com>

On 05:52 Fri 23 May     , Hal Rosenstock wrote:
> 
> But can be protected by other weak access control currently and perhaps
> more in the future.

OpenSM console is not a great example IMO - OpenSM doesn't need to issue
SA queries against itself.

> New commands which require trust can utilize SMKey
> without it being specified (at least for OpenSM), no ?

Maybe yes, but could you be more specific? Store SMKey in read-only
file on a client side?

> > And what about diagnostics when other SMs are used?
> 
> I think there's a problem here in a trusted environments given the
> approach taken as I've stated in the past but seems to have been
> forgotten. The more trust the less the current diag strategy fits.
> 
> Are you also going to be proposing exposing MKeys too once MKey
> management is supported by OpenSM/other SMs ?

I don't have any M_Key manager implementation details, but hope it will
not needed.

I'm not proposing to expose SM_Key, just added such option where this
key could be specified. So: 1) this is *optional*, 2) there is no
suggestions about how the right value should be determined.

Sasha


From Feed at lists.openfabrics.org  Tue May 27 04:04:01 2008
From: Feed at lists.openfabrics.org (Feed at lists.openfabrics.org)
Date: 27 May 2008 04:04:01 -0700
Subject: [ofa-general] Feed Blaster puts your ad right to the screens of
	millions in 15 Minutes !
Message-ID: <20080527040401.EBDB1E1C68ED4647@from.header.has.no.domain>

More and more people are subscribing to feeds every
day and there are millions who are already subscribed.

Thus, your ad will reach a very broad range of potential customers with 
each use of Feed Blaster!

Feed Blaster is the first & only submitter that can submit your
ads to thousands of feeds within a few minutes!

Post your ads where people read them!

- What if you could place your ad into all these feeds ?

Right, that would mean you would have millions of sites
linking to your ad - and millions of users reading your message within
minutes - and my idea actually works


For Full details please read the attached .html file


Unsubscribe:
 please read the attached .html file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/aec34d2d/attachment.htm>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/aec34d2d/attachment-0001.htm>

From hrosenstock at xsigo.com  Tue May 27 04:14:55 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 27 May 2008 04:14:55 -0700
Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys
In-Reply-To: <4838F9ED.9090304@voltaire.com>
References: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com>
	<4838F9ED.9090304@voltaire.com>
Message-ID: <1211886895.13185.207.camel@hrosenstock-ws.xsigo.com>

On Sun, 2008-05-25 at 08:32 +0300, Or Gerlitz wrote:
> Hal Rosenstock wrote:
> > management: Support separate SA and SM keys as clarified in IBA 1.2.1
> Does some host side patch is needed to inter-operate with this change?

Nope.

-- Hal

> 
> Or.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Tue May 27 04:29:12 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 27 May 2008 04:29:12 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
Message-ID: <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com>

On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
> > Following your logic we will need to disable root passwords
> > typing too.
> 
> That's taking it too far. Root passwords are at least hidden when
> typing.

At least hide the key typing from plain sight when typing like su does.

-- Hal


From hrosenstock at xsigo.com  Tue May 27 04:33:56 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 27 May 2008 04:33:56 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080527103341.GF12014@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<20080527103341.GF12014@sashak.voltaire.com>
Message-ID: <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-27 at 13:33 +0300, Sasha Khapyorsky wrote:
> On 05:52 Fri 23 May     , Hal Rosenstock wrote:
> > 
> > But can be protected by other weak access control currently and perhaps
> > more in the future.
> 
> OpenSM console is not a great example IMO - OpenSM doesn't need to issue
> SA queries against itself.

There's no reason it couldn't though rather than go after internal data
structures.

> > New commands which require trust can utilize SMKey
> > without it being specified (at least for OpenSM), no ?
> 
> Maybe yes, but could you be more specific? Store SMKey in read-only
> file on a client side?

Treat smkey as su treats password rather than a command line parameter
is another alternative.

> > > And what about diagnostics when other SMs are used?
> > 
> > I think there's a problem here in a trusted environments given the
> > approach taken as I've stated in the past but seems to have been
> > forgotten. The more trust the less the current diag strategy fits.
> > 
> > Are you also going to be proposing exposing MKeys too once MKey
> > management is supported by OpenSM/other SMs ?
> 
> I don't have any M_Key manager implementation details, 

There have been no details as yet but one can readily extrapolate the
same issue from the spec. (The issue actually goes further for MKey
IMO).

> but hope it will not needed.

I believe it is on the OFED 1.4 list.

> I'm not proposing to expose SM_Key, just added such option where this
> key could be specified.

How is that not exposing it ?

-- Hal

>  So: 1) this is *optional*, 2) there is no
> suggestions about how the right value should be determined.
> 
> Sasha


From nico.mittenzwey at s2001.tu-chemnitz.de  Tue May 27 05:00:34 2008
From: nico.mittenzwey at s2001.tu-chemnitz.de (Nico Mittenzwey)
Date: Tue, 27 May 2008 14:00:34 +0200
Subject: [ofa-general] Retry count error with ipath on OFED-1.3
In-Reply-To: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com>
References: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com>
Message-ID: <483BF7E2.2000402@s2001.tu-chemnitz.de>

Hello Keshetti,
thanks for your response.
After more tests it turned out to be a hardware error of the Infinipath HCA.

regards
Nico


From hrosenstock at xsigo.com  Tue May 27 06:02:34 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Tue, 27 May 2008 06:02:34 -0700
Subject: [ofa-general] [PATCH] mlx4_core: enable changing default max
	HCA resource limits at run time -- reposting
In-Reply-To: <200804281438.28417.jackm@dev.mellanox.co.il>
References: <200804281438.28417.jackm@dev.mellanox.co.il>
Message-ID: <1211893354.13185.229.camel@hrosenstock-ws.xsigo.com>

On Mon, 2008-04-28 at 14:38 +0300, Jack Morgenstein wrote:
> mlx4-core: enable changing default max HCA resource limits.
> 
> Enable module-initialization time modification of default HCA
> maximum resource limits via module parameters, as is done in mthca.
> 
> Specify the log of the parameter value, rather than the value itself
> to avoid the hidden side-effect of rounding up values to next power-of-2.
> 
> Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

Sorry if I'm rehashing this but this thread appears to have died out and
I'm not sure about it's status:

Where do we stand in terms of getting the additional mlx4 module
parameters incorporated ? 

Thanks.

-- Hal


From tziporet at dev.mellanox.co.il  Tue May 27 06:17:57 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 27 May 2008 16:17:57 +0300
Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling
In-Reply-To: <200805271014.30175.okir@lst.de>
References: <200805271014.30175.okir@lst.de>
Message-ID: <483C0A05.5030602@mellanox.co.il>

Olaf Kirch wrote:
> Oops, this should have gone to ofa-general as well.
>
> Olaf
> ----------  Forwarded Message  ----------
>
> Subject: [rds-devel] RDS: Fix a bug in RDMA signalling
> Date: Tuesday 27 May 2008
> From: Olaf Kirch <olaf.kirch at oracle.com>
> To: rds-devel at oss.oracle.com
>
>
>
> Vlad, is there any chance to get this into 1.3.1? I think I remember
> that rc2 (which is out already) was supposed to be the final release
> candidate for 1.3.1 - correct?
>
>
>   
I just asked to delay 1.3.1 to Monday so we have time to take this patch too

Tziporet


From vlad at dev.mellanox.co.il  Tue May 27 06:31:44 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 27 May 2008 16:31:44 +0300
Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling
In-Reply-To: <200805271014.30175.okir@lst.de>
References: <200805271014.30175.okir@lst.de>
Message-ID: <483C0D40.5060902@dev.mellanox.co.il>

Olaf Kirch wrote:
> Oops, this should have gone to ofa-general as well.
> 
> Olaf
> ----------  Forwarded Message  ----------
> 
> Subject: [rds-devel] RDS: Fix a bug in RDMA signalling
> Date: Tuesday 27 May 2008
> From: Olaf Kirch <olaf.kirch at oracle.com>
> To: rds-devel at oss.oracle.com
> 
> I found an issue with RDMA completions vs. IB connection teardown
> yesterday - the patch seems correct to me (and passed my testing),
> but I would appreciate a second set of eyeballs, and some additional
> testing (Rick, can you spin the crload wheel with this one, please?)
> 
> Vlad, is there any chance to get this into 1.3.1? I think I remember
> that rc2 (which is out already) was supposed to be the final release
> candidate for 1.3.1 - correct?
> 
> The patch is also available from my tree at
> git://git.openfabrics.org/~okir/ofed_1_3/linux-2.6.git code-drop-20080527
> 
> Olaf
> -------------------
> From: Olaf Kirch <olaf.kirch at oracle.com>
> Subject: RDS: Fix a bug in RDMA signalling
> 
> Code inspection revealed a problem in the way we signal RDMA
> completions to user space when a connection goes down.
> 
> The send CQ handler calls rds_ib_send_unmap_rm, indicating success
> no matter what the WC status code says. This means we may
> signal success for RDMAs that get flushed out with WR_FLUSH_ERR.
> 
> This patch fixes the problem by passing the wc.status value to
> rds_ib_send_unmap_rm for inspection.
> 
> While I was at it, I moved the code that translates WC status codes
> to RDMA notifications from the send CQ handler to rds_ib_send_unmap_rm
> where it belongs.
> 
> Signed-off-by: Olaf Kirch <olaf.kirch at oracle.com>
> ---
>  net/rds/ib_send.c |   39 ++++++++++++++++++++-------------------
>  net/rds/rdma.h    |    2 +-
>  net/rds/send.c    |    3 ++-
>  3 files changed, 23 insertions(+), 21 deletions(-)
> 
> Index: ofa_kernel-1.3.1/net/rds/ib_send.c
> ===================================================================
> --- ofa_kernel-1.3.1.orig/net/rds/ib_send.c
> +++ ofa_kernel-1.3.1/net/rds/ib_send.c
> @@ -41,7 +41,7 @@
>  
>  void rds_ib_send_unmap_rm(struct rds_ib_connection *ic,
>  		          struct rds_ib_send_work *send,
> -			  int success)
> +			  int wc_status)
>  {
>  	struct rds_message *rm = send->s_rm;
>  
> @@ -52,7 +52,9 @@ void rds_ib_send_unmap_rm(struct rds_ib_
>  		     DMA_TO_DEVICE);
>  
>  	/* raise rdma completion hwm */
> -	if (rm->m_rdma_op && success) {
> +	if (rm->m_rdma_op && wc_status != IB_WC_WR_FLUSH_ERR) {
> +		int notify_status;
> +
>  		/* If the user asked for a completion notification on this
>  		 * message, we can implement three different semantics:
>  		 *  1.	Notify when we received the ACK on the RDS message
> @@ -68,7 +70,20 @@ void rds_ib_send_unmap_rm(struct rds_ib_
>  		 * don't call rds_rdma_send_complete at all, and fall back to the notify
>  		 * handling in the ACK processing code.
>  		 */
> -		rds_rdma_send_complete(rm);
> +		switch (wc_status) {
> +		case IB_WC_SUCCESS:
> +			notify_status = RDS_RDMA_SUCCESS;
> +			break;
> +
> +		case IB_WC_REM_ACCESS_ERR:
> +			notify_status = RDS_RDMA_REMOTE_ERROR;
> +			break;
> +
> +		default:
> +			notify_status = RDS_RDMA_OTHER_ERROR;
> +			break;
> +		}
> +		rds_rdma_send_complete(rm, notify_status);
>  
>  		if (rm->m_rdma_op->r_write)
>  			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
> @@ -118,7 +133,7 @@ void rds_ib_send_clear_ring(struct rds_i
>  		if (send->s_wr.opcode == 0xdead)
>  			continue;
>  		if (send->s_rm)
> -			rds_ib_send_unmap_rm(ic, send, 0);
> +			rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR);
>  		if (send->s_op)
>  			ib_dma_unmap_sg(ic->i_cm_id->device,
>  				send->s_op->r_sg, send->s_op->r_nents,
> @@ -174,7 +189,7 @@ void rds_ib_send_cq_comp_handler(struct 
>  			switch (send->s_wr.opcode) {
>  			case IB_WR_SEND:
>  				if (send->s_rm)
> -					rds_ib_send_unmap_rm(ic, send, 1);
> +					rds_ib_send_unmap_rm(ic, send, wc.status);
>  				break;
>  			case IB_WR_RDMA_WRITE:
>  				if (send->s_op)
> @@ -204,20 +219,6 @@ void rds_ib_send_cq_comp_handler(struct 
>  			oldest = (oldest + 1) % ic->i_send_ring.w_nr;
>  		}
>  
> -		if (unlikely(wc.status != IB_WC_SUCCESS && send->s_op && send->s_op->r_notifier)) {
> -			switch (wc.status) {
> -			default:
> -				send->s_op->r_notifier->n_status = RDS_RDMA_OTHER_ERROR;
> -				break;
> -			case IB_WC_REM_ACCESS_ERR:
> -				send->s_op->r_notifier->n_status = RDS_RDMA_REMOTE_ERROR;
> -				break;
> -			case IB_WC_WR_FLUSH_ERR:
> -				/* flushed out; not an error */
> -				break;
> -			}
> -		}
> -
>  		rds_ib_ring_free(&ic->i_send_ring, completed);
>  
>  		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
> Index: ofa_kernel-1.3.1/net/rds/rdma.h
> ===================================================================
> --- ofa_kernel-1.3.1.orig/net/rds/rdma.h
> +++ ofa_kernel-1.3.1/net/rds/rdma.h
> @@ -71,6 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock *
>  int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
>  			  struct cmsghdr *cmsg);
>  void rds_rdma_free_op(struct rds_rdma_op *ro);
> -void rds_rdma_send_complete(struct rds_message *rm);
> +void rds_rdma_send_complete(struct rds_message *rm, int);
>  
>  #endif
> Index: ofa_kernel-1.3.1/net/rds/send.c
> ===================================================================
> --- ofa_kernel-1.3.1.orig/net/rds/send.c
> +++ ofa_kernel-1.3.1/net/rds/send.c
> @@ -361,7 +361,7 @@ int rds_send_acked_before(struct rds_con
>   * the IB send completion on the RDMA op and the accompanying
>   * message.
>   */
> -void rds_rdma_send_complete(struct rds_message *rm)
> +void rds_rdma_send_complete(struct rds_message *rm, int status)
>  {
>  	struct rds_sock *rs = NULL;
>  	struct rds_rdma_op *ro;
> @@ -376,6 +376,7 @@ void rds_rdma_send_complete(struct rds_m
>  		rs = rm->m_rs;
>  		sock_hold(rds_rs_to_sk(rs));
>  
> +		notifier->n_status = status;
>  		spin_lock(&rs->rs_lock);
>  		list_add_tail(&notifier->n_list, &rs->rs_notify_queue);
>  		spin_unlock(&rs->rs_lock);
> 

Applied to OFED-1.3.1 kernel git tree.

Regards,
Vladimir


From monis at Voltaire.COM  Tue May 27 07:38:13 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 27 May 2008 17:38:13 +0300
Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to
	groups and handle each according to level of severity
In-Reply-To: <48357C09.1040302@Voltaire.COM>
References: <48302034.8040709@Voltaire.COM>
	<483022BB.9060004@Voltaire.COM>	<adaprri53y0.fsf@cisco.com>
	<48357C09.1040302@Voltaire.COM>
Message-ID: <483C1CD5.9040904@Voltaire.COM>

Roland,
I sent a proposal that tries to answer yours and Hal's comments regarding the second patch.
I would appreciate if you take a look at it and let me know if you think I can go on to
produce a "decent" patch.
I just want to get more reviews before I make another step.

thanks

 MoniS


Moni Shoua wrote:
> Hal, Roland 
> Thanks for the comments. The patch below tries to address the issues that were 
> raised in its previous form. Please note that I'm only asking for opinion for now.
> If the idea is acceptable then I will recreate more elegant patch with the required
> fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with
> textual names).
> 
> The idea in few words is to flush only paths but keeping address handles in ipoib_neigh.
> This will trigger a new path lookup when an ARP probe arrives and eventually an addess 
> handle renewal. In the meantime, the old address handle is kept and can be used. In most
> cases this address handle is a valid address handle and when it is not than the situatio
> is not worse than before. 
> My tests show that this patch completes the improvement that was archived with patch #1 
> to zero packet loss (tested with ping flood) when SM change event occurs.
> 
> 
> thanks
> 
>  MoniS
> 
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index ca126fc..8ef6573 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -276,10 +276,11 @@ struct ipoib_dev_priv {
>  
>  	struct delayed_work pkey_poll_task;
>  	struct delayed_work mcast_task;
> -	struct work_struct flush_task;
> +	struct work_struct flush_task0;
> +	struct work_struct flush_task1;
> +	struct work_struct flush_task2;
>  	struct work_struct restart_task;
>  	struct delayed_work ah_reap_task;
> -	struct work_struct pkey_event_task;
>  
>  	struct ib_device *ca;
>  	u8		  port;
> @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		struct ipoib_ah *address, u32 qpn);
>  void ipoib_reap_ah(struct work_struct *work);
>  
> +void ipoib_flush_paths_only(struct net_device *dev);
>  void ipoib_flush_paths(struct net_device *dev);
>  struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
>  
>  int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
> -void ipoib_ib_dev_flush(struct work_struct *work);
> +void ipoib_ib_dev_flush0(struct work_struct *work);
> +void ipoib_ib_dev_flush1(struct work_struct *work);
> +void ipoib_ib_dev_flush2(struct work_struct *work);
>  void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
>  
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index f429bce..5a6bbe8 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>  	return 0;
>  }
>  
> -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
> +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level)
>  {
>  	struct ipoib_dev_priv *cpriv;
>  	struct net_device *dev = priv->dev;
> @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * the parent is down.
>  	 */
>  	list_for_each_entry(cpriv, &priv->child_intfs, list)
> -		__ipoib_ib_dev_flush(cpriv, pkey_event);
> +		__ipoib_ib_dev_flush(cpriv, level);
>  
>  	mutex_unlock(&priv->vlan_mutex);
>  
> @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		return;
>  	}
>  
> -	if (pkey_event) {
> +	if (level == 2) {
>  		if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) {
>  			clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
>  			ipoib_ib_dev_down(dev, 0);
> @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  		priv->pkey_index = new_index;
>  	}
>  
> -	ipoib_dbg(priv, "flushing\n");
> -
> -	ipoib_ib_dev_down(dev, 0);
> +	ipoib_flush_paths_only(dev);
> +	ipoib_mcast_dev_flush(dev);
> +	
> +	if (level >= 1)
> +		ipoib_ib_dev_down(dev, 0);
>  
> -	if (pkey_event) {
> +	if (level >= 2) {
>  		ipoib_ib_dev_stop(dev, 0);
>  		ipoib_ib_dev_open(dev);
>  	}
> @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event)
>  	 * we get here, don't bring it back up if it's not configured up
>  	 */
>  	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) {
> -		ipoib_ib_dev_up(dev);
> +		if (level >= 1)
> +			ipoib_ib_dev_up(dev);
>  		ipoib_mcast_restart_task(&priv->restart_task);
>  	}
>  }
>  
> -void ipoib_ib_dev_flush(struct work_struct *work)
> +void ipoib_ib_dev_flush0(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, flush_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task0);
>  
> -	ipoib_dbg(priv, "Flushing %s\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 0);
>  }
>  
> -void ipoib_pkey_event(struct work_struct *work)
> +void ipoib_ib_dev_flush1(struct work_struct *work)
>  {
>  	struct ipoib_dev_priv *priv =
> -		container_of(work, struct ipoib_dev_priv, pkey_event_task);
> +		container_of(work, struct ipoib_dev_priv, flush_task1);
>  
> -	ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name);
>  	__ipoib_ib_dev_flush(priv, 1);
>  }
>  
> +void ipoib_ib_dev_flush2(struct work_struct *work)
> +{
> +	struct ipoib_dev_priv *priv =
> +		container_of(work, struct ipoib_dev_priv, flush_task2);
> +
> +	__ipoib_ib_dev_flush(priv, 2);
> +}
> +
>  void ipoib_ib_dev_cleanup(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 2442090..c41798d 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path)
>  	return 0;
>  }
>  
> +static void path_free_only(struct net_device *dev, struct ipoib_path *path)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct ipoib_neigh *neigh, *tn;
> +	struct sk_buff *skb;
> +	unsigned long flags;
> +
> +	while ((skb = __skb_dequeue(&path->queue)))
> +		dev_kfree_skb_irq(skb);
> +
> +	if (path->ah)
> +		ipoib_put_ah(path->ah);
> +
> +	kfree(path);
> +}
>  static void path_free(struct net_device *dev, struct ipoib_path *path)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter,
>  
>  #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */
>  
> +void ipoib_flush_paths_only(struct net_device *dev)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	struct ipoib_path *path, *tp;
> +	LIST_HEAD(remove_list);
> +
> +	spin_lock_irq(&priv->tx_lock);
> +	spin_lock(&priv->lock);
> +
> +	list_splice_init(&priv->path_list, &remove_list);
> +
> +	list_for_each_entry(path, &remove_list, list)
> +		rb_erase(&path->rb_node, &priv->path_tree);
> +
> +	list_for_each_entry_safe(path, tp, &remove_list, list) {
> +		if (path->query)
> +			ib_sa_cancel_query(path->query_id, path->query);
> +		spin_unlock(&priv->lock);
> +		spin_unlock_irq(&priv->tx_lock);
> +		wait_for_completion(&path->done);
> +		path_free_only(dev, path);
> +		spin_lock_irq(&priv->tx_lock);
> +		spin_lock(&priv->lock);
> +	}
> +	spin_unlock(&priv->lock);
> +	spin_unlock_irq(&priv->tx_lock);
> +}
> +
>  void ipoib_flush_paths(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> @@ -421,6 +464,8 @@ static void path_rec_completion(int status,
>  			__skb_queue_tail(&skqueue, skb);
>  
>  		list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
> +			if (neigh->ah)
> +				ipoib_put_ah(neigh->ah);
>  			kref_get(&path->ah->ref);
>  			neigh->ah = path->ah;
>  			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
> @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev)
>  	INIT_LIST_HEAD(&priv->multicast_list);
>  
>  	INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
> -	INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event);
>  	INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
> -	INIT_WORK(&priv->flush_task,   ipoib_ib_dev_flush);
> +	INIT_WORK(&priv->flush_task0,   ipoib_ib_dev_flush0);
> +	INIT_WORK(&priv->flush_task1,   ipoib_ib_dev_flush1);
> +	INIT_WORK(&priv->flush_task2,   ipoib_ib_dev_flush2);
>  	INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task);
>  	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index 8766d29..80c0409 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler,
>  	if (record->element.port_num != priv->port)
>  		return;
>  
> -	if (record->event == IB_EVENT_PORT_ERR    ||
> -	    record->event == IB_EVENT_PORT_ACTIVE ||
> -	    record->event == IB_EVENT_LID_CHANGE  ||
> -	    record->event == IB_EVENT_SM_CHANGE   ||
> -	    record->event == IB_EVENT_CLIENT_REREGISTER) {
> -		ipoib_dbg(priv, "Port state change event\n");
> -		queue_work(ipoib_workqueue, &priv->flush_task);
> +	ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event,
> +			record->device->name, record->element.port_num);
> +	if ( record->event == IB_EVENT_SM_CHANGE   ||
> +	     record->event == IB_EVENT_CLIENT_REREGISTER) {
> +		queue_work(ipoib_workqueue, &priv->flush_task0);
> +	} else if (record->event == IB_EVENT_PORT_ERR ||
> +		   record->event == IB_EVENT_PORT_ACTIVE ||
> +		   record->event == IB_EVENT_LID_CHANGE) {
> +		queue_work(ipoib_workqueue, &priv->flush_task1);
>  	} else if (record->event == IB_EVENT_PKEY_CHANGE) {
> -		ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port);
> -		queue_work(ipoib_workqueue, &priv->pkey_event_task);
> +		queue_work(ipoib_workqueue, &priv->flush_task2);
>  	}
>  }
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sean.hefty at intel.com  Tue May 27 08:07:00 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 May 2008 08:07:00 -0700
Subject: [ofa-general] Update on features that are delayed 
In-Reply-To: <483BCDF9.2080207@voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com>
	<483BCDF9.2080207@voltaire.com>
Message-ID: <000001c8c00b$50df8be0$ebc8180a@amr.corp.intel.com>

>Can you please elaborate what is missing in the design/implementation of
>the rdma-cm regarding IPv6?

The main item that's missing is code in ib_addr to convert the IPv6 address to
an IB GID.  Once that's available, there may be a couple code paths where IPv6
checks are needed, but I can't think of anything that would be that hard. 

- Sean


From taylor at hpc.ufl.edu  Tue May 27 08:15:14 2008
From: taylor at hpc.ufl.edu (Charles Taylor)
Date: Tue, 27 May 2008 11:15:14 -0400
Subject: [ofa-general] OpenSM?
In-Reply-To: <48370AEE.7080507@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com>
Message-ID: <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>


We have a 400 node IB cluster.    We are running an embedded SM in  
failover mode on our TS270/Cisco7008 core switches.    Lately we have  
been seeing problems with LID assignment when rebooting nodes (see log  
messages below).   It is also taking far too long for LIDS to be  
assigned as it takes  on the order of minutes for the ports to  
transition to "ACTIVE".

This seems like a bug to us and we are considering switching to  
OpenSM  on a host.   I'm wondering about experience with running  
OpenSM for medium to large clusters (Fat Tree) and what resources  
(memory/cpu) we should plan on for the host node.

Thanks,

Charlie Taylor
UF HPC Center

May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover
the subnet
May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
existing IB
node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
DELETE_MC_GROUP
trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
Topology
changed
May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
discovering removed ports
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events
require sweep
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover
the subnet
May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
Topology
changed
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
discovering new ports
May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
multicast membership change
May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
Force port to
go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
Program port
state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
Failed to
negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
status 0x1c
May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
IN_SERVICE trap
for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
IB node
00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events
require sweep
May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover
the subnet
May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
topology
change
May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
previous GET/SET operation failures
May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:  
Reassigning
LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr  
LID=0
May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:  
Force port to
go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:  
Clean up SA
resources for port forced down due to LID conflict, node -
GUID=00:02:c9:02:00:21:4b:58, port=1
May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:  
cleaning DB
for guid 00:02:c9:02:00:21:4b:59, lid 194
May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
_ib_smAllocSubnet: initRate= 4
May 27 14:18:47 topspin-270sc last message repeated 23 times
May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
links
detected in the network
May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:  
Active
port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,  
state=2,
neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events
require sweep
May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
Rediscover
the subnet
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
topology
change
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
00:06:6a:00:d9:00:04:5d port 16 is INIT state
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
some ports in INIT state
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
previous GET/SET operation failures
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
_ib_smAllocSubnet: initRate= 4
May 27 14:21:05 topspin-270sc last message repeated 23 times
May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
links
detected in the network
May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:  
Program port
state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM  
CREATE_MC_GROUP
trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events
require sweep
May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
topology
change
May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
multicast membership change
May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a  
backup session
with Standby SM guid 00:05:ad:00:00:02:3c:60
May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
async events
require sweep
May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
topology
change
May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
caused by
multicast membership change
May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
synchronized
with Standby SM guid 00:05:ad:00:00:02:3c:60
May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
synchronized
with all designated backup SMs
May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
**********************  NEW SWEEP ********************
May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
topology
change

On May 23, 2008, at 2:20 PM, Steve Wise wrote:

> Or Gerlitz wrote:
>> Steve Wise wrote:
>>> Are we sure we need to expose this to the user?
>> I believe this is the way to go if we want to let smart ULPs  
>> generate new rkey/stag per mapping. Simpler ULPs could then just  
>> put the same value for each map associated with the same mr.
>>
>> Or.
>>
>
> How should I add this to the API?
>
> Perhaps we just document the format of an rkey in the struct ib_mr.   
> Thus the app would do this to change the key before posting the  
> fast_reg_mr wr (coded to be explicit, not efficient):
>
> u8 newkey;
> u32 newrkey;
>
> newkey = 0xaa;
> newrkey = (mr->rkey & 0xffffff00) | newkey;
> mr->rkey = newrkey
> wr.wr.fast_reg.mr = mr;
> ...
>
>
> Note, this assumes mr->rkey is in host byte order (I think the linux  
> rdma code assumes this in other places too).
>
>
> Steve.
>
>
>
>
>
>
>
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/fee62285/attachment.html>

From xma at us.ibm.com  Tue May 27 08:16:09 2008
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 27 May 2008 08:16:09 -0700
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <1211879148.13769.94.camel@mtls03>
Message-ID: <OFB5083D28.CF756A1C-ON87257456.0053454F-88257456.0027C5C2@us.ibm.com>


Hello Eli,

      >So what happens is that the application *thinks* it posted a certain
number of         > packets for transmission but these packets are flushed
and do not really get         > transmitted.
In this case, how many tx drop packets from ifconfig output? Should we see
ifconfig tx drop packets + tx successfully transmit packets  close to
netperf packets?

      Any TCP STREAM test results to share here?

thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/e8cae3ae/attachment.html>

From tom at opengridcomputing.com  Tue May 27 08:33:37 2008
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 27 May 2008 10:33:37 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <aday75w3bzt.fsf@cisco.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com> <adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>  <aday75w3bzt.fsf@cisco.com>
Message-ID: <1211902417.4114.73.camel@trinity.ogc.int>


On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote:
> > The "invalidate local stag" part of a read is just a local sink side
>  > operation (ie no wire protocol change from a read).  It's not like
>  > processing an ingress send-with-inv.  It is really functionally like a
>  > read followed immediately by a fenced invalidate-local, but it doesn't
>  > stall the pipe.  So the device has to remember the read is a "with inv
>  > local stag" and invalidate the stag after the read response is placed
>  > and before the WCE is reaped by the application.
> 
> Yes, understood.  My point was just that in IB, at least in theory, one
> could just use an L_Key that doesn't have any remote permissions in the
> scatter list of an RDMA read, while in iWARP, the STag used to place an
> RDMA read response has to have remote write permission.  So RDMA read
> with invalidate makes sense for iWARP, because it gives a race-free way
> to allow an STag to be invalidated immediately after an RDMA read
> response is placed, while in IB it's simpler just to never give remote
> access at all.
> 

So I think from an NFSRDMA coding perspective it's a wash...

When creating the local data sink, We need to check the transport type.

If it's IB --> only local access,
if it's iWARP --> local + remote access.

When posting the WR, We check the fastreg capabilities bit + transport type bit:
If fastreg is true -->
	Post FastReg
	If iWARP (or with a cap bit read-with-inv-flag)
		post rdma read w/ invalidate
	else /* IB */
		post rdma read
		post invalidate
	fi
else
	... today's logic
fi	

I make the observation, however, that the transport type is now overloaded
with a set of required verbs. For iWARP's case, this means rdma-read-w-inv,
plus rdma-send-w-inv, etc... This also means that new transport types will
inherit one or the other set of verbs (IB or iWARP).

Tom


>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From yevgenyp at mellanox.co.il  Tue May 27 08:31:50 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Tue, 27 May 2008 18:31:50 +0300
Subject: [ofa-general][PATCH v2 1/2]mlx4: Multiple completion vectors support
Message-ID: <483C2966.1080606@mellanox.co.il>

>From 90b97243dc9429eec9d0ed2188c3e16ab04e5432 Mon Sep 17 00:00:00 2001
From: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
Date: Tue, 27 May 2008 17:37:01 +0300
Subject: [PATCH] mlx4: Multiple completion vectors support

The driver now creates a completion EQ for every cpu.
While allocating CQ a ULP asks a completion vector number
it wants the CQ to be attached to. The number of completion
vectors is populated via ib_device.num_comp_vectors

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
Changes since V1:
Created the patch against the latest tree

 drivers/infiniband/hw/mlx4/cq.c   |    2 +-
 drivers/infiniband/hw/mlx4/main.c |    2 +-
 drivers/net/mlx4/cq.c             |   14 ++++++++--
 drivers/net/mlx4/eq.c             |   47 ++++++++++++++++++++++++------------
 drivers/net/mlx4/main.c           |   14 ++++++----
 drivers/net/mlx4/mlx4.h           |    4 +-
 include/linux/mlx4/device.h       |    4 ++-
 7 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 4521319..3519f92 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -221,7 +221,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 	}

 	err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar,
-			    cq->db.dma, &cq->mcq, 0);
+			    cq->db.dma, &cq->mcq, vector, 0);
 	if (err)
 		goto err_dbmap;

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 4d61e32..60dc700 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -563,7 +563,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	ibdev->ib_dev.owner		= THIS_MODULE;
 	ibdev->ib_dev.node_type		= RDMA_NODE_IB_CA;
 	ibdev->ib_dev.phys_port_cnt	= dev->caps.num_ports;
-	ibdev->ib_dev.num_comp_vectors	= 1;
+	ibdev->ib_dev.num_comp_vectors	= dev->caps.num_comp_vectors;
 	ibdev->ib_dev.dma_device	= &dev->pdev->dev;

 	ibdev->ib_dev.uverbs_abi_ver	= MLX4_IB_UVERBS_ABI_VERSION;
diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 95e87a2..9be895f 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -189,7 +189,7 @@ EXPORT_SYMBOL_GPL(mlx4_cq_resize);

 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
-		  int collapsed)
+		  unsigned vector, int collapsed)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_cq_table *cq_table = &priv->cq_table;
@@ -227,7 +227,15 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,

 	cq_context->flags	    = cpu_to_be32(!!collapsed << 18);
 	cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index);
-	cq_context->comp_eqn        = priv->eq_table.eq[MLX4_EQ_COMP].eqn;
+
+	if (vector >= dev->caps.num_comp_vectors) {
+		err = -EINVAL;
+		goto err_radix;
+	}
+
+	cq->comp_eq_idx		    = MLX4_EQ_COMP_CPU0 + vector;
+	cq_context->comp_eqn	    = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 +
+							vector].eqn;
 	cq_context->log_page_size   = mtt->page_shift - MLX4_ICM_PAGE_SHIFT;

 	mtt_addr = mlx4_mtt_addr(dev, mtt);
@@ -276,7 +284,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 	if (err)
 		mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);

-	synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq);
+	synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq);

 	spin_lock_irq(&cq_table->lock);
 	radix_tree_delete(&cq_table->tree, cq->cqn);
diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index e141a15..825e90c 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -265,7 +265,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr)

 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);

 	return IRQ_RETVAL(work);
@@ -482,7 +482,7 @@ static void mlx4_free_irqs(struct mlx4_dev *dev)

 	if (eq_table->have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		if (eq_table->eq[i].have_irq)
 			free_irq(eq_table->eq[i].irq, eq_table->eq + i);
 }
@@ -553,6 +553,7 @@ void mlx4_unmap_eq_icm(struct mlx4_dev *dev)
 int mlx4_init_eq_table(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
+	int req_eqs;
 	int err;
 	int i;

@@ -573,11 +574,21 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 	priv->eq_table.clr_int  = priv->clr_base +
 		(priv->eq_table.inta_pin < 32 ? 4 : 0);

-	err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE,
-			     (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0,
-			     &priv->eq_table.eq[MLX4_EQ_COMP]);
-	if (err)
-		goto err_out_unmap;
+	dev->caps.num_comp_vectors = 0;
+	req_eqs = (dev->flags & MLX4_FLAG_MSI_X) ? num_online_cpus() : 1;
+	while (req_eqs) {
+		err = mlx4_create_eq(
+			dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE,
+			(dev->flags & MLX4_FLAG_MSI_X) ?
+			(MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors) : 0,
+			&priv->eq_table.eq[MLX4_EQ_COMP_CPU0 +
+			dev->caps.num_comp_vectors]);
+		if (err)
+			goto err_out_comp;
+
+		dev->caps.num_comp_vectors++;
+		req_eqs--;
+	}

 	err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE,
 			     (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0,
@@ -586,12 +597,16 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 		goto err_out_comp;

 	if (dev->flags & MLX4_FLAG_MSI_X) {
-		static const char *eq_name[] = {
-			[MLX4_EQ_COMP]  = DRV_NAME " (comp)",
-			[MLX4_EQ_ASYNC] = DRV_NAME " (async)"
-		};
+		static char eq_name[MLX4_NUM_EQ][20];
+
+		for (i = 0; i < MLX4_EQ_COMP_CPU0 +
+		      dev->caps.num_comp_vectors; ++i) {
+			if (i == 0)
+				snprintf(eq_name[0], 20, DRV_NAME "(async)");
+			else
+				snprintf(eq_name[i], 20, "comp_" DRV_NAME "%d",
+					 i - 1);

-		for (i = 0; i < MLX4_NUM_EQ; ++i) {
 			err = request_irq(priv->eq_table.eq[i].irq,
 					  mlx4_msi_x_interrupt,
 					  0, eq_name[i], priv->eq_table.eq + i);
@@ -616,7 +631,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 		mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
 			   priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		eq_set_ci(&priv->eq_table.eq[i], 1);

 	return 0;
@@ -625,9 +640,9 @@ err_out_async:
 	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]);

 err_out_comp:
-	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]);
+	for (i = 0; i < dev->caps.num_comp_vectors; ++i)
+		mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i]);

-err_out_unmap:
 	mlx4_unmap_clr_int(dev);
 	mlx4_free_irqs(dev);

@@ -646,7 +661,7 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev)

 	mlx4_free_irqs(dev);

-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i)
 		mlx4_free_eq(dev, &priv->eq_table.eq[i]);

 	mlx4_unmap_clr_int(dev);
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index a6aa49f..8634b52 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -692,22 +692,24 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct msix_entry entries[MLX4_NUM_EQ];
+	int needed_vectors = MLX4_EQ_COMP_CPU0 + num_online_cpus();
 	int err;
 	int i;

 	if (msi_x) {
-		for (i = 0; i < MLX4_NUM_EQ; ++i)
+		for (i = 0; i < needed_vectors; ++i)
 			entries[i].entry = i;

-		err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries));
+		err = pci_enable_msix(dev->pdev, entries, needed_vectors);
 		if (err) {
 			if (err > 0)
-				mlx4_info(dev, "Only %d MSI-X vectors available, "
-					  "not using MSI-X\n", err);
+				mlx4_info(dev, "Only %d MSI-X vectors "
+					  "available, need %d. Not using MSI-X\n",
+					  err, needed_vectors);
 			goto no_msi;
 		}

-		for (i = 0; i < MLX4_NUM_EQ; ++i)
+		for (i = 0; i < needed_vectors; ++i)
 			priv->eq_table.eq[i].irq = entries[i].vector;

 		dev->flags |= MLX4_FLAG_MSI_X;
@@ -715,7 +717,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	}

 no_msi:
-	for (i = 0; i < MLX4_NUM_EQ; ++i)
+	for (i = 0; i < needed_vectors; ++i)
 		priv->eq_table.eq[i].irq = dev->pdev->irq;
 }

diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index a4023c2..8e5fbe0 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -64,8 +64,8 @@ enum {

 enum {
 	MLX4_EQ_ASYNC,
-	MLX4_EQ_COMP,
-	MLX4_NUM_EQ
+	MLX4_EQ_COMP_CPU0,
+	MLX4_NUM_EQ = MLX4_EQ_COMP_CPU0 + NR_CPUS
 };

 enum {
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index a744383..accc1ee 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -168,6 +168,7 @@ struct mlx4_caps {
 	int			reserved_cqs;
 	int			num_eqs;
 	int			reserved_eqs;
+	int			num_comp_vectors;
 	int			num_mpts;
 	int			num_mtt_segs;
 	int			fmr_reserved_mtts;
@@ -279,6 +280,7 @@ struct mlx4_cq {
 	int			arm_sn;

 	int			cqn;
+	int			comp_eq_idx;

 	atomic_t		refcount;
 	struct completion	free;
@@ -383,7 +385,7 @@ void mlx4_free_hwq_res(struct mlx4_dev *mdev, struct mlx4_hwq_resources *wqres,

 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
-		  int collapsed);
+		  unsigned vector, int collapsed);
 void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);

 int mlx4_qp_alloc(struct mlx4_dev *dev, int sqpn, struct mlx4_qp *qp);
-- 
1.5.3.7


From yevgenyp at mellanox.co.il  Tue May 27 08:35:00 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Tue, 27 May 2008 18:35:00 +0300
Subject: [ofa-general][PATCH v2 2/2] mlx4: Default value for automatic
	completion vector selection
Message-ID: <483C2A24.90903@mellanox.co.il>

>From 71b7fbb81b46f992986b2b278eea7c61d7e0372a Mon Sep 17 00:00:00 2001
From: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
Date: Tue, 27 May 2008 18:15:36 +0300
Subject: [PATCH] mlx4: Default value for automatic completion vector selection

When the vector number passed to mlx4_cq_alloc is IB_CQ_VECTOR_LEAST_ATTACHED (0xff),
the driver selects the completion vector that has the least CQ's attached
to it and attaches the CQ to the chosen vector.
IB_CQ_VECTOR_LEAST_ATTACHED is redefined in device.h as MLX4_ANY_VECTOR
because we don't want all mlx4_core clients (Ethernet and FCoE) to
include <rdma/ib_verbs.h>

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
Changes since V1:
1. Added IB_CQ_VECTOR_LEAST_ATTACHED to rdma/ib_verbs.h
2. Set MLX4_ANY_VECTOR to IB_CQ_VECTOR_LEAST_ATTACHED

 drivers/net/mlx4/cq.c       |   22 +++++++++++++++++++++-
 drivers/net/mlx4/mlx4.h     |    1 +
 include/linux/mlx4/device.h |    4 ++++
 include/rdma/ib_verbs.h     |   10 +++++++++-
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 9be895f..7f0bdf6 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq,
 }
 EXPORT_SYMBOL_GPL(mlx4_cq_resize);

+static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv)
+{
+	int i;
+	int index = 0;
+	int min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0].load;
+
+	for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) {
+		if (priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load < min) {
+			index = i;
+			min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load;
+		}
+	}
+
+	return index;
+}
+
 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 		  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
 		  unsigned vector, int collapsed)
@@ -228,7 +244,9 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	cq_context->flags	    = cpu_to_be32(!!collapsed << 18);
 	cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index);

-	if (vector >= dev->caps.num_comp_vectors) {
+	if (vector == MLX4_ANY_VECTOR)
+		vector = mlx4_find_least_loaded_vector(priv);
+	else if (vector >= dev->caps.num_comp_vectors) {
 		err = -EINVAL;
 		goto err_radix;
 	}
@@ -248,6 +266,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
 	if (err)
 		goto err_radix;

+	priv->eq_table.eq[cq->comp_eq_idx].load++;
 	cq->cons_index = 0;
 	cq->arm_sn     = 1;
 	cq->uar        = uar;
@@ -285,6 +304,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 		mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);

 	synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq);
+	priv->eq_table.eq[cq->comp_eq_idx].load--;

 	spin_lock_irq(&cq_table->lock);
 	radix_tree_delete(&cq_table->tree, cq->cqn);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 8e5fbe0..df16f05 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -143,6 +143,7 @@ struct mlx4_eq {
 	u16			irq;
 	u16			have_irq;
 	int			nent;
+	int			load;
 	struct mlx4_buf_list   *page_list;
 	struct mlx4_mtt		mtt;
 };
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index accc1ee..fd93546 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -37,6 +37,8 @@
 #include <linux/completion.h>
 #include <linux/radix-tree.h>

+#include <rdma/ib_verbs.h>
+
 #include <asm/atomic.h>

 enum {
@@ -133,6 +135,8 @@ enum {
 	MLX4_STAT_RATE_OFFSET	= 5
 };

+#define MLX4_ANY_VECTOR		IB_CQ_VECTOR_LEAST_ATTACHED
+
 static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor)
 {
 	return (major << 32) | (minor << 16) | subminor;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..2462d83 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1364,6 +1364,13 @@ static inline int ib_post_recv(struct ib_qp *qp,
 	return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
 }

+/*
+ * IB_CQ_VECTOR_LEAST_ATTACHED: The constatnt specifies that
+ *	teh cq will be attached to the least attached
+ *	completion vector
+ */
+#define IB_CQ_VECTOR_LEAST_ATTACHED	0xff
+
 /**
  * ib_create_cq - Creates a CQ on the specified device.
  * @device: The device on which to create the CQ.
@@ -1375,7 +1382,8 @@ static inline int ib_post_recv(struct ib_qp *qp,
  *   the associated completion and event handlers.
  * @cqe: The minimum size of the CQ.
  * @comp_vector - Completion vector used to signal completion events.
- *     Must be >= 0 and < context->num_comp_vectors.
+ *     Must be >= 0 and < context->num_comp_vectors
+ *     or IB_CQ_VECTOR_LEAST_ATTACHED.
  *
  * Users can examine the cq structure to determine the actual CQ size.
  */
-- 
1.5.3.7


From ogerlitz at voltaire.com  Tue May 27 08:50:21 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 27 May 2008 18:50:21 +0300 (IDT)
Subject: [ofa-general] [RFC PATCH 3/5] rdma/cma: simplify locking needed for
 serialization of callbacks
Message-ID: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>

The rdma-cm has some logic in place to make sure that callbacks on an ID are
delivered to the consumer in serialized manner, specifically it has code to protect
against the device removal racing with a callback now being delivered to the user.

This patch simplifies this logic by using a mutex per ID instead of the wait queue
and atomic variable. I have left the disable/enable_remove notation such that the patch
would be easier to read, but if this approach is accepted, I think we want to change
it to disable/enable_callback

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c	2008-05-26 15:11:17.000000000 +0300
+++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c	2008-05-26 15:22:11.000000000 +0300
@@ -126,8 +126,7 @@ struct rdma_id_private {

 	struct completion	comp;
 	atomic_t		refcount;
-	wait_queue_head_t	wait_remove;
-	atomic_t		dev_remove;
+	struct mutex		handler_mutex;

 	int			backlog;
 	int			timeout_ms;
@@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm

 	spin_lock_irqsave(&id_priv->lock, flags);
 	if (id_priv->state == state) {
-		atomic_inc(&id_priv->dev_remove);
+		mutex_lock(&id_priv->handler_mutex);
 		ret = 0;
 	} else
 		ret = -EINVAL;
@@ -369,8 +368,7 @@ static int cma_disable_remove(struct rdm

 static void cma_enable_remove(struct rdma_id_private *id_priv)
 {
-	if (atomic_dec_and_test(&id_priv->dev_remove))
-		wake_up(&id_priv->wait_remove);
+	mutex_unlock(&id_priv->handler_mutex);
 }

 static int cma_has_cm_dev(struct rdma_id_private *id_priv)
@@ -395,8 +393,7 @@ struct rdma_cm_id *rdma_create_id(rdma_c
 	mutex_init(&id_priv->qp_mutex);
 	init_completion(&id_priv->comp);
 	atomic_set(&id_priv->refcount, 1);
-	init_waitqueue_head(&id_priv->wait_remove);
-	atomic_set(&id_priv->dev_remove, 0);
+	mutex_init(&id_priv->handler_mutex);
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
@@ -1118,7 +1115,7 @@ static int cma_req_handler(struct ib_cm_
 		goto out;
 	}

-	atomic_inc(&conn_id->dev_remove);
+	mutex_lock(&conn_id->handler_mutex);
 	mutex_lock(&lock);
 	ret = cma_acquire_dev(conn_id);
 	mutex_unlock(&lock);
@@ -1296,7 +1293,7 @@ static int iw_conn_req_handler(struct iw
 		goto out;
 	}
 	conn_id = container_of(new_cm_id, struct rdma_id_private, id);
-	atomic_inc(&conn_id->dev_remove);
+	mutex_lock(&conn_id->handler_mutex);
 	conn_id->state = CMA_CONNECT;

 	dev = ip_dev_find(&init_net, iw_event->local_addr.sin_addr.s_addr);
@@ -1588,7 +1585,7 @@ static void cma_work_handler(struct work
 	struct rdma_id_private *id_priv = work->id;
 	int destroy = 0;

-	atomic_inc(&id_priv->dev_remove);
+	mutex_lock(&id_priv->handler_mutex);
 	if (!cma_comp_exch(id_priv, work->old_state, work->new_state))
 		goto out;

@@ -1760,7 +1757,7 @@ static void addr_handler(int status, str
 	struct rdma_cm_event event;

 	memset(&event, 0, sizeof event);
-	atomic_inc(&id_priv->dev_remove);
+	mutex_lock(&id_priv->handler_mutex);

 	/*
 	 * Grab mutex to block rdma_destroy_id() from removing the device while
@@ -2756,22 +2753,26 @@ static int cma_remove_id_dev(struct rdma
 {
 	struct rdma_cm_event event;
 	enum cma_state state;
-
+	int ret = 0;
+
 	/* Record that we want to remove the device */
 	state = cma_exch(id_priv, CMA_DEVICE_REMOVAL);
 	if (state == CMA_DESTROYING)
 		return 0;

 	cma_cancel_operation(id_priv, state);
-	wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove));
+	mutex_lock(&id_priv->handler_mutex);

 	/* Check for destruction from another callback. */
 	if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL))
-		return 0;
+		goto out;

 	memset(&event, 0, sizeof event);
 	event.event = RDMA_CM_EVENT_DEVICE_REMOVAL;
-	return id_priv->id.event_handler(&id_priv->id, &event);
+	ret = id_priv->id.event_handler(&id_priv->id, &event);
+out:
+	mutex_unlock(&id_priv->handler_mutex);
+	return ret;
 }

 static void cma_process_remove(struct cma_device *cma_dev)


From ogerlitz at voltaire.com  Tue May 27 08:51:34 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 27 May 2008 18:51:34 +0300 (IDT)
Subject: [ofa-general] [RFC V3 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>

RDMA_CM_EVENT_NETDEV_CHANGE event can be used by rdma-cm consumers that wish
to have their RDMA sessions always use the same links (eg <hca/port>) as the
IP stack does. In the current code, this does not happen when bonding is used
and fail-over happened, but the IB link used by an already existing session is
operating fine.

Use netevent notification for sensing that a change has happened in the IP stack,
then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" in
that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this ID.
The user can act on the event or just ignore it

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

This patch should be applied on top of the previous patch ("simplify locking needed
for serialization of callbacks) and the first two patches of the series I have posted
which remained unchanged at this point:

[RFC v2 PATCH 1/5] net/bonding: announce fail-over for the active-backup mode
http://lists.openfabrics.org/pipermail/general/2008-May/050285.html

[RFC v2 PATCH 2/5] rdma/addr: keep the name of the netdevice in struct rdma_dev_addr
http://lists.openfabrics.org/pipermail/general/2008-May/050286.html

main changes from v2 -

- took the approach of unconditionally notifying the user
- use the handler_mutex of the ID to serialize with other callbacks

As for the locking issues, I still have the double loop in cma_netdev_callback()
being wrapped with the rdma-cm global mutex taken.

The loop on devices has to be under this lock because the device removal code
in cma_remove_one() removes the device from the global linked list of devices
this code loops on.

The loop on IDs has to be under this lock because the device removal code in
cma_process_remove() removes IDs from the device ID list this code loops on.

Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c	2008-05-27 13:46:48.000000000 +0300
+++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c	2008-05-27 13:46:58.000000000 +0300
@@ -164,6 +164,12 @@ struct cma_work {
 	struct rdma_cm_event	event;
 };

+struct cma_ndev_work {
+	struct work_struct	work;
+	struct rdma_id_private	*id;
+	struct rdma_cm_event	event;
+};
+
 union cma_ip_addr {
 	struct in6_addr ip6;
 	struct {
@@ -1601,6 +1607,26 @@ out:
 	kfree(work);
 }

+static void cma_ndev_work_handler(struct work_struct *_work)
+{
+	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work);
+	struct rdma_id_private *id_priv = work->id;
+	int destroy = 0;
+
+	mutex_lock(&id_priv->handler_mutex);
+
+	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {
+		cma_exch(id_priv, CMA_DESTROYING);
+		destroy = 1;
+	}
+
+	cma_enable_remove(id_priv);
+	cma_deref_id(id_priv);
+	if (destroy)
+		rdma_destroy_id(&id_priv->id);
+	kfree(work);
+}
+
 static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms)
 {
 	struct rdma_route *route = &id_priv->id.route;
@@ -2726,6 +2752,61 @@ void rdma_leave_multicast(struct rdma_cm
 }
 EXPORT_SYMBOL(rdma_leave_multicast);

+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv)
+{
+	struct rdma_dev_addr *dev_addr;
+	struct cma_ndev_work *work;
+
+	dev_addr = &id_priv->id.route.addr.dev_addr;
+
+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
+		printk(KERN_ERR "addr change for device %s used by id %p, notifying\n",
+				ndev->name, &id_priv->id);
+		work = kzalloc(sizeof *work, GFP_ATOMIC);
+		if (!work)
+			return -ENOMEM;
+		INIT_WORK(&work->work, cma_ndev_work_handler);
+		work->id = id_priv;
+		work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE;
+		atomic_inc(&id_priv->refcount);
+		queue_work(cma_wq, &work->work);
+	}
+}
+
+static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
+	void *ctx)
+{
+	struct net_device *ndev = (struct net_device *)ctx;
+	struct cma_device *cma_dev;
+	struct rdma_id_private *id_priv;
+	int ret = NOTIFY_DONE;
+
+	if (dev_net(ndev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (event != NETDEV_BONDING_FAILOVER)
+		return NOTIFY_DONE;
+
+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
+		return NOTIFY_DONE;
+
+	mutex_lock(&lock);
+	list_for_each_entry(cma_dev, &dev_list, list)
+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
+			ret = cma_netdev_align_id(ndev, id_priv);
+			if (ret)
+				break;
+		}
+	mutex_unlock(&lock);
+
+	return ret;
+}
+
+static struct notifier_block cma_nb = {
+	.notifier_call = cma_netdev_callback
+};
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
@@ -2834,6 +2915,7 @@ static int cma_init(void)

 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
+	register_netdevice_notifier(&cma_nb);

 	ret = ib_register_client(&cma_client);
 	if (ret)
@@ -2841,6 +2923,7 @@ static int cma_init(void)
 	return 0;

 err:
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
@@ -2850,6 +2933,7 @@ err:
 static void cma_cleanup(void)
 {
 	ib_unregister_client(&cma_client);
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
Index: linux-2.6.26-rc3/include/rdma/rdma_cm.h
===================================================================
--- linux-2.6.26-rc3.orig/include/rdma/rdma_cm.h	2008-05-27 13:44:53.000000000 +0300
+++ linux-2.6.26-rc3/include/rdma/rdma_cm.h	2008-05-27 13:46:58.000000000 +0300
@@ -53,7 +53,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
 	RDMA_CM_EVENT_MULTICAST_JOIN,
-	RDMA_CM_EVENT_MULTICAST_ERROR
+	RDMA_CM_EVENT_MULTICAST_ERROR,
+	RDMA_CM_EVENT_NETDEV_CHANGE
 };

 enum rdma_port_space {


From sean.hefty at intel.com  Tue May 27 08:54:23 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 May 2008 08:54:23 -0700
Subject: [ofa-general] RE: [RFC PATCH 3/5] rdma/cma: simplify locking needed
	for serialization of callbacks
In-Reply-To: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
Message-ID: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com>

>@@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm
>
> 	spin_lock_irqsave(&id_priv->lock, flags);
> 	if (id_priv->state == state) {
>-		atomic_inc(&id_priv->dev_remove);
>+		mutex_lock(&id_priv->handler_mutex);

This just tried to acquire a mutex while holding a spinlock.

- Sean


From sean.hefty at intel.com  Tue May 27 09:19:37 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 May 2008 09:19:37 -0700
Subject: [ofa-general] RE: [RFC V3 PATCH 4/5] rdma/cma: implement
	RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
Message-ID: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>

>+static void cma_ndev_work_handler(struct work_struct *_work)
>+{
>+	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work,
>work);
>+	struct rdma_id_private *id_priv = work->id;
>+	int destroy = 0;
>+
>+	mutex_lock(&id_priv->handler_mutex);
>+
>+	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {

How do we know that the user hasn't tried to destroy the id from another
callback?  We need some sort of state check here.

>+		cma_exch(id_priv, CMA_DESTROYING);
>+		destroy = 1;
>+	}
>+
>+	cma_enable_remove(id_priv);

I didn't see the matching cma_disable_remove() call.

>+	cma_deref_id(id_priv);
>+	if (destroy)
>+		rdma_destroy_id(&id_priv->id);
>+	kfree(work);
>+}
>+
> static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int
>timeout_ms)
> {
> 	struct rdma_route *route = &id_priv->id.route;
>@@ -2726,6 +2752,61 @@ void rdma_leave_multicast(struct rdma_cm
> }
> EXPORT_SYMBOL(rdma_leave_multicast);
>
>+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private
>*id_priv)
>+{

nit - function name isn't clear to me.  Maybe something like
cma_netdev_change_handler()?  Although I'm not sure that netdev change is what
the user is really interested in.  What they really want to know is if IP
address mapping/resolution changed.  netdev is hidden from the user.

>+	struct rdma_dev_addr *dev_addr;
>+	struct cma_ndev_work *work;
>+
>+	dev_addr = &id_priv->id.route.addr.dev_addr;
>+
>+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
>+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
>+		printk(KERN_ERR "addr change for device %s used by id %p,
>notifying\n",
>+				ndev->name, &id_priv->id);
>+		work = kzalloc(sizeof *work, GFP_ATOMIC);
>+		if (!work)
>+			return -ENOMEM;
>+		INIT_WORK(&work->work, cma_ndev_work_handler);
>+		work->id = id_priv;
>+		work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE;

Maybe call this RDMA_CM_EVENT_ADDR_CHANGE?

>+		atomic_inc(&id_priv->refcount);
>+		queue_work(cma_wq, &work->work);
>+	}
>+}
>+
>+static int cma_netdev_callback(struct notifier_block *self, unsigned long
>event,
>+	void *ctx)
>+{
>+	struct net_device *ndev = (struct net_device *)ctx;
>+	struct cma_device *cma_dev;
>+	struct rdma_id_private *id_priv;
>+	int ret = NOTIFY_DONE;
>+
>+	if (dev_net(ndev) != &init_net)
>+		return NOTIFY_DONE;
>+
>+	if (event != NETDEV_BONDING_FAILOVER)
>+		return NOTIFY_DONE;
>+
>+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
>+		return NOTIFY_DONE;
>+
>+	mutex_lock(&lock);
>+	list_for_each_entry(cma_dev, &dev_list, list)

It seems like we just need to find the cma_dev that has the current mapping

- Sean


From ogerlitz at voltaire.com  Tue May 27 09:37:09 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 27 May 2008 19:37:09 +0300
Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
Message-ID: <483C38B5.6080706@voltaire.com>

Sean Hefty wrote:
>> +static void cma_ndev_work_handler(struct work_struct *_work)
>> +{
>> +	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work,
>> work);
>> +	struct rdma_id_private *id_priv = work->id;
>> +	int destroy = 0;
>> +
>> +	mutex_lock(&id_priv->handler_mutex);
>> +
>> +	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {
> How do we know that the user hasn't tried to destroy the id from another
> callback?  We need some sort of state check here.
correct, will be fixed.

>> +		cma_exch(id_priv, CMA_DESTROYING);
>> +		destroy = 1;
>> +	}
>> +
>> +	cma_enable_remove(id_priv);
>
> I didn't see the matching cma_disable_remove() call.
As you can see also in the patch 3/5, places in the code which 
originally did --not-- call cma_enable_remove() but rather just did 
atomic_inc(&conn_id->dev_remove) were just replaced with 
mutex_lock(&id_priv->handler_mutex). This is b/c cma_enable_remove does 
two things: 1) it does the state validation 2) it locks the 
handler_mutex, so places in the code which don't need the state 
validation don't call it... a bit dirty.

>>
>> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private
>> *id_priv)
>> +{
>
> nit - function name isn't clear to me.  Maybe something like
> cma_netdev_change_handler()?  Although I'm not sure that netdev change is what
> the user is really interested in.  What they really want to know is if IP
> address mapping/resolution changed.  netdev is hidden from the user.
OK, I will see how to improve the name
> Maybe call this RDMA_CM_EVENT_ADDR_CHANGE?
let me think about it
>> +static int cma_netdev_callback(struct notifier_block *self, unsigned long
>> event,
>> +	void *ctx)
>> +{
>> +	struct net_device *ndev = (struct net_device *)ctx;
>> +	struct cma_device *cma_dev;
>> +	struct rdma_id_private *id_priv;
>> +	int ret = NOTIFY_DONE;
>> +
>> +	if (dev_net(ndev) != &init_net)
>> +		return NOTIFY_DONE;
>> +
>> +	if (event != NETDEV_BONDING_FAILOVER)
>> +		return NOTIFY_DONE;
>> +
>> +	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
>> +		return NOTIFY_DONE;
>> +
>> +	mutex_lock(&lock);
>> +	list_for_each_entry(cma_dev, &dev_list, list)
>
> It seems like we just need to find the cma_dev that has the current mapping
correct. So I can take the lock, find the device in the list, increment 
its refcount and release the lock.  For that end I would have to save in 
the cma device structure the names of the network devices which are 
associated with it.... i can't use a comarison of the pdev etc pointers 
to the dma device since some network devices (eg bonding / vlan 
interfaces in ethernet) have NULL pdev  (they are virtual devices)

Later I can scan this device ID list, but I must do it under the lock 
inorder not to race with the device removal code which removed IDs from 
this list in cma_process_remove(), correct?

Or.


From ogerlitz at voltaire.com  Tue May 27 09:37:44 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 27 May 2008 19:37:44 +0300
Subject: [ofa-general] Re: [RFC PATCH 3/5] rdma/cma: simplify locking needed
 for serialization of callbacks
In-Reply-To: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com>
Message-ID: <483C38D8.5020600@voltaire.com>

Sean Hefty wrote:
>> @@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm
>>
>> 	spin_lock_irqsave(&id_priv->lock, flags);
>> 	if (id_priv->state == state) {
>> -		atomic_inc(&id_priv->dev_remove);
>> +		mutex_lock(&id_priv->handler_mutex);
> This just tried to acquire a mutex while holding a spinlock.

I see. So can taking this spin lock be avoided here? I understand that 
spin lock came to protect the state check, correct?

Or.


From Thomas.Talpey at netapp.com  Tue May 27 09:39:29 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 May 2008 12:39:29 -0400
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:
	MEM_MGT_EXTENSIONS support
In-Reply-To: <1211902417.4114.73.camel@trinity.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
Message-ID: <RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>

At 11:33 AM 5/27/2008, Tom Tucker wrote:
>So I think from an NFSRDMA coding perspective it's a wash...

Just to be clear, you're talking about the NFS/RDMA server. However, it's
pretty much a wash on the client, for different reasons.

>When posting the WR, We check the fastreg capabilities bit + transport 
>type bit:
>If fastreg is true -->
>       Post FastReg
>       If iWARP (or with a cap bit read-with-inv-flag)
>               post rdma read w/ invalidate

>... For iWARP's case, this means rdma-read-w-inv,
>plus rdma-send-w-inv, etc... 


Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:


   -------+-----------+-------+------+-------+-----------+--------------
   RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
   Message| Type      | Flag  | and  | Number| STag      | Length
   OpCode |           |       | TO   |       |           | Communicated
          |           |       |      |       |           | between DDP
          |           |       |      |       |           | and RDMAP
   -------+-----------+-------+------+-------+-----------+--------------
   0000b  | RDMA Write| 1     | Valid| N/A   | N/A       | Yes
          |           |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0001b  | RDMA Read | 0     | N/A  | 1     | N/A       | Yes
          | Request   |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0010b  | RDMA Read | 1     | Valid| N/A   | N/A       | Yes
          | Response  |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0011b  | Send      | 0     | N/A  | 0     | N/A       | Yes
          |           |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0100b  | Send with | 0     | N/A  | 0     | Valid     | Yes
          | Invalidate|       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0101b  | Send with | 0     | N/A  | 0     | N/A       | Yes
          | SE        |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0110b  | Send with | 0     | N/A  | 0     | Valid     | Yes
          | SE and    |       |      |       |           |
          | Invalidate|       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   0111b  | Terminate | 0     | N/A  | 2     | N/A       | Yes
          |           |       |      |       |           |
   -------+-----------+-------+------+-------+-----------+--------------
   1000b  |           |
   to     | Reserved  |               Not Specified
   1111b  |           |
   -------+-----------+-------------------------------------------------


I want to take this opportunity to also mention that the RPC/RDMA client-server
exchange does not support remote-invalidate currently. Because of the multiple
stags supported by the rpcrdma chunking header, and because the client needs
to verify that the stags were in fact invalidated, there is significant overhead,
and the jury is out on that benefit. In fact, I suspect it's a lose at the client.

Tom (Talpey).  


From felix at chelsio.com  Tue May 27 09:58:32 2008
From: felix at chelsio.com (Felix Marti)
Date: Tue, 27 May 2008 09:58:32 -0700
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS
	support
References: <20080516223256.27221.34568.stgit@dell3.ogc.int><20080516223419.27221.49014.stgit@dell3.ogc.int><4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com><483AB5AC.3030406@opengridcomputing.com><adak5hg4trz.fsf@cisco.com><483B3AD7.4050208@opengridcomputing.com><aday75w3bzt.fsf@cisco.com><1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>


> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-
> bounces at lists.openfabrics.org] On Behalf Of Talpey, Thomas
> Sent: Tuesday, May 27, 2008 9:39 AM
> To: Tom Tucker
> Cc: Roland Dreier; general at lists.openfabrics.org
> Subject: Re: [ofa-general] [PATCH RFC v3 1/2]
> RDMA/Core:MEM_MGT_EXTENSIONS support
> 
> At 11:33 AM 5/27/2008, Tom Tucker wrote:
> >So I think from an NFSRDMA coding perspective it's a wash...
> 
> Just to be clear, you're talking about the NFS/RDMA server. However,
> it's
> pretty much a wash on the client, for different reasons.
> 
> >When posting the WR, We check the fastreg capabilities bit +
transport
> >type bit:
> >If fastreg is true -->
> >       Post FastReg
> >       If iWARP (or with a cap bit read-with-inv-flag)
> >               post rdma read w/ invalidate
> 
> >... For iWARP's case, this means rdma-read-w-inv,
> >plus rdma-send-w-inv, etc...
> 
> 
> Maybe I'm confused, but I don't understand this. iWARP RDMA Read
> requests
> don't support remote invalidate. At least, the table in RFC5040 (p.22)
> doesn't:
> 
> 
> 
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
>    Message| Type      | Flag  | and  | Number| STag      | Length
>    OpCode |           |       | TO   |       |           |
Communicated
>           |           |       |      |       |           | between DDP
>           |           |       |      |       |           | and RDMAP
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0000b  | RDMA Write| 1     | Valid| N/A   | N/A       | Yes
>           |           |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0001b  | RDMA Read | 0     | N/A  | 1     | N/A       | Yes
>           | Request   |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0010b  | RDMA Read | 1     | Valid| N/A   | N/A       | Yes
>           | Response  |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0011b  | Send      | 0     | N/A  | 0     | N/A       | Yes
>           |           |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0100b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>           | Invalidate|       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0101b  | Send with | 0     | N/A  | 0     | N/A       | Yes
>           | SE        |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0110b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>           | SE and    |       |      |       |           |
>           | Invalidate|       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    0111b  | Terminate | 0     | N/A  | 2     | N/A       | Yes
>           |           |       |      |       |           |
>
-------+-----------+-------+------+-------+-----------+-------------
> -
>    1000b  |           |
>    to     | Reserved  |               Not Specified
>    1111b  |           |
>
-------+-----------+------------------------------------------------
> -
RDMA Read with Local Invalidate does not affect the wire. The 'must
invalidate' state is kept in the RNIC that issues the RDMA Read
Request...
> 
> 
> 
> I want to take this opportunity to also mention that the RPC/RDMA
> client-server
> exchange does not support remote-invalidate currently. Because of the
> multiple
> stags supported by the rpcrdma chunking header, and because the client
> needs
> to verify that the stags were in fact invalidated, there is
significant
> overhead,
> and the jury is out on that benefit. In fact, I suspect it's a lose at
> the client.
> 
> Tom (Talpey).
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general


From weiny2 at llnl.gov  Tue May 27 10:08:59 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 27 May 2008 10:08:59 -0700
Subject: [ofa-general] OpenSM?
In-Reply-To: <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>
	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>
	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>
	<48370AEE.7080507@opengridcomputing.com>
	<5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>
Message-ID: <20080527100859.6d48cd45.weiny2@llnl.gov>

Charles,

Here at LLNL we have been running OpenSM for some time.  Thus far we are very
happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
bring it up (not counting boot time) in less than a minute.

Here are some details.

We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
which have been submitted upstream and been accepted by Sasha but are not yet
in a release.)

Our clusters are all Fat-tree topologies.

We have a node which is more or less dedicated to running OpenSM.  We have some
other monitoring software running on it, but OpenSM can utilize the CPU/Memory
if it needs to.

   A) On our large clusters this node is a 4 socket, dual core (8 cores
   total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
   OpenSM needs this much but the nodes were built all the same so this is
   what it got.

   B) On one of our smaller clusters (128 nodes) OpenSM is running on a
   dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
   memory.  We have not seen any issues with this cluster and OpenSM.

We run with the up/down algorithm, ftree has not panned out for us yet.  I
can't say how that would compare to the Cisco algorithms.

In short OpenSM should work just fine on your cluster.

Hope this helps,
Ira


On Tue, 27 May 2008 11:15:14 -0400
Charles Taylor <taylor at hpc.ufl.edu> wrote:

> 
> We have a 400 node IB cluster.    We are running an embedded SM in  
> failover mode on our TS270/Cisco7008 core switches.    Lately we have  
> been seeing problems with LID assignment when rebooting nodes (see log  
> messages below).   It is also taking far too long for LIDS to be  
> assigned as it takes  on the order of minutes for the ports to  
> transition to "ACTIVE".
> 
> This seems like a bug to us and we are considering switching to  
> OpenSM  on a host.   I'm wondering about experience with running  
> OpenSM for medium to large clusters (Fat Tree) and what resources  
> (memory/cpu) we should plan on for the host node.
> 
> Thanks,
> 
> Charlie Taylor
> UF HPC Center
> 
> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> OUT_OF_SERVICE
> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
> existing IB
> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> DELETE_MC_GROUP
> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
> Topology
> changed
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> discovering removed ports
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
> Topology
> changed
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> discovering new ports
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
> Force port to
> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
> Program port
> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
> Failed to
> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
> status 0x1c
> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> IN_SERVICE trap
> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
> IB node
> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> previous GET/SET operation failures
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:  
> Reassigning
> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr  
> LID=0
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:  
> Force port to
> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:  
> Clean up SA
> resources for port forced down due to LID conflict, node -
> GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:  
> cleaning DB
> for guid 00:02:c9:02:00:21:4b:59, lid 194
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
> _ib_smAllocSubnet: initRate= 4
> May 27 14:18:47 topspin-270sc last message repeated 23 times
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
> links
> detected in the network
> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:  
> Active
> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,  
> state=2,
> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
> 00:06:6a:00:d9:00:04:5d port 16 is INIT state
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> some ports in INIT state
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> previous GET/SET operation failures
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
> _ib_smAllocSubnet: initRate= 4
> May 27 14:21:05 topspin-270sc last message repeated 23 times
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
> links
> detected in the network
> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:  
> Program port
> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM  
> CREATE_MC_GROUP
> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a  
> backup session
> with Standby SM guid 00:05:ad:00:00:02:3c:60
> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
> synchronized
> with Standby SM guid 00:05:ad:00:00:02:3c:60
> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
> synchronized
> with all designated backup SMs
> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> 
> On May 23, 2008, at 2:20 PM, Steve Wise wrote:
> 
> > Or Gerlitz wrote:
> >> Steve Wise wrote:
> >>> Are we sure we need to expose this to the user?
> >> I believe this is the way to go if we want to let smart ULPs  
> >> generate new rkey/stag per mapping. Simpler ULPs could then just  
> >> put the same value for each map associated with the same mr.
> >>
> >> Or.
> >>
> >
> > How should I add this to the API?
> >
> > Perhaps we just document the format of an rkey in the struct ib_mr.   
> > Thus the app would do this to change the key before posting the  
> > fast_reg_mr wr (coded to be explicit, not efficient):
> >
> > u8 newkey;
> > u32 newrkey;
> >
> > newkey = 0xaa;
> > newrkey = (mr->rkey & 0xffffff00) | newkey;
> > mr->rkey = newrkey
> > wr.wr.fast_reg.mr = mr;
> > ...
> >
> >
> > Note, this assumes mr->rkey is in host byte order (I think the linux  
> > rdma code assumes this in other places too).
> >
> >
> > Steve.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From sashak at voltaire.com  Tue May 27 10:53:43 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 27 May 2008 20:53:43 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<20080527103341.GF12014@sashak.voltaire.com>
	<1211888036.13185.219.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080527175343.GA14205@sashak.voltaire.com>

On 04:33 Tue 27 May     , Hal Rosenstock wrote:
> > 
> > Maybe yes, but could you be more specific? Store SMKey in read-only
> > file on a client side?
> 
> Treat smkey as su treats password rather than a command line parameter
> is another alternative.

Ok, let's do it as '--smkey X' and then saquery will ask for a value,
just like su does. Good?

> > I'm not proposing to expose SM_Key, just added such option where this
> > key could be specified.
> 
> How is that not exposing it ?

Because (1) and (2) below.

Sasha

> 
> -- Hal
> 
> >  So: 1) this is *optional*, 2) there is no
> > suggestions about how the right value should be determined.
> > 
> > Sasha
> 


From sashak at voltaire.com  Tue May 27 10:56:37 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 27 May 2008 20:56:37 +0300
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<1211887752.13185.212.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080527175637.GB14205@sashak.voltaire.com>

On 04:29 Tue 27 May     , Hal Rosenstock wrote:
> On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
> > > Following your logic we will need to disable root passwords
> > > typing too.
> > 
> > That's taking it too far. Root passwords are at least hidden when
> > typing.
> 
> At least hide the key typing from plain sight when typing like su does.

There are lot of tools where password can be specified as clear text in
command line (wget, smbclient, etc..) - it is an user responsibility to
keep his sensitive data safe.

Sasha


From pw at osc.edu  Tue May 27 11:00:04 2008
From: pw at osc.edu (Pete Wyckoff)
Date: Tue, 27 May 2008 14:00:04 -0400
Subject: [ofa-general] mthca MR attrs userspace change
Message-ID: <20080527180004.GA15444@osc.edu>

My kernel started complaining to me recently:

    ib_mthca 0000:02:00.0: Process 'tgtd' did not pass in MR attrs.
    ib_mthca 0000:02:00.0:   Update libmthca to fix this.

It comes from commit cb9fbc5c37b69ac584e61d449cfd590f5ae1f90d
("IB: expand ib_umem_get() prototype") and the fix in
baaad380c0aa955f7d62e846467316c94067f1a5 ("IB/mthca: Avoid changing
userspace ABI to handle DMA write barrier attribute").

Nice that everything still works with old userspace, but where is
the latest libmthca these days?  The one at kernel.org still has
ABI_VERSION 1:

    http://git.kernel.org/?p=libs/infiniband/libmthca.git;a=blob;f=src/mthca-abi.h;h=2557274e4cbd9f36df2be42379644d31b4ff5da3;hb=HEAD

By the way, Roland, your efforts at Fedora packaging are certainly
appreciated here.  If the new libmthca just showed up in updates to
F-9, that would be most convenient.

		-- Pete


From rdreier at cisco.com  Tue May 27 11:03:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 11:03:30 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 
	27 May 2008 14:00:04 -0400")
References: <20080527180004.GA15444@osc.edu>
Message-ID: <adave0zvd3h.fsf@cisco.com>

 >     ib_mthca 0000:02:00.0: Process 'tgtd' did not pass in MR attrs.
 >     ib_mthca 0000:02:00.0:   Update libmthca to fix this.

Heh, thanks for the reminder.  Will fix libmthca to handle this properly
today.

 > By the way, Roland, your efforts at Fedora packaging are certainly
 > appreciated here.  If the new libmthca just showed up in updates to
 > F-9, that would be most convenient.

It will, although the Fedora process takes a while to grind through ;)

 - R.


From sean.hefty at intel.com  Tue May 27 11:09:24 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 May 2008 11:09:24 -0700
Subject: [ofa-general] RE: [RFC PATCH 3/5] rdma/cma: simplify locking needed
	for serialization of callbacks
In-Reply-To: <483C38D8.5020600@voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com>
	<483C38D8.5020600@voltaire.com>
Message-ID: <000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com>

>>> 	spin_lock_irqsave(&id_priv->lock, flags);
>>> 	if (id_priv->state == state) {
>>> -		atomic_inc(&id_priv->dev_remove);
>>> +		mutex_lock(&id_priv->handler_mutex);
>> This just tried to acquire a mutex while holding a spinlock.
>
>I see. So can taking this spin lock be avoided here? I understand that
>spin lock came to protect the state check, correct?

I think we should just remove cma_disable_remove() and cma_enable_remove(), and
instead call mutex_lock/unlock directly in their places.  Where
cma_disable_remove() is called, add in appropriate state checks after acquiring
the mutex.

- Sean


From tom at opengridcomputing.com  Tue May 27 11:28:14 2008
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 27 May 2008 13:28:14 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <1211902417.4114.73.camel@trinity.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com> <adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>  <aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
Message-ID: <1211912894.4114.78.camel@trinity.ogc.int>


On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote:
> On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote:
> > > The "invalidate local stag" part of a read is just a local sink side
> >  > operation (ie no wire protocol change from a read).  It's not like
> >  > processing an ingress send-with-inv.  It is really functionally like a
> >  > read followed immediately by a fenced invalidate-local, but it doesn't
> >  > stall the pipe.  So the device has to remember the read is a "with inv
> >  > local stag" and invalidate the stag after the read response is placed
> >  > and before the WCE is reaped by the application.
> > 
> > Yes, understood.  My point was just that in IB, at least in theory, one
> > could just use an L_Key that doesn't have any remote permissions in the
> > scatter list of an RDMA read, while in iWARP, the STag used to place an
> > RDMA read response has to have remote write permission.  So RDMA read
> > with invalidate makes sense for iWARP, because it gives a race-free way
> > to allow an STag to be invalidated immediately after an RDMA read
> > response is placed, while in IB it's simpler just to never give remote
> > access at all.
> > 
> 
> So I think from an NFSRDMA coding perspective it's a wash...
> 
> When creating the local data sink, We need to check the transport type.
> 
> If it's IB --> only local access,
> if it's iWARP --> local + remote access.
> 
> When posting the WR, We check the fastreg capabilities bit + transport type bit:
> If fastreg is true -->
> 	Post FastReg
> 	If iWARP (or with a cap bit read-with-inv-flag)
> 		post rdma read w/ invalidate
> 	else /* IB */
> 		post rdma read

Steve pointed out a good optimization here. Instead of fencing the RDMA
READ here in advance of the INVALIDATE, we should post the INVALIDATE
when the READ WR completes. This will avoid stalling the SQ. Since IB
doesn't put the LKEY on the wire, there's no security issue to close. We
need to keep a bunch of fastreg MR around anyway for concurrent RPC.

Thoughts?
Tom

> 		post invalidate
> 	fi
> else
> 	... today's logic
> fi	
> 
> I make the observation, however, that the transport type is now overloaded
> with a set of required verbs. For iWARP's case, this means rdma-read-w-inv,
> plus rdma-send-w-inv, etc... This also means that new transport types will
> inherit one or the other set of verbs (IB or iWARP).
> 
> Tom
> 
> 
> >  - R.
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Tue May 27 11:24:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 11:24:55 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <adave0zvd3h.fsf@cisco.com> (Roland Dreier's message of "Tue, 27
	May 2008 11:03:30 -0700")
References: <20080527180004.GA15444@osc.edu> <adave0zvd3h.fsf@cisco.com>
Message-ID: <aday75vtxjc.fsf@cisco.com>

Here's the patch I plan to add to match the kernel, in case someone
wants to check it over:

diff --git a/src/mthca-abi.h b/src/mthca-abi.h
index 2557274..7e47d70 100644
--- a/src/mthca-abi.h
+++ b/src/mthca-abi.h
@@ -36,7 +36,8 @@
 
 #include <infiniband/kern-abi.h>
 
-#define MTHCA_UVERBS_ABI_VERSION	1
+#define MTHCA_UVERBS_MIN_ABI_VERSION	1
+#define MTHCA_UVERBS_MAX_ABI_VERSION	2
 
 struct mthca_alloc_ucontext_resp {
 	struct ibv_get_context_resp	ibv_resp;
@@ -50,6 +51,17 @@ struct mthca_alloc_pd_resp {
 	__u32				reserved;
 };
 
+struct mthca_reg_mr {
+	struct ibv_reg_mr		ibv_cmd;
+/*
+ * Mark the memory region with a DMA attribute that causes
+ * in-flight DMA to be flushed when the region is written to:
+ */
+#define MTHCA_MR_DMASYNC	0x1
+	__u32				mr_attrs;
+	__u32				reserved;
+};
+
 struct mthca_create_cq {
 	struct ibv_create_cq		ibv_cmd;
 	__u32				lkey;
diff --git a/src/mthca.c b/src/mthca.c
index e00c4ee..dd95636 100644
--- a/src/mthca.c
+++ b/src/mthca.c
@@ -282,9 +282,11 @@ static struct ibv_device *mthca_driver_init(const char *uverbs_sys_path,
 	return NULL;
 
 found:
-	if (abi_version > MTHCA_UVERBS_ABI_VERSION) {
-		fprintf(stderr, PFX "Fatal: ABI version %d of %s is too new (expected %d)\n",
-			abi_version, uverbs_sys_path, MTHCA_UVERBS_ABI_VERSION);
+	if (abi_version > MTHCA_UVERBS_MAX_ABI_VERSION ||
+	    abi_version < MTHCA_UVERBS_MIN_ABI_VERSION) {
+		fprintf(stderr, PFX "Fatal: ABI version %d of %s is not in supported range %d-%d\n",
+			abi_version, uverbs_sys_path, MTHCA_UVERBS_MIN_ABI_VERSION,
+			MTHCA_UVERBS_MAX_ABI_VERSION);
 		return NULL;
 	}
 
diff --git a/src/verbs.c b/src/verbs.c
index 6c9b53a..3d273d4 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -117,12 +117,21 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 				     size_t length, uint64_t hca_va,
-				     enum ibv_access_flags access)
+				     enum ibv_access_flags access,
+				     int dma_sync)
 {
 	struct ibv_mr *mr;
-	struct ibv_reg_mr cmd;
+	struct mthca_reg_mr cmd;
 	int ret;
 
+	/*
+	 * Old kernels just ignore the extra data we pass in with the
+	 * reg_mr command structure, so there's no need to add an ABI
+	 * version check here.
+	 */
+	cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0;
+	cmd.reserved = 0;
+
 	mr = malloc(sizeof *mr);
 	if (!mr)
 		return NULL;
@@ -132,11 +141,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 		struct ibv_reg_mr_resp resp;
 
 		ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-				     &cmd, sizeof cmd, &resp, sizeof resp);
+				     &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp);
 	}
 #else
 	ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-			     &cmd, sizeof cmd);
+			     &cmd.ibv_cmd, sizeof cmd);
 #endif
 	if (ret) {
 		free(mr);
@@ -149,7 +158,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
 			    size_t length, enum ibv_access_flags access)
 {
-	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +211,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe,
 
 	cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf,
 				cqe * MTHCA_CQ_ENTRY_SIZE,
-				0, IBV_ACCESS_LOCAL_WRITE);
+				0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!cq->mr)
 		goto err_buf;
 
@@ -297,7 +306,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe)
 
 	mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf,
 			    cqe * MTHCA_CQ_ENTRY_SIZE,
-			    0, IBV_ACCESS_LOCAL_WRITE);
+			    0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!mr) {
 		mthca_free_buf(&buf);
 		ret = ENOMEM;
@@ -405,7 +414,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd,
 	if (mthca_alloc_srq_buf(pd, &attr->attr, srq))
 		goto err;
 
-	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0);
+	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0);
 	if (!srq->mr)
 		goto err_free;
 
@@ -525,7 +534,7 @@ struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
 	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
 		goto err_free;
 
-	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0);
+	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0);
 	if (!qp->mr)
 		goto err_free;
 

From rdreier at cisco.com  Tue May 27 11:26:32 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 11:26:32 -0700
Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig
	and Makefile.
In-Reply-To: <71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com>
	(Ramachandra K.'s message of "Tue, 27 May 2008 11:53:42 +0530")
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103730.12355.14730.stgit@localhost.localdomain>
	<adamyme57ir.fsf@cisco.com>
	<71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com>
	<adaod6s4u1a.fsf@cisco.com>
	<71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com>
Message-ID: <adatzgjtxgn.fsf@cisco.com>

 > Makes sense. We will get rid of this CONFIG option. Apart from this
 > are there any other changes you
 > would like to see in the patch series ?

Have not reviewed the latest in detail but I think we are at least
pretty close to something ready to merge.

 - R.


From rdreier at cisco.com  Tue May 27 11:33:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 11:33:55 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 
	27 May 2008 14:00:04 -0400")
References: <20080527180004.GA15444@osc.edu>
Message-ID: <adaprr7tx4c.fsf@cisco.com>

 > Nice that everything still works with old userspace, but where is
 > the latest libmthca these days?  The one at kernel.org still has
 > ABI_VERSION 1:

Actually this tricked me... the kernel ABI didn't get bumped so there
was no reason to bump the libmthca ABI.  I'll actually use this slightly
simpler patch:

diff --git a/src/mthca-abi.h b/src/mthca-abi.h
index 2557274..4fbd98b 100644
--- a/src/mthca-abi.h
+++ b/src/mthca-abi.h
@@ -50,6 +50,17 @@ struct mthca_alloc_pd_resp {
 	__u32				reserved;
 };
 
+struct mthca_reg_mr {
+	struct ibv_reg_mr		ibv_cmd;
+/*
+ * Mark the memory region with a DMA attribute that causes
+ * in-flight DMA to be flushed when the region is written to:
+ */
+#define MTHCA_MR_DMASYNC	0x1
+	__u32				mr_attrs;
+	__u32				reserved;
+};
+
 struct mthca_create_cq {
 	struct ibv_create_cq		ibv_cmd;
 	__u32				lkey;
diff --git a/src/verbs.c b/src/verbs.c
index 6c9b53a..def0f30 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -117,12 +117,22 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 				     size_t length, uint64_t hca_va,
-				     enum ibv_access_flags access)
+				     enum ibv_access_flags access,
+				     int dma_sync)
 {
 	struct ibv_mr *mr;
-	struct ibv_reg_mr cmd;
+	struct mthca_reg_mr cmd;
 	int ret;
 
+	/*
+	 * Old kernels just ignore the extra data we pass in with the
+	 * reg_mr command structure, so there's no need to add an ABI
+	 * version check here (and indeed the kernel ABI was not
+	 * incremented due to this change).
+	 */
+	cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0;
+	cmd.reserved = 0;
+
 	mr = malloc(sizeof *mr);
 	if (!mr)
 		return NULL;
@@ -132,11 +142,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 		struct ibv_reg_mr_resp resp;
 
 		ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-				     &cmd, sizeof cmd, &resp, sizeof resp);
+				     &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp);
 	}
 #else
 	ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-			     &cmd, sizeof cmd);
+			     &cmd.ibv_cmd, sizeof cmd);
 #endif
 	if (ret) {
 		free(mr);
@@ -149,7 +159,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
 			    size_t length, enum ibv_access_flags access)
 {
-	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +212,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe,
 
 	cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf,
 				cqe * MTHCA_CQ_ENTRY_SIZE,
-				0, IBV_ACCESS_LOCAL_WRITE);
+				0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!cq->mr)
 		goto err_buf;
 
@@ -297,7 +307,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe)
 
 	mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf,
 			    cqe * MTHCA_CQ_ENTRY_SIZE,
-			    0, IBV_ACCESS_LOCAL_WRITE);
+			    0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!mr) {
 		mthca_free_buf(&buf);
 		ret = ENOMEM;
@@ -405,7 +415,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd,
 	if (mthca_alloc_srq_buf(pd, &attr->attr, srq))
 		goto err;
 
-	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0);
+	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0);
 	if (!srq->mr)
 		goto err_free;
 
@@ -525,7 +535,7 @@ struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
 	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
 		goto err_free;
 
-	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0);
+	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0);
 	if (!qp->mr)
 		goto err_free;
 

From swise at opengridcomputing.com  Tue May 27 11:34:29 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 13:34:29 -0500
Subject: [ofa-general] [PATCH RFC v4 0/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
Message-ID: <20080527183429.32168.14351.stgit@dell3.ogc.int>

The following patch series proposes:

- The API and core changes needed to implement the IB BMMR and 
  iWARP equivalient memory extensions.

- cxgb3 support.

Changes since version 3:
	- better comments to ib_alloc_fast_reg_page_list() function
	  to explicitly state the page list is owned by the device until
	  the fast_reg WR completes _and_ that the page_list can be
	  modified by the device.
	- cxgb3 - when allocating a page list, set max_page_list_len.
	- page_size -> page_shift in fast_reg union of ib_send_wr struct.
	- key support via ib_update_fast_reg_key()

Changes since version 2:
	- added device attribute max_fast_reg_page_list_len
	- added cxgb3 patch

Changes since version 1:
	- ib_alloc_mr() -> ib_alloc_fast_reg_mr()
	- pbl_depth -> max_page_list_len
	- page_list_len -> max_page_list_len where it makes sense
	- int -> unsigned int where needed
	- fbo -> first_byte_offset
	- added page size and page_list_len to fast_reg union in ib_send_wr
	- rearranged work request fast_reg union of ib_send_wr to pack it
	- dropped remove_access parameter from ib_alloc_fast_reg_mr()
	- IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS
	- compiled

Steve.


From swise at opengridcomputing.com  Tue May 27 11:35:49 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 13:35:49 -0500
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080527183429.32168.14351.stgit@dell3.ogc.int>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
Message-ID: <20080527183549.32168.22959.stgit@dell3.ogc.int>


Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.

- New API function, ib_update_fast_reg_key to allow the key portion of
the R_Key and L_Key of a fast_reg MR to be updated.  Applications call
this if desired before posting the IB_WR_FAST_REG_MR.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR R_Key/L_Key "key" field updated with ib_update_fast_reg_key().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work
requests to the For each outstanding mr-to-pbl binding in the SQ pipe,
a fast_reg_page_list needs to be allocated.  Thus pipelining can be
achieved while still allowing device-specific page_list processing.

The 4B fast_reg rkey or stag is composed of a 3B index, and a 1B key.
The application can change the key each time it fast-registers thus
allowing more control over the peer's use of the rkey (ie it can
effectively be changed each time the rkey is rebound to a page list).

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/verbs.c |   46 ++++++++++++++++++++++++
 include/rdma/ib_verbs.h         |   76 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+	struct ib_mr *mr;
+
+	if (!pd->device->alloc_fast_reg_mr)
+		return ERR_PTR(-ENOSYS);
+
+	mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len);
+
+	if (!IS_ERR(mr)) {
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
+		atomic_inc(&pd->usecnt);
+		atomic_set(&mr->usecnt, 0);
+	}
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int max_page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	if (!device->alloc_fast_reg_page_list)
+		return ERR_PTR(-ENOSYS);
+
+	page_list = device->alloc_fast_reg_page_list(device, max_page_list_len);
+
+	if (!IS_ERR(page_list)) {
+		page_list->device = device;
+		page_list->max_page_list_len = max_page_list_len;
+	}
+
+	return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+	page_list->device->free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..ede0c80 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_UD_IP_CSUM		= (1<<18),
 	IB_DEVICE_UD_TSO		= (1<<19),
 	IB_DEVICE_SEND_W_INV		= (1<<21),
+	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<22),
 };
 
 enum ib_atomic_cap {
@@ -151,6 +152,7 @@ struct ib_device_attr {
 	int			max_srq;
 	int			max_srq_wr;
 	int			max_srq_sge;
+	unsigned int		max_fast_reg_page_list_len;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -414,6 +416,8 @@ enum ib_wc_opcode {
 	IB_WC_FETCH_ADD,
 	IB_WC_BIND_MW,
 	IB_WC_LSO,
+	IB_WC_FAST_REG_MR,
+	IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -628,6 +632,9 @@ enum ib_wr_opcode {
 	IB_WR_ATOMIC_FETCH_AND_ADD,
 	IB_WR_LSO,
 	IB_WR_SEND_WITH_INV,
+	IB_WR_FAST_REG_MR,
+	IB_WR_INVALIDATE_MR,
+	IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +683,19 @@ struct ib_send_wr {
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
+		struct {
+			u64				iova_start;
+			struct ib_mr 			*mr;
+			struct ib_fast_reg_page_list	*page_list;
+			unsigned int			page_shift;
+			unsigned int			page_list_len;
+			unsigned int			first_byte_offset;
+			u32				length;
+			int				access_flags;
+		} fast_reg;
+		struct {
+			struct ib_mr 	*mr;
+		} local_inv;
 	} wr;
 };
 
@@ -1014,6 +1034,10 @@ struct ib_device {
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
+	struct ib_mr *		   (*alloc_fast_reg_mr)(struct ib_pd *pd,
+					       int max_page_list_len);
+	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len);
+	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -1808,6 +1832,58 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr);
 int ib_dereg_mr(struct ib_mr *mr);
 
 /**
+ * ib_alloc_fast_reg_mr - Allocates memory region usable with the
+ * IB_WR_FAST_REG_MR send work request.
+ * @pd: The protection domain associated with the region.
+ * @max_page_list_len: requested max physical buffer list size to be allocated.
+ */
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len);
+
+struct ib_fast_reg_page_list {
+	struct ib_device 	*device;
+	u64			*page_list;
+	unsigned int		max_page_list_len;
+};
+
+/**
+ * ib_alloc_fast_reg_page_list - Allocates a page list array
+ * @device - ib device pointer.
+ * @page_list_len - size of the page list array to be allocated.
+ *
+ * This allocates and returns a struct ib_fast_reg_page_list *
+ * and a page_list array that is at least page_list_len in size.
+ * The actual size is returned in max_page_list_len.
+ * The caller is responsible for initializing the contents of the
+ * page_list array before posting a send work request with the
+ * IB_WC_FAST_REG_MR opcode. The page_list array entries must be
+ * translated using one of the ib_dma_*() functions similar to the
+ * addresses passed to ib_map_phys_fmr(). Once the ib_post_send()
+ * is issued, the struct ib_fast_reg_page_list must not be modified
+ * by the caller until a completion notice is returned by the device.
+ */
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+				struct ib_device *device, int page_list_len);
+
+/**
+ * ib_free_fast_reg_page_list - Deallocates a previously allocated
+ * page list array.
+ * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
+ */
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+/**
+ * ib_update_fast_reg_key - updates the key portion of the fast_reg 
+ * R_Key and L_Key.
+ * @mr - struct ib_mr pointer to be updated.
+ * @newkey - new key to be used.
+ */
+static inline void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
+{
+	mr->lkey = (mr->lkey & 0xffffff00) | newkey;
+	mr->rkey = (mr->rkey & 0xffffff00) | newkey;
+}
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From swise at opengridcomputing.com  Tue May 27 11:35:51 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 13:35:51 -0500
Subject: [ofa-general] [PATCH RFC v4 2/2] RDMA/cxgb3: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080527183429.32168.14351.stgit@dell3.ogc.int>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
Message-ID: <20080527183551.32168.19227.stgit@dell3.ogc.int>


- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl 
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
  fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add invalidate_mr work request support.
- add send_with_inv and send_with_se_inv work request support.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c      |   13 ++-
 drivers/infiniband/hw/cxgb3/cxio_hal.h      |    1 
 drivers/infiniband/hw/cxgb3/cxio_wr.h       |   51 ++++++++++-
 drivers/infiniband/hw/cxgb3/iwch_provider.c |   78 ++++++++++++++++-
 drivers/infiniband/hw/cxgb3/iwch_qp.c       |  123 +++++++++++++++++++--------
 5 files changed, 216 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 3f441fc..6315c77 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -145,7 +145,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
 	}
 	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
 	memset(wqe, 0, sizeof(*wqe));
-	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7);
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7, 3);
 	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
 	sge_cmd = qpid << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
@@ -558,7 +558,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
 	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
 	memset(wqe, 0, sizeof(*wqe));
 	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 0,
-		       T3_CTL_QP_TID, 7);
+		       T3_CTL_QP_TID, 7, 3);
 	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
 	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
 	wqe->sge_cmd = cpu_to_be64(sge_cmd);
@@ -674,7 +674,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
 		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
 			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
 					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
-			       wr_len);
+			       wr_len, 3);
 		if (flag == T3_COMPLETION_FLAG)
 			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
 		len -= 96;
@@ -816,6 +816,13 @@ int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
 			     0, 0);
 }
 
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			     0, 0, 0ULL, 0, 0, 0, 0);
+}
+
 int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
 {
 	struct t3_rdma_init_wr *wqe;
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 6e128f6..e7659f6 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -165,6 +165,7 @@ int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
 int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size,
 		   u32 pbl_addr);
 int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
 int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
 int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index f1a25a8..2a24962 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -72,7 +72,8 @@ enum t3_wr_opcode {
 	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
 	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
 	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
-	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP,
+	T3_WR_FASTREG = FW_WROPCODE_RI_FASTREGISTER_MR
 } __attribute__ ((packed));
 
 enum t3_rdma_opcode {
@@ -89,7 +90,8 @@ enum t3_rdma_opcode {
 	T3_FAST_REGISTER,
 	T3_LOCAL_INV,
 	T3_QP_MOD,
-	T3_BYPASS
+	T3_BYPASS,	
+	T3_RDMA_READ_REQ_WITH_INV,
 } __attribute__ ((packed));
 
 static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
@@ -170,11 +172,46 @@ struct t3_send_wr {
 	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
 };
 
+#define T3_MAX_FASTREG_DEPTH 18
+
+struct t3_fastreg_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 page_type_perms; /* 2 */
+	__be32 reserved1;
+	__be32 stag;		/* 3 */
+	__be32 len;
+	__be32 va_base_hi;	/* 4 */
+	__be32 va_base_lo_fbo;
+	__be64 reserved2[6];	/* 5-10 */
+	__be64 pbl_addrs[0];	/* 11+ */
+};
+
+#define S_FR_PAGE_COUNT		24
+#define M_FR_PAGE_COUNT		0xff
+#define V_FR_PAGE_COUNT(x)	((x) << S_FR_PAGE_COUNT)
+#define G_FR_PAGE_COUNT(x)	((((x) >> S_FR_PAGE_COUNT)) & M_FR_PAGE_COUNT)
+
+#define S_FR_PAGE_SIZE		16
+#define M_FR_PAGE_SIZE		0x1f
+#define V_FR_PAGE_SIZE(x)	((x) << S_FR_PAGE_SIZE)
+#define G_FR_PAGE_SIZE(x)	((((x) >> S_FR_PAGE_SIZE)) & M_FR_PAGE_SIZE)
+
+#define S_FR_TYPE		8
+#define M_FR_TYPE		0x1
+#define V_FR_TYPE(x)		((x) << S_FR_TYPE)
+#define G_FR_TYPE(x)		((((x) >> S_FR_TYPE)) & M_FR_TYPE)
+
+#define S_FR_PERMS		0
+#define M_FR_PERMS		0xff
+#define V_FR_PERMS(x)		((x) << S_FR_PERMS)
+#define G_FR_PERMS(x)		((((x) >> S_FR_PERMS)) & M_FR_PERMS)
+
 struct t3_local_inv_wr {
 	struct fw_riwrh wrh;	/* 0 */
 	union t3_wrid wrid;	/* 1 */
 	__be32 stag;		/* 2 */
-	__be32 reserved3;
+	__be32 reserved;
 };
 
 struct t3_rdma_write_wr {
@@ -210,7 +247,8 @@ enum t3_mem_perms {
 	T3_MEM_ACCESS_LOCAL_READ = 0x1,
 	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
 	T3_MEM_ACCESS_REM_READ = 0x4,
-	T3_MEM_ACCESS_REM_WRITE = 0x8
+	T3_MEM_ACCESS_REM_WRITE = 0x8,
+	T3_MEM_ACCESS_MW_BIND = 0x10
 } __attribute__ ((packed));
 
 struct t3_bind_mw_wr {
@@ -346,6 +384,7 @@ union t3_wr {
 	struct t3_rdma_write_wr write;
 	struct t3_rdma_read_wr read;
 	struct t3_receive_wr recv;
+	struct t3_fastreg_wr fastreg;
 	struct t3_local_inv_wr local_inv;
 	struct t3_bind_mw_wr bind;
 	struct t3_bypass_wr bypass;
@@ -368,10 +407,10 @@ static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
 
 static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
 				  enum t3_wr_flags flags, u8 genbit, u32 tid,
-				  u8 len)
+				  u8 len, u8 sopeop)
 {
 	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
-					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_SOPEOP(sopeop) |
 					 V_FW_RIWR_FLAGS(flags));
 	wmb();
 	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 8934178..e53d25b 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -768,6 +768,65 @@ static int iwch_dealloc_mw(struct ib_mw *mw)
 	return 0;
 }
 
+static struct ib_mr *iwch_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	ret = iwch_alloc_pbl(mhp, pbl_depth);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->attr.pbl_size = pbl_depth;
+	ret = cxio_allocate_stag(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		iwch_free_pbl(mhp);
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_NON_SHARED_MR;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __func__, mmid, mhp, stag);
+	return &(mhp->ibmr);
+}
+
+static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl(
+					struct ib_device *device,
+					int page_list_len)
+{
+	struct ib_fast_reg_page_list *page_list;
+
+	page_list = kmalloc(sizeof *page_list + page_list_len * sizeof(u64), 
+			    GFP_KERNEL);
+	if (!page_list)
+		return ERR_PTR(-ENOMEM);
+
+	page_list->page_list = (u64 *)(page_list + 1);
+	page_list->max_page_list_len = page_list_len;
+
+	return page_list;
+}
+
+static void iwch_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list)
+{
+	kfree(page_list);
+}
+
 static int iwch_destroy_qp(struct ib_qp *ib_qp)
 {
 	struct iwch_dev *rhp;
@@ -843,6 +902,15 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	 */
 	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
 	wqsize = roundup_pow_of_two(rqsize + sqsize);
+
+	/*
+	 * Kernel users need more wq space for fastreg WRs which can take	
+	 * 2 WR fragments.
+	 */
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (!ucontext && wqsize < (rqsize + (2 * sqsize)))
+		wqsize = roundup_pow_of_two(rqsize +
+				roundup_pow_of_two(attrs->cap.max_send_wr * 2));
 	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __func__,
 	     wqsize, sqsize, rqsize);
 	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
@@ -851,7 +919,6 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	qhp->wq.size_log2 = ilog2(wqsize);
 	qhp->wq.rq_size_log2 = ilog2(rqsize);
 	qhp->wq.sq_size_log2 = ilog2(sqsize);
-	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
 	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
 			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
 		kfree(qhp);
@@ -1048,6 +1115,7 @@ static int iwch_query_device(struct ib_device *ibdev,
 	props->max_mr = dev->attr.max_mem_regs;
 	props->max_pd = dev->attr.max_pds;
 	props->local_ca_ack_delay = 0;
+	props->max_fast_reg_page_list_len = T3_MAX_FASTREG_DEPTH;
 
 	return 0;
 }
@@ -1145,8 +1213,9 @@ int iwch_register_device(struct iwch_dev *dev)
 	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
 	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
 	dev->ibdev.owner = THIS_MODULE;
-	dev->device_cap_flags =
-	    (IB_DEVICE_ZERO_STAG | IB_DEVICE_MEM_WINDOW);
+	dev->device_cap_flags = IB_DEVICE_ZERO_STAG | 
+				IB_DEVICE_MEM_WINDOW | 
+				IB_DEVICE_MEM_MGT_EXTENSIONS;
 
 	dev->ibdev.uverbs_cmd_mask =
 	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
@@ -1198,6 +1267,9 @@ int iwch_register_device(struct iwch_dev *dev)
 	dev->ibdev.alloc_mw = iwch_alloc_mw;
 	dev->ibdev.bind_mw = iwch_bind_mw;
 	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+	dev->ibdev.alloc_fast_reg_mr = iwch_alloc_fast_reg_mr;
+	dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl;
+	dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl;
 
 	dev->ibdev.attach_mcast = iwch_multicast_attach;
 	dev->ibdev.detach_mcast = iwch_multicast_detach;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 79dbe5b..c702c71 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -44,54 +44,39 @@ static int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
 
 	switch (wr->opcode) {
 	case IB_WR_SEND:
-	case IB_WR_SEND_WITH_IMM:
 		if (wr->send_flags & IB_SEND_SOLICITED)
 			wqe->send.rdmaop = T3_SEND_WITH_SE;
 		else
 			wqe->send.rdmaop = T3_SEND;
 		wqe->send.rem_stag = 0;
 		break;
-#if 0				/* Not currently supported */
-	case TYPE_SEND_INVALIDATE:
-	case TYPE_SEND_INVALIDATE_IMMEDIATE:
-		wqe->send.rdmaop = T3_SEND_WITH_INV;
-		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
-		break;
-	case TYPE_SEND_SE_INVALIDATE:
-		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+	case IB_WR_SEND_WITH_INV:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		else
+			wqe->send.rdmaop = T3_SEND_WITH_INV;
 		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
 		break;
-#endif
 	default:
-		break;
+		return -EINVAL;
 	}
 	if (wr->num_sge > T3_MAX_SGE)
 		return -EINVAL;
 	wqe->send.reserved[0] = 0;
 	wqe->send.reserved[1] = 0;
 	wqe->send.reserved[2] = 0;
-	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
-		plen = 4;
-		wqe->send.sgl[0].stag = wr->ex.imm_data;
-		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
-		wqe->send.num_sgle = __constant_cpu_to_be32(0);
-		*flit_cnt = 5;
-	} else {
-		plen = 0;
-		for (i = 0; i < wr->num_sge; i++) {
-			if ((plen + wr->sg_list[i].length) < plen) {
-				return -EMSGSIZE;
-			}
-			plen += wr->sg_list[i].length;
-			wqe->send.sgl[i].stag =
-			    cpu_to_be32(wr->sg_list[i].lkey);
-			wqe->send.sgl[i].len =
-			    cpu_to_be32(wr->sg_list[i].length);
-			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+	plen = 0;
+	for (i = 0; i < wr->num_sge; i++) {
+		if ((plen + wr->sg_list[i].length) < plen) {
+			return -EMSGSIZE;
 		}
-		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
-		*flit_cnt = 4 + ((wr->num_sge) << 1);
+		plen += wr->sg_list[i].length;
+		wqe->send.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->send.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
 	}
+	wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+	*flit_cnt = 4 + ((wr->num_sge) << 1);
 	wqe->send.plen = cpu_to_be32(plen);
 	return 0;
 }
@@ -155,6 +140,56 @@ static int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
 	return 0;
 }
 
+static int iwch_build_fastreg(union t3_wr *wqe, struct ib_send_wr *wr,
+				u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq)
+{
+	int i;
+	u64 *p;
+
+	if (wr->wr.fast_reg.page_list_len > T3_MAX_FASTREG_DEPTH)
+		return -EINVAL;
+	*wr_cnt = 1;
+	wqe->fastreg.stag = cpu_to_be32(wr->wr.fast_reg.mr->rkey);
+	wqe->fastreg.len = cpu_to_be32(wr->wr.fast_reg.length);
+	wqe->fastreg.va_base_hi = cpu_to_be32(wr->wr.fast_reg.iova_start>>32);
+	wqe->fastreg.va_base_lo_fbo = 
+				cpu_to_be32(wr->wr.fast_reg.iova_start&0xffffffff);
+	wqe->fastreg.page_type_perms = cpu_to_be32(
+		V_FR_PAGE_COUNT(wr->wr.fast_reg.page_list_len) | 
+		V_FR_PAGE_SIZE(wr->wr.fast_reg.page_shift-12) | 
+		V_FR_TYPE(T3_VA_BASED_TO) | 
+		V_FR_PERMS(iwch_ib_to_mwbind_access(wr->wr.fast_reg.access_flags)));
+	p = &wqe->fastreg.pbl_addrs[0];
+	for (i=0; i<wr->wr.fast_reg.page_list_len; i++, p++) {
+
+		/* If we need a 2nd WR, then set it up */
+		if (i == 10) {
+			*wr_cnt = 2;
+			wqe = (union t3_wr *)(wq->queue + 
+				Q_PTR2IDX((wq->wptr+1), wq->size_log2));
+			build_fw_riwrh((void *)wqe, T3_WR_FASTREG, 0,
+			       Q_GENBIT(wq->wptr, wq->size_log2),
+			       0, 1 + wr->wr.fast_reg.page_list_len - 10, 1);
+			
+			p = &wqe->flit[1];
+		}
+		*p = cpu_to_be64((u64)wr->wr.fast_reg.page_list->page_list[i]);
+	}
+	*flit_cnt = 5 + wr->wr.fast_reg.page_list_len;
+	if (*flit_cnt > 15)
+		*flit_cnt = 15;
+	return 0;
+}
+
+static int iwch_build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr,
+				u8 *flit_cnt)
+{
+	wqe->local_inv.stag = cpu_to_be32(wr->wr.local_inv.mr->rkey);
+	wqe->local_inv.reserved = 0;
+	*flit_cnt = sizeof(struct t3_local_inv_wr) >> 3;
+	return 0;
+}
+
 /*
  * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
  */
@@ -238,6 +273,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 	u32 num_wrs;
 	unsigned long flag;
 	struct t3_swsq *sqp;
+	int wr_cnt = 1;
 
 	qhp = to_iwch_qp(ibqp);
 	spin_lock_irqsave(&qhp->lock, flag);
@@ -262,15 +298,15 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		t3_wr_flags = 0;
 		if (wr->send_flags & IB_SEND_SOLICITED)
 			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
-		if (wr->send_flags & IB_SEND_FENCE)
-			t3_wr_flags |= T3_READ_FENCE_FLAG;
 		if (wr->send_flags & IB_SEND_SIGNALED)
 			t3_wr_flags |= T3_COMPLETION_FLAG;
 		sqp = qhp->wq.sq +
 		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
 		switch (wr->opcode) {
 		case IB_WR_SEND:
-		case IB_WR_SEND_WITH_IMM:
+		case IB_WR_SEND_WITH_INV:
+			if (wr->send_flags & IB_SEND_FENCE)
+				t3_wr_flags |= T3_READ_FENCE_FLAG;
 			t3_wr_opcode = T3_WR_SEND;
 			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
 			break;
@@ -289,6 +325,17 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 			if (!qhp->wq.oldest_read)
 				qhp->wq.oldest_read = sqp;
 			break;
+		case IB_WR_FAST_REG_MR:
+			t3_wr_opcode = T3_WR_FASTREG;
+			err = iwch_build_fastreg(wqe, wr, &t3_wr_flit_cnt, 
+						 &wr_cnt, &qhp->wq);
+			break;
+		case IB_WR_INVALIDATE_MR:
+			if (wr->send_flags & IB_SEND_FENCE)
+				t3_wr_flags |= T3_LOCAL_FENCE_FLAG;
+			t3_wr_opcode = T3_WR_INV_STAG;
+			err = iwch_build_inv_stag(wqe, wr, &t3_wr_flit_cnt);
+			break;
 		default:
 			PDBG("%s post of type=%d TBD!\n", __func__,
 			     wr->opcode);
@@ -307,14 +354,14 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 
 		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
 			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
-			       0, t3_wr_flit_cnt);
+			       0, t3_wr_flit_cnt, (wr_cnt == 1) ? 3 : 2);
 		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n",
 		     __func__, (unsigned long long) wr->wr_id, idx,
 		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
 		     sqp->opcode);
 		wr = wr->next;
 		num_wrs--;
-		++(qhp->wq.wptr);
+		qhp->wq.wptr += wr_cnt;
 		++(qhp->wq.sq_wptr);
 	}
 	spin_unlock_irqrestore(&qhp->lock, flag);
@@ -359,7 +406,7 @@ int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 			wr->wr_id;
 		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
 			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
-			       0, sizeof(struct t3_receive_wr) >> 3);
+			       0, sizeof(struct t3_receive_wr) >> 3, 3);
 		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
 		     "wqe %p \n", __func__, (unsigned long long) wr->wr_id,
 		     idx, qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
@@ -444,7 +491,7 @@ int iwch_bind_mw(struct ib_qp *qp,
 	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
 	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
 		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0,
-			        sizeof(struct t3_bind_mw_wr) >> 3);
+			        sizeof(struct t3_bind_mw_wr) >> 3, 3);
 	++(qhp->wq.wptr);
 	++(qhp->wq.sq_wptr);
 	spin_unlock_irqrestore(&qhp->lock, flag);


From Thomas.Talpey at netapp.com  Tue May 27 11:35:19 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 May 2008 14:35:19 -0400
Subject: [ofa-general] [PATCH RFC v3 1/2]
	RDMA/Core:MEM_MGT_EXTENSIONS support
In-Reply-To: <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigner s.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
Message-ID: <RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>

At 12:58 PM 5/27/2008, Felix Marti wrote:
>RDMA Read with Local Invalidate does not affect the wire. The 'must
>invalidate' state is kept in the RNIC that issues the RDMA Read
>Request...

Aha, okay that was not clear to me. What information does the RNIC use
to line up the arrival of the RDMA Read response with the "must invalidate"
state? Also, how does the RNIC signal whether the invalidation actually
occurred, so the upper layer can defend itself from attack?

Tom.


From swise at opengridcomputing.com  Tue May 27 11:40:28 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 13:40:28 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2]	RDMA/Core:MEM_MGT_EXTENSIONS
	support
In-Reply-To: <RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com>	<483AB5AC.3030406@opengridcomputing.com>	<adak5hg4trz.fsf@cisco.com>	<483B3AD7.4050208@opengridcomputing.com>	<aday75w3bzt.fsf@cisco.com>	<1211902417.4114.73.camel@trinity.ogc.int>	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
	<RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <483C559C.90203@opengridcomputing.com>

Talpey, Thomas wrote:
> At 12:58 PM 5/27/2008, Felix Marti wrote:
>   
>> RDMA Read with Local Invalidate does not affect the wire. The 'must
>> invalidate' state is kept in the RNIC that issues the RDMA Read
>> Request...
>>     
>
> Aha, okay that was not clear to me. What information does the RNIC use
> to line up the arrival of the RDMA Read response with the "must invalidate"
> state? 

The rnic already tracks outstanding read requests.  It now also will 
track the local stag to invalidate when the read completes.

> Also, how does the RNIC signal whether the invalidation actually
> occurred, so the upper layer can defend itself from attack?
>
>   

The stag is guaranteed to be in the invalid state by the time the app 
reaps the read-inv-local work completion...

Steve.


From tom at opengridcomputing.com  Tue May 27 11:58:43 2008
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 27 May 2008 13:58:43 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com> <adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com> <aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211914723.4114.86.camel@trinity.ogc.int>


On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
> At 11:33 AM 5/27/2008, Tom Tucker wrote:
> >So I think from an NFSRDMA coding perspective it's a wash...
> 
> Just to be clear, you're talking about the NFS/RDMA server. However, it's
> pretty much a wash on the client, for different reasons.
> 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

> >When posting the WR, We check the fastreg capabilities bit + transport 
> >type bit:
> >If fastreg is true -->
> >       Post FastReg
> >       If iWARP (or with a cap bit read-with-inv-flag)
> >               post rdma read w/ invalidate
> 
> >... For iWARP's case, this means rdma-read-w-inv,
> >plus rdma-send-w-inv, etc... 
> 
> 
> Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
> don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:
> 
> 
> 
>    -------+-----------+-------+------+-------+-----------+--------------
>    RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
>    Message| Type      | Flag  | and  | Number| STag      | Length
>    OpCode |           |       | TO   |       |           | Communicated
>           |           |       |      |       |           | between DDP
>           |           |       |      |       |           | and RDMAP
>    -------+-----------+-------+------+-------+-----------+--------------
>    0000b  | RDMA Write| 1     | Valid| N/A   | N/A       | Yes
>           |           |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0001b  | RDMA Read | 0     | N/A  | 1     | N/A       | Yes
>           | Request   |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0010b  | RDMA Read | 1     | Valid| N/A   | N/A       | Yes
>           | Response  |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0011b  | Send      | 0     | N/A  | 0     | N/A       | Yes
>           |           |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0100b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>           | Invalidate|       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0101b  | Send with | 0     | N/A  | 0     | N/A       | Yes
>           | SE        |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0110b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>           | SE and    |       |      |       |           |
>           | Invalidate|       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    0111b  | Terminate | 0     | N/A  | 2     | N/A       | Yes
>           |           |       |      |       |           |
>    -------+-----------+-------+------+-------+-----------+--------------
>    1000b  |           |
>    to     | Reserved  |               Not Specified
>    1111b  |           |
>    -------+-----------+-------------------------------------------------
> 
> 
> 
> I want to take this opportunity to also mention that the RPC/RDMA client-server
> exchange does not support remote-invalidate currently. Because of the multiple
> stags supported by the rpcrdma chunking header, and because the client needs
> to verify that the stags were in fact invalidated, there is significant overhead,
> and the jury is out on that benefit. In fact, I suspect it's a lose at the client.
> 
> Tom (Talpey).  
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Tue May 27 11:58:19 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 13:58:19 -0500
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <1211914723.4114.86.camel@trinity.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<4838F6CB.2040203@voltaire.com>
	<adazlqe5d43.fsf@cisco.com>	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>	<1211902417.4114.73.camel@trinity.ogc.int>	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<1211914723.4114.86.camel@trinity.ogc.int>
Message-ID: <483C59CB.2060308@opengridcomputing.com>

Tom Tucker wrote:
> On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
>   
>> At 11:33 AM 5/27/2008, Tom Tucker wrote:
>>     
>>> So I think from an NFSRDMA coding perspective it's a wash...
>>>       
>> Just to be clear, you're talking about the NFS/RDMA server. However, it's
>> pretty much a wash on the client, for different reasons.
>>
>>     
> Tom:
>
> What client side memory registration strategy do you recommend if the
> default on the server side is fastreg?
>
> On the performance side we are limited by the min size of the
> read/write-chunk element. If the client still gives the server a 4k
> chunk, the performance benefit (fewer PDU on the wire) goes away.
>
> Tom
>
>   

I would hope that dma_mr usage will be replaced with fast_reg on both 
the client and the server. 

>>> When posting the WR, We check the fastreg capabilities bit + transport 
>>> type bit:
>>> If fastreg is true -->
>>>       Post FastReg
>>>       If iWARP (or with a cap bit read-with-inv-flag)
>>>               post rdma read w/ invalidate
>>>       
>>> ... For iWARP's case, this means rdma-read-w-inv,
>>> plus rdma-send-w-inv, etc... 
>>>       
>> Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
>> don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:
>>
>>
>>
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
>>    Message| Type      | Flag  | and  | Number| STag      | Length
>>    OpCode |           |       | TO   |       |           | Communicated
>>           |           |       |      |       |           | between DDP
>>           |           |       |      |       |           | and RDMAP
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0000b  | RDMA Write| 1     | Valid| N/A   | N/A       | Yes
>>           |           |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0001b  | RDMA Read | 0     | N/A  | 1     | N/A       | Yes
>>           | Request   |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0010b  | RDMA Read | 1     | Valid| N/A   | N/A       | Yes
>>           | Response  |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0011b  | Send      | 0     | N/A  | 0     | N/A       | Yes
>>           |           |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0100b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>>           | Invalidate|       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0101b  | Send with | 0     | N/A  | 0     | N/A       | Yes
>>           | SE        |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0110b  | Send with | 0     | N/A  | 0     | Valid     | Yes
>>           | SE and    |       |      |       |           |
>>           | Invalidate|       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    0111b  | Terminate | 0     | N/A  | 2     | N/A       | Yes
>>           |           |       |      |       |           |
>>    -------+-----------+-------+------+-------+-----------+--------------
>>    1000b  |           |
>>    to     | Reserved  |               Not Specified
>>    1111b  |           |
>>    -------+-----------+-------------------------------------------------
>>
>>
>>
>> I want to take this opportunity to also mention that the RPC/RDMA client-server
>> exchange does not support remote-invalidate currently. Because of the multiple
>> stags supported by the rpcrdma chunking header, and because the client needs
>> to verify that the stags were in fact invalidated, there is significant overhead,
>> and the jury is out on that benefit. In fact, I suspect it's a lose at the client.
>>
>> Tom (Talpey).  
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>     
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From pw at osc.edu  Tue May 27 12:18:25 2008
From: pw at osc.edu (Pete Wyckoff)
Date: Tue, 27 May 2008 15:18:25 -0400
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <adaprr7tx4c.fsf@cisco.com>
References: <20080527180004.GA15444@osc.edu> <adaprr7tx4c.fsf@cisco.com>
Message-ID: <20080527191825.GA15530@osc.edu>

rdreier at cisco.com wrote on Tue, 27 May 2008 11:33 -0700:
>  > Nice that everything still works with old userspace, but where is
>  > the latest libmthca these days?  The one at kernel.org still has
>  > ABI_VERSION 1:
> 
> Actually this tricked me... the kernel ABI didn't get bumped so there
> was no reason to bump the libmthca ABI.  I'll actually use this slightly
> simpler patch:

Oh, yeah.  Your patch put it back to 1.  I missed that.  This
patch looks good, as far as I can tell without testing.  Only
CQ changes need dmasync, apparently.

		-- Pete


From Thomas.Talpey at netapp.com  Tue May 27 12:38:30 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 May 2008 15:38:30 -0400
Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:
	MEM_MGT_EXTENSIONS support
In-Reply-To: <1211914723.4114.86.camel@trinity.ogc.int>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<1211914723.4114.86.camel@trinity.ogc.int>
Message-ID: <RTPCLUEXC1-PRDyY0QU00000171@RTPMVEXC1-PRD.hq.netapp.com>

At 02:58 PM 5/27/2008, Tom Tucker wrote:
>
>On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
>> At 11:33 AM 5/27/2008, Tom Tucker wrote:
>> >So I think from an NFSRDMA coding perspective it's a wash...
>> 
>> Just to be clear, you're talking about the NFS/RDMA server. However, it's
>> pretty much a wash on the client, for different reasons.
>> 
>Tom:
>
>What client side memory registration strategy do you recommend if the
>default on the server side is fastreg?

"Whatever is fastest and safest". Given that the client and server won't
necessarily be using the same hardware, nor the same kernel for that
matter, I don't think we can or should legislate it.

That said, I am hopeful that "fastreg" does turn out to be "fast" and
therefore will become the only logical choice for the NFS/RDMA Linux
client. But the future Linux client is only one such system. I cannot
speak for others.

Tom.

>
>On the performance side we are limited by the min size of the
>read/write-chunk element. If the client still gives the server a 4k
>chunk, the performance benefit (fewer PDU on the wire) goes away.


From akepner at sgi.com  Tue May 27 12:39:11 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Tue, 27 May 2008 12:39:11 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <adaprr7tx4c.fsf@cisco.com>
References: <20080527180004.GA15444@osc.edu> <adaprr7tx4c.fsf@cisco.com>
Message-ID: <20080527193911.GF23650@sgi.com>

On Tue, May 27, 2008 at 11:33:55AM -0700, Roland Dreier wrote:
> ...
> Actually this tricked me... the kernel ABI didn't get bumped so there
> was no reason to bump the libmthca ABI.  I'll actually use this slightly
> simpler patch:

I was wondering about that... FWIW, the patch (which I deleted 
below) looked good to me.

-- 
Arthur


From Thomas.Talpey at netapp.com  Tue May 27 12:42:01 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 May 2008 15:42:01 -0400
Subject: [ofa-general] [PATCH RFC v3
	1/2]	RDMA/Core:MEM_MGT_EXTENSIONS support
In-Reply-To: <483C559C.90203@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
	<RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
	<483C559C.90203@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRDUBPcZ00000172@RTPMVEXC1-PRD.hq.netapp.com>

At 02:40 PM 5/27/2008, Steve Wise wrote:
>Talpey, Thomas wrote:
>> At 12:58 PM 5/27/2008, Felix Marti wrote:
>>   
>>> RDMA Read with Local Invalidate does not affect the wire. The 'must
>>> invalidate' state is kept in the RNIC that issues the RDMA Read
>>> Request...
>>>     
>>
>> Aha, okay that was not clear to me. What information does the RNIC use
>> to line up the arrival of the RDMA Read response with the "must invalidate"
>> state? 
>
>The rnic already tracks outstanding read requests.  It now also will 
>track the local stag to invalidate when the read completes.

Ah - okay, so the stag that actually gets invalidated was provided with
the RDMA Read request posting, and is not necessarily the stag that
arrived in the peer's RDMA Read response. That helps.

What happens if the upper layer gives up and invalidates the stag itself,
and the peer's RDMA Read response arrives later? Nothing bad, I assume,
and the peer's response is denied?

>
>> Also, how does the RNIC signal whether the invalidation actually
>> occurred, so the upper layer can defend itself from attack?
>>
>>   
>
>The stag is guaranteed to be in the invalid state by the time the app 
>reaps the read-inv-local work completion...

Ok, given my correct understanding of the source of the stag above.

Tom.


From akepner at sgi.com  Tue May 27 12:40:17 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Tue, 27 May 2008 12:40:17 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <20080527191825.GA15530@osc.edu>
References: <20080527180004.GA15444@osc.edu> <adaprr7tx4c.fsf@cisco.com>
	<20080527191825.GA15530@osc.edu>
Message-ID: <20080527194017.GG23650@sgi.com>

On Tue, May 27, 2008 at 03:18:25PM -0400, Pete Wyckoff wrote:
> 
> ... Only CQ changes need dmasync, apparently.

Right.

-- 
Arthur


From swise at opengridcomputing.com  Tue May 27 12:59:16 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 14:59:16 -0500
Subject: [ofa-general] [PATCH RFC v3  1/2]	RDMA/Core:MEM_MGT_EXTENSIONS
	support
In-Reply-To: <RTPCLUEXC1-PRDUBPcZ00000172@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
	<RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
	<483C559C.90203@opengridcomputing.com>
	<RTPCLUEXC1-PRDUBPcZ00000172@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <483C6814.4060706@opengridcomputing.com>

Talpey, Thomas wrote:
> At 02:40 PM 5/27/2008, Steve Wise wrote:
>   
>> Talpey, Thomas wrote:
>>     
>>> At 12:58 PM 5/27/2008, Felix Marti wrote:
>>>   
>>>       
>>>> RDMA Read with Local Invalidate does not affect the wire. The 'must
>>>> invalidate' state is kept in the RNIC that issues the RDMA Read
>>>> Request...
>>>>     
>>>>         
>>> Aha, okay that was not clear to me. What information does the RNIC use
>>> to line up the arrival of the RDMA Read response with the "must invalidate"
>>> state? 
>>>       
>> The rnic already tracks outstanding read requests.  It now also will 
>> track the local stag to invalidate when the read completes.
>>     
>
> Ah - okay, so the stag that actually gets invalidated was provided with
> the RDMA Read request posting, and is not necessarily the stag that
> arrived in the peer's RDMA Read response. That helps.
>
> What happens if the upper layer gives up and invalidates the stag itself,
> and the peer's RDMA Read response arrives later? Nothing bad, I assume,
> and the peer's response is denied?
>
>   

It behaves just like any other tagged message arriving and the target 
stag is invalid.  The connection is torn down via an RDMAP TERMINATE...


>>> Also, how does the RNIC signal whether the invalidation actually
>>> occurred, so the upper layer can defend itself from attack?
>>>
>>>   
>>>       
>> The stag is guaranteed to be in the invalid state by the time the app 
>> reaps the read-inv-local work completion...
>>     
>
> Ok, given my correct understanding of the source of the stag above.
>
> Tom.
>   


From rdreier at cisco.com  Tue May 27 13:00:17 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 13:00:17 -0700
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <1211879148.13769.94.camel@mtls03> (Eli Cohen's message of "Tue, 
	27 May 2008 12:05:48 +0300")
References: <1211879148.13769.94.camel@mtls03>
Message-ID: <ada63sztt4e.fsf@cisco.com>

thanks, applied for 2.6.27...

 > +	/*
 > +	 * we rely on this condition when copying small skbs and we
 > +	 * pass ownership of the first fragment only.
 > +	 */
 > +	if (SKB_TSHOLD > IPOIB_CM_HEAD_SIZE) {
 > +		printk("%s: SKB_TSHOLD(%d) must not be larger then %d\n",
 > +		       THIS_MODULE->name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE);
 > +		return -EINVAL;
 > +	}

I changed this to a build bug, to avoid waiting until runtime to notice
this problem:

+       /*
+        * When copying small received packets, we only copy from the
+        * linear data part of the SKB, so we rely on this condition.
+        */
+       BUILD_BUG_ON(IPOIB_CM_COPYBREAK > IPOIB_CM_HEAD_SIZE);


From Thomas.Talpey at netapp.com  Tue May 27 13:24:32 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 May 2008 16:24:32 -0400
Subject: [ofa-general] [PATCH RFC v3 
	1/2]	RDMA/Core:MEM_MGT_EXTENSIONS support
In-Reply-To: <483C6814.4060706@opengridcomputing.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
	<RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
	<483C559C.90203@opengridcomputing.com>
	<RTPCLUEXC1-PRDUBPcZ00000172@RTPMVEXC1-PRD.hq.netapp.com>
	<483C6814.4060706@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRDi71mr00000173@RTPMVEXC1-PRD.hq.netapp.com>

At 03:59 PM 5/27/2008, Steve Wise wrote:
>Talpey, Thomas wrote:
>> What happens if the upper layer gives up and invalidates the stag itself,
>> and the peer's RDMA Read response arrives later? Nothing bad, I assume,
>> and the peer's response is denied?
>>
>>   
>
>It behaves just like any other tagged message arriving and the target 
>stag is invalid.  The connection is torn down via an RDMAP TERMINATE...

I was wondering more about the dangling stag reference that the original work
request carried. Normally, it would reference the still-valid stag, but if that
stag was torn down (causing the invalidation to point to nothing), or worse,
re-bound (causing it to point at something else!), then it's a possible issue?

Sorry to seem paranoid here. Storage is pretty sensitive to silent data
corruption avenues. Because they always find a way to happen.

Tom.


From swise at opengridcomputing.com  Tue May 27 13:33:52 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 15:33:52 -0500
Subject: [ofa-general] [PATCH RFC v3   1/2]	RDMA/Core:MEM_MGT_EXTENSIONS
	support
In-Reply-To: <RTPCLUEXC1-PRDi71mr00000173@RTPMVEXC1-PRD.hq.netapp.com>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<4838F6CB.2040203@voltaire.com> <adazlqe5d43.fsf@cisco.com>
	<483AB5AC.3030406@opengridcomputing.com>
	<adak5hg4trz.fsf@cisco.com>
	<483B3AD7.4050208@opengridcomputing.com>
	<aday75w3bzt.fsf@cisco.com>
	<1211902417.4114.73.camel@trinity.ogc.int>
	<RTPCLUEXC1-PRD8CChT0000016f@RTPMVEXC1-PRD.hq.netapp.com>
	<8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com>
	<RTPCLUEXC1-PRDoqFXf00000170@RTPMVEXC1-PRD.hq.netapp.com>
	<483C559C.90203@opengridcomputing.com>
	<RTPCLUEXC1-PRDUBPcZ00000172@RTPMVEXC1-PRD.hq.netapp.com>
	<483C6814.4060706@opengridcomputing.com>
	<RTPCLUEXC1-PRDi71mr00000173@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <483C7030.7040208@opengridcomputing.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/c6abf849/attachment.html>

From rdreier at cisco.com  Tue May 27 13:50:54 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 13:50:54 -0700
Subject: [ofa-general] [ANNOUNCE] libmthca 1.0.5 released
Message-ID: <ada1w3ntqs1.fsf@cisco.com>

libmthca is a userspace driver for Mellanox InfiniBand HCAs.  It is a
plug-in module for libibverbs that allows programs to use Mellanox
hardware directly from userspace.

A new stable release, libmthca 1.0.5, is available from

    http://www.openfabrics.org/downloads/mlx4/libmthca-1.0.5.tar.gz

with sha1sum

    a68b1de47d320546c7bcc92bfa9c482f7d74fac1  /data/home/roland/libmthca-1.0.5.tar.gz

I also tagged the 1.0.5 release of libmthca and pushed it out to my
git tree on kernel.org:

    git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git

(the name of the tag is libmthca-1.0.5).

Builds for the Ubuntu 7.10 and 8.04 releases will be available by
adding the lines (replacing hardy by gutsy if needed)

    deb http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main
    deb-src http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main

to your /etc/sources.list file, and updated Debian and Fedora packages
will work their way into the main archives.

This release fixes several bugs and adds support for the new kernel
interface for mr attrs (which will be shipped in 2.6.26).

The complete list of changes since 1.0.4 is:

Eli Cohen (3):
      Ensure an receive WQEs are in memory before linking to chain
      Remove checks for srq->first_free < 0
      IB/ib_mthca: Pre-link receive WQEs in Tavor mode

Jack Morgenstein (1):
      Clear context struct at allocation time

Michael S. Tsirkin (2):
      Fix posting >255 recv WRs for Tavor
      Set cleaned CQEs back to HW ownership when cleaning CQ

Roland Dreier (21):
      Fix paths in Debian install files for libibverbs 1.1
      Update Debian changelog
      debian/rules: Remove DEB_DH_STRIP_ARGS
      Fix handling of send CQE with error for QPs connected to SRQ
      Add missing wmb() in mthca_tavor_post_send()
      Remove deprecated ${Source-Version} from debian/control
      Clean up NVALGRIND comment in config.h.in
      Fix Valgrind annotations so they can actually be built
      Remove ibv_driver_init from linker version script
      Fix spec file License: tag
      Mark "driver" file in sysconfdir with %config
      Update Debian policy version to 3.7.3
      Fix Valgrind false positives in mthca_create_cq() and mthca_create_srq()
      Add debian/watch file
      Update Debian build to avoid setting RPATH
      Change openib.org URLs to openfabrics.org URLs
      Fix CQ cleanup when QP is destroyed
      Update libmthca to handle new kernel ABI
      Include spec file changes from Fedora CVS
      Remove %config tag from mthca.driver file
      Roll libmthca-1.0.5 release


From kliteyn at dev.mellanox.co.il  Tue May 27 14:31:58 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 28 May 2008 00:31:58 +0300
Subject: [ofa-general] OpenSM?
In-Reply-To: <20080527100859.6d48cd45.weiny2@llnl.gov>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<483285DC.20003@voltaire.com>	<4832D850.2010102@opengridcomputing.com>	<4833EA6B.9000705@voltaire.com>	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>	<5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>
	<20080527100859.6d48cd45.weiny2@llnl.gov>
Message-ID: <483C7DCE.7080507@dev.mellanox.co.il>

Charles,

Ira Weiny wrote:
> Charles,
> 
> Here at LLNL we have been running OpenSM for some time.  Thus far we are very
> happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
> bring it up (not counting boot time) in less than a minute.

OpenSM is successfully running on some large clusters with 4-5K nodes.
It takes about 2-3 minutes to bring up such clusters.

> Here are some details.
> 
> We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
> which have been submitted upstream and been accepted by Sasha but are not yet
> in a release.)
> 
> Our clusters are all Fat-tree topologies.
> 
> We have a node which is more or less dedicated to running OpenSM.  We have some
> other monitoring software running on it, but OpenSM can utilize the CPU/Memory
> if it needs to.
> 
>    A) On our large clusters this node is a 4 socket, dual core (8 cores
>    total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
>    OpenSM needs this much but the nodes were built all the same so this is
>    what it got.
> 
>    B) On one of our smaller clusters (128 nodes) OpenSM is running on a
>    dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
>    memory.  We have not seen any issues with this cluster and OpenSM.
> 
> We run with the up/down algorithm, ftree has not panned out for us yet.  I
> can't say how that would compare to the Cisco algorithms.

If the cluster topology is fat-tree, then there is a ftree and up/down routing.
Ftree would be a good choice if you need LMC=0 (plus if the topology complies
with certain fat-tree rules). For any other tree, or for LMC>0, up/down should
work.

-- Yevgeny

> In short OpenSM should work just fine on your cluster.
> 
> Hope this helps,
> Ira
> 
> 
> On Tue, 27 May 2008 11:15:14 -0400
> Charles Taylor <taylor at hpc.ufl.edu> wrote:
> 
>> We have a 400 node IB cluster.    We are running an embedded SM in  
>> failover mode on our TS270/Cisco7008 core switches.    Lately we have  
>> been seeing problems with LID assignment when rebooting nodes (see log  
>> messages below).   It is also taking far too long for LIDS to be  
>> assigned as it takes  on the order of minutes for the ports to  
>> transition to "ACTIVE".
>>
>> This seems like a bug to us and we are considering switching to  
>> OpenSM  on a host.   I'm wondering about experience with running  
>> OpenSM for medium to large clusters (Fat Tree) and what resources  
>> (memory/cpu) we should plan on for the host node.
>>
>> Thanks,
>>
>> Charlie Taylor
>> UF HPC Center
>>
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> OUT_OF_SERVICE
>> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
>> existing IB
>> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> DELETE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
>> Topology
>> changed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> discovering removed ports
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
>> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
>> Topology
>> changed
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> discovering new ports
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
>> Failed to
>> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
>> status 0x1c
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> IN_SERVICE trap
>> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
>> IB node
>> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> previous GET/SET operation failures
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:  
>> Reassigning
>> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr  
>> LID=0
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:  
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:  
>> Clean up SA
>> resources for port forced down due to LID conflict, node -
>> GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:  
>> cleaning DB
>> for guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:18:47 topspin-270sc last message repeated 23 times
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
>> links
>> detected in the network
>> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:  
>> Active
>> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,  
>> state=2,
>> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
>> 00:06:6a:00:d9:00:04:5d port 16 is INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> some ports in INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> previous GET/SET operation failures
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:21:05 topspin-270sc last message repeated 23 times
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
>> links
>> detected in the network
>> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:  
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM  
>> CREATE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a  
>> backup session
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
>> synchronized
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
>> synchronized
>> with all designated backup SMs
>> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>>
>> On May 23, 2008, at 2:20 PM, Steve Wise wrote:
>>
>>> Or Gerlitz wrote:
>>>> Steve Wise wrote:
>>>>> Are we sure we need to expose this to the user?
>>>> I believe this is the way to go if we want to let smart ULPs  
>>>> generate new rkey/stag per mapping. Simpler ULPs could then just  
>>>> put the same value for each map associated with the same mr.
>>>>
>>>> Or.
>>>>
>>> How should I add this to the API?
>>>
>>> Perhaps we just document the format of an rkey in the struct ib_mr.   
>>> Thus the app would do this to change the key before posting the  
>>> fast_reg_mr wr (coded to be explicit, not efficient):
>>>
>>> u8 newkey;
>>> u32 newrkey;
>>>
>>> newkey = 0xaa;
>>> newrkey = (mr->rkey & 0xffffff00) | newkey;
>>> mr->rkey = newrkey
>>> wr.wr.fast_reg.mr = mr;
>>> ...
>>>
>>>
>>> Note, this assumes mr->rkey is in host byte order (I think the linux  
>>> rdma code assumes this in other places too).
>>>
>>>
>>> Steve.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Tue May 27 15:53:02 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 15:53:02 -0700
Subject: [ofa-general] Re: mthca MR attrs userspace change
In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 
	27 May 2008 14:00:04 -0400")
References: <20080527180004.GA15444@osc.edu>
Message-ID: <ada63szs6k1.fsf@cisco.com>

libmthca-1.0.5 is now in Fedora 9 proposed updates --
http://koji.fedoraproject.org/koji/buildinfo?buildID=50682

I'm not sure exactly how it makes it into real F-9.

 - R.


From YJia at tmriusa.com  Tue May 27 16:28:27 2008
From: YJia at tmriusa.com (Yicheng Jia)
Date: Tue, 27 May 2008 18:28:27 -0500
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
In-Reply-To: <48371ACE.908@gmail.com>
Message-ID: <OFDF9BCAD1.775FB546-ON86257456.007A86A5-86257456.0080F2D1@TMRIUSA.COM>

Thanks for your reply. I'm using one CQ for all the WRs. Do you know why 
there's no ARM-N support in MLX drivers? My concern is the performance. 
The overhead of software poll_cq loop is quite significant if there are 
multiple pieces of small amount of data to be transferred on both 
sender/receiver sides. For instance, on the sender, the data I have are 
1k, 1k, 2k, 1k..., on the receiver side, the data size and blocks are the 
same, 1k, 1k, 2k, 1k.... Do you have a good solution for such kind of 
problem?

Best,
Yicheng


Dotan Barak <dotanba at gmail.com> 
05/23/2008 01:27 PM

To
Yicheng Jia <YJia at tmriusa.com>
cc
general at lists.openfabrics.org
Subject
Re: [ofa-general] MLX HCA: CQ request notification for multiple 
completions not implemented?


Hi.

Yicheng Jia wrote:
>
> Hi Folks,
>
> I'm trying to use CQ Event notification for multiple completions 
> (ARM_N) according to Mellanox Lx III user manual for scatter/gathering 
> RDMA. However I couldn't find it in current MLX driver. It seems to me 
> that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are 
> multiple work requests, I have to use "poll_cq" to synchronously wait 
> until all the requests are done, is it correct? Is there a way to do 
> asynchronous multiple send by subscribing for a ARM_N event?
You are right: the low level drivers of Mellanox devices doesn't support 
ARM-N
(This feature is supported by the devices, but it wasn't implemented in 
the low level drivers).

You are right, in order to read all of the completions you need to use 
poll_cq.

By the way: Do you have you have to create a completion for any WR?
(if you are using one QP, this will maybe solve your problem).

Dotan

_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________


_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080527/8c457f00/attachment.html>

From swise at opengridcomputing.com  Tue May 27 19:05:36 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 May 2008 21:05:36 -0500
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <20080527183549.32168.22959.stgit@dell3.ogc.int>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
	<20080527183549.32168.22959.stgit@dell3.ogc.int>
Message-ID: <483CBDF0.7030209@opengridcomputing.com>

>  enum ib_send_flags {
> @@ -676,6 +683,19 @@ struct ib_send_wr {
>  			u16	pkey_index; /* valid for GSI only */
>  			u8	port_num;   /* valid for DR SMPs on switch only */
>  		} ud;
> +		struct {
> +			u64				iova_start;
> +			struct ib_mr 			*mr;
> +			struct ib_fast_reg_page_list	*page_list;
> +			unsigned int			page_shift;
> +			unsigned int			page_list_len;
> +			unsigned int			first_byte_offset;
> +			u32				length;
> +			int				access_flags;
> +		} fast_reg;
> +		struct {
> +			struct ib_mr 	*mr;
> +		} local_inv;
>  	} wr;
>  };

Ok, while writing a test case for all this jazz, I see that passing the 
struct ib_mr pointer to both IB_WR_FAST_REGISTER_MR and 
IB_WR_INVALIDATE_MR is perhaps bad.  Consider a chain of WRs: 
INVALIDATE_MR linked to a FAST_REGISTER_MR and passed to the provider 
via a single ib_post_send() call.  You can't do that if you want to bump 
the key value between the invalidate and the fast_reg with the new key, 
which is probably what apps want to do. You are forced, under this 
proposed API, to post the two WRs separately and call 
ib_update_fast_reg_key() in between the ib_post_send() calls.

Perhaps we should just pass in a u32 rkey for both WRs instead of the mr 
pointer?  Then the code could put the old rkey in the invalidate WR, and 
the newly updated rkey in the fast_reg WR and chain the two together and 
do a single post.

I think this is the way to go: change the fast_reg and local_inv unions 
to take a u32 rkey instead of a struct ib_mr *mr.

Thoughts?

Steve.


From rdreier at cisco.com  Tue May 27 20:59:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 20:59:55 -0700
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483CBDF0.7030209@opengridcomputing.com> (Steve Wise's message of
	"Tue, 27 May 2008 21:05:36 -0500")
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
	<20080527183549.32168.22959.stgit@dell3.ogc.int>
	<483CBDF0.7030209@opengridcomputing.com>
Message-ID: <ada1w3nrsck.fsf@cisco.com>

 > Perhaps we should just pass in a u32 rkey for both WRs instead of the
 > mr pointer?  Then the code could put the old rkey in the invalidate
 > WR, and the newly updated rkey in the fast_reg WR and chain the two
 > together and do a single post.

Makes sense to me.  The only thing I would worry about would be if some
device needs the actual mr struct pointer to post the work request, but
mlx4 and I guess cxgb3 don't at least and I don't see a good reason why
another device would.  Let's go for it.

 - R.


From rdreier at cisco.com  Tue May 27 21:23:06 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 21:23:06 -0700
Subject: [ofa-general] Re: [PATCH] libibverbs: fix coding style typos
	according to checkpatch.pl
In-Reply-To: <200805232332.07576.dotanba@gmail.com> (Dotan Barak's message of
	"Fri, 23 May 2008 23:32:07 +0300")
References: <200805232332.07576.dotanba@gmail.com>
Message-ID: <adawslfqcph.fsf@cisco.com>

thanks, applied


From earthpea at wagingwork.com  Tue May 27 22:18:56 2008
From: earthpea at wagingwork.com (Wendi Bishop)
Date: Wed, 28 May 2008 05:18:56 -0000
Subject: [ofa-general] ~After Effects CS 3 Pro~
Message-ID: <000301c8c081$841f0a00$0100007f@bnxym>

~ Adobe CS3 Master Collection for PC or MAC includes:
~ InDesign CS3
~ Photoshop CS3
~ Illustrator CS3
~ Acrobat 8 Professional
~ Flash CS3 Professional
~ Dreamweaver CS3
~ Fireworks CS3
~ Contribute CS3
~ After Effects CS3 Professional
~ Premiere Pro CS3
~ Encore DVD CS3
~ Soundbooth CS3

~ oemnewdeal . com in your Internet Exp1orer

~ System Requirements

~ For PC:
~ Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
~ Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
~ Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
~ 38GB of available hard-disk space (additional free space required during installation)
~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
~ Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended)
~ 1,280x1,024 monitor resolution with 32-bit color adapter
~ DVD-ROM drive

~ For MAC:
~ PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp)
~ Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server
~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
~ 36GB of available hard-disk space (additional free space required during installation)
~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
~ Core Audio compatible sound card
~ 1,280x1,024 monitor resolution with 32-bit color adapter
~ DVD-ROM drive~ DVD+-R burner required for DVD creation

U.N. agencies in Myanmar say international aid workers are finally moving into the Irrawaddy Delta. More visas for aid workers are being processed, and half a million people have now received food rations from the World Food Program.

Dresden, Germany, has regained some of its prewar splendor, thanks to the restoration of its Baroque architectural masterpieces. The Elbe River Valley that runs through the city was declared a U.N. World Heritage site. But now, a new steel bridge being built across the river is threatening that status.


From rdreier at cisco.com  Tue May 27 22:26:04 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 22:26:04 -0700
Subject: [ofa-general] Re: [PATCH v2 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
In-Reply-To: <20080519103529.12355.82570.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Mon, 19 May 2008 16:05:29 +0530")
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103529.12355.82570.stgit@localhost.localdomain>
Message-ID: <adaskw3q9sj.fsf@cisco.com>

 > +ssize_t vnic_create_primary(struct device *dev,
 > +			    struct device_attribute *dev_attr, const char *buf,
 > +			    size_t count)

 > +ssize_t vnic_create_secondary(struct device *dev,
 > +			      struct device_attribute *dev_attr,
 > +			      const char *buf, size_t count)

 > +ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr,
 > +		    const char *buf, size_t count)

These are all only referenced from a sysfs attribute defined in the same
file, so they can be made static (and don't need extern declarations in
a header file).


From rdreier at cisco.com  Tue May 27 22:28:46 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 27 May 2008 22:28:46 -0700
Subject: [ofa-general] Re: [PATCH v2 03/13] QLogic VNIC: Implementation of
	communication protocol with EVIC/VEx
In-Reply-To: <20080519103258.12355.6146.stgit@localhost.localdomain>
	(Ramachandra K.'s message of "Mon, 19 May 2008 16:02:58 +0530")
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103258.12355.6146.stgit@localhost.localdomain>
Message-ID: <adaod6rq9o1.fsf@cisco.com>

 > +void viport_disconnect(struct viport *viport)
 > +{
 > +	VIPORT_FUNCTION("viport_disconnect()\n");
 > +	viport->disconnect = 1;
 > +	viport_failure(viport);
 > +	wait_event(viport->disconnect_queue, viport->disconnect == 0);
 > +}
 > +
 > +void viport_free(struct viport *viport)
 > +{
 > +	VIPORT_FUNCTION("viport_free()\n");
 > +	viport_disconnect(viport);	/* NOTE: this can sleep */

There are no other calls to viport_disconnect() that I can see, so it
can be made static (and the declaration in vnic_viport.h can be dropped).
in fact given how small the function is and the fact that it has only a
single call site, it might be easier just to merge it into
viport_free().  But that's a matter of taste.

 - R.


From kliteyn at dev.mellanox.co.il  Tue May 27 23:24:49 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 28 May 2008 09:24:49 +0300
Subject: [ofa-general] OpenSM?
In-Reply-To: <20080527100859.6d48cd45.weiny2@llnl.gov>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	<20080516223419.27221.49014.stgit@dell3.ogc.int>	<483285DC.20003@voltaire.com>	<4832D850.2010102@opengridcomputing.com>	<4833EA6B.9000705@voltaire.com>	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>	<5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>
	<20080527100859.6d48cd45.weiny2@llnl.gov>
Message-ID: <483CFAB1.5020409@dev.mellanox.co.il>

Ira,

Ira Weiny wrote:
> 
> We run with the up/down algorithm, ftree has not panned out for us yet.

Can you elaborate on that?
Did ftree fail to digest the topology? Or did it do a lousy job configuring
the subnet? Or perhaps the you need LMC>0, which ftree doesn't support?

-- Yevgeny


From ogerlitz at voltaire.com  Tue May 27 23:45:33 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 28 May 2008 09:45:33 +0300
Subject: [ofa-general] Re: [RFC PATCH 3/5] rdma/cma: simplify locking needed
 for serialization of callbacks
In-Reply-To: <000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com>
	<483C38D8.5020600@voltaire.com>
	<000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com>
Message-ID: <483CFF8D.2000001@voltaire.com>

Sean Hefty wrote:
> I think we should just remove cma_disable_remove() and cma_enable_remove(), and instead call mutex_lock/unlock directly in their places.  Where cma_disable_remove() is called, add in appropriate state checks after acquiring
> the mutex.
OK, will do that.

Or.


From nonstylized at adriencarpentier.com  Wed May 28 03:25:08 2008
From: nonstylized at adriencarpentier.com (Bonnie Weber)
Date: Wed, 28 May 2008 12:25:08 +0200
Subject: [ofa-general] #All Adobe products in One Download#
Message-ID: <000401c8c0a9$0a9b1200$0100007f@jscyeu>

# Adobe CS3 Master Collection for PC or MAC includes:
# InDesign CS3
# Photoshop CS3
# Illustrator CS3
# Acrobat 8 Professional
# Flash CS3 Professional
# Dreamweaver CS3
# Fireworks CS3
# Contribute CS3
# After Effects CS3 Professional
# Premiere Pro CS3
# Encore DVD CS3
# Soundbooth CS3

# xpnewdeal. com in your Internet Exp1orer

# System Requirements

# For PC:
# Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
# Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
# Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
# 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
# 38GB of available hard-disk space (additional free space required during installation)
# Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
# Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended)
# 1,280x1,024 monitor resolution with 32-bit color adapter
# DVD-ROM drive

# For MAC:
# PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp)
# Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server
# 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
# 36GB of available hard-disk space (additional free space required during installation)
# Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
# Core Audio compatible sound card
# 1,280x1,024 monitor resolution with 32-bit color adapter
# DVD-ROM drive# DVD+-R burner required for DVD creation

The global jump in the price of food has also hit Afghanistan, one of the world's poorest countries. Many Afghans are now spending half their earnings on bread alone. International aid is keeping the country from food riots and starvation. But the crisis may encourage some farmers to move out of the drug trade and into wheat.

Ohio Couple Tell Sichuan Quake Tales


From vlad at lists.openfabrics.org  Wed May 28 03:09:57 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 28 May 2008 03:09:57 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080528-0200 daily build status
Message-ID: <20080528100957.CC3BBE60CAB@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From hrosenstock at xsigo.com  Wed May 28 04:06:30 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 04:06:30 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080527175637.GB14205@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<1211887752.13185.212.camel@hrosenstock-ws.xsigo.com>
	<20080527175637.GB14205@sashak.voltaire.com>
Message-ID: <1211972790.13185.332.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-27 at 20:56 +0300, Sasha Khapyorsky wrote:
> On 04:29 Tue 27 May     , Hal Rosenstock wrote:
> > On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
> > > > Following your logic we will need to disable root passwords
> > > > typing too.
> > > 
> > > That's taking it too far. Root passwords are at least hidden when
> > > typing.
> > 
> > At least hide the key typing from plain sight when typing like su does.
> 
> There are lot of tools where password can be specified as clear text in
> command line (wget, smbclient, etc..) - it is an user responsibility to
> keep his sensitive data safe.

Do those tools provide a way to obscure passwords or force the user to
do this in plain sight ? Seems like a user can't do this without support
from the tool. smbclient seems to provide this; I didn't look at wget.

smbclient supports an authorization file which supports this and says:
              Make  certain  that the permissions on the file restrict access
              from unwanted users.

As you mentioned before, this is another acceptable approach (and this
also lends itself better to scripting).

-- Hal

> Sasha


From hrosenstock at xsigo.com  Wed May 28 04:06:31 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 04:06:31 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <20080527175343.GA14205@sashak.voltaire.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<20080527103341.GF12014@sashak.voltaire.com>
	<1211888036.13185.219.camel@hrosenstock-ws.xsigo.com>
	<20080527175343.GA14205@sashak.voltaire.com>
Message-ID: <1211972791.13185.334.camel@hrosenstock-ws.xsigo.com>

On Tue, 2008-05-27 at 20:53 +0300, Sasha Khapyorsky wrote:
> On 04:33 Tue 27 May     , Hal Rosenstock wrote:
> > > 
> > > Maybe yes, but could you be more specific? Store SMKey in read-only
> > > file on a client side?
> > 
> > Treat smkey as su treats password rather than a command line parameter
> > is another alternative.
> 
> Ok, let's do it as '--smkey X' and then saquery will ask for a value,
> just like su does. Good?

Works for me.

> > > I'm not proposing to expose SM_Key, just added such option where this
> > > key could be specified.
> > 
> > How is that not exposing it ?
> 
> Because (1) and (2) below.

The original patch exposes the key when the option is invoked and that's
just the time to hide it.

-- Hal

> Sasha
> 
> > 
> > -- Hal
> > 
> > >  So: 1) this is *optional*, 2) there is no
> > > suggestions about how the right value should be determined.
> > > 
> > > Sasha
> > 


From eli at dev.mellanox.co.il  Wed May 28 04:13:45 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 28 May 2008 14:13:45 +0300
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <ada63sztt4e.fsf@cisco.com>
References: <1211879148.13769.94.camel@mtls03>  <ada63sztt4e.fsf@cisco.com>
Message-ID: <1211973225.13769.147.camel@mtls03>

On Tue, 2008-05-27 at 13:00 -0700, Roland Dreier wrote:
> I changed this to a build bug, to avoid waiting until runtime to notice
> this problem:
> 
> +       /*
> +        * When copying small received packets, we only copy from the
> +        * linear data part of the SKB, so we rely on this condition.
> +        */
> +       BUILD_BUG_ON(IPOIB_CM_COPYBREAK > IPOIB_CM_HEAD_SIZE);

I was looking for this one thing to make this check at compile time...
thanks for letting us know.


From ogerlitz at voltaire.com  Wed May 28 04:14:40 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 28 May 2008 14:14:40 +0300
Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
Message-ID: <483D3EA0.40506@voltaire.com>

Sean Hefty wrote
>> How do we know that the user hasn't tried to destroy the id from another
>> callback?  We need some sort of state check here.
fixed
>>
>> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv)
> nit - function name isn't clear to me.  Maybe something like
> cma_netdev_change_handler()?  Although I'm not sure that netdev change is what
> the user is really interested in.  What they really want to know is if IP
> address mapping/resolution changed.  netdev is hidden from the user.
I changed the function name to cma_netdev_change as it checks if there 
was some netdev change between the time of this ID address resolution to 
when the netdev event was delivered. The user doesn't get explicit 
notification from the rdma-cm on netdev change but rather on address 
change as you suggested next.

> Maybe call this RDMA_CM_EVENT_ADDR_CHANGE?
done

Or.


From ogerlitz at voltaire.com  Wed May 28 04:31:15 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 28 May 2008 14:31:15 +0300
Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
Message-ID: <483D4283.60307@voltaire.com>

Sean Hefty wrote:
>> +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, void *ctx)
>> +
>> +	mutex_lock(&lock);
>> +	list_for_each_entry(cma_dev, &dev_list, list)
>
> It seems like we just need to find the cma_dev that has the current mapping
If your comment comes to say that maybe first find the cma_dev to which 
this event applies, I don't think its possible, see below. If I didn't 
get you right, can you please explain it a little more.

The rdma-cm maintains a mapping between IDs to the physical devices. The 
mapping is established during address resolution using the HW address of 
the --network-- device that was resolved (eg through ARP and then 
looking on neigh->dev or route lookup) for this ID.

In the bonding case, the network device --borrows-- the HW address from 
the active slave device. During fail-over, the bonding net device 
changes its HW address and then the netdev event is delivered on which 
this code acts. So the same cma_dev can have IDs with different netdev 
HW address in their dev_addr struct, say bond0 = <ib0, ib1> and pdevA 
list = { <ID1, HW(ib0)>, <ID2, HW(ib1)>} depending on the time address 
resolution was done to ID1,ID2 and the ULP behavior on the ADDR_CHANGE 
event. I don't see how to get along with a simple check that tell on 
what cma_dev to look for matches. If we really want to avoid scanning 
all the cma_dev list, we can add a mapping between --net devices-- to 
IDs and then scan only the list of the affiliated netdevice.

So I am still left with the general rdma-cm mutex being taken for the 
duration of the double-loop...

Or.


From ogerlitz at voltaire.com  Wed May 28 04:34:31 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 28 May 2008 14:34:31 +0300 (IDT)
Subject: [ofa-general] [RFC V4 PATCH 3/5] rdma/cma: simply locking needed for
 serialization of callbacks
In-Reply-To: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805281433560.2453@zuben.voltaire.com>

The rdma-cm has some logic in place to make sure that callbacks on an ID are
delivered to the consumer in serialized manner, specifically it has code to protect
against the device removal racing with a callback now being delivered to the user.

This patch simplifies this logic by using a mutex per ID instead of the wait queue
and atomic variable. I have left the disable/enable_remove notation such that the patch
would be easier to read, but if this approach is accepted, I think we want to change
it to disable/enable_callback

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

 cma.c |   96 ++++++++++++++++++++++++++++++++----------------------------------
 1 files changed, 47 insertions(+), 49 deletions(-)

changes from v2 (I named this v4 to comply with the next patch)
- cma_disable_remove --> cma_disable_callback, acquire the mutex before the spinlock
- removed cma_enable_remove and just call mutex_unlock(id->handler_mutex) instead

Sean, basically you asked that cma_disable_remove be removed from the code, but this
would spread taking the spin lock and doing state checks on all the places which call
it, so I think it can be nice to still have it. As for the spin lock usage, I preferred
not to touch it, since the code of cma_comp, cma_exch, cma_comp_exch etc use it.

Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c	2008-05-26 15:11:17.000000000 +0300
+++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c	2008-05-28 11:08:24.000000000 +0300
@@ -126,8 +126,7 @@ struct rdma_id_private {

 	struct completion	comp;
 	atomic_t		refcount;
-	wait_queue_head_t	wait_remove;
-	atomic_t		dev_remove;
+	struct mutex		handler_mutex;

 	int			backlog;
 	int			timeout_ms;
@@ -351,28 +350,24 @@ static void cma_deref_id(struct rdma_id_
 		complete(&id_priv->comp);
 }

-static int cma_disable_remove(struct rdma_id_private *id_priv,
+static int cma_disable_callback(struct rdma_id_private *id_priv,
 			      enum cma_state state)
 {
 	unsigned long flags;
 	int ret;

+	mutex_lock(&id_priv->handler_mutex);
 	spin_lock_irqsave(&id_priv->lock, flags);
-	if (id_priv->state == state) {
-		atomic_inc(&id_priv->dev_remove);
+	if (id_priv->state == state)
 		ret = 0;
-	} else
+	 else {
+		mutex_unlock(&id_priv->handler_mutex);
 		ret = -EINVAL;
+	}
 	spin_unlock_irqrestore(&id_priv->lock, flags);
 	return ret;
 }

-static void cma_enable_remove(struct rdma_id_private *id_priv)
-{
-	if (atomic_dec_and_test(&id_priv->dev_remove))
-		wake_up(&id_priv->wait_remove);
-}
-
 static int cma_has_cm_dev(struct rdma_id_private *id_priv)
 {
 	return (id_priv->id.device && id_priv->cm_id.ib);
@@ -395,8 +390,7 @@ struct rdma_cm_id *rdma_create_id(rdma_c
 	mutex_init(&id_priv->qp_mutex);
 	init_completion(&id_priv->comp);
 	atomic_set(&id_priv->refcount, 1);
-	init_waitqueue_head(&id_priv->wait_remove);
-	atomic_set(&id_priv->dev_remove, 0);
+	mutex_init(&id_priv->handler_mutex);
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
@@ -923,7 +917,7 @@ static int cma_ib_handler(struct ib_cm_i
 	struct rdma_cm_event event;
 	int ret = 0;

-	if (cma_disable_remove(id_priv, CMA_CONNECT))
+	if (cma_disable_callback(id_priv, CMA_CONNECT))
 		return 0;

 	memset(&event, 0, sizeof event);
@@ -980,12 +974,12 @@ static int cma_ib_handler(struct ib_cm_i
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.ib = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_enable_remove(id_priv);
+		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}
 out:
-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	return ret;
 }

@@ -1097,7 +1091,7 @@ static int cma_req_handler(struct ib_cm_
 	int offset, ret;

 	listen_id = cm_id->context;
-	if (cma_disable_remove(listen_id, CMA_LISTEN))
+	if (cma_disable_callback(listen_id, CMA_LISTEN))
 		return -ECONNABORTED;

 	memset(&event, 0, sizeof event);
@@ -1118,7 +1112,7 @@ static int cma_req_handler(struct ib_cm_
 		goto out;
 	}

-	atomic_inc(&conn_id->dev_remove);
+	mutex_lock(&conn_id->handler_mutex);
 	mutex_lock(&lock);
 	ret = cma_acquire_dev(conn_id);
 	mutex_unlock(&lock);
@@ -1140,7 +1134,7 @@ static int cma_req_handler(struct ib_cm_
 		    !cma_is_ud_ps(conn_id->id.ps))
 			ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0);
 		mutex_unlock(&lock);
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		goto out;
 	}

@@ -1149,11 +1143,11 @@ static int cma_req_handler(struct ib_cm_

 release_conn_id:
 	cma_exch(conn_id, CMA_DESTROYING);
-	cma_enable_remove(conn_id);
+	mutex_unlock(&conn_id->handler_mutex);
 	rdma_destroy_id(&conn_id->id);

 out:
-	cma_enable_remove(listen_id);
+	mutex_unlock(&listen_id->handler_mutex);
 	return ret;
 }

@@ -1219,7 +1213,7 @@ static int cma_iw_handler(struct iw_cm_i
 	struct sockaddr_in *sin;
 	int ret = 0;

-	if (cma_disable_remove(id_priv, CMA_CONNECT))
+	if (cma_disable_callback(id_priv, CMA_CONNECT))
 		return 0;

 	memset(&event, 0, sizeof event);
@@ -1263,12 +1257,12 @@ static int cma_iw_handler(struct iw_cm_i
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.iw = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_enable_remove(id_priv);
+		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}

-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	return ret;
 }

@@ -1284,7 +1278,7 @@ static int iw_conn_req_handler(struct iw
 	struct ib_device_attr attr;

 	listen_id = cm_id->context;
-	if (cma_disable_remove(listen_id, CMA_LISTEN))
+	if (cma_disable_callback(listen_id, CMA_LISTEN))
 		return -ECONNABORTED;

 	/* Create a new RDMA id for the new IW CM ID */
@@ -1296,19 +1290,19 @@ static int iw_conn_req_handler(struct iw
 		goto out;
 	}
 	conn_id = container_of(new_cm_id, struct rdma_id_private, id);
-	atomic_inc(&conn_id->dev_remove);
+	mutex_lock(&conn_id->handler_mutex);
 	conn_id->state = CMA_CONNECT;

 	dev = ip_dev_find(&init_net, iw_event->local_addr.sin_addr.s_addr);
 	if (!dev) {
 		ret = -EADDRNOTAVAIL;
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
 	ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
 	if (ret) {
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
@@ -1317,7 +1311,7 @@ static int iw_conn_req_handler(struct iw
 	ret = cma_acquire_dev(conn_id);
 	mutex_unlock(&lock);
 	if (ret) {
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
@@ -1333,7 +1327,7 @@ static int iw_conn_req_handler(struct iw

 	ret = ib_query_device(conn_id->id.device, &attr);
 	if (ret) {
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
 		goto out;
 	}
@@ -1349,14 +1343,14 @@ static int iw_conn_req_handler(struct iw
 		/* User wants to destroy the CM ID */
 		conn_id->cm_id.iw = NULL;
 		cma_exch(conn_id, CMA_DESTROYING);
-		cma_enable_remove(conn_id);
+		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(&conn_id->id);
 	}

 out:
 	if (dev)
 		dev_put(dev);
-	cma_enable_remove(listen_id);
+	mutex_unlock(&listen_id->handler_mutex);
 	return ret;
 }

@@ -1588,7 +1582,7 @@ static void cma_work_handler(struct work
 	struct rdma_id_private *id_priv = work->id;
 	int destroy = 0;

-	atomic_inc(&id_priv->dev_remove);
+	mutex_lock(&id_priv->handler_mutex);
 	if (!cma_comp_exch(id_priv, work->old_state, work->new_state))
 		goto out;

@@ -1597,7 +1591,7 @@ static void cma_work_handler(struct work
 		destroy = 1;
 	}
 out:
-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	cma_deref_id(id_priv);
 	if (destroy)
 		rdma_destroy_id(&id_priv->id);
@@ -1760,7 +1754,7 @@ static void addr_handler(int status, str
 	struct rdma_cm_event event;

 	memset(&event, 0, sizeof event);
-	atomic_inc(&id_priv->dev_remove);
+	mutex_lock(&id_priv->handler_mutex);

 	/*
 	 * Grab mutex to block rdma_destroy_id() from removing the device while
@@ -1789,13 +1783,13 @@ static void addr_handler(int status, str

 	if (id_priv->id.event_handler(&id_priv->id, &event)) {
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_enable_remove(id_priv);
+		mutex_unlock(&id_priv->handler_mutex);
 		cma_deref_id(id_priv);
 		rdma_destroy_id(&id_priv->id);
 		return;
 	}
 out:
-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	cma_deref_id(id_priv);
 }

@@ -2122,7 +2116,7 @@ static int cma_sidr_rep_handler(struct i
 	struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd;
 	int ret = 0;

-	if (cma_disable_remove(id_priv, CMA_CONNECT))
+	if (cma_disable_callback(id_priv, CMA_CONNECT))
 		return 0;

 	memset(&event, 0, sizeof event);
@@ -2163,12 +2157,12 @@ static int cma_sidr_rep_handler(struct i
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.ib = NULL;
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_enable_remove(id_priv);
+		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
 		return ret;
 	}
 out:
-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	return ret;
 }

@@ -2566,8 +2560,8 @@ static int cma_ib_mc_handler(int status,
 	int ret;

 	id_priv = mc->id_priv;
-	if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) &&
-	    cma_disable_remove(id_priv, CMA_ADDR_RESOLVED))
+	if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) &&
+	    cma_disable_callback(id_priv, CMA_ADDR_RESOLVED))
 		return 0;

 	mutex_lock(&id_priv->qp_mutex);
@@ -2592,12 +2586,12 @@ static int cma_ib_mc_handler(int status,
 	ret = id_priv->id.event_handler(&id_priv->id, &event);
 	if (ret) {
 		cma_exch(id_priv, CMA_DESTROYING);
-		cma_enable_remove(id_priv);
+		mutex_unlock(&id_priv->handler_mutex);
 		rdma_destroy_id(&id_priv->id);
 		return 0;
 	}

-	cma_enable_remove(id_priv);
+	mutex_unlock(&id_priv->handler_mutex);
 	return 0;
 }

@@ -2756,22 +2750,26 @@ static int cma_remove_id_dev(struct rdma
 {
 	struct rdma_cm_event event;
 	enum cma_state state;
-
+	int ret = 0;
+
 	/* Record that we want to remove the device */
 	state = cma_exch(id_priv, CMA_DEVICE_REMOVAL);
 	if (state == CMA_DESTROYING)
 		return 0;

 	cma_cancel_operation(id_priv, state);
-	wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove));
+	mutex_lock(&id_priv->handler_mutex);

 	/* Check for destruction from another callback. */
 	if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL))
-		return 0;
+		goto out;

 	memset(&event, 0, sizeof event);
 	event.event = RDMA_CM_EVENT_DEVICE_REMOVAL;
-	return id_priv->id.event_handler(&id_priv->id, &event);
+	ret = id_priv->id.event_handler(&id_priv->id, &event);
+out:
+	mutex_unlock(&id_priv->handler_mutex);
+	return ret;
 }

 static void cma_process_remove(struct cma_device *cma_dev)


From ogerlitz at voltaire.com  Wed May 28 04:36:30 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 28 May 2008 14:36:30 +0300 (IDT)
Subject: [ofa-general] [RFC V4 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_ADDR_CHANGE notification
In-Reply-To: <Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805281434380.2453@zuben.voltaire.com>

RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm consuamers that wish
to have their RDMA sessions always use the same links (eg <hca/port>) as the
IP stack does. In the current code, this does not happen when bonding is used
and fail-over happened, but the IB link used by an already existing session is
operating fine.

Use netevent notification for sensing that a change has happened in the IP stack,
then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" in
that respect with the IP stack, and deliver RDMA_CM_EVENT_ADDR_CHANGE for
this ID. The user can act on the event or just ignore it

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

changes from v2 -
- took the approach of uncoditionally notifying the user
- use the handler_mutex of the ID to serialize with other callbacks

changes from v3 -
- check in cma_ndev_work_handler to make sure the ID is not getting destroyed
- change the event name to be RDMA_CM_EVENT_ADDR_CHANGE
- cma_netdev_align_id --> cma_netdev_change

As for the locking issues, I still have the double loop in cma_netdev_callback()
being wrapped with the rdma-cm global mutex taken, as I explained over the thread.

 drivers/infiniband/core/cma.c |   88 ++++++++++++++++++++++++++++++++++++++++++
 include/rdma/rdma_cm.h        |    3 -
 2 files changed, 90 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c	2008-05-28 11:08:24.000000000 +0300
+++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c	2008-05-28 13:03:43.000000000 +0300
@@ -164,6 +164,12 @@ struct cma_work {
 	struct rdma_cm_event	event;
 };

+struct cma_ndev_work {
+	struct work_struct	work;
+	struct rdma_id_private	*id;
+	struct rdma_cm_event	event;
+};
+
 union cma_ip_addr {
 	struct in6_addr ip6;
 	struct {
@@ -1598,6 +1604,28 @@ out:
 	kfree(work);
 }

+static void cma_ndev_work_handler(struct work_struct *_work)
+{
+	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work);
+	struct rdma_id_private *id_priv = work->id;
+	int destroy = 0;
+
+	mutex_lock(&id_priv->handler_mutex);
+	if (id_priv->state == CMA_DESTROYING)
+		goto out;
+
+	if (id_priv->id.event_handler(&id_priv->id, &work->event)) {
+		cma_exch(id_priv, CMA_DESTROYING);
+		destroy = 1;
+	}
+out:
+	mutex_unlock(&id_priv->handler_mutex);
+	cma_deref_id(id_priv);
+	if (destroy)
+		rdma_destroy_id(&id_priv->id);
+	kfree(work);
+}
+
 static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms)
 {
 	struct rdma_route *route = &id_priv->id.route;
@@ -2723,6 +2751,63 @@ void rdma_leave_multicast(struct rdma_cm
 }
 EXPORT_SYMBOL(rdma_leave_multicast);

+static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id_priv)
+{
+	struct rdma_dev_addr *dev_addr;
+	struct cma_ndev_work *work;
+
+	dev_addr = &id_priv->id.route.addr.dev_addr;
+
+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
+		printk(KERN_ERR "addr change for device %s used by id %p, notifying\n",
+				ndev->name, &id_priv->id);
+		work = kzalloc(sizeof *work, GFP_ATOMIC);
+		if (!work)
+			return -ENOMEM;
+		INIT_WORK(&work->work, cma_ndev_work_handler);
+		work->id = id_priv;
+		work->event.event = RDMA_CM_EVENT_ADDR_CHANGE;
+		atomic_inc(&id_priv->refcount);
+		queue_work(cma_wq, &work->work);
+	}
+
+	return 0;
+}
+
+static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
+	void *ctx)
+{
+	struct net_device *ndev = (struct net_device *)ctx;
+	struct cma_device *cma_dev;
+	struct rdma_id_private *id_priv;
+	int ret = NOTIFY_DONE;
+
+	if (dev_net(ndev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (event != NETDEV_BONDING_FAILOVER)
+		return NOTIFY_DONE;
+
+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
+		return NOTIFY_DONE;
+
+	mutex_lock(&lock);
+	list_for_each_entry(cma_dev, &dev_list, list)
+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
+			ret = cma_netdev_change(ndev, id_priv);
+			if (ret)
+				break;
+		}
+	mutex_unlock(&lock);
+
+	return ret;
+}
+
+static struct notifier_block cma_nb = {
+	.notifier_call = cma_netdev_callback
+};
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
@@ -2831,6 +2916,7 @@ static int cma_init(void)

 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
+	register_netdevice_notifier(&cma_nb);

 	ret = ib_register_client(&cma_client);
 	if (ret)
@@ -2838,6 +2924,7 @@ static int cma_init(void)
 	return 0;

 err:
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
@@ -2847,6 +2934,7 @@ err:
 static void cma_cleanup(void)
 {
 	ib_unregister_client(&cma_client);
+	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
 	destroy_workqueue(cma_wq);
Index: linux-2.6.26-rc3/include/rdma/rdma_cm.h
===================================================================
--- linux-2.6.26-rc3.orig/include/rdma/rdma_cm.h	2008-05-28 10:34:27.000000000 +0300
+++ linux-2.6.26-rc3/include/rdma/rdma_cm.h	2008-05-28 12:55:31.000000000 +0300
@@ -53,7 +53,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
 	RDMA_CM_EVENT_MULTICAST_JOIN,
-	RDMA_CM_EVENT_MULTICAST_ERROR
+	RDMA_CM_EVENT_MULTICAST_ERROR,
+	RDMA_CM_EVENT_ADDR_CHANGE
 };

 enum rdma_port_space {


From eli at dev.mellanox.co.il  Wed May 28 05:05:03 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 28 May 2008 15:05:03 +0300
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <OFB5083D28.CF756A1C-ON87257456.0053454F-88257456.0027C5C2@us.ibm.com>
References: <OFB5083D28.CF756A1C-ON87257456.0053454F-88257456.0027C5C2@us.ibm.com>
Message-ID: <1211976303.13769.155.camel@mtls03>


> In this case, how many tx drop packets from ifconfig output? Should we
> see ifconfig tx drop packets + tx successfully transmit packets close
> to netperf packets? 
That's right.

> 
> Any TCP STREAM test results to share here?
TCP won't demonstrate the problem since it uses Nagle's algorithm to
aggregate data into full sized packets.

> 
> thanks
> Shirley
> 


From Thomas.Talpey at netapp.com  Wed May 28 05:06:45 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 May 2008 08:06:45 -0400
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
Message-ID: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>

Is it possible to manually configure two Infiniband ports to operate
with one another in back-to-back mode, without running OpenSM
on one of them?

We have done this on other IB implementations by manually assigning
LIDs, but I discover that the "lid" entry below /sys/class/infiniband/<device>
is not writable, at least for mthca. Also, I expect that the ipoib driver will
be unable to join the broadcast group, so will be unwilling to come up fully.

With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
not IB?

If you're wondering, my goal is give NFS/RDMA users a way to avoid having
to install the many userspace modules needed to do this, including libibverbs,
opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
"easy" way to get started with just the kernel and some shell commands.

Tom.


From hrosenstock at xsigo.com  Wed May 28 05:39:29 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 05:39:29 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>

Tom,

On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
> Is it possible to manually configure two Infiniband ports to operate
> with one another in back-to-back mode, without running OpenSM
> on one of them?

This is possible but something would need to do at least some subset of
what the SM does depending on the precise requirements and the limits
placed on the environment supported without a "full blown" SM.

> We have done this on other IB implementations by manually assigning
> LIDs, but I discover that the "lid" entry below /sys/class/infiniband/<device>
> is not writable, at least for mthca.

This can be done via MADs so user_mad kernel module would be needed to
do this.

> Also, I expect that the ipoib driver will
> be unable to join the broadcast group, so will be unwilling to come up fully.

Is IPoIB a requirement ?

> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
> not IB?

The simple answer is that it is the nature of IB management (being
different than ethernet).

-- Hal

> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
> to install the many userspace modules needed to do this, including libibverbs,
> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
> "easy" way to get started with just the kernel and some shell commands.
> 
> Tom.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Thomas.Talpey at netapp.com  Wed May 28 05:56:53 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 May 2008 08:56:53 -0400
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
Message-ID: <RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>

At 08:39 AM 5/28/2008, Hal Rosenstock wrote:
>Tom,
>
>On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
>> Is it possible to manually configure two Infiniband ports to operate
>> with one another in back-to-back mode, without running OpenSM
>> on one of them?
>
>This is possible but something would need to do at least some subset of
>what the SM does depending on the precise requirements and the limits
>placed on the environment supported without a "full blown" SM.

Okay ... but IMO the only thing we need is a LID. Or at least, in my experience
all I've needed is a LID.

In a previous effort, we simply stole the low octet of an IP address, so we'd
"ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. Worked great.
If necessary, we would set a manual arp entry (using iproute) to avoid having
to broadcast.

>
>> We have done this on other IB implementations by manually assigning
>> LIDs, but I discover that the "lid" entry below 
>/sys/class/infiniband/<device>
>> is not writable, at least for mthca.
>
>This can be done via MADs so user_mad kernel module would be needed to
>do this.

Okay, all kernel modules can be assumed to be in place. How do we tell it
to manage the LID, with a shell command?

>
>> Also, I expect that the ipoib driver will
>> be unable to join the broadcast group, so will be unwilling to come up fully.
>
>Is IPoIB a requirement ?

I think so, for two reasons. One, principle of least surprise - the user will
expect to be able to ping, telnet etc if it has connectivity. Two, for NFS/RDMA
we require TCP and UDP connections in order to perform the mount and do
locking and recovery. We could do those over a parallel ethernet connection,
but that's kind of not the point.

>
>> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
>> not IB?
>
>The simple answer is that it is the nature of IB management (being
>different than ethernet).

Which, IMO, we need to boil down to simplest-possible, for at least some
workable configuration.

Thanks for the ideas!

Tom.

>
>-- Hal
>
>> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
>> to install the many userspace modules needed to do this, including 
>libibverbs,
>> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
>> "easy" way to get started with just the kernel and some shell commands.
>> 
>> Tom.
>> 
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> 
>> To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Wed May 28 06:03:37 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 06:03:37 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>

On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote:
> At 08:39 AM 5/28/2008, Hal Rosenstock wrote:
> >Tom,
> >
> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
> >> Is it possible to manually configure two Infiniband ports to operate
> >> with one another in back-to-back mode, without running OpenSM
> >> on one of them?
> >
> >This is possible but something would need to do at least some subset of
> >what the SM does depending on the precise requirements and the limits
> >placed on the environment supported without a "full blown" SM.
> 
> Okay ... but IMO the only thing we need is a LID. Or at least, in my experience
> all I've needed is a LID.

The port also needs to be walked from init to active which takes
coordination at both ends of the b2b link.

> In a previous effort, we simply stole the low octet of an IP address, so we'd
> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. Worked great.
> If necessary, we would set a manual arp entry (using iproute) to avoid having
> to broadcast.

That could be done if that is what is desired and can be relied upon
(that ib0 is configured and we only care about the first port).

Is it just ARP support that is needed ?

> >> We have done this on other IB implementations by manually assigning
> >> LIDs, but I discover that the "lid" entry below 
> >/sys/class/infiniband/<device>
> >> is not writable, at least for mthca.
> >
> >This can be done via MADs so user_mad kernel module would be needed to
> >do this.
> 
> Okay, all kernel modules can be assumed to be in place. How do we tell it
> to manage the LID, with a shell command?

A new "command" would be needed.

-- Hal

> >> Also, I expect that the ipoib driver will
> >> be unable to join the broadcast group, so will be unwilling to come up fully.
> >
> >Is IPoIB a requirement ?
> 
> I think so, for two reasons. One, principle of least surprise - the user will
> expect to be able to ping, telnet etc if it has connectivity. Two, for NFS/RDMA
> we require TCP and UDP connections in order to perform the mount and do
> locking and recovery. We could do those over a parallel ethernet connection,
> but that's kind of not the point.
> 
> >
> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
> >> not IB?
> >
> >The simple answer is that it is the nature of IB management (being
> >different than ethernet).
> 
> Which, IMO, we need to boil down to simplest-possible, for at least some
> workable configuration.
> 
> Thanks for the ideas!
> 
> Tom.
> 
> >
> >-- Hal
> >
> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
> >> to install the many userspace modules needed to do this, including 
> >libibverbs,
> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
> >> "easy" way to get started with just the kernel and some shell commands.
> >> 
> >> Tom.
> >> 
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> 
> >> To unsubscribe, please visit 
> >http://openib.org/mailman/listinfo/openib-general
> 


From Thomas.Talpey at netapp.com  Wed May 28 06:24:21 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 May 2008 09:24:21 -0400
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
	<1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
Message-ID: <RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>

At 09:03 AM 5/28/2008, Hal Rosenstock wrote:
>On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote:
>> At 08:39 AM 5/28/2008, Hal Rosenstock wrote:
>> >Tom,
>> >
>> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
>> >> Is it possible to manually configure two Infiniband ports to operate
>> >> with one another in back-to-back mode, without running OpenSM
>> >> on one of them?
>> >
>> >This is possible but something would need to do at least some subset of
>> >what the SM does depending on the precise requirements and the limits
>> >placed on the environment supported without a "full blown" SM.
>> 
>> Okay ... but IMO the only thing we need is a LID. Or at least, in my 
>experience
>> all I've needed is a LID.
>
>The port also needs to be walked from init to active which takes
>coordination at both ends of the b2b link.

Yep. But, it has all it needs with a LID, right? No messages need to be
exchanged, for instance.

>
>> In a previous effort, we simply stole the low octet of an IP address, so we'd
>> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. 
>Worked great.
>> If necessary, we would set a manual arp entry (using iproute) to avoid having
>> to broadcast.
>
>That could be done if that is what is desired and can be relied upon
>(that ib0 is configured and we only care about the first port).
>
>Is it just ARP support that is needed ?

Well, ARP is the precursor to establishing an IP send and a TCP connection,
which we need to do also. But, if the resulting ipaddr-hwaddr mapping is
installed, then ARP is unnecessary and the IP layer can send without using it.

When we did this before, we'd install a "permanent" ARP entry, in a two-line
shell script. Roughly, for peers configuring lids X and Y, it would do

peer X:
	ifconfig ib0 1.2.3.X
	ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid)

peer Y:
	ifconfig ib0 1.2.3.Y
	ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X

And we'd be up and running for both IP and RDMA connections. We fixed a
bug in the old iproute2 command to allow the long IB link addresses.

I'm thinking that using IPOIB to drive this kind of manual setup is one way
to approach it. It certainly would be simple, and worked for us before there
was an OFA stack.

Maybe I'm getting ahead of myself though, still wondering if there's a way
to do it with what we have.

Tom.

>
>> >> We have done this on other IB implementations by manually assigning
>> >> LIDs, but I discover that the "lid" entry below 
>> >/sys/class/infiniband/<device>
>> >> is not writable, at least for mthca.
>> >
>> >This can be done via MADs so user_mad kernel module would be needed to
>> >do this.
>> 
>> Okay, all kernel modules can be assumed to be in place. How do we tell it
>> to manage the LID, with a shell command?
>
>A new "command" would be needed.
>
>-- Hal
>
>> >> Also, I expect that the ipoib driver will
>> >> be unable to join the broadcast group, so will be unwilling to 
>come up fully.
>> >
>> >Is IPoIB a requirement ?
>> 
>> I think so, for two reasons. One, principle of least surprise - the user will
>> expect to be able to ping, telnet etc if it has connectivity. Two, 
>for NFS/RDMA
>> we require TCP and UDP connections in order to perform the mount and do
>> locking and recovery. We could do those over a parallel ethernet connection,
>> but that's kind of not the point.
>> 
>> >
>> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
>> >> not IB?
>> >
>> >The simple answer is that it is the nature of IB management (being
>> >different than ethernet).
>> 
>> Which, IMO, we need to boil down to simplest-possible, for at least some
>> workable configuration.
>> 
>> Thanks for the ideas!
>> 
>> Tom.
>> 
>> >
>> >-- Hal
>> >
>> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
>> >> to install the many userspace modules needed to do this, including 
>> >libibverbs,
>> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
>> >> "easy" way to get started with just the kernel and some shell commands.
>> >> 
>> >> Tom.
>> >> 
>> >> _______________________________________________
>> >> general mailing list
>> >> general at lists.openfabrics.org
>> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >> 
>> >> To unsubscribe, please visit 
>> >http://openib.org/mailman/listinfo/openib-general
>> 
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Wed May 28 06:34:10 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 06:34:10 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
	<1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com>

On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote:
> At 09:03 AM 5/28/2008, Hal Rosenstock wrote:
> >On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote:
> >> At 08:39 AM 5/28/2008, Hal Rosenstock wrote:
> >> >Tom,
> >> >
> >> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
> >> >> Is it possible to manually configure two Infiniband ports to operate
> >> >> with one another in back-to-back mode, without running OpenSM
> >> >> on one of them?
> >> >
> >> >This is possible but something would need to do at least some subset of
> >> >what the SM does depending on the precise requirements and the limits
> >> >placed on the environment supported without a "full blown" SM.
> >> 
> >> Okay ... but IMO the only thing we need is a LID. Or at least, in my 
> >experience
> >> all I've needed is a LID.
> >
> >The port also needs to be walked from init to active which takes
> >coordination at both ends of the b2b link.
> 
> Yep. But, it has all it needs with a LID, right? No messages need to be
> exchanged, for instance.

It's more than a LID and messages do need to be exchanged (mini SM ->
SMA) to walk the port from INIT to ACTIVE. This needs to be coordinated
on both sides of the link so they move in rough concert.

> >> In a previous effort, we simply stole the low octet of an IP address, so we'd
> >> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. 
> >Worked great.
> >> If necessary, we would set a manual arp entry (using iproute) to avoid having
> >> to broadcast.
> >
> >That could be done if that is what is desired and can be relied upon
> >(that ib0 is configured and we only care about the first port).
> >
> >Is it just ARP support that is needed ?
> 
> Well, ARP is the precursor to establishing an IP send and a TCP connection,
> which we need to do also.

I was just asking about other broadcast/multicast needs. Sounds like
this is not the case.

>  But, if the resulting ipaddr-hwaddr mapping is
> installed, then ARP is unnecessary and the IP layer can send without using it.
> 
> When we did this before, we'd install a "permanent" ARP entry, in a two-line
> shell script. Roughly, for peers configuring lids X and Y, it would do
> 
> peer X:
> 	ifconfig ib0 1.2.3.X
> 	ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid)
> 
> peer Y:
> 	ifconfig ib0 1.2.3.Y
> 	ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X
> 
> And we'd be up and running for both IP and RDMA connections. We fixed a
> bug in the old iproute2 command to allow the long IB link addresses.
> 
> I'm thinking that using IPOIB to drive this kind of manual setup is one way
> to approach it. It certainly would be simple, and worked for us before there
> was an OFA stack.

This would still work.

> Maybe I'm getting ahead of myself though, still wondering if there's a way
> to do it with what we have.

The closest thing is OpenSM run once mode but I think you've been
describing a b2b mini SM command which wouldn't be hard to implement.

-- Hal

> Tom.
> 
> >
> >> >> We have done this on other IB implementations by manually assigning
> >> >> LIDs, but I discover that the "lid" entry below 
> >> >/sys/class/infiniband/<device>
> >> >> is not writable, at least for mthca.
> >> >
> >> >This can be done via MADs so user_mad kernel module would be needed to
> >> >do this.
> >> 
> >> Okay, all kernel modules can be assumed to be in place. How do we tell it
> >> to manage the LID, with a shell command?
> >
> >A new "command" would be needed.
> >
> >-- Hal
> >
> >> >> Also, I expect that the ipoib driver will
> >> >> be unable to join the broadcast group, so will be unwilling to 
> >come up fully.
> >> >
> >> >Is IPoIB a requirement ?
> >> 
> >> I think so, for two reasons. One, principle of least surprise - the user will
> >> expect to be able to ping, telnet etc if it has connectivity. Two, 
> >for NFS/RDMA
> >> we require TCP and UDP connections in order to perform the mount and do
> >> locking and recovery. We could do those over a parallel ethernet connection,
> >> but that's kind of not the point.
> >> 
> >> >
> >> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
> >> >> not IB?
> >> >
> >> >The simple answer is that it is the nature of IB management (being
> >> >different than ethernet).
> >> 
> >> Which, IMO, we need to boil down to simplest-possible, for at least some
> >> workable configuration.
> >> 
> >> Thanks for the ideas!
> >> 
> >> Tom.
> >> 
> >> >
> >> >-- Hal
> >> >
> >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
> >> >> to install the many userspace modules needed to do this, including 
> >> >libibverbs,
> >> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
> >> >> "easy" way to get started with just the kernel and some shell commands.
> >> >> 
> >> >> Tom.
> >> >> 
> >> >> _______________________________________________
> >> >> general mailing list
> >> >> general at lists.openfabrics.org
> >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> >> 
> >> >> To unsubscribe, please visit 
> >> >http://openib.org/mailman/listinfo/openib-general
> >> 
> >
> >_______________________________________________
> >general mailing list
> >general at lists.openfabrics.org
> >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From tziporet at dev.mellanox.co.il  Wed May 28 06:49:26 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 28 May 2008 16:49:26 +0300
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <483D62E6.2050107@mellanox.co.il>

Talpey, Thomas wrote:
>
> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
> to install the many userspace modules needed to do this, including libibverbs,
> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
> "easy" way to get started with just the kernel and some shell commands.
>
>   
No need for libibverbs for opensm, just the management libraries.

Tziporet


From chu11 at llnl.gov  Wed May 28 03:55:25 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 28 May 2008 06:55:25 -0400
Subject: [ofa-general] OpenSM?
In-Reply-To: <483CFAB1.5020409@dev.mellanox.co.il>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>
	<20080516223419.27221.49014.stgit@dell3.ogc.int>
	<483285DC.20003@voltaire.com>	<4832D850.2010102@opengridcomputing.com>
	<4833EA6B.9000705@voltaire.com>
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>
	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>
	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>
	<5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>
	<20080527100859.6d48cd45.weiny2@llnl.gov>
	<483CFAB1.5020409@dev.mellanox.co.il>
Message-ID: <1211972125.5192.5.camel@whatsup>

Hey Yevgeny,

> Did ftree fail to digest the topology? 

For the current OpenSM, on atleast one cluster, yes.  I've been meaning
to look into it.  Just haven't gotten to it yet.

There is some legacy too. Earlier ftree's weren't able to handle a
number of corner cases, so updn was used instead (and now is still
used).

> Or perhaps that you need LMC>0, which ftree doesn't support?

This too.

Al

On Wed, 2008-05-28 at 09:24 +0300, Yevgeny Kliteynik wrote:
> Ira,
> 
> Ira Weiny wrote:
> > 
> > We run with the up/down algorithm, ftree has not panned out for us yet.
> 
> Can you elaborate on that?
> Did ftree fail to digest the topology? Or did it do a lousy job configuring
> the subnet? Or perhaps the you need LMC>0, which ftree doesn't support?
> 
> -- Yevgeny
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From downligging at toddshop.com  Wed May 28 06:02:28 2008
From: downligging at toddshop.com (Bruno Dickinson)
Date: Wed, 28 May 2008 16:02:28 +0300
Subject: [ofa-general] "All Adobe products in One Download"
Message-ID: <000301c8c0cb$097adc80$0100007f@pltuxsm>

" Adobe CS3 Master Collection for PC or MAC includes:
" InDesign CS3
" Photoshop CS3
" Illustrator CS3
" Acrobat 8 Professional
" Flash CS3 Professional
" Dreamweaver CS3
" Fireworks CS3
" Contribute CS3
" After Effects CS3 Professional
" Premiere Pro CS3
" Encore DVD CS3
" Soundbooth CS3

" xpwebspot . com in Web Browser

" System Requirements

" For PC:
" Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core
" Duo (or compatible) processor; SSE2-enabled processor required for AMD systems
" Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions)
" 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
" 38GB of available hard-disk space (additional free space required during installation)
" Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
" Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended)
" 1,280x1,024 monitor resolution with 32-bit color adapter
" DVD-ROM drive

" For MAC:
" PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp)
" Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server
" 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components
" 36GB of available hard-disk space (additional free space required during installation)
" Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred
" Core Audio compatible sound card
" 1,280x1,024 monitor resolution with 32-bit color adapter
" DVD-ROM drive" DVD+-R burner required for DVD creation

Fifty years ago, Beneteau was a small, family-owned company that made fishing boats in a French village. Now it's the world's top sailboat maker, with dealers in 50 countries. Reporter Eleanor Beardsley has more on the woman who transformed the company.

The Copenhagen Consensus Center's annual conference brings together some of the world's top economists and thinkers. Offering highlights is Bjorn Lomborg, CCC founder and author of <em>The Skeptical Environmentalist: Measuring the Real State of the World</em>.


From ramachandra.kuchimanchi at qlogic.com  Wed May 28 07:18:28 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Wed, 28 May 2008 19:48:28 +0530
Subject: [ofa-general] Re: [PATCH v2 03/13] QLogic VNIC: Implementation of
	communication protocol with EVIC/VEx
In-Reply-To: <adaod6rq9o1.fsf@cisco.com>
References: <20080519102843.12355.832.stgit@localhost.localdomain>
	<20080519103258.12355.6146.stgit@localhost.localdomain>
	<adaod6rq9o1.fsf@cisco.com>
Message-ID: <71d336490805280718j79c3ea24j4408b851eb0a23ab@mail.gmail.com>

Roland,

Thanks. Will fix both the items you pointed out.

Regards,
Ram

On Wed, May 28, 2008 at 10:58 AM, Roland Dreier <rdreier at cisco.com> wrote:
>  > +void viport_disconnect(struct viport *viport)
>  > +{
>  > +    VIPORT_FUNCTION("viport_disconnect()\n");
>  > +    viport->disconnect = 1;
>  > +    viport_failure(viport);
>  > +    wait_event(viport->disconnect_queue, viport->disconnect == 0);
>  > +}
>  > +
>  > +void viport_free(struct viport *viport)
>  > +{
>  > +    VIPORT_FUNCTION("viport_free()\n");
>  > +    viport_disconnect(viport);      /* NOTE: this can sleep */
>
> There are no other calls to viport_disconnect() that I can see, so it
> can be made static (and the declaration in vnic_viport.h can be dropped).
> in fact given how small the function is and the fact that it has only a
> single call site, it might be easier just to merge it into
> viport_free().  But that's a matter of taste.
>
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From Thomas.Talpey at netapp.com  Wed May 28 07:31:38 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 May 2008 10:31:38 -0400
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <483D62E6.2050107@mellanox.co.il>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
Message-ID: <RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>

At 09:49 AM 5/28/2008, Tziporet Koren wrote:
>Talpey, Thomas wrote:
>>
>> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
>> to install the many userspace modules needed to do this, including 
>libibverbs,
>> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
>> "easy" way to get started with just the kernel and some shell commands.
>>
>>   
>No need for libibverbs for opensm, just the management libraries.

Hmm, well expanding on my "etc.", at the NFS Connectathon we were
trying to install it on a green Fedora 9 system, and the RedHat opensm
package (OFED 1.3) wanted the following prerequisites:
	libibcommon
	libibumad
	opensm_lib
	opensm

In addition we needed to install
	libmthca
which in turn wanted
	libibverbs

And, after we successfully loaded it all and started opensm, etc,
then ipoib wouldn't come up to RUNNING because it couldn't join
the broadcast group.

At which point, we threw up our hands because all we wanted to
do was make a b2b connection, so we admitted defeat and connected
to a managed switch. :-/

I'm just looking to make it easier for the next guy/gal. ;-)

Tom.


From kliteyn at dev.mellanox.co.il  Wed May 28 07:44:27 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 28 May 2008 17:44:27 +0300
Subject: [ofa-general] OpenSM?
In-Reply-To: <1211972125.5192.5.camel@whatsup>
References: <20080516223256.27221.34568.stgit@dell3.ogc.int>	
	<20080516223419.27221.49014.stgit@dell3.ogc.int>	
	<483285DC.20003@voltaire.com>	<4832D850.2010102@opengridcomputing.com>	
	<4833EA6B.9000705@voltaire.com>	
	<26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com>	
	<48351310.5090108@voltaire.com>	<483578EA.4070503@opengridcomputing.com>	
	<48358428.2000902@voltaire.com>	<48370AEE.7080507@opengridcomputing.com>	
	<5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu>	
	<20080527100859.6d48cd45.weiny2@llnl.gov>	
	<483CFAB1.5020409@dev.mellanox.co.il>
	<1211972125.5192.5.camel@whatsup>
Message-ID: <483D6FCB.8020308@dev.mellanox.co.il>

Hi Al,

Al Chu wrote:
> Hey Yevgeny,
> 
>> Did ftree fail to digest the topology? 
> 
> For the current OpenSM, on atleast one cluster, yes.  I've been meaning
> to look into it.  Just haven't gotten to it yet.

OK, let me know if ftree is still unable to work on your topology
when you do get back to this - perhaps I'll need to tune it up a bit

> There is some legacy too. Earlier ftree's weren't able to handle a
> number of corner cases, so updn was used instead (and now is still
> used).
> 
>> Or perhaps that you need LMC>0, which ftree doesn't support?
> 
> This too.

Unless, of course, you need LMC>0, in which case ftree is useless.

-- Yevgeny

> Al
> 
> On Wed, 2008-05-28 at 09:24 +0300, Yevgeny Kliteynik wrote:
>> Ira,
>>
>> Ira Weiny wrote:
>>> We run with the up/down algorithm, ftree has not panned out for us yet.
>> Can you elaborate on that?
>> Did ftree fail to digest the topology? Or did it do a lousy job configuring
>> the subnet? Or perhaps the you need LMC>0, which ftree doesn't support?
>>
>> -- Yevgeny
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Wed May 28 08:24:29 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Wed, 28 May 2008 08:24:29 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>

On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote:
> In addition we needed to install
>         libmthca
> which in turn wanted
>         libibverbs

AFAIK, there is no OpenSM requirement for these libraries.

> And, after we successfully loaded it all and started opensm, etc,
> then ipoib wouldn't come up to RUNNING because it couldn't join
> the broadcast group.

This was likely due to IPoIB needing to be enabled on the default
partition. From the 1.3 opensm man page,

PARTITION CONFIGURATION
       The default name of OpenSM partitions configuration file is 
       /etc/opensm/partitions.conf. The default may be changed by using
       --Pconfig (-P) option with OpenSM.
...
       The  following  rule  is equivalent to how OpenSM used to run prior to
       the partition manager:

        Default=0x7fff,ipoib:ALL=full;

-- Hal


From weiny2 at llnl.gov  Wed May 28 09:06:29 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 28 May 2008 09:06:29 -0700
Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option
In-Reply-To: <1211972790.13185.332.camel@hrosenstock-ws.xsigo.com>
References: <20080522135329.GB32128@sashak.voltaire.com>
	<1211467609.18236.171.camel@hrosenstock-ws.xsigo.com>
	<20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<1211887752.13185.212.camel@hrosenstock-ws.xsigo.com>
	<20080527175637.GB14205@sashak.voltaire.com>
	<1211972790.13185.332.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080528090629.3ca96d30.weiny2@llnl.gov>

On Wed, 28 May 2008 04:06:30 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Tue, 2008-05-27 at 20:56 +0300, Sasha Khapyorsky wrote:
> > On 04:29 Tue 27 May     , Hal Rosenstock wrote:
> > > On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote:
> > > > > Following your logic we will need to disable root passwords
> > > > > typing too.
> > > > 
> > > > That's taking it too far. Root passwords are at least hidden when
> > > > typing.
> > > 
> > > At least hide the key typing from plain sight when typing like su does.
> > 
> > There are lot of tools where password can be specified as clear text in
> > command line (wget, smbclient, etc..) - it is an user responsibility to
> > keep his sensitive data safe.
> 
> Do those tools provide a way to obscure passwords or force the user to
> do this in plain sight ? Seems like a user can't do this without support
> from the tool. smbclient seems to provide this; I didn't look at wget.
> 
> smbclient supports an authorization file which supports this and says:
>               Make  certain  that the permissions on the file restrict access
>               from unwanted users.
> 
> As you mentioned before, this is another acceptable approach (and this
> also lends itself better to scripting).

Another example of this is MySQL.  From the man page:

          shell> mysql --user=user_name --password=your_password db_name

With the plugin I just released I install a config file with this password
accessible only to root.  If someone runs OpenSM as another user or has other
programs trying to access the DB (like SKUMMEE) then you will have to set the
permissions on this file appropriately.

I think I like the addition of a conf file for the scripts...

Ira

> 
> -- Hal
> 
> > Sasha
> 


From rdreier at cisco.com  Wed May 28 09:11:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 28 May 2008 09:11:37 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com> (Thomas
	Talpey's message of "Wed, 28 May 2008 10:31:38 -0400")
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <ada8wxuquh2.fsf@cisco.com>

 > Hmm, well expanding on my "etc.", at the NFS Connectathon we were
 > trying to install it on a green Fedora 9 system, and the RedHat opensm
 > package (OFED 1.3) wanted the following prerequisites:
 > 	libibcommon
 > 	libibumad
 > 	opensm_lib
 > 	opensm

I think the solution to this is not to do a bunch more work to create an
opensm-free solution, but just to package opensm properly so it can get
into Fedora.  Then installation is just "yum install opensm".

<http://fedoraproject.org/wiki/PackageMaintainers/Join> is a very
detailed step-by-step guide to what is necessary.

 > In addition we needed to install
 > 	libmthca
 > which in turn wanted
 > 	libibverbs

This isn't needed for opensm or NFS/RDMA, but it is very easy on Fedora
9: just "yum install libmthca".

 - R.


From dotanba at gmail.com  Wed May 28 11:24:28 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Wed, 28 May 2008 20:24:28 +0200
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
In-Reply-To: <OFDF9BCAD1.775FB546-ON86257456.007A86A5-86257456.0080F2D1@TMRIUSA.COM>
References: <OFDF9BCAD1.775FB546-ON86257456.007A86A5-86257456.0080F2D1@TMRIUSA.COM>
Message-ID: <483DA35C.3080502@gmail.com>

Yicheng Jia wrote:
>
> Thanks for your reply. I'm using one CQ for all the WRs. Do you know 
> why there's no ARM-N support in MLX drivers?
I don't know if i can speak in the name of Mellanox/MLX driver 
maintainers, but i think that the
reason is lack of demand for this feature (but i can't be sure).

> My concern is the performance. The overhead of software poll_cq loop 
> is quite significant if there are multiple pieces of small amount of 
> data to be transferred on both sender/receiver sides. For instance, on 
> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver 
> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you 
> have a good solution for such kind of problem?
How many QPs do you use?
(and how outstanding WR from every QP?)

Dotan
> Best,
> Yicheng
>
>
>
> *Dotan Barak <dotanba at gmail.com>*
>
> 05/23/2008 01:27 PM
>
> 	
> To
> 	Yicheng Jia <YJia at tmriusa.com>
> cc
> 	general at lists.openfabrics.org
> Subject
> 	Re: [ofa-general] MLX HCA: CQ request notification for multiple 
> completions not implemented?
>
>
>
> 	
>
>
>
>
>
> Hi.
>
> Yicheng Jia wrote:
> >
> > Hi Folks,
> >
> > I'm trying to use CQ Event notification for multiple completions
> > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering
> > RDMA. However I couldn't find it in current MLX driver. It seems to me
> > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are
> > multiple work requests, I have to use "poll_cq" to synchronously wait
> > until all the requests are done, is it correct? Is there a way to do
> > asynchronous multiple send by subscribing for a ARM_N event?
> You are right: the low level drivers of Mellanox devices doesn't support
> ARM-N
> (This feature is supported by the devices, but it wasn't implemented in
> the low level drivers).
>
> You are right, in order to read all of the completions you need to use
> poll_cq.
>
> By the way: Do you have you have to create a completion for any WR?
> (if you are using one QP, this will maybe solve your problem).
>
> Dotan
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> _____________________________________________________________________________
> <http://www.ers.ibm.com/>
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> _____________________________________________________________________________


From dotanba at gmail.com  Wed May 28 11:31:46 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Wed, 28 May 2008 20:31:46 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483BBBDB.6000605@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
Message-ID: <483DA512.2070403@gmail.com>

Marcel Heinz wrote:
> Marcel Heinz wrote:
>   
>> Dotan Barak wrote:
>>     
>>> Do you use the latest released FW for this device?
>>>       
>> The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a
>> look at the switch later.
>>     
>
> The Switch is Mellanox MT47396 based and uses FW version 1.0.0.  This
> isn't the latest one, but I don't see anything in the release notes of
> the 1.0.5 firmware which is related to our problem.
>   
1) I know that ib_send_bw supports multicast as well, can you please 
check that you can reproduce your problem
on this benchmark too?

2) You should expect that multicast messages will be slower than unicast 
because the HCA/switch treat them in different way
(message duplication need to be done if needed).

Dotan


From sean.hefty at intel.com  Wed May 28 10:59:34 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 May 2008 10:59:34 -0700
Subject: [ofa-general] RE: [RFC V3 PATCH 4/5] rdma/cma: implement
	RDMA_CM_EVENT_NETDEV_CHANGE notification
In-Reply-To: <483D4283.60307@voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<000401c8c015$75c37890$ebc8180a@amr.corp.intel.com>
	<483D4283.60307@voltaire.com>
Message-ID: <000601c8c0ec$96492510$6758180a@amr.corp.intel.com>

>The rdma-cm maintains a mapping between IDs to the physical devices. The
>mapping is established during address resolution using the HW address of
>the --network-- device that was resolved (eg through ARP and then
>looking on neigh->dev or route lookup) for this ID.

This is what I was thinking.

>In the bonding case, the network device --borrows-- the HW address from
>the active slave device. During fail-over, the bonding net device
>changes its HW address and then the netdev event is delivered on which
>this code acts. So the same cma_dev can have IDs with different netdev
>HW address in their dev_addr struct, say bond0 = <ib0, ib1> and pdevA
>list = { <ID1, HW(ib0)>, <ID2, HW(ib1)>} depending on the time address
>resolution was done to ID1,ID2 and the ULP behavior on the ADDR_CHANGE
>event. I don't see how to get along with a simple check that tell on
>what cma_dev to look for matches. If we really want to avoid scanning
>all the cma_dev list, we can add a mapping between --net devices-- to
>IDs and then scan only the list of the affiliated netdevice.

Ok - looping through everything isn't that bad, since it's not expected to
happen often.  If there's a way to improve this, I'm fine waiting to see if
there's a real problem before complicating things.

- Sean


From YJia at tmriusa.com  Wed May 28 11:10:42 2008
From: YJia at tmriusa.com (Yicheng Jia)
Date: Wed, 28 May 2008 13:10:42 -0500
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
In-Reply-To: <483DA35C.3080502@gmail.com>
Message-ID: <OFCC8CEBAD.CB8E7C11-ON86257457.00631DD3-86257457.0063DBA9@TMRIUSA.COM>

>> My concern is the performance. The overhead of software poll_cq loop 
>> is quite significant if there are multiple pieces of small amount of 
>> data to be transferred on both sender/receiver sides. For instance, on 
>> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver 
>> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you 
>> have a good solution for such kind of problem?

>How many QPs do you use?
>(and how outstanding WR from every QP?)

Only one QP. Is it better to alloc multiple QPs and evenly distribute WRs 
among those QPs?

Best,
Yicheng


Dotan Barak <dotanba at gmail.com> 
05/28/2008 12:24 PM

To
Yicheng Jia <YJia at tmriusa.com>
cc
general at lists.openfabrics.org
Subject
Re: [ofa-general] MLX HCA: CQ request notification for multiple 
completions not implemented?


Yicheng Jia wrote:
>
> Thanks for your reply. I'm using one CQ for all the WRs. Do you know 
> why there's no ARM-N support in MLX drivers?
I don't know if i can speak in the name of Mellanox/MLX driver 
maintainers, but i think that the
reason is lack of demand for this feature (but i can't be sure).

> My concern is the performance. The overhead of software poll_cq loop 
> is quite significant if there are multiple pieces of small amount of 
> data to be transferred on both sender/receiver sides. For instance, on 
> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver 
> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you 
> have a good solution for such kind of problem?
How many QPs do you use?
(and how outstanding WR from every QP?)

Dotan
> Best,
> Yicheng
>
>
>
> *Dotan Barak <dotanba at gmail.com>*
>
> 05/23/2008 01:27 PM
>
> 
> To
>                Yicheng Jia <YJia at tmriusa.com>
> cc
>                general at lists.openfabrics.org
> Subject
>                Re: [ofa-general] MLX HCA: CQ request notification for 
multiple 
> completions not implemented?
>
>
>
> 
>
>
>
>
>
> Hi.
>
> Yicheng Jia wrote:
> >
> > Hi Folks,
> >
> > I'm trying to use CQ Event notification for multiple completions
> > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering
> > RDMA. However I couldn't find it in current MLX driver. It seems to me
> > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are
> > multiple work requests, I have to use "poll_cq" to synchronously wait
> > until all the requests are done, is it correct? Is there a way to do
> > asynchronous multiple send by subscribing for a ARM_N event?
> You are right: the low level drivers of Mellanox devices doesn't support
> ARM-N
> (This feature is supported by the devices, but it wasn't implemented in
> the low level drivers).
>
> You are right, in order to read all of the completions you need to use
> poll_cq.
>
> By the way: Do you have you have to create a completion for any WR?
> (if you are using one QP, this will maybe solve your problem).
>
> Dotan
>
> 
_____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> 
_____________________________________________________________________________
> <http://www.ers.ibm.com/>
>
> 
_____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> 
_____________________________________________________________________________


_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________


_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080528/5bb756fb/attachment.html>

From dotanba at gmail.com  Wed May 28 12:20:52 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Wed, 28 May 2008 21:20:52 +0200
Subject: [ofa-general] MLX HCA: CQ request notification for multiple
	completions not implemented?
In-Reply-To: <OFCC8CEBAD.CB8E7C11-ON86257457.00631DD3-86257457.0063DBA9@TMRIUSA.COM>
References: <OFCC8CEBAD.CB8E7C11-ON86257457.00631DD3-86257457.0063DBA9@TMRIUSA.COM>
Message-ID: <483DB094.4000706@gmail.com>

Yicheng Jia wrote:
>
> >> My concern is the performance. The overhead of software poll_cq loop
> >> is quite significant if there are multiple pieces of small amount of
> >> data to be transferred on both sender/receiver sides. For instance, on
> >> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver
> >> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you
> >> have a good solution for such kind of problem?
>
> >How many QPs do you use?
> >(and how outstanding WR from every QP?)
>
> Only one QP. Is it better to alloc multiple QPs and evenly distribute 
> WRs among those QPs?
It depends on what you are trying to do ...

You don't have to ask for completion for any SR that you post, this way 
you can do some optimization..


Dotan
>
> Best,
> Yicheng
>
>
>
> *Dotan Barak <dotanba at gmail.com>*
>
> 05/28/2008 12:24 PM
>
> 	
> To
> 	Yicheng Jia <YJia at tmriusa.com>
> cc
> 	general at lists.openfabrics.org
> Subject
> 	Re: [ofa-general] MLX HCA: CQ request notification for multiple 
> completions not implemented?
>
>
>
> 	
>
>
>
>
>
> Yicheng Jia wrote:
> >
> > Thanks for your reply. I'm using one CQ for all the WRs. Do you know
> > why there's no ARM-N support in MLX drivers?
> I don't know if i can speak in the name of Mellanox/MLX driver
> maintainers, but i think that the
> reason is lack of demand for this feature (but i can't be sure).
>
> > My concern is the performance. The overhead of software poll_cq loop
> > is quite significant if there are multiple pieces of small amount of
> > data to be transferred on both sender/receiver sides. For instance, on
> > the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver
> > side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you
> > have a good solution for such kind of problem?
> How many QPs do you use?
> (and how outstanding WR from every QP?)
>
> Dotan
> > Best,
> > Yicheng
> >
> >
> >
> > *Dotan Barak <dotanba at gmail.com>*
> >
> > 05/23/2008 01:27 PM
> >
> >                  
> > To
> >                  Yicheng Jia <YJia at tmriusa.com>
> > cc
> >                  general at lists.openfabrics.org
> > Subject
> >                  Re: [ofa-general] MLX HCA: CQ request notification 
> for multiple
> > completions not implemented?
> >
> >
> >
> >                  
> >
> >
> >
> >
> >
> > Hi.
> >
> > Yicheng Jia wrote:
> > >
> > > Hi Folks,
> > >
> > > I'm trying to use CQ Event notification for multiple completions
> > > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering
> > > RDMA. However I couldn't find it in current MLX driver. It seems to me
> > > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are
> > > multiple work requests, I have to use "poll_cq" to synchronously wait
> > > until all the requests are done, is it correct? Is there a way to do
> > > asynchronous multiple send by subscribing for a ARM_N event?
> > You are right: the low level drivers of Mellanox devices doesn't support
> > ARM-N
> > (This feature is supported by the devices, but it wasn't implemented in
> > the low level drivers).
> >
> > You are right, in order to read all of the completions you need to use
> > poll_cq.
> >
> > By the way: Do you have you have to create a completion for any WR?
> > (if you are using one QP, this will maybe solve your problem).
> >
> > Dotan
> >
> > 
> _____________________________________________________________________________
> > Scanned by IBM Email Security Management Services powered by
> > MessageLabs. For more information please visit http://www.ers.ibm.com
> > 
> _____________________________________________________________________________
> > < <http://www.ers.ibm.com/>http://www.ers.ibm.com/>
> >
> > 
> _____________________________________________________________________________
> > Scanned by IBM Email Security Management Services powered by
> > MessageLabs. For more information please visit http://www.ers.ibm.com
> > 
> _____________________________________________________________________________
>
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit 
> <http://www.ers.ibm.com/>http://www.ers.ibm.com
> _____________________________________________________________________________
> <http://www.ers.ibm.com/>
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> _____________________________________________________________________________


From sean.hefty at intel.com  Wed May 28 11:53:57 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 May 2008 11:53:57 -0700
Subject: [ofa-general] RE: [RFC V4 PATCH 3/5] rdma/cma: simply locking needed
	for serialization of callbacks
In-Reply-To: <Pine.LNX.4.64.0805281433560.2453@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805281433560.2453@zuben.voltaire.com>
Message-ID: <000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com>

>-static int cma_disable_remove(struct rdma_id_private *id_priv,
>+static int cma_disable_callback(struct rdma_id_private *id_priv,
> 			      enum cma_state state)
> {
> 	unsigned long flags;
> 	int ret;
>
>+	mutex_lock(&id_priv->handler_mutex);
> 	spin_lock_irqsave(&id_priv->lock, flags);
>-	if (id_priv->state == state) {
>-		atomic_inc(&id_priv->dev_remove);
>+	if (id_priv->state == state)
> 		ret = 0;
>-	} else
>+	 else {
>+		mutex_unlock(&id_priv->handler_mutex);
> 		ret = -EINVAL;
>+	}
> 	spin_unlock_irqrestore(&id_priv->lock, flags);
> 	return ret;
> }

I wasn't clear on this before, but we shouldn't need to take the spinlock here
at all now.  We needed it before in order to check the state and increment
dev_remove in one operation.  Once the spinlock was released the state could
have changed, but dev_remove would have halted the device removal thread.  Under
the new method, device removal is halted while we hold the handler_mutex.

>@@ -2566,8 +2560,8 @@ static int cma_ib_mc_handler(int status,
> 	int ret;
>
> 	id_priv = mc->id_priv;
>-	if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) &&
>-	    cma_disable_remove(id_priv, CMA_ADDR_RESOLVED))
>+	if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) &&
>+	    cma_disable_callback(id_priv, CMA_ADDR_RESOLVED))

This can end up trying to acquire the mutex twice.  We could change this to

mutex_lock();
if (id_priv->state == CMA_ADDR_BOUND || id_priv->state == CMA_ADDR_RESOLVED)

- Sean


From sean.hefty at intel.com  Wed May 28 12:06:00 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 May 2008 12:06:00 -0700
Subject: [ofa-general] RE: [RFC V4 PATCH 4/5] rdma/cma: implement
	RDMA_CM_EVENT_ADDR_CHANGE notification
In-Reply-To: <Pine.LNX.4.64.0805281434380.2453@zuben.voltaire.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805281434380.2453@zuben.voltaire.com>
Message-ID: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com>

>+static void cma_ndev_work_handler(struct work_struct *_work)
>+{
>+	struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work,
>work);
>+	struct rdma_id_private *id_priv = work->id;
>+	int destroy = 0;
>+
>+	mutex_lock(&id_priv->handler_mutex);
>+	if (id_priv->state == CMA_DESTROYING)

We should probably skip id_priv->state == CMA_DEVICE_REMOVAL as well.

>@@ -2723,6 +2751,63 @@ void rdma_leave_multicast(struct rdma_cm
> }
> EXPORT_SYMBOL(rdma_leave_multicast);
>
>+static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private
>*id_priv)
>+{
>+	struct rdma_dev_addr *dev_addr;
>+	struct cma_ndev_work *work;
>+
>+	dev_addr = &id_priv->id.route.addr.dev_addr;
>+
>+	if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) &&
>+	  memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
>+		printk(KERN_ERR "addr change for device %s used by id %p,
>notifying\n",

Is KERN_ERR what we want here?

>+static int cma_netdev_callback(struct notifier_block *self, unsigned long
>event,
>+	void *ctx)
>+{
>+	struct net_device *ndev = (struct net_device *)ctx;
>+	struct cma_device *cma_dev;
>+	struct rdma_id_private *id_priv;
>+	int ret = NOTIFY_DONE;
>+
>+	if (dev_net(ndev) != &init_net)
>+		return NOTIFY_DONE;
>+
>+	if (event != NETDEV_BONDING_FAILOVER)
>+		return NOTIFY_DONE;
>+
>+	if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING))
>+		return NOTIFY_DONE;
>+
>+	mutex_lock(&lock);
>+	list_for_each_entry(cma_dev, &dev_list, list)
>+		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
>+			ret = cma_netdev_change(ndev, id_priv);
>+			if (ret)
>+				break;

Should this be goto (mutex_unlock) instead?

Okay - I think we're pretty close on the rdma_cm side of things.  Thanks.

- Sean


From jlentini at netapp.com  Wed May 28 14:24:20 2008
From: jlentini at netapp.com (James Lentini)
Date: Wed, 28 May 2008 17:24:20 -0400 (EDT)
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483CBDF0.7030209@opengridcomputing.com>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
	<20080527183549.32168.22959.stgit@dell3.ogc.int>
	<483CBDF0.7030209@opengridcomputing.com>
Message-ID: <alpine.LFD.1.10.0805281721410.2898@jlentini-linux.nane.netapp.com>


On Tue, 27 May 2008, Steve Wise wrote:

> >  enum ib_send_flags {
> > @@ -676,6 +683,19 @@ struct ib_send_wr {
> >  			u16	pkey_index; /* valid for GSI only */
> >  			u8	port_num;   /* valid for DR SMPs on switch
> > only */
> >  		} ud;
> > +		struct {
> > +			u64				iova_start;
> > +			struct ib_mr 			*mr;
> > +			struct ib_fast_reg_page_list	*page_list;
> > +			unsigned int			page_shift;
> > +			unsigned int			page_list_len;
> > +			unsigned int			first_byte_offset;
> > +			u32				length;
> > +			int				access_flags;
> > +		} fast_reg;
> > +		struct {
> > +			struct ib_mr 	*mr;
> > +		} local_inv;
> >  	} wr;
> >  };
> 
> Ok, while writing a test case for all this jazz, 

Could you post the test case when it is ready? An example of how to 
use this API would be useful. Of course, I realize you are revising 
the API at the moment...


From swise at opengridcomputing.com  Wed May 28 14:29:12 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 28 May 2008 16:29:12 -0500
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <alpine.LFD.1.10.0805281721410.2898@jlentini-linux.nane.netapp.com>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>
	<20080527183549.32168.22959.stgit@dell3.ogc.int>
	<483CBDF0.7030209@opengridcomputing.com>
	<alpine.LFD.1.10.0805281721410.2898@jlentini-linux.nane.netapp.com>
Message-ID: <483DCEA8.20505@opengridcomputing.com>

James Lentini wrote:
> On Tue, 27 May 2008, Steve Wise wrote:
>
>   
>>>  enum ib_send_flags {
>>> @@ -676,6 +683,19 @@ struct ib_send_wr {
>>>  			u16	pkey_index; /* valid for GSI only */
>>>  			u8	port_num;   /* valid for DR SMPs on switch
>>> only */
>>>  		} ud;
>>> +		struct {
>>> +			u64				iova_start;
>>> +			struct ib_mr 			*mr;
>>> +			struct ib_fast_reg_page_list	*page_list;
>>> +			unsigned int			page_shift;
>>> +			unsigned int			page_list_len;
>>> +			unsigned int			first_byte_offset;
>>> +			u32				length;
>>> +			int				access_flags;
>>> +		} fast_reg;
>>> +		struct {
>>> +			struct ib_mr 	*mr;
>>> +		} local_inv;
>>>  	} wr;
>>>  };
>>>       
>> Ok, while writing a test case for all this jazz, 
>>     
>
> Could you post the test case when it is ready? An example of how to 
> use this API would be useful. Of course, I realize you are revising 
> the API at the moment...
>   
Yes, I have already said I'll post a test case. :)

The krping tool will be the culprit.  Its the kernel equivalent of rping 
and has been around for a long time in one form or another.

It is available at git://git.openfabrics.org/~swise/krping

It currently supports dma mrs and regular mrs only.  I'm adding fastreg 
support now.  And I want to add mw too.


Steve.


From jon at opengridcomputing.com  Wed May 28 15:55:49 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Wed, 28 May 2008 17:55:49 -0500
Subject: [ofa-general] Port space sharing in RDS
Message-ID: <20080528225549.GC6288@opengridcomputing.com>

During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the
RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
2 major reasons.  

Firstly, the binding of a CM ID to IP address via rdma_bind_addr on INADDR_ANY
will cause the first caller to bind to all IP addresses/Devices (and the 
subsequent calls will fail).  So whichever module (IB or iWARP) that is called 
first will break the second (and cause the module loading to abort).  

Secondly, iWARP and the Linux stack must share the same port space.  If bound to
the same port, the RNIC will not be able to tell if the incoming RDS packet is 
for TCP or IB/iWARP RDS module.  It appears to be preferring the IB/iWARP RDS 
module and not passing the packet to the TCP RDS module.  Thus currently, iWARP 
adapters will not work if both TCP and IB are enabled in RDS.  

Regardless of whether iWARP support is separate or rolled into IB, these issues 
need to be resolved.  I am open to suggestions about how to go about correcting 
this.  One idea to correct the first issue, we can have the bind and all other 
device specific setup of both IB and iWARP handled by a single function which 
will then, based on node_type, handle the IB or iWARP case.  The second issue is
more complicated, as there is currently no way for the rdma_bind_addr to know if
the port is already in use and vice versa.  Obviously, we can make TCP/IWARP
inversely dependent on each other during compile time, but I'm not sure that is
a good long term strategy.

Thoughts?

Thanks,
Jon


From sean.hefty at intel.com  Wed May 28 16:33:06 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 May 2008 16:33:06 -0700
Subject: [ofa-general] Port space sharing in RDS
In-Reply-To: <20080528225549.GC6288@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>
Message-ID: <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>

>During RDS init, rds_ib_init and rds_tcp_init will both individually bind to
>the
>RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
>2 major reasons.

Can RDS use different port numbers for its RDMA and TCP protocols?  The wire
protocols end up being different when running over TCP versus iWarp.

- Sean


From jon at opengridcomputing.com  Wed May 28 17:03:54 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Wed, 28 May 2008 19:03:54 -0500
Subject: [ofa-general] Port space sharing in RDS
In-Reply-To: <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
Message-ID: <20080529000354.GD6288@opengridcomputing.com>

On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote:
> >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to
> >the
> >RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
> >2 major reasons.
> 
> Can RDS use different port numbers for its RDMA and TCP protocols?  The wire

I do not know if this is desirable, but a quick test shows that having TCP and IB
on different ports works around the problem.

> protocols end up being different when running over TCP versus iWarp.
> 
> - Sean
> 


From chu11 at llnl.gov  Wed May 28 17:14:59 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 28 May 2008 17:14:59 -0700
Subject: [ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting"
	option to updn/minhop routing
In-Reply-To: <1207861815.7695.160.camel@cardanus.llnl.gov>
References: <1207861815.7695.160.camel@cardanus.llnl.gov>
Message-ID: <1212020100.31760.154.camel@cardanus.llnl.gov>

Hey Sasha,

Attached are some numbers from a recent run I did with my port
offsetting patches.  I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120
nodes.  I ran w/ 1 task per node or 8 tasks per node (nodes have 8
processors each), trying LMC=0, LMC=1, and LMC=2 with the original
'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled
"PO").  Next to these columns are the percentage worse the numbers are
in comparison to LMC=0.  My understanding is that mvapich 0.9.9 does not
know how to take advantage of multiple lids while openMPI 1.2.6 does
know how to take advantage of it.

I think the key numbers to notice are that without port-offsetting,
performance relative to LMC=0 is pretty bad when the MPI implementation
does not know how to take advantage of multiple lids (mvapich 0.9.9).
LMC=1 shows ~30% performance degradation and LMC=2 shows ~90%
degradation on this cluster.  With the port-offsetting turned on, the
degradation falls to 0%-6%, a few times even being faster.  We consider
this within "noise" levels.

For MPIs that do know how to take advantage of multiple lids it seems
that the port-offsetting patch doesn't affect performance that much.
(See OpenMPI 1.2.6 sections).

PLMK what you think.  Thanks.

Al

On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote:
> Hey Sasha,
> 
> I was going to submit this after I had a chance to test on one of our
> big clusters to see if it worked 100% right.  But my final testing has
> been delayed (for a month now!).  Ira said some folks from Sonoma were
> interested in this, so I'll go ahead and post it.
> 
> This is a patch for something I call "port_offsetting" (name/description
> of the option is open to suggestion).  Basically, we want to move to
> using lmc > 0 on our clusters b/c some of the newer MPI implementations
> take advantage of multiple lids and have shown faster performance when
> lmc > 0.
> 
> The problem is that those users that do not use the newer MPI
> implementations, or do not run their code in a way that can take
> advantage of multiple lids, suffer great performance degradation in
> their code.  We determined that the primary issue is what we started
> calling "base lid alignment".  Here's a simple example.
> 
> Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D).
> Those lids are:
> 
> port A - 1,2,3,4
> port B - 5,6,7,8
> port C - 9,10,11,12
> port D - 13,14,15,16
> 
> Suppose forwarding of these lids goes through 4 switch ports.  If we
> cycle through the ports like updn/minhop currently do, we would see
> something like this.
> 
> switch port 1: 1, 5, 9, 13
> switch port 2: 2, 6, 10, 14
> switch port 3: 3, 7, 11, 15
> switch port 4: 4, 8, 12, 16
> 
> Note that the base lid of each port (lids 1, 5, 9, 13) goes through only
> 1 port of the switch.  Thus a user that uses only the base lid is using
> only 1 port out of the 4 ports they could be using.  Leading to terrible
> performance.
> 
> We want to get this instead.
> 
> switch port 1: 1, 8, 11, 14
> switch port 2: 2, 5, 12, 15
> switch port 3: 3, 6, 9,  16
> switch port 4: 4, 7, 10, 13
> 
> where base lids are distributed in a more even manner.
> 
> In order to do this, we (effectively) iterate through all ports like
> before, but we iterate starting at a different index depending on the
> number of paths we have routed thus far.
> 
> On one of our clusters, some testing has shown when we run w/ LMC=1 and
> 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than
> when LMC=0 is used.  With LMC=2, mpibench tends to be 50-70% worse in
> performance than with LMC=0.
> 
> With the port offsetting option, the performance degradation ranges 1-5%
> worse than LMC=0.  I am currently at a loss why I cannot get it to be
> even to LMC=0, but 1-5% is small enough to not make users mad :-)
> 
> The part I haven't been able to test yet is whether newer MPIs that do
> take advantage of LMC > 0 run equally when my port_offsetting is turned
> off and on.  That's the part I'm still haven't been able to test.
> 
> Thanks, look forward to your comments,
> 
> Al
> 
> 
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_port_offsetting.xls
Type: application/vnd.ms-excel
Size: 17408 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080528/82c4a362/attachment.xls>

From weiny2 at llnl.gov  Wed May 28 17:57:19 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 28 May 2008 17:57:19 -0700
Subject: [ofa-general] [PATCH]
 infiniband-diags/scripts/ibprint[ca|switch|rt].pl: fix
 printing by name
Message-ID: <20080528175719.06d6ae15.weiny2@llnl.gov>

I guess when I added support to search by GUID I must have broken the printing
by name.  This changes these scripts to use the common convention of "-G" to
specify that a GUID is to be searched for and fixes the printing when a name is
specified.

Ira

>From e9b6766bd6b3661a5bc1e78c9e95784a99a631c0 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 23 May 2008 16:19:57 -0700
Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: fix printing by name


Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/scripts/ibprintca.pl     |   15 +++++++++++----
 infiniband-diags/scripts/ibprintrt.pl     |   15 +++++++++++----
 infiniband-diags/scripts/ibprintswitch.pl |   15 +++++++++++----
 3 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
index 38b4330..b13a83b 100755
--- a/infiniband-diags/scripts/ibprintca.pl
+++ b/infiniband-diags/scripts/ibprintca.pl
@@ -45,12 +45,13 @@ use IBswcountlimits;
 sub usage_and_exit
 {
 	my $prog = $_[0];
-	print "Usage: $prog [-R -l] [<ca_guid|node_name>]\n";
+	print "Usage: $prog [-R -l] [-G <ca_guid> | <node_name>]\n";
 	print "   print only the ca specified from the ibnetdiscover output\n";
 	print "   -R Recalculate ibnetdiscover information\n";
 	print "   -l list cas\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
+	print "   -G node is specified with GUID\n";
 	exit 0;
 }
 
@@ -59,15 +60,21 @@ my $regenerate_map = undef;
 my $list_hcas      = undef;
 my $ca_name        = "";
 my $ca_port        = "";
+my $name_is_guid   = "no";
 chomp $argv0;
-if (!getopts("hRlC:P:"))         { usage_and_exit $argv0; }
+if (!getopts("hRlC:P:G"))         { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; }
 if (defined $Getopt::Std::opt_l) { $list_hcas      = $Getopt::Std::opt_l; }
 if (defined $Getopt::Std::opt_C) { $ca_name        = $Getopt::Std::opt_C; }
 if (defined $Getopt::Std::opt_P) { $ca_port        = $Getopt::Std::opt_P; }
+if (defined $Getopt::Std::opt_G) { $name_is_guid   = "yes"; }
 
-my $target_hca = format_guid($ARGV[0]);
+my $target_hca = $ARGV[0];
+
+if ($name_is_guid eq "yes") {
+	$target_hca = format_guid($target_hca);
+}
 
 my $cache_file = get_cache_file($ca_name, $ca_port);
 
@@ -100,7 +107,7 @@ sub main
 				$in_hca = "no";
 				goto DONE;
 			}
-			if ("0x$guid" eq $target_hca || $desc =~ /.*$target_hca.*/) {
+			if ("0x$guid" eq $target_hca || $desc =~ /[\s\"]$target_hca[\s\"]/) {
 				print $line;
 				$in_hca    = "yes";
 				$found_hca = "yes";
diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
index 86dcb64..e9e6cc4 100755
--- a/infiniband-diags/scripts/ibprintrt.pl
+++ b/infiniband-diags/scripts/ibprintrt.pl
@@ -45,12 +45,13 @@ use IBswcountlimits;
 sub usage_and_exit
 {
 	my $prog = $_[0];
-	print "Usage: $prog [-R -l] [<rt_guid|node_name>]\n";
+	print "Usage: $prog [-R -l] [-G <rt_guid> | <node_name>]\n";
 	print "   print only the rt specified from the ibnetdiscover output\n";
 	print "   -R Recalculate ibnetdiscover information\n";
 	print "   -l list rts\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
+	print "   -G node is specified with GUID\n";
 	exit 0;
 }
 
@@ -59,15 +60,21 @@ my $regenerate_map = undef;
 my $list_rts       = undef;
 my $ca_name        = "";
 my $ca_port        = "";
+my $name_is_guid   = "no";
 chomp $argv0;
-if (!getopts("hRlC:P:"))         { usage_and_exit $argv0; }
+if (!getopts("hRlC:P:G"))         { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; }
 if (defined $Getopt::Std::opt_l) { $list_rts       = $Getopt::Std::opt_l; }
 if (defined $Getopt::Std::opt_C) { $ca_name        = $Getopt::Std::opt_C; }
 if (defined $Getopt::Std::opt_P) { $ca_port        = $Getopt::Std::opt_P; }
+if (defined $Getopt::Std::opt_G) { $name_is_guid   = "yes"; }
 
-my $target_rt = format_guid($ARGV[0]);
+my $target_rt = $ARGV[0];
+
+if ($name_is_guid eq "yes") {
+	$target_rt = format_guid($target_rt);
+}
 
 my $cache_file = get_cache_file($ca_name, $ca_port);
 
@@ -100,7 +107,7 @@ sub main
 				$in_rt = "no";
 				goto DONE;
 			}
-			if ("0x$guid" eq $target_rt || $desc =~ /.*$target_rt.*/) {
+			if ("0x$guid" eq $target_rt || $desc =~ /[\s\"]$target_rt[\s\"]/) {
 				print $line;
 				$in_rt    = "yes";
 				$found_rt = "yes";
diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
index 6712201..148d70e 100755
--- a/infiniband-diags/scripts/ibprintswitch.pl
+++ b/infiniband-diags/scripts/ibprintswitch.pl
@@ -44,12 +44,13 @@ use IBswcountlimits;
 sub usage_and_exit
 {
 	my $prog = $_[0];
-	print "Usage: $prog [-R -l] [<switch_guid|switch_name>]\n";
+	print "Usage: $prog [-R -l] [-G <switch_guid> | <switch_name>]\n";
 	print "   print only the switch specified from the ibnetdiscover output\n";
 	print "   -R Recalculate ibnetdiscover information\n";
 	print "   -l list switches\n";
 	print "   -C <ca_name> use selected channel adaptor name for queries\n";
 	print "   -P <ca_port> use selected channel adaptor port for queries\n";
+	print "   -G node is specified with GUID\n";
 	exit 0;
 }
 
@@ -58,15 +59,21 @@ my $regenerate_map = undef;
 my $list_switches  = undef;
 my $ca_name        = "";
 my $ca_port        = "";
+my $name_is_guid   = "no";
 chomp $argv0;
-if (!getopts("hRlC:P:"))         { usage_and_exit $argv0; }
+if (!getopts("hRlC:P:G"))         { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; }
 if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; }
 if (defined $Getopt::Std::opt_l) { $list_switches  = $Getopt::Std::opt_l; }
 if (defined $Getopt::Std::opt_C) { $ca_name        = $Getopt::Std::opt_C; }
 if (defined $Getopt::Std::opt_P) { $ca_port        = $Getopt::Std::opt_P; }
+if (defined $Getopt::Std::opt_G) { $name_is_guid   = "yes"; }
 
-my $target_switch = format_guid($ARGV[0]);
+my $target_switch = $ARGV[0];
+
+if ($name_is_guid eq "yes") {
+	$target_switch = format_guid($target_switch);
+}
 
 my $cache_file = get_cache_file($ca_name, $ca_port);
 
@@ -99,7 +106,7 @@ sub main
 				$in_switch = "no";
 				goto DONE;
 			}
-			if ("0x$guid" eq $target_switch || $desc =~ /.*$target_switch.*/) {
+			if ("0x$guid" eq $target_switch || $desc =~ /[\s\"]$target_switch[\s\"]/) {
 				print $line;
 				$in_switch    = "yes";
 				$found_switch = "yes";
-- 
1.5.4.5


From weiny2 at llnl.gov  Wed May 28 17:59:41 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 28 May 2008 17:59:41 -0700
Subject: [ofa-general] [PATCH]
 infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow
 printing of multiple matches but print warning to user that multiple
 matches were found
Message-ID: <20080528175941.41425dac.weiny2@llnl.gov>

I think it is useful to print multiple matches found when searching for
matches.  Specifically when switches have not been named, ie they all have some
"Mellanox..." or "Voltaire..." name.

This prints all matches but also warns the user at the end that it found X
matches.

Ira

>From 11b85c9b526b9067aa12eac5d445d8ee43a7d024 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 23 May 2008 16:25:19 -0700
Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple
 matches but print warning to user that multiple matches were found


Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/scripts/ibprintca.pl     |   17 +++++++++--------
 infiniband-diags/scripts/ibprintrt.pl     |   17 +++++++++--------
 infiniband-diags/scripts/ibprintswitch.pl |   17 +++++++++--------
 3 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
index b13a83b..ccd5473 100755
--- a/infiniband-diags/scripts/ibprintca.pl
+++ b/infiniband-diags/scripts/ibprintca.pl
@@ -95,7 +95,7 @@ if ($target_hca eq "") {
 #
 sub main
 {
-	my $found_hca = undef;
+	my $found_hca = 0;
 	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
 	my $in_hca = "no";
 	my %ports  = undef;
@@ -105,12 +105,14 @@ sub main
 			my $desc = $2;
 			if ($in_hca eq "yes") {
 				$in_hca = "no";
-				goto DONE;
+				foreach my $port (sort { $a <=> $b } (keys %ports)) {
+					print $ports{$port};
+				}
 			}
 			if ("0x$guid" eq $target_hca || $desc =~ /[\s\"]$target_hca[\s\"]/) {
 				print $line;
 				$in_hca    = "yes";
-				$found_hca = "yes";
+				$found_hca++;
 			}
 		}
 		if ($line =~ /^Switch.*/ || $line =~ /^Rt.*/) { $in_hca = "no"; }
@@ -120,15 +122,14 @@ sub main
 		}
 
 	}
-	DONE:
-	foreach my $port (sort { $a <=> $b } (keys %ports)) {
-		print $ports{$port};
-	}
-	if (!$found_hca) {
+	if ($found_hca == 0) {
 		print "\"$target_hca\" not found\n";
 		print "   Try running with the \"-R\" option.\n";
 		print "   If still not found the node is probably down.\n";
 	}
+	if ($found_hca > 1) {
+		print "\nWARNING: Found $found_hca CA's with the name \"$target_hca\"\n";
+	}
 	close IBNET_TOPO;
 }
 main
diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
index e9e6cc4..4b83ff0 100755
--- a/infiniband-diags/scripts/ibprintrt.pl
+++ b/infiniband-diags/scripts/ibprintrt.pl
@@ -95,7 +95,7 @@ if ($target_rt eq "") {
 #
 sub main
 {
-	my $found_rt = undef;
+	my $found_rt = 0;
 	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
 	my $in_rt = "no";
 	my %ports = undef;
@@ -105,12 +105,14 @@ sub main
 			my $desc = $2;
 			if ($in_rt eq "yes") {
 				$in_rt = "no";
-				goto DONE;
+				foreach my $port (sort { $a <=> $b } (keys %ports)) {
+					print $ports{$port};
+				}
 			}
 			if ("0x$guid" eq $target_rt || $desc =~ /[\s\"]$target_rt[\s\"]/) {
 				print $line;
 				$in_rt    = "yes";
-				$found_rt = "yes";
+				$found_rt++;
 			}
 		}
 		if ($line =~ /^Switch.*/ || $line =~ /^Ca.*/) { $in_rt = "no"; }
@@ -120,15 +122,14 @@ sub main
 		}
 
 	}
-	DONE:
-	foreach my $port (sort { $a <=> $b } (keys %ports)) {
-		print $ports{$port};
-	}
-	if (!$found_rt) {
+	if ($found_rt == 0) {
 		print "\"$target_rt\" not found\n";
 		print "   Try running with the \"-R\" option.\n";
 		print "   If still not found the node is probably down.\n";
 	}
+	if ($found_rt > 1) {
+		print "\nWARNING: Found $found_rt Router's with the name \"$target_rt\"\n";
+	}
 	close IBNET_TOPO;
 }
 main
diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
index 148d70e..9426673 100755
--- a/infiniband-diags/scripts/ibprintswitch.pl
+++ b/infiniband-diags/scripts/ibprintswitch.pl
@@ -94,7 +94,7 @@ if ($target_switch eq "") {
 #
 sub main
 {
-	my $found_switch = undef;
+	my $found_switch = 0;
 	open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n";
 	my $in_switch = "no";
 	my %ports     = undef;
@@ -104,12 +104,14 @@ sub main
 			my $desc = $2;
 			if ($in_switch eq "yes") {
 				$in_switch = "no";
-				goto DONE;
+				foreach my $port (sort { $a <=> $b } (keys %ports)) {
+					print $ports{$port};
+				}
 			}
 			if ("0x$guid" eq $target_switch || $desc =~ /[\s\"]$target_switch[\s\"]/) {
 				print $line;
 				$in_switch    = "yes";
-				$found_switch = "yes";
+				$found_switch++;
 			}
 		}
 		if ($line =~ /^Ca.*/) { $in_switch = "no"; }
@@ -119,14 +121,13 @@ sub main
 		}
 
 	}
-	DONE:
-	foreach my $port (sort { $a <=> $b } (keys %ports)) {
-		print $ports{$port};
-	}
-	if (!$found_switch) {
+	if ($found_switch == 0) {
 		print "Switch \"$target_switch\" not found\n";
 		print "   Try running with the \"-R\" option.\n";
 	}
+	if ($found_switch > 1) {
+		print "\nWARNING: Found $found_switch switches with the name \"$target_switch\"\n";
+	}
 	close IBNET_TOPO;
 }
 main
-- 
1.5.4.5


From chu11 at llnl.gov  Wed May 28 19:30:52 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 28 May 2008 22:30:52 -0400
Subject: [ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting"
	option to updn/minhop routing
In-Reply-To: <1212020100.31760.154.camel@cardanus.llnl.gov>
References: <1207861815.7695.160.camel@cardanus.llnl.gov>
	<1212020100.31760.154.camel@cardanus.llnl.gov>
Message-ID: <1212028252.6913.3.camel@whatsup>

Oops, I forgot about one other important measurement we did.  The
following are the Average Send/Receive MPI bandwidths as measured by
mpigraph (http://sourceforge.net/projects/mpigraph).  Again, using updn
routing.

LMC=0  Send 391 MB/s  Recv 461 MB/s
LMC=1  Send 292 MB/s  Recv 358 MB/s
LMC=2  Send 197 MB/s  Recv 241 MB/s

with my port offsetting turned on.  I got

LMC=1  Send 387 MB/s  Recv 457 MB/s
LMC=2  Send 383 MB/s  Recv 455 MB/s

So similar to the AlltoAll MPI tests, the port offsetting gets the
numbers back to about what they were at LMC=0.

Al

On Wed, 2008-05-28 at 17:14 -0700, Al Chu wrote:
> Hey Sasha,
> 
> Attached are some numbers from a recent run I did with my port
> offsetting patches.  I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120
> nodes.  I ran w/ 1 task per node or 8 tasks per node (nodes have 8
> processors each), trying LMC=0, LMC=1, and LMC=2 with the original
> 'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled
> "PO").  Next to these columns are the percentage worse the numbers are
> in comparison to LMC=0.  My understanding is that mvapich 0.9.9 does not
> know how to take advantage of multiple lids while openMPI 1.2.6 does
> know how to take advantage of it.
> 
> I think the key numbers to notice are that without port-offsetting,
> performance relative to LMC=0 is pretty bad when the MPI implementation
> does not know how to take advantage of multiple lids (mvapich 0.9.9).
> LMC=1 shows ~30% performance degradation and LMC=2 shows ~90%
> degradation on this cluster.  With the port-offsetting turned on, the
> degradation falls to 0%-6%, a few times even being faster.  We consider
> this within "noise" levels.
> 
> For MPIs that do know how to take advantage of multiple lids it seems
> that the port-offsetting patch doesn't affect performance that much.
> (See OpenMPI 1.2.6 sections).
> 
> PLMK what you think.  Thanks.
> 
> Al
> 
> On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote:
> > Hey Sasha,
> > 
> > I was going to submit this after I had a chance to test on one of our
> > big clusters to see if it worked 100% right.  But my final testing has
> > been delayed (for a month now!).  Ira said some folks from Sonoma were
> > interested in this, so I'll go ahead and post it.
> > 
> > This is a patch for something I call "port_offsetting" (name/description
> > of the option is open to suggestion).  Basically, we want to move to
> > using lmc > 0 on our clusters b/c some of the newer MPI implementations
> > take advantage of multiple lids and have shown faster performance when
> > lmc > 0.
> > 
> > The problem is that those users that do not use the newer MPI
> > implementations, or do not run their code in a way that can take
> > advantage of multiple lids, suffer great performance degradation in
> > their code.  We determined that the primary issue is what we started
> > calling "base lid alignment".  Here's a simple example.
> > 
> > Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D).
> > Those lids are:
> > 
> > port A - 1,2,3,4
> > port B - 5,6,7,8
> > port C - 9,10,11,12
> > port D - 13,14,15,16
> > 
> > Suppose forwarding of these lids goes through 4 switch ports.  If we
> > cycle through the ports like updn/minhop currently do, we would see
> > something like this.
> > 
> > switch port 1: 1, 5, 9, 13
> > switch port 2: 2, 6, 10, 14
> > switch port 3: 3, 7, 11, 15
> > switch port 4: 4, 8, 12, 16
> > 
> > Note that the base lid of each port (lids 1, 5, 9, 13) goes through only
> > 1 port of the switch.  Thus a user that uses only the base lid is using
> > only 1 port out of the 4 ports they could be using.  Leading to terrible
> > performance.
> > 
> > We want to get this instead.
> > 
> > switch port 1: 1, 8, 11, 14
> > switch port 2: 2, 5, 12, 15
> > switch port 3: 3, 6, 9,  16
> > switch port 4: 4, 7, 10, 13
> > 
> > where base lids are distributed in a more even manner.
> > 
> > In order to do this, we (effectively) iterate through all ports like
> > before, but we iterate starting at a different index depending on the
> > number of paths we have routed thus far.
> > 
> > On one of our clusters, some testing has shown when we run w/ LMC=1 and
> > 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than
> > when LMC=0 is used.  With LMC=2, mpibench tends to be 50-70% worse in
> > performance than with LMC=0.
> > 
> > With the port offsetting option, the performance degradation ranges 1-5%
> > worse than LMC=0.  I am currently at a loss why I cannot get it to be
> > even to LMC=0, but 1-5% is small enough to not make users mad :-)
> > 
> > The part I haven't been able to test yet is whether newer MPIs that do
> > take advantage of LMC > 0 run equally when my port_offsetting is turned
> > off and on.  That's the part I'm still haven't been able to test.
> > 
> > Thanks, look forward to your comments,
> > 
> > Al
> > 
> > 
> -- 
> Albert Chu
> chu11 at llnl.gov
> 925-422-5311
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From sashak at voltaire.com  Wed May 28 22:34:18 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 29 May 2008 08:34:18 +0300
Subject: [ofa-general] Re: [PATCH]
	infiniband-diags/scripts/ibprint[ca|switch|rt].pl:
	allow printing of multiple matches but print warning to user that
	multiple matches were found
In-Reply-To: <20080528175941.41425dac.weiny2@llnl.gov>
References: <20080528175941.41425dac.weiny2@llnl.gov>
Message-ID: <20080529053418.GA16570@sashak.voltaire.com>

On 17:59 Wed 28 May     , Ira Weiny wrote:
> I think it is useful to print multiple matches found when searching for
> matches.  Specifically when switches have not been named, ie they all have some
> "Mellanox..." or "Voltaire..." name.
> 
> This prints all matches but also warns the user at the end that it found X
> matches.
> 
> Ira
> 
> From 11b85c9b526b9067aa12eac5d445d8ee43a7d024 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Fri, 23 May 2008 16:25:19 -0700
> Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple
>  matches but print warning to user that multiple matches were found
> 
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>

Both applied. Thanks.

Sasha


From jackm at dev.mellanox.co.il  Wed May 28 22:42:40 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 29 May 2008 08:42:40 +0300
Subject: [ofa-general] [PATCH] mlx4_core: enable changing default max HCA
	resource limits at run time -- reposting
In-Reply-To: <1211893354.13185.229.camel@hrosenstock-ws.xsigo.com>
References: <200804281438.28417.jackm@dev.mellanox.co.il>
	<1211893354.13185.229.camel@hrosenstock-ws.xsigo.com>
Message-ID: <200805290842.40722.jackm@dev.mellanox.co.il>

See 
    http://lists.openfabrics.org/pipermail/general/2008-May/049781.html

I'll submit a pair of patches incorporating my suggestions (at the end of the post).

Roland?

- Jack

On Tuesday 27 May 2008 16:02, Hal Rosenstock wrote:
> On Mon, 2008-04-28 at 14:38 +0300, Jack Morgenstein wrote:
> > mlx4-core: enable changing default max HCA resource limits.
> > 
> > Enable module-initialization time modification of default HCA
> > maximum resource limits via module parameters, as is done in mthca.
> > 
> > Specify the log of the parameter value, rather than the value itself
> > to avoid the hidden side-effect of rounding up values to next power-of-2.
> > 
> > Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
> 
> Sorry if I'm rehashing this but this thread appears to have died out and
> I'm not sure about it's status:
> 
> Where do we stand in terms of getting the additional mlx4 module
> parameters incorporated ? 
> 
> Thanks.
> 
> -- Hal
> 
> 


From sashak at voltaire.com  Wed May 28 22:56:41 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 29 May 2008 08:56:41 +0300
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
	<1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080529055641.GB16570@sashak.voltaire.com>

On 08:24 Wed 28 May     , Hal Rosenstock wrote:
> On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote:
> > In addition we needed to install
> >         libmthca
> > which in turn wanted
> >         libibverbs
> 
> AFAIK, there is no OpenSM requirement for these libraries.

Right, it is not needed. Actually it looks like a bug in OFED's
install.pl script.

> > And, after we successfully loaded it all and started opensm, etc,
> > then ipoib wouldn't come up to RUNNING because it couldn't join
> > the broadcast group.
> 
> This was likely due to IPoIB needing to be enabled on the default
> partition.

The default partition is enabled by default.

Sasha


From sashak at voltaire.com  Wed May 28 23:02:56 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 29 May 2008 09:02:56 +0300
Subject: [ofa-general] [PATCH] ofed_1_3_scripts: fix management libib*
	dependencies
In-Reply-To: <483D62E6.2050107@mellanox.co.il>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
Message-ID: <20080529060256.GD16570@sashak.voltaire.com>


libibcommon, libibumad and libibmad don't require libibverbs.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 install.pl |   30 +++++++++++++++---------------
 1 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/install.pl b/install.pl
index a795a3e..e533be7 100755
--- a/install.pl
+++ b/install.pl
@@ -584,28 +584,28 @@ my %packages_info = (
             { name => "libibcommon", parent => "libibcommon",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => ["libtool"],
-            dist_req_inst => [], ofa_req_build => ["libibverbs"],
-            ofa_req_inst => ["libibverbs"],
+            dist_req_inst => [], ofa_req_build => [],
+            ofa_req_inst => [],
             install32 => 1, exception => 0, configure_options => '' },
         'libibcommon-devel' =>
             { name => "libibcommon-devel", parent => "libibcommon",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"],
-            ofa_req_inst => ["libibverbs", "libibcommon"],
+            dist_req_inst => [], ofa_req_build => [],
+            ofa_req_inst => ["libibcommon"],
             install32 => 1, exception => 0 },
         'libibcommon-static' =>
             { name => "libibcommon-static", parent => "libibcommon",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"],
-            ofa_req_inst => ["libibverbs", "libibcommon"],
+            dist_req_inst => [], ofa_req_build => [],
+            ofa_req_inst => ["libibcommon"],
             install32 => 1, exception => 0 },
         'libibcommon-debuginfo' =>
             { name => "libibcommon-debuginfo", parent => "libibcommon",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"],
+            dist_req_inst => [], ofa_req_build => [],
             ofa_req_inst => [],
             install32 => 0, exception => 0 },
 
@@ -613,28 +613,28 @@ my %packages_info = (
             { name => "libibumad", parent => "libibumad",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs", "libibcommon-devel"],
-            ofa_req_inst => ["libibverbs", "libibcommon"],
+            dist_req_inst => [], ofa_req_build => ["libibcommon-devel"],
+            ofa_req_inst => ["libibcommon"],
             install32 => 1, exception => 0, configure_options => '' },
         'libibumad-devel' =>
             { name => "libibumad-devel", parent => "libibumad",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"],
-            ofa_req_inst => ["libibverbs", "libibcommon-devel", "libibumad"],
+            dist_req_inst => [], ofa_req_build => ["libibcommon-devel"],
+            ofa_req_inst => ["libibcommon-devel", "libibumad"],
             install32 => 1, exception => 0 },
         'libibumad-static' =>
             { name => "libibumad-static", parent => "libibumad",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"],
-            ofa_req_inst => ["libibverbs", "libibcommon-devel", "libibumad"],
+            dist_req_inst => [], ofa_req_build => ["libibcommon-devel"],
+            ofa_req_inst => ["libibcommon-devel", "libibumad"],
             install32 => 1, exception => 0 },
         'libibumad-debuginfo' =>
             { name => "libibumad-debuginfo", parent => "libibumad",
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
-            dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"],
+            dist_req_inst => [], ofa_req_build => ["libibcommon-devel"],
             ofa_req_inst => [],
             install32 => 0, exception => 0 },
 
@@ -643,7 +643,7 @@ my %packages_info = (
             selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0,
             available => 1, mode => "user", dist_req_build => [],
             dist_req_inst => [], ofa_req_build => ["libibumad-devel"],
-            ofa_req_inst => ["libibverbs", "libibumad"],
+            ofa_req_inst => ["libibumad"],
             install32 => 1, exception => 0, configure_options => '' },
         'libibmad-devel' =>
             { name => "libibmad-devel", parent => "libibmad",
-- 
1.5.5.1.178.g1f811


From ogerlitz at voltaire.com  Wed May 28 23:23:59 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 29 May 2008 09:23:59 +0300
Subject: [ofa-general] Re: [RFC V4 PATCH 3/5] rdma/cma: simply locking needed
 for serialization of callbacks
In-Reply-To: <000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805281433560.2453@zuben.voltaire.com>
	<000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com>
Message-ID: <483E4BFF.1090207@voltaire.com>

Sean Hefty wrote:
> I wasn't clear on this before, but we shouldn't need to take the spinlock here
> at all now.  We needed it before in order to check the state and increment
> dev_remove in one operation.  Once the spinlock was released the state could
> have changed, but dev_remove would have halted the device removal thread.  Under
> the new method, device removal is halted while we hold the handler_mutex.
OK, got it.

> This can end up trying to acquire the mutex twice.  We could change this to
> mutex_lock();
> if (id_priv->state == CMA_ADDR_BOUND || id_priv->state == CMA_ADDR_RESOLVED)
>
OK, will do so.


From ogerlitz at voltaire.com  Wed May 28 23:29:16 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 29 May 2008 09:29:16 +0300
Subject: [ofa-general] Re: [RFC V4 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_ADDR_CHANGE notification
In-Reply-To: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805281434380.2453@zuben.voltaire.com>
	<000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com>
Message-ID: <483E4D3C.3020600@voltaire.com>

Sean Hefty wrote:
>> +	mutex_lock(&id_priv->handler_mutex);
>> +	if (id_priv->state == CMA_DESTROYING)
>
> We should probably skip id_priv->state == CMA_DEVICE_REMOVAL as well.
OK
>
>> + printk(KERN_ERR "addr change for device %s used by id %p, notifying\n",
>
> Is KERN_ERR what we want here?
no, I think we can do well with warning or info level
>
>> +static int cma_netdev_callback(struct notifier_block *self, unsigned long
>> event, void *ctx)
>> +	mutex_lock(&lock);
>> +	list_for_each_entry(cma_dev, &dev_list, list)
>> +		list_for_each_entry(id_priv, &cma_dev->id_list, list) {
>> +			ret = cma_netdev_change(ndev, id_priv);
>> +			if (ret)
>> +				break;
>
> Should this be goto (mutex_unlock) instead?
yes it would be better to have it this way

Or


From vlad at dev.mellanox.co.il  Wed May 28 23:58:18 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 29 May 2008 09:58:18 +0300
Subject: [ofa-general] [PATCH] ofed_1_3_scripts: fix management libib*
	dependencies
In-Reply-To: <20080529060256.GD16570@sashak.voltaire.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>	<483D62E6.2050107@mellanox.co.il>
	<20080529060256.GD16570@sashak.voltaire.com>
Message-ID: <483E540A.4080701@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> libibcommon, libibumad and libibmad don't require libibverbs.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  install.pl |   30 +++++++++++++++---------------
>  1 files changed, 15 insertions(+), 15 deletions(-)
> 

Applied,

Regards,
Vladimir


From marcel.heinz at informatik.tu-chemnitz.de  Thu May 29 02:19:28 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Thu, 29 May 2008 11:19:28 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483DA512.2070403@gmail.com>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
Message-ID: <483E7520.1000302@informatik.tu-chemnitz.de>

Hi,

Dotan Barak wrote:
> Marcel Heinz wrote:
>> [low multicast throughput of ~250MB/s with own benchmark tool]
>
> 1) I know that ib_send_bw supports multicast as well, can you please
> check that you can reproduce your problem
> on this benchmark too?

Well, the last time I've checked this, ib_send_bw didn't support
multicast, but this was some months ago. That multicast support seems
a bit odd, since it doesn't create/join the multicast groups and there
is still a 1:1 TCP connection used to establish the IB connection, so
one cannot benchmark "real" multicast scenarios with more than one
receiver.

However, here are the results (I just used ipoib to let it create some
multicast groups for me):

| mh at mhtest0:~$ ib_send_bw -c UD -g mhtest1
| ------------------------------------------------------------------
|                     Send BW Multicast Test
| Connection type : UD
| Max msg size in UD is 2048 changing to 2048
| Inline data is used up to 400 bytes message
|   local address:  LID 0x01, QPN 0x4a0405, PSN 0x8667a7
|   remote address: LID 0x03, QPN 0x4a0405, PSN 0x5d41b6
| Mtu : 2048
| ------------------------------------------------------------------
|  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
|    2048        1000             301.12                247.05
| ------------------------------------------------------------------

This is the same result as my own benchmark showed in that scenario.

> 2) You should expect that multicast messages will be slower than
> unicast because the HCA/switch treat them in different way
> (message duplication need to be done if needed).

Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit
too much of overhead, don't you think? Especially if I take into account
that with my own benchmark, I can get ~950MB/s when I start another
receiver on the same host as the sender. Note that both of the
receivers, the local and the remote one, are seeing all packets at that
rate, so the HCAs and the switch must be able to handle multicast
packets with this throughput.

The other strange thing is that multicast traffics slows down other
traffic way more than the bandwith it consumes. Moreover, it seems like
it limits any other connections to the same throughput than that of the
multicast traffic, which looks suspicious to me.

The same behavior can be reproduced with ib_send_bw, by starting an
unicast and multicast run in parallel:

| mh at mhtest0:~$ ib_send_bw -c UD mhtest1 & ib_send_bw -c UD -g mhtest1\
| -p 18516
| ./ib_send_bw -c UD -g mhtest1 -p 18516
| [1] 4927
| ------------------------------------------------------------------
|                     Send BW Test
| Connection type : UD
| Max msg size in UD is 2048 changing to 2048
| ------------------------------------------------------------------
|                     Send BW Multicast Test
| Connection type : UD
| Max msg size in UD is 2048 changing to 2048
| Inline data is used up to 400 bytes message
| Inline data is used up to 400 bytes message
|   local address:  LID 0x01, QPN 0x530405, PSN 0xe98523
|   local address:  LID 0x01, QPN 0x530406, PSN 0x3b338e
|   remote address: LID 0x03, QPN 0x540405, PSN 0x5c53e2
| Mtu : 2048
|   remote address: LID 0x03, QPN 0x540406, PSN 0xff883f
| Mtu : 2048
| ------------------------------------------------------------------
|  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
| ------------------------------------------------------------------
|  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
|    2048        1000             692.41                270.26
|    2048        1000             246.00                244.68
| ------------------------------------------------------------------
| ------------------------------------------------------------------

Doing 2 unicast UD runs in parallel, I'm getting ~650MB/s average
bandwith for each, which sounds reasonable.

Also, when using  bidircetional mode, I'm getting ~1900MB/s (amlost
doubled) throughput for unicast, but still ~250MBs for multicast.

Regards,
Marcel


From tziporet at dev.mellanox.co.il  Thu May 29 02:35:05 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 29 May 2008 12:35:05 +0300
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483DCEA8.20505@opengridcomputing.com>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>	<20080527183549.32168.22959.stgit@dell3.ogc.int>	<483CBDF0.7030209@opengridcomputing.com>	<alpine.LFD.1.10.0805281721410.2898@jlentini-linux.nane.netapp.com>
	<483DCEA8.20505@opengridcomputing.com>
Message-ID: <483E78C9.7080209@mellanox.co.il>

Steve Wise wrote: 
> Yes, I have already said I'll post a test case. :)
>
> The krping tool will be the culprit.  Its the kernel equivalent of 
> rping and has been around for a long time in one form or another.
>
> It is available at git://git.openfabrics.org/~swise/krping
>
Do younthink we should include it in OFED as we include user space examples?

Tziporet


From ogerlitz at voltaire.com  Thu May 29 02:39:57 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 29 May 2008 12:39:57 +0300
Subject: [ofa-general] Re: [RFC V4 PATCH 4/5] rdma/cma: implement
 RDMA_CM_EVENT_ADDR_CHANGE notification
In-Reply-To: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com>
References: <Pine.LNX.4.64.0805271849500.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805271850320.23252@zuben.voltaire.com>
	<Pine.LNX.4.64.0805281434380.2453@zuben.voltaire.com>
	<000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com>
Message-ID: <483E79ED.4050301@voltaire.com>

Sean Hefty wrote:
> Okay - I think we're pretty close on the rdma_cm side of things.  Thanks.
>
Sean,

I have implemented all your last comments, so I think that rdma_cm wise 
we are kind of ready. The review of the bonding patch in netdev has just 
started and I want to get some progress there and testing before sending 
you the final set of the patches.

Or.


From vlad at lists.openfabrics.org  Thu May 29 03:09:11 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 29 May 2008 03:09:11 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080529-0200 daily build status
Message-ID: <20080529100911.A200AE60E00@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From hrosenstock at xsigo.com  Thu May 29 04:34:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 04:34:13 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <20080529055641.GB16570@sashak.voltaire.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
	<1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
	<20080529055641.GB16570@sashak.voltaire.com>
Message-ID: <1212060853.27600.83.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 08:56 +0300, Sasha Khapyorsky wrote:
> On 08:24 Wed 28 May     , Hal Rosenstock wrote:
> > On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote:
> > > In addition we needed to install
> > >         libmthca
> > > which in turn wanted
> > >         libibverbs
> > 
> > AFAIK, there is no OpenSM requirement for these libraries.
> 
> Right, it is not needed. Actually it looks like a bug in OFED's
> install.pl script.

T

> > > And, after we successfully loaded it all and started opensm, etc,
> > > then ipoib wouldn't come up to RUNNING because it couldn't join
> > > the broadcast group.
> > 
> > This was likely due to IPoIB needing to be enabled on the default
> > partition.
> 
> The default partition is enabled by default.

and so is ipoib on that partition (I forgot how this worked for the no
config file case) so the problem was something else (perhaps rate or MTU
mismatch if defaults were not adaquete for the b2b configuration ?).

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Thu May 29 04:36:04 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 04:36:04 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <20080529055641.GB16570@sashak.voltaire.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
	<1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
	<20080529055641.GB16570@sashak.voltaire.com>
Message-ID: <1212060964.27600.87.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 08:56 +0300, Sasha Khapyorsky wrote:
> On 08:24 Wed 28 May     , Hal Rosenstock wrote:
> > On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote:
> > > In addition we needed to install
> > >         libmthca
> > > which in turn wanted
> > >         libibverbs
> > 
> > AFAIK, there is no OpenSM requirement for these libraries.
> 
> Right, it is not needed. Actually it looks like a bug in OFED's
> install.pl script.

What about getting the FC package updated too ? I thought that uses
different packaging.

-- Hal


From richard.frank at oracle.com  Thu May 29 04:52:11 2008
From: richard.frank at oracle.com (Richard Frank)
Date: Thu, 29 May 2008 07:52:11 -0400
Subject: [rds-devel] [ofa-general] Port space sharing in RDS
In-Reply-To: <20080529000354.GD6288@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
Message-ID: <483E98EB.1020308@oracle.com>

I see no problem with this - but defer to Olaf.

Olaf is currently presenting at LinuxTag in Berlin - his responses may 
be delayed a day or two..

Jon Mason wrote:
> On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote:
>   
>>> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to
>>> the
>>> RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
>>> 2 major reasons.
>>>       
>> Can RDS use different port numbers for its RDMA and TCP protocols?  The wire
>>     
>
> I do not know if this is desirable, but a quick test shows that having TCP and IB
> on different ports works around the problem.
>
>   
>> protocols end up being different when running over TCP versus iWarp.
>>
>> - Sean
>>
>>     
>
> _______________________________________________
> rds-devel mailing list
> rds-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/rds-devel
>   


From eli at dev.mellanox.co.il  Thu May 29 05:15:18 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 29 May 2008 15:15:18 +0300
Subject: [ofa-general] IB/ipoib: Fix CM connection premature destruction
Message-ID: <1212063318.13769.186.camel@mtls03>

>From 24e88d727dbbb7fd491edb57416f5cb0d4009f1d Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Thu, 29 May 2008 15:13:25 +0300
Subject: [PATCH] IB/ipoib: Fix CM connection premature destruction

Destroy the CM connection at ipoib_cm_tx_destroy() after the TX
queue is flushed. Failure to do so might cause the cm_id to be
allocated again pending TX completions which have not been reported
yet will move the connection to the reap list again causing it to
be destroyed before it has been used. The overall effect would be
to delay the creation of a new connection.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 819c027..a40e649 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1113,9 +1113,6 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 	ipoib_dbg(priv, "Destroy active connection 0x%x head 0x%x tail 0x%x\n",
 		  p->qp ? p->qp->qp_num : 0, p->tx_head, p->tx_tail);
 
-	if (p->id)
-		ib_destroy_cm_id(p->id);
-
 	if (p->tx_ring) {
 		/* Wait for all sends to complete */
 		begin = jiffies;
@@ -1131,6 +1128,8 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 	}
 
 timeout:
+	if (p->id)
+		ib_destroy_cm_id(p->id);
 
 	while ((int) p->tx_tail - (int) p->tx_head < 0) {
 		tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
-- 
1.5.5.1


From hrosenstock at xsigo.com  Thu May 29 05:46:21 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 05:46:21 -0700
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483E7520.1000302@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
	<483E7520.1000302@informatik.tu-chemnitz.de>
Message-ID: <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote:
> Hi,
> 
> Dotan Barak wrote:
> > Marcel Heinz wrote:
> >> [low multicast throughput of ~250MB/s with own benchmark tool]
> >
> > 1) I know that ib_send_bw supports multicast as well, can you please
> > check that you can reproduce your problem
> > on this benchmark too?
> 
> Well, the last time I've checked this, ib_send_bw didn't support
> multicast, but this was some months ago. That multicast support seems
> a bit odd, since it doesn't create/join the multicast groups and there
> is still a 1:1 TCP connection used to establish the IB connection, so
> one cannot benchmark "real" multicast scenarios with more than one
> receiver.
> 
> However, here are the results (I just used ipoib to let it create some
> multicast groups for me):
> 
> | mh at mhtest0:~$ ib_send_bw -c UD -g mhtest1
> | ------------------------------------------------------------------
> |                     Send BW Multicast Test
> | Connection type : UD
> | Max msg size in UD is 2048 changing to 2048
> | Inline data is used up to 400 bytes message
> |   local address:  LID 0x01, QPN 0x4a0405, PSN 0x8667a7
> |   remote address: LID 0x03, QPN 0x4a0405, PSN 0x5d41b6
> | Mtu : 2048
> | ------------------------------------------------------------------
> |  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
> |    2048        1000             301.12                247.05
> | ------------------------------------------------------------------
> 
> This is the same result as my own benchmark showed in that scenario.
> 
> > 2) You should expect that multicast messages will be slower than
> > unicast because the HCA/switch treat them in different way
> > (message duplication need to be done if needed).
> 
> Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit
> too much of overhead, don't you think?

Agreed.

> Especially if I take into account
> that with my own benchmark, I can get ~950MB/s when I start another
> receiver on the same host as the sender. Note that both of the
> receivers, the local and the remote one, are seeing all packets at that
> rate, so the HCAs and the switch must be able to handle multicast
> packets with this throughput.

Perhaps this is a static rate issue.

What SM is being used ?

-- Hal

> The other strange thing is that multicast traffics slows down other
> traffic way more than the bandwith it consumes. Moreover, it seems like
> it limits any other connections to the same throughput than that of the
> multicast traffic, which looks suspicious to me.
> 
> The same behavior can be reproduced with ib_send_bw, by starting an
> unicast and multicast run in parallel:
> 
> | mh at mhtest0:~$ ib_send_bw -c UD mhtest1 & ib_send_bw -c UD -g mhtest1\
> | -p 18516
> | ./ib_send_bw -c UD -g mhtest1 -p 18516
> | [1] 4927
> | ------------------------------------------------------------------
> |                     Send BW Test
> | Connection type : UD
> | Max msg size in UD is 2048 changing to 2048
> | ------------------------------------------------------------------
> |                     Send BW Multicast Test
> | Connection type : UD
> | Max msg size in UD is 2048 changing to 2048
> | Inline data is used up to 400 bytes message
> | Inline data is used up to 400 bytes message
> |   local address:  LID 0x01, QPN 0x530405, PSN 0xe98523
> |   local address:  LID 0x01, QPN 0x530406, PSN 0x3b338e
> |   remote address: LID 0x03, QPN 0x540405, PSN 0x5c53e2
> | Mtu : 2048
> |   remote address: LID 0x03, QPN 0x540406, PSN 0xff883f
> | Mtu : 2048
> | ------------------------------------------------------------------
> |  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
> | ------------------------------------------------------------------
> |  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
> |    2048        1000             692.41                270.26
> |    2048        1000             246.00                244.68
> | ------------------------------------------------------------------
> | ------------------------------------------------------------------
> 
> Doing 2 unicast UD runs in parallel, I'm getting ~650MB/s average
> bandwith for each, which sounds reasonable.
> 
> Also, when using  bidircetional mode, I'm getting ~1900MB/s (amlost
> doubled) throughput for unicast, but still ~250MBs for multicast.
> 
> Regards,
> Marcel
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu May 29 06:08:42 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 29 May 2008 16:08:42 +0300
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1212060964.27600.87.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
	<1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
	<20080529055641.GB16570@sashak.voltaire.com>
	<1212060964.27600.87.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080529130842.GU4616@sashak.voltaire.com>

On 04:36 Thu 29 May     , Hal Rosenstock wrote:
> 
> What about getting the FC package updated too ? I thought that uses
> different packaging.

I don't know FC story. A spec files which are in management tree don't
have such dependencies.

Sasha


From sashak at voltaire.com  Thu May 29 06:10:31 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 29 May 2008 16:10:31 +0300
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1212060853.27600.83.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<483D62E6.2050107@mellanox.co.il>
	<RTPCLUEXC1-PRDFRaqb0000017b@RTPMVEXC1-PRD.hq.netapp.com>
	<1211988269.27600.32.camel@hrosenstock-ws.xsigo.com>
	<20080529055641.GB16570@sashak.voltaire.com>
	<1212060853.27600.83.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080529131031.GV4616@sashak.voltaire.com>

On 04:34 Thu 29 May     , Hal Rosenstock wrote:
> > 
> > The default partition is enabled by default.
> 
> and so is ipoib on that partition (I forgot how this worked for the no
> config file case) so the problem was something else (perhaps rate or MTU
> mismatch if defaults were not adaquete for the b2b configuration ?).

Yes, this should be something else.

Sasha


From hrosenstock at xsigo.com  Thu May 29 06:22:52 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 06:22:52 -0700
Subject: [ofa-general] [PATCHv2] management: Support separate SA and SM keys
	as clarified in IBA 1.2.1
Message-ID: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com>

management: Support separate SA and SM keys as clarified in IBA 1.2.1

v2 is just a rebase to latest tree

Signed-off-by: Hal Rosenstock <hal at xsigo.com> 

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index ed61721..ccf7bdd 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -730,7 +730,7 @@ get_all_records(osm_bind_handle_t bind_handle,
 		int trusted)
 {
 	return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset,
-			       trusted ? OSM_DEFAULT_SM_KEY : 0);
+			       trusted ? OSM_DEFAULT_SA_KEY : 0);
 }
 
 /**
@@ -1255,7 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle,
 	status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0,
 				 comp_mask, &pktr,
 				 ib_get_attr_offset(sizeof(pktr)),
-				 OSM_DEFAULT_SM_KEY);
+				 OSM_DEFAULT_SA_KEY);
 	if (status != IB_SUCCESS)
 		return status;
 
diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index 289e49e..07cc407 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -119,6 +119,17 @@ BEGIN_C_DECLS
 */
 #define OSM_DEFAULT_SM_KEY 1
 /********/
+/****s* OpenSM: Base/OSM_DEFAULT_SA_KEY
+* NAME
+*	OSM_DEFAULT_SA_KEY
+*
+* DESCRIPTION
+*	Subnet Adminstration key value.
+*
+* SYNOPSIS
+*/
+#define OSM_DEFAULT_SA_KEY 1
+/********/
 /****s* OpenSM: Base/OSM_DEFAULT_LMC
 * NAME
 *	OSM_DEFAULT_LMC
diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index d84c5a2..1b862c0 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -209,6 +209,7 @@ typedef struct _osm_subn_opt {
 	ib_net64_t guid;
 	ib_net64_t m_key;
 	ib_net64_t sm_key;
+	ib_net64_t sa_key;
 	ib_net64_t subnet_prefix;
 	ib_net16_t m_key_lease_period;
 	uint32_t sweep_interval;
@@ -295,7 +296,10 @@ typedef struct _osm_subn_opt {
 *		M_Key value sent to all ports qualifing all Set(PortInfo).
 *
 *	sm_key
-*		SM_Key value of the SM to qualify rcv SA queries as "trusted".
+*		SM_Key value of the SM used for SM authentication.
+*
+*	sa_key
+*		SM_Key value to qualify rcv SA queries as "trusted".
 *
 *	subnet_prefix
 *		Subnet prefix used on this subnet.
diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c
index 78fdec7..abd8d02 100644
--- a/opensm/opensm/osm_sa_mad_ctrl.c
+++ b/opensm/opensm/osm_sa_mad_ctrl.c
@@ -340,11 +340,11 @@ __osm_sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw,
 	 * otherwise discard the MAD.
 	 */
 	if ((p_sa_mad->sm_key != 0) &&
-	    (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sm_key)) {
+	    (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sa_key)) {
 		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A04: "
 			"Non-Zero SA MAD SM_Key: 0x%" PRIx64 " != SM_Key: 0x%"
 			PRIx64 "; MAD ignored\n", cl_ntoh64(p_sa_mad->sm_key),
-			cl_ntoh64(p_ctrl->p_subn->opt.sm_key)
+			cl_ntoh64(p_ctrl->p_subn->opt.sa_key)
 		    );
 		osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw);
 		goto Exit;
diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c
index 5cea525..4d19ed4 100644
--- a/opensm/opensm/osm_sa_pkey_record.c
+++ b/opensm/opensm/osm_sa_pkey_record.c
@@ -269,7 +269,7 @@ void osm_pkey_rec_rcv_process(IN void *ctx, IN void *data)
 	   to trusted requests.
 	   Check that the requester is a trusted one.
 	 */
-	if (p_rcvd_mad->sm_key != sa->p_subn->opt.sm_key) {
+	if (p_rcvd_mad->sm_key != sa->p_subn->opt.sa_key) {
 		/* This is not a trusted requester! */
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 4608: "
 			"Request from non-trusted requester: "
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 9d1fbeb..d1e25ef 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -387,6 +387,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->guid = 0;
 	p_opt->m_key = OSM_DEFAULT_M_KEY;
 	p_opt->sm_key = OSM_DEFAULT_SM_KEY;
+	p_opt->sa_key = OSM_DEFAULT_SA_KEY;
 	p_opt->subnet_prefix = IB_DEFAULT_SUBNET_PREFIX;
 	p_opt->m_key_lease_period = 0;
 	p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
@@ -1161,6 +1162,8 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 
 		opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key);
 
+		opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key);
+
 		opts_unpack_net64("subnet_prefix",
 				  p_key, p_val, &p_opts->subnet_prefix);
 
@@ -1401,8 +1404,10 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
 		"m_key 0x%016" PRIx64 "\n\n"
 		"# The lease period used for the M_Key on this subnet in [sec]\n"
 		"m_key_lease_period %u\n\n"
-		"# SM_Key value of the SM to qualify rcv SA queries as 'trusted'\n"
+		"# SM_Key value of the SM used for SM authentication\n"
 		"sm_key 0x%016" PRIx64 "\n\n"
+		"# SM_Key value to qualify rcv SA queries as 'trusted'\n"
+		"sa_key 0x%016" PRIx64 "\n\n"
 		"# Subnet prefix used on this subnet\n"
 		"subnet_prefix 0x%016" PRIx64 "\n\n"
 		"# The LMC value used on this subnet\n"
@@ -1456,6 +1461,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
 		cl_ntoh64(p_opts->m_key),
 		cl_ntoh16(p_opts->m_key_lease_period),
 		cl_ntoh64(p_opts->sm_key),
+		cl_ntoh64(p_opts->sa_key),
 		cl_ntoh64(p_opts->subnet_prefix),
 		p_opts->lmc,
 		p_opts->lmc_esp0 ? "TRUE" : "FALSE",


From marcel.heinz at informatik.tu-chemnitz.de  Thu May 29 06:35:22 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Thu, 29 May 2008 15:35:22 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>	
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>	
	<483E7520.1000302@informatik.tu-chemnitz.de>
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
Message-ID: <483EB11A.5000000@informatik.tu-chemnitz.de>

Hal Rosenstock wrote:
> On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote:
>>Dotan Barak wrote:
>>>Marcel Heinz wrote:
>>>
>>>>[low multicast throughput of ~250MB/s with own benchmark tool]
>>>
>>>1) I know that ib_send_bw supports multicast as well, can you please
>>>check that you can reproduce your problem
>>>on this benchmark too?
>>
>>|  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>|    2048        1000             301.12                247.05
>>
>>This is the same result as my own benchmark showed in that scenario.
>>
>>
>>>2) You should expect that multicast messages will be slower than
>>>unicast because the HCA/switch treat them in different way
>>>(message duplication need to be done if needed).
>>
>>Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit
>>too much of overhead, don't you think?
> 
> 
> Agreed.
> 
> 
>>Especially if I take into account
>>that with my own benchmark, I can get ~950MB/s when I start another
>>receiver on the same host as the sender. Note that both of the
>>receivers, the local and the remote one, are seeing all packets at that
>>rate, so the HCAs and the switch must be able to handle multicast
>>packets with this throughput.
> 
> 
> Perhaps this is a static rate issue.
> 
> What SM is being used ?

It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but
this didn't change anything.

Regards,
Marcel


From hrosenstock at xsigo.com  Thu May 29 06:37:23 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 06:37:23 -0700
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483EB11A.5000000@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
	<483E7520.1000302@informatik.tu-chemnitz.de>
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
	<483EB11A.5000000@informatik.tu-chemnitz.de>
Message-ID: <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote:
> Hal Rosenstock wrote:
> > On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote:
> >>Dotan Barak wrote:
> >>>Marcel Heinz wrote:
> >>>
> >>>>[low multicast throughput of ~250MB/s with own benchmark tool]
> >>>
> >>>1) I know that ib_send_bw supports multicast as well, can you please
> >>>check that you can reproduce your problem
> >>>on this benchmark too?
> >>
> >>|  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >>|    2048        1000             301.12                247.05
> >>
> >>This is the same result as my own benchmark showed in that scenario.
> >>
> >>
> >>>2) You should expect that multicast messages will be slower than
> >>>unicast because the HCA/switch treat them in different way
> >>>(message duplication need to be done if needed).
> >>
> >>Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit
> >>too much of overhead, don't you think?
> > 
> > 
> > Agreed.
> > 
> > 
> >>Especially if I take into account
> >>that with my own benchmark, I can get ~950MB/s when I start another
> >>receiver on the same host as the sender. Note that both of the
> >>receivers, the local and the remote one, are seeing all packets at that
> >>rate, so the HCAs and the switch must be able to handle multicast
> >>packets with this throughput.
> > 
> > 
> > Perhaps this is a static rate issue.
> > 
> > What SM is being used ?
> 
> It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but
> this didn't change anything.

Can you validate either the PathRecord or MCMemberRecord returned or the
static rate applied to the multicast QP in the various scenarios ? If it
is the same, this is not the problem but if it's different then we're on
to something here.

-- Hal

> Regards,
> Marcel
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From marcel.heinz at informatik.tu-chemnitz.de  Thu May 29 07:32:53 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Thu, 29 May 2008 16:32:53 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>	
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>	
	<483E7520.1000302@informatik.tu-chemnitz.de>	
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>	
	<483EB11A.5000000@informatik.tu-chemnitz.de>
	<1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>
Message-ID: <483EBE95.60901@informatik.tu-chemnitz.de>

Hi,

Hal Rosenstock wrote:
> On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote:
>>Hal Rosenstock wrote:
>>>On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote:
>>>>Especially if I take into account
>>>>that with my own benchmark, I can get ~950MB/s when I start another
>>>>receiver on the same host as the sender. Note that both of the
>>>>receivers, the local and the remote one, are seeing all packets at that
>>>>rate, so the HCAs and the switch must be able to handle multicast
>>>>packets with this throughput.
>>>
>>>
>>>Perhaps this is a static rate issue.
>>>
>>>What SM is being used ?
>>
>>It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but
>>this didn't change anything.
> 
> 
> Can you validate either the PathRecord or MCMemberRecord returned or the
> static rate applied to the multicast QP in the various scenarios ? If it
> is the same, this is not the problem but if it's different then we're on
> to something here.
> 

This is what happened:

1. The server on host B is started and creates the MC group, OpenSM
   returns:

| May 29 15:54:34 699610 [B6D71B90] 0x08 -> MCMember Record dump:
| 				MGID....................0xff12000000000000 : 0x00010002deadbeef
| 				PortGid.................0xfe80000000000000 : 0x0002c9020025abdd
| 				qkey....................0xABCD
| 				mlid....................0xC000
| 				mtu.....................0x84
| 				TClass..................0x0
| 				pkey....................0x7FFF
| 				rate....................0x86
| 				pkt_life................0x80
| 				SLFlowLabelHopLimit.....0x0
| 				ScopeState..............0x21
| 				ProxyJoin...............0x0

2. The client on host A is started and joins to the group as
   SendOnlyNonMember, OpenSM returns:

| May 29 15:54:45 381972 [B5D6FB90] 0x08 -> MCMember Record dump:
| 				MGID....................0xff12000000000000 : 0x00010002deadbeef
| 				PortGid.................0xfe80000000000000 : 0x0002c9020025abed
| 				qkey....................0xABCD
| 				mlid....................0xC000
| 				mtu.....................0x84
| 				TClass..................0x0
| 				pkey....................0x7FFF
| 				rate....................0x86
| 				pkt_life................0x80
| 				SLFlowLabelHopLimit.....0x0
| 				ScopeState..............0x4
| 				ProxyJoin...............0x0

Now I have 255MB/s between host A and B.

3. I start another server on host A, it joines to the group and
   OpenSM returns:

| May 29 15:54:56 129971 [B6570B90] 0x08 -> MCMember Record dump:
| 				MGID....................0xff12000000000000 : 0x00010002deadbeef
| 				PortGid.................0xfe80000000000000 : 0x0002c9020025abed
| 				qkey....................0xABCD
| 				mlid....................0xC000
| 				mtu.....................0x84
| 				TClass..................0x0
| 				pkey....................0x7FFF
| 				rate....................0x86
| 				pkt_life................0x80
| 				SLFlowLabelHopLimit.....0x0
| 				ScopeState..............0x25
| 				ProxyJoin...............0x0

Now, all 3 instances measure 950MB/s throughput.

The returned MCMember Records are absolutely identical except
for the PortGid and the membership state. How can
I find out the static rate applied to the multicast QP?

Regards,
Marcel


From xma at us.ibm.com  Thu May 29 07:37:12 2008
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 29 May 2008 07:37:12 -0700
Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode
In-Reply-To: <1211976303.13769.155.camel@mtls03>
Message-ID: <OFFF913239.39FC3854-ON87257458.004F50FB-88257458.00242C66@us.ibm.com>


Hello Eli,

> > In this case, how many tx drop packets from ifconfig output? Should we
> > see ifconfig tx drop packets + tx successfully transmit packets close
> > to netperf packets?
> That's right.

I am looking at ipoib_cm_handle_tx_wc(), there is no tx drop packets
increased in this situation, so tx transmit packets should be around
netperf send packets.

void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
  ...

        tx_req = &tx->tx_ring[wr_id];

        ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len,
DMA_TO_DEVICE);

        /* FIXME: is this right? Shouldn't we only increment on success? */
        ++dev->stats.tx_packets;
        dev->stats.tx_bytes += tx_req->skb->len;
...
}

> > Any TCP STREAM test results to share here?
> TCP won't demonstrate the problem since it uses Nagle's algorithm to
> aggregate data into full sized packets.

So when hitting this RNR retry, the error status return was flush err, so
the packets were silently dropped instead of "failed cm send event" and
clear the interface up flag?

Please correct me if wrong.

thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080529/48ab1852/attachment.html>

From hrosenstock at xsigo.com  Thu May 29 07:49:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 07:49:13 -0700
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483EBE95.60901@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
	<483E7520.1000302@informatik.tu-chemnitz.de>
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
	<483EB11A.5000000@informatik.tu-chemnitz.de>
	<1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>
	<483EBE95.60901@informatik.tu-chemnitz.de>
Message-ID: <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com>

Hi Marcel,

On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote:
> Hi,
> 
> Hal Rosenstock wrote:
> > On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote:
> >>Hal Rosenstock wrote:
> >>>On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote:
> >>>>Especially if I take into account
> >>>>that with my own benchmark, I can get ~950MB/s when I start another
> >>>>receiver on the same host as the sender. Note that both of the
> >>>>receivers, the local and the remote one, are seeing all packets at that
> >>>>rate, so the HCAs and the switch must be able to handle multicast
> >>>>packets with this throughput.
> >>>
> >>>
> >>>Perhaps this is a static rate issue.
> >>>
> >>>What SM is being used ?
> >>
> >>It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but
> >>this didn't change anything.
> > 
> > 
> > Can you validate either the PathRecord or MCMemberRecord returned or the
> > static rate applied to the multicast QP in the various scenarios ? If it
> > is the same, this is not the problem but if it's different then we're on
> > to something here.
> > 
> 
> This is what happened:
> 
> 1. The server on host B is started and creates the MC group, OpenSM
>    returns:
> 
> | May 29 15:54:34 699610 [B6D71B90] 0x08 -> MCMember Record dump:
> | 				MGID....................0xff12000000000000 : 0x00010002deadbeef
> | 				PortGid.................0xfe80000000000000 : 0x0002c9020025abdd
> | 				qkey....................0xABCD
> | 				mlid....................0xC000
> | 				mtu.....................0x84
> | 				TClass..................0x0
> | 				pkey....................0x7FFF
> | 				rate....................0x86
> | 				pkt_life................0x80
> | 				SLFlowLabelHopLimit.....0x0
> | 				ScopeState..............0x21
> | 				ProxyJoin...............0x0
> 
> 2. The client on host A is started and joins to the group as
>    SendOnlyNonMember, OpenSM returns:
> 
> | May 29 15:54:45 381972 [B5D6FB90] 0x08 -> MCMember Record dump:
> | 				MGID....................0xff12000000000000 : 0x00010002deadbeef
> | 				PortGid.................0xfe80000000000000 : 0x0002c9020025abed
> | 				qkey....................0xABCD
> | 				mlid....................0xC000
> | 				mtu.....................0x84
> | 				TClass..................0x0
> | 				pkey....................0x7FFF
> | 				rate....................0x86
> | 				pkt_life................0x80
> | 				SLFlowLabelHopLimit.....0x0
> | 				ScopeState..............0x4
> | 				ProxyJoin...............0x0
> 
> Now I have 255MB/s between host A and B.
> 
> 3. I start another server on host A, it joines to the group and
>    OpenSM returns:
> 
> | May 29 15:54:56 129971 [B6570B90] 0x08 -> MCMember Record dump:
> | 				MGID....................0xff12000000000000 : 0x00010002deadbeef
> | 				PortGid.................0xfe80000000000000 : 0x0002c9020025abed
> | 				qkey....................0xABCD
> | 				mlid....................0xC000
> | 				mtu.....................0x84
> | 				TClass..................0x0
> | 				pkey....................0x7FFF
> | 				rate....................0x86
> | 				pkt_life................0x80
> | 				SLFlowLabelHopLimit.....0x0
> | 				ScopeState..............0x25
> | 				ProxyJoin...............0x0
> 
> Now, all 3 instances measure 950MB/s throughput.
> 
> The returned MCMember Records are absolutely identical except
> for the PortGid and the membership state.

Rate 0x86 is exactly 20 Gbps.

> How can I find out the static rate applied to the multicast QP?

Given the above, I don't see this as a likely suspect but you should be
able to query the ah used for sending and look in the ah_attr for
static_rate.

-- Hal

> Regards,
> Marcel


From eli at dev.mellanox.co.il  Thu May 29 08:14:51 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 29 May 2008 18:14:51 +0300
Subject: [ofa-general] Re: IB/ipoib: Fix CM connection premature destruction
In-Reply-To: <1212063318.13769.186.camel@mtls03>
References: <1212063318.13769.186.camel@mtls03>
Message-ID: <4e6a6b3c0805290814m5399fc8frc7f0770ef003962f@mail.gmail.com>

On Thu, May 29, 2008 at 3:15 PM, Eli Cohen <eli at dev.mellanox.co.il> wrote:
> >From 24e88d727dbbb7fd491edb57416f5cb0d4009f1d Mon Sep 17 00:00:00 2001
> From: Eli Cohen <eli at mellanox.co.il>
> Date: Thu, 29 May 2008 15:13:25 +0300
> Subject: [PATCH] IB/ipoib: Fix CM connection premature destruction
>
> Destroy the CM connection at ipoib_cm_tx_destroy() after the TX
> queue is flushed. Failure to do so might cause the cm_id to be
> allocated again pending TX completions which have not been reported
> yet will move the connection to the reap list again causing it to
> be destroyed before it has been used. The overall effect would be
> to delay the creation of a new connection.

Thinking it over there's no bug there that I can identify. Please
ignore this patch.


From marcel.heinz at informatik.tu-chemnitz.de  Thu May 29 08:30:11 2008
From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz)
Date: Thu, 29 May 2008 17:30:11 +0200
Subject: [ofa-general] Multicast Performance
In-Reply-To: <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com>
References: <4836E231.4000601@informatik.tu-chemnitz.de>	
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>	
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>	
	<483E7520.1000302@informatik.tu-chemnitz.de>	
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>	
	<483EB11A.5000000@informatik.tu-chemnitz.de>	
	<1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>	
	<483EBE95.60901@informatik.tu-chemnitz.de>
	<1212072553.17997.65.camel@hrosenstock-ws.xsigo.com>
Message-ID: <483ECC03.8000600@informatik.tu-chemnitz.de>

Hi Hal,

Hal Rosenstock wrote:
> On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote:
>>How can I find out the static rate applied to the multicast QP?
> 
> Given the above, I don't see this as a likely suspect but you should be
> able to query the ah used for sending and look in the ah_attr for
> static_rate.

Well, there is no query_ah function. The ah I use for sending was
created with static_rate 0. The manpage doesn't give any explanation
what static rate is meant to be, but after playing around with it I
guess that it is what the spec calls "Inter Packet Delay", right? So 0
should be the correct choice.

There is also the ah_attr field of the ib_qp_attr struct which I could
query, but this field is not valid for datagram QPs.

Regards,
Marcel


From hrosenstock at xsigo.com  Thu May 29 08:34:41 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 08:34:41 -0700
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483ECC03.8000600@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
	<483E7520.1000302@informatik.tu-chemnitz.de>
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
	<483EB11A.5000000@informatik.tu-chemnitz.de>
	<1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>
	<483EBE95.60901@informatik.tu-chemnitz.de>
	<1212072553.17997.65.camel@hrosenstock-ws.xsigo.com>
	<483ECC03.8000600@informatik.tu-chemnitz.de>
Message-ID: <1212075281.17997.89.camel@hrosenstock-ws.xsigo.com>

Hi Marcel,

On Thu, 2008-05-29 at 17:30 +0200, Marcel Heinz wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote:
> >>How can I find out the static rate applied to the multicast QP?
> > 
> > Given the above, I don't see this as a likely suspect but you should be
> > able to query the ah used for sending and look in the ah_attr for
> > static_rate.
> 
> Well, there is no query_ah function.

I was looking at kernel space not user space. Not sure about user space
but I think it's moot.

> The ah I use for sending was
> created with static_rate 0. The manpage doesn't give any explanation
> what static rate is meant to be, but after playing around with it I
> guess that it is what the spec calls "Inter Packet Delay", right?

Yes.

>  So 0 should be the correct choice.

Yes.

This isn't the issue and I'm not sure what is. Sorry.

-- Hal

> There is also the ah_attr field of the ib_qp_attr struct which I could
> query, but this field is not valid for datagram QPs.

> Regards,
> Marcel


From jlentini at netapp.com  Thu May 29 08:37:15 2008
From: jlentini at netapp.com (James Lentini)
Date: Thu, 29 May 2008 11:37:15 -0400 (EDT)
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
	<1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>
	<1211981650.13185.362.camel@hrosenstock-ws.xsigo.com>
Message-ID: <alpine.LFD.1.10.0805291121580.2898@jlentini-linux.nane.netapp.com>


On Wed, 28 May 2008, Hal Rosenstock wrote:

> On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote:
> > At 09:03 AM 5/28/2008, Hal Rosenstock wrote:
> > >On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote:
> > >> At 08:39 AM 5/28/2008, Hal Rosenstock wrote:
> > >> >Tom,
> > >> >
> > >> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote:
> > >> >> Is it possible to manually configure two Infiniband ports to operate
> > >> >> with one another in back-to-back mode, without running OpenSM
> > >> >> on one of them?
> > >> >
> > >> >This is possible but something would need to do at least some subset of
> > >> >what the SM does depending on the precise requirements and the limits
> > >> >placed on the environment supported without a "full blown" SM.
> > >> 
> > >> Okay ... but IMO the only thing we need is a LID. Or at least, in my 
> > >experience
> > >> all I've needed is a LID.
> > >
> > >The port also needs to be walked from init to active which takes
> > >coordination at both ends of the b2b link.
> > 
> > Yep. But, it has all it needs with a LID, right? No messages need to be
> > exchanged, for instance.
> 
> It's more than a LID and messages do need to be exchanged (mini SM ->
> SMA) to walk the port from INIT to ACTIVE. This needs to be coordinated
> on both sides of the link so they move in rough concert.
> 
> > >> In a previous effort, we simply stole the low octet of an IP address, so we'd
> > >> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. 
> > >Worked great.
> > >> If necessary, we would set a manual arp entry (using iproute) to avoid having
> > >> to broadcast.
> > >
> > >That could be done if that is what is desired and can be relied upon
> > >(that ib0 is configured and we only care about the first port).
> > >
> > >Is it just ARP support that is needed ?
> > 
> > Well, ARP is the precursor to establishing an IP send and a TCP connection,
> > which we need to do also.
> 
> I was just asking about other broadcast/multicast needs. Sounds like
> this is not the case.
> 
> >  But, if the resulting ipaddr-hwaddr mapping is
> > installed, then ARP is unnecessary and the IP layer can send without using it.
> > 
> > When we did this before, we'd install a "permanent" ARP entry, in a two-line
> > shell script. Roughly, for peers configuring lids X and Y, it would do
> > 
> > peer X:
> > 	ifconfig ib0 1.2.3.X
> > 	ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid)
> > 
> > peer Y:
> > 	ifconfig ib0 1.2.3.Y
> > 	ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X
> > 
> > And we'd be up and running for both IP and RDMA connections. We fixed a
> > bug in the old iproute2 command to allow the long IB link addresses.
> > 
> > I'm thinking that using IPOIB to drive this kind of manual setup is one way
> > to approach it. It certainly would be simple, and worked for us before there
> > was an OFA stack.
> 
> This would still work.
> 
> > Maybe I'm getting ahead of myself though, still wondering if there's a way
> > to do it with what we have.
> 
> The closest thing is OpenSM run once mode but I think you've been
> describing a b2b mini SM command which wouldn't be hard to implement.

Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs 
to assign a lid, and then transitioned the port to ARMED and ACTIVE. 
This worked for enabling IB communication, but not IPoIB. In 
retrospect, I probably could have implemented the same functionality 
in userspace.

> -- Hal
> 
> > Tom.
> > 
> > >
> > >> >> We have done this on other IB implementations by manually assigning
> > >> >> LIDs, but I discover that the "lid" entry below 
> > >> >/sys/class/infiniband/<device>
> > >> >> is not writable, at least for mthca.
> > >> >
> > >> >This can be done via MADs so user_mad kernel module would be needed to
> > >> >do this.
> > >> 
> > >> Okay, all kernel modules can be assumed to be in place. How do we tell it
> > >> to manage the LID, with a shell command?
> > >
> > >A new "command" would be needed.
> > >
> > >-- Hal
> > >
> > >> >> Also, I expect that the ipoib driver will
> > >> >> be unable to join the broadcast group, so will be unwilling to 
> > >come up fully.
> > >> >
> > >> >Is IPoIB a requirement ?
> > >> 
> > >> I think so, for two reasons. One, principle of least surprise - the user will
> > >> expect to be able to ping, telnet etc if it has connectivity. Two, 
> > >for NFS/RDMA
> > >> we require TCP and UDP connections in order to perform the mount and do
> > >> locking and recovery. We could do those over a parallel ethernet connection,
> > >> but that's kind of not the point.
> > >> 
> > >> >
> > >> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why
> > >> >> not IB?
> > >> >
> > >> >The simple answer is that it is the nature of IB management (being
> > >> >different than ethernet).
> > >> 
> > >> Which, IMO, we need to boil down to simplest-possible, for at least some
> > >> workable configuration.
> > >> 
> > >> Thanks for the ideas!
> > >> 
> > >> Tom.
> > >> 
> > >> >
> > >> >-- Hal
> > >> >
> > >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having
> > >> >> to install the many userspace modules needed to do this, including 
> > >> >libibverbs,
> > >> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an
> > >> >> "easy" way to get started with just the kernel and some shell commands.
> > >> >> 
> > >> >> Tom.
> > >> >> 
> > >> >> _______________________________________________
> > >> >> general mailing list
> > >> >> general at lists.openfabrics.org
> > >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >> >> 
> > >> >> To unsubscribe, please visit 
> > >> >http://openib.org/mailman/listinfo/openib-general
> > >> 
> > >
> > >_______________________________________________
> > >general mailing list
> > >general at lists.openfabrics.org
> > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From hrosenstock at xsigo.com  Thu May 29 08:45:28 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 08:45:28 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <alpine.LFD.1.10.0805291121580.2898@jlentini-linux.nane.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
	<1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>
	<1211981650.13185.362.camel@hrosenstock-ws.xsigo.com>
	<alpine.LFD.1.10.0805291121580.2898@jlentini-linux.nane.netapp.com>
Message-ID: <1212075928.17997.93.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 11:37 -0400, James Lentini wrote:
> >The closest thing is OpenSM run once mode but I think you've been
> >describing a b2b mini SM command which wouldn't be hard to
> >implement.
> 
> Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs 
> to assign a lid, and then transitioned the port to ARMED and ACTIVE. 

Is this based on OpenIB/OpenFabrics kernel APIs ?

> This worked for enabling IB communication, but not IPoIB.

I think that IPoIB for b2b mode would be a relatively simple addition.

-- Hal


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:53:46 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:23:46 +0530
Subject: [ofa-general] [PATCH v3 00/13] QLogic VNIC Driver 
Message-ID: <20080529095126.9943.84692.stgit@localhost.localdomain>

Roland,

This is the third round of QLogic Virtual NIC driver patch series for submission
to 2.6.27 kernel. The series has been tested against your for-2.6.27 branch.

Based on comments received on second series of patches, following fixes are
introduced in this series:

        -  Removal of CONFIG_INFINIBAND_QLGC_VNIC_DEBUG option.
        -  Making few function definitions static.
        -  Removed un-necessary extern declarations.

The sparse endianness checking for the driver did not give any warnings and
checkpatch.pl have few warnings indicating lines slightly longer than 80 columns.

Background:
As mentioned in the first version of patch series, this series adds QLogic
Virtual NIC (VNIC) driver which works in conjunction with the the QLogic
Ethernet Virtual I/O Controller (EVIC) hardware. The VNIC driver along with the
QLogic EVIC's two 10 Gigabit ethernet ports, enables Infiniband clusters to
connect to Ethernet networks. This driver also works with the earlier version of
the I/O Controller, the VEx.

The QLogic VNIC driver creates virtual ethernet interfaces and tunnels the
Ethernet data to/from the EVIC over Infiniband using an Infiniband reliable
connection.

      [PATCH v3 01/13] QLogic VNIC: Driver - netdev implementation
      [PATCH v3 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx
      [PATCH v3 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx
      [PATCH v3 04/13] QLogic VNIC: Implementation of Control path of communication protocol
      [PATCH v3 05/13] QLogic VNIC: Implementation of Data path of communication protocol
      [PATCH v3 06/13] QLogic VNIC: IB core stack interaction
      [PATCH v3 07/13] QLogic VNIC: Handling configurable parameters of the driver
      [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver
      [PATCH v3 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast
      [PATCH v3 10/13] QLogic VNIC: Driver Statistics collection
      [PATCH v3 11/13] QLogic VNIC: Driver utility file - implements various utility macros
      [PATCH v3 12/13] QLogic VNIC: Driver Kconfig and Makefile.
      [PATCH v3 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile

 drivers/infiniband/Kconfig                         |    2 
 drivers/infiniband/Makefile                        |    1 
 drivers/infiniband/ulp/qlgc_vnic/Kconfig           |   19 
 drivers/infiniband/ulp/qlgc_vnic/Makefile          |   13 
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c     |  379 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h     |  242 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c    | 2286 ++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h    |  179 ++
 .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h    |  368 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c       | 1492 +++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h       |  206 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c         | 1043 +++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h         |  206 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c       | 1098 ++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h       |  154 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c  |  319 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h  |   77 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c    |  112 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h    |   79 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c      |  234 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h      |  497 ++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c        | 1133 ++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h        |   51 
 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h    |  103 +
 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h       |  236 ++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c     | 1214 +++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h     |  175 ++
 27 files changed, 11918 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h

--
Regards,
Ram 


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:54:23 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:24:23 +0530
Subject: [ofa-general] [PATCH v3 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095423.9943.77528.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

QLogic Virtual NIC Driver. This patch implements netdev registration,
netdev functions and state maintenance of the QLogic Virtual NIC
corresponding to the various events associated with the QLogic Ethernet 
Virtual I/O Controller (EVIC/VEx) connection.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h |  154 ++++
 2 files changed, 1252 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
new file mode 100644
index 0000000..570c069
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
@@ -0,0 +1,1098 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/skbuff.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/completion.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_netpath.h"
+#include "vnic_viport.h"
+#include "vnic_ib.h"
+#include "vnic_stats.h"
+
+#define MODULEVERSION "1.3.0.0.4"
+#define MODULEDETAILS	\
+		"QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION
+
+MODULE_AUTHOR("QLogic Corp.");
+MODULE_DESCRIPTION(MODULEDETAILS);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller");
+
+u32 vnic_debug;
+
+module_param(vnic_debug, uint, 0444);
+MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0");
+
+LIST_HEAD(vnic_list);
+
+static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue);
+static LIST_HEAD(vnic_npevent_list);
+static DECLARE_COMPLETION(vnic_npevent_thread_exit);
+static spinlock_t vnic_npevent_list_lock;
+static struct task_struct *vnic_npevent_thread;
+static int vnic_npevent_thread_end;
+
+static const char *const vnic_npevent_str[] = {
+    "PRIMARY CONNECTED",
+    "PRIMARY DISCONNECTED",
+    "PRIMARY CARRIER",
+    "PRIMARY NO CARRIER",
+    "PRIMARY TIMER EXPIRED",
+    "PRIMARY SETLINK",
+    "SECONDARY CONNECTED",
+    "SECONDARY DISCONNECTED",
+    "SECONDARY CARRIER",
+    "SECONDARY NO CARRIER",
+    "SECONDARY TIMER EXPIRED",
+    "SECONDARY SETLINK",
+    "FREE VNIC",
+};
+
+void vnic_connected(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_connected()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED);
+
+	vnic_connected_stats(vnic);
+}
+
+void vnic_disconnected(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_disconnected()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED);
+}
+
+void vnic_link_up(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_link_up()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP);
+}
+
+void vnic_link_down(struct vnic *vnic, struct netpath *netpath)
+{
+	VNIC_FUNCTION("vnic_link_down()\n");
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN);
+}
+
+void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath)
+{
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_stop_xmit()\n");
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (netpath == vnic->current_path) {
+		if (!netif_queue_stopped(vnic->netdevice)) {
+			netif_stop_queue(vnic->netdevice);
+			vnic->failed_over = 0;
+		}
+
+		vnic_stop_xmit_stats(vnic);
+	}
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath)
+{
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_restart_xmit()\n");
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (netpath == vnic->current_path) {
+		if (netif_queue_stopped(vnic->netdevice))
+			netif_wake_queue(vnic->netdevice);
+
+		vnic_restart_xmit_stats(vnic);
+	}
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
+		      struct sk_buff *skb)
+{
+	VNIC_FUNCTION("vnic_recv_packet()\n");
+	if ((netpath != vnic->current_path) || !vnic->open) {
+		VNIC_INFO("tossing packet\n");
+		dev_kfree_skb(skb);
+		return;
+	}
+
+	vnic->netdevice->last_rx = jiffies;
+	skb->dev = vnic->netdevice;
+	skb->protocol = eth_type_trans(skb, skb->dev);
+	if (!vnic->config->use_rx_csum)
+		skb->ip_summed = CHECKSUM_NONE;
+	netif_rx(skb);
+	vnic_recv_pkt_stats(vnic);
+}
+
+static struct net_device_stats *vnic_get_stats(struct net_device *device)
+{
+	struct vnic *vnic;
+	struct netpath *np;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_get_stats()\n");
+	vnic = netdev_priv(device);
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	np = vnic->current_path;
+	if (np && np->viport) {
+		atomic_inc(&np->viport->reference_count);
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+		viport_get_stats(np->viport, &vnic->stats);
+		atomic_dec(&np->viport->reference_count);
+		wake_up(&np->viport->reference_queue);
+	} else
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+
+	return &vnic->stats;
+}
+
+static int vnic_open(struct net_device *device)
+{
+	struct vnic *vnic;
+
+	VNIC_FUNCTION("vnic_open()\n");
+	vnic = netdev_priv(device);
+
+	vnic->open++;
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+	netif_start_queue(vnic->netdevice);
+
+	return 0;
+}
+
+static int vnic_stop(struct net_device *device)
+{
+	struct vnic *vnic;
+	int ret = 0;
+
+	VNIC_FUNCTION("vnic_stop()\n");
+	vnic = netdev_priv(device);
+	netif_stop_queue(device);
+	vnic->open--;
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+
+	return ret;
+}
+
+static int vnic_hard_start_xmit(struct sk_buff *skb,
+				struct net_device *device)
+{
+	struct vnic *vnic;
+	struct netpath *np;
+	cycles_t xmit_time;
+	int	 ret = -1;
+
+	VNIC_FUNCTION("vnic_hard_start_xmit()\n");
+	vnic = netdev_priv(device);
+	np = vnic->current_path;
+
+	vnic_pre_pkt_xmit_stats(&xmit_time);
+
+	if (np && np->viport)
+		ret = viport_xmit_packet(np->viport, skb);
+
+	if (ret) {
+		vnic_xmit_fail_stats(vnic);
+		dev_kfree_skb_any(skb);
+		vnic->stats.tx_dropped++;
+		goto out;
+	}
+
+	device->trans_start = jiffies;
+	vnic_post_pkt_xmit_stats(vnic, xmit_time);
+out:
+	return 0;
+}
+
+static void vnic_tx_timeout(struct net_device *device)
+{
+	struct vnic *vnic;
+	struct viport *viport = NULL;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_tx_timeout()\n");
+	vnic = netdev_priv(device);
+	device->trans_start = jiffies;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (vnic->current_path && vnic->current_path->viport) {
+		if (vnic->failed_over) {
+			if (vnic->current_path == &vnic->primary_path)
+				viport = vnic->secondary_path.viport;
+			else if (vnic->current_path == &vnic->secondary_path)
+				viport = vnic->primary_path.viport;
+		} else
+			viport = vnic->current_path->viport;
+
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+		if (viport)
+			viport_failure(viport);
+	} else
+		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+
+	VNIC_ERROR("vnic_tx_timeout\n");
+}
+
+static void vnic_set_multicast_list(struct net_device *device)
+{
+	struct vnic *vnic;
+	unsigned long flags;
+
+	VNIC_FUNCTION("vnic_set_multicast_list()\n");
+	vnic = netdev_priv(device);
+
+	spin_lock_irqsave(&vnic->lock, flags);
+	if (device->mc_count == 0) {
+		if (vnic->mc_list_len) {
+			vnic->mc_list_len = vnic->mc_count = 0;
+			kfree(vnic->mc_list);
+		}
+	} else {
+		struct dev_mc_list *mc_list = device->mc_list;
+		int i;
+
+		if (device->mc_count > vnic->mc_list_len) {
+			if (vnic->mc_list_len)
+				kfree(vnic->mc_list);
+			vnic->mc_list_len = device->mc_count + 10;
+			vnic->mc_list = kmalloc(vnic->mc_list_len *
+						sizeof *mc_list, GFP_ATOMIC);
+			if (!vnic->mc_list) {
+				vnic->mc_list_len = vnic->mc_count = 0;
+				VNIC_ERROR("failed allocating mc_list\n");
+				goto failure;
+			}
+		}
+		vnic->mc_count = device->mc_count;
+		for (i = 0; i < device->mc_count; i++) {
+			vnic->mc_list[i] = *mc_list;
+			vnic->mc_list[i].next = &vnic->mc_list[i + 1];
+			mc_list = mc_list->next;
+		}
+	}
+	spin_unlock_irqrestore(&vnic->lock, flags);
+
+	if (vnic->primary_path.viport)
+		viport_set_multicast(vnic->primary_path.viport,
+				     vnic->mc_list, vnic->mc_count);
+
+	if (vnic->secondary_path.viport)
+		viport_set_multicast(vnic->secondary_path.viport,
+				     vnic->mc_list, vnic->mc_count);
+
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
+	return;
+failure:
+	spin_unlock_irqrestore(&vnic->lock, flags);
+}
+
+/**
+ * Following set of functions queues up the events for EVIC and the
+ * kernel thread queuing up the event might return.
+ */
+static int vnic_set_mac_address(struct net_device *device, void *addr)
+{
+	struct vnic	*vnic;
+	struct sockaddr	*sockaddr = addr;
+	u8		*address;
+	int		ret = -1;
+
+	VNIC_FUNCTION("vnic_set_mac_address()\n");
+	vnic = netdev_priv(device);
+
+	if (!is_valid_ether_addr(sockaddr->sa_data))
+		return -EADDRNOTAVAIL;
+
+	if (netif_running(device))
+		return -EBUSY;
+
+	memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN);
+	address = sockaddr->sa_data;
+
+	if (vnic->primary_path.viport)
+		ret = viport_set_unicast(vnic->primary_path.viport,
+					 address);
+
+	if (ret)
+		return ret;
+
+	if (vnic->secondary_path.viport)
+		viport_set_unicast(vnic->secondary_path.viport, address);
+
+	vnic->mac_set = 1;
+	return 0;
+}
+
+static int vnic_change_mtu(struct net_device *device, int mtu)
+{
+	struct vnic	*vnic;
+	int		ret = 0;
+	int		pri_max_mtu;
+	int		sec_max_mtu;
+
+	VNIC_FUNCTION("vnic_change_mtu()\n");
+	vnic = netdev_priv(device);
+
+	if (vnic->primary_path.viport)
+		pri_max_mtu = viport_max_mtu(vnic->primary_path.viport);
+	else
+		pri_max_mtu = MAX_PARAM_VALUE;
+
+	if (vnic->secondary_path.viport)
+		sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport);
+	else
+		sec_max_mtu = MAX_PARAM_VALUE;
+
+	if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) {
+		device->mtu = mtu;
+		vnic_npevent_queue_evt(&vnic->primary_path,
+				       VNIC_PRINP_SETLINK);
+		vnic_npevent_queue_evt(&vnic->secondary_path,
+				       VNIC_SECNP_SETLINK);
+	} else if (pri_max_mtu < sec_max_mtu)
+		printk(KERN_WARNING PFX "%s: Maximum "
+					"supported MTU size is %d. "
+					"Cannot set MTU to %d\n",
+					vnic->config->name, pri_max_mtu, mtu);
+	else
+		printk(KERN_WARNING PFX "%s: Maximum "
+					"supported MTU size is %d. "
+					"Cannot set MTU to %d\n",
+					vnic->config->name, sec_max_mtu, mtu);
+
+	return ret;
+}
+
+static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath)
+{
+	u8	*address;
+	int	ret;
+
+	if (!vnic->mac_set) {
+		/* if netpath == secondary_path, then the primary path isn't
+		 * connected.  MAC address will be set when the primary
+		 * connects.
+		 */
+		netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr);
+		address = vnic->netdevice->dev_addr;
+
+		if (vnic->secondary_path.viport)
+			viport_set_unicast(vnic->secondary_path.viport,
+					   address);
+
+		vnic->mac_set = 1;
+	}
+	ret = register_netdev(vnic->netdevice);
+	if (ret) {
+		printk(KERN_ERR PFX "%s failed registering netdev "
+			"error %d - calling viport_failure\n",
+			config_viport_name(vnic->primary_path.viport->config),
+				ret);
+		vnic_free(vnic);
+		printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n",
+			config_viport_name(vnic->primary_path.viport->config));
+		return ret;
+	}
+
+	vnic->state = VNIC_REGISTERED;
+	vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/
+	return 0;
+}
+
+static void vnic_npevent_dequeue_all(struct vnic *vnic)
+{
+	unsigned long flags;
+	struct vnic_npevent *npevt, *tmp;
+
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	if (list_empty(&vnic_npevent_list))
+		goto out;
+	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
+				 list_ptrs) {
+		if ((npevt->vnic == vnic)) {
+			list_del(&npevt->list_ptrs);
+			kfree(npevt);
+		}
+	}
+out:
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+}
+
+static void update_path_and_reconnect(struct netpath *netpath,
+				      struct vnic *vnic)
+{
+	struct viport_config *config = netpath->viport->config;
+	int delay = 1;
+
+	if (vnic_ib_get_path(netpath, vnic))
+		return;
+	/*
+	 * tell viport_connect to wait for default_no_path_timeout
+	 * before connecting if  we are retrying the same path index
+	 * within default_no_path_timeout.
+	 * This prevents flooding connect requests to a path (or set
+	 * of paths) that aren't successfully connecting for some reason.
+	 */
+	if (time_after(jiffies,
+		(netpath->connect_time + vnic->config->no_path_timeout))) {
+		netpath->path_idx = config->path_idx;
+		netpath->connect_time = jiffies;
+		netpath->delay_reconnect = 0;
+		delay = 0;
+	} else if (config->path_idx != netpath->path_idx) {
+		delay = netpath->delay_reconnect;
+		netpath->path_idx = config->path_idx;
+		netpath->delay_reconnect = 1;
+	} else
+		delay = 1;
+	viport_connect(netpath->viport, delay);
+}
+
+static inline void vnic_set_checksum_flag(struct vnic *vnic,
+					  struct netpath *target_path)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	vnic->current_path = target_path;
+	vnic->failed_over = 1;
+	if (vnic->config->use_tx_csum &&
+	    netpath_can_tx_csum(vnic->current_path))
+		vnic->netdevice->features |= NETIF_F_IP_CSUM;
+
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+}
+
+static void vnic_set_uni_multicast(struct vnic *vnic,
+				   struct netpath *netpath)
+{
+	unsigned long	flags;
+	u8		*address;
+
+	if (vnic->mac_set) {
+		address = vnic->netdevice->dev_addr;
+
+		if (netpath->viport)
+			viport_set_unicast(netpath->viport, address);
+	}
+	spin_lock_irqsave(&vnic->lock, flags);
+
+	if (vnic->mc_list && netpath->viport)
+		viport_set_multicast(netpath->viport, vnic->mc_list,
+				     vnic->mc_count);
+
+	spin_unlock_irqrestore(&vnic->lock, flags);
+	if (vnic->state == VNIC_REGISTERED) {
+		if (!netpath->viport)
+			return;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags & ~IFF_UP,
+				vnic->netdevice->mtu);
+	}
+}
+
+static void vnic_set_netpath_timers(struct vnic *vnic,
+				    struct netpath *netpath)
+{
+	switch (netpath->timer_state) {
+	case NETPATH_TS_IDLE:
+		netpath->timer_state = NETPATH_TS_ACTIVE;
+		if (vnic->state == VNIC_UNINITIALIZED)
+			netpath_timer(netpath,
+				      vnic->config->
+				      primary_connect_timeout);
+		else
+			netpath_timer(netpath,
+				      vnic->config->
+				      primary_reconnect_timeout);
+			break;
+	case NETPATH_TS_ACTIVE:
+		/*nothing to do*/
+		break;
+	case NETPATH_TS_EXPIRED:
+		if (vnic->state == VNIC_UNINITIALIZED)
+			vnic_npevent_register(vnic, netpath);
+
+		break;
+	}
+}
+
+static void vnic_check_primary_path_timer(struct vnic *vnic)
+{
+	switch (vnic->primary_path.timer_state) {
+	case NETPATH_TS_ACTIVE:
+		/* nothing to do. just wait */
+		break;
+	case NETPATH_TS_IDLE:
+		netpath_timer(&vnic->primary_path,
+			      vnic->config->
+			      primary_switch_timeout);
+		break;
+	case NETPATH_TS_EXPIRED:
+		printk(KERN_INFO PFX
+		       "%s: switching to primary path\n",
+		       vnic->config->name);
+
+		vnic_set_checksum_flag(vnic, &vnic->primary_path);
+		break;
+	}
+}
+
+static void vnic_carrier_loss(struct vnic *vnic,
+			      struct netpath *last_path)
+{
+	if (vnic->primary_path.carrier) {
+		vnic->carrier = 1;
+		vnic_set_checksum_flag(vnic, &vnic->primary_path);
+
+		if (last_path && last_path != vnic->current_path)
+			printk(KERN_INFO PFX
+			       "%s: failing over to primary path\n",
+			       vnic->config->name);
+		else if (!last_path)
+			printk(KERN_INFO PFX "%s: using primary path\n",
+			       vnic->config->name);
+
+	} else if ((vnic->secondary_path.carrier) &&
+		   (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) {
+		vnic->carrier = 1;
+		vnic_set_checksum_flag(vnic, &vnic->secondary_path);
+
+		if (last_path && last_path != vnic->current_path)
+			printk(KERN_INFO PFX
+			       "%s: failing over to secondary path\n",
+			       vnic->config->name);
+		else if (!last_path)
+			printk(KERN_INFO PFX "%s: using secondary path\n",
+			       vnic->config->name);
+
+	}
+
+}
+
+static void vnic_handle_path_change(struct vnic *vnic,
+				    struct netpath **path)
+{
+	struct netpath *last_path = *path;
+
+	if (!last_path) {
+		if (vnic->current_path == &vnic->primary_path)
+			last_path = &vnic->secondary_path;
+		else
+			last_path = &vnic->primary_path;
+
+	}
+
+	if (vnic->current_path && vnic->current_path->viport)
+		viport_set_link(vnic->current_path->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+
+	if (last_path->viport)
+		viport_set_link(last_path->viport,
+				 vnic->netdevice->flags &
+				 ~IFF_UP, vnic->netdevice->mtu);
+
+	vnic_restart_xmit(vnic, vnic->current_path);
+}
+
+static void vnic_report_path_change(struct vnic *vnic,
+				    struct netpath *last_path,
+				    int other_path_ok)
+{
+	if (!vnic->current_path) {
+		if (last_path == &vnic->primary_path)
+			printk(KERN_INFO PFX "%s: primary path lost, "
+			       "no failover path available\n",
+			       vnic->config->name);
+		else
+			printk(KERN_INFO PFX "%s: secondary path lost, "
+			       "no failover path available\n",
+			       vnic->config->name);
+		return;
+	}
+
+	if (last_path != vnic->current_path)
+		return;
+
+	if (vnic->current_path == &vnic->secondary_path) {
+		if (other_path_ok != vnic->primary_path.carrier) {
+			if (other_path_ok)
+				printk(KERN_INFO PFX "%s: primary path no"
+				       " longer available for failover\n",
+				       vnic->config->name);
+			else
+				printk(KERN_INFO PFX "%s: primary path now"
+				       " available for failover\n",
+				       vnic->config->name);
+		}
+	} else {
+		if (other_path_ok != vnic->secondary_path.carrier) {
+			if (other_path_ok)
+				printk(KERN_INFO PFX "%s: secondary path no"
+				       " longer available for failover\n",
+				       vnic->config->name);
+			else
+				printk(KERN_INFO PFX "%s: secondary path now"
+				       " available for failover\n",
+				       vnic->config->name);
+		}
+	}
+}
+
+static void vnic_handle_free_vnic_evt(struct vnic *vnic)
+{
+	unsigned long flags;
+
+	if (!netif_queue_stopped(vnic->netdevice))
+		netif_stop_queue(vnic->netdevice);
+
+	netpath_timer_stop(&vnic->primary_path);
+	netpath_timer_stop(&vnic->secondary_path);
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	vnic->current_path = NULL;
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+	netpath_free(&vnic->primary_path);
+	netpath_free(&vnic->secondary_path);
+	if (vnic->state == VNIC_REGISTERED)
+		unregister_netdev(vnic->netdevice);
+
+	vnic_npevent_dequeue_all(vnic);
+	kfree(vnic->config);
+	if (vnic->mc_list_len) {
+		vnic->mc_list_len = vnic->mc_count = 0;
+		kfree(vnic->mc_list);
+	}
+
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_dev_attr_group);
+	vnic_cleanup_stats_files(vnic);
+	device_unregister(&vnic->dev_info.dev);
+	wait_for_completion(&vnic->dev_info.released);
+	free_netdev(vnic->netdevice);
+}
+
+static struct vnic *vnic_handle_npevent(struct vnic *vnic,
+					 enum vnic_npevent_type npevt_type)
+{
+	struct netpath	*netpath;
+	const char *netpath_str;
+
+	if (npevt_type <= VNIC_PRINP_LASTTYPE)
+		netpath_str = netpath_to_string(vnic, &vnic->primary_path);
+	else if	(npevt_type <= VNIC_SECNP_LASTTYPE)
+		netpath_str = netpath_to_string(vnic, &vnic->secondary_path);
+	else
+		netpath_str = netpath_to_string(vnic, vnic->current_path);
+
+	VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n",
+		  vnic->config->name, vnic_npevent_str[npevt_type],
+		  netpath_str, vnic->carrier);
+
+	switch (npevt_type) {
+	case VNIC_PRINP_CONNECTED:
+		netpath = &vnic->primary_path;
+		if (vnic->state == VNIC_UNINITIALIZED) {
+			if (vnic_npevent_register(vnic, netpath))
+				break;
+		}
+		vnic_set_uni_multicast(vnic, netpath);
+		break;
+	case VNIC_SECNP_CONNECTED:
+		vnic_set_uni_multicast(vnic, &vnic->secondary_path);
+		break;
+	case VNIC_PRINP_TIMEREXPIRED:
+		netpath = &vnic->primary_path;
+		netpath->timer_state = NETPATH_TS_EXPIRED;
+		if (!netpath->carrier)
+			update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_SECNP_TIMEREXPIRED:
+		netpath = &vnic->secondary_path;
+		netpath->timer_state = NETPATH_TS_EXPIRED;
+		if (!netpath->carrier)
+			update_path_and_reconnect(netpath, vnic);
+		else {
+			if (vnic->state == VNIC_UNINITIALIZED)
+				vnic_npevent_register(vnic, netpath);
+		}
+		break;
+	case VNIC_PRINP_LINKUP:
+		vnic->primary_path.carrier = 1;
+		break;
+	case VNIC_SECNP_LINKUP:
+		netpath = &vnic->secondary_path;
+		netpath->carrier = 1;
+		if (!vnic->carrier)
+			vnic_set_netpath_timers(vnic, netpath);
+		break;
+	case VNIC_PRINP_LINKDOWN:
+		vnic->primary_path.carrier = 0;
+		break;
+	case VNIC_SECNP_LINKDOWN:
+		if (vnic->state == VNIC_UNINITIALIZED)
+			netpath_timer_stop(&vnic->secondary_path);
+		vnic->secondary_path.carrier = 0;
+		break;
+	case VNIC_PRINP_DISCONNECTED:
+		netpath = &vnic->primary_path;
+		netpath_timer_stop(netpath);
+		netpath->carrier = 0;
+		update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_SECNP_DISCONNECTED:
+		netpath = &vnic->secondary_path;
+		netpath_timer_stop(netpath);
+		netpath->carrier = 0;
+		update_path_and_reconnect(netpath, vnic);
+		break;
+	case VNIC_PRINP_SETLINK:
+		netpath = vnic->current_path;
+		if (!netpath || !netpath->viport)
+			break;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+		break;
+	case VNIC_SECNP_SETLINK:
+		netpath = &vnic->secondary_path;
+		if (!netpath || !netpath->viport)
+			break;
+		viport_set_link(netpath->viport,
+				vnic->netdevice->flags,
+				vnic->netdevice->mtu);
+		break;
+	case VNIC_NP_FREEVNIC:
+		vnic_handle_free_vnic_evt(vnic);
+		vnic = NULL;
+		break;
+	}
+	return vnic;
+}
+
+static int vnic_npevent_statemachine(void *context)
+{
+	struct vnic_npevent	*vnic_link_evt;
+	enum vnic_npevent_type	npevt_type;
+	struct vnic		*vnic;
+	int			last_carrier;
+	int			other_path_ok = 0;
+	struct netpath		*last_path;
+
+	while (!vnic_npevent_thread_end ||
+	       !list_empty(&vnic_npevent_list)) {
+		unsigned long flags;
+
+		wait_event_interruptible(vnic_npevent_queue,
+					 !list_empty(&vnic_npevent_list)
+					 || vnic_npevent_thread_end);
+		spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+		if (list_empty(&vnic_npevent_list)) {
+			spin_unlock_irqrestore(&vnic_npevent_list_lock,
+					       flags);
+			VNIC_INFO("netpath statemachine wake"
+				  " on empty list\n");
+			continue;
+		}
+
+		vnic_link_evt = list_entry(vnic_npevent_list.next,
+					   struct vnic_npevent,
+					   list_ptrs);
+		list_del(&vnic_link_evt->list_ptrs);
+		spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+		vnic = vnic_link_evt->vnic;
+		npevt_type = vnic_link_evt->event_type;
+		kfree(vnic_link_evt);
+
+		if (vnic->current_path == &vnic->secondary_path)
+			other_path_ok = vnic->primary_path.carrier;
+		else if (vnic->current_path == &vnic->primary_path)
+			other_path_ok = vnic->secondary_path.carrier;
+
+		vnic = vnic_handle_npevent(vnic, npevt_type);
+
+		if (!vnic)
+			continue;
+
+		last_carrier = vnic->carrier;
+		last_path = vnic->current_path;
+
+		if (!vnic->current_path ||
+		    !vnic->current_path->carrier) {
+			vnic->carrier = 0;
+			vnic->current_path = NULL;
+			vnic->netdevice->features &= ~NETIF_F_IP_CSUM;
+		}
+
+		if (!vnic->carrier)
+			vnic_carrier_loss(vnic, last_path);
+		else if ((vnic->current_path != &vnic->primary_path) &&
+			 (vnic->config->prefer_primary) &&
+			 (vnic->primary_path.carrier))
+				vnic_check_primary_path_timer(vnic);
+
+		if (last_path)
+			vnic_report_path_change(vnic, last_path,
+						other_path_ok);
+
+		VNIC_INFO("new netpath=%s, carrier=%d\n",
+			  netpath_to_string(vnic, vnic->current_path),
+			  vnic->carrier);
+
+		if (vnic->current_path != last_path)
+			vnic_handle_path_change(vnic, &last_path);
+
+		if (vnic->carrier != last_carrier) {
+			if (vnic->carrier) {
+				VNIC_INFO("netif_carrier_on\n");
+				netif_carrier_on(vnic->netdevice);
+				vnic_carrier_loss_stats(vnic);
+			} else {
+				VNIC_INFO("netif_carrier_off\n");
+				netif_carrier_off(vnic->netdevice);
+				vnic_disconn_stats(vnic);
+			}
+
+		}
+	}
+	complete_and_exit(&vnic_npevent_thread_exit, 0);
+	return 0;
+}
+
+void vnic_npevent_queue_evt(struct netpath *netpath,
+			    enum vnic_npevent_type evt)
+{
+	struct vnic_npevent *npevent;
+	unsigned long flags;
+
+	npevent = kmalloc(sizeof *npevent, GFP_ATOMIC);
+	if (!npevent) {
+		VNIC_ERROR("Could not allocate memory for vnic event\n");
+		return;
+	}
+	npevent->vnic = netpath->parent;
+	npevent->event_type = evt;
+	INIT_LIST_HEAD(&npevent->list_ptrs);
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	list_add_tail(&npevent->list_ptrs, &vnic_npevent_list);
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+	wake_up(&vnic_npevent_queue);
+}
+
+void vnic_npevent_dequeue_evt(struct netpath *netpath,
+			      enum vnic_npevent_type evt)
+{
+	unsigned long flags;
+	struct vnic_npevent *npevt, *tmp;
+	struct vnic *vnic = netpath->parent;
+
+	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
+	if (list_empty(&vnic_npevent_list))
+		goto out;
+	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
+				 list_ptrs) {
+		if ((npevt->vnic == vnic) &&
+		    (npevt->event_type == evt)) {
+			list_del(&npevt->list_ptrs);
+			kfree(npevt);
+			break;
+		}
+	}
+out:
+	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
+}
+
+static int vnic_npevent_start(void)
+{
+	VNIC_FUNCTION("vnic_npevent_start()\n");
+
+	spin_lock_init(&vnic_npevent_list_lock);
+	vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL,
+						"qlgc_vnic_npevent_s_m");
+	if (IS_ERR(vnic_npevent_thread)) {
+		printk(KERN_WARNING PFX "failed to create vnic npevent"
+		       " thread; error %d\n",
+			(int) PTR_ERR(vnic_npevent_thread));
+		vnic_npevent_thread = NULL;
+		return 1;
+	}
+
+	return 0;
+}
+
+void vnic_npevent_cleanup(void)
+{
+	if (vnic_npevent_thread) {
+		vnic_npevent_thread_end = 1;
+		wake_up(&vnic_npevent_queue);
+		wait_for_completion(&vnic_npevent_thread_exit);
+		vnic_npevent_thread = NULL;
+	}
+}
+
+static void vnic_setup(struct net_device *device)
+{
+	ether_setup(device);
+
+	/* ether_setup is used to fill
+	 * device parameters for ethernet devices.
+	 * We override some of the parameters
+	 * which are specific to VNIC.
+	 */
+	device->get_stats		= vnic_get_stats;
+	device->open			= vnic_open;
+	device->stop			= vnic_stop;
+	device->hard_start_xmit		= vnic_hard_start_xmit;
+	device->tx_timeout		= vnic_tx_timeout;
+	device->set_multicast_list	= vnic_set_multicast_list;
+	device->set_mac_address		= vnic_set_mac_address;
+	device->change_mtu		= vnic_change_mtu;
+	device->watchdog_timeo 		= 10 * HZ;
+	device->features		= 0;
+}
+
+struct vnic *vnic_allocate(struct vnic_config *config)
+{
+	struct vnic *vnic = NULL;
+	struct net_device *netdev;
+
+	VNIC_FUNCTION("vnic_allocate()\n");
+	netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup);
+	if (!netdev) {
+		VNIC_ERROR("failed allocating vnic structure\n");
+		return NULL;
+	}
+
+	vnic = netdev_priv(netdev);
+	vnic->netdevice = netdev;
+	spin_lock_init(&vnic->lock);
+	spin_lock_init(&vnic->current_path_lock);
+	vnic_alloc_stats(vnic);
+	vnic->state = VNIC_UNINITIALIZED;
+	vnic->config = config;
+
+	netpath_init(&vnic->primary_path, vnic, 0);
+	netpath_init(&vnic->secondary_path, vnic, 1);
+
+	vnic->current_path = NULL;
+	vnic->failed_over = 0;
+
+	list_add_tail(&vnic->list_ptrs, &vnic_list);
+
+	return vnic;
+}
+
+void vnic_free(struct vnic *vnic)
+{
+	VNIC_FUNCTION("vnic_free()\n");
+	list_del(&vnic->list_ptrs);
+	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC);
+}
+
+static void __exit vnic_cleanup(void)
+{
+	VNIC_FUNCTION("vnic_cleanup()\n");
+
+	VNIC_INIT("unloading %s\n", MODULEDETAILS);
+
+	while (!list_empty(&vnic_list)) {
+		struct vnic *vnic =
+		    list_entry(vnic_list.next, struct vnic, list_ptrs);
+		vnic_free(vnic);
+	}
+
+	vnic_npevent_cleanup();
+	viport_cleanup();
+	vnic_ib_cleanup();
+}
+
+static int __init vnic_init(void)
+{
+	int ret;
+	VNIC_FUNCTION("vnic_init()\n");
+	VNIC_INIT("Initializing %s\n", MODULEDETAILS);
+
+	ret = config_start();
+	if (ret) {
+		VNIC_ERROR("config_start failed\n");
+		goto failure;
+	}
+
+	ret = vnic_ib_init();
+	if (ret) {
+		VNIC_ERROR("ib_start failed\n");
+		goto failure;
+	}
+
+	ret = viport_start();
+	if (ret) {
+		VNIC_ERROR("viport_start failed\n");
+		goto failure;
+	}
+
+	ret = vnic_npevent_start();
+	if (ret) {
+		VNIC_ERROR("vnic_npevent_start failed\n");
+		goto failure;
+	}
+
+	return 0;
+failure:
+	vnic_cleanup();
+	return ret;
+}
+
+module_init(vnic_init);
+module_exit(vnic_cleanup);
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
new file mode 100644
index 0000000..7535124
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
@@ -0,0 +1,154 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_MAIN_H_INCLUDED
+#define VNIC_MAIN_H_INCLUDED
+
+#include <linux/timex.h>
+#include <linux/netdevice.h>
+#include <linux/kthread.h>
+#include <linux/fs.h>
+
+#include "vnic_config.h"
+#include "vnic_netpath.h"
+
+extern u16 vnic_max_mtu;
+extern struct list_head vnic_list;
+extern struct attribute_group vnic_stats_attr_group;
+extern cycles_t vnic_recv_ref;
+
+enum vnic_npevent_type {
+	VNIC_PRINP_CONNECTED	= 0,
+	VNIC_PRINP_DISCONNECTED	= 1,
+	VNIC_PRINP_LINKUP	= 2,
+	VNIC_PRINP_LINKDOWN	= 3,
+	VNIC_PRINP_TIMEREXPIRED	= 4,
+	VNIC_PRINP_SETLINK	= 5,
+
+	/* used to figure out PRI vs SEC types for dbg msg*/
+	VNIC_PRINP_LASTTYPE     = VNIC_PRINP_SETLINK,
+
+	VNIC_SECNP_CONNECTED	= 6,
+	VNIC_SECNP_DISCONNECTED	= 7,
+	VNIC_SECNP_LINKUP	= 8,
+	VNIC_SECNP_LINKDOWN	= 9,
+	VNIC_SECNP_TIMEREXPIRED	= 10,
+	VNIC_SECNP_SETLINK	= 11,
+
+	/* used to figure out PRI vs SEC types for dbg msg*/
+	VNIC_SECNP_LASTTYPE     = VNIC_SECNP_SETLINK,
+
+	VNIC_NP_FREEVNIC	= 12,
+
+	/*
+	 * NOTE : If any new netpath event is being added, don't forget to
+	 * add corresponding netpath event string into vnic_main.c.
+	 */
+};
+
+struct vnic_npevent {
+	struct list_head	list_ptrs;
+	struct vnic		*vnic;
+	enum vnic_npevent_type	event_type;
+};
+
+void vnic_npevent_queue_evt(struct netpath *netpath,
+			    enum vnic_npevent_type evt);
+void vnic_npevent_dequeue_evt(struct netpath *netpath,
+			      enum vnic_npevent_type evt);
+
+enum vnic_state {
+	VNIC_UNINITIALIZED	= 0,
+	VNIC_REGISTERED		= 1
+};
+
+struct vnic {
+	struct list_head		list_ptrs;
+	enum vnic_state			state;
+	struct vnic_config		*config;
+	struct netpath			*current_path;
+	struct netpath			primary_path;
+	struct netpath			secondary_path;
+	int				open;
+	int				carrier;
+	int				failed_over;
+	int				mac_set;
+	struct net_device_stats 	stats;
+	struct net_device		*netdevice;
+	struct dev_info			dev_info;
+	struct dev_mc_list		*mc_list;
+	int				mc_list_len;
+	int				mc_count;
+	spinlock_t			lock;
+	spinlock_t			current_path_lock;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	start_time;
+		cycles_t	conn_time;
+		cycles_t	disconn_ref;	/* intermediate time */
+		cycles_t	disconn_time;
+		u32		disconn_num;
+		cycles_t	xmit_time;
+		u32		xmit_num;
+		u32		xmit_fail;
+		cycles_t	recv_time;
+		u32		recv_num;
+		u32		multicast_recv_num;
+		cycles_t	xmit_ref;	/* intermediate time */
+		cycles_t	xmit_off_time;
+		u32		xmit_off_num;
+		cycles_t	carrier_ref;	/* intermediate time */
+		cycles_t	carrier_off_time;
+		u32		carrier_off_num;
+	} statistics;
+	struct dev_info		stat_info;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct vnic *vnic_allocate(struct vnic_config *config);
+
+void vnic_free(struct vnic *vnic);
+
+void vnic_connected(struct vnic *vnic, struct netpath *netpath);
+void vnic_disconnected(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_link_up(struct vnic *vnic, struct netpath *netpath);
+void vnic_link_down(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath);
+void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath);
+
+void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
+		      struct sk_buff *skb);
+void vnic_npevent_cleanup(void);
+void completion_callback_cleanup(struct vnic_ib_conn *ib_conn);
+#endif	/* VNIC_MAIN_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:54:53 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:24:53 +0530
Subject: [ofa-general] [PATCH v3 02/13] QLogic VNIC: Netpath - abstraction of
	connection to EVIC/VEx
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095453.9943.66549.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch implements the netpath layer of QLogic VNIC. Netpath is an 
abstraction of a connection to EVIC. It primarily includes the 
implementation which maintains the timers to monitor the status of
the connection to EVIC/VEx.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c |  112 +++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h |   79 ++++++++++++++++
 2 files changed, 191 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
new file mode 100644
index 0000000..820b996
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_netpath.h"
+
+static void vnic_npevent_timeout(unsigned long data)
+{
+	struct netpath *netpath = (struct netpath *)data;
+
+	if (netpath->second_bias)
+		vnic_npevent_queue_evt(netpath, VNIC_SECNP_TIMEREXPIRED);
+	else
+		vnic_npevent_queue_evt(netpath, VNIC_PRINP_TIMEREXPIRED);
+}
+
+void netpath_timer(struct netpath *netpath, int timeout)
+{
+	if (netpath->timer_state == NETPATH_TS_ACTIVE)
+		del_timer_sync(&netpath->timer);
+	if (timeout) {
+		init_timer(&netpath->timer);
+		netpath->timer_state = NETPATH_TS_ACTIVE;
+		netpath->timer.expires = jiffies + timeout;
+		netpath->timer.data = (unsigned long)netpath;
+		netpath->timer.function = vnic_npevent_timeout;
+		add_timer(&netpath->timer);
+	} else
+		vnic_npevent_timeout((unsigned long)netpath);
+}
+
+void netpath_timer_stop(struct netpath *netpath)
+{
+	if (netpath->timer_state != NETPATH_TS_ACTIVE)
+		return;
+	del_timer_sync(&netpath->timer);
+	if (netpath->second_bias)
+		vnic_npevent_dequeue_evt(netpath, VNIC_SECNP_TIMEREXPIRED);
+	else
+		vnic_npevent_dequeue_evt(netpath, VNIC_PRINP_TIMEREXPIRED);
+
+	netpath->timer_state = NETPATH_TS_IDLE;
+}
+
+void netpath_free(struct netpath *netpath)
+{
+	if (!netpath->viport)
+		return;
+	viport_free(netpath->viport);
+	netpath->viport = NULL;
+	sysfs_remove_group(&netpath->dev_info.dev.kobj,
+			   &vnic_path_attr_group);
+	device_unregister(&netpath->dev_info.dev);
+	wait_for_completion(&netpath->dev_info.released);
+}
+
+void netpath_init(struct netpath *netpath, struct vnic *vnic,
+		  int second_bias)
+{
+	netpath->parent = vnic;
+	netpath->carrier = 0;
+	netpath->viport = NULL;
+	netpath->second_bias = second_bias;
+	netpath->timer_state = NETPATH_TS_IDLE;
+	init_timer(&netpath->timer);
+}
+
+const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath)
+{
+	if (!netpath)
+		return "NULL";
+	else if (netpath == &vnic->primary_path)
+		return "PRIMARY";
+	else if (netpath == &vnic->secondary_path)
+		return "SECONDARY";
+	else
+		return "UNKNOWN";
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
new file mode 100644
index 0000000..f4e142e
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_NETPATH_H_INCLUDED
+#define VNIC_NETPATH_H_INCLUDED
+
+#include <linux/spinlock.h>
+
+#include "vnic_sys.h"
+
+struct viport;
+struct vnic;
+
+enum netpath_ts {
+	NETPATH_TS_IDLE		= 0,
+	NETPATH_TS_ACTIVE	= 1,
+	NETPATH_TS_EXPIRED	= 2
+};
+
+struct netpath {
+	int			carrier;
+	struct vnic		*parent;
+	struct viport		*viport;
+	size_t			path_idx;
+	unsigned long		connect_time;
+	int			second_bias;
+	u8			is_primary_path;
+	u8 			delay_reconnect;
+	struct timer_list	timer;
+	enum netpath_ts		timer_state;
+	struct dev_info		dev_info;
+};
+
+void netpath_init(struct netpath *netpath, struct vnic *vnic,
+		  int second_bias);
+void netpath_free(struct netpath *netpath);
+
+void netpath_timer(struct netpath *netpath, int timeout);
+void netpath_timer_stop(struct netpath *netpath);
+
+const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath);
+
+#define netpath_get_hw_addr(netpath, address)		\
+	viport_get_hw_addr((netpath)->viport, address)
+#define netpath_is_connected(netpath)			\
+	(netpath->state == NETPATH_CONNECTED)
+#define netpath_can_tx_csum(netpath)			\
+	viport_can_tx_csum(netpath->viport)
+
+#endif	/* VNIC_NETPATH_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:55:23 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:25:23 +0530
Subject: [ofa-general] [PATCH v3 03/13] QLogic VNIC: Implementation of
	communication protocol with EVIC/VEx
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095523.9943.28433.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

Implementation of the statemachine for the protocol used while 
communicating with the EVIC. The patch also implements the viport
abstraction which represents the virtual ethernet port on EVIC.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 ++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h |  175 +++
 2 files changed, 1389 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
new file mode 100644
index 0000000..7462403
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c
@@ -0,0 +1,1214 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/netdevice.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/net.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_netpath.h"
+#include "vnic_control.h"
+#include "vnic_data.h"
+#include "vnic_config.h"
+#include "vnic_control_pkt.h"
+
+#define VIPORT_DISCONN_TIMER	10000 	 /* 10 seconds */
+
+#define MAX_RETRY_INTERVAL 	  20000  /* 20 seconds */
+#define RETRY_INCREMENT		  5000   /* 5 seconds  */
+#define MAX_CONNECT_RETRY_TIMEOUT 600000 /* 10 minutes */
+
+static DECLARE_WAIT_QUEUE_HEAD(viport_queue);
+static LIST_HEAD(viport_list);
+static DECLARE_COMPLETION(viport_thread_exit);
+static spinlock_t viport_list_lock;
+
+static struct task_struct *viport_thread;
+static int viport_thread_end;
+
+static void viport_timer(struct viport *viport, int timeout);
+
+struct viport *viport_allocate(struct viport_config *config)
+{
+	struct viport *viport;
+
+	VIPORT_FUNCTION("viport_allocate()\n");
+	viport = kzalloc(sizeof *viport, GFP_KERNEL);
+	if (!viport) {
+		VIPORT_ERROR("failed allocating viport structure\n");
+		return NULL;
+	}
+
+	viport->state = VIPORT_DISCONNECTED;
+	viport->link_state = LINK_FIRSTCONNECT;
+	viport->connect = WAIT;
+	viport->new_mtu = 1500;
+	viport->new_flags = 0;
+	viport->config = config;
+	viport->connect = DELAY;
+	viport->data.max_mtu = vnic_max_mtu;
+	spin_lock_init(&viport->lock);
+	init_waitqueue_head(&viport->stats_queue);
+	init_waitqueue_head(&viport->disconnect_queue);
+	init_waitqueue_head(&viport->reference_queue);
+	INIT_LIST_HEAD(&viport->list_ptrs);
+
+	vnic_mc_init(viport);
+
+	return viport;
+}
+
+void viport_connect(struct viport *viport, int delay)
+{
+	VIPORT_FUNCTION("viport_connect()\n");
+
+	if (viport->connect != DELAY)
+		viport->connect = (delay) ? DELAY : NOW;
+	if (viport->link_state == LINK_FIRSTCONNECT) {
+		u32 duration;
+		duration = (net_random() & 0x1ff);
+		if (!viport->parent->is_primary_path)
+			duration += 0x1ff;
+		viport->link_state = LINK_RETRYWAIT;
+		viport_timer(viport, duration);
+	} else
+		viport_kick(viport);
+}
+
+static void viport_disconnect(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_disconnect()\n");
+	viport->disconnect = 1;
+	viport_failure(viport);
+	wait_event(viport->disconnect_queue, viport->disconnect == 0);
+}
+
+void viport_free(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_free()\n");
+	viport_disconnect(viport);	/* NOTE: this can sleep */
+	vnic_mc_uninit(viport);
+	kfree(viport->config);
+	kfree(viport);
+}
+
+void viport_set_link(struct viport *viport, u16 flags, u16 mtu)
+{
+	unsigned long localflags;
+	int i;
+
+	VIPORT_FUNCTION("viport_set_link()\n");
+	if (mtu > data_max_mtu(&viport->data)) {
+		VIPORT_ERROR("configuration error."
+			     " mtu of %d unsupported by %s\n", mtu,
+			     config_viport_name(viport->config));
+		goto failure;
+	}
+
+	spin_lock_irqsave(&viport->lock, localflags);
+	flags &= IFF_UP | IFF_ALLMULTI | IFF_PROMISC;
+	if ((viport->new_flags != flags)
+	    || (viport->new_mtu != mtu)) {
+		viport->new_flags = flags;
+		viport->new_mtu = mtu;
+		viport->updates |= NEED_LINK_CONFIG;
+		if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+			if (((viport->mtu <= MCAST_MSG_SIZE) && (mtu >  MCAST_MSG_SIZE)) ||
+			    ((viport->mtu >  MCAST_MSG_SIZE) && (mtu <= MCAST_MSG_SIZE))) {
+			/*
+			 * MTU value will enable/disable the multicast. In
+			 * either case, need to send the CMD_CONFIG_ADDRESS2 to
+			 * EVIC. Hence, setting the NEED_ADDRESS_CONFIG flag.
+			 */
+				viport->updates |= NEED_ADDRESS_CONFIG;
+				if (mtu <= MCAST_MSG_SIZE) {
+				    VIPORT_PRINT("%s: MTU changed; "
+						"old:%d new:%d (threshold:%d);"
+						" MULTICAST will be enabled.\n",
+						config_viport_name(viport->config),
+						viport->mtu, mtu,
+						(int)MCAST_MSG_SIZE);
+				} else {
+				    VIPORT_PRINT("%s: MTU changed; "
+						"old:%d new:%d (threshold:%d); "
+						"MULTICAST will be disabled.\n",
+						config_viport_name(viport->config),
+						viport->mtu, mtu,
+						(int)MCAST_MSG_SIZE);
+				}
+				/* When we resend these addresses, EVIC will
+				 * send mgid=0 back in response. So no need to
+				 * shutoff ib_multicast.
+				 */
+				for (i = MCAST_ADDR_START; i < viport->num_mac_addresses; i++) {
+					if (viport->mac_addresses[i].valid)
+						viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+				}
+			}
+		}
+		viport_kick(viport);
+	}
+
+	spin_unlock_irqrestore(&viport->lock, localflags);
+	return;
+failure:
+	viport_failure(viport);
+}
+
+int viport_set_unicast(struct viport *viport, u8 *address)
+{
+	unsigned long flags;
+	int	ret = -1;
+	VIPORT_FUNCTION("viport_set_unicast()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+
+	if (!viport->mac_addresses)
+		goto out;
+
+	if (memcmp(viport->mac_addresses[UNICAST_ADDR].address,
+		   address, ETH_ALEN)) {
+		memcpy(viport->mac_addresses[UNICAST_ADDR].address,
+		       address, ETH_ALEN);
+		viport->mac_addresses[UNICAST_ADDR].operation
+		    = VNIC_OP_SET_ENTRY;
+		viport->updates |= NEED_ADDRESS_CONFIG;
+		viport_kick(viport);
+	}
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&viport->lock, flags);
+	return ret;
+}
+
+int viport_set_multicast(struct viport *viport,
+			 struct dev_mc_list *mc_list, int mc_count)
+{
+	u32 old_update_list;
+	int i;
+	int ret = -1;
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_set_multicast()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+
+	if (!viport->mac_addresses)
+		goto out;
+
+	old_update_list = viport->updates;
+	if (mc_count > viport->num_mac_addresses - MCAST_ADDR_START)
+		viport->updates |= NEED_LINK_CONFIG | MCAST_OVERFLOW;
+	else {
+		if (mc_count == 0) {
+			ret = 0;
+			goto out;
+		}
+		if (viport->updates & MCAST_OVERFLOW) {
+			viport->updates &= ~MCAST_OVERFLOW;
+			viport->updates |= NEED_LINK_CONFIG;
+		}
+		for (i = MCAST_ADDR_START; i < mc_count + MCAST_ADDR_START;
+						i++, mc_list = mc_list->next) {
+			if (viport->mac_addresses[i].valid &&
+				!memcmp(viport->mac_addresses[i].address,
+						mc_list->dmi_addr, ETH_ALEN))
+			continue;
+		memcpy(viport->mac_addresses[i].address,
+					 mc_list->dmi_addr, ETH_ALEN);
+		viport->mac_addresses[i].valid = 1;
+		viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+	}
+	for (; i < viport->num_mac_addresses; i++) {
+		if (!viport->mac_addresses[i].valid)
+			continue;
+		viport->mac_addresses[i].valid = 0;
+		viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY;
+	}
+	if (mc_count)
+		viport->updates |= NEED_ADDRESS_CONFIG;
+	}
+
+	if (viport->updates != old_update_list)
+		viport_kick(viport);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&viport->lock, flags);
+	return ret;
+}
+
+static inline void viport_disable_multicast(struct viport *viport)
+{
+	VIPORT_INFO("turned off IB_MULTICAST\n");
+	viport->config->control_config.ib_multicast = 0;
+	viport->config->control_config.ib_config.conn_data.features_supported &=
+				__constant_cpu_to_be32((u32)~VNIC_FEAT_INBOUND_IB_MC);
+	viport->link_state = LINK_RESET;
+}
+
+void viport_get_stats(struct viport *viport,
+		     struct net_device_stats *stats)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_get_stats()\n");
+	/* Reference count has been already incremented indicating
+	 * that viport structure is being used, which prevents its
+	 * freeing when this task sleeps
+	 */
+	if (time_after(jiffies,
+		(viport->last_stats_time + viport->config->stats_interval))) {
+
+		spin_lock_irqsave(&viport->lock, flags);
+		viport->updates |= NEED_STATS;
+		spin_unlock_irqrestore(&viport->lock, flags);
+		viport_kick(viport);
+		wait_event(viport->stats_queue,
+			   !(viport->updates & NEED_STATS)
+			   || (viport->disconnect == 1));
+
+		if (viport->stats.ethernet_status)
+			vnic_link_up(viport->vnic, viport->parent);
+		else
+			vnic_link_down(viport->vnic, viport->parent);
+	}
+
+	stats->rx_packets = be64_to_cpu(viport->stats.if_in_ok);
+	stats->tx_packets = be64_to_cpu(viport->stats.if_out_ok);
+	stats->rx_bytes   = be64_to_cpu(viport->stats.if_in_octets);
+	stats->tx_bytes   = be64_to_cpu(viport->stats.if_out_octets);
+	stats->rx_errors  = be64_to_cpu(viport->stats.if_in_errors);
+	stats->tx_errors  = be64_to_cpu(viport->stats.if_out_errors);
+	stats->rx_dropped = 0;	/* EIOC doesn't track */
+	stats->tx_dropped = 0;	/* EIOC doesn't track */
+	stats->multicast  = be64_to_cpu(viport->stats.if_in_nucast_pkts);
+	stats->collisions = 0;	/* EIOC doesn't track */
+}
+
+int viport_xmit_packet(struct viport *viport, struct sk_buff *skb)
+{
+	int status = -1;
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_xmit_packet()\n");
+	spin_lock_irqsave(&viport->lock, flags);
+	if (viport->state == VIPORT_CONNECTED)
+		status = data_xmit_packet(&viport->data, skb);
+	spin_unlock_irqrestore(&viport->lock, flags);
+
+	return status;
+}
+
+void viport_kick(struct viport *viport)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_kick()\n");
+	spin_lock_irqsave(&viport_list_lock, flags);
+	if (list_empty(&viport->list_ptrs)) {
+		list_add_tail(&viport->list_ptrs, &viport_list);
+		wake_up(&viport_queue);
+	}
+	spin_unlock_irqrestore(&viport_list_lock, flags);
+}
+
+void viport_failure(struct viport *viport)
+{
+	unsigned long flags;
+
+	VIPORT_FUNCTION("viport_failure()\n");
+	vnic_stop_xmit(viport->vnic, viport->parent);
+	spin_lock_irqsave(&viport_list_lock, flags);
+	viport->errored = 1;
+	if (list_empty(&viport->list_ptrs)) {
+		list_add_tail(&viport->list_ptrs, &viport_list);
+		wake_up(&viport_queue);
+	}
+	spin_unlock_irqrestore(&viport_list_lock, flags);
+}
+
+static void viport_timeout(unsigned long data)
+{
+	struct viport *viport;
+
+	VIPORT_FUNCTION("viport_timeout()\n");
+	viport = (struct viport *)data;
+	viport->timer_active = 0;
+	viport_kick(viport);
+}
+
+static void viport_timer(struct viport *viport, int timeout)
+{
+	VIPORT_FUNCTION("viport_timer()\n");
+	if (viport->timer_active)
+		del_timer(&viport->timer);
+	init_timer(&viport->timer);
+	viport->timer.expires = jiffies + timeout;
+	viport->timer.data = (unsigned long)viport;
+	viport->timer.function = viport_timeout;
+	viport->timer_active = 1;
+	add_timer(&viport->timer);
+}
+
+static void viport_timer_stop(struct viport *viport)
+{
+	VIPORT_FUNCTION("viport_timer_stop()\n");
+	if (viport->timer_active)
+		del_timer(&viport->timer);
+	viport->timer_active = 0;
+}
+
+static int viport_init_mac_addresses(struct viport *viport)
+{
+	struct vnic_address_op2	*temp;
+	unsigned long		flags;
+	int			i;
+
+	VIPORT_FUNCTION("viport_init_mac_addresses()\n");
+	i = viport->num_mac_addresses * sizeof *temp;
+	temp = kzalloc(viport->num_mac_addresses * sizeof *temp,
+		       GFP_KERNEL);
+	if (!temp) {
+		VIPORT_ERROR("failed allocating MAC address table\n");
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->mac_addresses = temp;
+	for (i = 0; i < viport->num_mac_addresses; i++) {
+		viport->mac_addresses[i].index = cpu_to_be16(i);
+		viport->mac_addresses[i].vlan =
+				cpu_to_be16(viport->default_vlan);
+	}
+	memset(viport->mac_addresses[BROADCAST_ADDR].address,
+	       0xFF, ETH_ALEN);
+	viport->mac_addresses[BROADCAST_ADDR].valid = 1;
+	memcpy(viport->mac_addresses[UNICAST_ADDR].address,
+	       viport->hw_mac_address, ETH_ALEN);
+	viport->mac_addresses[UNICAST_ADDR].valid = 1;
+
+	spin_unlock_irqrestore(&viport->lock, flags);
+
+	return 0;
+}
+
+static inline void viport_match_mac_address(struct vnic *vnic,
+					    struct viport *viport)
+{
+	if (vnic && vnic->current_path &&
+	    viport == vnic->current_path->viport &&
+	    vnic->mac_set &&
+	    memcmp(vnic->netdevice->dev_addr, viport->hw_mac_address, ETH_ALEN)) {
+		VIPORT_ERROR("*** ERROR MAC address mismatch; "
+				"current = %02x:%02x:%02x:%02x:%02x:%02x "
+				"From EVIC = %02x:%02x:%02x:%02x:%02x:%02x\n",
+				vnic->netdevice->dev_addr[0],
+				vnic->netdevice->dev_addr[1],
+				vnic->netdevice->dev_addr[2],
+				vnic->netdevice->dev_addr[3],
+				vnic->netdevice->dev_addr[4],
+				vnic->netdevice->dev_addr[5],
+				viport->hw_mac_address[0],
+				viport->hw_mac_address[1],
+				viport->hw_mac_address[2],
+				viport->hw_mac_address[3],
+				viport->hw_mac_address[4],
+				viport->hw_mac_address[5]);
+	}
+}
+
+static int viport_handle_init_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_UNINITIALIZED:
+			LINK_STATE("state LINK_UNINITIALIZED\n");
+			viport->updates = 0;
+			spin_lock_irq(&viport_list_lock);
+			list_del_init(&viport->list_ptrs);
+			spin_unlock_irq(&viport_list_lock);
+			if (atomic_read(&viport->reference_count)) {
+				wake_up(&viport->stats_queue);
+				wait_event(viport->reference_queue,
+					 atomic_read(&viport->reference_count) == 0);
+			}
+			/* No more references to viport structure
+			 * so it is safe to delete it by waking disconnect
+			 * queue
+			 */
+
+			viport->disconnect = 0;
+			wake_up(&viport->disconnect_queue);
+			break;
+		case LINK_INITIALIZE:
+			LINK_STATE("state LINK_INITIALIZE\n");
+			viport->errored = 0;
+			viport->connect = WAIT;
+			viport->last_stats_time = 0;
+			if (viport->disconnect)
+				viport->link_state = LINK_UNINITIALIZED;
+			else
+				viport->link_state = LINK_INITIALIZECONTROL;
+			break;
+		case LINK_INITIALIZECONTROL:
+			LINK_STATE("state LINK_INITIALIZECONTROL\n");
+			viport->pd = ib_alloc_pd(viport->config->ibdev);
+			if (IS_ERR(viport->pd))
+				viport->link_state = LINK_DISCONNECTED;
+			else if (control_init(&viport->control, viport,
+					    &viport->config->control_config,
+					    viport->pd)) {
+				ib_dealloc_pd(viport->pd);
+				viport->link_state = LINK_DISCONNECTED;
+
+			} else
+				viport->link_state = LINK_INITIALIZEDATA;
+			break;
+		case LINK_INITIALIZEDATA:
+			LINK_STATE("state LINK_INITIALIZEDATA\n");
+			if (data_init(&viport->data, viport,
+				      &viport->config->data_config,
+				      viport->pd))
+				viport->link_state = LINK_CLEANUPCONTROL;
+			else
+				viport->link_state = LINK_CONTROLCONNECT;
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_control_states(struct viport *viport)
+{
+	enum link_state old_state;
+	struct vnic *vnic;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_CONTROLCONNECT:
+			if (vnic_ib_cm_connect(&viport->control.ib_conn))
+				viport->link_state = LINK_CLEANUPDATA;
+			else
+				viport->link_state = LINK_CONTROLCONNECTWAIT;
+			break;
+		case LINK_CONTROLCONNECTWAIT:
+			LINK_STATE("state LINK_CONTROLCONNECTWAIT\n");
+			if (control_is_connected(&viport->control))
+				viport->link_state = LINK_INITVNICREQ;
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			}
+			break;
+		case LINK_INITVNICREQ:
+			LINK_STATE("state LINK_INITVNICREQ\n");
+			if (control_init_vnic_req(&viport->control))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_INITVNICRSP;
+			break;
+		case LINK_INITVNICRSP:
+			LINK_STATE("state LINK_INITVNICRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_init_vnic_rsp(&viport->control,
+						  &viport->features_supported,
+						  viport->hw_mac_address,
+						  &viport->num_mac_addresses,
+						  &viport->default_vlan)) {
+				if (viport_init_mac_addresses(viport))
+					viport->link_state =
+							LINK_RESETCONTROL;
+				else {
+					viport->link_state =
+							LINK_BEGINDATAPATH;
+					/*
+					 * Ensure that the current path's MAC
+					 * address matches the one returned by
+					 * EVIC - we've had cases of mismatch
+					 * which then caused havoc.
+					 */
+					vnic = viport->parent->parent;
+					viport_match_mac_address(vnic, viport);
+				}
+			}
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESETCONTROL;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_data_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_BEGINDATAPATH:
+			LINK_STATE("state LINK_BEGINDATAPATH\n");
+			viport->link_state = LINK_CONFIGDATAPATHREQ;
+			break;
+		case LINK_CONFIGDATAPATHREQ:
+			LINK_STATE("state LINK_CONFIGDATAPATHREQ\n");
+			if (control_config_data_path_req(&viport->control,
+						data_path_id(&viport->
+							     data),
+						data_host_pool_max
+						(&viport->data),
+						data_eioc_pool_max
+						(&viport->data)))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_CONFIGDATAPATHRSP;
+			break;
+		case LINK_CONFIGDATAPATHRSP:
+			LINK_STATE("state LINK_CONFIGDATAPATHRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_data_path_rsp(&viport->control,
+							 data_host_pool
+							 (&viport->data),
+							 data_eioc_pool
+							 (&viport->data),
+							 data_host_pool_max
+							 (&viport->data),
+							 data_eioc_pool_max
+							 (&viport->data),
+							 data_host_pool_min
+							 (&viport->data),
+							 data_eioc_pool_min
+							 (&viport->data)))
+				viport->link_state = LINK_DATACONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESETCONTROL;
+			}
+			break;
+		case LINK_DATACONNECT:
+			LINK_STATE("state LINK_DATACONNECT\n");
+			if (data_connect(&viport->data))
+				viport->link_state = LINK_RESETCONTROL;
+			else
+				viport->link_state = LINK_DATACONNECTWAIT;
+			break;
+		case LINK_DATACONNECTWAIT:
+			LINK_STATE("state LINK_DATACONNECTWAIT\n");
+			control_process_async(&viport->control);
+			if (data_is_connected(&viport->data))
+				viport->link_state = LINK_XCHGPOOLREQ;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_xchgpool_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_XCHGPOOLREQ:
+			LINK_STATE("state LINK_XCHGPOOLREQ\n");
+			if (control_exchange_pools_req(&viport->control,
+						       data_local_pool_addr
+						       (&viport->data),
+						       data_local_pool_rkey
+						       (&viport->data)))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_XCHGPOOLRSP;
+			break;
+		case LINK_XCHGPOOLRSP:
+			LINK_STATE("state LINK_XCHGPOOLRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_exchange_pools_rsp(&viport->control,
+						       data_remote_pool_addr
+						       (&viport->data),
+						       data_remote_pool_rkey
+						       (&viport->data)))
+				viport->link_state = LINK_INITIALIZED;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		case LINK_INITIALIZED:
+			LINK_STATE("state LINK_INITIALIZED\n");
+			viport->state = VIPORT_CONNECTED;
+			printk(KERN_INFO PFX
+			       "%s: connection established\n",
+			       config_viport_name(viport->config));
+			data_connected(&viport->data);
+			vnic_connected(viport->parent->parent,
+				       viport->parent);
+			if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+				printk(KERN_INFO PFX "%s: Supports Inbound IB "
+					"Multicast\n",
+					config_viport_name(viport->config));
+				if (mc_data_init(&viport->mc_data, viport,
+						&viport->config->data_config,
+						viport->pd)) {
+					viport_disable_multicast(viport);
+					break;
+				}
+			}
+			spin_lock_irq(&viport->lock);
+			viport->mtu = 1500;
+			viport->flags = 0;
+			if ((viport->mtu != viport->new_mtu) ||
+			    (viport->flags != viport->new_flags))
+				viport->updates |= NEED_LINK_CONFIG;
+			spin_unlock_irq(&viport->lock);
+			viport->link_state = LINK_IDLE;
+			viport->retry_duration = 0;
+			viport->total_retry_duration = 0;
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_idle_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int handle_mc_join_compl, handle_mc_join;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_IDLE:
+			LINK_STATE("state LINK_IDLE\n");
+			if (viport->config->hb_interval)
+				viport_timer(viport,
+					     viport->config->hb_interval);
+			viport->link_state = LINK_IDLING;
+			break;
+		case LINK_IDLING:
+			LINK_STATE("state LINK_IDLING\n");
+			control_process_async(&viport->control);
+			if (viport->errored) {
+				viport_timer_stop(viport);
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+				break;
+			}
+
+			spin_lock_irq(&viport->lock);
+			handle_mc_join = (viport->updates & NEED_MCAST_JOIN);
+			handle_mc_join_compl =
+				      (viport->updates & NEED_MCAST_COMPLETION);
+			/*
+			 * Turn off both flags, the handler functions will
+			 * rearm them if necessary.
+			 */
+			viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION);
+
+			if (viport->updates & NEED_LINK_CONFIG) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_CONFIGLINKREQ;
+			} else if (viport->updates & NEED_ADDRESS_CONFIG) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_CONFIGADDRSREQ;
+			} else if (viport->updates & NEED_STATS) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_REPORTSTATREQ;
+			} else if (viport->config->hb_interval) {
+				if (!viport->timer_active)
+					viport->link_state =
+						LINK_HEARTBEATREQ;
+			}
+			spin_unlock_irq(&viport->lock);
+			if (handle_mc_join) {
+				if (vnic_mc_join(viport))
+					viport_disable_multicast(viport);
+			}
+			if (handle_mc_join_compl)
+				vnic_mc_join_handle_completion(viport);
+
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_config_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int res;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_CONFIGLINKREQ:
+			LINK_STATE("state LINK_CONFIGLINKREQ\n");
+			spin_lock_irq(&viport->lock);
+			viport->updates &= ~NEED_LINK_CONFIG;
+			viport->flags = viport->new_flags;
+			if (viport->updates & MCAST_OVERFLOW)
+				viport->flags |= IFF_ALLMULTI;
+			viport->mtu = viport->new_mtu;
+			spin_unlock_irq(&viport->lock);
+			if (control_config_link_req(&viport->control,
+						    viport->flags,
+						    viport->mtu))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_CONFIGLINKRSP;
+			break;
+		case LINK_CONFIGLINKRSP:
+			LINK_STATE("state LINK_CONFIGLINKRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_link_rsp(&viport->control,
+						    &viport->flags,
+						    &viport->mtu))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		case LINK_CONFIGADDRSREQ:
+			LINK_STATE("state LINK_CONFIGADDRSREQ\n");
+
+			spin_lock_irq(&viport->lock);
+			res = control_config_addrs_req(&viport->control,
+						       viport->mac_addresses,
+						       viport->
+						       num_mac_addresses);
+
+			if (res > 0) {
+				viport->updates &= ~NEED_ADDRESS_CONFIG;
+				viport->link_state = LINK_CONFIGADDRSRSP;
+			} else if (res == 0)
+				viport->link_state = LINK_CONFIGADDRSRSP;
+			else
+				viport->link_state = LINK_RESET;
+			spin_unlock_irq(&viport->lock);
+			break;
+		case LINK_CONFIGADDRSRSP:
+			LINK_STATE("state LINK_CONFIGADDRSRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_config_addrs_rsp(&viport->control))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_stat_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_REPORTSTATREQ:
+			LINK_STATE("state LINK_REPORTSTATREQ\n");
+			if (control_report_statistics_req(&viport->control))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_REPORTSTATRSP;
+			break;
+		case LINK_REPORTSTATRSP:
+			LINK_STATE("state LINK_REPORTSTATRSP\n");
+			control_process_async(&viport->control);
+
+			spin_lock_irq(&viport->lock);
+			if (control_report_statistics_rsp(&viport->control,
+						  &viport->stats) == 0) {
+				viport->updates &= ~NEED_STATS;
+				viport->last_stats_time = jiffies;
+				wake_up(&viport->stats_queue);
+				viport->link_state = LINK_IDLE;
+			}
+
+			spin_unlock_irq(&viport->lock);
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_heartbeat_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_HEARTBEATREQ:
+			LINK_STATE("state LINK_HEARTBEATREQ\n");
+			if (control_heartbeat_req(&viport->control,
+						  viport->config->hb_timeout))
+				viport->link_state = LINK_RESET;
+			else
+				viport->link_state = LINK_HEARTBEATRSP;
+			break;
+		case LINK_HEARTBEATRSP:
+			LINK_STATE("state LINK_HEARTBEATRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_heartbeat_rsp(&viport->control))
+				viport->link_state = LINK_IDLE;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_RESET;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_reset_states(struct viport *viport)
+{
+	enum link_state old_state;
+	int handle_mc_join_compl = 0, handle_mc_join = 0;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_RESET:
+			LINK_STATE("state LINK_RESET\n");
+			viport->errored = 0;
+			spin_lock_irq(&viport->lock);
+			viport->state = VIPORT_DISCONNECTED;
+			/*
+			 * Turn off both flags, the handler functions will
+			 * rearm them if necessary
+			 */
+			viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION);
+
+			spin_unlock_irq(&viport->lock);
+			vnic_link_down(viport->vnic, viport->parent);
+			printk(KERN_INFO PFX
+			       "%s: connection lost\n",
+			       config_viport_name(viport->config));
+			if (handle_mc_join) {
+				if (vnic_mc_join(viport))
+					viport_disable_multicast(viport);
+			}
+			if (handle_mc_join_compl)
+				vnic_mc_join_handle_completion(viport);
+			if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+				vnic_mc_leave(viport);
+				vnic_mc_data_cleanup(&viport->mc_data);
+			}
+
+			if (control_reset_req(&viport->control))
+				viport->link_state = LINK_DATADISCONNECT;
+			else
+				viport->link_state = LINK_RESETRSP;
+			break;
+		case LINK_RESETRSP:
+			LINK_STATE("state LINK_RESETRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_reset_rsp(&viport->control))
+				viport->link_state = LINK_DATADISCONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_DATADISCONNECT;
+			}
+			break;
+		case LINK_RESETCONTROL:
+			LINK_STATE("state LINK_RESETCONTROL\n");
+			if (control_reset_req(&viport->control))
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			else
+				viport->link_state = LINK_RESETCONTROLRSP;
+			break;
+		case LINK_RESETCONTROLRSP:
+			LINK_STATE("state LINK_RESETCONTROLRSP\n");
+			control_process_async(&viport->control);
+
+			if (!control_reset_rsp(&viport->control))
+				viport->link_state = LINK_CONTROLDISCONNECT;
+
+			if (viport->errored) {
+				viport->errored = 0;
+				viport->link_state = LINK_CONTROLDISCONNECT;
+			}
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_handle_disconn_states(struct viport *viport)
+{
+	enum link_state old_state;
+
+	do {
+		switch (old_state = viport->link_state) {
+		case LINK_DATADISCONNECT:
+			LINK_STATE("state LINK_DATADISCONNECT\n");
+			data_disconnect(&viport->data);
+			viport->link_state = LINK_CONTROLDISCONNECT;
+			break;
+		case LINK_CONTROLDISCONNECT:
+			LINK_STATE("state LINK_CONTROLDISCONNECT\n");
+			viport->link_state = LINK_CLEANUPDATA;
+			break;
+		case LINK_CLEANUPDATA:
+			LINK_STATE("state LINK_CLEANUPDATA\n");
+			data_cleanup(&viport->data);
+			viport->link_state = LINK_CLEANUPCONTROL;
+			break;
+		case LINK_CLEANUPCONTROL:
+			LINK_STATE("state LINK_CLEANUPCONTROL\n");
+			spin_lock_irq(&viport->lock);
+			kfree(viport->mac_addresses);
+			viport->mac_addresses = NULL;
+			spin_unlock_irq(&viport->lock);
+			control_cleanup(&viport->control);
+			ib_dealloc_pd(viport->pd);
+			viport->link_state = LINK_DISCONNECTED;
+			break;
+		case LINK_DISCONNECTED:
+			LINK_STATE("state LINK_DISCONNECTED\n");
+			vnic_disconnected(viport->parent->parent,
+					  viport->parent);
+			if (viport->disconnect != 0)
+				viport->link_state = LINK_UNINITIALIZED;
+			else if (viport->retry == 1) {
+				viport->retry = 0;
+			/*
+			 * Check if the initial retry interval has crossed
+			 * 20 seconds.
+			 * The retry interval is initially 5 seconds which
+			 * is incremented by 5. Once it is 20 the interval
+			 * is fixed to 20 seconds till 10 minutes,
+			 * after which retrying is stopped
+			 */
+				if (viport->retry_duration  < MAX_RETRY_INTERVAL)
+					viport->retry_duration +=
+								RETRY_INCREMENT;
+
+				viport->total_retry_duration +=
+							 viport->retry_duration;
+
+				if (viport->total_retry_duration >=
+					MAX_CONNECT_RETRY_TIMEOUT) {
+					viport->link_state = LINK_UNINITIALIZED;
+					printk("Timed out after retrying"
+					       " for retry_duration %d msecs\n"
+						, viport->total_retry_duration);
+				} else {
+					viport->connect = DELAY;
+					viport->link_state = LINK_RETRYWAIT;
+				}
+				viport_timer(viport,
+				     msecs_to_jiffies(viport->retry_duration));
+			} else {
+				u32 duration = 5000 + ((net_random()) & 0x1FF);
+				if (!viport->parent->is_primary_path)
+					duration += 0x1ff;
+				viport_timer(viport,
+					     msecs_to_jiffies(duration));
+				viport->connect = DELAY;
+				viport->link_state = LINK_RETRYWAIT;
+			}
+			break;
+		case LINK_RETRYWAIT:
+			LINK_STATE("state LINK_RETRYWAIT\n");
+			viport->stats.ethernet_status = 0;
+			viport->updates = 0;
+			wake_up(&viport->stats_queue);
+			if (viport->disconnect != 0) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_UNINITIALIZED;
+			} else if (viport->connect == DELAY) {
+				if (!viport->timer_active)
+					viport->link_state = LINK_INITIALIZE;
+			} else if (viport->connect == NOW) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_INITIALIZE;
+			}
+			break;
+		case LINK_FIRSTCONNECT:
+			viport->stats.ethernet_status = 0;
+			viport->updates = 0;
+			wake_up(&viport->stats_queue);
+			if (viport->disconnect != 0) {
+				viport_timer_stop(viport);
+				viport->link_state = LINK_UNINITIALIZED;
+			}
+
+			break;
+		default:
+			return -1;
+		}
+	} while (viport->link_state != old_state);
+
+	return 0;
+}
+
+static int viport_statemachine(void *context)
+{
+	struct viport *viport;
+	enum link_state old_link_state;
+
+	VIPORT_FUNCTION("viport_statemachine()\n");
+	while (!viport_thread_end || !list_empty(&viport_list)) {
+		wait_event_interruptible(viport_queue,
+					 !list_empty(&viport_list)
+					 || viport_thread_end);
+		spin_lock_irq(&viport_list_lock);
+		if (list_empty(&viport_list)) {
+			spin_unlock_irq(&viport_list_lock);
+			continue;
+		}
+		viport = list_entry(viport_list.next, struct viport,
+				    list_ptrs);
+		list_del_init(&viport->list_ptrs);
+		spin_unlock_irq(&viport_list_lock);
+
+		do {
+			old_link_state = viport->link_state;
+
+			/*
+			 * Optimize for the state machine steady state
+			 * by checking for the most common states first.
+			 *
+			 */
+			if (viport_handle_idle_states(viport) == 0)
+				break;
+			if (viport_handle_heartbeat_states(viport) == 0)
+				break;
+			if (viport_handle_stat_states(viport) == 0)
+				break;
+			if (viport_handle_config_states(viport) == 0)
+				break;
+
+			if (viport_handle_init_states(viport) == 0)
+				break;
+			if (viport_handle_control_states(viport) == 0)
+				break;
+			if (viport_handle_data_states(viport) == 0)
+				break;
+			if (viport_handle_xchgpool_states(viport) == 0)
+				break;
+			if (viport_handle_reset_states(viport) == 0)
+				break;
+			if (viport_handle_disconn_states(viport) == 0)
+				break;
+		} while (viport->link_state != old_link_state);
+	}
+
+	complete_and_exit(&viport_thread_exit, 0);
+}
+
+int viport_start(void)
+{
+	VIPORT_FUNCTION("viport_start()\n");
+
+	spin_lock_init(&viport_list_lock);
+	viport_thread = kthread_run(viport_statemachine, NULL,
+					"qlgc_vnic_viport_s_m");
+	if (IS_ERR(viport_thread)) {
+		printk(KERN_WARNING PFX "Could not create viport_thread;"
+		       " error %d\n", (int) PTR_ERR(viport_thread));
+		viport_thread = NULL;
+		return 1;
+	}
+
+	return 0;
+}
+
+void viport_cleanup(void)
+{
+	VIPORT_FUNCTION("viport_cleanup()\n");
+	if (viport_thread) {
+		viport_thread_end = 1;
+		wake_up(&viport_queue);
+		wait_for_completion(&viport_thread_exit);
+		viport_thread = NULL;
+	}
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h
new file mode 100644
index 0000000..70cdc9f
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_VIPORT_H_INCLUDED
+#define VNIC_VIPORT_H_INCLUDED
+
+#include "vnic_control.h"
+#include "vnic_data.h"
+#include "vnic_multicast.h"
+
+enum viport_state {
+	VIPORT_DISCONNECTED	= 0,
+	VIPORT_CONNECTED	= 1
+};
+
+enum link_state {
+	LINK_UNINITIALIZED	= 0,
+	LINK_INITIALIZE		= 1,
+	LINK_INITIALIZECONTROL	= 2,
+	LINK_INITIALIZEDATA	= 3,
+	LINK_CONTROLCONNECT	= 4,
+	LINK_CONTROLCONNECTWAIT	= 5,
+	LINK_INITVNICREQ	= 6,
+	LINK_INITVNICRSP	= 7,
+	LINK_BEGINDATAPATH	= 8,
+	LINK_CONFIGDATAPATHREQ	= 9,
+	LINK_CONFIGDATAPATHRSP	= 10,
+	LINK_DATACONNECT	= 11,
+	LINK_DATACONNECTWAIT	= 12,
+	LINK_XCHGPOOLREQ	= 13,
+	LINK_XCHGPOOLRSP	= 14,
+	LINK_INITIALIZED	= 15,
+	LINK_IDLE		= 16,
+	LINK_IDLING		= 17,
+	LINK_CONFIGLINKREQ	= 18,
+	LINK_CONFIGLINKRSP	= 19,
+	LINK_CONFIGADDRSREQ	= 20,
+	LINK_CONFIGADDRSRSP	= 21,
+	LINK_REPORTSTATREQ	= 22,
+	LINK_REPORTSTATRSP	= 23,
+	LINK_HEARTBEATREQ	= 24,
+	LINK_HEARTBEATRSP	= 25,
+	LINK_RESET		= 26,
+	LINK_RESETRSP		= 27,
+	LINK_RESETCONTROL	= 28,
+	LINK_RESETCONTROLRSP	= 29,
+	LINK_DATADISCONNECT	= 30,
+	LINK_CONTROLDISCONNECT	= 31,
+	LINK_CLEANUPDATA	= 32,
+	LINK_CLEANUPCONTROL	= 33,
+	LINK_DISCONNECTED	= 34,
+	LINK_RETRYWAIT		= 35,
+	LINK_FIRSTCONNECT	= 36
+};
+
+enum {
+	BROADCAST_ADDR		= 0,
+	UNICAST_ADDR		= 1,
+	MCAST_ADDR_START	= 2
+};
+
+#define current_mac_address	mac_addresses[UNICAST_ADDR].address
+
+enum {
+	NEED_STATS           	= 0x00000001,
+	NEED_ADDRESS_CONFIG  	= 0x00000002,
+	NEED_LINK_CONFIG     	= 0x00000004,
+	MCAST_OVERFLOW       	= 0x00000008,
+	NEED_MCAST_COMPLETION	= 0x00000010,
+	NEED_MCAST_JOIN      	= 0x00000020
+};
+
+struct viport {
+	struct list_head		list_ptrs;
+	struct netpath			*parent;
+	struct vnic			*vnic;
+	struct viport_config		*config;
+	struct control			control;
+	struct data			data;
+	spinlock_t			lock;
+	struct ib_pd			*pd;
+	enum viport_state		state;
+	enum link_state			link_state;
+	struct vnic_cmd_report_stats_rsp stats;
+	wait_queue_head_t		stats_queue;
+	unsigned long			last_stats_time;
+	u32				features_supported;
+	u8				hw_mac_address[ETH_ALEN];
+	u16				default_vlan;
+	u16				num_mac_addresses;
+	struct vnic_address_op2		*mac_addresses;
+	u32				updates;
+	u16				flags;
+	u16				new_flags;
+	u16				mtu;
+	u16				new_mtu;
+	u32				errored;
+	enum { WAIT, DELAY, NOW }	connect;
+	u32				disconnect;
+	u32 				retry;
+	wait_queue_head_t		disconnect_queue;
+	int				timer_active;
+	struct timer_list		timer;
+	u32 				retry_duration;
+	u32 				total_retry_duration;
+	atomic_t			reference_count;
+	wait_queue_head_t		reference_queue;
+	struct mc_info	mc_info;
+	struct mc_data	mc_data;
+};
+
+int  viport_start(void);
+void viport_cleanup(void);
+
+struct viport *viport_allocate(struct viport_config *config);
+void viport_free(struct viport *viport);
+
+void viport_connect(struct viport *viport, int delay);
+
+void viport_set_link(struct viport *viport, u16 flags, u16 mtu);
+void viport_get_stats(struct viport *viport,
+		      struct net_device_stats *stats);
+int  viport_xmit_packet(struct viport *viport, struct sk_buff *skb);
+void viport_kick(struct viport *viport);
+
+void viport_failure(struct viport *viport);
+
+int viport_set_unicast(struct viport *viport, u8 *address);
+int viport_set_multicast(struct viport *viport,
+			 struct dev_mc_list *mc_list,
+			 int mc_count);
+
+#define viport_max_mtu(viport)		data_max_mtu(&(viport)->data)
+
+#define viport_get_hw_addr(viport, address)			\
+	memcpy(address, (viport)->hw_mac_address, ETH_ALEN)
+
+#define viport_features(viport) ((viport)->features_supported)
+
+#define viport_can_tx_csum(viport)				\
+	(((viport)->features_supported & 			\
+	(VNIC_FEAT_IPV4_CSUM_TX | VNIC_FEAT_TCP_CSUM_TX |	\
+	VNIC_FEAT_UDP_CSUM_TX)) == (VNIC_FEAT_IPV4_CSUM_TX |	\
+	VNIC_FEAT_TCP_CSUM_TX | VNIC_FEAT_UDP_CSUM_TX))
+
+#endif /* VNIC_VIPORT_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:55:54 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:25:54 +0530
Subject: [ofa-general] [PATCH v3 04/13] QLogic VNIC: Implementation of
	Control path of communication protocol
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095554.9943.43485.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the files that define the control packet formats
and implements various control messages that are exchanged as part
of the communication protocol with the EVIC/VEx.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c    | 2286 ++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h    |  179 ++
 .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h    |  368 +++
 3 files changed, 2833 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
new file mode 100644
index 0000000..774a071
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c
@@ -0,0 +1,2286 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/list.h>
+#include <linux/vmalloc.h>
+
+#include "vnic_util.h"
+#include "vnic_main.h"
+#include "vnic_viport.h"
+#include "vnic_stats.h"
+
+#define vnic_multicast_address(rsp2_address, index)           \
+	((rsp2_address)->list_address_ops[index].address[0] & 0x01)
+
+static void control_log_control_packet(struct vnic_control_packet *pkt);
+
+char *control_ifcfg_name(struct control *control)
+{
+	if (!control)
+		return "nctl";
+	if (!control->parent)
+		return "np";
+	if (!control->parent->parent)
+		return "npp";
+	if (!control->parent->parent->parent)
+		return "nppp";
+	if (!control->parent->parent->parent->config)
+		return "npppc";
+	return (control->parent->parent->parent->config->name);
+}
+
+static void control_recv(struct control *control, struct recv_io *recv_io)
+{
+	if (vnic_ib_post_recv(&control->ib_conn, &recv_io->io))
+		viport_failure(control->parent);
+}
+
+static void control_recv_complete(struct io *io)
+{
+	struct recv_io			*recv_io = (struct recv_io *)io;
+	struct recv_io			*last_recv_io;
+	struct control			*control = &io->viport->control;
+	struct vnic_control_packet	*pkt = control_packet(recv_io);
+	struct vnic_control_header	*c_hdr = &pkt->hdr;
+	unsigned long			flags;
+	cycles_t			response_time;
+
+	CONTROL_FUNCTION("%s: control_recv_complete() State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+	control_note_rsptime_stats(&response_time);
+	CONTROL_PACKET(pkt);
+	spin_lock_irqsave(&control->io_lock, flags);
+	if (c_hdr->pkt_type == TYPE_INFO) {
+		last_recv_io = control->info;
+		control->info = recv_io;
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		viport_kick(control->parent);
+		if (last_recv_io)
+			control_recv(control, last_recv_io);
+	} else if (c_hdr->pkt_type == TYPE_RSP) {
+		u8 repost = 0;
+		u8 fail = 0;
+		u8 kick = 0;
+
+		switch (control->req_state) {
+		case REQ_INACTIVE:
+		case RSP_RECEIVED:
+		case REQ_COMPLETED:
+			CONTROL_ERROR("%s: Unexpected control"
+					"response received: CMD = %d\n",
+					control_ifcfg_name(control),
+					c_hdr->pkt_cmd);
+			control_log_control_packet(pkt);
+			control->req_state = REQ_FAILED;
+			fail = 1;
+			break;
+		case REQ_POSTED:
+		case REQ_SENT:
+			if (c_hdr->pkt_cmd != control->last_cmd
+				|| c_hdr->pkt_seq_num != control->seq_num) {
+				CONTROL_ERROR("%s: Incorrect Control Response "
+					      "received\n",
+					      control_ifcfg_name(control));
+				CONTROL_ERROR("%s: Sent control request:\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(control_last_req(control));
+				CONTROL_ERROR("%s: Received control response:\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(pkt);
+				control->req_state = REQ_FAILED;
+				fail = 1;
+			} else {
+				control->response = recv_io;
+				control_update_rsptime_stats(control,
+							    response_time);
+				if (control->req_state == REQ_POSTED) {
+					CONTROL_INFO("%s: Recv CMD RSP %d"
+						     "before Send Completion\n",
+						     control_ifcfg_name(control),
+						     c_hdr->pkt_cmd);
+					control->req_state = RSP_RECEIVED;
+				} else {
+					control->req_state = REQ_COMPLETED;
+					kick = 1;
+				}
+			}
+			break;
+		case REQ_FAILED:
+			/* stay in REQ_FAILED state */
+			repost = 1;
+			break;
+		}
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		/* we must do this outside the lock*/
+		if (kick)
+			viport_kick(control->parent);
+		if (repost || fail) {
+			control_recv(control, recv_io);
+			if (fail)
+				viport_failure(control->parent);
+		}
+
+	} else {
+		list_add_tail(&recv_io->io.list_ptrs,
+			      &control->failure_list);
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		viport_kick(control->parent);
+	}
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+}
+
+static void control_timeout(unsigned long data)
+{
+	struct control *control;
+	unsigned long 		  flags;
+	u8 fail = 0;
+	u8 kick = 0;
+
+	control = (struct control *)data;
+	CONTROL_FUNCTION("%s: control_timeout(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	control->timer_state = TIMER_EXPIRED;
+
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		kick = 1;
+		/* stay in REQ_INACTIVE state */
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+		control->req_state = REQ_FAILED;
+		CONTROL_ERROR("%s: No send Completion for Cmd=%d \n",
+			      control_ifcfg_name(control), control->last_cmd);
+		control_timeout_stats(control);
+		fail = 1;
+		break;
+	case RSP_RECEIVED:
+		control->req_state = REQ_FAILED;
+		CONTROL_ERROR("%s: No response received from EIOC for Cmd=%d\n",
+			      control_ifcfg_name(control), control->last_cmd);
+		control_timeout_stats(control);
+		fail = 1;
+		break;
+	case REQ_COMPLETED:
+		/* stay in REQ_COMPLETED state*/
+		kick = 1;
+		break;
+	case REQ_FAILED:
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	if (kick)
+		viport_kick(control->parent);
+
+	return;
+}
+
+static void control_timer(struct control *control, int timeout)
+{
+	CONTROL_FUNCTION("%s: control_timer()\n",
+			 control_ifcfg_name(control));
+	if (control->timer_state == TIMER_ACTIVE)
+		mod_timer(&control->timer, jiffies + timeout);
+	else {
+		init_timer(&control->timer);
+		control->timer.expires = jiffies + timeout;
+		control->timer.data = (unsigned long)control;
+		control->timer.function = control_timeout;
+		control->timer_state = TIMER_ACTIVE;
+		add_timer(&control->timer);
+	}
+}
+
+static void control_timer_stop(struct control *control)
+{
+	CONTROL_FUNCTION("%s: control_timer_stop()\n",
+			 control_ifcfg_name(control));
+	if (control->timer_state == TIMER_ACTIVE)
+		del_timer_sync(&control->timer);
+
+	control->timer_state = TIMER_IDLE;
+}
+
+static int control_send(struct control *control, struct send_io *send_io)
+{
+	unsigned long 	flags;
+	u8 ret = -1;
+	u8 fail = 0;
+	struct vnic_control_packet *pkt = control_packet(send_io);
+
+	CONTROL_FUNCTION("%s: control_send(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		CONTROL_PACKET(pkt);
+		control_timer(control, control->config->rsp_timeout);
+		control_note_reqtime_stats(control);
+		if (vnic_ib_post_send(&control->ib_conn, &control->send_io.io)) {
+			CONTROL_ERROR("%s: Failed to post send\n",
+				control_ifcfg_name(control));
+			/* stay in REQ_INACTIVE state*/
+			fail = 1;
+		} else {
+			control->last_cmd = pkt->hdr.pkt_cmd;
+			control->req_state = REQ_POSTED;
+			ret = 0;
+		}
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+	case RSP_RECEIVED:
+	case REQ_COMPLETED:
+		CONTROL_ERROR("%s:Previous Command is not completed."
+			      "New CMD: %d Last CMD: %d Seq: %d\n",
+			      control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+			      control->last_cmd, control->seq_num);
+
+		control->req_state = REQ_FAILED;
+		fail = 1;
+		break;
+	case REQ_FAILED:
+		/* this can occur after an error when ViPort state machine
+		 * attempts to reset the link.
+		 */
+		CONTROL_INFO("%s:Attempt to send in failed state."
+			     "New CMD: %d Last CMD: %d\n",
+			     control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+			     control->last_cmd);
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	return ret;
+
+}
+
+static void control_send_complete(struct io *io)
+{
+	struct control *control = &io->viport->control;
+	unsigned long 		  flags;
+	u8 fail = 0;
+	u8 kick = 0;
+
+	CONTROL_FUNCTION("%s: control_sendComplete(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+	case REQ_SENT:
+	case REQ_COMPLETED:
+		CONTROL_ERROR("%s: Unexpected control send completion\n",
+			      control_ifcfg_name(control));
+		fail = 1;
+		control->req_state = REQ_FAILED;
+		break;
+	case REQ_POSTED:
+		control->req_state = REQ_SENT;
+		break;
+	case RSP_RECEIVED:
+		control->req_state = REQ_COMPLETED;
+		kick = 1;
+		break;
+	case REQ_FAILED:
+		/* stay in REQ_FAILED state */
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	/* we must do this outside the lock */
+	if (fail)
+		viport_failure(control->parent);
+	if (kick)
+		viport_kick(control->parent);
+
+	return;
+}
+
+void control_process_async(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	unsigned long			flags;
+
+	CONTROL_FUNCTION("%s: control_process_async()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	spin_lock_irqsave(&control->io_lock, flags);
+	recv_io = control->info;
+	if (recv_io) {
+		CONTROL_INFO("%s: processing info packet\n",
+			     control_ifcfg_name(control));
+		control->info = NULL;
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		pkt = control_packet(recv_io);
+		if (pkt->hdr.pkt_cmd == CMD_REPORT_STATUS) {
+			u32		status;
+			status =
+			  be32_to_cpu(pkt->cmd.report_status.status_number);
+			switch (status) {
+			case VNIC_STATUS_LINK_UP:
+				CONTROL_INFO("%s: link up\n",
+					     control_ifcfg_name(control));
+				vnic_link_up(control->parent->vnic,
+					     control->parent->parent);
+				break;
+			case VNIC_STATUS_LINK_DOWN:
+				CONTROL_INFO("%s: link down\n",
+					     control_ifcfg_name(control));
+				vnic_link_down(control->parent->vnic,
+					       control->parent->parent);
+				break;
+			default:
+				CONTROL_ERROR("%s: asynchronous status"
+					      " received from EIOC\n",
+					      control_ifcfg_name(control));
+				control_log_control_packet(pkt);
+				break;
+			}
+		}
+		if ((pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) ||
+		     pkt->cmd.report_status.is_fatal)
+			viport_failure(control->parent);
+
+		control_recv(control, recv_io);
+		spin_lock_irqsave(&control->io_lock, flags);
+	}
+
+	while (!list_empty(&control->failure_list)) {
+		CONTROL_INFO("%s: processing error packet\n",
+			     control_ifcfg_name(control));
+		recv_io = (struct recv_io *)
+		    list_entry(control->failure_list.next, struct io,
+			       list_ptrs);
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&control->io_lock, flags);
+		pkt = control_packet(recv_io);
+		CONTROL_ERROR("%s: asynchronous error received from EIOC\n",
+			      control_ifcfg_name(control));
+		control_log_control_packet(pkt);
+		if ((pkt->hdr.pkt_type != TYPE_ERR)
+		    || (pkt->hdr.pkt_cmd != CMD_REPORT_STATUS)
+		    || pkt->cmd.report_status.is_fatal)
+			viport_failure(control->parent);
+
+		control_recv(control, recv_io);
+		spin_lock_irqsave(&control->io_lock, flags);
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	CONTROL_FUNCTION("%s: done control_process_async\n",
+		     control_ifcfg_name(control));
+}
+
+static struct send_io *control_init_hdr(struct control *control, u8 cmd)
+{
+	struct control_config		*config;
+	struct vnic_control_packet	*pkt;
+	struct vnic_control_header	*hdr;
+
+	CONTROL_FUNCTION("control_init_hdr()\n");
+	config = control->config;
+
+	pkt = control_packet(&control->send_io);
+	hdr = &pkt->hdr;
+
+	hdr->pkt_type = TYPE_REQ;
+	hdr->pkt_cmd = cmd;
+	control->seq_num++;
+	hdr->pkt_seq_num = control->seq_num;
+	hdr->pkt_retry_count = 0;
+
+	return &control->send_io;
+}
+
+static struct recv_io *control_get_rsp(struct control *control)
+{
+	struct recv_io	*recv_io = NULL;
+	unsigned long	flags;
+	u8 fail = 0;
+
+	CONTROL_FUNCTION("%s: control_getRsp(), State=%d\n",
+			 control_ifcfg_name(control), control->req_state);
+	spin_lock_irqsave(&control->io_lock, flags);
+	switch (control->req_state) {
+	case REQ_INACTIVE:
+		CONTROL_ERROR("%s: Checked for Response with no"
+			      "command pending\n",
+			      control_ifcfg_name(control));
+		control->req_state = REQ_FAILED;
+		fail = 1;
+		break;
+	case REQ_POSTED:
+	case REQ_SENT:
+	case RSP_RECEIVED:
+		/* no response available yet
+		 stay in present state*/
+		break;
+	case REQ_COMPLETED:
+		recv_io = control->response;
+		if (!recv_io) {
+			control->req_state = REQ_FAILED;
+			fail = 1;
+			break;
+		}
+		control->response = NULL;
+		control->last_cmd = CMD_INVALID;
+		control_timer_stop(control);
+		control->req_state = REQ_INACTIVE;
+		break;
+	case REQ_FAILED:
+		control_timer_stop(control);
+		/* stay in REQ_FAILED state*/
+		break;
+	}
+	spin_unlock_irqrestore(&control->io_lock, flags);
+	if (fail)
+		viport_failure(control->parent);
+	return recv_io;
+}
+
+int control_init_vnic_req(struct control *control)
+{
+	struct send_io			*send_io;
+	struct control_config		*config = control->config;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_init_vnic_req	*init_vnic_req;
+
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_INIT_VNIC);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	init_vnic_req = &pkt->cmd.init_vnic_req;
+	init_vnic_req->vnic_major_version =
+				 __constant_cpu_to_be16(VNIC_MAJORVERSION);
+	init_vnic_req->vnic_minor_version =
+				 __constant_cpu_to_be16(VNIC_MINORVERSION);
+	init_vnic_req->vnic_instance = config->vnic_instance;
+	init_vnic_req->num_data_paths = 1;
+	init_vnic_req->num_address_entries =
+				cpu_to_be16(config->max_address_entries);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	CONTROL_PACKET(pkt);
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+static int control_chk_vnic_rsp_values(struct control *control,
+				       u16 *num_addrs,
+				       u8 num_data_paths,
+				       u8 num_lan_switches,
+				       u32 *features)
+{
+
+	struct control_config		*config = control->config;
+
+	if ((control->maj_ver > VNIC_MAJORVERSION)
+	    || ((control->maj_ver == VNIC_MAJORVERSION)
+		&& (control->min_ver > VNIC_MINORVERSION))) {
+		CONTROL_ERROR("%s: unsupported version\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_data_paths != 1) {
+		CONTROL_ERROR("%s: EIOC returned too many datapaths\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (*num_addrs > config->max_address_entries) {
+		CONTROL_ERROR("%s: EIOC returned more address"
+			      " entries than requested\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (*num_addrs < config->min_address_entries) {
+		CONTROL_ERROR("%s: not enough address entries\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_lan_switches < 1) {
+		CONTROL_ERROR("%s: EIOC returned no lan switches\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	if (num_lan_switches > 1) {
+		CONTROL_ERROR("%s: EIOC returned multiple lan switches\n",
+			      control_ifcfg_name(control));
+		goto failure;
+	}
+	CONTROL_ERROR("%s checking features %x ib_multicast:%d\n",
+			control_ifcfg_name(control),
+			*features, config->ib_multicast);
+	if ((*features & VNIC_FEAT_INBOUND_IB_MC) && !config->ib_multicast) {
+		/* disable multicast if it is not on in the cfg file, or
+		   if we turned it off because join failed */
+		*features &= ~VNIC_FEAT_INBOUND_IB_MC;
+	}
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_init_vnic_rsp(struct control *control, u32 *features,
+			  u8 *mac_address, u16 *num_addrs, u16 *vlan)
+{
+	u8 num_data_paths;
+	u8 num_lan_switches;
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_init_vnic_rsp	*init_vnic_rsp;
+
+
+	CONTROL_FUNCTION("%s: control_init_vnic_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_INIT_VNIC)
+		goto failure;
+
+	init_vnic_rsp = &pkt->cmd.init_vnic_rsp;
+	control->maj_ver = be16_to_cpu(init_vnic_rsp->vnic_major_version);
+	control->min_ver = be16_to_cpu(init_vnic_rsp->vnic_minor_version);
+	num_data_paths = init_vnic_rsp->num_data_paths;
+	num_lan_switches = init_vnic_rsp->num_lan_switches;
+	*features = be32_to_cpu(init_vnic_rsp->features_supported);
+	*num_addrs = be16_to_cpu(init_vnic_rsp->num_address_entries);
+
+	if (control_chk_vnic_rsp_values(control, num_addrs,
+					num_data_paths,
+					num_lan_switches,
+					features))
+		goto failure;
+
+	control->lan_switch.lan_switch_num =
+			init_vnic_rsp->lan_switch[0].lan_switch_num;
+	control->lan_switch.num_enet_ports =
+			init_vnic_rsp->lan_switch[0].num_enet_ports;
+	control->lan_switch.default_vlan =
+			init_vnic_rsp->lan_switch[0].default_vlan;
+	*vlan = be16_to_cpu(control->lan_switch.default_vlan);
+	memcpy(control->lan_switch.hw_mac_address,
+	       init_vnic_rsp->lan_switch[0].hw_mac_address, ETH_ALEN);
+	memcpy(mac_address, init_vnic_rsp->lan_switch[0].hw_mac_address,
+	       ETH_ALEN);
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+static void copy_recv_pool_config(struct vnic_recv_pool_config *src,
+				  struct vnic_recv_pool_config *dst)
+{
+	dst->size_recv_pool_entry  = src->size_recv_pool_entry;
+	dst->num_recv_pool_entries = src->num_recv_pool_entries;
+	dst->timeout_before_kick   = src->timeout_before_kick;
+	dst->num_recv_pool_entries_before_kick =
+				src->num_recv_pool_entries_before_kick;
+	dst->num_recv_pool_bytes_before_kick =
+				src->num_recv_pool_bytes_before_kick;
+	dst->free_recv_pool_entries_per_update =
+				src->free_recv_pool_entries_per_update;
+}
+
+static int check_recv_pool_config_value(__be32 *src, __be32 *dst,
+					__be32 *max, __be32 *min,
+					char *name)
+{
+	u32 value;
+
+	value = be32_to_cpu(*src);
+	if (value > be32_to_cpu(*max)) {
+		CONTROL_ERROR("value %s too large\n", name);
+		return -1;
+	} else if (value < be32_to_cpu(*min)) {
+		CONTROL_ERROR("value %s too small\n", name);
+		return -1;
+	}
+
+	*dst = cpu_to_be32(value);
+	return 0;
+}
+
+static int check_recv_pool_config(struct vnic_recv_pool_config *src,
+				  struct vnic_recv_pool_config *dst,
+				  struct vnic_recv_pool_config *max,
+				  struct vnic_recv_pool_config *min)
+{
+	if (check_recv_pool_config_value(&src->size_recv_pool_entry,
+				     &dst->size_recv_pool_entry,
+				     &max->size_recv_pool_entry,
+				     &min->size_recv_pool_entry,
+				     "size_recv_pool_entry")
+	    || check_recv_pool_config_value(&src->num_recv_pool_entries,
+				     &dst->num_recv_pool_entries,
+				     &max->num_recv_pool_entries,
+				     &min->num_recv_pool_entries,
+				     "num_recv_pool_entries")
+	    || check_recv_pool_config_value(&src->timeout_before_kick,
+				     &dst->timeout_before_kick,
+				     &max->timeout_before_kick,
+				     &min->timeout_before_kick,
+				     "timeout_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     num_recv_pool_entries_before_kick,
+				     &dst->
+				     num_recv_pool_entries_before_kick,
+				     &max->
+				     num_recv_pool_entries_before_kick,
+				     &min->
+				     num_recv_pool_entries_before_kick,
+				     "num_recv_pool_entries_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     num_recv_pool_bytes_before_kick,
+				     &dst->
+				     num_recv_pool_bytes_before_kick,
+				     &max->
+				     num_recv_pool_bytes_before_kick,
+				     &min->
+				     num_recv_pool_bytes_before_kick,
+				     "num_recv_pool_bytes_before_kick")
+	    || check_recv_pool_config_value(&src->
+				     free_recv_pool_entries_per_update,
+				     &dst->
+				     free_recv_pool_entries_per_update,
+				     &max->
+				     free_recv_pool_entries_per_update,
+				     &min->
+				     free_recv_pool_entries_per_update,
+				     "free_recv_pool_entries_per_update"))
+		goto failure;
+
+	if (!is_power_of_2(be32_to_cpu(dst->num_recv_pool_entries))) {
+		CONTROL_ERROR("num_recv_pool_entries (%d)"
+			      " must be power of 2\n",
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	if (!is_power_of_2(be32_to_cpu(dst->
+				      free_recv_pool_entries_per_update))) {
+		CONTROL_ERROR("free_recv_pool_entries_per_update (%d)"
+			      " must be power of 2\n",
+			      dst->free_recv_pool_entries_per_update);
+		goto failure;
+	}
+
+	if (be32_to_cpu(dst->free_recv_pool_entries_per_update) >=
+	    be32_to_cpu(dst->num_recv_pool_entries)) {
+		CONTROL_ERROR("free_recv_pool_entries_per_update (%d) must"
+			      " be less than num_recv_pool_entries (%d)\n",
+			      dst->free_recv_pool_entries_per_update,
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	if (be32_to_cpu(dst->num_recv_pool_entries_before_kick) >=
+	    be32_to_cpu(dst->num_recv_pool_entries)) {
+		CONTROL_ERROR("num_recv_pool_entries_before_kick (%d) must"
+			      " be less than num_recv_pool_entries (%d)\n",
+			      dst->num_recv_pool_entries_before_kick,
+			      dst->num_recv_pool_entries);
+		goto failure;
+	}
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_config_data_path_req(struct control *control, u64 path_id,
+				     struct vnic_recv_pool_config *host,
+				     struct vnic_recv_pool_config *eioc)
+{
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_data_path	*config_data_path;
+
+	CONTROL_FUNCTION("%s: control_config_data_path_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_CONFIG_DATA_PATH);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	config_data_path = &pkt->cmd.config_data_path_req;
+	config_data_path->data_path = 0;
+	config_data_path->path_identifier = path_id;
+	copy_recv_pool_config(host,
+			      &config_data_path->host_recv_pool_config);
+	copy_recv_pool_config(eioc,
+			      &config_data_path->eioc_recv_pool_config);
+	CONTROL_PACKET(pkt);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_config_data_path_rsp(struct control *control,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc,
+				 struct vnic_recv_pool_config *max_host,
+				 struct vnic_recv_pool_config *max_eioc,
+				 struct vnic_recv_pool_config *min_host,
+				 struct vnic_recv_pool_config *min_eioc)
+{
+	struct recv_io				*recv_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_data_path	*config_data_path;
+
+	CONTROL_FUNCTION("%s: control_config_data_path_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_CONFIG_DATA_PATH)
+		goto failure;
+
+	config_data_path = &pkt->cmd.config_data_path_rsp;
+	if (config_data_path->data_path != 0) {
+		CONTROL_ERROR("%s: received CMD_CONFIG_DATA_PATH response"
+			      " for wrong data path: %u\n",
+			      control_ifcfg_name(control),
+			      config_data_path->data_path);
+		goto failure;
+	}
+
+	if (check_recv_pool_config(&config_data_path->
+				   host_recv_pool_config,
+				   host, max_host, min_host)
+	    || check_recv_pool_config(&config_data_path->
+				      eioc_recv_pool_config,
+				      eioc, max_eioc, min_eioc)) {
+		goto failure;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_exchange_pools_req(struct control *control, u64 addr, u32 rkey)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_exchange_pools	*exchange_pools;
+
+	CONTROL_FUNCTION("%s: control_exchange_pools_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_EXCHANGE_POOLS);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	exchange_pools = &pkt->cmd.exchange_pools_req;
+	exchange_pools->data_path = 0;
+	exchange_pools->pool_rkey = cpu_to_be32(rkey);
+	exchange_pools->pool_addr = cpu_to_be64(addr);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_exchange_pools_rsp(struct control *control, u64 *addr,
+			       u32 *rkey)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_exchange_pools	*exchange_pools;
+
+	CONTROL_FUNCTION("%s: control_exchange_pools_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_EXCHANGE_POOLS)
+		goto failure;
+
+	exchange_pools = &pkt->cmd.exchange_pools_rsp;
+	*rkey = be32_to_cpu(exchange_pools->pool_rkey);
+	*addr = be64_to_cpu(exchange_pools->pool_addr);
+
+	if (exchange_pools->data_path != 0) {
+		CONTROL_ERROR("%s: received CMD_EXCHANGE_POOLS response"
+			      " for wrong data path: %u\n",
+			      control_ifcfg_name(control),
+			      exchange_pools->data_path);
+		goto failure;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_config_link_req(struct control *control, u16 flags, u16 mtu)
+{
+	struct send_io			*send_io;
+	struct vnic_cmd_config_link	*config_link_req;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_config_link_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_CONFIG_LINK);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	config_link_req = &pkt->cmd.config_link_req;
+	config_link_req->lan_switch_num =
+				control->lan_switch.lan_switch_num;
+	config_link_req->cmd_flags = VNIC_FLAG_SET_MTU;
+	if (flags & IFF_UP)
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_NIC;
+	else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_NIC;
+	if (flags & IFF_ALLMULTI)
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL;
+	else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_MCAST_ALL;
+	if (flags & IFF_PROMISC) {
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_PROMISC;
+		/* the EIOU doesn't really do PROMISC mode.
+		 * if PROMISC is set, it only receives unicast packets
+		 * I also have to set MCAST_ALL if I want real
+		 * PROMISC mode.
+		 */
+		config_link_req->cmd_flags &= ~VNIC_FLAG_DISABLE_MCAST_ALL;
+		config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL;
+	} else
+		config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_PROMISC;
+
+	config_link_req->mtu_size = cpu_to_be16(mtu);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_config_link_rsp(struct control *control, u16 *flags, u16 *mtu)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_config_link	*config_link_rsp;
+
+	CONTROL_FUNCTION("%s: control_config_link_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_CONFIG_LINK)
+		goto failure;
+	config_link_rsp = &pkt->cmd.config_link_rsp;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_NIC)
+		*flags |= IFF_UP;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL)
+		*flags |= IFF_ALLMULTI;
+	if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_PROMISC)
+		*flags |= IFF_PROMISC;
+
+	*mtu = be16_to_cpu(config_link_rsp->mtu_size);
+
+	if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+		/* featuresSupported might include INBOUND_IB_MC but
+		   MTU might cause it to be auto-disabled at embedded */
+		if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) {
+			union ib_gid mgid = config_link_rsp->allmulti_mgid;
+			if (mgid.raw[0] != 0xff) {
+				CONTROL_ERROR("%s: invalid formatprefix "
+						VNIC_GID_FMT "\n",
+						control_ifcfg_name(control),
+						VNIC_GID_RAW_ARG(mgid.raw));
+			} else {
+				/* rather than issuing join here, which might
+				 * arrive at SM before EVIC creates the MC
+				 * group, postpone it.
+				 */
+				vnic_mc_join_setup(control->parent, &mgid);
+				CONTROL_ERROR("join setup for ALL_MULTI\n");
+			}
+		}
+		/* we don't want to leave mcast group if MCAST_ALL is disabled
+		 * because there are no doubt multicast addresses set and we
+		 * want to stay joined so we can get that traffic via the
+		 * mcast group.
+		 */
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+/* control_config_addrs_req:
+ * return values:
+ *          -1: failure
+ *           0: incomplete (successful operation, but more address
+ *              table entries to be updated)
+ *           1: complete
+ */
+int control_config_addrs_req(struct control *control,
+			     struct vnic_address_op2 *addrs, u16 num)
+{
+	u16  i;
+	u8   j;
+	int  ret = 1;
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_config_addresses	*config_addrs_req;
+    struct vnic_cmd_config_addresses2   *config_addrs_req2;
+
+	CONTROL_FUNCTION("%s: control_config_addrs_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) {
+		CONTROL_INFO("Sending CMD_CONFIG_ADDRESSES2 %lx MAX:%d "
+				"sizes:%d %d(off:%d) sizes2:%d %d %d"
+				"(off:%d - %d %d %d %d %d %d %d)\n", jiffies,
+				(int)MAX_CONFIG_ADDR_ENTRIES2,
+				(int)sizeof(struct vnic_cmd_config_addresses),
+			(int)sizeof(struct vnic_address_op),
+			(int)offsetof(struct vnic_cmd_config_addresses,
+							list_address_ops),
+			(int)sizeof(struct vnic_cmd_config_addresses2),
+			(int)sizeof(struct vnic_address_op2),
+			(int)sizeof(union ib_gid),
+			(int)offsetof(struct vnic_cmd_config_addresses2,
+							list_address_ops),
+			(int)offsetof(struct vnic_address_op2, index),
+			(int)offsetof(struct vnic_address_op2, operation),
+			(int)offsetof(struct vnic_address_op2, valid),
+			(int)offsetof(struct vnic_address_op2, address),
+			(int)offsetof(struct vnic_address_op2, vlan),
+			(int)offsetof(struct vnic_address_op2, reserved),
+			(int)offsetof(struct vnic_address_op2, mgid)
+			);
+		send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES2);
+		if (!send_io)
+			goto failure;
+
+		pkt = control_packet(send_io);
+		config_addrs_req2 = &pkt->cmd.config_addresses_req2;
+		memset(pkt->cmd.cmd_data, 0, VNIC_MAX_CONTROLDATASZ);
+		config_addrs_req2->lan_switch_num =
+			control->lan_switch.lan_switch_num;
+		for (i = 0, j = 0; (i < num) && (j < MAX_CONFIG_ADDR_ENTRIES2); i++) {
+			if (!addrs[i].operation)
+				continue;
+			config_addrs_req2->list_address_ops[j].index =
+								 cpu_to_be16(i);
+			config_addrs_req2->list_address_ops[j].operation =
+							VNIC_OP_SET_ENTRY;
+			config_addrs_req2->list_address_ops[j].valid =
+								 addrs[i].valid;
+			memcpy(config_addrs_req2->list_address_ops[j].address,
+			       addrs[i].address, ETH_ALEN);
+			config_addrs_req2->list_address_ops[j].vlan =
+								 addrs[i].vlan;
+			addrs[i].operation = 0;
+			CONTROL_INFO("%s i=%d "
+				"addr[%d]=%02x:%02x:%02x:%02x:%02x:%02x "
+				"valid:%d\n", control_ifcfg_name(control), i, j,
+				addrs[i].address[0], addrs[i].address[1],
+				addrs[i].address[2], addrs[i].address[3],
+				addrs[i].address[4], addrs[i].address[5],
+				addrs[i].valid);
+			j++;
+		}
+		config_addrs_req2->num_address_ops = j;
+	} else {
+		send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES);
+		if (!send_io)
+			goto failure;
+
+		pkt = control_packet(send_io);
+		config_addrs_req = &pkt->cmd.config_addresses_req;
+		config_addrs_req->lan_switch_num =
+					control->lan_switch.lan_switch_num;
+		for (i = 0, j = 0; (i < num) && (j < 16); i++) {
+			if (!addrs[i].operation)
+				continue;
+			config_addrs_req->list_address_ops[j].index =
+								 cpu_to_be16(i);
+			config_addrs_req->list_address_ops[j].operation =
+							VNIC_OP_SET_ENTRY;
+			config_addrs_req->list_address_ops[j].valid =
+								 addrs[i].valid;
+			memcpy(config_addrs_req->list_address_ops[j].address,
+			       addrs[i].address, ETH_ALEN);
+			config_addrs_req->list_address_ops[j].vlan =
+								 addrs[i].vlan;
+			addrs[i].operation = 0;
+			j++;
+		}
+		config_addrs_req->num_address_ops = j;
+	}
+	for (; i < num; i++) {
+		if (addrs[i].operation) {
+			ret = 0;
+			break;
+		}
+	}
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+
+	if (control_send(control, send_io))
+		return -1;
+	return ret;
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+static int process_cmd_config_address2_rsp(struct control *control,
+					   struct vnic_control_packet *pkt,
+					   struct recv_io *recv_io)
+{
+	struct vnic_cmd_config_addresses2 *config_addrs_rsp2;
+	int idx, mcaddrs, nomgid;
+	union ib_gid mgid, rsp_mgid;
+
+	config_addrs_rsp2 = &pkt->cmd.config_addresses_rsp2;
+	CONTROL_INFO("%s rsp to CONFIG_ADDRESSES2\n",
+				 control_ifcfg_name(control));
+
+	for (idx = 0, mcaddrs = 0, nomgid = 1;
+			idx < config_addrs_rsp2->num_address_ops;
+				idx++) {
+		if (!config_addrs_rsp2->list_address_ops[idx].valid)
+			continue;
+
+		/* check if address is multicasts */
+		if (!vnic_multicast_address(config_addrs_rsp2, idx))
+			continue;
+
+		mcaddrs++;
+		mgid = config_addrs_rsp2->list_address_ops[idx].mgid;
+		CONTROL_INFO("%s: got mgid " VNIC_GID_FMT
+				" MCAST_MSG_SIZE:%d mtu:%d\n",
+				control_ifcfg_name(control),
+				VNIC_GID_RAW_ARG(mgid.raw),
+				(int)MCAST_MSG_SIZE,
+				control->parent->mtu);
+
+		/* Embedded should have turned off multicast
+		 * due to large MTU size; mgid had better be 0.
+		 */
+		if (control->parent->mtu > MCAST_MSG_SIZE) {
+			if ((mgid.global.subnet_prefix != 0) ||
+				(mgid.global.interface_id != 0)) {
+				CONTROL_ERROR("%s: invalid mgid; "
+						"expected 0 "
+						VNIC_GID_FMT "\n",
+						control_ifcfg_name(control),
+						VNIC_GID_RAW_ARG(mgid.raw));
+				}
+				continue;
+			}
+		if (mgid.raw[0] != 0xff) {
+			CONTROL_ERROR("%s: invalid formatprefix "
+					VNIC_GID_FMT "\n",
+					control_ifcfg_name(control),
+					VNIC_GID_RAW_ARG(mgid.raw));
+			continue;
+		}
+		nomgid = 0; /* got a valid mgid */
+
+		/* let's verify that all the mgids match this one */
+		for (; idx < config_addrs_rsp2->num_address_ops; idx++) {
+			if (!config_addrs_rsp2->list_address_ops[idx].valid)
+				continue;
+
+			/* check if address is multicasts */
+			if (!vnic_multicast_address(config_addrs_rsp2, idx))
+				continue;
+
+			rsp_mgid = config_addrs_rsp2->list_address_ops[idx].mgid;
+			if (memcmp(&mgid, &rsp_mgid, sizeof(union ib_gid)) == 0)
+				continue;
+
+			CONTROL_ERROR("%s: Multicast Group MGIDs not "
+					"unique; mgids: " VNIC_GID_FMT
+					 " " VNIC_GID_FMT "\n",
+					control_ifcfg_name(control),
+					VNIC_GID_RAW_ARG(mgid.raw),
+					VNIC_GID_RAW_ARG(rsp_mgid.raw));
+			return 1;
+		}
+
+		/* rather than issuing join here, which might arrive
+		 * at SM before EVIC creates the MC group, postpone it.
+		 */
+		vnic_mc_join_setup(control->parent, &mgid);
+
+		/* there is only one multicast group to join, so we're done. */
+		break;
+	}
+
+	/* we sent atleast one multicast address but got no MGID
+	 * back so, if it is not allmulti case, leave the group
+	 * we joined before. (for allmulti case we have to stay
+	 * joined)
+	 */
+	if ((config_addrs_rsp2->num_address_ops > 0) && (mcaddrs > 0) &&
+		nomgid && !(control->parent->flags & IFF_ALLMULTI)) {
+		CONTROL_INFO("numaddrops:%d mcadrs:%d nomgid:%d\n",
+			config_addrs_rsp2->num_address_ops,
+				mcaddrs > 0, nomgid);
+
+		vnic_mc_leave(control->parent);
+	}
+
+	return 0;
+}
+
+int control_config_addrs_rsp(struct control *control)
+{
+	struct recv_io *recv_io;
+	struct vnic_control_packet *pkt;
+
+	CONTROL_FUNCTION("%s: control_config_addrs_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if ((pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES) &&
+		(pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES2))
+		goto failure;
+
+	if (((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) &&
+	      !control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) ||
+	      ((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES) &&
+	       control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC)) {
+		CONTROL_ERROR("%s unexpected response pktCmd:%d flag:%x\n",
+				control_ifcfg_name(control), pkt->hdr.pkt_cmd,
+				control->parent->features_supported &
+				VNIC_FEAT_INBOUND_IB_MC);
+		goto failure;
+	}
+
+	if (pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) {
+		if (process_cmd_config_address2_rsp(control, pkt, recv_io))
+			goto failure;
+	} else {
+		struct vnic_cmd_config_addresses *config_addrs_rsp;
+		config_addrs_rsp = &pkt->cmd.config_addresses_rsp;
+	}
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_report_statistics_req(struct control *control)
+{
+	struct send_io				*send_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_report_stats_req	*report_statistics_req;
+
+	CONTROL_FUNCTION("%s: control_report_statistics_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_REPORT_STATISTICS);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	report_statistics_req = &pkt->cmd.report_statistics_req;
+	report_statistics_req->lan_switch_num =
+	    control->lan_switch.lan_switch_num;
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_report_statistics_rsp(struct control *control,
+				  struct vnic_cmd_report_stats_rsp *stats)
+{
+	struct recv_io				*recv_io;
+	struct vnic_control_packet		*pkt;
+	struct vnic_cmd_report_stats_rsp	*rep_stat_rsp;
+
+	CONTROL_FUNCTION("%s: control_report_statistics_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_REPORT_STATISTICS)
+		goto failure;
+
+	rep_stat_rsp = &pkt->cmd.report_statistics_rsp;
+
+	stats->if_in_broadcast_pkts   = rep_stat_rsp->if_in_broadcast_pkts;
+	stats->if_in_multicast_pkts   = rep_stat_rsp->if_in_multicast_pkts;
+	stats->if_in_octets	      = rep_stat_rsp->if_in_octets;
+	stats->if_in_ucast_pkts       = rep_stat_rsp->if_in_ucast_pkts;
+	stats->if_in_nucast_pkts      = rep_stat_rsp->if_in_nucast_pkts;
+	stats->if_in_underrun	      = rep_stat_rsp->if_in_underrun;
+	stats->if_in_errors	      = rep_stat_rsp->if_in_errors;
+	stats->if_out_errors	      = rep_stat_rsp->if_out_errors;
+	stats->if_out_octets	      = rep_stat_rsp->if_out_octets;
+	stats->if_out_ucast_pkts      = rep_stat_rsp->if_out_ucast_pkts;
+	stats->if_out_multicast_pkts  = rep_stat_rsp->if_out_multicast_pkts;
+	stats->if_out_broadcast_pkts  = rep_stat_rsp->if_out_broadcast_pkts;
+	stats->if_out_nucast_pkts     = rep_stat_rsp->if_out_nucast_pkts;
+	stats->if_out_ok	      = rep_stat_rsp->if_out_ok;
+	stats->if_in_ok		      = rep_stat_rsp->if_in_ok;
+	stats->if_out_ucast_bytes     = rep_stat_rsp->if_out_ucast_bytes;
+	stats->if_out_multicast_bytes = rep_stat_rsp->if_out_multicast_bytes;
+	stats->if_out_broadcast_bytes = rep_stat_rsp->if_out_broadcast_bytes;
+	stats->if_in_ucast_bytes      = rep_stat_rsp->if_in_ucast_bytes;
+	stats->if_in_multicast_bytes  = rep_stat_rsp->if_in_multicast_bytes;
+	stats->if_in_broadcast_bytes  = rep_stat_rsp->if_in_broadcast_bytes;
+	stats->ethernet_status	      = rep_stat_rsp->ethernet_status;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_reset_req(struct control *control)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_reset_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_RESET);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_reset_rsp(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+
+	CONTROL_FUNCTION("%s: control_reset_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_RESET)
+		goto failure;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+int control_heartbeat_req(struct control *control, u32 hb_interval)
+{
+	struct send_io			*send_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_heartbeat	*heartbeat_req;
+
+	CONTROL_FUNCTION("%s: control_heartbeat_req()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->send_dma, control->send_len,
+				   DMA_TO_DEVICE);
+
+	send_io = control_init_hdr(control, CMD_HEARTBEAT);
+	if (!send_io)
+		goto failure;
+
+	pkt = control_packet(send_io);
+	heartbeat_req = &pkt->cmd.heartbeat_req;
+	heartbeat_req->hb_interval = cpu_to_be32(hb_interval);
+
+	control->last_cmd = pkt->hdr.pkt_cmd;
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return control_send(control, send_io);
+failure:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->send_dma, control->send_len,
+				      DMA_TO_DEVICE);
+	return -1;
+}
+
+int control_heartbeat_rsp(struct control *control)
+{
+	struct recv_io			*recv_io;
+	struct vnic_control_packet	*pkt;
+	struct vnic_cmd_heartbeat	*heartbeat_rsp;
+
+	CONTROL_FUNCTION("%s: control_heartbeat_rsp()\n",
+			 control_ifcfg_name(control));
+	ib_dma_sync_single_for_cpu(control->parent->config->ibdev,
+				   control->recv_dma, control->recv_len,
+				   DMA_FROM_DEVICE);
+
+	recv_io = control_get_rsp(control);
+	if (!recv_io)
+		goto out;
+
+	pkt = control_packet(recv_io);
+	if (pkt->hdr.pkt_cmd != CMD_HEARTBEAT)
+		goto failure;
+
+	heartbeat_rsp = &pkt->cmd.heartbeat_rsp;
+
+	control_recv(control, recv_io);
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return 0;
+failure:
+	viport_failure(control->parent);
+out:
+	ib_dma_sync_single_for_device(control->parent->config->ibdev,
+				      control->recv_dma, control->recv_len,
+				      DMA_FROM_DEVICE);
+	return -1;
+}
+
+static int control_init_recv_ios(struct control *control,
+				 struct viport *viport,
+				 struct vnic_control_packet *pkt)
+{
+	struct io		*io;
+	struct ib_device	*ibdev = viport->config->ibdev;
+	struct control_config	*config = control->config;
+	dma_addr_t		recv_dma;
+	unsigned int		i;
+
+
+	control->recv_len = sizeof *pkt * config->num_recvs;
+	control->recv_dma = ib_dma_map_single(ibdev,
+					      pkt, control->recv_len,
+					      DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(ibdev, control->recv_dma)) {
+		CONTROL_ERROR("control recv dma map error\n");
+		goto failure;
+	}
+
+	recv_dma = control->recv_dma;
+	for (i = 0; i < config->num_recvs; i++) {
+		io = &control->recv_ios[i].io;
+		io->viport = viport;
+		io->routine = control_recv_complete;
+		io->type = RECV;
+
+		control->recv_ios[i].virtual_addr = (u8 *)pkt;
+		control->recv_ios[i].list.addr = recv_dma;
+		control->recv_ios[i].list.length = sizeof *pkt;
+		control->recv_ios[i].list.lkey = control->mr->lkey;
+
+		recv_dma = recv_dma + sizeof *pkt;
+		pkt++;
+
+		io->rwr.wr_id = (u64)io;
+		io->rwr.sg_list = &control->recv_ios[i].list;
+		io->rwr.num_sge = 1;
+		if (vnic_ib_post_recv(&control->ib_conn, io))
+			goto unmap_recv;
+	}
+
+	return 0;
+unmap_recv:
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->recv_dma, control->recv_len,
+			    DMA_FROM_DEVICE);
+failure:
+	return -1;
+}
+
+static int control_init_send_ios(struct control *control,
+				 struct viport *viport,
+				 struct vnic_control_packet *pkt)
+{
+	struct io		*io;
+	struct ib_device	*ibdev = viport->config->ibdev;
+
+	control->send_io.virtual_addr = (u8 *)pkt;
+	control->send_len = sizeof *pkt;
+	control->send_dma = ib_dma_map_single(ibdev, pkt,
+					      control->send_len,
+					      DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(ibdev, control->send_dma)) {
+		CONTROL_ERROR("control send dma map error\n");
+		goto failure;
+	}
+
+	io = &control->send_io.io;
+	io->viport = viport;
+	io->routine = control_send_complete;
+
+	control->send_io.list.addr = control->send_dma;
+	control->send_io.list.length = sizeof *pkt;
+	control->send_io.list.lkey = control->mr->lkey;
+
+	io->swr.wr_id = (u64)io;
+	io->swr.sg_list = &control->send_io.list;
+	io->swr.num_sge = 1;
+	io->swr.opcode = IB_WR_SEND;
+	io->swr.send_flags = IB_SEND_SIGNALED;
+	io->type = SEND;
+
+	return 0;
+failure:
+	return -1;
+}
+
+int control_init(struct control *control, struct viport *viport,
+		 struct control_config *config, struct ib_pd *pd)
+{
+	struct vnic_control_packet	*pkt;
+	unsigned int sz;
+
+	CONTROL_FUNCTION("%s: control_init()\n",
+			 control_ifcfg_name(control));
+	control->parent = viport;
+	control->config = config;
+	control->ib_conn.viport = viport;
+	control->ib_conn.ib_config = &config->ib_config;
+	control->ib_conn.state = IB_CONN_UNINITTED;
+	control->ib_conn.callback_thread = NULL;
+	control->ib_conn.callback_thread_end = 0;
+	control->req_state = REQ_INACTIVE;
+	control->last_cmd  = CMD_INVALID;
+	control->seq_num = 0;
+	control->response = NULL;
+	control->info = NULL;
+	INIT_LIST_HEAD(&control->failure_list);
+	spin_lock_init(&control->io_lock);
+
+	if (vnic_ib_conn_init(&control->ib_conn, viport, pd,
+			      &config->ib_config)) {
+		CONTROL_ERROR("Control IB connection"
+			      " initialization failed\n");
+		goto failure;
+	}
+
+	control->mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(control->mr)) {
+		CONTROL_ERROR("%s: failed to register memory"
+			      " for control connection\n",
+			      control_ifcfg_name(control));
+		goto destroy_conn;
+	}
+
+	control->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev,
+						 vnic_ib_cm_handler,
+						 &control->ib_conn);
+	if (IS_ERR(control->ib_conn.cm_id)) {
+		CONTROL_ERROR("creating control CM ID failed\n");
+		goto destroy_mr;
+	}
+
+	sz = sizeof(struct recv_io) * config->num_recvs;
+	control->recv_ios = vmalloc(sz);
+
+	if (!control->recv_ios) {
+		CONTROL_ERROR("%s: failed allocating space for recv ios\n",
+			      control_ifcfg_name(control));
+		goto destroy_cm_id;
+	}
+
+	memset(control->recv_ios, 0, sz);
+	/*One send buffer and num_recvs recv buffers */
+	control->local_storage = kzalloc(sizeof *pkt *
+					 (config->num_recvs + 1),
+					 GFP_KERNEL);
+
+	if (!control->local_storage) {
+		CONTROL_ERROR("%s: failed allocating space"
+			      " for local storage\n",
+			      control_ifcfg_name(control));
+		goto free_recv_ios;
+	}
+
+	pkt = control->local_storage;
+	if (control_init_send_ios(control, viport, pkt))
+		goto free_storage;
+
+	pkt++;
+	if (control_init_recv_ios(control, viport, pkt))
+		goto unmap_send;
+
+	return 0;
+
+unmap_send:
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->send_dma, control->send_len,
+			    DMA_TO_DEVICE);
+free_storage:
+	kfree(control->local_storage);
+free_recv_ios:
+	vfree(control->recv_ios);
+destroy_cm_id:
+	ib_destroy_cm_id(control->ib_conn.cm_id);
+destroy_mr:
+	ib_dereg_mr(control->mr);
+destroy_conn:
+	ib_destroy_qp(control->ib_conn.qp);
+	ib_destroy_cq(control->ib_conn.cq);
+failure:
+	return -1;
+}
+
+void control_cleanup(struct control *control)
+{
+	CONTROL_FUNCTION("%s: control_disconnect()\n",
+			 control_ifcfg_name(control));
+
+	if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0))
+		CONTROL_ERROR("control CM DREQ sending failed\n");
+
+	control->ib_conn.state = IB_CONN_DISCONNECTED;
+	control_timer_stop(control);
+	control->req_state  = REQ_INACTIVE;
+	control->response   = NULL;
+	control->last_cmd   = CMD_INVALID;
+	completion_callback_cleanup(&control->ib_conn);
+	ib_destroy_cm_id(control->ib_conn.cm_id);
+	ib_destroy_qp(control->ib_conn.qp);
+	ib_destroy_cq(control->ib_conn.cq);
+	ib_dereg_mr(control->mr);
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->send_dma, control->send_len,
+			    DMA_TO_DEVICE);
+	ib_dma_unmap_single(control->parent->config->ibdev,
+			    control->recv_dma, control->recv_len,
+			    DMA_FROM_DEVICE);
+	vfree(control->recv_ios);
+	kfree(control->local_storage);
+
+}
+
+static void control_log_report_status_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_REPORT_STATUS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO
+	       "               lan_switch_num = %u, is_fatal = %u\n",
+	       pkt->cmd.report_status.lan_switch_num,
+	       pkt->cmd.report_status.is_fatal);
+	printk(KERN_INFO
+	       "               status_number = %u, status_info = %u\n",
+	       be32_to_cpu(pkt->cmd.report_status.status_number),
+	       be32_to_cpu(pkt->cmd.report_status.status_info));
+	pkt->cmd.report_status.file_name[31] = '\0';
+	pkt->cmd.report_status.routine[31] = '\0';
+	printk(KERN_INFO "               filename = %s, routine = %s\n",
+	       pkt->cmd.report_status.file_name,
+	       pkt->cmd.report_status.routine);
+	printk(KERN_INFO
+	       "               line_num = %u, error_parameter = %u\n",
+	       be32_to_cpu(pkt->cmd.report_status.line_num),
+	       be32_to_cpu(pkt->cmd.report_status.error_parameter));
+	pkt->cmd.report_status.desc_text[127] = '\0';
+	printk(KERN_INFO "               desc_text = %s\n",
+	       pkt->cmd.report_status.desc_text);
+}
+
+static void control_log_report_stats_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_REPORT_STATISTICS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               lan_switch_num = %u\n",
+	       pkt->cmd.report_statistics_req.lan_switch_num);
+	if (pkt->hdr.pkt_type == TYPE_REQ)
+		return;
+	printk(KERN_INFO "               if_in_broadcast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_broadcast_pkts));
+	printk(" if_in_multicast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_multicast_pkts));
+	printk(KERN_INFO "               if_in_octets = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_octets));
+	printk(" if_in_ucast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_ucast_pkts));
+	printk(KERN_INFO "               if_in_nucast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_nucast_pkts));
+	printk(" if_in_underrun = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_underrun));
+	printk(KERN_INFO "               if_in_errors = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_errors));
+	printk(" if_out_errors = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_errors));
+	printk(KERN_INFO "               if_out_octets = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_octets));
+	printk(" if_out_ucast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_ucast_pkts));
+	printk(KERN_INFO "               if_out_multicast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_multicast_pkts));
+	printk(" if_out_broadcast_pkts = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_broadcast_pkts));
+	printk(KERN_INFO "               if_out_nucast_pkts = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_nucast_pkts));
+	printk(" if_out_ok = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.if_out_ok));
+	printk(KERN_INFO "               if_in_ok = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.if_in_ok));
+	printk(" if_out_ucast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_ucast_bytes));
+	printk(KERN_INFO "               if_out_multicast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+		      if_out_multicast_bytes));
+	printk(" if_out_broadcast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_out_broadcast_bytes));
+	printk(KERN_INFO "               if_in_ucast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_ucast_bytes));
+	printk(" if_in_multicast_bytes = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_multicast_bytes));
+	printk(KERN_INFO "               if_in_broadcast_bytes = %llu",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   if_in_broadcast_bytes));
+	printk(" ethernet_status = %llu\n",
+	       be64_to_cpu(pkt->cmd.report_statistics_rsp.
+			   ethernet_status));
+}
+
+static void control_log_config_link_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_CONFIG_LINK\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               cmd_flags = %x\n",
+	       pkt->cmd.config_link_req.cmd_flags);
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_ENABLE_NIC)
+		printk(KERN_INFO
+		       "                      VNIC_FLAG_ENABLE_NIC\n");
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_DISABLE_NIC)
+		printk(KERN_INFO
+		       "                      VNIC_FLAG_DISABLE_NIC\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL)
+		printk(KERN_INFO
+		       "                     VNIC_FLAG_ENABLE_"
+		       "MCAST_ALL\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_DISABLE_MCAST_ALL)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_DISABLE_"
+		       "MCAST_ALL\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_ENABLE_PROMISC)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_ENABLE_"
+		       "PROMISC\n");
+	if (pkt->cmd.config_link_req.
+	    cmd_flags & VNIC_FLAG_DISABLE_PROMISC)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_DISABLE_"
+		       "PROMISC\n");
+	if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_SET_MTU)
+		printk(KERN_INFO
+		       "                       VNIC_FLAG_SET_MTU\n");
+	printk(KERN_INFO
+	       "               lan_switch_num = %x, mtu_size = %d\n",
+	       pkt->cmd.config_link_req.lan_switch_num,
+	       be16_to_cpu(pkt->cmd.config_link_req.mtu_size));
+	if (pkt->hdr.pkt_type == TYPE_RSP) {
+		printk(KERN_INFO
+		       "               default_vlan = %u,"
+		       " hw_mac_address ="
+		       " %02x:%02x:%02x:%02x:%02x:%02x\n",
+		       be16_to_cpu(pkt->cmd.config_link_req.
+				   default_vlan),
+		       pkt->cmd.config_link_req.hw_mac_address[0],
+		       pkt->cmd.config_link_req.hw_mac_address[1],
+		       pkt->cmd.config_link_req.hw_mac_address[2],
+		       pkt->cmd.config_link_req.hw_mac_address[3],
+		       pkt->cmd.config_link_req.hw_mac_address[4],
+		       pkt->cmd.config_link_req.hw_mac_address[5]);
+	}
+}
+
+static void print_config_addr(struct vnic_address_op *list,
+				int num_address_ops, size_t mgidoff)
+{
+	int i = 0;
+
+	while (i < num_address_ops && i < 16) {
+		printk(KERN_INFO "               list_address_ops[%u].index"
+				 " = %u\n", i, be16_to_cpu(list->index));
+		switch (list->operation) {
+		case VNIC_OP_GET_ENTRY:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = VNIC_OP_GET_ENTRY\n", i);
+			break;
+		case VNIC_OP_SET_ENTRY:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = VNIC_OP_SET_ENTRY\n", i);
+			break;
+		default:
+			printk(KERN_INFO "               list_address_ops[%u]."
+					 "operation = UNKNOWN(%d)\n", i,
+					 list->operation);
+			break;
+		}
+		printk(KERN_INFO "               list_address_ops[%u].valid"
+				 " = %u\n", i, list->valid);
+		printk(KERN_INFO "               list_address_ops[%u].address"
+				 " = %02x:%02x:%02x:%02x:%02x:%02x\n", i,
+				 list->address[0], list->address[1],
+				 list->address[2], list->address[3],
+				 list->address[4], list->address[5]);
+		printk(KERN_INFO "               list_address_ops[%u].vlan"
+				 " = %u\n", i, be16_to_cpu(list->vlan));
+		if (mgidoff) {
+			printk(KERN_INFO
+				 "               list_address_ops[%u].mgid"
+				 " = " VNIC_GID_FMT "\n", i,
+				 VNIC_GID_RAW_ARG((char *)list + mgidoff));
+			list = (struct vnic_address_op *)
+			       ((char *)list + sizeof(struct vnic_address_op2));
+		} else
+			list = (struct vnic_address_op *)
+			       ((char *)list + sizeof(struct vnic_address_op));
+	i++;
+	}
+}
+
+static void control_log_config_addrs_pkt(struct vnic_control_packet *pkt,
+					u8 addresses2)
+{
+	struct vnic_address_op *list;
+	int no_address_ops;
+
+	if (addresses2)
+		printk(KERN_INFO
+			"               pkt_cmd = CMD_CONFIG_ADDRESSES2\n");
+	else
+		printk(KERN_INFO
+			"               pkt_cmd = CMD_CONFIG_ADDRESSES\n");
+	printk(KERN_INFO "               pkt_seq_num = %u,"
+			" pkt_retry_count = %u\n",
+			pkt->hdr.pkt_seq_num, pkt->hdr.pkt_retry_count);
+	if (addresses2) {
+		printk(KERN_INFO "               num_address_ops = %x,"
+				" lan_switch_num = %d\n",
+				pkt->cmd.config_addresses_req2.num_address_ops,
+				pkt->cmd.config_addresses_req2.lan_switch_num);
+		list = (struct vnic_address_op *)
+				pkt->cmd.config_addresses_req2.list_address_ops;
+		no_address_ops = pkt->cmd.config_addresses_req2.num_address_ops;
+		print_config_addr(list, no_address_ops,
+				offsetof(struct vnic_address_op2, mgid));
+	} else {
+		printk(KERN_INFO "               num_address_ops = %x,"
+				" lan_switch_num = %d\n",
+				pkt->cmd.config_addresses_req.num_address_ops,
+				pkt->cmd.config_addresses_req.lan_switch_num);
+		list = pkt->cmd.config_addresses_req.list_address_ops;
+		no_address_ops = pkt->cmd.config_addresses_req.num_address_ops;
+		print_config_addr(list, no_address_ops, 0);
+	}
+}
+
+static void control_log_exch_pools_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_EXCHANGE_POOLS\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               datapath = %u\n",
+	       pkt->cmd.exchange_pools_req.data_path);
+	printk(KERN_INFO "               pool_rkey = %08x"
+	       " pool_addr = %llx\n",
+	       be32_to_cpu(pkt->cmd.exchange_pools_req.pool_rkey),
+	       be64_to_cpu(pkt->cmd.exchange_pools_req.pool_addr));
+}
+
+static void control_log_data_path_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_CONFIG_DATA_PATH\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO "               path_identifier = %llx,"
+	       " data_path = %u\n",
+	       pkt->cmd.config_data_path_req.path_identifier,
+	       pkt->cmd.config_data_path_req.data_path);
+	printk(KERN_INFO
+	       "host config    size_recv_pool_entry = %u,"
+	       " num_recv_pool_entries = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.size_recv_pool_entry),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.num_recv_pool_entries));
+	printk(KERN_INFO
+	       "               timeout_before_kick = %u,"
+	       " num_recv_pool_entries_before_kick = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.timeout_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      num_recv_pool_entries_before_kick));
+	printk(KERN_INFO
+	       "               num_recv_pool_bytes_before_kick = %u,"
+	       " free_recv_pool_entries_per_update = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      num_recv_pool_bytes_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      host_recv_pool_config.
+		      free_recv_pool_entries_per_update));
+	printk(KERN_INFO
+	       "eioc config    size_recv_pool_entry = %u,"
+	       " num_recv_pool_entries = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.size_recv_pool_entry),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.num_recv_pool_entries));
+	printk(KERN_INFO
+	       "               timeout_before_kick = %u,"
+	       " num_recv_pool_entries_before_kick = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.timeout_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      num_recv_pool_entries_before_kick));
+	printk(KERN_INFO
+	       "               num_recv_pool_bytes_before_kick = %u,"
+	       " free_recv_pool_entries_per_update = %u\n",
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      num_recv_pool_bytes_before_kick),
+	       be32_to_cpu(pkt->cmd.config_data_path_req.
+		      eioc_recv_pool_config.
+		      free_recv_pool_entries_per_update));
+}
+
+static void control_log_init_vnic_pkt(struct vnic_control_packet *pkt)
+{
+	printk(KERN_INFO
+	       "               pkt_cmd = CMD_INIT_VNIC\n");
+	printk(KERN_INFO
+	       "               pkt_seq_num = %u,"
+	       " pkt_retry_count = %u\n",
+	       pkt->hdr.pkt_seq_num,
+	       pkt->hdr.pkt_retry_count);
+	printk(KERN_INFO
+	       "               vnic_major_version = %u,"
+	       " vnic_minor_version = %u\n",
+	       be16_to_cpu(pkt->cmd.init_vnic_req.vnic_major_version),
+	       be16_to_cpu(pkt->cmd.init_vnic_req.vnic_minor_version));
+	if (pkt->hdr.pkt_type == TYPE_REQ) {
+		printk(KERN_INFO
+		       "               vnic_instance = %u,"
+		       " num_data_paths = %u\n",
+		       pkt->cmd.init_vnic_req.vnic_instance,
+		       pkt->cmd.init_vnic_req.num_data_paths);
+		printk(KERN_INFO
+		       "               num_address_entries = %u\n",
+		       be16_to_cpu(pkt->cmd.init_vnic_req.
+			      num_address_entries));
+	} else {
+		printk(KERN_INFO
+		       "               num_lan_switches = %u,"
+		       " num_data_paths = %u\n",
+		       pkt->cmd.init_vnic_rsp.num_lan_switches,
+		       pkt->cmd.init_vnic_rsp.num_data_paths);
+		printk(KERN_INFO
+		       "               num_address_entries = %u,"
+		       " features_supported = %08x\n",
+		       be16_to_cpu(pkt->cmd.init_vnic_rsp.
+			      num_address_entries),
+		       be32_to_cpu(pkt->cmd.init_vnic_rsp.
+			      features_supported));
+		if (pkt->cmd.init_vnic_rsp.num_lan_switches != 0) {
+			printk(KERN_INFO
+			       "lan_switch[0]  lan_switch_num = %u,"
+			       " num_enet_ports = %08x\n",
+			       pkt->cmd.init_vnic_rsp.
+			       lan_switch[0].lan_switch_num,
+			       pkt->cmd.init_vnic_rsp.
+			       lan_switch[0].num_enet_ports);
+			printk(KERN_INFO
+			       "               default_vlan = %u,"
+			       " hw_mac_address ="
+			       " %02x:%02x:%02x:%02x:%02x:%02x\n",
+			       be16_to_cpu(pkt->cmd.init_vnic_rsp.
+				      lan_switch[0].default_vlan),
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[0],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[1],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[2],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[3],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[4],
+			       pkt->cmd.init_vnic_rsp.lan_switch[0].
+			       hw_mac_address[5]);
+		}
+	}
+}
+
+static void control_log_control_packet(struct vnic_control_packet *pkt)
+{
+	switch (pkt->hdr.pkt_type) {
+	case TYPE_INFO:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_INFO\n");
+		break;
+	case TYPE_REQ:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_REQ\n");
+		break;
+	case TYPE_RSP:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_RSP\n");
+		break;
+	case TYPE_ERR:
+		printk(KERN_INFO "control_packet: pkt_type = TYPE_ERR\n");
+		break;
+	default:
+		printk(KERN_INFO "control_packet: pkt_type = UNKNOWN\n");
+	}
+
+	switch (pkt->hdr.pkt_cmd) {
+	case CMD_INIT_VNIC:
+		control_log_init_vnic_pkt(pkt);
+		break;
+	case CMD_CONFIG_DATA_PATH:
+		control_log_data_path_pkt(pkt);
+		break;
+	case CMD_EXCHANGE_POOLS:
+		control_log_exch_pools_pkt(pkt);
+		break;
+	case CMD_CONFIG_ADDRESSES:
+		control_log_config_addrs_pkt(pkt, 0);
+		break;
+	case CMD_CONFIG_ADDRESSES2:
+		control_log_config_addrs_pkt(pkt, 1);
+		break;
+	case CMD_CONFIG_LINK:
+		control_log_config_link_pkt(pkt);
+		break;
+	case CMD_REPORT_STATISTICS:
+		control_log_report_stats_pkt(pkt);
+		break;
+	case CMD_CLEAR_STATISTICS:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_CLEAR_STATISTICS\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	case CMD_REPORT_STATUS:
+		control_log_report_status_pkt(pkt);
+
+		break;
+	case CMD_RESET:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_RESET\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	case CMD_HEARTBEAT:
+		printk(KERN_INFO
+		       "               pkt_cmd = CMD_HEARTBEAT\n");
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		printk(KERN_INFO "               hb_interval = %d\n",
+		       be32_to_cpu(pkt->cmd.heartbeat_req.hb_interval));
+		break;
+	default:
+		printk(KERN_INFO
+		       "               pkt_cmd = UNKNOWN (%u)\n",
+		       pkt->hdr.pkt_cmd);
+		printk(KERN_INFO
+		       "               pkt_seq_num = %u,"
+		       " pkt_retry_count = %u\n",
+		       pkt->hdr.pkt_seq_num,
+		       pkt->hdr.pkt_retry_count);
+		break;
+	}
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
new file mode 100644
index 0000000..57fab67
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h
@@ -0,0 +1,179 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONTROL_H_INCLUDED
+#define VNIC_CONTROL_H_INCLUDED
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+#include <linux/timex.h>
+#include <linux/completion.h>
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+
+#include "vnic_ib.h"
+#include "vnic_control_pkt.h"
+
+enum control_timer_state {
+	TIMER_IDLE	= 0,
+	TIMER_ACTIVE	= 1,
+	TIMER_EXPIRED	= 2
+};
+
+enum control_request_state {
+	REQ_INACTIVE,  /* quiet state, all previous operations done
+			*      response is NULL
+			*      last_cmd = CMD_INVALID
+			*      timer_state = IDLE
+			*/
+	REQ_POSTED,    /* REQ put on send Q
+			*      response is NULL
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_SENT,      /* Send completed for REQ
+			*      response is NULL
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	RSP_RECEIVED,  /* Received Resp, but no Send completion yet
+			*      response is response buffer received
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_COMPLETED, /* all processing for REQ completed, ready to be gotten
+			*      response is response buffer received
+			*      last_cmd = command issued
+			*      timer_state = ACTIVE
+			*/
+	REQ_FAILED,    /* processing of REQ/RSP failed.
+			*      response is NULL
+			*      last_cmd = CMD_INVALID
+			*      timer_state = IDLE or EXPIRED
+			*      viport has been moved to error state to force
+			*      recovery
+			*/
+};
+
+struct control {
+	struct viport			*parent;
+	struct control_config		*config;
+	struct ib_mr			*mr;
+	struct vnic_ib_conn		ib_conn;
+	struct vnic_control_packet	*local_storage;
+	int				send_len;
+	int				recv_len;
+	u16				maj_ver;
+	u16				min_ver;
+	struct vnic_lan_switch_attribs	lan_switch;
+	struct send_io			send_io;
+	struct recv_io			*recv_ios;
+	dma_addr_t			send_dma;
+	dma_addr_t			recv_dma;
+	enum control_timer_state	timer_state;
+	enum control_request_state      req_state;
+	struct timer_list		timer;
+	u8				seq_num;
+	u8				last_cmd;
+	struct recv_io			*response;
+	struct recv_io			*info;
+	struct list_head		failure_list;
+	spinlock_t			io_lock;
+	struct completion		done;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	request_time;	/* intermediate value */
+		cycles_t	response_time;
+		u32		response_num;
+		cycles_t	response_max;
+		cycles_t	response_min;
+		u32		timeout_num;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+int control_init(struct control *control, struct viport *viport,
+		 struct control_config *config, struct ib_pd *pd);
+
+void control_cleanup(struct control *control);
+
+void control_process_async(struct control *control);
+
+int control_init_vnic_req(struct control *control);
+int control_init_vnic_rsp(struct control *control, u32 *features,
+			  u8 *mac_address, u16 *num_addrs, u16 *vlan);
+
+int control_config_data_path_req(struct control *control, u64 path_id,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc);
+int control_config_data_path_rsp(struct control *control,
+				 struct vnic_recv_pool_config *host,
+				 struct vnic_recv_pool_config *eioc,
+				 struct vnic_recv_pool_config *max_host,
+				 struct vnic_recv_pool_config *max_eioc,
+				 struct vnic_recv_pool_config *min_host,
+				 struct vnic_recv_pool_config *min_eioc);
+
+int control_exchange_pools_req(struct control *control,
+			       u64 addr, u32 rkey);
+int control_exchange_pools_rsp(struct control *control,
+			       u64 *addr, u32 *rkey);
+
+int control_config_link_req(struct control *control,
+			    u16 flags, u16 mtu);
+int control_config_link_rsp(struct control *control,
+			    u16 *flags, u16 *mtu);
+
+int control_config_addrs_req(struct control *control,
+			     struct vnic_address_op2 *addrs, u16 num);
+int control_config_addrs_rsp(struct control *control);
+
+int control_report_statistics_req(struct control *control);
+int control_report_statistics_rsp(struct control *control,
+				  struct vnic_cmd_report_stats_rsp *stats);
+
+int control_heartbeat_req(struct control *control, u32 hb_interval);
+int control_heartbeat_rsp(struct control *control);
+
+int control_reset_req(struct control *control);
+int control_reset_rsp(struct control *control);
+
+#define control_packet(io) 					\
+	(struct vnic_control_packet *)(io)->virtual_addr
+#define control_is_connected(control) 				\
+	(vnic_ib_conn_connected(&((control)->ib_conn)))
+
+#define control_last_req(control)	control_packet(&(control)->send_io)
+#define control_features(control)	(control)->features_supported
+
+#define control_get_mac_address(control,addr) 				\
+	memcpy(addr, (control)->lan_switch.hw_mac_address, ETH_ALEN)
+
+#endif	/* VNIC_CONTROL_H_INCLUDED */
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
new file mode 100644
index 0000000..1fc62fb
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h
@@ -0,0 +1,368 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONTROL_PKT_H_INCLUDED
+#define VNIC_CONTROL_PKT_H_INCLUDED
+
+#include <linux/utsname.h>
+#include <rdma/ib_verbs.h>
+
+#define VNIC_MAX_NODENAME_LEN	64
+
+struct vnic_connection_data {
+	u64	path_id;
+	u8	vnic_instance;
+	u8	path_num;
+	u8	nodename[VNIC_MAX_NODENAME_LEN + 1];
+	u8  	reserved; /* for alignment */
+	__be32 	features_supported;
+};
+
+struct vnic_control_header {
+	u8	pkt_type;
+	u8	pkt_cmd;
+	u8	pkt_seq_num;
+	u8	pkt_retry_count;
+	u32	reserved;	/* for 64-bit alignmnet */
+};
+
+/* ptk_type values */
+enum {
+	TYPE_INFO	= 0,
+	TYPE_REQ	= 1,
+	TYPE_RSP	= 2,
+	TYPE_ERR	= 3
+};
+
+/* ptk_cmd values */
+enum {
+	CMD_INVALID		= 0,
+	CMD_INIT_VNIC		= 1,
+	CMD_CONFIG_DATA_PATH	= 2,
+	CMD_EXCHANGE_POOLS	= 3,
+	CMD_CONFIG_ADDRESSES	= 4,
+	CMD_CONFIG_LINK		= 5,
+	CMD_REPORT_STATISTICS	= 6,
+	CMD_CLEAR_STATISTICS	= 7,
+	CMD_REPORT_STATUS	= 8,
+	CMD_RESET		= 9,
+	CMD_HEARTBEAT		= 10,
+	CMD_CONFIG_ADDRESSES2	= 11,
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_REQ data format */
+struct vnic_cmd_init_vnic_req {
+	__be16	vnic_major_version;
+	__be16	vnic_minor_version;
+	u8	vnic_instance;
+	u8	num_data_paths;
+	__be16	num_address_entries;
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP subdata format */
+struct vnic_lan_switch_attribs {
+	u8	lan_switch_num;
+	u8	num_enet_ports;
+	__be16	default_vlan;
+	u8	hw_mac_address[ETH_ALEN];
+};
+
+/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP data format */
+struct vnic_cmd_init_vnic_rsp {
+	__be16				vnic_major_version;
+	__be16				vnic_minor_version;
+	u8				num_lan_switches;
+	u8				num_data_paths;
+	__be16				num_address_entries;
+	__be32				features_supported;
+	struct vnic_lan_switch_attribs	lan_switch[1];
+};
+
+/* features_supported values */
+enum {
+	VNIC_FEAT_IPV4_HEADERS		= 0x0001,
+	VNIC_FEAT_IPV6_HEADERS		= 0x0002,
+	VNIC_FEAT_IPV4_CSUM_RX		= 0x0004,
+	VNIC_FEAT_IPV4_CSUM_TX		= 0x0008,
+	VNIC_FEAT_TCP_CSUM_RX		= 0x0010,
+	VNIC_FEAT_TCP_CSUM_TX		= 0x0020,
+	VNIC_FEAT_UDP_CSUM_RX		= 0x0040,
+	VNIC_FEAT_UDP_CSUM_TX		= 0x0080,
+	VNIC_FEAT_TCP_SEGMENT		= 0x0100,
+	VNIC_FEAT_IPV4_IPSEC_OFFLOAD	= 0x0200,
+	VNIC_FEAT_IPV6_IPSEC_OFFLOAD	= 0x0400,
+	VNIC_FEAT_FCS_PROPAGATE		= 0x0800,
+	VNIC_FEAT_PF_KICK		= 0x1000,
+	VNIC_FEAT_PF_FORCE_ROUTE	= 0x2000,
+	VNIC_FEAT_CHASH_OFFLOAD		= 0x4000,
+	/* host send with immediate data */
+	VNIC_FEAT_RDMA_IMMED		= 0x8000,
+	/* host ignore inbound PF_VLAN_INSERT flag */
+	VNIC_FEAT_IGNORE_VLAN		= 0x10000,
+	/* host supports IB multicast for inbound Ethernet mcast traffic */
+	VNIC_FEAT_INBOUND_IB_MC 	= 0x20000,
+};
+
+/* pkt_cmd CMD_CONFIG_DATA_PATH subdata format */
+struct vnic_recv_pool_config {
+	__be32	size_recv_pool_entry;
+	__be32	num_recv_pool_entries;
+	__be32	timeout_before_kick;
+	__be32	num_recv_pool_entries_before_kick;
+	__be32	num_recv_pool_bytes_before_kick;
+	__be32	free_recv_pool_entries_per_update;
+};
+
+/* pkt_cmd CMD_CONFIG_DATA_PATH data format */
+struct vnic_cmd_config_data_path {
+	u64				path_identifier;
+	u8				data_path;
+	u8				reserved[3];
+	struct vnic_recv_pool_config	host_recv_pool_config;
+	struct vnic_recv_pool_config	eioc_recv_pool_config;
+};
+
+/* pkt_cmd CMD_EXCHANGE_POOLS data format */
+struct vnic_cmd_exchange_pools {
+	u8	data_path;
+	u8	reserved[3];
+	__be32	pool_rkey;
+	__be64	pool_addr;
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES subdata format */
+struct vnic_address_op {
+	__be16	index;
+	u8	operation;
+	u8	valid;
+	u8	address[6];
+	__be16	vlan;
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES2 subdata format */
+struct vnic_address_op2 {
+	__be16	index;
+	u8	operation;
+	u8	valid;
+	u8	address[6];
+	__be16	vlan;
+	u32 reserved; /* for alignment */
+	union ib_gid mgid; /* valid in rsp only if both ends support mcast */
+};
+
+/* operation values */
+enum {
+	VNIC_OP_SET_ENTRY = 0x01,
+	VNIC_OP_GET_ENTRY = 0x02
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES data format */
+struct vnic_cmd_config_addresses {
+	u8			num_address_ops;
+	u8			lan_switch_num;
+	struct vnic_address_op	list_address_ops[1];
+};
+
+/* pkt_cmd CMD_CONFIG_ADDRESSES2 data format */
+struct vnic_cmd_config_addresses2 {
+	u8			num_address_ops;
+	u8			lan_switch_num;
+	u8			reserved1;
+	u8			reserved2;
+	u8			reserved3;
+	struct vnic_address_op2	list_address_ops[1];
+};
+
+/* CMD_CONFIG_LINK data format */
+struct vnic_cmd_config_link {
+	u8	cmd_flags;
+	u8	lan_switch_num;
+	__be16	mtu_size;
+	__be16	default_vlan;
+	u8	hw_mac_address[6];
+	u32	reserved; /* for alignment */
+	/* valid in rsp only if both ends support mcast */
+	union ib_gid allmulti_mgid;
+};
+
+/* cmd_flags values */
+enum {
+	VNIC_FLAG_ENABLE_NIC		= 0x01,
+	VNIC_FLAG_DISABLE_NIC		= 0x02,
+	VNIC_FLAG_ENABLE_MCAST_ALL	= 0x04,
+	VNIC_FLAG_DISABLE_MCAST_ALL	= 0x08,
+	VNIC_FLAG_ENABLE_PROMISC	= 0x10,
+	VNIC_FLAG_DISABLE_PROMISC	= 0x20,
+	VNIC_FLAG_SET_MTU		= 0x40
+};
+
+/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_REQ data format */
+struct vnic_cmd_report_stats_req {
+	u8	lan_switch_num;
+};
+
+/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_RSP data format */
+struct vnic_cmd_report_stats_rsp {
+	u8	lan_switch_num;
+	u8	reserved[7];		/* for 64-bit alignment */
+	__be64	if_in_broadcast_pkts;
+	__be64	if_in_multicast_pkts;
+	__be64	if_in_octets;
+	__be64	if_in_ucast_pkts;
+	__be64	if_in_nucast_pkts;	/* if_in_broadcast_pkts
+					 + if_in_multicast_pkts */
+	__be64	if_in_underrun;		/* (OID_GEN_RCV_NO_BUFFER) */
+	__be64	if_in_errors;		/* (OID_GEN_RCV_ERROR) */
+	__be64	if_out_errors;		/* (OID_GEN_XMIT_ERROR) */
+	__be64	if_out_octets;
+	__be64	if_out_ucast_pkts;
+	__be64	if_out_multicast_pkts;
+	__be64	if_out_broadcast_pkts;
+	__be64	if_out_nucast_pkts;	/* if_out_broadcast_pkts
+					 + if_out_multicast_pkts */
+	__be64	if_out_ok;		/* if_out_nucast_pkts
+					 + if_out_ucast_pkts(OID_GEN_XMIT_OK) */
+	__be64	if_in_ok;		/* if_in_nucast_pkts
+					 + if_in_ucast_pkts(OID_GEN_RCV_OK) */
+	__be64	if_out_ucast_bytes;	/* (OID_GEN_DIRECTED_BYTES_XMT) */
+	__be64	if_out_multicast_bytes;	/* (OID_GEN_MULTICAST_BYTES_XMT) */
+	__be64	if_out_broadcast_bytes;	/* (OID_GEN_BROADCAST_BYTES_XMT) */
+	__be64	if_in_ucast_bytes;	/* (OID_GEN_DIRECTED_BYTES_RCV) */
+	__be64	if_in_multicast_bytes;	/* (OID_GEN_MULTICAST_BYTES_RCV) */
+	__be64	if_in_broadcast_bytes;	/* (OID_GEN_BROADCAST_BYTES_RCV) */
+	__be64	 ethernet_status;	/* OID_GEN_MEDIA_CONNECT_STATUS) */
+};
+
+/* pkt_cmd CMD_CLEAR_STATISTICS data format */
+struct vnic_cmd_clear_statistics {
+	u8	lan_switch_num;
+};
+
+/* pkt_cmd CMD_REPORT_STATUS data format */
+struct vnic_cmd_report_status {
+	u8	lan_switch_num;
+	u8	is_fatal;
+	u8	reserved[2];		/* for 32-bit alignment */
+	__be32	status_number;
+	__be32	status_info;
+	u8	file_name[32];
+	u8	routine[32];
+	__be32	line_num;
+	__be32	error_parameter;
+	u8	desc_text[128];
+};
+
+/* pkt_cmd CMD_HEARTBEAT data format */
+struct vnic_cmd_heartbeat {
+	__be32	hb_interval;
+};
+
+enum {
+	VNIC_STATUS_LINK_UP			= 1,
+	VNIC_STATUS_LINK_DOWN			= 2,
+	VNIC_STATUS_ENET_AGGREGATION_CHANGE	= 3,
+	VNIC_STATUS_EIOC_SHUTDOWN		= 4,
+	VNIC_STATUS_CONTROL_ERROR		= 5,
+	VNIC_STATUS_EIOC_ERROR			= 6
+};
+
+#define VNIC_MAX_CONTROLPKTSZ		256
+#define VNIC_MAX_CONTROLDATASZ						\
+	(VNIC_MAX_CONTROLPKTSZ - sizeof(struct vnic_control_header))
+
+struct vnic_control_packet {
+	struct vnic_control_header	hdr;
+	union {
+		struct vnic_cmd_init_vnic_req		init_vnic_req;
+		struct vnic_cmd_init_vnic_rsp		init_vnic_rsp;
+		struct vnic_cmd_config_data_path	config_data_path_req;
+		struct vnic_cmd_config_data_path	config_data_path_rsp;
+		struct vnic_cmd_exchange_pools		exchange_pools_req;
+		struct vnic_cmd_exchange_pools		exchange_pools_rsp;
+		struct vnic_cmd_config_addresses	config_addresses_req;
+		struct vnic_cmd_config_addresses2	config_addresses_req2;
+		struct vnic_cmd_config_addresses	config_addresses_rsp;
+		struct vnic_cmd_config_addresses2	config_addresses_rsp2;
+		struct vnic_cmd_config_link		config_link_req;
+		struct vnic_cmd_config_link		config_link_rsp;
+		struct vnic_cmd_report_stats_req	report_statistics_req;
+		struct vnic_cmd_report_stats_rsp	report_statistics_rsp;
+		struct vnic_cmd_clear_statistics	clear_statistics_req;
+		struct vnic_cmd_clear_statistics	clear_statistics_rsp;
+		struct vnic_cmd_report_status		report_status;
+		struct vnic_cmd_heartbeat		heartbeat_req;
+		struct vnic_cmd_heartbeat		heartbeat_rsp;
+
+		char   cmd_data[VNIC_MAX_CONTROLDATASZ];
+	} cmd;
+};
+
+union ib_gid_cpu {
+	u8      raw[16];
+	struct {
+		u64  subnet_prefix;
+		u64  interface_id;
+	} global;
+};
+
+static inline void bswap_ib_gid(union ib_gid *mgid1, union ib_gid_cpu *mgid2)
+{
+    /* swap hi & low */
+    __be64 low = mgid1->global.subnet_prefix;
+    mgid2->global.subnet_prefix = be64_to_cpu(mgid1->global.interface_id);
+    mgid2->global.interface_id = be64_to_cpu(low);
+}
+
+#define VNIC_GID_FMT 	"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x"
+
+#define VNIC_GID_RAW_ARG(gid) 	be16_to_cpu(*(__be16 *)&(gid)[0]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[2]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[4]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[6]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[8]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[10]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[12]),	\
+				be16_to_cpu(*(__be16 *)&(gid)[14])
+
+
+/* These defines are used to figure out how many address entries can be passed
+ * in config_addresses request.
+ */
+#define MAX_CONFIG_ADDR_ENTRIES \
+	((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses) \
+	- sizeof(struct vnic_address_op)))/sizeof(struct vnic_address_op))
+#define MAX_CONFIG_ADDR_ENTRIES2 \
+	((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses2) \
+	- sizeof(struct vnic_address_op2)))/sizeof(struct vnic_address_op2))
+
+
+#endif	/* VNIC_CONTROL_PKT_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:56:24 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:26:24 +0530
Subject: [ofa-general] [PATCH v3 05/13] QLogic VNIC: Implementation of Data
	path of communication protocol
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095624.9943.8469.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch implements the actual data transfer part of the communication
protocol with the EVIC/VEx. RDMA of ethernet packets is implemented in
here.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c    | 1492 +++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h    |  206 +++
 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h |  103 ++
 3 files changed, 1801 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
new file mode 100644
index 0000000..b81fcde
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c
@@ -0,0 +1,1492 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <net/inet_sock.h>
+#include <linux/ip.h>
+#include <linux/if_ether.h>
+#include <linux/vmalloc.h>
+
+#include "vnic_util.h"
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_data.h"
+#include "vnic_trailer.h"
+#include "vnic_stats.h"
+
+static void data_received_kick(struct io *io);
+static void data_xmit_complete(struct io *io);
+
+static void mc_data_recv_routine(struct io *io);
+static void mc_data_post_recvs(struct mc_data *mc_data);
+static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb,
+	struct viport_trailer *trailer);
+
+static u32 min_rcv_skb = 60;
+module_param(min_rcv_skb, int, 0444);
+MODULE_PARM_DESC(min_rcv_skb, "Packets of size (in bytes) less than"
+		 " or equal this value will be copied during receive."
+		 " Default 60");
+
+static u32 min_xmt_skb = 60;
+module_param(min_xmt_skb, int, 0444);
+MODULE_PARM_DESC(min_xmit_skb, "Packets of size (in bytes) less than"
+		 " or equal to this value will be copied during transmit."
+		 "Default 60");
+
+int data_init(struct data *data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd)
+{
+	DATA_FUNCTION("data_init()\n");
+
+	data->parent = viport;
+	data->config = config;
+	data->ib_conn.viport = viport;
+	data->ib_conn.ib_config = &config->ib_config;
+	data->ib_conn.state = IB_CONN_UNINITTED;
+	data->ib_conn.callback_thread = NULL;
+	data->ib_conn.callback_thread_end = 0;
+
+	if ((min_xmt_skb < 60) || (min_xmt_skb > 9000)) {
+		DATA_ERROR("min_xmt_skb (%d) must be between 60 and 9000\n",
+			   min_xmt_skb);
+		goto failure;
+	}
+	if (vnic_ib_conn_init(&data->ib_conn, viport, pd,
+			      &config->ib_config)) {
+		DATA_ERROR("Data IB connection initialization failed\n");
+		goto failure;
+	}
+	data->mr = ib_get_dma_mr(pd,
+				 IB_ACCESS_LOCAL_WRITE |
+				 IB_ACCESS_REMOTE_READ |
+				 IB_ACCESS_REMOTE_WRITE);
+	if (IS_ERR(data->mr)) {
+		DATA_ERROR("failed to register memory for"
+			   " data connection\n");
+		goto destroy_conn;
+	}
+
+	data->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev,
+					      vnic_ib_cm_handler,
+					      &data->ib_conn);
+
+	if (IS_ERR(data->ib_conn.cm_id)) {
+		DATA_ERROR("creating data CM ID failed\n");
+		goto dereg_mr;
+	}
+
+	return 0;
+
+dereg_mr:
+	ib_dereg_mr(data->mr);
+destroy_conn:
+	completion_callback_cleanup(&data->ib_conn);
+	ib_destroy_qp(data->ib_conn.qp);
+	ib_destroy_cq(data->ib_conn.cq);
+failure:
+	return -1;
+}
+
+static void data_post_recvs(struct data *data)
+{
+	unsigned long flags;
+	int i = 0;
+
+	DATA_FUNCTION("data_post_recvs()\n");
+	spin_lock_irqsave(&data->recv_ios_lock, flags);
+	while (!list_empty(&data->recv_ios)) {
+		struct io *io = list_entry(data->recv_ios.next,
+					   struct io, list_ptrs);
+		struct recv_io *recv_io = (struct recv_io *)io;
+
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+		if (vnic_ib_post_recv(&data->ib_conn, &recv_io->io)) {
+			viport_failure(data->parent);
+			return;
+		}
+		i++;
+		spin_lock_irqsave(&data->recv_ios_lock, flags);
+	}
+	spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+	DATA_INFO("data posted %d %p\n", i, &data->recv_ios);
+}
+
+static void data_init_pool_work_reqs(struct data *data,
+				      struct recv_io *recv_io)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct rdma_io		*rdma_io;
+	struct rdma_dest	*rdma_dest;
+	dma_addr_t		xmit_dma;
+	u8			*xmit_data;
+	unsigned int		i;
+
+	INIT_LIST_HEAD(&data->recv_ios);
+	spin_lock_init(&data->recv_ios_lock);
+	spin_lock_init(&data->xmit_buf_lock);
+	for (i = 0; i < data->config->num_recvs; i++) {
+		recv_io[i].io.viport = data->parent;
+		recv_io[i].io.routine = data_received_kick;
+		recv_io[i].list.addr = data->region_data_dma;
+		recv_io[i].list.length = 4;
+		recv_io[i].list.lkey = data->mr->lkey;
+
+		recv_io[i].io.rwr.wr_id = (u64)&recv_io[i].io;
+		recv_io[i].io.rwr.sg_list = &recv_io[i].list;
+		recv_io[i].io.rwr.num_sge = 1;
+
+		list_add(&recv_io[i].io.list_ptrs, &data->recv_ios);
+	}
+
+	INIT_LIST_HEAD(&recv_pool->avail_recv_bufs);
+	for (i = 0; i < recv_pool->pool_sz; i++) {
+		rdma_dest = &recv_pool->recv_bufs[i];
+		list_add(&rdma_dest->list_ptrs,
+			 &recv_pool->avail_recv_bufs);
+	}
+
+	xmit_dma = xmit_pool->xmitdata_dma;
+	xmit_data = xmit_pool->xmit_data;
+
+	for (i = 0; i < xmit_pool->num_xmit_bufs; i++) {
+		rdma_io = &xmit_pool->xmit_bufs[i];
+		rdma_io->index = i;
+		rdma_io->io.viport = data->parent;
+		rdma_io->io.routine = data_xmit_complete;
+
+		rdma_io->list[0].lkey = data->mr->lkey;
+		rdma_io->list[1].lkey = data->mr->lkey;
+		rdma_io->io.swr.wr_id = (u64)rdma_io;
+		rdma_io->io.swr.sg_list = rdma_io->list;
+		rdma_io->io.swr.num_sge = 2;
+		rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE;
+		rdma_io->io.swr.send_flags = IB_SEND_SIGNALED;
+		rdma_io->io.type = RDMA;
+
+		rdma_io->data = xmit_data;
+		rdma_io->data_dma = xmit_dma;
+
+		xmit_data += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT);
+		xmit_dma += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT);
+		rdma_io->trailer = (struct viport_trailer *)xmit_data;
+		rdma_io->trailer_dma = xmit_dma;
+		xmit_data += sizeof(struct viport_trailer);
+		xmit_dma += sizeof(struct viport_trailer);
+	}
+
+	xmit_pool->rdma_rkey = data->mr->rkey;
+	xmit_pool->rdma_addr = xmit_pool->buf_pool_dma;
+}
+
+static void data_init_free_bufs_swrs(struct data *data)
+{
+	struct rdma_io		*rdma_io;
+	struct send_io		*send_io;
+
+	rdma_io = &data->free_bufs_io;
+	rdma_io->io.viport = data->parent;
+	rdma_io->io.routine = NULL;
+
+	rdma_io->list[0].lkey = data->mr->lkey;
+
+	rdma_io->io.swr.wr_id = (u64)rdma_io;
+	rdma_io->io.swr.sg_list = rdma_io->list;
+	rdma_io->io.swr.num_sge = 1;
+	rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE;
+	rdma_io->io.swr.send_flags = IB_SEND_SIGNALED;
+	rdma_io->io.type = RDMA;
+
+	send_io = &data->kick_io;
+	send_io->io.viport = data->parent;
+	send_io->io.routine = NULL;
+
+	send_io->list.addr = data->region_data_dma;
+	send_io->list.length = 0;
+	send_io->list.lkey = data->mr->lkey;
+
+	send_io->io.swr.wr_id = (u64)send_io;
+	send_io->io.swr.sg_list = &send_io->list;
+	send_io->io.swr.num_sge = 1;
+	send_io->io.swr.opcode = IB_WR_SEND;
+	send_io->io.swr.send_flags = IB_SEND_SIGNALED;
+	send_io->io.type = SEND;
+}
+
+static int data_init_buf_pools(struct data *data)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct viport		*viport = data->parent;
+
+	recv_pool->buf_pool_len =
+	    sizeof(struct buff_pool_entry) * recv_pool->eioc_pool_sz;
+
+	recv_pool->buf_pool = kzalloc(recv_pool->buf_pool_len, GFP_KERNEL);
+
+	if (!recv_pool->buf_pool) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " for recv pool bufpool\n",
+			   recv_pool->buf_pool_len);
+		goto failure;
+	}
+
+	recv_pool->buf_pool_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      recv_pool->buf_pool, recv_pool->buf_pool_len,
+			      DMA_TO_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, recv_pool->buf_pool_dma)) {
+		DATA_ERROR("xmit buf_pool dma map error\n");
+		goto free_recv_pool;
+	}
+
+	xmit_pool->buf_pool_len =
+	    sizeof(struct buff_pool_entry) * xmit_pool->pool_sz;
+	xmit_pool->buf_pool = kzalloc(xmit_pool->buf_pool_len, GFP_KERNEL);
+
+	if (!xmit_pool->buf_pool) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " for xmit pool bufpool\n",
+			   xmit_pool->buf_pool_len);
+		goto unmap_recv_pool;
+	}
+
+	xmit_pool->buf_pool_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      xmit_pool->buf_pool, xmit_pool->buf_pool_len,
+			      DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->buf_pool_dma)) {
+		DATA_ERROR("xmit buf_pool dma map error\n");
+		goto free_xmit_pool;
+	}
+
+	xmit_pool->xmit_data = kzalloc(xmit_pool->xmitdata_len, GFP_KERNEL);
+
+	if (!xmit_pool->xmit_data) {
+		DATA_ERROR("failed allocating %d bytes for xmit data\n",
+			   xmit_pool->xmitdata_len);
+		goto unmap_xmit_pool;
+	}
+
+	xmit_pool->xmitdata_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      xmit_pool->xmit_data, xmit_pool->xmitdata_len,
+			      DMA_TO_DEVICE);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->xmitdata_dma)) {
+		DATA_ERROR("xmit data dma map error\n");
+		goto free_xmit_data;
+	}
+
+	return 0;
+
+free_xmit_data:
+	kfree(xmit_pool->xmit_data);
+unmap_xmit_pool:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    xmit_pool->buf_pool_dma,
+			    xmit_pool->buf_pool_len, DMA_FROM_DEVICE);
+free_xmit_pool:
+	kfree(xmit_pool->buf_pool);
+unmap_recv_pool:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    recv_pool->buf_pool_dma,
+			    recv_pool->buf_pool_len, DMA_TO_DEVICE);
+free_recv_pool:
+	kfree(recv_pool->buf_pool);
+failure:
+	return -1;
+}
+
+static void data_init_xmit_pool(struct data *data)
+{
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+
+	xmit_pool->pool_sz =
+		be32_to_cpu(data->eioc_pool_parms.num_recv_pool_entries);
+	xmit_pool->buffer_sz =
+		be32_to_cpu(data->eioc_pool_parms.size_recv_pool_entry);
+
+	xmit_pool->notify_count = 0;
+	xmit_pool->notify_bundle = data->config->notify_bundle;
+	xmit_pool->next_xmit_pool = 0;
+	xmit_pool->num_xmit_bufs = xmit_pool->notify_bundle * 2;
+	xmit_pool->next_xmit_buf = 0;
+	xmit_pool->last_comp_buf = xmit_pool->num_xmit_bufs - 1;
+	/* This assumes that data_init_recv_pool has been called
+	 * before.
+	 */
+	data->max_mtu = MAX_PAYLOAD(min((data)->recv_pool.buffer_sz,
+				   (data)->xmit_pool.buffer_sz)) - VLAN_ETH_HLEN;
+
+	xmit_pool->kick_count = 0;
+	xmit_pool->kick_byte_count = 0;
+
+	xmit_pool->send_kicks =
+	  be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_entries_before_kick)
+	  || be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_bytes_before_kick);
+	xmit_pool->kick_bundle =
+	    be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_entries_before_kick);
+	xmit_pool->kick_byte_bundle =
+	    be32_to_cpu(data->
+			eioc_pool_parms.num_recv_pool_bytes_before_kick);
+
+	xmit_pool->need_buffers = 1;
+
+	xmit_pool->xmitdata_len =
+	    BUFFER_SIZE(min_xmt_skb) * xmit_pool->num_xmit_bufs;
+}
+
+static void data_init_recv_pool(struct data *data)
+{
+	struct recv_pool	*recv_pool = &data->recv_pool;
+
+	recv_pool->pool_sz = data->config->host_recv_pool_entries;
+	recv_pool->eioc_pool_sz =
+		be32_to_cpu(data->host_pool_parms.num_recv_pool_entries);
+	if (recv_pool->pool_sz > recv_pool->eioc_pool_sz)
+		recv_pool->pool_sz =
+		    be32_to_cpu(data->host_pool_parms.num_recv_pool_entries);
+
+	recv_pool->buffer_sz =
+		    be32_to_cpu(data->host_pool_parms.size_recv_pool_entry);
+
+	recv_pool->sz_free_bundle =
+		be32_to_cpu(data->
+			host_pool_parms.free_recv_pool_entries_per_update);
+	recv_pool->num_free_bufs = 0;
+	recv_pool->num_posted_bufs = 0;
+
+	recv_pool->next_full_buf = 0;
+	recv_pool->next_free_buf = 0;
+	recv_pool->kick_on_free  = 0;
+}
+
+int data_connect(struct data *data)
+{
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct recv_pool	*recv_pool = &data->recv_pool;
+	struct recv_io		*recv_io;
+	unsigned int		sz;
+	struct viport		*viport = data->parent;
+
+	DATA_FUNCTION("data_connect()\n");
+
+	/* Do not interchange the order of the functions
+	 * called below as this will affect the MAX MTU
+	 * calculation
+	 */
+
+	data_init_recv_pool(data);
+	data_init_xmit_pool(data);
+
+	sz = sizeof(struct rdma_dest) * recv_pool->pool_sz    +
+	     sizeof(struct recv_io) * data->config->num_recvs +
+	     sizeof(struct rdma_io) * xmit_pool->num_xmit_bufs;
+
+	data->local_storage = vmalloc(sz);
+
+	if (!data->local_storage) {
+		DATA_ERROR("failed allocating %d bytes"
+			   " local storage\n", sz);
+		goto out;
+	}
+
+	memset(data->local_storage, 0, sz);
+
+	recv_pool->recv_bufs = (struct rdma_dest *)data->local_storage;
+	sz = sizeof(struct rdma_dest) * recv_pool->pool_sz;
+
+	recv_io = (struct recv_io *)(data->local_storage + sz);
+	sz += sizeof(struct recv_io) * data->config->num_recvs;
+
+	xmit_pool->xmit_bufs = (struct rdma_io *)(data->local_storage + sz);
+	data->region_data = kzalloc(4, GFP_KERNEL);
+
+	if (!data->region_data) {
+		DATA_ERROR("failed to alloc memory for region data\n");
+		goto free_local_storage;
+	}
+
+	data->region_data_dma =
+	    ib_dma_map_single(viport->config->ibdev,
+			      data->region_data, 4, DMA_BIDIRECTIONAL);
+
+	if (ib_dma_mapping_error(viport->config->ibdev, data->region_data_dma)) {
+		DATA_ERROR("region data dma map error\n");
+		goto free_region_data;
+	}
+
+	if (data_init_buf_pools(data))
+		goto unmap_region_data;
+
+	data_init_free_bufs_swrs(data);
+	data_init_pool_work_reqs(data, recv_io);
+
+	data_post_recvs(data);
+
+	if (vnic_ib_cm_connect(&data->ib_conn))
+		goto unmap_region_data;
+
+	return 0;
+
+unmap_region_data:
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    data->region_data_dma, 4, DMA_BIDIRECTIONAL);
+free_region_data:
+		kfree(data->region_data);
+free_local_storage:
+		vfree(data->local_storage);
+out:
+	return -1;
+}
+
+static void data_add_free_buffer(struct data *data, int index,
+				 struct rdma_dest *rdma_dest)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct buff_pool_entry *bpe;
+	dma_addr_t vaddr_dma;
+
+	DATA_FUNCTION("data_add_free_buffer()\n");
+	rdma_dest->trailer->connection_hash_and_valid = 0;
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	bpe = &pool->buf_pool[index];
+	bpe->rkey = cpu_to_be32(data->mr->rkey);
+	vaddr_dma = ib_dma_map_single(data->parent->config->ibdev,
+					rdma_dest->data, pool->buffer_sz,
+					DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(data->parent->config->ibdev, vaddr_dma)) {
+		DATA_ERROR("rdma_dest->data dma map error\n");
+		goto failure;
+	}
+	bpe->remote_addr = cpu_to_be64(vaddr_dma);
+	bpe->valid = (u32) (rdma_dest - &pool->recv_bufs[0]) + 1;
+	++pool->num_free_bufs;
+failure:
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma, pool->buf_pool_len,
+				      DMA_TO_DEVICE);
+}
+
+/* NOTE: this routine is not reentrant */
+static void data_alloc_buffers(struct data *data, int initial_allocation)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct rdma_dest *rdma_dest;
+	struct sk_buff *skb;
+	int index;
+
+	DATA_FUNCTION("data_alloc_buffers()\n");
+	index = ADD(pool->next_free_buf, pool->num_free_bufs,
+		    pool->eioc_pool_sz);
+
+	while (!list_empty(&pool->avail_recv_bufs)) {
+		rdma_dest =
+		    list_entry(pool->avail_recv_bufs.next,
+			       struct rdma_dest, list_ptrs);
+		if (!rdma_dest->skb) {
+			if (initial_allocation)
+				skb = alloc_skb(pool->buffer_sz + 2,
+						GFP_KERNEL);
+			else
+				skb = dev_alloc_skb(pool->buffer_sz + 2);
+			if (!skb)
+				break;
+			skb_reserve(skb, 2);
+			skb_put(skb, pool->buffer_sz);
+			rdma_dest->skb = skb;
+			rdma_dest->data = skb->data;
+			rdma_dest->trailer =
+			  (struct viport_trailer *)(rdma_dest->data +
+						    pool->buffer_sz -
+						    sizeof(struct
+							   viport_trailer));
+		}
+		rdma_dest->trailer->connection_hash_and_valid = 0;
+
+		list_del_init(&rdma_dest->list_ptrs);
+
+		data_add_free_buffer(data, index, rdma_dest);
+		index = NEXT(index, pool->eioc_pool_sz);
+	}
+}
+
+static void data_send_kick_message(struct data *data)
+{
+	struct xmit_pool *pool = &data->xmit_pool;
+	DATA_FUNCTION("data_send_kick_message()\n");
+	/* stop timer for bundle_timeout */
+	if (data->kick_timer_on) {
+		del_timer(&data->kick_timer);
+		data->kick_timer_on = 0;
+	}
+	pool->kick_count = 0;
+	pool->kick_byte_count = 0;
+
+	/* TODO: keep track of when kick is outstanding, and
+	 * don't reuse until complete
+	 */
+	if (vnic_ib_post_send(&data->ib_conn, &data->free_bufs_io.io)) {
+		DATA_ERROR("failed to post send\n");
+		viport_failure(data->parent);
+	}
+}
+
+static void data_send_free_recv_buffers(struct data *data)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct ib_send_wr *swr = &data->free_bufs_io.io.swr;
+
+	int bufs_sent = 0;
+	u64 rdma_addr;
+	u32 offset;
+	u32 sz;
+	unsigned int num_to_send, next_increment;
+
+	DATA_FUNCTION("data_send_free_recv_buffers()\n");
+
+	for (num_to_send = pool->sz_free_bundle;
+	     num_to_send <= pool->num_free_bufs;
+	     num_to_send += pool->sz_free_bundle) {
+		/* handle multiple bundles as one when possible. */
+		next_increment = num_to_send + pool->sz_free_bundle;
+		if ((next_increment <= pool->num_free_bufs)
+		    && (pool->next_free_buf + next_increment <=
+			pool->eioc_pool_sz))
+			continue;
+
+		offset = pool->next_free_buf *
+				sizeof(struct buff_pool_entry);
+		sz = num_to_send * sizeof(struct buff_pool_entry);
+		rdma_addr = pool->eioc_rdma_addr + offset;
+		swr->sg_list->length = sz;
+		swr->sg_list->addr = pool->buf_pool_dma + offset;
+		swr->wr.rdma.remote_addr = rdma_addr;
+
+		if (vnic_ib_post_send(&data->ib_conn,
+		    &data->free_bufs_io.io)) {
+			DATA_ERROR("failed to post send\n");
+			viport_failure(data->parent);
+			return;
+		}
+		INC(pool->next_free_buf, num_to_send, pool->eioc_pool_sz);
+		pool->num_free_bufs -= num_to_send;
+		pool->num_posted_bufs += num_to_send;
+		bufs_sent = 1;
+	}
+
+	if (bufs_sent) {
+		if (pool->kick_on_free)
+			data_send_kick_message(data);
+	}
+	if (pool->num_posted_bufs == 0) {
+		struct vnic *vnic = data->parent->vnic;
+		unsigned long flags;
+
+		spin_lock_irqsave(&vnic->current_path_lock, flags);
+		if (vnic->current_path == &vnic->primary_path) {
+			spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+			DATA_ERROR("%s: primary path: "
+					"unable to allocate receive buffers\n",
+					vnic->config->name);
+		} else {
+			if (vnic->current_path == &vnic->secondary_path) {
+				spin_unlock_irqrestore(&vnic->current_path_lock,
+							flags);
+				DATA_ERROR("%s: secondary path: "
+					"unable to allocate receive buffers\n",
+					vnic->config->name);
+			} else
+				spin_unlock_irqrestore(&vnic->current_path_lock,
+							flags);
+		}
+		data->ib_conn.state = IB_CONN_ERRORED;
+		viport_failure(data->parent);
+	}
+}
+
+void data_connected(struct data *data)
+{
+	DATA_FUNCTION("data_connected()\n");
+	data->free_bufs_io.io.swr.wr.rdma.rkey =
+				data->recv_pool.eioc_rdma_rkey;
+	data_alloc_buffers(data, 1);
+	data_send_free_recv_buffers(data);
+	data->connected = 1;
+}
+
+void data_disconnect(struct data *data)
+{
+	struct xmit_pool *xmit_pool = &data->xmit_pool;
+	struct recv_pool *recv_pool = &data->recv_pool;
+	unsigned int i;
+
+	DATA_FUNCTION("data_disconnect()\n");
+
+	data->connected = 0;
+	if (data->kick_timer_on) {
+		del_timer_sync(&data->kick_timer);
+		data->kick_timer_on = 0;
+	}
+
+	if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0))
+		DATA_ERROR("data CM DREQ sending failed\n");
+	data->ib_conn.state = IB_CONN_DISCONNECTED;
+
+	completion_callback_cleanup(&data->ib_conn);
+
+	for (i = 0; i < xmit_pool->num_xmit_bufs; i++) {
+		if (xmit_pool->xmit_bufs[i].skb)
+			dev_kfree_skb(xmit_pool->xmit_bufs[i].skb);
+		xmit_pool->xmit_bufs[i].skb = NULL;
+
+	}
+	for (i = 0; i < recv_pool->pool_sz; i++) {
+		if (data->recv_pool.recv_bufs[i].skb)
+			dev_kfree_skb(recv_pool->recv_bufs[i].skb);
+		recv_pool->recv_bufs[i].skb = NULL;
+	}
+	vfree(data->local_storage);
+	if (data->region_data) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    data->region_data_dma, 4,
+				    DMA_BIDIRECTIONAL);
+		kfree(data->region_data);
+	}
+
+	if (recv_pool->buf_pool) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    recv_pool->buf_pool_dma,
+				    recv_pool->buf_pool_len, DMA_TO_DEVICE);
+		kfree(recv_pool->buf_pool);
+	}
+
+	if (xmit_pool->buf_pool) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    xmit_pool->buf_pool_dma,
+				    xmit_pool->buf_pool_len, DMA_FROM_DEVICE);
+		kfree(xmit_pool->buf_pool);
+	}
+
+	if (xmit_pool->xmit_data) {
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    xmit_pool->xmitdata_dma,
+				    xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+		kfree(xmit_pool->xmit_data);
+	}
+}
+
+void data_cleanup(struct data *data)
+{
+	ib_destroy_cm_id(data->ib_conn.cm_id);
+
+	/* Completion callback cleanup called again.
+	 * This is to cleanup the threads in case there is an
+	 * error before state LINK_DATACONNECT due to which
+	 * data_disconnect is not called.
+	 */
+	completion_callback_cleanup(&data->ib_conn);
+	ib_destroy_qp(data->ib_conn.qp);
+	ib_destroy_cq(data->ib_conn.cq);
+	ib_dereg_mr(data->mr);
+
+}
+
+static int data_alloc_xmit_buffer(struct data *data, struct sk_buff *skb,
+				  struct buff_pool_entry **pp_bpe,
+				  struct rdma_io **pp_rdma_io,
+				  int *last)
+{
+	struct xmit_pool	*pool = &data->xmit_pool;
+	unsigned long		flags;
+	int			ret;
+
+	DATA_FUNCTION("data_alloc_xmit_buffer()\n");
+
+	spin_lock_irqsave(&data->xmit_buf_lock, flags);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+	*last = 0;
+	*pp_rdma_io = &pool->xmit_bufs[pool->next_xmit_buf];
+	*pp_bpe = &pool->buf_pool[pool->next_xmit_pool];
+
+	if ((*pp_bpe)->valid && pool->next_xmit_buf !=
+	     pool->last_comp_buf) {
+		INC(pool->next_xmit_buf, 1, pool->num_xmit_bufs);
+		INC(pool->next_xmit_pool, 1, pool->pool_sz);
+		if (!pool->buf_pool[pool->next_xmit_pool].valid) {
+			DATA_INFO("just used the last EIOU"
+				  " receive buffer\n");
+			*last = 1;
+			pool->need_buffers = 1;
+			vnic_stop_xmit(data->parent->vnic,
+				       data->parent->parent);
+			data_kickreq_stats(data);
+		} else if (pool->next_xmit_buf == pool->last_comp_buf) {
+			DATA_INFO("just used our last xmit buffer\n");
+			pool->need_buffers = 1;
+			vnic_stop_xmit(data->parent->vnic,
+				       data->parent->parent);
+		}
+		(*pp_rdma_io)->skb = skb;
+		(*pp_bpe)->valid = 0;
+		ret = 0;
+	} else {
+		data_no_xmitbuf_stats(data);
+		DATA_ERROR("Out of xmit buffers\n");
+		vnic_stop_xmit(data->parent->vnic,
+			       data->parent->parent);
+		ret = -1;
+	}
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma,
+				      pool->buf_pool_len, DMA_TO_DEVICE);
+	spin_unlock_irqrestore(&data->xmit_buf_lock, flags);
+	return ret;
+}
+
+static void data_rdma_packet(struct data *data, struct buff_pool_entry *bpe,
+			     struct rdma_io *rdma_io)
+{
+	struct ib_send_wr	*swr;
+	struct sk_buff		*skb;
+	dma_addr_t		trailer_data_dma;
+	dma_addr_t		skb_data_dma;
+	struct xmit_pool	*xmit_pool = &data->xmit_pool;
+	struct viport		*viport = data->parent;
+	u8			*d;
+	int			len;
+	int			fill_len;
+
+	DATA_FUNCTION("data_rdma_packet()\n");
+	swr = &rdma_io->io.swr;
+	skb = rdma_io->skb;
+	len = ALIGN(rdma_io->len, VIPORT_TRAILER_ALIGNMENT);
+	fill_len = len - skb->len;
+
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   xmit_pool->xmitdata_dma,
+				   xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+
+	d = (u8 *) rdma_io->trailer - fill_len;
+	trailer_data_dma = rdma_io->trailer_dma - fill_len;
+	memset(d, 0, fill_len);
+
+	swr->sg_list[0].length = skb->len;
+	if (skb->len <= min_xmt_skb) {
+		memcpy(rdma_io->data, skb->data, skb->len);
+		swr->sg_list[0].lkey = data->mr->lkey;
+		swr->sg_list[0].addr = rdma_io->data_dma;
+		dev_kfree_skb_any(skb);
+		rdma_io->skb = NULL;
+	} else {
+		swr->sg_list[0].lkey = data->mr->lkey;
+
+		skb_data_dma = ib_dma_map_single(viport->config->ibdev,
+						skb->data, skb->len,
+						DMA_TO_DEVICE);
+
+		if (ib_dma_mapping_error(viport->config->ibdev, skb_data_dma)) {
+			DATA_ERROR("skb data dma map error\n");
+			goto failure;
+		}
+
+		rdma_io->skb_data_dma = skb_data_dma;
+
+		swr->sg_list[0].addr = skb_data_dma;
+		skb_orphan(skb);
+	}
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   xmit_pool->buf_pool_dma,
+				   xmit_pool->buf_pool_len, DMA_TO_DEVICE);
+
+	swr->sg_list[1].addr = trailer_data_dma;
+	swr->sg_list[1].length = fill_len + sizeof(struct viport_trailer);
+	swr->sg_list[0].lkey = data->mr->lkey;
+	swr->wr.rdma.remote_addr = be64_to_cpu(bpe->remote_addr);
+	swr->wr.rdma.remote_addr += data->xmit_pool.buffer_sz;
+	swr->wr.rdma.remote_addr -= (sizeof(struct viport_trailer) + len);
+	swr->wr.rdma.rkey = be32_to_cpu(bpe->rkey);
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->buf_pool_dma,
+				      xmit_pool->buf_pool_len, DMA_TO_DEVICE);
+
+	/* If VNIC_FEAT_RDMA_IMMED is supported then change the work request
+	 * opcode to IB_WR_RDMA_WRITE_WITH_IMM
+	 */
+
+	if (data->parent->features_supported & VNIC_FEAT_RDMA_IMMED) {
+		swr->ex.imm_data = 0;
+		swr->opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	}
+
+	data->xmit_pool.notify_count++;
+	if (data->xmit_pool.notify_count >= data->xmit_pool.notify_bundle) {
+		data->xmit_pool.notify_count = 0;
+		swr->send_flags = IB_SEND_SIGNALED;
+	} else {
+		swr->send_flags = 0;
+	}
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->xmitdata_dma,
+				      xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+	if (vnic_ib_post_send(&data->ib_conn, &rdma_io->io)) {
+		DATA_ERROR("failed to post send for data RDMA write\n");
+		viport_failure(data->parent);
+		goto failure;
+	}
+
+	data_xmits_stats(data);
+failure:
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      xmit_pool->xmitdata_dma,
+				      xmit_pool->xmitdata_len, DMA_TO_DEVICE);
+}
+
+static void data_kick_timeout_handler(unsigned long arg)
+{
+	struct data *data = (struct data *)arg;
+
+	DATA_FUNCTION("data_kick_timeout_handler()\n");
+	data->kick_timer_on = 0;
+	data_send_kick_message(data);
+}
+
+int data_xmit_packet(struct data *data, struct sk_buff *skb)
+{
+	struct xmit_pool	*pool = &data->xmit_pool;
+	struct rdma_io		*rdma_io;
+	struct buff_pool_entry	*bpe;
+	struct viport_trailer	*trailer;
+	unsigned int		sz = skb->len;
+	int			last;
+
+	DATA_FUNCTION("data_xmit_packet()\n");
+	if (sz > pool->buffer_sz) {
+		DATA_ERROR("outbound packet too large, size = %d\n", sz);
+		return -1;
+	}
+
+	if (data_alloc_xmit_buffer(data, skb, &bpe, &rdma_io, &last)) {
+		DATA_ERROR("error in allocating data xmit buffer\n");
+		return -1;
+	}
+
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->xmitdata_dma, pool->xmitdata_len,
+				   DMA_TO_DEVICE);
+	trailer = rdma_io->trailer;
+
+	memset(trailer, 0, sizeof *trailer);
+	memcpy(trailer->dest_mac_addr, skb->data, ETH_ALEN);
+
+	if (skb->sk)
+		trailer->connection_hash_and_valid = 0x40 |
+			 ((be16_to_cpu(inet_sk(skb->sk)->sport) +
+			   be16_to_cpu(inet_sk(skb->sk)->dport)) & 0x3f);
+
+	trailer->connection_hash_and_valid |= CHV_VALID;
+
+	if ((sz > 16) && (*(__be16 *) (skb->data + 12) ==
+			   __constant_cpu_to_be16(ETH_P_8021Q))) {
+		trailer->vlan = *(__be16 *) (skb->data + 14);
+		memmove(skb->data + 4, skb->data, 12);
+		skb_pull(skb, 4);
+		sz -= 4;
+		trailer->pkt_flags |= PF_VLAN_INSERT;
+	}
+	if (last)
+		trailer->pkt_flags |= PF_KICK;
+	if (sz < ETH_ZLEN) {
+		/* EIOU requires all packets to be
+		 * of ethernet minimum packet size.
+		 */
+		trailer->data_length = __constant_cpu_to_be16(ETH_ZLEN);
+		rdma_io->len = ETH_ZLEN;
+	} else {
+		trailer->data_length = cpu_to_be16(sz);
+		rdma_io->len = sz;
+	}
+
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		trailer->tx_chksum_flags = TX_CHKSUM_FLAGS_CHECKSUM_V4
+		    | TX_CHKSUM_FLAGS_IP_CHECKSUM
+		    | TX_CHKSUM_FLAGS_TCP_CHECKSUM
+		    | TX_CHKSUM_FLAGS_UDP_CHECKSUM;
+	}
+
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->xmitdata_dma, pool->xmitdata_len,
+				      DMA_TO_DEVICE);
+	data_rdma_packet(data, bpe, rdma_io);
+
+	if (pool->send_kicks) {
+		/* EIOC needs kicks to inform it of sent packets */
+		pool->kick_count++;
+		pool->kick_byte_count += sz;
+		if ((pool->kick_count >= pool->kick_bundle)
+		    || (pool->kick_byte_count >= pool->kick_byte_bundle)) {
+			data_send_kick_message(data);
+		} else if (pool->kick_count == 1) {
+			init_timer(&data->kick_timer);
+			/* timeout_before_kick is in usec */
+			data->kick_timer.expires =
+			   msecs_to_jiffies(be32_to_cpu(data->
+				eioc_pool_parms.timeout_before_kick) * 1000)
+				+ jiffies;
+			data->kick_timer.data = (unsigned long)data;
+			data->kick_timer.function = data_kick_timeout_handler;
+			add_timer(&data->kick_timer);
+			data->kick_timer_on = 1;
+		}
+	}
+	return 0;
+}
+
+static void data_check_xmit_buffers(struct data *data)
+{
+	struct xmit_pool *pool = &data->xmit_pool;
+	unsigned long flags;
+
+	DATA_FUNCTION("data_check_xmit_buffers()\n");
+	spin_lock_irqsave(&data->xmit_buf_lock, flags);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	if (data->xmit_pool.need_buffers
+	    && pool->buf_pool[pool->next_xmit_pool].valid
+	    && pool->next_xmit_buf != pool->last_comp_buf) {
+		data->xmit_pool.need_buffers = 0;
+		vnic_restart_xmit(data->parent->vnic,
+				  data->parent->parent);
+		DATA_INFO("there are free xmit buffers\n");
+	}
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+				      pool->buf_pool_dma, pool->buf_pool_len,
+				      DMA_TO_DEVICE);
+
+	spin_unlock_irqrestore(&data->xmit_buf_lock, flags);
+}
+
+static struct sk_buff *data_recv_to_skbuff(struct data *data,
+					   struct rdma_dest *rdma_dest)
+{
+	struct viport_trailer *trailer;
+	struct sk_buff *skb = NULL;
+	int start;
+	unsigned int len;
+	u8 rx_chksum_flags;
+
+	DATA_FUNCTION("data_recv_to_skbuff()\n");
+	trailer = rdma_dest->trailer;
+	start = data_offset(data, trailer);
+	len = data_len(data, trailer);
+
+	if (len <= min_rcv_skb)
+		skb = dev_alloc_skb(len + VLAN_HLEN + 2);
+			 /* leave room for VLAN header and alignment */
+	if (skb) {
+		skb_reserve(skb, VLAN_HLEN + 2);
+		memcpy(skb->data, rdma_dest->data + start, len);
+		skb_put(skb, len);
+	} else {
+		skb = rdma_dest->skb;
+		rdma_dest->skb = NULL;
+		rdma_dest->trailer = NULL;
+		rdma_dest->data = NULL;
+		skb_pull(skb, start);
+		skb_trim(skb, len);
+	}
+
+	rx_chksum_flags = trailer->rx_chksum_flags;
+	DATA_INFO("rx_chksum_flags = %d, LOOP = %c, IP = %c,"
+	     " TCP = %c, UDP = %c\n",
+	     rx_chksum_flags,
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) ? 'Y' : 'N',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED) ? 'N' :
+	     '-',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED) ? 'N' :
+	     '-',
+	     (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED) ? 'Y'
+	     : (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED) ? 'N' :
+	     '-');
+
+	if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK)
+	    || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED)
+		&& ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED)
+		    || (rx_chksum_flags &
+			RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED))))
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+	else
+		skb->ip_summed = CHECKSUM_NONE;
+
+	if ((trailer->pkt_flags & PF_VLAN_INSERT) &&
+		!(data->parent->features_supported & VNIC_FEAT_IGNORE_VLAN)) {
+		u8 *rv;
+
+		rv = skb_push(skb, 4);
+		memmove(rv, rv + 4, 12);
+		*(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q);
+		if (trailer->pkt_flags & PF_PVID_OVERRIDDEN)
+			*(__be16 *) (rv + 14) = trailer->vlan &
+					__constant_cpu_to_be16(0xF000);
+		else
+			*(__be16 *) (rv + 14) = trailer->vlan;
+	}
+
+	return skb;
+}
+
+static int data_incoming_recv(struct data *data)
+{
+	struct recv_pool *pool = &data->recv_pool;
+	struct rdma_dest *rdma_dest;
+	struct viport_trailer *trailer;
+	struct buff_pool_entry *bpe;
+	struct sk_buff *skb;
+	dma_addr_t vaddr_dma;
+
+	DATA_FUNCTION("data_incoming_recv()\n");
+	if (pool->next_full_buf == pool->next_free_buf)
+		return -1;
+	bpe = &pool->buf_pool[pool->next_full_buf];
+	vaddr_dma = be64_to_cpu(bpe->remote_addr);
+	rdma_dest = &pool->recv_bufs[bpe->valid - 1];
+	trailer = rdma_dest->trailer;
+
+	if (!trailer
+	    || !(trailer->connection_hash_and_valid & CHV_VALID))
+		return -1;
+
+	/* received a packet */
+	if (trailer->pkt_flags & PF_KICK)
+		pool->kick_on_free = 1;
+
+	skb = data_recv_to_skbuff(data, rdma_dest);
+
+	if (skb) {
+		vnic_recv_packet(data->parent->vnic,
+				 data->parent->parent, skb);
+		list_add(&rdma_dest->list_ptrs, &pool->avail_recv_bufs);
+	}
+
+	ib_dma_unmap_single(data->parent->config->ibdev,
+			    vaddr_dma, pool->buffer_sz,
+			    DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(data->parent->config->ibdev,
+				   pool->buf_pool_dma, pool->buf_pool_len,
+				   DMA_TO_DEVICE);
+
+	bpe->valid = 0;
+	ib_dma_sync_single_for_device(data->parent->config->ibdev,
+					pool->buf_pool_dma, pool->buf_pool_len,
+					DMA_TO_DEVICE);
+
+	INC(pool->next_full_buf, 1, pool->eioc_pool_sz);
+	pool->num_posted_bufs--;
+	data_recvs_stats(data);
+	return 0;
+}
+
+static void data_received_kick(struct io *io)
+{
+	struct data *data = &io->viport->data;
+	unsigned long flags;
+
+	DATA_FUNCTION("data_received_kick()\n");
+	data_note_kickrcv_time();
+	spin_lock_irqsave(&data->recv_ios_lock, flags);
+	list_add(&io->list_ptrs, &data->recv_ios);
+	spin_unlock_irqrestore(&data->recv_ios_lock, flags);
+	data_post_recvs(data);
+	data_rcvkicks_stats(data);
+	data_check_xmit_buffers(data);
+
+	while (!data_incoming_recv(data));
+
+	if (data->connected) {
+		data_alloc_buffers(data, 0);
+		data_send_free_recv_buffers(data);
+	}
+}
+
+static void data_xmit_complete(struct io *io)
+{
+	struct rdma_io *rdma_io = (struct rdma_io *)io;
+	struct data *data = &io->viport->data;
+	struct xmit_pool *pool = &data->xmit_pool;
+	struct sk_buff *skb;
+
+	DATA_FUNCTION("data_xmit_complete()\n");
+
+	if (rdma_io->skb)
+		ib_dma_unmap_single(data->parent->config->ibdev,
+				    rdma_io->skb_data_dma, rdma_io->skb->len,
+				    DMA_TO_DEVICE);
+
+	while (pool->last_comp_buf != rdma_io->index) {
+		INC(pool->last_comp_buf, 1, pool->num_xmit_bufs);
+		skb = pool->xmit_bufs[pool->last_comp_buf].skb;
+		if (skb)
+			dev_kfree_skb_any(skb);
+		pool->xmit_bufs[pool->last_comp_buf].skb = NULL;
+	}
+
+	data_check_xmit_buffers(data);
+}
+
+static int mc_data_alloc_skb(struct ud_recv_io *recv_io, u32 len,
+				int initial_allocation)
+{
+	struct sk_buff *skb;
+	struct mc_data *mc_data = &recv_io->io.viport->mc_data;
+
+	DATA_FUNCTION("mc_data_alloc_skb\n");
+	if (initial_allocation)
+		skb = alloc_skb(len, GFP_KERNEL);
+	else
+		skb = alloc_skb(len, GFP_ATOMIC);
+	if (!skb) {
+		DATA_ERROR("failed to alloc MULTICAST skb\n");
+		return -1;
+	}
+	skb_put(skb, len);
+	recv_io->skb = skb;
+
+	recv_io->skb_data_dma = ib_dma_map_single(
+					recv_io->io.viport->config->ibdev,
+					skb->data, skb->len,
+					DMA_FROM_DEVICE);
+
+	if (ib_dma_mapping_error(recv_io->io.viport->config->ibdev,
+			recv_io->skb_data_dma)) {
+		DATA_ERROR("skb data dma map error\n");
+		dev_kfree_skb(skb);
+		return -1;
+	}
+
+	recv_io->list[0].addr = recv_io->skb_data_dma;
+	recv_io->list[0].length = sizeof(struct ib_grh);
+	recv_io->list[0].lkey = mc_data->mr->lkey;
+
+	recv_io->list[1].addr = recv_io->skb_data_dma + sizeof(struct ib_grh);
+	recv_io->list[1].length = len - sizeof(struct ib_grh);
+	recv_io->list[1].lkey = mc_data->mr->lkey;
+
+	recv_io->io.rwr.wr_id = (u64)&recv_io->io;
+	recv_io->io.rwr.sg_list = recv_io->list;
+	recv_io->io.rwr.num_sge = 2;
+	recv_io->io.rwr.next = NULL;
+
+	return 0;
+}
+
+static int mc_data_alloc_buffers(struct mc_data *mc_data)
+{
+	unsigned int i, num;
+	struct ud_recv_io *bufs = NULL, *recv_io;
+
+	DATA_FUNCTION("mc_data_alloc_buffers\n");
+	if (!mc_data->skb_len) {
+		unsigned int len;
+		/* align multicast msg buffer on viport_trailer boundary */
+		len = (MCAST_MSG_SIZE + VIPORT_TRAILER_ALIGNMENT - 1) &
+				(~((unsigned int)VIPORT_TRAILER_ALIGNMENT - 1));
+		/*
+		 * Add size of grh and trailer -
+		 * note, we don't need a + 4 for vlan because we have room in
+		 * netbuf for grh & trailer and we'll strip them both, so there
+		 * will be room enough to handle the 4 byte insertion for vlan.
+		 */
+		len +=	sizeof(struct ib_grh) +
+				sizeof(struct viport_trailer);
+		mc_data->skb_len = len;
+		DATA_INFO("mc_data->skb_len %d (sizes:%d %d)\n",
+					len, (int)sizeof(struct ib_grh),
+					(int)sizeof(struct viport_trailer));
+	}
+	mc_data->recv_len = sizeof(struct ud_recv_io) * mc_data->num_recvs;
+	bufs = kmalloc(mc_data->recv_len, GFP_KERNEL);
+	if (!bufs) {
+		DATA_ERROR("failed to allocate MULTICAST buffers size:%d\n",
+				mc_data->recv_len);
+		return -1;
+	}
+	DATA_INFO("allocated num_recvs:%d recv_len:%d \n",
+			mc_data->num_recvs, mc_data->recv_len);
+	for (num = 0; num < mc_data->num_recvs; num++) {
+		recv_io = &bufs[num];
+		recv_io->len = mc_data->skb_len;
+		recv_io->io.type = RECV_UD;
+		recv_io->io.viport = mc_data->parent;
+		recv_io->io.routine = mc_data_recv_routine;
+
+		if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 1)) {
+			for (i = 0; i < num; i++) {
+				recv_io = &bufs[i];
+				ib_dma_unmap_single(recv_io->io.viport->config->ibdev,
+						    recv_io->skb_data_dma,
+						    recv_io->skb->len,
+						    DMA_FROM_DEVICE);
+				dev_kfree_skb(recv_io->skb);
+			}
+			kfree(bufs);
+			return -1;
+		}
+		list_add_tail(&recv_io->io.list_ptrs,
+						 &mc_data->avail_recv_ios_list);
+	}
+	mc_data->recv_ios = bufs;
+	return 0;
+}
+
+void vnic_mc_data_cleanup(struct mc_data *mc_data)
+{
+	unsigned int num;
+
+	DATA_FUNCTION("vnic_mc_data_cleanup()\n");
+	completion_callback_cleanup(&mc_data->ib_conn);
+	if (!IS_ERR(mc_data->ib_conn.qp)) {
+		ib_destroy_qp(mc_data->ib_conn.qp);
+		mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL);
+	}
+	if (!IS_ERR(mc_data->ib_conn.cq)) {
+		ib_destroy_cq(mc_data->ib_conn.cq);
+		mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+	}
+	if (mc_data->recv_ios) {
+		for (num = 0; num < mc_data->num_recvs; num++) {
+			if (mc_data->recv_ios[num].skb)
+				dev_kfree_skb(mc_data->recv_ios[num].skb);
+			mc_data->recv_ios[num].skb = NULL;
+		}
+		kfree(mc_data->recv_ios);
+		mc_data->recv_ios = (struct ud_recv_io *)NULL;
+	}
+	if (mc_data->mr) {
+		ib_dereg_mr(mc_data->mr);
+		mc_data->mr = (struct ib_mr *)NULL;
+	}
+	DATA_FUNCTION("vnic_mc_data_cleanup done\n");
+
+}
+
+int mc_data_init(struct mc_data *mc_data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd)
+{
+	DATA_FUNCTION("mc_data_init()\n");
+
+	mc_data->num_recvs = viport->data.config->num_recvs;
+
+	INIT_LIST_HEAD(&mc_data->avail_recv_ios_list);
+	spin_lock_init(&mc_data->recv_lock);
+
+	mc_data->parent = viport;
+	mc_data->config = config;
+
+	mc_data->ib_conn.cm_id = NULL;
+	mc_data->ib_conn.viport = viport;
+	mc_data->ib_conn.ib_config = &config->ib_config;
+	mc_data->ib_conn.state = IB_CONN_UNINITTED;
+	mc_data->ib_conn.callback_thread = NULL;
+	mc_data->ib_conn.callback_thread_end = 0;
+
+	if (vnic_ib_mc_init(mc_data, viport, pd,
+			      &config->ib_config)) {
+		DATA_ERROR("vnic_ib_mc_init failed\n");
+		goto failure;
+	}
+	mc_data->mr = ib_get_dma_mr(pd,
+				 IB_ACCESS_LOCAL_WRITE |
+				 IB_ACCESS_REMOTE_WRITE);
+	if (IS_ERR(mc_data->mr)) {
+		DATA_ERROR("failed to register memory for"
+			   " mc_data connection\n");
+		goto destroy_conn;
+	}
+
+	if (mc_data_alloc_buffers(mc_data))
+		goto dereg_mr;
+
+	mc_data_post_recvs(mc_data);
+	if (vnic_ib_mc_mod_qp_to_rts(mc_data->ib_conn.qp))
+		goto dereg_mr;
+
+	return 0;
+
+dereg_mr:
+	ib_dereg_mr(mc_data->mr);
+	mc_data->mr = (struct ib_mr *)NULL;
+destroy_conn:
+	completion_callback_cleanup(&mc_data->ib_conn);
+	ib_destroy_qp(mc_data->ib_conn.qp);
+	mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL);
+	ib_destroy_cq(mc_data->ib_conn.cq);
+	mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+failure:
+	return -1;
+}
+
+static void mc_data_post_recvs(struct mc_data *mc_data)
+{
+	unsigned long flags;
+	int i = 0;
+	DATA_FUNCTION("mc_data_post_recvs\n");
+	spin_lock_irqsave(&mc_data->recv_lock, flags);
+	while (!list_empty(&mc_data->avail_recv_ios_list)) {
+		struct io *io = list_entry(mc_data->avail_recv_ios_list.next,
+				struct io, list_ptrs);
+		struct ud_recv_io *recv_io =
+					container_of(io, struct ud_recv_io, io);
+		list_del(&recv_io->io.list_ptrs);
+		spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+		if (vnic_ib_mc_post_recv(mc_data, &recv_io->io)) {
+			viport_failure(mc_data->parent);
+			return;
+		}
+		spin_lock_irqsave(&mc_data->recv_lock, flags);
+		i++;
+	}
+	DATA_INFO("mcdata posted %d %p\n", i, &mc_data->avail_recv_ios_list);
+	spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+}
+
+static void mc_data_recv_routine(struct io *io)
+{
+	struct sk_buff *skb;
+	struct ib_grh *grh;
+	struct viport_trailer *trailer;
+	struct mc_data *mc_data;
+	unsigned long flags;
+	struct ud_recv_io *recv_io = container_of(io, struct ud_recv_io, io);
+	union ib_gid_cpu sgid;
+
+	DATA_FUNCTION("mc_data_recv_routine\n");
+	skb = recv_io->skb;
+	grh = (struct ib_grh *)skb->data;
+	mc_data = &recv_io->io.viport->mc_data;
+
+	ib_dma_unmap_single(recv_io->io.viport->config->ibdev,
+			    recv_io->skb_data_dma, recv_io->skb->len,
+			    DMA_FROM_DEVICE);
+
+	/* first - check if we've got our own mc packet  */
+	/* convert sgid from host to cpu form before comparing */
+	bswap_ib_gid(&grh->sgid, &sgid);
+	if (cpu_to_be64(sgid.global.interface_id) ==
+		io->viport->config->path_info.path.sgid.global.interface_id) {
+		DATA_ERROR("dropping - our mc packet\n");
+		dev_kfree_skb(skb);
+	} else {
+		/* GRH is at head and trailer at end. Remove GRH from head.  */
+		trailer = (struct viport_trailer *)
+				(skb->data + recv_io->len -
+				 sizeof(struct viport_trailer));
+		skb_pull(skb, sizeof(struct ib_grh));
+		if (trailer->connection_hash_and_valid & CHV_VALID) {
+			mc_data_recv_to_skbuff(io->viport, skb, trailer);
+			vnic_recv_packet(io->viport->vnic, io->viport->parent,
+					skb);
+			vnic_multicast_recv_pkt_stats(io->viport->vnic);
+		} else {
+			DATA_ERROR("dropping - no CHV_VALID in HashAndValid\n");
+			dev_kfree_skb(skb);
+		}
+	}
+	recv_io->skb = NULL;
+	if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 0))
+		return;
+
+	spin_lock_irqsave(&mc_data->recv_lock, flags);
+	list_add_tail(&recv_io->io.list_ptrs, &mc_data->avail_recv_ios_list);
+	spin_unlock_irqrestore(&mc_data->recv_lock, flags);
+	mc_data_post_recvs(mc_data);
+	return;
+}
+
+static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb,
+				   struct viport_trailer *trailer)
+{
+	u8 rx_chksum_flags = trailer->rx_chksum_flags;
+
+	/* drop alignment bytes at start */
+	skb_pull(skb, trailer->data_alignment_offset);
+	/* drop excess from end */
+	skb_trim(skb, __be16_to_cpu(trailer->data_length));
+
+	if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK)
+	    || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED)
+		&& ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED)
+		    || (rx_chksum_flags &
+			RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED))))
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+	else
+		skb->ip_summed = CHECKSUM_NONE;
+
+	if ((trailer->pkt_flags & PF_VLAN_INSERT) &&
+	    !(viport->features_supported & VNIC_FEAT_IGNORE_VLAN)) {
+		u8 *rv;
+
+		/* insert VLAN id between source & length */
+		DATA_INFO("VLAN adjustment\n");
+		rv = skb_push(skb, 4);
+		memmove(rv, rv + 4, 12);
+		*(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q);
+		if (trailer->pkt_flags & PF_PVID_OVERRIDDEN)
+		/*
+		 *  Indicates VLAN is 0 but we keep the protocol id.
+		 */
+			*(__be16 *) (rv + 14) = trailer->vlan &
+					__constant_cpu_to_be16(0xF000);
+		else
+			*(__be16 *) (rv + 14) = trailer->vlan;
+		DATA_INFO("vlan:%x\n", *(int *)(rv+14));
+	}
+
+    return;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
new file mode 100644
index 0000000..866b9ee
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h
@@ -0,0 +1,206 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_DATA_H_INCLUDED
+#define VNIC_DATA_H_INCLUDED
+
+#include <linux/if_vlan.h>
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+#include <linux/timex.h>
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+
+#include "vnic_ib.h"
+#include "vnic_control_pkt.h"
+#include "vnic_trailer.h"
+
+struct rdma_dest {
+	struct list_head	list_ptrs;
+	struct sk_buff		*skb;
+	u8			*data;
+	struct viport_trailer	*trailer __attribute__((aligned(32)));
+};
+
+struct buff_pool_entry {
+	__be64	remote_addr;
+	__be32	rkey;
+	u32	valid;
+};
+
+struct recv_pool {
+	u32			buffer_sz;
+	u32			pool_sz;
+	u32			eioc_pool_sz;
+	u32	 		eioc_rdma_rkey;
+	u64 			eioc_rdma_addr;
+	u32 			next_full_buf;
+	u32 			next_free_buf;
+	u32 			num_free_bufs;
+	u32 			num_posted_bufs;
+	u32 			sz_free_bundle;
+	int			kick_on_free;
+	struct buff_pool_entry	*buf_pool;
+	dma_addr_t		buf_pool_dma;
+	int			buf_pool_len;
+	struct rdma_dest	*recv_bufs;
+	struct list_head	avail_recv_bufs;
+};
+
+struct xmit_pool {
+	u32			buffer_sz;
+	u32 			pool_sz;
+	u32 			notify_count;
+	u32 			notify_bundle;
+	u32 			next_xmit_buf;
+	u32 			last_comp_buf;
+	u32 			num_xmit_bufs;
+	u32 			next_xmit_pool;
+	u32 			kick_count;
+	u32 			kick_byte_count;
+	u32 			kick_bundle;
+	u32 			kick_byte_bundle;
+	int			need_buffers;
+	int			send_kicks;
+	uint32_t 		rdma_rkey;
+	u64 			rdma_addr;
+	struct buff_pool_entry	*buf_pool;
+	dma_addr_t		buf_pool_dma;
+	int			buf_pool_len;
+	struct rdma_io		*xmit_bufs;
+	u8			*xmit_data;
+	dma_addr_t		xmitdata_dma;
+	int			xmitdata_len;
+};
+
+struct data {
+	struct viport			*parent;
+	struct data_config		*config;
+	struct ib_mr			*mr;
+	struct vnic_ib_conn		ib_conn;
+	u8				*local_storage;
+	struct vnic_recv_pool_config	host_pool_parms;
+	struct vnic_recv_pool_config	eioc_pool_parms;
+	struct recv_pool		recv_pool;
+	struct xmit_pool		xmit_pool;
+	u8				*region_data;
+	dma_addr_t			region_data_dma;
+	struct rdma_io			free_bufs_io;
+	struct send_io			kick_io;
+	struct list_head		recv_ios;
+	spinlock_t			recv_ios_lock;
+	spinlock_t			xmit_buf_lock;
+	int				kick_timer_on;
+	int				connected;
+	u16				max_mtu;
+	struct timer_list		kick_timer;
+	struct completion		done;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		u32		xmit_num;
+		u32		recv_num;
+		u32		free_buf_sends;
+		u32		free_buf_num;
+		u32		free_buf_min;
+		u32		kick_recvs;
+		u32		kick_reqs;
+		u32		no_xmit_bufs;
+		cycles_t	no_xmit_buf_time;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct mc_data {
+    struct viport           *parent;
+    struct data_config      *config;
+    struct ib_mr            *mr;
+    struct vnic_ib_conn     ib_conn;
+
+    u32                     num_recvs;
+    u32                     skb_len;
+    spinlock_t              recv_lock;
+    int                     recv_len;
+    struct ud_recv_io      *recv_ios;
+    struct list_head        avail_recv_ios_list;
+};
+
+int data_init(struct data *data, struct viport *viport,
+	      struct data_config *config, struct ib_pd *pd);
+
+int  data_connect(struct data *data);
+void data_connected(struct data *data);
+void data_disconnect(struct data *data);
+
+int data_xmit_packet(struct data *data, struct sk_buff *skb);
+
+void data_cleanup(struct data *data);
+
+#define data_is_connected(data)		\
+	(vnic_ib_conn_connected(&((data)->ib_conn)))
+#define data_path_id(data)		(data)->config->path_id
+#define data_eioc_pool(data)		&(data)->eioc_pool_parms
+#define data_host_pool(data)		&(data)->host_pool_parms
+#define data_eioc_pool_min(data)	&(data)->config->eioc_min
+#define data_host_pool_min(data)	&(data)->config->host_min
+#define data_eioc_pool_max(data)	&(data)->config->eioc_max
+#define data_host_pool_max(data)	&(data)->config->host_max
+#define data_local_pool_addr(data)	(data)->xmit_pool.rdma_addr
+#define data_local_pool_rkey(data)	(data)->xmit_pool.rdma_rkey
+#define data_remote_pool_addr(data)	&(data)->recv_pool.eioc_rdma_addr
+#define data_remote_pool_rkey(data)	&(data)->recv_pool.eioc_rdma_rkey
+
+#define data_max_mtu(data)		(data)->max_mtu
+
+
+#define data_len(data, trailer)		be16_to_cpu(trailer->data_length)
+#define data_offset(data, trailer)					\
+	((data)->recv_pool.buffer_sz - sizeof(struct viport_trailer)	\
+	- ALIGN(data_len((data), (trailer)), VIPORT_TRAILER_ALIGNMENT)	\
+	+ (trailer->data_alignment_offset))
+
+/* the following macros manipulate ring buffer indexes.
+ * the ring buffer size must be a power of 2.
+ */
+#define ADD(index, increment, size)	(((index) + (increment))&((size) - 1))
+#define NEXT(index, size)		ADD(index, 1, size)
+#define INC(index, increment, size)	(index) = ADD(index, increment, size)
+
+/* this is max multicast msg embedded will send */
+#define MCAST_MSG_SIZE \
+		(2048 - sizeof(struct ib_grh) - sizeof(struct viport_trailer))
+
+int mc_data_init(struct mc_data *mc_data, struct viport *viport,
+	struct data_config *config,
+	struct ib_pd *pd);
+
+void vnic_mc_data_cleanup(struct mc_data *mc_data);
+
+#endif	/* VNIC_DATA_H_INCLUDED */
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
new file mode 100644
index 0000000..dd8a073
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h
@@ -0,0 +1,103 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_TRAILER_H_INCLUDED
+#define VNIC_TRAILER_H_INCLUDED
+
+/* pkt_flags values */
+enum {
+	PF_CHASH_VALID		= 0x01,
+	PF_IPSEC_VALID		= 0x02,
+	PF_TCP_SEGMENT		= 0x04,
+	PF_KICK			= 0x08,
+	PF_VLAN_INSERT		= 0x10,
+	PF_PVID_OVERRIDDEN 	= 0x20,
+	PF_FCS_INCLUDED 	= 0x40,
+	PF_FORCE_ROUTE		= 0x80
+};
+
+/* tx_chksum_flags values */
+enum {
+	TX_CHKSUM_FLAGS_CHECKSUM_V4	= 0x01,
+	TX_CHKSUM_FLAGS_CHECKSUM_V6	= 0x02,
+	TX_CHKSUM_FLAGS_TCP_CHECKSUM	= 0x04,
+	TX_CHKSUM_FLAGS_UDP_CHECKSUM	= 0x08,
+	TX_CHKSUM_FLAGS_IP_CHECKSUM	= 0x10
+};
+
+/* rx_chksum_flags values */
+enum {
+	RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED	= 0x01,
+	RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED	= 0x02,
+	RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED	= 0x04,
+	RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED	= 0x08,
+	RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED	= 0x10,
+	RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED	= 0x20,
+	RX_CHKSUM_FLAGS_LOOPBACK		= 0x40,
+	RX_CHKSUM_FLAGS_RESERVED		= 0x80
+};
+
+/* connection_hash_and_valid values */
+enum {
+	CHV_VALID	= 0x80,
+	CHV_HASH_MASH	= 0x7f
+};
+
+struct viport_trailer {
+	s8	data_alignment_offset;
+	u8	rndis_header_length;	/* reserved for use by edp */
+	__be16	data_length;
+	u8	pkt_flags;
+	u8	tx_chksum_flags;
+	u8	rx_chksum_flags;
+	u8	ip_sec_flags;
+	u32	tcp_seq_no;
+	u32	ip_sec_offload_handle;
+	u32	ip_sec_next_offload_handle;
+	u8	dest_mac_addr[6];
+	__be16	vlan;
+	u16	time_stamp;
+	u8	origin;
+	u8	connection_hash_and_valid;
+};
+
+#define VIPORT_TRAILER_ALIGNMENT	32
+
+#define BUFFER_SIZE(len)					\
+	(sizeof(struct viport_trailer) +			\
+	 ALIGN((len), VIPORT_TRAILER_ALIGNMENT))
+
+#define MAX_PAYLOAD(len)					\
+	ALIGN_DOWN((len) - sizeof(struct viport_trailer),	\
+		   VIPORT_TRAILER_ALIGNMENT)
+
+#endif	/* VNIC_TRAILER_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:56:54 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:26:54 +0530
Subject: [ofa-general] [PATCH v3 06/13] QLogic VNIC: IB core stack
	interaction
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095654.9943.72719.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

The patch implements the interaction of the QLogic VNIC driver with
the underlying core infiniband stack.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 ++++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h |  206 ++++++
 2 files changed, 1249 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
new file mode 100644
index 0000000..c43e69e
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c
@@ -0,0 +1,1043 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/netdevice.h>
+#include <linux/list.h>
+
+#include "vnic_util.h"
+#include "vnic_data.h"
+#include "vnic_config.h"
+#include "vnic_ib.h"
+#include "vnic_viport.h"
+#include "vnic_sys.h"
+#include "vnic_main.h"
+#include "vnic_stats.h"
+
+static int vnic_ib_inited;
+static void vnic_add_one(struct ib_device *device);
+static void vnic_remove_one(struct ib_device *device);
+static int vnic_defer_completion(void *ptr);
+
+static int vnic_ib_mc_init_qp(struct mc_data *mc_data,
+		struct vnic_ib_config *config,
+		struct ib_pd *pd,
+		struct viport_config *viport_config);
+
+static struct ib_client vnic_client = {
+	.name = "vnic",
+	.add = vnic_add_one,
+	.remove = vnic_remove_one
+};
+
+struct ib_sa_client vnic_sa_client;
+
+int vnic_ib_init(void)
+{
+	int ret = -1;
+
+	IB_FUNCTION("vnic_ib_init()\n");
+
+	/* class has to be registered before
+	 * calling ib_register_client() because, that call
+	 * will trigger vnic_add_port() which will register
+	 * class_device for the port with the parent class
+	 * as vnic_class
+	 */
+	ret = class_register(&vnic_class);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register class"
+		       " infiniband_qlgc_vnic; error %d", ret);
+		goto out;
+	}
+
+	ib_sa_register_client(&vnic_sa_client);
+	ret = ib_register_client(&vnic_client);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register IB client;"
+		       " error %d", ret);
+		goto err_ib_reg;
+	}
+
+	interface_dev.dev.class = &vnic_class;
+	interface_dev.dev.release = vnic_release_dev;
+	snprintf(interface_dev.dev.bus_id,
+		 BUS_ID_SIZE, "interfaces");
+	init_completion(&interface_dev.released);
+	ret = device_register(&interface_dev.dev);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't register class interfaces;"
+		       " error %d", ret);
+		goto err_class_dev;
+	}
+	ret = device_create_file(&interface_dev.dev,
+				       &dev_attr_delete_vnic);
+	if (ret) {
+		printk(KERN_ERR PFX "couldn't create class file"
+		       " 'delete_vnic'; error %d", ret);
+		goto err_class_file;
+	}
+
+	vnic_ib_inited = 1;
+
+	return ret;
+err_class_file:
+	device_unregister(&interface_dev.dev);
+err_class_dev:
+	ib_unregister_client(&vnic_client);
+err_ib_reg:
+	ib_sa_unregister_client(&vnic_sa_client);
+	class_unregister(&vnic_class);
+out:
+	return ret;
+}
+
+static struct vnic_ib_port *vnic_add_port(struct vnic_ib_device *device,
+					  u8 port_num)
+{
+	struct vnic_ib_port *port;
+
+	port = kzalloc(sizeof *port, GFP_KERNEL);
+	if (!port)
+		return NULL;
+
+	init_completion(&port->pdev_info.released);
+	port->dev = device;
+	port->port_num = port_num;
+
+	port->pdev_info.dev.class = &vnic_class;
+	port->pdev_info.dev.parent = NULL;
+	port->pdev_info.dev.release = vnic_release_dev;
+	snprintf(port->pdev_info.dev.bus_id, BUS_ID_SIZE,
+		 "vnic-%s-%d", device->dev->name, port_num);
+
+	if (device_register(&port->pdev_info.dev))
+		goto free_port;
+
+	if (device_create_file(&port->pdev_info.dev,
+				     &dev_attr_create_primary))
+		goto err_class;
+	if (device_create_file(&port->pdev_info.dev,
+				     &dev_attr_create_secondary))
+		goto err_class;
+
+	return port;
+err_class:
+	device_unregister(&port->pdev_info.dev);
+free_port:
+	kfree(port);
+
+	return NULL;
+}
+
+static void vnic_add_one(struct ib_device *device)
+{
+	struct vnic_ib_device *vnic_dev;
+	struct vnic_ib_port *port;
+	int s, e, p;
+
+	vnic_dev = kmalloc(sizeof *vnic_dev, GFP_KERNEL);
+	if (!vnic_dev)
+		return;
+
+	vnic_dev->dev = device;
+	INIT_LIST_HEAD(&vnic_dev->port_list);
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
+		s = 0;
+		e = 0;
+
+	} else {
+		s = 1;
+		e = device->phys_port_cnt;
+
+	}
+
+	for (p = s; p <= e; p++) {
+		port = vnic_add_port(vnic_dev, p);
+		if (port)
+			list_add_tail(&port->list, &vnic_dev->port_list);
+	}
+
+	ib_set_client_data(device, &vnic_client, vnic_dev);
+
+}
+
+static void vnic_remove_one(struct ib_device *device)
+{
+	struct vnic_ib_device *vnic_dev;
+	struct vnic_ib_port *port, *tmp_port;
+
+	vnic_dev = ib_get_client_data(device, &vnic_client);
+	list_for_each_entry_safe(port, tmp_port,
+				 &vnic_dev->port_list, list) {
+		device_unregister(&port->pdev_info.dev);
+		/*
+		 * wait for sysfs entries to go away, so that no new vnics
+		 * are created
+		 */
+		wait_for_completion(&port->pdev_info.released);
+		kfree(port);
+
+	}
+	kfree(vnic_dev);
+
+	/* TODO Only those vnic interfaces associated with
+	 * the HCA whose remove event is called should be freed
+	 * Currently all the vnic interfaces are freed
+	 */
+
+	while (!list_empty(&vnic_list)) {
+		struct vnic *vnic =
+		    list_entry(vnic_list.next, struct vnic, list_ptrs);
+		vnic_free(vnic);
+	}
+
+	vnic_npevent_cleanup();
+	viport_cleanup();
+
+}
+
+void vnic_ib_cleanup(void)
+{
+	IB_FUNCTION("vnic_ib_cleanup()\n");
+
+	if (!vnic_ib_inited)
+		return;
+
+	device_unregister(&interface_dev.dev);
+	wait_for_completion(&interface_dev.released);
+
+	ib_unregister_client(&vnic_client);
+	ib_sa_unregister_client(&vnic_sa_client);
+	class_unregister(&vnic_class);
+}
+
+static void vnic_path_rec_completion(int status,
+				     struct ib_sa_path_rec *pathrec,
+				     void *context)
+{
+	struct vnic_ib_path_info *p = context;
+	p->status = status;
+	if (!status)
+		p->path = *pathrec;
+
+	complete(&p->done);
+}
+
+int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic)
+{
+	struct viport_config *config = netpath->viport->config;
+	int ret = 0;
+
+	init_completion(&config->path_info.done);
+	IB_INFO("Using SA path rec get time out value of %d\n",
+	       config->sa_path_rec_get_timeout);
+	config->path_info.path_query_id =
+			 ib_sa_path_rec_get(&vnic_sa_client,
+					    config->ibdev,
+					    config->port,
+					    &config->path_info.path,
+					    IB_SA_PATH_REC_DGID      |
+					    IB_SA_PATH_REC_SGID      |
+					    IB_SA_PATH_REC_NUMB_PATH |
+					    IB_SA_PATH_REC_PKEY,
+					    config->sa_path_rec_get_timeout,
+					    GFP_KERNEL,
+					    vnic_path_rec_completion,
+					    &config->path_info,
+					    &config->path_info.path_query);
+
+	if (config->path_info.path_query_id < 0) {
+		IB_ERROR("SA path record query failed; error %d\n",
+			 config->path_info.path_query_id);
+		ret = config->path_info.path_query_id;
+		goto out;
+	}
+
+	wait_for_completion(&config->path_info.done);
+
+	if (config->path_info.status < 0) {
+		printk(KERN_WARNING PFX "connection not available to dgid "
+		       "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x",
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[0]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[2]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[4]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[6]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[8]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[10]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[12]),
+		       (int)be16_to_cpu(*(__be16 *) &config->path_info.path.
+					dgid.raw[14]));
+
+		if (config->path_info.status == -ETIMEDOUT)
+			printk(KERN_INFO " path query timed out\n");
+		else if (config->path_info.status == -EIO)
+			printk(KERN_INFO " path query sending error\n");
+		else
+			printk(KERN_INFO " error %d\n",
+			       config->path_info.status);
+
+		ret = config->path_info.status;
+	}
+out:
+	if (ret)
+		netpath_timer(netpath, vnic->config->no_path_timeout);
+
+	return ret;
+}
+
+static inline void vnic_ib_handle_completions(struct ib_wc *wc,
+					      struct vnic_ib_conn *ib_conn,
+					      u32 *comp_num,
+					      cycles_t *comp_time)
+{
+	struct io *io;
+
+	io = (struct io *)(wc->wr_id);
+	vnic_ib_comp_stats(ib_conn, comp_num);
+	if (wc->status) {
+		IB_INFO("completion error  wc.status %d"
+			 " wc.opcode %d vendor err 0x%x\n",
+			  wc->status, wc->opcode, wc->vendor_err);
+	} else if (io) {
+		vnic_ib_io_stats(io, ib_conn, *comp_time);
+		if (io->type == RECV_UD) {
+			struct ud_recv_io *recv_io =
+				container_of(io, struct ud_recv_io, io);
+			recv_io->len = wc->byte_len;
+		}
+		if (io->routine)
+			(*io->routine) (io);
+	}
+}
+
+static void ib_qp_event(struct ib_event *event, void *context)
+{
+	IB_ERROR("QP event %d\n", event->event);
+}
+
+static void vnic_ib_completion(struct ib_cq *cq, void *ptr)
+{
+	struct vnic_ib_conn *ib_conn = ptr;
+	unsigned long	 flags;
+	int compl_received;
+	struct ib_wc wc;
+	cycles_t  comp_time;
+	u32  comp_num = 0;
+
+	/* for multicast, cm_id is NULL, so skip that test */
+	if (ib_conn->cm_id &&
+	    (ib_conn->state != IB_CONN_CONNECTED))
+		return;
+
+	/* Check if completion processing is taking place in thread
+	 * If not then process completions in this handler,
+	 * else set compl_received if not set, to indicate that
+	 * there are more completions to process in thread.
+	 */
+
+	spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+	compl_received = ib_conn->compl_received;
+	spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags);
+
+	if (ib_conn->in_thread || compl_received) {
+		if (!compl_received) {
+			spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+			ib_conn->compl_received = 1;
+			spin_unlock_irqrestore(&ib_conn->compl_received_lock,
+									flags);
+		}
+		wake_up(&(ib_conn->callback_wait_queue));
+	} else {
+		vnic_ib_note_comptime_stats(&comp_time);
+		vnic_ib_callback_stats(ib_conn);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		while (ib_poll_cq(cq, 1, &wc) > 0) {
+			vnic_ib_handle_completions(&wc, ib_conn, &comp_num,
+								 &comp_time);
+			if (ib_conn->cm_id &&
+				 ib_conn->state != IB_CONN_CONNECTED)
+				break;
+
+			/* If we get more completions than the completion limit
+			 * defer completion to the thread
+			 */
+			if ((!ib_conn->in_thread) &&
+			    (comp_num >= ib_conn->ib_config->completion_limit)) {
+				ib_conn->in_thread = 1;
+				spin_lock_irqsave(
+					&ib_conn->compl_received_lock, flags);
+				ib_conn->compl_received = 1;
+				spin_unlock_irqrestore(
+					&ib_conn->compl_received_lock, flags);
+				wake_up(&(ib_conn->callback_wait_queue));
+				break;
+			}
+
+		}
+		vnic_ib_maxio_stats(ib_conn, comp_num);
+	}
+}
+
+static int vnic_ib_mod_qp_to_rts(struct ib_cm_id *cm_id,
+			     struct vnic_ib_conn *ib_conn)
+{
+	int attr_mask = 0;
+	int ret;
+	struct ib_qp_attr *qp_attr = NULL;
+
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	qp_attr->qp_state = IB_QPS_RTR;
+
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto out;
+
+	ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask);
+	if (ret)
+		goto out;
+
+	IB_INFO("QP RTR\n");
+
+	qp_attr->qp_state = IB_QPS_RTS;
+
+	ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask);
+	if (ret)
+		goto out;
+
+	ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask);
+	if (ret)
+		goto out;
+
+	IB_INFO("QP RTS\n");
+
+	ret = ib_send_cm_rtu(cm_id, NULL, 0);
+	if (ret)
+		goto out;
+out:
+	kfree(qp_attr);
+	return ret;
+}
+
+int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct vnic_ib_conn *ib_conn = cm_id->context;
+	struct viport *viport = ib_conn->viport;
+	int err = 0;
+
+	switch (event->event) {
+	case IB_CM_REQ_ERROR:
+		IB_ERROR("sending CM REQ failed\n");
+		err = 1;
+		viport->retry = 1;
+		break;
+	case IB_CM_REP_RECEIVED:
+		IB_INFO("CM REP recvd\n");
+		if (vnic_ib_mod_qp_to_rts(cm_id, ib_conn))
+			err = 1;
+		else {
+			ib_conn->state = IB_CONN_CONNECTED;
+			vnic_ib_connected_time_stats(ib_conn);
+			IB_INFO("RTU SENT\n");
+		}
+		break;
+	case IB_CM_REJ_RECEIVED:
+		printk(KERN_ERR PFX " CM rejected control connection\n");
+		if (event->param.rej_rcvd.reason ==
+		    IB_CM_REJ_INVALID_SERVICE_ID)
+			printk(KERN_ERR "reason: invalid service ID. "
+			       "IOCGUID value specified may be incorrect\n");
+		else
+			printk(KERN_ERR "reason code : 0x%x\n",
+			       event->param.rej_rcvd.reason);
+
+		err = 1;
+		viport->retry = 1;
+		break;
+	case IB_CM_MRA_RECEIVED:
+		IB_INFO("CM MRA received\n");
+		break;
+
+	case IB_CM_DREP_RECEIVED:
+		IB_INFO("CM DREP recvd\n");
+		ib_conn->state = IB_CONN_DISCONNECTED;
+		break;
+
+	case IB_CM_TIMEWAIT_EXIT:
+		IB_ERROR("CM timewait exit\n");
+		err = 1;
+		break;
+
+	default:
+		IB_INFO("unhandled CM event %d\n", event->event);
+		break;
+
+	}
+
+	if (err) {
+		ib_conn->state = IB_CONN_DISCONNECTED;
+		viport_failure(viport);
+	}
+
+	viport_kick(viport);
+	return 0;
+}
+
+
+int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn)
+{
+	struct ib_cm_req_param	*req = NULL;
+	struct viport		*viport;
+	int 			ret = -1;
+
+	if (!vnic_ib_conn_initted(ib_conn)) {
+		IB_ERROR("IB Connection out of state for CM connect (%d)\n",
+			 ib_conn->state);
+		return -EINVAL;
+	}
+
+	vnic_ib_conntime_stats(ib_conn);
+	req = kzalloc(sizeof *req, GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	viport	= ib_conn->viport;
+
+	req->primary_path	= &viport->config->path_info.path;
+	req->alternate_path	= NULL;
+	req->qp_num		= ib_conn->qp->qp_num;
+	req->qp_type		= ib_conn->qp->qp_type;
+	req->service_id 	= ib_conn->ib_config->service_id;
+	req->private_data	= &ib_conn->ib_config->conn_data;
+	req->private_data_len	= sizeof(struct vnic_connection_data);
+	req->flow_control	= 1;
+
+	get_random_bytes(&req->starting_psn, 4);
+	req->starting_psn &= 0xffffff;
+
+	/*
+	 * Both responder_resources and initiator_depth are set to zero
+	 * as we do not need RDMA read.
+	 *
+	 * They also must be set to zero, otherwise data connections
+	 * are rejected by VEx.
+	 */
+	req->responder_resources 	= 0;
+	req->initiator_depth		= 0;
+	req->remote_cm_response_timeout = 20;
+	req->local_cm_response_timeout  = 20;
+	req->retry_count		= ib_conn->ib_config->retry_count;
+	req->rnr_retry_count		= ib_conn->ib_config->rnr_retry_count;
+	req->max_cm_retries		= 15;
+
+	ib_conn->state = IB_CONN_CONNECTING;
+
+	ret = ib_send_cm_req(ib_conn->cm_id, req);
+
+	kfree(req);
+
+	if (ret) {
+		IB_ERROR("CM REQ sending failed; error %d \n", ret);
+		ib_conn->state = IB_CONN_DISCONNECTED;
+	}
+
+	return ret;
+}
+
+static int vnic_ib_init_qp(struct vnic_ib_conn *ib_conn,
+			   struct vnic_ib_config *config,
+			   struct ib_pd	*pd,
+			   struct viport_config *viport_config)
+{
+	struct ib_qp_init_attr	*init_attr;
+	struct ib_qp_attr	*attr;
+	int			ret;
+
+	init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL);
+	if (!init_attr)
+		return -ENOMEM;
+
+	init_attr->event_handler	= ib_qp_event;
+	init_attr->cap.max_send_wr	= config->num_sends;
+	init_attr->cap.max_recv_wr	= config->num_recvs;
+	init_attr->cap.max_recv_sge	= config->recv_scatter;
+	init_attr->cap.max_send_sge	= config->send_gather;
+	init_attr->sq_sig_type		= IB_SIGNAL_ALL_WR;
+	init_attr->qp_type		= IB_QPT_RC;
+	init_attr->send_cq		= ib_conn->cq;
+	init_attr->recv_cq		= ib_conn->cq;
+
+	ib_conn->qp = ib_create_qp(pd, init_attr);
+
+	if (IS_ERR(ib_conn->qp)) {
+		ret = -1;
+		IB_ERROR("could not create QP\n");
+		goto free_init_attr;
+	}
+
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (!attr) {
+		ret = -ENOMEM;
+		goto destroy_qp;
+	}
+
+	ret = ib_find_pkey(viport_config->ibdev, viport_config->port,
+			  be16_to_cpu(viport_config->path_info.path.pkey),
+			  &attr->pkey_index);
+	if (ret) {
+		printk(KERN_WARNING PFX "ib_find_pkey() failed; "
+		       "error %d\n", ret);
+		goto freeattr;
+	}
+
+	attr->qp_state		= IB_QPS_INIT;
+	attr->qp_access_flags	= IB_ACCESS_REMOTE_WRITE;
+	attr->port_num		= viport_config->port;
+
+	ret = ib_modify_qp(ib_conn->qp, attr,
+			   IB_QP_STATE |
+			   IB_QP_PKEY_INDEX |
+			   IB_QP_ACCESS_FLAGS | IB_QP_PORT);
+	if (ret) {
+		printk(KERN_WARNING PFX "could not modify QP; error %d \n",
+		       ret);
+		goto freeattr;
+	}
+
+	kfree(attr);
+	kfree(init_attr);
+	return ret;
+
+freeattr:
+	kfree(attr);
+destroy_qp:
+	ib_destroy_qp(ib_conn->qp);
+free_init_attr:
+	kfree(init_attr);
+	return ret;
+}
+
+int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config)
+{
+	struct viport_config	*viport_config = viport->config;
+	int		ret = -1;
+	unsigned int	cq_size = config->num_sends + config->num_recvs;
+
+
+	if (!vnic_ib_conn_uninitted(ib_conn)) {
+		IB_ERROR("IB Connection out of state for init (%d)\n",
+			 ib_conn->state);
+		return -EINVAL;
+	}
+
+	ib_conn->cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion,
+#ifdef BUILD_FOR_OFED_1_2
+				   NULL, ib_conn, cq_size);
+#else
+				   NULL, ib_conn, cq_size, 0);
+#endif
+	if (IS_ERR(ib_conn->cq)) {
+		IB_ERROR("could not create CQ\n");
+		goto out;
+	}
+
+	IB_INFO("cq created %p %d\n", ib_conn->cq, cq_size);
+	ib_req_notify_cq(ib_conn->cq, IB_CQ_NEXT_COMP);
+	init_waitqueue_head(&(ib_conn->callback_wait_queue));
+	init_completion(&(ib_conn->callback_thread_exit));
+
+	spin_lock_init(&ib_conn->compl_received_lock);
+
+	ib_conn->callback_thread = kthread_run(vnic_defer_completion, ib_conn,
+						"qlgc_vnic_def_compl");
+	if (IS_ERR(ib_conn->callback_thread)) {
+		IB_ERROR("Could not create vnic_callback_thread;"
+			" error %d\n", (int) PTR_ERR(ib_conn->callback_thread));
+		ib_conn->callback_thread = NULL;
+		goto destroy_cq;
+	}
+
+	ret = vnic_ib_init_qp(ib_conn, config, pd, viport_config);
+
+	if (ret)
+		goto destroy_thread;
+
+	spin_lock_init(&ib_conn->conn_lock);
+	ib_conn->state = IB_CONN_INITTED;
+
+	return ret;
+
+destroy_thread:
+	completion_callback_cleanup(ib_conn);
+destroy_cq:
+	ib_destroy_cq(ib_conn->cq);
+out:
+	return ret;
+}
+
+int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io)
+{
+	cycles_t		post_time;
+	struct ib_recv_wr	*bad_wr;
+	int			ret = -1;
+	unsigned long		flags;
+
+	IB_FUNCTION("vnic_ib_post_recv()\n");
+
+	spin_lock_irqsave(&ib_conn->conn_lock, flags);
+
+	if (!vnic_ib_conn_initted(ib_conn) &&
+	    !vnic_ib_conn_connected(ib_conn)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	vnic_ib_pre_rcvpost_stats(ib_conn, io, &post_time);
+	io->type = RECV;
+	ret = ib_post_recv(ib_conn->qp, &io->rwr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting rcv wr; error %d\n", ret);
+		ib_conn->state = IB_CONN_ERRORED;
+		goto out;
+	}
+
+	vnic_ib_post_rcvpost_stats(ib_conn, post_time);
+out:
+	spin_unlock_irqrestore(&ib_conn->conn_lock, flags);
+	return ret;
+
+}
+
+int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io)
+{
+	cycles_t		post_time;
+	unsigned long		flags;
+	struct ib_send_wr	*bad_wr;
+	int			ret = -1;
+
+	IB_FUNCTION("vnic_ib_post_send()\n");
+
+	spin_lock_irqsave(&ib_conn->conn_lock, flags);
+	if (!vnic_ib_conn_connected(ib_conn)) {
+		IB_ERROR("IB Connection out of state for"
+			 " posting sends (%d)\n", ib_conn->state);
+		goto out;
+	}
+
+	vnic_ib_pre_sendpost_stats(io, &post_time);
+	if (io->swr.opcode == IB_WR_RDMA_WRITE)
+		io->type = RDMA;
+	else
+		io->type = SEND;
+
+	ret = ib_post_send(ib_conn->qp, &io->swr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting send wr; error %d\n", ret);
+		ib_conn->state = IB_CONN_ERRORED;
+		goto out;
+	}
+
+	vnic_ib_post_sendpost_stats(ib_conn, io, post_time);
+out:
+	spin_unlock_irqrestore(&ib_conn->conn_lock, flags);
+	return ret;
+}
+
+static int vnic_defer_completion(void *ptr)
+{
+	struct vnic_ib_conn *ib_conn = ptr;
+	struct ib_wc wc;
+	struct ib_cq *cq = ib_conn->cq;
+	cycles_t 	 comp_time;
+	u32              comp_num = 0;
+	unsigned long	flags;
+
+	while (!ib_conn->callback_thread_end) {
+		wait_event_interruptible(ib_conn->callback_wait_queue,
+					 ib_conn->compl_received ||
+					 ib_conn->callback_thread_end);
+		ib_conn->in_thread = 1;
+		spin_lock_irqsave(&ib_conn->compl_received_lock, flags);
+		ib_conn->compl_received = 0;
+		spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags);
+		if (ib_conn->cm_id &&
+		    ib_conn->state != IB_CONN_CONNECTED)
+			goto out_thread;
+
+		vnic_ib_note_comptime_stats(&comp_time);
+		vnic_ib_callback_stats(ib_conn);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		while (ib_poll_cq(cq, 1, &wc) > 0) {
+			vnic_ib_handle_completions(&wc, ib_conn, &comp_num,
+								 &comp_time);
+			if (ib_conn->cm_id &&
+				 ib_conn->state != IB_CONN_CONNECTED)
+				break;
+		}
+		vnic_ib_maxio_stats(ib_conn, comp_num);
+out_thread:
+		ib_conn->in_thread = 0;
+	}
+	complete_and_exit(&(ib_conn->callback_thread_exit), 0);
+	return 0;
+}
+
+void completion_callback_cleanup(struct vnic_ib_conn *ib_conn)
+{
+	if (ib_conn->callback_thread) {
+		ib_conn->callback_thread_end = 1;
+		wake_up(&(ib_conn->callback_wait_queue));
+		wait_for_completion(&(ib_conn->callback_thread_exit));
+		ib_conn->callback_thread = NULL;
+	}
+}
+
+int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config)
+{
+	struct viport_config	*viport_config = viport->config;
+	int		ret = -1;
+	unsigned int	cq_size = config->num_recvs; /* recvs only */
+
+	IB_FUNCTION("vnic_ib_mc_init\n");
+
+	mc_data->ib_conn.cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion,
+#ifdef BUILD_FOR_OFED_1_2
+				   NULL, &mc_data->ib_conn, cq_size);
+#else
+				   NULL, &mc_data->ib_conn, cq_size, 0);
+#endif
+	if (IS_ERR(mc_data->ib_conn.cq)) {
+		IB_ERROR("ib_create_cq failed\n");
+		goto out;
+	}
+	IB_INFO("mc cq created %p %d\n", mc_data->ib_conn.cq, cq_size);
+
+	ret = ib_req_notify_cq(mc_data->ib_conn.cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		IB_ERROR("ib_req_notify_cq failed %x \n", ret);
+		goto destroy_cq;
+	}
+
+	init_waitqueue_head(&(mc_data->ib_conn.callback_wait_queue));
+	init_completion(&(mc_data->ib_conn.callback_thread_exit));
+
+	spin_lock_init(&mc_data->ib_conn.compl_received_lock);
+	mc_data->ib_conn.callback_thread = kthread_run(vnic_defer_completion,
+							&mc_data->ib_conn,
+							"qlgc_vnic_mc_def_compl");
+	if (IS_ERR(mc_data->ib_conn.callback_thread)) {
+		IB_ERROR("Could not create vnic_callback_thread for MULTICAST;"
+			" error %d\n",
+			(int) PTR_ERR(mc_data->ib_conn.callback_thread));
+		mc_data->ib_conn.callback_thread = NULL;
+		goto destroy_cq;
+	}
+	IB_INFO("callback_thread created\n");
+
+	ret = vnic_ib_mc_init_qp(mc_data, config, pd, viport_config);
+	if (ret)
+		goto destroy_thread;
+
+	spin_lock_init(&mc_data->ib_conn.conn_lock);
+	mc_data->ib_conn.state = IB_CONN_INITTED; /* stays in this state */
+
+	return ret;
+
+destroy_thread:
+	completion_callback_cleanup(&mc_data->ib_conn);
+destroy_cq:
+	ib_destroy_cq(mc_data->ib_conn.cq);
+	mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL);
+out:
+	return ret;
+}
+
+static int vnic_ib_mc_init_qp(struct mc_data *mc_data,
+			   struct vnic_ib_config *config,
+			   struct ib_pd	*pd,
+			   struct viport_config *viport_config)
+{
+	struct ib_qp_init_attr	*init_attr;
+	struct ib_qp_attr	*qp_attr;
+	int			ret;
+
+	IB_FUNCTION("vnic_ib_mc_init_qp\n");
+
+	if (!mc_data->ib_conn.cq) {
+		IB_ERROR("cq is null\n");
+		return -ENOMEM;
+	}
+
+	init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL);
+	if (!init_attr) {
+		IB_ERROR("failed to alloc init_attr\n");
+		return -ENOMEM;
+	}
+
+	init_attr->cap.max_recv_wr	= config->num_recvs;
+	init_attr->cap.max_send_wr	= 1;
+	init_attr->cap.max_recv_sge	= 2;
+	init_attr->cap.max_send_sge	= 1;
+
+	/* Completion for all work requests. */
+	init_attr->sq_sig_type		= IB_SIGNAL_ALL_WR;
+
+	init_attr->qp_type		= IB_QPT_UD;
+
+	init_attr->send_cq		= mc_data->ib_conn.cq;
+	init_attr->recv_cq		= mc_data->ib_conn.cq;
+
+	IB_INFO("creating qp %d \n", config->num_recvs);
+
+	mc_data->ib_conn.qp = ib_create_qp(pd, init_attr);
+
+	if (IS_ERR(mc_data->ib_conn.qp)) {
+		ret = -1;
+		IB_ERROR("could not create QP\n");
+		goto free_init_attr;
+	}
+
+	qp_attr = kzalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr) {
+		ret = -ENOMEM;
+		goto destroy_qp;
+	}
+
+	qp_attr->qp_state	= IB_QPS_INIT;
+	qp_attr->port_num	= viport_config->port;
+	qp_attr->qkey 		= IOC_NUMBER(be64_to_cpu(viport_config->ioc_guid));
+	qp_attr->pkey_index	= 0;
+	/* cannot set access flags for UD qp
+	qp_attr->qp_access_flags	= IB_ACCESS_REMOTE_WRITE; */
+
+	IB_INFO("port_num:%d qkey:%d pkey:%d\n", qp_attr->port_num,
+			qp_attr->qkey, qp_attr->pkey_index);
+	ret = ib_modify_qp(mc_data->ib_conn.qp, qp_attr,
+			   IB_QP_STATE |
+			   IB_QP_PKEY_INDEX |
+			   IB_QP_QKEY |
+
+			/* cannot set this for UD
+			   IB_QP_ACCESS_FLAGS | */
+
+			   IB_QP_PORT);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to INIT failed %d \n", ret);
+		goto free_qp_attr;
+	}
+
+	kfree(qp_attr);
+	kfree(init_attr);
+	return ret;
+
+free_qp_attr:
+	kfree(qp_attr);
+destroy_qp:
+	ib_destroy_qp(mc_data->ib_conn.qp);
+	mc_data->ib_conn.qp = ERR_PTR(-EINVAL);
+free_init_attr:
+	kfree(init_attr);
+	return ret;
+}
+
+int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp)
+{
+	int ret;
+	struct ib_qp_attr *qp_attr = NULL;
+
+	IB_FUNCTION("vnic_ib_mc_mod_qp_to_rts\n");
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	memset(qp_attr, 0, sizeof *qp_attr);
+	qp_attr->qp_state = IB_QPS_RTR;
+
+	ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to RTR failed %d\n", ret);
+		goto out;
+	}
+	IB_INFO("MC QP RTR\n");
+
+	memset(qp_attr, 0, sizeof *qp_attr);
+	qp_attr->qp_state = IB_QPS_RTS;
+	qp_attr->sq_psn = 0;
+
+	ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE | IB_QP_SQ_PSN);
+	if (ret) {
+		IB_ERROR("ib_modify_qp to RTS failed %d\n", ret);
+		goto out;
+	}
+	IB_INFO("MC QP RTS\n");
+
+	return 0;
+
+out:
+	kfree(qp_attr);
+	return -1;
+}
+
+int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io)
+{
+	cycles_t		post_time;
+	struct ib_recv_wr	*bad_wr;
+	int			ret = -1;
+
+	IB_FUNCTION("vnic_ib_mc_post_recv()\n");
+
+	vnic_ib_pre_rcvpost_stats(&mc_data->ib_conn, io, &post_time);
+	io->type = RECV_UD;
+	ret = ib_post_recv(mc_data->ib_conn.qp, &io->rwr, &bad_wr);
+	if (ret) {
+		IB_ERROR("error in posting rcv wr; error %d\n", ret);
+		goto out;
+	}
+	vnic_ib_post_rcvpost_stats(&mc_data->ib_conn, post_time);
+
+out:
+	return ret;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
new file mode 100644
index 0000000..ebf9ef5
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h
@@ -0,0 +1,206 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_IB_H_INCLUDED
+#define VNIC_IB_H_INCLUDED
+
+#include <linux/timex.h>
+#include <linux/completion.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_pack.h>
+#include <rdma/ib_sa.h>
+#include <rdma/ib_cm.h>
+
+#include "vnic_sys.h"
+#include "vnic_netpath.h"
+#define PFX	"qlgc_vnic: "
+
+struct io;
+typedef void (comp_routine_t) (struct io *io);
+
+enum vnic_ib_conn_state {
+	IB_CONN_UNINITTED	= 0,
+	IB_CONN_INITTED		= 1,
+	IB_CONN_CONNECTING	= 2,
+	IB_CONN_CONNECTED	= 3,
+	IB_CONN_DISCONNECTED	= 4,
+	IB_CONN_ERRORED		= 5
+};
+
+struct vnic_ib_conn {
+	struct viport		*viport;
+	struct vnic_ib_config	*ib_config;
+	spinlock_t		conn_lock;
+	enum vnic_ib_conn_state	state;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct ib_cm_id		*cm_id;
+	int 			callback_thread_end;
+	struct task_struct	*callback_thread;
+	wait_queue_head_t	callback_wait_queue;
+	u32 			in_thread;
+	u32 			compl_received;
+	struct completion 	callback_thread_exit;
+	spinlock_t		compl_received_lock;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	struct {
+		cycles_t	connection_time;
+		cycles_t	rdma_post_time;
+		u32		rdma_post_ios;
+		cycles_t	rdma_comp_time;
+		u32		rdma_comp_ios;
+		cycles_t	send_post_time;
+		u32		send_post_ios;
+		cycles_t	send_comp_time;
+		u32		send_comp_ios;
+		cycles_t	recv_post_time;
+		u32		recv_post_ios;
+		cycles_t	recv_comp_time;
+		u32		recv_comp_ios;
+		u32		num_ios;
+		u32		num_callbacks;
+		u32		max_ios;
+	} statistics;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+};
+
+struct vnic_ib_path_info {
+	struct ib_sa_path_rec	path;
+	struct ib_sa_query	*path_query;
+	int			path_query_id;
+	int			status;
+	struct			completion done;
+};
+
+struct vnic_ib_device {
+	struct ib_device	*dev;
+	struct list_head	port_list;
+};
+
+struct vnic_ib_port {
+	struct vnic_ib_device	*dev;
+	u8			port_num;
+	struct dev_info		pdev_info;
+	struct list_head	list;
+};
+
+struct io {
+	struct list_head	list_ptrs;
+	struct viport		*viport;
+	comp_routine_t		*routine;
+	struct ib_recv_wr	rwr;
+	struct ib_send_wr	swr;
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+	cycles_t		time;
+#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
+	enum {RECV, RDMA, SEND, RECV_UD}	type;
+};
+
+struct rdma_io {
+	struct io		io;
+	struct ib_sge		list[2];
+	u16			index;
+	u16			len;
+	u8			*data;
+	dma_addr_t		data_dma;
+	struct sk_buff		*skb;
+	dma_addr_t		skb_data_dma;
+	struct viport_trailer 	*trailer;
+	dma_addr_t 		trailer_dma;
+};
+
+struct send_io {
+	struct io	io;
+	struct ib_sge	list;
+	u8		*virtual_addr;
+};
+
+struct recv_io {
+	struct io	io;
+	struct ib_sge	list;
+	u8		*virtual_addr;
+};
+
+struct ud_recv_io {
+	struct io	io;
+	u16 	len;
+	dma_addr_t		skb_data_dma;
+	struct ib_sge	list[2]; /* one for grh and other for rest of pkt. */
+	struct sk_buff 	*skb;
+};
+
+int	vnic_ib_init(void);
+void	vnic_ib_cleanup(void);
+
+struct vnic;
+int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic);
+int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport,
+		      struct ib_pd *pd, struct vnic_ib_config *config);
+
+int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io);
+int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io);
+int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn);
+int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+
+#define	vnic_ib_conn_uninitted(ib_conn)			\
+	((ib_conn)->state == IB_CONN_UNINITTED)
+#define	vnic_ib_conn_initted(ib_conn)			\
+	((ib_conn)->state == IB_CONN_INITTED)
+#define	vnic_ib_conn_connecting(ib_conn)		\
+	((ib_conn)->state == IB_CONN_CONNECTING)
+#define	vnic_ib_conn_connected(ib_conn)			\
+	((ib_conn)->state == IB_CONN_CONNECTED)
+#define	vnic_ib_conn_disconnected(ib_conn)		\
+	((ib_conn)->state == IB_CONN_DISCONNECTED)
+
+#define MCAST_GROUP_INVALID 0x00 /* viport failed to join or left mc group */
+#define MCAST_GROUP_JOINING 0x01 /* wait for completion */
+#define MCAST_GROUP_JOINED  0x02 /* join process completed successfully */
+
+/* vnic_sa_client is used to register with sa once. It is needed to join and
+ * leave multicast groups.
+ */
+extern struct ib_sa_client vnic_sa_client;
+
+/* The following functions are using initialize and handle multicast
+ * components.
+ */
+struct mc_data; /* forward declaration */
+/* Initialize all necessary mc components */
+int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport,
+			struct ib_pd *pd, struct vnic_ib_config *config);
+/* Put multicast qp in RTS */
+int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp);
+/* Post multicast receive buffers */
+int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io);
+
+#endif	/* VNIC_IB_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:57:24 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:27:24 +0530
Subject: [ofa-general] [PATCH v3 07/13] QLogic VNIC: Handling configurable
	parameters of the driver
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095724.9943.92517.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the files that handle various configurable parameters
of the VNIC driver ---- configuration of virtual NIC, control, data 
connections to the EVIC and general IB connection parameters.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c |  379 ++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h |  242 +++++++++++++++
 2 files changed, 621 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
new file mode 100644
index 0000000..8bde3d8
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c
@@ -0,0 +1,379 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/string.h>
+#include <linux/utsname.h>
+#include <linux/if_vlan.h>
+
+#include "vnic_util.h"
+#include "vnic_config.h"
+#include "vnic_trailer.h"
+#include "vnic_main.h"
+
+u16 vnic_max_mtu = MAX_MTU;
+
+static u32 default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT;
+static u32 sa_path_rec_get_timeout = SA_PATH_REC_GET_TIMEOUT;
+static u32 default_primary_reconnect_timeout =
+				    DEFAULT_PRIMARY_RECONNECT_TIMEOUT;
+static u32 default_primary_switch_timeout = DEFAULT_PRIMARY_SWITCH_TIMEOUT;
+static int default_prefer_primary         = DEFAULT_PREFER_PRIMARY;
+
+static int use_rx_csum = VNIC_USE_RX_CSUM;
+static int use_tx_csum = VNIC_USE_TX_CSUM;
+
+static u32 control_response_timeout = CONTROL_RSP_TIMEOUT;
+static u32 completion_limit = DEFAULT_COMPLETION_LIMIT;
+
+module_param(vnic_max_mtu, ushort, 0444);
+MODULE_PARM_DESC(vnic_max_mtu, "Maximum MTU size (1500-9500). Default is 9500");
+
+module_param(default_prefer_primary, bool, 0444);
+MODULE_PARM_DESC(default_prefer_primary, "Determines if primary path is"
+		 " preferred (1) or not (0). Defaults to 0");
+module_param(use_rx_csum, bool, 0444);
+MODULE_PARM_DESC(use_rx_csum, "Determines if RX checksum is done on VEx (1)"
+		 " or not (0). Defaults to 1");
+module_param(use_tx_csum, bool, 0444);
+MODULE_PARM_DESC(use_tx_csum, "Determines if TX checksum is done on VEx (1)"
+		 " or not (0). Defaults to 1");
+module_param(default_no_path_timeout, uint, 0444);
+MODULE_PARM_DESC(default_no_path_timeout, "Time to wait in milliseconds"
+		 " before reconnecting to VEx after connection loss");
+module_param(default_primary_reconnect_timeout, uint, 0444);
+MODULE_PARM_DESC(default_primary_reconnect_timeout,  "Time to wait in"
+		 " milliseconds before reconnecting the"
+		 " primary path to VEx");
+module_param(default_primary_switch_timeout, uint, 0444);
+MODULE_PARM_DESC(default_primary_switch_timeout, "Time to wait before"
+		 " switching back to primary path if"
+		 " primary path is preferred");
+module_param(sa_path_rec_get_timeout, uint, 0444);
+MODULE_PARM_DESC(sa_path_rec_get_timeout, "Time out value in milliseconds"
+		 " for SA path record get queries");
+
+module_param(control_response_timeout, uint, 0444);
+MODULE_PARM_DESC(control_response_timeout, "Time out value in milliseconds"
+		 " to wait for response to control requests");
+
+module_param(completion_limit, uint, 0444);
+MODULE_PARM_DESC(completion_limit, "Maximum completions to process"
+		" in a single completion callback invocation. Default is 100"
+		" Minimum value is 10");
+
+static void config_control_defaults(struct control_config *control_config,
+				    struct path_param *params)
+{
+	int len;
+	char *dot;
+	u64 sid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8)
+	      |	IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	control_config->ib_config.service_id = cpu_to_be64(sid);
+	control_config->ib_config.conn_data.path_id = 0;
+	control_config->ib_config.conn_data.vnic_instance = params->instance;
+	control_config->ib_config.conn_data.path_num = 0;
+	control_config->ib_config.conn_data.features_supported =
+			__constant_cpu_to_be32((u32) (VNIC_FEAT_IGNORE_VLAN |
+						      VNIC_FEAT_RDMA_IMMED));
+	dot = strchr(init_utsname()->nodename, '.');
+
+	if (dot)
+		len = dot - init_utsname()->nodename;
+	else
+		len = strlen(init_utsname()->nodename);
+
+	if (len > VNIC_MAX_NODENAME_LEN)
+		len = VNIC_MAX_NODENAME_LEN;
+
+	memcpy(control_config->ib_config.conn_data.nodename,
+					init_utsname()->nodename, len);
+
+	if (params->ib_multicast == 1)
+		control_config->ib_multicast = 1;
+	else if (params->ib_multicast == 0)
+		control_config->ib_multicast = 0;
+	else {
+		/* parameter is not set - enable it by default */
+		control_config->ib_multicast = 1;
+		CONFIG_ERROR("IOCGUID=%llx INSTANCE=%d IB_MULTICAST defaulted"
+					" to TRUE\n",
+					be64_to_cpu(params->ioc_guid),
+					(char)params->instance);
+	}
+
+	if (control_config->ib_multicast)
+		control_config->ib_config.conn_data.features_supported |=
+			__constant_cpu_to_be32(VNIC_FEAT_INBOUND_IB_MC);
+
+	control_config->ib_config.retry_count = RETRY_COUNT;
+	control_config->ib_config.rnr_retry_count = RETRY_COUNT;
+	control_config->ib_config.min_rnr_timer = MIN_RNR_TIMER;
+
+	/* These values are not configurable*/
+	control_config->ib_config.num_recvs    = 5;
+	control_config->ib_config.num_sends    = 1;
+	control_config->ib_config.recv_scatter = 1;
+	control_config->ib_config.send_gather  = 1;
+	control_config->ib_config.completion_limit = completion_limit;
+
+	control_config->num_recvs = control_config->ib_config.num_recvs;
+
+	control_config->vnic_instance = params->instance;
+	control_config->max_address_entries = MAX_ADDRESS_ENTRIES;
+	control_config->min_address_entries = MIN_ADDRESS_ENTRIES;
+	control_config->rsp_timeout = msecs_to_jiffies(control_response_timeout);
+}
+
+static void config_data_defaults(struct data_config *data_config,
+				 struct path_param *params)
+{
+	u64 sid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8)
+	      |	IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	data_config->ib_config.service_id = cpu_to_be64(sid);
+	data_config->ib_config.conn_data.path_id = jiffies; /* random */
+	data_config->ib_config.conn_data.vnic_instance = params->instance;
+	data_config->ib_config.conn_data.path_num = 0;
+
+	data_config->ib_config.retry_count = RETRY_COUNT;
+	data_config->ib_config.rnr_retry_count = RETRY_COUNT;
+	data_config->ib_config.min_rnr_timer = MIN_RNR_TIMER;
+
+	/*
+	 * NOTE: the num_recvs size assumes that the EIOC could
+	 * RDMA enough packets to fill all of the host recv
+	 * pool entries, plus send a kick message after each
+	 * packet, plus RDMA new buffers for the size of
+	 * the EIOC recv buffer pool, plus send kick messages
+	 * after each min_host_update_sz of new buffers all
+	 * before the host can even pull off the first completed
+	 * receive off the completion queue, and repost the
+	 * receive. NOT LIKELY!
+	 */
+	data_config->ib_config.num_recvs = HOST_RECV_POOL_ENTRIES +
+	    (MAX_EIOC_POOL_SZ / MIN_HOST_UPDATE_SZ);
+
+	data_config->ib_config.num_sends = (2 * NOTIFY_BUNDLE_SZ) +
+	    (HOST_RECV_POOL_ENTRIES / MIN_EIOC_UPDATE_SZ) + 1;
+
+	data_config->ib_config.recv_scatter = 1; /* not configurable */
+	data_config->ib_config.send_gather = 2;	 /* not configurable */
+	data_config->ib_config.completion_limit = completion_limit;
+
+	data_config->num_recvs = data_config->ib_config.num_recvs;
+	data_config->path_id = data_config->ib_config.conn_data.path_id;
+
+
+	data_config->host_recv_pool_entries = HOST_RECV_POOL_ENTRIES;
+
+	data_config->host_min.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU));
+	data_config->host_max.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + vnic_max_mtu));
+	data_config->eioc_min.size_recv_pool_entry =
+			cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU));
+	data_config->eioc_max.size_recv_pool_entry =
+			__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_entries =
+				__constant_cpu_to_be32(MIN_HOST_POOL_SZ);
+	data_config->host_max.num_recv_pool_entries =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+	data_config->eioc_min.num_recv_pool_entries =
+				__constant_cpu_to_be32(MIN_EIOC_POOL_SZ);
+	data_config->eioc_max.num_recv_pool_entries =
+				__constant_cpu_to_be32(MAX_EIOC_POOL_SZ);
+
+	data_config->host_min.timeout_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_TIMEOUT);
+	data_config->host_max.timeout_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_TIMEOUT);
+	data_config->eioc_min.timeout_before_kick = 0;
+	data_config->eioc_max.timeout_before_kick =
+			__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_entries_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_ENTRIES);
+	data_config->host_max.num_recv_pool_entries_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_ENTRIES);
+	data_config->eioc_min.num_recv_pool_entries_before_kick = 0;
+	data_config->eioc_max.num_recv_pool_entries_before_kick =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.num_recv_pool_bytes_before_kick =
+			__constant_cpu_to_be32(MIN_HOST_KICK_BYTES);
+	data_config->host_max.num_recv_pool_bytes_before_kick =
+			__constant_cpu_to_be32(MAX_HOST_KICK_BYTES);
+	data_config->eioc_min.num_recv_pool_bytes_before_kick = 0;
+	data_config->eioc_max.num_recv_pool_bytes_before_kick =
+				__constant_cpu_to_be32(MAX_PARAM_VALUE);
+
+	data_config->host_min.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MIN_HOST_UPDATE_SZ);
+	data_config->host_max.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MAX_HOST_UPDATE_SZ);
+	data_config->eioc_min.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MIN_EIOC_UPDATE_SZ);
+	data_config->eioc_max.free_recv_pool_entries_per_update =
+				__constant_cpu_to_be32(MAX_EIOC_UPDATE_SZ);
+
+	data_config->notify_bundle = NOTIFY_BUNDLE_SZ;
+}
+
+static void config_path_info_defaults(struct viport_config *config,
+				      struct path_param *params)
+{
+	int i;
+	ib_query_gid(config->ibdev, config->port, 0,
+			  &config->path_info.path.sgid);
+	for (i = 0; i < 16; i++)
+		config->path_info.path.dgid.raw[i] = params->dgid[i];
+
+	config->path_info.path.pkey = params->pkey;
+	config->path_info.path.numb_path = 1;
+	config->sa_path_rec_get_timeout = sa_path_rec_get_timeout;
+
+}
+
+static void config_viport_defaults(struct viport_config *config,
+				      struct path_param *params)
+{
+	config->ibdev = params->ibdev;
+	config->port = params->port;
+	config->ioc_guid = params->ioc_guid;
+	config->stats_interval = msecs_to_jiffies(VIPORT_STATS_INTERVAL);
+	config->hb_interval = msecs_to_jiffies(VIPORT_HEARTBEAT_INTERVAL);
+	config->hb_timeout = VIPORT_HEARTBEAT_TIMEOUT * 1000;
+				/*hb_timeout needs to be in usec*/
+	strcpy(config->ioc_string, params->ioc_string);
+	config_path_info_defaults(config, params);
+
+	config_control_defaults(&config->control_config, params);
+	config_data_defaults(&config->data_config, params);
+}
+
+static void config_vnic_defaults(struct vnic_config *config)
+{
+	config->no_path_timeout = msecs_to_jiffies(default_no_path_timeout);
+	config->primary_connect_timeout =
+	    msecs_to_jiffies(DEFAULT_PRIMARY_CONNECT_TIMEOUT);
+	config->primary_reconnect_timeout =
+	    msecs_to_jiffies(default_primary_reconnect_timeout);
+	config->primary_switch_timeout =
+	    msecs_to_jiffies(default_primary_switch_timeout);
+	config->prefer_primary = default_prefer_primary;
+	config->use_rx_csum = use_rx_csum;
+	config->use_tx_csum = use_tx_csum;
+}
+
+struct viport_config *config_alloc_viport(struct path_param *params)
+{
+	struct viport_config *config;
+
+	config = kzalloc(sizeof *config, GFP_KERNEL);
+	if (!config) {
+		CONFIG_ERROR("could not allocate memory for"
+			     " struct viport_config\n");
+		return NULL;
+	}
+
+	config_viport_defaults(config, params);
+
+	return config;
+}
+
+struct vnic_config *config_alloc_vnic(void)
+{
+	struct vnic_config *config;
+
+	config = kzalloc(sizeof *config, GFP_KERNEL);
+	if (!config) {
+		CONFIG_ERROR("couldn't allocate memory for"
+			     " struct vnic_config\n");
+
+		return NULL;
+	}
+
+	config_vnic_defaults(config);
+	return config;
+}
+
+char *config_viport_name(struct viport_config *config)
+{
+	/* function only called by one thread, can return a static string */
+	static char str[64];
+
+	sprintf(str, "GUID %llx instance %d",
+		be64_to_cpu(config->ioc_guid),
+		config->control_config.vnic_instance);
+	return str;
+}
+
+int config_start(void)
+{
+	vnic_max_mtu = min_t(u16, vnic_max_mtu, MAX_MTU);
+	vnic_max_mtu = max_t(u16, vnic_max_mtu, MIN_MTU);
+
+	sa_path_rec_get_timeout = min_t(u32, sa_path_rec_get_timeout,
+					MAX_SA_TIMEOUT);
+	sa_path_rec_get_timeout = max_t(u32, sa_path_rec_get_timeout,
+					MIN_SA_TIMEOUT);
+
+	control_response_timeout = min_t(u32, control_response_timeout,
+					 MAX_CONTROL_RSP_TIMEOUT);
+
+	control_response_timeout = max_t(u32, control_response_timeout,
+					 MIN_CONTROL_RSP_TIMEOUT);
+
+	completion_limit	 = max_t(u32, completion_limit,
+					 MIN_COMPLETION_LIMIT);
+
+	if (!default_no_path_timeout)
+		default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT;
+
+	if (!default_primary_reconnect_timeout)
+		default_primary_reconnect_timeout =
+					 DEFAULT_PRIMARY_RECONNECT_TIMEOUT;
+
+	if (!default_primary_switch_timeout)
+		default_primary_switch_timeout =
+					DEFAULT_PRIMARY_SWITCH_TIMEOUT;
+
+	return 0;
+
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
new file mode 100644
index 0000000..dca5f98
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_CONFIG_H_INCLUDED
+#define VNIC_CONFIG_H_INCLUDED
+
+#include <rdma/ib_verbs.h>
+#include <linux/types.h>
+#include <linux/if.h>
+
+#include "vnic_control.h"
+#include "vnic_ib.h"
+
+#define SST_AGN         0x10ULL
+#define SST_OUI         0x00066AULL
+
+enum {
+	CONTROL_PATH_ID = 0x0,
+	DATA_PATH_ID    = 0x1
+};
+
+#define IOC_NUMBER(GUID)        (((GUID) >> 32) & 0xFF)
+
+enum {
+	VNIC_CLASS_SUBCLASS	= 0x2000066A,
+	VNIC_PROTOCOL		= 0,
+	VNIC_PROT_VERSION	= 1
+};
+
+enum {
+	MIN_MTU	= 1500,	/* minimum negotiated MTU size */
+	MAX_MTU	= 9500	/* jumbo frame */
+};
+
+/*
+ * TODO: tune the pool parameter values
+ */
+enum {
+	MIN_ADDRESS_ENTRIES = 16,
+	MAX_ADDRESS_ENTRIES = 64
+};
+
+enum {
+	HOST_RECV_POOL_ENTRIES	= 512,
+	MIN_HOST_POOL_SZ	= 64,
+	MIN_EIOC_POOL_SZ	= 64,
+	MAX_EIOC_POOL_SZ	= 256,
+	MIN_HOST_UPDATE_SZ	= 8,
+	MAX_HOST_UPDATE_SZ	= 32,
+	MIN_EIOC_UPDATE_SZ	= 8,
+	MAX_EIOC_UPDATE_SZ	= 32,
+	NOTIFY_BUNDLE_SZ	= 32
+};
+
+enum {
+	MIN_HOST_KICK_TIMEOUT = 10,	/* in usec */
+	MAX_HOST_KICK_TIMEOUT = 100	/* in usec */
+};
+
+enum {
+	MIN_HOST_KICK_ENTRIES = 1,
+	MAX_HOST_KICK_ENTRIES = 128
+};
+
+enum {
+	MIN_HOST_KICK_BYTES = 0,
+	MAX_HOST_KICK_BYTES = 5000
+};
+
+enum {
+	DEFAULT_NO_PATH_TIMEOUT			= 10000,
+	DEFAULT_PRIMARY_CONNECT_TIMEOUT		= 10000,
+	DEFAULT_PRIMARY_RECONNECT_TIMEOUT	= 10000,
+	DEFAULT_PRIMARY_SWITCH_TIMEOUT		= 10000
+};
+
+enum {
+	VIPORT_STATS_INTERVAL		= 500,	/* .5 sec */
+	VIPORT_HEARTBEAT_INTERVAL	= 1000,	/* 1 second */
+	VIPORT_HEARTBEAT_TIMEOUT	= 64000	/* 64 sec */
+};
+
+enum {
+	/* 5 sec increased for EVIC support for large number of
+	 * host connections
+	 */
+	CONTROL_RSP_TIMEOUT		= 5000,
+	MIN_CONTROL_RSP_TIMEOUT		= 1000,	/* 1  sec */
+	MAX_CONTROL_RSP_TIMEOUT		= 60000	/* 60 sec */
+};
+
+/* Maximum number of completions to be processed
+ * during a single completion callback invocation
+ */
+enum {
+	DEFAULT_COMPLETION_LIMIT 	= 100,
+	MIN_COMPLETION_LIMIT		= 10
+};
+
+/* infiniband connection parameters */
+enum {
+	RETRY_COUNT		= 3,
+	MIN_RNR_TIMER		= 22,	/* 20 ms */
+	DEFAULT_PKEY		= 0	/* pkey table index */
+};
+
+enum {
+	SA_PATH_REC_GET_TIMEOUT	= 1000,	/* 1000 ms */
+	MIN_SA_TIMEOUT		= 100,	/* 100 ms */
+	MAX_SA_TIMEOUT		= 20000	/* 20s */
+};
+
+#define MAX_PARAM_VALUE                 0x40000000
+#define VNIC_USE_RX_CSUM		1
+#define VNIC_USE_TX_CSUM		1
+#define	DEFAULT_PREFER_PRIMARY		0
+
+/* As per IBTA specification, IOCString Maximum length can be 512 bits. */
+#define MAX_IOC_STRING_LEN 		(512/8)
+
+struct path_param {
+	__be64			ioc_guid;
+	u8			ioc_string[MAX_IOC_STRING_LEN+1];
+	u8			port;
+	u8			instance;
+	struct ib_device	*ibdev;
+	struct vnic_ib_port	*ibport;
+	char			name[IFNAMSIZ];
+	u8			dgid[16];
+	__be16			pkey;
+	int			rx_csum;
+	int			tx_csum;
+	int			heartbeat;
+	int			ib_multicast;
+};
+
+struct vnic_ib_config {
+	__be64				service_id;
+	struct vnic_connection_data	conn_data;
+	u32				retry_count;
+	u32				rnr_retry_count;
+	u8				min_rnr_timer;
+	u32				num_sends;
+	u32				num_recvs;
+	u32				recv_scatter;	/* 1 */
+	u32				send_gather;	/* 1 or 2 */
+	u32				completion_limit;
+};
+
+struct control_config {
+	struct vnic_ib_config	ib_config;
+	u32			num_recvs;
+	u8			vnic_instance;
+	u16			max_address_entries;
+	u16			min_address_entries;
+	u32			rsp_timeout;
+	u32			ib_multicast;
+};
+
+struct data_config {
+	struct vnic_ib_config		ib_config;
+	u64				path_id;
+	u32				num_recvs;
+	u32				host_recv_pool_entries;
+	struct vnic_recv_pool_config	host_min;
+	struct vnic_recv_pool_config	host_max;
+	struct vnic_recv_pool_config	eioc_min;
+	struct vnic_recv_pool_config	eioc_max;
+	u32				notify_bundle;
+};
+
+struct viport_config {
+	struct viport			*viport;
+	struct control_config		control_config;
+	struct data_config		data_config;
+	struct vnic_ib_path_info	path_info;
+	u32				sa_path_rec_get_timeout;
+	struct ib_device		*ibdev;
+	u32				port;
+	unsigned long			stats_interval;
+	u32				hb_interval;
+	u32				hb_timeout;
+	__be64				ioc_guid;
+	u8				ioc_string[MAX_IOC_STRING_LEN+1];
+	size_t				path_idx;
+};
+
+/*
+ * primary_connect_timeout   - if the secondary connects first,
+ *                             how long do we give the primary?
+ * primary_reconnect_timeout - same as above, but used when recovering
+ *                             from the case where both paths fail
+ * primary_switch_timeout -    how long do we wait before switching to the
+ *                             primary when it comes back?
+ */
+struct vnic_config {
+	struct vnic	*vnic;
+	char		name[IFNAMSIZ];
+	unsigned long	no_path_timeout;
+	u32 		primary_connect_timeout;
+	u32		primary_reconnect_timeout;
+	u32		primary_switch_timeout;
+	int		prefer_primary;
+	int		use_rx_csum;
+	int		use_tx_csum;
+};
+
+int config_start(void);
+struct viport_config *config_alloc_viport(struct path_param *params);
+struct vnic_config   *config_alloc_vnic(void);
+char *config_viport_name(struct viport_config *config);
+
+#endif	/* VNIC_CONFIG_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:57:54 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:27:54 +0530
Subject: [ofa-general] [PATCH v3 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095754.9943.27936.stgit@localhost.localdomain>

From: Amar Mudrankit <amar.mudrankit at qlogic.com>

The sysfs interface for the QLogic VNIC driver is implemented through
this patch.

Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h |   51 +
 2 files changed, 1184 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
new file mode 100644
index 0000000..40b3c77
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
@@ -0,0 +1,1133 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/parser.h>
+#include <linux/if.h>
+
+#include "vnic_util.h"
+#include "vnic_config.h"
+#include "vnic_ib.h"
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_stats.h"
+
+/*
+ * target eiocs are added by writing
+ *
+ * ioc_guid=<EIOC GUID>,dgid=<dest GID>,pkey=<P_key>,name=<interface_name>
+ * to the create_primary  sysfs attribute.
+ */
+enum {
+	VNIC_OPT_ERR = 0,
+	VNIC_OPT_IOC_GUID = 1 << 0,
+	VNIC_OPT_DGID = 1 << 1,
+	VNIC_OPT_PKEY = 1 << 2,
+	VNIC_OPT_NAME = 1 << 3,
+	VNIC_OPT_INSTANCE = 1 << 4,
+	VNIC_OPT_RXCSUM = 1 << 5,
+	VNIC_OPT_TXCSUM = 1 << 6,
+	VNIC_OPT_HEARTBEAT = 1 << 7,
+	VNIC_OPT_IOC_STRING = 1 << 8,
+	VNIC_OPT_IB_MULTICAST = 1 << 9,
+	VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID |
+			VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY),
+};
+
+static match_table_t vnic_opt_tokens = {
+	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
+	{VNIC_OPT_DGID, "dgid=%s"},
+	{VNIC_OPT_PKEY, "pkey=%x"},
+	{VNIC_OPT_NAME, "name=%s"},
+	{VNIC_OPT_INSTANCE, "instance=%d"},
+	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
+	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
+	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
+	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
+	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
+	{VNIC_OPT_ERR, NULL}
+};
+
+void vnic_release_dev(struct device *dev)
+{
+	struct dev_info *dev_info =
+	    container_of(dev, struct dev_info, dev);
+
+	complete(&dev_info->released);
+
+}
+
+struct class vnic_class = {
+	.name = "infiniband_qlgc_vnic",
+	.dev_release = vnic_release_dev
+};
+
+struct dev_info interface_dev;
+
+static int vnic_parse_options(const char *buf, struct path_param *param)
+{
+	char *options, *sep_opt;
+	char *p;
+	char dgid[3];
+	substring_t args[MAX_OPT_ARGS];
+	int opt_mask = 0;
+	int token;
+	int ret = -EINVAL;
+	int i, len;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, ",")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, vnic_opt_tokens, args);
+		opt_mask |= token;
+
+		switch (token) {
+		case VNIC_OPT_IOC_GUID:
+			p = match_strdup(args);
+			param->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL,
+								      16));
+			kfree(p);
+			break;
+
+		case VNIC_OPT_DGID:
+			p = match_strdup(args);
+			if (strlen(p) != 32) {
+				printk(KERN_WARNING PFX
+				       "bad dest GID parameter '%s'\n", p);
+				kfree(p);
+				goto out;
+			}
+
+			for (i = 0; i < 16; ++i) {
+				strlcpy(dgid, p + i * 2, 3);
+				param->dgid[i] = simple_strtoul(dgid, NULL,
+								16);
+
+			}
+			kfree(p);
+			break;
+
+		case VNIC_OPT_PKEY:
+			if (match_hex(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad P_key parameter '%s'\n", p);
+				goto out;
+			}
+			param->pkey = cpu_to_be16(token);
+			break;
+
+		case VNIC_OPT_NAME:
+			p = match_strdup(args);
+			if (strlen(p) >= IFNAMSIZ) {
+				printk(KERN_WARNING PFX
+				       "interface name parameter too long\n");
+				kfree(p);
+				goto out;
+			}
+			strcpy(param->name, p);
+			kfree(p);
+			break;
+		case VNIC_OPT_INSTANCE:
+			if (match_int(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad instance parameter '%s'\n", p);
+				goto out;
+			}
+
+			if (token > 255 || token < 0) {
+				printk(KERN_WARNING PFX
+				       "instance parameter must be"
+				       " >= 0 and <= 255\n");
+				goto out;
+			}
+
+			param->instance = token;
+			break;
+		case VNIC_OPT_RXCSUM:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->rx_csum = 1;
+			else if (!strncmp(p, "false", 5))
+				param->rx_csum = 0;
+			else {
+				printk(KERN_WARNING PFX
+				       "bad rx_csum parameter."
+				       " must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		case VNIC_OPT_TXCSUM:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->tx_csum = 1;
+			else if (!strncmp(p, "false", 5))
+				param->tx_csum = 0;
+			else {
+				printk(KERN_WARNING PFX
+				       "bad tx_csum parameter."
+				       " must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		case VNIC_OPT_HEARTBEAT:
+			if (match_int(args, &token)) {
+				printk(KERN_WARNING PFX
+				       "bad instance parameter '%s'\n", p);
+				goto out;
+			}
+
+			if (token > 6000 || token <= 0) {
+				printk(KERN_WARNING PFX
+				       "heartbeat parameter must be"
+				       " > 0 and <= 6000\n");
+				goto out;
+			}
+			param->heartbeat = token;
+			break;
+		case VNIC_OPT_IOC_STRING:
+			p = match_strdup(args);
+			len = strlen(p);
+			if (len > MAX_IOC_STRING_LEN) {
+				printk(KERN_WARNING PFX
+				       "ioc string parameter too long\n");
+				kfree(p);
+				goto out;
+			}
+			strcpy(param->ioc_string, p);
+			if (*(p + len - 1) != '\"') {
+				strcat(param->ioc_string, ",");
+				kfree(p);
+				p = strsep(&sep_opt, "\"");
+				strcat(param->ioc_string, p);
+				sep_opt++;
+			} else {
+				*(param->ioc_string + len - 1) = '\0';
+				kfree(p);
+			}
+			break;
+		case VNIC_OPT_IB_MULTICAST:
+			p = match_strdup(args);
+			if (!strncmp(p, "true", 4))
+				param->ib_multicast = 1;
+			else if (!strncmp(p, "false", 5))
+				param->ib_multicast = 0;
+			else {
+					printk(KERN_WARNING PFX
+					"bad ib_multicast parameter."
+					" must be 'true' or 'false'\n");
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+		default:
+			printk(KERN_WARNING PFX
+			       "unknown parameter or missing value "
+			       "'%s' in target creation request\n", p);
+			goto out;
+		}
+
+	}
+
+	if ((opt_mask & VNIC_OPT_ALL) == VNIC_OPT_ALL)
+		ret = 0;
+	else
+		for (i = 0; i < ARRAY_SIZE(vnic_opt_tokens); ++i)
+			if ((vnic_opt_tokens[i].token & VNIC_OPT_ALL) &&
+			    !(vnic_opt_tokens[i].token & opt_mask))
+				printk(KERN_WARNING PFX
+				       "target creation request is "
+				       "missing parameter '%s'\n",
+				       vnic_opt_tokens[i].pattern);
+
+out:
+	kfree(options);
+	return ret;
+
+}
+
+static ssize_t show_vnic_state(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+	switch (vnic->state) {
+	case VNIC_UNINITIALIZED:
+		return sprintf(buf, "VNIC_UNINITIALIZED\n");
+	case VNIC_REGISTERED:
+		return sprintf(buf, "VNIC_REGISTERED\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+	}
+
+}
+
+static DEVICE_ATTR(vnic_state, S_IRUGO, show_vnic_state, NULL);
+
+static ssize_t show_rx_csum(struct device *dev,
+			    struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+
+	if (vnic->config->use_rx_csum)
+		return sprintf(buf, "true\n");
+	else
+		return sprintf(buf, "false\n");
+}
+
+static DEVICE_ATTR(rx_csum, S_IRUGO, show_rx_csum, NULL);
+
+static ssize_t show_tx_csum(struct device *dev,
+			    struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+
+	if (vnic->config->use_tx_csum)
+		return sprintf(buf, "true\n");
+	else
+		return sprintf(buf, "false\n");
+}
+
+static DEVICE_ATTR(tx_csum, S_IRUGO, show_tx_csum, NULL);
+
+static ssize_t show_current_path(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, dev_info);
+	unsigned long flags;
+	size_t length;
+
+	spin_lock_irqsave(&vnic->current_path_lock, flags);
+	if (vnic->current_path == &vnic->primary_path)
+		length = sprintf(buf, "primary_path\n");
+	else if (vnic->current_path == &vnic->secondary_path)
+		length = sprintf(buf, "secondary path\n");
+	else
+		length = sprintf(buf, "none\n");
+	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
+	return length;
+}
+
+static DEVICE_ATTR(current_path, S_IRUGO, show_current_path, NULL);
+
+static struct attribute *vnic_dev_attrs[] = {
+	&dev_attr_vnic_state.attr,
+	&dev_attr_rx_csum.attr,
+	&dev_attr_tx_csum.attr,
+	&dev_attr_current_path.attr,
+	NULL
+};
+
+struct attribute_group vnic_dev_attr_group = {
+	.attrs = vnic_dev_attrs,
+};
+
+static inline void print_dgid(u8 *dgid)
+{
+	int i;
+
+	for (i = 0; i < 16; i += 2)
+		printk("%04x", be16_to_cpu(*(__be16 *)&dgid[i]));
+}
+
+static inline int is_dgid_zero(u8 *dgid)
+{
+	int i;
+
+	for (i = 0; i < 16; i++) {
+		if (dgid[i] != 0)
+			return 1;
+	}
+	return 0;
+}
+
+static int create_netpath(struct netpath *npdest,
+			  struct path_param *p_params)
+{
+	struct viport_config	*viport_config;
+	struct viport		*viport;
+	struct vnic		*vnic;
+	struct list_head	*ptr;
+	int			ret = 0;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (vnic->primary_path.viport) {
+			viport_config = vnic->primary_path.viport->config;
+			if ((viport_config->ioc_guid == p_params->ioc_guid)
+			    && (viport_config->control_config.vnic_instance
+				== p_params->instance)
+			    && (be64_to_cpu(p_params->ioc_guid))) {
+				SYS_ERROR("GUID %llx,"
+					  " INSTANCE %d already in use\n",
+					  be64_to_cpu(p_params->ioc_guid),
+					  p_params->instance);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		if (vnic->secondary_path.viport) {
+			viport_config = vnic->secondary_path.viport->config;
+			if ((viport_config->ioc_guid == p_params->ioc_guid)
+			    && (viport_config->control_config.vnic_instance
+				== p_params->instance)
+			    && (be64_to_cpu(p_params->ioc_guid))) {
+				SYS_ERROR("GUID %llx,"
+					  " INSTANCE %d already in use\n",
+					  be64_to_cpu(p_params->ioc_guid),
+					  p_params->instance);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+	if (npdest->viport) {
+		SYS_ERROR("create_netpath: path already exists\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	viport_config = config_alloc_viport(p_params);
+	if (!viport_config) {
+		SYS_ERROR("create_netpath: failed creating viport config\n");
+		ret = -1;
+		goto out;
+	}
+
+	/*User specified heartbeat value is in 1/100s of a sec*/
+	if (p_params->heartbeat != -1) {
+		viport_config->hb_interval =
+			msecs_to_jiffies(p_params->heartbeat * 10);
+		viport_config->hb_timeout =
+			(p_params->heartbeat << 6) * 10000; /* usec */
+	}
+
+	viport_config->path_idx = 0;
+
+	viport = viport_allocate(viport_config);
+	if (!viport) {
+		SYS_ERROR("create_netpath: failed creating viport\n");
+		kfree(viport_config);
+		ret = -1;
+		goto out;
+	}
+
+	npdest->viport = viport;
+	viport->parent = npdest;
+	viport->vnic = npdest->parent;
+
+	if (is_dgid_zero(p_params->dgid) &&  p_params->ioc_guid != 0
+	   &&  p_params->pkey != 0) {
+		viport_kick(viport);
+		vnic_disconnected(npdest->parent, npdest);
+	} else {
+		printk(KERN_WARNING "Specified parameters IOCGUID=%llx, "
+			"P_Key=%x, DGID=", be64_to_cpu(p_params->ioc_guid),
+			p_params->pkey);
+		print_dgid(p_params->dgid);
+		printk(" insufficient for establishing %s path for interface "
+			"%s. Hence, path will not be established.\n",
+			(npdest->second_bias ? "secondary" : "primary"),
+			p_params->name);
+	}
+out:
+	return ret;
+}
+
+static struct vnic *create_vnic(struct path_param *param)
+{
+	struct vnic_config *vnic_config;
+	struct vnic *vnic;
+	struct list_head *ptr;
+
+	SYS_INFO("create_vnic: name = %s\n", param->name);
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, param->name)) {
+			SYS_ERROR("vnic %s already exists\n",
+				   param->name);
+			return NULL;
+		}
+	}
+
+	vnic_config = config_alloc_vnic();
+	if (!vnic_config) {
+		SYS_ERROR("create_vnic: failed creating vnic config\n");
+		return NULL;
+	}
+
+	if (param->rx_csum != -1)
+		vnic_config->use_rx_csum = param->rx_csum;
+
+	if (param->tx_csum != -1)
+		vnic_config->use_tx_csum = param->tx_csum;
+
+	strcpy(vnic_config->name, param->name);
+	vnic = vnic_allocate(vnic_config);
+	if (!vnic) {
+		SYS_ERROR("create_vnic: failed allocating vnic\n");
+		goto free_vnic_config;
+	}
+
+	init_completion(&vnic->dev_info.released);
+
+	vnic->dev_info.dev.class = NULL;
+	vnic->dev_info.dev.parent = &interface_dev.dev;
+	vnic->dev_info.dev.release = vnic_release_dev;
+	snprintf(vnic->dev_info.dev.bus_id, BUS_ID_SIZE,
+		 vnic_config->name);
+
+	if (device_register(&vnic->dev_info.dev)) {
+		SYS_ERROR("create_vnic: error in registering"
+			  " vnic class dev\n");
+		goto free_vnic;
+	}
+
+	if (sysfs_create_group(&vnic->dev_info.dev.kobj,
+			       &vnic_dev_attr_group)) {
+		SYS_ERROR("create_vnic: error in creating"
+			  "vnic attr group\n");
+		goto err_attr;
+
+	}
+
+	if (vnic_setup_stats_files(vnic))
+		goto err_stats;
+
+	return vnic;
+err_stats:
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_dev_attr_group);
+err_attr:
+	device_unregister(&vnic->dev_info.dev);
+	wait_for_completion(&vnic->dev_info.released);
+free_vnic:
+	list_del(&vnic->list_ptrs);
+	kfree(vnic);
+free_vnic_config:
+	kfree(vnic_config);
+	return NULL;
+}
+
+static ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr,
+			   const char *buf, size_t count)
+{
+	struct vnic *vnic;
+	struct list_head *ptr;
+	int ret = -EINVAL;
+
+	if (count > IFNAMSIZ) {
+		printk(KERN_WARNING PFX "invalid vnic interface name\n");
+		return ret;
+	}
+
+	SYS_INFO("vnic_delete: name = %s\n", buf);
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, buf)) {
+			vnic_free(vnic);
+			return count;
+		}
+	}
+
+	printk(KERN_WARNING PFX "vnic interface '%s' does not exist\n", buf);
+	return ret;
+}
+
+DEVICE_ATTR(delete_vnic, S_IWUSR, NULL, vnic_delete);
+
+static ssize_t show_viport_state(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+	switch (path->viport->state) {
+	case VIPORT_DISCONNECTED:
+		return sprintf(buf, "VIPORT_DISCONNECTED\n");
+	case VIPORT_CONNECTED:
+		return sprintf(buf, "VIPORT_CONNECTED\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+	}
+
+}
+
+static DEVICE_ATTR(viport_state, S_IRUGO, show_viport_state, NULL);
+
+static ssize_t show_link_state(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	switch (path->viport->link_state) {
+	case LINK_UNINITIALIZED:
+		return sprintf(buf, "LINK_UNINITIALIZED\n");
+	case LINK_INITIALIZE:
+		return sprintf(buf, "LINK_INITIALIZE\n");
+	case LINK_INITIALIZECONTROL:
+		return sprintf(buf, "LINK_INITIALIZECONTROL\n");
+	case LINK_INITIALIZEDATA:
+		return sprintf(buf, "LINK_INITIALIZEDATA\n");
+	case LINK_CONTROLCONNECT:
+		return sprintf(buf, "LINK_CONTROLCONNECT\n");
+	case LINK_CONTROLCONNECTWAIT:
+		return sprintf(buf, "LINK_CONTROLCONNECTWAIT\n");
+	case LINK_INITVNICREQ:
+		return sprintf(buf, "LINK_INITVNICREQ\n");
+	case LINK_INITVNICRSP:
+		return sprintf(buf, "LINK_INITVNICRSP\n");
+	case LINK_BEGINDATAPATH:
+		return sprintf(buf, "LINK_BEGINDATAPATH\n");
+	case LINK_CONFIGDATAPATHREQ:
+		return sprintf(buf, "LINK_CONFIGDATAPATHREQ\n");
+	case LINK_CONFIGDATAPATHRSP:
+		return sprintf(buf, "LINK_CONFIGDATAPATHRSP\n");
+	case LINK_DATACONNECT:
+		return sprintf(buf, "LINK_DATACONNECT\n");
+	case LINK_DATACONNECTWAIT:
+		return sprintf(buf, "LINK_DATACONNECTWAIT\n");
+	case LINK_XCHGPOOLREQ:
+		return sprintf(buf, "LINK_XCHGPOOLREQ\n");
+	case LINK_XCHGPOOLRSP:
+		return sprintf(buf, "LINK_XCHGPOOLRSP\n");
+	case LINK_INITIALIZED:
+		return sprintf(buf, "LINK_INITIALIZED\n");
+	case LINK_IDLE:
+		return sprintf(buf, "LINK_IDLE\n");
+	case LINK_IDLING:
+		return sprintf(buf, "LINK_IDLING\n");
+	case LINK_CONFIGLINKREQ:
+		return sprintf(buf, "LINK_CONFIGLINKREQ\n");
+	case LINK_CONFIGLINKRSP:
+		return sprintf(buf, "LINK_CONFIGLINKRSP\n");
+	case LINK_CONFIGADDRSREQ:
+		return sprintf(buf, "LINK_CONFIGADDRSREQ\n");
+	case LINK_CONFIGADDRSRSP:
+		return sprintf(buf, "LINK_CONFIGADDRSRSP\n");
+	case LINK_REPORTSTATREQ:
+		return sprintf(buf, "LINK_REPORTSTATREQ\n");
+	case LINK_REPORTSTATRSP:
+		return sprintf(buf, "LINK_REPORTSTATRSP\n");
+	case LINK_HEARTBEATREQ:
+		return sprintf(buf, "LINK_HEARTBEATREQ\n");
+	case LINK_HEARTBEATRSP:
+		return sprintf(buf, "LINK_HEARTBEATRSP\n");
+	case LINK_RESET:
+		return sprintf(buf, "LINK_RESET\n");
+	case LINK_RESETRSP:
+		return sprintf(buf, "LINK_RESETRSP\n");
+	case LINK_RESETCONTROL:
+		return sprintf(buf, "LINK_RESETCONTROL\n");
+	case LINK_RESETCONTROLRSP:
+		return sprintf(buf, "LINK_RESETCONTROLRSP\n");
+	case LINK_DATADISCONNECT:
+		return sprintf(buf, "LINK_DATADISCONNECT\n");
+	case LINK_CONTROLDISCONNECT:
+		return sprintf(buf, "LINK_CONTROLDISCONNECT\n");
+	case LINK_CLEANUPDATA:
+		return sprintf(buf, "LINK_CLEANUPDATA\n");
+	case LINK_CLEANUPCONTROL:
+		return sprintf(buf, "LINK_CLEANUPCONTROL\n");
+	case LINK_DISCONNECTED:
+		return sprintf(buf, "LINK_DISCONNECTED\n");
+	case LINK_RETRYWAIT:
+		return sprintf(buf, "LINK_RETRYWAIT\n");
+	default:
+		return sprintf(buf, "INVALID STATE\n");
+
+	}
+
+}
+static DEVICE_ATTR(link_state, S_IRUGO, show_link_state, NULL);
+
+static ssize_t show_heartbeat(struct device *dev,
+			      struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	/* hb_inteval is in jiffies, convert it back to
+	 * 1/100ths of a second
+	 */
+	return sprintf(buf, "%d\n",
+		(jiffies_to_msecs(path->viport->config->hb_interval)/10));
+}
+
+static DEVICE_ATTR(heartbeat, S_IRUGO, show_heartbeat, NULL);
+
+static ssize_t show_ioc_guid(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%llx\n",
+				__be64_to_cpu(path->viport->config->ioc_guid));
+}
+
+static DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL);
+
+static inline void get_dgid_string(u8 *dgid, char *buf)
+{
+	int i;
+	char holder[5];
+
+	for (i = 0; i < 16; i += 2) {
+		sprintf(holder, "%04x", be16_to_cpu(*(__be16 *)&dgid[i]));
+		strcat(buf, holder);
+	}
+
+	strcat(buf, "\n");
+}
+
+static ssize_t show_dgid(struct device *dev,
+			 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	get_dgid_string(path->viport->config->path_info.path.dgid.raw, buf);
+
+	return strlen(buf);
+}
+
+static DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL);
+
+static ssize_t show_pkey(struct device *dev,
+			 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%x\n", path->viport->config->path_info.path.pkey);
+}
+
+static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL);
+
+static ssize_t show_hca_info(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "vnic-%s-%d\n", path->viport->config->ibdev->name,
+						path->viport->config->port);
+}
+
+static DEVICE_ATTR(hca_info, S_IRUGO, show_hca_info, NULL);
+
+static ssize_t show_ioc_string(struct device *dev,
+			       struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	return sprintf(buf, "%s\n", path->viport->config->ioc_string);
+}
+
+static  DEVICE_ATTR(ioc_string, S_IRUGO, show_ioc_string, NULL);
+
+static ssize_t show_multicast_state(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+
+	struct netpath *path = container_of(info, struct netpath, dev_info);
+
+	if (!(path->viport->features_supported & VNIC_FEAT_INBOUND_IB_MC))
+		return sprintf(buf, "feature not enabled\n");
+
+	switch (path->viport->mc_info.state) {
+	case MCAST_STATE_INVALID:
+		return sprintf(buf, "state=Invalid\n");
+	case MCAST_STATE_JOINING:
+		return sprintf(buf, "state=Joining MGID:" VNIC_GID_FMT "\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw));
+	case MCAST_STATE_ATTACHING:
+		return sprintf(buf, "state=Attaching MGID:" VNIC_GID_FMT
+			" MLID:%X\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw),
+			path->viport->mc_info.mlid);
+	case MCAST_STATE_JOINED_ATTACHED:
+		return sprintf(buf,
+			"state=Joined & Attached MGID:" VNIC_GID_FMT
+			" MLID:%X\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw),
+			path->viport->mc_info.mlid);
+	case MCAST_STATE_DETACHING:
+		return sprintf(buf, "state=Detaching MGID: " VNIC_GID_FMT "\n",
+			VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw));
+	case MCAST_STATE_RETRIED:
+		return sprintf(buf, "state=Retries Exceeded\n");
+	}
+	return sprintf(buf, "invalid state\n");
+}
+
+static  DEVICE_ATTR(multicast_state, S_IRUGO, show_multicast_state, NULL);
+
+static struct attribute *vnic_path_attrs[] = {
+	&dev_attr_viport_state.attr,
+	&dev_attr_link_state.attr,
+	&dev_attr_heartbeat.attr,
+	&dev_attr_ioc_guid.attr,
+	&dev_attr_dgid.attr,
+	&dev_attr_pkey.attr,
+	&dev_attr_hca_info.attr,
+	&dev_attr_ioc_string.attr,
+	&dev_attr_multicast_state.attr,
+	NULL
+};
+
+struct attribute_group vnic_path_attr_group = {
+	.attrs = vnic_path_attrs,
+};
+
+
+static int setup_path_class_files(struct netpath *path, char *name)
+{
+	init_completion(&path->dev_info.released);
+
+	path->dev_info.dev.class = NULL;
+	path->dev_info.dev.parent = &path->parent->dev_info.dev;
+	path->dev_info.dev.release = vnic_release_dev;
+	snprintf(path->dev_info.dev.bus_id, BUS_ID_SIZE, name);
+
+	if (device_register(&path->dev_info.dev)) {
+		SYS_ERROR("error in registering path class dev\n");
+		goto out;
+	}
+
+	if (sysfs_create_group(&path->dev_info.dev.kobj,
+			       &vnic_path_attr_group)) {
+		SYS_ERROR("error in creating vnic path group attrs");
+		goto err_path;
+	}
+
+	return 0;
+
+err_path:
+	device_unregister(&path->dev_info.dev);
+	wait_for_completion(&path->dev_info.released);
+out:
+	return -1;
+
+}
+
+static inline void update_dgids(u8 *old, u8 *new, char *vnic_name,
+				char *path_name)
+{
+	int i;
+
+	if (!memcmp(old, new, 16))
+		return;
+
+	printk(KERN_INFO PFX "Changing dgid from 0x");
+	print_dgid(old);
+	printk(" to 0x");
+	print_dgid(new);
+	printk(" for %s path of %s\n", path_name, vnic_name);
+	for (i = 0; i < 16; i++)
+		old[i] = new[i];
+}
+
+static inline void update_ioc_guids(struct path_param *params,
+				    struct netpath *path,
+				    char *vnic_name, char *path_name)
+{
+	u64 sid;
+
+	if (path->viport->config->ioc_guid == params->ioc_guid)
+		return;
+
+	printk(KERN_INFO PFX "Changing IOC GUID from 0x%llx to 0x%llx "
+			 "for %s path of %s\n",
+			 __be64_to_cpu(path->viport->config->ioc_guid),
+			 __be64_to_cpu(params->ioc_guid), path_name, vnic_name);
+
+	path->viport->config->ioc_guid = params->ioc_guid;
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8)
+				| IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	path->viport->config->control_config.ib_config.service_id =
+							 cpu_to_be64(sid);
+
+	sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8)
+				| IOC_NUMBER(be64_to_cpu(params->ioc_guid));
+
+	path->viport->config->data_config.ib_config.service_id =
+							 cpu_to_be64(sid);
+}
+
+static inline void update_pkeys(__be16 *old, __be16 *new, char *vnic_name,
+				char *path_name)
+{
+	if (*old == *new)
+		return;
+
+	printk(KERN_INFO PFX "Changing P_Key from 0x%x to 0x%x "
+			 "for %s path of %s\n", *old, *new,
+			 path_name, vnic_name);
+	*old = *new;
+}
+
+static void update_ioc_strings(struct path_param *params, struct netpath *path,
+								char *path_name)
+{
+	if (!strcmp(params->ioc_string, path->viport->config->ioc_string))
+		return;
+
+	printk(KERN_INFO PFX "Changing ioc_string to %s for %s path of %s\n",
+				params->ioc_string, path_name, params->name);
+
+	strcpy(path->viport->config->ioc_string, params->ioc_string);
+}
+
+static void update_path_parameters(struct path_param *params,
+				   struct netpath *path)
+{
+	update_dgids(path->viport->config->path_info.path.dgid.raw,
+		params->dgid, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_ioc_guids(params, path, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_pkeys(&path->viport->config->path_info.path.pkey,
+		&params->pkey, params->name,
+		(path->second_bias ? "secondary" : "primary"));
+
+	update_ioc_strings(params, path,
+		(path->second_bias ? "secondary" : "primary"));
+}
+
+static ssize_t update_params_and_connect(struct path_param *params,
+					 struct netpath *path, size_t count)
+{
+	if (is_dgid_zero(params->dgid) && params->ioc_guid != 0 &&
+	    params->pkey != 0) {
+
+		if (!memcmp(path->viport->config->path_info.path.dgid.raw,
+			params->dgid, 16) &&
+		    params->ioc_guid == path->viport->config->ioc_guid &&
+		    params->pkey     == path->viport->config->path_info.path.pkey) {
+
+			printk(KERN_WARNING PFX "All of the dgid, ioc_guid and "
+						"pkeys are same as the existing"
+						" one. Not updating values.\n");
+			return -EINVAL;
+		} else {
+			if (path->viport->state == VIPORT_CONNECTED) {
+				printk(KERN_WARNING PFX "%s path of %s "
+					"interface is already in connected "
+					"state. Not updating values.\n",
+				(path->second_bias ? "Secondary" : "Primary"),
+				path->parent->config->name);
+				return -EINVAL;
+			} else {
+				update_path_parameters(params, path);
+				viport_kick(path->viport);
+				vnic_disconnected(path->parent, path);
+				return count;
+			}
+		}
+	} else {
+		printk(KERN_WARNING PFX "Either dgid, iocguid, pkey is zero. "
+					"No update.\n");
+		return -EINVAL;
+	}
+}
+
+static ssize_t vnic_create_primary(struct device *dev,
+				   struct device_attribute *dev_attr,
+				   const char *buf, size_t count)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic_ib_port *target =
+	    container_of(info, struct vnic_ib_port, pdev_info);
+
+	struct path_param param;
+	int ret = -EINVAL;
+	struct vnic *vnic;
+	struct list_head    *ptr;
+
+	param.instance = 0;
+	param.rx_csum = -1;
+	param.tx_csum = -1;
+	param.heartbeat = -1;
+	param.ib_multicast = -1;
+	*param.ioc_string = '\0';
+
+	ret = vnic_parse_options(buf, &param);
+
+	if (ret)
+		goto out;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strcmp(vnic->config->name, param.name)) {
+			ret = update_params_and_connect(&param,
+							&vnic->primary_path,
+							count);
+			goto out;
+		}
+	 }
+
+	param.ibdev = target->dev->dev;
+	param.ibport = target;
+	param.port = target->port_num;
+
+	vnic = create_vnic(&param);
+	if (!vnic) {
+		printk(KERN_ERR PFX "creating vnic failed\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (create_netpath(&vnic->primary_path, &param)) {
+		printk(KERN_ERR PFX "creating primary netpath failed\n");
+		goto free_vnic;
+	}
+
+	if (setup_path_class_files(&vnic->primary_path, "primary_path"))
+		goto free_vnic;
+
+	if (vnic && !vnic->primary_path.viport) {
+		printk(KERN_ERR PFX "no valid netpaths\n");
+		goto free_vnic;
+	}
+
+	return count;
+
+free_vnic:
+	vnic_free(vnic);
+	ret = -EINVAL;
+out:
+	return ret;
+}
+
+DEVICE_ATTR(create_primary, S_IWUSR, NULL, vnic_create_primary);
+
+static ssize_t vnic_create_secondary(struct device *dev,
+				     struct device_attribute *dev_attr,
+				     const char *buf, size_t count)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic_ib_port *target =
+	    container_of(info, struct vnic_ib_port, pdev_info);
+
+	struct path_param param;
+	struct vnic *vnic = NULL;
+	int ret = -EINVAL;
+	struct list_head *ptr;
+	int found = 0;
+
+	param.instance = 0;
+	param.rx_csum = -1;
+	param.tx_csum = -1;
+	param.heartbeat = -1;
+	param.ib_multicast = -1;
+	*param.ioc_string = '\0';
+
+	ret = vnic_parse_options(buf, &param);
+
+	if (ret)
+		goto out;
+
+	list_for_each(ptr, &vnic_list) {
+		vnic = list_entry(ptr, struct vnic, list_ptrs);
+		if (!strncmp(vnic->config->name, param.name, IFNAMSIZ)) {
+			if (vnic->secondary_path.viport) {
+				ret = update_params_and_connect(&param,
+								&vnic->secondary_path,
+								count);
+				goto out;
+			}
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found) {
+		printk(KERN_ERR PFX
+		       "primary connection with name '%s' does not exist\n",
+		       param.name);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	param.ibdev = target->dev->dev;
+	param.ibport = target;
+	param.port = target->port_num;
+
+	if (create_netpath(&vnic->secondary_path, &param)) {
+		printk(KERN_ERR PFX "creating secondary netpath failed\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (setup_path_class_files(&vnic->secondary_path, "secondary_path"))
+		goto free_vnic;
+
+	return count;
+
+free_vnic:
+	vnic_free(vnic);
+	ret = -EINVAL;
+out:
+	return ret;
+}
+
+DEVICE_ATTR(create_secondary, S_IWUSR, NULL, vnic_create_secondary);
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
new file mode 100644
index 0000000..7e6aa8d
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_SYS_H_INCLUDED
+#define VNIC_SYS_H_INCLUDED
+
+struct dev_info {
+	struct device		dev;
+	struct completion	released;
+};
+
+extern struct class vnic_class;
+extern struct dev_info interface_dev;
+extern struct attribute_group vnic_dev_attr_group;
+extern struct attribute_group vnic_path_attr_group;
+extern struct device_attribute dev_attr_create_primary;
+extern struct device_attribute dev_attr_create_secondary;
+extern struct device_attribute dev_attr_delete_vnic;
+
+extern void vnic_release_dev(struct device *dev);
+
+#endif	/*VNIC_SYS_H_INCLUDED*/


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:58:24 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:28:24 +0530
Subject: [ofa-general] [PATCH v3 09/13] QLogic VNIC: IB Multicast for
	Ethernet broadcast/multicast
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095824.9943.36889.stgit@localhost.localdomain>

From: Usha Srinivasan <usha.srinivasan at qlogic.com>

Implementation of ethernet broadcasting and multicasting for QLogic
VNIC interface by making use of underlying IB multicasting. 

Signed-off-by: Usha Srinivasan <usha.srinivasan at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c |  319 +++++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h |   77 +++++
 2 files changed, 396 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
new file mode 100644
index 0000000..f40ea20
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c
@@ -0,0 +1,319 @@
+/*
+ * Copyright (c) 2008 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <linux/jiffies.h>
+#include <rdma/ib_sa.h>
+#include "vnic_viport.h"
+#include "vnic_main.h"
+#include "vnic_util.h"
+
+static inline void vnic_set_multicast_state_invalid(struct viport *viport)
+{
+	viport->mc_info.state = MCAST_STATE_INVALID;
+	viport->mc_info.mc = NULL;
+	memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid));
+}
+
+int vnic_mc_init(struct viport *viport)
+{
+	MCAST_FUNCTION("vnic_mc_init %p\n", viport);
+	vnic_set_multicast_state_invalid(viport);
+	viport->mc_info.retries = 0;
+	spin_lock_init(&viport->mc_info.lock);
+
+	return 0;
+}
+
+void vnic_mc_uninit(struct viport *viport)
+{
+	unsigned long flags;
+	MCAST_FUNCTION("vnic_mc_uninit %p\n", viport);
+
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if ((viport->mc_info.state != MCAST_STATE_INVALID) &&
+	    (viport->mc_info.state != MCAST_STATE_RETRIED)) {
+		MCAST_ERROR("%s mcast state is not INVALID or RETRIED %d\n",
+				control_ifcfg_name(&viport->control),
+				viport->mc_info.state);
+	}
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+	MCAST_FUNCTION("vnic_mc_uninit done\n");
+}
+
+
+/* This function is called when NEED_MCAST_COMPLETION is set.
+ * It finishes off the join multicast work.
+ */
+int vnic_mc_join_handle_completion(struct viport *viport)
+{
+	unsigned int ret = 0;
+
+	MCAST_FUNCTION("vnic_mc_join_handle_completion()\n");
+	if (viport->mc_info.state != MCAST_STATE_JOINING) {
+		MCAST_ERROR("%s unexpected mcast state in handle_completion: "
+				" %d\n", control_ifcfg_name(&viport->control),
+				viport->mc_info.state);
+		ret = -1;
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_ATTACHING;
+	MCAST_INFO("%s Attaching QP %lx mgid:"
+			VNIC_GID_FMT " mlid:%x\n",
+			control_ifcfg_name(&viport->control), jiffies,
+			VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw),
+					 viport->mc_info.mlid);
+	ret = ib_attach_mcast(viport->mc_data.ib_conn.qp, &viport->mc_info.mgid,
+			viport->mc_info.mlid);
+	if (ret) {
+		MCAST_ERROR("%s Attach mcast qp failed %d\n",
+				control_ifcfg_name(&viport->control), ret);
+		ret = -1;
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_JOINED_ATTACHED;
+	MCAST_INFO("%s UD QP successfully attached to mcast group\n",
+			control_ifcfg_name(&viport->control));
+
+out:
+	return ret;
+}
+
+/* NOTE: ib_sa.h says "returning a non-zero value from this callback will
+ * result in destroying the multicast tracking structure.
+ */
+static int vnic_mc_join_complete(int status,
+				struct ib_sa_multicast *multicast)
+{
+	struct viport *viport = (struct viport *)multicast->context;
+	unsigned long flags;
+
+	MCAST_FUNCTION("vnic_mc_join_complete() status:%x\n", status);
+	if (status) {
+		spin_lock_irqsave(&viport->mc_info.lock, flags);
+		if (status == -ENETRESET) {
+			vnic_set_multicast_state_invalid(viport);
+			viport->mc_info.retries = 0;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			MCAST_ERROR("%s got ENETRESET\n",
+					control_ifcfg_name(&viport->control));
+			goto out;
+		}
+		/* perhaps the mcgroup hasn't yet been created - retry */
+		viport->mc_info.retries++;
+		viport->mc_info.mc = NULL;
+		if (viport->mc_info.retries > MAX_MCAST_JOIN_RETRIES) {
+			viport->mc_info.state = MCAST_STATE_RETRIED;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			MCAST_ERROR("%s join failed 0x%x - max retries:%d "
+					"exceeded\n",
+					control_ifcfg_name(&viport->control),
+					status, viport->mc_info.retries);
+		} else {
+			viport->mc_info.state = MCAST_STATE_INVALID;
+			spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+			spin_lock_irqsave(&viport->lock, flags);
+			viport->updates |= NEED_MCAST_JOIN;
+			spin_unlock_irqrestore(&viport->lock, flags);
+			viport_kick(viport);
+			MCAST_ERROR("%s join failed 0x%x - retrying; "
+					"retries:%d\n",
+					control_ifcfg_name(&viport->control),
+					status, viport->mc_info.retries);
+		}
+		goto out;
+	}
+
+	/* finish join work from main state loop for viport - in case
+	 * the work itself cannot be done in a callback environment */
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->mc_info.mlid = be16_to_cpu(multicast->rec.mlid);
+	viport->updates |= NEED_MCAST_COMPLETION;
+	spin_unlock_irqrestore(&viport->lock, flags);
+	viport_kick(viport);
+	MCAST_INFO("%s setting NEED_MCAST_COMPLETION %x %x\n",
+			control_ifcfg_name(&viport->control),
+			multicast->rec.mlid, viport->mc_info.mlid);
+out:
+	return status;
+}
+
+void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid)
+{
+	unsigned long flags;
+
+	MCAST_FUNCTION("in vnic_mc_join_setup\n");
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if (viport->mc_info.state != MCAST_STATE_INVALID) {
+		if (viport->mc_info.state == MCAST_STATE_DETACHING)
+			MCAST_ERROR("%s detach in progress\n",
+					control_ifcfg_name(&viport->control));
+		else if (viport->mc_info.state == MCAST_STATE_RETRIED)
+			MCAST_ERROR("%s max join retries exceeded\n",
+					control_ifcfg_name(&viport->control));
+		else {
+			/* join/attach in progress or done */
+			/* verify that the current mgid is same as prev mgid */
+			if (memcmp(mgid, &viport->mc_info.mgid, sizeof(union ib_gid)) != 0) {
+				/* Separate MGID for each IOC */
+				MCAST_ERROR("%s Multicast Group MGIDs not "
+					"unique; mgids: " VNIC_GID_FMT
+					 " " VNIC_GID_FMT "\n",
+					control_ifcfg_name(&viport->control),
+					VNIC_GID_RAW_ARG(mgid->raw),
+					VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw));
+			} else
+				MCAST_INFO("%s join already issued: %d\n",
+					control_ifcfg_name(&viport->control),
+					viport->mc_info.state);
+
+		}
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		return;
+	}
+	viport->mc_info.mgid = *mgid;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+	spin_lock_irqsave(&viport->lock, flags);
+	viport->updates |= NEED_MCAST_JOIN;
+	spin_unlock_irqrestore(&viport->lock, flags);
+	viport_kick(viport);
+	MCAST_INFO("%s setting NEED_MCAST_JOIN \n",
+			control_ifcfg_name(&viport->control));
+}
+
+int vnic_mc_join(struct viport *viport)
+{
+	struct ib_sa_mcmember_rec rec;
+	ib_sa_comp_mask comp_mask;
+	unsigned long flags;
+	int ret = 0;
+
+	MCAST_FUNCTION("vnic_mc_join()\n");
+	if (!viport->mc_data.ib_conn.qp) {
+		MCAST_ERROR("%s qp is NULL\n",
+				control_ifcfg_name(&viport->control));
+		ret = -1;
+		goto out;
+	}
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if (viport->mc_info.state != MCAST_STATE_INVALID) {
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		MCAST_INFO("%s Multicast join already issued\n",
+				control_ifcfg_name(&viport->control));
+		goto out;
+	}
+	viport->mc_info.state = MCAST_STATE_JOINING;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+
+	memset(&rec, 0, sizeof(rec));
+	rec.join_state = 2; /* bit 1 is Nonmember */
+	rec.mgid = viport->mc_info.mgid;
+	rec.port_gid = viport->config->path_info.path.sgid;
+
+	comp_mask = 	IB_SA_MCMEMBER_REC_MGID     |
+			IB_SA_MCMEMBER_REC_PORT_GID |
+			IB_SA_MCMEMBER_REC_JOIN_STATE;
+
+	MCAST_INFO("%s Joining Multicast group%lx mgid:"
+			VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n",
+			control_ifcfg_name(&viport->control), jiffies,
+			VNIC_GID_RAW_ARG(rec.mgid.raw),
+			VNIC_GID_RAW_ARG(rec.port_gid.raw));
+
+	viport->mc_info.mc = ib_sa_join_multicast(&vnic_sa_client,
+			viport->config->ibdev, viport->config->port,
+			&rec, comp_mask, GFP_KERNEL,
+			vnic_mc_join_complete, viport);
+
+	if (IS_ERR(viport->mc_info.mc)) {
+		MCAST_ERROR("%s Multicast joining failed " VNIC_GID_FMT
+				".\n",
+				control_ifcfg_name(&viport->control),
+				VNIC_GID_RAW_ARG(rec.mgid.raw));
+		viport->mc_info.state = MCAST_STATE_INVALID;
+		ret = -1;
+		goto out;
+	}
+	MCAST_INFO("%s Multicast group join issued mgid:"
+			VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n",
+			control_ifcfg_name(&viport->control),
+			VNIC_GID_RAW_ARG(rec.mgid.raw),
+			VNIC_GID_RAW_ARG(rec.port_gid.raw));
+out:
+	return ret;
+}
+
+void vnic_mc_leave(struct viport *viport)
+{
+	unsigned long flags;
+	unsigned int ret;
+	struct ib_sa_multicast *mc;
+
+	MCAST_FUNCTION("vnic_mc_leave()\n");
+
+	spin_lock_irqsave(&viport->mc_info.lock, flags);
+	if ((viport->mc_info.state == MCAST_STATE_INVALID) ||
+	    (viport->mc_info.state == MCAST_STATE_RETRIED)) {
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		return;
+	}
+
+	if (viport->mc_info.state == MCAST_STATE_JOINED_ATTACHED) {
+
+		viport->mc_info.state = MCAST_STATE_DETACHING;
+		spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+		ret = ib_detach_mcast(viport->mc_data.ib_conn.qp,
+					 &viport->mc_info.mgid,
+					viport->mc_info.mlid);
+		if (ret) {
+			MCAST_ERROR("%s UD QP Detach failed %d\n",
+				control_ifcfg_name(&viport->control), ret);
+			return;
+		}
+		MCAST_INFO("%s UD QP detached succesfully\n",
+				control_ifcfg_name(&viport->control));
+		spin_lock_irqsave(&viport->mc_info.lock, flags);
+	}
+	mc = viport->mc_info.mc;
+	vnic_set_multicast_state_invalid(viport);
+	viport->mc_info.retries = 0;
+	spin_unlock_irqrestore(&viport->mc_info.lock, flags);
+
+	if (mc) {
+		MCAST_INFO("%s Freeing up multicast structure.\n",
+				control_ifcfg_name(&viport->control));
+		ib_sa_free_multicast(mc);
+	}
+	MCAST_FUNCTION("vnic_mc_leave done\n");
+	return;
+}
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
new file mode 100644
index 0000000..e049180
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h
@@ -0,0 +1,77 @@
+/*
+ * Copyright (c) 2008 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __VNIC_MULTICAST_H__
+#define __VNIC_MULTTCAST_H__
+
+enum {
+	MCAST_STATE_INVALID         = 0x00, /* join not attempted or failed */
+	MCAST_STATE_JOINING         = 0x01, /* join mcgroup in progress */
+	MCAST_STATE_ATTACHING       = 0x02, /* join completed with success,
+					     * attach qp to mcgroup in progress
+					     */
+	MCAST_STATE_JOINED_ATTACHED = 0x03, /* join completed with success */
+	MCAST_STATE_DETACHING       = 0x04, /* detach qp in progress */
+	MCAST_STATE_RETRIED         = 0x05, /* retried join and failed */
+};
+
+#define MAX_MCAST_JOIN_RETRIES 	       5 /* used to retry join */
+
+struct mc_info {
+	u8  			state;
+	spinlock_t 		lock;
+	union ib_gid 		mgid;
+	u16 			mlid;
+	struct ib_sa_multicast 	*mc;
+	u8 			retries;
+};
+
+
+int vnic_mc_init(struct viport *viport);
+void vnic_mc_uninit(struct viport *viport);
+extern char *control_ifcfg_name(struct control *control);
+
+/* This function is called when a viport gets a multicast mgid from EVIC
+   and must join the multicast group. It sets up NEED_MCAST_JOIN flag, which
+   results in vnic_mc_join being called later. */
+void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid);
+
+/* This function is called when NEED_MCAST_JOIN flag is set. */
+int vnic_mc_join(struct viport *viport);
+
+/* This function is called when NEED_MCAST_COMPLETION is set.
+   It finishes off the join multicast work. */
+int vnic_mc_join_handle_completion(struct viport *viport);
+
+void vnic_mc_leave(struct viport *viport);
+
+#endif /* __VNIC_MULTICAST_H__ */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:58:54 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:28:54 +0530
Subject: [ofa-general] [PATCH v3 10/13] QLogic VNIC: Driver Statistics
	collection
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095854.9943.19624.stgit@localhost.localdomain>

From: Amar Mudrankit <amar.mudrankit at qlogic.com>

Collection of statistics about QLogic VNIC interfaces is implemented
in this patch.

Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c |  234 ++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h |  497 +++++++++++++++++++++++++
 2 files changed, 731 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
new file mode 100644
index 0000000..d11a8df
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c
@@ -0,0 +1,234 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+
+#include "vnic_main.h"
+
+cycles_t vnic_recv_ref;
+
+/*
+ * TODO: Statistics reporting for control path, data path,
+ *       RDMA times, IOs etc
+ *
+ */
+static ssize_t show_lifetime(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time = get_cycles() - vnic->statistics.start_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(lifetime, S_IRUGO, show_lifetime, NULL);
+
+static ssize_t show_conntime(struct device *dev,
+			     struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	if (vnic->statistics.conn_time)
+		return sprintf(buf, "%llu\n",
+			   (unsigned long long)vnic->statistics.conn_time);
+	return 0;
+}
+
+static DEVICE_ATTR(connection_time, S_IRUGO, show_conntime, NULL);
+
+static ssize_t show_disconnects(struct device *dev,
+				struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	u32 num;
+
+	if (vnic->statistics.disconn_ref)
+		num = vnic->statistics.disconn_num + 1;
+	else
+		num = vnic->statistics.disconn_num;
+
+	return sprintf(buf, "%d\n", num);
+}
+
+static DEVICE_ATTR(disconnects, S_IRUGO, show_disconnects, NULL);
+
+static ssize_t show_total_disconn_time(struct device *dev,
+				       struct device_attribute *dev_attr,
+				       char *buf)
+{
+	struct dev_info *info = container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time;
+
+	if (vnic->statistics.disconn_ref)
+		time = vnic->statistics.disconn_time +
+		       get_cycles() - vnic->statistics.disconn_ref;
+	else
+		time = vnic->statistics.disconn_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(total_disconn_time, S_IRUGO, show_total_disconn_time, NULL);
+
+static ssize_t show_carrier_losses(struct device *dev,
+				   struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	u32 num;
+
+	if (vnic->statistics.carrier_ref)
+		num = vnic->statistics.carrier_off_num + 1;
+	else
+		num = vnic->statistics.carrier_off_num;
+
+	return sprintf(buf, "%d\n", num);
+}
+
+static DEVICE_ATTR(carrier_losses, S_IRUGO, show_carrier_losses, NULL);
+
+static ssize_t show_total_carr_loss_time(struct device *dev,
+					 struct device_attribute *dev_attr,
+					 char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+	cycles_t time;
+
+	if (vnic->statistics.carrier_ref)
+		time = vnic->statistics.carrier_off_time +
+		       get_cycles() - vnic->statistics.carrier_ref;
+	else
+		time = vnic->statistics.carrier_off_time;
+
+	return sprintf(buf, "%llu\n", (unsigned long long)time);
+}
+
+static DEVICE_ATTR(total_carrier_loss_time, S_IRUGO,
+			 show_total_carr_loss_time, NULL);
+
+static ssize_t show_total_recv_time(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%llu\n",
+		       (unsigned long long)vnic->statistics.recv_time);
+}
+
+static DEVICE_ATTR(total_recv_time, S_IRUGO, show_total_recv_time, NULL);
+
+static ssize_t show_recvs(struct device *dev,
+			  struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.recv_num);
+}
+
+static DEVICE_ATTR(recvs, S_IRUGO, show_recvs, NULL);
+
+static ssize_t show_multicast_recvs(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.multicast_recv_num);
+}
+
+static DEVICE_ATTR(multicast_recvs, S_IRUGO, show_multicast_recvs, NULL);
+
+static ssize_t show_total_xmit_time(struct device *dev,
+				    struct device_attribute *dev_attr,
+				    char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%llu\n",
+		       (unsigned long long)vnic->statistics.xmit_time);
+}
+
+static DEVICE_ATTR(total_xmit_time, S_IRUGO, show_total_xmit_time, NULL);
+
+static ssize_t show_xmits(struct device *dev,
+			  struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.xmit_num);
+}
+
+static DEVICE_ATTR(xmits, S_IRUGO, show_xmits, NULL);
+
+static ssize_t show_failed_xmits(struct device *dev,
+				 struct device_attribute *dev_attr, char *buf)
+{
+	struct dev_info *info =	container_of(dev, struct dev_info, dev);
+	struct vnic *vnic = container_of(info, struct vnic, stat_info);
+
+	return sprintf(buf, "%d\n", vnic->statistics.xmit_fail);
+}
+
+static DEVICE_ATTR(failed_xmits, S_IRUGO, show_failed_xmits, NULL);
+
+static struct attribute *vnic_stats_attrs[] = {
+	&dev_attr_lifetime.attr,
+	&dev_attr_xmits.attr,
+	&dev_attr_total_xmit_time.attr,
+	&dev_attr_failed_xmits.attr,
+	&dev_attr_recvs.attr,
+	&dev_attr_multicast_recvs.attr,
+	&dev_attr_total_recv_time.attr,
+	&dev_attr_connection_time.attr,
+	&dev_attr_disconnects.attr,
+	&dev_attr_total_disconn_time.attr,
+	&dev_attr_carrier_losses.attr,
+	&dev_attr_total_carrier_loss_time.attr,
+	NULL
+};
+
+struct attribute_group vnic_stats_attr_group = {
+	.attrs = vnic_stats_attrs,
+};
diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
new file mode 100644
index 0000000..a241b71
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h
@@ -0,0 +1,497 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_STATS_H_INCLUDED
+#define VNIC_STATS_H_INCLUDED
+
+#include "vnic_main.h"
+#include "vnic_ib.h"
+#include "vnic_sys.h"
+
+#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
+
+static inline void vnic_connected_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.conn_time == 0) {
+		vnic->statistics.conn_time =
+		    get_cycles() - vnic->statistics.start_time;
+	}
+
+	if (vnic->statistics.disconn_ref != 0) {
+		vnic->statistics.disconn_time +=
+		    get_cycles() - vnic->statistics.disconn_ref;
+		vnic->statistics.disconn_num++;
+		vnic->statistics.disconn_ref = 0;
+	}
+
+}
+
+static inline void vnic_stop_xmit_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.xmit_ref == 0)
+		vnic->statistics.xmit_ref = get_cycles();
+}
+
+static inline void vnic_restart_xmit_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.xmit_ref != 0) {
+		vnic->statistics.xmit_off_time +=
+		    get_cycles() - vnic->statistics.xmit_ref;
+		vnic->statistics.xmit_off_num++;
+		vnic->statistics.xmit_ref = 0;
+	}
+}
+
+static inline void vnic_recv_pkt_stats(struct vnic *vnic)
+{
+	vnic->statistics.recv_time += get_cycles() - vnic_recv_ref;
+	vnic->statistics.recv_num++;
+}
+
+static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic)
+{
+	vnic->statistics.multicast_recv_num++;
+}
+
+static inline void vnic_pre_pkt_xmit_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic,
+					    cycles_t time)
+{
+	vnic->statistics.xmit_time += get_cycles() - time;
+	vnic->statistics.xmit_num++;
+
+}
+
+static inline void vnic_xmit_fail_stats(struct vnic *vnic)
+{
+	vnic->statistics.xmit_fail++;
+}
+
+static inline void vnic_carrier_loss_stats(struct vnic *vnic)
+{
+	if (vnic->statistics.carrier_ref != 0) {
+		vnic->statistics.carrier_off_time +=
+			get_cycles() -  vnic->statistics.carrier_ref;
+		vnic->statistics.carrier_off_num++;
+		vnic->statistics.carrier_ref = 0;
+	}
+}
+
+static inline int vnic_setup_stats_files(struct vnic *vnic)
+{
+	init_completion(&vnic->stat_info.released);
+	vnic->stat_info.dev.class = NULL;
+	vnic->stat_info.dev.parent = &vnic->dev_info.dev;
+	vnic->stat_info.dev.release = vnic_release_dev;
+	snprintf(vnic->stat_info.dev.bus_id, BUS_ID_SIZE,
+		 "stats");
+
+	if (device_register(&vnic->stat_info.dev)) {
+		SYS_ERROR("create_vnic: error in registering"
+			  " stat class dev\n");
+		goto stats_out;
+	}
+
+	if (sysfs_create_group(&vnic->stat_info.dev.kobj,
+			       &vnic_stats_attr_group))
+		goto err_stats_file;
+
+	return 0;
+err_stats_file:
+	device_unregister(&vnic->stat_info.dev);
+	wait_for_completion(&vnic->stat_info.released);
+stats_out:
+	return -1;
+}
+
+static inline void vnic_cleanup_stats_files(struct vnic *vnic)
+{
+	sysfs_remove_group(&vnic->dev_info.dev.kobj,
+			   &vnic_stats_attr_group);
+	device_unregister(&vnic->stat_info.dev);
+	wait_for_completion(&vnic->stat_info.released);
+}
+
+static inline void vnic_disconn_stats(struct vnic *vnic)
+{
+	if (!vnic->statistics.disconn_ref)
+		vnic->statistics.disconn_ref = get_cycles();
+
+	if (vnic->statistics.carrier_ref == 0)
+		vnic->statistics.carrier_ref = get_cycles();
+}
+
+static inline void vnic_alloc_stats(struct vnic *vnic)
+{
+	vnic->statistics.start_time = get_cycles();
+}
+
+static inline void control_note_rsptime_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void control_update_rsptime_stats(struct control *control,
+						cycles_t response_time)
+{
+	response_time -= control->statistics.request_time;
+	control->statistics.response_time += response_time;
+	control->statistics.response_num++;
+	if (control->statistics.response_max < response_time)
+		control->statistics.response_max = response_time;
+	if ((control->statistics.response_min == 0) ||
+	    (control->statistics.response_min > response_time))
+		control->statistics.response_min =  response_time;
+
+}
+
+static inline void control_note_reqtime_stats(struct control *control)
+{
+	control->statistics.request_time = get_cycles();
+}
+
+static inline void control_timeout_stats(struct control *control)
+{
+	control->statistics.timeout_num++;
+}
+
+static inline void data_kickreq_stats(struct data *data)
+{
+	data->statistics.kick_reqs++;
+}
+
+static inline void data_no_xmitbuf_stats(struct data *data)
+{
+	data->statistics.no_xmit_bufs++;
+}
+
+static inline void data_xmits_stats(struct data *data)
+{
+	data->statistics.xmit_num++;
+}
+
+static inline void data_recvs_stats(struct data *data)
+{
+	data->statistics.recv_num++;
+}
+
+static inline void data_note_kickrcv_time(void)
+{
+	vnic_recv_ref = get_cycles();
+}
+
+static inline void data_rcvkicks_stats(struct data *data)
+{
+	data->statistics.kick_recvs++;
+}
+
+
+static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.connection_time = get_cycles();
+}
+
+static inline void vnic_ib_note_comptime_stats(cycles_t *time)
+{
+	*time = get_cycles();
+}
+
+static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.num_callbacks++;
+}
+
+static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn,
+				      u32 *comp_num)
+{
+	ib_conn->statistics.num_ios++;
+	*comp_num = *comp_num + 1;
+
+}
+
+static inline void vnic_ib_io_stats(struct io *io,
+				    struct vnic_ib_conn *ib_conn,
+				    cycles_t comp_time)
+{
+	if ((io->type == RECV) || (io->type == RECV_UD))
+		io->time = comp_time;
+	else if (io->type == RDMA) {
+		ib_conn->statistics.rdma_comp_time += comp_time - io->time;
+		ib_conn->statistics.rdma_comp_ios++;
+	} else if (io->type == SEND) {
+		ib_conn->statistics.send_comp_time += comp_time - io->time;
+		ib_conn->statistics.send_comp_ios++;
+	}
+}
+
+static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn,
+				       u32 comp_num)
+{
+	if (comp_num > ib_conn->statistics.max_ios)
+		ib_conn->statistics.max_ios = comp_num;
+}
+
+static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn)
+{
+	ib_conn->statistics.connection_time =
+			 get_cycles() - ib_conn->statistics.connection_time;
+
+}
+
+static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					     struct io *io,
+					     cycles_t *time)
+{
+	*time = get_cycles();
+	if (io->time != 0) {
+		ib_conn->statistics.recv_comp_time += *time - io->time;
+		ib_conn->statistics.recv_comp_ios++;
+	}
+
+}
+
+static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					      cycles_t time)
+{
+	ib_conn->statistics.recv_post_time += get_cycles() - time;
+	ib_conn->statistics.recv_post_ios++;
+}
+
+static inline void vnic_ib_pre_sendpost_stats(struct io *io,
+					      cycles_t *time)
+{
+	io->time = *time = get_cycles();
+}
+
+static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn,
+					       struct io *io,
+					       cycles_t time)
+{
+	time = get_cycles() - time;
+	if (io->swr.opcode == IB_WR_RDMA_WRITE) {
+		ib_conn->statistics.rdma_post_time += time;
+		ib_conn->statistics.rdma_post_ios++;
+	} else {
+		ib_conn->statistics.send_post_time += time;
+		ib_conn->statistics.send_post_ios++;
+	}
+}
+#else	/*CONFIG_INIFINIBAND_VNIC_STATS*/
+
+static inline void vnic_connected_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_stop_xmit_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_restart_xmit_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_recv_pkt_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_pre_pkt_xmit_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic,
+					    cycles_t time)
+{
+	;
+}
+
+static inline void vnic_xmit_fail_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline int vnic_setup_stats_files(struct vnic *vnic)
+{
+	return 0;
+}
+
+static inline void vnic_cleanup_stats_files(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_carrier_loss_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_disconn_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void vnic_alloc_stats(struct vnic *vnic)
+{
+	;
+}
+
+static inline void control_note_rsptime_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void control_update_rsptime_stats(struct control *control,
+						cycles_t response_time)
+{
+	;
+}
+
+static inline void control_note_reqtime_stats(struct control *control)
+{
+	;
+}
+
+static inline void control_timeout_stats(struct control *control)
+{
+	;
+}
+
+static inline void data_kickreq_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_no_xmitbuf_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_xmits_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_recvs_stats(struct data *data)
+{
+	;
+}
+
+static inline void data_note_kickrcv_time(void)
+{
+	;
+}
+
+static inline void data_rcvkicks_stats(struct data *data)
+{
+	;
+}
+
+static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn)
+{
+	;
+}
+
+static inline void vnic_ib_note_comptime_stats(cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn)
+
+{
+	;
+}
+static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn,
+				      u32 *comp_num)
+{
+	;
+}
+
+static inline void vnic_ib_io_stats(struct io *io,
+				    struct vnic_ib_conn *ib_conn,
+				    cycles_t comp_time)
+{
+	;
+}
+
+static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn,
+				       u32 comp_num)
+{
+	;
+}
+
+static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn)
+{
+	;
+}
+
+static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					     struct io *io,
+					     cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn,
+					      cycles_t time)
+{
+	;
+}
+
+static inline void vnic_ib_pre_sendpost_stats(struct io *io,
+					      cycles_t *time)
+{
+	;
+}
+
+static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn,
+					       struct io *io,
+					       cycles_t time)
+{
+	;
+}
+#endif	/*CONFIG_INIFINIBAND_VNIC_STATS*/
+
+#endif	/*VNIC_STATS_H_INCLUDED*/


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:59:25 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:29:25 +0530
Subject: [ofa-general] [PATCH v3 11/13] QLogic VNIC: Driver utility file -
	implements various utility macros
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095925.9943.21164.stgit@localhost.localdomain>

From: Poornima Kamath <poornima.kamath at qlogic.com>

This patch adds the driver utility file which mainly contains utility
macros for debugging of QLogic VNIC driver.

Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h |  236 ++++++++++++++++++++++++++
 1 files changed, 236 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h

diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
new file mode 100644
index 0000000..095fa3a
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h
@@ -0,0 +1,236 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef VNIC_UTIL_H_INCLUDED
+#define VNIC_UTIL_H_INCLUDED
+
+#define MODULE_NAME "QLGC_VNIC"
+
+#define VNIC_MAJORVERSION	1
+#define VNIC_MINORVERSION	1
+
+#define ALIGN_DOWN(x, a)	((x)&(~((a)-1)))
+
+extern u32 vnic_debug;
+
+enum {
+	DEBUG_IB_INFO			= 0x00000001,
+	DEBUG_IB_FUNCTION		= 0x00000002,
+	DEBUG_IB_FSTATUS		= 0x00000004,
+	DEBUG_IB_ASSERTS		= 0x00000008,
+	DEBUG_CONTROL_INFO		= 0x00000010,
+	DEBUG_CONTROL_FUNCTION	= 0x00000020,
+	DEBUG_CONTROL_PACKET	= 0x00000040,
+	DEBUG_CONFIG_INFO		= 0x00000100,
+	DEBUG_DATA_INFO 		= 0x00001000,
+	DEBUG_DATA_FUNCTION		= 0x00002000,
+	DEBUG_NETPATH_INFO		= 0x00010000,
+	DEBUG_VIPORT_INFO		= 0x00100000,
+	DEBUG_VIPORT_FUNCTION	= 0x00200000,
+	DEBUG_LINK_STATE		= 0x00400000,
+	DEBUG_VNIC_INFO 		= 0x01000000,
+	DEBUG_VNIC_FUNCTION		= 0x02000000,
+	DEBUG_MCAST_INFO		= 0x04000000,
+	DEBUG_MCAST_FUNCTION	= 0x08000000,
+	DEBUG_SYS_INFO			= 0x10000000,
+	DEBUG_SYS_VERBOSE		= 0x40000000
+};
+
+#define PRINT(level, x, fmt, arg...)					\
+	printk(level "%s: " fmt, MODULE_NAME, ##arg)
+
+#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...)		\
+	do {								\
+		 if (condition)						\
+			printk(level "%s: %s: " fmt,			\
+			       MODULE_NAME, x, ##arg);			\
+	} while (0)
+
+#define IB_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "IB", fmt, ##arg)
+#define IB_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "IB", fmt, ##arg)
+
+#define IB_FUNCTION(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO, 				\
+			  "IB", 				\
+			  (vnic_debug & DEBUG_IB_FUNCTION), 	\
+			  fmt, ##arg)
+
+#define IB_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "IB",					\
+			  (vnic_debug & DEBUG_IB_INFO),		\
+			  fmt, ##arg)
+
+#define IB_ASSERT(x)							\
+	do {								\
+		 if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x))		\
+			panic("%s assertion failed, file:  %s,"		\
+				" line %d: ",				\
+				MODULE_NAME, __FILE__, __LINE__)	\
+	} while (0)
+
+#define CONTROL_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "CONTROL", fmt, ##arg)
+#define CONTROL_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "CONTROL", fmt, ##arg)
+
+#define CONTROL_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,					\
+			  "CONTROL",					\
+			  (vnic_debug & DEBUG_CONTROL_INFO),		\
+			  fmt, ##arg)
+
+#define CONTROL_FUNCTION(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,					\
+			"CONTROL",					\
+			(vnic_debug & DEBUG_CONTROL_FUNCTION),		\
+			fmt, ##arg)
+
+#define CONTROL_PACKET(pkt)					\
+	do {							\
+		 if (vnic_debug & DEBUG_CONTROL_PACKET)		\
+			control_log_control_packet(pkt);	\
+	} while (0)
+
+#define CONFIG_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "CONFIG", fmt, ##arg)
+#define CONFIG_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "CONFIG", fmt, ##arg)
+
+#define CONFIG_INFO(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "CONFIG",				\
+			  (vnic_debug & DEBUG_CONFIG_INFO),	\
+			  fmt, ##arg)
+
+#define DATA_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "DATA", fmt, ##arg)
+#define DATA_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "DATA", fmt, ##arg)
+
+#define DATA_INFO(fmt, arg...)					\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "DATA",				\
+			  (vnic_debug & DEBUG_DATA_INFO),	\
+			  fmt, ##arg)
+
+#define DATA_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "DATA",				\
+			  (vnic_debug & DEBUG_DATA_FUNCTION),	\
+			  fmt, ##arg)
+
+
+#define MCAST_PRINT(fmt, arg...)        \
+    PRINT(KERN_INFO, "MCAST", fmt, ##arg)
+#define MCAST_ERROR(fmt, arg...)        \
+    PRINT(KERN_ERR, "MCAST", fmt, ##arg)
+
+#define MCAST_INFO(fmt, arg...)   	              		\
+	PRINT_CONDITIONAL(KERN_INFO,     			\
+			"MCAST",   				\
+			(vnic_debug & DEBUG_MCAST_INFO),	\
+			fmt, ##arg)
+
+#define MCAST_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			"MCAST",				\
+			(vnic_debug & DEBUG_MCAST_FUNCTION), 	\
+			fmt, ##arg)
+
+#define NETPATH_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "NETPATH", fmt, ##arg)
+#define NETPATH_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "NETPATH", fmt, ##arg)
+
+#define NETPATH_INFO(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "NETPATH",				\
+			  (vnic_debug & DEBUG_NETPATH_INFO),	\
+			  fmt, ##arg)
+
+#define VIPORT_PRINT(fmt, arg...)		\
+	PRINT(KERN_INFO, "VIPORT", fmt, ##arg)
+#define VIPORT_ERROR(fmt, arg...)		\
+	PRINT(KERN_ERR, "VIPORT", fmt, ##arg)
+
+#define VIPORT_INFO(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "VIPORT",				\
+			  (vnic_debug & DEBUG_VIPORT_INFO),	\
+			  fmt, ##arg)
+
+#define VIPORT_FUNCTION(fmt, arg...)				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "VIPORT",				\
+			  (vnic_debug & DEBUG_VIPORT_FUNCTION),	\
+			  fmt, ##arg)
+
+#define LINK_STATE(fmt, arg...) 				\
+	PRINT_CONDITIONAL(KERN_INFO,				\
+			  "LINK",				\
+			  (vnic_debug & DEBUG_LINK_STATE),	\
+			  fmt, ##arg)
+
+#define VNIC_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "NIC", fmt, ##arg)
+#define VNIC_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "NIC", fmt, ##arg)
+#define VNIC_INIT(fmt, arg...)			\
+	PRINT(KERN_INFO, "NIC", fmt, ##arg)
+
+#define VNIC_INFO(fmt, arg...)					\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "NIC",				\
+			   (vnic_debug & DEBUG_VNIC_INFO),	\
+			   fmt, ##arg)
+
+#define VNIC_FUNCTION(fmt, arg...)				\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "NIC",				\
+			   (vnic_debug & DEBUG_VNIC_FUNCTION),	\
+			   fmt, ##arg)
+
+#define SYS_PRINT(fmt, arg...)			\
+	PRINT(KERN_INFO, "SYS", fmt, ##arg)
+#define SYS_ERROR(fmt, arg...)			\
+	PRINT(KERN_ERR, "SYS", fmt, ##arg)
+
+#define SYS_INFO(fmt, arg...)					\
+	 PRINT_CONDITIONAL(KERN_INFO,				\
+			   "SYS",				\
+			   (vnic_debug & DEBUG_SYS_INFO),	\
+			   fmt, ##arg)
+
+#endif	/* VNIC_UTIL_H_INCLUDED */


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 02:59:55 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:29:55 +0530
Subject: [ofa-general] [PATCH v3 12/13] QLogic VNIC: Driver Kconfig and
	Makefile.
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529095955.9943.48616.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

Kconfig and Makefile for the QLogic VNIC driver.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/ulp/qlgc_vnic/Kconfig  |   19 +++++++++++++++++++
 drivers/infiniband/ulp/qlgc_vnic/Makefile |   13 +++++++++++++
 2 files changed, 32 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile

diff --git a/drivers/infiniband/ulp/qlgc_vnic/Kconfig b/drivers/infiniband/ulp/qlgc_vnic/Kconfig
new file mode 100644
index 0000000..7b4030e
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/Kconfig
@@ -0,0 +1,19 @@
+config INFINIBAND_QLGC_VNIC
+	tristate "QLogic VNIC - Support for QLogic Ethernet Virtual I/O Controller"
+	depends on INFINIBAND && NETDEVICES && INET
+	---help---
+	  Support for the QLogic Ethernet Virtual I/O Controller
+	  (EVIC). In conjunction with the EVIC, this provides virtual
+	  ethernet interfaces and transports ethernet packets over
+	  InfiniBand so that you can communicate with Ethernet networks
+	  using your IB device.
+
+config INFINIBAND_QLGC_VNIC_STATS
+	bool "QLogic VNIC Statistics"
+	depends on INFINIBAND_QLGC_VNIC
+	default n
+	---help---
+	  This option compiles statistics collecting code into the
+	  data path of the QLogic VNIC driver to help in profiling and fine
+	  tuning. This adds some overhead in the interest of gathering
+	  data.
diff --git a/drivers/infiniband/ulp/qlgc_vnic/Makefile b/drivers/infiniband/ulp/qlgc_vnic/Makefile
new file mode 100644
index 0000000..509dd67
--- /dev/null
+++ b/drivers/infiniband/ulp/qlgc_vnic/Makefile
@@ -0,0 +1,13 @@
+obj-$(CONFIG_INFINIBAND_QLGC_VNIC)		+= qlgc_vnic.o
+
+qlgc_vnic-y					:= vnic_main.o \
+						   vnic_ib.o \
+						   vnic_viport.o \
+						   vnic_control.o \
+						   vnic_data.o \
+						   vnic_netpath.o \
+						   vnic_config.o \
+						   vnic_sys.o \
+						   vnic_multicast.o
+
+qlgc_vnic-$(CONFIG_INFINIBAND_QLGC_VNIC_STATS)	+= vnic_stats.o


From ramachandra.kuchimanchi at qlogic.com  Thu May 29 03:00:25 2008
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 29 May 2008 15:30:25 +0530
Subject: [ofa-general] [PATCH v3 13/13] QLogic VNIC: Modifications to IB
	Kconfig and Makefile
In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
Message-ID: <20080529100025.9943.35838.stgit@localhost.localdomain>

From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>

This patch modifies the toplevel Infiniband Kconfig and Makefile
to include QLogic VNIC as new ULP.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
---

 drivers/infiniband/Kconfig  |    2 ++
 drivers/infiniband/Makefile |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a5dc78a..0775df5 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -53,4 +53,6 @@ source "drivers/infiniband/ulp/srp/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 
+source "drivers/infiniband/ulp/qlgc_vnic/Kconfig"
+
 endif # INFINIBAND
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index ed35e44..845271e 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES)		+= hw/nes/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
+obj-$(CONFIG_INFINIBAND_QLGC_VNIC)	+= ulp/qlgc_vnic/


From weiny2 at llnl.gov  Thu May 29 10:06:17 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 29 May 2008 10:06:17 -0700
Subject: [ofa-general] Infiniband back-to-back without OpenSM?
In-Reply-To: <alpine.LFD.1.10.0805291121580.2898@jlentini-linux.nane.netapp.com>
References: <RTPCLUEXC1-PRDvoGdK00000174@RTPMVEXC1-PRD.hq.netapp.com>
	<1211978369.13185.351.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDyLAPD00000179@RTPMVEXC1-PRD.hq.netapp.com>
	<1211979817.13185.357.camel@hrosenstock-ws.xsigo.com>
	<RTPCLUEXC1-PRDVXLuS0000017a@RTPMVEXC1-PRD.hq.netapp.com>
	<1211981650.13185.362.camel@hrosenstock-ws.xsigo.com>
	<alpine.LFD.1.10.0805291121580.2898@jlentini-linux.nane.netapp.com>
Message-ID: <20080529100617.11d9b492.weiny2@llnl.gov>

On Thu, 29 May 2008 11:37:15 -0400 (EDT)
James Lentini <jlentini at netapp.com> wrote:

> 
> 
> On Wed, 28 May 2008, Hal Rosenstock wrote:
> 
> > On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote:

<snip>

> > 
> > > Maybe I'm getting ahead of myself though, still wondering if there's a way
> > > to do it with what we have.
> > 
> > The closest thing is OpenSM run once mode but I think you've been
> > describing a b2b mini SM command which wouldn't be hard to implement.
> 
> Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs 
> to assign a lid, and then transitioned the port to ARMED and ACTIVE. 
> This worked for enabling IB communication, but not IPoIB. In 
> retrospect, I probably could have implemented the same functionality 
> in userspace.
> 

Have/could you release this?

I would be interested in looking at it.

Thanks,
Ira


From shemminger at vyatta.com  Thu May 29 10:27:52 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 29 May 2008 10:27:52 -0700
Subject: [ofa-general] Re: [PATCH v3 01/13] QLogic VNIC: Driver - netdev
	implementation
In-Reply-To: <20080529095423.9943.77528.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
	<20080529095423.9943.77528.stgit@localhost.localdomain>
Message-ID: <20080529102752.584147ee@extreme>

On Thu, 29 May 2008 15:24:23 +0530
Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

> From: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> 
> QLogic Virtual NIC Driver. This patch implements netdev registration,
> netdev functions and state maintenance of the QLogic Virtual NIC
> corresponding to the various events associated with the QLogic Ethernet 
> Virtual I/O Controller (EVIC/VEx) connection.
> 
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
> Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
> ---
> 
>  drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++
>  drivers/infiniband/ulp/qlgc_vnic/vnic_main.h |  154 ++++
>  2 files changed, 1252 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
> 
> diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
> new file mode 100644
> index 0000000..570c069
> --- /dev/null
> +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c
> @@ -0,0 +1,1098 @@
> +/*
> + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/skbuff.h>
> +#include <linux/string.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/completion.h>
> +
> +#include "vnic_util.h"
> +#include "vnic_main.h"
> +#include "vnic_netpath.h"
> +#include "vnic_viport.h"
> +#include "vnic_ib.h"
> +#include "vnic_stats.h"
> +
> +#define MODULEVERSION "1.3.0.0.4"
> +#define MODULEDETAILS	\
> +		"QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION
> +
> +MODULE_AUTHOR("QLogic Corp.");
> +MODULE_DESCRIPTION(MODULEDETAILS);
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller");
> +
> +u32 vnic_debug;
> +
> +module_param(vnic_debug, uint, 0444);
> +MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0");

maybe migrate this to ethtool msg_level?

> +
> +LIST_HEAD(vnic_list);
> +
> +static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue);
> +static LIST_HEAD(vnic_npevent_list);
> +static DECLARE_COMPLETION(vnic_npevent_thread_exit);
> +static spinlock_t vnic_npevent_list_lock;
> +static struct task_struct *vnic_npevent_thread;
> +static int vnic_npevent_thread_end;
> +
> +static const char *const vnic_npevent_str[] = {
> +    "PRIMARY CONNECTED",
> +    "PRIMARY DISCONNECTED",
> +    "PRIMARY CARRIER",
> +    "PRIMARY NO CARRIER",
> +    "PRIMARY TIMER EXPIRED",
> +    "PRIMARY SETLINK",
> +    "SECONDARY CONNECTED",
> +    "SECONDARY DISCONNECTED",
> +    "SECONDARY CARRIER",
> +    "SECONDARY NO CARRIER",
> +    "SECONDARY TIMER EXPIRED",
> +    "SECONDARY SETLINK",
> +    "FREE VNIC",
> +};
> +
> +void vnic_connected(struct vnic *vnic, struct netpath *netpath)
> +{
> +	VNIC_FUNCTION("vnic_connected()\n");
> +	if (netpath->second_bias)
> +		vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED);
> +	else
> +		vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED);
> +
> +	vnic_connected_stats(vnic);
> +}
> +
> +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath)
> +{
> +	VNIC_FUNCTION("vnic_disconnected()\n");
> +	if (netpath->second_bias)
> +		vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED);
> +	else
> +		vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED);
> +}
> +
> +void vnic_link_up(struct vnic *vnic, struct netpath *netpath)
> +{
> +	VNIC_FUNCTION("vnic_link_up()\n");
> +	if (netpath->second_bias)
> +		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP);
> +	else
> +		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP);
> +}
> +
> +void vnic_link_down(struct vnic *vnic, struct netpath *netpath)
> +{
> +	VNIC_FUNCTION("vnic_link_down()\n");
> +	if (netpath->second_bias)
> +		vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN);
> +	else
> +		vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN);
> +}
> +
> +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath)
> +{
> +	unsigned long flags;
> +
> +	VNIC_FUNCTION("vnic_stop_xmit()\n");
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	if (netpath == vnic->current_path) {
> +		if (!netif_queue_stopped(vnic->netdevice)) {
> +			netif_stop_queue(vnic->netdevice);
> +			vnic->failed_over = 0;
> +		}
> +
> +		vnic_stop_xmit_stats(vnic);
> +	}
> +	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +}
> +
> +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath)
> +{
> +	unsigned long flags;
> +
> +	VNIC_FUNCTION("vnic_restart_xmit()\n");
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	if (netpath == vnic->current_path) {
> +		if (netif_queue_stopped(vnic->netdevice))
> +			netif_wake_queue(vnic->netdevice);
> +
> +		vnic_restart_xmit_stats(vnic);
> +	}
> +	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +}
> +
> +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
> +		      struct sk_buff *skb)
> +{
> +	VNIC_FUNCTION("vnic_recv_packet()\n");
> +	if ((netpath != vnic->current_path) || !vnic->open) {
> +		VNIC_INFO("tossing packet\n");
> +		dev_kfree_skb(skb);
> +		return;
> +	}
> +
> +	vnic->netdevice->last_rx = jiffies;
> +	skb->dev = vnic->netdevice;
> +	skb->protocol = eth_type_trans(skb, skb->dev);
> +	if (!vnic->config->use_rx_csum)
> +		skb->ip_summed = CHECKSUM_NONE;
> +	netif_rx(skb);

Not sure if you are calling this always in softirq (ie NAPI or tasklet)
then no need for additional queuing and softirq from netif_rx

> +	vnic_recv_pkt_stats(vnic);
> +}
> +
> +static struct net_device_stats *vnic_get_stats(struct net_device *device)
> +{
> +	struct vnic *vnic;
> +	struct netpath *np;
> +	unsigned long flags;
> +
> +	VNIC_FUNCTION("vnic_get_stats()\n");
> +	vnic = netdev_priv(device);
> +
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	np = vnic->current_path;
> +	if (np && np->viport) {
> +		atomic_inc(&np->viport->reference_count);
> +		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +		viport_get_stats(np->viport, &vnic->stats);
> +		atomic_dec(&np->viport->reference_count);
> +		wake_up(&np->viport->reference_queue);
> +	} else
> +		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +
> +	return &vnic->stats;
> +}

You can use device->stats and delete vnic->stats to save space.

> +
> +static int vnic_open(struct net_device *device)
> +{
> +	struct vnic *vnic;
> +
> +	VNIC_FUNCTION("vnic_open()\n");
> +	vnic = netdev_priv(device);
> +
> +	vnic->open++;

Don't need this (vnic->open), instead use netif_running(device).

> +	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
> +	netif_start_queue(vnic->netdevice);
> +
> +	return 0;
> +}
> +
> +static int vnic_stop(struct net_device *device)
> +{
> +	struct vnic *vnic;
> +	int ret = 0;
> +
> +	VNIC_FUNCTION("vnic_stop()\n");
> +	vnic = netdev_priv(device);
> +	netif_stop_queue(device);
> +	vnic->open--;
> +	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
> +
> +	return ret;
> +}
> +
> +static int vnic_hard_start_xmit(struct sk_buff *skb,
> +				struct net_device *device)
> +{
> +	struct vnic *vnic;
> +	struct netpath *np;
> +	cycles_t xmit_time;
> +	int	 ret = -1;
> +
> +	VNIC_FUNCTION("vnic_hard_start_xmit()\n");
> +	vnic = netdev_priv(device);
> +	np = vnic->current_path;
> +
> +	vnic_pre_pkt_xmit_stats(&xmit_time);
> +
> +	if (np && np->viport)
> +		ret = viport_xmit_packet(np->viport, skb);
> +
> +	if (ret) {
> +		vnic_xmit_fail_stats(vnic);
> +		dev_kfree_skb_any(skb);
> +		vnic->stats.tx_dropped++;
> +		goto out;
> +	}
> +
> +	device->trans_start = jiffies;
> +	vnic_post_pkt_xmit_stats(vnic, xmit_time);
> +out:
> +	return 0;
> +}

No flow control? you will just drop packets if overloaded?

> +
> +static void vnic_tx_timeout(struct net_device *device)
> +{
> +	struct vnic *vnic;
> +	struct viport *viport = NULL;
> +	unsigned long flags;
> +
> +	VNIC_FUNCTION("vnic_tx_timeout()\n");
> +	vnic = netdev_priv(device);
> +	device->trans_start = jiffies;
> +
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	if (vnic->current_path && vnic->current_path->viport) {
> +		if (vnic->failed_over) {
> +			if (vnic->current_path == &vnic->primary_path)
> +				viport = vnic->secondary_path.viport;
> +			else if (vnic->current_path == &vnic->secondary_path)
> +				viport = vnic->primary_path.viport;
> +		} else
> +			viport = vnic->current_path->viport;
> +
> +		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +		if (viport)
> +			viport_failure(viport);
> +	} else
> +		spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +
> +	VNIC_ERROR("vnic_tx_timeout\n");
> +}
> +
> +static void vnic_set_multicast_list(struct net_device *device)
> +{
> +	struct vnic *vnic;
> +	unsigned long flags;
> +
> +	VNIC_FUNCTION("vnic_set_multicast_list()\n");
> +	vnic = netdev_priv(device);
> +
> +	spin_lock_irqsave(&vnic->lock, flags);
> +	if (device->mc_count == 0) {
> +		if (vnic->mc_list_len) {
> +			vnic->mc_list_len = vnic->mc_count = 0;
> +			kfree(vnic->mc_list);
> +		}
> +	} else {
> +		struct dev_mc_list *mc_list = device->mc_list;
> +		int i;
> +
> +		if (device->mc_count > vnic->mc_list_len) {
> +			if (vnic->mc_list_len)
> +				kfree(vnic->mc_list);
> +			vnic->mc_list_len = device->mc_count + 10;
> +			vnic->mc_list = kmalloc(vnic->mc_list_len *
> +						sizeof *mc_list, GFP_ATOMIC);
> +			if (!vnic->mc_list) {
> +				vnic->mc_list_len = vnic->mc_count = 0;
> +				VNIC_ERROR("failed allocating mc_list\n");
> +				goto failure;
> +			}
> +		}
> +		vnic->mc_count = device->mc_count;
> +		for (i = 0; i < device->mc_count; i++) {
> +			vnic->mc_list[i] = *mc_list;
> +			vnic->mc_list[i].next = &vnic->mc_list[i + 1];
> +			mc_list = mc_list->next;
> +		}
> +	}
> +	spin_unlock_irqrestore(&vnic->lock, flags);
> +
> +	if (vnic->primary_path.viport)
> +		viport_set_multicast(vnic->primary_path.viport,
> +				     vnic->mc_list, vnic->mc_count);
> +
> +	if (vnic->secondary_path.viport)
> +		viport_set_multicast(vnic->secondary_path.viport,
> +				     vnic->mc_list, vnic->mc_count);
> +
> +	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK);
> +	return;
> +failure:
> +	spin_unlock_irqrestore(&vnic->lock, flags);
> +}
> +
> +/**
> + * Following set of functions queues up the events for EVIC and the
> + * kernel thread queuing up the event might return.
> + */
> +static int vnic_set_mac_address(struct net_device *device, void *addr)
> +{
> +	struct vnic	*vnic;
> +	struct sockaddr	*sockaddr = addr;
> +	u8		*address;
> +	int		ret = -1;
> +
> +	VNIC_FUNCTION("vnic_set_mac_address()\n");
> +	vnic = netdev_priv(device);
> +
> +	if (!is_valid_ether_addr(sockaddr->sa_data))
> +		return -EADDRNOTAVAIL;
> +
> +	if (netif_running(device))
> +		return -EBUSY;
> +
> +	memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN);
> +	address = sockaddr->sa_data;
> +
> +	if (vnic->primary_path.viport)
> +		ret = viport_set_unicast(vnic->primary_path.viport,
> +					 address);
> +
> +	if (ret)
> +		return ret;
> +
> +	if (vnic->secondary_path.viport)
> +		viport_set_unicast(vnic->secondary_path.viport, address);
> +
> +	vnic->mac_set = 1;
> +	return 0;
> +}
> +
> +static int vnic_change_mtu(struct net_device *device, int mtu)
> +{
> +	struct vnic	*vnic;
> +	int		ret = 0;
> +	int		pri_max_mtu;
> +	int		sec_max_mtu;
> +
> +	VNIC_FUNCTION("vnic_change_mtu()\n");
> +	vnic = netdev_priv(device);
> +
> +	if (vnic->primary_path.viport)
> +		pri_max_mtu = viport_max_mtu(vnic->primary_path.viport);
> +	else
> +		pri_max_mtu = MAX_PARAM_VALUE;
> +
> +	if (vnic->secondary_path.viport)
> +		sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport);
> +	else
> +		sec_max_mtu = MAX_PARAM_VALUE;
> +
> +	if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) {
> +		device->mtu = mtu;
> +		vnic_npevent_queue_evt(&vnic->primary_path,
> +				       VNIC_PRINP_SETLINK);
> +		vnic_npevent_queue_evt(&vnic->secondary_path,
> +				       VNIC_SECNP_SETLINK);
> +	} else if (pri_max_mtu < sec_max_mtu)
> +		printk(KERN_WARNING PFX "%s: Maximum "
> +					"supported MTU size is %d. "
> +					"Cannot set MTU to %d\n",
> +					vnic->config->name, pri_max_mtu, mtu);
> +	else
> +		printk(KERN_WARNING PFX "%s: Maximum "
> +					"supported MTU size is %d. "
> +					"Cannot set MTU to %d\n",
> +					vnic->config->name, sec_max_mtu, mtu);
> +
> +	return ret;
> +}
> +
> +static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath)
> +{
> +	u8	*address;
> +	int	ret;
> +
> +	if (!vnic->mac_set) {
> +		/* if netpath == secondary_path, then the primary path isn't
> +		 * connected.  MAC address will be set when the primary
> +		 * connects.
> +		 */
> +		netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr);
> +		address = vnic->netdevice->dev_addr;
> +
> +		if (vnic->secondary_path.viport)
> +			viport_set_unicast(vnic->secondary_path.viport,
> +					   address);
> +
> +		vnic->mac_set = 1;
> +	}
> +	ret = register_netdev(vnic->netdevice);
> +	if (ret) {
> +		printk(KERN_ERR PFX "%s failed registering netdev "
> +			"error %d - calling viport_failure\n",
> +			config_viport_name(vnic->primary_path.viport->config),
> +				ret);
> +		vnic_free(vnic);
> +		printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n",
> +			config_viport_name(vnic->primary_path.viport->config));
> +		return ret;
> +	}
> +
> +	vnic->state = VNIC_REGISTERED;
> +	vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/
> +	return 0;
> +}
> +
> +static void vnic_npevent_dequeue_all(struct vnic *vnic)
> +{
> +	unsigned long flags;
> +	struct vnic_npevent *npevt, *tmp;
> +
> +	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
> +	if (list_empty(&vnic_npevent_list))
> +		goto out;
> +	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
> +				 list_ptrs) {
> +		if ((npevt->vnic == vnic)) {
> +			list_del(&npevt->list_ptrs);
> +			kfree(npevt);
> +		}
> +	}
> +out:
> +	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
> +}
> +
> +static void update_path_and_reconnect(struct netpath *netpath,
> +				      struct vnic *vnic)
> +{
> +	struct viport_config *config = netpath->viport->config;
> +	int delay = 1;
> +
> +	if (vnic_ib_get_path(netpath, vnic))
> +		return;
> +	/*
> +	 * tell viport_connect to wait for default_no_path_timeout
> +	 * before connecting if  we are retrying the same path index
> +	 * within default_no_path_timeout.
> +	 * This prevents flooding connect requests to a path (or set
> +	 * of paths) that aren't successfully connecting for some reason.
> +	 */
> +	if (time_after(jiffies,
> +		(netpath->connect_time + vnic->config->no_path_timeout))) {
> +		netpath->path_idx = config->path_idx;
> +		netpath->connect_time = jiffies;
> +		netpath->delay_reconnect = 0;
> +		delay = 0;
> +	} else if (config->path_idx != netpath->path_idx) {
> +		delay = netpath->delay_reconnect;
> +		netpath->path_idx = config->path_idx;
> +		netpath->delay_reconnect = 1;
> +	} else
> +		delay = 1;
> +	viport_connect(netpath->viport, delay);
> +}
> +
> +static inline void vnic_set_checksum_flag(struct vnic *vnic,
> +					  struct netpath *target_path)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	vnic->current_path = target_path;
> +	vnic->failed_over = 1;
> +	if (vnic->config->use_tx_csum &&
> +	    netpath_can_tx_csum(vnic->current_path))
> +		vnic->netdevice->features |= NETIF_F_IP_CSUM;
> +
> +	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +}
> +
> +static void vnic_set_uni_multicast(struct vnic *vnic,
> +				   struct netpath *netpath)
> +{
> +	unsigned long	flags;
> +	u8		*address;
> +
> +	if (vnic->mac_set) {
> +		address = vnic->netdevice->dev_addr;
> +
> +		if (netpath->viport)
> +			viport_set_unicast(netpath->viport, address);
> +	}
> +	spin_lock_irqsave(&vnic->lock, flags);
> +
> +	if (vnic->mc_list && netpath->viport)
> +		viport_set_multicast(netpath->viport, vnic->mc_list,
> +				     vnic->mc_count);
> +
> +	spin_unlock_irqrestore(&vnic->lock, flags);
> +	if (vnic->state == VNIC_REGISTERED) {
> +		if (!netpath->viport)
> +			return;
> +		viport_set_link(netpath->viport,
> +				vnic->netdevice->flags & ~IFF_UP,
> +				vnic->netdevice->mtu);
> +	}
> +}
> +
> +static void vnic_set_netpath_timers(struct vnic *vnic,
> +				    struct netpath *netpath)
> +{
> +	switch (netpath->timer_state) {
> +	case NETPATH_TS_IDLE:
> +		netpath->timer_state = NETPATH_TS_ACTIVE;
> +		if (vnic->state == VNIC_UNINITIALIZED)
> +			netpath_timer(netpath,
> +				      vnic->config->
> +				      primary_connect_timeout);
> +		else
> +			netpath_timer(netpath,
> +				      vnic->config->
> +				      primary_reconnect_timeout);
> +			break;
> +	case NETPATH_TS_ACTIVE:
> +		/*nothing to do*/
> +		break;
> +	case NETPATH_TS_EXPIRED:
> +		if (vnic->state == VNIC_UNINITIALIZED)
> +			vnic_npevent_register(vnic, netpath);
> +
> +		break;
> +	}
> +}
> +
> +static void vnic_check_primary_path_timer(struct vnic *vnic)
> +{
> +	switch (vnic->primary_path.timer_state) {
> +	case NETPATH_TS_ACTIVE:
> +		/* nothing to do. just wait */
> +		break;
> +	case NETPATH_TS_IDLE:
> +		netpath_timer(&vnic->primary_path,
> +			      vnic->config->
> +			      primary_switch_timeout);
> +		break;
> +	case NETPATH_TS_EXPIRED:
> +		printk(KERN_INFO PFX
> +		       "%s: switching to primary path\n",
> +		       vnic->config->name);
> +
> +		vnic_set_checksum_flag(vnic, &vnic->primary_path);
> +		break;
> +	}
> +}
> +
> +static void vnic_carrier_loss(struct vnic *vnic,
> +			      struct netpath *last_path)
> +{
> +	if (vnic->primary_path.carrier) {
> +		vnic->carrier = 1;
> +		vnic_set_checksum_flag(vnic, &vnic->primary_path);
> +
> +		if (last_path && last_path != vnic->current_path)
> +			printk(KERN_INFO PFX
> +			       "%s: failing over to primary path\n",
> +			       vnic->config->name);
> +		else if (!last_path)
> +			printk(KERN_INFO PFX "%s: using primary path\n",
> +			       vnic->config->name);
> +
> +	} else if ((vnic->secondary_path.carrier) &&
> +		   (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) {
> +		vnic->carrier = 1;
> +		vnic_set_checksum_flag(vnic, &vnic->secondary_path);
> +
> +		if (last_path && last_path != vnic->current_path)
> +			printk(KERN_INFO PFX
> +			       "%s: failing over to secondary path\n",
> +			       vnic->config->name);
> +		else if (!last_path)
> +			printk(KERN_INFO PFX "%s: using secondary path\n",
> +			       vnic->config->name);
> +
> +	}
> +
> +}
> +
> +static void vnic_handle_path_change(struct vnic *vnic,
> +				    struct netpath **path)
> +{
> +	struct netpath *last_path = *path;
> +
> +	if (!last_path) {
> +		if (vnic->current_path == &vnic->primary_path)
> +			last_path = &vnic->secondary_path;
> +		else
> +			last_path = &vnic->primary_path;
> +
> +	}
> +
> +	if (vnic->current_path && vnic->current_path->viport)
> +		viport_set_link(vnic->current_path->viport,
> +				vnic->netdevice->flags,
> +				vnic->netdevice->mtu);
> +
> +	if (last_path->viport)
> +		viport_set_link(last_path->viport,
> +				 vnic->netdevice->flags &
> +				 ~IFF_UP, vnic->netdevice->mtu);
> +
> +	vnic_restart_xmit(vnic, vnic->current_path);
> +}
> +
> +static void vnic_report_path_change(struct vnic *vnic,
> +				    struct netpath *last_path,
> +				    int other_path_ok)
> +{
> +	if (!vnic->current_path) {
> +		if (last_path == &vnic->primary_path)
> +			printk(KERN_INFO PFX "%s: primary path lost, "
> +			       "no failover path available\n",
> +			       vnic->config->name);
> +		else
> +			printk(KERN_INFO PFX "%s: secondary path lost, "
> +			       "no failover path available\n",
> +			       vnic->config->name);
> +		return;
> +	}
> +
> +	if (last_path != vnic->current_path)
> +		return;
> +
> +	if (vnic->current_path == &vnic->secondary_path) {
> +		if (other_path_ok != vnic->primary_path.carrier) {
> +			if (other_path_ok)
> +				printk(KERN_INFO PFX "%s: primary path no"
> +				       " longer available for failover\n",
> +				       vnic->config->name);
> +			else
> +				printk(KERN_INFO PFX "%s: primary path now"
> +				       " available for failover\n",
> +				       vnic->config->name);
> +		}
> +	} else {
> +		if (other_path_ok != vnic->secondary_path.carrier) {
> +			if (other_path_ok)
> +				printk(KERN_INFO PFX "%s: secondary path no"
> +				       " longer available for failover\n",
> +				       vnic->config->name);
> +			else
> +				printk(KERN_INFO PFX "%s: secondary path now"
> +				       " available for failover\n",
> +				       vnic->config->name);
> +		}
> +	}
> +}
> +
> +static void vnic_handle_free_vnic_evt(struct vnic *vnic)
> +{
> +	unsigned long flags;
> +
> +	if (!netif_queue_stopped(vnic->netdevice))
> +		netif_stop_queue(vnic->netdevice);
> +
> +	netpath_timer_stop(&vnic->primary_path);
> +	netpath_timer_stop(&vnic->secondary_path);
> +	spin_lock_irqsave(&vnic->current_path_lock, flags);
> +	vnic->current_path = NULL;
> +	spin_unlock_irqrestore(&vnic->current_path_lock, flags);
> +	netpath_free(&vnic->primary_path);
> +	netpath_free(&vnic->secondary_path);
> +	if (vnic->state == VNIC_REGISTERED)
> +		unregister_netdev(vnic->netdevice);
> +
> +	vnic_npevent_dequeue_all(vnic);
> +	kfree(vnic->config);
> +	if (vnic->mc_list_len) {
> +		vnic->mc_list_len = vnic->mc_count = 0;
> +		kfree(vnic->mc_list);
> +	}
> +
> +	sysfs_remove_group(&vnic->dev_info.dev.kobj,
> +			   &vnic_dev_attr_group);
> +	vnic_cleanup_stats_files(vnic);
> +	device_unregister(&vnic->dev_info.dev);
> +	wait_for_completion(&vnic->dev_info.released);
> +	free_netdev(vnic->netdevice);
> +}
> +
> +static struct vnic *vnic_handle_npevent(struct vnic *vnic,
> +					 enum vnic_npevent_type npevt_type)
> +{
> +	struct netpath	*netpath;
> +	const char *netpath_str;
> +
> +	if (npevt_type <= VNIC_PRINP_LASTTYPE)
> +		netpath_str = netpath_to_string(vnic, &vnic->primary_path);
> +	else if	(npevt_type <= VNIC_SECNP_LASTTYPE)
> +		netpath_str = netpath_to_string(vnic, &vnic->secondary_path);
> +	else
> +		netpath_str = netpath_to_string(vnic, vnic->current_path);
> +
> +	VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n",
> +		  vnic->config->name, vnic_npevent_str[npevt_type],
> +		  netpath_str, vnic->carrier);
> +
> +	switch (npevt_type) {
> +	case VNIC_PRINP_CONNECTED:
> +		netpath = &vnic->primary_path;
> +		if (vnic->state == VNIC_UNINITIALIZED) {
> +			if (vnic_npevent_register(vnic, netpath))
> +				break;
> +		}
> +		vnic_set_uni_multicast(vnic, netpath);
> +		break;
> +	case VNIC_SECNP_CONNECTED:
> +		vnic_set_uni_multicast(vnic, &vnic->secondary_path);
> +		break;
> +	case VNIC_PRINP_TIMEREXPIRED:
> +		netpath = &vnic->primary_path;
> +		netpath->timer_state = NETPATH_TS_EXPIRED;
> +		if (!netpath->carrier)
> +			update_path_and_reconnect(netpath, vnic);
> +		break;
> +	case VNIC_SECNP_TIMEREXPIRED:
> +		netpath = &vnic->secondary_path;
> +		netpath->timer_state = NETPATH_TS_EXPIRED;
> +		if (!netpath->carrier)
> +			update_path_and_reconnect(netpath, vnic);
> +		else {
> +			if (vnic->state == VNIC_UNINITIALIZED)
> +				vnic_npevent_register(vnic, netpath);
> +		}
> +		break;
> +	case VNIC_PRINP_LINKUP:
> +		vnic->primary_path.carrier = 1;
> +		break;
> +	case VNIC_SECNP_LINKUP:
> +		netpath = &vnic->secondary_path;
> +		netpath->carrier = 1;
> +		if (!vnic->carrier)
> +			vnic_set_netpath_timers(vnic, netpath);
> +		break;
> +	case VNIC_PRINP_LINKDOWN:
> +		vnic->primary_path.carrier = 0;
> +		break;
> +	case VNIC_SECNP_LINKDOWN:
> +		if (vnic->state == VNIC_UNINITIALIZED)
> +			netpath_timer_stop(&vnic->secondary_path);
> +		vnic->secondary_path.carrier = 0;
> +		break;
> +	case VNIC_PRINP_DISCONNECTED:
> +		netpath = &vnic->primary_path;
> +		netpath_timer_stop(netpath);
> +		netpath->carrier = 0;
> +		update_path_and_reconnect(netpath, vnic);
> +		break;
> +	case VNIC_SECNP_DISCONNECTED:
> +		netpath = &vnic->secondary_path;
> +		netpath_timer_stop(netpath);
> +		netpath->carrier = 0;
> +		update_path_and_reconnect(netpath, vnic);
> +		break;
> +	case VNIC_PRINP_SETLINK:
> +		netpath = vnic->current_path;
> +		if (!netpath || !netpath->viport)
> +			break;
> +		viport_set_link(netpath->viport,
> +				vnic->netdevice->flags,
> +				vnic->netdevice->mtu);
> +		break;
> +	case VNIC_SECNP_SETLINK:
> +		netpath = &vnic->secondary_path;
> +		if (!netpath || !netpath->viport)
> +			break;
> +		viport_set_link(netpath->viport,
> +				vnic->netdevice->flags,
> +				vnic->netdevice->mtu);
> +		break;
> +	case VNIC_NP_FREEVNIC:
> +		vnic_handle_free_vnic_evt(vnic);
> +		vnic = NULL;
> +		break;
> +	}
> +	return vnic;
> +}
> +
> +static int vnic_npevent_statemachine(void *context)
> +{
> +	struct vnic_npevent	*vnic_link_evt;
> +	enum vnic_npevent_type	npevt_type;
> +	struct vnic		*vnic;
> +	int			last_carrier;
> +	int			other_path_ok = 0;
> +	struct netpath		*last_path;
> +
> +	while (!vnic_npevent_thread_end ||
> +	       !list_empty(&vnic_npevent_list)) {
> +		unsigned long flags;
> +
> +		wait_event_interruptible(vnic_npevent_queue,
> +					 !list_empty(&vnic_npevent_list)
> +					 || vnic_npevent_thread_end);
> +		spin_lock_irqsave(&vnic_npevent_list_lock, flags);
> +		if (list_empty(&vnic_npevent_list)) {
> +			spin_unlock_irqrestore(&vnic_npevent_list_lock,
> +					       flags);
> +			VNIC_INFO("netpath statemachine wake"
> +				  " on empty list\n");
> +			continue;
> +		}
> +
> +		vnic_link_evt = list_entry(vnic_npevent_list.next,
> +					   struct vnic_npevent,
> +					   list_ptrs);

You could use new list_first_entry macro here.

> +		list_del(&vnic_link_evt->list_ptrs);
> +		spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
> +		vnic = vnic_link_evt->vnic;
> +		npevt_type = vnic_link_evt->event_type;
> +		kfree(vnic_link_evt);
> +
> +		if (vnic->current_path == &vnic->secondary_path)
> +			other_path_ok = vnic->primary_path.carrier;
> +		else if (vnic->current_path == &vnic->primary_path)
> +			other_path_ok = vnic->secondary_path.carrier;
> +
> +		vnic = vnic_handle_npevent(vnic, npevt_type);
> +
> +		if (!vnic)
> +			continue;
> +
> +		last_carrier = vnic->carrier;
> +		last_path = vnic->current_path;
> +
> +		if (!vnic->current_path ||
> +		    !vnic->current_path->carrier) {
> +			vnic->carrier = 0;
> +			vnic->current_path = NULL;
> +			vnic->netdevice->features &= ~NETIF_F_IP_CSUM;
> +		}
> +
> +		if (!vnic->carrier)
> +			vnic_carrier_loss(vnic, last_path);
> +		else if ((vnic->current_path != &vnic->primary_path) &&
> +			 (vnic->config->prefer_primary) &&
> +			 (vnic->primary_path.carrier))
> +				vnic_check_primary_path_timer(vnic);
> +
> +		if (last_path)
> +			vnic_report_path_change(vnic, last_path,
> +						other_path_ok);
> +
> +		VNIC_INFO("new netpath=%s, carrier=%d\n",
> +			  netpath_to_string(vnic, vnic->current_path),
> +			  vnic->carrier);
> +
> +		if (vnic->current_path != last_path)
> +			vnic_handle_path_change(vnic, &last_path);
> +
> +		if (vnic->carrier != last_carrier) {
> +			if (vnic->carrier) {
> +				VNIC_INFO("netif_carrier_on\n");
> +				netif_carrier_on(vnic->netdevice);
> +				vnic_carrier_loss_stats(vnic);
> +			} else {
> +				VNIC_INFO("netif_carrier_off\n");
> +				netif_carrier_off(vnic->netdevice);
> +				vnic_disconn_stats(vnic);
> +			}
> +
> +		}
> +	}
> +	complete_and_exit(&vnic_npevent_thread_exit, 0);
> +	return 0;
> +}
> +
> +void vnic_npevent_queue_evt(struct netpath *netpath,
> +			    enum vnic_npevent_type evt)
> +{
> +	struct vnic_npevent *npevent;
> +	unsigned long flags;
> +
> +	npevent = kmalloc(sizeof *npevent, GFP_ATOMIC);
> +	if (!npevent) {
> +		VNIC_ERROR("Could not allocate memory for vnic event\n");
> +		return;
> +	}
> +	npevent->vnic = netpath->parent;
> +	npevent->event_type = evt;
> +	INIT_LIST_HEAD(&npevent->list_ptrs);
> +	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
> +	list_add_tail(&npevent->list_ptrs, &vnic_npevent_list);
> +	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
> +	wake_up(&vnic_npevent_queue);
> +}
> +
> +void vnic_npevent_dequeue_evt(struct netpath *netpath,
> +			      enum vnic_npevent_type evt)
> +{
> +	unsigned long flags;
> +	struct vnic_npevent *npevt, *tmp;
> +	struct vnic *vnic = netpath->parent;
> +
> +	spin_lock_irqsave(&vnic_npevent_list_lock, flags);
> +	if (list_empty(&vnic_npevent_list))
> +		goto out;
> +	list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list,
> +				 list_ptrs) {
> +		if ((npevt->vnic == vnic) &&
> +		    (npevt->event_type == evt)) {
> +			list_del(&npevt->list_ptrs);
> +			kfree(npevt);
> +			break;
> +		}
> +	}
> +out:
> +	spin_unlock_irqrestore(&vnic_npevent_list_lock, flags);
> +}
> +
> +static int vnic_npevent_start(void)
> +{
> +	VNIC_FUNCTION("vnic_npevent_start()\n");
> +
> +	spin_lock_init(&vnic_npevent_list_lock);
> +	vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL,
> +						"qlgc_vnic_npevent_s_m");
> +	if (IS_ERR(vnic_npevent_thread)) {
> +		printk(KERN_WARNING PFX "failed to create vnic npevent"
> +		       " thread; error %d\n",
> +			(int) PTR_ERR(vnic_npevent_thread));
> +		vnic_npevent_thread = NULL;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void vnic_npevent_cleanup(void)
> +{
> +	if (vnic_npevent_thread) {
> +		vnic_npevent_thread_end = 1;
> +		wake_up(&vnic_npevent_queue);
> +		wait_for_completion(&vnic_npevent_thread_exit);
> +		vnic_npevent_thread = NULL;
> +	}
> +}
> +
> +static void vnic_setup(struct net_device *device)
> +{
> +	ether_setup(device);
> +
> +	/* ether_setup is used to fill
> +	 * device parameters for ethernet devices.
> +	 * We override some of the parameters
> +	 * which are specific to VNIC.
> +	 */
> +	device->get_stats		= vnic_get_stats;
> +	device->open			= vnic_open;
> +	device->stop			= vnic_stop;
> +	device->hard_start_xmit		= vnic_hard_start_xmit;
> +	device->tx_timeout		= vnic_tx_timeout;
> +	device->set_multicast_list	= vnic_set_multicast_list;
> +	device->set_mac_address		= vnic_set_mac_address;
> +	device->change_mtu		= vnic_change_mtu;
> +	device->watchdog_timeo 		= 10 * HZ;
> +	device->features		= 0;
> +}
> +
> +struct vnic *vnic_allocate(struct vnic_config *config)
> +{
> +	struct vnic *vnic = NULL;
> +	struct net_device *netdev;
> +
> +	VNIC_FUNCTION("vnic_allocate()\n");
> +	netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup);
> +	if (!netdev) {
> +		VNIC_ERROR("failed allocating vnic structure\n");
> +		return NULL;
> +	}
> +
> +	vnic = netdev_priv(netdev);
> +	vnic->netdevice = netdev;
> +	spin_lock_init(&vnic->lock);
> +	spin_lock_init(&vnic->current_path_lock);
> +	vnic_alloc_stats(vnic);
> +	vnic->state = VNIC_UNINITIALIZED;
> +	vnic->config = config;
> +
> +	netpath_init(&vnic->primary_path, vnic, 0);
> +	netpath_init(&vnic->secondary_path, vnic, 1);
> +
> +	vnic->current_path = NULL;
> +	vnic->failed_over = 0;
> +
> +	list_add_tail(&vnic->list_ptrs, &vnic_list);
> +
> +	return vnic;
> +}
> +
> +void vnic_free(struct vnic *vnic)
> +{
> +	VNIC_FUNCTION("vnic_free()\n");
> +	list_del(&vnic->list_ptrs);
> +	vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC);
> +}
> +
> +static void __exit vnic_cleanup(void)
> +{
> +	VNIC_FUNCTION("vnic_cleanup()\n");
> +
> +	VNIC_INIT("unloading %s\n", MODULEDETAILS);
> +
> +	while (!list_empty(&vnic_list)) {
> +		struct vnic *vnic =
> +		    list_entry(vnic_list.next, struct vnic, list_ptrs);

Another place to use list_first_entry

> +		vnic_free(vnic);
> +	}
> +
> +	vnic_npevent_cleanup();
> +	viport_cleanup();
> +	vnic_ib_cleanup();
> +}
> +
> +static int __init vnic_init(void)
> +{
> +	int ret;
> +	VNIC_FUNCTION("vnic_init()\n");
> +	VNIC_INIT("Initializing %s\n", MODULEDETAILS);
> +
> +	ret = config_start();
> +	if (ret) {
> +		VNIC_ERROR("config_start failed\n");
> +		goto failure;
> +	}
> +
> +	ret = vnic_ib_init();
> +	if (ret) {
> +		VNIC_ERROR("ib_start failed\n");
> +		goto failure;
> +	}
> +
> +	ret = viport_start();
> +	if (ret) {
> +		VNIC_ERROR("viport_start failed\n");
> +		goto failure;
> +	}
> +
> +	ret = vnic_npevent_start();
> +	if (ret) {
> +		VNIC_ERROR("vnic_npevent_start failed\n");
> +		goto failure;
> +	}
> +
> +	return 0;
> +failure:
> +	vnic_cleanup();
> +	return ret;
> +}
> +
> +module_init(vnic_init);
> +module_exit(vnic_cleanup);
> diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
> new file mode 100644
> index 0000000..7535124
> --- /dev/null
> +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h
> @@ -0,0 +1,154 @@
> +/*
> + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef VNIC_MAIN_H_INCLUDED
> +#define VNIC_MAIN_H_INCLUDED
> +
> +#include <linux/timex.h>
> +#include <linux/netdevice.h>
> +#include <linux/kthread.h>
> +#include <linux/fs.h>
> +
> +#include "vnic_config.h"
> +#include "vnic_netpath.h"
> +
> +extern u16 vnic_max_mtu;
> +extern struct list_head vnic_list;
> +extern struct attribute_group vnic_stats_attr_group;
> +extern cycles_t vnic_recv_ref;
> +
> +enum vnic_npevent_type {
> +	VNIC_PRINP_CONNECTED	= 0,
> +	VNIC_PRINP_DISCONNECTED	= 1,
> +	VNIC_PRINP_LINKUP	= 2,
> +	VNIC_PRINP_LINKDOWN	= 3,
> +	VNIC_PRINP_TIMEREXPIRED	= 4,
> +	VNIC_PRINP_SETLINK	= 5,
> +
> +	/* used to figure out PRI vs SEC types for dbg msg*/
> +	VNIC_PRINP_LASTTYPE     = VNIC_PRINP_SETLINK,
> +
> +	VNIC_SECNP_CONNECTED	= 6,
> +	VNIC_SECNP_DISCONNECTED	= 7,
> +	VNIC_SECNP_LINKUP	= 8,
> +	VNIC_SECNP_LINKDOWN	= 9,
> +	VNIC_SECNP_TIMEREXPIRED	= 10,
> +	VNIC_SECNP_SETLINK	= 11,
> +
> +	/* used to figure out PRI vs SEC types for dbg msg*/
> +	VNIC_SECNP_LASTTYPE     = VNIC_SECNP_SETLINK,
> +
> +	VNIC_NP_FREEVNIC	= 12,
> +
> +	/*
> +	 * NOTE : If any new netpath event is being added, don't forget to
> +	 * add corresponding netpath event string into vnic_main.c.
> +	 */
> +};
> +
> +struct vnic_npevent {
> +	struct list_head	list_ptrs;
> +	struct vnic		*vnic;
> +	enum vnic_npevent_type	event_type;
> +};
> +
> +void vnic_npevent_queue_evt(struct netpath *netpath,
> +			    enum vnic_npevent_type evt);
> +void vnic_npevent_dequeue_evt(struct netpath *netpath,
> +			      enum vnic_npevent_type evt);
> +
> +enum vnic_state {
> +	VNIC_UNINITIALIZED	= 0,
> +	VNIC_REGISTERED		= 1
> +};
> +
> +struct vnic {
> +	struct list_head		list_ptrs;
> +	enum vnic_state			state;
> +	struct vnic_config		*config;
> +	struct netpath			*current_path;
> +	struct netpath			primary_path;
> +	struct netpath			secondary_path;
> +	int				open;
> +	int				carrier;
> +	int				failed_over;
> +	int				mac_set;
> +	struct net_device_stats 	stats;
> +	struct net_device		*netdevice;
> +	struct dev_info			dev_info;
> +	struct dev_mc_list		*mc_list;
> +	int				mc_list_len;
> +	int				mc_count;
> +	spinlock_t			lock;
> +	spinlock_t			current_path_lock;
> +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS
> +	struct {
> +		cycles_t	start_time;
> +		cycles_t	conn_time;
> +		cycles_t	disconn_ref;	/* intermediate time */
> +		cycles_t	disconn_time;
> +		u32		disconn_num;
> +		cycles_t	xmit_time;
> +		u32		xmit_num;
> +		u32		xmit_fail;
> +		cycles_t	recv_time;
> +		u32		recv_num;
> +		u32		multicast_recv_num;
> +		cycles_t	xmit_ref;	/* intermediate time */
> +		cycles_t	xmit_off_time;
> +		u32		xmit_off_num;
> +		cycles_t	carrier_ref;	/* intermediate time */
> +		cycles_t	carrier_off_time;
> +		u32		carrier_off_num;
> +	} statistics;
> +	struct dev_info		stat_info;
> +#endif	/* CONFIG_INFINIBAND_QLGC_VNIC_STATS */
> +};
> +
> +struct vnic *vnic_allocate(struct vnic_config *config);
> +
> +void vnic_free(struct vnic *vnic);
> +
> +void vnic_connected(struct vnic *vnic, struct netpath *netpath);
> +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath);
> +
> +void vnic_link_up(struct vnic *vnic, struct netpath *netpath);
> +void vnic_link_down(struct vnic *vnic, struct netpath *netpath);
> +
> +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath);
> +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath);
> +
> +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath,
> +		      struct sk_buff *skb);
> +void vnic_npevent_cleanup(void);
> +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn);

Change name to vnic_complete_cleanup or something like that for consistency.

> +#endif	/* VNIC_MAIN_H_INCLUDED */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


From shemminger at vyatta.com  Thu May 29 10:30:03 2008
From: shemminger at vyatta.com (Stephen Hemminger)
Date: Thu, 29 May 2008 10:30:03 -0700
Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface
 implementation for the driver
In-Reply-To: <20080529095754.9943.27936.stgit@localhost.localdomain>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
	<20080529095754.9943.27936.stgit@localhost.localdomain>
Message-ID: <20080529103003.010c4a08@extreme>

On Thu, 29 May 2008 15:27:54 +0530
Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:

> From: Amar Mudrankit <amar.mudrankit at qlogic.com>
> 
> The sysfs interface for the QLogic VNIC driver is implemented through
> this patch.
> 
> Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
> ---
> 
>  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++
>  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h |   51 +
>  2 files changed, 1184 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
>  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
> 
> diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> new file mode 100644
> index 0000000..40b3c77
> --- /dev/null
> +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> @@ -0,0 +1,1133 @@
> +/*
> + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/netdevice.h>
> +#include <linux/parser.h>
> +#include <linux/if.h>
> +
> +#include "vnic_util.h"
> +#include "vnic_config.h"
> +#include "vnic_ib.h"
> +#include "vnic_viport.h"
> +#include "vnic_main.h"
> +#include "vnic_stats.h"
> +
> +/*
> + * target eiocs are added by writing
> + *
> + * ioc_guid=<EIOC GUID>,dgid=<dest GID>,pkey=<P_key>,name=<interface_name>
> + * to the create_primary  sysfs attribute.
> + */
> +enum {
> +	VNIC_OPT_ERR = 0,
> +	VNIC_OPT_IOC_GUID = 1 << 0,
> +	VNIC_OPT_DGID = 1 << 1,
> +	VNIC_OPT_PKEY = 1 << 2,
> +	VNIC_OPT_NAME = 1 << 3,
> +	VNIC_OPT_INSTANCE = 1 << 4,
> +	VNIC_OPT_RXCSUM = 1 << 5,
> +	VNIC_OPT_TXCSUM = 1 << 6,
> +	VNIC_OPT_HEARTBEAT = 1 << 7,
> +	VNIC_OPT_IOC_STRING = 1 << 8,
> +	VNIC_OPT_IB_MULTICAST = 1 << 9,
> +	VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID |
> +			VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY),
> +};
> +
> +static match_table_t vnic_opt_tokens = {
> +	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
> +	{VNIC_OPT_DGID, "dgid=%s"},
> +	{VNIC_OPT_PKEY, "pkey=%x"},
> +	{VNIC_OPT_NAME, "name=%s"},
> +	{VNIC_OPT_INSTANCE, "instance=%d"},
> +	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
> +	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
> +	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
> +	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
> +	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
> +	{VNIC_OPT_ERR, NULL}
> +};
>

No sysfs is supposed to be one value per file use separate attributes
for each one. This also eliminates the parsing code.


From greg at kroah.com  Thu May 29 10:48:05 2008
From: greg at kroah.com (Greg KH)
Date: Thu, 29 May 2008 10:48:05 -0700
Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
In-Reply-To: <20080529103003.010c4a08@extreme>
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
	<20080529095754.9943.27936.stgit@localhost.localdomain>
	<20080529103003.010c4a08@extreme>
Message-ID: <20080529174805.GA10903@kroah.com>

On Thu, May 29, 2008 at 10:30:03AM -0700, Stephen Hemminger wrote:
> On Thu, 29 May 2008 15:27:54 +0530
> Ramachandra K <ramachandra.kuchimanchi at qlogic.com> wrote:
> 
> > From: Amar Mudrankit <amar.mudrankit at qlogic.com>
> > 
> > The sysfs interface for the QLogic VNIC driver is implemented through
> > this patch.
> > 
> > Signed-off-by: Amar Mudrankit <amar.mudrankit at qlogic.com>
> > Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> > Signed-off-by: Poornima Kamath <poornima.kamath at qlogic.com>
> > ---
> > 
> >  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++
> >  drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h |   51 +
> >  2 files changed, 1184 insertions(+), 0 deletions(-)
> >  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> >  create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h
> > 
> > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> > new file mode 100644
> > index 0000000..40b3c77
> > --- /dev/null
> > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c
> > @@ -0,0 +1,1133 @@
> > +/*
> > + * Copyright (c) 2006 QLogic, Inc.  All rights reserved.
> > + *
> > + * This software is available to you under a choice of one of two
> > + * licenses.  You may choose to be licensed under the terms of the GNU
> > + * General Public License (GPL) Version 2, available from the file
> > + * COPYING in the main directory of this source tree, or the
> > + * OpenIB.org BSD license below:
> > + *
> > + *     Redistribution and use in source and binary forms, with or
> > + *     without modification, are permitted provided that the following
> > + *     conditions are met:
> > + *
> > + *      - Redistributions of source code must retain the above
> > + *        copyright notice, this list of conditions and the following
> > + *        disclaimer.
> > + *
> > + *      - Redistributions in binary form must reproduce the above
> > + *        copyright notice, this list of conditions and the following
> > + *        disclaimer in the documentation and/or other materials
> > + *        provided with the distribution.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + */
> > +
> > +#include <linux/netdevice.h>
> > +#include <linux/parser.h>
> > +#include <linux/if.h>
> > +
> > +#include "vnic_util.h"
> > +#include "vnic_config.h"
> > +#include "vnic_ib.h"
> > +#include "vnic_viport.h"
> > +#include "vnic_main.h"
> > +#include "vnic_stats.h"
> > +
> > +/*
> > + * target eiocs are added by writing
> > + *
> > + * ioc_guid=<EIOC GUID>,dgid=<dest GID>,pkey=<P_key>,name=<interface_name>
> > + * to the create_primary  sysfs attribute.
> > + */
> > +enum {
> > +	VNIC_OPT_ERR = 0,
> > +	VNIC_OPT_IOC_GUID = 1 << 0,
> > +	VNIC_OPT_DGID = 1 << 1,
> > +	VNIC_OPT_PKEY = 1 << 2,
> > +	VNIC_OPT_NAME = 1 << 3,
> > +	VNIC_OPT_INSTANCE = 1 << 4,
> > +	VNIC_OPT_RXCSUM = 1 << 5,
> > +	VNIC_OPT_TXCSUM = 1 << 6,
> > +	VNIC_OPT_HEARTBEAT = 1 << 7,
> > +	VNIC_OPT_IOC_STRING = 1 << 8,
> > +	VNIC_OPT_IB_MULTICAST = 1 << 9,
> > +	VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID |
> > +			VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY),
> > +};
> > +
> > +static match_table_t vnic_opt_tokens = {
> > +	{VNIC_OPT_IOC_GUID, "ioc_guid=%s"},
> > +	{VNIC_OPT_DGID, "dgid=%s"},
> > +	{VNIC_OPT_PKEY, "pkey=%x"},
> > +	{VNIC_OPT_NAME, "name=%s"},
> > +	{VNIC_OPT_INSTANCE, "instance=%d"},
> > +	{VNIC_OPT_RXCSUM, "rx_csum=%s"},
> > +	{VNIC_OPT_TXCSUM, "tx_csum=%s"},
> > +	{VNIC_OPT_HEARTBEAT, "heartbeat=%d"},
> > +	{VNIC_OPT_IOC_STRING, "ioc_string=\"%s"},
> > +	{VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"},
> > +	{VNIC_OPT_ERR, NULL}
> > +};
> >
> 
> No sysfs is supposed to be one value per file use separate attributes
> for each one. This also eliminates the parsing code.

Also, every new sysfs file needs to have an entry in Documentation/ABI/
which shows how to use it and what the contents are.

And yes, multiple values per sysfs file are not allowed, sorry, please
change this.  If you need to configure your device through an interface
like this, consider using configfs instead, that is what it is there
for.

thanks,

greg k-h


From matthias at sgi.com  Thu May 29 11:32:47 2008
From: matthias at sgi.com (Matthias Blankenhaus)
Date: Thu, 29 May 2008 11:32:47 -0700 (PDT)
Subject: [ofa-general] saquery port problems
In-Reply-To: <20080522073703.GA31474@sashak.voltaire.com>
References: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805211642370.25026@protactinium.engr.sgi.com>
	<20080522073703.GA31474@sashak.voltaire.com>
Message-ID: <Pine.LNX.4.64.0805291131440.8688@protactinium.engr.sgi.com>


On Thu, 22 May 2008, Sasha Khapyorsky wrote:

> Hi Matthias,
> 
> On 16:48 Wed 21 May     , Matthias Blankenhaus wrote:
> > I have a patch that fixes the problem:
> > 
> > diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c
> > --- infiniband-diags-1.3.6.vanilla/src/saquery.c        2008-02-28 
> > 00:58:36.000000000 -0800
> > +++ my/src/saquery.c    2008-05-21 16:08:19.583221794 -0700
> > @@ -1304,13 +1304,13 @@ get_bind_handle(void)
> >                         ca_name_index++;
> >                 if (sa_port_num && sa_port_num != attr_array[i].port_num)
> >                         continue;
> > -               if (sa_hca_name && i == 0)
> > -                       continue;
> >                 if (sa_hca_name
> >                  && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0)
> >                         continue;
> > -               if (attr_array[i].link_state == IB_LINK_ACTIVE)
> > +               if (attr_array[i].link_state == IB_LINK_ACTIVE) {
> >                         port_guid = attr_array[i].port_guid;
> > +                       break;
> > +                }
> >         }
> > 
> > 
> > I have tested it and it solves the problem.
> > 
> > Does this look ok ?
> 
> Yes, this looks correct. Thanks for fixing this. I just will need your
> 'Signed-off-by:' line in order to apply the patch.

Sorry, I don't know what that is :-) This is my first patch for OFED, 
excuse my ignorance.

Please, let me know if this helps:

Signed-off-by:  matthias at sgi.com


Matthias

> 
> Sasha
> 


From tziporet at mellanox.co.il  Thu May 29 11:34:48 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 29 May 2008 21:34:48 +0300
Subject: [ofa-general] OFED 1.3.1 RC3 release is available
Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E6E3@mtlexch01.mtl.com>


Hi,

OFED 1.3.1 RC3 release is available on
http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc3.tgz

To get BUILD_ID run ofed_info

Please report any issues in Bugzilla https://bugs.openfabrics.org/

The GA version is expected next week.

Release information:
--------------------
Linux Operating Systems:
        - RedHat EL4 up4:       2.6.9-42.ELsmp
        - RedHat EL4 up5:       2.6.9-55.ELsmp
        - RedHat EL4 up6:       2.6.9-67.ELsmp
        - RedHat EL5:           2.6.18-8.el5
        - RedHat EL5 up1:       2.6.18-53.el5
        - RedHat EL5 up2 beta:  2.6.18-84.el5       *
        - Fedora C6:            2.6.18-8.fc6        *
        - SLES10:               2.6.16.21-0.8-smp
        - SLES10 SP1:           2.6.16.46-0.12-smp
        - SLES10 SP1 up1:       2.6.16.53-0.16-smp
        - SLES10 SP2:           2.6.16.60-0.21-smp  *
        - OpenSuSE 10.3:        2.6.22-*-*          *
        - kernel.org:           2.6.23 and 2.6.24

      * OSes that are partially tested

Systems:
	* x86_64
	* x86
	* ia64
	* ppc64


Main changes from OFED 1.3.1-rc2
================================
* Updates utilities:
    * mstflint
    * ibutils: Added rcv/snd data/pkt counters to port counters fetch
(-pm)
    * opensm version 3.1.11
* ULPs changes:
    * RDS: 
	- Fix a bug in RDMA signaling
	- Add 3 more stats counters
	- Fix kernel oops: swiotlb_unmap_sg+0x35/0x126
    * IPoIB: 
	- Fix alignment of small SKBs in CM mode receive
	- Set max CM MTU when moving to CM mode
	- Fix neigh destructor oops on kernel versions between 2.6.17
and 2.6.20
* General:
    * 90-ib.rules: fix uat that has been deprecated


Main changes from OFED 1.3.1-rc1
================================
* Added backports for the OSes (with very limited testing):
    * SLES10 SP2 with kernel 2.6.16.60-0.21-smp  
    * RedHat EL5 up2 beta with kernel 2.6.18-84.el5       
    
* MPI packages update:
    * mvapich-1.0.1-2481

* Updated libraries:
    * dapl-v1 1.2.7-1
    * dapl-v2 2.0.9-1
    * libcxgb3 1.2.1

* ULPs changes:
   * OpenSM: Fix segmentation fault
   * iSER: Bug fixes since 2.6.24
   * RDS: fixes for RDMA API
   * IPoIB: Fix several kernel crashes (see attached list)

* Updated low level drivers:
   * nes
   * mlx4
   * cxgb3
   * ehca
   * ipath


Main Changes from OFED-1.3:
===========================
* MPI packages update:
    * mvapich-1.0.1-2434
    * mvapich2-1.0.3-1
    * openmpi-1.2.6-1

* Updated libraries:
   * dapl-v1 1.2.6
   * dapl-v2 2.0.8
   * libcxgb3 1.2.0
   * librdmacm 1.0.7

* ULPs changes:
   * IB Bonding: ib-bonding-0.9.0-24
   * IPoIB bug fixes
   * RDS fixes for RDMA API
   * SRP failover

* Updated low level drivers:
   * nes
   * mlx4
   * cxgb3
   * ehca


Vlad & Tziporet


From okir at lst.de  Thu May 29 11:38:34 2008
From: okir at lst.de (Olaf Kirch)
Date: Thu, 29 May 2008 20:38:34 +0200
Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling
In-Reply-To: <483C0D40.5060902@dev.mellanox.co.il>
References: <200805271014.30175.okir@lst.de>
	<483C0D40.5060902@dev.mellanox.co.il>
Message-ID: <200805292038.35476.okir@lst.de>

On Tuesday 27 May 2008 15:31:44 Vladimir Sokolovsky wrote:
> Applied to OFED-1.3.1 kernel git tree.

Thanks a lot, Vlad!

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From olaf.kirch at oracle.com  Thu May 29 11:40:55 2008
From: olaf.kirch at oracle.com (Olaf Kirch)
Date: Thu, 29 May 2008 20:40:55 +0200
Subject: [rds-devel] [ofa-general] Port space sharing in RDS
In-Reply-To: <20080529000354.GD6288@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
Message-ID: <200805292040.56097.olaf.kirch@oracle.com>

On Thursday 29 May 2008 02:03:54 Jon Mason wrote:
> On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote:
> > >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to
> > >the
> > >RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
> > >2 major reasons.
> > 
> > Can RDS use different port numbers for its RDMA and TCP protocols?  The wire
> 
> I do not know if this is desirable, but a quick test shows that having TCP and IB
> on different ports works around the problem.

Okay, fine with me. Since TCP is disabled in 1.3 anyway, this shouldn't be an
issue there, but it'll certainly crop up - I'm re-enabling TCP for 1.4.

Care to send me a patch? Any preference as to the port number?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From eaburns at iol.unh.edu  Thu May 29 11:44:25 2008
From: eaburns at iol.unh.edu (Ethan Burns)
Date: Thu, 29 May 2008 14:44:25 -0400
Subject: [ofa-general] UNH-iSCSI v2.0 with iSER
Message-ID: <20080529184425.GA32717@postal.iol.unh.edu>

Hello,

	I would like to inform everyone who is interested that the
UNH-iSCSI sourceforge page [1] has been updated to include the latest
version of the UNH-iSCSI initiator and target.

This latest version:
	- includes major and minor bug fixes
	- includes iSER support
	- allows for compilation in user-space (for development purposes)
	- *finally* has support for selecting which DISKIO mode SCSI
	  devices are offered by the target to the initiator.

The code has been implemented and tested with RHEL5 (and OFED-1.3 when
iSER mode is enabled).

	There are still some rough edges that need to be smoothed out,
but hopefully the community will be able to help out here.  Further, the
iSER support has only been tested over iWARP.

Thanks,
Ethan Burns <eaburns at iol.unh.edu>

[1] https://sourceforge.net/projects/unh-iscsi


From jon at opengridcomputing.com  Thu May 29 11:55:25 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Thu, 29 May 2008 13:55:25 -0500
Subject: [rds-devel] [ofa-general] Port space sharing in RDS
In-Reply-To: <200805292040.56097.olaf.kirch@oracle.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
Message-ID: <20080529185525.GD7299@opengridcomputing.com>

On Thu, May 29, 2008 at 08:40:55PM +0200, Olaf Kirch wrote:
> On Thursday 29 May 2008 02:03:54 Jon Mason wrote:
> > On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote:
> > > >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to
> > > >the
> > > >RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
> > > >2 major reasons.
> > > 
> > > Can RDS use different port numbers for its RDMA and TCP protocols?  The wire
> > 
> > I do not know if this is desirable, but a quick test shows that having TCP and IB
> > on different ports works around the problem.
> 
> Okay, fine with me. Since TCP is disabled in 1.3 anyway, this shouldn't be an
> issue there, but it'll certainly crop up - I'm re-enabling TCP for 1.4.
> 
> Care to send me a patch? Any preference as to the port number?

Sure, I'll be happy to send a patch.  The port numbers I picked were simply the current
one for use in TCP and the next one for IB.  Obviously, I will need to verify that
there are no conflicts with its usage.  I'll check this out and send it out shortly.

Thanks,
Jon

> 
> Olaf
> -- 
> Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
> okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From hrosenstock at xsigo.com  Thu May 29 12:25:07 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 12:25:07 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_sa_mcmember_record.c:
	Improve log message and some comments relating to SNM
Message-ID: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com>

opensm/osm_sa_mcmember_record.c: Improve log message and some comments
relating to SNM (solicited node multicast)

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index fd6714c..040068f 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1082,10 +1082,10 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context)
 	if (memcmp(&p_mgrp->mcmember_rec.mgid, p_recvd_mgid, sizeof(ib_gid_t))) {
 
 		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
-			/* Special Case IPV6 Multicast Loopback addresses */
+			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
 			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
-			/* Where XXXX is the partition and YYYYYY is the last 24 bits
-			 * of the port guid */
+			/* Where XXXX is the P_Key and
+			 * YYYYYY is the last 24 bits of the port guid */
 #define PREFIX_MASK (0xff12601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
@@ -1099,8 +1099,8 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context)
 			    (g_interface_id & INT_ID_MASK) ==
 			     (rcv_interface_id & INT_ID_MASK)) {
 				OSM_LOG(sa->p_log, OSM_LOG_INFO,
-					"Special Case Mcast Join for MGID "
-					" MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n",
+					"Special Case Solicited Node Mcast Join "
+					" for MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n",
 					rcv_prefix, rcv_interface_id);
 			} else
 				return;


From hrosenstock at xsigo.com  Thu May 29 12:25:10 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 12:25:10 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/main.c: Minor change to long
	option for consolidate_ipv6_snm_req
Message-ID: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com>

opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index fe19a12..05b3dd5 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -658,7 +658,7 @@ int main(int argc, char *argv[])
 		{"perfmgr_sweep_time_s", 1, NULL, 2},
 #endif
 		{"prefix_routes_file", 1, NULL, 3},
-		{"consolidate_ipv6_snm_reqests", 0, NULL, 4},
+		{"consolidate_ipv6_snm_req", 0, NULL, 4},
 		{NULL, 0, NULL, 0}	/* Required at the end of the array */
 	};
 

From hrosenstock at xsigo.com  Thu May 29 12:27:12 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 12:27:12 -0700
Subject: [ofa-general] OpenSM IPv6 consolidation
Message-ID: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>

Ira,

In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is:

#define PREFIX_MASK (0xff12601b00000000ULL)

Shouldn't all scopes be consolidated so this should be:

#define PREFIX_MASK (0xff10601b00000000ULL)

or was this intentional for some reason ?

-- Hal


From weiny2 at llnl.gov  Thu May 29 14:35:35 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 29 May 2008 14:35:35 -0700
Subject: [ofa-general] Re: OpenSM IPv6 consolidation
In-Reply-To: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080529143535.60d02d75.weiny2@llnl.gov>

On Thu, 29 May 2008 12:27:12 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> Ira,
> 
> In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is:
> 
> #define PREFIX_MASK (0xff12601b00000000ULL)
> 
> Shouldn't all scopes be consolidated so this should be:
> 
> #define PREFIX_MASK (0xff10601b00000000ULL)
> 
> or was this intentional for some reason ?
> 

It seemed reasonable for this to consolidate link-local only because according
to my IPv6 book, solicited node multicast is the particular range,
ff02::1:ff00:0/104

However, I am a bit confused about how the scope bits map from the IP address
to the MGID.  The MGID refers only to the IB-subnet scope _not_ IP, therefore
what I said above might not matter because we are now talking about the IB
scope.

But that begs the question: Can a node issue an SNM request to a node in
another IB subnet?  (I think the answer is yes if the IP subnet spans more than
one IB subnet)  In that case, the SNM address would be in the range
ff02::1:ff00:0/104 but what MGID would that map onto in IB?  I think the
current mapping results in an IB link-local scope.  So would a router have to
forward it even though the IB scope is link-local?

Now my head hurts...  :-(

Ira


From meier3 at llnl.gov  Thu May 29 14:43:50 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Thu, 29 May 2008 14:43:50 -0700
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <20080525191047.GS4616@sashak.voltaire.com>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
	<20080523103532.GA4640@sashak.voltaire.com>
	<4836E9B8.2080406@llnl.gov>
	<20080525191047.GS4616@sashak.voltaire.com>
Message-ID: <483F2396.3050008@llnl.gov>

Sasha Khapyorsky wrote:
> Hi Tim, Ira,
> 
> On 08:58 Fri 23 May     , Timothy A. Meier wrote:
>> Following Hals advice, authorization is based on the umad permissions.
> 
> I will send some more comments about this method later today. But
> basically still think that some things could be broken and that it is
> not really trivial to separate in this way wrong usage from desired
> behavior reliably (with some approximation it is possible of course).
> 
>> The intent is simply to provide a consistent and
>> non-silent fail mechanism.
> 
> OTOH I fully agree with yours and Ira's arguments about this - 'Silent'
> fails are bad. I thought about how to solve this and and started to run
> diag perl scripts from unprivileged account in various conditions (cache
> file exists or not, cache dir is readable or not, etc.).
> 
> First thing I saw was that even on bad usage most scripts return 0. Then
> I found that on many failures return status is not checked or ignored
> and program return 0. I did those two patches (below) and up to now it
> works fine for me (but likely I didn't cover everything). What do you
> say?
> 
I think this patch is fine, and helps solve the improper "usage" issue.
(btw - should we prefer the "adapter" spelling over "adaptor"?)

My patch was addressing non-authorized use.  Our philosophy was to not allow
"any" sort of functionality (even help) if not authorized.  Fail, and provide
a reason/code.

So rather than go through each perl script to see if the proper thing is done
(return code is checked, error msg provided, terminate, etc.) each time a
privileged function is invoked, we just do it at the beginning of the script,
using a common (consistent) function call ( auth_check() ).

I don't know if this is the desired behavior, but it would have caught a few
problems we have encountered with "silent" failures that produce misleading
results.  It would also catch any future (unauthorized) scripting issues.

On 5-23, I submitted a patch which adds an auth_check() function to the common
perl module.  I agree, the implementation is non-ideal, but it is probably
sufficient for the vast majority of installations.

If you think the concept of an auth_check() function is desirable/acceptable,
then I will pursue fixing the implementation in a more universal way.


> Sasha
> 
> 
>>From cbbc155996c9f6efe91b78f055a643809b997468 Mon Sep 17 00:00:00 2001
> From: root <root at castor.voltaire.com>
> Date: Sat, 24 May 2008 11:04:08 +0300
> Subject: [PATCH] infiniband-diags/scripts/*.pl: exit 2 on usage errors
> 
> Add non-zero exit status (2) on usage errors for perl scripts.
> 
> Signed-off-by: root <root at castor.voltaire.com>
> ---
>  infiniband-diags/scripts/check_lft_balance.pl |    2 +-
>  infiniband-diags/scripts/ibfindnodesusing.pl  |    2 +-
>  infiniband-diags/scripts/ibidsverify.pl       |    2 +-
>  infiniband-diags/scripts/iblinkinfo.pl        |    2 +-
>  infiniband-diags/scripts/ibprintca.pl         |    2 +-
>  infiniband-diags/scripts/ibprintrt.pl         |    2 +-
>  infiniband-diags/scripts/ibprintswitch.pl     |    2 +-
>  infiniband-diags/scripts/ibqueryerrors.pl     |    2 +-
>  infiniband-diags/scripts/ibswportwatch.pl     |    2 +-
>  9 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/infiniband-diags/scripts/check_lft_balance.pl b/infiniband-diags/scripts/check_lft_balance.pl
> index 66f5f0f..b0f0fef 100755
> --- a/infiniband-diags/scripts/check_lft_balance.pl
> +++ b/infiniband-diags/scripts/check_lft_balance.pl
> @@ -70,7 +70,7 @@ sub usage
>  	print "Usage: $prog [-R -v]\n";
>  	print "  -R recalculate all cached information\n";
>  	print "  -v verbose output\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  sub is_port_up
> diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl
> index 1bf0987..71656b3 100755
> --- a/infiniband-diags/scripts/ibfindnodesusing.pl
> +++ b/infiniband-diags/scripts/ibfindnodesusing.pl
> @@ -80,7 +80,7 @@ sub usage_and_exit
>  	print "   -R Recalculate ibnetdiscover information\n";
>  	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl
> index de78e6b..1a236c8 100755
> --- a/infiniband-diags/scripts/ibidsverify.pl
> +++ b/infiniband-diags/scripts/ibidsverify.pl
> @@ -46,7 +46,7 @@ sub usage_and_exit
>  	print "   -h This help message\n";
>  	print
>  "   -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
> index a195474..a7a3df5 100755
> --- a/infiniband-diags/scripts/iblinkinfo.pl
> +++ b/infiniband-diags/scripts/iblinkinfo.pl
> @@ -62,7 +62,7 @@ sub usage_and_exit
>  	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
>  	print "   -g print port guids instead of node guids\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0              = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl
> index 38b4330..0baea0b 100755
> --- a/infiniband-diags/scripts/ibprintca.pl
> +++ b/infiniband-diags/scripts/ibprintca.pl
> @@ -51,7 +51,7 @@ sub usage_and_exit
>  	print "   -l list cas\n";
>  	print "   -C <ca_name> use selected channel adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl
> index 86dcb64..0b3db19 100755
> --- a/infiniband-diags/scripts/ibprintrt.pl
> +++ b/infiniband-diags/scripts/ibprintrt.pl
> @@ -51,7 +51,7 @@ sub usage_and_exit
>  	print "   -l list rts\n";
>  	print "   -C <ca_name> use selected channel adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl
> index 6712201..c7377a9 100755
> --- a/infiniband-diags/scripts/ibprintswitch.pl
> +++ b/infiniband-diags/scripts/ibprintswitch.pl
> @@ -50,7 +50,7 @@ sub usage_and_exit
>  	print "   -l list switches\n";
>  	print "   -C <ca_name> use selected channel adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl
> index c807c02..5f2e167 100755
> --- a/infiniband-diags/scripts/ibqueryerrors.pl
> +++ b/infiniband-diags/scripts/ibqueryerrors.pl
> @@ -149,7 +149,7 @@ sub usage_and_exit
>  	print "   -d include the data counters in the output\n";
>  	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
>  	print "   -P <ca_port> use selected channel adaptor port for queries\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  my $argv0          = `basename $0`;
> diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl
> index 6d6ba1c..d888f51 100755
> --- a/infiniband-diags/scripts/ibswportwatch.pl
> +++ b/infiniband-diags/scripts/ibswportwatch.pl
> @@ -81,7 +81,7 @@ sub usage_and_exit
>  	print "   -n <cycles> run n cycles then exit (default -1 == forever)\n";
>  	print "   -G Address provided is a GUID\n";
>  	print "   -b report bytes/second packets/second\n";
> -	exit 0;
> +	exit 2;
>  }
>  
>  # =========================================================================


-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov


From jon at opengridcomputing.com  Thu May 29 15:58:24 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Thu, 29 May 2008 17:58:24 -0500
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB
In-Reply-To: <200805292040.56097.olaf.kirch@oracle.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
Message-ID: <20080529225824.GB7960@opengridcomputing.com>

[PATCH] rds: use separate ports for TCP and IB

Currently, RDS will bind to a single port during bring up of both the IB and TCP
sub-modules.  This binding of 2 different processes to a single port causes a
port space collision to devices which are aware of both (e.g., iWARP).  This
prevents iWARP devices from working with RDS if both TCP and IB are compiled in.

This patch works around this issue by having IB and TCP bind to separate ports,
thus avoiding the port space collision.  This enables iWARP to work over RDS TCP.

Signed-off-by: Jon Mason <jon at opengridcomputing.com>

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index a49e394..9935c9b 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -628,7 +628,7 @@ int rds_ib_conn_connect(struct rds_connection *conn)
 
 	dest.sin_family = AF_INET;
 	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
-	dest.sin_port = (__force u16)htons(RDS_PORT);
+	dest.sin_port = (__force u16)htons(RDS_IB_PORT);
 
 	ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src,
 				(struct sockaddr *)&dest,
@@ -813,7 +813,7 @@ int __init rds_ib_listen_init(void)
 
 	sin.sin_family = PF_INET,
 	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
-	sin.sin_port = (__force u16)htons(RDS_PORT);
+	sin.sin_port = (__force u16)htons(RDS_IB_PORT);
 
 	/*
 	 * XXX I bet this binds the cm_id to a device.  If we want to support
@@ -833,7 +833,7 @@ int __init rds_ib_listen_init(void)
 		goto out;
 	}
 
-	rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT);
+	rdsdebug("cm %p listening on port %u\n", cm_id, RDS_IB_PORT);
 
 	rds_ib_listen_id = cm_id;
 	cm_id = NULL;
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 03031e2..aa14fa6 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -25,9 +25,11 @@
  * userspace from listening.
  *
  * port 18633 was the version that had ack frames on the wire.
+ * port 18634 was the version that had both TCP and IB transports on the
+ * same port.
  */
-#define RDS_PORT	18634
-
+#define RDS_IB_PORT	18635
+#define RDS_TCP_PORT	18636
 
 #ifndef AF_RDS
 #define AF_RDS          28      /* Reliable Datagram Socket     */
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 0389a99..298e372 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -96,7 +96,7 @@ int rds_tcp_conn_connect(struct rds_connection *conn)
 
 	dest.sin_family = AF_INET;
 	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
-	dest.sin_port = (__force u16)htons(RDS_PORT);
+	dest.sin_port = (__force u16)htons(RDS_TCP_PORT);
 
 	/* 
 	 * once we call connect() we can start getting callbacks and they
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index caeacbe..50709b7 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -159,7 +159,7 @@ int __init rds_tcp_listen_init(void)
 
 	sin.sin_family = PF_INET,
 	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
-	sin.sin_port = (__force u16)htons(RDS_PORT);
+	sin.sin_port = (__force u16)htons(RDS_TCP_PORT);
 
 	ret = sock->ops->bind(sock, (struct sockaddr *)&sin, sizeof(sin));
 	if (ret < 0)


From jgunthorpe at obsidianresearch.com  Thu May 29 16:00:27 2008
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 29 May 2008 17:00:27 -0600
Subject: [ofa-general] Re: OpenSM IPv6 consolidation
In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov>
References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
	<20080529143535.60d02d75.weiny2@llnl.gov>
Message-ID: <20080529230027.GG8259@obsidianresearch.com>

On Thu, May 29, 2008 at 02:35:35PM -0700, Ira Weiny wrote:
 
> But that begs the question: Can a node issue an SNM request to a node in
> another IB subnet?  (I think the answer is yes if the IP subnet spans more than
> one IB subnet)  In that case, the SNM address would be in the range
> ff02::1:ff00:0/104 but what MGID would that map onto in IB?  I think the
> current mapping results in an IB link-local scope.  So would a router have to
> forward it even though the IB scope is link-local?

IP (v4 and v6) 'link local' traffic (ie IPv4 broadcasts and IPv6 link
local multicast) use MGID scope bits that are dependent on the
configuration of the IPoIB stack. Today linux and everyone else uses
link local MGID scope. There are patches floating about to make this
configurable like pkey so that you can have a global IB scope IPoIB
subnet. We used that patch set at SC07 to demonstrate IPoIB running
single subnet across IB routers.

Jason


From weiny2 at llnl.gov  Thu May 29 16:08:51 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 29 May 2008 16:08:51 -0700
Subject: [ofa-general] Re: OpenSM IPv6 consolidation
In-Reply-To: <20080529230027.GG8259@obsidianresearch.com>
References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
	<20080529143535.60d02d75.weiny2@llnl.gov>
	<20080529230027.GG8259@obsidianresearch.com>
Message-ID: <20080529160851.6ec1ed06.weiny2@llnl.gov>

On Thu, 29 May 2008 17:00:27 -0600
Jason Gunthorpe <jgunthorpe at obsidianresearch.com> wrote:

> On Thu, May 29, 2008 at 02:35:35PM -0700, Ira Weiny wrote:
>  
> > But that begs the question: Can a node issue an SNM request to a node in
> > another IB subnet?  (I think the answer is yes if the IP subnet spans more than
> > one IB subnet)  In that case, the SNM address would be in the range
> > ff02::1:ff00:0/104 but what MGID would that map onto in IB?  I think the
> > current mapping results in an IB link-local scope.  So would a router have to
> > forward it even though the IB scope is link-local?
> 
> IP (v4 and v6) 'link local' traffic (ie IPv4 broadcasts and IPv6 link
> local multicast) use MGID scope bits that are dependent on the
> configuration of the IPoIB stack. Today linux and everyone else uses
> link local MGID scope. There are patches floating about to make this
> configurable like pkey so that you can have a global IB scope IPoIB
> subnet. We used that patch set at SC07 to demonstrate IPoIB running
> single subnet across IB routers.
>

So, in that case if one is having issues with MLID space and wants to use my
hack it should consolidate all the scopes.

BTW, I still have on the back burner plans to implement a "real" fix to this
problem...  If only there were say -- 100 hours in a day?  ;-)

Ira


From rdreier at cisco.com  Thu May 29 16:11:09 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 29 May 2008 16:11:09 -0700
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB
In-Reply-To: <20080529225824.GB7960@opengridcomputing.com> (Jon Mason's
	message of "Thu, 29 May 2008 17:58:24 -0500")
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
	<20080529225824.GB7960@opengridcomputing.com>
Message-ID: <ada7idcogdu.fsf@cisco.com>

 > Currently, RDS will bind to a single port during bring up of both the IB and TCP
 > sub-modules.  This binding of 2 different processes to a single port causes a
 > port space collision to devices which are aware of both (e.g., iWARP).  This
 > prevents iWARP devices from working with RDS if both TCP and IB are compiled in.

Of course nothing prevents another hapless application from trying to
use port 18635 with TCP...

Not really much we can do about the general port space collision problem
unless and until the network stack guys are willing to cooperate though.

 - R.


From jon at opengridcomputing.com  Thu May 29 16:44:54 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Thu, 29 May 2008 18:44:54 -0500
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB
In-Reply-To: <ada7idcogdu.fsf@cisco.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
	<20080529225824.GB7960@opengridcomputing.com>
	<ada7idcogdu.fsf@cisco.com>
Message-ID: <20080529234454.GD7960@opengridcomputing.com>

On Thu, May 29, 2008 at 04:11:09PM -0700, Roland Dreier wrote:
>  > Currently, RDS will bind to a single port during bring up of both the IB and TCP
>  > sub-modules.  This binding of 2 different processes to a single port causes a
>  > port space collision to devices which are aware of both (e.g., iWARP).  This
>  > prevents iWARP devices from working with RDS if both TCP and IB are compiled in.
> 
> Of course nothing prevents another hapless application from trying to
> use port 18635 with TCP...

Yes, but that potential problem was already there.  I suppose in the long run RDS
should try to get IANA to give a reserved port (assuming that the RDS of ports 1540
and 1541 is a different RDS).

> Not really much we can do about the general port space collision problem
> unless and until the network stack guys are willing to cooperate though.

While I agree it is a necessity, I don't think I want to be the one to start that fight
again.  Perhaps if/when RDS is merged with mainline.

>  - R.


From vidvuds at ucla.edu  Thu May 29 18:00:47 2008
From: vidvuds at ucla.edu (Vidvuds Ozolins)
Date: Thu, 29 May 2008 18:00:47 -0700
Subject: [ofa-general] OFED-1.3.1 fails on CentOS 5.0 in libmlx4
Message-ID: <E89D12CD-81E3-4A51-A212-34E014C8CF6B@ucla.edu>

Hi All,

When I try installing OFED-1.3.1 on CentOS 5.0 I get the following  
error message:

Build libmlx4 RPM
Running  rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' -- 
define 'dist ' --target x86_64 --define '_prefix /usr' --define  
'_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' / 
home/vidvuds/OFED-1.3.1-rc3/SRPMS/libmlx4-1.0-0.1.ofed20080421.src.rpm
Failed to build libmlx4 RPM
See /tmp/OFED.18735.logs/libmlx4.rpmbuild.log

Anyone knows what is going on? The contents of the logfile are:

[root at smithers OFED-1.3.1-rc3]# more /tmp/OFED.18735.logs/ 
libmlx4.rpmbuild.log
Running  rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' -- 
define 'dist ' --target
  x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define  
'_sysconfdir /etc' --de
fine '_usr /usr' /home/vidvuds/OFED-1.3.1-rc3/SRPMS/ 
libmlx4-1.0-0.1.ofed20080421.src.rpm
error: Macro %dist has empty body
error: Macro %dist has empty body
warning: user vlad does not exist - using root
warning: group vlad does not exist - using root
warning: user vlad does not exist - using root
warning: group vlad does not exist - using root
Installing /home/vidvuds/OFED-1.3.1-rc3/SRPMS/ 
libmlx4-1.0-0.1.ofed20080421.src.rpm
Building target platforms: x86_64
Building for target x86_64
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.87202
+ umask 022
+ cd /var/tmp/OFED_topdir/BUILD
+ LANG=C
+ export LANG
+ unset DISPLAY
+ cd /var/tmp/OFED_topdir/BUILD
+ rm -rf libmlx4-1.0
+ /bin/gzip -dc /var/tmp/OFED_topdir/SOURCES/ 
libmlx4-1.0-0.1.ofed20080421.tar.gz
+ tar -xf -
+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd libmlx4-1.0
++ /usr/bin/id -u
+ '[' 0 = 0 ']'
+ /bin/chown -Rhf root .
++ /usr/bin/id -u
+ '[' 0 = 0 ']'
+ /bin/chgrp -Rhf root .
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.87202
+ umask 022
+ cd /var/tmp/OFED_topdir/BUILD
+ cd libmlx4-1.0
+ LANG=C
+ export LANG
+ unset DISPLAY
+ CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic'
+ export CFLAGS
+ CXXFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param
=ssp-buffer-size=4 -m64 -mtune=generic'
+ export CXXFLAGS
+ FFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic'
+ export FFLAGS
++ find . -name config.guess -o -name config.sub
+ for i in '$(find . -name config.guess -o -name config.sub)'
++ basename ./config/config.sub
+ '[' -f /usr/lib/rpm/redhat/config.sub ']'
+ /bin/rm -f ./config/config.sub
++ basename ./config/config.sub
+ /bin/cp -fv /usr/lib/rpm/redhat/config.sub ./config/config.sub
`/usr/lib/rpm/redhat/config.sub' -> `./config/config.sub'
+ for i in '$(find . -name config.guess -o -name config.sub)'
++ basename ./config/config.guess
+ '[' -f /usr/lib/rpm/redhat/config.guess ']'
+ /bin/rm -f ./config/config.guess
++ basename ./config/config.guess
+ /bin/cp -fv /usr/lib/rpm/redhat/config.guess ./config/config.guess
`/usr/lib/rpm/redhat/config.guess' -> `./config/config.guess'
+ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat- 
linux-gnu --target=x86_64-
redhat-linux-gnu --program-prefix= --prefix=/usr --exec-prefix=/usr -- 
bindir=/usr/bin --sbind
ir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/ 
include --libdir=/usr/l
ib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/ 
usr/com --mandir=/usr/s
hare/man --infodir=/usr/share/info
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-redhat-linux-gnu
checking host system type... x86_64-redhat-linux-gnu
checking for style of include used by make... GNU
checking for x86_64-redhat-linux-gnu-gcc... no
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /bin/sed
checking for egrep... grep -E
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for /usr/bin/ld option to reload object files... -r
checking for BSD-compatible nm... /usr/bin/nm -B
checking whether ln -s works... yes
checking how to recognise dependent libraries... pass_all
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking dlfcn.h usability... yes
checking dlfcn.h presence... yes
checking for dlfcn.h... yes
checking for x86_64-redhat-linux-gnu-g++... no
checking for x86_64-redhat-linux-gnu-c++... no
checking for x86_64-redhat-linux-gnu-gpp... no
checking for x86_64-redhat-linux-gnu-aCC... no
checking for x86_64-redhat-linux-gnu-CC... no
checking for x86_64-redhat-linux-gnu-cxx... no
checking for x86_64-redhat-linux-gnu-cc++... no
checking for x86_64-redhat-linux-gnu-cl... no
checking for x86_64-redhat-linux-gnu-FCC... no
checking for x86_64-redhat-linux-gnu-KCC... no
checking for x86_64-redhat-linux-gnu-RCC... no
checking for x86_64-redhat-linux-gnu-xlC_r... no
checking for x86_64-redhat-linux-gnu-xlC... no
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking how to run the C++ preprocessor... g++ -E
checking for x86_64-redhat-linux-gnu-g77... no
checking for x86_64-redhat-linux-gnu-f77... no
checking for x86_64-redhat-linux-gnu-xlf... no
checking for x86_64-redhat-linux-gnu-frt... no
checking for x86_64-redhat-linux-gnu-pgf77... no
checking for x86_64-redhat-linux-gnu-fort77... no
checking for x86_64-redhat-linux-gnu-fl32... no
checking for x86_64-redhat-linux-gnu-af77... no
checking for x86_64-redhat-linux-gnu-f90... no
checking for x86_64-redhat-linux-gnu-xlf90... no
checking for x86_64-redhat-linux-gnu-pgf90... no
checking for x86_64-redhat-linux-gnu-epcf90... no
checking for x86_64-redhat-linux-gnu-f95... no
checking for x86_64-redhat-linux-gnu-fort... no
checking for x86_64-redhat-linux-gnu-xlf95... no
checking for x86_64-redhat-linux-gnu-ifc... no
checking for x86_64-redhat-linux-gnu-efc... no
checking for x86_64-redhat-linux-gnu-pgf95... no
checking for x86_64-redhat-linux-gnu-lf95... no
checking for x86_64-redhat-linux-gnu-gfortran... no
checking for g77... g77
checking whether we are using the GNU Fortran 77 compiler... no
checking whether g77 accepts -g... yes
checking the maximum length of command line arguments... 32768
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for objdir... .libs
checking for x86_64-redhat-linux-gnu-ar... no
checking for ar... ar
checking for x86_64-redhat-linux-gnu-ranlib... no
checking for ranlib... ranlib
checking for x86_64-redhat-linux-gnu-strip... no
checking for strip... strip
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC
checking if gcc PIC flag -fPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports  
shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
configure: creating libtool
appending configuration tag "CXX" to libtool
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports  
shared libraries... yes
checking for g++ option to produce PIC... -fPIC
checking if g++ PIC flag -fPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports  
shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
appending configuration tag "F77" to libtool
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking for g77 option to produce PIC... -fPIC
checking if g77 PIC flag -fPIC works... no
checking if g77 static flag -static works... no
checking if g77 supports -c -o file.o... no
checking whether the g77 linker (/usr/bin/ld -m elf_x86_64) supports  
shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for x86_64-redhat-linux-gnu-gcc... gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ANSI C... (cached) none needed
checking dependency style of gcc... (cached) gcc3
checking for ibv_get_device_list in -libverbs... yes
checking infiniband/driver.h usability... yes
checking infiniband/driver.h presence... yes
checking for infiniband/driver.h... yes
checking for ANSI C header files... (cached) yes
checking valgrind/memcheck.h usability... yes
checking valgrind/memcheck.h presence... yes
checking for valgrind/memcheck.h... yes
checking for an ANSI C-conforming const... yes
checking for long... yes
checking size of long... 8
checking for struct ibv_context.xrc_ops... no
checking for ibv_read_sysfs_file... yes
checking for ibv_dontfork_range... yes
checking for ibv_dofork_range... yes
checking for ibv_register_driver... yes
checking whether ld accepts --version-script... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating libmlx4.spec
config.status: creating config.h
config.status: executing depfiles commands
+ make -j8
make  all-am
make[1]: Entering directory `/var/tmp/OFED_topdir/BUILD/libmlx4-1.0'
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT buf.lo -MD -MP -MF ".deps/ 
buf.Tpo" -c -o buf.lo `tes
t -f 'src/buf.c' || echo './'`src/buf.c; \
	then mv -f ".deps/buf.Tpo" ".deps/buf.Plo"; else rm -f ".deps/ 
buf.Tpo"; exit 1; fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT cq.lo -MD -MP -MF ".deps/ 
cq.Tpo" -c -o cq.lo `test -
f 'src/cq.c' || echo './'`src/cq.c; \
	then mv -f ".deps/cq.Tpo" ".deps/cq.Plo"; else rm -f ".deps/cq.Tpo";  
exit 1; fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT dbrec.lo -MD -MP -MF ".deps/ 
dbrec.Tpo" -c -o dbrec.l
o `test -f 'src/dbrec.c' || echo './'`src/dbrec.c; \
	then mv -f ".deps/dbrec.Tpo" ".deps/dbrec.Plo"; else rm -f ".deps/ 
dbrec.Tpo"; exit 1;
  fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT mlx4.lo -MD -MP -MF ".deps/ 
mlx4.Tpo" -c -o mlx4.lo `
test -f 'src/mlx4.c' || echo './'`src/mlx4.c; \
	then mv -f ".deps/mlx4.Tpo" ".deps/mlx4.Plo"; else rm -f ".deps/ 
mlx4.Tpo"; exit 1; fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT qp.lo -MD -MP -MF ".deps/ 
qp.Tpo" -c -o qp.lo `test -
f 'src/qp.c' || echo './'`src/qp.c; \
	then mv -f ".deps/qp.Tpo" ".deps/qp.Plo"; else rm -f ".deps/qp.Tpo";  
exit 1; fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT srq.lo -MD -MP -MF ".deps/ 
srq.Tpo" -c -o srq.lo `tes
t -f 'src/srq.c' || echo './'`src/srq.c; \
	then mv -f ".deps/srq.Tpo" ".deps/srq.Plo"; else rm -f ".deps/ 
srq.Tpo"; exit 1; fi
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - 
I. -I.    -g -Wall -D_G
NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - 
fstack-protector --param=s
sp-buffer-size=4 -m64 -mtune=generic -MT verbs.lo -MD -MP -MF ".deps/ 
verbs.Tpo" -c -o verbs.l
o `test -f 'src/verbs.c' || echo './'`src/verbs.c; \
	then mv -f ".deps/verbs.Tpo" ".deps/verbs.Plo"; else rm -f ".deps/ 
verbs.Tpo"; exit 1;
  fi
mkdir .libs
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT dbrec.
lo -MD -MP -MF .deps/dbrec.Tpo -c src/dbrec.c  -fPIC -DPIC -o .libs/ 
dbrec.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT buf.lo
  -MD -MP -MF .deps/buf.Tpo -c src/buf.c  -fPIC -DPIC -o .libs/buf.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT mlx4.l
o -MD -MP -MF .deps/mlx4.Tpo -c src/mlx4.c  -fPIC -DPIC -o .libs/mlx4.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT srq.lo
  -MD -MP -MF .deps/srq.Tpo -c src/srq.c  -fPIC -DPIC -o .libs/srq.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT cq.lo
-MD -MP -MF .deps/cq.Tpo -c src/cq.c  -fPIC -DPIC -o .libs/cq.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT qp.lo
-MD -MP -MF .deps/qp.Tpo -c src/qp.c  -fPIC -DPIC -o .libs/qp.o
  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - 
Wall -Wp,-D_FORTIFY_SOU
RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - 
mtune=generic -MT verbs.
lo -MD -MP -MF .deps/verbs.Tpo -c src/verbs.c  -fPIC -DPIC -o .libs/ 
verbs.o
In file included from src/buf.c:39:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
make[1]: *** [buf.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
In file included from src/dbrec.c:42:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
make[1]: *** [dbrec.lo] Error 1
In file included from src/srq.c:42:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
make[1]: *** [srq.lo] Error 1
In file included from src/cq.c:47:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
In file included from src/mlx4.c:49:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
In file included from src/qp.c:44:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
src/cq.c: In function 'mlx4_cq_clean':
src/cq.c:408: error: 'struct ibv_srq' has no member named 'xrc_cq'
src/qp.c: In function 'mlx4_post_send':
src/qp.c:244: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
src/qp.c:244: error: (Each undeclared identifier is reported only once
src/qp.c:244: error: for each function it appears in.)
src/qp.c:245: error: 'struct ibv_send_wr' has no member named  
'xrc_remote_srq_num'
make[1]: *** [cq.lo] Error 1
make[1]: *** [mlx4.lo] Error 1
src/qp.c: In function 'mlx4_calc_sq_wqe_size':
src/qp.c:547: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
src/qp.c: In function 'mlx4_set_sq_sizes':
src/qp.c:636: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
make[1]: *** [qp.lo] Error 1
In file included from src/verbs.c:44:
src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type
src/verbs.c: In function 'mlx4_destroy_srq':
src/verbs.c:329: error: 'struct ibv_srq' has no member named 'xrc_cq'
src/verbs.c:331: error: 'struct ibv_srq' has no member named 'xrc_cq'
src/verbs.c:340: error: 'struct ibv_srq' has no member named 'xrc_cq'
src/verbs.c: In function 'mlx4_create_qp':
src/verbs.c:388: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
src/verbs.c:388: error: (Each undeclared identifier is reported only  
once
src/verbs.c:388: error: for each function it appears in.)
src/verbs.c: In function 'mlx4_modify_qp':
src/verbs.c:517: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
src/verbs.c: In function 'mlx4_destroy_qp':
src/verbs.c:579: error: 'IBV_QPT_XRC' undeclared (first use in this  
function)
make[1]: *** [verbs.lo] Error 1
make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/libmlx4-1.0'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.87202 (%build)


RPM build errors:
     Macro %dist has empty body
     Macro %dist has empty body
     user vlad does not exist - using root
     group vlad does not exist - using root
     user vlad does not exist - using root
     group vlad does not exist - using root
     Bad exit status from /var/tmp/rpm-tmp.87202 (%build)


Thanks,
Vidvuds


Vidvuds Ozolins
Dept. of Materials Science and Engineering
University of California, Los Angeles
3121E Engineering V
P.O. Box 951595
Los Angeles, CA 90095-1595
Office: (310) 267-5538
E-mail: vidvuds at ucla.edu


From hrosenstock at xsigo.com  Thu May 29 19:31:26 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 19:31:26 -0700
Subject: [ofa-general] Re: OpenSM IPv6 consolidation
In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov>
References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
	<20080529143535.60d02d75.weiny2@llnl.gov>
Message-ID: <1212114686.17997.139.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 14:35 -0700, Ira Weiny wrote:
> On Thu, 29 May 2008 12:27:12 -0700
> Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> 
> > Ira,
> > 
> > In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is:
> > 
> > #define PREFIX_MASK (0xff12601b00000000ULL)
> > 
> > Shouldn't all scopes be consolidated so this should be:
> > 
> > #define PREFIX_MASK (0xff10601b00000000ULL)
> > 
> > or was this intentional for some reason ?
> > 
> 
> It seemed reasonable for this to consolidate link-local only because according
> to my IPv6 book, solicited node multicast is the particular range,
> ff02::1:ff00:0/104
> 
> However, I am a bit confused about how the scope bits map from the IP address
> to the MGID.  The MGID refers only to the IB-subnet scope _not_ IP, therefore
> what I said above might not matter because we are now talking about the IB
> scope.

Right; the IP scope is different from the IB scope.

> But that begs the question: Can a node issue an SNM request to a node in
> another IB subnet? 

Yes, as IPoIB subnets can span IB subnets.

-- Hal

>  (I think the answer is yes if the IP subnet spans more than
> one IB subnet)  In that case, the SNM address would be in the range
> ff02::1:ff00:0/104 but what MGID would that map onto in IB?  I think the
> current mapping results in an IB link-local scope.  So would a router have to
> forward it even though the IB scope is link-local?

> Now my head hurts...  :-(
> 
> Ira
> 


From hrosenstock at xsigo.com  Thu May 29 19:32:10 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Thu, 29 May 2008 19:32:10 -0700
Subject: [ofa-general] Re: OpenSM IPv6 consolidation
In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov>
References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com>
	<20080529143535.60d02d75.weiny2@llnl.gov>
Message-ID: <1212114730.17997.141.camel@hrosenstock-ws.xsigo.com>

On Thu, 2008-05-29 at 14:35 -0700, Ira Weiny wrote:
> On Thu, 29 May 2008 12:27:12 -0700
> Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> 
> > Ira,
> > 
> > In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is:
> > 
> > #define PREFIX_MASK (0xff12601b00000000ULL)
> > 
> > Shouldn't all scopes be consolidated so this should be:
> > 
> > #define PREFIX_MASK (0xff10601b00000000ULL)
> > 
> > or was this intentional for some reason ?
> > 
> 
> It seemed reasonable for this to consolidate link-local only

Actually, the code doesn't quite even do that. Patch to follow in a bit.

-- Hal

>  because according
> to my IPv6 book, solicited node multicast is the particular range,
> ff02::1:ff00:0/104
> 
> However, I am a bit confused about how the scope bits map from the IP address
> to the MGID.  The MGID refers only to the IB-subnet scope _not_ IP, therefore
> what I said above might not matter because we are now talking about the IB
> scope.
> 
> But that begs the question: Can a node issue an SNM request to a node in
> another IB subnet?  (I think the answer is yes if the IP subnet spans more than
> one IB subnet)  In that case, the SNM address would be in the range
> ff02::1:ff00:0/104 but what MGID would that map onto in IB?  I think the
> current mapping results in an IB link-local scope.  So would a router have to
> forward it even though the IB scope is link-local?
> 
> Now my head hurts...  :-(
> 
> Ira
> 


From statubp at bossimissi.com  Thu May 29 21:18:18 2008
From: statubp at bossimissi.com (Kenny Coulter)
Date: Fri, 30 May 2008 11:18:18 +0700
Subject: [ofa-general] get smart, beat your foes
Message-ID: <01c8c246$dc533900$02e46bcb@statubp>

Make your corn bigger.
And no metter how your look, you'll have success!
Join and enjoy chicks attention!

This way    http://www.demodev.cn/a/


From vlad at lists.openfabrics.org  Fri May 30 03:09:07 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 30 May 2008 03:09:07 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080530-0200 daily build status
Message-ID: <20080530100907.EF77AE60BE5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From cousin_vinnie at hotmail.fr  Fri May 30 03:26:30 2008
From: cousin_vinnie at hotmail.fr (Renaud Durand)
Date: Fri, 30 May 2008 12:26:30 +0200
Subject: [ofa-general] iSCSI problem
Message-ID: <BAY139-W267942299058702146D349FBBE0@phx.gbl>


Hello guys,


I tried to create a ISER target to load a hard
drive from a remote computer following this tuto :
https://wiki.openfabrics.org/tiki-index.php?page=ISER


I can connect to the target, i have a device /dev/sdc now.


my problem now is that i cannot mount this device, I try :


linux-cx5e:~ # mount /dev/sdc /mnt


 and the computer freezes 


I have no idea why the computer freezes, so if you have any suggestions...


thanks
_________________________________________________________________
Faites vous de nouveaux amis grâce à l'annuaire des profils Messenger !
http://home.services.spaces.live.com/search/?page=searchresults&ss=true&FormId=AdvPeopleSearch&form=SPXFRM&tp=3&sc=2&pg=0&Search.DisplayName=Nom+public&search.gender=&search.age=&Search.FirstName=Pr%C3%A9nom&Search.LastName=Nom&search.location=Lieu&search.occupation=Profession&search.interests=amis&submit=Rechercher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080530/077bebf6/attachment.html>

From hrosenstock at xsigo.com  Fri May 30 04:07:39 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 04:07:39 -0700
Subject: [ofa-general] [PATCH] OpenSM/osm_sa_mcmember_record.c: Collapse all
	scopes when consolidating IPv6 SNM
Message-ID: <1212145660.17997.150.camel@hrosenstock-ws.xsigo.com>

OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
scopes when consolidating IPv6 SNM

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
+++ opensm/osm_sa_mcmember_record.c	2008-05-30 04:02:20.800519000 -0700
@@ -1086,14 +1086,15 @@
 			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
 			/* Where XXXX is the P_Key and
 			 * YYYYYY is the last 24 bits of the port guid */
-#define PREFIX_MASK (0xff10601b00000000ULL)
+#define PREFIX_MASK (0xff10ffff00000000ULL)
+#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
 			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
 			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
 			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
 
-			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
+			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
 			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
 			    g_prefix == rcv_prefix &&
 			    (g_interface_id & INT_ID_MASK) ==


From hrosenstock at xsigo.com  Fri May 30 04:07:57 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 04:07:57 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: Add another HP OUI to
	recognized vendor IDs
Message-ID: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com>

OpenSM: Add another HP OUI to recognized vendor IDs

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index 289e49e..43ec033 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -873,6 +884,7 @@ typedef enum _osm_mcast_req_type {
 #define OSM_VENDOR_ID_SUN           0x0003BA
 #define OSM_VENDOR_ID_3LEAFNTWKS    0x0016A1
 #define OSM_VENDOR_ID_XSIGO         0x001397
+#define OSM_VENDOR_ID_HP2           0x0018FE
 
 /**********/
 
diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c
index aa0a9ea..15cc2e9 100644
--- a/opensm/opensm/osm_helper.c
+++ b/opensm/opensm/osm_helper.c
@@ -2228,6 +2228,7 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho)
 	case OSM_VENDOR_ID_PANTA:
 		return (panta_str);
 	case OSM_VENDOR_ID_HP:
+	case OSM_VENDOR_ID_HP2:
 		return (hp_str);
 	case OSM_VENDOR_ID_RIOWORKS:
 		return (rioworks_str);


From hrosenstock at xsigo.com  Fri May 30 04:17:52 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 04:17:52 -0700
Subject: [ofa-general] [PATCHv2] OpenSM/osm_sa_mcmember_record.c: Collapse
	all scopes when consolidating IPv6 SNM
Message-ID: <1212146272.17997.157.camel@hrosenstock-ws.xsigo.com>

OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
scopes when consolidating IPv6 SNM

Minor comment change from v1 of this patch

Patch is cumulative on minor improvement patch to this file

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
+++ opensm/osm_sa_mcmember_record.c	2008-05-30 04:11:10.637837000 -0700
@@ -1083,17 +1083,18 @@
 
 		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
 			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
-			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
-			/* Where XXXX is the P_Key and
+			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
+			/* Where Z is the scope, XXXX is the P_Key, and
 			 * YYYYYY is the last 24 bits of the port guid */
-#define PREFIX_MASK (0xff10601b00000000ULL)
+#define PREFIX_MASK (0xff10ffff00000000ULL)
+#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
 			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
 			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
 			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
 
-			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
+			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
 			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
 			    g_prefix == rcv_prefix &&
 			    (g_interface_id & INT_ID_MASK) ==


From shuttingr8 at epicenter.org  Fri May 30 04:27:43 2008
From: shuttingr8 at epicenter.org (Antoinette Battle)
Date: Fri, 30 May 2008 20:27:43 +0900
Subject: [ofa-general] We have CEOs as students.
Message-ID: <01c8c293$9cff0980$0811aa79@shuttingr8>

Bacheelor, MasteerMBA, and Doctoraate diplomas available in the field of your choice that's right, you can even become a Doctor and receive all the benefits that comes with it!
 
Our Diplomas/Certificates are recognised in most countries
 
No required examination, tests, classes, books, or interviews.
 
** No one is turned down
** Confidentiality assured
 
CALL US 24 HOURS A DAY, 7 DAYS A WEEK
 
For US: 1-801-504-2132
Outside US: +1-801-504-2132
 
"Just leave your NAME &amp; PHONE NO. (with CountryCode)" in the voicemail
 
our staff will get back to you in next few days
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080530/c7ee2cba/attachment.html>

From Thomas.Talpey at netapp.com  Fri May 30 04:55:53 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Fri, 30 May 2008 07:55:53 -0400
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and
  IB
In-Reply-To: <20080529234454.GD7960@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
	<20080529225824.GB7960@opengridcomputing.com>
	<ada7idcogdu.fsf@cisco.com>
	<20080529234454.GD7960@opengridcomputing.com>
Message-ID: <RTPCLUEXC1-PRDzscE80000018f@RTPMVEXC1-PRD.hq.netapp.com>

At 07:44 PM 5/29/2008, Jon Mason wrote:
>On Thu, May 29, 2008 at 04:11:09PM -0700, Roland Dreier wrote:
>> Of course nothing prevents another hapless application from trying to
>> use port 18635 with TCP...
>
>Yes, but that potential problem was already there.  I suppose in the 
>long run RDS
>should try to get IANA to give a reserved port (assuming that the RDS 
>of ports 1540
>and 1541 is a different RDS).

RDS should do that right now, before the number is compiled into the code!

The registry is at http://www.iana.org/assignments/port-numbers

and the application form is http://www.iana.org/cgi-bin/usr-port-number.pl

It's supposed to be a two-week process, but it can take longer if you have
special requests like a specific port number, etc.

>
>> Not really much we can do about the general port space collision problem
>> unless and until the network stack guys are willing to cooperate though.

I don't think it has anything to do with the network stack code. It's basically
an Internet license plate, issued by a separate authority.

Tom.


From hrosenstock at xsigo.com  Fri May 30 05:50:24 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 05:50:24 -0700
Subject: [ofa-general] More questions/comments on IPv6 SNM consolidation
	option in OpenSM
Message-ID: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com>

Ira,

The IPv6 SNM consolidation option in OpenSM currently collapses the SNM
groups down to 1 group aliased group. However, in a heterogeneous
network, not all ports will be able to meet certain group parameters
(MTU, rate). This has been discussed on the list before. My current read
of the code indicates that these joins would be rejected. Is that
right ? If so, my question is why not allow them to create and join with
their original real multicast group for this case ? The downside would
be that if there were a lot of ports like this, then the consolidation
would reduce the number of groups but maybe not enough. So would we then
want an additional option for doing this (and what the default should
be) ? One cut on the default would be to keep it the same as now but
does that really matter ? Ideally, those additional SNM groups would be
collapsed too. I think that aspect was dealt with in Jason's approach to
this in a thread entitled "IPv6 and IPoIB scalability issue":
http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
in which he proposed an MGID range for collapsing IPv6 SNM groups.

Also, have you tried IPv6 SNM consolidation with multiple partitions ? I
may have more on this aspect later.

-- Hal


From hrosenstock at xsigo.com  Fri May 30 05:56:13 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 05:56:13 -0700
Subject: [ofa-general] Re: [PATCH] infiniband-diags: terminate perl
	scripts with error if not authorized
In-Reply-To: <20080525191430.GT4616@sashak.voltaire.com>
References: <4836EB27.7060707@llnl.gov>
	<20080525191430.GT4616@sashak.voltaire.com>
Message-ID: <1212152173.17997.176.camel@hrosenstock-ws.xsigo.com>

On Sun, 2008-05-25 at 22:14 +0300, Sasha Khapyorsky wrote:
> Hi Tim,
> 
> On 09:04 Fri 23 May     , Timothy A. Meier wrote:
> >  
> > +# =========================================================================
> > +#  only authorized if uid is root, or matches umad ownership
> > +#
> > +sub auth_check
> > +{
> > +	my $file = "/dev/infiniband/umad0";
> 
> How would we know that it is "/dev/infiniband/umad0" and not another
> device (when first port in not connected, or if -C and/or -P options are
> used, or if udev is configured to put the entries in another place)?
> 
> Really I don't see an easy (without reimplementing most of libibumad
> device resolution functionality via sysfs in perl scripts) way to detect
> device reliably.

How about having a library function return the umad mapping so this
doesn't need to be reimplemented ?

-- Hal

> > +	my $uid = (stat $file)[4];
> > +	my $gid = (stat $file)[5];
> > +	if (($> != $uid) && ($> != $gid) && ($> != 0)){
> 
> The requirement here is not really ownership, but rather that the file
> is readable and writable by user which runs script. Right?
> 
> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From forum.san at gmail.com  Fri May 30 06:05:21 2008
From: forum.san at gmail.com (Sangamesh B)
Date: Fri, 30 May 2008 18:35:21 +0530
Subject: [ofa-general] ***SPAM*** How to test OFED install
Message-ID: <cb60cbc40805300605o139d0111x3e0d130965dc1968@mail.gmail.com>

Hi all,

Can some one send the link/document which can explain the OFA tests:To check
drivers installed properly?

Thanks,
Sangamesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080530/846e93a6/attachment.html>

From okir at lst.de  Fri May 30 06:21:19 2008
From: okir at lst.de (Olaf Kirch)
Date: Fri, 30 May 2008 15:21:19 +0200
Subject: [ofa-general] Port space sharing in RDS
In-Reply-To: <20080528225549.GC6288@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>
Message-ID: <200805301521.20476.okir@lst.de>

On Thursday 29 May 2008 00:55:49 Jon Mason wrote:
> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the
> RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
> 2 major reasons.

I looked at the code, and I think it is possible to bind to each
interface address separately instead of binding to INADDR_ANY once.
It's a bit of extra work, but it would take care of the conflict.

For instance, in the IB case, the binding code could move to
rds_ib_add_one, and we'd bind to each address associated with the interface -
the problem there being of course that we're currently not watching for
address changes.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From jon at opengridcomputing.com  Fri May 30 08:08:14 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Fri, 30 May 2008 10:08:14 -0500
Subject: [ofa-general] Port space sharing in RDS
In-Reply-To: <200805301521.20476.okir@lst.de>
References: <20080528225549.GC6288@opengridcomputing.com>
	<200805301521.20476.okir@lst.de>
Message-ID: <20080530150814.GB8638@opengridcomputing.com>

On Fri, May 30, 2008 at 03:21:19PM +0200, Olaf Kirch wrote:
> On Thursday 29 May 2008 00:55:49 Jon Mason wrote:
> > During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the
> > RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
> > 2 major reasons.
> 
> I looked at the code, and I think it is possible to bind to each
> interface address separately instead of binding to INADDR_ANY once.
> It's a bit of extra work, but it would take care of the conflict.
> 
> For instance, in the IB case, the binding code could move to
> rds_ib_add_one, and we'd bind to each address associated with the interface -
> the problem there being of course that we're currently not watching for
> address changes.

Yes, that was my original suggestion (though worded poorly).  If that way is prefered,
I can do it.

Thanks,
Jon

> 
> Olaf
> -- 
> Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
> okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax


From freyp at student.ethz.ch  Fri May 30 09:00:26 2008
From: freyp at student.ethz.ch (Philip Frey)
Date: Fri, 30 May 2008 18:00:26 +0200
Subject: [ofa-general] Length of inbound RDMA send
Message-ID: <4840249A.2080009@student.ethz.ch>

Hello,

I was wondering if a receive work completion tells me how
many bytes had been placed. Is 'byte_len' the field indicating
that value?

The various fields of 'struct ibv_wc' are not quite clear to me.
Can you point me to a document where this is described?

Many thanks and kind regards,
  Philip


From rdreier at cisco.com  Fri May 30 09:09:15 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 30 May 2008 09:09:15 -0700
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB
In-Reply-To: <RTPCLUEXC1-PRDzscE80000018f@RTPMVEXC1-PRD.hq.netapp.com> (Thomas
	Talpey's message of "Fri, 30 May 2008 07:55:53 -0400")
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
	<20080529225824.GB7960@opengridcomputing.com>
	<ada7idcogdu.fsf@cisco.com>
	<20080529234454.GD7960@opengridcomputing.com>
	<RTPCLUEXC1-PRDzscE80000018f@RTPMVEXC1-PRD.hq.netapp.com>
Message-ID: <adaskvzn590.fsf@cisco.com>

 > >> Not really much we can do about the general port space collision problem
 > >> unless and until the network stack guys are willing to cooperate though.
 > 
 > I don't think it has anything to do with the network stack code. It's basically
 > an Internet license plate, issued by a separate authority.

I just meant that currently, I can bind an iWARP listen and a normal TCP
listen to the same port, and a connect attempt to one or the other will fail.


From rdreier at cisco.com  Fri May 30 09:10:33 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 30 May 2008 09:10:33 -0700
Subject: [ofa-general] Length of inbound RDMA send
In-Reply-To: <4840249A.2080009@student.ethz.ch> (Philip Frey's message of
	"Fri, 30 May 2008 18:00:26 +0200")
References: <4840249A.2080009@student.ethz.ch>
Message-ID: <adaod6nn56u.fsf@cisco.com>

 > I was wondering if a receive work completion tells me how
 > many bytes had been placed. Is 'byte_len' the field indicating
 > that value?

Not sure what an "RDMA send" is -- if you mean a normal send (as opposed
to an RDMA operation), then yes the byte_len field has the length of the
message that was received.

 > The various fields of 'struct ibv_wc' are not quite clear to me.
 > Can you point me to a document where this is described?

The "poll CQ" section of chapter 11 of the IB spec should cover it.

 - R.


From Thomas.Talpey at netapp.com  Fri May 30 09:29:01 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Fri, 30 May 2008 12:29:01 -0400
Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and
  IB
In-Reply-To: <adaskvzn590.fsf@cisco.com>
References: <20080528225549.GC6288@opengridcomputing.com>
	<000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com>
	<20080529000354.GD6288@opengridcomputing.com>
	<200805292040.56097.olaf.kirch@oracle.com>
	<20080529225824.GB7960@opengridcomputing.com>
	<ada7idcogdu.fsf@cisco.com>
	<20080529234454.GD7960@opengridcomputing.com>
	<RTPCLUEXC1-PRDzscE80000018f@RTPMVEXC1-PRD.hq.netapp.com>
	<adaskvzn590.fsf@cisco.com>
Message-ID: <RTPCLUEXC1-PRD8UJrN00000192@RTPMVEXC1-PRD.hq.netapp.com>

At 12:09 PM 5/30/2008, Roland Dreier wrote:
>
> > >> Not really much we can do about the general port space collision problem
> > >> unless and until the network stack guys are willing to cooperate though.
> > 
> > I don't think it has anything to do with the network stack code. 
>It's basically
> > an Internet license plate, issued by a separate authority.
>
>I just meant that currently, I can bind an iWARP listen and a normal TCP
>listen to the same port, and a connect attempt to one or the other will fail.

Oh THAT problem. :-)

Yes, at the moment TCP to the iWARP NIC is like talking to a different
host. But, RDMA-aware versions of a given protocol still need a second
port, unless there is explicit upper layer support for initiating the MPA
exchange. We have the same issue with NFSv3/RDMA, and we have
applied for a second port (the application is still pending within IANA). 

The second port is not needed for future NFSv4.1, which has RDMA
negotiation in its session establishment. And it's also not needed for
IB, which doesn't have an RDMA upgrade at all. But for simplicity, we'll
continue to use both.

Tom. 


From hrosenstock at xsigo.com  Fri May 30 11:07:15 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 11:07:15 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_subnet.c: Change comment
	for IPv6 SNM in options file
Message-ID: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com>

opensm/osm_subnet.c: Change comment for IPv6 SNM in options file

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- opensm/osm_subnet.c.2	2008-05-29 04:24:19.802169000 -0700
+++ opensm/osm_subnet.c	2008-05-30 11:04:00.098938000 -0700
@@ -1713,7 +1713,7 @@
 		p_opts->prefix_routes_file);
 
 	fprintf(opts_file,
-		"#\n# IPv6 MCast Options\n#\n"
+		"#\n# IPv6 Solicited Node Multicast (SNM) Options\n#\n"
 		"consolidate_ipv6_snm_req %s\n\n",
 		p_opts->consolidate_ipv6_snm_req ? "TRUE" : "FALSE");
 

From dotanba at gmail.com  Fri May 30 12:39:03 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 30 May 2008 21:39:03 +0200
Subject: [ofa-general] Length of inbound RDMA send
In-Reply-To: <4840249A.2080009@student.ethz.ch>
References: <4840249A.2080009@student.ethz.ch>
Message-ID: <484057D7.2060904@gmail.com>

Hi.

Philip Frey wrote:
> Hello,
>
> I was wondering if a receive work completion tells me how
> many bytes had been placed. Is 'byte_len' the field indicating
> that value?
>
> The various fields of 'struct ibv_wc' are not quite clear to me.
> Can you point me to a document where this is described?

But you know how many bytes you sent in this message ....
byte_len is most useful for incoming message (to understand how many 
bytes were received).

Dotan


From swise at opengridcomputing.com  Fri May 30 12:24:30 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 30 May 2008 14:24:30 -0500
Subject: [ofa-general] Port space sharing in RDS
In-Reply-To: <20080530150814.GB8638@opengridcomputing.com>
References: <20080528225549.GC6288@opengridcomputing.com>	<200805301521.20476.okir@lst.de>
	<20080530150814.GB8638@opengridcomputing.com>
Message-ID: <4840546E.40009@opengridcomputing.com>

Jon Mason wrote:
> On Fri, May 30, 2008 at 03:21:19PM +0200, Olaf Kirch wrote:
>   
>> On Thursday 29 May 2008 00:55:49 Jon Mason wrote:
>>     
>>> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the
>>> RDS port for all IP addresses.  Unfortunately, that will not work for iWARP for
>>> 2 major reasons.
>>>       
>> I looked at the code, and I think it is possible to bind to each
>> interface address separately instead of binding to INADDR_ANY once.
>> It's a bit of extra work, but it would take care of the conflict.
>>
>> For instance, in the IB case, the binding code could move to
>> rds_ib_add_one, and we'd bind to each address associated with the interface -
>> the problem there being of course that we're currently not watching for
>> address changes.
>>     
>
> Yes, that was my original suggestion (though worded poorly).  If that way is prefered,
> I can do it.
>
>   
Note that if you do bind to specific addresses, then you need to deal 
with multiple addresses bound to the same interface...

> Thanks,
> Jon
>
>   
>> Olaf
>> -- 
>> Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
>> okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
>>     
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From swise at opengridcomputing.com  Fri May 30 13:21:57 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 30 May 2008 15:21:57 -0500
Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS
	support
In-Reply-To: <483E78C9.7080209@mellanox.co.il>
References: <20080527183429.32168.14351.stgit@dell3.ogc.int>	<20080527183549.32168.22959.stgit@dell3.ogc.int>	<483CBDF0.7030209@opengridcomputing.com>	<alpine.LFD.1.10.0805281721410.2898@jlentini-linux.nane.netapp.com>
	<483DCEA8.20505@opengridcomputing.com>
	<483E78C9.7080209@mellanox.co.il>
Message-ID: <484061E5.5060600@opengridcomputing.com>

Tziporet Koren wrote:
> Steve Wise wrote:
>> Yes, I have already said I'll post a test case. :)
>>
>> The krping tool will be the culprit.  Its the kernel equivalent of 
>> rping and has been around for a long time in one form or another.
>>
>> It is available at git://git.openfabrics.org/~swise/krping
>>
> Do younthink we should include it in OFED as we include user space 
> examples?
>
> Tziporet

I would rather not ship it since then I'd have to support it. :)


From hrosenstock at xsigo.com  Fri May 30 13:22:14 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Fri, 30 May 2008 13:22:14 -0700
Subject: [ofa-general] [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse
	all scopes when consolidating IPv6 SNM
Message-ID: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>

OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
scopes when consolidating IPv6 SNM

v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP
v1 had a minor comment change

Patch is cumulative on minor improvement patch to this file

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
+++ opensm/osm_sa_mcmember_record.c	2008-05-30 13:13:59.344954000 -0700
@@ -1083,19 +1083,21 @@
 
 		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
 			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
-			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
-			/* Where XXXX is the P_Key and
+			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
+			/* Where Z is the scope, XXXX is the P_Key, and
 			 * YYYYYY is the last 24 bits of the port guid */
-#define PREFIX_MASK (0xff10601b00000000ULL)
+#define PREFIX_MASK (0xff10ffff00000000ULL)
+#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
 			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
 			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
 			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
 
-			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
+			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
 			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
-			    g_prefix == rcv_prefix &&
+			    (g_prefix & PREFIX_MASK) ==
+			     (rcv_prefix && PREFIX_MASK) &&
 			    (g_interface_id & INT_ID_MASK) ==
 			     (rcv_interface_id & INT_ID_MASK)) {
 				OSM_LOG(sa->p_log, OSM_LOG_INFO,


From weiny2 at llnl.gov  Fri May 30 14:41:15 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 30 May 2008 14:41:15 -0700
Subject: [ofa-general] Re: More questions/comments on IPv6 SNM consolidation
 option in OpenSM
In-Reply-To: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com>
References: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080530144115.5fe80f0c.weiny2@llnl.gov>

On Fri, 30 May 2008 05:50:24 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> Ira,
> 
> The IPv6 SNM consolidation option in OpenSM currently collapses the SNM
> groups down to 1 group aliased group. However, in a heterogeneous
> network, not all ports will be able to meet certain group parameters
> (MTU, rate). This has been discussed on the list before. My current read
> of the code indicates that these joins would be rejected. Is that
> right ?

Yes I beleive so.

>
> If so, my question is why not allow them to create and join with
> their original real multicast group for this case ?

That would be fine as long as there were not too many "odd" nodes.

>
> The downside would
> be that if there were a lot of ports like this, then the consolidation
> would reduce the number of groups but maybe not enough. So would we then
> want an additional option for doing this (and what the default should
> be) ?

I think it would be best to consolidate all the "like" ports.  For example if
you had 3 different MTU's on the fabric then you would have 3 different MGID's
and groups.  The main reason I did not do this was because it would have been a
much larger change to the code and I did not want to risk breaking things.

>
> One cut on the default would be to keep it the same as now but
> does that really matter ? Ideally, those additional SNM groups would be
> collapsed too. I think that aspect was dealt with in Jason's approach to
> this in a thread entitled "IPv6 and IPoIB scalability issue":
> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
> in which he proposed an MGID range for collapsing IPv6 SNM groups.

Ah yes...  I guess I should have read this part before responding above!  ;-)

> 
> Also, have you tried IPv6 SNM consolidation with multiple partitions ? I
> may have more on this aspect later.
> 

No, as we don't really use partitions.

Ira


From vlad at lists.openfabrics.org  Sat May 31 03:09:05 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 31 May 2008 03:09:05 -0700 (PDT)
Subject: [ofa-general] ofa_1_3_kernel 20080531-0200 daily build status
Message-ID: <20080531100905.7B0CBE60B3E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod --with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.24
Passed on ppc64 with linux-2.6.19

Failed:


From tempacer2 at upa.qc.ca  Sat May 31 04:02:04 2008
From: tempacer2 at upa.qc.ca (Colin Mayberry)
Date: Sat, 31 May 2008 20:02:04 +0900
Subject: [ofa-general] get smart, beat your foes
Message-ID: <01c8c359$32180600$ba506d7b@tempacer2>

Make your corn bigger.
And no metter how your look, you'll have success!
Join and enjoy chicks attention!

This way    http://www.esigont.com/a/


From sashak at voltaire.com  Sat May 31 05:18:58 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 15:18:58 +0300
Subject: [ofa-general] saquery port problems
In-Reply-To: <Pine.LNX.4.64.0805291131440.8688@protactinium.engr.sgi.com>
References: <Pine.LNX.4.64.0805201306270.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805201340230.22893@protactinium.engr.sgi.com>
	<Pine.LNX.4.64.0805211642370.25026@protactinium.engr.sgi.com>
	<20080522073703.GA31474@sashak.voltaire.com>
	<Pine.LNX.4.64.0805291131440.8688@protactinium.engr.sgi.com>
Message-ID: <20080531121858.GB22418@sashak.voltaire.com>

On 11:32 Thu 29 May     , Matthias Blankenhaus wrote:
> 
> Sorry, I don't know what that is :-) This is my first patch for OFED, 
> excuse my ignorance.

One of the simplest ways to send the patch is to commit the change to
some branch in your local git tree and the to email the output of
'git-format-patch --stdout HEAD^' command.

> Please, let me know if this helps:
> 
> Signed-off-by:  matthias at sgi.com

Yes, this helps. Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 31 06:41:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 16:41:19 +0300
Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts
	with error if not root
In-Reply-To: <483F2396.3050008@llnl.gov>
References: <48358DF8.2060603@llnl.gov>
	<1211469467.18236.196.camel@hrosenstock-ws.xsigo.com>
	<20080523103532.GA4640@sashak.voltaire.com>
	<4836E9B8.2080406@llnl.gov>
	<20080525191047.GS4616@sashak.voltaire.com>
	<483F2396.3050008@llnl.gov>
Message-ID: <20080531134119.GE22418@sashak.voltaire.com>

On 14:43 Thu 29 May     , Timothy A. Meier wrote:
> I think this patch is fine, and helps solve the improper "usage" issue.

I will apply then.

> (btw - should we prefer the "adapter" spelling over "adaptor"?)

Originally it was added as "adaptor" with "adding -C, -P options" patch.
I have nothing against changing this to "adapter".

> My patch was addressing non-authorized use.  Our philosophy was to not 
> allow
> "any" sort of functionality (even help) if not authorized.  Fail, and 
> provide
> a reason/code.

Doesn't 'chmod 0700 /usr/local/sbin/ib*.pl' (as root) solve this?

> So rather than go through each perl script to see if the proper thing is 
> done
> (return code is checked, error msg provided, terminate, etc.)

It is bug fixing... :)

> On 5-23, I submitted a patch which adds an auth_check() function to the 
> common
> perl module.  I agree, the implementation is non-ideal, but it is probably
> sufficient for the vast majority of installations.
>
> If you think the concept of an auth_check() function is 
> desirable/acceptable,
> then I will pursue fixing the implementation in a more universal way.

Basically I think that idea of limited access is useful, but don't see
why simple 'chmod' is insufficient. And if it is not I think that
auth_check() should be optional (and of course not broken).

Sasha


From sashak at voltaire.com  Sat May 31 07:11:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 17:11:19 +0300
Subject: [ofa-general] Re: [PATCHv2] management: Support separate SA and SM
	keys as clarified in IBA 1.2.1
In-Reply-To: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com>
References: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531141119.GG22418@sashak.voltaire.com>

On 06:22 Thu 29 May     , Hal Rosenstock wrote:
> management: Support separate SA and SM keys as clarified in IBA 1.2.1
> 
> v2 is just a rebase to latest tree
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com> 

Applied. Thanks.

Host order default value is obviously wrong thing (like just
OSM_DEFAULT_SM_KEY which is not resolved yet). I think I will change it
to something like:

#define OSM_DEFAULT_SA_KEY OSM_DEFAULT_SM_KEY

Sasha


From sashak at voltaire.com  Sat May 31 07:13:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 17:13:40 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_sa_mcmember_record.c:
	Improve log message and some comments relating to SNM
In-Reply-To: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com>
References: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531141340.GH22418@sashak.voltaire.com>

On 12:25 Thu 29 May     , Hal Rosenstock wrote:
> opensm/osm_sa_mcmember_record.c: Improve log message and some comments
> relating to SNM (solicited node multicast)
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From hrosenstock at xsigo.com  Sat May 31 07:19:32 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 31 May 2008 07:19:32 -0700
Subject: [ofa-general] Re: [PATCHv2] management: Support separate SA
	and SM keys as clarified in IBA 1.2.1
In-Reply-To: <20080531141119.GG22418@sashak.voltaire.com>
References: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com>
	<20080531141119.GG22418@sashak.voltaire.com>
Message-ID: <1212243572.17997.276.camel@hrosenstock-ws.xsigo.com>

On Sat, 2008-05-31 at 17:11 +0300, Sasha Khapyorsky wrote:
> On 06:22 Thu 29 May     , Hal Rosenstock wrote:
> > management: Support separate SA and SM keys as clarified in IBA 1.2.1
> > 
> > v2 is just a rebase to latest tree
> > 
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com> 
> 
> Applied. Thanks.
> 
> Host order default value is obviously wrong thing 

But it's compatible with what is there now so doesn't require a change
to saquery except in the PPC case as you noted.

> (like just OSM_DEFAULT_SM_KEY which is not resolved yet).

> I think I will change it to something like:
> 
> #define OSM_DEFAULT_SA_KEY OSM_DEFAULT_SM_KEY

Depends on how default SM key finally settles out as to whether it's
better to do it this way. I agree if we started with a clean slate this
would be the way to do it.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Sat May 31 07:28:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 17:28:16 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/main.c: Minor change to
	long option for consolidate_ipv6_snm_req
In-Reply-To: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com>
References: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531142816.GI22418@sashak.voltaire.com>

On 12:25 Thu 29 May     , Hal Rosenstock wrote:
> opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 31 07:35:35 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 17:35:35 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: Add another HP OUI to
	recognized vendor IDs
In-Reply-To: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com>
References: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531143535.GJ22418@sashak.voltaire.com>

On 04:07 Fri 30 May     , Hal Rosenstock wrote:
> OpenSM: Add another HP OUI to recognized vendor IDs
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 31 07:40:53 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 17:40:53 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_subnet.c: Change
	comment for IPv6 SNM in options file
In-Reply-To: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com>
References: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531144053.GK22418@sashak.voltaire.com>

On 11:07 Fri 30 May     , Hal Rosenstock wrote:
> opensm/osm_subnet.c: Change comment for IPv6 SNM in options file
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied by hand (patch is whitespace mangled). Thanks.

Sasha

> 
> --- opensm/osm_subnet.c.2	2008-05-29 04:24:19.802169000 -0700
> +++ opensm/osm_subnet.c	2008-05-30 11:04:00.098938000 -0700
> @@ -1713,7 +1713,7 @@
>  		p_opts->prefix_routes_file);
>  
>  	fprintf(opts_file,
> -		"#\n# IPv6 MCast Options\n#\n"
> +		"#\n# IPv6 Solicited Node Multicast (SNM) Options\n#\n"
>  		"consolidate_ipv6_snm_req %s\n\n",
>  		p_opts->consolidate_ipv6_snm_req ? "TRUE" : "FALSE");
>  
> 
> 


From atheatre at bellnet.ca  Sat May 31 07:48:32 2008
From: atheatre at bellnet.ca (Uk Lottery Board)
Date: Sat, 31 May 2008 10:48:32 -0400
Subject: [ofa-general] Congratulation Your Email Won
Message-ID: <6tghee$1cknof@toip39-bus.srvr.bell.ca>

contact.mr.scott campbell ,Email:agent_scottcampbell at yahoo.co.uk for a lump sum pay out of 1,500,000.00 pounds. Provide him with the information below: 1.Full Name: 2.Full Address, Occupation, sex and age


From sashak at voltaire.com  Sat May 31 08:09:22 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 18:09:22 +0300
Subject: [ofa-general] Multicast Performance
In-Reply-To: <483EBE95.60901@informatik.tu-chemnitz.de>
References: <4836E231.4000601@informatik.tu-chemnitz.de>
	<48371B2D.3040908@gmail.com>
	<483A7E40.5040407@informatik.tu-chemnitz.de>
	<483BBBDB.6000605@informatik.tu-chemnitz.de>
	<483DA512.2070403@gmail.com>
	<483E7520.1000302@informatik.tu-chemnitz.de>
	<1212065181.27600.96.camel@hrosenstock-ws.xsigo.com>
	<483EB11A.5000000@informatik.tu-chemnitz.de>
	<1212068243.17997.48.camel@hrosenstock-ws.xsigo.com>
	<483EBE95.60901@informatik.tu-chemnitz.de>
Message-ID: <20080531150922.GL22418@sashak.voltaire.com>

On 16:32 Thu 29 May     , Marcel Heinz wrote:
> 
> Now, all 3 instances measure 950MB/s throughput.
> 
> The returned MCMember Records are absolutely identical except
> for the PortGid and the membership state.

So the difference is only membership. If you have just 2 full member
instances could you see performance degradation?

Sasha


From sashak at voltaire.com  Sat May 31 08:37:30 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 18:37:30 +0300
Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c:
	Collapse all scopes when consolidating IPv6 SNM
In-Reply-To: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>
References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531153730.GM22418@sashak.voltaire.com>

Hi Hal,

I agree about idea. But patch itself seems to not be against main stream
(or any known published branch). Comments are below.

On 13:22 Fri 30 May     , Hal Rosenstock wrote:
> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
> scopes when consolidating IPv6 SNM
> 
> v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP
> v1 had a minor comment change
> 
> Patch is cumulative on minor improvement patch to this file
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> 
> --- opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
> +++ opensm/osm_sa_mcmember_record.c	2008-05-30 13:13:59.344954000 -0700

Please next time generate a diff at least at one level above (but better
is as usual - at git tree level + 1).

> @@ -1083,19 +1083,21 @@
>  
>  		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
>  			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
> -			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
> -			/* Where XXXX is the P_Key and
> +			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
> +			/* Where Z is the scope, XXXX is the P_Key, and
>  			 * YYYYYY is the last 24 bits of the port guid */
> -#define PREFIX_MASK (0xff10601b00000000ULL)

There I have 0xff12601b00000000ULL value (and likely it is what you
wanted to fix :)).

> +#define PREFIX_MASK (0xff10ffff00000000ULL)
> +#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
>  #define INT_ID_MASK (0x00000001ff000000ULL)
>  			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
>  			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
>  			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
>  			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
>  
> -			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
> +			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&

If you are changing PREFIX_MASK to 0xff10601b00000000ULL, why
PREFIX_SIGNATURE is needed? Am I missing something?

Sasha

>  			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
> -			    g_prefix == rcv_prefix &&
> +			    (g_prefix & PREFIX_MASK) ==
> +			     (rcv_prefix && PREFIX_MASK) &&
>  			    (g_interface_id & INT_ID_MASK) ==
>  			     (rcv_interface_id & INT_ID_MASK)) {
>  				OSM_LOG(sa->p_log, OSM_LOG_INFO,
> 
> 


From hrosenstock at xsigo.com  Sat May 31 08:48:45 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 31 May 2008 08:48:45 -0700
Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c:
	Collapse all scopes when consolidating IPv6 SNM
In-Reply-To: <20080531153730.GM22418@sashak.voltaire.com>
References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>
	<20080531153730.GM22418@sashak.voltaire.com>
Message-ID: <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com>

Hi Sasha,

On Sat, 2008-05-31 at 18:37 +0300, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> I agree about idea. But patch itself seems to not be against main stream
> (or any known published branch). Comments are below.

It is against master but I made a mistake in generating it.

> On 13:22 Fri 30 May     , Hal Rosenstock wrote:
> > OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
> > scopes when consolidating IPv6 SNM
> > 
> > v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP
> > v1 had a minor comment change
> > 
> > Patch is cumulative on minor improvement patch to this file
> > 
> > Signed-off-by: Hal Rosenstock <hal at xsigo.com>
> > 
> > --- opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
> > +++ opensm/osm_sa_mcmember_record.c	2008-05-30 13:13:59.344954000 -0700
> 
> Please next time generate a diff at least at one level above (but better
> is as usual - at git tree level + 1).
> 
> > @@ -1083,19 +1083,21 @@
> >  
> >  		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
> >  			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
> > -			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
> > -			/* Where XXXX is the P_Key and
> > +			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
> > +			/* Where Z is the scope, XXXX is the P_Key, and
> >  			 * YYYYYY is the last 24 bits of the port guid */
> > -#define PREFIX_MASK (0xff10601b00000000ULL)
> 
> There I have 0xff12601b00000000ULL value (and likely it is what you
> wanted to fix :)).

Correct. That was where the generated patch was broken.

> > +#define PREFIX_MASK (0xff10ffff00000000ULL)
> > +#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
> >  #define INT_ID_MASK (0x00000001ff000000ULL)
> >  			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
> >  			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
> >  			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
> >  			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
> >  
> > -			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
> > +			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
> 
> If you are changing PREFIX_MASK to 0xff10601b00000000ULL,

No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to:
1. eliminate the scope part, and
2. get the entire signature
for subsequent comparison.

>  why PREFIX_SIGNATURE is needed? Am I missing something?

Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it
should be after masking.

I'll regenerate v4 and hopefully I'll get it right.

-- Hal

> Sasha
> 
> >  			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
> > -			    g_prefix == rcv_prefix &&
> > +			    (g_prefix & PREFIX_MASK) ==
> > +			     (rcv_prefix && PREFIX_MASK) &&
> >  			    (g_interface_id & INT_ID_MASK) ==
> >  			     (rcv_interface_id & INT_ID_MASK)) {
> >  				OSM_LOG(sa->p_log, OSM_LOG_INFO,
> > 
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Sat May 31 09:03:47 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 31 May 2008 09:03:47 -0700
Subject: [ofa-general] [PATCHv4] OpenSM/osm_sa_mcmember_record.c: Collapse
	all scopes when consolidating IPv6 SNM
Message-ID: <1212249827.17997.294.camel@hrosenstock-ws.xsigo.com>

OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
scopes when consolidating IPv6 SNM

v4 fixes the original PREFIX_MASK as Sasha commented
v3 compares masked prefixes rather than actual prefix in MCMemberRecord
MGID and MGRP
v2 had a minor comment change

Patch is cumulative on minor improvement patch to this file

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

--- opensm/opensm/osm_sa_mcmember_record.c.1	2008-05-30 03:58:01.129544000 -0700
+++ opensm/opensm/osm_sa_mcmember_record.c	2008-05-30 13:13:59.344954000 -0700
@@ -1083,19 +1083,21 @@
 
 		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
 			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
-			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
-			/* Where XXXX is the P_Key and
+			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
+			/* Where Z is the scope, XXXX is the P_Key, and
 			 * YYYYYY is the last 24 bits of the port guid */
-#define PREFIX_MASK (0xff12601b00000000ULL)
+#define PREFIX_MASK (0xff10ffff00000000ULL)
+#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
 			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
 			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
 			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
 
-			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
+			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
 			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
-			    g_prefix == rcv_prefix &&
+			    (g_prefix & PREFIX_MASK) ==
+			     (rcv_prefix && PREFIX_MASK) &&
 			    (g_interface_id & INT_ID_MASK) ==
 			     (rcv_interface_id & INT_ID_MASK)) {
 				OSM_LOG(sa->p_log, OSM_LOG_INFO,


From Income at lists.openfabrics.org  Sat May 31 09:30:03 2008
From: Income at lists.openfabrics.org (Income at lists.openfabrics.org)
Date: 31 May 2008 09:30:03 -0700
Subject: [ofa-general] Quit Your Day Job Within 30 Days
Message-ID: <20080531093002.7C1D027846B01604@from.header.has.no.domain>

I've discovered an exciting
way to make moolah from the
comfort of YOUR living room...

while creating multiple streams
of income...

using Google and other search
engines!

This isn't something you see
every day.

This is a business that can
help you earn $3k to $9k a month!


For Full details please read the attached .html file


Unsubscribe:
 please read the attached .html file, click on contact form
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080531/7b07eacf/attachment.htm>

From sashak at voltaire.com  Sat May 31 10:13:15 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 20:13:15 +0300
Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c:
	Collapse all scopes when consolidating IPv6 SNM
In-Reply-To: <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com>
References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>
	<20080531153730.GM22418@sashak.voltaire.com>
	<1212248925.17997.282.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531171315.GN22418@sashak.voltaire.com>

On 08:48 Sat 31 May     , Hal Rosenstock wrote:
> 
> No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to:
> 1. eliminate the scope part, and
> 2. get the entire signature
> for subsequent comparison.
> 
> >  why PREFIX_SIGNATURE is needed? Am I missing something?
> 
> Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it
> should be after masking.

I see. Then shouldn't it (mask) be 0xff10ffff0000ffffULL?

Sasha


From sashak at voltaire.com  Sat May 31 10:28:15 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 20:28:15 +0300
Subject: [ofa-general] [PATCH v2] saquery: --smkey command line option
In-Reply-To: <1211972791.13185.334.camel@hrosenstock-ws.xsigo.com>
References: <20080522145607.GE32128@sashak.voltaire.com>
	<1211469029.18236.188.camel@hrosenstock-ws.xsigo.com>
	<20080523100634.GD4164@sashak.voltaire.com>
	<1211541313.13185.80.camel@hrosenstock-ws.xsigo.com>
	<20080523123414.GB4640@sashak.voltaire.com>
	<1211547161.13185.103.camel@hrosenstock-ws.xsigo.com>
	<20080527103341.GF12014@sashak.voltaire.com>
	<1211888036.13185.219.camel@hrosenstock-ws.xsigo.com>
	<20080527175343.GA14205@sashak.voltaire.com>
	<1211972791.13185.334.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531172815.GO22418@sashak.voltaire.com>


This adds possibility to specify SM_Key value with saquery. It should
work with queries where OSM_DEFAULT_SM_KEY was used.

If non-numeric string (like 'x') is provided with --smkey option then
saquery will prompt to get SM_Key value.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

SM_key value prompting was added as addition to v1 of the patch.

 infiniband-diags/src/saquery.c |   20 +++++++++++++++++---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 3d4ab24..d3875fc 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -37,6 +37,7 @@
  *
  */
 
+#include <unistd.h>
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/socket.h>
@@ -69,6 +70,7 @@ char *argv0 = "saquery";
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
+static ib_net64_t smkey = OSM_DEFAULT_SA_KEY;
 
 /**
  * Declare some globals because I don't want this to be too complex.
@@ -730,7 +732,7 @@ get_all_records(osm_bind_handle_t bind_handle,
 		int trusted)
 {
 	return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset,
-			       trusted ? OSM_DEFAULT_SA_KEY : 0);
+			       trusted ? smkey : 0);
 }
 
 /**
@@ -1254,8 +1256,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle,
 
 	status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0,
 				 comp_mask, &pktr,
-				 ib_get_attr_offset(sizeof(pktr)),
-				 OSM_DEFAULT_SA_KEY);
+				 ib_get_attr_offset(sizeof(pktr)), smkey);
 	if (status != IB_SUCCESS)
 		return status;
 
@@ -1411,6 +1412,10 @@ usage(void)
 				"IPv6 format\n");
 	fprintf(stderr, "   -C <ca_name> specify the SA query HCA\n");
 	fprintf(stderr, "   -P <ca_port> specify the SA query port\n");
+	fprintf(stderr, "   --smkey <val> specify SM_Key value for the query."
+			" If non-numeric value \n"
+			"                 (like 'x') is specified then "
+			"saquery will prompt for a value\n");
 	fprintf(stderr, "   -t | --timeout <msec> specify the SA query "
 				"response timeout (default %u msec)\n",
 			DEFAULT_SA_TIMEOUT_MS);
@@ -1466,6 +1471,7 @@ main(int argc, char **argv)
 	   {"sgid-to-dgid", 1, 0, 2},
 	   {"timeout", 1, 0, 't'},
 	   {"node-name-map", 1, 0, 3},
+	   {"smkey", 1, 0, 4},
 	   { }
 	};
 
@@ -1512,6 +1518,14 @@ main(int argc, char **argv)
 		case 3:
 			node_name_map_file = strdup(optarg);
 			break;
+		case 4:
+			if (!isxdigit(*optarg) &&
+			    !(optarg = getpass("SM_Key: "))) {
+				fprintf(stderr, "cannot get SM_Key\n");
+				usage();
+			}
+			smkey = cl_hton64(strtoull(optarg, NULL, 0));
+			break;
 		case 'p':
 			query_type = IB_MAD_ATTR_PATH_RECORD;
 			break;
-- 
1.5.5.1.178.g1f811


From hrosenstock at xsigo.com  Sat May 31 10:31:54 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 31 May 2008 10:31:54 -0700
Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c:
	Collapse all scopes when consolidating IPv6 SNM
In-Reply-To: <20080531171315.GN22418@sashak.voltaire.com>
References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com>
	<20080531153730.GM22418@sashak.voltaire.com>
	<1212248925.17997.282.camel@hrosenstock-ws.xsigo.com>
	<20080531171315.GN22418@sashak.voltaire.com>
Message-ID: <1212255114.17997.301.camel@hrosenstock-ws.xsigo.com>

On Sat, 2008-05-31 at 20:13 +0300, Sasha Khapyorsky wrote:
> On 08:48 Sat 31 May     , Hal Rosenstock wrote:
> > 
> > No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to:
> > 1. eliminate the scope part, and
> > 2. get the entire signature
> > for subsequent comparison.
> > 
> > >  why PREFIX_SIGNATURE is needed? Am I missing something?
> > 
> > Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it
> > should be after masking.
> 
> I see. Then shouldn't it (mask) be 0xff10ffff0000ffffULL?

Yes; updated patch to follow shortly.

Similarly for INTF_MASK and I'll generate a separate path for that.

-- Hal

> Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hrosenstock at xsigo.com  Sat May 31 10:31:57 2008
From: hrosenstock at xsigo.com (Hal Rosenstock)
Date: Sat, 31 May 2008 10:31:57 -0700
Subject: [ofa-general] [PATCHv5] OpenSM/osm_sa_mcmember_record.c: Collapse
	all scopes when consolidating IPv6 SNM
Message-ID: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com>

OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
scopes when consolidating IPv6 SNM

v5 changes PREFIX_MASK so low 16 bits are validated as 0
v4 fixes the original PREFIX_MASK as Sasha commented
v3 compares masked prefixes rather than actual prefix in MCMemberRecord
MGID and MGRP
v2 had a minor comment change

Signed-off-by: Hal Rosenstock <hal at xsigo.com>

diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 040068f..c14632d 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1083,19 +1083,21 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context)
 
 		if (sa->p_subn->opt.consolidate_ipv6_snm_req) {
 			/* Special Case IPv6 Solicited Node Multicast (SNM) addresses */
-			/* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */
-			/* Where XXXX is the P_Key and
+			/* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */
+			/* Where Z is the scope, XXXX is the P_Key, and
 			 * YYYYYY is the last 24 bits of the port guid */
-#define PREFIX_MASK (0xff12601b00000000ULL)
+#define PREFIX_MASK (0xff10ffff0000ffffULL)
+#define PREFIX_SIGNATURE (0xff10601b00000000ULL)
 #define INT_ID_MASK (0x00000001ff000000ULL)
 			uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
 			uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
 			uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
 			uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
 
-			if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK &&
+			if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE &&
 			    (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK &&
-			    g_prefix == rcv_prefix &&
+			    (g_prefix & PREFIX_MASK) ==
+			     (rcv_prefix && PREFIX_MASK) &&
 			    (g_interface_id & INT_ID_MASK) ==
 			     (rcv_interface_id & INT_ID_MASK)) {
 				OSM_LOG(sa->p_log, OSM_LOG_INFO,


From sashak at voltaire.com  Sat May 31 12:25:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 31 May 2008 22:25:51 +0300
Subject: [ofa-general] Re: [PATCHv5] OpenSM/osm_sa_mcmember_record.c:
	Collapse all scopes when consolidating IPv6 SNM
In-Reply-To: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com>
References: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531192551.GP22418@sashak.voltaire.com>

On 10:31 Sat 31 May     , Hal Rosenstock wrote:
> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all
> scopes when consolidating IPv6 SNM
> 
> v5 changes PREFIX_MASK so low 16 bits are validated as 0
> v4 fixes the original PREFIX_MASK as Sasha commented
> v3 compares masked prefixes rather than actual prefix in MCMemberRecord
> MGID and MGRP
> v2 had a minor comment change
> 
> Signed-off-by: Hal Rosenstock <hal at xsigo.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat May 31 14:49:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 1 Jun 2008 00:49:19 +0300
Subject: [ofa-general] Re: OSM_DEFAULT_SM_KEY byte order
In-Reply-To: <1211467961.18236.178.camel@hrosenstock-ws.xsigo.com>
References: <20080522140916.GC32128@sashak.voltaire.com>
	<1211467961.18236.178.camel@hrosenstock-ws.xsigo.com>
Message-ID: <20080531214919.GS22418@sashak.voltaire.com>

On 07:52 Thu 22 May     , Hal Rosenstock wrote:
> > +#define OSM_DEFAULT_SM_KEY CL_HTON64(1)
> >  /********/
> >  /****s* OpenSM: Base/OSM_DEFAULT_LMC
> >  * NAME
> > 
> > 
> > , but sort of backward compatibility (currently I know that
> > OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost.
> > Is this so important? Ideas?
> 
> IMO yes, I think this breaks both backward compatibility and what was
> actually observed from some other SMs during interop testing.
> 
> I agree it needs fixing but I think the proper thing is probably more
> like:
> 
> #define OSM_DEFAULT_SM_KEY CL_HTON64(0x0100000000000000);

Using value like this we will break on big endian machines where
originally the value is correct. I think that '1' in network byte order
is better (especially in long term) - it is more "native" non-zero
value. Also I found at least one vendor SM which uses 1 as default SM
key in network byte order (and this is expected, I doubt somebody uses
0x0100000000000000).

Our own backward compatibility could be solved by configuring sm key
(this will work with OpenSM and saquery).

Another opinions?

Sasha


From sashak at voltaire.com  Sat May 31 15:13:22 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 1 Jun 2008 01:13:22 +0300
Subject: [ofa-general] [PATCH] opensm: remove osm_log reference from
	osm_mad_pool object
Message-ID: <20080531221322.GT22418@sashak.voltaire.com>


This removes osm_log reference from osm_mad_pool object as well as some
noisy debug prints there. Recently osm_mad_pool was reworked to use
plain malloc allocator instead of complib's cl_qlock_pool so importance
of those messages was reduced.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_mad_pool.h |   12 +-------
 opensm/opensm/libopensm.ver          |    2 +-
 opensm/opensm/osm_mad_pool.c         |   50 +++------------------------------
 opensm/opensm/osm_opensm.c           |    2 +-
 4 files changed, 8 insertions(+), 58 deletions(-)

diff --git a/opensm/include/opensm/osm_mad_pool.h b/opensm/include/opensm/osm_mad_pool.h
index b8421b9..e3234f4 100644
--- a/opensm/include/opensm/osm_mad_pool.h
+++ b/opensm/include/opensm/osm_mad_pool.h
@@ -53,7 +53,6 @@
 #include <opensm/osm_base.h>
 #include <vendor/osm_vendor.h>
 #include <opensm/osm_madw.h>
-#include <opensm/osm_log.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -95,14 +94,10 @@ BEGIN_C_DECLS
 * SYNOPSIS
 */
 typedef struct _osm_mad_pool {
-	osm_log_t *p_log;
 	atomic32_t mads_out;
 } osm_mad_pool_t;
 /*
 * FIELDS
-*	p_log
-*		Pointer to the log object.
-*
 *	mads_out
 *		Running total of the number of MADs outstanding.
 *
@@ -176,17 +171,12 @@ void osm_mad_pool_destroy(IN osm_mad_pool_t * const p_pool);
 *
 * SYNOPSIS
 */
-ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool,
-				  IN osm_log_t * const p_log);
+ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool);
 /*
 * PARAMETERS
 *	p_pool
 *		[in] Pointer to an osm_mad_pool_t object to initialize.
 *
-*	p_log
-*		[in] Pointer to the log object.
-*
-*
 * RETURN VALUES
 *	CL_SUCCESS if the MAD Pool was initialized successfully.
 *
diff --git a/opensm/opensm/libopensm.ver b/opensm/opensm/libopensm.ver
index 3324b1a..0d5e9d4 100644
--- a/opensm/opensm/libopensm.ver
+++ b/opensm/opensm/libopensm.ver
@@ -6,4 +6,4 @@
 # API_REV - advance on any added API
 # RUNNING_REV - advance any change to the vendor files
 # AGE - number of backward versions the API still supports
-LIBVERSION=2:1:0
+LIBVERSION=2:2:0
diff --git a/opensm/opensm/osm_mad_pool.c b/opensm/opensm/osm_mad_pool.c
index 9b3812f..a7769d4 100644
--- a/opensm/opensm/osm_mad_pool.c
+++ b/opensm/opensm/osm_mad_pool.c
@@ -53,7 +53,6 @@
 #include <string.h>
 #include <opensm/osm_mad_pool.h>
 #include <opensm/osm_madw.h>
-#include <opensm/osm_log.h>
 #include <vendor/osm_vendor_api.h>
 
 /**********************************************************************
@@ -74,14 +73,10 @@ void osm_mad_pool_destroy(IN osm_mad_pool_t * const p_pool)
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t
-osm_mad_pool_init(IN osm_mad_pool_t * const p_pool, IN osm_log_t * const p_log)
+ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool)
 {
-	OSM_LOG_ENTER(p_log);
+	p_pool->mads_out = 0;
 
-	p_pool->p_log = p_log;
-
-	OSM_LOG_EXIT(p_log);
 	return IB_SUCCESS;
 }
 
@@ -95,8 +90,6 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool,
 	osm_madw_t *p_madw;
 	ib_mad_t *p_mad;
 
-	OSM_LOG_ENTER(p_pool->p_log);
-
 	CL_ASSERT(h_bind != OSM_BIND_INVALID_HANDLE);
 	CL_ASSERT(total_size);
 
@@ -104,11 +97,8 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool,
 	   First, acquire a mad wrapper from the mad wrapper pool.
 	 */
 	p_madw = malloc(sizeof(*p_madw));
-	if (p_madw == NULL) {
-		OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0703: "
-			"Unable to acquire MAD wrapper object\n");
+	if (p_madw == NULL)
 		goto Exit;
-	}
 
 	osm_madw_init(p_madw, h_bind, total_size, p_mad_addr);
 
@@ -117,9 +107,6 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool,
 	 */
 	p_mad = osm_vendor_get(h_bind, total_size, &p_madw->vend_wrap);
 	if (p_mad == NULL) {
-		OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0704: "
-			"Unable to acquire wire MAD\n");
-
 		/* Don't leak wrappers! */
 		free(p_madw);
 		p_madw = NULL;
@@ -132,13 +119,8 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool,
 	 */
 	osm_madw_set_mad(p_madw, p_mad);
 
-	OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG,
-		"Acquired p_madw = %p, p_mad = %p, size = %u\n",
-		p_madw, p_madw->p_mad, total_size);
-
 Exit:
-	OSM_LOG_EXIT(p_pool->p_log);
-	return (p_madw);
+	return p_madw;
 }
 
 /**********************************************************************
@@ -151,8 +133,6 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool,
 {
 	osm_madw_t *p_madw;
 
-	OSM_LOG_ENTER(p_pool->p_log);
-
 	CL_ASSERT(h_bind != OSM_BIND_INVALID_HANDLE);
 	CL_ASSERT(total_size);
 	CL_ASSERT(p_mad);
@@ -161,11 +141,8 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool,
 	   First, acquire a mad wrapper from the mad wrapper pool.
 	 */
 	p_madw = malloc(sizeof(*p_madw));
-	if (p_madw == NULL) {
-		OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0705: "
-			"Unable to acquire MAD wrapper object\n");
+	if (p_madw == NULL)
 		goto Exit;
-	}
 
 	/*
 	   Finally, initialize the wrapper object.
@@ -174,12 +151,7 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool,
 	osm_madw_init(p_madw, h_bind, total_size, p_mad_addr);
 	osm_madw_set_mad(p_madw, p_mad);
 
-	OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG,
-		"Acquired p_madw = %p, p_mad = %p size = %u\n",
-		p_madw, p_madw->p_mad, total_size);
-
 Exit:
-	OSM_LOG_EXIT(p_pool->p_log);
 	return (p_madw);
 }
 
@@ -189,19 +161,14 @@ osm_madw_t *osm_mad_pool_get_wrapper_raw(IN osm_mad_pool_t * const p_pool)
 {
 	osm_madw_t *p_madw;
 
-	OSM_LOG_ENTER(p_pool->p_log);
-
 	p_madw = malloc(sizeof(*p_madw));
 	if (!p_madw)
 		return NULL;
 
-	OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG, "Getting p_madw = %p\n", p_madw);
-
 	osm_madw_init(p_madw, 0, 0, 0);
 	osm_madw_set_mad(p_madw, 0);
 	cl_atomic_inc(&p_pool->mads_out);
 
-	OSM_LOG_EXIT(p_pool->p_log);
 	return (p_madw);
 }
 
@@ -210,13 +177,8 @@ osm_madw_t *osm_mad_pool_get_wrapper_raw(IN osm_mad_pool_t * const p_pool)
 void
 osm_mad_pool_put(IN osm_mad_pool_t * const p_pool, IN osm_madw_t * const p_madw)
 {
-	OSM_LOG_ENTER(p_pool->p_log);
-
 	CL_ASSERT(p_madw);
 
-	OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG,
-		"Releasing p_madw = %p, p_mad = %p\n", p_madw, p_madw->p_mad);
-
 	/*
 	   First, return the wire mad to the pool
 	 */
@@ -228,6 +190,4 @@ osm_mad_pool_put(IN osm_mad_pool_t * const p_pool, IN osm_madw_t * const p_madw)
 	 */
 	free(p_madw);
 	cl_atomic_dec(&p_pool->mads_out);
-
-	OSM_LOG_EXIT(p_pool->p_log);
 }
diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c
index aa7ded3..abe55b5 100644
--- a/opensm/opensm/osm_opensm.c
+++ b/opensm/opensm/osm_opensm.c
@@ -314,7 +314,7 @@ osm_opensm_init(IN osm_opensm_t * const p_osm,
 		goto Exit;
 	}
 
-	status = osm_mad_pool_init(&p_osm->mad_pool, &p_osm->log);
+	status = osm_mad_pool_init(&p_osm->mad_pool);
 	if (status != IB_SUCCESS)
 		goto Exit;
 
-- 
1.5.5.1.178.g1f811


From rdreier at cisco.com  Sat May 31 22:46:05 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 31 May 2008 22:46:05 -0700
Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface
	implementation for the driver
References: <20080529095126.9943.84692.stgit@localhost.localdomain>
	<20080529095754.9943.27936.stgit@localhost.localdomain>
	<20080529103003.010c4a08@extreme> <20080529174805.GA10903@kroah.com>
Message-ID: <ada7id7685e.fsf@cisco.com>

 > And yes, multiple values per sysfs file are not allowed, sorry, please
 > change this.  If you need to configure your device through an interface
 > like this, consider using configfs instead, that is what it is there
 > for.

Makes sense... I know that the SRP initiator uses the method of multiple
'token=<value>' entries passed into sysfs, but the excuse is that SRP
was merged before configfs.

I'll also have a look at adding configfs support to SRP and deprecating
the current sysfs method...

 - R.