From necojp at citiz.net  Thu Jun  1 00:08:39 2006
From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=)
Date: Thu,  1 Jun 2006 00:08:39 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCIXshfCRVJEgkMyRtJCwbKEI=?=
	=?iso-2022-jp?b?GyRCQmclVCVzJUEhfCF7GyhC?=
Message-ID: <20060601070839.DD53422834D@openib.ca.sandia.gov>

　　　今日、お財布の中に300円しか入ってない・・・
　　　リアルに今、お財布が大ピンチな男性の皆様！

　　　『逆援助交際』はご存知ですか？

　　　　　　　　　※男性専用　無料登録ページ※
　　　　　　http://www.himitsuno-sasayaki6.net/?haru225

　　　　　　　　　　　　　簡単・高収入！
　
　　　★アドレス確認のみの簡単登録で即利用が可能です★
　　　　　　　逆サポ＆割り切り交際相手を探してみよう


　　【番外編】
　　即H希望女性のみ！！
　　→→→　http://www.himitsuno-sasayaki6.net/?haru225

　　即H、即アポOKな淫乱女性大募集!!
　　　　　　　☆☆☆☆登録は無料です☆☆☆☆

　◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆

　　　　　配信不要はこちらまで→　office_news_himitsu at yahoo.co.jp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/2e090232/attachment.html>

From hamlin at idsmail.com  Thu Jun  1 00:59:24 2006
From: hamlin at idsmail.com (Catherine Light)
Date: Thu, 01 Jun 2006 01:59:24 -0600
Subject: [openib-general] Re-finance at the lowestt ratess
Message-ID: <543e234b.0474530@yahoo.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/bc6f95bc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dogfish.jpg
Type: image/jpg
Size: 5762 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/bc6f95bc/attachment.jpg>

From gtrouhfbpls at hotmail.com  Thu Jun  1 01:21:46 2006
From: gtrouhfbpls at hotmail.com (qfepkncill)
Date: Thu,  1 Jun 2006 01:21:46 -0700 (PDT)
Subject: [openib-general] Hoodia 920+ a day keeps diets away.%
Message-ID: <20060601082146.EFCAF2283D5@openib.ca.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/d93df143/attachment.html>

From hitozumabi at hitmail.cc  Thu Jun  1 01:46:50 2006
From: hitozumabi at hitmail.cc (hitozumabi at hitmail.cc)
Date: Thu,  1 Jun 2006 01:46:50 -0700 (PDT)
Subject: [openib-general] =?utf-8?b?wo7DpcKVd8KCw4bCl0Y=?=
Message-ID: 20060601172719.51962mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv

�Ƃ肠�����A��ԍŏ��ɂ���URL�����Ƃ��܂��B

���S�i�v�����T�C�g�F�l�Ȕ��@http://yaii.net/htm

��w�w�ɂƂĂ�l�C�̂���T�C�g�ł��B
�Ȃ�������I�X�X������̂��͂��̌��ǂ�ł݂ĉ������B


�v�����

�@�@�@�@�@�@�@�@��w�́@(�E�́E)�����I

�Ƃ������b�ł��B

�o��n���鏗�̐l���āA����ς肢���ł���ˁB

�������Ȃɐ������Ȃ����ǁA����肵�Ă܂����B
���X�A�G�b�`��ł����肵�Ă܂���B
���傤�ǐ挎��3�l�ł��ˁB
���́A���q�吶��2�l�Ǝ�w�̕���1�l�ł��B
���ԂŌ����ƁA���q�吶�����q�吶����w�Ƃ�������ŁA
�v����ɎႢ�̂������Ƃ�����ƁA�Ⴄ��������Ă݂����ȁE�E�E
�Ƃ��������ł����B

���j���q�吶�Ƃ��Ⴂ�̂́A�݂�ȍD�����낤���ǁA�}�W�ő�ρI
�@�@���J�l�������������Ă�����������āE�E�E�G�b�`�͂܂��܂������B

��w�̕��́A������������Ȃ�ł����ǁA�ق�ƂɃG�����B�b�������B
���[���̂�����͂��߂đ��X�ɃG���b�΂�����B
�ʃ��͂Ȃ񂩂��ꂢ�ȁA��w�Ƃ������Ȃ݂����ȁi�H�j���������̂ŁA
���M���^�Ȃ�������Ă݂܂����B

http://yaii.net/htm

���_�Ƃ��ẮA��w�ō��ł��I


����|�C���g��1�_�����A�u26�`35�΁v���炢�́A��w�Ƃ����Ă�܂��Ⴂ�w��_��
���Ƃł��B
�܂��A�J���_���܂�����Ă��Ă��Ȃ��A�Ƃ����l�I�Ȏ��A���̗��R�ɓ����Ă�
�܂����B

���̂�����̔N��w�́A�܂��G�b�`�ɑ΂��Ă̖����Ƃ������A�֐S��
��߂Ă��Ȃ��̂ŁA�����̐����͈͓̔�Ŏ��́A��ɃG�b�`�̋@���
���߂Ă��܂��B

-----------------------------------------------------
�����o�^�F�@http://yaii.net/htm
-----------------------------------------------------

����܂���w���傾�����̂Łi���j�퓹�ɏ]���āA�f��ł���āA�H���ɍs���āA
�E�E�E�Ȃ�ă��C�����ǂ낤�Ƃ�����A�͂��߂�����ޏ��A
�u�]�v�Ȃ��Ƃ͂���Ȃ���ł����ǁv�I�[���o�܂����Ă���킯�ł��B

�Ȃ̂ŁA�M���O�Łu����H�v�Ƃ����Ă݂���u����v�����āB

�͂�i�Ƃ���������ł����̂��I�H��w�j�I

����ň��m�����āu�����Ƃł����H�v�Ȃ�Ă����Ă݂���A�����n�j�I

�Ђ႟�I

�z�e���㕂������A���b�L�[��Ȃ�ĕn�R���������Ǝv���Ă����ł�����
�Ƃɂ�����A�������Ȃ��Ƃӂ��Ƃ񂶂Ⴂ�܂����I
��͂��w�̋��݂Ƃ������A�S��j����m���Ă�킯�ł��B
������2������ΐ���t�A���ɂ����邭�炢�̎��Ȃ̂ł����i�劾�j�A
����3���Ԃ�4���ʂĂĂ��܂��܂����B

-----------------------------------------------------
�����o�^�F�@http://yaii.net/htm
-----------------------------------------------------

��������d���Ղ�B����ȂɋC���������̂͂ق�Ƃɏ��߂Ă��Ă��񂶂ł����i�Ȃ�
���̎q���H�j

�����̂ǂ��ɂ���ȃ`�J�����c���Ă����̂��I�H���Ă��炢�ł��B

�����Ɖ����ƐS�n�悢���ŔR���s������W������ԂɂȂ��Ă���Ƃ���ɁA
�g�h���̈ꌾ���E�E�E�B


�u�ˁA����I�ɉ���Ă�炤�Ƃ�����ǂꂭ�炢�K�v�H�v


�E�E���H����I�H�E�E�E�b�e�i���f�X�J�H

����͑��ɂ����t���Ă�ł����H��ł���ˁH

���[�ƁE�E�E�E�E�E�E�E�E�E�E�E�E�E�E�E���ꂪ�S�R�킩��Ȃ��E�E�E�B

��������[��������b�̓W�J���X���[�Y�Ɍ����܂����A����
�ޏ��̌����Ă��邱�Ƃ𗝉���̂ɍŒ�ł�1���͂�����܂����B
������ƃA�^�}�����Ȃ��Ă܂����B�Ӗ��킩��Ȃ��āB

�Ƃ肠�����A�Ў�̃��r��S�čL���Ă݂܂����B

�����Ă킩��Ȃ���ł�����B

�u�킩�����B���Ⴀ�Ƃ肠���������̂��炨���Ă���v

���H�킩�����H�H

���H������āA�܂�A���́A���A�@�g���񂱂�ɂ��́B


�C�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[��ł����H�i�J�r�����j


�C�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[�[��ł��A�Ƃ͌����܂���ł�����

�Ȃɂ������������悤�ł��B�p�`�p�`�B�悩�����ˁA���B

�����i3���ł��ˁj�͖��T1�񎄂̕����ŉ���ƂɂȂ��Ă܂��B
�S�Ă����Ȃ̂͂�����Ɛh���ł����ǁB�ł�A�O�ł������Ⴂ�܂������B�T�񕪁B

�Ȃ񂩎�w���ĒU�߂�����킯������A���̂����U�߂���荞��ł��āu�e���[�v��
��
�����邱�Ƃ����̂��ȁH���Ďv�����肵�܂����A�ޏ����m���Ă��鎄�̘A�����
��
�t���[���[�������ł�����A�Ȃ�Ƃ��Ȃ邩�ȁH�Ƃ�v���Ă܂��B

-----------------------------------------------------
�����o�^�F�@http://yaii.net/htm
-----------------------------------------------------


�Ƃɂ����B


��w�B


(�E�́E)�C�C!!


�~������̂�2���ɓ���B


�I�X�X���ł��B

-----------------------------------------------------
�����o�^�F�@http://yaii.net/htm
-----------------------------------------------------


From leonida at voltaire.com  Thu Jun  1 01:50:14 2006
From: leonida at voltaire.com (Leonid Arsh)
Date: Thu, 01 Jun 2006 11:50:14 +0300
Subject: [openib-general][PATCH 1 of 3] repost: Client Reregister support
	for kernel space
In-Reply-To: <adairnlx1k5.fsf@cisco.com>
References: <20060509060958.GA482@voltaire.com> <adairnlx1k5.fsf@cisco.com>
Message-ID: <447EAA46.9080905@voltaire.com>

Thank you!
Looks fine. I don't see anything that can harm.
The only thing that I can't be completely sure is the change in the 
ipath_mad.c - I'm less familiar with the code
and didn't check it myself. Anyway, the change seems fine too.

Roland Dreier wrote:
> OK, I cleaned up your patches and applied the following to my
> for-2.6.18 tree.  I think all of my changes were fixes and/or
> cleanups, but you may want to check that I didn't break anything --
> I'm sending the 5 patches I ended up with to the list.
>
>  - R.
>   


From love at aaamich.com  Thu Jun  1 03:30:24 2006
From: love at aaamich.com (Weldon Cornelius)
Date: Thu, 01 Jun 2006 02:30:24 -0800
Subject: [openib-general] Notice: Loww mortagee ratee approved
Message-ID: <48379.$$.16807.Etrack@yahoo.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/5a9fb05b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diffident.7.gif
Type: image/gif
Size: 8467 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/5a9fb05b/attachment.gif>

From aisy at adlandpro.com  Thu Jun  1 03:35:42 2006
From: aisy at adlandpro.com ( Curtis)
Date: Thu, 01 Jun 2006 02:35:42 -0800
Subject: [openib-general] 3.25%% approvedd rattee
Message-ID: <72563.$$.26481.Etrack@>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/492158f1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .gif
Type: image/gif
Size: 8467 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/492158f1/attachment.gif>

From dorranu at cfpa.com  Thu Jun  1 02:40:44 2006
From: dorranu at cfpa.com (Niocla Dorrance)
Date: Thu, 1 Jun 2006 02:40:44 -0700
Subject: [openib-general] Re: 658 AMBqBtEN
Message-ID: <000001c6855f$7461c670$e4eaa8c0@anx39>

Hi, 
V A L \ U M 
C \ A L i S 
A M B \ E N
P R O Z & C
M E R \ D i A
S O M &
X & N A X
V \ A G R A 
L E V \ T R A 
http://www.rubasujadun.com 


but not quite straight into the side of the hill  The Hill, as all the 
people for many miles round called it  and many little round doors 
opened out of it, first on one side and then on another. No going 
upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of
these), wardrobes (he had whole rooms devoted to clothes), kitchens, 
dining-rooms, all were on the same floor, and indeed on the same 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/1bc9e97b/attachment.html>

From kholden at adtaz.sps.mot.com  Thu Jun  1 03:44:16 2006
From: kholden at adtaz.sps.mot.com (Rodney Carroll)
Date: Thu, 01 Jun 2006 02:44:16 -0800
Subject: [openib-general] Agents compete for your refi!!
Message-ID: <398267569426842.4310327@hotmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/e3b497d4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: checksummed.0.gif
Type: image/gif
Size: 7610 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/e3b497d4/attachment.gif>

From rosasedp9 at hotmail.com  Thu Jun  1 07:15:30 2006
From: rosasedp9 at hotmail.com (Nina)
Date: Thu, 1 Jun 2006 13:15:30 -0100
Subject: [openib-general] pay attention to the message CTXE.PK watch it
	performing 
Message-ID: <20060601102828.4E19022834D@openib.ca.sandia.gov>

Movements of stockk and market tendencies analyzed Price-sensitive insider information on sstocks to boost revenues
C T X E - CCANTEX _ENERGGY_CORP. 

Check the sstock which should exppload: C T X E . P K

CURRRENT_PRICE: $0.58
Expected price at the end of the week: $1.2 => + 48% stoock coverage highlighting booming markets
This stoock is ggreatly recommended by agressive innvestors in a short tterm. 
Don't loos a chance to earnn. stoock market tendencies and booming predictions Reliable methods of stoock analysis and optimal buy and sell times 


 ...............................
It is a long lane that has no turning A sprat to catch a mackerel. Live and let live Experience is a wonderful thing It enables you to recognize a mistake when you make it again  The eyes are the window of the soul What's good for the goose is good for the gander. A conscience is what hurts when all your other parts feel so good.  Smooth runs the water where the brook is deep


From fpqlzsaavnf at hotmail.com  Thu Jun  1 03:50:33 2006
From: fpqlzsaavnf at hotmail.com (mark.simons)
Date: Thu,  1 Jun 2006 03:50:33 -0700 (PDT)
Subject: [openib-general] Home Equity
Message-ID: <20060601105033.75E6E22834D@openib.ca.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/3dc9de2a/attachment.html>

From laaaaaaa at yahoo.co.in  Thu Jun  1 04:11:53 2006
From: laaaaaaa at yahoo.co.in (Lamine.)
Date: Thu, 1 Jun 2006 15:11:53 +0400
Subject: [openib-general] Please Is Urgent.
Message-ID: <20060601112447.84F9522834D@openib.ca.sandia.gov>

Greetings from Dubai, 
This letter must come to you as a big surprise, but I believe it is only a day that people meet and become great friends and business partners. i am Mr LAMINE MOHAMED DALLAJI,currently Head of Cooperate affairs with a reputable bank here in U. A. E. I write you this proposal in good faith, believing that I can trust you with the information I am about to reveal to you. 
I have an urgent and very confidential business proposition for you. On November 6, 2000, an Iraqi Foreign Oil consultant/contractor with the CHEVRON PETROLEUM CORPORATION, MR. KHALIL AL NASSER made a (Fixed deposit) for 36 calendar months, valued at US$17,500,000.00 (Seventeen Million Five hundred Thousand Dollars only) in my bank and I happen 
to be his account officer before I was moved to my present position recently. Upon maturity in 2003, as his account officer, it is my duty to notify him on the maturity date so I sent a routine notification to his forwarding address but the letter was returned undelivered. 
After sometime, I tried sending back the letter, but it was again returned and finally I discovered from his contract employers, Chevron Petroleum Corporation that Mr. Khalil Al Nasser died as a result of torture in the hand of Saddam Hussein (former Iraqi President) during one of his trips to his country Iraq, as he was accused of leaking information to the Americans. On further investigation, I discovered that Mr. Al Nasser�s family wife and two sons died during the Gulf 
War in Iraq and was the reason why he did not declare any next of kin or relation in all his official documents, including his Bank Deposit paperwork in my Bank and did not leave any WILL. This sum of US$17,500,000.00 have been floating and placed under dormant/unserviceable account by my bank management since no one have heard from the owner since 2003. I 
wish to let you know that all the investigation I have made so far, my bank management is not aware of it, I am the only one that have the information.but recently i dicided to disclose the issue to someone abroad lets puts heads together over it. 
With the recent change of government in my country and with their efforts to support the United Nations in checkmating terrorism aid in the U. A. E. By end of this year, the government will pass a new financial control law which will give the government authority to interrogate account owners of above $5,000,000.00 to explain the source of the funds, making sure it is not for terrorism support. If I do not move this money out of the country immediately, by end of the 
year the government will definitely confiscate the money, because my bank cannot provide the account owner to explain the source of the money. I cannot directly transfer out this money without the help of a foreigner and that is why I am contacting you for an assistance. As the Account Officer to late Al Nasser, coupled with my present position and status in the bank as Head of Retail Banking Groug , I have the power to influence the release of the funds to any foreigner that comes up as the next of kin to the account, with the correct information concerning the account, which I shall give you. All documents to 
enable you claim this fund will be carefully worked out and there is practically no risk involved, the transaction will be executed under a legitimate arrangement that will protect you from any breach of law, beside U. A. E is porous and anything goes. 
If you accept to work with me, I want you to state how you wish us to share the funds in percentage, so that both parties will be satisfied. If you are interested, contact me as soon as you receive this message so we can go over the details. Thanking you in advance and may God bless . Please, treat with utmost confidentiality. I shall send you copy of the 
deposit certificate issued to Al Nasser when the deposit was made for your perusal. I wait your urgent response. 
Regards, 
Mr.LAMINE Mohamed DALLAJI.


From melody19194 at yahoo.co.jp  Thu Jun  1 04:28:45 2006
From: melody19194 at yahoo.co.jp (melody19194 at yahoo.co.jp)
Date: Thu,  1 Jun 2006 04:28:45 -0700 (PDT)
Subject: [openib-general] =?utf-8?b?woNcwoFbwoNWwoPCg8KDwovCg2zCg2LCg2c=?=
	=?utf-8?b?woPCj8KBW8KDTMKDwpPCg0/Cg1TCg0PCg2fCj8K1wpHDksKPw7M=?=
Message-ID: 20060601201402.39458mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv

����ɂ��́A������̓����f�B�[�^�c�����ǂł��B

�����f�B�[�Ƃ́A�����o�[�݂̂ō\������Ă���ŋߗ��s��SNS�i�\�[�V�����l�b�g���[�L���O�T�C�g�ł��B

���񃉃��_�����I�ł��Ȃ��l�ɏ��ҏ�����炳���Ă��������܂����B

���L��URL���o�^��s���Ă��������l�b�g���[�N�������̊F�l�Ƃ̌𗬂��肢�������܂��B
�@�@�@http://qqpg.com/mmt

�����F�l�̓v���t�B�[���A�ʐ^��o�^�A���J���邱�Ƃɂ���Ă�葽���̕��X�ɏ���
�@�@���M���邱�Ƃ��o���܂��B���p�A�o�^�͖����ł��B
�@�@�v���t�B�[���A�ʐ^�̓o�^�A���J�@����
�@�@�@http://qqpg.com/mmt

���������f�B�[�ł͐M���ł�����l�A�F�B�A���l�A�Z�b�N�X�t�����h�A���܂��܂ȃc�[�����p�ӂ��Ă���܂��B
�@�@�@http://qqpg.com/mmt

���������f�B�[��g���Ή�����m�̃l�b�g���[�N���ǂ��ė���p�[�e�B�Ȃǂ̌𗬂�
�@�@�ȒP�ɂł��܂��B�����ɂ͂��Ȃ��̃p�[�g�i�[����q����M���ł���l�b�g���[�N��
�@�@�`������Ă��܂��B�����f�B�[�͂ǂ����Ōq�����Ă���l���m���W�܂�o������T�C�g 
�@�@�ł���A���ꂪ�����f�B�[�̓����ł��B
�@�@�@http://qqpg.com/mmt

����ł́A�Q����S��肨�҂����Ă���܂��B�����f�B�[�^�c�ǁB


From rkuchimanchi at silverstorm.com  Thu Jun  1 07:12:13 2006
From: rkuchimanchi at silverstorm.com (Ramchandra K)
Date: Thu, 01 Jun 2006 19:42:13 +0530
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
	format according to target io_class
In-Reply-To: <adaejycdakf.fsf@cisco.com>
References: <D80D83302DEE6249A221093BF2BB69AE44D740@mail.silverstorm.com>
	<adaejycdakf.fsf@cisco.com>
Message-ID: <1149171133.7588.45.camel@Prawra.gs-lab.com>

On Mon, 2006-05-29 at 10:07 -0700, Roland Dreier wrote:
> Overall seems OK.  Some comments:

I am resending the patch with the modifications you suggested.

>  > +#define SRP_REV10_IO_CLASS   0xFF00
>  > +#define SRP_REV16A_IO_CLASS  0x0100
> 
> I think these should be in an enum in <scsi/srp.h>, since they're
> generic constants from the SRP spec.
> 
I have defined the IO class values as an enum in <scsi/srp.h>. I am
sending this as a separate patch. I am not sure if those changes
are to be submitted here, since srp.h is not in the Open Fabrics
code base. But both the patches have to be applied together for
the SRP code to compile.


Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>

Index: infiniband/ulp/srp/ib_srp.c
===================================================================
--- infiniband/ulp/srp/ib_srp.c	(revision 7615)
+++ infiniband/ulp/srp/ib_srp.c	(working copy)
@@ -321,8 +321,33 @@
 	req->priv.req_it_iu_len = cpu_to_be32(srp_max_iu_len);
 	req->priv.req_buf_fmt 	= cpu_to_be16(SRP_BUF_FORMAT_DIRECT |
 					      SRP_BUF_FORMAT_INDIRECT);
-	memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16);
 	/*
+	 * Older targets conforming to Rev 10 of the SRP specification
+	 * use the port identifier format which is
+	 *
+	 * lower 8 bytes :  GUID
+	 * upper 8 bytes :  extension
+	 *
+	 * Where as according to the new SRP specification (Rev 16a), the 
+	 * port identifier format is
+	 *
+	 * lower 8 bytes :  extension
+	 * upper 8 bytes :  GUID
+	 *
+	 * So check the IO class of the target to decide which format to use.
+	 */
+
+	/* If its Rev 10, flip the initiator port id fields */
+	if (target->io_class == SRP_REV10_IO_CLASS) {
+		memcpy(req->priv.initiator_port_id,
+			target->srp_host->initiator_port_id + 8 , 8);
+		memcpy(req->priv.initiator_port_id + 8,
+			target->srp_host->initiator_port_id, 8);
+	} else {	
+		memcpy(req->priv.initiator_port_id,
+			 target->srp_host->initiator_port_id, 16);
+	}
+	/*
 	 * Topspin/Cisco SRP targets will reject our login unless we
 	 * zero out the first 8 bytes of our initiator port ID.  The
 	 * second 8 bytes must be our local node GUID, but we always
@@ -334,8 +359,13 @@
 		       (unsigned long long) be64_to_cpu(target->ioc_guid));
 		memset(req->priv.initiator_port_id, 0, 8);
 	}
-	memcpy(req->priv.target_port_id,     &target->id_ext, 8);
-	memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8);
+	if (target->io_class == SRP_REV10_IO_CLASS) {
+		memcpy(req->priv.target_port_id,     &target->ioc_guid, 8);
+		memcpy(req->priv.target_port_id + 8, &target->id_ext, 8);
+	} else {
+		memcpy(req->priv.target_port_id,     &target->id_ext, 8);
+		memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8);
+	}
 
 	status = ib_send_cm_req(target->cm_id, &req->param);
 
@@ -1513,6 +1543,7 @@
 	SRP_OPT_SERVICE_ID	= 1 << 4,
 	SRP_OPT_MAX_SECT	= 1 << 5,
 	SRP_OPT_MAX_CMD_PER_LUN	= 1 << 6,
+	SRP_OPT_IO_CLASS	= 1 << 7,
 	SRP_OPT_ALL		= (SRP_OPT_ID_EXT	|
 				   SRP_OPT_IOC_GUID	|
 				   SRP_OPT_DGID		|
@@ -1528,6 +1559,7 @@
 	{ SRP_OPT_SERVICE_ID,		"service_id=%s"		},
 	{ SRP_OPT_MAX_SECT,		"max_sect=%d" 		},
 	{ SRP_OPT_MAX_CMD_PER_LUN,	"max_cmd_per_lun=%d" 	},
+	{ SRP_OPT_IO_CLASS,		"io_class=%x"		},
 	{ SRP_OPT_ERR,			NULL 			}
 };
 
@@ -1611,7 +1643,19 @@
 			}
 			target->scsi_host->cmd_per_lun = min(token, SRP_SQ_SIZE);
 			break;
-
+		case SRP_OPT_IO_CLASS:
+			if (match_hex(args, &token)) {
+				printk(KERN_WARNING PFX "bad  IO class parameter '%s' \n", p);
+				goto out;
+			}
+			if (token == SRP_REV10_IO_CLASS || token == SRP_REV16A_IO_CLASS)
+				target->io_class = token;
+			else
+				printk(KERN_WARNING PFX "unknown IO class parameter value"
+				   " %x specified. Use %x or %x. Defaulting to IO class %x\n",
+				   token, SRP_REV10_IO_CLASS, SRP_REV16A_IO_CLASS,
+				   SRP_REV16A_IO_CLASS);
+			break;
 		default:
 			printk(KERN_WARNING PFX "unknown parameter or missing value "
 			       "'%s' in target creation request\n", p);
@@ -1654,6 +1698,7 @@
 	target = host_to_target(target_host);
 	memset(target, 0, sizeof *target);
 
+	target->io_class   = SRP_REV16A_IO_CLASS;
 	target->scsi_host  = target_host;
 	target->srp_host   = host;
 
Index: infiniband/ulp/srp/ib_srp.h
===================================================================
--- infiniband/ulp/srp/ib_srp.h	(revision 7615)
+++ infiniband/ulp/srp/ib_srp.h	(working copy)
@@ -122,6 +122,7 @@
 	__be64			id_ext;
 	__be64			ioc_guid;
 	__be64			service_id;
+	__be16			io_class;
 	struct srp_host	       *srp_host;
 	struct Scsi_Host       *scsi_host;
 	char			target_name[32];


From rkuchimanchi at silverstorm.com  Thu Jun  1 07:12:25 2006
From: rkuchimanchi at silverstorm.com (Ramchandra K)
Date: Thu, 01 Jun 2006 19:42:25 +0530
Subject: [openib-general] [PATCH] Define IO class values in <scsi/srp.h>
Message-ID: <1149171145.7588.46.camel@Prawra.gs-lab.com>

Hi Roland,

This patch adds IO class values of SRP Rev 10 and Rev 16a to
aid in deciding the port identifier format to be used.

Regards,
Ram

Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>

--- orig/include/scsi/srp.h	2006-06-01 00:45:13.000000000 -0400
+++ wc/include/scsi/srp.h	2006-06-01 00:58:10.000000000 -0400
@@ -44,6 +44,11 @@
 #include <linux/types.h>
 
 enum {
+	SRP_REV10_IO_CLASS	= 0xFF00,
+	SRP_REV16A_IO_CLASS	= 0x0100
+};
+
+enum {
 	SRP_LOGIN_REQ	= 0x00,
 	SRP_TSK_MGMT	= 0x01,
 	SRP_CMD		= 0x02,


From rkuchimanchi at silverstorm.com  Thu Jun  1 07:12:32 2006
From: rkuchimanchi at silverstorm.com (Ramchandra K)
Date: Thu, 01 Jun 2006 19:42:32 +0530
Subject: [openib-general] [PATCH] (Resend) SRPTOOLS: print out the target
	io_class in ibsrpdm
Message-ID: <1149171152.7588.47.camel@Prawra.gs-lab.com>

Hi Roland,

Resending the patch that prints out the target io class value
in ibsrpdm to aid in specifying the target creation
parameter - io_class.

Regards,
Ram

Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>

Index: userspace/srptools/src/srp-dm.c
===================================================================
--- userspace/srptools/src/srp-dm.c	(revision 7617)
+++ userspace/srptools/src/srp-dm.c	(working copy)
@@ -398,6 +398,7 @@
 				 (unsigned long long) ntohll(ioc_prof.guid));
 			pr_human("        vendor ID: %06x\n", ntohl(ioc_prof.vendor_id) >> 8);
 			pr_human("        device ID: %06x\n", ntohl(ioc_prof.device_id));
+			pr_human("        IO class : %hx\n", ntohs(ioc_prof.io_class));
 			pr_human("        ID:        %s\n", ioc_prof.id);
 			pr_human("        service entries: %d\n", ioc_prof.service_entries);
 
@@ -429,11 +430,13 @@
 					       "ioc_guid=%016llx,"
 					       "dgid=%016llx%016llx,"
 					       "pkey=ffff,"
+					       "io_class=%hx,"	
 					       "service_id=%016llx\n",
 					       id_ext,
 					       (unsigned long long) ntohll(ioc_prof.guid),
 					       (unsigned long long) subnet_prefix,
 					       (unsigned long long) guid,
+					       (unsigned short) ntohs(ioc_prof.io_class),
 					       (unsigned long long) ntohll(svc_entries.service[k].id));
 				}
 			}


From rdreier at cisco.com  Thu Jun  1 07:29:32 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 Jun 2006 07:29:32 -0700
Subject: [openib-general] [PATCH 4/5] IB/mthca: Add client reregister
	event generation
In-Reply-To: <447E7C9E.1060907@mellanox.co.il> (Eitan Zahavi's message of "Thu,
	01 Jun 2006 08:35:26 +0300")
References: <20060531223205.10506.51241.stgit@localhost.localdomain>
	<20060531223215.10506.28838.stgit@localhost.localdomain>
	<447E7C9E.1060907@mellanox.co.il>
Message-ID: <adak681ueyb.fsf@cisco.com>

    Eitan> Hi Roland, Is there a reason why the LID_CHANGE event is
    Eitan> happening even if the LID did not change?

It was used as a proxy for client reregister-like events before client
reregister existed.

 - R.


From kaho20006p at infoseek.jp  Thu Jun  1 05:07:29 2006
From: kaho20006p at infoseek.jp (=?iso-2022-jp?B?GyRCOzMyPDlhSmYbKEI=?=)
Date: Thu, 01 Jun 2006 21:07:29 +0900
Subject: [openib-general] =?iso-2022-jp?b?GyRCPWkyRiROJDQwJzsiJHIbKEI=?=
	=?iso-2022-jp?b?GyRCPz0kNz5lJDIkXiQ5ISMbKEI=?=
Message-ID: <20060601161713.64D6A22834D@openib.ca.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/b929a3e3/attachment.html>

From swise at opengridcomputing.com  Thu Jun  1 10:00:33 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 01 Jun 2006 12:00:33 -0500
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
In-Reply-To: <447E1720.7000307@ichips.intel.com>
References: <20060531182650.3308.81538.stgit@stevo-desktop>
	<20060531182652.3308.1244.stgit@stevo-desktop>
	<447E1720.7000307@ichips.intel.com>
Message-ID: <1149181233.31610.34.camel@stevo-desktop>

On Wed, 2006-05-31 at 15:22 -0700, Sean Hefty wrote:
> Steve Wise wrote:
> > +/* 
> > + * Release a reference on cm_id. If the last reference is being removed
> > + * and iw_destroy_cm_id is waiting, wake up the waiting thread.
> > + */
> > +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
> > +{
> > +	int ret = 0;
> > +
> > +	BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
> > +	if (atomic_dec_and_test(&cm_id_priv->refcount)) {
> > +		BUG_ON(!list_empty(&cm_id_priv->work_list));
> > +		if (waitqueue_active(&cm_id_priv->destroy_wait)) {
> > +			BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING);
> > +			BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
> > +					&cm_id_priv->flags));
> > +			ret = 1;
> > +			wake_up(&cm_id_priv->destroy_wait);
> 
> We recently changed the RDMA CM, IB CM, and a couple of other modules from using 
> wait objects to completions.   This avoids a race condition between decrementing 
> the reference count, which allows destruction to proceed, and calling wake_up on 
> a freed cm_id.  My guess is that you may need to do the same.
> 

Good catch.  Yes, the IW CM suffers from the same race condition.  I'll
change this to use completions...

> Can you also explain the use of the return value here?  It's ignored below in 
> rem_ref() and destroy_cm_id().
> 

The return value is supposed to indicate whether this call to deref
_may_ have resulted in waking up another thread and the cm_id being
freed.  Its used in cm_work_handler(), in conjunction with setting the
IWCM_F_CALLBACK_DESTROY flag to know whether the cm_id needs to be freed
on the callback path.

> > +static void add_ref(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	atomic_inc(&cm_id_priv->refcount);
> > +}
> > +
> > +static void rem_ref(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	iwcm_deref_id(cm_id_priv);
> > +}
> > +
> 
> > +/* 
> > + * CM_ID <-- CLOSING
> > + *
> > + * Block if a passive or active connection is currenlty being processed. Then
> > + * process the event as follows:
> > + * - If we are ESTABLISHED, move to CLOSING and modify the QP state
> > + *   based on the abrupt flag 
> > + * - If the connection is already in the CLOSING or IDLE state, the peer is
> > + *   disconnecting concurrently with us and we've already seen the 
> > + *   DISCONNECT event -- ignore the request and return 0
> > + * - Disconnect on a listening endpoint returns -EINVAL
> > + */
> > +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	/* Wait if we're currently in a connect or accept downcall */
> > +	wait_event(cm_id_priv->connect_wait, 
> > +		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> 
> Am I understanding this check correctly?  You're checking to see if the user has 
> called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
> iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
> event?

The CM must wait for the low level provider to finish a connect() or
accept() operation before telling the low level provider to disconnect
via modifying the iwarp QP.  Regardless of whether they block, this
disconnect can happen concurrently with the connect/accept so we need to
hold the disconnect until the connect/accept completes.

> 
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_ESTABLISHED:
> > +		cm_id_priv->state = IW_CM_STATE_CLOSING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp)	{ /* QP could be <nul> for user-mode client */
> > +			if (abrupt)
> > +				ret = iwcm_modify_qp_err(cm_id_priv->qp);
> > +			else
> > +				ret = iwcm_modify_qp_sqd(cm_id_priv->qp);
> > +			/* 
> > +			 * If both sides are disconnecting the QP could
> > +			 * already be in ERR or SQD states
> > +			 */
> > +			ret = 0;
> > +		}
> > +		else
> > +			ret = -EINVAL;
> > +		break;
> > +	case IW_CM_STATE_LISTEN:
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = -EINVAL;
> > +		break;
> > +	case IW_CM_STATE_CLOSING:
> > +		/* remote peer closed first */
> > +	case IW_CM_STATE_IDLE:	
> > +		/* accept or connect returned !0 */
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_RECV:
> > +		/* 
> > +		 * App called disconnect before/without calling accept after
> > +		 * connect_request event delivered.
> > +		 */
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_SENT:
> > +		/* Can only get here if wait above fails */
> > +	default:		
> > +		BUG_ON(1);
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_disconnect);
> > +static void destroy_cm_id(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	/* Wait if we're currently in a connect or accept downcall. A
> > +	 * listening endpoint should never block here. */
> > +	wait_event(cm_id_priv->connect_wait, 
> > +		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> 
> Same question/comment as above.
> 

Same answer.  

> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_LISTEN:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		/* destroy the listening endpoint */
> > +		ret = cm_id->device->iwcm->destroy_listen(cm_id);
> > +		break;
> > +	case IW_CM_STATE_ESTABLISHED:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		/* Abrupt close of the connection */
> > +		(void)iwcm_modify_qp_err(cm_id_priv->qp);
> > +		break;
> > +	case IW_CM_STATE_IDLE:
> > +	case IW_CM_STATE_CLOSING:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_RECV:
> > +		/* 
> > +		 * App called destroy before/without calling accept after
> > +		 * receiving connection request event notification.
> > +		 */ 
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_SENT:
> > +	case IW_CM_STATE_DESTROYING:
> > +	default:
> > +		BUG_ON(1);
> > +		break;
> > +	}
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> 
> As an alternative, you could hold the lock from above, an let the LISTEN / 
> ESTABLISHED state checks release and reacquire.
> 

Yes, perhaps that's cleaner.

> > +	if (cm_id_priv->qp) {
> > +		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
> > +		cm_id_priv->qp = NULL;
> > +	}
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	(void)iwcm_deref_id(cm_id_priv);
> > +}
> > +
> > +/* 
> > + * This function is only called by the application thread and cannot
> > + * be called by the event thread. The function will wait for all
> > + * references to be released on the cm_id and then kfree the cm_id
> > + * object. 
> > + */
> > +void iw_destroy_cm_id(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +        BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags));
> > +
> > +	destroy_cm_id(cm_id);
> > +
> > +	wait_event(cm_id_priv->destroy_wait, 
> > +		   !atomic_read(&cm_id_priv->refcount));
> > +
> > +	kfree(cm_id_priv);
> > +}
> > +EXPORT_SYMBOL(iw_destroy_cm_id);
> > +
> > +/* 
> > + * CM_ID <-- LISTEN
> > + *
> > + * Start listening for connect requests. Generates one CONNECT_REQUEST
> > + * event for each inbound connect request. 
> > + */
> > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_IDLE:
> > +		cm_id_priv->state = IW_CM_STATE_LISTEN;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id->device->iwcm->create_listen(cm_id, backlog);
> > +		if (ret)
> > +			cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		break;
> > +	default:
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_listen);
> > +
> > +/* 
> > + * CM_ID <-- IDLE
> > + *
> > + * Rejects an inbound connection request. No events are generated.
> > + */
> > +int iw_cm_reject(struct iw_cm_id *cm_id,
> > +		 const void *private_data,
> > +		 u8 private_data_len)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +	cm_id_priv->state = IW_CM_STATE_IDLE;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->reject(cm_id, private_data, 
> > +					  private_data_len);
> > +
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_reject);
> > +
> > +/* 
> > + * CM_ID <-- ESTABLISHED
> > + *
> > + * Accepts an inbound connection request and generates an ESTABLISHED
> > + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block
> > + * until the ESTABLISHED event is received from the provider. 
> > + */
> 
> This makes it sound like we're just waiting for an event.
> 

disconnect/destory paths wait for the provider to complete the accept or
connect operation.


> > +int iw_cm_accept(struct iw_cm_id *cm_id, 
> > +		 struct iw_cm_conn_param *iw_param)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	struct ib_qp *qp;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +	/* Get the ib_qp given the QPN */
> > +	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
> > +	if (!qp) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		return -EINVAL;
> > +	}
> > +	cm_id->device->iwcm->add_ref(qp);
> > +	cm_id_priv->qp = qp;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->accept(cm_id, iw_param);
> > +	if (ret) {
> > +		/* An error on accept precludes provider events */
> > +		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
> > +		cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp) {
> > +			cm_id->device->iwcm->rem_ref(qp);
> > +			cm_id_priv->qp = NULL;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		printk("Accept failed, ret=%d\n", ret);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +	}			
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_accept);
> > +
> > +/*
> > + * Active Side: CM_ID <-- CONN_SENT
> > + *
> > + * If successful, results in the generation of a CONNECT_REPLY
> > + * event. iw_cm_disconnect and iw_cm_destroy will block until the
> > + * CONNECT_REPLY event is received from the provider.
> > + */
> > +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +	struct ib_qp *qp;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_IDLE) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +		
> > +	/* Get the ib_qp given the QPN */
> > +	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
> > +	if (!qp) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		return -EINVAL;
> > +	}
> > +	cm_id->device->iwcm->add_ref(qp);
> > +	cm_id_priv->qp = qp;
> > +	cm_id_priv->state = IW_CM_STATE_CONN_SENT;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->connect(cm_id, iw_param);
> > +	if (ret) {
> > +		spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp) {
> > +			cm_id->device->iwcm->rem_ref(qp);
> > +			cm_id_priv->qp = NULL;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
> > +		cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		printk("Connect failed, ret=%d\n", ret);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_connect);
> > +
> > +/*
> > + * Passive Side: new CM_ID <-- CONN_RECV
> > + *
> > + * Handles an inbound connect request. The function creates a new
> > + * iw_cm_id to represent the new connection and inherits the client
> > + * callback function and other attributes from the listening parent. 
> > + * 
> > + * The work item contains a pointer to the listen_cm_id and the event. The
> > + * listen_cm_id contains the client cm_handler, context and
> > + * device. These are copied when the device is cloned. The event
> > + * contains the new four tuple.
> > + *
> > + * An error on the child should not affect the parent, so this
> > + * function does not return a value.
> > + */
> > +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, 
> > +				struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	struct iw_cm_id *cm_id;
> > +	struct iwcm_id_private *cm_id_priv;
> > +	int ret;
> > +
> > +	/* The provider should never generate a connection request
> > +	 * event with a bad status. 
> > +	 */
> > +	BUG_ON(iw_event->status);
> > +
> > +	/* We could be destroying the listening id. If so, ignore this
> > +	 * upcall. */
> > +	spin_lock_irqsave(&listen_id_priv->lock, flags);
> > +	if (listen_id_priv->state != IW_CM_STATE_LISTEN) {
> > +		spin_unlock_irqrestore(&listen_id_priv->lock, flags);
> > +		return;
> > +	}
> > +	spin_unlock_irqrestore(&listen_id_priv->lock, flags);
> > +
> > +	cm_id = iw_create_cm_id(listen_id_priv->id.device,	
> > +				listen_id_priv->id.cm_handler, 
> > +				listen_id_priv->id.context);
> > +	/* If the cm_id could not be created, ignore the request */
> > +	if (IS_ERR(cm_id)) 
> > +		return;
> > +
> > +	cm_id->provider_data = iw_event->provider_data;
> > +	cm_id->local_addr = iw_event->local_addr;
> > +	cm_id->remote_addr = iw_event->remote_addr;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	cm_id_priv->state = IW_CM_STATE_CONN_RECV;
> > +	
> > +	/* Call the client CM handler */
> > +	ret = cm_id->cm_handler(cm_id, iw_event);
> > +	if (ret) {
> > +		printk("destroying child id %p, ret=%d\n",
> > +		       cm_id, ret);
> 
> We probably don't always want to print a message here.
> 

Yes.  I'll change this to a pr_debug().


> > +		set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
> > +		destroy_cm_id(cm_id);
> > +		if (atomic_read(&cm_id_priv->refcount)==0)
> > +			kfree(cm_id);
> > +	}
> > +}
> > +
> > +/*
> > + * Passive Side: CM_ID <-- ESTABLISHED
> > + * 
> > + * The provider generated an ESTABLISHED event which means that 
> > + * the MPA negotion has completed successfully and we are now in MPA
> > + * FPDU mode. 
> > + *
> > + * This event can only be received in the CONN_RECV state. If the
> > + * remote peer closed, the ESTABLISHED event would be received followed
> > + * by the CLOSE event. If the app closes, it will block until we wake
> > + * it up after processing this event.
> > + */
> > +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, 
> > +			       struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +
> > +	/* We clear the CONNECT_WAIT bit here to allow the callback
> > +	 * function to call iw_cm_disconnect. Calling iw_destroy_cm_id
> > +	 * from a callback handler is not allowed */
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_CONN_RECV:
> > +		cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> 
> Can just BUG_ON the state and avoid the switch.  Same comment applies below.
> 

ok.

> > +	}
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Active Side: CM_ID <-- ESTABLISHED
> > + *
> > + * The app has called connect and is waiting for the established event to
> > + * post it's requests to the server. This event will wake up anyone
> > + * blocked in iw_cm_disconnect or iw_destroy_id.
> > + */
> > +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, 
> > +			       struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	/* Clear the connect wait bit so a callback function calling
> > +	 * iw_cm_disconnect will not wait and deadlock this thread */
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_CONN_SENT:
> > +		if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) {
> > +			cm_id_priv->id.local_addr = iw_event->local_addr;
> > +			cm_id_priv->id.remote_addr = iw_event->remote_addr;
> > +			cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
> > +		} else {
> > +			/* REJECTED or RESET */
> > +			cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
> > +			cm_id_priv->qp = NULL;
> > +			cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> > +	}
> > +	/* Wake up waiters on connect complete */
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * CM_ID <-- CLOSING 
> > + *
> > + * If in the ESTABLISHED state, move to CLOSING.
> > + */
> > +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, 
> > +				  struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED)
> > +		cm_id_priv->state = IW_CM_STATE_CLOSING;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +}
> > +
> > +/*
> > + * CM_ID <-- IDLE
> > + *
> > + * If in the ESTBLISHED or CLOSING states, the QP will have have been
> > + * moved by the provider to the ERR state. Disassociate the CM_ID from
> > + * the QP,  move to IDLE, and remove the 'connected' reference.
> > + * 
> > + * If in some other state, the cm_id was destroyed asynchronously.
> > + * This is the last reference that will result in waking up
> > + * the app thread blocked in iw_destroy_cm_id.
> > + */
> > +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, 
> > +				  struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +	/* TT */printk("%s:%d cm_id_priv=%p, state=%d\n", 
> > +		       __FUNCTION__, __LINE__,
> > +		       cm_id_priv,cm_id_priv->state);
> 
> Will want to remove this.
> 

oops. yes...

> - Sean


From tom at opengridcomputing.com  Thu Jun  1 10:11:58 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Thu, 01 Jun 2006 12:11:58 -0500
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
In-Reply-To: <447E1720.7000307@ichips.intel.com>
References: <20060531182650.3308.81538.stgit@stevo-desktop>
	<20060531182652.3308.1244.stgit@stevo-desktop>
	<447E1720.7000307@ichips.intel.com>
Message-ID: <1149181918.18855.23.camel@trinity.ogc.int>

On Wed, 2006-05-31 at 15:22 -0700, Sean Hefty wrote:
> Steve Wise wrote:
> > +/* 
> > + * Release a reference on cm_id. If the last reference is being removed
> > + * and iw_destroy_cm_id is waiting, wake up the waiting thread.
> > + */
> > +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
> > +{
> > +	int ret = 0;
> > +
> > +	BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
> > +	if (atomic_dec_and_test(&cm_id_priv->refcount)) {
> > +		BUG_ON(!list_empty(&cm_id_priv->work_list));
> > +		if (waitqueue_active(&cm_id_priv->destroy_wait)) {
> > +			BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING);
> > +			BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
> > +					&cm_id_priv->flags));
> > +			ret = 1;
> > +			wake_up(&cm_id_priv->destroy_wait);
> 
> We recently changed the RDMA CM, IB CM, and a couple of other modules from using 
> wait objects to completions.   This avoids a race condition between decrementing 
> the reference count, which allows destruction to proceed, and calling wake_up on 
> a freed cm_id.  My guess is that you may need to do the same.
> 
> Can you also explain the use of the return value here?  It's ignored below in 
> rem_ref() and destroy_cm_id().
> 
> > +static void add_ref(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	atomic_inc(&cm_id_priv->refcount);
> > +}
> > +
> > +static void rem_ref(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	iwcm_deref_id(cm_id_priv);
> > +}
> > +
> 
> > +/* 
> > + * CM_ID <-- CLOSING
> > + *
> > + * Block if a passive or active connection is currenlty being processed. Then
> > + * process the event as follows:
> > + * - If we are ESTABLISHED, move to CLOSING and modify the QP state
> > + *   based on the abrupt flag 
> > + * - If the connection is already in the CLOSING or IDLE state, the peer is
> > + *   disconnecting concurrently with us and we've already seen the 
> > + *   DISCONNECT event -- ignore the request and return 0
> > + * - Disconnect on a listening endpoint returns -EINVAL
> > + */
> > +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	/* Wait if we're currently in a connect or accept downcall */
> > +	wait_event(cm_id_priv->connect_wait, 
> > +		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> 
> Am I understanding this check correctly?  You're checking to see if the user has 
> called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
> iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
> event?

Yes. The application (or the case I saw was user-mode exit logic after
ctrl-C) cleaning up at random times relative to connection
establishment. 

> 
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_ESTABLISHED:
> > +		cm_id_priv->state = IW_CM_STATE_CLOSING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp)	{ /* QP could be <nul> for user-mode client */
> > +			if (abrupt)
> > +				ret = iwcm_modify_qp_err(cm_id_priv->qp);
> > +			else
> > +				ret = iwcm_modify_qp_sqd(cm_id_priv->qp);
> > +			/* 
> > +			 * If both sides are disconnecting the QP could
> > +			 * already be in ERR or SQD states
> > +			 */
> > +			ret = 0;
> > +		}
> > +		else
> > +			ret = -EINVAL;
> > +		break;
> > +	case IW_CM_STATE_LISTEN:
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = -EINVAL;
> > +		break;
> > +	case IW_CM_STATE_CLOSING:
> > +		/* remote peer closed first */
> > +	case IW_CM_STATE_IDLE:	
> > +		/* accept or connect returned !0 */
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_RECV:
> > +		/* 
> > +		 * App called disconnect before/without calling accept after
> > +		 * connect_request event delivered.
> > +		 */
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_SENT:
> > +		/* Can only get here if wait above fails */
> > +	default:		
> > +		BUG_ON(1);
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_disconnect);
> > +static void destroy_cm_id(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	/* Wait if we're currently in a connect or accept downcall. A
> > +	 * listening endpoint should never block here. */
> > +	wait_event(cm_id_priv->connect_wait, 
> > +		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> 
> Same question/comment as above.
> 
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_LISTEN:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		/* destroy the listening endpoint */
> > +		ret = cm_id->device->iwcm->destroy_listen(cm_id);
> > +		break;
> > +	case IW_CM_STATE_ESTABLISHED:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		/* Abrupt close of the connection */
> > +		(void)iwcm_modify_qp_err(cm_id_priv->qp);
> > +		break;
> > +	case IW_CM_STATE_IDLE:
> > +	case IW_CM_STATE_CLOSING:
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_RECV:
> > +		/* 
> > +		 * App called destroy before/without calling accept after
> > +		 * receiving connection request event notification.
> > +		 */ 
> > +		cm_id_priv->state = IW_CM_STATE_DESTROYING;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		break;
> > +	case IW_CM_STATE_CONN_SENT:
> > +	case IW_CM_STATE_DESTROYING:
> > +	default:
> > +		BUG_ON(1);
> > +		break;
> > +	}
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> 
> As an alternative, you could hold the lock from above, an let the LISTEN / 
> ESTABLISHED state checks release and reacquire.
> 
> > +	if (cm_id_priv->qp) {
> > +		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
> > +		cm_id_priv->qp = NULL;
> > +	}
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	(void)iwcm_deref_id(cm_id_priv);
> > +}
> > +
> > +/* 
> > + * This function is only called by the application thread and cannot
> > + * be called by the event thread. The function will wait for all
> > + * references to be released on the cm_id and then kfree the cm_id
> > + * object. 
> > + */
> > +void iw_destroy_cm_id(struct iw_cm_id *cm_id)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +        BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags));
> > +
> > +	destroy_cm_id(cm_id);
> > +
> > +	wait_event(cm_id_priv->destroy_wait, 
> > +		   !atomic_read(&cm_id_priv->refcount));
> > +
> > +	kfree(cm_id_priv);
> > +}
> > +EXPORT_SYMBOL(iw_destroy_cm_id);
> > +
> > +/* 
> > + * CM_ID <-- LISTEN
> > + *
> > + * Start listening for connect requests. Generates one CONNECT_REQUEST
> > + * event for each inbound connect request. 
> > + */
> > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_IDLE:
> > +		cm_id_priv->state = IW_CM_STATE_LISTEN;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id->device->iwcm->create_listen(cm_id, backlog);
> > +		if (ret)
> > +			cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		break;
> > +	default:
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_listen);
> > +
> > +/* 
> > + * CM_ID <-- IDLE
> > + *
> > + * Rejects an inbound connection request. No events are generated.
> > + */
> > +int iw_cm_reject(struct iw_cm_id *cm_id,
> > +		 const void *private_data,
> > +		 u8 private_data_len)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +	cm_id_priv->state = IW_CM_STATE_IDLE;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->reject(cm_id, private_data, 
> > +					  private_data_len);
> > +
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_reject);
> > +
> > +/* 
> > + * CM_ID <-- ESTABLISHED
> > + *
> > + * Accepts an inbound connection request and generates an ESTABLISHED
> > + * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block
> > + * until the ESTABLISHED event is received from the provider. 
> > + */
> 
> This makes it sound like we're just waiting for an event.
> 
> > +int iw_cm_accept(struct iw_cm_id *cm_id, 
> > +		 struct iw_cm_conn_param *iw_param)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	struct ib_qp *qp;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +	/* Get the ib_qp given the QPN */
> > +	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
> > +	if (!qp) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		return -EINVAL;
> > +	}
> > +	cm_id->device->iwcm->add_ref(qp);
> > +	cm_id_priv->qp = qp;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->accept(cm_id, iw_param);
> > +	if (ret) {
> > +		/* An error on accept precludes provider events */
> > +		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
> > +		cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp) {
> > +			cm_id->device->iwcm->rem_ref(qp);
> > +			cm_id_priv->qp = NULL;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		printk("Accept failed, ret=%d\n", ret);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +	}			
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_accept);
> > +
> > +/*
> > + * Active Side: CM_ID <-- CONN_SENT
> > + *
> > + * If successful, results in the generation of a CONNECT_REPLY
> > + * event. iw_cm_disconnect and iw_cm_destroy will block until the
> > + * CONNECT_REPLY event is received from the provider.
> > + */
> > +int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
> > +{
> > +	struct iwcm_id_private *cm_id_priv;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +	struct ib_qp *qp;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state != IW_CM_STATE_IDLE) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +		return -EINVAL;
> > +	}
> > +		
> > +	/* Get the ib_qp given the QPN */
> > +	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
> > +	if (!qp) {
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		return -EINVAL;
> > +	}
> > +	cm_id->device->iwcm->add_ref(qp);
> > +	cm_id_priv->qp = qp;
> > +	cm_id_priv->state = IW_CM_STATE_CONN_SENT;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +
> > +	ret = cm_id->device->iwcm->connect(cm_id, iw_param);
> > +	if (ret) {
> > +		spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +		if (cm_id_priv->qp) {
> > +			cm_id->device->iwcm->rem_ref(qp);
> > +			cm_id_priv->qp = NULL;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
> > +		cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		printk("Connect failed, ret=%d\n", ret);
> > +		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +		wake_up_all(&cm_id_priv->connect_wait);
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iw_cm_connect);
> > +
> > +/*
> > + * Passive Side: new CM_ID <-- CONN_RECV
> > + *
> > + * Handles an inbound connect request. The function creates a new
> > + * iw_cm_id to represent the new connection and inherits the client
> > + * callback function and other attributes from the listening parent. 
> > + * 
> > + * The work item contains a pointer to the listen_cm_id and the event. The
> > + * listen_cm_id contains the client cm_handler, context and
> > + * device. These are copied when the device is cloned. The event
> > + * contains the new four tuple.
> > + *
> > + * An error on the child should not affect the parent, so this
> > + * function does not return a value.
> > + */
> > +static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, 
> > +				struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	struct iw_cm_id *cm_id;
> > +	struct iwcm_id_private *cm_id_priv;
> > +	int ret;
> > +
> > +	/* The provider should never generate a connection request
> > +	 * event with a bad status. 
> > +	 */
> > +	BUG_ON(iw_event->status);
> > +
> > +	/* We could be destroying the listening id. If so, ignore this
> > +	 * upcall. */
> > +	spin_lock_irqsave(&listen_id_priv->lock, flags);
> > +	if (listen_id_priv->state != IW_CM_STATE_LISTEN) {
> > +		spin_unlock_irqrestore(&listen_id_priv->lock, flags);
> > +		return;
> > +	}
> > +	spin_unlock_irqrestore(&listen_id_priv->lock, flags);
> > +
> > +	cm_id = iw_create_cm_id(listen_id_priv->id.device,	
> > +				listen_id_priv->id.cm_handler, 
> > +				listen_id_priv->id.context);
> > +	/* If the cm_id could not be created, ignore the request */
> > +	if (IS_ERR(cm_id)) 
> > +		return;
> > +
> > +	cm_id->provider_data = iw_event->provider_data;
> > +	cm_id->local_addr = iw_event->local_addr;
> > +	cm_id->remote_addr = iw_event->remote_addr;
> > +
> > +	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> > +	cm_id_priv->state = IW_CM_STATE_CONN_RECV;
> > +	
> > +	/* Call the client CM handler */
> > +	ret = cm_id->cm_handler(cm_id, iw_event);
> > +	if (ret) {
> > +		printk("destroying child id %p, ret=%d\n",
> > +		       cm_id, ret);
> 
> We probably don't always want to print a message here.
> 
> > +		set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
> > +		destroy_cm_id(cm_id);
> > +		if (atomic_read(&cm_id_priv->refcount)==0)
> > +			kfree(cm_id);
> > +	}
> > +}
> > +
> > +/*
> > + * Passive Side: CM_ID <-- ESTABLISHED
> > + * 
> > + * The provider generated an ESTABLISHED event which means that 
> > + * the MPA negotion has completed successfully and we are now in MPA
> > + * FPDU mode. 
> > + *
> > + * This event can only be received in the CONN_RECV state. If the
> > + * remote peer closed, the ESTABLISHED event would be received followed
> > + * by the CLOSE event. If the app closes, it will block until we wake
> > + * it up after processing this event.
> > + */
> > +static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, 
> > +			       struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +
> > +	/* We clear the CONNECT_WAIT bit here to allow the callback
> > +	 * function to call iw_cm_disconnect. Calling iw_destroy_cm_id
> > +	 * from a callback handler is not allowed */
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_CONN_RECV:
> > +		cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> 
> Can just BUG_ON the state and avoid the switch.  Same comment applies below.
> 
> > +	}
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Active Side: CM_ID <-- ESTABLISHED
> > + *
> > + * The app has called connect and is waiting for the established event to
> > + * post it's requests to the server. This event will wake up anyone
> > + * blocked in iw_cm_disconnect or iw_destroy_id.
> > + */
> > +static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, 
> > +			       struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	/* Clear the connect wait bit so a callback function calling
> > +	 * iw_cm_disconnect will not wait and deadlock this thread */
> > +	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
> > +	switch (cm_id_priv->state) {
> > +	case IW_CM_STATE_CONN_SENT:
> > +		if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) {
> > +			cm_id_priv->id.local_addr = iw_event->local_addr;
> > +			cm_id_priv->id.remote_addr = iw_event->remote_addr;
> > +			cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
> > +		} else {
> > +			/* REJECTED or RESET */
> > +			cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
> > +			cm_id_priv->qp = NULL;
> > +			cm_id_priv->state = IW_CM_STATE_IDLE;
> > +		}
> > +		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> > +	}
> > +	/* Wake up waiters on connect complete */
> > +	wake_up_all(&cm_id_priv->connect_wait);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * CM_ID <-- CLOSING 
> > + *
> > + * If in the ESTABLISHED state, move to CLOSING.
> > + */
> > +static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, 
> > +				  struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cm_id_priv->lock, flags);
> > +	if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED)
> > +		cm_id_priv->state = IW_CM_STATE_CLOSING;
> > +	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
> > +}
> > +
> > +/*
> > + * CM_ID <-- IDLE
> > + *
> > + * If in the ESTBLISHED or CLOSING states, the QP will have have been
> > + * moved by the provider to the ERR state. Disassociate the CM_ID from
> > + * the QP,  move to IDLE, and remove the 'connected' reference.
> > + * 
> > + * If in some other state, the cm_id was destroyed asynchronously.
> > + * This is the last reference that will result in waking up
> > + * the app thread blocked in iw_destroy_cm_id.
> > + */
> > +static int cm_close_handler(struct iwcm_id_private *cm_id_priv, 
> > +				  struct iw_cm_event *iw_event)
> > +{
> > +	unsigned long flags;
> > +	int ret = 0;
> > +	/* TT */printk("%s:%d cm_id_priv=%p, state=%d\n", 
> > +		       __FUNCTION__, __LINE__,
> > +		       cm_id_priv,cm_id_priv->state);
> 
> Will want to remove this.
> 
> - Sean
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu Jun  1 10:48:38 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 1 Jun 2006 20:48:38 +0300
Subject: [openib-general] [PATCH TRIVIAL] opensm: fix comment in osm_matrix.h
Message-ID: <20060601174838.GA12872@sashak.voltaire.com>

This fixes the function description comment in osm_matrix.h

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/include/opensm/osm_matrix.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/osm/include/opensm/osm_matrix.h b/osm/include/opensm/osm_matrix.h
index c6c5107..0903708 100644
--- a/osm/include/opensm/osm_matrix.h
+++ b/osm/include/opensm/osm_matrix.h
@@ -321,7 +321,7 @@ osm_lid_matrix_get_num_ports(
 *	osm_lid_matrix_get_least_hops
 *
 * DESCRIPTION
-*	Returns the number of ports in this lid matrix.
+*	Returns the least number of hops for specified lid
 *
 * SYNOPSIS
 */
@@ -345,7 +345,7 @@ osm_lid_matrix_get_least_hops(
 *		[in] LID (host order) for which to retrieve the shortest hop count.
 *
 * RETURN VALUES
-*	Returns the number of ports in this lid matrix.
+*	Returns the least number of hops for specified lid
 *
 * NOTES
 *


From william at pellicano.biz  Thu Jun  1 05:45:16 2006
From: william at pellicano.biz (Reginald)
Date: Thu, 01 Jun 2006 13:45:16 +0100
Subject: [openib-general] Medicines for men before Valentine Day !!!
Message-ID: <000001c685a3$24311e00$0100007f@D62LYD61>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/94e8e2a6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Untitled-2.jpg
Type: image/jpeg
Size: 20429 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/94e8e2a6/attachment.jpg>

From sashak at voltaire.com  Thu Jun  1 11:09:49 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 1 Jun 2006 21:09:49 +0300
Subject: [openib-general] QoS RFC - Resend using a friendly mailer
In-Reply-To: <20060530224917.GM29770@esmail.cup.hp.com>
References: <E1Fl5b7-0001S1-00@mtlpx01.yok.mtl.com>
	<20060530190936.GD21212@sashak.voltaire.com>
	<20060530224917.GM29770@esmail.cup.hp.com>
Message-ID: <20060601180949.GB14883@sashak.voltaire.com>

On 15:49 Tue 30 May     , Grant Grundler wrote:
> On Tue, May 30, 2006 at 10:09:36PM +0300, Sasha Khapyorsky wrote:
> > > XML style syntax is provided for the policy file.
> > 
> > Why XML? It is not too much readable and writable (by human) format.
> 
> It is human readable and very portable.
> An example is here:
> 	http://svn.gnumonks.org/trunk/mmio_test/mmio_test.xml

Yes it is readable, but for many people it is _less_ readable and even
_less_ writable than "plain" text.

> And GPL libraries can parse XML.

It is true, but currently we have "portability" complaints even against
using libpthread.

Sasha

> So the new code is fairly short:
> 	http://svn.gnumonks.org/trunk/mmio_test/xmlin.c
> 
> hth,
> grant


From sashak at voltaire.com  Thu Jun  1 11:51:03 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 1 Jun 2006 21:51:03 +0300
Subject: [openib-general] QoS RFC - Resend using a friendly mailer
In-Reply-To: <E1Fl5b7-0001S1-00@mtlpx01.yok.mtl.com>
References: <E1Fl5b7-0001S1-00@mtlpx01.yok.mtl.com>
Message-ID: <20060601185103.GC14883@sashak.voltaire.com>

Hi Eitan,

Some more comments related to OpenSM.

On 17:53 Tue 30 May     , Eitan Zahavi wrote:
> 
> 9. OpenSM features
> -------------------
> The QoS related functionality to be provided by OpenSM can be split into two 
> main parts:
> 
> 3.1. Fabric Setup
> During fabric initialization the SM should parse the policy and apply its 
> settings to the discovered fabric elements. The following actions should be 
> performed:
> * Parsing of policy
> * Node Group identification. Warning should be provided for each node not 
>   specified but found.
> * SL2VL settings validation should be checked:
>   + A warning will be provided if there are no matching targets for the SL2VL 
>     setting statement. 
>   + An error message will be printed to the log file if an invalid setting is 
>     found. A setting is invalid if it refers to:
>     - Non existing port numbers of the target devices
>     - Unsupported VLs for the target device. In the later case the map to non
>       existing VLs should be replaced to VL15 i.e. packets will be dropped.

Not sure that unsupported VLs mapping to VL15 is best option. Actually
if SL2VL will be specified per port group this may mean that at least in
"generic" case all group members should have similar physical
capabilities or "reliable" part of SLs will be limited by lowest VLCap
in this group (other SLs will be just dropped somewhere).

In current SL2VL mapping implementation we are using such rule to replace
unsupported VLs: (new VL) = (requested VL) % (operational data VLs)
This may have some disadvantage too, but I think it is generally "safer".

Also I guess that by "unsupported VLs" you are referring unsupported or
non-configured VLs.

> * SL2VL setting is to be performed
> * VL Arbitration table settings should be validated according to the following 
>   rules:
>   + A warning will be provided if there are no matching targets for the setting 
>     statement
>   + An error will be provided if the port number exceeds the target ports
>   + An error will be generated if the table length exceeds device capabilities
>   + An warning will be generated if the table quote a VL that is not supported 
>     by the target device

Should there be replacement rule for not supported VLs?

In IBTA spec (v.1, p.190, l.14) is stated that entry with unsupported VL
may be skipped _OR_ "trusted" to other (supported) VL. I think if we will
not care about unsupported replacement there may be hole for
"device/vendor dependent" behavior.

Sasha


From iod00d at hp.com  Thu Jun  1 12:07:45 2006
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 1 Jun 2006 12:07:45 -0700
Subject: [openib-general] QoS RFC - Resend using a friendly mailer
In-Reply-To: <20060601180949.GB14883@sashak.voltaire.com>
References: <E1Fl5b7-0001S1-00@mtlpx01.yok.mtl.com>
	<20060530190936.GD21212@sashak.voltaire.com>
	<20060530224917.GM29770@esmail.cup.hp.com>
	<20060601180949.GB14883@sashak.voltaire.com>
Message-ID: <20060601190745.GA7670@esmail.cup.hp.com>

On Thu, Jun 01, 2006 at 09:09:49PM +0300, Sasha Khapyorsky wrote:
> On 15:49 Tue 30 May     , Grant Grundler wrote:
> > On Tue, May 30, 2006 at 10:09:36PM +0300, Sasha Khapyorsky wrote:
> > > > XML style syntax is provided for the policy file.
> > > 
> > > Why XML? It is not too much readable and writable (by human) format.
> > 
> > It is human readable and very portable.
> > An example is here:
> > 	http://svn.gnumonks.org/trunk/mmio_test/mmio_test.xml
> 
> Yes it is readable, but for many people it is _less_ readable and even
> _less_ writable than "plain" text.

This might be a good starting point for "many people":
	http://ahds.ac.uk/creating/information-papers/xml-editors/

I tried conglomerate (debian) and it doesn't like mmiot_test.xnl
for some reason.  But I suppose that could be fixed.

Anyway, my point is there is no shortage of GUIs to edit XML files
and verify syntactical correctness.

hth,
grant


From rdreier at cisco.com  Thu Jun  1 13:05:56 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 Jun 2006 13:05:56 -0700
Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for
	max_map_per_fmr device attribute
In-Reply-To: <Pine.LNX.4.64.0605300921500.29921@zuben> (Or Gerlitz's message
	of "Tue, 30 May 2006 09:22:59 +0300 (IDT)")
References: <Pine.LNX.4.44.0605231114130.18808-100000@zuben>
	<Pine.LNX.4.64.0605281501370.5690@zuben>
	<Pine.LNX.4.64.0605300921500.29921@zuben>
Message-ID: <adairnktzdn.fsf@cisco.com>

I had a chance to look at this, and I don't believe it is precisely
correct for mem-free HCAs with the current FMR implementation.

 > +	/* on memfull HCA an FMR can be remapped 2^B - 1 times where B < 32 is
 > +	 * the number of bits which are not used for MPT addressing, on memfree
 > +	 * HCA B=8 so an FMR can be remapped 255 times.
 > +	 */
 > +	if(!mthca_is_memfree(mdev))
 > +		props->max_map_per_fmr = (1 << (32 -
 > +					long_log2(mdev->limits.num_mpts))) - 1;
 > +	else
 > +		props->max_map_per_fmr = (1 << 8) - 1;

Look at mthca_arbel_map_phys_fmr().  The question is how often key
will repeat after being indexed, and when MTHCA_FLAG_SINAI_OPT is not
set, then the same increment is used in the mem-free case as in the
Tavor case.

So I think the code I quoted should really be:

	if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)
		props->max_map_per_fmr = (1 << (32 -
					long_log2(mdev->limits.num_mpts))) - 1;
	else
		props->max_map_per_fmr = (1 << 8) - 1;

Do you agree?  If so I can fix this patch up myself and apply it.

 - R.


From mshefty at ichips.intel.com  Thu Jun  1 14:09:12 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 01 Jun 2006 14:09:12 -0700
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
In-Reply-To: <1149181233.31610.34.camel@stevo-desktop>
References: <20060531182650.3308.81538.stgit@stevo-desktop>	
	<20060531182652.3308.1244.stgit@stevo-desktop>	
	<447E1720.7000307@ichips.intel.com>
	<1149181233.31610.34.camel@stevo-desktop>
Message-ID: <447F5778.6010202@ichips.intel.com>

Steve Wise wrote:
>>>+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
>>>+{
>>>+	struct iwcm_id_private *cm_id_priv;
>>>+	unsigned long flags;
>>>+	int ret = 0;
>>>+
>>>+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
>>>+	/* Wait if we're currently in a connect or accept downcall */
>>>+	wait_event(cm_id_priv->connect_wait, 
>>>+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
>>
>>Am I understanding this check correctly?  You're checking to see if the user has 
>>called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
>>iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
>>event?
> 
> 
> The CM must wait for the low level provider to finish a connect() or
> accept() operation before telling the low level provider to disconnect
> via modifying the iwarp QP.  Regardless of whether they block, this
> disconnect can happen concurrently with the connect/accept so we need to
> hold the disconnect until the connect/accept completes.
> 
> 
>>>+EXPORT_SYMBOL(iw_cm_disconnect);
>>>+static void destroy_cm_id(struct iw_cm_id *cm_id)
>>>+{
>>>+	struct iwcm_id_private *cm_id_priv;
>>>+	unsigned long flags;
>>>+	int ret;
>>>+
>>>+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
>>>+	/* Wait if we're currently in a connect or accept downcall. A
>>>+	 * listening endpoint should never block here. */
>>>+	wait_event(cm_id_priv->connect_wait, 
>>>+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
>>
>>Same question/comment as above.
>>
> 
> 
> Same answer.  

There's a difference between trying to handle the user calling 
disconnect/destroy at the same time a call to accept/connect is active, versus 
the user calling disconnect/destroy after accept/connect have returned.  In the 
latter case, I think you're fine.  In the first case, this is allowing a user to 
call destroy at the same time that they're calling accept/connect. 
Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by 
accept/connect by the time disconnect/destroy tests it.

- Sean


From gaoq at cse.ohio-state.edu  Thu Jun  1 14:44:01 2006
From: gaoq at cse.ohio-state.edu (Qi Gao)
Date: Thu, 1 Jun 2006 17:44:01 -0400
Subject: [openib-general] EINTR in ibv_get_cq_event
Message-ID: <007501c685c4$86931c30$0763a8c0@Brunhild>

Hi,

I'm trying to use the ibv_get_cq_event, and I see the following behavior:

This is my code:
----------
ret = ibv_get_cq_event(cm_ud_comp_ch, &ev_cq, &ev_ctx);
if (ret) {
     fprintf(stderr, "Failed to get cq_event: %d\n", ret);
     perror("ibv_get_cq_event");
}
----------

Most times it's OK, but sometimes I see:
----------
Failed to get cq_event: -1
ibv_get_cq_event: Interrupted system call
----------

Could someone tell me what may be happening?

Thanks,

Qi


From faulkner at opengridcomputing.com  Thu Jun  1 15:06:05 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Thu, 1 Jun 2006 17:06:05 -0500
Subject: [openib-general] [PATCH] librdmacm: ucma_init reads past end of
	device_list
Message-ID: <200606011706.05383.faulkner@opengridcomputing.com>

The code currently in place seems to expect there to be a null element at the 
end of the dev_list to trigger the end of the loop.  ibv_get_device_list
does not provide such an entry, but the number of entries is
available.  This patch retrieves that number and loops based on it.
If ibv_get_device_list should return a list with a null element at the end,
then it is not working correctly.  This patch will work with either of the
possible intended behaviors of ibv_get_device_list.

Fix spelling of "liste".


Index: cma.c
===================================================================
--- cma.c	(revision 7568)
+++ cma.c	(working copy)
@@ -183,6 +183,7 @@
 static int ucma_init(void)
 {
 	int i;
+	int num_devices;
 	struct cma_device *cma_dev;
 	struct ibv_device_attr attr;
 	int ret;
@@ -201,14 +202,14 @@
 		goto err;
 	}
 
-	dev_list = ibv_get_device_list(NULL);
+	dev_list = ibv_get_device_list(&num_devices);
 	if (!dev_list) {
-		printf("CMA: unable to get RDMA device liste\n");
+		printf("CMA: unable to get RDMA device list\n");
 		ret = -ENODEV;
 		goto err;
 	}
 
-	for (i = 0; dev_list[i]; ++i) {
+	for (i = 0; i < num_devices; ++i) {
 		cma_dev = malloc(sizeof *cma_dev);
 		if (!cma_dev) {
 			ret = -ENOMEM;


From tom at opengridcomputing.com  Thu Jun  1 15:21:16 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Thu, 01 Jun 2006 17:21:16 -0500
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
In-Reply-To: <447F5778.6010202@ichips.intel.com>
References: <20060531182650.3308.81538.stgit@stevo-desktop>
	<20060531182652.3308.1244.stgit@stevo-desktop>
	<447E1720.7000307@ichips.intel.com>
	<1149181233.31610.34.camel@stevo-desktop>
	<447F5778.6010202@ichips.intel.com>
Message-ID: <1149200476.18855.83.camel@trinity.ogc.int>

On Thu, 2006-06-01 at 14:09 -0700, Sean Hefty wrote:
> Steve Wise wrote:
> >>>+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
> >>>+{
> >>>+	struct iwcm_id_private *cm_id_priv;
> >>>+	unsigned long flags;
> >>>+	int ret = 0;
> >>>+
> >>>+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> >>>+	/* Wait if we're currently in a connect or accept downcall */
> >>>+	wait_event(cm_id_priv->connect_wait, 
> >>>+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> >>
> >>Am I understanding this check correctly?  You're checking to see if the user has 
> >>called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
> >>iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
> >>event?
> > 
> > 
> > The CM must wait for the low level provider to finish a connect() or
> > accept() operation before telling the low level provider to disconnect
> > via modifying the iwarp QP.  Regardless of whether they block, this
> > disconnect can happen concurrently with the connect/accept so we need to
> > hold the disconnect until the connect/accept completes.
> > 
> > 
> >>>+EXPORT_SYMBOL(iw_cm_disconnect);
> >>>+static void destroy_cm_id(struct iw_cm_id *cm_id)
> >>>+{
> >>>+	struct iwcm_id_private *cm_id_priv;
> >>>+	unsigned long flags;
> >>>+	int ret;
> >>>+
> >>>+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
> >>>+	/* Wait if we're currently in a connect or accept downcall. A
> >>>+	 * listening endpoint should never block here. */
> >>>+	wait_event(cm_id_priv->connect_wait, 
> >>>+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
> >>
> >>Same question/comment as above.
> >>
> > 
> > 
> > Same answer.  
> 
> There's a difference between trying to handle the user calling 
> disconnect/destroy at the same time a call to accept/connect is active, versus 
> the user calling disconnect/destroy after accept/connect have returned.  In the 
> latter case, I think you're fine.  In the first case, this is allowing a user to 
> call destroy at the same time that they're calling accept/connect. 
> Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by 
> accept/connect by the time disconnect/destroy tests it.

The problem is that we can't synchronously cancel an outstanding connect
request. Once we've asked the adapter to connect, we can't tell him to
stop, we have to wait for it to fail. During the time period between
when we ask to connect and the adapter says yeah-or-nay, the user hits
ctrl-C. This is the case where disconnect and/or destroy gets called and
we have to block it waiting for the outstanding connect request to
complete.

One alternative to this approach is to do the kfree of the cm_id in the
deref logic. This was the original design and leaves the object around
to handle the completion of the connect and still allows the app to
clean up and go away without all this waitin' around. When the adapter
finally finishes and releases it's reference, the object is kfree'd.

Hope this helps.
 
> 
> - Sean
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From caitlinb at broadcom.com  Thu Jun  1 15:28:24 2006
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Thu, 1 Jun 2006 15:28:24 -0700
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
Message-ID: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com>


>> 
>> There's a difference between trying to handle the user calling
>> disconnect/destroy at the same time a call to accept/connect is
>> active, versus the user calling disconnect/destroy after
>> accept/connect have returned.  In the latter case, I think you're
>> fine.  In the first case, this is allowing a user to call
> destroy at the same time that they're calling accept/connect.
>> Additionally, there's no guarantee that the F_CONNECT_WAIT flag has
>> been set by accept/connect by the time disconnect/destroy tests it.
> 
> The problem is that we can't synchronously cancel an
> outstanding connect request. Once we've asked the adapter to
> connect, we can't tell him to stop, we have to wait for it to
> fail. During the time period between when we ask to connect
> and the adapter says yeah-or-nay, the user hits ctrl-C. This
> is the case where disconnect and/or destroy gets called and
> we have to block it waiting for the outstanding connect
> request to complete.
> 
> One alternative to this approach is to do the kfree of the
> cm_id in the deref logic. This was the original design and
> leaves the object around to handle the completion of the
> connect and still allows the app to clean up and go away
> without all this waitin' around. When the adapter finally
> finishes and releases it's reference, the object is kfree'd.
> 
> Hope this helps.
> 
Why couldn't you synchronously put the cm_id in a state of
"pending delete" and do the actual delete when the RNIC
provides a response to the request? There could even be
an optional method to see if the device is capable of
cancelling the request. I know it can't yank a SYN back
from the wire, but it could refrain from retransmitting.


From mshefty at ichips.intel.com  Thu Jun  1 15:34:43 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 01 Jun 2006 15:34:43 -0700
Subject: [openib-general] Re: [PATCH] librdmacm: ucma_init reads past end of
	device_list
In-Reply-To: <200606011706.05383.faulkner@opengridcomputing.com>
References: <200606011706.05383.faulkner@opengridcomputing.com>
Message-ID: <447F6B83.6000902@ichips.intel.com>

Boyd R. Faulkner wrote:
> The code currently in place seems to expect there to be a null element at the 
> end of the dev_list to trigger the end of the loop.  ibv_get_device_list
> does not provide such an entry, but the number of entries is
> available.  This patch retrieves that number and loops based on it.
> If ibv_get_device_list should return a list with a null element at the end,
> then it is not working correctly.  This patch will work with either of the
> possible intended behaviors of ibv_get_device_list.
> 
> Fix spelling of "liste".

Thanks - can you please send a signed-off-by line?

- Sean


From flaloto at webmail.co.za  Fri Jun  2 13:39:24 2006
From: flaloto at webmail.co.za (flaloto at webmail.co.za)
Date: Fri, 02 Jun 2006 12:39:24 -0800
Subject: [openib-general] FINAL NOTICE OF AWARD NOTIFICATION 
Message-ID: <20060601225235.D3CEC22834D@openib.ca.sandia.gov>

FROM THE DESK OF THE PROMOTIONS LOTTERY MANAGER, 
PROTEA WINNERS ORGANIZATION LOTTERY SOUTH AFRICA, 
13,LAKE VIEW DRIVE,AUCKLAND PARK 
P.O.BOX 296, AUCKLAND PARK, 
JOHANNESBURY, 
SOUTH AFRICA. 

Your kind Attention: 

FINAL NOTICE OF AWARD NOTIFICATION 

We are pleased to inform you of the announcement today the 1St of JUNE 2006,of 
winners of the PROTEA WINNERS ORGANIZATION LOTTERY, held on 31ST OF MAY 2006 
as part of our promotional draws. Participants were selected through a 
computer ballot system drawn from 2,500,000 email addresses of individuals 
and companies from Africa, America, Asia, Australia,Canada,Europe, Middle 
East,and New Zealand as part of our electronic business Promotions lottery Program. 
You qualified for the draw as a result of you visiting Various websites we are 
running the e-business promotions lottery for. 

You/Your Company, attached to ticket number 139-3201-6409,with serial number 
570-10 drew the lucky numbers 1,8,14,20,31,46,72,and consequently won in the 
Second Category. 

You have therefore been approved for a lump sum pay out of US$3,000,000.00 in 
cash,which is the winning payout for Second category winners.This is from the 
total prize money of S$21,000,000.00 shared among the Seven international 
winners in the Second category. 

CONGRATULATIONS! 

Your fund is now deposited with the Maco Finance and Security Company insured 
in your name. Due to the mix up of some numbers and names, we award strictly 
from public notice until your claim has been processed and your money 
remitted 
to your account.

This is part of our security protocol to avoid double claiming or 
unscrupulous 
acts by participants of this program. We hope with a part of your prize, you 
will participate in our up coming mid year (2007) high stakes US$1.3 billion 
International Lottery. To begin your claim, please contact your 
claim agent immediately: 

DR MARK ZOLO
FOREIGN SERVICE MANAGER, 
TEL: 00-27-78-3030-229 
EMAIL: markzolo_2006 at yahoo.com

Kindly contact your claims officer and provide him with the following 
informations,

1. The Refrence Number.-------------------------------------------
2. The Batch Number.-----------------------------------------------
3. The Ticket Number.----------------------------------------------
4. The serial Number.------------------------------------------------
5. The lucky Number.-----------------------------------------------

FULL NAMES:________________________MAILING 
ADDRESS:__________________________________SEX:_____________________ 
AGE/DATE OF BIRTH________________MARITAL STATUS:___________________ 
OCCUPATION:______________________TEL/FAX UMBER:_____________________
AMOUNT WON:__________________STATE/COUNTRY:_____________________COMPANY 
NAME:________________________

If you do not contact your claims agent within 14 working days of this 
notification, your winning prize money would be revoked. 

NOTE: In order to avoid unnecessary delays and complications, please remember 
to quote your reference and batch numbers. Winners are advised to keep their 
winning details/ information from the public to avoid fraudulent claims. 
(IMPORTANT) pending the transfer/claim

REFERENCE NUMBER: KFQ-XV69-013b-9 
BATCH NUMBER: 57-488-BBN 

Congratulations once again from all our staff and thank you promotions 
program. 

Sincerely, 

MRS STEPHANIE NKOSI. 
Lottery Co-ordinator
PROTEA WINNERS ORGANIZATION LOTTERY SOUTH AFRICA 


From robert.j.woodruff at intel.com  Thu Jun  1 15:40:02 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 1 Jun 2006 15:40:02 -0700
Subject: [openib-general] [PATCH] ipathverbs.c fails to compile on svn 7568
	or on the ofed 1.0 branch
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408>


I ran into a compile problem with
userspace/libipathverbs/src/ipathverbs.c

This patch fixes the compile problem.

--- ipathverbs.c	2006-06-01 14:56:46.000000000 -0700
+++ ipathverbs.new.c	2006-06-01 14:54:48.000000000 -0700
@@ -41,6 +41,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <sysfs/libsysfs.h>
 
 #include "ipathverbs.h"
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/c5a9d9f6/attachment.html>

From rdreier at cisco.com  Thu Jun  1 15:56:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 Jun 2006 15:56:58 -0700
Subject: [openib-general] EINTR in ibv_get_cq_event
In-Reply-To: <007501c685c4$86931c30$0763a8c0@Brunhild> (Qi Gao's message of
	"Thu, 1 Jun 2006 17:44:01 -0400")
References: <007501c685c4$86931c30$0763a8c0@Brunhild>
Message-ID: <adaac8wtrgl.fsf@cisco.com>

    Qi> Could someone tell me what may be happening?

Your process is getting a signal that interrupts the underlying read()
system call.

 - R.


From rdreier at cisco.com  Thu Jun  1 15:59:11 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 Jun 2006 15:59:11 -0700
Subject: [openib-general] [PATCH] librdmacm: ucma_init reads past end of
	device_list
In-Reply-To: <200606011706.05383.faulkner@opengridcomputing.com> (Boyd R.
	Faulkner's message of "Thu, 1 Jun 2006 17:06:05 -0500")
References: <200606011706.05383.faulkner@opengridcomputing.com>
Message-ID: <ada64jktrcw.fsf@cisco.com>

    Boyd> The code currently in place seems to expect there to be a
    Boyd> null element at the end of the dev_list to trigger the end
    Boyd> of the loop.  ibv_get_device_list does not provide such an
    Boyd> entry, but the number of entries is available.  This patch
    Boyd> retrieves that number and loops based on it.  If
    Boyd> ibv_get_device_list should return a list with a null element
    Boyd> at the end, then it is not working correctly.  This patch
    Boyd> will work with either of the possible intended behaviors of
    Boyd> ibv_get_device_list.

This is definitely a bug in libibverbs -- I clearly wrote

 * ibv_get_device_list - Get list of IB devices currently available
 * @num_devices: optional.  if non-NULL, set to the number of devices
 * returned in the array.
 *
 * Return a NULL-terminated array of IB devices.  The array can be
 * released with ibv_free_device_list().

so I intended to return a NULL-terminated array.

I'll fix libibverbs up.

 - R.


From rdreier at cisco.com  Thu Jun  1 16:00:50 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 01 Jun 2006 16:00:50 -0700
Subject: [openib-general] Re: [openfabrics-ewg] [PATCH] ipathverbs.c fails
	to compile on svn 7568 or on the ofed 1.0 branch
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408> (Robert J.
	Woodruff's message of "Thu, 1 Jun 2006 15:40:02 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F0007DC4264@orsmsx408>
Message-ID: <ada1wu8tra5.fsf@cisco.com>

 > I ran into a compile problem with
 > userspace/libipathverbs/src/ipathverbs.c
 > 
 > This patch fixes the compile problem.
 > 
 > --- ipathverbs.c	2006-06-01 14:56:46.000000000 -0700
 > +++ ipathverbs.new.c	2006-06-01 14:54:48.000000000 -0700
 > @@ -41,6 +41,7 @@
 >  #include <stdio.h>
 >  #include <stdlib.h>
 >  #include <unistd.h>
 > +#include <sysfs/libsysfs.h>
 >  
 >  #include "ipathverbs.h"

I don't think there's much point in this, since the resulting library
won't actually work with the libibverbs 1.1 development tree anyway.

Just build against libibverbs-1.0 until libipathverbs is fixed to work
with development libibverbs versions.

 - R.


From Clayton707 at indiatimes.com  Thu Jun  1 16:23:20 2006
From: Clayton707 at indiatimes.com (Noah Lausberg)
Date: Thu, 01 Jun 2006 23:23:20 -0000
Subject: [openib-general] Re: Fwd: Finding your best mortgage rate here
Message-ID: <8ibenndpt0eqdagtg5t5@indiatimes.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/d771c11b/attachment.html>

From bos at pathscale.com  Thu Jun  1 16:27:48 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 01 Jun 2006 16:27:48 -0700
Subject: [openib-general] [PATCH 5/5] IB/ipath: Add client reregister
	event generation
In-Reply-To: <20060531223218.10506.76076.stgit@localhost.localdomain>
References: <20060531223205.10506.51241.stgit@localhost.localdomain>
	<20060531223218.10506.76076.stgit@localhost.localdomain>
Message-ID: <1149204468.16993.8.camel@localhost.localdomain>

On Wed, 2006-05-31 at 15:32 -0700, Roland Dreier wrote:

> Generate client reregister event instead of LID change event when
> client reregister bit is set.

Please CC me on ipath driver patches, as I'm not guaranteed to see them
otherwise.

	<b


From faulkner at opengridcomputing.com  Thu Jun  1 16:37:35 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Thu, 1 Jun 2006 18:37:35 -0500
Subject: [openib-general] [PATCH] librdmacm: ucma_init reads past end of
	device_list
Message-ID: <200606011837.35328.faulkner@opengridcomputing.com>

The code currently in place seems to expect there to be a null element at the 
end of the dev_list to trigger the end of the loop.  ibv_get_device_list
does not provide such an entry, but the number of entries is
available.  This patch retrieves that number and loops based on it.
If ibv_get_device_list should return a list with a null element at the end
then it is not working correctly.  This patch will work with either of the
possible intended behaviors of ibv_get_device_list.

Roland has said that ibv_get_device_list should return a list with a null 
element at the end.

Fix spelling of "liste".

Signed-off-by:  Boyd Faulkner <faulkner at opengridcomputing.com>

Index: cma.c
===================================================================
--- cma.c	(revision 7568)
+++ cma.c	(working copy)
@@ -183,6 +183,7 @@
 static int ucma_init(void)
 {
 	int i;
+	int num_devices;
 	struct cma_device *cma_dev;
 	struct ibv_device_attr attr;
 	int ret;
@@ -201,14 +202,14 @@
 		goto err;
 	}
 
-	dev_list = ibv_get_device_list(NULL);
+	dev_list = ibv_get_device_list(&num_devices);
 	if (!dev_list) {
-		printf("CMA: unable to get RDMA device liste\n");
+		printf("CMA: unable to get RDMA device list\n");
 		ret = -ENODEV;
 		goto err;
 	}
 
-	for (i = 0; dev_list[i]; ++i) {
+	for (i = 0; i < num_devices; ++i) {
 		cma_dev = malloc(sizeof *cma_dev);
 		if (!cma_dev) {
 			ret = -ENOMEM;


From pgpshyqeckq at hotmail.com  Thu Jun  1 16:56:20 2006
From: pgpshyqeckq at hotmail.com (info8100)
Date: Thu,  1 Jun 2006 16:56:20 -0700 (PDT)
Subject: [openib-general] Penis Enlarge Patch works for 97% percent of men.
Message-ID: <20060601235620.2DA0122834D@openib.ca.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/6558586f/attachment.html>

From mashirle at us.ibm.com  Thu Jun  1 09:58:36 2006
From: mashirle at us.ibm.com (Shirley Ma)
Date: Thu, 01 Jun 2006 09:58:36 -0700
Subject: [openib-general] [PATCH] IPoIB skb panic
Message-ID: <1149181116.8085.8.camel@ibm-khxoic5vfkn.beaverton.ibm.com>

Roland,

I found there are two problems in path_free(), it would cause kernel skb
panic.
1. path_free() should dev_kfree_skb_any() (any context) instead of
dev_kfree_skb_irq() (irq context)
2. path->queue should be protected by priv->lock since there is a  
possible race between unicast_send_arp() and ipoib_flush_paths() when
bring interface down. It's  safe to use priv->lock, because
skb_queue_len(&path->queue) <  
IPOIB_MAX_PATH_REC_QUEUE, which is 3.

Here is the patch. Please review it and let me know if there is a
problem to apply this patch.

Signed-off-by: Shirley Ma <xma at us.ibm.com>
diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c
--- infiniband/ulp/ipoib/ipoib_main.c	2006-05-03 13:16:18.000000000 -0700
+++ infiniband-skb/ulp/ipoib/ipoib_main.c	2006-06-01 09:14:05.000000000 -0700
@@ -252,11 +252,11 @@ static void path_free(struct net_device 
 	struct sk_buff *skb;
 	unsigned long flags;
 
-	while ((skb = __skb_dequeue(&path->queue)))
-		dev_kfree_skb_irq(skb);
-
 	spin_lock_irqsave(&priv->lock, flags);
 
+	while ((skb = __skb_dequeue(&path->queue)))
+		dev_kfree_skb_any(skb);
+
 	list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
 		/*
 		 * It's safe to call ipoib_put_ah() inside priv->lock

Thanks
Shirley Ma
IBM LTC


From gaansari at franciscomarrero.com  Thu Jun  1 17:12:27 2006
From: gaansari at franciscomarrero.com (Nathaniel Roberts)
Date: Fri, 02 Jun 2006 02:12:27 +0200
Subject: [openib-general] Buy OEM Software
Message-ID: <000001c685d8$a982f580$0100007f@localhost>


Special Offer
Adobe Video Collection
Adobe Premiere 1.5 Professional
Adobe After Effects 6.5 Professional
Adobe Audition 1.5
Adobe Encore DVD 1.5
$149.95
More Info >>  Microsoft 2 in 1
MS Windows XP Pro
MS Office 2003 Pro


$99.95
More Info >>  Microsoft + Adobe 3 in 1

MS Windows XP Pro
MS Office 2003 Pro
Adobe Acrobat 7.0 Professional


$149.95
More Info >>

Bestsellers
 Microsoft Office Professional Edition 2003
Rating:  6 reviews
Retail price: $550.00

You save: $480.05 (87%)
Our price: $69.95
    [Add to cart]

 Microsoft Windows XP Professional
Rating:  8 reviews
Retail price: $200.00

You save: $150.05 (75%)

Our price: $49.95
    [Add to cart]

 Adobe Photoshop CS2 V 9.0
Rating:  3 reviews
Retail price: $599.00

You save: $529.05 (88%)

Our price: $69.95
    [Add to cart]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/efda913f/attachment.html>

From manpreet at gmail.com  Thu Jun  1 18:22:53 2006
From: manpreet at gmail.com (Manpreet Singh)
Date: Thu, 1 Jun 2006 18:22:53 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
Message-ID: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>

Hi,

It seems that the number of outstanding RDMAs that a Mellanox HCA can handle
has been configured at 4 (mthca_main.c: default_profile: rdb_per_qp). And
the HCAs can support a much higher value (128 I think).

Could we move this value higher or atleast make it configurable?

Thanks,
Manpreet.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/4c1663c1/attachment.html>

From troy at scl.ameslab.gov  Thu Jun  1 18:57:08 2006
From: troy at scl.ameslab.gov (Troy Benjegerdes)
Date: Thu, 1 Jun 2006 20:57:08 -0500
Subject: [openib-general] EHCA broken for 2.6.16?
Message-ID: <20060602015708.GE18223@scl.ameslab.gov>

Okay guys, what's up this time?

Kernel 2.6.16.. 

  CC [M]  drivers/infiniband/hw/ehca/ehca_main.o
In file included from drivers/infiniband/hw/ehca/ehca_qes.h:47,
                 from drivers/infiniband/hw/ehca/ipz_pt_fn.h:46,
                 from drivers/infiniband/hw/ehca/ehca_classes.h:46,
                 from drivers/infiniband/hw/ehca/ehca_main.c:43:
drivers/infiniband/hw/ehca/ehca_tools.h: In function 'ehca2ib_return_code':
drivers/infiniband/hw/ehca/ehca_tools.h:404: error: 'H_SUCCESS'
undeclared (first use in this function)
drivers/infiniband/hw/ehca/ehca_tools.h:404: error: (Each undeclared
identifier is reported only once
drivers/infiniband/hw/ehca/ehca_tools.h:404: error: for each function it 
appears in.)
drivers/infiniband/hw/ehca/ehca_tools.h:406: error: 'H_BUSY' undeclared
(first use in this function)
drivers/infiniband/hw/ehca/ehca_tools.h:408: error: 'H_NO_MEM'
undeclared (first use in this function)
drivers/infiniband/hw/ehca/ehca_main.c: In function
'ehca_sense_attributes':


From mayumi at hushmail.com  Thu Jun  1 20:28:03 2006
From: mayumi at hushmail.com (mayumi at hushmail.com)
Date: Thu,  1 Jun 2006 20:28:03 -0700 (PDT)
Subject: [openib-general] =?utf-8?b?woLDhsKCw6jCgsKgwoLCpsKCwrjCgsKowpU=?=
	=?utf-8?b?w5TCjsKWwpHDksKCw4HCgsOEwoLDnMKCwrfCgsOLwoHCmSA=?=
Message-ID: 20030927191438.48533mail@mail.pop_lachere_8754158754_top881server_system87_lachere.net

�b�����Ă����܂��B
�T�C�g�Ō���������ł����ǃG�b�`�Ȃ��Ƒ�D���Ȃ�ł���ˁH
�b�����D���Ȃ�ł��I�I�E�E���ǂ��ܕt�������Ă�ގ�����Ȃ����E�E
�悩�����炨�F�B�ɂȂ�܂��񂩁H

�ł�ˁA�����p����������ł�����
���J���̉e���ł������ȃg�R�̖т��S������Ă���܂���B�B
�G�b�`�̎��A���ꂪ�������C���������炵����ł�����
�ǂ��������x�����Ă݂܂��H

���݂��Z��ł鏊�߂��݂����Ȃ̂ŁA
�b������ǂ��Ȃ��Ă݂����ȁ`
�����đ��肪���Ȃ��Ǝ₵���ł����� ��)

���񂽂�Ȏ��ȏЉ�܂��I
�b��(����)�{���ł��B
���܂Q�R�˂Ŏd���͋߂��̃X�[�p�[�ŃA���o�C�g���Ă܂��B
���Ă������悤�ȃM�����n���D�݂ł�����A���Ԃ�_������B�B
�����l���C������̂ŒN�Ƃł�����킯�ł͂Ȃ���ł����A
�T�C�g�̃I�t��Ƃ��Œj���Ɖ�����肵�Ă�A
���̕ӂł�����ƕs���������ł��B

�葫���⃍�[�v�Ŕ����Ă݂���
���K�Ƃ���ӂ߂��Ă݂����ł��B
http://lachere.net/h/
�Ƃ肠�������Ԏ��҂��Ă܂��ˁ� 


From weber at agentpoint.com  Fri Jun  2 00:24:39 2006
From: weber at agentpoint.com (Bernard Zuniga)
Date: Thu, 01 Jun 2006 23:24:39 -0800
Subject: [openib-general] Excellent mortagee ratees
Message-ID: <567958880.8155089795681.JavaMail.ebayapp@sj-besreco755>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/3c22d0a0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: socioeconomic.0.gif
Type: image/gif
Size: 8503 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060601/3c22d0a0/attachment.gif>

From postwar121 at yahoo.co.jp  Thu Jun  1 23:56:06 2006
From: postwar121 at yahoo.co.jp (=?iso-2022-jp?B?bWlraQ==?=)
Date: Thu,  1 Jun 2006 23:56:06 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCMkYkTyRkJEMkUCRqISYbKEI=?=
	=?iso-2022-jp?b?GyRCISYhJhsoQg==?=
Message-ID: <20060602065606.8C73B22834D@openib.ca.sandia.gov>

▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽▽

可愛い女の子や色っぽいお姉さま、即アポ、即ハメOKなHな女の子まで

　　　　　　　　　　　http://uxoz.com/?juri
△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△△

　　　　。☆。☆ 。☆。完全！期間限定企画。☆。☆ 。☆。

　　　　　┏━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┳━┓　　
　　　　　☆見┃な┃け┃れ┃ば┃損┃す┃る┃の┃は┃ア┃ナ┃タ☆
　　　　　┗━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┻━┛

『前にオススメガールとして紹介されました♪樹里ID:124047です☆ 
今日今月のエッチ周期中！みたい(≡^∇^≡)ニャハハ☆ 

上にノルと止まんない大胆な騎乗位の腰の動きは誰にも負けない自信あるよん☆ 
早くあなたの上にノリたいな☆』 

顔は幼い系って言われるけど『えっちな体してるってよく言われます。実際
えっちなんですが^^; 

恋人気分でいちゃいちゃしたいなプレイはだいたい何でもOK』 
でもNGプレイもあったりして( ;^^)痛い系かも?
http://uxoz.com/?juri


拒否の方は
stop at uxoz.com


From anton at samba.org  Thu Jun  1 23:43:46 2006
From: anton at samba.org (Anton Blanchard)
Date: Fri, 2 Jun 2006 16:43:46 +1000
Subject: [openib-general] [PATCH] Fix some compile issues with libehca
Message-ID: <20060602064346.GE1736@krispykreme>


Hi,

Heres a patch to fix some warnings about missing prototypes (memset
etc), and one compile error due to libsysfs not being included.

This was also giving a warning when built as 32bit:

       my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue;

src/ehca_umain.c:239: warning: cast to pointer from integer of different size

So cast it to a long first. Is that code correct for 32bit?

Anton
---

Index: src/ehca_uinit.c
===================================================================
--- src/ehca_uinit.c	(revision 7621)
+++ src/ehca_uinit.c	(working copy)
@@ -44,6 +44,7 @@
 
 #include <infiniband/driver.h>
 #include <stdlib.h>
+#include <string.h>
 #include <unistd.h>
 #include <errno.h>
 #include <sys/mman.h>
@@ -51,6 +52,7 @@
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <sysfs/libsysfs.h>
 
 #include "ehca_uclasses.h"
 
Index: src/ehca_umain.c
===================================================================
--- src/ehca_umain.c	(revision 7621)
+++ src/ehca_umain.c	(working copy)
@@ -53,6 +53,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <string.h>
 #include <sys/mman.h>
 #include <netinet/in.h>
 
@@ -234,8 +235,8 @@
 	/* copy data returned from kernel */
 	my_cq->cq_number = resp.cq_number;
 	my_cq->token = resp.token;
-	my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue;
-	my_cq->ipz_queue.current_q_addr = (u8*)resp.ipz_queue.queue;
+	my_cq->ipz_queue.queue = (u8*)(long)resp.ipz_queue.queue;
+	my_cq->ipz_queue.current_q_addr = (u8*)(long)resp.ipz_queue.queue;
 	my_cq->ipz_queue.qe_size = resp.ipz_queue.qe_size;
 	my_cq->ipz_queue.act_nr_of_sg = resp.ipz_queue.act_nr_of_sg;
 	my_cq->ipz_queue.queue_length = resp.ipz_queue.queue_length;
@@ -321,16 +322,16 @@
 	my_qp->qkey = resp.qkey;
 	my_qp->real_qp_num = resp.real_qp_num;
 	/* rqueue properties */
-	my_qp->ipz_rqueue.queue = (u8*)resp.ipz_rqueue.queue;
-	my_qp->ipz_rqueue.current_q_addr = (u8*)resp.ipz_rqueue.queue;
+	my_qp->ipz_rqueue.queue = (u8*)(long)resp.ipz_rqueue.queue;
+	my_qp->ipz_rqueue.current_q_addr = (u8*)(long)resp.ipz_rqueue.queue;
 	my_qp->ipz_rqueue.qe_size = resp.ipz_rqueue.qe_size;
 	my_qp->ipz_rqueue.act_nr_of_sg = resp.ipz_rqueue.act_nr_of_sg;
 	my_qp->ipz_rqueue.queue_length = resp.ipz_rqueue.queue_length;
 	my_qp->ipz_rqueue.pagesize = resp.ipz_rqueue.pagesize;
 	my_qp->ipz_rqueue.toggle_state = resp.ipz_rqueue.toggle_state;
 	/* squeue properties */
-	my_qp->ipz_squeue.queue = (u8*)resp.ipz_squeue.queue;
-	my_qp->ipz_squeue.current_q_addr = (u8*)resp.ipz_squeue.queue;
+	my_qp->ipz_squeue.queue = (u8*)(long)resp.ipz_squeue.queue;
+	my_qp->ipz_squeue.current_q_addr = (u8*)(long)resp.ipz_squeue.queue;
 	my_qp->ipz_squeue.qe_size = resp.ipz_squeue.qe_size;
 	my_qp->ipz_squeue.act_nr_of_sg = resp.ipz_squeue.act_nr_of_sg;
 	my_qp->ipz_squeue.queue_length = resp.ipz_squeue.queue_length;


From anton at samba.org  Thu Jun  1 23:49:24 2006
From: anton at samba.org (Anton Blanchard)
Date: Fri, 2 Jun 2006 16:49:24 +1000
Subject: [openib-general] [PATCH] Fix ipathverbs compile
Message-ID: <20060602064924.GF1736@krispykreme>


Similar to libehca, I had to add a sysfs include to be able to compile
it. Am I missing something or is this correct?

Anton
---

Index: src/ipathverbs.c
===================================================================
--- src/ipathverbs.c	(revision 7621)
+++ src/ipathverbs.c	(working copy)
@@ -41,6 +41,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <sysfs/libsysfs.h>
 
 #include "ipathverbs.h"
 

From mayumi at hushmail.com  Fri Jun  2 00:38:46 2006
From: mayumi at hushmail.com (mayumi at hushmail.com)
Date: Fri,  2 Jun 2006 00:38:46 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCNmJBLEUqJEs3QyReJGwbKEI=?=
	=?iso-2022-jp?b?GyRCJEYkaz1PPXckSD1QMnEkJCReJDskcyQrISkbKEI=?=
Message-ID: 20060602151637.97125mail@mail.lovelove-queensex552158754_lookserver772_womansystem01_woman-queen-love.tv

人妻セフレ探しの決定版！
※‥※‥※‥※‥※‥※‥※‥※‥※‥※

世の中の女性の中で、人妻が一番出会えます。
それは、時間とお金に余裕があり、旦那とのSEXに飽きているからです。
妻とはこうあるべき、という仮面を脱いだ彼女達6万3千人にご登録いただいております。

※‥※‥※‥※‥※‥※‥※‥※‥※‥※

＜＜今日の新規人妻＞＞
-------------------------------------------------------------------
キララ様（25才） 
コメント: あまり経験のない方・・・
詳しく見る⇒
　　　http://lovlyqueen.cx/h/
-------------------------------------------------------------------
谷様（36才） 
コメント: なんか家事に疲れちゃった・・・
詳しく見る⇒
　　　http://lovlyqueen.cx/h/
-------------------------------------------------------------------
紹介料・登録料・退会料金等全て無料
エッチが好きな女性たちがあなたの欲求を満たしてくれます。
人妻との大人の関係をぜひこちらでお楽しみください。

　　　http://lovlyqueen.cx/h/
⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔⇔

        逆◎助
逆援では逢えないと思っている人いませんか？その思い込みを１８０度ひっくり返せるのがこのサイト！ 
当サイトは女性会員様の月会費で運営さしていただいてるので男性の紹介料・登録料・退会料金等全て無料となっています。

↓↓↓↓↓
　　　http://lovlyqueen.cx/h/
--------------------------------------------------------------------


From HNGUYEN at de.ibm.com  Fri Jun  2 01:13:39 2006
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 2 Jun 2006 10:13:39 +0200
Subject: [openib-general] EHCA broken for 2.6.16?
In-Reply-To: <20060602015708.GE18223@scl.ameslab.gov>
Message-ID: <OFD96E7B41.765BEE1E-ONC1257181.002D22F0-C1257181.002D06B9@de.ibm.com>

Hi Troy!
Please use kernel 2.6.17.rc1 or later instead, because all former hvcall
defines like H_Success were uppercased (to H_SUCCESS) in 2.6.17-rc1, and
our code has been built against that version. For details refer to this
thread http://patchwork.ozlabs.org/linuxppc/patch?id=4868.
Thanks
Hoang-Nam Nguyen


openib-general-bounces at openib.org wrote on 02.06.2006 03:57:08:

> Okay guys, what's up this time?
>
> Kernel 2.6.16..
>
>   CC [M]  drivers/infiniband/hw/ehca/ehca_main.o
> In file included from drivers/infiniband/hw/ehca/ehca_qes.h:47,
>                  from drivers/infiniband/hw/ehca/ipz_pt_fn.h:46,
>                  from drivers/infiniband/hw/ehca/ehca_classes.h:46,
>                  from drivers/infiniband/hw/ehca/ehca_main.c:43:
> drivers/infiniband/hw/ehca/ehca_tools.h: In function
'ehca2ib_return_code':
> drivers/infiniband/hw/ehca/ehca_tools.h:404: error: 'H_SUCCESS'
> undeclared (first use in this function)
> drivers/infiniband/hw/ehca/ehca_tools.h:404: error: (Each undeclared
> identifier is reported only once
> drivers/infiniband/hw/ehca/ehca_tools.h:404: error: for each function it
> appears in.)
> drivers/infiniband/hw/ehca/ehca_tools.h:406: error: 'H_BUSY' undeclared
> (first use in this function)
> drivers/infiniband/hw/ehca/ehca_tools.h:408: error: 'H_NO_MEM'
> undeclared (first use in this function)
> drivers/infiniband/hw/ehca/ehca_main.c: In function
> 'ehca_sense_attributes':
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
>


From pickwinnersonline6 at 702mail.co.za  Fri Jun  2 01:09:14 2006
From: pickwinnersonline6 at 702mail.co.za (Pick Winners Online)
Date: Fri, 2 Jun 2006 01:09:14 -0700 (PDT)
Subject: [openib-general] CONGRATULATIONS YOU HAVE WON
Message-ID: <1579.165.146.46.236.1149235754.squirrel@www.teampopromotions.org>

PICK WINNERS ONLINE ORG.
24 GRAYSTON ROAD, SANDTON JOHANNESBURG,
SOUTH AFRICA.

Ref No: ZAR/J0021-671/SA2006
Batch No: 002472/B8992-60

Dear Email Owner,

We happily announce to you today 2nd June 2006, the Draw # (4 2 11 23 82
16 10) of the PICK WINNER ONLINE ORG. held recently. Your e-mail address
attached to ticket number: 728-10185-03124-78 288 with Serial number
6712/05 drew the lucky numbers 01-05-13-267-32-05 which subsequently won
you the lottery in the 2nd category i.e. Thunder ball Jackpot.

You have therefore been approved to claim a total sum of
US$2,200,000.00(Two Million, Two Hundred Thousand, United States Dollars)
in cash credited to file RPC/908011 8308/04.

All participants for the online version were selected randomly from World
Wide Web sites through computer draw system and extracted from over
100,000 unions, associations and corporate bodies that are listed online.

This promotion takes place monthly. Please note that your lucky winning
number falls within our International Booklet representative office here
in South Africa as indicated in your winning numbers. Your fund is now
available for claim. Due to the mix up of some numbers and email address,
we request you to quote your Reference and Batch numbers when you contact
our fiduciary agent. We also advice you to keep this award strictly from
Public Notice until your claim has been processed. This is part of
precautionary measures to avoid double claiming and unwarranted abuse of
this program by some unscrupulous elements Please be warned.

To file for your claim, please contact our fiduciary agent in South Africa.

FIDUCIARY AGENT CONTACT:
MR. JOHN NKOSI
Email:claimfiles at websurfer.co.za
Tel: +27-83-7494-916

NOTE: YOU ARE REQUIRED TO FILL IN THE BELOW DETAILS FOR THE PROCESSING OF
YOUR CLAIM.

NAME:..................................
EMAIL ADDRESS:............................
ADDRESS:.......................................................
NATIONALITY:...........................
SEX:.........................................
AGE:.....................................
PHONE/MOBILE:..........................
FAX:...................................
OCCUPATION:....................................
COMPANY:..............................
BATCH/WINNING
NUMNBER:............................................


PLEASE YOU MUST BE 18 YEARS AND ABOVE TO CLAIM A PRIZE.

To avoid unnecessary delays and complications, please quote our
reference/batch numbers to any correspondences with our designated agent
or us. Congratulations once more from all members and staff of this
program, and thank you for being part of our promotional lottery program.

Sincerely,
Mrs. Mary Nonjabs
Zonal Coordinator.

****NB: Please keep this award strictly from Public Notice until your
claim has been processed****


From gruenberg at xcoglobal.com  Fri Jun  2 02:59:19 2006
From: gruenberg at xcoglobal.com (Marcos Murphy)
Date: Fri, 02 Jun 2006 17:59:19 +0800
Subject: [openib-general] Three Steps to the Software You Need at the Prices
	You Want
Message-ID: <000001c6862a$b16b8580$0100007f@localhost>


Special Offer
Adobe Video Collection
Adobe Premiere 1.5 Professional
Adobe After Effects 6.5 Professional
Adobe Audition 1.5
Adobe Encore DVD 1.5
$149.95
More Info >>  Microsoft 2 in 1
MS Windows XP Pro
MS Office 2003 Pro


$99.95
More Info >>  Microsoft + Adobe 3 in 1

MS Windows XP Pro
MS Office 2003 Pro
Adobe Acrobat 7.0 Professional


$149.95
More Info >>

Bestsellers
 Microsoft Office Professional Edition 2003
Rating:  6 reviews
Retail price: $550.00

You save: $480.05 (87%)
Our price: $69.95
    [Add to cart]

 Microsoft Windows XP Professional
Rating:  8 reviews
Retail price: $200.00

You save: $150.05 (75%)

Our price: $49.95
    [Add to cart]

 Adobe Photoshop CS2 V 9.0
Rating:  3 reviews
Retail price: $599.00

You save: $529.05 (88%)

Our price: $69.95
    [Add to cart]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/8664899e/attachment.html>

From necojp at citiz.net  Fri Jun  2 03:28:32 2006
From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=)
Date: Fri,  2 Jun 2006 03:28:32 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCNVUxZzhyOl0kSyVPJV4bKEI=?=
	=?iso-2022-jp?b?GyRCJGokRCREJCIkayEiQCQkTj13GyhC?=
Message-ID: <20060602102832.E894422834D@openib.ca.sandia.gov>

　　　　淫乱マダム、資産家夫人・・
　　　　http://xbja.com/?bid17

　　　時間もお金も有り余ってる女性達の詰め合わせ！
　　　性欲は誰にも負けないという淫乱揃い

　　　　　ホストクラブに通っても、ホストとはSEXが出来ない・・
　　　　　結局は心も欲求不満も解消されない！

　　　そんな不満たらたらな淫乱セレブ達を黙らせてあげて♪
　　　http://xbja.com/?bid17
　　　　　月に２回のデートで○○十万以上・・・
　　　　　　　　　　　　　　体験して実感してみて下さい！

　　　　　　　　　−−−セレブの集まる逆援サイト−−−
　　　　　　　　　　　　♪【完全無料登録制】♪
　　　　　　　　　http://xbja.com/?bid17


＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿


　　　　　　　　　配信不要はこちらへ→　delivery_decline at yahoo.co.jp


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/a556d176/attachment.html>

From halr at voltaire.com  Fri Jun  2 03:29:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jun 2006 06:29:36 -0400
Subject: [openib-general] Re: [PATCH TRIVIAL] opensm: fix comment in
	osm_matrix.h
In-Reply-To: <20060601174838.GA12872@sashak.voltaire.com>
References: <20060601174838.GA12872@sashak.voltaire.com>
Message-ID: <1149244174.4510.90851.camel@hal.voltaire.com>

On Thu, 2006-06-01 at 13:48, Sasha Khapyorsky wrote:
> This fixes the function description comment in osm_matrix.h
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied to trunk and 1.0 branch.

-- Hal


From emiime9836 at so-net.ne.jp  Fri Jun  2 03:49:36 2006
From: emiime9836 at so-net.ne.jp (=?shift-jis?B?MjAwNi0wNi0wMiAxODozNjozOQ==?=)
Date: Fri,  2 Jun 2006 03:49:36 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCJD8kQCROOSU0cT80ISkbKEI=?=
Message-ID: <20060602104936.258092283D5@openib.ca.sandia.gov>

おしっこやウンコのプレイってした事ないでしょ！？

一度は経験してみない？

さ○うのごはんにブリブリかけちゃって、それを食べるとか(ノ∀`)

って言うのは冗談だけど、軽めなスカトロなら興味あるでしょう♪

ご主人様になって、Mっ子な女の子達を調教しちゃおう！

http://vlzh.com/?e63


拒否
k_49singing_in_the_rain at yahoo.co.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/3739b028/attachment.html>

From HNGUYEN at de.ibm.com  Fri Jun  2 03:41:24 2006
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 2 Jun 2006 12:41:24 +0200
Subject: [openib-general] [PATCH] Fix some compile issues with libehca
In-Reply-To: <20060602064346.GE1736@krispykreme>
Message-ID: <OFDE690F08.3BFF656F-ONC1257181.003A9794-C1257181.003A8DB1@de.ibm.com>

Hi,
will incorporate those patches in our code. They should be correct for both
64/32 bit version of libehca.
Thanks!
Hoang-Nam Nguyen

openib-general-bounces at openib.org wrote on 02.06.2006 08:43:46:

>
> Hi,
>
> Heres a patch to fix some warnings about missing prototypes (memset
> etc), and one compile error due to libsysfs not being included.
>
> This was also giving a warning when built as 32bit:
>
>        my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue;
>
> src/ehca_umain.c:239: warning: cast to pointer from integer of different
size
>
> So cast it to a long first. Is that code correct for 32bit?
>
> Anton
> ---
>
> Index: src/ehca_uinit.c
> ===================================================================
> --- src/ehca_uinit.c   (revision 7621)
> +++ src/ehca_uinit.c   (working copy)
> @@ -44,6 +44,7 @@
>
>  #include <infiniband/driver.h>
>  #include <stdlib.h>
> +#include <string.h>
>  #include <unistd.h>
>  #include <errno.h>
>  #include <sys/mman.h>
> @@ -51,6 +52,7 @@
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
> +#include <sysfs/libsysfs.h>
>
>  #include "ehca_uclasses.h"
>
> Index: src/ehca_umain.c
> ===================================================================
> --- src/ehca_umain.c   (revision 7621)
> +++ src/ehca_umain.c   (working copy)
> @@ -53,6 +53,7 @@
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <errno.h>
> +#include <string.h>
>  #include <sys/mman.h>
>  #include <netinet/in.h>
>
> @@ -234,8 +235,8 @@
>     /* copy data returned from kernel */
>     my_cq->cq_number = resp.cq_number;
>     my_cq->token = resp.token;
> -   my_cq->ipz_queue.queue = (u8*)resp.ipz_queue.queue;
> -   my_cq->ipz_queue.current_q_addr = (u8*)resp.ipz_queue.queue;
> +   my_cq->ipz_queue.queue = (u8*)(long)resp.ipz_queue.queue;
> +   my_cq->ipz_queue.current_q_addr = (u8*)(long)resp.ipz_queue.queue;
>     my_cq->ipz_queue.qe_size = resp.ipz_queue.qe_size;
>     my_cq->ipz_queue.act_nr_of_sg = resp.ipz_queue.act_nr_of_sg;
>     my_cq->ipz_queue.queue_length = resp.ipz_queue.queue_length;
> @@ -321,16 +322,16 @@
>     my_qp->qkey = resp.qkey;
>     my_qp->real_qp_num = resp.real_qp_num;
>     /* rqueue properties */
> -   my_qp->ipz_rqueue.queue = (u8*)resp.ipz_rqueue.queue;
> -   my_qp->ipz_rqueue.current_q_addr = (u8*)resp.ipz_rqueue.queue;
> +   my_qp->ipz_rqueue.queue = (u8*)(long)resp.ipz_rqueue.queue;
> +   my_qp->ipz_rqueue.current_q_addr = (u8*)(long)resp.ipz_rqueue.queue;
>     my_qp->ipz_rqueue.qe_size = resp.ipz_rqueue.qe_size;
>     my_qp->ipz_rqueue.act_nr_of_sg = resp.ipz_rqueue.act_nr_of_sg;
>     my_qp->ipz_rqueue.queue_length = resp.ipz_rqueue.queue_length;
>     my_qp->ipz_rqueue.pagesize = resp.ipz_rqueue.pagesize;
>     my_qp->ipz_rqueue.toggle_state = resp.ipz_rqueue.toggle_state;
>     /* squeue properties */
> -   my_qp->ipz_squeue.queue = (u8*)resp.ipz_squeue.queue;
> -   my_qp->ipz_squeue.current_q_addr = (u8*)resp.ipz_squeue.queue;
> +   my_qp->ipz_squeue.queue = (u8*)(long)resp.ipz_squeue.queue;
> +   my_qp->ipz_squeue.current_q_addr = (u8*)(long)resp.ipz_squeue.queue;
>     my_qp->ipz_squeue.qe_size = resp.ipz_squeue.qe_size;
>     my_qp->ipz_squeue.act_nr_of_sg = resp.ipz_squeue.act_nr_of_sg;
>     my_qp->ipz_squeue.queue_length = resp.ipz_squeue.queue_length;
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
>


From confirm at paypal.com  Fri Jun  2 04:34:56 2006
From: confirm at paypal.com (PayPal Security Department)
Date: Fri, 02 Jun 2006 04:34:56 -0700
Subject: [openib-general] *** Security Issues ***
Message-ID: <WYIYLZTOHNYWGIKFLBVUI@charter.net>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/7466f01a/attachment.html>

From halr at voltaire.com  Fri Jun  2 04:38:32 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jun 2006 07:38:32 -0400
Subject: [openib-general] {PATCH] Some small fixes in osm_ucast_mgr.c
Message-ID: <1149248311.4510.92726.camel@hal.voltaire.com>

OpenSM/osm_ucast_mgr.c: Small cleanup in terms of dump file

Some small cleanup near old dump file removing: replacing
CL_ASSERT() with debug-unconditional check and then remove check before
freeing.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
Signed-off-by: Hal Rosenstock <halr at voltaire.com>
---

 osm/opensm/osm_ucast_mgr.c |   15 ++++-----------
 1 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index cb59a7b..6e0d6c6 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -1148,21 +1148,14 @@ osm_ucast_mgr_process(
       build and download the switch forwarding tables.
     */
 
-    /* initialize the fdb dump file: */
-    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
+    /* remove the old fdb dump file: */
+    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) && (file_name =
+        (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10)) )
     {
-      file_name =
-        (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10);
-
-      CL_ASSERT(file_name);
-
       strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir);
       strcat(file_name, "/osm.fdbs");
-
       unlink(file_name);  
-
-      if (file_name)
-        cl_free(file_name);
+      cl_free(file_name);
     }
 
     cl_qmap_apply_func( p_sw_guid_tbl,


From halr at voltaire.com  Fri Jun  2 04:53:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jun 2006 07:53:40 -0400
Subject: [openib-general] {PATCH] Some small fixes in osm_mcast_mgr.c
Message-ID: <1149248313.4510.92728.camel@hal.voltaire.com>

OpenSM/osm_mcast_mgr.c: Small cleanup in terms of dump file

Some small cleanup near old dump file removing: replacing
CL_ASSERT() with debug-unconditional check and then remove check before
freeing.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_mcast_mgr.c
===================================================================
--- opensm/osm_mcast_mgr.c	(revision 7614)
+++ opensm/osm_mcast_mgr.c	(working copy)
@@ -1466,18 +1466,17 @@ __unlink_mcast_fdb(IN osm_mcast_mgr_t* c
 {
   char *file_name = NULL;
 
+  /* remove the old fdb dump file: */
   file_name =
     (char*)cl_malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 12);
 
-  CL_ASSERT(file_name);
-
-  strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir);
-  strcat(file_name, "/osm.mcfdbs");
-
-  unlink(file_name);
-
-  if (file_name)
+  if( file_name )
+  {
+    strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir);
+    strcat(file_name, "/osm.mcfdbs");
+    unlink(file_name);
     cl_free(file_name);
+  }
 }
 
 /**********************************************************************


From swise at opengridcomputing.com  Fri Jun  2 06:57:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 Jun 2006 08:57:44 -0500
Subject: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com>
References: <54AD0F12E08D1541B826BE97C98F99F150D3E6@NT-SJCA-0751.brcm.ad.broadcom.com>
Message-ID: <1149256664.791.3.camel@stevo-desktop>

> > 
> > The problem is that we can't synchronously cancel an
> > outstanding connect request. Once we've asked the adapter to
> > connect, we can't tell him to stop, we have to wait for it to
> > fail. During the time period between when we ask to connect
> > and the adapter says yeah-or-nay, the user hits ctrl-C. This
> > is the case where disconnect and/or destroy gets called and
> > we have to block it waiting for the outstanding connect
> > request to complete.
> > 
> > One alternative to this approach is to do the kfree of the
> > cm_id in the deref logic. This was the original design and
> > leaves the object around to handle the completion of the
> > connect and still allows the app to clean up and go away
> > without all this waitin' around. When the adapter finally
> > finishes and releases it's reference, the object is kfree'd.
> > 
> > Hope this helps.
> > 
> Why couldn't you synchronously put the cm_id in a state of
> "pending delete" and do the actual delete when the RNIC
> provides a response to the request? 

This is Tom's "alternative" mentioned above.  The provider already keeps
an explicit reference on the cm_id while it might possibly deliver an
event on that cm_id.  So if you change deref to kfree the cm_id on its
last deref (when the refcnt reaches 0), then you can avoid blocking
during destroy...  

> There could even be
> an optional method to see if the device is capable of
> cancelling the request. I know it can't yank a SYN back
> from the wire, but it could refrain from retransmitting.

I would suggest we don't add this optional method until we see an RNIC
that supports canceling a connect request or accept synchronously...

Steve.


From bullard at jshoc.com  Fri Jun  2 08:31:01 2006
From: bullard at jshoc.com (Florence Dillon)
Date: Fri, 02 Jun 2006 09:31:01 -0600
Subject: [openib-general] Lowest rate approved
Message-ID: <2.7.6.6.6.73150818746686.118a9470@146.246.248.81>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/33d86f73/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laurentian.jpg
Type: image/jpg
Size: 7236 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/33d86f73/attachment.jpg>

From ort at cambriawines.com  Fri Jun  2 10:29:19 2006
From: ort at cambriawines.com (Behnam Orta)
Date: Fri, 2 Jun 2006 10:29:19 -0700
Subject: [openib-general] Re: 192 VtArGGRA
Message-ID: <000001c6866a$14a4ae30$d04ba8c0@tlj77>

Hi, 
V A L \ U M 
S O M &
L E V \ T R A
P R O Z & C
A M B \ E N
C \ A L i S 
V \ A G R A 
X & N A X
M E R \ D i A 
http://www.slerethey.com 


white light shone through it. What is this? he said. There are 
moon-letters here, beside the plain runes which say five feet high the 
door and three may walk abreast. What are moon-letters? asked the 
hobbit full of excitement. He loved maps, as I have told you before; and
he also liked runes and letters and cunning handwriting, though when he 
wrote himself it was a bit thin and spidery. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/4fc0a50e/attachment.html>

From mashirle at us.ibm.com  Fri Jun  2 04:08:14 2006
From: mashirle at us.ibm.com (Shirley Ma)
Date: Fri, 02 Jun 2006 04:08:14 -0700
Subject: [openib-general] [PATCH]Repost: IPoIB skb panic
Message-ID: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>

Roland,

I posted the patch yesterday, it seems it only went to web site. I
repost this patch here for you to review. Please let me know if there is
any problem to apply this patch.

There are two problems in path_free(), which caused kernel skb
panic during interface up/down stress test.
1. path_free() should call dev_kfree_skb_any() (any context) instead of
dev_kfree_skb_irq() (irq context) since it is called in process
context. 
2. path->queue should be protected by priv->lock since there is a  race
between unicast_send_arp() and ipoib_flush_paths() to release skb when
bringing interface down. It's  safe to use priv->lock, because
skb_queue_len(&path->queue) <  
IPOIB_MAX_PATH_REC_QUEUE, which is 3.

Signed-off-by: Shirley Ma <xma at us.ibm.com>
diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c
--- infiniband/ulp/ipoib/ipoib_main.c	2006-05-03 13:16:18.000000000 -0700
+++ infiniband-skb/ulp/ipoib/ipoib_main.c	2006-06-01 09:14:05.000000000 -0700
@@ -252,11 +252,11 @@ static void path_free(struct net_device 
 	struct sk_buff *skb;
 	unsigned long flags;
 
-	while ((skb = __skb_dequeue(&path->queue)))
-		dev_kfree_skb_irq(skb);
-
 	spin_lock_irqsave(&priv->lock, flags);
 
+	while ((skb = __skb_dequeue(&path->queue)))
+		dev_kfree_skb_any(skb);
+
 	list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
 		/*
 		 * It's safe to call ipoib_put_ah() inside priv->lock

Thanks
Shirley Ma
IBM LTC


From walter.maculan at foodservicemarketplace.com  Fri Jun  2 11:13:16 2006
From: walter.maculan at foodservicemarketplace.com (Ezekiel Hall)
Date: Fri, 02 Jun 2006 20:13:16 +0200
Subject: [openib-general] Hey buddy, whats up
Message-ID: <000001c6866f$d3d3c200$0100007f@localhost>

In a dayes without unapologetic the face  of anybody
grew kgy6 Black freedoms mouths, the overpower
swallowed up the sun crammed air was function with
suppressed accomplish The wind smelling through
the long hangers and sobbed  and tureen 
the secret symbolically
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/30dd6496/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: top.jpg
Type: image/jpeg
Size: 8387 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/30dd6496/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: down.gif
Type: image/gif
Size: 7523 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/30dd6496/attachment.gif>

From noahm at yebox.com  Fri Jun  2 12:33:34 2006
From: noahm at yebox.com (Arnold Looney)
Date: Fri, 02 Jun 2006 11:33:34 -0800
Subject: [openib-general] 3.25%% approvedd rattee
Message-ID: <72754.$$.95673.Etrack@yahoo.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/580bcbcf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vectorial.4.gif
Type: image/gif
Size: 8467 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/580bcbcf/attachment.gif>

From sean.hefty at intel.com  Fri Jun  2 11:43:22 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 2 Jun 2006 11:43:22 -0700
Subject: [openib-general] libmthca build issue
Message-ID: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>

I'm running into an issue trying to build libmthca.

During the ./configure step, I get:

checking size of long... configure: error: cannot compute sizeof (long), 77

Has anyone else run into this?

- Sean


From mshefty at ichips.intel.com  Fri Jun  2 12:07:25 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 02 Jun 2006 12:07:25 -0700
Subject: [openib-general] libmthca build issue
In-Reply-To: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>
Message-ID: <44808C6D.6030708@ichips.intel.com>

Sean Hefty wrote:
> I'm running into an issue trying to build libmthca.
> 
> During the ./configure step, I get:
> 
> checking size of long... configure: error: cannot compute sizeof (long), 77
> 
> Has anyone else run into this?

Rebooting my system and rebuilding made this error go away.

- Sean


From swise at opengridcomputing.com  Fri Jun  2 12:09:15 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 Jun 2006 14:09:15 -0500
Subject: [openib-general] libmthca build issue
In-Reply-To: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>
Message-ID: <1149275355.11187.21.camel@stevo-desktop>

On Fri, 2006-06-02 at 11:43 -0700, Sean Hefty wrote:
> I'm running into an issue trying to build libmthca.
> 
> During the ./configure step, I get:
> 
> checking size of long... configure: error: cannot compute sizeof (long), 77
> 
> Has anyone else run into this?
> 
> - Sean

I just hit this too today.  Inspecting the config log file revealed that
it could find libibverbs.so.  I ran ldconfig, then reran autogen and
configure and it worked.  Try that...

Stevo.


From swise at opengridcomputing.com  Fri Jun  2 12:14:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 Jun 2006 14:14:36 -0500
Subject: [openib-general] libmthca build issue
In-Reply-To: <1149275355.11187.21.camel@stevo-desktop>
References: <ORSMSX4011XvpFVjCRG0000002e@orsmsx401.amr.corp.intel.com>
	<1149275355.11187.21.camel@stevo-desktop>
Message-ID: <1149275676.11187.23.camel@stevo-desktop>

On Fri, 2006-06-02 at 14:09 -0500, Steve Wise wrote:
> On Fri, 2006-06-02 at 11:43 -0700, Sean Hefty wrote:
> > I'm running into an issue trying to build libmthca.
> > 
> > During the ./configure step, I get:
> > 
> > checking size of long... configure: error: cannot compute sizeof (long), 77
> > 
> > Has anyone else run into this?
> > 
> > - Sean
> 
> I just hit this too today.  Inspecting the config log file revealed that
> it could find libibverbs.so.  I ran ldconfig, then reran autogen and
    ^^^^^^

Er, make that "could not"...


From sean.hefty at intel.com  Fri Jun  2 12:13:37 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 2 Jun 2006 12:13:37 -0700
Subject: [openib-general] libmthca build issue
In-Reply-To: <1149275355.11187.21.camel@stevo-desktop>
Message-ID: <ORSMSX401Rqf69aZbLA00000030@orsmsx401.amr.corp.intel.com>

>I just hit this too today.  Inspecting the config log file revealed that
>it could find libibverbs.so.  I ran ldconfig, then reran autogen and
>configure and it worked.  Try that...

Thanks - I'll try that next time.


From melody19194 at yahoo.co.jp  Fri Jun  2 13:00:30 2006
From: melody19194 at yahoo.co.jp (melody19194 at yahoo.co.jp)
Date: Fri,  2 Jun 2006 13:00:30 -0700 (PDT)
Subject: [openib-general] =?utf-8?b?woNcwoFbwoNWwoPCg8KDwovCg2zCg2LCg2c=?=
	=?utf-8?b?woPCj8KBW8KDTMKDwpPCg0/Cg1TCg0PCg2fCj8K1wpHDksKPw7M=?=
Message-ID: 20060603045844.61712mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv

����ɂ��́A������̓����f�B�[�^�c�����ǂł��B

�����f�B�[�Ƃ́A�����o�[�݂̂ō\������Ă���ŋߗ��s��SNS�i�\�[�V�����l�b�g���[�L���O�T�C�g�ł��B

���񃉃��_�����I�ł��Ȃ��l�ɏ��ҏ�����炳���Ă��������܂����B

���L��URL���o�^��s���Ă��������l�b�g���[�N�������̊F�l�Ƃ̌𗬂��肢�������܂��B
�@�@�@http://qqpg.com/mmt

�����F�l�̓v���t�B�[���A�ʐ^��o�^�A���J���邱�Ƃɂ���Ă�葽���̕��X�ɏ���
�@�@���M���邱�Ƃ��o���܂��B���p�A�o�^�͖����ł��B
�@�@�v���t�B�[���A�ʐ^�̓o�^�A���J�@����
�@�@�@http://qqpg.com/mmt

���������f�B�[�ł͐M���ł�����l�A�F�B�A���l�A�Z�b�N�X�t�����h�A���܂��܂ȃc�[�����p�ӂ��Ă���܂��B
�@�@�@http://qqpg.com/mmt

���������f�B�[��g���Ή�����m�̃l�b�g���[�N���ǂ��ė���p�[�e�B�Ȃǂ̌𗬂�
�@�@�ȒP�ɂł��܂��B�����ɂ͂��Ȃ��̃p�[�g�i�[����q����M���ł���l�b�g���[�N��
�@�@�`������Ă��܂��B�����f�B�[�͂ǂ����Ōq�����Ă���l���m���W�܂�o������T�C�g 
�@�@�ł���A���ꂪ�����f�B�[�̓����ł��B
�@�@�@http://qqpg.com/mmt

����ł́A�Q����S��肨�҂����Ă���܂��B�����f�B�[�^�c�ǁB


From rdreier at cisco.com  Fri Jun  2 13:02:33 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 13:02:33 -0700
Subject: [openib-general] [PATCH 5/5] IB/ipath: Add client reregister
	event generation
In-Reply-To: <1149204468.16993.8.camel@localhost.localdomain> (Bryan
	O'Sullivan's message of "Thu, 01 Jun 2006 16:27:48 -0700")
References: <20060531223205.10506.51241.stgit@localhost.localdomain>
	<20060531223218.10506.76076.stgit@localhost.localdomain>
	<1149204468.16993.8.camel@localhost.localdomain>
Message-ID: <adaslmns4va.fsf@cisco.com>

    Bryan> Please CC me on ipath driver patches, as I'm not guaranteed
    Bryan> to see them otherwise.

Sorry, I realized I forgot to do that and sent a heads up as a reply
to the patch email.

BTW you probably will want to update the entry in MAINTAINERS now that
you are qlogic and not pathscale...

 - R.


From rdreier at cisco.com  Fri Jun  2 13:09:07 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 13:09:07 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	(Shirley Ma's message of "Fri, 02 Jun 2006 04:08:14 -0700")
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
Message-ID: <adaodxbs4kc.fsf@cisco.com>

 > 1. path_free() should call dev_kfree_skb_any() (any context) instead of
 > dev_kfree_skb_irq() (irq context) since it is called in process
 > context. 

Agree -- although actually in the current code, plain dev_kfree_skb()
would be fine.  In fact, since your patch moves the free inside a
spinlock, dev_kfree_skb_irq() would be correct.

 > 2. path->queue should be protected by priv->lock since there is a  race
 > between unicast_send_arp() and ipoib_flush_paths() to release skb when
 > bringing interface down. It's  safe to use priv->lock, because
 > skb_queue_len(&path->queue) <  
 > IPOIB_MAX_PATH_REC_QUEUE, which is 3.

I'm having a hard time understanding this race.  path_free() should
never be called on paths that are reachable via the list of paths or
the rb-tree of paths, and unicast_send_arp() should never touch a path
that is going to path_free().

Also, it seems if there is a race here then this fix is insufficient,
because path_free() does a kfree() on the whole path structure, which
would lead to use-after-free if unicast_send_arp() might still touch it.

So could you diagram the race you are seeing?  (ie what are the two
different threads doing that causes a problem?)

Thanks,
  Roland


From rdreier at cisco.com  Fri Jun  2 13:10:49 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 13:10:49 -0700
Subject: [openib-general] [PATCH] Fix ipathverbs compile
In-Reply-To: <20060602064924.GF1736@krispykreme> (Anton Blanchard's message of
	"Fri, 2 Jun 2006 16:49:24 +1000")
References: <20060602064924.GF1736@krispykreme>
Message-ID: <adak67zs4hi.fsf@cisco.com>

    Anton> Similar to libehca, I had to add a sysfs include to be able
    Anton> to compile it. Am I missing something or is this correct?

The issue is that I changed the development libibverbs tree in svn to
no longer use libsysfs, and libehca and libipathverbs are not updated
to the new interface yet.  So it is true that they won't compile
against the development libibverbs tree without including the libsysfs
header, but it's also true that just adding the header so they compile
will lead to a driver library that doesn't work anyway.

So I think it's better to leave them not compiling until they are
really fixed up.

 - R.


From rdreier at cisco.com  Fri Jun  2 13:12:30 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 13:12:30 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>
	(Manpreet Singh's message of "Thu, 1 Jun 2006 18:22:53 -0700")
References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>
Message-ID: <adafyins4ep.fsf@cisco.com>

    Manpreet> Mellanox HCA can handle has been configured at 4
    Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the
    Manpreet> HCAs can support a much higher value (128 I think).

    Manpreet> Could we move this value higher or atleast make it
    Manpreet> configurable?

Leonid Arsh has a patch that I will integrate soon that makes this
configurable.

However, I'm curious.  Do you have a workload where this actually
makes a measurable difference?  It seems that having 4 RDMA requests
outstanding on the wire should be enough to get things to pipeline
pretty well.

If you haven't tested this, right now you can of course edit
mthca_main.c to change the default value and recompile.

 - R.


From rjwalsh at pathscale.com  Fri Jun  2 13:35:19 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Fri, 02 Jun 2006 13:35:19 -0700
Subject: [openib-general] [PATCH] Fix ipathverbs compile
In-Reply-To: <adak67zs4hi.fsf@cisco.com>
References: <20060602064924.GF1736@krispykreme>  <adak67zs4hi.fsf@cisco.com>
Message-ID: <1149280519.13958.10.camel@hematite.pathscale.com>

On Fri, 2006-06-02 at 13:10 -0700, Roland Dreier wrote:
>     Anton> Similar to libehca, I had to add a sysfs include to be able
>     Anton> to compile it. Am I missing something or is this correct?
> 
> The issue is that I changed the development libibverbs tree in svn to
> no longer use libsysfs, and libehca and libipathverbs are not updated
> to the new interface yet.  So it is true that they won't compile
> against the development libibverbs tree without including the libsysfs
> header, but it's also true that just adding the header so they compile
> will lead to a driver library that doesn't work anyway.
> 
> So I think it's better to leave them not compiling until they are
> really fixed up.

We're in the middle of getting a new software release done here, and
just haven't had the bandwidth to look at this yet.  I'll get around to
it hopefully by the middle of next week and do the appropriate updates
from the libipathverbs end.

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 483 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/af76af8e/attachment.sig>

From nahoko at centralpets.com  Fri Jun  2 13:53:25 2006
From: nahoko at centralpets.com (nahoko at centralpets.com)
Date: Fri,  2 Jun 2006 13:53:25 -0700 (PDT)
Subject: [openib-general] =?iso-2022-jp?b?GyRCI1YjSSNQJTslbCVWJDQbKEI=?=
	=?iso-2022-jp?b?GyRCPlIycBsoQg==?=
Message-ID: 20060603043903.40893mail@mail.perfect_web88915_serebu-server55_system02_serebudeai.tv

∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞
＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿
●《＄》実在ＶＩＰ女性様専用の男遊戯娯楽サイト《＄》●
￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣
∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞


※ＶＩＰ＝ＶＥＲＹ・ＩＭＰＯＲＴＡＮＴ・PERFECTION
￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣

ＶＩＰ女性様方は貴方様の肉体を
『現金束と交換可能な性欲塊・ＳＥＸ専用人間』
￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣
と思ってらっしゃいます。
気分を害されますか？それとも、興奮されますか？

━━━━━━━━━━━
※●大事な決定事項●※
━━━━━━━━━━━

下記ＵＲＬ内にいらっしゃいます女性様は、
既に貴方様のＳＥＸ専用ＶＩＰ女性会員様となっております。
同意をされましても無視をされましても、この決定事項は
一切揺るぎません。

〓ＳＥＸをして多額現金をお受け取りになられる場合は同意を。
〓本優遇権利を破棄される場合は無視を。

───────────
http://perfection.cx/h/
━━━━━━━━━━━━━━━━━━━━━━━━━━━
※●貴方様のＳＥＸ専用ＶＩＰ女性会員様はこちらの方●※
━━━━━━━━━━━━━━━━━━━━━━━━━━━
●麻耶（マヤ）様
○29歳
●貿易会社企画運営（現在会長）
○3サイズは上から87・57・86
●お礼金は最低35万円〜無限大（事実可能金額）

○『私とＳＥＸをして頂けませんか？して頂けますよね？お金なら
幾らでもあげる、お金以外の物がご希望ならそれでも構わない。
貴方の生活をがらりと変えて見せますね。貴方が私とＳＥＸをして
頂けるのなら。その代わり、最低月に2回はお相手をして欲しいの。
それ位は頑張って頂けないと、こちらとしても不本意かと思います

一生見る事の出来ない生活と、一生稼ぐ事の出来ない現金の数。
信じないなら無視して下さい。只、貴方以外の男性の人生・運命が
変わるだけ。宜しいなら無視を。

只私は今、今日、貴方を誘ってるの、責任を持って誘っております、
ご心配はなさらないで。』（麻耶）

http://perfection.cx/h/

※ワンクリック詐欺等の疑い・入場登録料金発生等は一切御座いません


From notif.dept at service159924946.paypal.com  Fri Jun  2 14:06:30 2006
From: notif.dept at service159924946.paypal.com (PayPal Corp.)
Date: Fri,  2 Jun 2006 14:06:30 -0700 (PDT)
Subject: [openib-general] PayPal Notification - Action REQUIRED - No:159 
Message-ID: <1473672486.JavaMail.15992@on-Mail002>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/e1a3afb1/attachment.html>

From weiny2 at llnl.gov  Fri Jun  2 15:03:28 2006
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 02 Jun 2006 15:03:28 -0700
Subject: [Fwd: [openib-general] [PATCH] ibv_*_pingpong examples : user
	option for pkey]
In-Reply-To: <1149284656.4510.108332.camel@hal.voltaire.com>
References: <1149284656.4510.108332.camel@hal.voltaire.com>
Message-ID: <20060602150328.2bcd5e48.weiny2@llnl.gov>

Hal,

I changed the pkey_idx to pkey-idx per your comment.  But other than that this
is the same patch.

Roland do I need to do something else?

Thanks,
Ira

On Fri, 02 Jun 2006 17:44:38 -0400
Hal Rosenstock <halr at voltaire.com> wrote:

> Hey Ira,
> 
> Roland didn't respond to this. You may want to resend this patch to
> him and cc: openib-general. Does it need any updating due to other
> changes in this ?
> 
> -- Hal
> 
> -----Forwarded Message-----
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> To: openib-general at openib.org
> Subject: [openib-general] [PATCH] ibv_*_pingpong examples : user
> option for pkey Date: 26 May 2006 16:54:56 -0700
> 
> While testing the pkey features of opensm I added this patch to be
> able to check out the use of different pkeys.
> 
> Ira
> 
> ----
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pingpong-pkey-option.patch
Type: application/octet-stream
Size: 8496 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060602/467b9ee4/attachment.obj>

From swise at opengridcomputing.com  Fri Jun  2 15:03:52 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 02 Jun 2006 17:03:52 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
Message-ID: <1149285832.11187.33.camel@stevo-desktop>

Hello,

The gen2 iwarp branch has been merged up to the main trunk revision
7626.    The iwarp branch can be found at gen2/branches/iwarp and
contains the Ammasso 1100 and Chelsio T3 drivers and user libs.

If you are working on iwarp, please test out this new branch and lemme
know if there are any problems.  


Thanks,

Steve.


From mashirle at us.ibm.com  Fri Jun  2 08:25:13 2006
From: mashirle at us.ibm.com (Shirley Ma)
Date: Fri, 02 Jun 2006 08:25:13 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <adaodxbs4kc.fsf@cisco.com>
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<adaodxbs4kc.fsf@cisco.com>
Message-ID: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>

Roland,

More clarification: we saw two races here:
1. path_free() was called by both unicast_arp_send() and
ipoib_flush_paths() in the same time.
0xc0000004bff0a0d0        3        1  1    0   R  0xc0000004bff0a580
*ksoftirqd/0
          SP(esp)            PC(eip)      Function(args)
0xc00000000f707c80  0xc0000000003199d0  .skb_release_data +0x7c
0xc00000000f707c80  0xc000000000319688 (lr) .kfree_skbmem +0x20
0xc00000000f707d10  0xc000000000319688  .kfree_skbmem +0x20
0xc00000000f707da0  0xc0000000003197fc  .__kfree_skb +0x148
0xc00000000f707e50  0xc00000000031e2a8  .net_tx_action +0xa4
0xc00000000f707f00  0xc00000000006ab38  .__do_softirq +0xa8
0xc00000000f707f90  0xc0000000000177b0  .call_do_softirq +0x14
0xc0000000cff83d90  0xc000000000012064  .do_softirq +0x90
0xc0000000cff83e20  0xc00000000006b0fc  .ksoftirqd +0xfc
0xc0000000cff83ed0  0xc000000000081d74  .kthread +0x17c
0xc0000000cff83f90  0xc000000000017d24  .kernel_thread +0x4c
KERNEL: assertion (!atomic_read(&skb->users)) failed at net/core/dev.c 

2. during unicast arp skb retransmission, unicast_arp_send() appended
the skb on the list, while ipoib_flush_paths() calling path_free() to
free the same skb from the list.
<3>KERNEL: assertion (!atomic_read(&skb->users)) failed at
net/core/dev.c 
(1742)
<4>Warning: kfree_skb passed an skb still on a list (from c00000000031e2a8).
<2>kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel)
void __kfree_skb(struct sk_buff *skb)
{
        if (skb->list) {
                printk(KERN_WARNING "Warning: kfree_skb passed an skb still "
                       "on a list (from %p).\n", NET_CALLER(skb));
                BUG();
        }

The patch will fix both problems by using priv->lock to protect path->queue list. Am I right?

Thanks
Shirley Ma
IBM LTC


From rdreier at cisco.com  Fri Jun  2 16:15:28 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 16:15:28 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	(Shirley Ma's message of "Fri, 02 Jun 2006 08:25:13 -0700")
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<adaodxbs4kc.fsf@cisco.com>
	<1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
Message-ID: <ada3benrvxr.fsf@cisco.com>

 > 2. during unicast arp skb retransmission, unicast_arp_send() appended
 > the skb on the list, while ipoib_flush_paths() calling path_free() to
 > free the same skb from the list.

I think I see what's going on.  the skb ends up being on two lists at
once I guess...

 - R.


From rdreier at cisco.com  Fri Jun  2 16:16:28 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 16:16:28 -0700
Subject: [Fwd: [openib-general] [PATCH] ibv_*_pingpong examples : user
	option for pkey]
In-Reply-To: <20060602150328.2bcd5e48.weiny2@llnl.gov> (Ira Weiny's message of
	"Fri, 02 Jun 2006 15:03:28 -0700")
References: <1149284656.4510.108332.camel@hal.voltaire.com>
	<20060602150328.2bcd5e48.weiny2@llnl.gov>
Message-ID: <aday7wfqhbn.fsf@cisco.com>

    Ira> Hal, I changed the pkey_idx to pkey-idx per your comment.
    Ira> But other than that this is the same patch.

    Ira> Roland do I need to do something else?

Sorry, I didn't see it the first time around.

I'll take a look at it.

 - R.


From mashirle at us.ibm.com  Fri Jun  2 10:02:49 2006
From: mashirle at us.ibm.com (Shirley Ma)
Date: Fri, 02 Jun 2006 10:02:49 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <ada3benrvxr.fsf@cisco.com>
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<adaodxbs4kc.fsf@cisco.com>
	<1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<ada3benrvxr.fsf@cisco.com>
Message-ID: <1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com>

On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote:
>  > 2. during unicast arp skb retransmission, unicast_arp_send() appended
>  > the skb on the list, while ipoib_flush_paths() calling path_free() to
>  > free the same skb from the list.
> 
> I think I see what's going on.  the skb ends up being on two lists at
> once I guess...
> 
>  - R.

The skb has only one prev, one next pointers, it can only be on one list
at a time. How could skb go on two lists at once?

Thanks
Shirley


From somenath at veritas.com  Fri Jun  2 18:07:07 2006
From: somenath at veritas.com (somenath)
Date: Fri, 02 Jun 2006 18:07:07 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <adafyins4ep.fsf@cisco.com>
References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>
	<adafyins4ep.fsf@cisco.com>
Message-ID: <4480E0BB.5070707@veritas.com>

What happens if one tries to do RDMA (say write for example) higher than
4 (or 128 in changed case)? does it just wait till previos operation is 
completed?
 I don't remember seeing any error ....it was only
limited by the send Q-depth which can go much larger value.

thanks, som.

Roland Dreier wrote:

>    Manpreet> Mellanox HCA can handle has been configured at 4
>    Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the
>    Manpreet> HCAs can support a much higher value (128 I think).
>
>    Manpreet> Could we move this value higher or atleast make it
>    Manpreet> configurable?
>
>Leonid Arsh has a patch that I will integrate soon that makes this
>configurable.
>
>However, I'm curious.  Do you have a workload where this actually
>makes a measurable difference?  It seems that having 4 RDMA requests
>outstanding on the wire should be enough to get things to pipeline
>pretty well.
>
>If you haven't tested this, right now you can of course edit
>mthca_main.c to change the default value and recompile.
>
> - R.
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>  
>


From rdreier at cisco.com  Fri Jun  2 18:11:57 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 18:11:57 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	(Shirley Ma's message of "Fri, 02 Jun 2006 10:02:49 -0700")
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<adaodxbs4kc.fsf@cisco.com>
	<1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<ada3benrvxr.fsf@cisco.com>
	<1149267773.8085.68.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
Message-ID: <adapshrqbz6.fsf@cisco.com>

 > The skb has only one prev, one next pointers, it can only be on one list
 > at a time. How could skb go on two lists at once?

Good question.  Actually I was wrong about understanding things
before.  I don't see any way that path_free() and unicast_arp_send()
can be operating on the same struct ipoib_path at the same time.  And
I don't see how unicast_arp_send() could be handling the an skb that's
already queued in a path's queue.

path_free() only gets called from ipoib_flush_paths() after the path
has been removed from the list of paths and the rb_tree of paths (both
protected by priv->lock), so unicast_arp_send() wouldn't find the path
to queue an skb.  And ipoib_flush_paths() can't find a new path
created by unicast_arp_send().

Obviously I'm missing something but I still don't see the real cause
of your crash.

 - R.


From rdreier at cisco.com  Fri Jun  2 18:23:24 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 02 Jun 2006 18:23:24 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <4480E0BB.5070707@veritas.com> (somenath@veritas.com's message of
	"Fri, 02 Jun 2006 18:07:07 -0700")
References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>
	<adafyins4ep.fsf@cisco.com> <4480E0BB.5070707@veritas.com>
Message-ID: <adalksfqbg3.fsf@cisco.com>

 > What happens if one tries to do RDMA (say write for example) higher than
 > 4 (or 128 in changed case)? does it just wait till previos operation
 > is completed?
 >  I don't remember seeing any error ....it was only
 > limited by the send Q-depth which can go much larger value.

Yes, the limit of outstanding RDMAs is not related to the send queue
depth.  Of course you can post many more than 4 RDMAs to a send queue
-- the HCA just won't have more than 4 requests outstanding at a time.


From trimmer at silverstorm.com  Sat Jun  3 07:03:07 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Sat, 3 Jun 2006 10:03:07 -0400
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
Message-ID: <D80D83302DEE6249A221093BF2BB69AE58E88F@mail.silverstorm.com>

>  > What happens if one tries to do RDMA (say write for example) higher
> than
>  > 4 (or 128 in changed case)? does it just wait till previos
operation
>  > is completed?
>  >  I don't remember seeing any error ....it was only
>  > limited by the send Q-depth which can go much larger value.
> 
> Yes, the limit of outstanding RDMAs is not related to the send queue
> depth.  Of course you can post many more than 4 RDMAs to a send queue
> -- the HCA just won't have more than 4 requests outstanding at a time.

To further clarity, this parameter only affects the number of concurrent
outstanding RDMA Reads which the HCA will process.  Once it hits this
limit, the send Q will stall waiting for issued reads to complete prior
to initiating new reads.  It does not affect RDMA Writes.  It is very
analogous to outstanding reads parameters in PCI-X and PCIe (although
this parameters is independent from those).  The IB spec defines
ordering rules for RDMA Reads and Writes.

The number of outstanding RDMA Reads is negotiated by the CM during
connection establishment and the QP which is sending the RDMA Read must
have a value configured for this parameter which is <= the remote ends
capability.

In previous testing by Mellanox on SDR HCAs they indicated values beyond
2-4 did not improve performance (and in fact required more RDMA
resources be allocated for the corresponding QP or HCA).  Hence I
suspect a very large value like 128 would offer no improvement over
values in the 2-8 range.

Todd Rimmer


From tmicheal23 at yahoo.co.uk  Sat Jun  3 07:59:31 2006
From: tmicheal23 at yahoo.co.uk (Tony Micheal)
Date: Sat, 03 Jun 2006 16:59:31 +0200
Subject: [openib-general] THANK   YOU
Message-ID: <E1FmXb9-0007rY-I1@server7.speedpacket.com>

Dear Friend ,

I'm happy to inform you about my success in getting those funds
transferred under the cooperation of a new partner from Brazil. 
Presently i'm
in Brazil for investment projects with my own share of the total sum.
meanwhile,i didn't forget your past efforts and attempts  to assist me
in transferring those funds despite that it failed us some how.

Now contact my secretary in Nigeria his name Mr Emeka  Ibeh on
emekaibeh01 at yahoo.com and ask him to send you the total of $800.000
which  i kept for your compensation for all the past efforts and 
attempts
to assist me in this matter. I appreciated your efforts at that time
very  much. so feel free and get in touched with my secretary Emeka and
instruct him where to send the amount to you.

Please do let me know immediately you receive it so that we can share
the joy after all the sufferness at that time. in the moment, I�m very 
busy here because of the investment projects which me and the new
partner are having at hand, finally, remember that I had forwarded
instruction to the secretary on your behalf to receive that money, so 
feel free
to get in touch with Emeka  Ibeh , he will send the amount to you
without any delay.

Regards,

Tony  Micheal


From anton at samba.org  Sat Jun  3 17:05:35 2006
From: anton at samba.org (Anton Blanchard)
Date: Sun, 4 Jun 2006 10:05:35 +1000
Subject: [openib-general] [PATCH] Fix ipathverbs compile
In-Reply-To: <1149280519.13958.10.camel@hematite.pathscale.com>
References: <20060602064924.GF1736@krispykreme> <adak67zs4hi.fsf@cisco.com>
	<1149280519.13958.10.camel@hematite.pathscale.com>
Message-ID: <20060604000535.GA986@krispykreme>


Hi,

> > The issue is that I changed the development libibverbs tree in svn to
> > no longer use libsysfs, and libehca and libipathverbs are not updated
> > to the new interface yet.  So it is true that they won't compile
> > against the development libibverbs tree without including the libsysfs
> > header, but it's also true that just adding the header so they compile
> > will lead to a driver library that doesn't work anyway.
> > 
> > So I think it's better to leave them not compiling until they are
> > really fixed up.
> 
> We're in the middle of getting a new software release done here, and
> just haven't had the bandwidth to look at this yet.  I'll get around to
> it hopefully by the middle of next week and do the appropriate updates
> from the libipathverbs end.

Thanks for the explanation, makes sense :)

Anton


From anton at samba.org  Sat Jun  3 17:22:00 2006
From: anton at samba.org (Anton Blanchard)
Date: Sun, 4 Jun 2006 10:22:00 +1000
Subject: [openib-general] Fix some suspicious ppc64 code in dapl
Message-ID: <20060604002200.GB986@krispykreme>


Hi,

I was reading through the ppc64 specific code in dapl/ and noticed some
suspicious inline assembly.

- EIEIO_ON_SMP and ISYNC_ON_SMP are in kernel UP build optimisations, we
  shouldnt export them to userspace. Replace it with lwsync and isync.
- The comment says its implemenenting cmpxchg64 but in fact its
  implementing cmpxchg32. Fix the comment.

Index: dapl/udapl/linux/dapl_osd.h
===================================================================
--- dapl/udapl/linux/dapl_osd.h	(revision 7621)
+++ dapl/udapl/linux/dapl_osd.h	(working copy)
@@ -238,14 +238,13 @@
 #endif /* __ia64__ */
 #elif defined(__PPC64__)
         __asm__ __volatile__ (
-        EIEIO_ON_SMP
-"1:     lwarx   %0,0,%2         # __cmpxchg_u64\n\
-        cmpd    0,%0,%3\n\
+"       lwsync\n\
+1:      lwarx   %0,0,%2         # __cmpxchg_u32\n\
+        cmpw    0,%0,%3\n\
         bne-    2f\n\
         stwcx.  %4,0,%2\n\
-        bne-    1b"
-        ISYNC_ON_SMP
-        "\n\
+        bne-    1b\n\
+        isync\n\
 2:"
         : "=&r" (current_value), "=m" (*v)
         : "r" (v), "r" (match_value), "r" (new_value), "m" (*v)


From tziporet at mellanox.co.il  Sun Jun  4 00:26:46 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Sun, 04 Jun 2006 10:26:46 +0300
Subject: [openib-general] Re: OFED RC6 Tag
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007D8CE2B@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007D8CE2B@orsmsx408>
Message-ID: <44828B36.5010302@mellanox.co.il>

Woodruff, Robert J wrote:
> Hi,
>
> I noticed that you now have a rc6 tag for the OFED kernel code.
> Is there a tag for the userspace code ? or what SVN rev will be used
> for RC6.
>
> woody
>
>   
There is no tag for the user level code since it is taken directly from 
the HEAD of branch.
In the release BUILD ID you can see the svn revision of the user level.

Tziporet


From mst at mellanox.co.il  Sun Jun  4 00:42:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 4 Jun 2006 10:42:35 +0300
Subject: [openib-general] Re: [PATCH] RFC: use stdint.h types
In-Reply-To: <ada8xoi2qk5.fsf@cisco.com>
References: <20060531085029.GP21266@mellanox.co.il> <ada8xoi2qk5.fsf@cisco.com>
Message-ID: <20060604074235.GV21266@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] RFC: use stdint.h types
> 
> My initial reaction is that I don't like this, since it makes it
> harder to keep the kernel ABI files in sync between libraries and the
> kernel.

How about a perl script?

#!/usr/bin/perl -pi
s/\b__u(64|32|16|8)\b/uint$1_t/;
s/\b__s(64|32|16|8)\b/int$1_t/;

and back

#!/usr/bin/perl -pi
s/\b(uint)(64|32|16|8)_t\b/__u$1/;
s/\b(__s)(64|32|16|8)\b/__s$1/;

> Does overriding offsetof() really cause any problems?

Donnu, but I'm worried there's some subtle reason gcc 4.0 defines it by
means of __builtin_offsetof rather than the traditional way.

> Does including <linux/types.h> break anything?

Well, we are using an undocumented interface, and so it does make things
fragile, take the compilation problem on sles10 as an example.

-- 
MST


From ishai at mellanox.co.il  Sun Jun  4 02:43:22 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Sun, 4 Jun 2006 12:43:22 +0300
Subject: [openib-general] SRP: [PATCH] Misc cleanups in ib_srp
Message-ID: <20060604094322.GA9091@mellanox.co.il>

Hi,

Misc cleanups in ib_srp. Please consider for 2.6.18.
1) I think that it is more efficient to move the req entries from req_list
   to free_list in srp_reconnect_target (rather than rebuild the free_list).
   (In any case this code is shorter).
2) This allows us to reuse code in srp_reset_device and srp_reconnect_target
   and call a new function srp_reset_req.
3) We can use list_move_tail in srp_remove_req.

Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-05-19 11:14:35.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-05-21 17:41:25.000000000 +0300
@@ -451,14 +451,26 @@ static void srp_unmap_data(struct scsi_c
 		     scmnd->sc_data_direction);
 }
 
+static void srp_remove_req(struct srp_target_port *target, struct srp_request *req)
+{
+	srp_unmap_data(req->scmnd, target, req);
+	list_move_tail(&req->list, &target->free_reqs);
+}
+
+static void srp_reset_req(struct srp_target_port *target, struct srp_request *req)
+{
+	req->scmnd->result = DID_RESET << 16;
+	req->scmnd->scsi_done(req->scmnd);
+	srp_remove_req(target, req);
+}
+
 static int srp_reconnect_target(struct srp_target_port *target)
 {
 	struct ib_cm_id *new_cm_id;
 	struct ib_qp_attr qp_attr;
-	struct srp_request *req;
+	struct srp_request *req, *tmp;
 	struct ib_wc wc;
 	int ret;
-	int i;
 
 	spin_lock_irq(target->scsi_host->host_lock);
 	if (target->state != SRP_TARGET_LIVE) {
@@ -494,19 +506,12 @@ static int srp_reconnect_target(struct s
 	while (ib_poll_cq(target->cq, 1, &wc) > 0)
 		; /* nothing */
 
-	list_for_each_entry(req, &target->req_queue, list) {
-		req->scmnd->result = DID_RESET << 16;
-		req->scmnd->scsi_done(req->scmnd);
-		srp_unmap_data(req->scmnd, target, req);
-	}
+	list_for_each_entry_safe(req, tmp, &target->req_queue, list)
+		srp_reset_req(target, req);
 
 	target->rx_head	 = 0;
 	target->tx_head	 = 0;
 	target->tx_tail  = 0;
-	INIT_LIST_HEAD(&target->free_reqs);
-	INIT_LIST_HEAD(&target->req_queue);
-	for (i = 0; i < SRP_SQ_SIZE; ++i)
-		list_add_tail(&target->req_ring[i].list, &target->free_reqs);
 
 	ret = srp_connect_target(target);
 	if (ret)
@@ -706,13 +711,6 @@ static int srp_map_data(struct scsi_cmnd
 	return len;
 }
 
-static void srp_remove_req(struct srp_target_port *target, struct srp_request *req)
-{
-	srp_unmap_data(req->scmnd, target, req);
-	list_del(&req->list);
-	list_add_tail(&req->list, &target->free_reqs);
-}
-
 static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp)
 {
 	struct srp_request *req;
@@ -1349,11 +1347,8 @@ static int srp_reset_device(struct scsi_
 	spin_lock_irq(target->scsi_host->host_lock);
 
 	list_for_each_entry_safe(req, tmp, &target->req_queue, list)
-		if (req->scmnd->device == scmnd->device) {
-			req->scmnd->result = DID_RESET << 16;
-			req->scmnd->scsi_done(req->scmnd);
-			srp_remove_req(target, req);
-		}
+		if (req->scmnd->device == scmnd->device)
+			srp_reset_req(target, req);
 
 	spin_unlock_irq(target->scsi_host->host_lock);
 
-- 
Ishai Rabinovitz


From mst at mellanox.co.il  Sun Jun  4 04:16:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 4 Jun 2006 14:16:15 +0300
Subject: [openib-general] Re: [PATCH] ipoib: fix ah leak at interface down
In-Reply-To: <ada4pz8ern3.fsf@cisco.com>
References: <200605281547.29313.eli@mellanox.co.il>
	<20060529151547.GO21266@mellanox.co.il> <ada4pz8ern3.fsf@cisco.com>
Message-ID: <20060604111615.GC21266@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] ipoib: fix ah leak at interface down
> 
>     Michael> If this makes sense, please push into 2.6.17.
> 
> Yes, looks OK for 2.6.17.  Out of curiousity:
> 
>     Michael> This might result in leaks (we see ah leaks which we
>     Michael> think can be attributed to this bug) as new packets get
>     Michael> posted while the interface is going down.
> 
> with this patch applied, do the leaks go away?

We've just got a confirmation from customer that this patch really fixes the AH
leak.

Please ask Linus to pull it into 2.6.17.

-- 
MST


From jackm at mellanox.co.il  Sun Jun  4 07:10:04 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Sun, 4 Jun 2006 17:10:04 +0300
Subject: [openib-general] Removing mpi subtree from ofed branch
Message-ID: <200606041710.04801.jackm@mellanox.co.il>

I would like to remove the userspace mpi subtree from the ofed branch 
(https://openib.org/svn/gen2/branches/1.0/src/userspace).

MPI is supplied in ofed as a separate package, which is not taken from the 
ofed branch.  The presence of the mpi directory in the ofed branch is 
therefore misleading.

If no one objects, I'll delete the mpi subtree from the ofed branch in a week 
(June 11).

- Jack


From eli at mellanox.co.il  Sun Jun  4 07:17:08 2006
From: eli at mellanox.co.il (Eli Cohen)
Date: Sun, 04 Jun 2006 17:17:08 +0300
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
References: <1149246495.8085.24.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
	<adaodxbs4kc.fsf@cisco.com>
	<1149261914.8085.58.camel@ibm-khxoic5vfkn.beaverton.ibm.com>
Message-ID: <1149430628.6779.14.camel@mtls03.yok.mtl.com>

> More clarification: we saw two races here:
> 1. path_free() was called by both unicast_arp_send() and
> ipoib_flush_paths() in the same time.

It is not possible to call path_free() on the same object from both
unicast_arp_send() and ipoib_flush_paths(). This becasue
unicast_arp_send() calls it only for newly created objects for which
path_rec_create() failed, in which case the object was never inserted
into the list or the rb_tree.

> 2. during unicast arp skb retransmission, unicast_arp_send() appended
> the skb on the list, while ipoib_flush_paths() calling path_free() to
> free the same skb from the list.

I don't see any issue here as well.

Can you reproduce the crash? If you do, can you send how?


From sweitzen at cisco.com  Sun Jun  4 09:59:07 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Sun, 4 Jun 2006 09:59:07 -0700
Subject: [openib-general] Removing mpi subtree from ofed branch
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D1221E@xmb-sjc-216.amer.cisco.com>

> I would like to remove the userspace mpi subtree from the ofed branch 
> (https://openib.org/svn/gen2/branches/1.0/src/userspace).
> 
> MPI is supplied in ofed as a separate package, which is not 
> taken from the 
> ofed branch.  The presence of the mpi directory in the ofed branch is 
> therefore misleading.

So why don't we put the OFED MVAPICH MPI source in the branch then?  It
is also kinda confusing that the OFED MVAPICH is a tarball and not in
subversion, given that it is based off the code that is in suvbersion.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

From xma at us.ibm.com  Sun Jun  4 10:49:36 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Sun, 4 Jun 2006 10:49:36 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <1149430628.6779.14.camel@mtls03.yok.mtl.com>
Message-ID: <OF5D84F503.EC682BF2-ON87257183.006180B9-88257183.003B32DA@us.ibm.com>

Ohmm. That's a myth. So this problem is hardware independent, right? 
It's not easy to reproduce it. ifconfig up and down stress test could hit 
this problem occasionally.

thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060604/f8f8de10/attachment.html>

From narravul at cse.ohio-state.edu  Sun Jun  4 21:43:12 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Mon, 5 Jun 2006 00:43:12 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <1149285832.11187.33.camel@stevo-desktop>
References: <1149285832.11187.33.camel@stevo-desktop>
Message-ID: <Pine.LNX.4.63.0606050033550.9862@n5.nowlab.cis.ohio-state.edu>

Hi Steve,
   We are trying the new iwarp branch on ammasso adapters. The installation 
has gone fine. However, on running rping there is a error during 
disconnect phase.

$ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
libibverbs: Warning: no userspace device-specific driver found for uverbs1
         driver search path: /usr/local/lib/infiniband
libibverbs: Warning: no userspace device-specific driver found for uverbs0
         driver search path: /usr/local/lib/infiniband
ping data: rdm
ping data: rdm
ping data: rdm
ping data: rdm
cq completion failed status 5
DISCONNECT EVENT...
*** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
Aborted

There are no apparent errors showing up in dmesg. Is this error 
currently expected?

Thanks,
   --Sundeep.

On Fri, 2 Jun 2006, Steve Wise wrote:

> Hello,
>
> The gen2 iwarp branch has been merged up to the main trunk revision
> 7626.    The iwarp branch can be found at gen2/branches/iwarp and
> contains the Ammasso 1100 and Chelsio T3 drivers and user libs.
>
> If you are working on iwarp, please test out this new branch and lemme
> know if there are any problems.
>
>
> Thanks,
>
> Steve.
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From k_mahesh85 at yahoo.co.in  Sun Jun  4 22:39:56 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Mon, 5 Jun 2006 06:39:56 +0100 (BST)
Subject: [openib-general] problem with memory registration-RDMA kernel
	utliity
Message-ID: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com>

i am trying to develop a kernel utility to perform RDMA read/write operations
i am facing a problem with memory regiatration in it.

my code looks like.........

u64 *addr_array;

addr_array = kmalloc(sizeof(u64),GFP_KERNEL); 
                        //i am using only one page buffer

test->mem = kmalloc(4096,GFP_KERNEL);
         // buffer on which RDMA_READ is to be performed

test->fmr = ib_alloc_fmr(test->pd,IB_ACCESS_LOCAL_WRITE |
                                        IB_ACCESS_REMOTE_READ |
                                        IB_ACCESS_REMOTE_WRITE,
                                fmr_attr); //fmr_attr is intialised properly

addr_array[0] =  virt_to_phys(test->mem) ;
ret = ib_map_phys_fmr(test->fmr,addr_array[0],1,(u64)test->mem);


All these operations are not generating any errors
But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is generating IB_WC_REM_ACCESS_ERROR completion.

am i missing anything in the process of registering the memory?????


Thanks n regards
K.Mahesh

 Send instant messages to your online friends http://in.messenger.yahoo.com 

 Stay connected with your friends even when away from PC.  Link: http://in.mobile.yahoo.com/new/messenger/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/63d2168f/attachment.html>

From dotanb at mellanox.co.il  Sun Jun  4 22:59:43 2006
From: dotanb at mellanox.co.il (Dotan Barak)
Date: Mon, 5 Jun 2006 08:59:43 +0300
Subject: [openib-general] problem with memory registration-RDMA kernel
	utliity
In-Reply-To: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com>
References: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com>
Message-ID: <200606050859.44108.dotanb@mellanox.co.il>

Hi.

> All these operations are not generating any errors
> But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is generating IB_WC_REM_ACCESS_ERROR completion.
> 
> am i missing anything in the process of registering the memory?????

1) Did you enable the RDMA_READ + RDMA_WRITE in the modify QP (qp_access_flags) in the responder side?
2) Do you have more than one PD (QP and MR PD should be the same)?
3) you should check that the address + rkey that the requestor sides uses are the right values ..

Dotan


From mst at mellanox.co.il  Sun Jun  4 23:39:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 5 Jun 2006 09:39:23 +0300
Subject: [openib-general] Re: problem with memory registration-RDMA kernel
	utliity
In-Reply-To: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com>
References: <20060605053956.81862.qmail@web8313.mail.in.yahoo.com>
Message-ID: <20060605063923.GI21266@mellanox.co.il>

Quoting r. keshetti mahesh <k_mahesh85 at yahoo.co.in>:
> addr_array[0] =  virt_to_phys(test->mem) ;

Not related to your problem, but you really should be using the DMA API
to get the DMA address and pass that to memory registration verbs.

-- 
MST


From mst at mellanox.co.il  Mon Jun  5 01:11:37 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 5 Jun 2006 11:11:37 +0300
Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <OF5D84F503.EC682BF2-ON87257183.006180B9-88257183.003B32DA@us.ibm.com>
References: <1149430628.6779.14.camel@mtls03.yok.mtl.com>
	<OF5D84F503.EC682BF2-ON87257183.006180B9-88257183.003B32DA@us.ibm.com>
Message-ID: <20060605081136.GJ21266@mellanox.co.il>

Quoting r. Shirley Ma <xma at us.ibm.com>:
> Subject: Re: Re: [PATCH]Repost: IPoIB skb panic
> 
> 
> Ohmm. That's a myth. So this problem is hardware independent, right?
> It's not easy to reproduce it. ifconfig up and down stress test could hit this problem occasionally.

Could be the same problem Eli's recent patch fixed.
http://www.mail-archive.com/openib-general at openib.org/msg20894.html
Please try with that applied.

-- 
MST


From eitan at mellanox.co.il  Mon Jun  5 02:36:46 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 05 Jun 2006 12:36:46 +0300
Subject: [openib-general] [PATCH] osm: segfault fix in
	osm_get_gid_by_mad_addr 
Message-ID: <86lksceyfl.fsf@mtl066.yok.mtl.com>

Hi Hal

I got a report regarding crashes in osm_get_gid_by_mad_addr.
It was missing a check on p_port looked up by LID. The affected
flows are reports and multicast joins.

The fix modified the function to return status (instead of GID).
I did run some simulation flows after the fix but please double
check before commit.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_subnet.h
===================================================================
--- include/opensm/osm_subnet.h	(revision 7542)
+++ include/opensm/osm_subnet.h	(working copy)
@@ -770,11 +770,12 @@ struct _osm_port;
 *
 * SYNOPSIS
 */
-ib_gid_t
+ib_api_status_t
 osm_get_gid_by_mad_addr(
    IN struct _osm_log      *p_log,
    IN const osm_subn_t     *p_subn,
-	IN const struct _osm_mad_addr *p_mad_addr );
+	IN const struct _osm_mad_addr *p_mad_addr,
+   OUT ib_gid_t            *p_gid);
 /*
 * PARAMETERS
 *  p_log
@@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr(
 *	p_mad_addr
 *		[in] Pointer to mad address object.
 *
+*  p_gid
+*     [out] Pointer to teh GID structure to fill in.
+* 
 * RETURN VALUES
-*	Requestor gid object if found. Null otherwise.
+*	IB_SUCCESS if was able to find the GID by address given
 *
 * NOTES
 *
Index: opensm/osm_subnet.c
===================================================================
--- opensm/osm_subnet.c	(revision 7670)
+++ opensm/osm_subnet.c	(working copy)
@@ -236,16 +236,24 @@ osm_subn_init(
 
 /**********************************************************************
  **********************************************************************/
-ib_gid_t
+ib_api_status_t
 osm_get_gid_by_mad_addr(
   IN osm_log_t*            p_log,
   IN const osm_subn_t     *p_subn,
-  IN const osm_mad_addr_t *p_mad_addr )
+  IN const osm_mad_addr_t *p_mad_addr,
+  OUT ib_gid_t            *p_gid)
 {
   const cl_ptr_vector_t*  p_tbl;
   const osm_port_t*       p_port = NULL;
   const osm_physp_t*      p_physp = NULL;
-  ib_gid_t                request_gid;
+
+  if ( p_gid == NULL ) 
+  {
+    osm_log( p_log, OSM_LOG_ERROR,
+             "osm_get_gid_by_mad_addr: ERR 7505 "
+             "Provided output GID is NULL\n");
+    return(IB_INVALID_PARAMETER);
+  }
 
   /* Find the port gid of the request in the subnet */
   p_tbl = &p_subn->port_lid_tbl;
@@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr(
       cl_ntoh16(p_mad_addr->dest_lid))
   {
     p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) );
+    if ( p_port == NULL )
+    {
+      osm_log( p_log, OSM_LOG_DEBUG,
+               "osm_get_gid_by_mad_addr: "
+               "Did not find any port with LID: 0x%X\n",
+               cl_ntoh16(p_mad_addr->dest_lid)
+               );
+      return(IB_INVALID_PARAMETER);
+    }
     p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
-    request_gid.unicast.interface_id = p_physp->port_guid;
-    request_gid.unicast.prefix = p_subn->opt.subnet_prefix;
+    p_gid->unicast.interface_id = p_physp->port_guid;
+    p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
   }
   else
   {
@@ -270,7 +287,7 @@ osm_get_gid_by_mad_addr(
              );
   }
 
-  return request_gid;
+  return( IB_SUCCESS );
 }
 
 /**********************************************************************
Index: opensm/osm_sa_informinfo.c
===================================================================
--- opensm/osm_sa_informinfo.c	(revision 7670)
+++ opensm/osm_sa_informinfo.c	(working copy)
@@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method(
   uint8_t subscribe;
   ib_net32_t qpn;
   uint8_t     resp_time_val;
+  ib_api_status_t res;
 
   OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method );
 
@@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method(
   inform_info_rec.inform_record.subscriber_enum = 0;
 
   /* update the subscriber GID according to mad address */
-  inform_info_rec.inform_record.subscriber_gid =
-    osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr );
+  res = osm_get_gid_by_mad_addr(
+    p_rcv->p_log, 
+    p_rcv->p_subn, 
+    &p_madw->mad_addr,
+    &inform_info_rec.inform_record.subscriber_gid);
+  if ( res != NULL )
+  {    
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_infr_rcv_process_set_method: ERR 4308 "
+             "Got Subscribe Request from unknown LID: 0x%04X\n",
+             cl_ntoh16(p_madw->mad_addr.dest_lid)
+             );
+    osm_sa_send_error(
+      p_rcv->p_resp,
+      p_madw,
+      IB_SA_MAD_STATUS_REQ_INVALID);
+    goto Exit;
+  }
 
   /*
    * MODIFICATIONS DONE ON INCOMING REQUEST:
Index: opensm/osm_sa_mcmember_record.c
===================================================================
--- opensm/osm_sa_mcmember_record.c	(revision 7670)
+++ opensm/osm_sa_mcmember_record.c	(working copy)
@@ -437,12 +437,21 @@ __add_new_mgrp_port(
 {
   boolean_t proxy_join;
   ib_gid_t requester_gid;
+  ib_api_status_t res;
 
   /* set the proxy_join if the requester gid is not identical to the
      joined gid */
-  requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log,
+  res = osm_get_gid_by_mad_addr( p_rcv->p_log,
                                            p_rcv->p_subn,
-                                           p_mad_addr );
+                                 p_mad_addr, &requester_gid );
+  if ( res != IB_SUCCESS )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__add_new_mgrp_port: ERR 1B22: "
+             "Could not find GUID for requestor.\n" );
+    
+    return IB_INVALID_PARAMETER;
+  }
 
   if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid,
               sizeof(ib_gid_t)))
@@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co
   ib_net64_t portguid;
   ib_gid_t request_gid;
   osm_physp_t* p_request_physp;
+  ib_api_status_t res;
 
   portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id;
 
@@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co
   {
     /* The proxy_join is not set. Modifying can by done only
        if the requester GID == PortGID */
-    request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log,
+    res = osm_get_gid_by_mad_addr(p_rcv->p_log,
                                           p_rcv->p_subn,
-                                          p_mad_addr );
+                                  p_mad_addr,
+                                  &request_gid);
+
+    if ( res != IB_SUCCESS )
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__validate_modify: "
+               "Could not find any port by given request address.\n"
+               );
+      return FALSE;
+    }
 
     if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t)))
     {


From halr at voltaire.com  Mon Jun  5 03:07:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 06:07:57 -0400
Subject: [openib-general] Re: [PATCH] osm: segfault fix in
	osm_get_gid_by_mad_addr
In-Reply-To: <86lksceyfl.fsf@mtl066.yok.mtl.com>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com>
Message-ID: <1149502076.4510.202028.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-06-05 at 05:36, Eitan Zahavi wrote:
> Hi Hal
> 
> I got a report regarding crashes in osm_get_gid_by_mad_addr.
> It was missing a check on p_port looked up by LID. The affected
> flows are reports and multicast joins.
> 
> The fix modified the function to return status (instead of GID).
> I did run some simulation flows after the fix but please double
> check before commit.

See comments below.

> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> Index: include/opensm/osm_subnet.h
> ===================================================================
> --- include/opensm/osm_subnet.h	(revision 7542)
> +++ include/opensm/osm_subnet.h	(working copy)
> @@ -770,11 +770,12 @@ struct _osm_port;
>  *
>  * SYNOPSIS
>  */
> -ib_gid_t
> +ib_api_status_t
>  osm_get_gid_by_mad_addr(
>     IN struct _osm_log      *p_log,
>     IN const osm_subn_t     *p_subn,
> -	IN const struct _osm_mad_addr *p_mad_addr );
> +	IN const struct _osm_mad_addr *p_mad_addr,
> +   OUT ib_gid_t            *p_gid);
>  /*
>  * PARAMETERS
>  *  p_log
> @@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr(
>  *	p_mad_addr
>  *		[in] Pointer to mad address object.
>  *
> +*  p_gid
> +*     [out] Pointer to teh GID structure to fill in.
> +* 
>  * RETURN VALUES
> -*	Requestor gid object if found. Null otherwise.
> +*	IB_SUCCESS if was able to find the GID by address given
>  *
>  * NOTES
>  *
> Index: opensm/osm_subnet.c
> ===================================================================
> --- opensm/osm_subnet.c	(revision 7670)
> +++ opensm/osm_subnet.c	(working copy)
> @@ -236,16 +236,24 @@ osm_subn_init(
>  
>  /**********************************************************************
>   **********************************************************************/
> -ib_gid_t
> +ib_api_status_t
>  osm_get_gid_by_mad_addr(
>    IN osm_log_t*            p_log,
>    IN const osm_subn_t     *p_subn,
> -  IN const osm_mad_addr_t *p_mad_addr )
> +  IN const osm_mad_addr_t *p_mad_addr,
> +  OUT ib_gid_t            *p_gid)
>  {
>    const cl_ptr_vector_t*  p_tbl;
>    const osm_port_t*       p_port = NULL;
>    const osm_physp_t*      p_physp = NULL;
> -  ib_gid_t                request_gid;
> +
> +  if ( p_gid == NULL ) 
> +  {
> +    osm_log( p_log, OSM_LOG_ERROR,
> +             "osm_get_gid_by_mad_addr: ERR 7505 "
> +             "Provided output GID is NULL\n");
> +    return(IB_INVALID_PARAMETER);
> +  }
>  
>    /* Find the port gid of the request in the subnet */
>    p_tbl = &p_subn->port_lid_tbl;
> @@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr(
>        cl_ntoh16(p_mad_addr->dest_lid))
>    {
>      p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) );
> +    if ( p_port == NULL )
> +    {
> +      osm_log( p_log, OSM_LOG_DEBUG,
> +               "osm_get_gid_by_mad_addr: "
> +               "Did not find any port with LID: 0x%X\n",
> +               cl_ntoh16(p_mad_addr->dest_lid)
> +               );
> +      return(IB_INVALID_PARAMETER);
> +    }
>      p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
> -    request_gid.unicast.interface_id = p_physp->port_guid;
> -    request_gid.unicast.prefix = p_subn->opt.subnet_prefix;
> +    p_gid->unicast.interface_id = p_physp->port_guid;
> +    p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
>    }
>    else
>    {

Isn't an error status needed to be returned for this else ?

> @@ -270,7 +287,7 @@ osm_get_gid_by_mad_addr(
>               );
>    }
>  
> -  return request_gid;
> +  return( IB_SUCCESS );
>  }
>  
>  /**********************************************************************
> Index: opensm/osm_sa_informinfo.c
> ===================================================================
> --- opensm/osm_sa_informinfo.c	(revision 7670)
> +++ opensm/osm_sa_informinfo.c	(working copy)
> @@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method(
>    uint8_t subscribe;
>    ib_net32_t qpn;
>    uint8_t     resp_time_val;
> +  ib_api_status_t res;
>  
>    OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method );
>  
> @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method(
>    inform_info_rec.inform_record.subscriber_enum = 0;
>  
>    /* update the subscriber GID according to mad address */
> -  inform_info_rec.inform_record.subscriber_gid =
> -    osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr );
> +  res = osm_get_gid_by_mad_addr(
> +    p_rcv->p_log, 
> +    p_rcv->p_subn, 
> +    &p_madw->mad_addr,
> +    &inform_info_rec.inform_record.subscriber_gid);
> +  if ( res != NULL )

Should this be IB_SUCCESS rather than NULL ?

> +  {    
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "osm_infr_rcv_process_set_method: ERR 4308 "
> +             "Got Subscribe Request from unknown LID: 0x%04X\n",
> +             cl_ntoh16(p_madw->mad_addr.dest_lid)
> +             );
> +    osm_sa_send_error(
> +      p_rcv->p_resp,
> +      p_madw,
> +      IB_SA_MAD_STATUS_REQ_INVALID);
> +    goto Exit;
> +  }
>  
>    /*
>     * MODIFICATIONS DONE ON INCOMING REQUEST:
> Index: opensm/osm_sa_mcmember_record.c
> ===================================================================
> --- opensm/osm_sa_mcmember_record.c	(revision 7670)
> +++ opensm/osm_sa_mcmember_record.c	(working copy)
> @@ -437,12 +437,21 @@ __add_new_mgrp_port(
>  {
>    boolean_t proxy_join;
>    ib_gid_t requester_gid;
> +  ib_api_status_t res;
>  
>    /* set the proxy_join if the requester gid is not identical to the
>       joined gid */
> -  requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log,
> +  res = osm_get_gid_by_mad_addr( p_rcv->p_log,
>                                             p_rcv->p_subn,
> -                                           p_mad_addr );
> +                                 p_mad_addr, &requester_gid );
> +  if ( res != IB_SUCCESS )
> +  {
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "__add_new_mgrp_port: ERR 1B22: "
> +             "Could not find GUID for requestor.\n" );

ERR 1B22 is already in use.

> +    
> +    return IB_INVALID_PARAMETER;
> +  }

Also, based on this change, the caller of __add_new_mgrp_port should not
just send SA error with IB_SA_MAD_STATUS_NO_RESOURCES but rather base it
off the error status now.

-- Hal

>    if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid,
>                sizeof(ib_gid_t)))
> @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co
>    ib_net64_t portguid;
>    ib_gid_t request_gid;
>    osm_physp_t* p_request_physp;
> +  ib_api_status_t res;
>  
>    portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id;
>  
> @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co
>    {
>      /* The proxy_join is not set. Modifying can by done only
>         if the requester GID == PortGID */
> -    request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log,
> +    res = osm_get_gid_by_mad_addr(p_rcv->p_log,
>                                            p_rcv->p_subn,
> -                                          p_mad_addr );
> +                                  p_mad_addr,
> +                                  &request_gid);
> +
> +    if ( res != IB_SUCCESS )
> +    {
> +      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
> +               "__validate_modify: "
> +               "Could not find any port by given request address.\n"
> +               );
> +      return FALSE;
> +    }
>  
>      if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t)))
>      {
> 


From eitan at mellanox.co.il  Mon Jun  5 04:33:47 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 5 Jun 2006 14:33:47 +0300
Subject: [openib-general] QoS RFC - Resend using a friendly mailer
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687C0@mtlexch01.mtl.com>

Hi Sasha,

Please see my comments below

> >
> > 9. OpenSM features
> > -------------------
> > The QoS related functionality to be provided by OpenSM can be split
into two
> > main parts:
> >
> > 3.1. Fabric Setup
> > During fabric initialization the SM should parse the policy and
apply its
> > settings to the discovered fabric elements. The following actions
should be
> > performed:
> > * Parsing of policy
> > * Node Group identification. Warning should be provided for each
node not
> >   specified but found.
> > * SL2VL settings validation should be checked:
> >   + A warning will be provided if there are no matching targets for
the SL2VL
> >     setting statement.
> >   + An error message will be printed to the log file if an invalid
setting is
> >     found. A setting is invalid if it refers to:
> >     - Non existing port numbers of the target devices
> >     - Unsupported VLs for the target device. In the later case the
map to non
> >       existing VLs should be replaced to VL15 i.e. packets will be
dropped.
> 
> Not sure that unsupported VLs mapping to VL15 is best option. Actually
> if SL2VL will be specified per port group this may mean that at least
in
> "generic" case all group members should have similar physical
> capabilities or "reliable" part of SLs will be limited by lowest VLCap
> in this group (other SLs will be just dropped somewhere).
[EZ] I prefer not hiding the mismatch. In my mind the explicit setting
should be provided for each of the groups of switches that do not share
same VLs support. 
But this is not a strong requirement in my mind. In general I would
prefer to get a clear error message when the fabric can not support the
given policy. Once such error is provided I think we could use whatever
"recovery" option you have in mind.
> 
> In current SL2VL mapping implementation we are using such rule to
replace
> unsupported VLs: (new VL) = (requested VL) % (operational data VLs)
> This may have some disadvantage too, but I think it is generally
"safer".
[EZ] It is safer since it will not cause data loss. But then the QoS
will probably be broken.
> 
> Also I guess that by "unsupported VLs" you are referring unsupported
or
> non-configured VLs.
[EZ] Yes true.
> 
> > * SL2VL setting is to be performed
> > * VL Arbitration table settings should be validated according to the
following
> >   rules:
> >   + A warning will be provided if there are no matching targets for
the setting
> >     statement
> >   + An error will be provided if the port number exceeds the target
ports
> >   + An error will be generated if the table length exceeds device
capabilities
> >   + An warning will be generated if the table quote a VL that is not
supported
> >     by the target device
> 
> Should there be replacement rule for not supported VLs?
> 
> In IBTA spec (v.1, p.190, l.14) is stated that entry with unsupported
VL
> may be skipped _OR_ "trusted" to other (supported) VL. I think if we
will
> not care about unsupported replacement there may be hole for
> "device/vendor dependent" behavior.
[EZ] OK good point. Lets have a replacement rule.
> 
> Sasha


From eitan at mellanox.co.il  Mon Jun  5 05:33:07 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 5 Jun 2006 15:33:07 +0300
Subject: [openib-general] RE: [PATCH] osm: segfault fix in
	osm_get_gid_by_mad_addr
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687C4@mtlexch01.mtl.com>

Hi Hal,

I will re-send the patch with fixes.
I also replied to the comments below

> See comments below.
> 
> >      p_physp = osm_port_get_phys_ptr( p_port,
p_port->default_port_num);
> > -    request_gid.unicast.interface_id = p_physp->port_guid;
> > -    request_gid.unicast.prefix = p_subn->opt.subnet_prefix;
> > +    p_gid->unicast.interface_id = p_physp->port_guid;
> > +    p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
> >    }
> >    else
> >    {
> 
> Isn't an error status needed to be returned for this else ?
[EZ] Correct
> 
> > @@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method(
> >    inform_info_rec.inform_record.subscriber_enum = 0;
> >
> >    /* update the subscriber GID according to mad address */
> > -  inform_info_rec.inform_record.subscriber_gid =
> > -    osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw-
> >mad_addr );
> > +  res = osm_get_gid_by_mad_addr(
> > +    p_rcv->p_log,
> > +    p_rcv->p_subn,
> > +    &p_madw->mad_addr,
> > +    &inform_info_rec.inform_record.subscriber_gid);
> > +  if ( res != NULL )
> 
> Should this be IB_SUCCESS rather than NULL ?
[EZ] True.
> 
> > +  {
> >     * MODIFICATIONS DONE ON INCOMING REQUEST:
> > Index: opensm/osm_sa_mcmember_record.c
> > ===================================================================
> > --- opensm/osm_sa_mcmember_record.c	(revision 7670)
> > +++ opensm/osm_sa_mcmember_record.c	(working copy)
> > @@ -437,12 +437,21 @@ __add_new_mgrp_port(
> >  {
> >    boolean_t proxy_join;
> >    ib_gid_t requester_gid;
> > +  ib_api_status_t res;
> >
> >    /* set the proxy_join if the requester gid is not identical to
the
> >       joined gid */
> > -  requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log,
> > +  res = osm_get_gid_by_mad_addr( p_rcv->p_log,
> >                                             p_rcv->p_subn,
> > -                                           p_mad_addr );
> > +                                 p_mad_addr, &requester_gid );
> > +  if ( res != IB_SUCCESS )
> > +  {
> > +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +             "__add_new_mgrp_port: ERR 1B22: "
> > +             "Could not find GUID for requestor.\n" );
> 
> ERR 1B22 is already in use.
[EZ] OK last was 1B28 using 1B29
> 
> > +
> > +    return IB_INVALID_PARAMETER;
> > +  }
> 
> Also, based on this change, the caller of __add_new_mgrp_port should
not
> just send SA error with IB_SA_MAD_STATUS_NO_RESOURCES but rather base
it
> off the error status now.
[EZ] Correct. But I think there is no error message that fits exactly
the case where the requester is not known to the SM. I will use invalid
parameter.
> 
> -- Hal
> 
> >    if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid,
> >                sizeof(ib_gid_t)))
> > @@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co
> >    ib_net64_t portguid;
> >    ib_gid_t request_gid;
> >    osm_physp_t* p_request_physp;
> > +  ib_api_status_t res;
> >
> >    portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id;
> >
> > @@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co
> >    {
> >      /* The proxy_join is not set. Modifying can by done only
> >         if the requester GID == PortGID */
> > -    request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log,
> > +    res = osm_get_gid_by_mad_addr(p_rcv->p_log,
> >                                            p_rcv->p_subn,
> > -                                          p_mad_addr );
> > +                                  p_mad_addr,
> > +                                  &request_gid);
> > +
> > +    if ( res != IB_SUCCESS )
> > +    {
> > +      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
> > +               "__validate_modify: "
> > +               "Could not find any port by given request
address.\n"
> > +               );
> > +      return FALSE;
> > +    }
> >
> >      if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid,
sizeof(ib_gid_t)))
> >      {
> >


From Thomas.Talpey at netapp.com  Mon Jun  5 05:31:11 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 05 Jun 2006 08:31:11 -0400
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE58E88F@mail.silverstorm.co
 m>
References: <D80D83302DEE6249A221093BF2BB69AE58E88F@mail.silverstorm.com>
Message-ID: <7.0.1.0.2.20060605081948.044849d0@netapp.com>

At 10:03 AM 6/3/2006, Rimmer, Todd wrote: 
>> Yes, the limit of outstanding RDMAs is not related to the send queue
>> depth.  Of course you can post many more than 4 RDMAs to a send queue
>> -- the HCA just won't have more than 4 requests outstanding at a time.
>
>To further clarity, this parameter only affects the number of concurrent
>outstanding RDMA Reads which the HCA will process.  Once it hits this
>limit, the send Q will stall waiting for issued reads to complete prior
>to initiating new reads.

It's worse than that - the send queue must stall for *all* operations.
Otherwise the hardware has to track in-progress operations which are
queued after stalled ones. It really breaks the initiation model.

Semantically, the provider is not required to provide any such flow control
behavior by the way. The Mellanox one apparently does, but it is not
a requirement of the verbs, it's a requirement on the upper layer. If more
RDMA Reads are posted than the remote peer supports, the connection
may break.

>The number of outstanding RDMA Reads is negotiated by the CM during
>connection establishment and the QP which is sending the RDMA Read must
>have a value configured for this parameter which is <= the remote ends
>capability.

In other words, we're probably stuck at 4. :-) I don't think there is any
Mellanox-based implementation that has ever supported > 4.

>In previous testing by Mellanox on SDR HCAs they indicated values beyond
>2-4 did not improve performance (and in fact required more RDMA
>resources be allocated for the corresponding QP or HCA).  Hence I
>suspect a very large value like 128 would offer no improvement over
>values in the 2-8 range.

I am not so sure of that. For one thing, it's dependent on VERY small
latencies. The presence of a switch, or link extenders will make a huge
difference. Second, heavy multi-QP firmware loads will increase the
latencies. Third, constants are pretty much never a good idea in
networking.

The NFS/RDMA client tries to set the maximum IRD value it can obtain.
RDMA Read is used quite heavily by the server to fetch client data
segments for NFS writes.

Tom.


From eitan at mellanox.co.il  Mon Jun  5 05:34:03 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 05 Jun 2006 15:34:03 +0300
Subject: [openib-general] [PATCH] osm: segfault fix in
	osm_get_gid_by_mad_addr (take 2)
Message-ID: <864pyzok78.fsf@mtl066.yok.mtl.com>

Hi Hal

I got a report regarding crashes in osm_get_gid_by_mad_addr.
It was missing a check on p_port looked up by LID. The affected
flows are reports and multicast joins.

The fix modified the function to return status (instead of GID).
I did run some simulation flows after the fix but please double
check before commit.

This time I hope I did not missed anything

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_subnet.h
===================================================================
--- include/opensm/osm_subnet.h	(revision 7542)
+++ include/opensm/osm_subnet.h	(working copy)
@@ -770,11 +770,12 @@ struct _osm_port;
 *
 * SYNOPSIS
 */
-ib_gid_t
+ib_api_status_t
 osm_get_gid_by_mad_addr(
    IN struct _osm_log      *p_log,
    IN const osm_subn_t     *p_subn,
-	IN const struct _osm_mad_addr *p_mad_addr );
+	IN const struct _osm_mad_addr *p_mad_addr,
+   OUT ib_gid_t            *p_gid);
 /*
 * PARAMETERS
 *  p_log
@@ -786,8 +787,11 @@ osm_get_gid_by_mad_addr(
 *	p_mad_addr
 *		[in] Pointer to mad address object.
 *
+*  p_gid
+*     [out] Pointer to teh GID structure to fill in.
+* 
 * RETURN VALUES
-*	Requestor gid object if found. Null otherwise.
+*	IB_SUCCESS if was able to find the GID by address given
 *
 * NOTES
 *
Index: opensm/osm_subnet.c
===================================================================
--- opensm/osm_subnet.c	(revision 7670)
+++ opensm/osm_subnet.c	(working copy)
@@ -236,16 +236,24 @@ osm_subn_init(
 
 /**********************************************************************
  **********************************************************************/
-ib_gid_t
+ib_api_status_t
 osm_get_gid_by_mad_addr(
   IN osm_log_t*            p_log,
   IN const osm_subn_t     *p_subn,
-  IN const osm_mad_addr_t *p_mad_addr )
+  IN const osm_mad_addr_t *p_mad_addr,
+  OUT ib_gid_t            *p_gid)
 {
   const cl_ptr_vector_t*  p_tbl;
   const osm_port_t*       p_port = NULL;
   const osm_physp_t*      p_physp = NULL;
-  ib_gid_t                request_gid;
+
+  if ( p_gid == NULL ) 
+  {
+    osm_log( p_log, OSM_LOG_ERROR,
+             "osm_get_gid_by_mad_addr: ERR 7505 "
+             "Provided output GID is NULL\n");
+    return(IB_INVALID_PARAMETER);
+  }
 
   /* Find the port gid of the request in the subnet */
   p_tbl = &p_subn->port_lid_tbl;
@@ -256,9 +264,18 @@ osm_get_gid_by_mad_addr(
       cl_ntoh16(p_mad_addr->dest_lid))
   {
     p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_mad_addr->dest_lid) );
+    if ( p_port == NULL )
+    {
+      osm_log( p_log, OSM_LOG_DEBUG,
+               "osm_get_gid_by_mad_addr: "
+               "Did not find any port with LID: 0x%X\n",
+               cl_ntoh16(p_mad_addr->dest_lid)
+               );
+      return(IB_INVALID_PARAMETER);
+    }
     p_physp = osm_port_get_phys_ptr( p_port, p_port->default_port_num);
-    request_gid.unicast.interface_id = p_physp->port_guid;
-    request_gid.unicast.prefix = p_subn->opt.subnet_prefix;
+    p_gid->unicast.interface_id = p_physp->port_guid;
+    p_gid->unicast.prefix = p_subn->opt.subnet_prefix;
   }
   else
   {
@@ -268,9 +285,10 @@ osm_get_gid_by_mad_addr(
              "Lid is out of range: 0x%X\n",
              cl_ntoh16(p_mad_addr->dest_lid)
              );
+    return(IB_INVALID_PARAMETER);
   }
 
-  return request_gid;
+  return( IB_SUCCESS );
 }
 
 /**********************************************************************
Index: opensm/osm_sa_informinfo.c
===================================================================
--- opensm/osm_sa_informinfo.c	(revision 7670)
+++ opensm/osm_sa_informinfo.c	(working copy)
@@ -348,6 +348,7 @@ osm_infr_rcv_process_set_method(
   uint8_t subscribe;
   ib_net32_t qpn;
   uint8_t     resp_time_val;
+  ib_api_status_t res;
 
   OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_set_method );
 
@@ -382,8 +383,24 @@ osm_infr_rcv_process_set_method(
   inform_info_rec.inform_record.subscriber_enum = 0;
 
   /* update the subscriber GID according to mad address */
-  inform_info_rec.inform_record.subscriber_gid =
-    osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr );
+  res = osm_get_gid_by_mad_addr(
+    p_rcv->p_log, 
+    p_rcv->p_subn, 
+    &p_madw->mad_addr,
+    &inform_info_rec.inform_record.subscriber_gid);
+  if ( res != IB_SUCCESS )
+  {    
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_infr_rcv_process_set_method: ERR 4308 "
+             "Got Subscribe Request from unknown LID: 0x%04X\n",
+             cl_ntoh16(p_madw->mad_addr.dest_lid)
+             );
+    osm_sa_send_error(
+      p_rcv->p_resp,
+      p_madw,
+      IB_SA_MAD_STATUS_REQ_INVALID);
+    goto Exit;
+  }
 
   /*
    * MODIFICATIONS DONE ON INCOMING REQUEST:
Index: opensm/osm_sa_mcmember_record.c
===================================================================
--- opensm/osm_sa_mcmember_record.c	(revision 7670)
+++ opensm/osm_sa_mcmember_record.c	(working copy)
@@ -437,12 +437,21 @@ __add_new_mgrp_port(
 {
   boolean_t proxy_join;
   ib_gid_t requester_gid;
+  ib_api_status_t res;
 
   /* set the proxy_join if the requester gid is not identical to the
      joined gid */
-  requester_gid = osm_get_gid_by_mad_addr( p_rcv->p_log,
+  res = osm_get_gid_by_mad_addr( p_rcv->p_log,
                                            p_rcv->p_subn,
-                                           p_mad_addr );
+                                 p_mad_addr, &requester_gid );
+  if ( res != IB_SUCCESS )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__add_new_mgrp_port: ERR 1B29: "
+             "Could not find GUID for requestor.\n" );
+    
+    return IB_INVALID_PARAMETER;
+  }
 
   if (!memcmp(&p_recvd_mcmember_rec->port_gid, &requester_gid,
               sizeof(ib_gid_t)))
@@ -755,6 +764,7 @@ __validate_modify(IN osm_mcmr_recv_t* co
   ib_net64_t portguid;
   ib_gid_t request_gid;
   osm_physp_t* p_request_physp;
+  ib_api_status_t res;
 
   portguid = p_recvd_mcmember_rec->port_gid.unicast.interface_id;
 
@@ -775,9 +785,19 @@ __validate_modify(IN osm_mcmr_recv_t* co
   {
     /* The proxy_join is not set. Modifying can by done only
        if the requester GID == PortGID */
-    request_gid = osm_get_gid_by_mad_addr(p_rcv->p_log,
+    res = osm_get_gid_by_mad_addr(p_rcv->p_log,
                                           p_rcv->p_subn,
-                                          p_mad_addr );
+                                  p_mad_addr,
+                                  &request_gid);
+
+    if ( res != IB_SUCCESS )
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__validate_modify: "
+               "Could not find any port by given request address.\n"
+               );
+      return FALSE;
+    }
 
     if (memcmp(&((*pp_mcm_port)->port_gid), &request_gid, sizeof(ib_gid_t)))
     {
@@ -1759,7 +1779,11 @@ osm_mcmr_rcv_join_mgrp(
     __cleanup_mgrp(p_rcv, mlid);
 
     CL_PLOCK_RELEASE( p_rcv->p_lock );
+    if (status == IB_INVALID_PARAMETER) 
+      sa_status = IB_SA_MAD_STATUS_REQ_INVALID;
+    else
     sa_status = IB_SA_MAD_STATUS_NO_RESOURCES;
+
     osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status);
     goto Exit;
   }


From k_mahesh85 at yahoo.co.in  Mon Jun  5 05:37:33 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Mon, 5 Jun 2006 13:37:33 +0100 (BST)
Subject: [openib-general] Re: problem with memory registration-RDMA kernel
	utliity
Message-ID: <20060605123733.24901.qmail@web8315.mail.in.yahoo.com>

i have added  dma_map_single() and sent the address i got from that to perform RDMA_READ, now it is not at all generating any completion event and just halting there itself.

below is the changed code......
-----------------------------------------------------------------------------------------------------------------
i am trying to develop a kernel utility to perform RDMA read/write operations
i am facing a problem with memory regiatration in it.

my code looks like.........

u64 *addr_array;

addr_array = kmalloc(sizeof(u64),GFP_KERNEL); 
                        //i am using only one page buffer

test->mem = kmalloc(4096,GFP_KERNEL);
         // buffer on which RDMA_READ is to be performed

test->fmr = ib_alloc_fmr(test->pd,IB_ACCESS_LOCAL_WRITE |
                                        IB_ACCESS_REMOTE_READ  |
                                        IB_ACCESS_REMOTE_WRITE,
                                fmr_attr); //fmr_attr is intialised properly

addr_array[0] = dma_map_single(test->device->dma_device,test->mem,4096,DMA_TO_DEVICE);

ret = ib_map_phys_fmr(test->fmr,addr_array[0],1,addr_array);


All these operations are not generating any errors
But when i pass this address (addr_array[0]) as the remote address, the RDMA_READ operation on this address is not generating  any completion event and halting there itself.

am i missing anything in the process of registering the memory?????


Thanks n regards
K.Mahesh
 Send instant messages to your online friends http://in.messenger.yahoo.com 

 Stay connected with your friends even when away from PC.  Link: http://in.mobile.yahoo.com/new/messenger/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/fad2f6f0/attachment.html>

From eitan at mellanox.co.il  Mon Jun  5 05:40:53 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 05 Jun 2006 15:40:53 +0300
Subject: [openib-general] [PATCH] osm: management class constants are unit8
	not uint16
Message-ID: <863bejojvu.fsf@mtl066.yok.mtl.com>

Hi Hal

Cleaning up compilation warnings I found that the osm_vendor_mlx_svc.h
was using NTOH16 on the class constants.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/vendor/osm_vendor_mlx_svc.h
===================================================================
--- include/vendor/osm_vendor_mlx_svc.h	(revision 7542)
+++ include/vendor/osm_vendor_mlx_svc.h	(working copy)
@@ -119,8 +119,8 @@ osmv_mad_is_rmpp(IN const ib_mad_t *p_ma
 
     rmpp_flags = ((ib_rmpp_mad_t*)p_mad)->rmpp_flags;
     /* HACK - JUST SA and DevMgt for now - need to add BIS and DevAdm */
-    if ( (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_SUBN_ADM)) &&
-         (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_DEV_MGMT)) )
+    if ( (p_mad->mgmt_class != IB_MCLASS_SUBN_ADM) &&
+         (p_mad->mgmt_class != IB_MCLASS_DEV_MGMT) )
       return(0);
     return (0 != (rmpp_flags & IB_RMPP_FLAG_ACTIVE));
 }


From ogerlitz at voltaire.com  Mon Jun  5 05:41:42 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 05 Jun 2006 15:41:42 +0300
Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for
 max_map_per_fmr device attribute
In-Reply-To: <adairnktzdn.fsf@cisco.com>
References: <Pine.LNX.4.44.0605231114130.18808-100000@zuben>	<Pine.LNX.4.64.0605281501370.5690@zuben>	<Pine.LNX.4.64.0605300921500.29921@zuben>
	<adairnktzdn.fsf@cisco.com>
Message-ID: <44842686.10002@voltaire.com>

Roland Dreier wrote:
> I had a chance to look at this, and I don't believe it is precisely
> correct for mem-free HCAs with the current FMR implementation.
> 
>  > +	/* on memfull HCA an FMR can be remapped 2^B - 1 times where B < 32 is
>  > +	 * the number of bits which are not used for MPT addressing, on memfree
>  > +	 * HCA B=8 so an FMR can be remapped 255 times.
>  > +	 */
>  > +	if(!mthca_is_memfree(mdev))
>  > +		props->max_map_per_fmr = (1 << (32 -
>  > +					long_log2(mdev->limits.num_mpts))) - 1;
>  > +	else
>  > +		props->max_map_per_fmr = (1 << 8) - 1;
> 
> Look at mthca_arbel_map_phys_fmr().  The question is how often key
> will repeat after being indexed, and when MTHCA_FLAG_SINAI_OPT is not
> set, then the same increment is used in the mem-free case as in the
> Tavor case.
> 
> So I think the code I quoted should really be:
> 
> 	if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)
> 		props->max_map_per_fmr = (1 << (32 -
> 					long_log2(mdev->limits.num_mpts))) - 1;
> 	else
> 		props->max_map_per_fmr = (1 << 8) - 1;
> 
> Do you agree?  If so I can fix this patch up myself and apply it.

Yes it makes sense, but you need the check should be

	if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT))

instead of

	if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)

also, what about the other patch which changes fmr_pool.c to query the 
device, have you got(reviewed/accepted) it? i have modified it to 
allocate the device attr struct on the heap as you have asked.

Or.


From eitan at mellanox.co.il  Mon Jun  5 05:51:53 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 05 Jun 2006 15:51:53 +0300
Subject: [openib-general] [PATCH] osm: trivial missing header files fix
Message-ID: <861wu3ojdi.fsf@mtl066.yok.mtl.com>

Hi Hal

Cleaning up compilation warnings I found there missing includes in
various sources.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/vendor/osm_vendor_mlx_txn.h
===================================================================
--- include/vendor/osm_vendor_mlx_txn.h	(revision 7542)
+++ include/vendor/osm_vendor_mlx_txn.h	(working copy)
@@ -37,6 +37,9 @@
 #ifndef _OSMV_TXN_H_
 #define _OSMV_TXN_H_
 
+#include <sys/types.h>
+#include <unistd.h>
+
 #include <complib/cl_qmap.h>
 #include <opensm/osm_madw.h>
 #include <complib/cl_event_wheel.h>
Index: libvendor/osm_vendor_mlx_hca.c
===================================================================
--- libvendor/osm_vendor_mlx_hca.c	(revision 7542)
+++ libvendor/osm_vendor_mlx_hca.c	(working copy)
@@ -39,6 +39,7 @@
 #  include <config.h>
 #endif /* HAVE_CONFIG_H */
 
+#include <stdlib.h>
 #include <string.h>
 #if defined(OSM_VENDOR_INTF_MTL) | defined(OSM_VENDOR_INTF_TS)
 #undef IN
Index: libvendor/osm_vendor_mlx_hca_sim.c
===================================================================
--- libvendor/osm_vendor_mlx_hca_sim.c	(revision 7542)
+++ libvendor/osm_vendor_mlx_hca_sim.c	(working copy)
@@ -43,6 +43,7 @@
 #undef IN
 #undef OUT
 
+#include <unistd.h>
 #include <vendor/osm_vendor_api.h>
 #include <complib/cl_memory.h>
 #include <opensm/osm_log.h>
Index: opensm/osm_node_info_rcv.c
===================================================================
--- opensm/osm_node_info_rcv.c	(revision 7670)
+++ opensm/osm_node_info_rcv.c	(working copy)
@@ -55,6 +55,7 @@
 #  include <config.h>
 #endif /* HAVE_CONFIG_H */
 
+#include <stdlib.h>
 #include <string.h>
 #include <iba/ib_types.h>
 #include <complib/cl_qmap.h>
Index: opensm/osm_drop_mgr.c
===================================================================
--- opensm/osm_drop_mgr.c	(revision 7670)
+++ opensm/osm_drop_mgr.c	(working copy)
@@ -51,6 +51,7 @@
 #  include <config.h>
 #endif /* HAVE_CONFIG_H */
 
+#include <stdlib.h>
 #include <string.h>
 #include <iba/ib_types.h>
 #include <complib/cl_qmap.h>


From dotanb at mellanox.co.il  Mon Jun  5 05:57:22 2006
From: dotanb at mellanox.co.il (Dotan Barak)
Date: Mon, 5 Jun 2006 15:57:22 +0300
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
References: <D80D83302DEE6249A221093BF2BB69AE58E88F@mail.silverstorm.com>
	<7.0.1.0.2.20060605081948.044849d0@netapp.com>
Message-ID: <200606051557.22577.dotanb@mellanox.co.il>

Hi.

> In other words, we're probably stuck at 4. :-) I don't think there is any
> Mellanox-based implementation that has ever supported > 4.

The VAPI driver (gen1 driver for Mellanox HCAs) supported 8 outstanding RDMA Read/Atomic operations.
I guess that the magic value "4" is a low level driver issue and not HCA issue.

Dotan


From eitan at mellanox.co.il  Mon Jun  5 05:59:45 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 05 Jun 2006 15:59:45 +0300
Subject: [openib-general] [PATCH] osm: trivial missing cast in osmt_service
	call for memcmp
Message-ID: <86zmgrn4fy.fsf@mtl066.yok.mtl.com>

Hi Hal

Last one of my cleaning up compilation warnings I found a missing
cast in osmtest service name compare.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: osmtest/osmt_service.c
===================================================================
--- osmtest/osmt_service.c	(revision 7542)
+++ osmtest/osmt_service.c	(working copy)
@@ -1138,8 +1138,8 @@ osmt_get_all_services_and_check_names( I
                    "osmt_get_all_services_and_check_names: "
                    "-I- Comparing source name : >%s<, with record name : >%s<, idx : %d\n",
                    p_valid_service_names_arr[j],p_rec->service_name, p_checked_names[j]);     
-          if ( strcmp(p_valid_service_names_arr[j],
-                      p_rec->service_name) == 0 )
+          if ( strcmp((char *)p_valid_service_names_arr[j],
+                      (char *)p_rec->service_name) == 0 )
           {
               osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
                        "osmt_get_all_services_and_check_names: "


From halr at voltaire.com  Mon Jun  5 05:57:47 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 08:57:47 -0400
Subject: [openib-general] Re: [PATCH] osm: management class constants are
	unit8 not uint16
In-Reply-To: <863bejojvu.fsf@mtl066.yok.mtl.com>
References: <863bejojvu.fsf@mtl066.yok.mtl.com>
Message-ID: <1149512262.4510.206671.camel@hal.voltaire.com>

On Mon, 2006-06-05 at 08:40, Eitan Zahavi wrote:
> Hi Hal
> 
> Cleaning up compilation warnings I found that the osm_vendor_mlx_svc.h
> was using NTOH16 on the class constants.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to both trunk and 1.0 branch.

-- Hal


From jlentini at netapp.com  Mon Jun  5 06:38:43 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 5 Jun 2006 09:38:43 -0400 (EDT)
Subject: [openib-general] Fix some suspicious ppc64 code in dapl
In-Reply-To: <20060604002200.GB986@krispykreme>
References: <20060604002200.GB986@krispykreme>
Message-ID: <Pine.LNX.4.64.0606050936270.4750@jlentini-linux.nane.netapp.com>


> Index: dapl/udapl/linux/dapl_osd.h
> ===================================================================
> --- dapl/udapl/linux/dapl_osd.h	(revision 7621)
> +++ dapl/udapl/linux/dapl_osd.h	(working copy)
> @@ -238,14 +238,13 @@
>  #endif /* __ia64__ */
>  #elif defined(__PPC64__)
>          __asm__ __volatile__ (
> -        EIEIO_ON_SMP
> -"1:     lwarx   %0,0,%2         # __cmpxchg_u64\n\
> -        cmpd    0,%0,%3\n\
> +"       lwsync\n\
> +1:      lwarx   %0,0,%2         # __cmpxchg_u32\n\
> +        cmpw    0,%0,%3\n\
>          bne-    2f\n\
>          stwcx.  %4,0,%2\n\
> -        bne-    1b"
> -        ISYNC_ON_SMP
> -        "\n\
> +        bne-    1b\n\
> +        isync\n\
>  2:"
>          : "=&r" (current_value), "=m" (*v)
>          : "r" (v), "r" (match_value), "r" (new_value), "m" (*v)

Thank you Anton. Could you replying with a signed off by line? I'll 
properly attribute this fix to you in the commit log.


From halr at voltaire.com  Mon Jun  5 07:21:05 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 10:21:05 -0400
Subject: [openib-general] Re: [PATCH] osm: segfault fix in
	osm_get_gid_by_mad_addr (take 2)
In-Reply-To: <864pyzok78.fsf@mtl066.yok.mtl.com>
References: <864pyzok78.fsf@mtl066.yok.mtl.com>
Message-ID: <1149517245.4510.208652.camel@hal.voltaire.com>

On Mon, 2006-06-05 at 08:34, Eitan Zahavi wrote:
> Hi Hal
> 
> I got a report regarding crashes in osm_get_gid_by_mad_addr.
> It was missing a check on p_port looked up by LID. The affected
> flows are reports and multicast joins.
> 
> The fix modified the function to return status (instead of GID).
> I did run some simulation flows after the fix but please double
> check before commit.
> 
> This time I hope I did not missed anything
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks.  Applied (with some cosmetic changes) to both trunk and 1.0
branch.

-- Hal


From halr at voltaire.com  Mon Jun  5 08:05:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 11:05:34 -0400
Subject: [openib-general] [PATCH] OpenSM: Don't exit when log fills disk
Message-ID: <1149519927.4510.209794.camel@hal.voltaire.com>

OpenSM: Don't exit when log fills disk

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_log.c
===================================================================
--- opensm/osm_log.c	(revision 7645)
+++ opensm/osm_log.c	(working copy)
@@ -80,6 +80,9 @@ static char *month_str[] = {
 };
 #endif /* ndef WIN32 */
 
+static int log_exit_count = 0;
+
+
 void
 osm_log(
   IN osm_log_t* const p_log,
@@ -175,8 +178,10 @@ osm_log(
     
     if (ret < 0)
     {
-      fprintf(stderr, "OSM LOG FAILURE! Probably quota exceeded\n");
-      exit(1);
+      if (log_exit_count++ < 10)
+      {
+        fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n");
+      }
     }
   }
 }


From hbchen at lanl.gov  Mon Jun  5 08:12:03 2006
From: hbchen at lanl.gov (hbchen)
Date: Mon, 05 Jun 2006 09:12:03 -0600
Subject: [openib-general] Question about the IPoIB bandwidth performance ? 
In-Reply-To: <86lksceyfl.fsf@mtl066.yok.mtl.com>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com>
Message-ID: <448449C3.9000705@lanl.gov>

Hi,
I have a question about the IPoIB bandwidth performance.
I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
ethernet card,
and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).


NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
(IPoNIC/LB)
--------------------- ---------------- --------------
----------------------------------
Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
using Linux 2.6.14.6)
(PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
using Linux 2.6.14.6)
474MB/sec 37% (the best from OpenIB mailing list)
(2.6.12-rc5 patch 1)

Why the bandwidth utilization of IPoIB is so low compared to the others
NICs?
There must be a lot of room to improve the IPoIB software to reach 75%+
bandwidth utilization.


HB Chen
Los Alamos National Lab
hbchen at labl.gov


From halr at voltaire.com  Mon Jun  5 08:21:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 11:21:23 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <448449C3.9000705@lanl.gov>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com>  <448449C3.9000705@lanl.gov>
Message-ID: <1149520880.4510.210194.camel@hal.voltaire.com>

On Mon, 2006-06-05 at 11:12, hbchen wrote:
> Hi,
> I have a question about the IPoIB bandwidth performance.
> I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
> ethernet card,
> and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
> 
> 
> NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
> (IPoNIC/LB)
> --------------------- ---------------- --------------
> ----------------------------------
> Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
> Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
> Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
> using Linux 2.6.14.6)
> (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
> IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
> using Linux 2.6.14.6)
> 474MB/sec 37% (the best from OpenIB mailing list)
> (2.6.12-rc5 patch 1)
> 
> Why the bandwidth utilization of IPoIB is so low compared to the others
> NICs?

One thing to note is that the max utilization of 10G IB (4x) is 8G due
to the signalling being included in this rate (unlike ethernet whose
rate represents the data rate and does not include the signalling
overhead).

-- Hal

> There must be a lot of room to improve the IPoIB software to reach 75%+
> bandwidth utilization.
> 
> 
> HB Chen
> Los Alamos National Lab
> hbchen at labl.gov
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From hbchen at lanl.gov  Mon Jun  5 08:38:24 2006
From: hbchen at lanl.gov (hbchen)
Date: Mon, 05 Jun 2006 09:38:24 -0600
Subject: [openib-general] Question about the IPoIB bandwidth	performance ?
In-Reply-To: <1149520880.4510.210194.camel@hal.voltaire.com>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov>
	<1149520880.4510.210194.camel@hal.voltaire.com>
Message-ID: <44844FF0.9020309@lanl.gov>

Hal Rosenstock wrote:

>On Mon, 2006-06-05 at 11:12, hbchen wrote:
>  
>
>>Hi,
>>I have a question about the IPoIB bandwidth performance.
>>I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
>>ethernet card,
>>and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
>>
>>
>>NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
>>(IPoNIC/LB)
>>--------------------- ---------------- --------------
>>----------------------------------
>>Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
>>Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
>>Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
>>using Linux 2.6.14.6)
>>(PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
>>IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
>>using Linux 2.6.14.6)
>>474MB/sec 37% (the best from OpenIB mailing list)
>>(2.6.12-rc5 patch 1)
>>
>>Why the bandwidth utilization of IPoIB is so low compared to the others
>>NICs?
>>    
>>
>
>One thing to note is that the max utilization of 10G IB (4x) is 8G due
>to the signalling being included in this rate (unlike ethernet whose
>rate represents the data rate and does not include the signalling
>overhead).
>  
>
Hal,
Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very low.
>> IPoIB=420MB/sec
>> bandwidth utilization= 420/1024 = 41.01%


HB


>-- Hal
>
>  
>
>>There must be a lot of room to improve the IPoIB software to reach 75%+
>>bandwidth utilization.
>>
>>
>>HB Chen
>>Los Alamos National Lab
>>hbchen at labl.gov
>>
>>_______________________________________________
>>openib-general mailing list
>>openib-general at openib.org
>>http://openib.org/mailman/listinfo/openib-general
>>
>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>>    
>>
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/bff889b1/attachment.html>

From halr at voltaire.com  Mon Jun  5 08:34:50 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 11:34:50 -0400
Subject: [openib-general] Re: [PATCH] osm: trivial missing header files fix
In-Reply-To: <861wu3ojdi.fsf@mtl066.yok.mtl.com>
References: <861wu3ojdi.fsf@mtl066.yok.mtl.com>
Message-ID: <1149521684.4510.210522.camel@hal.voltaire.com>

On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote:
> Hi Hal
> 
> Cleaning up compilation warnings I found there missing includes in
> various sources.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only.

-- Hal


From halr at voltaire.com  Mon Jun  5 08:45:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jun 2006 11:45:28 -0400
Subject: [openib-general] Re: [PATCH] osm: trivial missing cast in
	osmt_service call for memcmp
In-Reply-To: <86zmgrn4fy.fsf@mtl066.yok.mtl.com>
References: <86zmgrn4fy.fsf@mtl066.yok.mtl.com>
Message-ID: <1149522314.4510.210789.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote:
> Hi Hal
> 
> Last one of my cleaning up compilation warnings I found a missing
> cast in osmtest service name compare.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only.

-- Hal


From wombat2 at us.ibm.com  Mon Jun  5 08:54:42 2006
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Mon, 5 Jun 2006 11:54:42 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <20060605152617.496942283DA@openib.ca.sandia.gov>
Message-ID: <OF49FE6275.04BAA014-ON85257184.0055600A-85257184.005767F3@us.ibm.com>

Hal Rosenstock wrote:

> On Mon, 2006-06-05 at 11:12, hbchen wrote:
> > Hi,
> > I have a question about the IPoIB bandwidth performance.
> > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
> > ethernet card,
> > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
> >
> >
> > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
> > (IPoNIC/LB)
> > --------------------- ---------------- --------------
> > ----------------------------------
> > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
> > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
> > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
> > > using Linux 2.6.14.6)
> > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
> > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
> > using Linux 2.6.14.6)
> > 474MB/sec 37% (the best from OpenIB mailing list)
> > (2.6.12-rc5 patch 1)
> >
> > Why the bandwidth utilization of IPoIB is so low compared to the others
> > NICs?
>
> One thing to note is that the max utilization of 10G IB (4x) is 8G due
> to the signalling being included in this rate (unlike ethernet whose
> rate represents the data rate and does not include the signalling
> overhead).
>
> -- Hal
>

You also have larger IP packets when you use GigE ( especially in large
send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE,
without large send you get a 9000 MTU. With large send you get a 64K buffer
to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the
adapter.

Currently with IPoIB using UD mode, you have to generate lots of 2K
packets. With serialized IBoIP drivers you end up bottlenecking on a single
CPU. There is a IPoIB-CM IEFT spec out which should significantly improve
IPoIB performance if implemented.

> > There must be a lot of room to improve the IPoIB software to reach 75%+
> > bandwidth utilization.
> >
> >
> > HB Chen
> > Los Alamos National Lab
> > hbchen at labl.gov
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
> >


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general


Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner


From xma at us.ibm.com  Mon Jun  5 09:02:36 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 5 Jun 2006 09:02:36 -0700
Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <20060605081136.GJ21266@mellanox.co.il>
Message-ID: <OF670DFDF8.17265BE1-ON87257184.0057F250-88257184.0057D8DD@us.ibm.com>


Michael,

I will apply this patch. This patch would reduce the race, not address the
problem.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/319028a6/attachment.html>

From rdreier at cisco.com  Mon Jun  5 09:01:14 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 09:01:14 -0700
Subject: [openib-general] Re: [PATCHv2 1/2] resend: mthca support for
	max_map_per_fmr device attribute
In-Reply-To: <44842686.10002@voltaire.com> (Or Gerlitz's message of "Mon,
	05 Jun 2006 15:41:42 +0300")
References: <Pine.LNX.4.44.0605231114130.18808-100000@zuben>
	<Pine.LNX.4.64.0605281501370.5690@zuben>
	<Pine.LNX.4.64.0605300921500.29921@zuben> <adairnktzdn.fsf@cisco.com>
	<44842686.10002@voltaire.com>
Message-ID: <ada3bejr3qt.fsf@cisco.com>

 > Yes it makes sense, but you need the check should be
 > 
 > 	if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT))
 > 
 > instead of
 > 
 > 	if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)

Yep, you're right, I got it backwards.

 > also, what about the other patch which changes fmr_pool.c to query the
 > device, have you got(reviewed/accepted) it? i have modified it to
 > allocate the device attr struct on the heap as you have asked.

It looks fine.  I was just reviewing everything together.

 - R.


From Thomas.Talpey at netapp.com  Mon Jun  5 08:52:03 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 05 Jun 2006 11:52:03 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <44844FF0.9020309@lanl.gov>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov>
	<1149520880.4510.210194.camel@hal.voltaire.com>
	<44844FF0.9020309@lanl.gov>
Message-ID: <7.0.1.0.2.20060605114203.043ad738@netapp.com>

At 11:38 AM 6/5/2006, hbchen wrote:
>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low.
>>> IPoIB=420MB/sec  
>>> bandwidth utilization= 420/1024 = 41.01%


Helen, have you measured the CPU utilizations during these runs?
Perhaps you are out of CPU.

Outrageous opinion follows.

Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
There is no hardware checksumming, nor large-send offloading, both
of which force overhead onto software. And, as you just discovered
it isn't even 10Gb!

In general, network emulation layers are always going to perform more
poorly than native implementations. But this is only a generality learned
from years of experience with them.

Tom.  


From hbchen at lanl.gov  Mon Jun  5 09:11:30 2006
From: hbchen at lanl.gov (hbchen)
Date: Mon, 05 Jun 2006 10:11:30 -0600
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <7.0.1.0.2.20060605114203.043ad738@netapp.com>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov>
	<1149520880.4510.210194.camel@hal.voltaire.com>
	<44844FF0.9020309@lanl.gov>
	<7.0.1.0.2.20060605114203.043ad738@netapp.com>
Message-ID: <448457B2.6050608@lanl.gov>

Talpey, Thomas wrote:

>At 11:38 AM 6/5/2006, hbchen wrote:
>  
>
>>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low.
>>    
>>
>>>>IPoIB=420MB/sec  
>>>>bandwidth utilization= 420/1024 = 41.01%
>>>>        
>>>>
>
>
>Helen, have you measured the CPU utilizations during these runs?
>Perhaps you are out of CPU.
>
>  
>
Tom,
I am HB Chen from LANL not the Helen Chen from SNL.
I didn't run out of CPU.  It is about 70-80 % of CPU utilization.
 

>Outrageous opinion follows.
>
>Frankly, an IB HCA running Ethernet emulation is approximately the
>world's worst 10GbE adapter (not to put too fine of a point on it :-) )
>  
>
The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%  
bandwidth utilization why not the IPoIB ?

HB Chen
hbchen at lanl.gov

>There is no hardware checksumming, nor large-send offloading, both
>of which force overhead onto software. And, as you just discovered
>it isn't even 10Gb!
>
>In general, network emulation layers are always going to perform more
>poorly than native implementations. But this is only a generality learned
>from years of experience with them.
>
>Tom.  
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/c7ac0915/attachment.html>

From rdreier at cisco.com  Mon Jun  5 09:11:16 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 09:11:16 -0700
Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <OF670DFDF8.17265BE1-ON87257184.0057F250-88257184.0057D8DD@us.ibm.com>
	(Shirley Ma's message of "Mon, 5 Jun 2006 09:02:36 -0700")
References: <OF670DFDF8.17265BE1-ON87257184.0057F250-88257184.0057D8DD@us.ibm.com>
Message-ID: <adau06zpopn.fsf@cisco.com>

    Shirley> I will apply this patch. This patch would reduce the
    Shirley> race, not address the problem.

Does anyone know what the problem really is?  I sure don't.

 - R.


From Thomas.Talpey at netapp.com  Mon Jun  5 09:17:20 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 05 Jun 2006 12:17:20 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <448457B2.6050608@lanl.gov>
References: <86lksceyfl.fsf@mtl066.yok.mtl.com> <448449C3.9000705@lanl.gov>
	<1149520880.4510.210194.camel@hal.voltaire.com>
	<44844FF0.9020309@lanl.gov>
	<7.0.1.0.2.20060605114203.043ad738@netapp.com>
	<448457B2.6050608@lanl.gov>
Message-ID: <7.0.1.0.2.20060605121321.043ad738@netapp.com>

At 12:11 PM 6/5/2006, hbchen wrote:
>>Perhaps you are out of CPU.
>>
>>  
>Tom,
>I am HB Chen from LANL not the Helen Chen from SNL.

Oops, sorry! I have too many email messages going by. :-)
HB, then.


>I didn't run out of CPU.  It is about 70-80 % of CPU utilization.

But, is one CPU at 100%? Interrupt processing, for example.

>  
>>
>>Outrageous opinion follows.
>>
>>Frankly, an IB HCA running Ethernet emulation is approximately the
>>world's worst 10GbE adapter (not to put too fine of a point on it :-) )
>>  
>The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%  bandwidth utilization why not the IPoIB ?

I am not familiar with the implementation Myrinet uses. In any
case, I am not saying that an emulation can't reach certain goals,
just that they will pretty much always be inferior to native approaches.
Sometimes far inferior.

Tom. 


From felix at chelsio.com  Mon Jun  5 09:32:10 2006
From: felix at chelsio.com (Felix Marti)
Date: Mon, 5 Jun 2006 09:32:10 -0700
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
Message-ID: <8A71B368A89016469F72CD08050AD33486F05A@maui.asicdesigners.com>

 
________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of hbchen
Sent: Monday, June 05, 2006 9:12 AM
To: Talpey, Thomas
Cc: openib-general at openib.org
Subject: Re: [openib-general] Question about the IPoIB bandwidth
performance ?

 
Talpey, Thomas wrote:


At 11:38 AM 6/5/2006, hbchen wrote:
  

	Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very low.
	    

			IPoIB=420MB/sec  
			bandwidth utilization= 420/1024 = 41.01%
			        

Helen, have you measured the CPU utilizations during these runs?
Perhaps you are out of CPU.
 
  
Tom,
I am HB Chen from LANL not the Helen Chen from SNL.
I didn't run out of CPU.  It is about 70-80 % of CPU utilization.
  

Outrageous opinion follows.
 
Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
  

The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%
bandwidth utilization why not the IPoIB ?


[Felix:] As pointed out earlier: it is the message rate. If you change
the mtu to 1500B (instead of the non-standard 9000B Jumbo frames)
performance will drop into the same range as what you see with IPoIB
(limited by the receiver).


HB Chen 
hbchen at lanl.gov


There is no hardware checksumming, nor large-send offloading, both
of which force overhead onto software. And, as you just discovered
it isn't even 10Gb!
 
In general, network emulation layers are always going to perform more
poorly than native implementations. But this is only a generality
learned
from years of experience with them.
 
Tom.  
 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/e152ba96/attachment.html>

From ishai at mellanox.co.il  Mon Jun  5 08:32:13 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 5 Jun 2006 18:32:13 +0300
Subject: [openib-general] SRP [PATCH 0/4] Kernel support for removal and
	restoration of target
Message-ID: <20060605153213.GA7472@mellanox.co.il>

Hi Roland,

I'm sending 4 patches that implement kernel support for removal and restoration
of a target (will be used by ibsrpdm).

Some comments about them

1) The first patch splits reconnect to 2 two functions: _srp_remove_target 
   and _srp_restore_target. _srp_remove_target uses the functions I sent in 
   previous patch (Misc cleanups in ib_srp). If you want I can resend this 
   patch without using the previous patch (But then there will be a problem 
   with the previous patch :(  ).

2) These patches implement the following behavior: When someone writes the
   string "remove" to /sys/class/scsi_host/host?/remove_target the corresponding
   target goes to a DISCONNECTED state (After closing the cm, and reset all 
   pending requests).
   Now when the scsi performs queuecommand to this host a SCSI_MLQUEUE_HOST_BUSY
   is returned. These causes the scsi layer to wait until the target turns
   to LIVE state. This is very nice if the user that initiated the remove_target
   knows what he is doing and will perform a restore_target later. On the other 
   hand it may be problematic if the target remains DISCONNECTED and user
   applications that try to access this target remain stuck in the kernel (in 
   the scsi layer)
   I've several ideas on how to handle it (have a timeout after which 
   queuecommand will return fail, try to perform a restore_target after a
   timeout, make sure the daemon will run a restore_target after a timeout) 
   but I'm not sure they are the correct thing to do. I'm waiting for
   suggestions.
   In any case I believe we should apply these patches and add solution to
   this problem later.

Please comment.

-- 
Ishai Rabinovitz


From ishai at mellanox.co.il  Mon Jun  5 08:33:32 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 5 Jun 2006 18:33:32 +0300
Subject: [openib-general] SRP [PATCH 1/4] split srp_reconnect_target
Message-ID: <20060605153332.GB7472@mellanox.co.il>


Split the srp_reconnect_target to two functions _srp_remove_target and
_srp_restore_target. These functions will be used later in patch series also
to allow removal and restoration of a target from the sysfs.

I made some changes in order to support this:
1) There are two new states:
SRP_TARGET_DISCONNECTED - The state after _srp_remove_target was successfully 
                          executed and before _srp_restore_target is executed.
SRP_TARGET_DISCONNECTING - The state while _srp_remove_target is executed.
SRP_TARGET_CONNECTING is now the state while _srp_restore_target is executed.

2) The value of target->cm_id can be NULL. This happens after _srp_remove_target
   destroyed the old cm_id and before _srp_restore_target created the new cm_id.

Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-04 10:03:25.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-04 10:54:26.000000000 +0300
@@ -40,6 +40,7 @@
 #include <linux/parser.h>
 #include <linux/random.h>
 #include <linux/jiffies.h>
+#include <linux/delay.h>
 
 #include <asm/atomic.h>
 
@@ -373,7 +374,8 @@ static void srp_remove_work(void *target
 	spin_unlock(&target->srp_host->target_lock);
 
 	scsi_remove_host(target->scsi_host);
-	ib_destroy_cm_id(target->cm_id);
+	if (target->cm_id)
+		ib_destroy_cm_id(target->cm_id);
 	srp_free_target_ib(target);
 	scsi_host_put(target->scsi_host);
 }
@@ -464,20 +466,57 @@ static void srp_reset_req(struct srp_tar
 	srp_remove_req(target, req);
 }
 
-static int srp_reconnect_target(struct srp_target_port *target)
+static void srp_remove_target_port(struct srp_target_port *target)
+{
+	/*
+	 * Kill our target port off.
+	 * However, we have to defer the real removal because we might
+	 * be in the context of the SCSI error handler now, which
+	 * would deadlock if we call scsi_remove_host().
+	 */
+	spin_lock_irq(target->scsi_host->host_lock);
+	if (target->state != SRP_TARGET_REMOVED) {
+		target->state = SRP_TARGET_DEAD;
+		INIT_WORK(&target->work, srp_remove_work, target);
+		schedule_work(&target->work);
+	}
+	spin_unlock_irq(target->scsi_host->host_lock);
+}
+
+static int _srp_remove_target(struct srp_target_port *target)
 {
-	struct ib_cm_id *new_cm_id;
 	struct ib_qp_attr qp_attr;
 	struct srp_request *req, *tmp;
 	struct ib_wc wc;
-	int ret;
+	int ret = 0;
 
 	spin_lock_irq(target->scsi_host->host_lock);
-	if (target->state != SRP_TARGET_LIVE) {
+	switch (target->state) {
+	case SRP_TARGET_REMOVED:
+	case SRP_TARGET_DEAD:
+		ret = -ENOENT;
+		break;
+
+	case SRP_TARGET_DISCONNECTING:
+	case SRP_TARGET_CONNECTING:
+		ret = -EAGAIN; /* So that the caller will try again later -
+				  after the connection ends one way or another */
+		break;
+
+	case SRP_TARGET_DISCONNECTED:
+		ret = -ENOTCONN;
+		break;
+
+	case SRP_TARGET_LIVE:
+		break;
+	}
+
+	if (ret) {
 		spin_unlock_irq(target->scsi_host->host_lock);
-		return -EAGAIN;
+		return ret;
 	}
-	target->state = SRP_TARGET_CONNECTING;
+
+	target->state = SRP_TARGET_DISCONNECTING;
 	spin_unlock_irq(target->scsi_host->host_lock);
 
 	srp_disconnect_target(target);
@@ -485,24 +525,14 @@ static int srp_reconnect_target(struct s
 	 * Now get a new local CM ID so that we avoid confusing the
 	 * target in case things are really fouled up.
 	 */
-	new_cm_id = ib_create_cm_id(target->srp_host->dev->dev,
-				    srp_cm_handler, target);
-	if (IS_ERR(new_cm_id)) {
-		ret = PTR_ERR(new_cm_id);
-		goto err;
-	}
 	ib_destroy_cm_id(target->cm_id);
-	target->cm_id = new_cm_id;
+	target->cm_id = NULL;
 
 	qp_attr.qp_state = IB_QPS_RESET;
 	ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE);
 	if (ret)
 		goto err;
 
-	ret = srp_init_qp(target, target->qp);
-	if (ret)
-		goto err;
-
 	while (ib_poll_cq(target->cq, 1, &wc) > 0)
 		; /* nothing */
 
@@ -513,6 +543,49 @@ static int srp_reconnect_target(struct s
 	target->tx_head	 = 0;
 	target->tx_tail  = 0;
 
+	spin_lock_irq(target->scsi_host->host_lock);
+	if (target->state == SRP_TARGET_DISCONNECTING) {
+		ret = 0;
+		target->state = SRP_TARGET_DISCONNECTED;
+	} else
+		ret = -EAGAIN;
+	spin_unlock_irq(target->scsi_host->host_lock);
+
+	return ret;
+
+err:
+	printk(KERN_ERR PFX "remove failed (%d), removing target port.\n", ret);
+
+	srp_remove_target_port(target);
+
+	return ret;
+}
+
+static int _srp_restore_target(struct srp_target_port *target)
+{
+	struct ib_cm_id *new_cm_id;
+	int ret;
+
+	spin_lock_irq(target->scsi_host->host_lock);
+	if (target->state != SRP_TARGET_DISCONNECTED) {
+		spin_unlock_irq(target->scsi_host->host_lock);
+		return -EAGAIN;
+	}
+	target->state = SRP_TARGET_CONNECTING;
+	spin_unlock_irq(target->scsi_host->host_lock);
+
+	new_cm_id = ib_create_cm_id(target->srp_host->dev->dev,
+				    srp_cm_handler, target);
+	if (IS_ERR(new_cm_id)) {
+		ret = PTR_ERR(new_cm_id);
+		goto err;
+	}
+	target->cm_id = new_cm_id;
+
+	ret = srp_init_qp(target, target->qp);
+	if (ret)
+		goto err;
+
 	ret = srp_connect_target(target);
 	if (ret)
 		goto err;
@@ -528,25 +601,22 @@ static int srp_reconnect_target(struct s
 	return ret;
 
 err:
-	printk(KERN_ERR PFX "reconnect failed (%d), removing target port.\n", ret);
+	printk(KERN_ERR PFX "restore failed (%d), removing target port.\n", ret);
 
-	/*
-	 * We couldn't reconnect, so kill our target port off.
-	 * However, we have to defer the real removal because we might
-	 * be in the context of the SCSI error handler now, which
-	 * would deadlock if we call scsi_remove_host().
-	 */
-	spin_lock_irq(target->scsi_host->host_lock);
-	if (target->state == SRP_TARGET_CONNECTING) {
-		target->state = SRP_TARGET_DEAD;
-		INIT_WORK(&target->work, srp_remove_work, target);
-		schedule_work(&target->work);
-	}
-	spin_unlock_irq(target->scsi_host->host_lock);
+	srp_remove_target_port(target);
 
 	return ret;
 }
 
+static int srp_reconnect_target(struct srp_target_port *target)
+{
+	int ret = _srp_remove_target(target);
+	if (ret && ret != -ENOTCONN)
+		return ret;
+
+	return _srp_restore_target(target);
+}
+
 static int srp_map_fmr(struct srp_device *dev, struct scatterlist *scat,
 		       int sg_cnt, struct srp_request *req,
 		       struct srp_direct_buf *buf)
@@ -933,6 +1003,13 @@ static int __srp_post_send(struct srp_ta
 	return ret;
 }
 
+static int srp_target_is_not_connected(struct srp_target_port *target)
+{
+	return (1 << target->state) &
+	       ((1 << SRP_TARGET_CONNECTING) | (1 << SRP_TARGET_DISCONNECTING) |
+		(1 << SRP_TARGET_DISCONNECTED));
+}
+
 static int srp_queuecommand(struct scsi_cmnd *scmnd,
 			    void (*done)(struct scsi_cmnd *))
 {
@@ -942,7 +1019,7 @@ static int srp_queuecommand(struct scsi_
 	struct srp_cmd *cmd;
 	int len;
 
-	if (target->state == SRP_TARGET_CONNECTING)
+	if (unlikely(srp_target_is_not_connected(target)))
 		goto err;
 
 	if (target->state == SRP_TARGET_DEAD ||
@@ -1292,6 +1369,9 @@ static int srp_abort(struct scsi_cmnd *s
 
 	printk(KERN_ERR "SRP abort called\n");
 
+	if (srp_target_is_not_connected(target))
+		return FAILED;
+
 	if (srp_find_req(target, scmnd, &req))
 		return FAILED;
 	if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK))
@@ -1320,6 +1400,9 @@ static int srp_reset_device(struct scsi_
 
 	printk(KERN_ERR "SRP reset_device called\n");
 
+	if (srp_target_is_not_connected(target))
+		return FAILED;
+
 	if (srp_find_req(target, scmnd, &req))
 		return FAILED;
 	if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET))
@@ -1914,8 +2000,10 @@ static void srp_remove_one(struct ib_dev
 		list_for_each_entry_safe(target, tmp_target,
 					 &host->target_list, list) {
 			scsi_remove_host(target->scsi_host);
-			srp_disconnect_target(target);
-			ib_destroy_cm_id(target->cm_id);
+			if (target->cm_id) {
+				srp_disconnect_target(target);
+				ib_destroy_cm_id(target->cm_id);
+			}
 			srp_free_target_ib(target);
 			scsi_host_put(target->scsi_host);
 		}
Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.h
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.h	2006-06-04 10:02:47.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.h	2006-06-04 10:03:25.000000000 +0300
@@ -75,6 +75,8 @@ enum {
 enum srp_target_state {
 	SRP_TARGET_LIVE,
 	SRP_TARGET_CONNECTING,
+	SRP_TARGET_DISCONNECTED,
+	SRP_TARGET_DISCONNECTING,
 	SRP_TARGET_DEAD,
 	SRP_TARGET_REMOVED
 };
-- 
Ishai Rabinovitz


From ishai at mellanox.co.il  Mon Jun  5 08:34:33 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 5 Jun 2006 18:34:33 +0300
Subject: [openib-general] SRP [PATCH 2/4] remove target 
Message-ID: <20060605153433.GC7472@mellanox.co.il>


Add support to remove_target from sysfs.
Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-05 16:46:55.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-05 17:11:11.000000000 +0300
@@ -1516,6 +1516,10 @@
 	return sprintf(buf, "%d\n", target->zero_req_lim);
 }
 
+static ssize_t srp_remove_target(struct class_device *cdev,
+				 const char *buf, size_t count);
+
+static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target);
 static CLASS_DEVICE_ATTR(id_ext,	S_IRUGO, show_id_ext,		NULL);
 static CLASS_DEVICE_ATTR(ioc_guid,	S_IRUGO, show_ioc_guid,		NULL);
 static CLASS_DEVICE_ATTR(service_id,	S_IRUGO, show_service_id,	NULL);
@@ -1524,6 +1528,7 @@
 static CLASS_DEVICE_ATTR(zero_req_lim,	S_IRUGO, show_zero_req_lim,	NULL);
 
 static struct class_device_attribute *srp_host_attrs[] = {
+	&class_device_attr_remove_target,
 	&class_device_attr_id_ext,
 	&class_device_attr_ioc_guid,
 	&class_device_attr_service_id,
@@ -1814,6 +1819,23 @@
 
 static CLASS_DEVICE_ATTR(add_target, S_IWUSR, NULL, srp_create_target);
 
+static ssize_t srp_remove_target(struct class_device *cdev,
+				 const char *buf, size_t count)
+{
+	int ret;
+	const char const remove_str[] = "remove";
+
+	if (strncmp(buf, "remove", sizeof(remove_str)))
+		return -EINVAL;
+
+	ret = _srp_remove_target(host_to_target(class_to_shost(cdev)));
+
+	if (ret)
+		return ret;
+
+	return count;
+}
+
 static ssize_t show_ibdev(struct class_device *class_dev, char *buf)
 {
 	struct srp_host *host =
-- 
Ishai Rabinovitz


From ishai at mellanox.co.il  Mon Jun  5 08:35:17 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 5 Jun 2006 18:35:17 +0300
Subject: [openib-general] SRP [PATCH 3/4] restore target 
Message-ID: <20060605153517.GD7472@mellanox.co.il>


Add support to restore_target from sysfs.
Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-04 10:01:50.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-04 10:02:27.000000000 +0300
@@ -1551,16 +1551,21 @@ static ssize_t show_zero_req_lim(struct 
 static ssize_t srp_remove_target(struct class_device *cdev,
 				 const char *buf, size_t count);
 
-static CLASS_DEVICE_ATTR(remove_target, S_IWUSR, NULL, srp_remove_target);
-static CLASS_DEVICE_ATTR(id_ext,	S_IRUGO, show_id_ext,		NULL);
-static CLASS_DEVICE_ATTR(ioc_guid,	S_IRUGO, show_ioc_guid,		NULL);
-static CLASS_DEVICE_ATTR(service_id,	S_IRUGO, show_service_id,	NULL);
-static CLASS_DEVICE_ATTR(pkey,		S_IRUGO, show_pkey,		NULL);
-static CLASS_DEVICE_ATTR(dgid,		S_IRUGO, show_dgid,		NULL);
-static CLASS_DEVICE_ATTR(zero_req_lim,	S_IRUGO, show_zero_req_lim,	NULL);
+static ssize_t srp_restore_target(struct class_device *cdev,
+				  const char *buf, size_t count);
+
+static CLASS_DEVICE_ATTR(remove_target,  S_IWUSR, NULL, srp_remove_target);
+static CLASS_DEVICE_ATTR(restore_target, S_IWUSR, NULL, srp_restore_target);
+static CLASS_DEVICE_ATTR(id_ext,	 S_IRUGO, show_id_ext,		NULL);
+static CLASS_DEVICE_ATTR(ioc_guid,	 S_IRUGO, show_ioc_guid,	NULL);
+static CLASS_DEVICE_ATTR(service_id,	 S_IRUGO, show_service_id,	NULL);
+static CLASS_DEVICE_ATTR(pkey,		 S_IRUGO, show_pkey,		NULL);
+static CLASS_DEVICE_ATTR(dgid,		 S_IRUGO, show_dgid,		NULL);
+static CLASS_DEVICE_ATTR(zero_req_lim,	 S_IRUGO, show_zero_req_lim,	NULL);
 
 static struct class_device_attribute *srp_host_attrs[] = {
 	&class_device_attr_remove_target,
+	&class_device_attr_restore_target,
 	&class_device_attr_id_ext,
 	&class_device_attr_ioc_guid,
 	&class_device_attr_service_id,
@@ -1861,6 +1866,17 @@ static ssize_t srp_remove_target(struct 
 	return count;
 }
 
+static ssize_t srp_restore_target(struct class_device *cdev,
+				 const char *buf, size_t count)
+{
+	int ret = _srp_restore_target(host_to_target(class_to_shost(cdev)));
+
+	if (ret)
+		return ret;
+
+	return count;
+}
+
 static ssize_t show_ibdev(struct class_device *class_dev, char *buf)
 {
 	struct srp_host *host =
-- 
Ishai Rabinovitz


From ishai at mellanox.co.il  Mon Jun  5 08:36:06 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 5 Jun 2006 18:36:06 +0300
Subject: [openib-general] SRP [PATCH 4/4] show_srp_state
Message-ID: <20060605153606.GE7472@mellanox.co.il>


Add query for srp_state in sysfs.
Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>
Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-05-31 18:52:14.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-04 14:21:52.000000000 +0300
@@ -1362,6 +1362,26 @@ static int srp_reset_host(struct scsi_cm
 	return ret;
 }
 
+static ssize_t show_srp_state(struct class_device *cdev, char *buf)
+{
+	struct srp_target_port *target = host_to_target(class_to_shost(cdev));
+	enum srp_target_state target_state = target->state;
+
+	static const char *state_name[] = {
+		[SRP_TARGET_LIVE]		= "LIVE",
+		[SRP_TARGET_CONNECTING]		= "CONNECTING",
+		[SRP_TARGET_DISCONNECTING]	= "DISCONNECTING",
+		[SRP_TARGET_DISCONNECTED]	= "DISCONNECTED",
+		[SRP_TARGET_DEAD]		= "DEAD",
+		[SRP_TARGET_REMOVED]		= "REMOVED",
+	};
+
+	if (target_state >= 0 && target_state < ARRAY_SIZE(state_name))
+		return sprintf(buf, "%s\n", state_name[target_state]);
+
+	return sprintf(buf, "UNKNOWN\n");
+}
+
 static ssize_t show_id_ext(struct class_device *cdev, char *buf)
 {
 	struct srp_target_port *target = host_to_target(class_to_shost(cdev));
@@ -1439,6 +1459,7 @@ static ssize_t show_zero_req_lim(struct 
 
 static CLASS_DEVICE_ATTR(remove_target,  S_IWUSR, NULL, srp_remove_target);
 static CLASS_DEVICE_ATTR(restore_target, S_IWUSR, NULL, srp_restore_target);
+static CLASS_DEVICE_ATTR(srp_state,	 S_IRUGO, show_srp_state,	NULL);
 static CLASS_DEVICE_ATTR(id_ext,	 S_IRUGO, show_id_ext,		NULL);
 static CLASS_DEVICE_ATTR(ioc_guid,	 S_IRUGO, show_ioc_guid,	NULL);
 static CLASS_DEVICE_ATTR(service_id,	 S_IRUGO, show_service_id,	NULL);
@@ -1447,6 +1468,7 @@ static CLASS_DEVICE_ATTR(dgid,		S_IRUGO,
 static struct class_device_attribute *srp_host_attrs[] = {
 	&class_device_attr_remove_target,
 	&class_device_attr_restore_target,
+	&class_device_attr_srp_state,
 	&class_device_attr_id_ext,
 	&class_device_attr_ioc_guid,
 	&class_device_attr_service_id,
-- 
Ishai Rabinovitz


From mst at mellanox.co.il  Mon Jun  5 09:40:14 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 5 Jun 2006 19:40:14 +0300
Subject: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <adau06zpopn.fsf@cisco.com>
References: <adau06zpopn.fsf@cisco.com>
Message-ID: <20060605164014.GA32268@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
> 
>     Shirley> I will apply this patch. This patch would reduce the
>     Shirley> race, not address the problem.
> 
> Does anyone know what the problem really is?  I sure don't.

Not me :). I suspect Shirley is seeing results of memory corruption
as a result of interface getting restarted - the problem fixed by Eli's patch.

-- 
MST


From wombat2 at us.ibm.com  Mon Jun  5 09:53:02 2006
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Mon, 5 Jun 2006 12:53:02 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <20060605161150.A21DE2283DA@openib.ca.sandia.gov>
Message-ID: <OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@us.ibm.com>

> Thomas Talpey said:
> At 11:38 AM 6/5/2006, hbchen wrote:
> >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very > low.
> >>> IPoIB=420MB/sec
> >>> bandwidth utilization= 420/1024 = 41.01%
>
>
> Helen, have you measured the CPU utilizations during these runs?
> Perhaps you are out of CPU.
>
> Outrageous opinion follows.
>
> Frankly, an IB HCA running Ethernet emulation is approximately the
> world's worst 10GbE adapter (not to put too fine of a point on it :-) )
> There is no hardware checksumming, nor large-send offloading, both
> of which force overhead onto software. And, as you just discovered
> it isn't even 10Gb!
>
> In general, network emulation layers are always going to perform more
> poorly than native implementations. But this is only a generality learned
> from years of experience with them
>
> Tom.

Hold on here....

Who said anything about Ethernnet emulation. Hal said he is running
straight Netperf over IB not ethernet emulation. I don't think that any IB
HCAs today support offloaded checksum and large send. You are comparing
apples and oranges. The only appropriate comparison is to use the IBM HCA
compared to the mthca adapters. I think Hal's point is actually comparing
"any" IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's
should get similar IPoIB performance using identical OpenIB stacks.


Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner


             openib-general-re                                             
             quest at openib.org                                              
             Sent by:                                                   To 
             openib-general-bo         openib-general at openib.org           
             unces at openib.org                                           cc 
                                                                           
                                                                   Subject 
             06/05/2006 12:11          openib-general Digest, Vol 24,      
             PM                        Issue 22                            
                                                                           
                                                                           
             Please respond to                                             
             openib-general at op                                             
                 enib.org                                                  
                                                                           
                                                                           
Send openib-general mailing list submissions to
             openib-general at openib.org

To subscribe or unsubscribe via the World Wide Web, visit
             http://openib.org/mailman/listinfo/openib-general
or, via email, send a message with subject or body 'help' to
             openib-general-request at openib.org

You can reach the person managing the list at
             openib-general-owner at openib.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of openib-general digest..."
Today's Topics:

   1. Re: Question about the IPoIB bandwidth           performance ?
(hbchen)
   2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock)
   3. Re: [PATCH] osm: trivial missing cast in         osmt_service call
      for memcmp (Hal Rosenstock)
   4. Re: Question about the IPoIB bandwidth performance ?
      (Bernard King-Smith)
   5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma)
   6. Re: [PATCHv2 1/2] resend: mthca support for
max_map_per_fmr
      device attribute (Roland Dreier)
   7. Re: Question about the IPoIB bandwidth performance ?
      (Talpey, Thomas)
   8. Re: Question about the IPoIB bandwidth performance ? (hbchen)

----- Message from "hbchen" <hbchen at lanl.gov> on Mon, 05 Jun 2006 09:38:24
-0600 -----
                                                                           
      To: "Hal Rosenstock" <halr at voltaire.com>                             
                                                                           
      cc: "OPENIB" <openib-general at openib.org>                             
                                                                           
 Subject: Re: [openib-general] Question about the IPoIB bandwidth          
          performance ?                                                    
                                                                           

Hal Rosenstock wrote:
      On Mon, 2006-06-05 at 11:12, hbchen wrote:

            Hi,
            I have a question about the IPoIB bandwidth performance.
            I did netperf testing using Single GiGE, Myrinet D card,
            Myrinet 10G
            ethernet card,
            and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).


            NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth
            utilization
            (IPoNIC/LB)
            --------------------- ---------------- --------------
            ----------------------------------
            Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X
            interface)
            Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X
            interface)
            Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My
            testing
            using Linux 2.6.14.6)
            (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
            IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My
            testing
            using Linux 2.6.14.6)
            474MB/sec 37% (the best from OpenIB mailing list)
            (2.6.12-rc5 patch 1)

            Why the bandwidth utilization of IPoIB is so low compared to
            the others
            NICs?


      One thing to note is that the max utilization of 10G IB (4x) is 8G
      due
      to the signalling being included in this rate (unlike ethernet whose
      rate represents the data rate and does not include the signalling
      overhead).

Hal,
Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very low.
>> IPoIB=420MB/sec
>> bandwidth utilization= 420/1024 = 41.01%


HB


      -- Hal


            There must be a lot of room to improve the IPoIB software to
            reach 75%+
            bandwidth utilization.


            HB Chen
            Los Alamos National Lab
            hbchen at labl.gov

            _______________________________________________
            openib-general mailing list
            openib-general at openib.org
            http://openib.org/mailman/listinfo/openib-general

            To unsubscribe, please visit
            http://openib.org/mailman/listinfo/openib-general


----- Message from "Hal Rosenstock" <halr at voltaire.com> on 05 Jun 2006
11:34:50 -0400 -----
                                                                           
       To: "Eitan Zahavi" <eitan at mellanox.co.il>                           
                                                                           
       cc: "OPENIB" <openib-general at openib.org>                            
                                                                           
  Subject: [openib-general] Re: [PATCH] osm: trivial missing header files  
           fix                                                             
                                                                           

On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote:
> Hi Hal
>
> Cleaning up compilation warnings I found there missing includes in
> various sources.
>
> Eitan
>
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only.

-- Hal


----- Message from "Hal Rosenstock" <halr at voltaire.com> on 05 Jun 2006
11:45:28 -0400 -----
                                                                           
     To: "Eitan Zahavi" <eitan at mellanox.co.il>                             
                                                                           
     cc: "OPENIB" <openib-general at openib.org>                              
                                                                           
 Subject [openib-general] Re: [PATCH] osm: trivial missing cast in         
       : osmt_service call for memcmp                                      
                                                                           

Hi Eitan,

On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote:
> Hi Hal
>
> Last one of my cleaning up compilation warnings I found a missing
> cast in osmtest service name compare.
>
> Eitan
>
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only.

-- Hal


----- Message from "Bernard King-Smith" <wombat2 at us.ibm.com> on Mon, 5 Jun
2006 11:54:42 -0400 -----
                                                                           
      To: openib-general at openib.org                                        
                                                                           
 Subject: Re: [openib-general] Question about the IPoIB bandwidth          
          performance ?                                                    
                                                                           

Hal Rosenstock wrote:

> On Mon, 2006-06-05 at 11:12, hbchen wrote:
> > Hi,
> > I have a question about the IPoIB bandwidth performance.
> > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
> > ethernet card,
> > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
> >
> >
> > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
> > (IPoNIC/LB)
> > --------------------- ---------------- --------------
> > ----------------------------------
> > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
> > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
> > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
> > > using Linux 2.6.14.6)
> > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
> > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
> > using Linux 2.6.14.6)
> > 474MB/sec 37% (the best from OpenIB mailing list)
> > (2.6.12-rc5 patch 1)
> >
> > Why the bandwidth utilization of IPoIB is so low compared to the others
> > NICs?
>
> One thing to note is that the max utilization of 10G IB (4x) is 8G due
> to the signalling being included in this rate (unlike ethernet whose
> rate represents the data rate and does not include the signalling
> overhead).
>
> -- Hal
>

You also have larger IP packets when you use GigE ( especially in large
send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE,
without large send you get a 9000 MTU. With large send you get a 64K buffer
to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the
adapter.

Currently with IPoIB using UD mode, you have to generate lots of 2K
packets. With serialized IBoIP drivers you end up bottlenecking on a single
CPU. There is a IPoIB-CM IEFT spec out which should significantly improve
IPoIB performance if implemented.

> > There must be a lot of room to improve the IPoIB software to reach 75%+
> > bandwidth utilization.
> >
> >
> > HB Chen
> > Los Alamos National Lab
> > hbchen at labl.gov
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
> >


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general


Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner


----- Message from "Shirley Ma" <xma at us.ibm.com> on Mon, 5 Jun 2006
09:02:36 -0700 -----
                                                                           
    To: "Michael S. Tsirkin" <mst at mellanox.co.il>                          
                                                                           
    cc: "Roland Dreier" <rdreier at cisco.com>, mashirle at us.ibm.com,          
        openib-general at openib.org                                          
                                                                           
 Subjec [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic            
     t:                                                                    
                                                                           

Michael,

I will apply this patch. This patch would reduce the race, not address the
problem.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
----- Message from "Roland Dreier" <rdreier at cisco.com> on Mon, 05 Jun 2006
09:01:14 -0700 -----
                                                                           
    To: "Or Gerlitz" <ogerlitz at voltaire.com>                               
                                                                           
    cc: openib-general at openib.org                                          
                                                                           
 Subjec [openib-general] Re: [PATCHv2 1/2] resend: mthca support for       
     t: max_map_per_fmr device attribute                                   
                                                                           

 > Yes it makes sense, but you need the check should be
 >
 >           if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT))
 >
 > instead of
 >
 >           if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)

Yep, you're right, I got it backwards.

 > also, what about the other patch which changes fmr_pool.c to query the
 > device, have you got(reviewed/accepted) it? i have modified it to
 > allocate the device attr struct on the heap as you have asked.

It looks fine.  I was just reviewing everything together.

 - R.


----- Message from "Talpey, Thomas" <Thomas.Talpey at netapp.com> on Mon, 05
Jun 2006 11:52:03 -0400 -----
                                                                           
      To: "hbchen" <hbchen at lanl.gov>                                       
                                                                           
      cc: openib-general at openib.org                                        
                                                                           
 Subject: Re: [openib-general] Question about the IPoIB bandwidth          
          performance ?                                                    
                                                                           

At 11:38 AM 6/5/2006, hbchen wrote:
>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very low.
>>> IPoIB=420MB/sec
>>> bandwidth utilization= 420/1024 = 41.01%


Helen, have you measured the CPU utilizations during these runs?
Perhaps you are out of CPU.

Outrageous opinion follows.

Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
There is no hardware checksumming, nor large-send offloading, both
of which force overhead onto software. And, as you just discovered
it isn't even 10Gb!

In general, network emulation layers are always going to perform more
poorly than native implementations. But this is only a generality learned
from years of experience with them.

Tom.


----- Message from "hbchen" <hbchen at lanl.gov> on Mon, 05 Jun 2006 10:11:30
-0600 -----
                                                                           
      To: "Talpey, Thomas" <Thomas.Talpey at netapp.com>                      
                                                                           
      cc: openib-general at openib.org                                        
                                                                           
 Subject: Re: [openib-general] Question about the IPoIB bandwidth          
          performance ?                                                    
                                                                           

Talpey, Thomas wrote:
      At 11:38 AM 6/5/2006, hbchen wrote:

            Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB
            bandwidth utilization is still very low.

                        IPoIB=420MB/sec
                        bandwidth utilization= 420/1024 = 41.01%


      Helen, have you measured the CPU utilizations during these runs?
      Perhaps you are out of CPU.


Tom,
I am HB Chen from LANL not the Helen Chen from SNL.
I didn't run out of CPU.  It is about 70-80 % of CPU utilization.

      Outrageous opinion follows.

      Frankly, an IB HCA running Ethernet emulation is approximately the
      world's worst 10GbE adapter (not to put too fine of a point on it :-)
      )

The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%  bandwidth
utilization why not the IPoIB ?

HB Chen
hbchen at lanl.gov
      There is no hardware checksumming, nor large-send offloading, both
      of which force overhead onto software. And, as you just discovered
      it isn't even 10Gb!

      In general, network emulation layers are always going to perform more
      poorly than native implementations. But this is only a generality
      learned
      from years of experience with them.

      Tom.


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general


From hycsw at ca.sandia.gov  Mon Jun  5 09:55:11 2006
From: hycsw at ca.sandia.gov (Helen Chen)
Date: Mon, 5 Jun 2006 09:55:11 -0700 (PDT)
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
Message-ID: <200606051655.JAA18854@ca.sandia.gov>

Tom,

We are in the process of measuring the CPU utilization on our NFS/RDMA
experiments in contrast with regular the NFS, we also intend to include 
netperf numbers and will keep you posted with our results as soon as 
possible.

Helen

----- original Message -----
>From openib-general-bounces at openib.org Mon Jun  5 09:03:56 2006


Helen, have you measured the CPU utilizations during these runs?
Perhaps you are out of CPU.

Outrageous opinion follows.

Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
There is no hardware checksumming, nor large-send offloading, both
of which force overhead onto software. And, as you just discovered
it isn't even 10Gb!

In general, network emulation layers are always going to perform more
poorly than native implementations. But this is only a generality learned
from years of experience with them.

Tom.  

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Thomas.Talpey at netapp.com  Mon Jun  5 10:08:17 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 05 Jun 2006 13:08:17 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@
	us.ibm.com>
References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov>
	<OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@us.ibm.com>
Message-ID: <7.0.1.0.2.20060605130607.086feab0@netapp.com>

>Who said anything about Ethernnet emulation. Hal said he is running
>straight Netperf over IB not ethernet emulation. I don't think that any IB
>HCAs today support offloaded checksum and large send. You are comparing
>apples and oranges. 

I consider IPoIB to be Ethernet emulation.

As for apples and oranges, my point exactly.

Tom.


At 12:53 PM 6/5/2006, Bernard King-Smith wrote:
>> Thomas Talpey said:
>> At 11:38 AM 6/5/2006, hbchen wrote:
>> >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
>utilization is still very > low.
>> >>> IPoIB=420MB/sec
>> >>> bandwidth utilization= 420/1024 = 41.01%
>>
>>
>> Helen, have you measured the CPU utilizations during these runs?
>> Perhaps you are out of CPU.
>>
>> Outrageous opinion follows.
>>
>> Frankly, an IB HCA running Ethernet emulation is approximately the
>> world's worst 10GbE adapter (not to put too fine of a point on it :-) )
>> There is no hardware checksumming, nor large-send offloading, both
>> of which force overhead onto software. And, as you just discovered
>> it isn't even 10Gb!
>>
>> In general, network emulation layers are always going to perform more
>> poorly than native implementations. But this is only a generality learned
>> from years of experience with them
>>
>> Tom.
>
>Hold on here....
>
>Who said anything about Ethernnet emulation. Hal said he is running
>straight Netperf over IB not ethernet emulation. I don't think that any IB
>HCAs today support offloaded checksum and large send. You are comparing
>apples and oranges. The only appropriate comparison is to use the IBM HCA
>compared to the mthca adapters. I think Hal's point is actually comparing
>"any" IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's
>should get similar IPoIB performance using identical OpenIB stacks.
>
>
>Bernie King-Smith
>IBM Corporation
>Server Group
>Cluster System Performance
>wombat2 at us.ibm.com    (845)433-8483
>Tie. 293-8483 or wombat2 on NOTES
>
>"We are not responsible for the world we are born into, only for the world
>we leave when we die.
>So we have to accept what has gone before us and work to change the only
>thing we can,
>-- The Future." William Shatner
>
>
>                                                                           
>             openib-general-re                                             
>             quest at openib.org                                              
>             Sent by:                                                   To 
>             openib-general-bo         openib-general at openib.org           
>             unces at openib.org                                           cc 
>                                                                           
>                                                                   Subject 
>             06/05/2006 12:11          openib-general Digest, Vol 24,      
>             PM                        Issue 22                            
>                                                                           
>                                                                           
>             Please respond to                                             
>             openib-general at op                                             
>                 enib.org                                                  
>                                                                           
>                                                                           
>
>
>
>
>Send openib-general mailing list submissions to
>             openib-general at openib.org
>
>To subscribe or unsubscribe via the World Wide Web, visit
>             http://openib.org/mailman/listinfo/openib-general
>or, via email, send a message with subject or body 'help' to
>             openib-general-request at openib.org
>
>You can reach the person managing the list at
>             openib-general-owner at openib.org
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of openib-general digest..."
>Today's Topics:
>
>   1. Re: Question about the IPoIB bandwidth           performance ?
>(hbchen)
>   2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock)
>   3. Re: [PATCH] osm: trivial missing cast in         osmt_service call
>      for memcmp (Hal Rosenstock)
>   4. Re: Question about the IPoIB bandwidth performance ?
>      (Bernard King-Smith)
>   5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma)
>   6. Re: [PATCHv2 1/2] resend: mthca support for
>max_map_per_fmr
>      device attribute (Roland Dreier)
>   7. Re: Question about the IPoIB bandwidth performance ?
>      (Talpey, Thomas)
>   8. Re: Question about the IPoIB bandwidth performance ? (hbchen)
>
>----- Message from "hbchen" <hbchen at lanl.gov> on Mon, 05 Jun 2006 09:38:24
>-0600 -----
>                                                                           
>      To: "Hal Rosenstock" <halr at voltaire.com>                             
>                                                                           
>      cc: "OPENIB" <openib-general at openib.org>                             
>                                                                           
> Subject: Re: [openib-general] Question about the IPoIB bandwidth          
>          performance ?                                                    
>                                                                           
>
>Hal Rosenstock wrote:
>      On Mon, 2006-06-05 at 11:12, hbchen wrote:
>
>            Hi,
>            I have a question about the IPoIB bandwidth performance.
>            I did netperf testing using Single GiGE, Myrinet D card,
>            Myrinet 10G
>            ethernet card,
>            and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
>
>
>            NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth
>            utilization
>            (IPoNIC/LB)
>            --------------------- ---------------- --------------
>            ----------------------------------
>            Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X
>            interface)
>            Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X
>            interface)
>            Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My
>            testing
>            using Linux 2.6.14.6)
>            (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
>            IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My
>            testing
>            using Linux 2.6.14.6)
>            474MB/sec 37% (the best from OpenIB mailing list)
>            (2.6.12-rc5 patch 1)
>
>            Why the bandwidth utilization of IPoIB is so low compared to
>            the others
>            NICs?
>
>
>      One thing to note is that the max utilization of 10G IB (4x) is 8G
>      due
>      to the signalling being included in this rate (unlike ethernet whose
>      rate represents the data rate and does not include the signalling
>      overhead).
>
>Hal,
>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
>utilization is still very low.
>>> IPoIB=420MB/sec
>>> bandwidth utilization= 420/1024 = 41.01%
>
>
>HB
>
>
>
>
>      -- Hal
>
>
>            There must be a lot of room to improve the IPoIB software to
>            reach 75%+
>            bandwidth utilization.
>
>
>            HB Chen
>            Los Alamos National Lab
>            hbchen at labl.gov
>
>            _______________________________________________
>            openib-general mailing list
>            openib-general at openib.org
>            http://openib.org/mailman/listinfo/openib-general
>
>            To unsubscribe, please visit
>            http://openib.org/mailman/listinfo/openib-general
>
>
>
>
>
>----- Message from "Hal Rosenstock" <halr at voltaire.com> on 05 Jun 2006
>11:34:50 -0400 -----
>                                                                           
>       To: "Eitan Zahavi" <eitan at mellanox.co.il>                           
>                                                                           
>       cc: "OPENIB" <openib-general at openib.org>                            
>                                                                           
>  Subject: [openib-general] Re: [PATCH] osm: trivial missing header files  
>           fix                                                             
>                                                                           
>
>On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote:
>> Hi Hal
>>
>> Cleaning up compilation warnings I found there missing includes in
>> various sources.
>>
>> Eitan
>>
>> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>
>Thanks. Applied to trunk only.
>
>-- Hal
>
>
>
>----- Message from "Hal Rosenstock" <halr at voltaire.com> on 05 Jun 2006
>11:45:28 -0400 -----
>                                                                           
>     To: "Eitan Zahavi" <eitan at mellanox.co.il>                             
>                                                                           
>     cc: "OPENIB" <openib-general at openib.org>                              
>                                                                           
> Subject [openib-general] Re: [PATCH] osm: trivial missing cast in         
>       : osmt_service call for memcmp                                      
>                                                                           
>
>Hi Eitan,
>
>On Mon, 2006-06-05 at 08:59, Eitan Zahavi wrote:
>> Hi Hal
>>
>> Last one of my cleaning up compilation warnings I found a missing
>> cast in osmtest service name compare.
>>
>> Eitan
>>
>> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>
>Thanks. Applied to trunk only.
>
>-- Hal
>
>
>
>----- Message from "Bernard King-Smith" <wombat2 at us.ibm.com> on Mon, 5 Jun
>2006 11:54:42 -0400 -----
>                                                                           
>      To: openib-general at openib.org                                        
>                                                                           
> Subject: Re: [openib-general] Question about the IPoIB bandwidth          
>          performance ?                                                    
>                                                                           
>
>Hal Rosenstock wrote:
>
>> On Mon, 2006-06-05 at 11:12, hbchen wrote:
>> > Hi,
>> > I have a question about the IPoIB bandwidth performance.
>> > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G
>> > ethernet card,
>> > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface).
>> >
>> >
>> > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization
>> > (IPoNIC/LB)
>> > --------------------- ---------------- --------------
>> > ----------------------------------
>> > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface)
>> > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface)
>> > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing
>> > > using Linux 2.6.14.6)
>> > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website)
>> > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing
>> > using Linux 2.6.14.6)
>> > 474MB/sec 37% (the best from OpenIB mailing list)
>> > (2.6.12-rc5 patch 1)
>> >
>> > Why the bandwidth utilization of IPoIB is so low compared to the others
>> > NICs?
>>
>> One thing to note is that the max utilization of 10G IB (4x) is 8G due
>> to the signalling being included in this rate (unlike ethernet whose
>> rate represents the data rate and does not include the signalling
>> overhead).
>>
>> -- Hal
>>
>
>You also have larger IP packets when you use GigE ( especially in large
>send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE,
>without large send you get a 9000 MTU. With large send you get a 64K buffer
>to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the
>adapter.
>
>Currently with IPoIB using UD mode, you have to generate lots of 2K
>packets. With serialized IBoIP drivers you end up bottlenecking on a single
>CPU. There is a IPoIB-CM IEFT spec out which should significantly improve
>IPoIB performance if implemented.
>
>> > There must be a lot of room to improve the IPoIB software to reach 75%+
>> > bandwidth utilization.
>> >
>> >
>> > HB Chen
>> > Los Alamos National Lab
>> > hbchen at labl.gov
>> >
>> > _______________________________________________
>> > openib-general mailing list
>> > openib-general at openib.org
>> > http://openib.org/mailman/listinfo/openib-general
>> >
>> > To unsubscribe, please visit
>http://openib.org/mailman/listinfo/openib-general
>> >
>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>
>Bernie King-Smith
>IBM Corporation
>Server Group
>Cluster System Performance
>wombat2 at us.ibm.com    (845)433-8483
>Tie. 293-8483 or wombat2 on NOTES
>
>"We are not responsible for the world we are born into, only for the world
>we leave when we die.
>So we have to accept what has gone before us and work to change the only
>thing we can,
>-- The Future." William Shatner
>
>
>
>
>----- Message from "Shirley Ma" <xma at us.ibm.com> on Mon, 5 Jun 2006
>09:02:36 -0700 -----
>                                                                           
>    To: "Michael S. Tsirkin" <mst at mellanox.co.il>                          
>                                                                           
>    cc: "Roland Dreier" <rdreier at cisco.com>, mashirle at us.ibm.com,          
>        openib-general at openib.org                                          
>                                                                           
> Subjec [openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic            
>     t:                                                                    
>                                                                           
>
>Michael,
>
>I will apply this patch. This patch would reduce the race, not address the
>problem.
>
>Thanks
>Shirley Ma
>IBM Linux Technology Center
>15300 SW Koll Parkway
>Beaverton, OR 97006-6063
>Phone(Fax): (503) 578-7638
>----- Message from "Roland Dreier" <rdreier at cisco.com> on Mon, 05 Jun 2006
>09:01:14 -0700 -----
>                                                                           
>    To: "Or Gerlitz" <ogerlitz at voltaire.com>                               
>                                                                           
>    cc: openib-general at openib.org                                          
>                                                                           
> Subjec [openib-general] Re: [PATCHv2 1/2] resend: mthca support for       
>     t: max_map_per_fmr device attribute                                   
>                                                                           
>
> > Yes it makes sense, but you need the check should be
> >
> >           if (!(dev->mthca_flags & MTHCA_FLAG_SINAI_OPT))
> >
> > instead of
> >
> >           if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT)
>
>Yep, you're right, I got it backwards.
>
> > also, what about the other patch which changes fmr_pool.c to query the
> > device, have you got(reviewed/accepted) it? i have modified it to
> > allocate the device attr struct on the heap as you have asked.
>
>It looks fine.  I was just reviewing everything together.
>
> - R.
>
>
>----- Message from "Talpey, Thomas" <Thomas.Talpey at netapp.com> on Mon, 05
>Jun 2006 11:52:03 -0400 -----
>                                                                           
>      To: "hbchen" <hbchen at lanl.gov>                                       
>                                                                           
>      cc: openib-general at openib.org                                        
>                                                                           
> Subject: Re: [openib-general] Question about the IPoIB bandwidth          
>          performance ?                                                    
>                                                                           
>
>At 11:38 AM 6/5/2006, hbchen wrote:
>>Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
>utilization is still very low.
>>>> IPoIB=420MB/sec
>>>> bandwidth utilization= 420/1024 = 41.01%
>
>
>Helen, have you measured the CPU utilizations during these runs?
>Perhaps you are out of CPU.
>
>Outrageous opinion follows.
>
>Frankly, an IB HCA running Ethernet emulation is approximately the
>world's worst 10GbE adapter (not to put too fine of a point on it :-) )
>There is no hardware checksumming, nor large-send offloading, both
>of which force overhead onto software. And, as you just discovered
>it isn't even 10Gb!
>
>In general, network emulation layers are always going to perform more
>poorly than native implementations. But this is only a generality learned
>from years of experience with them.
>
>Tom.
>
>
>
>----- Message from "hbchen" <hbchen at lanl.gov> on Mon, 05 Jun 2006 10:11:30
>-0600 -----
>                                                                           
>      To: "Talpey, Thomas" <Thomas.Talpey at netapp.com>                      
>                                                                           
>      cc: openib-general at openib.org                                        
>                                                                           
> Subject: Re: [openib-general] Question about the IPoIB bandwidth          
>          performance ?                                                    
>                                                                           
>
>Talpey, Thomas wrote:
>      At 11:38 AM 6/5/2006, hbchen wrote:
>
>            Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB
>            bandwidth utilization is still very low.
>
>                        IPoIB=420MB/sec
>                        bandwidth utilization= 420/1024 = 41.01%
>
>
>
>      Helen, have you measured the CPU utilizations during these runs?
>      Perhaps you are out of CPU.
>
>
>Tom,
>I am HB Chen from LANL not the Helen Chen from SNL.
>I didn't run out of CPU.  It is about 70-80 % of CPU utilization.
>
>      Outrageous opinion follows.
>
>      Frankly, an IB HCA running Ethernet emulation is approximately the
>      world's worst 10GbE adapter (not to put too fine of a point on it :-)
>      )
>
>The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%  bandwidth
>utilization why not the IPoIB ?
>
>HB Chen
>hbchen at lanl.gov
>      There is no hardware checksumming, nor large-send offloading, both
>      of which force overhead onto software. And, as you just discovered
>      it isn't even 10Gb!
>
>      In general, network emulation layers are always going to perform more
>      poorly than native implementations. But this is only a generality
>      learned
>      from years of experience with them.
>
>      Tom.
>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Mon Jun  5 10:16:13 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 10:16:13 -0700
Subject: [openib-general] Re: [PATCHv2 2/2] resend: port the fmr pool to use
 the max_map_per_fmr device attribute
In-Reply-To: <Pine.LNX.4.64.0605300923150.29921@zuben> (Or Gerlitz's
	message of "Tue, 30 May 2006 09:23:41 +0300 (IDT)")
References: <Pine.LNX.4.44.0605231209540.18808-100000@zuben>
	<Pine.LNX.4.64.0605281502300.5690@zuben>
	<Pine.LNX.4.64.0605300923150.29921@zuben>
Message-ID: <adairnfplpe.fsf@cisco.com>

Thanks, applied both patches.


From rdreier at cisco.com  Mon Jun  5 10:21:33 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 10:21:33 -0700
Subject: [openib-general] [git pull] please pull infiniband.git
Message-ID: <adaejy3plgi.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This has a one-line bug fix:

Eli Cohen:
      IPoIB: Fix AH leak at interface down

 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index a54da42..8406839 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -275,6 +275,7 @@ static void ipoib_ib_handle_wc(struct ne
 		spin_lock_irqsave(&priv->tx_lock, flags);
 		++priv->tx_tail;
 		if (netif_queue_stopped(dev) &&
+		    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) &&
 		    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
 			netif_wake_queue(dev);
 		spin_unlock_irqrestore(&priv->tx_lock, flags);


From somenath at veritas.com  Mon Jun  5 10:50:55 2006
From: somenath at veritas.com (somenath)
Date: Mon, 05 Jun 2006 10:50:55 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
References: <D80D83302DEE6249A221093BF2BB69AE58E88F@mail.silverstorm.com>
	<7.0.1.0.2.20060605081948.044849d0@netapp.com>
Message-ID: <44846EFF.2020705@veritas.com>

Talpey, Thomas wrote:

>At 10:03 AM 6/3/2006, Rimmer, Todd wrote: 
>  
>
>>>Yes, the limit of outstanding RDMAs is not related to the send queue
>>>depth.  Of course you can post many more than 4 RDMAs to a send queue
>>>-- the HCA just won't have more than 4 requests outstanding at a time.
>>>      
>>>
>>To further clarity, this parameter only affects the number of concurrent
>>outstanding RDMA Reads which the HCA will process.  Once it hits this
>>limit, the send Q will stall waiting for issued reads to complete prior
>>to initiating new reads.
>>    
>>
>
>It's worse than that - the send queue must stall for *all* operations.
>Otherwise the hardware has to track in-progress operations which are
>queued after stalled ones. It really breaks the initiation model.
>  
>

possibility of stalling is scary!
is there any way one can figure out:

1. number of outstanding sends at a given point of time in Send Q?
2. maximum number of outstanding sends ever posted (during the lifetime 
of Q)?

its possible to measure those in ULPs, but then that may not match 
exactly what is
seen in the real Q...so, is there any low level tool to measure this?

thanks, som.

>Semantically, the provider is not required to provide any such flow control
>behavior by the way. The Mellanox one apparently does, but it is not
>a requirement of the verbs, it's a requirement on the upper layer. If more
>RDMA Reads are posted than the remote peer supports, the connection
>may break.
>
>  
>
>>The number of outstanding RDMA Reads is negotiated by the CM during
>>connection establishment and the QP which is sending the RDMA Read must
>>have a value configured for this parameter which is <= the remote ends
>>capability.
>>    
>>
>
>In other words, we're probably stuck at 4. :-) I don't think there is any
>Mellanox-based implementation that has ever supported > 4.
>
>  
>
>>In previous testing by Mellanox on SDR HCAs they indicated values beyond
>>2-4 did not improve performance (and in fact required more RDMA
>>resources be allocated for the corresponding QP or HCA).  Hence I
>>suspect a very large value like 128 would offer no improvement over
>>values in the 2-8 range.
>>    
>>
>
>I am not so sure of that. For one thing, it's dependent on VERY small
>latencies. The presence of a switch, or link extenders will make a huge
>difference. Second, heavy multi-QP firmware loads will increase the
>latencies. Third, constants are pretty much never a good idea in
>networking.
>
>The NFS/RDMA client tries to set the maximum IRD value it can obtain.
>RDMA Read is used quite heavily by the server to fetch client data
>segments for NFS writes.
>
>Tom.
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>  
>


From parks at lanl.gov  Mon Jun  5 11:16:31 2006
From: parks at lanl.gov (Parks Fields)
Date: Mon, 05 Jun 2006 12:16:31 -0600
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <7.0.1.0.2.20060605130607.086feab0@netapp.com>
References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov>
	<OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@us.ibm.com>
	<7.0.1.0.2.20060605130607.086feab0@netapp.com>
Message-ID: <7.0.1.0.2.20060605120638.025f6270@lanl.gov>


>
>I consider IPoIB to be Ethernet emulation.
>
>As for apples and oranges, my point exactly.


It is not really about comparisons. Here at LANL we have an 
environment where all our new Clusters have to mount our global 
parallel file system Panasas. It is ethernet and will be for a while.

Cluster interconnect is IB and the compute nodes do NOT have 
ethernet, so we created i-o nodes to "bridge " IB to ethernet.

Compute node----IB---i/o node---10gig---ethernet switch ----  panasas

We like to match / balance the network to bandwidth to storage 
bandwidth plus try to achieve 1GB/sec per TF of the machine.  EX: 
50TF machine  = 50 GB/sec of storage bandwidth needed.

So if IPoIB would give us ~700 MB/sec and came out the other side 
with 10gigE at ~800 that would be nice.
Hope this helps.   We are now trying to find out is SDP will work end-to-end.

thanks
parks


                    ***** Correspondence *****

This email contains no programmatic content that requires independent 
ADC review  


From swise at opengridcomputing.com  Mon Jun  5 11:18:17 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 05 Jun 2006 13:18:17 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.LNX.4.63.0606050033550.9862@n5.nowlab.cis.ohio-state.edu>
References: <1149285832.11187.33.camel@stevo-desktop>
	<Pine.LNX.4.63.0606050033550.9862@n5.nowlab.cis.ohio-state.edu>
Message-ID: <1149531497.2766.12.camel@stevo-desktop>

On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
> Hi Steve,
>    We are trying the new iwarp branch on ammasso adapters. The installation 
> has gone fine. However, on running rping there is a error during 
> disconnect phase.
> 
> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>          driver search path: /usr/local/lib/infiniband
> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>          driver search path: /usr/local/lib/infiniband
> ping data: rdm
> ping data: rdm
> ping data: rdm
> ping data: rdm
> cq completion failed status 5
> DISCONNECT EVENT...
> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> Aborted
> 
> There are no apparent errors showing up in dmesg. Is this error 
> currently expected?
> 
> Thanks,
>    --Sundeep.
> 

The cq completion failure is expected (rping doesn't try to gracefully
close down).  But the glibc error sounds like a bug.  

Can you try this on an IB transport?  Also, why are you getting "no
driver found" errors for uverbs0 and uverbs1?  Are these amso devices?

Boyd, can you please try and reproduce this here?

Steve.


From swise at opengridcomputing.com  Mon Jun  5 11:32:53 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 05 Jun 2006 13:32:53 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.LNX.4.63.0606050033550.9862@n5.nowlab.cis.ohio-state.edu>
References: <1149285832.11187.33.camel@stevo-desktop>
	<Pine.LNX.4.63.0606050033550.9862@n5.nowlab.cis.ohio-state.edu>
Message-ID: <1149532374.15071.1.camel@stevo-desktop>

By the way, I assume you configured, rebuilt and reinstalled libibverbs,
librdmacm, and libamso?

I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2
distro.  What distro/kernel verions?

Thanx,


Steve.


On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
> Hi Steve,
>    We are trying the new iwarp branch on ammasso adapters. The installation 
> has gone fine. However, on running rping there is a error during 
> disconnect phase.
> 
> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>          driver search path: /usr/local/lib/infiniband
> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>          driver search path: /usr/local/lib/infiniband
> ping data: rdm
> ping data: rdm
> ping data: rdm
> ping data: rdm
> cq completion failed status 5
> DISCONNECT EVENT...
> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> Aborted
> 
> There are no apparent errors showing up in dmesg. Is this error 
> currently expected?
> 
> Thanks,
>    --Sundeep.
> 
> On Fri, 2 Jun 2006, Steve Wise wrote:
> 
> > Hello,
> >
> > The gen2 iwarp branch has been merged up to the main trunk revision
> > 7626.    The iwarp branch can be found at gen2/branches/iwarp and
> > contains the Ammasso 1100 and Chelsio T3 drivers and user libs.
> >
> > If you are working on iwarp, please test out this new branch and lemme
> > know if there are any problems.
> >
> >
> > Thanks,
> >
> > Steve.
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >


From Thomas.Talpey at netapp.com  Mon Jun  5 11:36:27 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Mon, 05 Jun 2006 14:36:27 -0400
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <7.0.1.0.2.20060605120638.025f6270@lanl.gov>
References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov>
	<OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@us.ibm.com>
	<7.0.1.0.2.20060605130607.086feab0@netapp.com>
	<7.0.1.0.2.20060605120638.025f6270@lanl.gov>
Message-ID: <7.0.1.0.2.20060605143006.086feab0@netapp.com>

Thanks Parks, this is a very interesting perspective.
I will avoid going into my rant about edge devices for
now, however. :-)

I am not sure what you mean about using SDP "end to end".
I assume you would perhaps use SDP to these edge nodes,
but this would require terminating the SDP connection and
re-issuing the stream over TCP to the Panasas box, wouldn't it?

Would this bridging be done in-kernel, like your IPoIB/Ethernet
solution today, or would you implement a daemon? It will be
a difficult challenge, I predict.

Tom.

At 02:16 PM 6/5/2006, Parks Fields wrote:
>
>>
>>I consider IPoIB to be Ethernet emulation.
>>
>>As for apples and oranges, my point exactly.
>
>
>It is not really about comparisons. Here at LANL we have an 
>environment where all our new Clusters have to mount our global 
>parallel file system Panasas. It is ethernet and will be for a while.
>
>Cluster interconnect is IB and the compute nodes do NOT have 
>ethernet, so we created i-o nodes to "bridge " IB to ethernet.
>
>Compute node----IB---i/o node---10gig---ethernet switch ----  panasas
>
>We like to match / balance the network to bandwidth to storage 
>bandwidth plus try to achieve 1GB/sec per TF of the machine.  EX: 
>50TF machine  = 50 GB/sec of storage bandwidth needed.
>
>So if IPoIB would give us ~700 MB/sec and came out the other side 
>with 10gigE at ~800 that would be nice.
>Hope this helps.   We are now trying to find out is SDP will work end-to-end.
>
>thanks
>parks
>
>
>
>                    ***** Correspondence *****
>
>This email contains no programmatic content that requires independent 
>ADC review  
>
>
>


From parks at lanl.gov  Mon Jun  5 11:54:04 2006
From: parks at lanl.gov (Parks Fields)
Date: Mon, 05 Jun 2006 12:54:04 -0600
Subject: [openib-general] Question about the IPoIB bandwidth performance ?
In-Reply-To: <7.0.1.0.2.20060605143006.086feab0@netapp.com>
References: <20060605161150.A21DE2283DA@openib.ca.sandia.gov>
	<OFA9538E02.393B5B53-ON85257184.005BAE70-85257184.005CBF50@us.ibm.com>
	<7.0.1.0.2.20060605130607.086feab0@netapp.com>
	<7.0.1.0.2.20060605120638.025f6270@lanl.gov>
	<7.0.1.0.2.20060605143006.086feab0@netapp.com>
Message-ID: <7.0.1.0.2.20060605124604.02601c00@lanl.gov>

At 12:36 PM 6/5/2006, Talpey, Thomas wrote:
>Thanks Parks, this is a very interesting perspective.
>I will avoid going into my rant about edge devices for
>now, however. :-)

Cool, you can send it direct if you want.


>I am not sure what you mean about using SDP "end to end".
>I assume you would perhaps use SDP to these edge nodes,
>but this would require terminating the SDP connection and
>re-issuing the stream over TCP to the Panasas box, wouldn't it?

yes It would probably have to work that way. Another problem would be 
SDP is not routeable.


>Would this bridging be done in-kernel, like your IPoIB/Ethernet
>solution today, or would you implement a daemon? It will be
>a difficult challenge, I predict.

We are just starting to think about things like this, and trying to 
keep an open mind to all possibilities.  We have no solutions to do 
this yet. There might be better ways.
So you are correct and haven't thought it all the way through and 
have no alterative plan other than IPoIB at the moment.

My next step will be testing 4x-ddr IPoIB before doing anything else.
parks


                    ***** Correspondence *****

This email contains no programmatic content that requires independent 
ADC review  


From rkuchimanchi at silverstorm.com  Mon Jun  5 11:56:52 2006
From: rkuchimanchi at silverstorm.com (Ramachandra K)
Date: Tue, 06 Jun 2006 00:26:52 +0530
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
 format according to target io_class
In-Reply-To: <1149171133.7588.45.camel@Prawra.gs-lab.com>
References: <D80D83302DEE6249A221093BF2BB69AE44D740@mail.silverstorm.com>
	<adaejycdakf.fsf@cisco.com>
	<1149171133.7588.45.camel@Prawra.gs-lab.com>
Message-ID: <44847E74.3000409@silverstorm.com>

Hi Roland,

Did you get a chance to look at the modified SRP patches that
I sent last week ?

Regards,
Ram

Ramchandra K wrote:
> On Mon, 2006-05-29 at 10:07 -0700, Roland Dreier wrote:
>> Overall seems OK.  Some comments:
> 
> I am resending the patch with the modifications you suggested.
> 
>>  > +#define SRP_REV10_IO_CLASS   0xFF00
>>  > +#define SRP_REV16A_IO_CLASS  0x0100
>>
>> I think these should be in an enum in <scsi/srp.h>, since they're
>> generic constants from the SRP spec.
>>
> I have defined the IO class values as an enum in <scsi/srp.h>. I am
> sending this as a separate patch. I am not sure if those changes
> are to be submitted here, since srp.h is not in the Open Fabrics
> code base. But both the patches have to be applied together for
> the SRP code to compile.
> 
> 
> Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>
> 
> Index: infiniband/ulp/srp/ib_srp.c
> ===================================================================
> --- infiniband/ulp/srp/ib_srp.c	(revision 7615)
> +++ infiniband/ulp/srp/ib_srp.c	(working copy)
> @@ -321,8 +321,33 @@
>  	req->priv.req_it_iu_len = cpu_to_be32(srp_max_iu_len);
>  	req->priv.req_buf_fmt 	= cpu_to_be16(SRP_BUF_FORMAT_DIRECT |
>  					      SRP_BUF_FORMAT_INDIRECT);
> -	memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16);
>  	/*
> +	 * Older targets conforming to Rev 10 of the SRP specification
> +	 * use the port identifier format which is
> +	 *
> +	 * lower 8 bytes :  GUID
> +	 * upper 8 bytes :  extension
> +	 *
> +	 * Where as according to the new SRP specification (Rev 16a), the 
> +	 * port identifier format is
> +	 *
> +	 * lower 8 bytes :  extension
> +	 * upper 8 bytes :  GUID
> +	 *
> +	 * So check the IO class of the target to decide which format to use.
> +	 */
> +
> +	/* If its Rev 10, flip the initiator port id fields */
> +	if (target->io_class == SRP_REV10_IO_CLASS) {
> +		memcpy(req->priv.initiator_port_id,
> +			target->srp_host->initiator_port_id + 8 , 8);
> +		memcpy(req->priv.initiator_port_id + 8,
> +			target->srp_host->initiator_port_id, 8);
> +	} else {	
> +		memcpy(req->priv.initiator_port_id,
> +			 target->srp_host->initiator_port_id, 16);
> +	}
> +	/*
>  	 * Topspin/Cisco SRP targets will reject our login unless we
>  	 * zero out the first 8 bytes of our initiator port ID.  The
>  	 * second 8 bytes must be our local node GUID, but we always
> @@ -334,8 +359,13 @@
>  		       (unsigned long long) be64_to_cpu(target->ioc_guid));
>  		memset(req->priv.initiator_port_id, 0, 8);
>  	}
> -	memcpy(req->priv.target_port_id,     &target->id_ext, 8);
> -	memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8);
> +	if (target->io_class == SRP_REV10_IO_CLASS) {
> +		memcpy(req->priv.target_port_id,     &target->ioc_guid, 8);
> +		memcpy(req->priv.target_port_id + 8, &target->id_ext, 8);
> +	} else {
> +		memcpy(req->priv.target_port_id,     &target->id_ext, 8);
> +		memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8);
> +	}
>  
>  	status = ib_send_cm_req(target->cm_id, &req->param);
>  
> @@ -1513,6 +1543,7 @@
>  	SRP_OPT_SERVICE_ID	= 1 << 4,
>  	SRP_OPT_MAX_SECT	= 1 << 5,
>  	SRP_OPT_MAX_CMD_PER_LUN	= 1 << 6,
> +	SRP_OPT_IO_CLASS	= 1 << 7,
>  	SRP_OPT_ALL		= (SRP_OPT_ID_EXT	|
>  				   SRP_OPT_IOC_GUID	|
>  				   SRP_OPT_DGID		|
> @@ -1528,6 +1559,7 @@
>  	{ SRP_OPT_SERVICE_ID,		"service_id=%s"		},
>  	{ SRP_OPT_MAX_SECT,		"max_sect=%d" 		},
>  	{ SRP_OPT_MAX_CMD_PER_LUN,	"max_cmd_per_lun=%d" 	},
> +	{ SRP_OPT_IO_CLASS,		"io_class=%x"		},
>  	{ SRP_OPT_ERR,			NULL 			}
>  };
>  
> @@ -1611,7 +1643,19 @@
>  			}
>  			target->scsi_host->cmd_per_lun = min(token, SRP_SQ_SIZE);
>  			break;
> -
> +		case SRP_OPT_IO_CLASS:
> +			if (match_hex(args, &token)) {
> +				printk(KERN_WARNING PFX "bad  IO class parameter '%s' \n", p);
> +				goto out;
> +			}
> +			if (token == SRP_REV10_IO_CLASS || token == SRP_REV16A_IO_CLASS)
> +				target->io_class = token;
> +			else
> +				printk(KERN_WARNING PFX "unknown IO class parameter value"
> +				   " %x specified. Use %x or %x. Defaulting to IO class %x\n",
> +				   token, SRP_REV10_IO_CLASS, SRP_REV16A_IO_CLASS,
> +				   SRP_REV16A_IO_CLASS);
> +			break;
>  		default:
>  			printk(KERN_WARNING PFX "unknown parameter or missing value "
>  			       "'%s' in target creation request\n", p);
> @@ -1654,6 +1698,7 @@
>  	target = host_to_target(target_host);
>  	memset(target, 0, sizeof *target);
>  
> +	target->io_class   = SRP_REV16A_IO_CLASS;
>  	target->scsi_host  = target_host;
>  	target->srp_host   = host;
>  
> Index: infiniband/ulp/srp/ib_srp.h
> ===================================================================
> --- infiniband/ulp/srp/ib_srp.h	(revision 7615)
> +++ infiniband/ulp/srp/ib_srp.h	(working copy)
> @@ -122,6 +122,7 @@
>  	__be64			id_ext;
>  	__be64			ioc_guid;
>  	__be64			service_id;
> +	__be16			io_class;
>  	struct srp_host	       *srp_host;
>  	struct Scsi_Host       *scsi_host;
>  	char			target_name[32];
> 
> 


From narravul at cse.ohio-state.edu  Mon Jun  5 11:50:47 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Mon, 5 Jun 2006 14:50:47 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <1149531497.2766.12.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606051448220.19934-100000@nu.cse.ohio-state.edu>

> The cq completion failure is expected (rping doesn't try to gracefully
> close down).  But the glibc error sounds like a bug.

OK.

> Can you try this on an IB transport?  Also, why are you getting "no
> driver found" errors for uverbs0 and uverbs1?  Are these amso devices?

I will try this on the IB transport.

I am not sure about the "no driver found" warnings. btw, only the amso
devises are installed on the nodes. Is there some tool to check which
device uverbs0 is connected to?

  --Sundeep.

>
> Boyd, can you please try and reproduce this here?
>
> Steve.
>
>


From swise at opengridcomputing.com  Mon Jun  5 12:01:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 05 Jun 2006 14:01:44 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.GSO.4.40.0606051448220.19934-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606051448220.19934-100000@nu.cse.ohio-state.edu>
Message-ID: <1149534104.15071.4.camel@stevo-desktop>

On Mon, 2006-06-05 at 14:50 -0400, Sundeep Narravula wrote:
> > The cq completion failure is expected (rping doesn't try to gracefully
> > close down).  But the glibc error sounds like a bug.
> 
> OK.
> 
> > Can you try this on an IB transport?  Also, why are you getting "no
> > driver found" errors for uverbs0 and uverbs1?  Are these amso devices?
> 
> I will try this on the IB transport.
> 
> I am not sure about the "no driver found" warnings. btw, only the amso
> devises are installed on the nodes. Is there some tool to check which
> device uverbs0 is connected to?
> 

I'm not sure how to map these.  But if you have mthca adapter installed,
and the libmthca driver isn't installed, you'll see these types of
warnings.

I'm guessing the glibc error is finding some rping bug.  Maybe you have
a later version of libc than my suse 9.2 distro?


Stevo.


From rdreier at cisco.com  Mon Jun  5 12:01:27 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 12:01:27 -0700
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
 format according to target io_class
In-Reply-To: <44847E74.3000409@silverstorm.com> (Ramachandra K.'s
	message of "Tue, 06 Jun 2006 00:26:52 +0530")
References: <D80D83302DEE6249A221093BF2BB69AE44D740@mail.silverstorm.com>
	<adaejycdakf.fsf@cisco.com>
	<1149171133.7588.45.camel@Prawra.gs-lab.com>
	<44847E74.3000409@silverstorm.com>
Message-ID: <adaslmjo29k.fsf@cisco.com>

    Ramachandra> Hi Roland, Did you get a chance to look at the
    Ramachandra> modified SRP patches that I sent last week ?

Yes, I will fix them up and apply them.

 - R.


From narravul at cse.ohio-state.edu  Mon Jun  5 11:58:48 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Mon, 5 Jun 2006 14:58:48 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <1149532374.15071.1.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606051450520.19934-100000@nu.cse.ohio-state.edu>

> By the way, I assume you configured, rebuilt and reinstalled libibverbs,
> librdmacm, and libamso?

Yes. I have done these.

>
> I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2
> distro.  What distro/kernel verions?

The kernel used is 2.6.16 on a RH-AS4.

 --Sundeep.

>
> Thanx,
>
>
> Steve.
>
>
> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
> > Hi Steve,
> >    We are trying the new iwarp branch on ammasso adapters. The installation
> > has gone fine. However, on running rping there is a error during
> > disconnect phase.
> >
> > $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
> > libibverbs: Warning: no userspace device-specific driver found for uverbs1
> >          driver search path: /usr/local/lib/infiniband
> > libibverbs: Warning: no userspace device-specific driver found for uverbs0
> >          driver search path: /usr/local/lib/infiniband
> > ping data: rdm
> > ping data: rdm
> > ping data: rdm
> > ping data: rdm
> > cq completion failed status 5
> > DISCONNECT EVENT...
> > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> > Aborted
> >
> > There are no apparent errors showing up in dmesg. Is this error
> > currently expected?
> >
> > Thanks,
> >    --Sundeep.
> >
> > On Fri, 2 Jun 2006, Steve Wise wrote:
> >
> > > Hello,
> > >
> > > The gen2 iwarp branch has been merged up to the main trunk revision
> > > 7626.    The iwarp branch can be found at gen2/branches/iwarp and
> > > contains the Ammasso 1100 and Chelsio T3 drivers and user libs.
> > >
> > > If you are working on iwarp, please test out this new branch and lemme
> > > know if there are any problems.
> > >
> > >
> > > Thanks,
> > >
> > > Steve.
> > >
> > >
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > >
>


From tziporet at mellanox.co.il  Mon Jun  5 12:09:52 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 5 Jun 2006 22:09:52 +0300
Subject: [openib-general] Fix some suspicious ppc64 code in dapl
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com>

Hi James,
Is it important to take this patch to the OFED release?

Thanks
Tziporet

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini
Sent: Monday, June 05, 2006 4:39 PM
To: Anton Blanchard
Cc: openib-general at openib.org
Subject: Re: [openib-general] Fix some suspicious ppc64 code in dapl


> Index: dapl/udapl/linux/dapl_osd.h
> ===================================================================
> --- dapl/udapl/linux/dapl_osd.h	(revision 7621)
> +++ dapl/udapl/linux/dapl_osd.h	(working copy)
> @@ -238,14 +238,13 @@
>  #endif /* __ia64__ */
>  #elif defined(__PPC64__)
>          __asm__ __volatile__ (
> -        EIEIO_ON_SMP
> -"1:     lwarx   %0,0,%2         # __cmpxchg_u64\n\
> -        cmpd    0,%0,%3\n\
> +"       lwsync\n\
> +1:      lwarx   %0,0,%2         # __cmpxchg_u32\n\
> +        cmpw    0,%0,%3\n\
>          bne-    2f\n\
>          stwcx.  %4,0,%2\n\
> -        bne-    1b"
> -        ISYNC_ON_SMP
> -        "\n\
> +        bne-    1b\n\
> +        isync\n\
>  2:"
>          : "=&r" (current_value), "=m" (*v)
>          : "r" (v), "r" (match_value), "r" (new_value), "m" (*v)

Thank you Anton. Could you replying with a signed off by line? I'll 
properly attribute this fix to you in the commit log.
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Mon Jun  5 12:17:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 12:17:59 -0700
Subject: [openib-general] RFC: Add I/O class enum values to <scsi/srp.h>
Message-ID: <adaodx7o1i0.fsf@cisco.com>

Does anyone have an objection to me merging the trivial patch below
through my git tree?  This will be used by the IB SRP initiator to
work with SilverStorm targets, which still implement rev. 10 of the
SRP spec.  I could just make these values private to the IB initiator,
but I figured that things directly from the SRP spec belong in
<scsi/srp.h> rather than in a particular driver's private header.

Thanks,
  Roland

diff-tree a13ac0e9f99636a043d197f3349a67303ce4a701 (from bb61dd1fbf59f2291295986bed1f99b48f513fa4)
Author: Ramachandra K <rkuchimanchi at silverstorm.com>
Date:   Mon Jun 5 12:13:52 2006 -0700

    [SCSI] srp.h: Add I/O Class values
    
    Add enum values for I/O Class values from rev. 10 and rev. 16a SRP
    drafts.  The values are used to detect targets that implement obsolete
    revisions of SRP, so that the initiator can use the old format for
    port identifier when connecting to them.
    
    Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/include/scsi/srp.h b/include/scsi/srp.h
index 637f77e..ad178fa 100644
--- a/include/scsi/srp.h
+++ b/include/scsi/srp.h
@@ -87,6 +87,11 @@ enum srp_login_rej_reason {
 	SRP_LOGIN_REJ_CHANNEL_LIMIT_REACHED		= 0x00010006
 };
 
+enum {
+	SRP_REV10_IB_IO_CLASS	= 0xff00,
+	SRP_REV16A_IB_IO_CLASS	= 0x0100
+};
+
 struct srp_direct_buf {
 	__be64	va;
 	__be32	key;


From narravul at cse.ohio-state.edu  Mon Jun  5 12:23:53 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Mon, 5 Jun 2006 15:23:53 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <1149534104.15071.4.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606051519580.19934-100000@nu.cse.ohio-state.edu>

> I'm not sure how to map these.  But if you have mthca adapter installed,
> and the libmthca driver isn't installed, you'll see these types of
> warnings.

This is the case. We have the adapters for mthca installed but not the
drivers.

> I'm guessing the glibc error is finding some rping bug.  Maybe you have
> a later version of libc than my suse 9.2 distro?

The glibc version we are using is 2.3.4

  --Sundeep.

>
>
> Stevo.
>
>


From James.Bottomley at SteelEye.com  Mon Jun  5 12:40:57 2006
From: James.Bottomley at SteelEye.com (James Bottomley)
Date: Mon, 05 Jun 2006 14:40:57 -0500
Subject: [openib-general] Re: RFC: Add I/O class enum values to <scsi/srp.h>
In-Reply-To: <adaodx7o1i0.fsf@cisco.com>
References: <adaodx7o1i0.fsf@cisco.com>
Message-ID: <1149536457.3479.2.camel@mulgrave.il.steeleye.com>

On Mon, 2006-06-05 at 12:17 -0700, Roland Dreier wrote:
> Does anyone have an objection to me merging the trivial patch below
> through my git tree?  This will be used by the IB SRP initiator to
> work with SilverStorm targets, which still implement rev. 10 of the
> SRP spec.  I could just make these values private to the IB initiator,
> but I figured that things directly from the SRP spec belong in
> <scsi/srp.h> rather than in a particular driver's private header.

No objection here ... but if you do, it will entangle our git trees even
more nastily, since the srp.h file is created in the scsi-misc-2.6 tree.

James


From rdreier at cisco.com  Mon Jun  5 12:55:43 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 12:55:43 -0700
Subject: [openib-general] Re: RFC: Add I/O class enum values to <scsi/srp.h>
In-Reply-To: <1149536457.3479.2.camel@mulgrave.il.steeleye.com> (James
	Bottomley's message of "Mon, 05 Jun 2006 14:40:57 -0500")
References: <adaodx7o1i0.fsf@cisco.com>
	<1149536457.3479.2.camel@mulgrave.il.steeleye.com>
Message-ID: <adafyijnzr4.fsf@cisco.com>

    James> No objection here ... but if you do, it will entangle our
    James> git trees even more nastily, since the srp.h file is
    James> created in the scsi-misc-2.6 tree.

No, I think we're OK.  srp.h is already in Linus's tree (it went in as
part of the original IB SRP initiator merge), and scsi-misc doesn't have
any changes after ec448a0a36 (which is already upstream) in it.  So
putting the IO Class change in my tree actually reduces the dependency
between our trees, since I can put the IB SRP changes in my tree
without worrying about you merging the srp.h change first.

 - R.


From swise at opengridcomputing.com  Mon Jun  5 13:07:46 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 05 Jun 2006 15:07:46 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.GSO.4.40.0606051519580.19934-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606051519580.19934-100000@nu.cse.ohio-state.edu>
Message-ID: <1149538066.15071.16.camel@stevo-desktop>

On Mon, 2006-06-05 at 15:23 -0400, Sundeep Narravula wrote:
> > I'm not sure how to map these.  But if you have mthca adapter installed,
> > and the libmthca driver isn't installed, you'll see these types of
> > warnings.
> 
> This is the case. We have the adapters for mthca installed but not the
> drivers.
> 
> > I'm guessing the glibc error is finding some rping bug.  Maybe you have
> > a later version of libc than my suse 9.2 distro?
> 
> The glibc version we are using is 2.3.4
> 

My systems are 2.3.3-118 (that's the version in the rpm name).


From faulkner at opengridcomputing.com  Mon Jun  5 13:12:57 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Mon, 5 Jun 2006 15:12:57 -0500
Subject: [openib-general] Serialization in ib_uverbs
Message-ID: <200606051512.57653.faulkner@opengridcomputing.com>

I have a question about the intent of the mutex lock ib_uverbs_idr_mutex
used in kernel interface from user libraries.  It appears to be a lock on the 
idr linked lists but as a great many, if not all, of the ib_uverbs commands 
grab the mutex at the start of the function and hold it to the end, it acts 
to serialize all the library accesses to the kernel.  Is this intended?  

If a driver, say, waits for all references to an object to be removed before 
the close completes, all accesses stop while that occurs and if the command 
to make that happen needs that mutex, everything stops. I have seen this in 
practice.

Any insight would be appreciated.

Thanks,
Boyd

-- 
Boyd R. Faulkner
Open Grid Computing, Inc.
Phone:	512-343-9196 x109
Fax:	512-343-5450


From rdreier at cisco.com  Mon Jun  5 13:36:03 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 13:36:03 -0700
Subject: [openib-general] Serialization in ib_uverbs
In-Reply-To: <200606051512.57653.faulkner@opengridcomputing.com> (Boyd
	R. Faulkner's message of "Mon, 5 Jun 2006 15:12:57 -0500")
References: <200606051512.57653.faulkner@opengridcomputing.com>
Message-ID: <adabqt7nxvw.fsf@cisco.com>

    Boyd> I have a question about the intent of the mutex lock
    Boyd> ib_uverbs_idr_mutex used in kernel interface from user
    Boyd> libraries.  It appears to be a lock on the idr linked lists
    Boyd> but as a great many, if not all, of the ib_uverbs commands
    Boyd> grab the mutex at the start of the function and hold it to
    Boyd> the end, it acts to serialize all the library accesses to
    Boyd> the kernel.  Is this intended?

Yes, when I first implemented things it made things a lot easier to
serialize things.  For example holding the mutex during the entire
process of creating a QP prevents the associated CQs from being
destroyed in the middle of the operation.

It does seem to be a scalability problem for some devices/workloads,
and Robert Walsh from qlogic was planning on looking at this
(replacing the mutex with a reference counting scheme).  I don't know
if he's started on this or not.

 - R.


From faulkner at opengridcomputing.com  Mon Jun  5 13:38:34 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Mon, 5 Jun 2006 15:38:34 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <1149534104.15071.4.camel@stevo-desktop>
References: <Pine.GSO.4.40.0606051448220.19934-100000@nu.cse.ohio-state.edu>
	<1149534104.15071.4.camel@stevo-desktop>
Message-ID: <200606051538.35084.faulkner@opengridcomputing.com>

On Mon June 5 2006 14:01, Steve Wise wrote:
> On Mon, 2006-06-05 at 14:50 -0400, Sundeep Narravula wrote:
> > > The cq completion failure is expected (rping doesn't try to gracefully
> > > close down).  But the glibc error sounds like a bug.
> >
> > OK.
> >
> > > Can you try this on an IB transport?  Also, why are you getting "no
> > > driver found" errors for uverbs0 and uverbs1?  Are these amso devices?
> >
> > I will try this on the IB transport.
> >
> > I am not sure about the "no driver found" warnings. btw, only the amso
> > devises are installed on the nodes. Is there some tool to check which
> > device uverbs0 is connected to?
>
> I'm not sure how to map these.  But if you have mthca adapter installed,
> and the libmthca driver isn't installed, you'll see these types of
> warnings.

You will also get this warning on the latest CM if you have not updated the 
library to use ibv_driver_init vs. openib_driver_init.  This drop for libamso 
happened last Friday, Jun 2.  Check and see if you have that.

>
> I'm guessing the glibc error is finding some rping bug.  Maybe you have
> a later version of libc than my suse 9.2 distro?
>
>
> Stevo.

-- 
Boyd R. Faulkner
Open Grid Computing, Inc.
Phone:	512-343-9196 x109
Fax:	512-343-5450


From rjwalsh at pathscale.com  Mon Jun  5 13:50:11 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Mon, 05 Jun 2006 13:50:11 -0700
Subject: [openib-general] Serialization in ib_uverbs
In-Reply-To: <adabqt7nxvw.fsf@cisco.com>
References: <200606051512.57653.faulkner@opengridcomputing.com>
	<adabqt7nxvw.fsf@cisco.com>
Message-ID: <1149540611.15423.6.camel@hematite.internal.keyresearch.com>

> It does seem to be a scalability problem for some devices/workloads,
> and Robert Walsh from qlogic was planning on looking at this
> (replacing the mutex with a reference counting scheme).  I don't know
> if he's started on this or not.

Not yet, but I'll be starting this as soon as I'm done with some other
release-related work.  Probably in the next few days I'll be starting.

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/64050044/attachment.sig>

From manpreet at gmail.com  Mon Jun  5 14:15:36 2006
From: manpreet at gmail.com (Manpreet Singh)
Date: Mon, 5 Jun 2006 14:15:36 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <adafyins4ep.fsf@cisco.com>
References: <67897d690606011822j7b915876l57149508623c6c4f@mail.gmail.com>
	<adafyins4ep.fsf@cisco.com>
Message-ID: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com>

We have seen this happen over an IB analyzer. Recompiling the mthca driver
with a high value like 64 or 128 works around this problem.
When the condition hits, the HCA receiving the 4+ RDMAs generates an invalid
request error.

Any ideas as to when this patch might enter the mainline sources?

Thanks,
 Manpreet.


On 6/2/06, Roland Dreier <rdreier at cisco.com> wrote:
>
>     Manpreet> Mellanox HCA can handle has been configured at 4
>     Manpreet> (mthca_main.c: default_profile: rdb_per_qp). And the
>     Manpreet> HCAs can support a much higher value (128 I think).
>
>     Manpreet> Could we move this value higher or atleast make it
>     Manpreet> configurable?
>
> Leonid Arsh has a patch that I will integrate soon that makes this
> configurable.
>
> However, I'm curious.  Do you have a workload where this actually
> makes a measurable difference?  It seems that having 4 RDMA requests
> outstanding on the wire should be enough to get things to pipeline
> pretty well.
>
> If you haven't tested this, right now you can of course edit
> mthca_main.c to change the default value and recompile.
>
> - R.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060605/fc4e302d/attachment.html>

From faulkner at opengridcomputing.com  Mon Jun  5 14:16:34 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Mon, 5 Jun 2006 16:16:34 -0500
Subject: [openib-general] Serialization in ib_uverbs
In-Reply-To: <adabqt7nxvw.fsf@cisco.com>
References: <200606051512.57653.faulkner@opengridcomputing.com>
	<adabqt7nxvw.fsf@cisco.com>
Message-ID: <200606051616.34617.faulkner@opengridcomputing.com>

On Mon June 5 2006 15:36, Roland Dreier wrote:
>     Boyd> I have a question about the intent of the mutex lock
>     Boyd> ib_uverbs_idr_mutex used in kernel interface from user
>     Boyd> libraries.  It appears to be a lock on the idr linked lists
>     Boyd> but as a great many, if not all, of the ib_uverbs commands
>     Boyd> grab the mutex at the start of the function and hold it to
>     Boyd> the end, it acts to serialize all the library accesses to
>     Boyd> the kernel.  Is this intended?
>
> Yes, when I first implemented things it made things a lot easier to
> serialize things.  For example holding the mutex during the entire
> process of creating a QP prevents the associated CQs from being
> destroyed in the middle of the operation.
>
> It does seem to be a scalability problem for some devices/workloads,
> and Robert Walsh from qlogic was planning on looking at this
> (replacing the mutex with a reference counting scheme).  I don't know
> if he's started on this or not.
>
>  - R.

It is in the pipe.  Sweet.

Thanks,
Boyd

-- 
Boyd R. Faulkner
Open Grid Computing, Inc.
Phone:	512-343-9196 x109
Fax:	512-343-5450


From sean.hefty at intel.com  Mon Jun  5 17:05:25 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 5 Jun 2006 17:05:25 -0700
Subject: [openib-general] [PATCH 1/2] RDMA CM: allow user to set IB CM
 timeout and retries
Message-ID: <ORSMSX401pPjXDN9mc300000034@orsmsx401.amr.corp.intel.com>

Allow users to override the default number of retries and timeout used
by the RDMA CM when connecting over Infiniband.  Some applications, like
MPI, are unable to connect within the default timeout value when scaling
up.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: include/rdma/rdma_user_cm.h
===================================================================
--- include/rdma/rdma_user_cm.h	(revision 7619)
+++ include/rdma/rdma_user_cm.h	(working copy)
@@ -203,6 +203,7 @@ enum {
 /* IB specific option names for get/set. */
 enum {
 	IB_PATH_OPTIONS = 1,
+	IB_CM_REQ_OPTIONS = 2,
 };
 
 struct rdma_ucm_get_option_resp {
Index: include/rdma/rdma_cm_ib.h
===================================================================
--- include/rdma/rdma_cm_ib.h	(revision 7619)
+++ include/rdma/rdma_cm_ib.h	(working copy)
@@ -44,4 +44,26 @@
 int rdma_set_ib_paths(struct rdma_cm_id *id,
 		      struct ib_sa_path_rec *path_rec, int num_paths);
 
+struct ib_cm_req_opt {
+	u8	remote_cm_response_timeout;
+	u8	local_cm_response_timeout;
+	u8	max_cm_retries;
+};
+
+/**
+ * rdma_get_ib_req_info - Retrieves the current IB CM REQ / SIDR REQ values
+ *   that will be used when connection, or performing service ID resolution.
+ * @id: Connection identifier associated with the request.
+ * @info: Current values for CM REQ messages.
+ */
+int rdma_get_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info);
+
+/**
+ * rdma_set_ib_req_info - Sets the current IB CM REQ / SIDR REQ values
+ *   that will be used when connection, or performing service ID resolution.
+ * @id: Connection identifier associated with the request.
+ * @info: New values for CM REQ messages.
+ */
+int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info);
+
 #endif /* RDMA_CM_IB_H */
Index: core/ucma_ib.c
===================================================================
--- core/ucma_ib.c	(revision 7619)
+++ core/ucma_ib.c	(working copy)
@@ -81,12 +81,37 @@ static int ucma_get_paths(struct rdma_cm
 	return ret;
 }
 
+static int ucma_get_req_opt(struct rdma_cm_id *id, void __user *opt,
+			    int *optlen)
+{
+	struct ib_cm_req_opt req_opt;
+	int ret = 0;
+
+	if (!opt)
+		goto out;
+
+ 	if (*optlen < sizeof req_opt) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = rdma_get_ib_req_info(id, &req_opt);
+	if (!ret)
+		if (copy_to_user(opt, &req_opt, sizeof req_opt))
+			ret = -EFAULT;
+out:
+	*optlen = sizeof req_opt;
+	return ret;	
+}
+
 int ucma_get_ib_option(struct rdma_cm_id *id, int optname,
 		       void *optval, int *optlen)
 {
 	switch (optname) {
 	case IB_PATH_OPTIONS:
 		return ucma_get_paths(id, optval, optlen);
+	case IB_CM_REQ_OPTIONS:
+		return ucma_get_req_opt(id, optval, optlen);
 	default:
 		return -EINVAL;
 	}
@@ -132,12 +157,27 @@ out:
 	return ret;
 }
 
+static int ucma_set_req_opt(struct rdma_cm_id *id, void __user *opt, int optlen)
+{
+	struct ib_cm_req_opt req_opt;
+
+	if (optlen != sizeof req_opt)
+		return -EINVAL;
+
+	if (copy_from_user(&req_opt, opt, sizeof req_opt))
+		return -EFAULT;
+
+	return rdma_set_ib_req_info(id, &req_opt);
+}
+
 int ucma_set_ib_option(struct rdma_cm_id *id, int optname,
 		       void *optval, int optlen)
 {
 	switch (optname) {
 	case IB_PATH_OPTIONS:
 		return ucma_set_paths(id, optval, optlen);
+	case IB_CM_REQ_OPTIONS:
+		return ucma_set_req_opt(id, optval, optlen);
 	default:
 		return -EINVAL;
 	}
Index: core/cma.c
===================================================================
--- core/cma.c	(revision 7619)
+++ core/cma.c	(working copy)
@@ -126,6 +126,10 @@ struct rdma_id_private {
 		struct ib_cm_id	*ib;
 	} cm_id;
 
+	union {
+		struct ib_cm_req_opt *req;
+	} options;
+
 	u32			seq_num;
 	u32			qp_num;
 	enum ib_qp_type		qp_type;
@@ -710,6 +714,7 @@ void rdma_destroy_id(struct rdma_cm_id *
 	wait_for_completion(&id_priv->comp);
 
 	kfree(id_priv->id.route.path_rec);
+	kfree(id_priv->options.req);
 	kfree(id_priv);
 }
 EXPORT_SYMBOL(rdma_destroy_id);
@@ -1240,6 +1245,65 @@ err:
 }
 EXPORT_SYMBOL(rdma_set_ib_paths);
 
+static inline u8 cma_get_ib_remote_timeout(struct rdma_id_private *id_priv)
+{
+	return	id_priv->options.req ?
+		id_priv->options.req->remote_cm_response_timeout :
+		CMA_CM_RESPONSE_TIMEOUT;
+}
+
+static inline u8 cma_get_ib_local_timeout(struct rdma_id_private *id_priv)
+{
+	return	id_priv->options.req ?
+		id_priv->options.req->local_cm_response_timeout :
+		CMA_CM_RESPONSE_TIMEOUT;
+}
+
+static inline u8 cma_get_ib_cm_retries(struct rdma_id_private *id_priv)
+{
+	return	id_priv->options.req ?
+		id_priv->options.req->max_cm_retries : CMA_MAX_CM_RETRIES;
+}
+
+int rdma_get_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info)
+{
+	struct rdma_id_private *id_priv;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	if (!cma_comp(id_priv, CMA_ROUTE_RESOLVED))
+		return -EINVAL;
+
+	info->remote_cm_response_timeout = cma_get_ib_remote_timeout(id_priv);
+	info->local_cm_response_timeout = cma_get_ib_local_timeout(id_priv);
+	info->max_cm_retries = cma_get_ib_cm_retries(id_priv);
+	return 0;
+}
+EXPORT_SYMBOL(rdma_get_ib_req_info);
+
+int rdma_set_ib_req_info(struct rdma_cm_id *id, struct ib_cm_req_opt *info)
+{
+	struct rdma_id_private *id_priv;
+
+	if (info->remote_cm_response_timeout > 0x1F ||
+	    info->local_cm_response_timeout > 0x1F ||
+	    info->max_cm_retries > 0xF)
+		return -EINVAL;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	if (!cma_comp(id_priv, CMA_ROUTE_RESOLVED))
+		return -EINVAL;
+
+	if (!id_priv->options.req) {
+		id_priv->options.req = kmalloc(sizeof *info, GFP_KERNEL);
+		if (!id_priv->options.req)
+			return -ENOMEM;
+	}
+
+	*id_priv->options.req = *info;
+	return 0;
+}
+EXPORT_SYMBOL(rdma_set_ib_req_info);
+
 int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 {
 	struct rdma_id_private *id_priv;
@@ -1646,9 +1710,9 @@ static int cma_connect_ib(struct rdma_id
 	req.flow_control = conn_param->flow_control;
 	req.retry_count = conn_param->retry_count;
 	req.rnr_retry_count = conn_param->rnr_retry_count;
-	req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.max_cm_retries = CMA_MAX_CM_RETRIES;
+	req.remote_cm_response_timeout = cma_get_ib_remote_timeout(id_priv);
+	req.local_cm_response_timeout = cma_get_ib_local_timeout(id_priv);
+	req.max_cm_retries = cma_get_ib_cm_retries(id_priv);
 	req.srq = id_priv->srq ? 1 : 0;
 
 	ret = ib_send_cm_req(id_priv->cm_id.ib, &req);
@@ -1707,7 +1771,7 @@ static int cma_accept_ib(struct rdma_id_
 	rep.private_data_len = conn_param->private_data_len;
 	rep.responder_resources = conn_param->responder_resources;
 	rep.initiator_depth = conn_param->initiator_depth;
-	rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT;
+	rep.target_ack_delay = cma_get_ib_local_timeout(id_priv);
 	rep.failover_accepted = 0;
 	rep.flow_control = conn_param->flow_control;
 	rep.rnr_retry_count = conn_param->rnr_retry_count;


From sean.hefty at intel.com  Mon Jun  5 17:11:49 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 5 Jun 2006 17:11:49 -0700
Subject: [openib-general] [PATCH 2/2] librdmacm: allow user to set IB CM
 timeout and retries
Message-ID: <ORSMSX401DNMha2MWoN00000035@orsmsx401.amr.corp.intel.com>

Userspace support to allow overriding the default timeout and retry used
by the RDMA CM when connecting over Infiniband.

This patch moves the Infiniband specific options into their own header file.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: include/rdma/rdma_cma_ib.h
===================================================================
--- include/rdma/rdma_cma_ib.h	(revision 0)
+++ include/rdma/rdma_cma_ib.h	(revision 0)
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+
+#if !defined(RDMA_CMA_IB_H)
+#define RDMA_CMA_IB_H
+
+#include <rdma/rdma_cma.h>
+
+
+/* IB specific option names for get/set. */
+enum {
+	IB_PATH_OPTIONS = 1,	/* struct ibv_kern_path_rec */
+	IB_CM_REQ_OPTIONS = 2	/* struct ib_cm_req_opt */
+};
+
+struct ib_cm_req_opt {
+	uint8_t		remote_cm_response_timeout;
+	uint8_t		local_cm_response_timeout;
+	uint8_t		max_cm_retries;
+};
+
+#endif /* RDMA_CMA_IB_H */
Index: include/rdma/rdma_cma.h
===================================================================
--- include/rdma/rdma_cma.h	(revision 7636)
+++ include/rdma/rdma_cma.h	(working copy)
@@ -60,11 +60,6 @@ enum {
 	RDMA_PROTO_IB = 1,
 };
 
-/* IB specific option names for get/set. */
-enum {
-	IB_PATH_OPTIONS = 1,
-};
-
 struct ib_addr {
 	union ibv_gid	sgid;
 	union ibv_gid	dgid;
Index: Makefile.am
===================================================================
--- Makefile.am	(revision 7636)
+++ Makefile.am	(working copy)
@@ -27,10 +27,12 @@ examples_rping_LDADD = $(top_builddir)/s
 librdmacmincludedir = $(includedir)/rdma
 
 librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \
-			   include/rdma/rdma_cma.h
+			   include/rdma/rdma_cma.h \
+			   include/rdma/rdma_cma_ib.h
 
 EXTRA_DIST = include/rdma/rdma_cma_abi.h \
 	     include/rdma/rdma_cma.h \
+	     include/rdma/rdma_cma_ib.h \
 	     src/librdmacm.map \
 	     librdmacm.spec.in
 

From arlin.r.davis at intel.com  Mon Jun  5 17:16:31 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Mon, 5 Jun 2006 17:16:31 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add support for
 IB_CM_REQ_OPTIONS
Message-ID: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>

James,

Here is a patch to the openib-cma provider that uses the new set_option feature of the uCMA to
adjust connect request timeout and retry values. The defaults are a little quick for some consumers.
They are now bumped up from 3 retries to 15 and are tunable with uDAPL environment variables. Also,
included a fix to disallow any event after a disconnect event.

You need to sync up the commit with Sean's patch for the uCMA get/set IB_CM_REQ_OPTIONS.

I would like to get this in OFED RC6 if possible.

Thanks,

-arlin


Signed-off by: Arlin Davis ardavis at ichips.intel.com

Index: dapl/openib_cma/dapl_ib_util.c
===================================================================
--- dapl/openib_cma/dapl_ib_util.c	(revision 7694)
+++ dapl/openib_cma/dapl_ib_util.c	(working copy)
@@ -264,7 +264,15 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N
 	/* set inline max with env or default, get local lid and gid 0 */
 	hca_ptr->ib_trans.max_inline_send = 
 		dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT);
-		
+
+	/* set CM timer defaults */	
+	hca_ptr->ib_trans.max_cm_timeout =
+		dapl_os_get_env_val("DAPL_MAX_CM_RESPONSE_TIME", 
+				    IB_CM_RESPONSE_TIMEOUT);
+	hca_ptr->ib_trans.max_cm_retries = 
+		dapl_os_get_env_val("DAPL_MAX_CM_RETRIES", 
+				    IB_CM_RETRIES);
+
 	/* EVD events without direct CQ channels, non-blocking */
 	hca_ptr->ib_trans.ib_cq = 
 		ibv_create_comp_channel(hca_ptr->ib_hca_handle);
Index: dapl/openib_cma/dapl_ib_cm.c
===================================================================
--- dapl/openib_cma/dapl_ib_cm.c	(revision 7694)
+++ dapl/openib_cma/dapl_ib_cm.c	(working copy)
@@ -58,6 +58,7 @@
 #include "dapl_ib_util.h"
 #include <sys/poll.h>
 #include <signal.h>
+#include <rdma/rdma_cma_ib.h>
 
 extern struct rdma_event_channel *g_cm_events;
 
@@ -85,7 +86,6 @@ static inline uint64_t cpu_to_be64(uint6
     (unsigned short)((SID % IB_PORT_MOD) + IB_PORT_BASE) :\
     (unsigned short)SID)
 
-
 static void dapli_addr_resolve(struct dapl_cm_id *conn)
 {
 	int ret;
@@ -114,6 +114,8 @@ static void dapli_addr_resolve(struct da
 static void dapli_route_resolve(struct dapl_cm_id *conn)
 {
 	int ret;
+	size_t optlen = sizeof(struct ib_cm_req_opt);
+	struct ib_cm_req_opt req_opt;
 #ifdef DAPL_DBG
 	struct rdma_addr *ipaddr = &conn->cm_id->route.addr;
 	struct ib_addr   *ibaddr = &conn->cm_id->route.addr.addr.ibaddr;
@@ -143,13 +145,43 @@ static void dapli_route_resolve(struct d
 			cpu_to_be64(ibaddr->dgid.global.interface_id));
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
-		" rdma_connect: cm_id %p pdata %p plen %d rr %d ind %d\n",
+		" route_resolve: cm_id %p pdata %p plen %d rr %d ind %d\n",
 		conn->cm_id,
 		conn->params.private_data, 
 		conn->params.private_data_len,
 		conn->params.responder_resources, 
 		conn->params.initiator_depth );
 
+	/* Get default connect request timeout values, and adjust */
+	ret = rdma_get_option(conn->cm_id, RDMA_PROTO_IB, IB_CM_REQ_OPTIONS,
+			      (void*)&req_opt, &optlen);
+	if (ret) {
+		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_get_option failed: %s\n",
+			     strerror(errno));
+		goto bail;
+	}
+
+	dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: "
+		     "Set CR times - response %d to %d, retry %d to %d\n",
+		     req_opt.remote_cm_response_timeout, 
+		     conn->hca->ib_trans.max_cm_timeout,
+		     req_opt.max_cm_retries, 
+		     conn->hca->ib_trans.max_cm_retries);
+
+	/* Use hca response time setting for connect requests */
+	req_opt.max_cm_retries = conn->hca->ib_trans.max_cm_retries;
+	req_opt.remote_cm_response_timeout = 
+				conn->hca->ib_trans.max_cm_timeout;
+	req_opt.local_cm_response_timeout = 
+				req_opt.remote_cm_response_timeout;
+	ret = rdma_set_option(conn->cm_id, RDMA_PROTO_IB, IB_CM_REQ_OPTIONS,
+			      (void*)&req_opt, optlen);
+	if (ret) {
+		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_set_option failed: %s\n",
+			     strerror(errno));
+		goto bail;
+	}
+
 	ret = rdma_connect(conn->cm_id, &conn->params);
 	if (ret) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect failed: %s\n",
@@ -273,14 +305,37 @@ static void dapli_cm_active_cb(struct da
 	}
 	dapl_os_unlock(&conn->lock);
 
+        /* There is a chance that we can get events after
+         * the consumer calls disconnect in a pending state
+         * since the IB CM and uDAPL states are not shared.
+         * In some cases, IB CM could generate either a DCONN
+         * or CONN_ERR after the consumer returned from
+         * dapl_ep_disconnect with a DISCONNECTED event
+         * already queued. Check state here and bail to
+         * avoid any events after a disconnect.
+         */
+        if (DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP))
+                return;
+
+        dapl_os_lock(&conn->ep->header.lock);
+        if (conn->ep->param.ep_state == DAT_EP_STATE_DISCONNECTED) {
+                dapl_os_unlock(&conn->ep->header.lock);
+                return;
+        }
+        if (event->event == RDMA_CM_EVENT_DISCONNECTED)
+                conn->ep->param.ep_state = DAT_EP_STATE_DISCONNECTED;
+
+        dapl_os_unlock(&conn->ep->header.lock);
+
 	switch (event->event) {
 	case RDMA_CM_EVENT_UNREACHABLE:
 	case RDMA_CM_EVENT_CONNECT_ERROR:
-		dapl_dbg_log(
-			DAPL_DBG_TYPE_WARN,
-			" dapli_cm_active_handler: CONN_ERR "
-			" event=0x%x status=%d\n",	
-			event->event, event->status);
+                dapl_dbg_log(
+                        DAPL_DBG_TYPE_WARN,
+                        " dapli_cm_active_handler: CONN_ERR "
+                        " event=0x%x status=%d %s\n",
+                        event->event, event->status,
+                        (event->status == -110)?"TIMEOUT":"" );
 
 		dapl_evd_connection_callback(conn,
 					     IB_CME_DESTINATION_UNREACHABLE,
@@ -368,25 +423,23 @@ static void dapli_cm_passive_cb(struct d
 				 	  event->private_data, new_conn->sp);
 		break;
 	case RDMA_CM_EVENT_UNREACHABLE:
-		dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE,
-				 NULL, conn->sp);
-
 	case RDMA_CM_EVENT_CONNECT_ERROR:
 
 		dapl_dbg_log(
-			DAPL_DBG_TYPE_WARN, 
-			" dapli_cm_passive: CONN_ERR "
-			" event=0x%x status=%d",
-			" on SRC 0x%x,0x%x DST 0x%x,0x%x\n",
-			event->event, event->status,
-			ntohl(((struct sockaddr_in *)
-				&ipaddr->src_addr)->sin_addr.s_addr),
-			ntohs(((struct sockaddr_in *)
-				&ipaddr->src_addr)->sin_port),
-			ntohl(((struct sockaddr_in *)
-				&ipaddr->dst_addr)->sin_addr.s_addr),
-			ntohs(((struct sockaddr_in *)
-				&ipaddr->dst_addr)->sin_port));
+                        DAPL_DBG_TYPE_WARN,
+                        " dapli_cm_passive: CONN_ERR "
+                        " event=0x%x status=%d %s"
+                        " on SRC 0x%x,0x%x DST 0x%x,0x%x\n",
+                        event->event, event->status,
+                        (event->status == -110)?"TIMEOUT":"",
+                        ntohl(((struct sockaddr_in *)
+                                &ipaddr->src_addr)->sin_addr.s_addr),
+                        ntohs(((struct sockaddr_in *)
+                                &ipaddr->src_addr)->sin_port),
+                        ntohl(((struct sockaddr_in *)
+                                &ipaddr->dst_addr)->sin_addr.s_addr),
+                        ntohs(((struct sockaddr_in *)
+                                &ipaddr->dst_addr)->sin_port));
 
 		dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE,
 				 NULL, conn->sp);
Index: dapl/openib_cma/dapl_ib_util.h
===================================================================
--- dapl/openib_cma/dapl_ib_util.h	(revision 7694)
+++ dapl/openib_cma/dapl_ib_util.h	(working copy)
@@ -67,8 +67,8 @@ typedef ib_hca_handle_t		dapl_ibal_ca_t;
 
 #define IB_RC_RETRY_COUNT      7
 #define IB_RNR_RETRY_COUNT     7
-#define IB_CM_RESPONSE_TIMEOUT 18	/* 1 sec */
-#define IB_MAX_CM_RETRIES      7
+#define IB_CM_RESPONSE_TIMEOUT  20	/* 4 sec */
+#define IB_CM_RETRIES           15
 #define IB_REQ_MRA_TIMEOUT	27	/* a little over 9 minutes */
 #define IB_MAX_AT_RETRY		3
 #define IB_TARGET_MAX		4	/* max_qp_ous_rd_atom */
@@ -252,6 +252,8 @@ typedef struct _ib_hca_transport
 	ib_async_cq_handler_t	async_cq_error;
 	ib_async_dto_handler_t	async_cq;
 	ib_async_qp_handler_t	async_qp_error;
+	uint8_t			max_cm_timeout;
+	uint8_t			max_cm_retries;
 
 } ib_hca_transport_t;
 

From rdreier at cisco.com  Mon Jun  5 21:42:22 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 05 Jun 2006 21:42:22 -0700
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
 format according to target io_class
In-Reply-To: <1149171133.7588.45.camel@Prawra.gs-lab.com> (Ramchandra
	K.'s message of "Thu, 01 Jun 2006 19:42:13 +0530")
References: <D80D83302DEE6249A221093BF2BB69AE44D740@mail.silverstorm.com>
	<adaejycdakf.fsf@cisco.com>
	<1149171133.7588.45.camel@Prawra.gs-lab.com>
Message-ID: <adau06ynbdd.fsf@cisco.com>

Thanks, I applied this.


From k_mahesh85 at yahoo.co.in  Mon Jun  5 21:51:43 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Tue, 6 Jun 2006 05:51:43 +0100 (BST)
Subject: [openib-general] repost-problem with memory registration-RDMA
	kernel utliity
Message-ID: <20060606045143.81301.qmail@web8327.mail.in.yahoo.com>

can anybody me suggest me the correct way to register a buffer for doing RDMA operations
i have already posted my code in the previous thread but that  is not working fine.

it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA  operations over it.

-Mahesh

 Send instant messages to your online friends http://in.messenger.yahoo.com 

 Stay connected with your friends even when away from PC.  Link: http://in.mobile.yahoo.com/new/messenger/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060606/40b7f5ed/attachment.html>

From mst at mellanox.co.il  Tue Jun  6 00:08:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 10:08:15 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com>
References: <67897d690606051415o3675207o549ce7e084d618b8@mail.gmail.com>
Message-ID: <20060606070814.GA2432@mellanox.co.il>

Quoting r. Manpreet Singh <manpreet at gmail.com>:
> Subject: Re: Mellanox HCAs: outstanding RDMAs
> 
> We have seen this happen over an IB analyzer. Recompiling the mthca driver with a high value like 64 or 128 works around this problem.
> When the condition hits, the HCA receiving the 4+ RDMAs generates an invalid request error.

Posting more read work requests than might be outstanding simultaneously on the
wire is not an error.

I think the fact you are getting an error means you are configuring
max_rd_atomic/max_dest_rd_atomic on the local versus remote side incorrectly
(these represent the Number of responder resources for RDMA Read/atomic ops and
Number of Outstanding RDMA Read/atomic ops at destination, respectively).

If so this is a bug in ULP, working around it by increasing the number
of credits on both sides does not seem like the right thing to do.

See 12.7.29 RESPONDER RESOURCES, and 12.7.30 INITIATOR DEPTH.

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 00:09:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 10:09:54 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <44846EFF.2020705@veritas.com>
References: <44846EFF.2020705@veritas.com>
Message-ID: <20060606070954.GB2432@mellanox.co.il>

Quoting r. somenath <somenath at veritas.com>:
> possibility of stalling is scary!

You might want to review chapter 9.5 TRANSACTION ORDERING for info on when will
ordering rules cause the IB QP to stall.

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 00:43:14 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 10:43:14 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
References: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
Message-ID: <20060606074314.GC2432@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> Semantically, the provider is not required to provide any such flow control
> behavior by the way. The Mellanox one apparently does, but it is not
> a requirement of the verbs, it's a requirement on the upper layer. If more
> RDMA Reads are posted than the remote peer supports, the connection
> may break.

This does not sound right. Isn't this the meaning of this field:
"Initiator Depth: Number of RDMA Reads & atomic operations
outstanding at any time"? Shouldn't any provider enforce this limit?

-- 
MST


From Thomas.Talpey at netapp.com  Tue Jun  6 05:24:23 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 08:24:23 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <20060606074314.GC2432@mellanox.co.il>
References: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
	<20060606074314.GC2432@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>

At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote:
>Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
>> Semantically, the provider is not required to provide any such flow control
>> behavior by the way. The Mellanox one apparently does, but it is not
>> a requirement of the verbs, it's a requirement on the upper layer. If more
>> RDMA Reads are posted than the remote peer supports, the connection
>> may break.
>
>This does not sound right. Isn't this the meaning of this field:
>"Initiator Depth: Number of RDMA Reads & atomic operations
>outstanding at any time"? Shouldn't any provider enforce this limit?

The core spec does not require it. An implementation *may* enforce it,
but is not *required* to do so. And as pointed out in the other message,
there are repercussions of doing so.

I believe the silent queue stalling is a bit of a time bomb for upper layers,
whose implementers are quite likely unaware of the danger. I greatly
prefer an implementation which simply sends the RDMA Read request,
resulting in a failed (but unblocked!) connection. Silence is a very
dangerous thing, no matter how helpful the intent.

Tom.


From Thomas.Talpey at netapp.com  Tue Jun  6 05:13:32 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 08:13:32 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <20060606070954.GB2432@mellanox.co.il>
References: <44846EFF.2020705@veritas.com>
	<20060606070954.GB2432@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060606080728.086feab0@netapp.com>

At 03:09 AM 6/6/2006, Michael S. Tsirkin wrote:
>Quoting r. somenath <somenath at veritas.com>:
>> possibility of stalling is scary!
>
>You might want to review chapter 9.5 TRANSACTION ORDERING for info on when will
>ordering rules cause the IB QP to stall.

MST, are you disagreeing that RDMA Reads can stall the queue?
Section 9.5, C9-25 lays it right out as the first requirement:

>> C9-25: A requester shall transmit request messages in the order that the
>> Work Queue Elements (WQEs) were posted.

Therefore, a provider which implements flow control on RDMA Reads cannot
transmit new sends until the prior RDMA Reads can be initiated. Of course,
they may complete in a somewhat different order...

It's all about flow control - which is not mandatory. It's a convenient, but
very risky thing. Upper layers are often unaware of its ramifications.

Tom.  


From mst at mellanox.co.il  Tue Jun  6 05:44:26 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 15:44:26 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606080728.086feab0@netapp.com>
References: <7.0.1.0.2.20060606080728.086feab0@netapp.com>
Message-ID: <20060606124426.GH2432@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> Subject: Re: Mellanox HCAs: outstanding RDMAs
> 
> At 03:09 AM 6/6/2006, Michael S. Tsirkin wrote:
> >Quoting r. somenath <somenath at veritas.com>:
> >> possibility of stalling is scary!
> >
> >You might want to review chapter 9.5 TRANSACTION ORDERING for info on when
> >will ordering rules cause the IB QP to stall.
> 
> MST, are you disagreeing that RDMA Reads can stall the queue?

I don't disagree with this of course. I was simply suggesting to ULP designers
to read the chapter 9.5 and become aware of the rules, taking them
into account at early stages of protocol design.

-- 
MST


From Thomas.Talpey at netapp.com  Tue Jun  6 05:52:04 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 08:52:04 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <20060606124426.GH2432@mellanox.co.il>
References: <7.0.1.0.2.20060606080728.086feab0@netapp.com>
	<20060606124426.GH2432@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060606084959.0469bcc8@netapp.com>

At 08:44 AM 6/6/2006, Michael S. Tsirkin wrote:
>> MST, are you disagreeing that RDMA Reads can stall the queue?
>
>I don't disagree with this of course. I was simply suggesting to ULP designers
>to read the chapter 9.5 and become aware of the rules, taking them
>into account at early stages of protocol design.

:-) RTFM?

I still think flow control is wrong and dangerous thing for RDMA Read.
If it never happened, and the connections just failed, we'd never have
the issue. Also, I'm certain we'll see upper layers that work on one
provider, only to fail on another. Sigh.

Tom.


From mst at mellanox.co.il  Tue Jun  6 05:56:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 15:56:34 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
Message-ID: <20060606125634.GI2432@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> Subject: Re: Mellanox HCAs: outstanding RDMAs
> 
> At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote:
> >Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> >> Semantically, the provider is not required to provide any such flow control
> >> behavior by the way. The Mellanox one apparently does, but it is not
> >> a requirement of the verbs, it's a requirement on the upper layer. If more
> >> RDMA Reads are posted than the remote peer supports, the connection
> >> may break.
> >
> >This does not sound right. Isn't this the meaning of this field:
> >"Initiator Depth: Number of RDMA Reads & atomic operations
> >outstanding at any time"? Shouldn't any provider enforce this limit?
> 
> The core spec does not require it. An implementation *may* enforce it,
> but is not *required* to do so. And as pointed out in the other message,
> there are repercussions of doing so.

Interesting, I wasn't aware of such interpretation of the spec.
When QP is modified to RTS, the initiator depth is passed to it, which
suggests that the provider must obey, not ignore this parameter. No?

-- 
MST


From Thomas.Talpey at netapp.com  Tue Jun  6 06:42:15 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 09:42:15 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <20060606125634.GI2432@mellanox.co.il>
References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
	<20060606125634.GI2432@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060606093959.086feab0@netapp.com>

At 08:56 AM 6/6/2006, Michael S. Tsirkin wrote:
>> The core spec does not require it. An implementation *may* enforce it,
>> but is not *required* to do so. And as pointed out in the other message,
>> there are repercussions of doing so.
>
>Interesting, I wasn't aware of such interpretation of the spec.
>When QP is modified to RTS, the initiator depth is passed to it, which
>suggests that the provider must obey, not ignore this parameter. No?

This is the difference between "may" and "must". The value is provided,
but I don't see anything in the spec that makes a requirement on its
enforcement. Table 107 says the consumer can query it, that's about
as close as it comes. There's some discussion about CM exchange too.

Don't forget about iWARP, btw.

Tom.


From jlentini at netapp.com  Tue Jun  6 06:44:51 2006
From: jlentini at netapp.com (James Lentini)
Date: Tue, 6 Jun 2006 09:44:51 -0400 (EDT)
Subject: [openib-general] Fix some suspicious ppc64 code in dapl
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7122@mtlexch01.mtl.com>
Message-ID: <Pine.LNX.4.64.0606060943290.4750@jlentini-linux.nane.netapp.com>


On Mon, 5 Jun 2006, Tziporet Koren wrote:

> Is it important to take this patch to the OFED release?

It may fix 

http://openib.org/bugzilla/show_bug.cgi?id=48


From halr at voltaire.com  Tue Jun  6 06:45:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Jun 2006 09:45:00 -0400
Subject: [openib-general] [PATCH] [MINOR] OpenSM: Minor improvement to a
 couple of SA error paths
Message-ID: <1149601493.4510.243499.camel@hal.voltaire.com>

OpenSM: Minor improvement to a couple of SA error paths

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_slvl_record.c
===================================================================
--- opensm/osm_sa_slvl_record.c	(revision 7718)
+++ opensm/osm_sa_slvl_record.c	(working copy)
@@ -158,15 +158,6 @@ __osm_sa_slvl_create(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_slvl_create );
 
-  if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
-  {
-    lid = osm_physp_get_port_info_ptr( p_physp )->base_lid;
-  }
-  else
-  {
-    lid = osm_node_get_base_lid( p_physp->p_node, 0 );
-  }
-
   p_rec_item = (osm_slvl_item_t*)cl_qlock_pool_get( &p_rcv->pool );
   if( p_rec_item == NULL )
   {
@@ -177,6 +168,15 @@ __osm_sa_slvl_create(
     goto Exit;
   }
 
+  if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
+  {
+    lid = osm_physp_get_port_info_ptr( p_physp )->base_lid;
+  }
+  else
+  {
+    lid = osm_node_get_base_lid( p_physp->p_node, 0 );
+  }
+
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
Index: opensm/osm_sa_vlarb_record.c
===================================================================
--- opensm/osm_sa_vlarb_record.c	(revision 7718)
+++ opensm/osm_sa_vlarb_record.c	(working copy)
@@ -158,15 +158,6 @@ __osm_sa_vl_arb_create(
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_vl_arb_create );
 
-  if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
-  {
-    lid = osm_physp_get_port_info_ptr( p_physp )->base_lid;
-  }
-  else
-  {
-    lid = osm_node_get_base_lid( p_physp->p_node, 0 );
-  }
-
   p_rec_item = (osm_vl_arb_item_t*)cl_qlock_pool_get( &p_rcv->pool );
   if( p_rec_item == NULL )
   {
@@ -177,6 +168,15 @@ __osm_sa_vl_arb_create(
     goto Exit;
   }
 
+  if (p_physp->p_node->node_info.node_type != IB_NODE_TYPE_SWITCH)
+  {
+    lid = osm_physp_get_port_info_ptr( p_physp )->base_lid;
+  }
+  else
+  {
+    lid = osm_node_get_base_lid( p_physp->p_node, 0 );
+  }
+
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,


From rdreier at cisco.com  Tue Jun  6 07:40:26 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 07:40:26 -0700
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606093959.086feab0@netapp.com> (Thomas
	Talpey's message of "Tue, 06 Jun 2006 09:42:15 -0400")
References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
	<20060606125634.GI2432@mellanox.co.il>
	<7.0.1.0.2.20060606093959.086feab0@netapp.com>
Message-ID: <ada4pyymjol.fsf@cisco.com>

    Thomas> This is the difference between "may" and "must". The value
    Thomas> is provided, but I don't see anything in the spec that
    Thomas> makes a requirement on its enforcement. Table 107 says the
    Thomas> consumer can query it, that's about as close as it
    Thomas> comes. There's some discussion about CM exchange too.

This seems like a very strained interpretation of the spec.  For
example, there's no explicit language in the IB spec that requires an
HCA to use the destination LID passed via a modify QP operation, but I
don't think anyone would seriously argue that an implementation that
sent messages to some other random destination was compliant.

In the same way, if I pass a limit for the number of outstanding
RDMA/atomic operations in to a modify QP operation, I would expect the
HCA to use that limit.

 - R.


From Thomas.Talpey at netapp.com  Tue Jun  6 07:49:08 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 10:49:08 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <ada4pyymjol.fsf@cisco.com>
References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
	<20060606125634.GI2432@mellanox.co.il>
	<7.0.1.0.2.20060606093959.086feab0@netapp.com>
	<ada4pyymjol.fsf@cisco.com>
Message-ID: <7.0.1.0.2.20060606104534.086feab0@netapp.com>

At 10:40 AM 6/6/2006, Roland Dreier wrote:
>    Thomas> This is the difference between "may" and "must". The value
>    Thomas> is provided, but I don't see anything in the spec that
>    Thomas> makes a requirement on its enforcement. Table 107 says the
>    Thomas> consumer can query it, that's about as close as it
>    Thomas> comes. There's some discussion about CM exchange too.
>
>This seems like a very strained interpretation of the spec.  For

I don't see how strained has anything to do with it. It's not saying anything
either way. So, a legal implementation can make either choice. We're
talking about the spec!

But, it really doesn't matter. The point is, an upper layer should be paying
attention to the number of RDMA Reads it posts, or else suffer either the
queue-stalling or connection-failing consequences. Bad stuff either way.

Tom.


>example, there's no explicit language in the IB spec that requires an
>HCA to use the destination LID passed via a modify QP operation, but I
>don't think anyone would seriously argue that an implementation that
>sent messages to some other random destination was compliant.
>
>In the same way, if I pass a limit for the number of outstanding
>RDMA/atomic operations in to a modify QP operation, I would expect the
>HCA to use that limit.
>
> - R.


From rdreier at cisco.com  Tue Jun  6 08:00:16 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 08:00:16 -0700
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606104534.086feab0@netapp.com> (Thomas
	Talpey's message of "Tue, 06 Jun 2006 10:49:08 -0400")
References: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
	<20060606125634.GI2432@mellanox.co.il>
	<7.0.1.0.2.20060606093959.086feab0@netapp.com>
	<ada4pyymjol.fsf@cisco.com>
	<7.0.1.0.2.20060606104534.086feab0@netapp.com>
Message-ID: <adazmgql473.fsf@cisco.com>

    Thomas> I don't see how strained has anything to do with it. It's
    Thomas> not saying anything either way. So, a legal implementation
    Thomas> can make either choice. We're talking about the spec!

I guess the reason I say it is strained is because the spec does have
the following compliance statement for the modify QP verb:

    C11-8 Upon invocation of this Verb, the CI shall modify the
    attributes for the specified QP...

So what should I expect to happen if I modify the number of
outstanding RDMA Read/atomic operations?  That the HCA will ignore
that attribute?  To me the only sensible interpretation of the spec is
that setting a limit on outstanding operations will limit the number
of outstanding operations.  If the attribute doesn't do anything, then
why would the spec include it?

 - R.


From trimmer at silverstorm.com  Tue Jun  6 09:43:23 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Tue, 6 Jun 2006 12:43:23 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Message-ID: <D80D83302DEE6249A221093BF2BB69AE58EBD2@mail.silverstorm.com>


> Talpey, Thomas
> Sent: Tuesday, June 06, 2006 10:49 AM
> 
> At 10:40 AM 6/6/2006, Roland Dreier wrote:
> >    Thomas> This is the difference between "may" and "must". The
value
> >    Thomas> is provided, but I don't see anything in the spec that
> >    Thomas> makes a requirement on its enforcement. Table 107 says
the
> >    Thomas> consumer can query it, that's about as close as it
> >    Thomas> comes. There's some discussion about CM exchange too.
> >
> >This seems like a very strained interpretation of the spec.  For
> 
> I don't see how strained has anything to do with it. It's not saying
> anything
> either way. So, a legal implementation can make either choice. We're
> talking about the spec!
> 
> But, it really doesn't matter. The point is, an upper layer should be
> paying
> attention to the number of RDMA Reads it posts, or else suffer either
the
> queue-stalling or connection-failing consequences. Bad stuff either
way.
> 
> Tom.

Somewhere beneath this discussion is a bug in the application or IB
stack.  I'm not sure which "may" in the spec you are referring to, but
the "may"s I have found all are for cases where the responder might
support only 1 outstanding request.  In all cases the negotiation
protocol must be followed and the requestor is not allowed to exceed the
negotiated limit.

The mechanism should be:
client queries its local HCA and determines responder resources (eg.
number of concurrent outstanding RDMA reads on the wire from the remote
end where this end will respond with the read data) and initiator depth
(eg. number of concurrent outstanding RDMA reads which this end can
initiate as the requestor).

client puts the above information in the CM REQ.

server similarly gets its information from its local CA and negotiates
down the values to the MIN of each side (REP.InitiatorDepth =
MIN(REQ.ResponderResources, server's local CAs Initiator depth);
REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
responder resources).  If server does not support RDMA Reads, it can
REJ.

If client decided the negotiated values are insufficient to meet its
goals, it can disconnect.

Each side sets its QP parameters via modify QP appropriately.  Note they
too will be mirror images of eachother:
client:
QP.Max RDMA Reads as Initiator = REP.ResponderResources
QP.Max RDMA reads as responder = REP.InitiatorDepth

server:
QP.Max RDMA Reads as responder = REP.ResponderResources
QP.Max RDMA reads as initiator = REP.InitiatorDepth

We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
and provided the above negotiation is followed, we have seen no issues.
Note however that by default a Mellanox HCA typically reports a large
InitiatorDepth (128) and a modest ResponderResources (4-8).  Hence when
I hear that Responder Resources must be grown to 128 for some
application to reliably work, it implies the negotiation I outlined
above is not being followed.

Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
write on a send queue are ordered.  There are many cases where an op can
pass an outstanding RDMA read, hence it is not always bad to queue extra
RDMA reads.  If needed, the Fence can be sent to force order.

For many apps, its going to be better to get the items onto queue and
let the QP handle the outstanding reads cases rather than have the app
add a level of queuing for this purpose.  Letting the HCA do the queuing
will allow for a more rapid initiation of subsequent reads.

Todd Rimmer


From sean.hefty at intel.com  Tue Jun  6 09:55:11 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 09:55:11 -0700
Subject: [openib-general] multicast questions
Message-ID: <ORSMSX4018rmrxrnDY700000037@orsmsx401.amr.corp.intel.com>

Does anyone know if the following multicast configurations have been tested?

1. Receiving messages on the same port that they were sent, but on a different
QP.

2. Receiving messages on multiple QPs on the same port.

- Sean


From mst at mellanox.co.il  Tue Jun  6 10:23:46 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 20:23:46 +0300
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606104534.086feab0@netapp.com>
References: <7.0.1.0.2.20060606104534.086feab0@netapp.com>
Message-ID: <20060606172345.GB4397@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> But, it really doesn't matter. The point is, an upper layer should be paying
> attention to the number of RDMA Reads it posts, or else suffer either the
> queue-stalling or connection-failing consequences. Bad stuff either way.

Queue-stalling is not necessarily bad, for example if the ULP needs to perform
multiple RDMA reads anyway. You can use multipe QPs if you do not require
ordering between operations.

Connection-failing *is* bad stuff, IMO it might be compliant but its clearly
broken in the same way that a NIC that drops all packets might be complaint but
is broken.

-- 
MST


From halr at voltaire.com  Tue Jun  6 10:27:07 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Jun 2006 13:27:07 -0400
Subject: [openib-general] [PATCH][MINOR] OpenSM: Fix inconsistent use of
	osm_log level
Message-ID: <1149614823.4510.248559.camel@hal.voltaire.com>

OpenSM: Fix inconsistent use of osm_log level
Also, some other cosmetic changes

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_pkey_rcv.c
===================================================================
--- opensm/osm_pkey_rcv.c	(revision 7733)
+++ opensm/osm_pkey_rcv.c	(working copy)
@@ -200,13 +200,10 @@ osm_pkey_rcv_process(
   */
   if( !osm_physp_is_valid( p_physp ) )
   {
-    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_pkey_rcv_process: ERR 4807: "
-               "Got invalid port number 0x%X\n",
-               port_num );
-    }
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_pkey_rcv_process: ERR 4807: "
+             "Got invalid port number 0x%X\n",
+             port_num );
     goto Exit;
   }
 
Index: opensm/osm_sa_guidinfo_record.c
===================================================================
--- opensm/osm_sa_guidinfo_record.c	(revision 7733)
+++ opensm/osm_sa_guidinfo_record.c	(working copy)
@@ -171,7 +171,7 @@ __osm_gir_rcv_new_gir(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_gir_rcv_new_gir: "
              "New GUIDInfoRecord: lid 0x%X, block num %d\n",
              cl_ntoh16( match_lid ), block_num );
@@ -220,7 +220,7 @@ __osm_sa_gir_create_gir(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_gir_create_gir: "
              "Looking for GUIDRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n",
              cl_ntoh16( match_lid ),
@@ -282,7 +282,7 @@ __osm_sa_gir_create_gir(
       */
       if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
       {
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
                  "__osm_sa_gir_create_gir: "
                  "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
                  cl_ntoh16( base_lid_ho ),
@@ -495,7 +495,7 @@ osm_gir_rcv_process(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_gir_rcv_process: "
+             "osm_gir_rcv_process: ERR 5103: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_sa_vlarb_record.c
===================================================================
--- opensm/osm_sa_vlarb_record.c	(revision 7733)
+++ opensm/osm_sa_vlarb_record.c	(working copy)
@@ -179,7 +179,7 @@ __osm_sa_vl_arb_create(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_vl_arb_create: "
              "New VLArbitration for: port 0x%016" PRIx64
              ", lid 0x%X, port# 0x%X Block:%u\n",
@@ -416,7 +416,7 @@ osm_vlarb_rec_rcv_process(
     else
     { /*  port out of range */
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_vlarb_rec_rcv_process: "
+               "osm_vlarb_rec_rcv_process: ERR 2A01: "
                "Given LID (%u) is out of range:%u\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
     }
@@ -444,7 +444,7 @@ osm_vlarb_rec_rcv_process(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_vlarb_rec_rcv_process: "
+             "osm_vlarb_rec_rcv_process:  ERR 2A08: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_sa_multipath_record.c
===================================================================
--- opensm/osm_sa_multipath_record.c	(revision 7733)
+++ opensm/osm_sa_multipath_record.c	(working copy)
@@ -1281,7 +1281,8 @@ __osm_mpr_rcv_process_pairs(
                                                      max_paths - total_paths,
 					             comp_mask, p_list );
       total_paths += num_paths;
-      osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_mpr_rcv_process_pairs: "
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__osm_mpr_rcv_process_pairs: "
                "%d paths %d total paths %d max paths\n",
                num_paths, total_paths, max_paths );
       /* Just take first NumbPaths found */
@@ -1468,7 +1469,8 @@ osm_mpr_rcv_process(
   if ( sa_status != IB_SA_MAD_STATUS_SUCCESS || !nsrc || !ndest )
   {
     if ( sa_status == IB_SA_MAD_STATUS_SUCCESS && ( !nsrc || !ndest ) )
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mpr_rcv_process_cb: ERR 4512: "
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_mpr_rcv_process_cb: ERR 4512: "
                "__osm_mpr_rcv_get_end_points failed, not enough GIDs "
                "(nsrc %d ndest %d)\n",
                nsrc, ndest);
Index: opensm/osm_subnet.c
===================================================================
--- opensm/osm_subnet.c	(revision 7733)
+++ opensm/osm_subnet.c	(working copy)
@@ -250,7 +250,7 @@ osm_get_gid_by_mad_addr(
   if ( p_gid == NULL ) 
   {
     osm_log( p_log, OSM_LOG_ERROR,
-             "osm_get_gid_by_mad_addr: ERR 7505 "
+             "osm_get_gid_by_mad_addr: ERR 7505: "
              "Provided output GID is NULL\n");
     return(IB_INVALID_PARAMETER);
   }
@@ -281,7 +281,7 @@ osm_get_gid_by_mad_addr(
   {
     /* The dest_lid is not in the subnet table - this is an error */
     osm_log( p_log, OSM_LOG_ERROR,
-             "osm_get_gid_by_mad_addr: ERR 7501 "
+             "osm_get_gid_by_mad_addr: ERR 7501: "
              "LID is out of range: 0x%X\n",
              cl_ntoh16(p_mad_addr->dest_lid)
              );
@@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr(
     {
       /* The port is not in the port_lid table - this is an error */
       osm_log( p_log, OSM_LOG_ERROR,
-               "osm_get_physp_by_mad_addr: ERR 7502 "
+               "osm_get_physp_by_mad_addr: ERR 7502: "
                "Cannot locate port object by lid: 0x%X\n",
                cl_ntoh16(p_mad_addr->dest_lid)
                );
@@ -329,7 +329,7 @@ osm_get_physp_by_mad_addr(
   {
     /* The dest_lid is not in the subnet table - this is an error */
     osm_log( p_log, OSM_LOG_ERROR,
-             "osm_get_physp_by_mad_addr: ERR 7503 "
+             "osm_get_physp_by_mad_addr: ERR 7503: "
              "Lid is out of range: 0x%X\n",
              cl_ntoh16(p_mad_addr->dest_lid)
              );
@@ -365,7 +365,7 @@ osm_get_port_by_mad_addr(
   {
     /* The dest_lid is not in the subnet table - this is an error */
     osm_log( p_log, OSM_LOG_ERROR,
-             "osm_get_port_by_mad_addr: ERR 7504 "
+             "osm_get_port_by_mad_addr: ERR 7504: "
              "Lid is out of range: 0x%X\n",
              cl_ntoh16(p_mad_addr->dest_lid)
              );
Index: opensm/osm_sa_lft_record.c
===================================================================
--- opensm/osm_sa_lft_record.c	(revision 7733)
+++ opensm/osm_sa_lft_record.c	(working copy)
@@ -510,7 +510,7 @@ osm_lftr_rcv_process(
   {
     osm_log(p_rcv->p_log, OSM_LOG_ERROR,
             "osm_lftr_rcv_process: ERR 4411: "
-            "osm_vendor_send. status = %s\n",
+            "osm_vendor_send status = %s\n",
             ib_get_err_str(status));
     goto Exit;
   }
Index: opensm/osm_pkey_rcv_ctrl.c
===================================================================
--- opensm/osm_pkey_rcv_ctrl.c	(revision 7733)
+++ opensm/osm_pkey_rcv_ctrl.c	(working copy)
@@ -110,7 +110,7 @@ osm_pkey_rcv_ctrl_init(
   {
     osm_log( p_log, OSM_LOG_ERROR,
              "osm_pkey_rcv_ctrl_init: ERR 4901: "
-             "Dispatcher registration failed.\n" );
+             "Dispatcher registration failed\n" );
     status = IB_INSUFFICIENT_RESOURCES;
     goto Exit;
   }
Index: opensm/osm_sa_service_record.c
===================================================================
--- opensm/osm_sa_service_record.c	(revision 7733)
+++ opensm/osm_sa_service_record.c	(working copy)
@@ -1115,7 +1115,7 @@ osm_sr_rcv_process(
   default:
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "osm_sr_rcv_process: "
-             "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method ));
+             "Bad Method (%s)\n", ib_get_sa_method_str( p_sa_mad->method ) );
     osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
     break;
   }
Index: opensm/osm_sa_portinfo_record.c
===================================================================
--- opensm/osm_sa_portinfo_record.c	(revision 7733)
+++ opensm/osm_sa_portinfo_record.c	(working copy)
@@ -168,7 +168,7 @@ __osm_pir_rcv_new_pir(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_pir_rcv_new_pir: "
              "New PortInfoRecord: port 0x%016" PRIx64
              ", lid 0x%X, port# 0x%X\n",
@@ -678,7 +678,7 @@ osm_pir_rcv_process(
     else
     {
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_pir_rcv_process: "
+               "osm_pir_rcv_process: ERR 2101: "
                "Given LID (%u) is out of range:%u\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
     }
@@ -694,7 +694,7 @@ osm_pir_rcv_process(
       else
       {
         osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-                 "osm_pir_rcv_process: "
+                 "osm_pir_rcv_process:  ERR 2103: "
                  "Given LID (%u) is out of range:%u\n",
                  cl_ntoh16(p_pi->base_lid), cl_ptr_vector_get_size(p_tbl));
       }
@@ -721,7 +721,7 @@ osm_pir_rcv_process(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_pir_rcv_process: "
+             "osm_pir_rcv_process: ERR 2108: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
@@ -852,7 +852,7 @@ osm_pir_rcv_process(
   {
     osm_log(p_rcv->p_log, OSM_LOG_ERROR,
             "osm_pir_rcv_process: ERR 2107: "
-            "osm_vendor_send. status = %s\n",
+            "osm_vendor_send status = %s\n",
             ib_get_err_str(status));
     goto Exit;
   }
Index: opensm/osm_sa_pkey_record.c
===================================================================
--- opensm/osm_sa_pkey_record.c	(revision 7733)
+++ opensm/osm_sa_pkey_record.c	(working copy)
@@ -169,7 +169,7 @@ __osm_sa_pkey_create(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_pkey_create: "
              "New P_Key table for: port 0x%016" PRIx64
              ", lid 0x%X, port# 0x%X Block:%u\n",
@@ -432,7 +432,7 @@ osm_pkey_rec_rcv_process(
     else
     { /* port out of range */
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_pkey_rec_rcv_process: "
+               "osm_pkey_rec_rcv_process: ERR 4609: "
                "Given LID (%u) is out of range:%u\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
     }
@@ -460,7 +460,7 @@ osm_pkey_rec_rcv_process(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_pkey_rec_rcv_process: "
+             "osm_pkey_rec_rcv_process: ERR 460A: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_inform.c
===================================================================
--- opensm/osm_inform.c	(revision 7733)
+++ opensm/osm_inform.c	(working copy)
@@ -283,7 +283,7 @@ osm_infr_insert_to_db(
            "Inserting a new InformInfo Record into Database\n");
   osm_log( p_log, OSM_LOG_DEBUG,
            "osm_infr_insert_to_db: "
-	   "Dump before insertion (size : %d) : \n",
+	   "Dump before insertion (size : %d)\n",
 	   cl_qlist_count(&p_subn->sa_infr_list) );
  __dump_all_informs(p_subn, p_log);
 
@@ -295,7 +295,7 @@ osm_infr_insert_to_db(
 
   osm_log( p_log, OSM_LOG_DEBUG,
            "osm_infr_insert_to_db: "
-	   "Dump after insertion (size : %d) : \n",
+	   "Dump after insertion (size : %d)\n",
 	   cl_qlist_count(&p_subn->sa_infr_list) );
  __dump_all_informs(p_subn, p_log);
   OSM_LOG_EXIT( p_log );
Index: opensm/osm_sa_slvl_record.c
===================================================================
--- opensm/osm_sa_slvl_record.c	(revision 7733)
+++ opensm/osm_sa_slvl_record.c	(working copy)
@@ -179,7 +179,7 @@ __osm_sa_slvl_create(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_sa_slvl_create: "
              "New SLtoVL Map for: OUT port 0x%016" PRIx64
              ", lid 0x%X, port# 0x%X to In Port:%u\n",
@@ -395,7 +395,7 @@ osm_slvl_rec_rcv_process(
     else
     { /*  port out of range */
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_slvl_rec_rcv_process: "
+               "osm_slvl_rec_rcv_process: ERR 2601: "
                "Given LID (%u) is out of range:%u\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
     }
@@ -423,7 +423,7 @@ osm_slvl_rec_rcv_process(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_slvl_rec_rcv_process: "
+             "osm_slvl_rec_rcv_process: ERR 2607: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_mcast_mgr.c
===================================================================
--- opensm/osm_mcast_mgr.c	(revision 7738)
+++ opensm/osm_mcast_mgr.c	(working copy)
@@ -1130,7 +1130,7 @@ osm_mcast_mgr_process_single(
   p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
   mlid_ho = cl_ntoh16( mlid );
 
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
+  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
   {
     osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
              "osm_mcast_mgr_process_single: "
@@ -1249,7 +1249,7 @@ osm_mcast_mgr_process_single(
     {
       if( join_state & IB_JOIN_STATE_SEND_ONLY )
       {
-        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
+        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
         {
           osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                    "osm_mcast_mgr_process_single: "
@@ -1269,7 +1269,7 @@ osm_mcast_mgr_process_single(
   }
   else
   {
-    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
+    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
     {
       osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                "osm_mcast_mgr_process_single: "
Index: opensm/osm_trap_rcv.c
===================================================================
--- opensm/osm_trap_rcv.c	(revision 7733)
+++ opensm/osm_trap_rcv.c	(working copy)
@@ -678,7 +678,7 @@ __osm_trap_rcv_process_sm(
   OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_sm );
 
   osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-           "__osm_trap_rcv_process_sm: "
+           "__osm_trap_rcv_process_sm: ERR 3807: "
            "This function is not supported yet\n");
 
   OSM_LOG_EXIT( p_rcv->p_log );
@@ -696,7 +696,7 @@ __osm_trap_rcv_process_response(
   OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_response );
 
   osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-           "__osm_trap_rcv_process_response: "
+           "__osm_trap_rcv_process_response: ERR 3808: "
            "This function is not supported yet\n");
 
   OSM_LOG_EXIT( p_rcv->p_log );
Index: opensm/osm_sa_informinfo.c
===================================================================
--- opensm/osm_sa_informinfo.c	(revision 7733)
+++ opensm/osm_sa_informinfo.c	(working copy)
@@ -357,14 +357,15 @@ osm_infr_rcv_process_set_method(
   p_recvd_inform_info =
     (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_sa_mad );
 
-  /* the dump routine is not defined yet
-     if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
-     {
-       osm_dump_inform_info_record( p_rcv->p_log,
-         p_recvd_service_rec,
-         OSM_LOG_DEBUG );
-     }
-  */
+#if 0
+  /* the dump routine is not implemented yet */
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  {
+    osm_dump_inform_info_record( p_rcv->p_log,
+      p_recvd_inform_info,
+      OSM_LOG_DEBUG );
+  }
+#endif
 
   /* Grab the lock */
   cl_plock_excl_acquire( p_rcv->p_lock );
Index: opensm/osm_ucast_updn.c
===================================================================
--- opensm/osm_ucast_updn.c	(revision 7733)
+++ opensm/osm_ucast_updn.c	(working copy)
@@ -879,7 +879,7 @@ osm_subn_calc_up_down_min_hop_table(
   if (num_guids == 0)
   {
     osm_log(&(osm.log), OSM_LOG_ERROR,
-            "osm_subn_calc_up_down_min_hop_table: "
+            "osm_subn_calc_up_down_min_hop_table: ERR AA0A: "
             "No guids were given or number of guids is 0\n");
     return 1;
   }
Index: opensm/osm_sa_node_record.c
===================================================================
--- opensm/osm_sa_node_record.c	(revision 7733)
+++ opensm/osm_sa_node_record.c	(working copy)
@@ -161,7 +161,7 @@ __osm_nr_rcv_new_nr(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_nr_rcv_new_nr: "
              "New NodeRecord: node 0x%016" PRIx64
              "\n\t\t\t\tport 0x%016" PRIx64 ", lid 0x%X\n",
@@ -211,7 +211,7 @@ __osm_nr_rcv_create_nr(
 
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_nr_rcv_create_nr: "
              "Looking for NodeRecord with LID: 0x%X GUID:0x%016" PRIx64 "\n",
              cl_ntoh16( match_lid ),
@@ -257,7 +257,7 @@ __osm_nr_rcv_create_nr(
       */
       if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
       {
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
                  "__osm_nr_rcv_create_nr: "
                  "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
                  cl_ntoh16( base_lid_ho ),
@@ -326,7 +326,7 @@ __osm_nr_rcv_by_comp_mask(
     */
     if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
     {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
                "__osm_nr_rcv_by_comp_mask: "
                "Looking for node 0x%016" PRIx64
                ", found 0x%016" PRIx64 "\n",
@@ -493,7 +493,7 @@ osm_nr_rcv_process(
    */
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_nr_rcv_process: "
+             "osm_nr_rcv_process: ERR 1D03: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_sa_link_record.c
===================================================================
--- opensm/osm_sa_link_record.c	(revision 7733)
+++ opensm/osm_sa_link_record.c	(working copy)
@@ -312,7 +312,7 @@ __osm_lr_rcv_get_physp_link(
   {
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
              "__osm_lr_rcv_get_physp_link: "
-             "Acquiring link record.\n"
+             "Acquiring link record\n"
              "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)"
              ", dest port 0x%" PRIx64 " (port 0x%X)\n",
              cl_ntoh64( osm_physp_get_port_guid( p_src_physp ) ),
@@ -606,7 +606,7 @@ __osm_lr_rcv_respond(
   if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
        (num_rec > 1)) {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "__osm_lr_rcv_respond: "
+             "__osm_lr_rcv_respond: ERR 1806: "
              "Got more than one record for SubnAdmGet (%u)\n",
              num_rec );
     osm_sa_send_error( p_rcv->p_resp, p_madw,
Index: opensm/osm_slvl_map_rcv.c
===================================================================
--- opensm/osm_slvl_map_rcv.c	(revision 7733)
+++ opensm/osm_slvl_map_rcv.c	(working copy)
@@ -211,13 +211,10 @@ osm_slvl_rcv_process(
   */
   if( !osm_physp_is_valid( p_physp ) )
   {
-    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_slvl_rcv_process: "
-               "Got invalid port number 0x%X\n",
-               out_port_num );
-    }
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_slvl_rcv_process: "
+             "Got invalid port number 0x%X\n",
+             out_port_num );
     goto Exit;
   }
 
Index: opensm/osm_sa_link_record_ctrl.c
===================================================================
--- opensm/osm_sa_link_record_ctrl.c	(revision 7733)
+++ opensm/osm_sa_link_record_ctrl.c	(working copy)
@@ -116,7 +116,7 @@ osm_lr_rcv_ctrl_init(
   {
     osm_log( p_log, OSM_LOG_ERROR,
              "osm_lr_rcv_ctrl_init: ERR 1901: "
-             "Dispatcher registration failed.\n" );
+             "Dispatcher registration failed\n" );
     status = IB_INSUFFICIENT_RESOURCES;
     goto Exit;
   }
Index: opensm/osm_qos.c
===================================================================
--- opensm/osm_qos.c	(revision 7733)
+++ opensm/osm_qos.c	(working copy)
@@ -279,7 +279,8 @@ static ib_api_status_t qos_physp_setup(o
 	/* setup vl high limit */
 	status = vl_high_limit_update(p_req, p, qcfg);
 	if (status != IB_SUCCESS) {
-		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
+		osm_log(p_log, OSM_LOG_ERROR,
+			"qos_physp_setup: ERR 6201 : "
 			"failed to update VLHighLimit "
 			"for port %" PRIx64 " #%d\n",
 			cl_ntoh64(p->port_guid), port_num);
@@ -289,7 +290,8 @@ static ib_api_status_t qos_physp_setup(o
 	/* setup VLArbitration */
 	status = vlarb_update(p_req, p, port_num, qcfg);
 	if (status != IB_SUCCESS) {
-		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
+		osm_log(p_log, OSM_LOG_ERROR,
+			"qos_physp_setup: ERR 6202 : "
 			"failed to update VLArbitration tables "
 			"for port %" PRIx64 " #%d\n",
 			cl_ntoh64(p->port_guid), port_num);
@@ -299,7 +301,8 @@ static ib_api_status_t qos_physp_setup(o
 	/* setup Sl2VL tables */
 	status = sl2vl_update(p_req, p, port_num, qcfg);
 	if (status != IB_SUCCESS) {
-		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
+		osm_log(p_log, OSM_LOG_ERROR,
+			"qos_physp_setup: ERR 6203 : "
 			"failed to update SL2VLMapping tables "
 			"for port %" PRIx64 " #%d\n",
 			cl_ntoh64(p->port_guid), port_num);
Index: opensm/osm_sa_mcmember_record.c
===================================================================
--- opensm/osm_sa_mcmember_record.c	(revision 7733)
+++ opensm/osm_sa_mcmember_record.c	(working copy)
@@ -2286,7 +2286,7 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* 
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
              "osm_mcmr_query_mgrp: ERR 1B17: "
-             "osm_vendor_send. status = %s\n",
+             "osm_vendor_send status = %s\n",
              ib_get_err_str(status) );
     goto Exit;
   }
Index: opensm/osm_drop_mgr.c
===================================================================
--- opensm/osm_drop_mgr.c	(revision 7733)
+++ opensm/osm_drop_mgr.c	(working copy)
@@ -512,7 +512,7 @@ __osm_drop_mgr_check_node(
 
   if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH )
   {
-    osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
+    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
              "__osm_drop_mgr_check_node: ERR 0107: "
              "Node 0x%016" PRIx64 " is not a switch node\n",
              cl_ntoh64( node_guid ) );
Index: opensm/osm_lid_mgr.c
===================================================================
--- opensm/osm_lid_mgr.c	(revision 7733)
+++ opensm/osm_lid_mgr.c	(working copy)
@@ -637,7 +637,7 @@ __osm_lid_mgr_init_sweep(
   osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
            "__osm_lid_mgr_init_sweep: "
            "final free lid range [0x%x:0x%x]\n",
-           p_range->min_lid, p_range->max_lid);
+           p_range->min_lid, p_range->max_lid );
 
   OSM_LOG_EXIT( p_mgr->p_log );
   return status;
@@ -757,7 +757,7 @@ __osm_lid_mgr_find_free_lid_range(
   /* if we run out of lids, give an error and abort! */
   osm_log( p_mgr->p_log, OSM_LOG_ERROR,
            "__osm_lid_mgr_find_free_lid_range: ERR 0307: "
-           "OPENSM RAN OUT OF LIDS!!!\n");
+           "OPENSM RAN OUT OF LIDS!!!\n" );
   CL_ASSERT( 0 );
 }
 
@@ -827,7 +827,7 @@ __osm_lid_mgr_get_port_lid(
       osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                "__osm_lid_mgr_get_port_lid: "
                "0x%016" PRIx64" matches its known lid:0x%04x\n",
-               guid, min_lid);
+               guid, min_lid );
       goto Exit;
     }
     else
@@ -848,7 +848,7 @@ __osm_lid_mgr_get_port_lid(
     osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
              "__osm_lid_mgr_get_port_lid: "
              "0x%016" PRIx64" has no persistent lid assigned\n",
-             guid);
+             guid );
   }
 
   /* if the port info carries a lid it must be lmc aligned and not mapped
@@ -872,7 +872,7 @@ __osm_lid_mgr_get_port_lid(
         osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                  "__osm_lid_mgr_get_port_lid: "
                  "0x%016" PRIx64" lid range:[0x%x-0x%x] is free\n",
-                 guid, *p_min_lid, *p_max_lid);
+                 guid, *p_min_lid, *p_max_lid );
         goto NewLidSet;
       }
       else
@@ -881,7 +881,7 @@ __osm_lid_mgr_get_port_lid(
                  "__osm_lid_mgr_get_port_lid: "
                  "0x%016" PRIx64
                  " existing lid range:[0x%x:0x%x] is not free\n",
-                 guid, min_lid, min_lid + num_lids - 1);
+                 guid, min_lid, min_lid + num_lids - 1 );
       }
     }
     else
@@ -890,7 +890,7 @@ __osm_lid_mgr_get_port_lid(
                "__osm_lid_mgr_get_port_lid: "
                "0x%016" PRIx64
                " existing lid range:[0x%x:0x%x] is not lmc aligned\n",
-               guid, min_lid, min_lid + num_lids - 1);
+               guid, min_lid, min_lid + num_lids - 1 );
     }
   }
 
@@ -902,7 +902,7 @@ __osm_lid_mgr_get_port_lid(
   osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
            "__osm_lid_mgr_get_port_lid: "
            "0x%016" PRIx64" assigned a new lid range:[0x%x-0x%x]\n",
-           guid, *p_min_lid, *p_max_lid);
+           guid, *p_min_lid, *p_max_lid );
   lid_changed = 1;
 
  NewLidSet:
@@ -1339,9 +1339,9 @@ osm_lid_mgr_process_sm(
   {
     osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
              "osm_lid_mgr_process_sm: "
-             "Invoking UI function pfn_ui_pre_lid_assign\n");
+             "Invoking UI function pfn_ui_pre_lid_assign\n" );
     p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(
-      p_mgr->p_subn->opt.ui_pre_lid_assign_ctx);
+      p_mgr->p_subn->opt.ui_pre_lid_assign_ctx );
   }
 
   /* Set the send_set_reqs of the p_mgr to FALSE, and
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 7733)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -245,7 +245,7 @@ pkey_mgr_update_peer_port(
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: "
+               "pkey_mgr_update_peer_port: ERR 0502: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -272,7 +272,7 @@ pkey_mgr_update_peer_port(
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: "
+                     "pkey_mgr_update_peer_port: ERR 0503: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -332,7 +332,7 @@ static boolean_t pkey_mgr_update_port(
          ret_val = TRUE;
       else
          osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port:  "
+                  "pkey_mgr_update_port: ERR 0504: "
                   "pkey_mgr_update_pkey_entry() failed to update "
                   "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
                   block_index,
@@ -393,7 +393,8 @@ osm_pkey_mgr_process(
 
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
-      osm_log( &p_osm->log, OSM_LOG_ERROR, "osm_pkey_mgr_process: "
+      osm_log( &p_osm->log, OSM_LOG_ERROR,
+               "osm_pkey_mgr_process: ERR 0505: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
Index: opensm/osm_vl_arb_rcv.c
===================================================================
--- opensm/osm_vl_arb_rcv.c	(revision 7733)
+++ opensm/osm_vl_arb_rcv.c	(working copy)
@@ -211,13 +211,10 @@ osm_vla_rcv_process(
   */
   if( !osm_physp_is_valid( p_physp ) )
   {
-    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_vla_rcv_process: "
-               "Got invalid port number 0x%X\n",
-               port_num );
-    }
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_vla_rcv_process: "
+             "Got invalid port number 0x%X\n",
+             port_num );
     goto Exit;
   }
 

From mst at mellanox.co.il  Tue Jun  6 11:11:32 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 21:11:32 +0300
Subject: [openib-general] RFC: ib_cache_event problems
Message-ID: <20060606181132.GA4701@mellanox.co.il>

Hello!
We are seeing the following problems in ib_cache_event:
1. If a GFP_ATOMIC allocation fails, it seems that cache won't be updated
2. Since cache isn't updated immediately, but by queueing a work request, it is
possible for e.g. IP over IB, to query the cache as a result of event and get a
stale value.  Consider for example ipoib - in this case
ipoib_pkey_dev_check_presence returns an incorrect value. We are actually seeing
this happening in stress testing.

Since the SM will not retry the MAD, event won't be regenerated,
so values ULP gets from cache may never get updated.

Suggestions
1. Cache should create ib_update_work objects statically upon hotplug event.
2. Need a mechanism for cache to consume events which trigger cache updates,
and delay reporting them to ULPs until after cache is updated.

Opinions?

-- 
MST


From rkuchimanchi at silverstorm.com  Tue Jun  6 11:17:34 2006
From: rkuchimanchi at silverstorm.com (Ramachandra K)
Date: Tue, 06 Jun 2006 23:47:34 +0530
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
 format according to target io_class
In-Reply-To: <adau06ynbdd.fsf@cisco.com>
References: <1149171133.7588.45.camel@Prawra.gs-lab.com> (Ramchandra
	K.'s message of "Thu, 01 Jun 2006 19:42:13 +0530")
Message-ID: <44861416.2864.88C6C5@rkuchimanchi.silverstorm.com>

> Thanks, I applied this.

Thanks a lot Roland. But there was also a patch for ibsrpdm to
display the IO class of the target. I am including it below for
your convenience.

Regards,
Ram

Signed-off-by: Ramachandra K <rkuchimanchi at silverstorm.com>

Index: userspace/srptools/src/srp-dm.c
===================================================================
--- userspace/srptools/src/srp-dm.c	(revision 7738)
+++ userspace/srptools/src/srp-dm.c	(working copy)
@@ -399,6 +399,7 @@
 			pr_human("        vendor ID: %06x\n", ntohl(ioc_prof.vendor_id) >> 8);
 			pr_human("        device ID: %06x\n", ntohl(ioc_prof.device_id));
 			pr_human("        ID:        %s\n", ioc_prof.id);
+			pr_human("        IO class : %hx\n", ntohs(ioc_prof.io_class));
 			pr_human("        service entries: %d\n", ioc_prof.service_entries);
 
 			for (j = 0; j < ioc_prof.service_entries; j += 4) {
@@ -429,11 +430,13 @@
 					       "ioc_guid=%016llx,"
 					       "dgid=%016llx%016llx,"
 					       "pkey=ffff,"
+					       "io_class=%hx,"	
 					       "service_id=%016llx\n",
 					       id_ext,
 					       (unsigned long long) ntohll(ioc_prof.guid),
 					       (unsigned long long) subnet_prefix,
 					       (unsigned long long) guid,
+					       ntohs(ioc_prof.io_class),
 					       (unsigned long long) ntohll(svc_entries.service[k].id));
 				}
 			}


From mshefty at ichips.intel.com  Tue Jun  6 11:28:48 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 06 Jun 2006 11:28:48 -0700
Subject: [openib-general] multicast questions
In-Reply-To: <ORSMSX4018rmrxrnDY700000037@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4018rmrxrnDY700000037@orsmsx401.amr.corp.intel.com>
Message-ID: <4485C960.50802@ichips.intel.com>

Sean Hefty wrote:
> Does anyone know if the following multicast configurations have been tested?
> 
> 1. Receiving messages on the same port that they were sent, but on a different
> QP.
> 
> 2. Receiving messages on multiple QPs on the same port.

These are the situations that I'm seeing.  (Note that this is with new code, so 
it's entirely possible that the errors are in the new code.)

Works:
Process A1 on node A starts and joins a group.
Process B1 on node B starts and joins the same group.
Process B1 sends messages.
process A1 receives messages.

Failure 1 (loopback):
Process A1 on node A starts and joins a group.
Process A2 on node A starts and joins the same group.
Process A2 sends messages.
Process A1 does not see messages.

Failure 2 (multiple receivers):
Process A1 on node A starts and joins a group.
Process A2 on node A starts and joins the same group.
Process B1 on node B starts and joins the same group.
Process B1 sends messages.
Process A1 receives messages.
Process A2 does not see messages.

It appears that the multicast groups are being created successfully, and the QPs 
are attaching to the groups.

- Sean


From mshefty at ichips.intel.com  Tue Jun  6 11:43:53 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 06 Jun 2006 11:43:53 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
In-Reply-To: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
Message-ID: <4485CCE9.7020807@ichips.intel.com>

Arlin Davis wrote:
> Here is a patch to the openib-cma provider that uses the new set_option
> feature of the uCMA to adjust connect request timeout and retry values. The
> defaults are a little quick for some consumers. They are now bumped up from 3
> retries to 15 and are tunable with uDAPL environment variables. Also, 
> included a fix to disallow any event after a disconnect event.
> 
> You need to sync up the commit with Sean's patch for the uCMA get/set
> IB_CM_REQ_OPTIONS.

The RDMA CM changes for kernel and userspace have been committed to allow 
setting the timeout / retry values.

- Sean


From eitan at mellanox.co.il  Tue Jun  6 11:50:10 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 6 Jun 2006 21:50:10 +0300
Subject: [openib-general] RE: [PATCH][MINOR] OpenSM: Fix inconsistent use of
 osm_log level
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023687E6@mtlexch01.mtl.com>

Hi Hal,

Thanks for cleaning this up.
I see you also cleaned up missing ":" in errors etc.

Good to go from my perspective

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, June 06, 2006 8:27 PM
> To: openib-general at openib.org
> Cc: Eitan Zahavi
> Subject: [PATCH][MINOR] OpenSM: Fix inconsistent use of osm_log level
> 
> OpenSM: Fix inconsistent use of osm_log level
> Also, some other cosmetic changes
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Index: opensm/osm_pkey_rcv.c
> ===================================================================
> --- opensm/osm_pkey_rcv.c	(revision 7733)
> +++ opensm/osm_pkey_rcv.c	(working copy)
> @@ -200,13 +200,10 @@ osm_pkey_rcv_process(
>    */
>    if( !osm_physp_is_valid( p_physp ) )
>    {
> -    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
> -    {
> -      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_pkey_rcv_process: ERR 4807: "
> -               "Got invalid port number 0x%X\n",
> -               port_num );
> -    }
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "osm_pkey_rcv_process: ERR 4807: "
> +             "Got invalid port number 0x%X\n",
> +             port_num );
>      goto Exit;
>    }
> 
> Index: opensm/osm_sa_guidinfo_record.c
> ===================================================================
> --- opensm/osm_sa_guidinfo_record.c	(revision 7733)
> +++ opensm/osm_sa_guidinfo_record.c	(working copy)
> @@ -171,7 +171,7 @@ __osm_gir_rcv_new_gir(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_gir_rcv_new_gir: "
>               "New GUIDInfoRecord: lid 0x%X, block num %d\n",
>               cl_ntoh16( match_lid ), block_num );
> @@ -220,7 +220,7 @@ __osm_sa_gir_create_gir(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_sa_gir_create_gir: "
>               "Looking for GUIDRecord with LID: 0x%X GUID:0x%016"
PRIx64 "\n",
>               cl_ntoh16( match_lid ),
> @@ -282,7 +282,7 @@ __osm_sa_gir_create_gir(
>        */
>        if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>        {
> -        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>                   "__osm_sa_gir_create_gir: "
>                   "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
>                   cl_ntoh16( base_lid_ho ),
> @@ -495,7 +495,7 @@ osm_gir_rcv_process(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_gir_rcv_process: "
> +             "osm_gir_rcv_process: ERR 5103: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_sa_vlarb_record.c
> ===================================================================
> --- opensm/osm_sa_vlarb_record.c	(revision 7733)
> +++ opensm/osm_sa_vlarb_record.c	(working copy)
> @@ -179,7 +179,7 @@ __osm_sa_vl_arb_create(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_sa_vl_arb_create: "
>               "New VLArbitration for: port 0x%016" PRIx64
>               ", lid 0x%X, port# 0x%X Block:%u\n",
> @@ -416,7 +416,7 @@ osm_vlarb_rec_rcv_process(
>      else
>      { /*  port out of range */
>        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_vlarb_rec_rcv_process: "
> +               "osm_vlarb_rec_rcv_process: ERR 2A01: "
>                 "Given LID (%u) is out of range:%u\n",
>                 cl_ntoh16(p_rcvd_rec->lid),
cl_ptr_vector_get_size(p_tbl) );
>      }
> @@ -444,7 +444,7 @@ osm_vlarb_rec_rcv_process(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_vlarb_rec_rcv_process: "
> +             "osm_vlarb_rec_rcv_process:  ERR 2A08: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_sa_multipath_record.c
> ===================================================================
> --- opensm/osm_sa_multipath_record.c	(revision 7733)
> +++ opensm/osm_sa_multipath_record.c	(working copy)
> @@ -1281,7 +1281,8 @@ __osm_mpr_rcv_process_pairs(
>                                                       max_paths -
total_paths,
>  					             comp_mask, p_list
);
>        total_paths += num_paths;
> -      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
"__osm_mpr_rcv_process_pairs: "
> +      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
> +               "__osm_mpr_rcv_process_pairs: "
>                 "%d paths %d total paths %d max paths\n",
>                 num_paths, total_paths, max_paths );
>        /* Just take first NumbPaths found */
> @@ -1468,7 +1469,8 @@ osm_mpr_rcv_process(
>    if ( sa_status != IB_SA_MAD_STATUS_SUCCESS || !nsrc || !ndest )
>    {
>      if ( sa_status == IB_SA_MAD_STATUS_SUCCESS && ( !nsrc || !ndest )
)
> -      osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mpr_rcv_process_cb:
ERR
> 4512: "
> +      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +               "osm_mpr_rcv_process_cb: ERR 4512: "
>                 "__osm_mpr_rcv_get_end_points failed, not enough GIDs
"
>                 "(nsrc %d ndest %d)\n",
>                 nsrc, ndest);
> Index: opensm/osm_subnet.c
> ===================================================================
> --- opensm/osm_subnet.c	(revision 7733)
> +++ opensm/osm_subnet.c	(working copy)
> @@ -250,7 +250,7 @@ osm_get_gid_by_mad_addr(
>    if ( p_gid == NULL )
>    {
>      osm_log( p_log, OSM_LOG_ERROR,
> -             "osm_get_gid_by_mad_addr: ERR 7505 "
> +             "osm_get_gid_by_mad_addr: ERR 7505: "
>               "Provided output GID is NULL\n");
>      return(IB_INVALID_PARAMETER);
>    }
> @@ -281,7 +281,7 @@ osm_get_gid_by_mad_addr(
>    {
>      /* The dest_lid is not in the subnet table - this is an error */
>      osm_log( p_log, OSM_LOG_ERROR,
> -             "osm_get_gid_by_mad_addr: ERR 7501 "
> +             "osm_get_gid_by_mad_addr: ERR 7501: "
>               "LID is out of range: 0x%X\n",
>               cl_ntoh16(p_mad_addr->dest_lid)
>               );
> @@ -316,7 +316,7 @@ osm_get_physp_by_mad_addr(
>      {
>        /* The port is not in the port_lid table - this is an error */
>        osm_log( p_log, OSM_LOG_ERROR,
> -               "osm_get_physp_by_mad_addr: ERR 7502 "
> +               "osm_get_physp_by_mad_addr: ERR 7502: "
>                 "Cannot locate port object by lid: 0x%X\n",
>                 cl_ntoh16(p_mad_addr->dest_lid)
>                 );
> @@ -329,7 +329,7 @@ osm_get_physp_by_mad_addr(
>    {
>      /* The dest_lid is not in the subnet table - this is an error */
>      osm_log( p_log, OSM_LOG_ERROR,
> -             "osm_get_physp_by_mad_addr: ERR 7503 "
> +             "osm_get_physp_by_mad_addr: ERR 7503: "
>               "Lid is out of range: 0x%X\n",
>               cl_ntoh16(p_mad_addr->dest_lid)
>               );
> @@ -365,7 +365,7 @@ osm_get_port_by_mad_addr(
>    {
>      /* The dest_lid is not in the subnet table - this is an error */
>      osm_log( p_log, OSM_LOG_ERROR,
> -             "osm_get_port_by_mad_addr: ERR 7504 "
> +             "osm_get_port_by_mad_addr: ERR 7504: "
>               "Lid is out of range: 0x%X\n",
>               cl_ntoh16(p_mad_addr->dest_lid)
>               );
> Index: opensm/osm_sa_lft_record.c
> ===================================================================
> --- opensm/osm_sa_lft_record.c	(revision 7733)
> +++ opensm/osm_sa_lft_record.c	(working copy)
> @@ -510,7 +510,7 @@ osm_lftr_rcv_process(
>    {
>      osm_log(p_rcv->p_log, OSM_LOG_ERROR,
>              "osm_lftr_rcv_process: ERR 4411: "
> -            "osm_vendor_send. status = %s\n",
> +            "osm_vendor_send status = %s\n",
>              ib_get_err_str(status));
>      goto Exit;
>    }
> Index: opensm/osm_pkey_rcv_ctrl.c
> ===================================================================
> --- opensm/osm_pkey_rcv_ctrl.c	(revision 7733)
> +++ opensm/osm_pkey_rcv_ctrl.c	(working copy)
> @@ -110,7 +110,7 @@ osm_pkey_rcv_ctrl_init(
>    {
>      osm_log( p_log, OSM_LOG_ERROR,
>               "osm_pkey_rcv_ctrl_init: ERR 4901: "
> -             "Dispatcher registration failed.\n" );
> +             "Dispatcher registration failed\n" );
>      status = IB_INSUFFICIENT_RESOURCES;
>      goto Exit;
>    }
> Index: opensm/osm_sa_service_record.c
> ===================================================================
> --- opensm/osm_sa_service_record.c	(revision 7733)
> +++ opensm/osm_sa_service_record.c	(working copy)
> @@ -1115,7 +1115,7 @@ osm_sr_rcv_process(
>    default:
>      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "osm_sr_rcv_process: "
> -             "Bad Method (%s)\n", ib_get_sa_method_str(
p_sa_mad->method ));
> +             "Bad Method (%s)\n", ib_get_sa_method_str(
p_sa_mad->method ) );
>      osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
>      break;
>    }
> Index: opensm/osm_sa_portinfo_record.c
> ===================================================================
> --- opensm/osm_sa_portinfo_record.c	(revision 7733)
> +++ opensm/osm_sa_portinfo_record.c	(working copy)
> @@ -168,7 +168,7 @@ __osm_pir_rcv_new_pir(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_pir_rcv_new_pir: "
>               "New PortInfoRecord: port 0x%016" PRIx64
>               ", lid 0x%X, port# 0x%X\n",
> @@ -678,7 +678,7 @@ osm_pir_rcv_process(
>      else
>      {
>        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_pir_rcv_process: "
> +               "osm_pir_rcv_process: ERR 2101: "
>                 "Given LID (%u) is out of range:%u\n",
>                 cl_ntoh16(p_rcvd_rec->lid),
cl_ptr_vector_get_size(p_tbl));
>      }
> @@ -694,7 +694,7 @@ osm_pir_rcv_process(
>        else
>        {
>          osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -                 "osm_pir_rcv_process: "
> +                 "osm_pir_rcv_process:  ERR 2103: "
>                   "Given LID (%u) is out of range:%u\n",
>                   cl_ntoh16(p_pi->base_lid),
cl_ptr_vector_get_size(p_tbl));
>        }
> @@ -721,7 +721,7 @@ osm_pir_rcv_process(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_pir_rcv_process: "
> +             "osm_pir_rcv_process: ERR 2108: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> @@ -852,7 +852,7 @@ osm_pir_rcv_process(
>    {
>      osm_log(p_rcv->p_log, OSM_LOG_ERROR,
>              "osm_pir_rcv_process: ERR 2107: "
> -            "osm_vendor_send. status = %s\n",
> +            "osm_vendor_send status = %s\n",
>              ib_get_err_str(status));
>      goto Exit;
>    }
> Index: opensm/osm_sa_pkey_record.c
> ===================================================================
> --- opensm/osm_sa_pkey_record.c	(revision 7733)
> +++ opensm/osm_sa_pkey_record.c	(working copy)
> @@ -169,7 +169,7 @@ __osm_sa_pkey_create(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_sa_pkey_create: "
>               "New P_Key table for: port 0x%016" PRIx64
>               ", lid 0x%X, port# 0x%X Block:%u\n",
> @@ -432,7 +432,7 @@ osm_pkey_rec_rcv_process(
>      else
>      { /* port out of range */
>        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_pkey_rec_rcv_process: "
> +               "osm_pkey_rec_rcv_process: ERR 4609: "
>                 "Given LID (%u) is out of range:%u\n",
>                 cl_ntoh16(p_rcvd_rec->lid),
cl_ptr_vector_get_size(p_tbl));
>      }
> @@ -460,7 +460,7 @@ osm_pkey_rec_rcv_process(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_pkey_rec_rcv_process: "
> +             "osm_pkey_rec_rcv_process: ERR 460A: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_inform.c
> ===================================================================
> --- opensm/osm_inform.c	(revision 7733)
> +++ opensm/osm_inform.c	(working copy)
> @@ -283,7 +283,7 @@ osm_infr_insert_to_db(
>             "Inserting a new InformInfo Record into Database\n");
>    osm_log( p_log, OSM_LOG_DEBUG,
>             "osm_infr_insert_to_db: "
> -	   "Dump before insertion (size : %d) : \n",
> +	   "Dump before insertion (size : %d)\n",
>  	   cl_qlist_count(&p_subn->sa_infr_list) );
>   __dump_all_informs(p_subn, p_log);
> 
> @@ -295,7 +295,7 @@ osm_infr_insert_to_db(
> 
>    osm_log( p_log, OSM_LOG_DEBUG,
>             "osm_infr_insert_to_db: "
> -	   "Dump after insertion (size : %d) : \n",
> +	   "Dump after insertion (size : %d)\n",
>  	   cl_qlist_count(&p_subn->sa_infr_list) );
>   __dump_all_informs(p_subn, p_log);
>    OSM_LOG_EXIT( p_log );
> Index: opensm/osm_sa_slvl_record.c
> ===================================================================
> --- opensm/osm_sa_slvl_record.c	(revision 7733)
> +++ opensm/osm_sa_slvl_record.c	(working copy)
> @@ -179,7 +179,7 @@ __osm_sa_slvl_create(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_sa_slvl_create: "
>               "New SLtoVL Map for: OUT port 0x%016" PRIx64
>               ", lid 0x%X, port# 0x%X to In Port:%u\n",
> @@ -395,7 +395,7 @@ osm_slvl_rec_rcv_process(
>      else
>      { /*  port out of range */
>        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_slvl_rec_rcv_process: "
> +               "osm_slvl_rec_rcv_process: ERR 2601: "
>                 "Given LID (%u) is out of range:%u\n",
>                 cl_ntoh16(p_rcvd_rec->lid),
cl_ptr_vector_get_size(p_tbl));
>      }
> @@ -423,7 +423,7 @@ osm_slvl_rec_rcv_process(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_slvl_rec_rcv_process: "
> +             "osm_slvl_rec_rcv_process: ERR 2607: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_mcast_mgr.c
> ===================================================================
> --- opensm/osm_mcast_mgr.c	(revision 7738)
> +++ opensm/osm_mcast_mgr.c	(working copy)
> @@ -1130,7 +1130,7 @@ osm_mcast_mgr_process_single(
>    p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
>    mlid_ho = cl_ntoh16( mlid );
> 
> -  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
> +  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
>    {
>      osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>               "osm_mcast_mgr_process_single: "
> @@ -1249,7 +1249,7 @@ osm_mcast_mgr_process_single(
>      {
>        if( join_state & IB_JOIN_STATE_SEND_ONLY )
>        {
> -        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
> +        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
>          {
>            osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>                     "osm_mcast_mgr_process_single: "
> @@ -1269,7 +1269,7 @@ osm_mcast_mgr_process_single(
>    }
>    else
>    {
> -    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) )
> +    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
>      {
>        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>                 "osm_mcast_mgr_process_single: "
> Index: opensm/osm_trap_rcv.c
> ===================================================================
> --- opensm/osm_trap_rcv.c	(revision 7733)
> +++ opensm/osm_trap_rcv.c	(working copy)
> @@ -678,7 +678,7 @@ __osm_trap_rcv_process_sm(
>    OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_sm );
> 
>    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -           "__osm_trap_rcv_process_sm: "
> +           "__osm_trap_rcv_process_sm: ERR 3807: "
>             "This function is not supported yet\n");
> 
>    OSM_LOG_EXIT( p_rcv->p_log );
> @@ -696,7 +696,7 @@ __osm_trap_rcv_process_response(
>    OSM_LOG_ENTER( p_rcv->p_log, __osm_trap_rcv_process_response );
> 
>    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -           "__osm_trap_rcv_process_response: "
> +           "__osm_trap_rcv_process_response: ERR 3808: "
>             "This function is not supported yet\n");
> 
>    OSM_LOG_EXIT( p_rcv->p_log );
> Index: opensm/osm_sa_informinfo.c
> ===================================================================
> --- opensm/osm_sa_informinfo.c	(revision 7733)
> +++ opensm/osm_sa_informinfo.c	(working copy)
> @@ -357,14 +357,15 @@ osm_infr_rcv_process_set_method(
>    p_recvd_inform_info =
>      (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_sa_mad );
> 
> -  /* the dump routine is not defined yet
> -     if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
> -     {
> -       osm_dump_inform_info_record( p_rcv->p_log,
> -         p_recvd_service_rec,
> -         OSM_LOG_DEBUG );
> -     }
> -  */
> +#if 0
> +  /* the dump routine is not implemented yet */
> +  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
> +  {
> +    osm_dump_inform_info_record( p_rcv->p_log,
> +      p_recvd_inform_info,
> +      OSM_LOG_DEBUG );
> +  }
> +#endif
> 
>    /* Grab the lock */
>    cl_plock_excl_acquire( p_rcv->p_lock );
> Index: opensm/osm_ucast_updn.c
> ===================================================================
> --- opensm/osm_ucast_updn.c	(revision 7733)
> +++ opensm/osm_ucast_updn.c	(working copy)
> @@ -879,7 +879,7 @@ osm_subn_calc_up_down_min_hop_table(
>    if (num_guids == 0)
>    {
>      osm_log(&(osm.log), OSM_LOG_ERROR,
> -            "osm_subn_calc_up_down_min_hop_table: "
> +            "osm_subn_calc_up_down_min_hop_table: ERR AA0A: "
>              "No guids were given or number of guids is 0\n");
>      return 1;
>    }
> Index: opensm/osm_sa_node_record.c
> ===================================================================
> --- opensm/osm_sa_node_record.c	(revision 7733)
> +++ opensm/osm_sa_node_record.c	(working copy)
> @@ -161,7 +161,7 @@ __osm_nr_rcv_new_nr(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_nr_rcv_new_nr: "
>               "New NodeRecord: node 0x%016" PRIx64
>               "\n\t\t\t\tport 0x%016" PRIx64 ", lid 0x%X\n",
> @@ -211,7 +211,7 @@ __osm_nr_rcv_create_nr(
> 
>    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>    {
> -    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_nr_rcv_create_nr: "
>               "Looking for NodeRecord with LID: 0x%X GUID:0x%016"
PRIx64 "\n",
>               cl_ntoh16( match_lid ),
> @@ -257,7 +257,7 @@ __osm_nr_rcv_create_nr(
>        */
>        if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>        {
> -        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>                   "__osm_nr_rcv_create_nr: "
>                   "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
>                   cl_ntoh16( base_lid_ho ),
> @@ -326,7 +326,7 @@ __osm_nr_rcv_by_comp_mask(
>      */
>      if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>      {
> -      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>                 "__osm_nr_rcv_by_comp_mask: "
>                 "Looking for node 0x%016" PRIx64
>                 ", found 0x%016" PRIx64 "\n",
> @@ -493,7 +493,7 @@ osm_nr_rcv_process(
>     */
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "osm_nr_rcv_process: "
> +             "osm_nr_rcv_process: ERR 1D03: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_sa_link_record.c
> ===================================================================
> --- opensm/osm_sa_link_record.c	(revision 7733)
> +++ opensm/osm_sa_link_record.c	(working copy)
> @@ -312,7 +312,7 @@ __osm_lr_rcv_get_physp_link(
>    {
>      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>               "__osm_lr_rcv_get_physp_link: "
> -             "Acquiring link record.\n"
> +             "Acquiring link record\n"
>               "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)"
>               ", dest port 0x%" PRIx64 " (port 0x%X)\n",
>               cl_ntoh64( osm_physp_get_port_guid( p_src_physp ) ),
> @@ -606,7 +606,7 @@ __osm_lr_rcv_respond(
>    if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
>         (num_rec > 1)) {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -             "__osm_lr_rcv_respond: "
> +             "__osm_lr_rcv_respond: ERR 1806: "
>               "Got more than one record for SubnAdmGet (%u)\n",
>               num_rec );
>      osm_sa_send_error( p_rcv->p_resp, p_madw,
> Index: opensm/osm_slvl_map_rcv.c
> ===================================================================
> --- opensm/osm_slvl_map_rcv.c	(revision 7733)
> +++ opensm/osm_slvl_map_rcv.c	(working copy)
> @@ -211,13 +211,10 @@ osm_slvl_rcv_process(
>    */
>    if( !osm_physp_is_valid( p_physp ) )
>    {
> -    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
> -    {
> -      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_slvl_rcv_process: "
> -               "Got invalid port number 0x%X\n",
> -               out_port_num );
> -    }
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "osm_slvl_rcv_process: "
> +             "Got invalid port number 0x%X\n",
> +             out_port_num );
>      goto Exit;
>    }
> 
> Index: opensm/osm_sa_link_record_ctrl.c
> ===================================================================
> --- opensm/osm_sa_link_record_ctrl.c	(revision 7733)
> +++ opensm/osm_sa_link_record_ctrl.c	(working copy)
> @@ -116,7 +116,7 @@ osm_lr_rcv_ctrl_init(
>    {
>      osm_log( p_log, OSM_LOG_ERROR,
>               "osm_lr_rcv_ctrl_init: ERR 1901: "
> -             "Dispatcher registration failed.\n" );
> +             "Dispatcher registration failed\n" );
>      status = IB_INSUFFICIENT_RESOURCES;
>      goto Exit;
>    }
> Index: opensm/osm_qos.c
> ===================================================================
> --- opensm/osm_qos.c	(revision 7733)
> +++ opensm/osm_qos.c	(working copy)
> @@ -279,7 +279,8 @@ static ib_api_status_t qos_physp_setup(o
>  	/* setup vl high limit */
>  	status = vl_high_limit_update(p_req, p, qcfg);
>  	if (status != IB_SUCCESS) {
> -		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"qos_physp_setup: ERR 6201 : "
>  			"failed to update VLHighLimit "
>  			"for port %" PRIx64 " #%d\n",
>  			cl_ntoh64(p->port_guid), port_num);
> @@ -289,7 +290,8 @@ static ib_api_status_t qos_physp_setup(o
>  	/* setup VLArbitration */
>  	status = vlarb_update(p_req, p, port_num, qcfg);
>  	if (status != IB_SUCCESS) {
> -		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"qos_physp_setup: ERR 6202 : "
>  			"failed to update VLArbitration tables "
>  			"for port %" PRIx64 " #%d\n",
>  			cl_ntoh64(p->port_guid), port_num);
> @@ -299,7 +301,8 @@ static ib_api_status_t qos_physp_setup(o
>  	/* setup Sl2VL tables */
>  	status = sl2vl_update(p_req, p, port_num, qcfg);
>  	if (status != IB_SUCCESS) {
> -		osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: "
> +		osm_log(p_log, OSM_LOG_ERROR,
> +			"qos_physp_setup: ERR 6203 : "
>  			"failed to update SL2VLMapping tables "
>  			"for port %" PRIx64 " #%d\n",
>  			cl_ntoh64(p->port_guid), port_num);
> Index: opensm/osm_sa_mcmember_record.c
> ===================================================================
> --- opensm/osm_sa_mcmember_record.c	(revision 7733)
> +++ opensm/osm_sa_mcmember_record.c	(working copy)
> @@ -2286,7 +2286,7 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t*
>    {
>      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
>               "osm_mcmr_query_mgrp: ERR 1B17: "
> -             "osm_vendor_send. status = %s\n",
> +             "osm_vendor_send status = %s\n",
>               ib_get_err_str(status) );
>      goto Exit;
>    }
> Index: opensm/osm_drop_mgr.c
> ===================================================================
> --- opensm/osm_drop_mgr.c	(revision 7733)
> +++ opensm/osm_drop_mgr.c	(working copy)
> @@ -512,7 +512,7 @@ __osm_drop_mgr_check_node(
> 
>    if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH )
>    {
> -    osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
> +    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
>               "__osm_drop_mgr_check_node: ERR 0107: "
>               "Node 0x%016" PRIx64 " is not a switch node\n",
>               cl_ntoh64( node_guid ) );
> Index: opensm/osm_lid_mgr.c
> ===================================================================
> --- opensm/osm_lid_mgr.c	(revision 7733)
> +++ opensm/osm_lid_mgr.c	(working copy)
> @@ -637,7 +637,7 @@ __osm_lid_mgr_init_sweep(
>    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>             "__osm_lid_mgr_init_sweep: "
>             "final free lid range [0x%x:0x%x]\n",
> -           p_range->min_lid, p_range->max_lid);
> +           p_range->min_lid, p_range->max_lid );
> 
>    OSM_LOG_EXIT( p_mgr->p_log );
>    return status;
> @@ -757,7 +757,7 @@ __osm_lid_mgr_find_free_lid_range(
>    /* if we run out of lids, give an error and abort! */
>    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
>             "__osm_lid_mgr_find_free_lid_range: ERR 0307: "
> -           "OPENSM RAN OUT OF LIDS!!!\n");
> +           "OPENSM RAN OUT OF LIDS!!!\n" );
>    CL_ASSERT( 0 );
>  }
> 
> @@ -827,7 +827,7 @@ __osm_lid_mgr_get_port_lid(
>        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>                 "__osm_lid_mgr_get_port_lid: "
>                 "0x%016" PRIx64" matches its known lid:0x%04x\n",
> -               guid, min_lid);
> +               guid, min_lid );
>        goto Exit;
>      }
>      else
> @@ -848,7 +848,7 @@ __osm_lid_mgr_get_port_lid(
>      osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>               "__osm_lid_mgr_get_port_lid: "
>               "0x%016" PRIx64" has no persistent lid assigned\n",
> -             guid);
> +             guid );
>    }
> 
>    /* if the port info carries a lid it must be lmc aligned and not
mapped
> @@ -872,7 +872,7 @@ __osm_lid_mgr_get_port_lid(
>          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>                   "__osm_lid_mgr_get_port_lid: "
>                   "0x%016" PRIx64" lid range:[0x%x-0x%x] is free\n",
> -                 guid, *p_min_lid, *p_max_lid);
> +                 guid, *p_min_lid, *p_max_lid );
>          goto NewLidSet;
>        }
>        else
> @@ -881,7 +881,7 @@ __osm_lid_mgr_get_port_lid(
>                   "__osm_lid_mgr_get_port_lid: "
>                   "0x%016" PRIx64
>                   " existing lid range:[0x%x:0x%x] is not free\n",
> -                 guid, min_lid, min_lid + num_lids - 1);
> +                 guid, min_lid, min_lid + num_lids - 1 );
>        }
>      }
>      else
> @@ -890,7 +890,7 @@ __osm_lid_mgr_get_port_lid(
>                 "__osm_lid_mgr_get_port_lid: "
>                 "0x%016" PRIx64
>                 " existing lid range:[0x%x:0x%x] is not lmc
aligned\n",
> -               guid, min_lid, min_lid + num_lids - 1);
> +               guid, min_lid, min_lid + num_lids - 1 );
>      }
>    }
> 
> @@ -902,7 +902,7 @@ __osm_lid_mgr_get_port_lid(
>    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>             "__osm_lid_mgr_get_port_lid: "
>             "0x%016" PRIx64" assigned a new lid range:[0x%x-0x%x]\n",
> -           guid, *p_min_lid, *p_max_lid);
> +           guid, *p_min_lid, *p_max_lid );
>    lid_changed = 1;
> 
>   NewLidSet:
> @@ -1339,9 +1339,9 @@ osm_lid_mgr_process_sm(
>    {
>      osm_log( p_mgr->p_log, OSM_LOG_VERBOSE,
>               "osm_lid_mgr_process_sm: "
> -             "Invoking UI function pfn_ui_pre_lid_assign\n");
> +             "Invoking UI function pfn_ui_pre_lid_assign\n" );
>      p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(
> -      p_mgr->p_subn->opt.ui_pre_lid_assign_ctx);
> +      p_mgr->p_subn->opt.ui_pre_lid_assign_ctx );
>    }
> 
>    /* Set the send_set_reqs of the p_mgr to FALSE, and
> Index: opensm/osm_pkey_mgr.c
> ===================================================================
> --- opensm/osm_pkey_mgr.c	(revision 7733)
> +++ opensm/osm_pkey_mgr.c	(working copy)
> @@ -245,7 +245,7 @@ pkey_mgr_update_peer_port(
>     if (pkey_mgr_enforce_partition( p_req, peer, enforce ) !=
IB_SUCCESS)
>     {
>        osm_log( p_log, OSM_LOG_ERROR,
> -               "pkey_mgr_update_peer_port: "
> +               "pkey_mgr_update_peer_port: ERR 0502: "
>                 "pkey_mgr_enforce_partition() failed to update "
>                 "node 0x%016" PRIx64 " port %u\n",
>                 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> @@ -272,7 +272,7 @@ pkey_mgr_update_peer_port(
>              ret_val = TRUE;
>           else
>              osm_log( p_log, OSM_LOG_ERROR,
> -                     "pkey_mgr_update_peer_port: "
> +                     "pkey_mgr_update_peer_port: ERR 0503: "
>                       "pkey_mgr_update_pkey_entry() failed to update "
>                       "pkey table block %d for node 0x%016" PRIx64
>                       " port %u\n",
> @@ -332,7 +332,7 @@ static boolean_t pkey_mgr_update_port(
>           ret_val = TRUE;
>        else
>           osm_log( p_log, OSM_LOG_ERROR,
> -                  "pkey_mgr_update_port:  "
> +                  "pkey_mgr_update_port: ERR 0504: "
>                    "pkey_mgr_update_pkey_entry() failed to update "
>                    "pkey table block %d for node 0x%016" PRIx64 " port
%u\n",
>                    block_index,
> @@ -393,7 +393,8 @@ osm_pkey_mgr_process(
> 
>     if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) !=
IB_SUCCESS )
>     {
> -      osm_log( &p_osm->log, OSM_LOG_ERROR, "osm_pkey_mgr_process: "
> +      osm_log( &p_osm->log, OSM_LOG_ERROR,
> +               "osm_pkey_mgr_process: ERR 0505: "
>                 "osm_prtn_make_partitions() failed\n" );
>        goto _err;
>     }
> Index: opensm/osm_vl_arb_rcv.c
> ===================================================================
> --- opensm/osm_vl_arb_rcv.c	(revision 7733)
> +++ opensm/osm_vl_arb_rcv.c	(working copy)
> @@ -211,13 +211,10 @@ osm_vla_rcv_process(
>    */
>    if( !osm_physp_is_valid( p_physp ) )
>    {
> -    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_VERBOSE ) )
> -    {
> -      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> -               "osm_vla_rcv_process: "
> -               "Got invalid port number 0x%X\n",
> -               port_num );
> -    }
> +    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +             "osm_vla_rcv_process: "
> +             "Got invalid port number 0x%X\n",
> +             port_num );
>      goto Exit;
>    }
> 
> 


From Thomas.Talpey at netapp.com  Tue Jun  6 12:07:34 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 06 Jun 2006 15:07:34 -0400
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE58EBD2@mail.silverstorm.co
 m>
References: <D80D83302DEE6249A221093BF2BB69AE58EBD2@mail.silverstorm.com>
Message-ID: <7.0.1.0.2.20060606131933.04267008@netapp.com>

Todd, thanks for the set-up. I'm really glad we're having this discussion!

Let me give an NFS/RDMA example to illustrate why this upper layer,
at least, doesn't want the HCA doing its flow control, or resource
management.

NFS/RDMA is a credit-based protocol which allows many operations in
progress at the server. Let's say the client is currently running with
an RPC slot table of 100 requests (a typical value).

Of these requests, some workload-specific percentage will be reads,
writes, or metadata. All NFS operations consist of one send from
client to server, some number of RDMA writes (for NFS reads) or
RDMA reads (for NFS writes), then terminated with one send from
server to client.

The number of RDMA read or write operations per NFS op depends
on the amount of data being read or written, and also the memory
registration strategy in use on the client. The highest-performing
such strategy is an all-physical one, which results in one RDMA-able
segment per physical page. NFS r/w requests are, by default, 32KB,
or 8 pages typical. So, typically 8 RDMA requests (read or write) are
the result.

To illustrate, let's say the client is processing a multi-threaded
workload, with (say) 50% reads, 20% writes, and 30% metadata
such as lookup and getattr. A kernel build, for example. Therefore,
of our 100 active operations, 50 are reads for 32KB each, 20 are
writes of 32KB, and 30 are metadata (non-RDMA). 

To the server, this results in 100 requests, 100 replies, 400 RDMA
writes, and 160 RDMA Reads. Of course, these overlap heavily due
to the widely differing latency of each op and the highly distributed
arrival times. But, for the example this is a snapshot of current load.

The latency of the metadata operations is quite low, because lookup
and getattr are acting on what is effectively cached data. The reads
and writes however, are much longer, because they reference the
filesystem. When disk queues are deep, they can take many ms.

Imagine what happens if the client's IRD is 4 and the server ignores
its local ORD. As soon as a write begins execution, the server posts
8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
are sent, the fifth stalls, and stalls the send queue! Even when three
RDMA Reads complete, the queue remains stalled, it doesn't unblock
until the fourth is done and all the RDMA Reads have been initiated.

But, what just happened to all the other server send traffic? All those
metadata replies, and other reads which completed? They're stuck,
waiting for that one write request. In my example, these number 99 NFS
ops, i.e. 654 WRs! All for one NFS write! The client operation stream
effectively became single threaded. What good is the "rapid initiation
of RDMA Reads" you describe in the face of this?

Yes, there are many arcane and resource-intensive ways around it.
But the simplest by far is to count the RDMA Reads outstanding, and
for the *upper layer* to honor ORD, not the HCA. Then, the send queue
never blocks, and the operation streams never loses parallelism. This
is what our NFS server does.

As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth
of the RDMA Read stream. 4 is good for local, low latency connections.
But over a complicated switch infrastructure, or heaven forbid a dark fiber
long link, I guarantee it will cause a bottleneck. This isn't an issue except
for operations that care, but it is certainly detectable. I would like to see
if a pure RDMA Read stream can fully utilize a typical IB fabric, and how
much headroom an IRD of 4 provides. Not much, I predict.

Closing the connection if IRD is "insufficient to meet goals" isn't a good
answer, IMO. How does that benefit interoperability? 

Thanks for the opportunity to spout off again. Comments welcome!

Tom.

At 12:43 PM 6/6/2006, Rimmer, Todd wrote:
>
>
>> Talpey, Thomas
>> Sent: Tuesday, June 06, 2006 10:49 AM
>> 
>> At 10:40 AM 6/6/2006, Roland Dreier wrote:
>> >    Thomas> This is the difference between "may" and "must". The
>value
>> >    Thomas> is provided, but I don't see anything in the spec that
>> >    Thomas> makes a requirement on its enforcement. Table 107 says
>the
>> >    Thomas> consumer can query it, that's about as close as it
>> >    Thomas> comes. There's some discussion about CM exchange too.
>> >
>> >This seems like a very strained interpretation of the spec.  For
>> 
>> I don't see how strained has anything to do with it. It's not saying
>> anything
>> either way. So, a legal implementation can make either choice. We're
>> talking about the spec!
>> 
>> But, it really doesn't matter. The point is, an upper layer should be
>> paying
>> attention to the number of RDMA Reads it posts, or else suffer either
>the
>> queue-stalling or connection-failing consequences. Bad stuff either
>way.
>> 
>> Tom.
>
>Somewhere beneath this discussion is a bug in the application or IB
>stack.  I'm not sure which "may" in the spec you are referring to, but
>the "may"s I have found all are for cases where the responder might
>support only 1 outstanding request.  In all cases the negotiation
>protocol must be followed and the requestor is not allowed to exceed the
>negotiated limit.
>
>The mechanism should be:
>client queries its local HCA and determines responder resources (eg.
>number of concurrent outstanding RDMA reads on the wire from the remote
>end where this end will respond with the read data) and initiator depth
>(eg. number of concurrent outstanding RDMA reads which this end can
>initiate as the requestor).
>
>client puts the above information in the CM REQ.
>
>server similarly gets its information from its local CA and negotiates
>down the values to the MIN of each side (REP.InitiatorDepth =
>MIN(REQ.ResponderResources, server's local CAs Initiator depth);
>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
>responder resources).  If server does not support RDMA Reads, it can
>REJ.
>
>If client decided the negotiated values are insufficient to meet its
>goals, it can disconnect.
>
>Each side sets its QP parameters via modify QP appropriately.  Note they
>too will be mirror images of eachother:
>client:
>QP.Max RDMA Reads as Initiator = REP.ResponderResources
>QP.Max RDMA reads as responder = REP.InitiatorDepth
>
>server:
>QP.Max RDMA Reads as responder = REP.ResponderResources
>QP.Max RDMA reads as initiator = REP.InitiatorDepth
>
>We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
>and provided the above negotiation is followed, we have seen no issues.
>Note however that by default a Mellanox HCA typically reports a large
>InitiatorDepth (128) and a modest ResponderResources (4-8).  Hence when
>I hear that Responder Resources must be grown to 128 for some
>application to reliably work, it implies the negotiation I outlined
>above is not being followed.
>
>Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
>write on a send queue are ordered.  There are many cases where an op can
>pass an outstanding RDMA read, hence it is not always bad to queue extra
>RDMA reads.  If needed, the Fence can be sent to force order.
>
>For many apps, its going to be better to get the items onto queue and
>let the QP handle the outstanding reads cases rather than have the app
>add a level of queuing for this purpose.  Letting the HCA do the queuing
>will allow for a more rapid initiation of subsequent reads.
>
>Todd Rimmer


From rdreier at cisco.com  Tue Jun  6 13:21:13 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:21:13 -0700
Subject: [openib-general] Re: [PATCH] SRP : Use correct port identifier
 format according to target io_class
In-Reply-To: <44861416.2864.88C6C5@rkuchimanchi.silverstorm.com> (
	Ramachandra K.'s message of "Tue, 06 Jun 2006 23:47:34 +0530")
References: <1149171133.7588.45.camel@Prawra.gs-lab.com>
	<44861416.2864.88C6C5@rkuchimanchi.silverstorm.com>
Message-ID: <adar722kpc6.fsf@cisco.com>

    Ramachandra> Thanks a lot Roland. But there was also a patch for
    Ramachandra> ibsrpdm to display the IO class of the target. I am
    Ramachandra> including it below for your convenience.

Yes, I still have this in my queue.  I need to rewrite this because I
don't think ibsrpdm should generate io_class options for rev. 16a
targets, since that will break old kernels for no reason.  But I will
apply it soon.

 - R.


From rdreier at cisco.com  Tue Jun  6 13:23:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:23:58 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606181132.GA4701@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 21:11:32 +0300")
References: <20060606181132.GA4701@mellanox.co.il>
Message-ID: <adamzcqkp7l.fsf@cisco.com>

 > 1. Cache should create ib_update_work objects statically upon hotplug event.

Seems reasonable, since multiple pending cache update events can be
safely coalesced into one.

 > 2. Need a mechanism for cache to consume events which trigger cache updates,
 > and delay reporting them to ULPs until after cache is updated.

This seems like overkill to me.  And I don't see how to avoid
GFP_ATOMIC allocations that might fail, since the cache module would
need to maintain an arbitrary-length queue of pending events.

 - R.


From mst at mellanox.co.il  Tue Jun  6 13:31:03 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:31:03 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adamzcqkp7l.fsf@cisco.com>
References: <adamzcqkp7l.fsf@cisco.com>
Message-ID: <20060606203103.GF4719@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>  > 1. Cache should create ib_update_work objects statically upon hotplug event.
> 
> Seems reasonable, since multiple pending cache update events can be
> safely coalesced into one.
> 
>  > 2. Need a mechanism for cache to consume events which trigger cache updates,
>  > and delay reporting them to ULPs until after cache is updated.
> 
> This seems like overkill to me.

How then can we solve the problem of IPoIB querying the cache as a result of
an event, and getting a stale value? Note we are actually seeing this in
practice when changing pkeys.

> And I don't see how to avoid
> GFP_ATOMIC allocations that might fail, since the cache module would
> need to maintain an arbitrary-length queue of pending events.

IMO order of events is typically not important, so 
we only need to handle up to 6 different events in some kind of bitmask.

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 13:35:38 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:35:38 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adamzcqkp7l.fsf@cisco.com>
References: <adamzcqkp7l.fsf@cisco.com>
Message-ID: <20060606203538.GG4719@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>  > 1. Cache should create ib_update_work objects statically upon hotplug event.
> 
> Seems reasonable, since multiple pending cache update events can be
> safely coalesced into one.
> 
>  > 2. Need a mechanism for cache to consume events which trigger cache updates,
>  > and delay reporting them to ULPs until after cache is updated.
> 
> This seems like overkill to me.  And I don't see how to avoid
> GFP_ATOMIC allocations that might fail, since the cache module would
> need to maintain an arbitrary-length queue of pending events.

Hmm. Thinking about it some more - how about generating the events 
from a mad thread in core rather than from povider?
Then this would be thread context so cache could simply perform updates
inline in event handler.

-- 
MST


From rdreier at cisco.com  Tue Jun  6 13:40:51 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:40:51 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606203103.GF4719@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 23:31:03 +0300")
References: <adamzcqkp7l.fsf@cisco.com> <20060606203103.GF4719@mellanox.co.il>
Message-ID: <adairnekofg.fsf@cisco.com>

    Michael> How then can we solve the problem of IPoIB querying the
    Michael> cache as a result of an event, and getting a stale value?
    Michael> Note we are actually seeing this in practice when
    Michael> changing pkeys.

It doesn't seem like a severe problem to me -- IPoIB will just check
again in another second, right?

The whole intention of the cache interface is that it should only be
used when a stale value is not fatal.  So if this isn't good enough
for IPoIB, then it should just query the P_Key table directly.  But of
course even that could return stale values, since there's no guarantee
that the P_Key table doesn't change immediately after the query operation.

    Michael> IMO order of events is typically not important, so we
    Michael> only need to handle up to 6 different events in some kind
    Michael> of bitmask.

This seems like a strong statement -- certainly there's a big
difference between the sequence "port active" then "port error"
vs. "port error" then "port active".  Also coalescing events means
that the sequences "port error", "port active", "port error" vs. just
"port error", "port active" can't be distinguished.

So I think the proposed cure may be worse than the disease here.

 - R.


From rdreier at cisco.com  Tue Jun  6 13:42:48 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:42:48 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606203538.GG4719@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 23:35:38 +0300")
References: <adamzcqkp7l.fsf@cisco.com> <20060606203538.GG4719@mellanox.co.il>
Message-ID: <adaejy2koc7.fsf@cisco.com>

    Michael> Hmm. Thinking about it some more - how about generating
    Michael> the events from a mad thread in core rather than from
    Michael> povider?  Then this would be thread context so cache
    Michael> could simply perform updates inline in event handler.

You have the same problem with allocating storage in atomic context
for an arbitrary-length queue of events.

 - R.


From mst at mellanox.co.il  Tue Jun  6 13:43:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:43:54 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adaejy2koc7.fsf@cisco.com>
References: <adaejy2koc7.fsf@cisco.com>
Message-ID: <20060606204354.GH4719@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> Hmm. Thinking about it some more - how about generating
>     Michael> the events from a mad thread in core rather than from
>     Michael> povider?  Then this would be thread context so cache
>     Michael> could simply perform updates inline in event handler.
> 
> You have the same problem with allocating storage in atomic context
> for an arbitrary-length queue of events.

No, on mad thread we can allocate with GFP_KERNEL I think.

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 13:48:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:48:11 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adairnekofg.fsf@cisco.com>
References: <adairnekofg.fsf@cisco.com>
Message-ID: <20060606204811.GA5472@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> How then can we solve the problem of IPoIB querying the
>     Michael> cache as a result of an event, and getting a stale value?
>     Michael> Note we are actually seeing this in practice when
>     Michael> changing pkeys.
> 
> It doesn't seem like a severe problem to me -- IPoIB will just check
> again in another second, right?

That would solve the problem, but -

int ipoib_ib_dev_up(struct net_device *dev)
{
        struct ipoib_dev_priv *priv = netdev_priv(dev);

        ipoib_pkey_dev_check_presence(dev);

        if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
                ipoib_dbg(priv, "PKEY is not assigned.\n");
                return 0;
        }

        set_bit(IPOIB_FLAG_OPER_UP, &priv->flags);

        return ipoib_mcast_start_thread(dev);
}

This doesn't seem to retry anything.

-- 
MST


From rdreier at cisco.com  Tue Jun  6 13:49:05 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:49:05 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606204354.GH4719@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 23:43:54 +0300")
References: <adaejy2koc7.fsf@cisco.com> <20060606204354.GH4719@mellanox.co.il>
Message-ID: <adaac8qko1q.fsf@cisco.com>

    Michael> No, on mad thread we can allocate with GFP_KERNEL I think.

But how do you get into thread context?  Events are generated in
interrupt context, and if you want to defer the work to process
context, then you have to store the information somewhere.

 - R.


From rdreier at cisco.com  Tue Jun  6 13:51:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:51:59 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606204811.GA5472@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 23:48:11 +0300")
References: <adairnekofg.fsf@cisco.com> <20060606204811.GA5472@mellanox.co.il>
Message-ID: <ada64jeknww.fsf@cisco.com>

    Michael> That would solve the problem, but -

    Michael> int ipoib_ib_dev_up(struct net_device *dev)

    Michael> This doesn't seem to retry anything.

ipoib_main calls ipoib_pkey_dev_delay_open() before it tries
ipoib_ib_dev_up().  So it should be OK if the P_Key isn't assigned
yet.

 - R.


From mst at mellanox.co.il  Tue Jun  6 13:54:00 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:54:00 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <ada64jeknww.fsf@cisco.com>
References: <ada64jeknww.fsf@cisco.com>
Message-ID: <20060606205400.GB5472@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> That would solve the problem, but -
> 
>     Michael> int ipoib_ib_dev_up(struct net_device *dev)
> 
>     Michael> This doesn't seem to retry anything.
> 
> ipoib_main calls ipoib_pkey_dev_delay_open() before it tries
> ipoib_ib_dev_up().  So it should be OK if the P_Key isn't assigned
> yet.

But ipoib_ib_dev_flush doesn't?

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 13:56:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:56:34 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adaac8qko1q.fsf@cisco.com>
References: <adaac8qko1q.fsf@cisco.com>
Message-ID: <20060606205634.GC5472@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> No, on mad thread we can allocate with GFP_KERNEL I think.
> 
> But how do you get into thread context?  Events are generated in
> interrupt context, and if you want to defer the work to process
> context, then you have to store the information somewhere.

We already process incoming MADs in thread context in core.
So all events related to MADs could be generated there.
But you might be right - it might be easier to fix ULPs.

-- 
MST


From mst at mellanox.co.il  Tue Jun  6 13:57:25 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 6 Jun 2006 23:57:25 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adairnekofg.fsf@cisco.com>
References: <adairnekofg.fsf@cisco.com>
Message-ID: <20060606205725.GD5472@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
>     Michael> IMO order of events is typically not important, so we
>     Michael> only need to handle up to 6 different events in some kind
>     Michael> of bitmask.
> 
> This seems like a strong statement

Maybe add "cache updated" event for ULPs to listen on?

-- 
MST


From rdreier at cisco.com  Tue Jun  6 13:59:57 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 13:59:57 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060606205400.GB5472@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 6 Jun 2006 23:54:00 +0300")
References: <ada64jeknww.fsf@cisco.com> <20060606205400.GB5472@mellanox.co.il>
Message-ID: <ada1wu2knjm.fsf@cisco.com>

    Michael> But ipoib_ib_dev_flush doesn't?

Ah, that looks like the bug I guess.  What's the situation?  SM clears
P_Key table and then later readds a P_Key?

 - R.


From mst at mellanox.co.il  Tue Jun  6 14:03:36 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 7 Jun 2006 00:03:36 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <ada1wu2knjm.fsf@cisco.com>
References: <ada1wu2knjm.fsf@cisco.com>
Message-ID: <20060606210336.GE5472@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> But ipoib_ib_dev_flush doesn't?
> 
> Ah, that looks like the bug I guess.  What's the situation?  SM clears
> P_Key table and then later readds a P_Key?

Yes, something like that.

-- 
MST


From mshefty at ichips.intel.com  Tue Jun  6 14:26:15 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 06 Jun 2006 14:26:15 -0700
Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize
	ib_ah_attr from a work completion
In-Reply-To: <ORSMSX4011XvpFVjCRG00000008@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4011XvpFVjCRG00000008@orsmsx401.amr.corp.intel.com>
Message-ID: <4485F2F7.3020807@ichips.intel.com>

Sean Hefty wrote:
> Expose a new call to initialize address handle attributes from a work
> completion.  This functionality is duplicated by both verbs and the CM.

Is there any objection to committing this patch set?

1. Export ib_init_ah_from_wc() call in verbs.
2. Add ib_init_ah_from_path() call to ib_sa.
3. Convert CM to use exported calls.

This eliminates some duplicated code, and centralizes the initialization of 
ib_ah_attr to reduce the chance of users setting the global routing flag 
incorrectly.

- Sean


From rdreier at cisco.com  Tue Jun  6 14:27:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 14:27:58 -0700
Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize
	ib_ah_attr from a work completion
In-Reply-To: <4485F2F7.3020807@ichips.intel.com> (Sean Hefty's message
	of "Tue, 06 Jun 2006 14:26:15 -0700")
References: <ORSMSX4011XvpFVjCRG00000008@orsmsx401.amr.corp.intel.com>
	<4485F2F7.3020807@ichips.intel.com>
Message-ID: <adaslmij7oh.fsf@cisco.com>

    Sean> Is there any objection to committing this patch set?

I think it's fine.  Should I queue it for 2.6.18 too?


From sean.hefty at intel.com  Tue Jun  6 14:31:40 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 14:31:40 -0700
Subject: [openib-general] [PATCH 1/3] verbs: add call to initialize
	ib_ah_attr from a work completion
In-Reply-To: <adaslmij7oh.fsf@cisco.com>
Message-ID: <ORSMSX401FRaqbC8wSA00000038@orsmsx401.amr.corp.intel.com>

>I think it's fine.  Should I queue it for 2.6.18 too?

That probably makes sense.  I'll send a couple of svn revs that should be safe
to pull into 2.6.18 after committing this.

- Sean


From sweitzen at cisco.com  Tue Jun  6 14:35:25 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 6 Jun 2006 14:35:25 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12A77@xmb-sjc-216.amer.cisco.com>

Arlin, 

I'm having trouble running Intel MPI 2.0.1 and OFED 1.0 rc5 with Intel
MPI Benchmark 2.3 on a 32-node PCI-X RHEL4 U3 i686 cluster.  This thread
caught my eye, can you look at my output and tell me if this is the same
issue?  If not, are there other things I can tune, or should I file a
bug somewhere?

$ .../intelmpi-2.0.1-`uname -m`/bin/mpiexec -genv I_MPI_DEBUG 3 -genv
I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../intelmpi-2.0.1-`uname
-m`/lib -n 32 .../IMB_2.3/src/IMB-MPI1 PingPong
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(531): Initialization failed
MPID_Init(146): channel initialization failed
MPIDI_CH3_Init(937):
MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
VC_post_connect
(unknown)(): (null)
rank 10 in job 1  192.168.1.1_33715   caused collective abort of all
ranks
  exit status of rank 10: killed by signal 9
rank 1 in job 1  192.168.1.1_33715   caused collective abort of all
ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 1  192.168.1.1_33715   caused collective abort of all
ranks
  exit status of rank 0: killed by signal 9
[releng at svbu-qaclus-1 intel.intel]$

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Arlin Davis
> Sent: Monday, June 05, 2006 5:17 PM
> To: Lentini, James
> Cc: 'openib-general'
> Subject: [openib-general] [PATCH] uDAPL openib-cma provider - 
> add support for IB_CM_REQ_OPTIONS
> 
> James,
> 
> Here is a patch to the openib-cma provider that uses the new 
> set_option feature of the uCMA to
> adjust connect request timeout and retry values. The defaults 
> are a little quick for some consumers.
> They are now bumped up from 3 retries to 15 and are tunable 
> with uDAPL environment variables. Also,
> included a fix to disallow any event after a disconnect event.
> 
> You need to sync up the commit with Sean's patch for the uCMA 
> get/set IB_CM_REQ_OPTIONS.
> 
> I would like to get this in OFED RC6 if possible.
> 
> Thanks,
> 
> -arlin
> 
> 
> 
> Signed-off by: Arlin Davis ardavis at ichips.intel.com
> 
> Index: dapl/openib_cma/dapl_ib_util.c
> ===================================================================
> --- dapl/openib_cma/dapl_ib_util.c	(revision 7694)
> +++ dapl/openib_cma/dapl_ib_util.c	(working copy)
> @@ -264,7 +264,15 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N
>  	/* set inline max with env or default, get local lid 
> and gid 0 */
>  	hca_ptr->ib_trans.max_inline_send = 
>  		dapl_os_get_env_val("DAPL_MAX_INLINE", 
> INLINE_SEND_DEFAULT);
> -		
> +
> +	/* set CM timer defaults */	
> +	hca_ptr->ib_trans.max_cm_timeout =
> +		dapl_os_get_env_val("DAPL_MAX_CM_RESPONSE_TIME", 
> +				    IB_CM_RESPONSE_TIMEOUT);
> +	hca_ptr->ib_trans.max_cm_retries = 
> +		dapl_os_get_env_val("DAPL_MAX_CM_RETRIES", 
> +				    IB_CM_RETRIES);
> +
>  	/* EVD events without direct CQ channels, non-blocking */
>  	hca_ptr->ib_trans.ib_cq = 
>  		ibv_create_comp_channel(hca_ptr->ib_hca_handle);
> Index: dapl/openib_cma/dapl_ib_cm.c
> ===================================================================
> --- dapl/openib_cma/dapl_ib_cm.c	(revision 7694)
> +++ dapl/openib_cma/dapl_ib_cm.c	(working copy)
> @@ -58,6 +58,7 @@
>  #include "dapl_ib_util.h"
>  #include <sys/poll.h>
>  #include <signal.h>
> +#include <rdma/rdma_cma_ib.h>
>  
>  extern struct rdma_event_channel *g_cm_events;
>  
> @@ -85,7 +86,6 @@ static inline uint64_t cpu_to_be64(uint6
>      (unsigned short)((SID % IB_PORT_MOD) + IB_PORT_BASE) :\
>      (unsigned short)SID)
>  
> -
>  static void dapli_addr_resolve(struct dapl_cm_id *conn)
>  {
>  	int ret;
> @@ -114,6 +114,8 @@ static void dapli_addr_resolve(struct da
>  static void dapli_route_resolve(struct dapl_cm_id *conn)
>  {
>  	int ret;
> +	size_t optlen = sizeof(struct ib_cm_req_opt);
> +	struct ib_cm_req_opt req_opt;
>  #ifdef DAPL_DBG
>  	struct rdma_addr *ipaddr = &conn->cm_id->route.addr;
>  	struct ib_addr   *ibaddr = &conn->cm_id->route.addr.addr.ibaddr;
> @@ -143,13 +145,43 @@ static void dapli_route_resolve(struct d
>  			cpu_to_be64(ibaddr->dgid.global.interface_id));
>  	
>  	dapl_dbg_log(DAPL_DBG_TYPE_CM, 
> -		" rdma_connect: cm_id %p pdata %p plen %d rr %d 
> ind %d\n",
> +		" route_resolve: cm_id %p pdata %p plen %d rr 
> %d ind %d\n",
>  		conn->cm_id,
>  		conn->params.private_data, 
>  		conn->params.private_data_len,
>  		conn->params.responder_resources, 
>  		conn->params.initiator_depth );
>  
> +	/* Get default connect request timeout values, and adjust */
> +	ret = rdma_get_option(conn->cm_id, RDMA_PROTO_IB, 
> IB_CM_REQ_OPTIONS,
> +			      (void*)&req_opt, &optlen);
> +	if (ret) {
> +		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " 
> rdma_get_option failed: %s\n",
> +			     strerror(errno));
> +		goto bail;
> +	}
> +
> +	dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: "
> +		     "Set CR times - response %d to %d, retry 
> %d to %d\n",
> +		     req_opt.remote_cm_response_timeout, 
> +		     conn->hca->ib_trans.max_cm_timeout,
> +		     req_opt.max_cm_retries, 
> +		     conn->hca->ib_trans.max_cm_retries);
> +
> +	/* Use hca response time setting for connect requests */
> +	req_opt.max_cm_retries = conn->hca->ib_trans.max_cm_retries;
> +	req_opt.remote_cm_response_timeout = 
> +				conn->hca->ib_trans.max_cm_timeout;
> +	req_opt.local_cm_response_timeout = 
> +				req_opt.remote_cm_response_timeout;
> +	ret = rdma_set_option(conn->cm_id, RDMA_PROTO_IB, 
> IB_CM_REQ_OPTIONS,
> +			      (void*)&req_opt, optlen);
> +	if (ret) {
> +		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " 
> rdma_set_option failed: %s\n",
> +			     strerror(errno));
> +		goto bail;
> +	}
> +
>  	ret = rdma_connect(conn->cm_id, &conn->params);
>  	if (ret) {
>  		dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect 
> failed: %s\n",
> @@ -273,14 +305,37 @@ static void dapli_cm_active_cb(struct da
>  	}
>  	dapl_os_unlock(&conn->lock);
>  
> +        /* There is a chance that we can get events after
> +         * the consumer calls disconnect in a pending state
> +         * since the IB CM and uDAPL states are not shared.
> +         * In some cases, IB CM could generate either a DCONN
> +         * or CONN_ERR after the consumer returned from
> +         * dapl_ep_disconnect with a DISCONNECTED event
> +         * already queued. Check state here and bail to
> +         * avoid any events after a disconnect.
> +         */
> +        if (DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP))
> +                return;
> +
> +        dapl_os_lock(&conn->ep->header.lock);
> +        if (conn->ep->param.ep_state == DAT_EP_STATE_DISCONNECTED) {
> +                dapl_os_unlock(&conn->ep->header.lock);
> +                return;
> +        }
> +        if (event->event == RDMA_CM_EVENT_DISCONNECTED)
> +                conn->ep->param.ep_state = DAT_EP_STATE_DISCONNECTED;
> +
> +        dapl_os_unlock(&conn->ep->header.lock);
> +
>  	switch (event->event) {
>  	case RDMA_CM_EVENT_UNREACHABLE:
>  	case RDMA_CM_EVENT_CONNECT_ERROR:
> -		dapl_dbg_log(
> -			DAPL_DBG_TYPE_WARN,
> -			" dapli_cm_active_handler: CONN_ERR "
> -			" event=0x%x status=%d\n",	
> -			event->event, event->status);
> +                dapl_dbg_log(
> +                        DAPL_DBG_TYPE_WARN,
> +                        " dapli_cm_active_handler: CONN_ERR "
> +                        " event=0x%x status=%d %s\n",
> +                        event->event, event->status,
> +                        (event->status == -110)?"TIMEOUT":"" );
>  
>  		dapl_evd_connection_callback(conn,
>  					     
> IB_CME_DESTINATION_UNREACHABLE,
> @@ -368,25 +423,23 @@ static void dapli_cm_passive_cb(struct d
>  				 	  event->private_data, 
> new_conn->sp);
>  		break;
>  	case RDMA_CM_EVENT_UNREACHABLE:
> -		dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE,
> -				 NULL, conn->sp);
> -
>  	case RDMA_CM_EVENT_CONNECT_ERROR:
>  
>  		dapl_dbg_log(
> -			DAPL_DBG_TYPE_WARN, 
> -			" dapli_cm_passive: CONN_ERR "
> -			" event=0x%x status=%d",
> -			" on SRC 0x%x,0x%x DST 0x%x,0x%x\n",
> -			event->event, event->status,
> -			ntohl(((struct sockaddr_in *)
> -				&ipaddr->src_addr)->sin_addr.s_addr),
> -			ntohs(((struct sockaddr_in *)
> -				&ipaddr->src_addr)->sin_port),
> -			ntohl(((struct sockaddr_in *)
> -				&ipaddr->dst_addr)->sin_addr.s_addr),
> -			ntohs(((struct sockaddr_in *)
> -				&ipaddr->dst_addr)->sin_port));
> +                        DAPL_DBG_TYPE_WARN,
> +                        " dapli_cm_passive: CONN_ERR "
> +                        " event=0x%x status=%d %s"
> +                        " on SRC 0x%x,0x%x DST 0x%x,0x%x\n",
> +                        event->event, event->status,
> +                        (event->status == -110)?"TIMEOUT":"",
> +                        ntohl(((struct sockaddr_in *)
> +                                &ipaddr->src_addr)->sin_addr.s_addr),
> +                        ntohs(((struct sockaddr_in *)
> +                                &ipaddr->src_addr)->sin_port),
> +                        ntohl(((struct sockaddr_in *)
> +                                &ipaddr->dst_addr)->sin_addr.s_addr),
> +                        ntohs(((struct sockaddr_in *)
> +                                &ipaddr->dst_addr)->sin_port));
>  
>  		dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE,
>  				 NULL, conn->sp);
> Index: dapl/openib_cma/dapl_ib_util.h
> ===================================================================
> --- dapl/openib_cma/dapl_ib_util.h	(revision 7694)
> +++ dapl/openib_cma/dapl_ib_util.h	(working copy)
> @@ -67,8 +67,8 @@ typedef ib_hca_handle_t		dapl_ibal_ca_t;
>  
>  #define IB_RC_RETRY_COUNT      7
>  #define IB_RNR_RETRY_COUNT     7
> -#define IB_CM_RESPONSE_TIMEOUT 18	/* 1 sec */
> -#define IB_MAX_CM_RETRIES      7
> +#define IB_CM_RESPONSE_TIMEOUT  20	/* 4 sec */
> +#define IB_CM_RETRIES           15
>  #define IB_REQ_MRA_TIMEOUT	27	/* a little over 9 minutes */
>  #define IB_MAX_AT_RETRY		3
>  #define IB_TARGET_MAX		4	/* max_qp_ous_rd_atom */
> @@ -252,6 +252,8 @@ typedef struct _ib_hca_transport
>  	ib_async_cq_handler_t	async_cq_error;
>  	ib_async_dto_handler_t	async_cq;
>  	ib_async_qp_handler_t	async_qp_error;
> +	uint8_t			max_cm_timeout;
> +	uint8_t			max_cm_retries;
>  
>  } ib_hca_transport_t;
>  
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From jlentini at netapp.com  Tue Jun  6 14:51:17 2006
From: jlentini at netapp.com (James Lentini)
Date: Tue, 6 Jun 2006 17:51:17 -0400 (EDT)
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider - add
 support for IB_CM_REQ_OPTIONS
In-Reply-To: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606061747110.4750@jlentini-linux.nane.netapp.com>


On Mon, 5 Jun 2006, Arlin Davis wrote:

> Here is a patch to the openib-cma provider that uses the new 
> set_option feature of the uCMA to adjust connect request timeout and 
> retry values. The defaults are a little quick for some consumers. 
> They are now bumped up from 3 retries to 15 and are tunable with 
> uDAPL environment variables. Also, included a fix to disallow any 
> event after a disconnect event.

Committed in revision 7755.

> I would like to get this in OFED RC6 if possible.

Who is the gatekeeper for OFED? One of us should bring this to their 
attention, but I'm not sure who to contact.


From mshefty at ichips.intel.com  Tue Jun  6 14:57:00 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 06 Jun 2006 14:57:00 -0700
Subject: [openib-general] svn changes for git 2.6.18
Message-ID: <4485FA2C.3010900@ichips.intel.com>

Roland,

The following svn revision change sets should be safe for 2.6.18:

7748 - remove duplicated pkey from SIDR REQ API
7751 - init ah_attr from wc
7752 - init ah_attr from path
7754 - convert CM to use previous two calls

- Sean


From rdreier at cisco.com  Tue Jun  6 14:57:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 14:57:20 -0700
Subject: [openib-general] Re: SRP: [PATCH] Misc cleanups in ib_srp
In-Reply-To: <20060604094322.GA9091@mellanox.co.il> (Ishai Rabinovitz's
	message of "Sun, 4 Jun 2006 12:43:22 +0300")
References: <20060604094322.GA9091@mellanox.co.il>
Message-ID: <adaodx6j6bj.fsf@cisco.com>

Thanks, looks good to me -- applied and queued for 2.6.18.

...for some reason I thought list_move_tail was only in new kernels,
so I put it in my git tree but held it back from svn.  But of course
it's always been there, so I don't know what I was thinking of.

 - R.


From sweitzen at cisco.com  Tue Jun  6 15:00:30 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 6 Jun 2006 15:00:30 -0700
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12ABE@xmb-sjc-216.amer.cisco.com>

Tziporet is the gatekeeper (does that make me the keymaster? :-).

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini
> Sent: Tuesday, June 06, 2006 2:51 PM
> To: Arlin Davis
> Cc: 'openib-general'
> Subject: [openib-general] Re: [PATCH] uDAPL openib-cma 
> provider - add support for IB_CM_REQ_OPTIONS
> 
> 
> 
> On Mon, 5 Jun 2006, Arlin Davis wrote:
> 
> > Here is a patch to the openib-cma provider that uses the new 
> > set_option feature of the uCMA to adjust connect request 
> timeout and 
> > retry values. The defaults are a little quick for some consumers. 
> > They are now bumped up from 3 retries to 15 and are tunable with 
> > uDAPL environment variables. Also, included a fix to disallow any 
> > event after a disconnect event.
> 
> Committed in revision 7755.
> 
> > I would like to get this in OFED RC6 if possible.
> 
> Who is the gatekeeper for OFED? One of us should bring this to their 
> attention, but I'm not sure who to contact.
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Tue Jun  6 15:11:46 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 15:11:46 -0700
Subject: [openib-general] Re: SRP [PATCH 0/4] Kernel support for removal and
 restoration of target
In-Reply-To: <20060605153213.GA7472@mellanox.co.il> (Ishai Rabinovitz's
	message of "Mon, 5 Jun 2006 18:32:13 +0300")
References: <20060605153213.GA7472@mellanox.co.il>
Message-ID: <adabqt6j5nh.fsf@cisco.com>

I haven't read too deeply yet, but something that would help me
understand the overall plan here would be an explanation of how one
would use the restore_target function.  Why would I want to disconnect
from a target but keep the kernel's SCSI device hanging around?

 - R.


From rdreier at cisco.com  Tue Jun  6 15:10:34 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 15:10:34 -0700
Subject: [openib-general] Re: [PATCH] SRPTOOLS : print out the target
	io_class in ibsrpdm
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE44D741@mail.silverstorm.com>
	(Ramachandra Kuchimanchi's message of
	"Fri, 26 May 2006 13:31:44 -0400")
References: <D80D83302DEE6249A221093BF2BB69AE44D741@mail.silverstorm.com>
Message-ID: <adafyiij5ph.fsf@cisco.com>

Thanks, I applied this.


From rdreier at cisco.com  Tue Jun  6 15:27:52 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 06 Jun 2006 15:27:52 -0700
Subject: [openib-general] Re: svn changes for git 2.6.18
In-Reply-To: <4485FA2C.3010900@ichips.intel.com> (Sean Hefty's message
	of "Tue, 06 Jun 2006 14:57:00 -0700")
References: <4485FA2C.3010900@ichips.intel.com>
Message-ID: <ada7j3tkjh3.fsf@cisco.com>

OK, I dropped them in for-2.6.18.


From ardavis at ichips.intel.com  Tue Jun  6 15:47:42 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 06 Jun 2006 15:47:42 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12A77@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12A77@xmb-sjc-216.amer.cisco.com>
Message-ID: <4486060E.8000500@ichips.intel.com>

Scott Weitzenkamp (sweitzen) wrote:

>Arlin, 
>
>I'm having trouble running Intel MPI 2.0.1 and OFED 1.0 rc5 with Intel
>MPI Benchmark 2.3 on a 32-node PCI-X RHEL4 U3 i686 cluster.  This thread
>caught my eye, can you look at my output and tell me if this is the same
>issue?  If not, are there other things I can tune, or should I file a
>bug somewhere?
>
>  
>
this looks like a configuration issue and not the timeout. The CR 
timeouts occured with
the rdma device and not the rdssm.  Is IPoIB running on the ib0 
interfaces across the
fabric?

>$ .../intelmpi-2.0.1-`uname -m`/bin/mpiexec -genv I_MPI_DEBUG 3 -genv
>I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../intelmpi-2.0.1-`uname
>-m`/lib -n 32 .../IMB_2.3/src/IMB-MPI1 PingPong
>I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
>I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
>I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
>I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
>I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(531): Initialization failed
>MPID_Init(146): channel initialization failed
>MPIDI_CH3_Init(937):
>MPIDI_CH3_Progress(328): MPIDI_CH3I_RDMA_wait_connect failed in
>VC_post_connect
>(unknown)(): (null)
>aborting job:
>
>  
>


From sweitzen at cisco.com  Tue Jun  6 17:07:53 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 6 Jun 2006 17:07:53 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12B76@xmb-sjc-216.amer.cisco.com>


> this looks like a configuration issue and not the timeout. The CR 
> timeouts occured with
> the rdma device and not the rdssm.  Is IPoIB running on the ib0 
> interfaces across the
> fabric?

Yes, IPoIB is running.

Scott


From sean.hefty at intel.com  Tue Jun  6 19:36:23 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 19:36:23 -0700
Subject: [openib-general] [PATCH 0/4] Add support for UD QPs
Message-ID: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>

The following patch series adds support for UD QPs to userspace through the RDMA
CM.  UD QPs are referenced by an IP address, UDP port number.  The RDMA CM
abstracts SIDR for Infiniband clients.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
A subsequent patch series will add multicast handling to the UD QPs.


From sean.hefty at intel.com  Tue Jun  6 19:43:13 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 19:43:13 -0700
Subject: [openib-general] [PATCH 1/4] IB CM: Save and report remote UD QP
 attributes after SIDR
In-Reply-To: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX401ryWtIIZS2T0000003a@orsmsx401.amr.corp.intel.com>

Record remote QP information returned from SIDR.  Expose attributes through
a new API.  This functionality is similar to the ib_cm_init_qp_attr()
routine that exists for RC QPs.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: core/cm.c
===================================================================
--- core/cm.c	(revision 7758)
+++ core/cm.c	(working copy)
@@ -138,6 +138,7 @@ struct cm_id_private {
 	__be64 tid;
 	__be32 local_qpn;
 	__be32 remote_qpn;
+	__be32 remote_qkey;
 	enum ib_qp_type qp_type;
 	__be32 sq_psn;
 	__be32 rq_psn;
@@ -2836,6 +2837,9 @@ static int cm_sidr_rep_handler(struct cm
 	}
 	cm_id_priv->id.state = IB_CM_IDLE;
 	ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
+
+	cm_id_priv->remote_qpn = cm_sidr_rep_get_qpn(sidr_rep_msg);
+	cm_id_priv->remote_qkey = sidr_rep_msg->qkey;
 	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
 
 	cm_format_sidr_rep_event(work);
@@ -3230,6 +3234,29 @@ int ib_cm_init_qp_attr(struct ib_cm_id *
 }
 EXPORT_SYMBOL(ib_cm_init_qp_attr);
 
+int ib_cm_get_dst_attr(struct ib_cm_id *cm_id, struct ib_ah_attr *ah_attr,
+		       u32 *remote_qpn, u32 *remote_qkey)
+{
+	struct cm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+
+	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->id.state != IB_CM_IDLE) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	*ah_attr = cm_id_priv->av.ah_attr;
+	*remote_qpn = be32_to_cpu(cm_id_priv->remote_qpn);
+	*remote_qkey = be32_to_cpu(cm_id_priv->remote_qkey);
+out:
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(ib_cm_get_dst_attr);
+
 static void cm_add_one(struct ib_device *device)
 {
 	struct cm_device *cm_dev;
Index: include/rdma/ib_cm.h
===================================================================
--- include/rdma/ib_cm.h	(revision 7758)
+++ include/rdma/ib_cm.h	(working copy)
@@ -521,6 +521,18 @@ int ib_cm_init_qp_attr(struct ib_cm_id *
 		       int *qp_attr_mask);
 
 /**
+ * ib_cm_get_dst_attr - Initializes the attributes for use in sending
+ *   to a specified UD QP.
+ * @cm_id: Communication identifier that was used for the SIDR REQ.
+ * @ah_attr: Address handle attributes that should be used to send to the
+ *   destination QP.
+ * @remote_qpn: Remote QPN of the destination QP.
+ * @remote_qkey: Remote QKey of the destination QP.
+ */
+int ib_cm_get_dst_attr(struct ib_cm_id *cm_id, struct ib_ah_attr *ah_attr,
+		       u32 *remote_qpn, u32 *remote_qkey);
+
+/**
  * ib_send_cm_apr - Sends an alternate path response message in response to
  *   a load alternate path request.
  * @cm_id: Connection identifier associated with the alternate path response.


From sean.hefty at intel.com  Tue Jun  6 19:49:08 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 19:49:08 -0700
Subject: [openib-general] [PATCH 2/4] Add support for UD QPs in RDMA CM
In-Reply-To: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX401Rqf69aZbLA0000003b@orsmsx401.amr.corp.intel.com>

Add support for UD QPs in the RDMA CM.  UD QPs are identified by an IP address
and UDP port number.  The RDMA CM provides resolution of an IP address/port
number to a remote QPN / QKey using existing address and route resolution and
SIDR.

This patch extends the RDMA CM protocol from IB CM REQ messages to
IB CM SIDR REQ messages.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: core/cma.c
===================================================================
--- core/cma.c	(revision 7758)
+++ core/cma.c	(working copy)
@@ -66,6 +66,7 @@ static DEFINE_MUTEX(lock);
 static struct workqueue_struct *cma_wq;
 static DEFINE_IDR(sdp_ps);
 static DEFINE_IDR(tcp_ps);
+static DEFINE_IDR(udp_ps);
 
 struct cma_device {
 	struct list_head	list;
@@ -473,6 +474,29 @@ int rdma_init_qp_attr(struct rdma_cm_id 
 }
 EXPORT_SYMBOL(rdma_init_qp_attr);
 
+int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr,
+		      struct ib_ah_attr *ah_attr, u32 *remote_qpn,
+		      u32 *remote_qkey)
+{
+	struct rdma_id_private *id_priv;
+	int ret;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	case RDMA_TRANSPORT_IB:
+		if (!memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr)))
+			ret = ib_cm_get_dst_attr(id_priv->cm_id.ib, ah_attr,
+						 remote_qpn, remote_qkey);
+		break;
+	default:
+		ret = -ENOSYS;
+		break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(rdma_get_dst_attr);
+
 static inline int cma_zero_addr(struct sockaddr *addr)
 {
 	struct in6_addr *ip6;
@@ -496,9 +520,17 @@ static inline int cma_any_addr(struct so
 	return cma_zero_addr(addr) || cma_loopback_addr(addr);
 }
 
+static inline __be16 cma_port(struct sockaddr *addr)
+{
+	if (addr->sa_family == AF_INET)
+		return ((struct sockaddr_in *) addr)->sin_port;
+	else
+		return ((struct sockaddr_in6 *) addr)->sin6_port;
+}
+
 static inline int cma_any_port(struct sockaddr *addr)
 {
-	return !((struct sockaddr_in *) addr)->sin_port;
+	return !cma_port(addr);
 }
 
 static int cma_get_net_info(void *hdr, enum rdma_port_space ps,
@@ -841,8 +873,8 @@ out:
 	return ret;
 }
 
-static struct rdma_id_private* cma_new_id(struct rdma_cm_id *listen_id,
-					  struct ib_cm_event *ib_event)
+static struct rdma_id_private* cma_new_conn_id(struct rdma_cm_id *listen_id,
+					       struct ib_cm_event *ib_event)
 {
 	struct rdma_id_private *id_priv;
 	struct rdma_cm_id *id;
@@ -885,6 +917,42 @@ err:
 	return NULL;
 }
 
+static struct rdma_id_private* cma_new_udp_id(struct rdma_cm_id *listen_id,
+					      struct ib_cm_event *ib_event)
+{
+	struct rdma_id_private *id_priv;
+	struct rdma_cm_id *id;
+	union cma_ip_addr *src, *dst;
+	__u16 port;
+	u8 ip_ver;
+	int ret;
+
+	id = rdma_create_id(listen_id->event_handler, listen_id->context,
+			    listen_id->ps);
+	if (IS_ERR(id))
+		return NULL;
+
+
+	if (cma_get_net_info(ib_event->private_data, listen_id->ps,
+			     &ip_ver, &port, &src, &dst))
+		goto err;
+
+	cma_save_net_info(&id->route.addr, &listen_id->route.addr,
+			  ip_ver, port, src, dst);
+
+	ret = rdma_translate_ip(&id->route.addr.src_addr,
+				&id->route.addr.dev_addr);
+	if (ret)
+		goto err;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	id_priv->state = CMA_CONNECT;
+	return id_priv;
+err:
+	rdma_destroy_id(id);
+	return NULL;
+}
+
 static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 {
 	struct rdma_id_private *listen_id, *conn_id;
@@ -897,7 +965,10 @@ static int cma_req_handler(struct ib_cm_
 		goto out;
 	}
 
-	conn_id = cma_new_id(&listen_id->id, ib_event);
+	if (listen_id->id.ps == RDMA_PS_UDP)
+		conn_id = cma_new_udp_id(&listen_id->id, ib_event);
+	else
+		conn_id = cma_new_conn_id(&listen_id->id, ib_event);
 	if (!conn_id) {
 		ret = -ENOMEM;
 		goto out;
@@ -934,8 +1005,7 @@ out:
 
 static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr)
 {
-	return cpu_to_be64(((u64)ps << 16) +
-	       be16_to_cpu(((struct sockaddr_in *) addr)->sin_port));
+	return cpu_to_be64(((u64)ps << 16) + be16_to_cpu(cma_port(addr)));
 }
 
 static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr *addr,
@@ -1586,6 +1656,9 @@ static int cma_get_port(struct rdma_id_p
 	case RDMA_PS_TCP:
 		ps = &tcp_ps;
 		break;
+	case RDMA_PS_UDP:
+		ps = &udp_ps;
+		break;
 	default:
 		return -EPROTONOSUPPORT;
 	}
@@ -1664,6 +1737,93 @@ static int cma_format_hdr(void *hdr, enu
 	return 0;
 }
 
+static int cma_sidr_rep_handler(struct ib_cm_id *cm_id,
+				struct ib_cm_event *ib_event)
+{
+	struct rdma_id_private *id_priv = cm_id->context;
+	enum rdma_cm_event_type event;
+	struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd;
+	struct rdma_route *route;
+	int ret = 0, status;
+
+	atomic_inc(&id_priv->dev_remove);
+	if (!cma_comp(id_priv, CMA_CONNECT))
+		goto out;
+
+	switch (ib_event->event) {
+	case IB_CM_SIDR_REQ_ERROR:
+		event = RDMA_CM_EVENT_UNREACHABLE;
+		status = -ETIMEDOUT;
+		break;
+	case IB_CM_SIDR_REP_RECEIVED:
+		if (rep->status != IB_SIDR_SUCCESS) {
+			event = RDMA_CM_EVENT_UNREACHABLE;
+			status = ib_event->param.sidr_rep_rcvd.status;
+			break;
+		}
+		route = &id_priv->id.route;
+		if (rep->qkey != ntohs(cma_port(&route->addr.dst_addr))) {
+			event = RDMA_CM_EVENT_UNREACHABLE;
+			status = -EINVAL;
+			break;
+		}
+		event = RDMA_CM_EVENT_ESTABLISHED;
+		status = 0;
+		break;
+	default:
+		printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d",
+		       ib_event->event);
+		goto out;
+	}
+
+	ret = cma_notify_user(id_priv, event, status, NULL, 0);
+	if (ret) {
+		/* Destroy the CM ID by returning a non-zero value. */
+		id_priv->cm_id.ib = NULL;
+		cma_exch(id_priv, CMA_DESTROYING);
+		cma_release_remove(id_priv);
+		rdma_destroy_id(&id_priv->id);
+		return ret;
+	}
+out:
+	cma_release_remove(id_priv);
+	return ret;
+}
+
+static int cma_resolve_ib_udp(struct rdma_id_private *id_priv)
+{
+	struct ib_cm_sidr_req_param req;
+	struct rdma_route *route;
+	struct cma_hdr hdr;
+	int ret;
+
+	id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device,
+					    cma_sidr_rep_handler, id_priv);
+	if (IS_ERR(id_priv->cm_id.ib))
+		return PTR_ERR(id_priv->cm_id.ib);
+
+	route = &id_priv->id.route;
+	ret = cma_format_hdr(&hdr, id_priv->id.ps, route);
+	if (ret)
+		goto out;
+
+	req.path = route->path_rec;
+	req.service_id = cma_get_service_id(id_priv->id.ps,
+					    &route->addr.dst_addr);
+	req.timeout_ms = 1 << max(cma_get_ib_remote_timeout(id_priv) - 8, 0);
+	req.private_data = &hdr;
+	req.private_data_len = sizeof hdr;
+	req.max_cm_retries = cma_get_ib_cm_retries(id_priv);
+
+	ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req);
+out:
+	if (ret) {
+		ib_destroy_cm_id(id_priv->cm_id.ib);
+		id_priv->cm_id.ib = NULL;
+	}
+	return ret;
+}
+
 static int cma_connect_ib(struct rdma_id_private *id_priv,
 			  struct rdma_conn_param *conn_param)
 {
@@ -1738,7 +1898,10 @@ int rdma_connect(struct rdma_cm_id *id, 
 
 	switch (rdma_node_get_transport(id->device->node_type)) {
 	case RDMA_TRANSPORT_IB:
-		ret = cma_connect_ib(id_priv, conn_param);
+		if (id->ps == RDMA_PS_UDP)
+			ret = cma_resolve_ib_udp(id_priv);
+		else
+			ret = cma_connect_ib(id_priv, conn_param);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -1780,6 +1943,21 @@ static int cma_accept_ib(struct rdma_id_
 	return ib_send_cm_rep(id_priv->cm_id.ib, &rep);
 }
 
+static int cma_send_sidr_rep(struct rdma_id_private *id_priv,
+			     enum ib_cm_sidr_status status)
+{
+	struct ib_cm_sidr_rep_param rep;
+
+	memset(&rep, 0, sizeof rep);
+	rep.status = status;
+	if (status == IB_SIDR_SUCCESS) {
+		rep.qp_num = id_priv->qp_num;
+		rep.qkey = ntohs(cma_port(&id_priv->id.route.addr.src_addr));
+	}
+
+	return ib_send_cm_sidr_rep(id_priv->cm_id.ib, &rep);
+}
+
 int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 {
 	struct rdma_id_private *id_priv;
@@ -1797,7 +1975,9 @@ int rdma_accept(struct rdma_cm_id *id, s
 
 	switch (rdma_node_get_transport(id->device->node_type)) {
 	case RDMA_TRANSPORT_IB:
-		if (conn_param)
+		if (id->ps == RDMA_PS_UDP)
+			ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS);
+		else if (conn_param)
 			ret = cma_accept_ib(id_priv, conn_param);
 		else
 			ret = cma_rep_recv(id_priv);
@@ -1830,9 +2010,12 @@ int rdma_reject(struct rdma_cm_id *id, c
 
 	switch (rdma_node_get_transport(id->device->node_type)) {
 	case RDMA_TRANSPORT_IB:
-		ret = ib_send_cm_rej(id_priv->cm_id.ib,
-				     IB_CM_REJ_CONSUMER_DEFINED, NULL, 0,
-				     private_data, private_data_len);
+		if (id->ps == RDMA_PS_UDP)
+			ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT);
+		else
+			ret = ib_send_cm_rej(id_priv->cm_id.ib,
+					     IB_CM_REJ_CONSUMER_DEFINED, NULL,
+					     0, private_data, private_data_len);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -1995,6 +2178,7 @@ static void cma_cleanup(void)
 	destroy_workqueue(cma_wq);
 	idr_destroy(&sdp_ps);
 	idr_destroy(&tcp_ps);
+	idr_destroy(&udp_ps);
 }
 
 module_init(cma_init);
Index: include/rdma/rdma_cm.h
===================================================================
--- include/rdma/rdma_cm.h	(revision 7758)
+++ include/rdma/rdma_cm.h	(working copy)
@@ -212,9 +212,15 @@ struct rdma_conn_param {
 
 /**
  * rdma_connect - Initiate an active connection request.
+ * @id: Connection identifier to connect.
+ * @conn_param: Connection information used for connected QPs.
  *
  * Users must have resolved a route for the rdma_cm_id to connect with
  * by having called rdma_resolve_route before calling this routine.
+ *
+ * This call will either connect to a remote QP or obtain remote QP
+ * information for unconnected rdma_cm_id's.  The actual operation is
+ * based on the rdma_cm_id's port space.
  */
 int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param);
 
Index: include/rdma/rdma_cm_ib.h
===================================================================
--- include/rdma/rdma_cm_ib.h	(revision 7758)
+++ include/rdma/rdma_cm_ib.h	(working copy)
@@ -44,6 +44,22 @@
 int rdma_set_ib_paths(struct rdma_cm_id *id,
 		      struct ib_sa_path_rec *path_rec, int num_paths);
 
+/**
+ * rdma_get_dst_attr - Retrieve information about a UDP destination.
+ * @id: Connection identifier associated with the request.
+ * @addr: Address of remote destination to retrieve information about.
+ * @ah_attr: Address handle attributes.  A caller uses these attributes to
+ *   create an address handle when communicating with the destination.
+ * @remote_qpn: The remote QP number associated with the UDP address.
+ * @remote_qkey: The QKey of the remote QP.
+ *
+ * Users must have called rdma_connect() to resolve the destination for a
+ * UD QP, or rdma_join_multicast() for multicast destinations.
+ */
+int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr,
+		      struct ib_ah_attr *ah_attr, u32 *remote_qpn,
+		      u32 *remote_qkey);
+
 struct ib_cm_req_opt {
 	u8	remote_cm_response_timeout;
 	u8	local_cm_response_timeout;


From sean.hefty at intel.com  Tue Jun  6 19:52:47 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 19:52:47 -0700
Subject: [openib-general] [PATCH 3/4] uverbs: export ib_copy_ah_attr_to_user
In-Reply-To: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX401X3cf3ednHM0000003c@orsmsx401.amr.corp.intel.com>

Export the ib_copy_ah_attr_to_user() routine to allow copy ib_ah_attr
to userspace to support UD QPs.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: core/uverbs_marshall.c
===================================================================
--- core/uverbs_marshall.c	(revision 7758)
+++ core/uverbs_marshall.c	(working copy)
@@ -32,8 +32,8 @@
 
 #include <rdma/ib_marshall.h>
 
-static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
-				    struct ib_ah_attr *src)
+void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
+			     struct ib_ah_attr *src)
 {
 	memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof src->grh.dgid);
 	dst->grh.flow_label        = src->grh.flow_label;
@@ -47,6 +47,7 @@ static void ib_copy_ah_attr_to_user(stru
 	dst->is_global             = src->ah_flags & IB_AH_GRH ? 1 : 0;
 	dst->port_num 	    	   = src->port_num;
 }
+EXPORT_SYMBOL(ib_copy_ah_attr_to_user);
 
 void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 			     struct ib_qp_attr *src)
Index: include/rdma/ib_marshall.h
===================================================================
--- include/rdma/ib_marshall.h	(revision 7758)
+++ include/rdma/ib_marshall.h	(working copy)
@@ -41,6 +41,9 @@
 void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 			     struct ib_qp_attr *src);
 
+void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
+			     struct ib_ah_attr *src);
+
 void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst,
 			      struct ib_sa_path_rec *src);
 

From sean.hefty at intel.com  Tue Jun  6 19:57:24 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 19:57:24 -0700
Subject: [openib-general] [PATCH 4/4] uCMA: export UD QP support to userspace
In-Reply-To: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX4013SOlbpH71y0000003d@orsmsx401.amr.corp.intel.com>

Export the RDMA CM's support of UD QPs to the userspace library.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
My intent is to bump the ABI version only once.  The multicast patches
will not increment the ABI.

Index: core/ucma.c
===================================================================
--- core/ucma.c	(revision 7758)
+++ core/ucma.c	(working copy)
@@ -41,6 +41,7 @@
 #include <rdma/rdma_user_cm.h>
 #include <rdma/ib_marshall.h>
 #include <rdma/rdma_cm.h>
+#include <rdma/rdma_cm_ib.h>
 
 #include "ucma_ib.h"
 
@@ -291,7 +292,7 @@ static ssize_t ucma_create_id(struct ucm
 		return -ENOMEM;
 
 	ctx->uid = cmd.uid;
-	ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP);
+	ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, cmd.ps);
 	if (IS_ERR(ctx->cm_id)) {
 		ret = PTR_ERR(ctx->cm_id);
 		goto err1;
@@ -736,6 +737,40 @@ static ssize_t ucma_set_option(struct uc
 	return ret;
 }
 
+static ssize_t ucma_get_dst_attr(struct ucma_file *file,
+				 const char __user *inbuf,
+				 int in_len, int out_len)
+{
+	struct rdma_ucm_get_dst_attr cmd;
+	struct rdma_ucm_dst_attr_resp resp;
+	struct ib_ah_attr ah_attr;
+	struct ucma_context *ctx;
+	int ret;
+
+	if (out_len < sizeof(resp))
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+		return -EFAULT;
+
+	ctx = ucma_get_ctx(file, cmd.id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = rdma_get_dst_attr(ctx->cm_id, (struct sockaddr *) &cmd.addr,
+				&ah_attr, &resp.remote_qpn, &resp.remote_qkey);
+	if (ret)
+		goto out;
+
+	ib_copy_ah_attr_to_user(&resp.ah_attr, &ah_attr);
+	if (copy_to_user((void __user *)(unsigned long)cmd.response,
+			 &resp, sizeof(resp)))
+		ret = -EFAULT;
+out:
+	ucma_put_ctx(ctx);
+	return ret;
+}
+
 static ssize_t (*ucma_cmd_table[])(struct ucma_file *file,
 				   const char __user *inbuf,
 				   int in_len, int out_len) = {
@@ -753,7 +788,8 @@ static ssize_t (*ucma_cmd_table[])(struc
 	[RDMA_USER_CM_CMD_INIT_QP_ATTR]	= ucma_init_qp_attr,
 	[RDMA_USER_CM_CMD_GET_EVENT]	= ucma_get_event,
 	[RDMA_USER_CM_CMD_GET_OPTION]	= ucma_get_option,
-	[RDMA_USER_CM_CMD_SET_OPTION]	= ucma_set_option
+	[RDMA_USER_CM_CMD_SET_OPTION]	= ucma_set_option,
+	[RDMA_USER_CM_CMD_GET_DST_ATTR] = ucma_get_dst_attr
 };
 
 static ssize_t ucma_write(struct file *filp, const char __user *buf,
Index: include/rdma/rdma_user_cm.h
===================================================================
--- include/rdma/rdma_user_cm.h	(revision 7758)
+++ include/rdma/rdma_user_cm.h	(working copy)
@@ -38,7 +38,7 @@
 #include <rdma/ib_user_verbs.h>
 #include <rdma/ib_user_sa.h>
 
-#define RDMA_USER_CM_ABI_VERSION	1
+#define RDMA_USER_CM_ABI_VERSION	2
 
 #define RDMA_MAX_PRIVATE_DATA		256
 
@@ -58,6 +58,7 @@ enum {
 	RDMA_USER_CM_CMD_GET_EVENT,
 	RDMA_USER_CM_CMD_GET_OPTION,
 	RDMA_USER_CM_CMD_SET_OPTION,
+	RDMA_USER_CM_CMD_GET_DST_ATTR
 };
 
 /*
@@ -72,6 +73,8 @@ struct rdma_ucm_cmd_hdr {
 struct rdma_ucm_create_id {
 	__u64 uid;
 	__u64 response;
+	__u16 ps;
+	__u8  reserved[6];
 };
 
 struct rdma_ucm_create_id_resp {
@@ -171,6 +174,18 @@ struct rdma_ucm_init_qp_attr {
 	__u32 qp_state;
 };
 
+struct rdma_ucm_dst_attr_resp {
+	__u32 remote_qpn;
+	__u32 remote_qkey;
+	struct ib_uverbs_ah_attr ah_attr;
+};
+
+struct rdma_ucm_get_dst_attr {
+	__u64 response;
+	struct sockaddr_in6 addr;
+	__u32 id;
+};
+
 struct rdma_ucm_get_event {
 	__u64 response;
 };


From sean.hefty at intel.com  Tue Jun  6 20:08:57 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 20:08:57 -0700
Subject: [openib-general] [PATCH 1/2] libibverbs: add helper functions for
	UD QP support
Message-ID: <ORSMSX401EXUIEAOeIi0000003e@orsmsx401.amr.corp.intel.com>

Adds some helper functions to simplify using UD QPs.

Add new routines: ibv_init_ah_from_wc() and ibv_create_ah_from_wc()
to simplify UD QP communication.

Expose ibv_copy_ah_attr_from_kern to retrieve ibv_ah_attr from kernel for
a UD QP.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: include/infiniband/verbs.h
===================================================================
--- include/infiniband/verbs.h	(revision 7636)
+++ include/infiniband/verbs.h	(working copy)
@@ -298,6 +298,15 @@ struct ibv_global_route {
 	uint8_t			traffic_class;
 };
 
+struct ibv_grh {
+	uint32_t		version_tclass_flow;
+	uint16_t		paylen;
+	uint8_t			next_hdr;
+	uint8_t			hop_limit;
+	union ibv_gid		sgid;
+	union ibv_gid		dgid;
+};
+
 enum ibv_rate {
 	IBV_RATE_MAX      = 0,
 	IBV_RATE_2_5_GBPS = 2,
@@ -952,6 +961,36 @@ static inline int ibv_post_recv(struct i
 struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr);
 
 /**
+ * ibv_init_ah_from_wc - Initializes address handle attributes from a
+ *   work completion.
+ * @context: Device context on which the received message arrived.
+ * @port_num: Port on which the received message arrived.
+ * @wc: Work completion associated with the received message.
+ * @grh: References the received global route header.  This parameter is
+ *   ignored unless the work completion indicates that the GRH is valid.
+ * @ah_attr: Returned attributes that can be used when creating an address
+ *   handle for replying to the message.
+ */
+int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num,
+			struct ibv_wc *wc, struct ibv_grh *grh,
+			struct ibv_ah_attr *ah_attr);
+
+/**
+ * ibv_create_ah_from_wc - Creates an address handle associated with the
+ *   sender of the specified work completion.
+ * @pd: The protection domain associated with the address handle.
+ * @wc: Work completion information associated with a received message.
+ * @grh: References the received global route header.  This parameter is
+ *   ignored unless the work completion indicates that the GRH is valid.
+ * @port_num: The outbound port number to associate with the address.
+ *
+ * The address handle is used to reference a local or global destination
+ * in all UD QP post sends.
+ */
+struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc,
+				     struct ibv_grh *grh, uint8_t port_num);
+
+/**
  * ibv_destroy_ah - Destroy an address handle.
  */
 int ibv_destroy_ah(struct ibv_ah *ah);
Index: include/infiniband/marshall.h
===================================================================
--- include/infiniband/marshall.h	(revision 7636)
+++ include/infiniband/marshall.h	(working copy)
@@ -51,6 +51,9 @@ BEGIN_C_DECLS
 void ibv_copy_qp_attr_from_kern(struct ibv_qp_attr *dst,
 				struct ibv_kern_qp_attr *src);
 
+void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst,
+				struct ibv_kern_ah_attr *src);
+
 void ibv_copy_path_rec_from_kern(struct ibv_sa_path_rec *dst,
 				 struct ibv_kern_path_rec *src);
 
Index: src/libibverbs.map
===================================================================
--- src/libibverbs.map	(revision 7636)
+++ src/libibverbs.map	(working copy)
@@ -32,6 +32,8 @@ IBVERBS_1.0 {
 		ibv_modify_qp;
 		ibv_destroy_qp;
 		ibv_create_ah;
+		ibv_init_ah_from_wc;
+		ibv_create_ah_from_wc;
 		ibv_destroy_ah;
 		ibv_attach_mcast;
 		ibv_detach_mcast;
@@ -65,6 +67,7 @@ IBVERBS_1.0 {
 		ibv_cmd_attach_mcast;
 		ibv_cmd_detach_mcast;
 		ibv_copy_qp_attr_from_kern;
+		ibv_copy_ah_attr_from_kern;
 		ibv_copy_path_rec_from_kern;
 		ibv_copy_path_rec_to_kern;
 		ibv_rate_to_mult;
Index: src/verbs.c
===================================================================
--- src/verbs.c	(revision 7636)
+++ src/verbs.c	(working copy)
@@ -42,6 +42,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <string.h>
 
 #include "ibverbs.h"
 
@@ -392,6 +393,62 @@ struct ibv_ah *ibv_create_ah(struct ibv_
 	return ah;
 }
 
+static int ibv_find_gid_index(struct ibv_context *context, uint8_t port_num,
+			      union ibv_gid *gid)
+{
+	union ibv_gid sgid;
+	int i = 0, ret;
+
+	do {
+		ret = ibv_query_gid(context, port_num, i++, &sgid);
+	} while (!ret && memcmp(&sgid, gid, sizeof *gid));
+
+	return ret ? ret : i - 1;
+}
+
+int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num,
+			struct ibv_wc *wc, struct ibv_grh *grh,
+			struct ibv_ah_attr *ah_attr)
+{
+	uint32_t flow_class;
+	int ret;
+
+	memset(ah_attr, 0, sizeof *ah_attr);
+	ah_attr->dlid = wc->slid;
+	ah_attr->sl = wc->sl;
+	ah_attr->src_path_bits = wc->dlid_path_bits;
+	ah_attr->port_num = port_num;
+
+	if (wc->wc_flags & IBV_WC_GRH) {
+		ah_attr->is_global = 1;
+		ah_attr->grh.dgid = grh->sgid;
+
+		ret = ibv_find_gid_index(context, port_num, &grh->dgid);
+		if (ret < 0)
+			return ret;
+
+		ah_attr->grh.sgid_index = (uint8_t) ret;
+		flow_class = ntohl(grh->version_tclass_flow);
+		ah_attr->grh.flow_label = flow_class & 0xFFFFF;
+		ah_attr->grh.hop_limit = grh->hop_limit;
+		ah_attr->grh.traffic_class = (flow_class >> 20) & 0xFF;
+	}
+	return 0;
+}
+
+struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc,
+				     struct ibv_grh *grh, uint8_t port_num)
+{
+	struct ibv_ah_attr ah_attr;
+	int ret;
+
+	ret = ibv_init_ah_from_wc(pd->context, port_num, wc, grh, &ah_attr);
+	if (ret)
+		return NULL;
+
+	return ibv_create_ah(pd, &ah_attr);
+}
+
 int ibv_destroy_ah(struct ibv_ah *ah)
 {
 	return ah->context->ops.destroy_ah(ah);
Index: src/marshall.c
===================================================================
--- src/marshall.c	(revision 7636)
+++ src/marshall.c	(working copy)
@@ -38,8 +38,8 @@
 
 #include <infiniband/marshall.h>
 
-static void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst,
-				       struct ibv_kern_ah_attr *src)
+void ibv_copy_ah_attr_from_kern(struct ibv_ah_attr *dst,
+				struct ibv_kern_ah_attr *src)
 {
 	memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid);
 	dst->grh.flow_label = src->grh.flow_label;
Index: ChangeLog
===================================================================
--- ChangeLog	(revision 7636)
+++ ChangeLog	(working copy)
@@ -1,3 +1,13 @@
+2006-06-07  Sean Hefty     <sean.hefty at intel.com>
+
+	* src/verbs.c include/infiniband/verbs.h: Add new routines:
+	ibv_init_ah_from_wc() and ibv_create_ah_from_wc() to simplify UD QP
+	communication.
+
+	* src/marshall.c include/infiniband/marshall.h: Expose
+	ibv_copy_ah_attr_from_kern to retrieve ibv_ah_attr from kernel for
+	a UD QP.
+
 2006-06-01  Roland Dreier  <rdreier at cisco.com>
 
 	* src/device.c (ibv_get_device_list): Actually return a


From sean.hefty at intel.com  Tue Jun  6 20:15:43 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 6 Jun 2006 20:15:43 -0700
Subject: [openib-general] [PATCH 2/2] librdmacm: add UD QP support for
	userspace clients
In-Reply-To: <ORSMSX401EXUIEAOeIi0000003e@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX401GXix4A2Dre0000003f@orsmsx401.amr.corp.intel.com>

Add support for UD QPs to the RDMA CM library, along with a goofy test program.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Index: include/rdma/rdma_cma_ib.h
===================================================================
--- include/rdma/rdma_cma_ib.h	(revision 7743)
+++ include/rdma/rdma_cma_ib.h	(working copy)
@@ -44,4 +44,19 @@ struct ib_cm_req_opt {
 	uint8_t		max_cm_retries;
 };
 
+/**
+ * rdma_get_dst_attr - Retrieve information about a UDP destination.
+ * @id: Connection identifier associated with the request.
+ * @addr: Address of remote destination to retrieve information about.
+ * @ah_attr: Address handle attributes.  A caller uses these attributes to
+ *   create an address handle when communicating with the destination.
+ * @qpn: The remote QP number associated with the UDP address.
+ * @qkey: The QKey of the remote QP.
+ *
+ * Users must have called rdma_connect() to resolve the destination information.
+ */
+int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr,
+		      struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn,
+		      uint32_t *remote_qkey);
+
 #endif /* RDMA_CMA_IB_H */
Index: include/rdma/rdma_cma_abi.h
===================================================================
--- include/rdma/rdma_cma_abi.h	(revision 7636)
+++ include/rdma/rdma_cma_abi.h	(working copy)
@@ -40,7 +40,7 @@
  */
 
 #define RDMA_USER_CM_MIN_ABI_VERSION	1
-#define RDMA_USER_CM_MAX_ABI_VERSION	1
+#define RDMA_USER_CM_MAX_ABI_VERSION	2
 
 #define RDMA_MAX_PRIVATE_DATA		256
 
@@ -60,6 +60,7 @@ enum {
 	UCMA_CMD_GET_EVENT,
 	UCMA_CMD_GET_OPTION,
 	UCMA_CMD_SET_OPTION,
+	UCMA_CMD_GET_DST_ATTR
 };
 
 struct ucma_abi_cmd_hdr {
@@ -68,9 +69,16 @@ struct ucma_abi_cmd_hdr {
 	__u16 out;
 };
 
+struct ucma_abi_create_id_v1 {
+	__u64 uid;
+	__u64 response;
+};
+
 struct ucma_abi_create_id {
 	__u64 uid;
 	__u64 response;
+	__u16 ps;
+	__u8  reserved[6];
 };
 
 struct ucma_abi_create_id_resp {
@@ -170,6 +178,18 @@ struct ucma_abi_init_qp_attr {
 	__u32 qp_state;
 };
 
+struct ucma_abi_dst_attr_resp {
+	__u32 remote_qpn;
+	__u32 remote_qkey;
+	struct ibv_kern_ah_attr ah_attr;
+};
+
+struct ucma_abi_get_dst_attr {
+	__u64 response;
+	struct sockaddr_in6 addr;
+	__u32 id;
+};
+
 struct ucma_abi_get_event {
 	__u64 response;
 };
Index: include/rdma/rdma_cma.h
===================================================================
--- include/rdma/rdma_cma.h	(revision 7743)
+++ include/rdma/rdma_cma.h	(working copy)
@@ -54,6 +54,11 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
 };
 
+enum rdma_port_space {
+	RDMA_PS_TCP  = 0x0106,
+	RDMA_PS_UDP  = 0x0111,
+};
+
 /* Protocol levels for get/set options. */
 enum {
 	RDMA_PROTO_IP = 0,
@@ -90,6 +95,7 @@ struct rdma_cm_id {
 	void			*context;
 	struct ibv_qp		*qp;
 	struct rdma_route	 route;
+	enum rdma_port_space	 ps;
 	uint8_t			 port_num;
 };
 
@@ -121,9 +127,11 @@ void rdma_destroy_event_channel(struct r
  * @id: A reference where the allocated communication identifier will be
  *   returned.
  * @context: User specified context associated with the rdma_cm_id.
+ * @ps: RDMA port space.
  */
 int rdma_create_id(struct rdma_event_channel *channel,
-		   struct rdma_cm_id **id, void *context);
+		   struct rdma_cm_id **id, void *context,
+		   enum rdma_port_space ps);
 
 /**
  * rdma_destroy_id - Release a communication identifier.
@@ -194,6 +202,10 @@ struct rdma_conn_param {
 	uint8_t flow_control;
 	uint8_t retry_count;		/* ignored when accepting */
 	uint8_t rnr_retry_count;
+	/* Fields below ignored if a QP is created on the rdma_cm_id. */
+	uint8_t srq;
+	uint32_t qp_num;
+	enum ibv_qp_type qp_type;
 };
 
 /**
@@ -227,7 +239,8 @@ int rdma_reject(struct rdma_cm_id *id, c
 		uint8_t private_data_len);
 
 /**
- * rdma_disconnect - This function disconnects the associated QP.
+ * rdma_disconnect - This function disconnects the associated QP and
+ *   transitions it into the error state.
  */
 int rdma_disconnect(struct rdma_cm_id *id);
 
@@ -278,4 +291,18 @@ int rdma_get_option(struct rdma_cm_id *i
 int rdma_set_option(struct rdma_cm_id *id, int level, int optname,
 		    void *optval, size_t optlen);
 
+static inline uint16_t rdma_get_src_port(struct rdma_cm_id *id)
+{
+	return	id->route.addr.src_addr.sin6_family == PF_INET6 ?
+		id->route.addr.src_addr.sin6_port :
+		((struct sockaddr_in *) &id->route.addr.src_addr)->sin_port;
+}
+
+static inline uint16_t rdma_get_dst_port(struct rdma_cm_id *id)
+{
+	return	id->route.addr.dst_addr.sin6_family == PF_INET6 ?
+		id->route.addr.dst_addr.sin6_port :
+		((struct sockaddr_in *) &id->route.addr.dst_addr)->sin_port;
+}
+
 #endif /* RDMA_CMA_H */
Index: src/cma.c
===================================================================
--- src/cma.c	(revision 7636)
+++ src/cma.c	(working copy)
@@ -54,6 +54,7 @@
 #include <infiniband/marshall.h>
 #include <rdma/rdma_cma.h>
 #include <rdma/rdma_cma_abi.h>
+#include <rdma/rdma_cma_ib.h>
 
 #define PFX "librdmacm: "
 
@@ -203,7 +204,7 @@ static int ucma_init(void)
 
 	dev_list = ibv_get_device_list(NULL);
 	if (!dev_list) {
-		printf("CMA: unable to get RDMA device liste\n");
+		printf("CMA: unable to get RDMA device list\n");
 		ret = -ENODEV;
 		goto err;
 	}
@@ -301,7 +302,8 @@ static void ucma_free_id(struct cma_id_p
 }
 
 static struct cma_id_private *ucma_alloc_id(struct rdma_event_channel *channel,
-					    void *context)
+					    void *context,
+					    enum rdma_port_space ps)
 {
 	struct cma_id_private *id_priv;
 
@@ -311,6 +313,7 @@ static struct cma_id_private *ucma_alloc
 
 	memset(id_priv, 0, sizeof *id_priv);
 	id_priv->id.context = context;
+	id_priv->id.ps = ps;
 	id_priv->id.channel = channel;
 	pthread_mutex_init(&id_priv->mut, NULL);
 	if (pthread_cond_init(&id_priv->cond, NULL))
@@ -322,8 +325,44 @@ err:	ucma_free_id(id_priv);
 	return NULL;
 }
 
+static int ucma_create_id_v1(struct rdma_event_channel *channel,
+			     struct rdma_cm_id **id, void *context,
+			     enum rdma_port_space ps)
+{
+	struct ucma_abi_create_id_resp *resp;
+	struct ucma_abi_create_id_v1 *cmd;
+	struct cma_id_private *id_priv;
+	void *msg;
+	int ret, size;
+
+	if (ps != RDMA_PS_TCP) {
+		fprintf(stderr, "librdmacm: Kernel ABI does not support "
+				"requested port space.\n");
+		return -EPROTONOSUPPORT;
+	}
+
+	id_priv = ucma_alloc_id(channel, context, ps);
+	if (!id_priv)
+		return -ENOMEM;
+
+	CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size);
+	cmd->uid = (uintptr_t) id_priv;
+
+	ret = write(channel->fd, msg, size);
+	if (ret != size)
+		goto err;
+
+	id_priv->handle = resp->id;
+	*id = &id_priv->id;
+	return 0;
+
+err:	ucma_free_id(id_priv);
+	return ret;
+}
+
 int rdma_create_id(struct rdma_event_channel *channel,
-		   struct rdma_cm_id **id, void *context)
+		   struct rdma_cm_id **id, void *context,
+		   enum rdma_port_space ps)
 {
 	struct ucma_abi_create_id_resp *resp;
 	struct ucma_abi_create_id *cmd;
@@ -335,12 +374,16 @@ int rdma_create_id(struct rdma_event_cha
 	if (ret)
 		return ret;
 
-	id_priv = ucma_alloc_id(channel, context);
+	if (abi_ver == 1)
+		return ucma_create_id_v1(channel, id, context, ps);
+
+	id_priv = ucma_alloc_id(channel, context, ps);
 	if (!id_priv)
 		return -ENOMEM;
 
 	CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size);
 	cmd->uid = (uintptr_t) id_priv;
+	cmd->ps = ps;
 
 	ret = write(channel->fd, msg, size);
 	if (ret != size)
@@ -637,6 +680,36 @@ static int ucma_init_ib_qp(struct cma_id
 					   IBV_QP_PKEY_INDEX | IBV_QP_PORT);
 }
 
+static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct ibv_qp *qp)
+{
+	struct ibv_qp_attr qp_attr;
+	struct ib_addr *ibaddr;
+	int ret;
+
+	ibaddr = &id_priv->id.route.addr.addr.ibaddr;
+	ret = ucma_find_pkey(id_priv->cma_dev, id_priv->id.port_num,
+			     ibaddr->pkey, &qp_attr.pkey_index);
+	if (ret)
+		return ret;
+
+	qp_attr.port_num = id_priv->id.port_num;
+	qp_attr.qp_state = IBV_QPS_INIT;
+	qp_attr.qkey = ntohs(rdma_get_src_port(&id_priv->id));
+	ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX |
+					  IBV_QP_PORT | IBV_QP_QKEY);
+	if (ret)
+		return ret;
+
+	qp_attr.qp_state = IBV_QPS_RTR;
+	ret = ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE);
+	if (ret)
+		return ret;
+
+	qp_attr.qp_state = IBV_QPS_RTS;
+	qp_attr.sq_psn = 0;
+	return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN);
+}
+
 int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd,
 		   struct ibv_qp_init_attr *qp_init_attr)
 {
@@ -652,7 +725,10 @@ int rdma_create_qp(struct rdma_cm_id *id
 	if (!qp)
 		return -ENOMEM;
 
-	ret = ucma_init_ib_qp(id_priv, qp);
+	if (id->ps == RDMA_PS_UDP)
+		ret = ucma_init_ud_qp(id_priv, qp);
+	else
+		ret = ucma_init_ib_qp(id_priv, qp);
 	if (ret)
 		goto err;
 
@@ -670,11 +746,12 @@ void rdma_destroy_qp(struct rdma_cm_id *
 
 static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst,
 					 struct rdma_conn_param *src,
-					 struct ibv_qp *qp)
+					 uint32_t qp_num,
+					 enum ibv_qp_type qp_type, uint8_t srq)
 {
-	dst->qp_num = qp->qp_num;
-	dst->qp_type = qp->qp_type;
-	dst->srq = (qp->srq != NULL);
+	dst->qp_num = qp_num;
+	dst->qp_type = qp_type;
+	dst->srq = srq;
 	dst->responder_resources = src->responder_resources;
 	dst->initiator_depth = src->initiator_depth;
 	dst->flow_control = src->flow_control;
@@ -700,7 +777,15 @@ int rdma_connect(struct rdma_cm_id *id, 
 	CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_CONNECT, size);
 	id_priv = container_of(id, struct cma_id_private, id);
 	cmd->id = id_priv->handle;
-	ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp);
+	if (id->qp)
+		ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param,
+					     id->qp->qp_num, id->qp->qp_type,
+					     (id->qp->srq != NULL));
+	else
+		ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param,
+					     conn_param->qp_num,
+					     conn_param->qp_type,
+					     conn_param->srq);
 
 	ret = write(id->channel->fd, msg, size);
 	if (ret != size)
@@ -735,15 +820,25 @@ int rdma_accept(struct rdma_cm_id *id, s
 	void *msg;
 	int ret, size;
 
-	ret = ucma_modify_qp_rtr(id);
-	if (ret)
-		return ret;
+	if (id->ps != RDMA_PS_UDP) {
+		ret = ucma_modify_qp_rtr(id);
+		if (ret)
+			return ret;
+	}
 
 	CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ACCEPT, size);
 	id_priv = container_of(id, struct cma_id_private, id);
 	cmd->id = id_priv->handle;
 	cmd->uid = (uintptr_t) id_priv;
-	ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp);
+	if (id->qp)
+		ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param,
+					     id->qp->qp_num, id->qp->qp_type,
+					     (id->qp->srq != NULL));
+	else
+		ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param,
+					     conn_param->qp_num,
+					     conn_param->qp_type,
+					     conn_param->srq);
 
 	ret = write(id->channel->fd, msg, size);
 	if (ret != size) {
@@ -845,7 +940,8 @@ static int ucma_process_conn_req(struct 
 	int ret;
 
 	listen_id_priv = container_of(event->id, struct cma_id_private, id);
-	id_priv = ucma_alloc_id(event->id->channel, event->id->context);
+	id_priv = ucma_alloc_id(event->id->channel, event->id->context,
+				event->id->ps);
 	if (!id_priv) {
 		ucma_destroy_kern_id(event->id->channel->fd, handle);
 		ret = -ENOMEM;
@@ -967,6 +1063,9 @@ retry:
 		}
 		break;
 	case RDMA_CM_EVENT_ESTABLISHED:
+		if (id_priv->id.ps == RDMA_PS_UDP)
+			break;
+
 		evt->status = ucma_process_establish(&id_priv->id);
 		if (evt->status) {
 			evt->event = RDMA_CM_EVENT_CONNECT_ERROR;
@@ -1041,3 +1140,32 @@ int rdma_set_option(struct rdma_cm_id *i
 
 	return 0;
 }
+
+int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr,
+		      struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn,
+		      uint32_t *remote_qkey)
+{
+	struct ucma_abi_dst_attr_resp *resp;
+	struct ucma_abi_get_dst_attr *cmd;
+	struct cma_id_private *id_priv;
+	void *msg;
+	int ret, size, addrlen;
+	
+	addrlen = ucma_addrlen(addr);
+	if (!addrlen)
+		return -EINVAL;
+
+	CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_DST_ATTR, size);
+	id_priv = container_of(id, struct cma_id_private, id);
+	cmd->id = id_priv->handle;
+	memcpy(&cmd->addr, addr, addrlen);
+
+	ret = write(id->channel->fd, msg, size);
+	if (ret != size)
+		return (ret > 0) ? -ENODATA : ret;
+
+	ibv_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr);
+	*remote_qpn = resp->remote_qpn;
+	*remote_qkey = resp->remote_qkey;
+	return 0;
+}
Index: src/librdmacm.map
===================================================================
--- src/librdmacm.map	(revision 7636)
+++ src/librdmacm.map	(working copy)
@@ -18,5 +18,6 @@ RDMACM_1.0 {
 		rdma_ack_cm_event;
 		rdma_get_option;
 		rdma_set_option;
+		rdma_get_dst_attr;
 	local: *;
 };
Index: librdmacm.spec.in
===================================================================
--- librdmacm.spec.in	(revision 7636)
+++ librdmacm.spec.in	(working copy)
@@ -66,3 +66,4 @@ rm -rf $RPM_BUILD_ROOT
 %defattr(-,root,root)
 %{_bindir}/rping
 %{_bindir}/ucmatose
+%{_bindir}/udaddy
Index: Makefile.am
===================================================================
--- Makefile.am	(revision 7743)
+++ Makefile.am	(working copy)
@@ -18,11 +18,13 @@ endif
 src_librdmacm_la_SOURCES = src/cma.c
 src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script)
 
-bin_PROGRAMS = examples/ucmatose examples/rping
+bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy
 examples_ucmatose_SOURCES = examples/cmatose.c
 examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la
 examples_rping_SOURCES = examples/rping.c
 examples_rping_LDADD = $(top_builddir)/src/librdmacm.la
+examples_udaddy_SOURCES = examples/udaddy.c
+examples_udaddy_LDADD = $(top_builddir)/src/librdmacm.la
 
 librdmacmincludedir = $(includedir)/rdma
 
Index: examples/rping.c
===================================================================
--- examples/rping.c	(revision 7636)
+++ examples/rping.c	(working copy)
@@ -1028,7 +1028,7 @@ int main(int argc, char *argv[])
 		goto out;
 	}
 
-	ret = rdma_create_id(cb->cm_channel, &cb->cm_id, cb);
+	ret = rdma_create_id(cb->cm_channel, &cb->cm_id, cb, RDMA_PS_TCP);
 	if (ret) {
 		ret = errno;
 		fprintf(stderr, "rdma_create_id error %d\n", ret);
Index: examples/udaddy.c
===================================================================
--- examples/udaddy.c	(revision 0)
+++ examples/udaddy.c	(revision 0)
@@ -0,0 +1,636 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <byteswap.h>
+
+#include <rdma/rdma_cma.h>
+#include <rdma/rdma_cma_ib.h>
+
+/*
+ * To execute:
+ * Server: rdma_cmatose
+ * Client: rdma_cmatose "dst_ip=ip"
+ */
+
+struct cmatest_node {
+	int			id;
+	struct rdma_cm_id	*cma_id;
+	int			connected;
+	struct ibv_pd		*pd;
+	struct ibv_cq		*cq;
+	struct ibv_mr		*mr;
+	struct ibv_ah		*ah;
+	uint32_t		remote_qpn;
+	uint32_t		remote_qkey;
+	void			*mem;
+};
+
+struct cmatest {
+	struct rdma_event_channel *channel;
+	struct cmatest_node	*nodes;
+	int			conn_index;
+	int			connects_left;
+
+	struct sockaddr_in	dst_in;
+	struct sockaddr		*dst_addr;
+	struct sockaddr_in	src_in;
+	struct sockaddr		*src_addr;
+};
+
+static struct cmatest test;
+static int connections = 1;
+static int message_size = 100;
+static int message_count = 10;
+static int is_server;
+
+static int create_message(struct cmatest_node *node)
+{
+	if (!message_size)
+		message_count = 0;
+
+	if (!message_count)
+		return 0;
+
+	node->mem = malloc(message_size + sizeof(struct ibv_grh));
+	if (!node->mem) {
+		printf("failed message allocation\n");
+		return -1;
+	}
+	node->mr = ibv_reg_mr(node->pd, node->mem,
+			      message_size + sizeof(struct ibv_grh),
+			      IBV_ACCESS_LOCAL_WRITE);
+	if (!node->mr) {
+		printf("failed to reg MR\n");
+		goto err;
+	}
+	return 0;
+err:
+	free(node->mem);
+	return -1;
+}
+
+static int init_node(struct cmatest_node *node)
+{
+	struct ibv_qp_init_attr init_qp_attr;
+	int cqe, ret;
+
+	node->pd = ibv_alloc_pd(node->cma_id->verbs);
+	if (!node->pd) {
+		ret = -ENOMEM;
+		printf("cmatose: unable to allocate PD\n");
+		goto out;
+	}
+
+	cqe = message_count ? message_count * 2 : 2;
+	node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0);
+	if (!node->cq) {
+		ret = -ENOMEM;
+		printf("cmatose: unable to create CQ\n");
+		goto out;
+	}
+
+	memset(&init_qp_attr, 0, sizeof init_qp_attr);
+	init_qp_attr.cap.max_send_wr = message_count ? message_count : 1;
+	init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1;
+	init_qp_attr.cap.max_send_sge = 1;
+	init_qp_attr.cap.max_recv_sge = 1;
+	init_qp_attr.qp_context = node;
+	init_qp_attr.sq_sig_all = 0;
+	init_qp_attr.qp_type = IBV_QPT_UD;
+	init_qp_attr.send_cq = node->cq;
+	init_qp_attr.recv_cq = node->cq;
+	ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr);
+	if (ret) {
+		printf("cmatose: unable to create QP: %d\n", ret);
+		goto out;
+	}
+
+	ret = create_message(node);
+	if (ret) {
+		printf("cmatose: failed to create messages: %d\n", ret);
+		goto out;
+	}
+out:
+	return ret;
+}
+
+static int post_recvs(struct cmatest_node *node)
+{
+	struct ibv_recv_wr recv_wr, *recv_failure;
+	struct ibv_sge sge;
+	int i, ret = 0;
+
+	if (!message_count)
+		return 0;
+
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sge;
+	recv_wr.num_sge = 1;
+	recv_wr.wr_id = (uintptr_t) node;
+
+	sge.length = message_size + sizeof(struct ibv_grh);
+	sge.lkey = node->mr->lkey;
+	sge.addr = (uintptr_t) node->mem;
+
+	for (i = 0; i < message_count && !ret; i++ ) {
+		ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure);
+		if (ret) {
+			printf("failed to post receives: %d\n", ret);
+			break;
+		}
+	}
+	return ret;
+}
+
+static int post_sends(struct cmatest_node *node, int signal_flag)
+{
+	struct ibv_send_wr send_wr, *bad_send_wr;
+	struct ibv_sge sge;
+	int i, ret = 0;
+
+	if (!node->connected || !message_count)
+		return 0;
+
+	send_wr.next = NULL;
+	send_wr.sg_list = &sge;
+	send_wr.num_sge = 1;
+	send_wr.opcode = IBV_WR_SEND_WITH_IMM;
+	send_wr.send_flags = IBV_SEND_INLINE | signal_flag;
+	send_wr.wr_id = (unsigned long)node;
+	send_wr.imm_data = htonl(node->cma_id->qp->qp_num);
+
+	send_wr.wr.ud.ah = node->ah;
+	send_wr.wr.ud.remote_qpn = node->remote_qpn;
+	send_wr.wr.ud.remote_qkey = node->remote_qkey;
+
+	sge.length = message_size - sizeof(struct ibv_grh);
+	sge.lkey = node->mr->lkey;
+	sge.addr = (uintptr_t) node->mem;
+
+	for (i = 0; i < message_count && !ret; i++) {
+		ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr);
+		if (ret) 
+			printf("failed to post sends: %d\n", ret);
+	}
+	return ret;
+}
+
+static void connect_error(void)
+{
+	test.connects_left--;
+}
+
+static int addr_handler(struct cmatest_node *node)
+{
+	int ret;
+
+	ret = rdma_resolve_route(node->cma_id, 2000);
+	if (ret) {
+		printf("cmatose: resolve route failed: %d\n", ret);
+		connect_error();
+	}
+	return ret;
+}
+
+static int route_handler(struct cmatest_node *node)
+{
+	struct rdma_conn_param conn_param;
+	int ret;
+
+	ret = init_node(node);
+	if (ret)
+		goto err;
+
+	ret = post_recvs(node);
+	if (ret)
+		goto err;
+
+	memset(&conn_param, 0, sizeof conn_param);
+	conn_param.qp_num = node->cma_id->qp->qp_num;
+	conn_param.qp_type = node->cma_id->qp->qp_type;
+	conn_param.retry_count = 5;
+	ret = rdma_connect(node->cma_id, &conn_param);
+	if (ret) {
+		printf("cmatose: failure connecting: %d\n", ret);
+		goto err;
+	}
+	return 0;
+err:
+	connect_error();
+	return ret;
+}
+
+static int connect_handler(struct rdma_cm_id *cma_id)
+{
+	struct cmatest_node *node;
+	struct rdma_conn_param conn_param;
+	int ret;
+
+	if (test.conn_index == connections) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+	node = &test.nodes[test.conn_index++];
+
+	node->cma_id = cma_id;
+	cma_id->context = node;
+
+	ret = init_node(node);
+	if (ret)
+		goto err2;
+
+	ret = post_recvs(node);
+	if (ret)
+		goto err2;
+
+	memset(&conn_param, 0, sizeof conn_param);
+	conn_param.qp_num = node->cma_id->qp->qp_num;
+	conn_param.qp_type = node->cma_id->qp->qp_type;
+	ret = rdma_accept(node->cma_id, &conn_param);
+	if (ret) {
+		printf("cmatose: failure accepting: %d\n", ret);
+		goto err2;
+	}
+	node->connected = 1;
+	test.connects_left--;
+	return 0;
+
+err2:
+	node->cma_id = NULL;
+	connect_error();
+err1:
+	printf("cmatose: failing connection request\n");
+	rdma_reject(cma_id, NULL, 0);
+	return ret;
+}
+
+static int resolved_handler(struct cmatest_node *node)
+{
+	struct ibv_ah_attr ah_attr;
+	int ret;
+
+	ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr,
+				&node->remote_qpn, &node->remote_qkey);
+	if (ret) {
+		printf("udaddy: failure getting destination attributes\n");
+		goto err;
+	}
+
+	node->ah = ibv_create_ah(node->pd, &ah_attr);
+	if (!node->ah) {
+		printf("udaddy: failure creating address handle\n");
+		goto err;
+	}
+
+	node->connected = 1;
+	test.connects_left--;
+	return 0;
+err:
+	connect_error();
+	return ret;
+}
+
+static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
+{
+	int ret = 0;
+
+	switch (event->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		ret = addr_handler(cma_id->context);
+		break;
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		ret = route_handler(cma_id->context);
+		break;
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		ret = connect_handler(cma_id);
+		break;
+	case RDMA_CM_EVENT_ESTABLISHED:
+		ret = resolved_handler(cma_id->context);
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+	case RDMA_CM_EVENT_REJECTED:
+		printf("cmatose: event: %d, error: %d\n", event->event,
+			event->status);
+		connect_error();
+		ret = event->status;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/* Cleanup will occur after test completes. */
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static void destroy_node(struct cmatest_node *node)
+{
+	if (!node->cma_id)
+		return;
+
+	if (node->ah)
+		ibv_destroy_ah(node->ah);
+
+	if (node->cma_id->qp)
+		rdma_destroy_qp(node->cma_id);
+
+	if (node->cq)
+		ibv_destroy_cq(node->cq);
+
+	if (node->mem) {
+		ibv_dereg_mr(node->mr);
+		free(node->mem);
+	}
+
+	if (node->pd)
+		ibv_dealloc_pd(node->pd);
+
+	/* Destroy the RDMA ID after all device resources */
+	rdma_destroy_id(node->cma_id);
+}
+
+static int alloc_nodes(void)
+{
+	int ret, i;
+
+	test.nodes = malloc(sizeof *test.nodes * connections);
+	if (!test.nodes) {
+		printf("cmatose: unable to allocate memory for test nodes\n");
+		return -ENOMEM;
+	}
+	memset(test.nodes, 0, sizeof *test.nodes * connections);
+
+	for (i = 0; i < connections; i++) {
+		test.nodes[i].id = i;
+		if (!is_server) {
+			ret = rdma_create_id(test.channel,
+					     &test.nodes[i].cma_id,
+					     &test.nodes[i], RDMA_PS_UDP);
+			if (ret)
+				goto err;
+		}
+	}
+	return 0;
+err:
+	while (--i >= 0)
+		rdma_destroy_id(test.nodes[i].cma_id);
+	free(test.nodes);
+	return ret;
+}
+
+static void destroy_nodes(void)
+{
+	int i;
+
+	for (i = 0; i < connections; i++)
+		destroy_node(&test.nodes[i]);
+	free(test.nodes);
+}
+
+static void create_reply_ah(struct cmatest_node *node, struct ibv_wc *wc)
+{
+	node->ah = ibv_create_ah_from_wc(node->pd, wc, node->mem,
+					 node->cma_id->port_num);
+	node->remote_qpn = ntohl(wc->imm_data);
+	node->remote_qkey = ntohs(rdma_get_dst_port(node->cma_id));
+}
+
+static int poll_cqs(void)
+{
+	struct ibv_wc wc[8];
+	int done, i, ret;
+
+	for (i = 0; i < connections; i++) {
+		if (!test.nodes[i].connected)
+			continue;
+
+		for (done = 0; done < message_count; done += ret) {
+			ret = ibv_poll_cq(test.nodes[i].cq, 8, wc);
+			if (ret < 0) {
+				printf("cmatose: failed polling CQ: %d\n", ret);
+				return ret;
+			}
+
+			if (ret && !test.nodes[i].ah)
+				create_reply_ah(&test.nodes[i], wc);
+		}
+	}
+	return 0;
+}
+
+static int connect_events(void)
+{
+	struct rdma_cm_event *event;
+	int ret = 0;
+
+	while (test.connects_left && !ret) {
+		ret = rdma_get_cm_event(test.channel, &event);
+		if (!ret) {
+			ret = cma_handler(event->id, event);
+			rdma_ack_cm_event(event);
+		}
+	}
+	return ret;
+}
+
+static int run_server(void)
+{
+	struct rdma_cm_id *listen_id;
+	int i, ret;
+
+	printf("cmatose: starting server\n");
+	ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_UDP);
+	if (ret) {
+		printf("cmatose: listen request failed\n");
+		return ret;
+	}
+
+	test.src_in.sin_family = PF_INET;
+	test.src_in.sin_port = 7174;
+	ret = rdma_bind_addr(listen_id, test.src_addr);
+	if (ret) {
+		printf("cmatose: bind address failed: %d\n", ret);
+		return ret;
+	}
+
+	ret = rdma_listen(listen_id, 0);
+	if (ret) {
+		printf("cmatose: failure trying to listen: %d\n", ret);
+		goto out;
+	}
+
+	connect_events();
+
+	if (message_count) {
+		printf("receiving data transfers\n");
+		ret = poll_cqs();
+		if (ret)
+			goto out;
+
+		printf("sending replies\n");
+		for (i = 0; i < connections; i++) {
+			ret = post_sends(&test.nodes[i], IBV_SEND_SIGNALED);
+			if (ret)
+				goto out;
+		}
+
+		ret = poll_cqs();
+		if (ret)
+			goto out;
+		printf("data transfers complete\n");
+	}
+out:
+	rdma_destroy_id(listen_id);
+	return ret;
+}
+
+static int get_addr(char *dst, struct sockaddr_in *addr)
+{
+	struct addrinfo *res;
+	int ret;
+
+	ret = getaddrinfo(dst, NULL, NULL, &res);
+	if (ret) {
+		printf("getaddrinfo failed - invalid hostname or IP address\n");
+		return ret;
+	}
+
+	if (res->ai_family != PF_INET) {
+		ret = -1;
+		goto out;
+	}
+
+	*addr = *(struct sockaddr_in *) res->ai_addr;
+out:
+	freeaddrinfo(res);
+	return ret;
+}
+
+static int run_client(char *dst, char *src)
+{
+	int i, ret;
+
+	printf("cmatose: starting client\n");
+	if (src) {
+		ret = get_addr(src, &test.src_in);
+		if (ret)
+			return ret;
+	}
+
+	ret = get_addr(dst, &test.dst_in);
+	if (ret)
+		return ret;
+
+	test.dst_in.sin_port = 7174;
+
+	printf("cmatose: connecting\n");
+	for (i = 0; i < connections; i++) {
+		ret = rdma_resolve_addr(test.nodes[i].cma_id,
+					src ? test.src_addr : NULL,
+					test.dst_addr, 2000);
+		if (ret) {
+			printf("cmatose: failure getting addr: %d\n", ret);
+			connect_error();
+			return ret;
+		}
+	}
+
+	ret = connect_events();
+	if (ret)
+		goto out;
+
+	if (message_count) {
+		printf("initiating data transfers\n");
+		for (i = 0; i < connections; i++) {
+			ret = post_sends(&test.nodes[i], 0);
+			if (ret)
+				goto out;
+		}
+		printf("receiving data transfers\n");
+		ret = poll_cqs();
+		if (ret)
+			goto out;
+
+		printf("data transfers complete\n");
+	}
+out:
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+
+	if (argc > 3) {
+		printf("usage: %s [server_addr [src_addr]]\n", argv[0]);
+		exit(1);
+	}
+	is_server = (argc == 1);
+
+	test.dst_addr = (struct sockaddr *) &test.dst_in;
+	test.src_addr = (struct sockaddr *) &test.src_in;
+	test.connects_left = connections;
+
+	test.channel = rdma_create_event_channel();
+	if (!test.channel) {
+		printf("failed to create event channel\n");
+		exit(1);
+	}
+
+	if (alloc_nodes())
+		exit(1);
+
+	if (is_server)
+		ret = run_server();
+	else
+		ret = run_client(argv[1], (argc == 3) ? argv[2] : NULL);
+
+	printf("test complete\n");
+	destroy_nodes();
+	rdma_destroy_event_channel(test.channel);
+
+	printf("return status %d\n", ret);
+	return ret;
+}
Index: examples/cmatose.c
===================================================================
--- examples/cmatose.c	(revision 7636)
+++ examples/cmatose.c	(working copy)
@@ -380,7 +380,7 @@ static int alloc_nodes(void)
 		if (!is_server) {
 			ret = rdma_create_id(test.channel,
 					     &test.nodes[i].cma_id,
-					     &test.nodes[i]);
+					     &test.nodes[i], RDMA_PS_TCP);
 			if (ret)
 				goto err;
 		}
@@ -466,7 +466,7 @@ static int run_server(void)
 	int i, ret;
 
 	printf("cmatose: starting server\n");
-	ret = rdma_create_id(test.channel, &listen_id, &test);
+	ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_TCP);
 	if (ret) {
 		printf("cmatose: listen request failed\n");
 		return ret;


From yipeeyipeeyipeeyipee at yahoo.com  Tue Jun  6 23:14:33 2006
From: yipeeyipeeyipeeyipee at yahoo.com (yipee)
Date: Wed, 7 Jun 2006 06:14:33 +0000 (UTC)
Subject: [openib-general] Mellanox raw QP
Message-ID: <loom.20060607T080809-788@post.gmane.org>

Hi,

Can I create raw QP (MLX type) using the openIB API?
Is this possible from user space ? I searched for the right API for this but
couldn't find any such way.
Would I have to do this from kernel?


Thanks,
x


From dotanb at mellanox.co.il  Tue Jun  6 23:37:37 2006
From: dotanb at mellanox.co.il (Dotan Barak)
Date: Wed, 7 Jun 2006 09:37:37 +0300
Subject: [openib-general] Mellanox raw QP
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30243FCFE@mtlexch01.mtl.com>

> Can I create raw QP (MLX type) using the openIB API?
> Is this possible from user space ? I searched for the right 
> API for this but
> couldn't find any such way.
> Would I have to do this from kernel?

There isn't any API to create raw QP from user level, so I believe that
the answer is yes ...

Dotan


From k_mahesh85 at yahoo.co.in  Tue Jun  6 23:44:23 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Wed, 7 Jun 2006 07:44:23 +0100 (BST)
Subject: [openib-general] repost-problem with memory registration-RDMA
	kernel utliity
Message-ID: <20060607064423.80428.qmail@web8316.mail.in.yahoo.com>


 can anybody me suggest me the correct way to register a buffer for doing RDMA operations
i have already posted my code in the previous thread but that  is not working fine.

it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA  operations over it.

-Mahesh


 Send instant messages to your online friends http://in.messenger.yahoo.com 

 Stay connected with your friends even when away from PC.  Link: http://in.mobile.yahoo.com/new/messenger/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/223849b4/attachment.html>

From k_mahesh85 at yahoo.co.in  Wed Jun  7 00:06:09 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Wed, 7 Jun 2006 08:06:09 +0100 (BST)
Subject: [openib-general] repost-problem with memory registration-RDMA
	kernel utliity
Message-ID: <20060607070609.3767.qmail@web8315.mail.in.yahoo.com>


 can anybody me suggest me the correct way to register a buffer for doing RDMA operations
i have already posted my code in the previous thread but that  is not working fine.

it is a kernel utility and i have obtained the buffer by using kmalloc, now how can i register this inorder to perform RDMA  operations over it.

-Mahesh


 Send instant messages to your online friends http://in.messenger.yahoo.com 

 Stay connected with your friends even when away from PC.  Link: http://in.mobile.yahoo.com/new/messenger/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/5f55a8bc/attachment.html>

From ogerlitz at voltaire.com  Wed Jun  7 02:52:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 7 Jun 2006 12:52:22 +0300 (IDT)
Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while probing
	out ib_sa
Message-ID: <Pine.LNX.4.64.0606071248110.2804@zuben>

By mistake i was trying to bringup ib1 where port 1 and not 2 was the
active port, and then got this crash on the rmmod script which is doing:

ifconfig ib0 down
ifconfig ib1 down
modprobe -r ib_ipoib
modprobe -r ib_mthca

this is the dmesg crash - it happened over x86 with svn 7772

Or.

ADDRCONF(NETDEV_UP): ib1: link is not ready
Unable to handle kernel paging request at virtual address f8dd6758
 printing eip:
f8dd6758
*pde = 37c99067
*pte = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: parport_pc lp parport autofs4 nfs lockd sunrpc button battery ac ipv6 ohci_hcd i2c_amd8111 i2c_core hw_random shpchp ib_mthca ib_sa ib_mad ib_core e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror dm_mod sata_sil libata sd_mod scsi_mod
CPU:    1
EIP:    0060:[<f8dd6758>]    Not tainted VLI
EFLAGS: 00210246   (2.6.16 #1)
EIP is at 0xf8dd6758
eax: 00000000   ebx: ef2a2594   ecx: ef2a25a0   edx: f599beec
esi: f38a5bec   edi: f38a5bf4   ebp: fffffffc   esp: f599be60
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 20746, threadinfo=f599a000 task=f6411aa0)
Stack: <0>f8dd1721 fffffffc 00000000 ecd95798 f66f9000 00000000 00000000 f599beb8
       00000000 00000022 00000001 0000000f 00200286 f7878ec8 c03217dc c0150cfa
       f7fff200 6b00002c f4c9d668 f7fff200 ef2a2594 f38a5bf4 f599bef4 f599beec
Call Trace:
 [<f8dd1721>] ib_sa_mcmember_rec_callback+0x43/0x4e [ib_sa]
 [<c03217dc>] _spin_unlock_irqrestore+0x9/0xe
 [<c0150cfa>] poison_obj+0x21/0x41
 [<f8dd18b7>] send_handler+0x39/0x88 [ib_sa]
 [<f8db8a16>] cancel_mads+0x111/0x12f [ib_mad]
 [<f8db6787>] unregister_mad_agent+0xe/0xae [ib_mad]
 [<f8db688f>] ib_unregister_mad_agent+0x13/0x1f [ib_mad]
 [<f8dd1b4a>] ib_sa_remove_one+0x3c/0x6e [ib_sa]
 [<f8dab111>] ib_unregister_client+0x34/0xb0 [ib_core]
 [<f8dd1b86>] ib_sa_cleanup+0xa/0x17 [ib_sa]
 [<c01323df>] sys_delete_module+0x129/0x162
 [<c0148103>] do_munmap+0xe7/0xf3
 [<c014815c>] sys_munmap+0x4d/0x69
 [<c01026b7>] sysenter_past_esp+0x54/0x75
Code:  Bad EIP value.
 BUG: modprobe/20746, lock held at task exit time!
 [f8db4280] {device_mutex}
.. held by:          modprobe:20746 [f6411aa0, 118]
... acquired at:               ib_unregister_client+0x12/0xb0 [ib_core]


From bpradip at in.ibm.com  Wed Jun  7 05:54:39 2006
From: bpradip at in.ibm.com (Pradipta Kr. Banerjee)
Date: Wed, 07 Jun 2006 18:24:39 +0530
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.GSO.4.40.0606051450520.19934-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606051450520.19934-100000@nu.cse.ohio-state.edu>
Message-ID: <4486CC8F.3050601@in.ibm.com>

Sundeep Narravula wrote:
>> By the way, I assume you configured, rebuilt and reinstalled libibverbs,
>> librdmacm, and libamso?
> 
> Yes. I have done these.
> 
>> I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2
>> distro.  What distro/kernel verions?
> 
> The kernel used is 2.6.16 on a RH-AS4.
> 
>  --Sundeep.

I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc 
version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC.
Will running it under gdb be of some help ?

Thanks
Pradipta Kumar.
> 
>> Thanx,
>>
>>
>> Steve.
>>
>>
>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
>>> Hi Steve,
>>>    We are trying the new iwarp branch on ammasso adapters. The installation
>>> has gone fine. However, on running rping there is a error during
>>> disconnect phase.
>>>
>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>>>          driver search path: /usr/local/lib/infiniband
>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>>>          driver search path: /usr/local/lib/infiniband
>>> ping data: rdm
>>> ping data: rdm
>>> ping data: rdm
>>> ping data: rdm
>>> cq completion failed status 5
>>> DISCONNECT EVENT...
>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
>>> Aborted
>>>
>>> There are no apparent errors showing up in dmesg. Is this error
>>> currently expected?
>>>
>>> Thanks,
>>>    --Sundeep.
>>>


From bpradip at in.ibm.com  Wed Jun  7 05:56:21 2006
From: bpradip at in.ibm.com (Pradipta Kr. Banerjee)
Date: Wed, 07 Jun 2006 18:26:21 +0530
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.GSO.4.40.0606051450520.19934-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606051450520.19934-100000@nu.cse.ohio-state.edu>
Message-ID: <4486CCF5.4050902@in.ibm.com>

Sundeep Narravula wrote:
>> By the way, I assume you configured, rebuilt and reinstalled libibverbs,
>> librdmacm, and libamso?
> 
> Yes. I have done these.
> 
>> I do not see this on my systems using a 2.6.16.5 kernel on a SUSE 9.2
>> distro.  What distro/kernel verions?
> 
> The kernel used is 2.6.16 on a RH-AS4.
> 
>  --Sundeep.

I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc 
version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC.
Will running it under gdb be of some help ?

Thanks
Pradipta Kumar.
> 
>> Thanx,
>>
>>
>> Steve.
>>
>>
>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
>>> Hi Steve,
>>>    We are trying the new iwarp branch on ammasso adapters. The installation
>>> has gone fine. However, on running rping there is a error during
>>> disconnect phase.
>>>
>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>>>          driver search path: /usr/local/lib/infiniband
>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>>>          driver search path: /usr/local/lib/infiniband
>>> ping data: rdm
>>> ping data: rdm
>>> ping data: rdm
>>> ping data: rdm
>>> cq completion failed status 5
>>> DISCONNECT EVENT...
>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
>>> Aborted
>>>
>>> There are no apparent errors showing up in dmesg. Is this error
>>> currently expected?
>>>
>>> Thanks,
>>>    --Sundeep.
>>>


From jackm at mellanox.co.il  Wed Jun  7 06:39:03 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Wed, 7 Jun 2006 16:39:03 +0300
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <Pine.LNX.4.64.0606061747110.4750@jlentini-linux.nane.netapp.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<Pine.LNX.4.64.0606061747110.4750@jlentini-linux.nane.netapp.com>
Message-ID: <200606071639.03787.jackm@mellanox.co.il>

On Wednesday 07 June 2006 00:51, James Lentini wrote:
> On Mon, 5 Jun 2006, Arlin Davis wrote:
> > Here is a patch to the openib-cma provider that uses the new
> > set_option feature of the uCMA to adjust connect request timeout and
> > retry values. 

After examining the patch (svn 7755), I noticed that it depends on changes to 
the kernel CMA module (SVN 7742) which were checked only last night (June 6) 
(and which we did not see until this morning).
These CMA changes were not included in today's OFED RC6 release.  Therefore,
this new feature (set_option to adjust timeout and retry values) will not be 
supported in the current OFED final release (next week).  Possibly, it can be 
included in the next OFED release.

>>  Also, included a fix to disallow any
> > event after a disconnect event.

This (bug fix) can still be included in next-week's release, if you think it 
is important (I have extracted it from the changes checked in at svn 7755)

- Jack


From ishai at mellanox.co.il  Wed Jun  7 07:31:10 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Wed, 7 Jun 2006 17:31:10 +0300
Subject: [openib-general] Re: SRP [PATCH 0/4] Kernel support for removal and
 restoration of target
In-Reply-To: <adabqt6j5nh.fsf@cisco.com>
References: <20060605153213.GA7472@mellanox.co.il> <adabqt6j5nh.fsf@cisco.com>
Message-ID: <20060607143110.GA7442@mellanox.co.il>

The idea is that the daemon will notice targets that leave the fabric (for a 
short time), and will activate remove_target. When the target will return to
the fabric, the daemon will activate restore_target.
This will make sure that the scsi_host won't go offline (From where there is no
return)
I'm waiting for suggestion about the mechanism that will be responsible to 
remove the scsi_host when the target does not return to the fabric after a 
while. (See my previous mail for details)


On Tue, Jun 06, 2006 at 03:11:46PM -0700, Roland Dreier wrote:
> I haven't read too deeply yet, but something that would help me
> understand the overall plan here would be an explanation of how one
> would use the restore_target function.  Why would I want to disconnect
> from a target but keep the kernel's SCSI device hanging around?
> 
>  - R.

-- 
Ishai Rabinovitz


From tziporet at mellanox.co.il  Wed Jun  7 07:40:05 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 07 Jun 2006 17:40:05 +0300
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12ABE@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12ABE@xmb-sjc-216.amer.cisco.com>
Message-ID: <4486E545.2030900@mellanox.co.il>

Scott Weitzenkamp (sweitzen) wrote:
> Tziporet is the gatekeeper (does that make me the keymaster? :-).
>
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
>   
Since we are in RC6 today, toward a release next week we cannot take new 
CMA features that were implemented only this week.
We plan another release pretty soon due to SDP problems, so we can 
include these changes in the next OFED release.

Jack already replied regarding the fix

Tziporet


From tziporet at mellanox.co.il  Wed Jun  7 07:58:32 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 7 Jun 2006 17:58:32 +0300
Subject: [openib-general] OFED-1.0-rc6 is available
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com>

Hi All,

 
We have prepared OFED 1.0 RC6.

Release location: https://openib.org/svn/gen2/branches/1.0/ofed/releases

File: OFED-1.0-rc6.tgz

 
Note: This release is the code freeze release for OFED 1.0. Only
showstopper bugs will be fixed.

 
BUILD_ID:

OFED-1.0-rc6

 
openib-1.0 (REV=7772)

# User space

https://openib.org/svn/gen2/branches/1.0/src/userspace

# Kernel space

https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel

Git:

ref: refs/heads/for-2.6.17

commit d9ec5ad24ce80b7ef69a0717363db661d13aada5

 
# MPI

mpi_osu-0.9.7-mlx2.1.0.tgz

openmpi-1.1b1-1.src.rpm

mpitests-1.0-0.src.rpm 

 
OSes:

    * RH EL4 up2: 2.6.9-22.ELsmp

    * RH EL4 up3: 2.6.9-34.ELsmp

    * Fedora C4: 2.6.11-1.1369_FC4

    * SLES10 RC2: 2.6.16.16-1.6-smp

    * SUSE 10 Pro: 2.6.13-15-smp

    * kernel.org: 2.6.16.x

 
Systems:

    * x86_64

    * x86

    * ia64

    * ppc64

 
Main changes from RC5:

1.       SDP - libsdp implementation of RFC proposed by Eitan Zahavi;
bug fixes in kernel module. See details below.

2.       SRP - bug fixes

3.       Open MPI - new package based on 1.1b1-1

4.       OSU-MPI - See details below.

5.       iSER: Enhanced to support SLES 10 RC1.

6.      IPoIB default configuration changed:

a.       IPoIB configuration at install time is now optional.

b.       The default configuration of IPoIB interfaces (if performed at
install time) is DHCP; it can be changed during interactive
installation.

c.       For unattended installation one can give a new configuration
file. See the example below.

7.       Bug Fixes.

 
Package limitations:

1.       The ipath driver does not compile/load on most systems. To be
fixed in final release. 
Meanwhile, one must work with custom build and not choose ipath driver,
or change in the conf file: ib_ipath=n.
I attached a reference ofed-no_ipath.conf file. 
Once Qlogic fixes the backport patches I will publish them on the
release page so any one interested can use them with this release.

2.       iSER is working on SuSE SLES 10 RC1 only

 
IPoIB configuration file example:

If you are going to install OFED on a 32 node cluster and want to use
static IPoIB configuration based on Ethernet device configuration follow
instructions below:

 
Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster
are: 10.0.0.1 - 10.0.0.32

and you want to assign to ib0 IP addresses in the range: 192.168.0.1 -
192.168.0.32

and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32

 
Then create the file ofed_net.conf with the following lines:

 
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=192.168.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=192.168.0.0
BROADCAST_ib0=192.168.255.255
ONBOOT_ib0=1
LAN_INTERFACE_ib1=eth0
IPADDR_ib1=172.16.'*'.'*'
NETMASK_ib1=255.255.0.0
NETWORK_ib1=172.16.0.0
BROADCAST_ib1=172.16.255.255
ONBOOT_ib1=1

 
Note: '*' will be replaced by the corresponding octal from the eth0 IP
address.

 
Assuming that you already have OFED configuration file (ofed.conf) with
selected packages (created by running OFED-1.0/install.sh)

Run:    ./install.sh -c ofed.conf -net ofed_net.conf

 
OSU MPI:

*        Added mpi_alltoall fine tuning parameters

*        Added default configuration/documentation file
$MPIHOME/etc/mvapich.conf

*        Added shell configuration files  $MPIHOME/etc/mvapich.csh ,
$MPIHOME/etc/mvapich.csh

*        Default MTU was changed back to 2K for InfiniHost III Ex and
InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
VIADEV_DEFAULT_MTU=MTU1024

 
SDP Details: 

libsdp enhancements according to the RFC:

1.	New config syntax (please see libsdp.conf)
2.	With no config or empty config use SIMPLE_LIBSDP mode
3.	Support listening on both tcp and sdp
4.	Support trying both connections (first SDP then TCP)
5.	Support IPv4 embedded in IPv6 (also convert back address)
6.	Comprehensive verbosity logging
7.	BNF based config parser

 
Current SDP limitations:

*        SDP currently does not support sending/receiving out of band
data (MSG_OOB).

*        Generally, SDP supports only SOL_SOCKET socket options.

*        The following options can be set but actual support is missing:

o       SO_KEEPALIVE - no keepalives are sent

o       SO_OOBINLINE - out of band data is not supported

o       SDP currently supports setting the following SOL_TCP socket
options:

o       TCP_NODELAY, TCP_CORK - but actual support for these options is
still missing

*         SDP currently does not handle Zcopy mode messages correctly
and does not set MaxAdverts properly in HH/HAH messages.

 
OFED components tested by Mellanox:

*	Verbs over mthca
*	IPoIB
*	OpenSM
*	OSU-MPI
*	SRP
*	SDP
*	IB administration utils (ibutils)

 
Please send us any issues you encounter and/or test results.

 
Thanks

Tziporet & Vlad

 
Tziporet Koren

Software Director

Mellanox Technologies

mailto: tziporet at mellanox.co.il <mailto:tziporet at mellanox.co.il> 
Tel +972-4-9097200, ext 380

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/ad83deb8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-no_ipath.conf
Type: application/octet-stream
Size: 646 bytes
Desc: ofed-no_ipath.conf
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/ad83deb8/attachment.obj>

From jlentini at netapp.com  Wed Jun  7 08:26:35 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 7 Jun 2006 11:26:35 -0400 (EDT)
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <200606071639.03787.jackm@mellanox.co.il>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<Pine.LNX.4.64.0606061747110.4750@jlentini-linux.nane.netapp.com>
	<200606071639.03787.jackm@mellanox.co.il>
Message-ID: <Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>


On Wed, 7 Jun 2006, Jack Morgenstein wrote:

> >>  Also, included a fix to disallow any
> > > event after a disconnect event.
> 
> This (bug fix) can still be included in next-week's release, if you 
> think it is important (I have extracted it from the changes checked 
> in at svn 7755)

If you are going to make another release anyway, then I would included 
it. 


From swise at opengridcomputing.com  Wed Jun  7 08:56:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 10:56:44 -0500
Subject: [openib-general] Re: [PATCH 2/7] AMSO1100 Low Level Driver.
In-Reply-To: <20060531115906.30f4bbda@localhost.localdomain>
References: <20060531182733.3652.54755.stgit@stevo-desktop>
	<20060531182737.3652.24752.stgit@stevo-desktop>
	<20060531115906.30f4bbda@localhost.localdomain>
Message-ID: <1149695804.27684.42.camel@stevo-desktop>


On Wed, 2006-05-31 at 11:59 -0700, Stephen Hemminger wrote:
> The following should be replaced with BUG_ON() or WARN_ON().
> and pr_debug()
> 
> +#ifdef C2_DEBUG
> +#define assert(expr)                                                  \
> +    if(!(expr)) {                                                     \
> +        printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\
> +               #expr, __FILE__, __FUNCTION__, __LINE__);              \
> +    }
> +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} while (0)
> +#else
> +#define assert(expr)          do {} while (0)
> +#define dprintk(fmt, args...) do {} while (0)
> +#endif				/* C2_DEBUG */
> 
> --------------------
> Also, you tend to use assert() as a bogus NULL pointer check.
> If you get passed a NULL, it is a bug, and the deref will fail
> and cause a pretty stack dump...
> 

done.

> 
> +static void c2_set_rxbufsize(struct c2_port *c2_port)
> +{
> +	struct net_device *netdev = c2_port->netdev;
> +
> +	assert(netdev != NULL);
> 
> Bogus, you will just fail on the deref below
> 

done.

> +
> +	if (netdev->mtu > RX_BUF_SIZE)
> +		c2_port->rx_buf_size =
> +		    netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) +
> +		    NET_IP_ALIGN;
> +	else
> +		c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE;
> +}
> 
> 
> +static void c2_rx_interrupt(struct net_device *netdev)
> +{
> +	struct c2_port *c2_port = netdev_priv(netdev);
> +	struct c2_dev *c2dev = c2_port->c2dev;
> +	struct c2_ring *rx_ring = &c2_port->rx_ring;
> +	struct c2_element *elem;
> +	struct c2_rx_desc *rx_desc;
> +	struct c2_rxp_hdr *rxp_hdr;
> +	struct sk_buff *skb;
> +	dma_addr_t mapaddr;
> +	u32 maplen, buflen;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&c2dev->lock, flags);
> +
> +	/* Begin where we left off */
> +	rx_ring->to_clean = rx_ring->start + c2dev->cur_rx;
> +
> +	for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean;
> +	     elem = elem->next) {
> +		rx_desc = elem->ht_desc;
> +		mapaddr = elem->mapaddr;
> +		maplen = elem->maplen;
> +		skb = elem->skb;
> +		rxp_hdr = (struct c2_rxp_hdr *) skb->data;
> +
> +		if (rxp_hdr->flags != RXP_HRXD_DONE)
> +			break;
> +		buflen = rxp_hdr->len;
> +
> +		/* Sanity check the RXP header */
> +		if (rxp_hdr->status != RXP_HRXD_OK ||
> +		    buflen > (rx_desc->len - sizeof(*rxp_hdr))) {
> +			c2_rx_error(c2_port, elem);
> +			continue;
> +		}
> +
> +		/* 
> +		 * Allocate and map a new skb for replenishing the host 
> +		 * RX desc 
> +		 */
> +		if (c2_rx_alloc(c2_port, elem)) {
> +			c2_rx_error(c2_port, elem);
> +			continue;
> +		}
> +
> +		/* Unmap the old skb */
> +		pci_unmap_single(c2dev->pcidev, mapaddr, maplen,
> +				 PCI_DMA_FROMDEVICE);
> +
> 
> prefetch(skb->data) here will help performance.
> 
> 

good. ok.

> +		/*
> +		 * Skip past the leading 8 bytes comprising of the 
> +		 * "struct c2_rxp_hdr", prepended by the adapter 
> +		 * to the usual Ethernet header ("struct ethhdr"), 
> +		 * to the start of the raw Ethernet packet.
> +		 * 
> +		 * Fix up the various fields in the sk_buff before 
> +		 * passing it up to netif_rx(). The transfer size 
> +		 * (in bytes) specified by the adapter len field of 
> +		 * the "struct rxp_hdr_t" does NOT include the 
> +		 * "sizeof(struct c2_rxp_hdr)".
> +		 */
> +		skb->data += sizeof(*rxp_hdr);
> +		skb->tail = skb->data + buflen;
> +		skb->len = buflen;
> +		skb->dev = netdev;
> +		skb->protocol = eth_type_trans(skb, netdev);
> +
> +		/* Drop arp requests to the pseudo nic ip addr */
> +		if (unlikely(ntohs(skb->protocol) == ETH_P_ARP)) {
> +			u8 *tpa;
> +
> +			/* pull out the tgt ip addr */
> +			tpa = skb->data /* beginning of the arp packet */
> +				+ 8	/* arp addr fmts, lens, and opcode */
> +				+ 6  	/* arp src hw addr */
> +				+ 4 	/* arp src proto addr */
> +				+ 6;	/* arp tgt hw addr */
> +			if (is_rnic_addr(c2dev->pseudo_netdev, *((u32 *)tpa))) {
> +				dprintk("Dropping arp req for"
> +					" %03d.%03d.%03d.%03d\n",
> +					tpa[0], tpa[1], tpa[2], tpa[3]); 
> +				kfree_skb(skb);
> +				continue;
> +			}
> +		} 
> 
> This is looks like a mess, please do it at a higher level or
> code it with proper structure headers
> 

This code can be removed entirely.  It can be avoided having the c2
driver set in_dev->cnf.arp_ignore to 1 when loaded.


> +
> +		netif_rx(skb);
> +
> +		netdev->last_rx = jiffies;
> +		c2_port->netstats.rx_packets++;
> +		c2_port->netstats.rx_bytes += buflen;
> +	}
> +
> +	/* Save where we left off */
> +	rx_ring->to_clean = elem;
> +	c2dev->cur_rx = elem - rx_ring->start;
> +	C2_SET_CUR_RX(c2dev, c2dev->cur_rx);
> +
> +	spin_unlock_irqrestore(&c2dev->lock, flags);
> +}
> +
> +/*
> + * Handle netisr0 TX & RX interrupts.
> + */
> +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs)
> +{
> +	unsigned int netisr0, dmaisr;
> +	int handled = 0;
> +	struct c2_dev *c2dev = (struct c2_dev *) dev_id;
> +
> +	assert(c2dev != NULL);
> +
> +	/* Process CCILNET interrupts */
> +	netisr0 = readl(c2dev->regs + C2_NISR0);
> +	if (netisr0) {
> +
> +		/*
> +		 * There is an issue with the firmware that always
> +		 * provides the status of RX for both TX & RX 
> +		 * interrupts.  So process both queues here.
> +		 */
> +		c2_rx_interrupt(c2dev->netdev);
> +		c2_tx_interrupt(c2dev->netdev);
> +
> +		/* Clear the interrupt */
> +		writel(netisr0, c2dev->regs + C2_NISR0);
> +		handled++;
> +	}
> +
> +	/* Process RNIC interrupts */
> +	dmaisr = readl(c2dev->regs + C2_DISR);
> +	if (dmaisr) {
> +		writel(dmaisr, c2dev->regs + C2_DISR);
> +		c2_rnic_interrupt(c2dev);
> +		handled++;
> +	}
> +
> +	if (handled) {
> +		return IRQ_HANDLED;
> +	} else {
> +		return IRQ_NONE;
> +	}
> 
> 	return IRQ_RETVAL(handled);
> +}
> +
> +static int c2_up(struct net_device *netdev)
> +{
> +	struct c2_port *c2_port = netdev_priv(netdev);
> +	struct c2_dev *c2dev = c2_port->c2dev;
> +	struct c2_element *elem;
> +	struct c2_rxp_hdr *rxp_hdr;
> +	size_t rx_size, tx_size;
> +	int ret, i;
> +	unsigned int netimr0;
> +
> +	assert(c2dev != NULL);
> 
> More bogus asserts
> 

removed.

<snip>

> +static struct net_device_stats *c2_get_stats(struct net_device *netdev)
> +{
> +	struct c2_port *c2_port = netdev_priv(netdev);
> +
> +	return &c2_port->netstats;
> +}
> +
> +static int c2_set_mac_address(struct net_device *netdev, void *p)
> +{
> +	return -1;
> +}
> 
> If you don't handle changing mac_address, just leaveing
> dev->set_mac_address will do the right thing.
> Also, if you need to return an error, use -ESOMEERROR, not -1.
> 

I'll remove c2_set_mac_address() entirely.

<snip>


> This seems like log spam, or developer debug thing.
> You need to learn to watch netlink event's from user space.
> 

Yes, the entire block below will be removed.  It's not needed.

> 
> +
> +#ifdef NETEVENT_NOTIFIER
> +static int netevent_notifier(struct notifier_block *self, unsigned long event,
> +			     void *data)
> +{
> +	int i;
> +	u8 *ha;
> +	struct neighbour *neigh = data;
> +	struct netevent_redirect *redir = data;
> +	struct netevent_route_change *rev = data;
> +
> +	switch (event) {
> +	case NETEVENT_ROUTE_UPDATE:
> +		printk(KERN_ERR "NETEVENT_ROUTE_UPDATE:\n");
> +		printk(KERN_ERR "fib_flags           : %d\n",
> +		       rev->fib_info->fib_flags);
> +		printk(KERN_ERR "fib_protocol        : %d\n",
> +		       rev->fib_info->fib_protocol);
> +		printk(KERN_ERR "fib_prefsrc         : %08x\n",
> +		       rev->fib_info->fib_prefsrc);
> +		printk(KERN_ERR "fib_priority        : %d\n",
> +		       rev->fib_info->fib_priority);
> +		break;
> +
> +	case NETEVENT_NEIGH_UPDATE:
> +		printk(KERN_ERR "NETEVENT_NEIGH_UPDATE:\n");
> +		printk(KERN_ERR "nud_state : %d\n", neigh->nud_state);
> +		printk(KERN_ERR "refcnt    : %d\n", neigh->refcnt);
> +		printk(KERN_ERR "used      : %d\n", neigh->used);
> +		printk(KERN_ERR "confirmed : %d\n", neigh->confirmed);
> +		printk(KERN_ERR "      ha: ");
> +		for (i = 0; i < neigh->dev->addr_len; i += 4) {
> +			ha = &neigh->ha[i];
> +			printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2],
> +			       ha[3]);
> +		}
> +		printk("\n");
> +
> +		printk(KERN_ERR "%8s: ", neigh->dev->name);
> +		for (i = 0; i < neigh->dev->addr_len; i += 4) {
> +			ha = &neigh->ha[i];
> +			printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2],
> +			       ha[3]);
> +		}
> +		printk("\n");
> +		break;
> +
> +	case NETEVENT_REDIRECT:
> +		printk(KERN_ERR "NETEVENT_REDIRECT:\n");
> +		printk(KERN_ERR "old: ");
> +		for (i = 0; i < redir->old->neighbour->dev->addr_len; i += 4) {
> +			ha = &redir->old->neighbour->ha[i];
> +			printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2],
> +			       ha[3]);
> +		}
> +		printk("\n");
> +
> +		printk(KERN_ERR "new: ");
> +		for (i = 0; i < redir->new->neighbour->dev->addr_len; i += 4) {
> +			ha = &redir->new->neighbour->ha[i];
> +			printk("%02x:%02x:%02x:%02x:", ha[0], ha[1], ha[2],
> +			       ha[3]);
> +		}
> +		printk("\n");
> +		break;
> +
> +	default:
> +		printk(KERN_ERR "NETEVENT_WTFO:\n");
> +	}
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block nb = {
> +	.notifier_call = netevent_notifier,
> +};
> +#endif
> +/*


Thanks,

Steve.


From rdreier at cisco.com  Wed Jun  7 09:00:37 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 Jun 2006 09:00:37 -0700
Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while
	probing out ib_sa
In-Reply-To: <Pine.LNX.4.64.0606071248110.2804@zuben> (Or Gerlitz's
	message of "Wed, 7 Jun 2006 12:52:22 +0300 (IDT)")
References: <Pine.LNX.4.64.0606071248110.2804@zuben>
Message-ID: <adau06xhs62.fsf@cisco.com>

Looks like the same crash mst saw related to the multicast module
being unloaded and then having sa call back into it.  One small clue:

 > esi: f38a5bec   edi: f38a5bf4   ebp: fffffffc   esp: f599be60

ebp is -4, which is -EINTR.  So this may be a callback from sa_query's
send_handler() caused by a IB_WC_WR_FLUSH_ERR status.

 - R.


From mshefty at ichips.intel.com  Wed Jun  7 09:52:45 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 07 Jun 2006 09:52:45 -0700
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <200606071639.03787.jackm@mellanox.co.il>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<Pine.LNX.4.64.0606061747110.4750@jlentini-linux.nane.netapp.com>
	<200606071639.03787.jackm@mellanox.co.il>
Message-ID: <4487045D.9000405@ichips.intel.com>

Jack Morgenstein wrote:
> After examining the patch (svn 7755), I noticed that it depends on changes to 
> the kernel CMA module (SVN 7742) which were checked only last night (June 6) 
> (and which we did not see until this morning).
> These CMA changes were not included in today's OFED RC6 release.  Therefore,
> this new feature (set_option to adjust timeout and retry values) will not be 
> supported in the current OFED final release (next week).  Possibly, it can be 
> included in the next OFED release.

The changes were added as a solution to a scale up issue seen specifically by 
Intel MPI.

- Sean


From rkuchimanchi at silverstorm.com  Wed Jun  7 10:28:05 2006
From: rkuchimanchi at silverstorm.com (Ramachandra K)
Date: Wed, 07 Jun 2006 22:58:05 +0530
Subject: [openib-general] OFED-1.0-rc6 is available
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com>
Message-ID: <44870CA5.3080406@silverstorm.com>

Tziporet Koren wrote:
> Hi All,
> 
> We have prepared OFED 1.0 RC6.
> 
 From the openib source tar ball in OFED RC6, it looks like
the SRP kernel changes (ulp/srp/ib_srp.c) in the trunk for
supporting Rev 10 targets have been included in RC6, but the 
corresponding changes to the userspace srptool--ibsrpdm
(userspace/srptools/src/srp-dm.c) for displaying the IO class
of the target have not been made part of RC6.

The changes to ibsrpdm were committed to the SVN repository trunk in 
revision number 7758.

Will the latest version of ibsrpdm make it to the next OFED release ?

Regards,
Ram


From mshefty at ichips.intel.com  Wed Jun  7 10:41:59 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 07 Jun 2006 10:41:59 -0700
Subject: [openib-general] Re: crash in ib_sa_mcmember_rec_callback while
 probing out ib_sa
In-Reply-To: <Pine.LNX.4.64.0606071248110.2804@zuben>
References: <Pine.LNX.4.64.0606071248110.2804@zuben>
Message-ID: <44870FE7.5090808@ichips.intel.com>

I will look into this.

- Sean


From xma at us.ibm.com  Wed Jun  7 10:48:11 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 7 Jun 2006 10:48:11 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <OF670DFDF8.17265BE1-ON87257184.0057F250-88257184.0057D8DD@us.ibm.com>
Message-ID: <OFC8FC1F58.F9BF3F26-ON87257186.0060FF46-88257186.003B3EE6@us.ibm.com>

Roland,

We have seen several skb panic under heavy stress 48 hour test. I wonder 
whether there are duplicated or corrupted cookies received from device 
driver to reuse skb buff, since skb buff ring is indexed by wr_id. 
Is that possible?

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/ad0908e1/attachment.html>

From mshefty at ichips.intel.com  Wed Jun  7 11:25:08 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 07 Jun 2006 11:25:08 -0700
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <447DD2E4.3030709@ichips.intel.com>
References: <1149024804.4510.1056.camel@hal.voltaire.com>
	<ORSMSX4011792WdO5uA00000025@orsmsx401.amr.corp.intel.com>
	<20060531090817.GQ21266@mellanox.co.il>
	<447DC8F8.60409@ichips.intel.com>
	<1149095100.4510.29902.camel@hal.voltaire.com>
	<447DD2E4.3030709@ichips.intel.com>
Message-ID: <44871A04.9010705@ichips.intel.com>

Sean Hefty wrote:
> The multicast module should work in this specific case, since the only 
> client is ipoib, and ipoib first leaves the group before re-joining.

I think that there's a race here.  If ipoib leaves, then re-joins quickly 
enough, the join request will be processed before the leave.  The result is that 
the join will be fulfilled locally, without an additional MAD sent.  (Trying to 
process the leave immediately doesn't fix the problem in the generic case, where 
there may be multiple users of a group.)

A temporary fix would be to always send a MAD, even if the join can be fulfilled 
locally.  But I'm looking at having the multicast module re-join on an event. 
This raises the possibility that the new join request may fail, which would 
require the multicast module to report that a membership is no longer active.

Another problem is if some nodes are joined as NonMembers or SendOnlyNonMembers, 
then the SM will not create the multicast group when they try to re-join.  This 
leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join 
until one of the FullMembers joins.

- Sean


From rdreier at cisco.com  Wed Jun  7 11:30:40 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 Jun 2006 11:30:40 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <OFC8FC1F58.F9BF3F26-ON87257186.0060FF46-88257186.003B3EE6@us.ibm.com>
	(Shirley Ma's message of "Wed, 7 Jun 2006 10:48:11 -0700")
References: <OFC8FC1F58.F9BF3F26-ON87257186.0060FF46-88257186.003B3EE6@us.ibm.com>
Message-ID: <adaac8ohl7z.fsf@cisco.com>

    Shirley> Roland, We have seen several skb panic under heavy stress
    Shirley> 48 hour test. I wonder whether there are duplicated or
    Shirley> corrupted cookies received from device driver to reuse
    Shirley> skb buff, since skb buff ring is indexed by wr_id.  Is
    Shirley> that possible?

It's possible, but I would say it's quite unlikely with mthca.  With
ehca I have no sense of how bug-free the driver is.

Can you post a recipe to reproduce the crash?

 - R.


From xma at us.ibm.com  Wed Jun  7 11:48:34 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 7 Jun 2006 11:48:34 -0700
Subject: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
In-Reply-To: <adaac8ohl7z.fsf@cisco.com>
Message-ID: <OF00EF16C6.1E2DF665-ON87257186.0066E7E2-88257186.0040C63F@us.ibm.com>

Roland,

>Can you post a recipe to reproduce the crash?
It happened on 32 nodes cluster (each node has 8 dual core cpus) running 
IBM applications over IPoIB.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/07be52e5/attachment.html>

From halr at voltaire.com  Wed Jun  7 11:50:16 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Jun 2006 14:50:16 -0400
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <44871A04.9010705@ichips.intel.com>
References: <1149024804.4510.1056.camel@hal.voltaire.com>
	<ORSMSX4011792WdO5uA00000025@orsmsx401.amr.corp.intel.com>
	<20060531090817.GQ21266@mellanox.co.il>
	<447DC8F8.60409@ichips.intel.com>
	<1149095100.4510.29902.camel@hal.voltaire.com>
	<447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com>
Message-ID: <1149706206.4510.292005.camel@hal.voltaire.com>

On Wed, 2006-06-07 at 14:25, Sean Hefty wrote:
> Sean Hefty wrote:
> > The multicast module should work in this specific case, since the only 
> > client is ipoib, and ipoib first leaves the group before re-joining.
> 
> I think that there's a race here.  If ipoib leaves, then re-joins quickly 
> enough, the join request will be processed before the leave.

The order of joins and leaves is important in terms of the SA.

>   The result is that 
> the join will be fulfilled locally, without an additional MAD sent.  (Trying to 
> process the leave immediately doesn't fix the problem in the generic case, where 
> there may be multiple users of a group.)
> 
> A temporary fix would be to always send a MAD, even if the join can be fulfilled 
> locally.  But I'm looking at having the multicast module re-join on an event. 
> This raises the possibility that the new join request may fail, which would 
> require the multicast module to report that a membership is no longer active.

A similar (as yet unresolved) problem exists with the SA if the topology
changes and the previous group/members can no longer be satisfied.

> Another problem is if some nodes are joined as NonMembers or SendOnlyNonMembers, 
> then the SM will not create the multicast group when they try to re-join.

The same is true for FullMembers when there is insufficient components
to create the group. In these cases, the group must either be precreated
or the creator of the group must talk to the SA "first".

>   This 
> leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join 
> until one of the FullMembers joins.

Might also be true with joins (not creates) from FullMembers too. I
would presume in such cases, the join would be retried. SendOnlyMembers
(at least for IPoIB) do this if not joined every time a packet is sent.

-- Hal

> - Sean


From narravul at cse.ohio-state.edu  Wed Jun  7 11:49:51 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 7 Jun 2006 14:49:51 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <200606051538.35084.faulkner@opengridcomputing.com>
Message-ID: <Pine.GSO.4.40.0606071448110.13338-100000@nu.cse.ohio-state.edu>


> You will also get this warning on the latest CM if you have not updated the
> library to use ibv_driver_init vs. openib_driver_init.  This drop for libamso
> happened last Friday, Jun 2.  Check and see if you have that.

This is the svn version I used for the test. (Looks like I have the
changes from Jun 2.)

$ svn info
URL: https://openib.org/svn/gen2/branches/iwarp
Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
Revision: 7668
Node Kind: directory
Schedule: normal
Last Changed Author: swise
Last Changed Rev: 7638
Last Changed Date: 2006-06-02 17:13:02 -0400 (Fri, 02 Jun 2006)

  --Sundeep.

>
> >
> > I'm guessing the glibc error is finding some rping bug.  Maybe you have
> > a later version of libc than my suse 9.2 distro?
> >
> >
> > Stevo.
>
> --
> Boyd R. Faulkner
> Open Grid Computing, Inc.
> Phone:	512-343-9196 x109
> Fax:	512-343-5450
>


From narravul at cse.ohio-state.edu  Wed Jun  7 11:55:00 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Wed, 7 Jun 2006 14:55:00 -0400 (EDT)
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <4486CCF5.4050902@in.ibm.com>
Message-ID: <Pine.GSO.4.40.0606071450020.13338-100000@nu.cse.ohio-state.edu>

Hi,

> I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc
> version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC.

The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and
AMSO1100 RNIC.

> Will running it under gdb be of some help ?

I am able to reproduce this error with/without gdb. The glibc error
disappears with higher number of iterations.

(gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999
Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100
-p 9999
Reading symbols from shared object read from target memory...done.
Loaded system supplied DSO at 0xffffe000
[Thread debugging using libthread_db enabled]
[New Thread -1208465728 (LWP 23960)]
libibverbs: Warning: no userspace device-specific driver found for uverbs1
        driver search path: /usr/local/lib/infiniband
libibverbs: Warning: no userspace device-specific driver found for uverbs0
        driver search path: /usr/local/lib/infiniband
[New Thread -1208468560 (LWP 23963)]
[New Thread -1216861264 (LWP 23964)]
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
ping data: rdma-ping
cq completion failed status 5
DISCONNECT EVENT...
*** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***

Program received signal SIGABRT, Aborted.
[Switching to Thread -1208465728 (LWP 23960)]
0xffffe410 in __kernel_vsyscall ()
(gdb)

  --Sundeep.

>
> Thanks
> Pradipta Kumar.
> >
> >> Thanx,
> >>
> >>
> >> Steve.
> >>
> >>
> >> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
> >>> Hi Steve,
> >>>    We are trying the new iwarp branch on ammasso adapters. The installation
> >>> has gone fine. However, on running rping there is a error during
> >>> disconnect phase.
> >>>
> >>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
> >>> libibverbs: Warning: no userspace device-specific driver found for uverbs1
> >>>          driver search path: /usr/local/lib/infiniband
> >>> libibverbs: Warning: no userspace device-specific driver found for uverbs0
> >>>          driver search path: /usr/local/lib/infiniband
> >>> ping data: rdm
> >>> ping data: rdm
> >>> ping data: rdm
> >>> ping data: rdm
> >>> cq completion failed status 5
> >>> DISCONNECT EVENT...
> >>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> >>> Aborted
> >>>
> >>> There are no apparent errors showing up in dmesg. Is this error
> >>> currently expected?
> >>>
> >>> Thanks,
> >>>    --Sundeep.
> >>>
>


From arlin.r.davis at intel.com  Wed Jun  7 12:11:19 2006
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 7 Jun 2006 12:11:19 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
Message-ID: <B0095134066CC94FBC80973103FFA1FE9E5F42@orsmsx416.amr.corp.intel.com>

Scott,

Can you take a look and see if rdma_cm and rdma_ucm modules are being
loaded?

I noticed on my latest OFED RC5 install that I had to start them
manually.

-arlin

>-----Original Message-----
>From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
>Sent: Tuesday, June 06, 2006 5:08 PM
>To: Arlin Davis; Scott Weitzenkamp (sweitzen)
>Cc: Davis, Arlin R; Lentini, James; openib-general
>Subject: RE: [openib-general] [PATCH] uDAPL openib-cma provider - add
support for IB_CM_REQ_OPTIONS
>
>
>> this looks like a configuration issue and not the timeout. The CR
>> timeouts occured with
>> the rdma device and not the rdssm.  Is IPoIB running on the ib0
>> interfaces across the
>> fabric?
>
>Yes, IPoIB is running.
>
>Scott


From sweitzen at cisco.com  Wed Jun  7 12:57:13 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 7 Jun 2006 12:57:13 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12E97@xmb-sjc-216.amer.cisco.com>

Yes, the modules were loaded.

Each of the 32 hosts had 3 IB ports up.  Does Intel MPI or uDAPL use
multiple ports and/or multiple HCAs?

I shut down all but one port on each host, and now Pallas is running
better on the 32 nodes using Intel MPI 2.0.1.  HP MPI 2.2 started
working too with Pallas too over uDAPL, so maybe this is a uDAPL issue?

I need to repeat the tests to make sure this isn't a fluke.

Thanks for your help so far.

Scott

> -----Original Message-----
> From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] 
> Sent: Wednesday, June 07, 2006 12:11 PM
> To: Scott Weitzenkamp (sweitzen); Arlin Davis
> Cc: Lentini, James; openib-general
> Subject: RE: [openib-general] [PATCH] uDAPL openib-cma 
> provider - add support for IB_CM_REQ_OPTIONS
> 
> Scott,
> 
> Can you take a look and see if rdma_cm and rdma_ucm modules are being
> loaded?
> 
> I noticed on my latest OFED RC5 install that I had to start them
> manually.
> 
> -arlin
> 
> >-----Original Message-----
> >From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> >Sent: Tuesday, June 06, 2006 5:08 PM
> >To: Arlin Davis; Scott Weitzenkamp (sweitzen)
> >Cc: Davis, Arlin R; Lentini, James; openib-general
> >Subject: RE: [openib-general] [PATCH] uDAPL openib-cma provider - add
> support for IB_CM_REQ_OPTIONS
> >
> >
> >> this looks like a configuration issue and not the timeout. The CR
> >> timeouts occured with
> >> the rdma device and not the rdssm.  Is IPoIB running on the ib0
> >> interfaces across the
> >> fabric?
> >
> >Yes, IPoIB is running.
> >
> >Scott
> 


From rdreier at cisco.com  Wed Jun  7 13:03:06 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 Jun 2006 13:03:06 -0700
Subject: [openib-general] OFED-1.0-rc6 is available
In-Reply-To: <44870CA5.3080406@silverstorm.com> (Ramachandra K.'s
	message of "Wed, 07 Jun 2006 22:58:05 +0530")
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA715E@mtlexch01.mtl.com>
	<44870CA5.3080406@silverstorm.com>
Message-ID: <adafyigg2dh.fsf@cisco.com>

We also just found a bug in how ibsrpdm discovers Cisco/Topspin FC
gateways.  The patch is below, and is also checked in to the trunk as
svn rev 7803.  Please include this in OFED 1.0 final.

Thanks,
  Roland

--- srptools/ChangeLog	(revision 7796)
+++ srptools/ChangeLog	(working copy)
@@ -1,3 +1,9 @@
+2006-06-07  Roland Dreier  <rdreier at cisco.com>
+	* src/srp-dm.c (do_port): Use correct endianness when comparing
+	GUID against Topspin OUI.
+
+	* src/srp-dm.c (set_class_port_info): Trivial whitespace fixes.
+
 2006-05-29  Ishai Rabinovitz  <ishai at mellanox.co.il>
 
 	* src/srp-dm.c (main): The agent ID array is declared with 0
--- srptools/src/srp-dm.c	(revision 7796)
+++ srptools/src/srp-dm.c	(working copy)
@@ -52,8 +52,6 @@
 #include "ib_user_mad.h"
 #include "srp-dm.h"
 
-static const uint8_t topspin_oui[3] = { 0x00, 0x05, 0xad };
-
 static char *umad_dev   = "/dev/infiniband/umad0";
 static char *port_sysfs_path;
 static int   timeout_ms = 25000;
@@ -249,7 +247,7 @@ static int set_class_port_info(int fd, u
 
 	init_srp_dm_mad(&out_mad, agent[1], dlid, SRP_DM_ATTR_CLASS_PORT_INFO, 0);
 
-	out_dm_mad          = (void *) out_mad.data;
+	out_dm_mad         = (void *) out_mad.data;
 	out_dm_mad->method = SRP_DM_METHOD_SET;
 
 	cpi                = (void *) out_dm_mad->data;
@@ -266,9 +264,8 @@ static int set_class_port_info(int fd, u
 		return -1;
 	}
 
-	for (i = 0; i < 8; ++i) {
+	for (i = 0; i < 8; ++i)
 		((uint16_t *) cpi->trap_gid)[i] = htons(strtol(val + i * 5, NULL, 16));
-	}
 
 	if (send_and_get(fd, &out_mad, &in_mad, 0) < 0)
 		return -1;
@@ -371,7 +368,10 @@ static int do_port(int fd, uint32_t agen
 	struct srp_dm_svc_entries	svc_entries;
 	int				i, j, k;
 
-	if (!memcmp(&guid, topspin_oui, 3) &&
+	static const uint64_t topspin_oui = 0x0005ad0000000000ull;
+	static const uint64_t oui_mask    = 0xffffff0000000000ull;
+
+	if ((guid & oui_mask) == topspin_oui &&
 	    set_class_port_info(fd, agent, dlid))
 		fprintf(stderr, "Warning: set of ClassPortInfo failed\n");
 

From swise at opengridcomputing.com  Wed Jun  7 13:06:00 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:00 -0500
Subject: [openib-general] [PATCH v2 0/2][RFC] iWARP Core Support
Message-ID: <20060607200600.9003.56328.stgit@stevo-desktop>


This patchset defines the modifications to the Linux infiniband subsystem
to support iWARP devices.  We're submitting it for review now with the
goal for inclusion in the 2.6.19 kernel.  This code has gone through
several reviews in the openib-general list.  Now we are submitting it
for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.18 branch.  The patchset consists of 2 patches:

        1 - New iWARP CM implementation.  
        2 - Core changes to support iWARP.

I believe I've addressed all the round 1 review comments.  Details of
the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Wed Jun  7 13:06:05 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:05 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <20060607200600.9003.56328.stgit@stevo-desktop>
References: <20060607200600.9003.56328.stgit@stevo-desktop>
Message-ID: <20060607200605.9003.25830.stgit@stevo-desktop>


This patch provides the new files implementing the iWARP Connection
Manager.

Review Changes:

- sizeof -> sizeof()

- removed printks

- removed TT debug code

- cleaned up lock/unlock around switch statements.

- waitqueue -> completion for destroy path.
---

 drivers/infiniband/core/iwcm.c |  877 ++++++++++++++++++++++++++++++++++++++++
 include/rdma/iw_cm.h           |  254 ++++++++++++
 include/rdma/iw_cm_private.h   |   62 +++
 3 files changed, 1193 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
new file mode 100644
index 0000000..994bc79
--- /dev/null
+++ b/drivers/infiniband/core/iwcm.c
@@ -0,0 +1,877 @@
+/*
+ * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/idr.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <linux/completion.h>
+#include <rdma/iw_cm.h>
+#include <rdma/iw_cm_private.h>
+#include <rdma/ib_addr.h>
+
+MODULE_AUTHOR("Tom Tucker");
+MODULE_DESCRIPTION("iWARP CM");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static struct workqueue_struct *iwcm_wq;
+struct iwcm_work {
+	struct work_struct work;
+	struct iwcm_id_private *cm_id;
+	struct list_head list;
+	struct iw_cm_event event;
+};
+
+/* 
+ * Release a reference on cm_id. If the last reference is being removed
+ * and iw_destroy_cm_id is waiting, wake up the waiting thread.
+ */
+static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
+{
+	int ret = 0;
+
+	BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
+	if (atomic_dec_and_test(&cm_id_priv->refcount)) {
+		BUG_ON(!list_empty(&cm_id_priv->work_list));
+		if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) {
+			BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING);
+			BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
+					&cm_id_priv->flags));
+			ret = 1;
+		}
+		complete(&cm_id_priv->destroy_comp);
+	}
+
+	return ret;
+}
+
+static void add_ref(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	atomic_inc(&cm_id_priv->refcount);
+}
+
+static void rem_ref(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	iwcm_deref_id(cm_id_priv);
+}
+
+static void cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event);
+
+struct iw_cm_id *iw_create_cm_id(struct ib_device *device,
+				 iw_cm_handler cm_handler,
+				 void *context)
+{
+	struct iwcm_id_private *cm_id_priv;
+
+	cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL);
+	if (!cm_id_priv)
+		return ERR_PTR(-ENOMEM);
+
+	cm_id_priv->state = IW_CM_STATE_IDLE;
+	cm_id_priv->id.device = device;
+	cm_id_priv->id.cm_handler = cm_handler;
+	cm_id_priv->id.context = context;
+	cm_id_priv->id.event_handler = cm_event_handler;
+	cm_id_priv->id.add_ref = add_ref;
+	cm_id_priv->id.rem_ref = rem_ref;
+	spin_lock_init(&cm_id_priv->lock);
+	atomic_set(&cm_id_priv->refcount, 1);
+	init_waitqueue_head(&cm_id_priv->connect_wait);
+	init_completion(&cm_id_priv->destroy_comp);
+	INIT_LIST_HEAD(&cm_id_priv->work_list);
+
+	return &cm_id_priv->id;
+}
+EXPORT_SYMBOL(iw_create_cm_id);
+
+
+static int iwcm_modify_qp_err(struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	if (!qp)
+		return -EINVAL;
+
+	qp_attr.qp_state = IB_QPS_ERR;
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE);
+}
+
+/* 
+ * This is really the RDMAC CLOSING state. It is most similar to the
+ * IB SQD QP state. 
+ */
+static int iwcm_modify_qp_sqd(struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	BUG_ON(qp == NULL);
+	qp_attr.qp_state = IB_QPS_SQD;
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE);
+}
+
+/* 
+ * CM_ID <-- CLOSING
+ *
+ * Block if a passive or active connection is currenlty being processed. Then
+ * process the event as follows:
+ * - If we are ESTABLISHED, move to CLOSING and modify the QP state
+ *   based on the abrupt flag 
+ * - If the connection is already in the CLOSING or IDLE state, the peer is
+ *   disconnecting concurrently with us and we've already seen the 
+ *   DISCONNECT event -- ignore the request and return 0
+ * - Disconnect on a listening endpoint returns -EINVAL
+ */
+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+	struct ib_qp *qp = NULL;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	/* Wait if we're currently in a connect or accept downcall */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_ESTABLISHED:
+		cm_id_priv->state = IW_CM_STATE_CLOSING;
+
+		/* QP could be <nul> for user-mode client */
+		if (cm_id_priv->qp)
+			qp = cm_id_priv->qp;
+		else
+			ret = -EINVAL;
+		break;
+	case IW_CM_STATE_LISTEN:
+		ret = -EINVAL;
+		break;
+	case IW_CM_STATE_CLOSING:
+		/* remote peer closed first */
+	case IW_CM_STATE_IDLE:	
+		/* accept or connect returned !0 */
+		break;
+	case IW_CM_STATE_CONN_RECV:
+		/* 
+		 * App called disconnect before/without calling accept after
+		 * connect_request event delivered.
+		 */
+		break;
+	case IW_CM_STATE_CONN_SENT:
+		/* Can only get here if wait above fails */
+	default:		
+		BUG_ON(1);
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	if (qp) {
+		if (abrupt)
+			ret = iwcm_modify_qp_err(qp);
+		else
+			ret = iwcm_modify_qp_sqd(qp);
+
+		/*
+		 * If both sides are disconnecting the QP could
+		 * already be in ERR or SQD states
+		 */
+		ret = 0;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_disconnect);
+
+/* 
+ * CM_ID <-- DESTROYING
+ * 
+ * Clean up all resources associated with the connection and release
+ * the initial reference taken by iw_create_cm_id. 
+ */
+static void destroy_cm_id(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	/* Wait if we're currently in a connect or accept downcall. A
+	 * listening endpoint should never block here. */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_LISTEN:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		/* destroy the listening endpoint */
+		ret = cm_id->device->iwcm->destroy_listen(cm_id);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_ESTABLISHED:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		/* Abrupt close of the connection */
+		(void)iwcm_modify_qp_err(cm_id_priv->qp);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CLOSING:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		break;
+	case IW_CM_STATE_CONN_RECV:
+		/* 
+		 * App called destroy before/without calling accept after
+		 * receiving connection request event notification.
+		 */ 
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		break;
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_DESTROYING:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	if (cm_id_priv->qp) {
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	(void)iwcm_deref_id(cm_id_priv);
+}
+
+/* 
+ * This function is only called by the application thread and cannot
+ * be called by the event thread. The function will wait for all
+ * references to be released on the cm_id and then kfree the cm_id
+ * object. 
+ */
+void iw_destroy_cm_id(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+        BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags));
+
+	destroy_cm_id(cm_id);
+
+	wait_for_completion(&cm_id_priv->destroy_comp);
+
+	kfree(cm_id_priv);
+}
+EXPORT_SYMBOL(iw_destroy_cm_id);
+
+/* 
+ * CM_ID <-- LISTEN
+ *
+ * Start listening for connect requests. Generates one CONNECT_REQUEST
+ * event for each inbound connect request. 
+ */
+int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+		cm_id_priv->state = IW_CM_STATE_LISTEN;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		ret = cm_id->device->iwcm->create_listen(cm_id, backlog);
+		if (ret)
+			cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_listen);
+
+/* 
+ * CM_ID <-- IDLE
+ *
+ * Rejects an inbound connection request. No events are generated.
+ */
+int iw_cm_reject(struct iw_cm_id *cm_id,
+		 const void *private_data,
+		 u8 private_data_len)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+	cm_id_priv->state = IW_CM_STATE_IDLE;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->reject(cm_id, private_data, 
+					  private_data_len);
+
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_reject);
+
+/* 
+ * CM_ID <-- ESTABLISHED
+ *
+ * Accepts an inbound connection request and generates an ESTABLISHED
+ * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block
+ * until the ESTABLISHED event is received from the provider. 
+ */
+int iw_cm_accept(struct iw_cm_id *cm_id, 
+		 struct iw_cm_conn_param *iw_param)
+{
+	struct iwcm_id_private *cm_id_priv;
+	struct ib_qp *qp;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+	/* Get the ib_qp given the QPN */
+	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
+	if (!qp) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		return -EINVAL;
+	}
+	cm_id->device->iwcm->add_ref(qp);
+	cm_id_priv->qp = qp;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->accept(cm_id, iw_param);
+	if (ret) {
+		/* An error on accept precludes provider events */
+		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		if (cm_id_priv->qp) {
+			cm_id->device->iwcm->rem_ref(qp);
+			cm_id_priv->qp = NULL;
+		}
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		printk("Accept failed, ret=%d\n", ret);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+	}			
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_accept);
+
+/*
+ * Active Side: CM_ID <-- CONN_SENT
+ *
+ * If successful, results in the generation of a CONNECT_REPLY
+ * event. iw_cm_disconnect and iw_cm_destroy will block until the
+ * CONNECT_REPLY event is received from the provider.
+ */
+int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct iwcm_id_private *cm_id_priv;
+	int ret = 0;
+	unsigned long flags;
+	struct ib_qp *qp;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state != IW_CM_STATE_IDLE) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+		
+	/* Get the ib_qp given the QPN */
+	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
+	if (!qp) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		return -EINVAL;
+	}
+	cm_id->device->iwcm->add_ref(qp);
+	cm_id_priv->qp = qp;
+	cm_id_priv->state = IW_CM_STATE_CONN_SENT;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->connect(cm_id, iw_param);
+	if (ret) {
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		if (cm_id_priv->qp) {
+			cm_id->device->iwcm->rem_ref(qp);
+			cm_id_priv->qp = NULL;
+		}
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		printk("Connect failed, ret=%d\n", ret);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_connect);
+
+/*
+ * Passive Side: new CM_ID <-- CONN_RECV
+ *
+ * Handles an inbound connect request. The function creates a new
+ * iw_cm_id to represent the new connection and inherits the client
+ * callback function and other attributes from the listening parent. 
+ * 
+ * The work item contains a pointer to the listen_cm_id and the event. The
+ * listen_cm_id contains the client cm_handler, context and
+ * device. These are copied when the device is cloned. The event
+ * contains the new four tuple.
+ *
+ * An error on the child should not affect the parent, so this
+ * function does not return a value.
+ */
+static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, 
+				struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	struct iw_cm_id *cm_id;
+	struct iwcm_id_private *cm_id_priv;
+	int ret;
+
+	/* The provider should never generate a connection request
+	 * event with a bad status. 
+	 */
+	BUG_ON(iw_event->status);
+
+	/* We could be destroying the listening id. If so, ignore this
+	 * upcall. */
+	spin_lock_irqsave(&listen_id_priv->lock, flags);
+	if (listen_id_priv->state != IW_CM_STATE_LISTEN) {
+		spin_unlock_irqrestore(&listen_id_priv->lock, flags);
+		return;
+	}
+	spin_unlock_irqrestore(&listen_id_priv->lock, flags);
+
+	cm_id = iw_create_cm_id(listen_id_priv->id.device,	
+				listen_id_priv->id.cm_handler, 
+				listen_id_priv->id.context);
+	/* If the cm_id could not be created, ignore the request */
+	if (IS_ERR(cm_id)) 
+		return;
+
+	cm_id->provider_data = iw_event->provider_data;
+	cm_id->local_addr = iw_event->local_addr;
+	cm_id->remote_addr = iw_event->remote_addr;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	cm_id_priv->state = IW_CM_STATE_CONN_RECV;
+	
+	/* Call the client CM handler */
+	ret = cm_id->cm_handler(cm_id, iw_event);
+	if (ret) {
+		set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
+		destroy_cm_id(cm_id);
+		if (atomic_read(&cm_id_priv->refcount)==0)
+			kfree(cm_id);
+	}
+}
+
+/*
+ * Passive Side: CM_ID <-- ESTABLISHED
+ * 
+ * The provider generated an ESTABLISHED event which means that 
+ * the MPA negotion has completed successfully and we are now in MPA
+ * FPDU mode. 
+ *
+ * This event can only be received in the CONN_RECV state. If the
+ * remote peer closed, the ESTABLISHED event would be received followed
+ * by the CLOSE event. If the app closes, it will block until we wake
+ * it up after processing this event.
+ */
+static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, 
+			       struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+
+	/* We clear the CONNECT_WAIT bit here to allow the callback
+	 * function to call iw_cm_disconnect. Calling iw_destroy_cm_id
+	 * from a callback handler is not allowed */
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
+	cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+
+/*
+ * Active Side: CM_ID <-- ESTABLISHED
+ *
+ * The app has called connect and is waiting for the established event to
+ * post it's requests to the server. This event will wake up anyone
+ * blocked in iw_cm_disconnect or iw_destroy_id.
+ */
+static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, 
+			       struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	/* Clear the connect wait bit so a callback function calling
+	 * iw_cm_disconnect will not wait and deadlock this thread */
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
+	if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) {
+		cm_id_priv->id.local_addr = iw_event->local_addr;
+		cm_id_priv->id.remote_addr = iw_event->remote_addr;
+		cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
+	} else {
+		/* REJECTED or RESET */
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+
+	/* Wake up waiters on connect complete */
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+
+/*
+ * CM_ID <-- CLOSING 
+ *
+ * If in the ESTABLISHED state, move to CLOSING.
+ */
+static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, 
+				  struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED)
+		cm_id_priv->state = IW_CM_STATE_CLOSING;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+}
+
+/*
+ * CM_ID <-- IDLE
+ *
+ * If in the ESTBLISHED or CLOSING states, the QP will have have been
+ * moved by the provider to the ERR state. Disassociate the CM_ID from
+ * the QP,  move to IDLE, and remove the 'connected' reference.
+ * 
+ * If in some other state, the cm_id was destroyed asynchronously.
+ * This is the last reference that will result in waking up
+ * the app thread blocked in iw_destroy_cm_id.
+ */
+static int cm_close_handler(struct iwcm_id_private *cm_id_priv, 
+				  struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+
+	if (cm_id_priv->qp) {
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+	}
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_ESTABLISHED:
+	case IW_CM_STATE_CLOSING:
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_DESTROYING:
+		break;
+	default:
+		BUG_ON(1);
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	return ret;
+}
+
+static int process_event(struct iwcm_id_private *cm_id_priv, 
+			 struct iw_cm_event *iw_event)
+{
+	int ret = 0;
+
+	switch (iw_event->event) {
+	case IW_CM_EVENT_CONNECT_REQUEST:
+		cm_conn_req_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_CONNECT_REPLY:
+		ret = cm_conn_rep_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_ESTABLISHED:
+		ret = cm_conn_est_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_DISCONNECT:
+		cm_disconnect_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_CLOSE:
+		ret = cm_close_handler(cm_id_priv, iw_event);
+		break;
+	default:
+		BUG_ON(1);
+	}
+
+	return ret;
+}
+
+/* 
+ * Process events on the work_list for the cm_id. If the callback
+ * function requests that the cm_id be deleted, a flag is set in the
+ * cm_id flags to indicate that when the last reference is
+ * removed, the cm_id is to be destroyed. This is necessary to
+ * distinguish between an object that will be destroyed by the app
+ * thread asleep on the destroy_comp list vs. an object destroyed
+ * here synchronously when the last reference is removed.
+ */
+static void cm_work_handler(void *arg)
+{
+	struct iwcm_work *work = (struct iwcm_work*)arg;
+	struct iwcm_id_private *cm_id_priv = work->cm_id;
+	unsigned long flags;
+	int empty;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	empty = list_empty(&cm_id_priv->work_list);
+	while (!empty) {
+		work = list_entry(cm_id_priv->work_list.next, 
+				  struct iwcm_work, list);
+		list_del_init(&work->list);
+		empty = list_empty(&cm_id_priv->work_list);
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+		ret = process_event(cm_id_priv, &work->event);
+		kfree(work);
+		if (ret) {
+			set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
+			destroy_cm_id(&cm_id_priv->id);
+		}
+		BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
+		if (iwcm_deref_id(cm_id_priv))
+			return;
+		
+		if (atomic_read(&cm_id_priv->refcount)==0 && 
+		    test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) {
+			kfree(cm_id_priv);
+			return;
+		}
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+}
+
+/* 
+ * This function is called on interrupt context. Schedule events on
+ * the iwcm_wq thread to allow callback functions to downcall into
+ * the CM and/or block.  Events are queued to a per-CM_ID
+ * work_list. If this is the first event on the work_list, the work
+ * element is also queued on the iwcm_wq thread.
+ *
+ * Each event holds a reference on the cm_id. Until the last posted
+ * event has been delivered and processed, the cm_id cannot be
+ * deleted. 
+ */
+static void cm_event_handler(struct iw_cm_id *cm_id,
+			     struct iw_cm_event *iw_event) 
+{
+	struct iwcm_work *work;
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	atomic_inc(&cm_id_priv->refcount);
+	
+	INIT_WORK(&work->work, cm_work_handler, work);
+	work->cm_id = cm_id_priv;
+	work->event = *iw_event;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (list_empty(&cm_id_priv->work_list)) {
+		list_add_tail(&work->list, &cm_id_priv->work_list);
+		queue_work(iwcm_wq, &work->work);
+	} else
+		list_add_tail(&work->list, &cm_id_priv->work_list);
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+}
+
+static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv,
+				  struct ib_qp_attr *qp_attr,
+				  int *qp_attr_mask)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_CONN_RECV:
+	case IW_CM_STATE_ESTABLISHED:
+		*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
+		qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
+					   IB_ACCESS_REMOTE_WRITE|
+					   IB_ACCESS_REMOTE_READ;
+		ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+
+static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv,
+				  struct ib_qp_attr *qp_attr,
+				  int *qp_attr_mask)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_CONN_RECV:
+	case IW_CM_STATE_ESTABLISHED:
+		*qp_attr_mask = 0;
+		ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+
+int iw_cm_init_qp_attr(struct iw_cm_id *cm_id,
+		       struct ib_qp_attr *qp_attr,
+		       int *qp_attr_mask)
+{
+	struct iwcm_id_private *cm_id_priv;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	switch (qp_attr->qp_state) {
+	case IB_QPS_INIT:
+	case IB_QPS_RTR:
+		ret = iwcm_init_qp_init_attr(cm_id_priv, 
+					     qp_attr, qp_attr_mask);
+		break;
+	case IB_QPS_RTS:
+		ret = iwcm_init_qp_rts_attr(cm_id_priv, 
+					    qp_attr, qp_attr_mask);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_init_qp_attr);
+
+static int __init iw_cm_init(void)
+{
+	iwcm_wq = create_singlethread_workqueue("iw_cm_wq");
+	if (!iwcm_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __exit iw_cm_cleanup(void)
+{
+	destroy_workqueue(iwcm_wq);
+}
+
+module_init(iw_cm_init);
+module_exit(iw_cm_cleanup);
diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h
new file mode 100644
index 0000000..0752a94
--- /dev/null
+++ b/include/rdma/iw_cm.h
@@ -0,0 +1,254 @@
+/*
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if !defined(IW_CM_H)
+#define IW_CM_H
+
+#include <linux/in.h>
+#include <rdma/ib_cm.h>
+
+struct iw_cm_id;
+
+enum iw_cm_event_type {
+	IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */
+	IW_CM_EVENT_CONNECT_REPLY,	 /* reply from active connect request */
+	IW_CM_EVENT_ESTABLISHED,	 /* passive side accept successful */
+	IW_CM_EVENT_DISCONNECT,		 /* orderly shutdown */
+	IW_CM_EVENT_CLOSE		 /* close complete */
+};
+enum iw_cm_event_status {
+	IW_CM_EVENT_STATUS_OK = 0,	 /* request successful */
+	IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */
+	IW_CM_EVENT_STATUS_REJECTED,	 /* connect request rejected */
+	IW_CM_EVENT_STATUS_TIMEOUT,	 /* the operation timed out */
+	IW_CM_EVENT_STATUS_RESET,	 /* reset from remote peer */
+	IW_CM_EVENT_STATUS_EINVAL,	 /* asynchronous failure for bad parm */
+};
+struct iw_cm_event {
+	enum iw_cm_event_type event;
+	enum iw_cm_event_status status;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	void *private_data;
+	u8 private_data_len;
+	void* provider_data;
+};
+
+/**
+ * iw_cm_handler - Function to be called by the IW CM when delivering events
+ * to the client.
+ *
+ * @cm_id: The IW CM identifier associated with the event.
+ * @event: Pointer to the event structure.
+ */
+typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id,
+			     struct iw_cm_event *event);
+
+/**
+ * iw_event_handler - Function called by the provider when delivering provider
+ * events to the IW CM. 
+ *
+ * @cm_id: The IW CM identifier associated with the event.
+ * @event: Pointer to the event structure.
+ */
+typedef void (*iw_event_handler)(struct iw_cm_id *cm_id,
+				 struct iw_cm_event *event);
+struct iw_cm_id {
+	iw_cm_handler		cm_handler;      /* client callback function */
+	void		        *context;	 /* client cb context */
+	struct ib_device	*device;	 
+	struct sockaddr_in      local_addr;
+	struct sockaddr_in	remote_addr;
+	void			*provider_data;	 /* provider private data */
+	iw_event_handler        event_handler;   /* cb for provider
+						    events */
+	/* Used by provider to add and remove refs on IW cm_id */	
+	void (*add_ref)(struct iw_cm_id *);     
+	void (*rem_ref)(struct iw_cm_id *);
+};
+
+struct iw_cm_conn_param {
+	const void *private_data;
+	u16 private_data_len;
+	u32 ord;
+	u32 ird;
+	u32 qpn;
+};
+
+struct iw_cm_verbs {
+	void		(*add_ref)(struct ib_qp *qp);
+
+	void		(*rem_ref)(struct ib_qp *qp);
+
+	struct ib_qp *	(*get_qp)(struct ib_device *device,
+				  int qpn);
+
+	int		(*connect)(struct iw_cm_id *cm_id,
+				   struct iw_cm_conn_param *conn_param);
+	
+	int		(*accept)(struct iw_cm_id *cm_id, 
+				  struct iw_cm_conn_param *conn_param);
+
+	int		(*reject)(struct iw_cm_id *cm_id, 
+				  const void *pdata, u8 pdata_len);
+
+	int		(*create_listen)(struct iw_cm_id *cm_id,
+					 int backlog);
+
+	int		(*destroy_listen)(struct iw_cm_id *cm_id);
+};
+
+/**
+ * iw_create_cm_id - Create an IW CM identifier.
+ *
+ * @device: The IB device on which to create the IW CM identier.
+ * @event_handler: User callback invoked to report events associated with the
+ *   returned IW CM identifier. 
+ * @context: User specified context associated with the id.
+ */
+struct iw_cm_id *iw_create_cm_id(struct ib_device *device,
+				 iw_cm_handler cm_handler, void *context);
+
+/**
+ * iw_destroy_cm_id - Destroy an IW CM identifier.
+ *
+ * @cm_id: The previously created IW CM identifier to destroy.
+ *
+ * The client can assume that no events will be delivered for the CM ID after
+ * this function returns. 
+ */
+void iw_destroy_cm_id(struct iw_cm_id *cm_id);
+
+/**
+ * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP
+ *
+ * @cm_id: The IW CM idenfier to unbind from the QP. 
+ * @qp: The QP
+ *
+ * This is called by the provider when destroying the QP to ensure
+ * that any references held by the IWCM are released. It may also
+ * be called by the IWCM when destroying a CM_ID to that any
+ * references held by the provider are released.
+ */
+void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp);
+
+/**
+ * iw_cm_get_qp - Return the ib_qp associated with a QPN
+ *
+ * @ib_device: The IB device
+ * @qpn: The queue pair number
+ */
+struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn);
+
+/**
+ * iw_cm_listen - Listen for incoming connection requests on the
+ * specified IW CM id. 
+ *
+ * @cm_id: The IW CM identifier.
+ * @backlog: The maximum number of outstanding un-accepted inbound listen
+ *   requests to queue. 
+ * 
+ * The source address and port number are specified in the IW CM identifier
+ * structure.
+ */
+int iw_cm_listen(struct iw_cm_id *cm_id, int backlog);
+
+/**
+ * iw_cm_accept - Called to accept an incoming connect request. 
+ *
+ * @cm_id: The IW CM identifier associated with the connection request. 
+ * @iw_param: Pointer to a structure containing connection establishment
+ *   parameters. 
+ *
+ * The specified cm_id will have been provided in the event data for a
+ * CONNECT_REQUEST event. Subsequent events related to this connection will be
+ * delivered to the specified IW CM identifier prior and may occur prior to
+ * the return of this function. If this function returns a non-zero value, the
+ * client can assume that no events will be delivered to the specified IW CM
+ * identifier. 
+ */
+int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param);
+
+/**
+ * iw_cm_reject - Reject an incoming connection request.
+ *
+ * @cm_id: Connection identifier associated with the request.
+ * @private_daa: Pointer to data to deliver to the remote peer as part of the
+ *   reject message. 
+ * @private_data_len: The number of bytes in the private_data parameter. 
+ *
+ * The client can assume that no events will be delivered to the specified IW
+ * CM identifier following the return of this function. The private_data
+ * buffer is available for reuse when this function returns. 
+ */
+int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data,
+		 u8 private_data_len);
+
+/**
+ * iw_cm_connect - Called to request a connection to a remote peer. 
+ *
+ * @cm_id: The IW CM identifier for the connection.
+ * @iw_param: Pointer to a structure containing connection  establishment
+ *   parameters. 
+ *
+ * Events may be delivered to the specified IW CM identifier prior to the
+ * return of this function. If this function returns a non-zero value, the
+ * client can assume that no events will be delivered to the specified IW CM
+ * identifier.  
+ */
+int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param);
+
+/**
+ * iw_cm_disconnect - Close the specified connection. 
+ *
+ * @cm_id: The IW CM identifier to close.
+ * @abrupt: If 0, the connection will be closed gracefully, otherwise, the
+ *   connection will be reset.  
+ *
+ * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is
+ * delivered. 
+ */
+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt);
+
+/**
+ * iw_cm_init_qp_attr - Called to initialize the attributes of the QP
+ * associated with a IW CM identifier.
+ *
+ * @cm_id: The IW CM identifier associated with the QP
+ * @qp_attr: Pointer to the QP attributes structure. 
+ * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are
+ *   valid.  
+ */
+int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr,
+		       int *qp_attr_mask);
+
+#endif /* IW_CM_H */
diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h
new file mode 100644
index 0000000..aba8cb2
--- /dev/null
+++ b/include/rdma/iw_cm_private.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if !defined(IW_CM_PRIVATE_H)
+#define IW_CM_PRIVATE_H
+
+#include <rdma/iw_cm.h>
+
+enum iw_cm_state {
+	IW_CM_STATE_IDLE,             /* unbound, inactive */
+	IW_CM_STATE_LISTEN,           /* listen waiting for connect */
+	IW_CM_STATE_CONN_RECV,        /* inbound waiting for user accept */
+	IW_CM_STATE_CONN_SENT,        /* outbound waiting for peer accept */
+	IW_CM_STATE_ESTABLISHED,      /* established */
+	IW_CM_STATE_CLOSING,	      /* disconnect */
+	IW_CM_STATE_DESTROYING        /* object being deleted */
+};
+
+struct iwcm_id_private {
+	struct iw_cm_id	id;
+	enum iw_cm_state state;
+	unsigned long flags;
+	struct ib_qp *qp;
+	struct completion destroy_comp;
+	wait_queue_head_t connect_wait;
+	struct list_head work_list;
+	spinlock_t lock;
+	atomic_t refcount;
+};
+#define IWCM_F_CALLBACK_DESTROY   1
+#define IWCM_F_CONNECT_WAIT       2
+
+#endif /* IW_CM_PRIVATE_H */


From swise at opengridcomputing.com  Wed Jun  7 13:06:10 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:10 -0500
Subject: [openib-general] [PATCH v2 2/2] iWARP Core Changes.
In-Reply-To: <20060607200600.9003.56328.stgit@stevo-desktop>
References: <20060607200600.9003.56328.stgit@stevo-desktop>
Message-ID: <20060607200610.9003.54068.stgit@stevo-desktop>


This patch contains modifications to the existing rdma header files,
core files, drivers, and ulp files to support iWARP.

Review updates:

- copy_addr() -> rdma_copy_addr()

- dst_dev_addr param in rdma_copy_addr to const.

- various spacing nits with recasting

- include linux/inetdevice.h to get ip_dev_find() prototype.
---

 drivers/infiniband/core/Makefile             |    4 
 drivers/infiniband/core/addr.c               |   19 +
 drivers/infiniband/core/cache.c              |    8 -
 drivers/infiniband/core/cm.c                 |    3 
 drivers/infiniband/core/cma.c                |  353 +++++++++++++++++++++++---
 drivers/infiniband/core/device.c             |    6 
 drivers/infiniband/core/mad.c                |   11 +
 drivers/infiniband/core/sa_query.c           |    5 
 drivers/infiniband/core/smi.c                |   18 +
 drivers/infiniband/core/sysfs.c              |   18 +
 drivers/infiniband/core/ucm.c                |    5 
 drivers/infiniband/core/user_mad.c           |    9 -
 drivers/infiniband/hw/ipath/ipath_verbs.c    |    2 
 drivers/infiniband/hw/mthca/mthca_provider.c |    2 
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |    8 +
 drivers/infiniband/ulp/srp/ib_srp.c          |    2 
 include/rdma/ib_addr.h                       |   15 +
 include/rdma/ib_verbs.h                      |   39 +++
 18 files changed, 435 insertions(+), 92 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 68e73ec..163d991 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -1,7 +1,7 @@
 infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)	:= ib_addr.o rdma_cm.o
 
 obj-$(CONFIG_INFINIBAND) +=		ib_core.o ib_mad.o ib_sa.o \
-					ib_cm.o $(infiniband-y)
+					ib_cm.o iw_cm.o $(infiniband-y)
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_sa-y :=			sa_query.o
 
 ib_cm-y :=			cm.o
 
+iw_cm-y :=			iwcm.o
+
 rdma_cm-y :=			cma.o
 
 ib_addr-y :=			addr.o
diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d294bbc..83f84ef 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -32,6 +32,7 @@ #include <linux/mutex.h>
 #include <linux/inetdevice.h>
 #include <linux/workqueue.h>
 #include <linux/if_arp.h>
+#include <linux/inetdevice.h>
 #include <net/arp.h>
 #include <net/neighbour.h>
 #include <net/route.h>
@@ -60,12 +61,15 @@ static LIST_HEAD(req_list);
 static DECLARE_WORK(work, process_req, NULL);
 static struct workqueue_struct *addr_wq;
 
-static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
-		     unsigned char *dst_dev_addr)
+int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+		     const unsigned char *dst_dev_addr)
 {
 	switch (dev->type) {
 	case ARPHRD_INFINIBAND:
-		dev_addr->dev_type = IB_NODE_CA;
+		dev_addr->dev_type = RDMA_NODE_IB_CA;
+		break;
+	case ARPHRD_ETHER:
+		dev_addr->dev_type = RDMA_NODE_RNIC;
 		break;
 	default:
 		return -EADDRNOTAVAIL;
@@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add
 		memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
 	return 0;
 }
+EXPORT_SYMBOL(rdma_copy_addr);
 
 int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
 {
@@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a
 	if (!dev)
 		return -EADDRNOTAVAIL;
 
-	ret = copy_addr(dev_addr, dev, NULL);
+	ret = rdma_copy_addr(dev_addr, dev, NULL);
 	dev_put(dev);
 	return ret;
 }
@@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so
 
 	/* If the device does ARP internally, return 'done' */
 	if (rt->idev->dev->flags & IFF_NOARP) {
-		copy_addr(addr, rt->idev->dev, NULL);
+		rdma_copy_addr(addr, rt->idev->dev, NULL);
 		goto put;
 	}
 
@@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so
 		src_in->sin_addr.s_addr = rt->rt_src;
 	}
 
-	ret = copy_addr(addr, neigh->dev, neigh->ha);
+	ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
 release:
 	neigh_release(neigh);
 put:
@@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc
 	if (ZERONET(src_ip)) {
 		src_in->sin_family = dst_in->sin_family;
 		src_in->sin_addr.s_addr = dst_ip;
-		ret = copy_addr(addr, dev, dev->dev_addr);
+		ret = rdma_copy_addr(addr, dev, dev->dev_addr);
 	} else if (LOOPBACK(src_ip)) {
 		ret = rdma_translate_ip((struct sockaddr *)dst_in, addr);
 		if (!ret)
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index e05ca2c..061858c 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -32,13 +32,12 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $
  */
 
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
-#include <linux/sched.h>	/* INIT_WORK, schedule_work(), flush_scheduled_work() */
 
 #include <rdma/ib_cache.h>
 
@@ -62,12 +61,13 @@ struct ib_update_work {
 
 static inline int start_port(struct ib_device *device)
 {
-	return device->node_type == IB_NODE_SWITCH ? 0 : 1;
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
 }
 
 static inline int end_port(struct ib_device *device)
 {
-	return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt;
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
 }
 
 int ib_get_cached_gid(struct ib_device *device,
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 1c7463b..cf43ccb 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3253,6 +3253,9 @@ static void cm_add_one(struct ib_device 
 	int ret;
 	u8 i;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) *
 			 device->phys_port_cnt, GFP_KERNEL);
 	if (!cm_dev)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 94555d2..414600c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -35,6 +35,7 @@ #include <linux/in6.h>
 #include <linux/mutex.h>
 #include <linux/random.h>
 #include <linux/idr.h>
+#include <linux/inetdevice.h>
 
 #include <net/tcp.h>
 
@@ -43,6 +44,7 @@ #include <rdma/rdma_cm_ib.h>
 #include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_sa.h>
+#include <rdma/iw_cm.h>
 
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
@@ -124,6 +126,7 @@ struct rdma_id_private {
 	int			query_id;
 	union {
 		struct ib_cm_id	*ib;
+		struct iw_cm_id	*iw;
 	} cm_id;
 
 	u32			seq_num;
@@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r
 	id_priv->cma_dev = NULL;
 }
 
-static int cma_acquire_ib_dev(struct rdma_id_private *id_priv)
+static int cma_acquire_dev(struct rdma_id_private *id_priv)
 {
+	enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type;
 	struct cma_device *cma_dev;
 	union ib_gid *gid;
 	int ret = -ENODEV;
 
-	gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+	switch (rdma_node_get_transport(dev_type)) {
+	case RDMA_TRANSPORT_IB:
+		gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+		break;
+	case RDMA_TRANSPORT_IWARP:
+		gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+		break;
+	default:
+		return -ENODEV;
+	}
 
 	mutex_lock(&lock);
 	list_for_each_entry(cma_dev, &dev_list, list) {
@@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm
 	return ret;
 }
 
-static int cma_acquire_dev(struct rdma_id_private *id_priv)
-{
-	switch (id_priv->id.route.addr.dev_addr.dev_type) {
-	case IB_NODE_CA:
-		return cma_acquire_ib_dev(id_priv);
-	default:
-		return -ENODEV;
-	}
-}
-
 static void cma_deref_id(struct rdma_id_private *id_priv)
 {
 	if (atomic_dec_and_test(&id_priv->refcount))
@@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id
 					  IB_QP_PKEY_INDEX | IB_QP_PORT);
 }
 
+static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE;
+
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS);
+}
+
 int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
 		   struct ib_qp_init_attr *qp_init_attr)
 {
@@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id
 	if (IS_ERR(qp))
 		return PTR_ERR(qp);
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_init_ib_qp(id_priv, qp);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_init_iw_qp(id_priv, qp);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id 
 	int ret;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	switch (id_priv->id.device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr,
 					 qp_attr_mask);
 		if (qp_attr->qp_state == IB_QPS_RTR)
 			qp_attr->rq_psn = id_priv->seq_num;
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
+					qp_attr_mask);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i
 
 static void cma_cancel_route(struct rdma_id_private *id_priv)
 {
-	switch (id_priv->id.device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		if (id_priv->query)
 			ib_sa_cancel_query(id_priv->query_id, id_priv->query);
 		break;
@@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd
 	cma_exch(id_priv, CMA_DESTROYING);
 
 	if (id_priv->cma_dev) {
-		switch (id_priv->id.device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
+		case RDMA_TRANSPORT_IWARP:
+	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
+				iw_destroy_cm_id(id_priv->cm_id.iw);
+			break;
 		default:
 			break;
 		}
@@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id *
 	cma_cancel_operation(id_priv, state);
 
 	if (id_priv->cma_dev) {
-		switch (id->device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id->device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
+		case RDMA_TRANSPORT_IWARP:
+	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
+				iw_destroy_cm_id(id_priv->cm_id.iw);
+			break;
 		default:
 			break;
 		}
@@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i
 	ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid);
 	ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
 	ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey));
-	rt->addr.dev_addr.dev_type = IB_NODE_CA;
+	rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
 	id_priv->state = CMA_CONNECT;
@@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_
 	}
 
 	atomic_inc(&conn_id->dev_remove);
-	ret = cma_acquire_ib_dev(conn_id);
+	ret = cma_acquire_dev(conn_id);
 	if (ret) {
 		ret = -ENODEV;
 		cma_release_remove(conn_id);
@@ -981,6 +1009,123 @@ static void cma_set_compare_data(enum rd
 	}
 }
 
+static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
+{
+	struct rdma_id_private *id_priv = iw_id->context;
+	enum rdma_cm_event_type event = 0;
+	struct sockaddr_in *sin;
+	int ret = 0;
+
+	atomic_inc(&id_priv->dev_remove);
+
+	switch (iw_event->event) {
+	case IW_CM_EVENT_CLOSE:
+		event = RDMA_CM_EVENT_DISCONNECTED;
+		break;
+	case IW_CM_EVENT_CONNECT_REPLY:
+		sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+		*sin = iw_event->local_addr;
+		sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
+		*sin = iw_event->remote_addr;
+		if (iw_event->status)
+			event = RDMA_CM_EVENT_REJECTED;
+		else
+			event = RDMA_CM_EVENT_ESTABLISHED;
+		break;
+	case IW_CM_EVENT_ESTABLISHED:
+		event = RDMA_CM_EVENT_ESTABLISHED;
+		break;
+	default:
+		BUG_ON(1);
+	}	
+
+	ret = cma_notify_user(id_priv, event, iw_event->status, 
+			      iw_event->private_data, 
+			      iw_event->private_data_len);
+	if (ret) {
+		/* Destroy the CM ID by returning a non-zero value. */
+		id_priv->cm_id.iw = NULL;
+		cma_exch(id_priv, CMA_DESTROYING);
+		cma_release_remove(id_priv);
+		rdma_destroy_id(&id_priv->id);
+		return ret;
+	}
+
+	cma_release_remove(id_priv);
+	return ret;
+}
+
+static int iw_conn_req_handler(struct iw_cm_id *cm_id, 
+			       struct iw_cm_event *iw_event)
+{
+	struct rdma_cm_id *new_cm_id;
+	struct rdma_id_private *listen_id, *conn_id;
+	struct sockaddr_in *sin;
+	struct net_device *dev;
+	int ret;
+
+	listen_id = cm_id->context;
+	atomic_inc(&listen_id->dev_remove);
+	if (!cma_comp(listen_id, CMA_LISTEN)) {
+		ret = -ECONNABORTED;
+		goto out;
+	}
+
+	/* Create a new RDMA id for the new IW CM ID */
+	new_cm_id = rdma_create_id(listen_id->id.event_handler, 
+				   listen_id->id.context,
+				   RDMA_PS_TCP);
+	if (!new_cm_id) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	conn_id = container_of(new_cm_id, struct rdma_id_private, id);
+	atomic_inc(&conn_id->dev_remove);
+	conn_id->state = CMA_CONNECT;
+
+	dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr);
+	if (!dev) {
+		ret = -EADDRNOTAVAIL;
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+	ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
+	if (ret) {
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+
+	ret = cma_acquire_dev(conn_id);
+	if (ret) {
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+
+	conn_id->cm_id.iw = cm_id;
+	cm_id->context = conn_id;
+	cm_id->cm_handler = cma_iw_handler;
+
+	sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr;
+	*sin = iw_event->local_addr;
+	sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr;
+	*sin = iw_event->remote_addr;
+
+	ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0,
+			      iw_event->private_data,
+			      iw_event->private_data_len);
+	if (ret) {
+		/* User wants to destroy the CM ID */
+		conn_id->cm_id.iw = NULL;
+		cma_exch(conn_id, CMA_DESTROYING);
+		cma_release_remove(conn_id);
+		rdma_destroy_id(&conn_id->id);
+	}
+
+out:
+	cma_release_remove(listen_id);
+	return ret;
+}
+
 static int cma_ib_listen(struct rdma_id_private *id_priv)
 {
 	struct ib_cm_compare_data compare_data;
@@ -1010,6 +1155,30 @@ static int cma_ib_listen(struct rdma_id_
 	return ret;
 }
 
+static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog)
+{
+	int ret;
+	struct sockaddr_in *sin;
+
+	id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, 
+					    iw_conn_req_handler,
+					    id_priv);
+	if (IS_ERR(id_priv->cm_id.iw))
+		return PTR_ERR(id_priv->cm_id.iw);
+
+	sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+	id_priv->cm_id.iw->local_addr = *sin;
+
+	ret = iw_cm_listen(id_priv->cm_id.iw, backlog);
+
+	if (ret) {
+		iw_destroy_cm_id(id_priv->cm_id.iw);
+		id_priv->cm_id.iw = NULL;
+	}
+
+	return ret;
+}
+
 static int cma_listen_handler(struct rdma_cm_id *id,
 			      struct rdma_cm_event *event)
 {
@@ -1085,12 +1254,17 @@ int rdma_listen(struct rdma_cm_id *id, i
 		return -EINVAL;
 
 	if (id->device) {
-		switch (id->device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id->device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
 				goto err;
 			break;
+		case RDMA_TRANSPORT_IWARP:
+			ret = cma_iw_listen(id_priv, backlog);
+			if (ret)
+				goto err;
+			break;
 		default:
 			ret = -ENOSYS;
 			goto err;
@@ -1229,6 +1403,23 @@ err:
 }
 EXPORT_SYMBOL(rdma_set_ib_paths);
 
+static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
+{
+	struct cma_work *work;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->id = id_priv;
+	INIT_WORK(&work->work, cma_work_handler, work);
+	work->old_state = CMA_ROUTE_QUERY;
+	work->new_state = CMA_ROUTE_RESOLVED;
+	work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
+	queue_work(cma_wq, &work->work);
+	return 0;
+}
+
 int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 {
 	struct rdma_id_private *id_priv;
@@ -1239,10 +1430,13 @@ int rdma_resolve_route(struct rdma_cm_id
 		return -EINVAL;
 
 	atomic_inc(&id_priv->refcount);
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_resolve_ib_route(id_priv, timeout_ms);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_resolve_iw_route(id_priv, timeout_ms);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1354,8 +1548,8 @@ static int cma_resolve_loopback(struct r
 			 ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr));
 
 	if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) {
-		src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr;
-		dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr;
+		src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+		dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
 		src_in->sin_family = dst_in->sin_family;
 		src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr;
 	}
@@ -1646,6 +1840,47 @@ out:
 	return ret;
 }
 
+static int cma_connect_iw(struct rdma_id_private *id_priv,
+			  struct rdma_conn_param *conn_param)
+{
+	struct iw_cm_id *cm_id;
+	struct sockaddr_in* sin;
+	int ret;
+	struct iw_cm_conn_param iw_param;
+
+	cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		goto out;
+	}
+
+	id_priv->cm_id.iw = cm_id;
+
+	sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr;
+	cm_id->local_addr = *sin;
+
+	sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr;
+	cm_id->remote_addr = *sin;
+
+	ret = cma_modify_qp_rtr(&id_priv->id);
+	if (ret) {
+		iw_destroy_cm_id(cm_id);
+		return ret;
+	}
+
+	iw_param.ord = conn_param->initiator_depth;
+	iw_param.ird = conn_param->responder_resources;
+	iw_param.private_data = conn_param->private_data;
+	iw_param.private_data_len = conn_param->private_data_len;
+	if (id_priv->id.qp)
+		iw_param.qpn = id_priv->qp_num;
+	else 
+		iw_param.qpn = conn_param->qp_num;
+	ret = iw_cm_connect(cm_id, &iw_param);
+out:
+	return ret;
+}
+
 int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 {
 	struct rdma_id_private *id_priv;
@@ -1661,10 +1896,13 @@ int rdma_connect(struct rdma_cm_id *id, 
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_connect_ib(id_priv, conn_param);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_connect_iw(id_priv, conn_param);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1705,6 +1943,28 @@ static int cma_accept_ib(struct rdma_id_
 	return ib_send_cm_rep(id_priv->cm_id.ib, &rep);
 }
 
+static int cma_accept_iw(struct rdma_id_private *id_priv, 
+		  struct rdma_conn_param *conn_param)
+{
+	struct iw_cm_conn_param iw_param;
+	int ret;
+
+	ret = cma_modify_qp_rtr(&id_priv->id);
+	if (ret)
+		return ret;
+
+	iw_param.ord = conn_param->initiator_depth;
+	iw_param.ird = conn_param->responder_resources;
+	iw_param.private_data = conn_param->private_data;
+	iw_param.private_data_len = conn_param->private_data_len;
+	if (id_priv->id.qp) {
+		iw_param.qpn = id_priv->qp_num;
+	} else 
+		iw_param.qpn = conn_param->qp_num;
+
+	return iw_cm_accept(id_priv->cm_id.iw, &iw_param);
+}
+
 int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 {
 	struct rdma_id_private *id_priv;
@@ -1720,13 +1980,16 @@ int rdma_accept(struct rdma_cm_id *id, s
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		if (conn_param)
 			ret = cma_accept_ib(id_priv, conn_param);
 		else
 			ret = cma_rep_recv(id_priv);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_accept_iw(id_priv, conn_param);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1753,12 +2016,16 @@ int rdma_reject(struct rdma_cm_id *id, c
 	if (!cma_comp(id_priv, CMA_CONNECT))
 		return -EINVAL;
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = ib_send_cm_rej(id_priv->cm_id.ib,
 				     IB_CM_REJ_CONSUMER_DEFINED, NULL, 0,
 				     private_data, private_data_len);
 		break;
+	case RDMA_TRANSPORT_IWARP: 
+		ret = iw_cm_reject(id_priv->cm_id.iw, 
+				   private_data, private_data_len);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1777,16 +2044,18 @@ int rdma_disconnect(struct rdma_cm_id *i
 	    !cma_comp(id_priv, CMA_DISCONNECT))
 		return -EINVAL;
 
-	ret = cma_modify_qp_err(id);
-	if (ret)
-		goto out;
-
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
+		ret = cma_modify_qp_err(id);
+		if (ret)
+			goto out;
 		/* Initiate or respond to a disconnect. */
 		if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
 			ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b2f3cb9..7318fba 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: device.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: device.c 5943 2006-03-22 00:58:04Z roland $
  */
 
 #include <linux/module.h>
@@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi
 		  u8 port_num,
 		  struct ib_port_attr *port_attr)
 {
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		if (port_num)
 			return -EINVAL;
 	} else if (port_num < 1 || port_num > device->phys_port_cnt)
@@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev
 		   u8 port_num, int port_modify_mask,
 		   struct ib_port_modify *port_modify)
 {
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		if (port_num)
 			return -EINVAL;
 	} else if (port_num < 1 || port_num > device->phys_port_cnt)
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index b38e02a..a928ecf 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
  *
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
+ * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $
  */
 #include <linux/dma-mapping.h>
 #include <rdma/ib_cache.h>
@@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib
 {
 	int start, end, i;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		start = 0;
 		end   = 0;
 	} else {
@@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct 
 {
 	int i, num_ports, cur_port;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 501cc05..4230277 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -887,7 +887,10 @@ static void ib_sa_add_one(struct ib_devi
 	struct ib_sa_device *sa_dev;
 	int s, e, i;
 
-	if (device->node_type == IB_NODE_SWITCH)
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
 		s = 1;
diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c
index 35852e7..b81b2b9 100644
--- a/drivers/infiniband/core/smi.c
+++ b/drivers/infiniband/core/smi.c
@@ -34,7 +34,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $
+ * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $
  */
 
 #include <rdma/ib_smi.h>
@@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 
 		/* C14-9:2 */
 		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			/* smp->return_path set when received */
@@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 		if (hop_ptr == hop_cnt) {
 			/* smp->return_path set when received */
 			smp->hop_ptr++;
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_dlid == IB_LID_PERMISSIVE);
 		}
 
@@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 
 		/* C14-13:2 */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			smp->hop_ptr--;
@@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 		if (hop_ptr == 1) {
 			smp->hop_ptr--;
 			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_slid == IB_LID_PERMISSIVE);
 		}
 
@@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 
 		/* C14-9:2 -- intermediate hop */
 		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			smp->return_path[hop_ptr] = port_num;
@@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 				smp->return_path[hop_ptr] = port_num;
 			/* smp->hop_ptr updated when sending */
 
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_dlid == IB_LID_PERMISSIVE);
 		}
 
@@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 
 		/* C14-13:2 */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			/* smp->hop_ptr updated when sending */
@@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 				return 1;
 			}
 			/* smp->hop_ptr updated when sending */
-			return (node_type == IB_NODE_SWITCH);
+			return (node_type == RDMA_NODE_IB_SWITCH);
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 21f9282..cfd2c06 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $
  */
 
 #include "core_priv.h"
@@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla
 		return -ENODEV;
 
 	switch (dev->node_type) {
-	case IB_NODE_CA:     return sprintf(buf, "%d: CA\n", dev->node_type);
-	case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type);
-	case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type);
-	default:             return sprintf(buf, "%d: <unknown>\n", dev->node_type);
+	case RDMA_NODE_IB_CA:
+		return sprintf(buf, "%d: CA\n", dev->node_type);
+	case RDMA_NODE_RNIC:
+		return sprintf(buf, "%d: RNIC\n", dev->node_type);
+	case RDMA_NODE_IB_SWITCH:
+		return sprintf(buf, "%d: switch\n", dev->node_type);
+	case RDMA_NODE_IB_ROUTER:
+		return sprintf(buf, "%d: router\n", dev->node_type);
+	default:
+		return sprintf(buf, "%d: <unknown>\n", dev->node_type);
 	}
 }
 
@@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d
 	if (ret)
 		goto err_put;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		ret = add_port(device, 0);
 		if (ret)
 			goto err_put;
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 67caf36..ad2e417 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
+ * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $
  */
 
 #include <linux/completion.h>
@@ -1248,7 +1248,8 @@ static void ib_ucm_add_one(struct ib_dev
 {
 	struct ib_ucm_device *ucm_dev;
 
-	if (!device->alloc_ucontext)
+	if (!device->alloc_ucontext ||
+	    rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
 		return;
 
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index afe70a5..0cbd692 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Voltaire, Inc. All rights reserved. 
+ * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. 
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
+ * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $
  */
 
 #include <linux/module.h>
@@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de
 	struct ib_umad_device *umad_dev;
 	int s, e, i;
 
-	if (device->node_type == IB_NODE_SWITCH)
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
 		s = 1;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 28fdbda..e4b45d7 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
-	dev->node_type = IB_NODE_CA;
+	dev->node_type = RDMA_NODE_IB_CA;
 	dev->phys_port_cnt = 1;
 	dev->dma_device = ipath_layer_get_device(dd);
 	dev->class_dev.dev = dev->dma_device;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index a2eae8a..5c31819 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1273,7 +1273,7 @@ int mthca_register_device(struct mthca_d
 		(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
-	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.node_type            = RDMA_NODE_IB_CA;
 	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
 	dev->ib_dev.dma_device           = &dev->pdev->dev;
 	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 1c6ea1c..262427f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi
 	struct ipoib_dev_priv *priv;
 	int s, e, p;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
 	if (!dev_list)
 		return;
 
 	INIT_LIST_HEAD(dev_list);
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		s = 0;
 		e = 0;
 	} else {
@@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d
 	struct ipoib_dev_priv *priv, *tmp;
 	struct list_head *dev_list;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	dev_list = ib_get_client_data(device, &ipoib_client);
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index f1401e1..bba2956 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1845,7 +1845,7 @@ static void srp_add_one(struct ib_device
 	if (IS_ERR(srp_dev->fmr_pool))
 		srp_dev->fmr_pool = NULL;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		s = 0;
 		e = 0;
 	} else {
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index fcb5ba8..d95d3eb 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -40,7 +40,7 @@ struct rdma_dev_addr {
 	unsigned char src_dev_addr[MAX_ADDR_LEN];
 	unsigned char dst_dev_addr[MAX_ADDR_LEN];
 	unsigned char broadcast[MAX_ADDR_LEN];
-	enum ib_node_type dev_type;
+	enum rdma_node_type dev_type;
 };
 
 /**
@@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src
 
 void rdma_addr_cancel(struct rdma_dev_addr *addr);
 
+int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+	      const unsigned char *dst_dev_addr);
+
 static inline int ip_addr_size(struct sockaddr *addr)
 {
 	return addr->sa_family == AF_INET6 ?
@@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru
 	memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid);
 }
 
+static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda)
+{
+	return (union ib_gid *) rda->src_dev_addr;
+}
+
+static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda)
+{
+	return (union ib_gid *) rda->dst_dev_addr;
+}
+
 #endif /* IB_ADDR_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index aeb4fcd..eac2d8f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -35,7 +35,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $
+ * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $
  */
 
 #if !defined(IB_VERBS_H)
@@ -56,12 +56,35 @@ union ib_gid {
 	} global;
 };
 
-enum ib_node_type {
-	IB_NODE_CA 	= 1,
-	IB_NODE_SWITCH,
-	IB_NODE_ROUTER
+enum rdma_node_type {
+	/* IB values map to NodeInfo:NodeType. */
+	RDMA_NODE_IB_CA 	= 1,
+	RDMA_NODE_IB_SWITCH,
+	RDMA_NODE_IB_ROUTER,
+	RDMA_NODE_RNIC
 };
 
+enum rdma_transport_type {
+	RDMA_TRANSPORT_IB,
+	RDMA_TRANSPORT_IWARP
+};
+
+static inline enum rdma_transport_type
+rdma_node_get_transport(enum rdma_node_type node_type)
+{
+	switch (node_type) {
+	case RDMA_NODE_IB_CA:
+	case RDMA_NODE_IB_SWITCH:
+	case RDMA_NODE_IB_ROUTER:
+		return RDMA_TRANSPORT_IB;
+	case RDMA_NODE_RNIC:
+		return RDMA_TRANSPORT_IWARP;
+	default:
+		BUG();
+		return 0;
+	}
+}
+
 enum ib_device_cap_flags {
 	IB_DEVICE_RESIZE_MAX_WR		= 1,
 	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1),
@@ -78,6 +101,9 @@ enum ib_device_cap_flags {
 	IB_DEVICE_RC_RNR_NAK_GEN	= (1<<12),
 	IB_DEVICE_SRQ_RESIZE		= (1<<13),
 	IB_DEVICE_N_NOTIFY_CQ		= (1<<14),
+	IB_DEVICE_ZERO_STAG		= (1<<15),
+	IB_DEVICE_SEND_W_INV		= (1<<16),
+	IB_DEVICE_MEM_WINDOW		= (1<<17)
 };
 
 enum ib_atomic_cap {
@@ -830,6 +856,7 @@ struct ib_cache {
 	u8                     *lmc_cache;
 };
 
+struct iw_cm_verbs;
 struct ib_device {
 	struct device                *dma_device;
 
@@ -846,6 +873,8 @@ struct ib_device {
 
 	u32                           flags;
 
+	struct iw_cm_verbs	     *iwcm;
+
 	int		           (*query_device)(struct ib_device *device,
 						   struct ib_device_attr *device_attr);
 	int		           (*query_port)(struct ib_device *device,


From swise at opengridcomputing.com  Wed Jun  7 13:06:46 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:46 -0500
Subject: [openib-general] [PATCH v2 0/7][RFC] Ammasso 1100 iWARP Driver
Message-ID: <20060607200646.9259.24588.stgit@stevo-desktop>


This patchset implements the iWARP provider driver for the Ammasso
1100 RNIC.  It is dependent on the "iWARP Core Support" patch set.  We're
submitting it for review with the goal for inclusion in the 2.6.19 kernel.
This code has gone through several reviews in the openib-general list.
Now we are submitting it for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.18 branch.  The patchset consists of 7 patches:

        1 - Low-level device interface and native stack support
        2 - Work request definitions
        3 - Provider interface
        4 - Memory management
        5 - User mode message queue implementation      
        6 - Verbs queue implementation
        7 - Kconfig and Makefile

I believe I've addressed all the round 1 review comments.  Details of
the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Wed Jun  7 13:06:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:55 -0500
Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200655.9259.90768.stgit@stevo-desktop>


Review Changes:

- sizeof -> sizeof()
---

 drivers/infiniband/hw/amso1100/c2_alloc.c |  256 ++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_mm.c    |  378 +++++++++++++++++++++++++++++
 2 files changed, 634 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c
new file mode 100644
index 0000000..e496eb7
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_alloc.c
@@ -0,0 +1,256 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h>
+
+#include "c2.h"
+
+/* Trivial bitmap-based allocator */
+u32 c2_alloc(struct c2_alloc *alloc)
+{
+	u32 obj;
+
+	spin_lock(&alloc->lock);
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max)
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+	if (obj >= 0) {
+	    alloc->last = obj+1;
+	    if (alloc->last > alloc->max)
+		    alloc->last = 0;
+	}
+	spin_unlock(&alloc->lock);
+
+	return obj;
+}
+
+void c2_free(struct c2_alloc *alloc, u32 obj)
+{
+	spin_lock(&alloc->lock);
+	clear_bit(obj, alloc->table);
+	spin_unlock(&alloc->lock);
+}
+
+int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved)
+{
+	int i;
+
+	alloc->last = 0;
+	alloc->max = num;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof(long), GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	for (i = 0; i < reserved; ++i)
+		set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void c2_alloc_cleanup(struct c2_alloc *alloc)
+{
+	kfree(alloc->table);
+}
+
+/*
+ * Array of pointers with lazy allocation of leaf pages.  Callers of
+ * _get, _set and _clear methods must use a lock or otherwise
+ * serialize access to the array.
+ */
+
+void *c2_array_get(struct c2_array *array, int index)
+{
+	int p = (index * sizeof(void *)) >> PAGE_SHIFT;
+
+	if (array->page_list[p].page) {
+		int i = index & (PAGE_SIZE / sizeof(void *) - 1);
+		return array->page_list[p].page[i];
+	} else
+		return NULL;
+}
+
+int c2_array_set(struct c2_array *array, int index, void *value)
+{
+	int p = (index * sizeof(void *)) >> PAGE_SHIFT;
+
+	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
+	if (!array->page_list[p].page)
+		array->page_list[p].page =
+		    (void **) get_zeroed_page(GFP_ATOMIC);
+
+	if (!array->page_list[p].page)
+		return -ENOMEM;
+
+	array->page_list[p].page[index & (PAGE_SIZE / sizeof(void *) - 1)] =
+	    value;
+	++array->page_list[p].used;
+
+	return 0;
+}
+
+void c2_array_clear(struct c2_array *array, int index)
+{
+	int p = (index * sizeof(void *)) >> PAGE_SHIFT;
+
+	if (--array->page_list[p].used == 0) {
+		free_page((unsigned long) array->page_list[p].page);
+		array->page_list[p].page = NULL;
+	}
+
+	if (array->page_list[p].used < 0)
+		pr_debug("Array %p index %d page %d with ref count %d < 0\n",
+			 array, index, p, array->page_list[p].used);
+}
+
+int c2_array_init(struct c2_array *array, int nent)
+{
+	int npage = (nent * sizeof(void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	int i;
+
+	array->page_list =
+	    kmalloc(npage * sizeof(*array->page_list), GFP_KERNEL);
+	if (!array->page_list)
+		return -ENOMEM;
+
+	for (i = 0; i < npage; ++i) {
+		array->page_list[i].page = NULL;
+		array->page_list[i].used = 0;
+	}
+
+	return 0;
+}
+
+void c2_array_cleanup(struct c2_array *array, int nent)
+{
+	int i;
+
+	for (i = 0; i < (nent * sizeof(void *) + PAGE_SIZE - 1) / PAGE_SIZE;
+	     ++i)
+		free_page((unsigned long) array->page_list[i].page);
+
+	kfree(array->page_list);
+}
+
+static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head)
+{
+	int i;
+	struct sp_chunk *new_head;
+
+	new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA);
+	if (new_head == NULL)
+		return -ENOMEM;
+
+	new_head->next = NULL;
+	new_head->head = 0;
+	new_head->gfp_mask = gfp_mask;
+
+	/* build list where each index is the next free slot */
+	for (i = 0;
+	     i < (PAGE_SIZE - sizeof(struct sp_chunk) - 
+		  sizeof(u16)) / sizeof(u16) - 1; 
+	     i++) {
+		new_head->shared_ptr[i] = i + 1;
+	}
+	/* terminate list */
+	new_head->shared_ptr[i] = 0xFFFF;
+
+	*head = new_head;
+	return 0;
+}
+
+int c2_init_mqsp_pool(gfp_t gfp_mask, struct sp_chunk **root)
+{
+	return c2_alloc_mqsp_chunk(gfp_mask, root);
+}
+
+void c2_free_mqsp_pool(struct sp_chunk *root)
+{
+	struct sp_chunk *next;
+
+	while (root) {
+		next = root->next;
+		__free_page((struct page *) root);
+		root = next;
+	}
+}
+
+u16 *c2_alloc_mqsp(struct sp_chunk *head)
+{
+	u16 mqsp;
+
+	while (head) {
+		mqsp = head->head;
+		if (mqsp != 0xFFFF) {
+			head->head = head->shared_ptr[mqsp];
+			break;
+		} else if (head->next == NULL) {
+			if (c2_alloc_mqsp_chunk(head->gfp_mask, &head->next)==0) {
+				head = head->next;
+				mqsp = head->head;
+				head->head = head->shared_ptr[mqsp];
+				break;
+			} else
+				return NULL;
+		} else
+			head = head->next;
+	}
+	if (head)
+		return &(head->shared_ptr[mqsp]);
+	return NULL;
+}
+
+void c2_free_mqsp(u16 * mqsp)
+{
+	struct sp_chunk *head;
+	u16 idx;
+
+	/* The chunk containing this ptr begins at the page boundary */
+	head = (struct sp_chunk *) ((unsigned long) mqsp & PAGE_MASK);
+
+	/* Link head to new mqsp */
+	*mqsp = head->head;
+
+	/* Compute the shared_ptr index */
+	idx = ((unsigned long) mqsp & ~PAGE_MASK) >> 1;
+	idx -= (unsigned long) &(((struct sp_chunk *) 0)->shared_ptr[0]) >> 1;
+
+	/* Point this index at the head */
+	head->shared_ptr[idx] = head->head;
+
+	/* Point head at this index */
+	head->head = idx;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_mm.c b/drivers/infiniband/hw/amso1100/c2_mm.c
new file mode 100644
index 0000000..13c8122
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mm.c
@@ -0,0 +1,378 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include "c2_vq.h"
+
+#define PBL_VIRT 1
+#define PBL_PHYS 2
+
+/*
+ * Send all the PBL messages to convey the remainder of the PBL
+ * Wait for the adapter's reply on the last one.
+ * This is indicated by setting the MEM_PBL_COMPLETE in the flags.
+ *
+ * NOTE:  vq_req is _not_ freed by this function.  The VQ Host
+ *	  Reply buffer _is_ freed by this function.
+ */
+static int
+send_pbl_messages(struct c2_dev *c2dev, u32 stag_index,
+		  unsigned long va, u32 pbl_depth,
+		  struct c2_vq_req *vq_req, int pbl_type)
+{
+	u32 pbe_count;		/* amt that fits in a PBL msg */
+	u32 count;		/* amt in this PBL MSG. */
+	struct c2wr_nsmr_pbl_req *wr;	/* PBL WR ptr */
+	struct c2wr_nsmr_pbl_rep *reply;	/* reply ptr */
+ 	int err, pbl_virt, pbl_index, i;
+
+	switch (pbl_type) {
+	case PBL_VIRT:
+		pbl_virt = 1;
+		break;
+	case PBL_PHYS:
+		pbl_virt = 0;
+		break;
+	default:
+		return -EINVAL;
+		break;
+	}
+
+	pbe_count = (c2dev->req_vq.msg_size -
+		     sizeof(struct c2wr_nsmr_pbl_req)) / sizeof(u64);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		return -ENOMEM;
+	}
+	c2_wr_set_id(wr, CCWR_NSMR_PBL);
+
+	/*
+	 * Only the last PBL message will generate a reply from the verbs, 
+	 * so we set the context to 0 indicating there is no kernel verbs
+	 * handler blocked awaiting this reply.
+	 */
+	wr->hdr.context = 0;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->stag_index = stag_index;	/* already swapped */
+	wr->flags = 0;
+	pbl_index = 0;
+	while (pbl_depth) {
+		count = min(pbe_count, pbl_depth);
+		wr->addrs_length = cpu_to_be32(count);
+
+		/*
+		 *  If this is the last message, then reference the
+		 *  vq request struct cuz we're gonna wait for a reply.
+		 *  also make this PBL msg as the last one.
+		 */
+		if (count == pbl_depth) {
+			/*
+			 * reference the request struct.  dereferenced in the 
+			 * int handler.
+			 */
+			vq_req_get(c2dev, vq_req);
+			wr->flags = cpu_to_be32(MEM_PBL_COMPLETE);
+
+			/*
+			 * This is the last PBL message.
+			 * Set the context to our VQ Request Object so we can
+			 * wait for the reply.
+			 */
+			wr->hdr.context = (unsigned long) vq_req;
+		}
+
+		/*
+		 * If pbl_virt is set then va is a virtual address 
+		 * that describes a virtually contiguous memory
+		 * allocation. The wr needs the start of each virtual page
+		 * to be converted to the corresponding physical address
+		 * of the page. If pbl_virt is not set then va is an array
+		 * of physical addresses and there is no conversion to do.
+		 * Just fill in the wr with what is in the array.  
+		 */
+		for (i = 0; i < count; i++) {
+			if (pbl_virt) {
+				/* XXX */
+				//wr->paddrs[i] = 
+				//	cpu_to_be64(user_virt_to_phys(va));
+				va += PAGE_SIZE;
+			} else {
+ 				wr->paddrs[i] = 
+				    cpu_to_be64(((u64 *)va)[pbl_index + i]);
+			}
+		}
+
+		/*
+		 * Send WR to adapter
+		 */
+		err = vq_send_wr(c2dev, (union c2wr *) wr);
+		if (err) {
+			if (count <= pbe_count) {
+				vq_req_put(c2dev, vq_req);
+			}
+			goto bail0;
+		}
+		pbl_depth -= count;
+		pbl_index += count;
+	}
+
+	/*
+	 *  Now wait for the reply...
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_nsmr_pbl_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	kfree(wr);
+	return err;
+}
+
+#define C2_PBL_MAX_DEPTH 131072
+int
+c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, 
+ 			   int page_size, int pbl_depth, u32 length, 
+ 			   u32 offset, u64 *va, enum c2_acf acf, 
+			   struct c2_mr *mr)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_nsmr_register_req *wr;
+	struct c2wr_nsmr_register_rep *reply;
+	u16 flags;
+	int i, pbe_count, count;
+	int err;
+
+	if (!va || !length || !addr_list || !pbl_depth)
+		return -EINTR;
+
+	/*
+	 * Verify PBL depth is within rnic max
+	 */
+	if (pbl_depth > C2_PBL_MAX_DEPTH) {
+		return -EINTR;
+	}
+
+	/*
+	 * allocate verbs request object
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	/*
+	 * build the WR
+	 */
+	c2_wr_set_id(wr, CCWR_NSMR_REGISTER);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+
+	flags = (acf | MEM_VA_BASED | MEM_REMOTE);
+
+	/*
+	 * compute how many pbes can fit in the message
+	 */
+	pbe_count = (c2dev->req_vq.msg_size -
+		     sizeof(struct c2wr_nsmr_register_req)) / sizeof(u64);
+
+	if (pbl_depth <= pbe_count) {
+		flags |= MEM_PBL_COMPLETE;
+	}
+	wr->flags = cpu_to_be16(flags);
+	wr->stag_key = 0;	//stag_key;
+	wr->va = cpu_to_be64(*va);
+	wr->pd_id = mr->pd->pd_id;
+	wr->pbe_size = cpu_to_be32(page_size);
+	wr->length = cpu_to_be32(length);
+	wr->pbl_depth = cpu_to_be32(pbl_depth);
+	wr->fbo = cpu_to_be32(offset);
+	count = min(pbl_depth, pbe_count);
+	wr->addrs_length = cpu_to_be32(count);
+
+	/*
+	 * fill out the PBL for this message
+	 */
+	for (i = 0; i < count; i++) {
+		wr->paddrs[i] = cpu_to_be64(addr_list[i]);
+	}
+
+	/*
+	 * regerence the request struct 
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * send the WR to the adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	/*
+	 * wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail1;
+	}
+
+	/*
+	 * process reply
+	 */
+	reply =
+	    (struct c2wr_nsmr_register_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	if ((err = c2_errno(reply))) {
+		goto bail2;
+	}
+	//*p_pb_entries = be32_to_cpu(reply->pbl_depth);
+	mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index);
+	vq_repbuf_free(c2dev, reply);
+
+	/*
+	 * if there are still more PBEs we need to send them to
+	 * the adapter and wait for a reply on the final one.
+	 * reuse vq_req for this purpose.
+	 */
+	pbl_depth -= count;
+	if (pbl_depth) {
+
+		vq_req->reply_msg = (unsigned long) NULL;
+		atomic_set(&vq_req->reply_ready, 0);
+		err = send_pbl_messages(c2dev,
+					cpu_to_be32(mr->ibmr.lkey),
+					(unsigned long) &addr_list[i],
+					pbl_depth, vq_req, PBL_PHYS);
+		if (err) {
+			goto bail1;
+		}
+	}
+
+	vq_req_free(c2dev, vq_req);
+	kfree(wr);
+
+	return err;
+
+      bail2:
+	vq_repbuf_free(c2dev, reply);
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index)
+{
+	struct c2_vq_req *vq_req;	/* verbs request object */
+	struct c2wr_stag_dealloc_req wr;	/* work request */
+	struct c2wr_stag_dealloc_rep *reply;	/* WR reply  */
+	int err;
+
+
+	/*
+	 * allocate verbs request object
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		return -ENOMEM;
+	}
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_STAG_DEALLOC);
+	wr.hdr.context = (u64) (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.stag_index = cpu_to_be32(stag_index);
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_stag_dealloc_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}


From swise at opengridcomputing.com  Wed Jun  7 13:06:49 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:49 -0500
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200648.9259.69698.stgit@stevo-desktop>


This is the core of the driver and includes the hardware probe, low-level
device interfaces and native Ethernet support.

Review Changes:

- sizeof -> sizeof()

- dprintk() -> pr_debug()

- removed useless asserts

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG

- removed debug netevent code

- removed arp request squelch code from intr handler, replacing it
  with setting arp_ignore when the c2 netdev is brought up.

- removed c2_set_mac_addr().
---

 drivers/infiniband/hw/amso1100/c2.c      | 1255 ++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2.h      |  555 +++++++++++++
 drivers/infiniband/hw/amso1100/c2_ae.c   |  359 +++++++++
 drivers/infiniband/hw/amso1100/c2_intr.c |  209 +++++
 drivers/infiniband/hw/amso1100/c2_rnic.c |  631 +++++++++++++++
 5 files changed, 3009 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c
new file mode 100644
index 0000000..4fdbd80
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2.c
@@ -0,0 +1,1255 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/ib_smi.h>
+#include "c2.h"
+#include "c2_provider.h"
+
+MODULE_AUTHOR("Tom Tucker <tom at opengridcomputing.com>");
+MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK
+    | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN;
+
+static int debug = -1;		/* defaults above */
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
+
+static int c2_up(struct net_device *netdev);
+static int c2_down(struct net_device *netdev);
+static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static void c2_tx_interrupt(struct net_device *netdev);
+static void c2_rx_interrupt(struct net_device *netdev);
+static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs);
+static void c2_tx_timeout(struct net_device *netdev);
+static int c2_change_mtu(struct net_device *netdev, int new_mtu);
+static void c2_reset(struct c2_port *c2_port);
+static struct net_device_stats *c2_get_stats(struct net_device *netdev);
+
+static struct pci_device_id c2_pci_table[] = {
+	{0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID},
+	{0}
+};
+
+MODULE_DEVICE_TABLE(pci, c2_pci_table);
+
+static void c2_print_macaddr(struct net_device *netdev)
+{
+	pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, "
+		"IRQ %u\n", netdev->name,
+		netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2],
+		netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5],
+		netdev->irq);
+}
+
+static void c2_set_rxbufsize(struct c2_port *c2_port)
+{
+	struct net_device *netdev = c2_port->netdev;
+
+	if (netdev->mtu > RX_BUF_SIZE)
+		c2_port->rx_buf_size =
+		    netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) +
+		    NET_IP_ALIGN;
+	else
+		c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE;
+}
+
+/*
+ * Allocate TX ring elements and chain them together.
+ * One-to-one association of adapter descriptors with ring elements.
+ */
+static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
+			    dma_addr_t base, void __iomem * mmio_txp_ring)
+{
+	struct c2_tx_desc *tx_desc;
+	struct c2_txp_desc __iomem *txp_desc;
+	struct c2_element *elem;
+	int i;
+
+	tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);
+	if (!tx_ring->start)
+		return -ENOMEM;
+
+	elem = tx_ring->start;
+	tx_desc = vaddr;
+	txp_desc = mmio_txp_ring;
+	for (i = 0; i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) {
+		tx_desc->len = 0;
+		tx_desc->status = 0;
+
+		/* Set TXP_HTXD_UNINIT */
+		__raw_writeq(cpu_to_be64(0x1122334455667788ULL),
+			     (void __iomem *) txp_desc + C2_TXP_ADDR);
+		__raw_writew(0, (void __iomem *) txp_desc + C2_TXP_LEN);
+		__raw_writew(cpu_to_be16(TXP_HTXD_UNINIT),
+			     (void __iomem *) txp_desc + C2_TXP_FLAGS);
+
+		elem->skb = NULL;
+		elem->ht_desc = tx_desc;
+		elem->hw_desc = txp_desc;
+
+		if (i == tx_ring->count - 1) {
+			elem->next = tx_ring->start;
+			tx_desc->next_offset = base;
+		} else {
+			elem->next = elem + 1;
+			tx_desc->next_offset =
+			    base + (i + 1) * sizeof(*tx_desc);
+		}
+	}
+
+	tx_ring->to_use = tx_ring->to_clean = tx_ring->start;
+
+	return 0;
+}
+
+/*
+ * Allocate RX ring elements and chain them together.
+ * One-to-one association of adapter descriptors with ring elements.
+ */
+static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr,
+			    dma_addr_t base, void __iomem * mmio_rxp_ring)
+{
+	struct c2_rx_desc *rx_desc;
+	struct c2_rxp_desc __iomem *rxp_desc;
+	struct c2_element *elem;
+	int i;
+
+	rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL);
+	if (!rx_ring->start)
+		return -ENOMEM;
+
+	elem = rx_ring->start;
+	rx_desc = vaddr;
+	rxp_desc = mmio_rxp_ring;
+	for (i = 0; i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) {
+		rx_desc->len = 0;
+		rx_desc->status = 0;
+
+		/* Set RXP_HRXD_UNINIT */
+		__raw_writew(cpu_to_be16(RXP_HRXD_OK),
+		       (void __iomem *) rxp_desc + C2_RXP_STATUS);
+		__raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_COUNT);
+		__raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_LEN);
+		__raw_writeq(cpu_to_be64(0x99aabbccddeeffULL),
+			     (void __iomem *) rxp_desc + C2_RXP_ADDR);
+		__raw_writew(cpu_to_be16(RXP_HRXD_UNINIT),
+			     (void __iomem *) rxp_desc + C2_RXP_FLAGS);
+
+		elem->skb = NULL;
+		elem->ht_desc = rx_desc;
+		elem->hw_desc = rxp_desc;
+
+		if (i == rx_ring->count - 1) {
+			elem->next = rx_ring->start;
+			rx_desc->next_offset = base;
+		} else {
+			elem->next = elem + 1;
+			rx_desc->next_offset =
+			    base + (i + 1) * sizeof(*rx_desc);
+		}
+	}
+
+	rx_ring->to_use = rx_ring->to_clean = rx_ring->start;
+
+	return 0;
+}
+
+/* Setup buffer for receiving */
+static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_rx_desc *rx_desc = elem->ht_desc;
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen;
+	struct c2_rxp_hdr *rxp_hdr;
+
+	skb = dev_alloc_skb(c2_port->rx_buf_size);
+	if (unlikely(!skb)) {
+		pr_debug("%s: out of memory for receive\n",
+			c2_port->netdev->name);
+		return -ENOMEM;
+	}
+
+	/* Zero out the rxp hdr in the sk_buff */
+	memset(skb->data, 0, sizeof(*rxp_hdr));
+
+	skb->dev = c2_port->netdev;
+
+	maplen = c2_port->rx_buf_size;
+	mapaddr =
+	    pci_map_single(c2dev->pcidev, skb->data, maplen,
+			   PCI_DMA_FROMDEVICE);
+
+	/* Set the sk_buff RXP_header to RXP_HRXD_READY */
+	rxp_hdr = (struct c2_rxp_hdr *) skb->data;
+	rxp_hdr->flags = RXP_HRXD_READY;
+
+	__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+	__raw_writew(cpu_to_be16((u16) maplen - sizeof(*rxp_hdr)),
+		     elem->hw_desc + C2_RXP_LEN);
+	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_RXP_ADDR);
+	__raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS);
+
+	elem->skb = skb;
+	elem->mapaddr = mapaddr;
+	elem->maplen = maplen;
+	rx_desc->len = maplen;
+
+	return 0;
+}
+
+/*
+ * Allocate buffers for the Rx ring
+ * For receive:  rx_ring.to_clean is next received frame
+ */
+static int c2_rx_fill(struct c2_port *c2_port)
+{
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	int ret = 0;
+
+	elem = rx_ring->start;
+	do {
+		if (c2_rx_alloc(c2_port, elem)) {
+			ret = 1;
+			break;
+		}
+	} while ((elem = elem->next) != rx_ring->start);
+
+	rx_ring->to_clean = rx_ring->start;
+	return ret;
+}
+
+/* Free all buffers in RX ring, assumes receiver stopped */
+static void c2_rx_clean(struct c2_port *c2_port)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	struct c2_rx_desc *rx_desc;
+
+	elem = rx_ring->start;
+	do {
+		rx_desc = elem->ht_desc;
+		rx_desc->len = 0;
+
+		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
+		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);
+		__raw_writeq(cpu_to_be64(0x99aabbccddeeffULL),
+			     elem->hw_desc + C2_RXP_ADDR);
+		__raw_writew(cpu_to_be16(RXP_HRXD_UNINIT),
+			     elem->hw_desc + C2_RXP_FLAGS);
+
+		if (elem->skb) {
+			pci_unmap_single(c2dev->pcidev, elem->mapaddr,
+					 elem->maplen, PCI_DMA_FROMDEVICE);
+			dev_kfree_skb(elem->skb);
+			elem->skb = NULL;
+		}
+	} while ((elem = elem->next) != rx_ring->start);
+}
+
+static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem)
+{
+	struct c2_tx_desc *tx_desc = elem->ht_desc;
+
+	tx_desc->len = 0;
+
+	pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen,
+			 PCI_DMA_TODEVICE);
+
+	if (elem->skb) {
+		dev_kfree_skb_any(elem->skb);
+		elem->skb = NULL;
+	}
+
+	return 0;
+}
+
+/* Free all buffers in TX ring, assumes transmitter stopped */
+static void c2_tx_clean(struct c2_port *c2_port)
+{
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	struct c2_txp_desc txp_htxd;
+	int retry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2_port->tx_lock, flags);
+
+	elem = tx_ring->start;
+
+	do {
+		retry = 0;
+		do {
+			txp_htxd.flags =
+			    readw(elem->hw_desc + C2_TXP_FLAGS);
+
+			if (txp_htxd.flags == TXP_HTXD_READY) {
+				retry = 1;
+				__raw_writew(0,
+					     elem->hw_desc + C2_TXP_LEN);
+				__raw_writeq(0,
+					     elem->hw_desc + C2_TXP_ADDR);
+				__raw_writew(cpu_to_be16(TXP_HTXD_DONE),
+					     elem->hw_desc + C2_TXP_FLAGS);
+				c2_port->netstats.tx_dropped++;
+				break;
+			} else {
+				__raw_writew(0,
+					     elem->hw_desc + C2_TXP_LEN);
+				__raw_writeq(cpu_to_be64(0x1122334455667788ULL),
+					     elem->hw_desc + C2_TXP_ADDR);
+				__raw_writew(cpu_to_be16(TXP_HTXD_UNINIT),
+					     elem->hw_desc + C2_TXP_FLAGS);
+			}
+
+			c2_tx_free(c2_port->c2dev, elem);
+
+		} while ((elem = elem->next) != tx_ring->start);
+	} while (retry);
+
+	c2_port->tx_avail = c2_port->tx_ring.count - 1;
+	c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start;
+
+	if (c2_port->tx_avail > MAX_SKB_FRAGS + 1)
+		netif_wake_queue(c2_port->netdev);
+
+	spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+}
+
+/*
+ * Process transmit descriptors marked 'DONE' by the firmware,
+ * freeing up their unneeded sk_buffs.
+ */
+static void c2_tx_interrupt(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	struct c2_txp_desc txp_htxd;
+
+	spin_lock(&c2_port->tx_lock);
+
+	for (elem = tx_ring->to_clean; elem != tx_ring->to_use;
+	     elem = elem->next) {
+		txp_htxd.flags =
+		    be16_to_cpu(readw(elem->hw_desc + C2_TXP_FLAGS));
+
+		if (txp_htxd.flags != TXP_HTXD_DONE)
+			break;
+
+		if (netif_msg_tx_done(c2_port)) {
+			/* PCI reads are expensive in fast path */
+			txp_htxd.len =
+			    be16_to_cpu(readw(elem->hw_desc + C2_TXP_LEN));
+			pr_debug("%s: tx done slot %3Zu status 0x%x len "
+				"%5u bytes\n",
+				netdev->name, elem - tx_ring->start,
+				txp_htxd.flags, txp_htxd.len);
+		}
+
+		c2_tx_free(c2dev, elem);
+		++(c2_port->tx_avail);
+	}
+
+	tx_ring->to_clean = elem;
+
+	if (netif_queue_stopped(netdev)
+	    && c2_port->tx_avail > MAX_SKB_FRAGS + 1)
+		netif_wake_queue(netdev);
+
+	spin_unlock(&c2_port->tx_lock);
+}
+
+static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem)
+{
+	struct c2_rx_desc *rx_desc = elem->ht_desc;
+	struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data;
+
+	if (rxp_hdr->status != RXP_HRXD_OK ||
+	    rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) {
+		pr_debug("BAD RXP_HRXD\n");
+		pr_debug("  rx_desc : %p\n", rx_desc);
+		pr_debug("    index : %Zu\n",
+			elem - c2_port->rx_ring.start);
+		pr_debug("    len   : %u\n", rx_desc->len);
+		pr_debug("  rxp_hdr : %p [PA %p]\n", rxp_hdr,
+			(void *) __pa((unsigned long) rxp_hdr));
+		pr_debug("    flags : 0x%x\n", rxp_hdr->flags);
+		pr_debug("    status: 0x%x\n", rxp_hdr->status);
+		pr_debug("    len   : %u\n", rxp_hdr->len);
+		pr_debug("    rsvd  : 0x%x\n", rxp_hdr->rsvd);
+	}
+
+	/* Setup the skb for reuse since we're dropping this pkt */
+	elem->skb->tail = elem->skb->data = elem->skb->head;
+
+	/* Zero out the rxp hdr in the sk_buff */
+	memset(elem->skb->data, 0, sizeof(*rxp_hdr));
+
+	/* Write the descriptor to the adapter's rx ring */
+	__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+	__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
+	__raw_writew(cpu_to_be16((u16) elem->maplen - sizeof(*rxp_hdr)),
+		     elem->hw_desc + C2_RXP_LEN);
+	__raw_writeq(cpu_to_be64(elem->mapaddr), elem->hw_desc + C2_RXP_ADDR);
+	__raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS);
+
+	pr_debug("packet dropped\n");
+	c2_port->netstats.rx_dropped++;
+}
+
+static void c2_rx_interrupt(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	struct c2_rx_desc *rx_desc;
+	struct c2_rxp_hdr *rxp_hdr;
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen, buflen;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2dev->lock, flags);
+
+	/* Begin where we left off */
+	rx_ring->to_clean = rx_ring->start + c2dev->cur_rx;
+
+	for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean;
+	     elem = elem->next) {
+		rx_desc = elem->ht_desc;
+		mapaddr = elem->mapaddr;
+		maplen = elem->maplen;
+		skb = elem->skb;
+		rxp_hdr = (struct c2_rxp_hdr *) skb->data;
+
+		if (rxp_hdr->flags != RXP_HRXD_DONE)
+			break;
+		buflen = rxp_hdr->len;
+
+		/* Sanity check the RXP header */
+		if (rxp_hdr->status != RXP_HRXD_OK ||
+		    buflen > (rx_desc->len - sizeof(*rxp_hdr))) {
+			c2_rx_error(c2_port, elem);
+			continue;
+		}
+
+		/* 
+		 * Allocate and map a new skb for replenishing the host 
+		 * RX desc 
+		 */
+		if (c2_rx_alloc(c2_port, elem)) {
+			c2_rx_error(c2_port, elem);
+			continue;
+		}
+
+		/* Unmap the old skb */
+		pci_unmap_single(c2dev->pcidev, mapaddr, maplen,
+				 PCI_DMA_FROMDEVICE);
+
+		prefetch(skb->data);
+
+		/*
+		 * Skip past the leading 8 bytes comprising of the 
+		 * "struct c2_rxp_hdr", prepended by the adapter 
+		 * to the usual Ethernet header ("struct ethhdr"), 
+		 * to the start of the raw Ethernet packet.
+		 * 
+		 * Fix up the various fields in the sk_buff before 
+		 * passing it up to netif_rx(). The transfer size 
+		 * (in bytes) specified by the adapter len field of 
+		 * the "struct rxp_hdr_t" does NOT include the 
+		 * "sizeof(struct c2_rxp_hdr)".
+		 */
+		skb->data += sizeof(*rxp_hdr);
+		skb->tail = skb->data + buflen;
+		skb->len = buflen;
+		skb->dev = netdev;
+		skb->protocol = eth_type_trans(skb, netdev);
+
+		netif_rx(skb);
+
+		netdev->last_rx = jiffies;
+		c2_port->netstats.rx_packets++;
+		c2_port->netstats.rx_bytes += buflen;
+	}
+
+	/* Save where we left off */
+	rx_ring->to_clean = elem;
+	c2dev->cur_rx = elem - rx_ring->start;
+	C2_SET_CUR_RX(c2dev, c2dev->cur_rx);
+
+	spin_unlock_irqrestore(&c2dev->lock, flags);
+}
+
+/*
+ * Handle netisr0 TX & RX interrupts.
+ */
+static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+{
+	unsigned int netisr0, dmaisr;
+	int handled = 0;
+	struct c2_dev *c2dev = (struct c2_dev *) dev_id;
+
+	/* Process CCILNET interrupts */
+	netisr0 = readl(c2dev->regs + C2_NISR0);
+	if (netisr0) {
+
+		/*
+		 * There is an issue with the firmware that always
+		 * provides the status of RX for both TX & RX 
+		 * interrupts.  So process both queues here.
+		 */
+		c2_rx_interrupt(c2dev->netdev);
+		c2_tx_interrupt(c2dev->netdev);
+
+		/* Clear the interrupt */
+		writel(netisr0, c2dev->regs + C2_NISR0);
+		handled++;
+	}
+
+	/* Process RNIC interrupts */
+	dmaisr = readl(c2dev->regs + C2_DISR);
+	if (dmaisr) {
+		writel(dmaisr, c2dev->regs + C2_DISR);
+		c2_rnic_interrupt(c2dev);
+		handled++;
+	}
+
+	if (handled) {
+		return IRQ_HANDLED;
+	} else {
+		return IRQ_NONE;
+	}
+}
+
+static int c2_up(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_element *elem;
+	struct c2_rxp_hdr *rxp_hdr;
+	struct in_device *in_dev;
+	size_t rx_size, tx_size;
+	int ret, i;
+	unsigned int netimr0;
+
+	if (netif_msg_ifup(c2_port))
+		pr_debug("%s: enabling interface\n", netdev->name);
+
+	/* Set the Rx buffer size based on MTU */
+	c2_set_rxbufsize(c2_port);
+
+	/* Allocate DMA'able memory for Tx/Rx host descriptor rings */
+	rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc);
+	tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc);
+
+	c2_port->mem_size = tx_size + rx_size;
+	c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size,
+					    &c2_port->dma);
+	if (c2_port->mem == NULL) {
+		pr_debug("Unable to allocate memory for "
+			"host descriptor rings\n");
+		return -ENOMEM;
+	}
+
+	memset(c2_port->mem, 0, c2_port->mem_size);
+
+	/* Create the Rx host descriptor ring */
+	if ((ret =
+	     c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma,
+			      c2dev->mmio_rxp_ring))) {
+		pr_debug("Unable to create RX ring\n");
+		goto bail0;
+	}
+
+	/* Allocate Rx buffers for the host descriptor ring */
+	if (c2_rx_fill(c2_port)) {
+		pr_debug("Unable to fill RX ring\n");
+		goto bail1;
+	}
+
+	/* Create the Tx host descriptor ring */
+	if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size,
+				    c2_port->dma + rx_size,
+				    c2dev->mmio_txp_ring))) {
+		pr_debug("Unable to create TX ring\n");
+		goto bail1;
+	}
+
+	/* Set the TX pointer to where we left off */
+	c2_port->tx_avail = c2_port->tx_ring.count - 1;
+	c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean =
+	    c2_port->tx_ring.start + c2dev->cur_tx;
+
+	/* missing: Initialize MAC */
+
+	BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean);
+
+	/* Reset the adapter, ensures the driver is in sync with the RXP */
+	c2_reset(c2_port);
+
+	/* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */
+	for (i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count;
+	     i++, elem++) {
+		rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data;
+		rxp_hdr->flags = 0;
+		__raw_writew(cpu_to_be16(RXP_HRXD_READY),
+			     elem->hw_desc + C2_RXP_FLAGS);
+	}
+
+	/* Enable network packets */
+	netif_start_queue(netdev);
+
+	/* Enable IRQ */
+	writel(0, c2dev->regs + C2_IDIS);
+	netimr0 = readl(c2dev->regs + C2_NIMR0);
+	netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT);
+	writel(netimr0, c2dev->regs + C2_NIMR0);
+
+	/* Tell the stack to ignore arp requests for ipaddrs bound to 
+	 * other interfaces.  This is needed to prevent the host stack
+	 * from responding to arp requests to the ipaddr bound on the
+	 * rdma interface.
+	 */
+	in_dev = in_dev_get(netdev);
+	in_dev->cnf.arp_ignore = 1;
+	in_dev_put(in_dev);
+
+	return 0;
+
+      bail1:
+	c2_rx_clean(c2_port);
+	kfree(c2_port->rx_ring.start);
+
+      bail0:
+	pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem,
+			    c2_port->dma);
+
+	return ret;
+}
+
+static int c2_down(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+
+	if (netif_msg_ifdown(c2_port))
+		pr_debug("%s: disabling interface\n",
+			netdev->name);
+
+	/* Wait for all the queued packets to get sent */
+	c2_tx_interrupt(netdev);
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+
+	/* Disable IRQs by clearing the interrupt mask */
+	writel(1, c2dev->regs + C2_IDIS);
+	writel(0, c2dev->regs + C2_NIMR0);
+
+	/* missing: Stop transmitter */
+
+	/* missing: Stop receiver */
+
+	/* Reset the adapter, ensures the driver is in sync with the RXP */
+	c2_reset(c2_port);
+
+	/* missing: Turn off LEDs here */
+
+	/* Free all buffers in the host descriptor rings */
+	c2_tx_clean(c2_port);
+	c2_rx_clean(c2_port);
+
+	/* Free the host descriptor rings */
+	kfree(c2_port->rx_ring.start);
+	kfree(c2_port->tx_ring.start);
+	pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem,
+			    c2_port->dma);
+
+	return 0;
+}
+
+static void c2_reset(struct c2_port *c2_port)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	unsigned int cur_rx = c2dev->cur_rx;
+
+	/* Tell the hardware to quiesce */
+	C2_SET_CUR_RX(c2dev, cur_rx | C2_PCI_HRX_QUI);
+
+	/*
+	 * The hardware will reset the C2_PCI_HRX_QUI bit once
+	 * the RXP is quiesced.  Wait 2 seconds for this.
+	 */
+	ssleep(2);
+
+	cur_rx = C2_GET_CUR_RX(c2dev);
+
+	if (cur_rx & C2_PCI_HRX_QUI)
+		pr_debug("c2_reset: failed to quiesce the hardware!\n");
+
+	cur_rx &= ~C2_PCI_HRX_QUI;
+
+	c2dev->cur_rx = cur_rx;
+
+	pr_debug("Current RX: %u\n", c2dev->cur_rx);
+}
+
+static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	dma_addr_t mapaddr;
+	u32 maplen;
+	unsigned long flags;
+	unsigned int i;
+
+	spin_lock_irqsave(&c2_port->tx_lock, flags);
+
+	if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) {
+		netif_stop_queue(netdev);
+		spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+
+		pr_debug("%s: Tx ring full when queue awake!\n",
+			netdev->name);
+		return NETDEV_TX_BUSY;
+	}
+
+	maplen = skb_headlen(skb);
+	mapaddr =
+	    pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE);
+
+	elem = tx_ring->to_use;
+	elem->skb = skb;
+	elem->mapaddr = mapaddr;
+	elem->maplen = maplen;
+
+	/* Tell HW to xmit */
+	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
+	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
+	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
+
+	c2_port->netstats.tx_packets++;
+	c2_port->netstats.tx_bytes += maplen;
+
+	/* Loop thru additional data fragments and queue them */
+	if (skb_shinfo(skb)->nr_frags) {
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			maplen = frag->size;
+			mapaddr =
+			    pci_map_page(c2dev->pcidev, frag->page,
+					 frag->page_offset, maplen,
+					 PCI_DMA_TODEVICE);
+
+			elem = elem->next;
+			elem->skb = NULL;
+			elem->mapaddr = mapaddr;
+			elem->maplen = maplen;
+
+			/* Tell HW to xmit */
+			__raw_writeq(cpu_to_be64(mapaddr),
+				     elem->hw_desc + C2_TXP_ADDR);
+			__raw_writew(cpu_to_be16(maplen),
+				     elem->hw_desc + C2_TXP_LEN);
+			__raw_writew(cpu_to_be16(TXP_HTXD_READY),
+				     elem->hw_desc + C2_TXP_FLAGS);
+
+			c2_port->netstats.tx_packets++;
+			c2_port->netstats.tx_bytes += maplen;
+		}
+	}
+
+	tx_ring->to_use = elem->next;
+	c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1);
+
+	if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) {
+		netif_stop_queue(netdev);
+		if (netif_msg_tx_queued(c2_port))
+			pr_debug("%s: transmit queue full\n",
+				netdev->name);
+	}
+
+	spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+
+	netdev->trans_start = jiffies;
+
+	return NETDEV_TX_OK;
+}
+
+static struct net_device_stats *c2_get_stats(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+
+	return &c2_port->netstats;
+}
+
+static void c2_tx_timeout(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+
+	if (netif_msg_timer(c2_port))
+		pr_debug("%s: tx timeout\n", netdev->name);
+
+	c2_tx_clean(c2_port);
+}
+
+static int c2_change_mtu(struct net_device *netdev, int new_mtu)
+{
+	int ret = 0;
+
+	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
+		return -EINVAL;
+
+	netdev->mtu = new_mtu;
+
+	if (netif_running(netdev)) {
+		c2_down(netdev);
+
+		c2_up(netdev);
+	}
+
+	return ret;
+}
+
+/* Initialize network device */
+static struct net_device *c2_devinit(struct c2_dev *c2dev,
+				     void __iomem * mmio_addr)
+{
+	struct c2_port *c2_port = NULL;
+	struct net_device *netdev = alloc_etherdev(sizeof(*c2_port));
+
+	if (!netdev) {
+		pr_debug("c2_port etherdev alloc failed");
+		return NULL;
+	}
+
+	SET_MODULE_OWNER(netdev);
+	SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev);
+
+	netdev->open = c2_up;
+	netdev->stop = c2_down;
+	netdev->hard_start_xmit = c2_xmit_frame;
+	netdev->get_stats = c2_get_stats;
+	netdev->tx_timeout = c2_tx_timeout;
+	netdev->change_mtu = c2_change_mtu;
+	netdev->watchdog_timeo = C2_TX_TIMEOUT;
+	netdev->irq = c2dev->pcidev->irq;
+
+	c2_port = netdev_priv(netdev);
+	c2_port->netdev = netdev;
+	c2_port->c2dev = c2dev;
+	c2_port->msg_enable = netif_msg_init(debug, default_msg);
+	c2_port->tx_ring.count = C2_NUM_TX_DESC;
+	c2_port->rx_ring.count = C2_NUM_RX_DESC;
+
+	spin_lock_init(&c2_port->tx_lock);
+
+	/* Copy our 48-bit ethernet hardware address */
+	memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6);
+
+	/* Validate the MAC address */
+	if (!is_valid_ether_addr(netdev->dev_addr)) {
+		pr_debug("Invalid MAC Address\n");
+		c2_print_macaddr(netdev);
+		free_netdev(netdev);
+		return NULL;
+	}
+
+	c2dev->netdev = netdev;
+
+	return netdev;
+}
+
+static int __devinit c2_probe(struct pci_dev *pcidev,
+			      const struct pci_device_id *ent)
+{
+	int ret = 0, i;
+	unsigned long reg0_start, reg0_flags, reg0_len;
+	unsigned long reg2_start, reg2_flags, reg2_len;
+	unsigned long reg4_start, reg4_flags, reg4_len;
+	unsigned kva_map_size;
+	struct net_device *netdev = NULL;
+	struct c2_dev *c2dev = NULL;
+	void __iomem *mmio_regs = NULL;
+
+	printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n",
+		DRV_VERSION);
+
+	/* Enable PCI device */
+	ret = pci_enable_device(pcidev);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: Unable to enable PCI device\n",
+			pci_name(pcidev));
+		goto bail0;
+	}
+
+	reg0_start = pci_resource_start(pcidev, BAR_0);
+	reg0_len = pci_resource_len(pcidev, BAR_0);
+	reg0_flags = pci_resource_flags(pcidev, BAR_0);
+
+	reg2_start = pci_resource_start(pcidev, BAR_2);
+	reg2_len = pci_resource_len(pcidev, BAR_2);
+	reg2_flags = pci_resource_flags(pcidev, BAR_2);
+
+	reg4_start = pci_resource_start(pcidev, BAR_4);
+	reg4_len = pci_resource_len(pcidev, BAR_4);
+	reg4_flags = pci_resource_flags(pcidev, BAR_4);
+
+	pr_debug("BAR0 size = 0x%lX bytes\n", reg0_len);
+	pr_debug("BAR2 size = 0x%lX bytes\n", reg2_len);
+	pr_debug("BAR4 size = 0x%lX bytes\n", reg4_len);
+
+	/* Make sure PCI base addr are MMIO */
+	if (!(reg0_flags & IORESOURCE_MEM) ||
+	    !(reg2_flags & IORESOURCE_MEM) || !(reg4_flags & IORESOURCE_MEM)) {
+		printk(KERN_ERR PFX "PCI regions not an MMIO resource\n");
+		ret = -ENODEV;
+		goto bail1;
+	}
+
+	/* Check for weird/broken PCI region reporting */
+	if ((reg0_len < C2_REG0_SIZE) ||
+	    (reg2_len < C2_REG2_SIZE) || (reg4_len < C2_REG4_SIZE)) {
+		printk(KERN_ERR PFX "Invalid PCI region sizes\n");
+		ret = -ENODEV;
+		goto bail1;
+	}
+
+	/* Reserve PCI I/O and memory resources */
+	ret = pci_request_regions(pcidev, DRV_NAME);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: Unable to request regions\n",
+			pci_name(pcidev));
+		goto bail1;
+	}
+
+	if ((sizeof(dma_addr_t) > 4)) {
+		ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK);
+		if (ret < 0) {
+			printk(KERN_ERR PFX "64b DMA configuration failed\n");
+			goto bail2;
+		}
+	} else {
+		ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK);
+		if (ret < 0) {
+			printk(KERN_ERR PFX "32b DMA configuration failed\n");
+			goto bail2;
+		}
+	}
+
+	/* Enables bus-mastering on the device */
+	pci_set_master(pcidev);
+
+	/* Remap the adapter PCI registers in BAR4 */
+	mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET,
+				    sizeof(struct c2_adapter_pci_regs));
+	if (mmio_regs == 0UL) {
+		printk(KERN_ERR PFX
+			"Unable to remap adapter PCI registers in BAR4\n");
+		ret = -EIO;
+		goto bail2;
+	}
+
+	/* Validate PCI regs magic */
+	for (i = 0; i < sizeof(c2_magic); i++) {
+		if (c2_magic[i] != readb(mmio_regs + C2_REGS_MAGIC + i)) {
+			printk(KERN_ERR PFX "Downlevel Firmware boot loader "
+				"[%d/%Zd: got 0x%x, exp 0x%x]. Use the cc_flash "
+			       "utility to update your boot loader\n",
+				i + 1, sizeof(c2_magic),
+				readb(mmio_regs + C2_REGS_MAGIC + i),
+				c2_magic[i]);
+			printk(KERN_ERR PFX "Adapter not claimed\n");
+			iounmap(mmio_regs);
+			ret = -EIO;
+			goto bail2;
+		}
+	}
+
+	/* Validate the adapter version */
+	if (be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)) != C2_VERSION) {
+		printk(KERN_ERR PFX "Version mismatch "
+			"[fw=%u, c2=%u], Adapter not claimed\n",
+			be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)),
+			C2_VERSION);
+		ret = -EINVAL;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	/* Validate the adapter IVN */
+	if (be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)) != C2_IVN) {
+		printk(KERN_ERR PFX "Downlevel FIrmware level. You should be using "
+		       "the OpenIB device support kit. "
+		       "[fw=0x%x, c2=0x%x], Adapter not claimed\n",
+			be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)),
+			C2_IVN);
+		ret = -EINVAL;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	/* Allocate hardware structure */
+	c2dev = (struct c2_dev *) ib_alloc_device(sizeof(*c2dev));
+	if (!c2dev) {
+		printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n",
+			pci_name(pcidev));
+		ret = -ENOMEM;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	memset(c2dev, 0, sizeof(*c2dev));
+	spin_lock_init(&c2dev->lock);
+	c2dev->pcidev = pcidev;
+	c2dev->cur_tx = 0;
+
+	/* Get the last RX index */
+	c2dev->cur_rx =
+	    (be32_to_cpu(readl(mmio_regs + C2_REGS_HRX_CUR)) -
+	     0xffffc000) / sizeof(struct c2_rxp_desc);
+
+	/* Request an interrupt line for the driver */
+	ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n",
+			pci_name(pcidev), pcidev->irq);
+		iounmap(mmio_regs);
+		goto bail3;
+	}
+
+	/* Set driver specific data */
+	pci_set_drvdata(pcidev, c2dev);
+
+	/* Initialize network device */
+	if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) {
+		iounmap(mmio_regs);
+		goto bail4;
+	}
+
+	/* Save off the actual size prior to unmapping mmio_regs */
+	kva_map_size = be32_to_cpu(readl(mmio_regs + C2_REGS_PCI_WINSIZE));
+
+	/* Unmap the adapter PCI registers in BAR4 */
+	iounmap(mmio_regs);
+
+	/* Register network device */
+	ret = register_netdev(netdev);
+	if (ret) {
+		printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n",
+			ret);
+		goto bail5;
+	}
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+
+	/* Remap the adapter HRXDQ PA space to kernel VA space */
+	c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET,
+					       C2_RXP_HRXDQ_SIZE);
+	if (c2dev->mmio_rxp_ring == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n");
+		ret = -EIO;
+		goto bail6;
+	}
+
+	/* Remap the adapter HTXDQ PA space to kernel VA space */
+	c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET,
+					       C2_TXP_HTXDQ_SIZE);
+	if (c2dev->mmio_txp_ring == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n");
+		ret = -EIO;
+		goto bail7;
+	}
+
+	/* Save off the current RX index in the last 4 bytes of the TXP Ring */
+	C2_SET_CUR_RX(c2dev, c2dev->cur_rx);
+
+	/* Remap the PCI registers in adapter BAR0 to kernel VA space */
+	c2dev->regs = ioremap_nocache(reg0_start, reg0_len);
+	if (c2dev->regs == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap BAR0\n");
+		ret = -EIO;
+		goto bail8;
+	}
+
+	/* Remap the PCI registers in adapter BAR4 to kernel VA space */
+	c2dev->pa = reg4_start + C2_PCI_REGS_OFFSET;
+	c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, 
+				     kva_map_size);
+	if (c2dev->kva == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap BAR4\n");
+		ret = -EIO;
+		goto bail9;
+	}
+
+	/* Print out the MAC address */
+	c2_print_macaddr(netdev);
+
+	ret = c2_rnic_init(c2dev);
+	if (ret) {
+		printk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret);
+		goto bail10;
+	}
+
+	c2_register_device(c2dev);
+
+	return 0;
+
+ bail10:
+	iounmap(c2dev->kva);
+
+ bail9:
+	iounmap(c2dev->regs);
+
+ bail8:
+	iounmap(c2dev->mmio_txp_ring);
+
+ bail7:
+	iounmap(c2dev->mmio_rxp_ring);
+
+ bail6:
+	unregister_netdev(netdev);
+
+ bail5:
+	free_netdev(netdev);
+
+ bail4:
+	free_irq(pcidev->irq, c2dev);
+
+ bail3:
+	ib_dealloc_device(&c2dev->ibdev);
+
+ bail2:
+	pci_release_regions(pcidev);
+
+ bail1:
+	pci_disable_device(pcidev);
+
+ bail0:
+	return ret;
+}
+
+static void __devexit c2_remove(struct pci_dev *pcidev)
+{
+	struct c2_dev *c2dev = pci_get_drvdata(pcidev);
+	struct net_device *netdev = c2dev->netdev;
+
+	/* Unregister with OpenIB */
+	c2_unregister_device(c2dev);
+
+	/* Clean up the RNIC resources */
+	c2_rnic_term(c2dev);
+
+	/* Remove network device from the kernel */
+	unregister_netdev(netdev);
+
+	/* Free network device */
+	free_netdev(netdev);
+
+	/* Free the interrupt line */
+	free_irq(pcidev->irq, c2dev);
+
+	/* missing: Turn LEDs off here */
+
+	/* Unmap adapter PA space */
+	iounmap(c2dev->kva);
+	iounmap(c2dev->regs);
+	iounmap(c2dev->mmio_txp_ring);
+	iounmap(c2dev->mmio_rxp_ring);
+
+	/* Free the hardware structure */
+	ib_dealloc_device(&c2dev->ibdev);
+
+	/* Release reserved PCI I/O and memory resources */
+	pci_release_regions(pcidev);
+
+	/* Disable PCI device */
+	pci_disable_device(pcidev);
+
+	/* Clear driver specific data */
+	pci_set_drvdata(pcidev, NULL);
+}
+
+static struct pci_driver c2_pci_driver = {
+	.name = DRV_NAME,
+	.id_table = c2_pci_table,
+	.probe = c2_probe,
+	.remove = __devexit_p(c2_remove),
+};
+
+static int __init c2_init_module(void)
+{
+	return pci_module_init(&c2_pci_driver);
+}
+
+static void __exit c2_exit_module(void)
+{
+	pci_unregister_driver(&c2_pci_driver);
+}
+
+module_init(c2_init_module);
+module_exit(c2_exit_module);
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
new file mode 100644
index 0000000..3251e8f
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -0,0 +1,555 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __C2_H
+#define __C2_H
+
+#include <linux/netdevice.h>
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <asm/semaphore.h>
+
+#include "c2_provider.h"
+#include "c2_mq.h"
+#include "c2_status.h"
+
+#define DRV_NAME     "c2"
+#define DRV_VERSION  "1.1"
+#define PFX          DRV_NAME ": "
+
+#define BAR_0                0
+#define BAR_2                2
+#define BAR_4                4
+
+#define RX_BUF_SIZE         (1536 + 8)
+#define ETH_JUMBO_MTU        9000
+#define C2_MAGIC            "CEPHEUS"
+#define C2_VERSION           4
+#define C2_IVN              (18 & 0x7fffffff)
+
+#define C2_REG0_SIZE        (16 * 1024)
+#define C2_REG2_SIZE        (2 * 1024 * 1024)
+#define C2_REG4_SIZE        (256 * 1024 * 1024)
+#define C2_NUM_TX_DESC       341
+#define C2_NUM_RX_DESC       256
+#define C2_PCI_REGS_OFFSET  (0x10000)
+#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2))
+#define C2_RXP_HRXDQ_SIZE   (4096)
+#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE)
+#define C2_TXP_HTXDQ_SIZE   (4096)
+#define C2_TX_TIMEOUT	    (6*HZ)
+
+/* CEPHEUS */
+static const u8 c2_magic[] = {
+	0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53
+};
+
+enum adapter_pci_regs {
+	C2_REGS_MAGIC = 0x0000,
+	C2_REGS_VERS = 0x0008,
+	C2_REGS_IVN = 0x000C,
+	C2_REGS_PCI_WINSIZE = 0x0010,
+	C2_REGS_Q0_QSIZE = 0x0014,
+	C2_REGS_Q0_MSGSIZE = 0x0018,
+	C2_REGS_Q0_POOLSTART = 0x001C,
+	C2_REGS_Q0_SHARED = 0x0020,
+	C2_REGS_Q1_QSIZE = 0x0024,
+	C2_REGS_Q1_MSGSIZE = 0x0028,
+	C2_REGS_Q1_SHARED = 0x0030,
+	C2_REGS_Q2_QSIZE = 0x0034,
+	C2_REGS_Q2_MSGSIZE = 0x0038,
+	C2_REGS_Q2_SHARED = 0x0040,
+	C2_REGS_ENADDR = 0x004C,
+	C2_REGS_RDMA_ENADDR = 0x0054,
+	C2_REGS_HRX_CUR = 0x006C,
+};
+
+struct c2_adapter_pci_regs {
+	char reg_magic[8];
+	u32 version;
+	u32 ivn;
+	u32 pci_window_size;
+	u32 q0_q_size;
+	u32 q0_msg_size;
+	u32 q0_pool_start;
+	u32 q0_shared;
+	u32 q1_q_size;
+	u32 q1_msg_size;
+	u32 q1_pool_start;
+	u32 q1_shared;
+	u32 q2_q_size;
+	u32 q2_msg_size;
+	u32 q2_pool_start;
+	u32 q2_shared;
+	u32 log_start;
+	u32 log_size;
+	u8 host_enaddr[8];
+	u8 rdma_enaddr[8];
+	u32 crash_entry;
+	u32 crash_ready[2];
+	u32 fw_txd_cur;
+	u32 fw_hrxd_cur;
+	u32 fw_rxd_cur;
+};
+
+enum pci_regs {
+	C2_HISR = 0x0000,
+	C2_DISR = 0x0004,
+	C2_HIMR = 0x0008,
+	C2_DIMR = 0x000C,
+	C2_NISR0 = 0x0010,
+	C2_NISR1 = 0x0014,
+	C2_NIMR0 = 0x0018,
+	C2_NIMR1 = 0x001C,
+	C2_IDIS = 0x0020,
+};
+
+enum {
+	C2_PCI_HRX_INT = 1 << 8,
+	C2_PCI_HTX_INT = 1 << 17,
+	C2_PCI_HRX_QUI = 1 << 31,
+};
+
+/*
+ * Cepheus registers in BAR0.
+ */
+struct c2_pci_regs {
+	u32 hostisr;
+	u32 dmaisr;
+	u32 hostimr;
+	u32 dmaimr;
+	u32 netisr0;
+	u32 netisr1;
+	u32 netimr0;
+	u32 netimr1;
+	u32 int_disable;
+};
+
+/* TXP flags */
+enum c2_txp_flags {
+	TXP_HTXD_DONE = 0,
+	TXP_HTXD_READY = 1 << 0,
+	TXP_HTXD_UNINIT = 1 << 1,
+};
+
+/* RXP flags */
+enum c2_rxp_flags {
+	RXP_HRXD_UNINIT = 0,
+	RXP_HRXD_READY = 1 << 0,
+	RXP_HRXD_DONE = 1 << 1,
+};
+
+/* RXP status */
+enum c2_rxp_status {
+	RXP_HRXD_ZERO = 0,
+	RXP_HRXD_OK = 1 << 0,
+	RXP_HRXD_BUF_OV = 1 << 1,
+};
+
+/* TXP descriptor fields */
+enum txp_desc {
+	C2_TXP_FLAGS = 0x0000,
+	C2_TXP_LEN = 0x0002,
+	C2_TXP_ADDR = 0x0004,
+};
+
+/* RXP descriptor fields */
+enum rxp_desc {
+	C2_RXP_FLAGS = 0x0000,
+	C2_RXP_STATUS = 0x0002,
+	C2_RXP_COUNT = 0x0004,
+	C2_RXP_LEN = 0x0006,
+	C2_RXP_ADDR = 0x0008,
+};
+
+struct c2_txp_desc {
+	u16 flags;
+	u16 len;
+	u64 addr;
+} __attribute__ ((packed));
+
+struct c2_rxp_desc {
+	u16 flags;
+	u16 status;
+	u16 count;
+	u16 len;
+	u64 addr;
+} __attribute__ ((packed));
+
+struct c2_rxp_hdr {
+	u16 flags;
+	u16 status;
+	u16 len;
+	u16 rsvd;
+} __attribute__ ((packed));
+
+struct c2_tx_desc {
+	u32 len;
+	u32 status;
+	dma_addr_t next_offset;
+};
+
+struct c2_rx_desc {
+	u32 len;
+	u32 status;
+	dma_addr_t next_offset;
+};
+
+struct c2_alloc {
+	u32 last;
+	u32 max;
+	spinlock_t lock;
+	unsigned long *table;
+};
+
+struct c2_array {
+	struct {
+		void **page;
+		int used;
+	} *page_list;
+};
+
+/*
+ * The MQ shared pointer pool is organized as a linked list of
+ * chunks. Each chunk contains a linked list of free shared pointers
+ * that can be allocated to a given user mode client.
+ *
+ */
+struct sp_chunk {
+	struct sp_chunk *next;
+	gfp_t gfp_mask;
+	u16 head;
+	u16 shared_ptr[0];
+};
+
+struct c2_pd_table {
+	struct c2_alloc alloc;
+	struct c2_array pd;
+};
+
+struct c2_qp_table {
+	struct c2_alloc alloc;
+	spinlock_t lock;
+	struct c2_array qp;
+	struct c2_qp** map;
+};
+
+struct c2_element {
+	struct c2_element *next;
+	void *ht_desc;		/* host     descriptor */
+	void __iomem *hw_desc;	/* hardware descriptor */
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen;
+};
+
+struct c2_ring {
+	struct c2_element *to_clean;
+	struct c2_element *to_use;
+	struct c2_element *start;
+	unsigned long count;
+};
+
+struct c2_dev {
+	struct ib_device ibdev;
+	void __iomem *regs;
+	void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */
+	void __iomem *mmio_rxp_ring;
+	spinlock_t lock;
+	struct pci_dev *pcidev;
+	struct net_device *netdev;
+	struct net_device *pseudo_netdev;
+	unsigned int cur_tx;
+	unsigned int cur_rx;
+	u32 adapter_handle;
+	int device_cap_flags;
+	void __iomem *kva;	/* KVA device memory */
+	unsigned long pa;	/* PA device memory */
+	void **qptr_array;
+
+	kmem_cache_t *host_msg_cache;
+
+	struct list_head cca_link;		/* adapter list */
+	struct list_head eh_wakeup_list;	/* event wakeup list */
+	wait_queue_head_t req_vq_wo;
+
+	/* Cached RNIC properties */
+	struct ib_device_attr props;
+
+	struct c2_pd_table pd_table;
+	struct c2_qp_table qp_table;
+	int ports;		/* num of GigE ports */
+	int devnum;
+	spinlock_t vqlock;	/* sync vbs req MQ */
+
+	/* Verbs Queues */
+	struct c2_mq req_vq;	/* Verbs Request MQ */
+	struct c2_mq rep_vq;	/* Verbs Reply MQ */
+	struct c2_mq aeq;	/* Async Events MQ */
+
+	/* Kernel client MQs */
+	struct sp_chunk *kern_mqsp_pool;
+
+	/* Device updates these values when posting messages to a host
+	 * target queue */
+	u16 req_vq_shared;
+	u16 rep_vq_shared;
+	u16 aeq_shared;
+	u16 irq_claimed;
+
+	/*
+	 * Shared host target pages for user-accessible MQs.
+	 */
+	int hthead;		/* index of first free entry */
+	void *htpages;		/* kernel vaddr */
+	int htlen;		/* length of htpages memory */
+	void *htuva;		/* user mapped vaddr */
+	spinlock_t htlock;	/* serialize allocation */
+
+	u64 adapter_hint_uva;	/* access to the activity FIFO */
+
+	//	spinlock_t aeq_lock;
+	//	spinlock_t rnic_lock;
+
+	u16 hint_count;
+	u16 hints_read;
+
+	int init;		/* TRUE if it's ready */
+	char ae_cache_name[16];
+	char vq_cache_name[16];
+};
+
+struct c2_port {
+	u32 msg_enable;
+	struct c2_dev *c2dev;
+	struct net_device *netdev;
+
+	spinlock_t tx_lock;
+	u32 tx_avail;
+	struct c2_ring tx_ring;
+	struct c2_ring rx_ring;
+
+	void *mem;		/* PCI memory for host rings */
+	dma_addr_t dma;
+	unsigned long mem_size;
+
+	u32 rx_buf_size;
+
+	struct net_device_stats netstats;
+};
+
+/*
+ * Activity FIFO registers in BAR0.
+ */
+#define PCI_BAR0_HOST_HINT	0x100
+#define PCI_BAR0_ADAPTER_HINT	0x2000
+
+/*
+ * Ammasso PCI vendor id and Cepheus PCI device id.
+ */
+#define CQ_ARMED 	0x01
+#define CQ_WAIT_FOR_DMA	0x80
+
+/*
+ * The format of a hint is as follows:
+ * Lower 16 bits are the count of hints for the queue.
+ * Next 15 bits are the qp_index
+ * Upper most bit depends on who reads it:
+ *    If read by producer, then it means Full (1) or Not-Full (0)
+ *    If read by consumer, then it means Empty (1) or Not-Empty (0)
+ */
+#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count)
+#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16)
+#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF)
+
+
+/*
+ * The following defines the offset in SDRAM for the c2_adapter_pci_regs_t
+ * struct. 
+ */
+#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000
+
+#ifndef readq
+static inline u64 readq(const void __iomem * addr)
+{
+	u64 ret = readl(addr + 4);
+	ret <<= 32;
+	ret |= readl(addr);
+
+	return ret;
+}
+#endif
+
+#ifndef __raw_writeq
+static inline void __raw_writeq(u64 val, void __iomem * addr)
+{
+	__raw_writel((u32) (val), addr);
+	__raw_writel((u32) (val >> 32), (addr + 4));
+}
+#endif
+
+#define C2_SET_CUR_RX(c2dev, cur_rx) \
+	__raw_writel(cpu_to_be32(cur_rx), c2dev->mmio_txp_ring + 4092)
+
+#define C2_GET_CUR_RX(c2dev) \
+	be32_to_cpu(readl(c2dev->mmio_txp_ring + 4092))
+
+static inline struct c2_dev *to_c2dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct c2_dev, ibdev);
+}
+
+static inline int c2_errno(void *reply)
+{
+	switch (c2_wr_get_result(reply)) {
+	case C2_OK:
+		return 0;
+	case CCERR_NO_BUFS:
+	case CCERR_INSUFFICIENT_RESOURCES:
+	case CCERR_ZERO_RDMA_READ_RESOURCES:
+		return -ENOMEM;
+	case CCERR_MR_IN_USE:
+	case CCERR_QP_IN_USE:
+		return -EBUSY;
+	case CCERR_ADDR_IN_USE:
+		return -EADDRINUSE;
+	case CCERR_ADDR_NOT_AVAIL:
+		return -EADDRNOTAVAIL;
+	case CCERR_CONN_RESET:
+		return -ECONNRESET;
+	case CCERR_NOT_IMPLEMENTED:
+	case CCERR_INVALID_WQE:
+		return -ENOSYS;
+	case CCERR_QP_NOT_PRIVILEGED:
+		return -EPERM;
+	case CCERR_STACK_ERROR:
+		return -EPROTO;
+	case CCERR_ACCESS_VIOLATION:
+	case CCERR_BASE_AND_BOUNDS_VIOLATION:
+		return -EFAULT;
+	case CCERR_STAG_STATE_NOT_INVALID:
+	case CCERR_INVALID_ADDRESS:
+	case CCERR_INVALID_CQ:
+	case CCERR_INVALID_EP:
+	case CCERR_INVALID_MODIFIER:
+	case CCERR_INVALID_MTU:
+	case CCERR_INVALID_PD_ID:
+	case CCERR_INVALID_QP:
+	case CCERR_INVALID_RNIC:
+	case CCERR_INVALID_STAG:
+		return -EINVAL;
+	default:
+		return -EAGAIN;
+	}
+}
+
+/* Device */
+extern int c2_register_device(struct c2_dev *c2dev);
+extern void c2_unregister_device(struct c2_dev *c2dev);
+extern int c2_rnic_init(struct c2_dev *c2dev);
+extern void c2_rnic_term(struct c2_dev *c2dev);
+extern void c2_rnic_interrupt(struct c2_dev *c2dev);
+extern int c2_rnic_query(struct c2_dev *c2dev, struct ib_device_attr *props);
+extern int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask);
+extern int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask);
+
+/* QPs */
+extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd,
+		       struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp);
+extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp);
+extern struct ib_qp *c2_get_qp(struct ib_device *device, int qpn);
+extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp,
+			struct ib_qp_attr *attr, int attr_mask);
+extern int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, 
+				 int ord, int ird);
+extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
+			struct ib_send_wr **bad_wr);
+extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
+			   struct ib_recv_wr **bad_wr);
+extern int __devinit c2_init_qp_table(struct c2_dev *c2dev);
+extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev);
+extern void c2_set_qp_state(struct c2_qp *, int);
+
+/* PDs */
+extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd);
+extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd);
+extern int __devinit c2_init_pd_table(struct c2_dev *c2dev);
+extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev);
+
+/* CQs */
+extern int c2_init_cq(struct c2_dev *c2dev, int entries,
+		      struct c2_ucontext *ctx, struct c2_cq *cq);
+extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq);
+extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
+extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
+extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+
+/* CM */
+extern int c2_llp_connect(struct iw_cm_id *cm_id, 
+			  struct iw_cm_conn_param *iw_param);
+extern int c2_llp_accept(struct iw_cm_id *cm_id, 
+			 struct iw_cm_conn_param *iw_param);
+extern int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata,
+			 u8 pdata_len);
+extern int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog);
+extern int c2_llp_service_destroy(struct iw_cm_id *cm_id);
+
+/* MM */
+extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list,
+ 				      int page_size, int pbl_depth, u32 length, 
+ 				      u32 off, u64 *va, enum c2_acf acf, 
+				      struct c2_mr *mr);
+extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index);
+
+/* AE */
+extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index);
+
+/* Allocators */
+extern u32 c2_alloc(struct c2_alloc *alloc);
+extern void c2_free(struct c2_alloc *alloc, u32 obj);
+extern int c2_alloc_init(struct c2_alloc *alloc, u32 num, u32 reserved);
+extern void c2_alloc_cleanup(struct c2_alloc *alloc);
+extern int c2_init_mqsp_pool(gfp_t gfp_mask, struct sp_chunk **root);
+extern void c2_free_mqsp_pool(struct sp_chunk *root);
+extern u16 *c2_alloc_mqsp(struct sp_chunk *head);
+extern void c2_free_mqsp(u16 * mqsp);
+extern void c2_array_cleanup(struct c2_array *array, int nent);
+extern int c2_array_init(struct c2_array *array, int nent);
+extern void c2_array_clear(struct c2_array *array, int index);
+extern int c2_array_set(struct c2_array *array, int index, void *value);
+extern void *c2_array_get(struct c2_array *array, int index);
+
+#endif
diff --git a/drivers/infiniband/hw/amso1100/c2_ae.c b/drivers/infiniband/hw/amso1100/c2_ae.c
new file mode 100644
index 0000000..c979ef6
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_ae.c
@@ -0,0 +1,359 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include <rdma/iw_cm.h>
+#include "c2_status.h"
+#include "c2_ae.h"
+
+static int c2_convert_cm_status(u32 c2_status)
+{
+	switch (c2_status) {
+	case C2_CONN_STATUS_SUCCESS:
+		return 0;
+	case C2_CONN_STATUS_REJECTED:
+		return -ENETRESET;
+	case C2_CONN_STATUS_REFUSED:
+		return -ECONNREFUSED;
+	case C2_CONN_STATUS_TIMEDOUT:
+		return -ETIMEDOUT;
+	case C2_CONN_STATUS_NETUNREACH:
+		return -ENETUNREACH;
+	case C2_CONN_STATUS_HOSTUNREACH:
+		return -EHOSTUNREACH;
+	case C2_CONN_STATUS_INVALID_RNIC:
+		return -EINVAL;
+	case C2_CONN_STATUS_INVALID_QP:
+		return -EINVAL;
+	case C2_CONN_STATUS_INVALID_QP_STATE:
+		return -EINVAL;
+	case C2_CONN_STATUS_ADDR_NOT_AVAIL:
+		return -EADDRNOTAVAIL;
+	default:
+		printk(KERN_ERR PFX
+		       "%s - Unable to convert CM status: %d\n",
+		       __FUNCTION__, c2_status);
+		return -EIO;
+	}
+}
+
+#ifdef DEBUG
+static const char* to_event_str(int event)
+{
+	static const char* event_str[] = {
+		"CCAE_REMOTE_SHUTDOWN",
+		"CCAE_ACTIVE_CONNECT_RESULTS",
+		"CCAE_CONNECTION_REQUEST",
+		"CCAE_LLP_CLOSE_COMPLETE",
+		"CCAE_TERMINATE_MESSAGE_RECEIVED",
+		"CCAE_LLP_CONNECTION_RESET",
+		"CCAE_LLP_CONNECTION_LOST",
+		"CCAE_LLP_SEGMENT_SIZE_INVALID",
+		"CCAE_LLP_INVALID_CRC",
+		"CCAE_LLP_BAD_FPDU",
+		"CCAE_INVALID_DDP_VERSION",
+		"CCAE_INVALID_RDMA_VERSION",
+		"CCAE_UNEXPECTED_OPCODE",
+		"CCAE_INVALID_DDP_QUEUE_NUMBER",
+		"CCAE_RDMA_READ_NOT_ENABLED",
+		"CCAE_RDMA_WRITE_NOT_ENABLED",
+		"CCAE_RDMA_READ_TOO_SMALL",
+		"CCAE_NO_L_BIT",
+		"CCAE_TAGGED_INVALID_STAG",
+		"CCAE_TAGGED_BASE_BOUNDS_VIOLATION",
+		"CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION",
+		"CCAE_TAGGED_INVALID_PD",
+		"CCAE_WRAP_ERROR",
+		"CCAE_BAD_CLOSE",
+		"CCAE_BAD_LLP_CLOSE",
+		"CCAE_INVALID_MSN_RANGE",
+		"CCAE_INVALID_MSN_GAP",
+		"CCAE_IRRQ_OVERFLOW",
+		"CCAE_IRRQ_MSN_GAP",
+		"CCAE_IRRQ_MSN_RANGE",
+		"CCAE_IRRQ_INVALID_STAG",
+		"CCAE_IRRQ_BASE_BOUNDS_VIOLATION",
+		"CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION",
+		"CCAE_IRRQ_INVALID_PD",
+		"CCAE_IRRQ_WRAP_ERROR",
+		"CCAE_CQ_SQ_COMPLETION_OVERFLOW",
+		"CCAE_CQ_RQ_COMPLETION_ERROR",
+		"CCAE_QP_SRQ_WQE_ERROR",
+		"CCAE_QP_LOCAL_CATASTROPHIC_ERROR",
+		"CCAE_CQ_OVERFLOW",
+		"CCAE_CQ_OPERATION_ERROR",
+		"CCAE_SRQ_LIMIT_REACHED",
+		"CCAE_QP_RQ_LIMIT_REACHED",
+		"CCAE_SRQ_CATASTROPHIC_ERROR",
+		"CCAE_RNIC_CATASTROPHIC_ERROR"
+	};
+
+	if (event < CCAE_REMOTE_SHUTDOWN || 
+	    event > CCAE_RNIC_CATASTROPHIC_ERROR)
+		return "<invalid event>";
+
+	event -= CCAE_REMOTE_SHUTDOWN;
+	return event_str[event];
+}
+
+const char *to_qp_state_str(int state)
+{
+	switch (state) {
+	case C2_QP_STATE_IDLE:
+		return "C2_QP_STATE_IDLE";
+	case C2_QP_STATE_CONNECTING:
+		return "C2_QP_STATE_CONNECTING";
+	case C2_QP_STATE_RTS:
+		return "C2_QP_STATE_RTS";
+	case C2_QP_STATE_CLOSING:
+		return "C2_QP_STATE_CLOSING";
+	case C2_QP_STATE_TERMINATE:
+		return "C2_QP_STATE_TERMINATE";
+	case C2_QP_STATE_ERROR:
+		return "C2_QP_STATE_ERROR";
+	default:
+		return "<invalid QP state>";
+	};
+}
+#endif
+
+void c2_ae_event(struct c2_dev *c2dev, u32 mq_index)
+{
+	struct c2_mq *mq = c2dev->qptr_array[mq_index];
+	union c2wr *wr;
+	void *resource_user_context;
+	struct iw_cm_event cm_event;
+	struct ib_event ib_event;
+	enum c2_resource_indicator resource_indicator;
+	enum c2_event_id event_id;
+	unsigned long flags;
+	u8 *pdata = NULL;
+	int status;
+
+	/*
+	 * retreive the message
+	 */
+	wr = c2_mq_consume(mq);
+	if (!wr)
+		return;
+
+	memset(&ib_event, 0, sizeof(ib_event));
+	memset(&cm_event, 0, sizeof(cm_event));
+
+	event_id = c2_wr_get_id(wr);
+	resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type);
+	resource_user_context =
+	    (void *) (unsigned long) wr->ae.ae_generic.user_context;
+
+	status = cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr));
+
+	pr_debug("event received c2_dev=%p, event_id=%d, "
+		"resource_indicator=%d, user_context=%p, status = %d\n",
+		c2dev, event_id, resource_indicator, resource_user_context, 
+		status);
+
+	switch (resource_indicator) {
+	case C2_RES_IND_QP:{
+
+		struct c2_qp *qp = (struct c2_qp *)resource_user_context;
+		struct iw_cm_id *cm_id = qp->cm_id;
+		struct c2wr_ae_active_connect_results *res;
+
+		if (!cm_id) {
+			pr_debug("event received, but cm_id is <nul>, qp=%p!\n",
+				qp);
+			goto ignore_it;
+		}
+		pr_debug("%s: event = %s, user_context=%llx, "
+			"resource_type=%x, "
+			"resource=%x, qp_state=%s\n",
+			__FUNCTION__,
+			to_event_str(event_id),
+			be64_to_cpu(wr->ae.ae_generic.user_context),
+			be32_to_cpu(wr->ae.ae_generic.resource_type),
+			be32_to_cpu(wr->ae.ae_generic.resource),
+			to_qp_state_str(be32_to_cpu(wr->ae.ae_generic.qp_state)));
+			
+		c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state));
+
+		switch (event_id) {
+		case CCAE_ACTIVE_CONNECT_RESULTS:
+			res = &wr->ae.ae_active_connect_results;
+			cm_event.event = IW_CM_EVENT_CONNECT_REPLY;
+			cm_event.local_addr.sin_addr.s_addr = res->laddr;
+			cm_event.remote_addr.sin_addr.s_addr = res->raddr;
+			cm_event.local_addr.sin_port = res->lport;
+			cm_event.remote_addr.sin_port =	res->rport;
+			if (status == 0) {
+				cm_event.private_data_len = 
+					be32_to_cpu(res->private_data_length);
+			} else {
+				spin_lock_irqsave(&qp->lock, flags);
+				if (qp->cm_id) {
+					qp->cm_id->rem_ref(qp->cm_id);
+					qp->cm_id = NULL;
+				}
+				spin_unlock_irqrestore(&qp->lock, flags);
+				cm_event.private_data_len = 0;
+				cm_event.private_data = NULL;
+			}
+			if (cm_event.private_data_len) {
+				/* copy private data */
+				pdata =
+				    kmalloc(cm_event.private_data_len,
+					    GFP_ATOMIC);
+				if (!pdata) {
+					/* Ignore the request, maybe the 
+					 * remote peer will retry */
+					pr_debug ("Ignored connect request -- "
+						 "no memory for pdata"
+						 "private_data_len=%d\n",
+						 cm_event.private_data_len);
+					goto ignore_it;
+				}
+
+				memcpy(pdata, res->private_data,
+				       cm_event.private_data_len);
+
+				cm_event.private_data = pdata;
+			}
+			if (cm_id->event_handler)
+				cm_id->event_handler(cm_id, &cm_event);
+			break;
+		case CCAE_TERMINATE_MESSAGE_RECEIVED:
+		case CCAE_CQ_SQ_COMPLETION_OVERFLOW:
+			ib_event.device = &c2dev->ibdev;
+			ib_event.element.qp = &qp->ibqp;
+			ib_event.event = IB_EVENT_QP_REQ_ERR;
+
+			if (qp->ibqp.event_handler)
+				qp->ibqp.event_handler(&ib_event,
+						       qp->ibqp.
+						       qp_context);
+			break;
+		case CCAE_BAD_CLOSE:
+		case CCAE_LLP_CLOSE_COMPLETE:
+		case CCAE_LLP_CONNECTION_RESET:
+		case CCAE_LLP_CONNECTION_LOST:
+			BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b);
+
+			spin_lock_irqsave(&qp->lock, flags);
+			if (qp->cm_id) {
+				qp->cm_id->rem_ref(qp->cm_id);
+				qp->cm_id = NULL;
+			}
+			spin_unlock_irqrestore(&qp->lock, flags);
+			cm_event.event = IW_CM_EVENT_CLOSE;
+			cm_event.status = 0;
+			if (cm_id->event_handler)
+				cm_id->event_handler(cm_id, &cm_event);
+			break;
+		default:
+			BUG_ON(1);
+			pr_debug("%s:%d Unexpected event_id=%d on QP=%p, "
+				"CM_ID=%p\n",
+				__FUNCTION__, __LINE__,
+				event_id, qp, cm_id);
+			break;
+		}
+		break;
+	}
+
+	case C2_RES_IND_EP:{
+
+		struct c2wr_ae_connection_request *req = 
+			&wr->ae.ae_connection_request;
+		struct iw_cm_id *cm_id = 
+			(struct iw_cm_id *)resource_user_context;
+
+		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
+		if (event_id != CCAE_CONNECTION_REQUEST) {
+			pr_debug("%s: Invalid event_id: %d\n",
+				__FUNCTION__, event_id);
+			break;
+		}
+		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
+		cm_event.provider_data = (void*)(unsigned long)req->cr_handle;
+		cm_event.local_addr.sin_addr.s_addr = req->laddr;
+		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
+		cm_event.local_addr.sin_port = req->lport;
+		cm_event.remote_addr.sin_port = req->rport;
+		cm_event.private_data_len = 
+			be32_to_cpu(req->private_data_length);
+
+		if (cm_event.private_data_len) {
+			pdata =
+			    kmalloc(cm_event.private_data_len,
+				    GFP_ATOMIC);
+			if (!pdata) {
+				/* Ignore the request, maybe the remote peer 
+				 * will retry */
+				pr_debug ("Ignored connect request -- "
+					 "no memory for pdata"
+					 "private_data_len=%d\n",
+					 cm_event.private_data_len);
+				goto ignore_it;
+			}
+			memcpy(pdata,
+			       req->private_data, 
+			       cm_event.private_data_len);
+
+			cm_event.private_data = pdata;
+		}
+		if (cm_id->event_handler)
+			cm_id->event_handler(cm_id, &cm_event);
+		break;
+	}
+
+	case C2_RES_IND_CQ:{
+		struct c2_cq *cq =
+		    (struct c2_cq *) resource_user_context;
+
+		pr_debug("IB_EVENT_CQ_ERR\n");
+		ib_event.device = &c2dev->ibdev;
+		ib_event.element.cq = &cq->ibcq;
+		ib_event.event = IB_EVENT_CQ_ERR;
+
+		if (cq->ibcq.event_handler)
+			cq->ibcq.event_handler(&ib_event,
+					       cq->ibcq.cq_context);
+	}
+
+	default:
+		printk("Bad resource indicator = %d\n",
+		       resource_indicator);
+		break;
+	}
+
+ ignore_it:
+	c2_mq_free(mq);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c
new file mode 100644
index 0000000..75bb18c
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_intr.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include <rdma/iw_cm.h>
+#include "c2_vq.h"
+
+static void handle_mq(struct c2_dev *c2dev, u32 index);
+static void handle_vq(struct c2_dev *c2dev, u32 mq_index);
+
+/*
+ * Handle RNIC interrupts
+ */
+void c2_rnic_interrupt(struct c2_dev *c2dev)
+{
+	unsigned int mq_index;
+
+	while (c2dev->hints_read != be16_to_cpu(c2dev->hint_count)) {
+		mq_index = readl(c2dev->regs + PCI_BAR0_HOST_HINT);
+		if (mq_index & 0x80000000) {
+			break;
+		}
+
+		c2dev->hints_read++;
+		handle_mq(c2dev, mq_index);
+	}
+
+}
+
+/*
+ * Top level MQ handler 
+ */
+static void handle_mq(struct c2_dev *c2dev, u32 mq_index)
+{
+	if (c2dev->qptr_array[mq_index] == NULL) {
+		pr_debug(KERN_INFO "handle_mq: stray activity for mq_index=%d\n",
+			mq_index);
+		return;
+	}
+
+	switch (mq_index) {
+	case (0):
+		/*
+		 * An index of 0 in the activity queue
+		 * indicates the req vq now has messages
+		 * available...
+		 *
+		 * Wake up any waiters waiting on req VQ 
+		 * message availability.  
+		 */
+		wake_up(&c2dev->req_vq_wo);
+		break;
+	case (1):
+		handle_vq(c2dev, mq_index);
+		break;
+	case (2):
+		/* We have to purge the VQ in case there are pending
+		 * accept reply requests that would result in the
+		 * generation of an ESTABLISHED event. If we don't
+		 * generate these first, a CLOSE event could end up
+		 * being delivered before the ESTABLISHED event.
+		 */
+		handle_vq(c2dev, 1);
+
+		c2_ae_event(c2dev, mq_index);
+		break;
+	default:
+		/* There is no event synchronization between CQ events
+		 * and AE or CM events. In fact, CQE could be
+		 * delivered for all of the I/O up to and including the
+		 * FLUSH for a peer disconenct prior to the ESTABLISHED
+		 * event being delivered to the app. The reason for this
+		 * is that CM events are delivered on a thread, while AE
+		 * and CM events are delivered on interrupt context. 
+		 */
+		c2_cq_event(c2dev, mq_index);
+		break;
+	}
+
+	return;
+}
+
+/*
+ * Handles verbs WR replies.
+ */
+static void handle_vq(struct c2_dev *c2dev, u32 mq_index)
+{
+	void *adapter_msg, *reply_msg;
+	struct c2wr_hdr *host_msg;
+	struct c2wr_hdr tmp;
+	struct c2_mq *reply_vq;
+	struct c2_vq_req *req;
+	struct iw_cm_event cm_event;
+	int err;
+
+	reply_vq = (struct c2_mq *) c2dev->qptr_array[mq_index];
+
+	/*
+	 * get next msg from mq_index into adapter_msg.
+	 * don't free it yet.
+	 */
+	adapter_msg = c2_mq_consume(reply_vq);
+	if (adapter_msg == NULL) {
+		return;
+	}
+
+	host_msg = vq_repbuf_alloc(c2dev);
+
+	/*
+	 * If we can't get a host buffer, then we'll still 
+	 * wakeup the waiter, we just won't give him the msg.
+	 * It is assumed the waiter will deal with this...
+	 */
+	if (!host_msg) {
+		pr_debug("handle_vq: no repbufs!\n");
+
+		/*      
+		 * just copy the WR header into a local variable.
+		 * this allows us to still demux on the context
+		 */
+		host_msg = &tmp;
+		memcpy(host_msg, adapter_msg, sizeof(tmp));
+		reply_msg = NULL;
+	} else {
+		memcpy(host_msg, adapter_msg, reply_vq->msg_size);
+		reply_msg = host_msg;
+	}
+
+	/*
+	 * consume the msg from the MQ
+	 */
+	c2_mq_free(reply_vq);
+
+	/*
+	 * wakeup the waiter.
+	 */
+	req = (struct c2_vq_req *) (unsigned long) host_msg->context;
+	if (req == NULL) {
+		/*
+		 * We should never get here, as the adapter should
+		 * never send us a reply that we're not expecting.
+		 */
+		vq_repbuf_free(c2dev, host_msg);
+		pr_debug("handle_vq: UNEXPECTEDLY got NULL req\n");
+		return;
+	}
+
+	err = c2_errno(reply_msg);
+	if (!err) switch (req->event) {
+	case IW_CM_EVENT_ESTABLISHED:
+		c2_set_qp_state(req->qp,
+				C2_QP_STATE_RTS);
+	case IW_CM_EVENT_CLOSE:
+
+		/* 
+		 * Move the QP to RTS if this is 
+		 * the established event 
+		 */
+		cm_event.event = req->event;
+		cm_event.status = 0;
+		cm_event.local_addr = req->cm_id->local_addr;
+		cm_event.remote_addr = req->cm_id->remote_addr;
+		cm_event.private_data = NULL;
+		cm_event.private_data_len = 0;
+		req->cm_id->event_handler(req->cm_id, &cm_event);
+		break;
+	default:
+		break;
+	}
+
+	req->reply_msg = (u64) (unsigned long) (reply_msg);
+	atomic_set(&req->reply_ready, 1);
+	wake_up(&req->wait_object);
+
+	/*
+	 * If the request was cancelled, then this put will
+	 * free the vq_req memory...and reply_msg!!!
+	 */
+	vq_req_put(c2dev, req);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c
new file mode 100644
index 0000000..49645a9
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_rnic.c
@@ -0,0 +1,631 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+#include <linux/mm.h>
+#include <linux/inet.h>
+
+#include <linux/route.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+#include <rdma/ib_smi.h>
+#include "c2.h"
+#include "c2_vq.h"
+
+/* Device capabilities */
+#define C2_MIN_PAGESIZE  1024
+
+#define C2_MAX_MRS       32768
+#define C2_MAX_QPS       16000
+#define C2_MAX_WQE_SZ    256
+#define C2_MAX_QP_WR     ((128*1024)/C2_MAX_WQE_SZ)
+#define C2_MAX_SGES      4
+#define C2_MAX_SGE_RD    1
+#define C2_MAX_CQS       32768
+#define C2_MAX_CQES      4096
+#define C2_MAX_PDS       16384
+
+/*
+ * Send the adapter INIT message to the amso1100
+ */
+static int c2_adapter_init(struct c2_dev *c2dev)
+{
+	struct c2wr_init_req wr;
+	int err;
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_INIT);
+	wr.hdr.context = 0;
+	wr.hint_count = cpu_to_be64(__pa(&c2dev->hint_count));
+	wr.q0_host_shared = cpu_to_be64(__pa(c2dev->req_vq.shared));
+	wr.q1_host_shared = cpu_to_be64(__pa(c2dev->rep_vq.shared));
+	wr.q1_host_msg_pool = cpu_to_be64(__pa(c2dev->rep_vq.msg_pool.host));
+	wr.q2_host_shared = cpu_to_be64(__pa(c2dev->aeq.shared));
+	wr.q2_host_msg_pool = cpu_to_be64(__pa(c2dev->aeq.msg_pool.host));
+
+	/* Post the init message */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+
+	return err;
+}
+
+/*
+ * Send the adapter TERM message to the amso1100
+ */
+static void c2_adapter_term(struct c2_dev *c2dev)
+{
+	struct c2wr_init_req wr;
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_TERM);
+	wr.hdr.context = 0;
+
+	/* Post the init message */
+	vq_send_wr(c2dev, (union c2wr *) & wr);
+	c2dev->init = 0;
+
+	return;
+}
+
+/*
+ * Query the adapter
+ */
+int c2_rnic_query(struct c2_dev *c2dev,
+		  struct ib_device_attr *props)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_query_req wr;
+	struct c2wr_rnic_query_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_RNIC_QUERY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_query_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply)
+		err = -ENOMEM;
+
+	err = c2_errno(reply);
+	if (err)
+		goto bail2;
+
+	props->fw_ver = 
+		((u64)be32_to_cpu(reply->fw_ver_major) << 32) |
+		((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) |
+		(be32_to_cpu(reply->fw_ver_patch) && 0xFFFF);
+	memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6);
+	props->max_mr_size         = 0xFFFFFFFF;
+	props->page_size_cap       = ~(C2_MIN_PAGESIZE-1);
+	props->vendor_id           = be32_to_cpu(reply->vendor_id);
+	props->vendor_part_id      = be32_to_cpu(reply->part_number);
+	props->hw_ver              = be32_to_cpu(reply->hw_version);
+	props->max_qp              = be32_to_cpu(reply->max_qps);
+	props->max_qp_wr           = be32_to_cpu(reply->max_qp_depth);
+	props->device_cap_flags    = c2dev->device_cap_flags;
+	props->max_sge             = C2_MAX_SGES;
+	props->max_sge_rd          = C2_MAX_SGE_RD;
+	props->max_cq              = be32_to_cpu(reply->max_cqs);
+	props->max_cqe             = be32_to_cpu(reply->max_cq_depth);
+	props->max_mr              = be32_to_cpu(reply->max_mrs);
+	props->max_pd              = be32_to_cpu(reply->max_pds);
+	props->max_qp_rd_atom      = be32_to_cpu(reply->max_qp_ird);
+	props->max_ee_rd_atom      = 0;
+	props->max_res_rd_atom     = be32_to_cpu(reply->max_global_ird);
+	props->max_qp_init_rd_atom = be32_to_cpu(reply->max_qp_ord);
+	props->max_ee_init_rd_atom = 0;
+	props->atomic_cap          = IB_ATOMIC_NONE;
+	props->max_ee              = 0;
+	props->max_rdd             = 0;
+	props->max_mw              = be32_to_cpu(reply->max_mws);
+	props->max_raw_ipv6_qp     = 0;
+	props->max_raw_ethy_qp     = 0;
+	props->max_mcast_grp       = 0;
+	props->max_mcast_qp_attach = 0;
+	props->max_total_mcast_qp_attach = 0;
+	props->max_ah              = 0;
+	props->max_fmr             = 0;
+	props->max_map_per_fmr     = 0;
+	props->max_srq             = 0;
+	props->max_srq_wr          = 0;
+	props->max_srq_sge         = 0;
+	props->max_pkeys           = 0;
+	props->local_ca_ack_delay  = 0;
+
+ bail2:
+	vq_repbuf_free(c2dev, reply);
+
+ bail1:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Add an IP address to the RNIC interface
+ */
+int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_setconfig_req *wr;
+	struct c2wr_rnic_setconfig_rep *reply;
+	struct c2_netaddr netaddr;
+	int err, len;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	len = sizeof(struct c2_netaddr);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->option = cpu_to_be32(C2_CFG_ADD_ADDR);
+
+	netaddr.ip_addr = inaddr;
+	netaddr.netmask = inmask;
+	netaddr.mtu = 0;
+
+	memcpy(wr->data, &netaddr, len);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Delete an IP address from the RNIC interface
+ */
+int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_setconfig_req *wr;
+	struct c2wr_rnic_setconfig_rep *reply;
+	struct c2_netaddr netaddr;
+	int err, len;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	len = sizeof(struct c2_netaddr);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->option = cpu_to_be32(C2_CFG_DEL_ADDR);
+
+	netaddr.ip_addr = inaddr;
+	netaddr.netmask = inmask;
+	netaddr.mtu = 0;
+
+	memcpy(wr->data, &netaddr, len);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Open a single RNIC instance to use with all
+ * low level openib calls
+ */
+static int c2_rnic_open(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *vq_req;
+	union c2wr wr;
+	struct c2wr_rnic_open_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		return -ENOMEM;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_RNIC_OPEN);
+	wr.rnic_open.req.hdr.context = (unsigned long) (vq_req);
+	wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE);
+	wr.rnic_open.req.port_num = cpu_to_be16(0);
+	wr.rnic_open.req.user_context = (unsigned long) c2dev;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	reply = (struct c2wr_rnic_open_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	if ((err = c2_errno(reply)) != 0) {
+		goto bail1;
+	}
+
+	c2dev->adapter_handle = reply->rnic_handle;
+
+      bail1:
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Close the RNIC instance
+ */
+static int c2_rnic_close(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *vq_req;
+	union c2wr wr;
+	struct c2wr_rnic_close_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		return -ENOMEM;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_RNIC_CLOSE);
+	wr.rnic_close.req.hdr.context = (unsigned long) vq_req;
+	wr.rnic_close.req.rnic_handle = c2dev->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	reply = (struct c2wr_rnic_close_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	if ((err = c2_errno(reply)) != 0) {
+		goto bail1;
+	}
+
+	c2dev->adapter_handle = 0;
+
+      bail1:
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Called by c2_probe to initialize the RNIC. This principally
+ * involves initalizing the various limits and resouce pools that
+ * comprise the RNIC instance.
+ */
+int c2_rnic_init(struct c2_dev *c2dev)
+{
+	int err;
+	u32 qsize, msgsize;
+	void *q1_pages;
+	void *q2_pages;
+	void __iomem *mmio_regs;
+
+	/* Device capabilities */
+	c2dev->device_cap_flags =
+	    (IB_DEVICE_RESIZE_MAX_WR |
+	     IB_DEVICE_CURR_QP_STATE_MOD |
+	     IB_DEVICE_SYS_IMAGE_GUID |
+	     IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	/* Allocate the qptr_array */
+	c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *));
+	if (!c2dev->qptr_array) {
+		return -ENOMEM;
+	}
+
+	/* Inialize the qptr_array */
+	memset(c2dev->qptr_array, 0, C2_MAX_CQS * sizeof(void *));
+	c2dev->qptr_array[0] = (void *) &c2dev->req_vq;
+	c2dev->qptr_array[1] = (void *) &c2dev->rep_vq;
+	c2dev->qptr_array[2] = (void *) &c2dev->aeq;
+
+	/* Initialize data structures */
+	init_waitqueue_head(&c2dev->req_vq_wo);
+	spin_lock_init(&c2dev->vqlock);
+	spin_lock_init(&c2dev->lock);
+
+	/* Allocate MQ shared pointer pool for kernel clients. User
+	 * mode client pools are hung off the user context
+	 */
+	err = c2_init_mqsp_pool(GFP_KERNEL, &c2dev->kern_mqsp_pool);
+	if (err) {
+		goto bail0;
+	}
+
+	/* Allocate shared pointers for Q0, Q1, and Q2 from
+	 * the shared pointer pool.
+	 */
+	c2dev->req_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	c2dev->aeq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	if (!c2dev->req_vq.shared ||
+	    !c2dev->rep_vq.shared || !c2dev->aeq.shared) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	mmio_regs = c2dev->kva;
+	/* Initialize the Verbs Request Queue */
+	c2_mq_req_init(&c2dev->req_vq, 0,
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_QSIZE)),
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_MSGSIZE)),
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_POOLSTART)),
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_SHARED)),
+		       C2_MQ_ADAPTER_TARGET);
+
+	/* Initialize the Verbs Reply Queue */
+	qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE));
+	msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE));
+	q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL);
+	if (!q1_pages) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	c2_mq_rep_init(&c2dev->rep_vq,
+		   1,
+		   qsize,
+		   msgsize,
+		   q1_pages,
+		   mmio_regs +
+		   be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_SHARED)),
+		   C2_MQ_HOST_TARGET);
+
+	/* Initialize the Asynchronus Event Queue */
+	qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE));
+	msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE));
+	q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL);
+	if (!q2_pages) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+	c2_mq_rep_init(&c2dev->aeq,
+		       2,
+		       qsize,
+		       msgsize,
+		       q2_pages,
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_SHARED)),
+		       C2_MQ_HOST_TARGET);
+
+	/* Initialize the verbs request allocator */
+	err = vq_init(c2dev);
+	if (err)
+		goto bail3;
+
+	/* Enable interrupts on the adapter */
+	writel(0, c2dev->regs + C2_IDIS);
+
+	/* create the WR init message */
+	err = c2_adapter_init(c2dev);
+	if (err)
+		goto bail4;
+	c2dev->init++;
+
+	/* open an adapter instance */
+	err = c2_rnic_open(c2dev);
+	if (err)
+		goto bail4;
+
+	/* Initialize cached the adapter limits */
+	if (c2_rnic_query(c2dev, &c2dev->props))
+		goto bail4;
+
+	/* Initialize the PD pool */
+	err = c2_init_pd_table(c2dev);
+	if (err)
+		goto bail5;
+
+	/* Initialize the QP pool */
+	err = c2_init_qp_table(c2dev);
+	if (err)
+		goto bail6;
+	return 0;
+
+      bail6:
+	c2_cleanup_pd_table(c2dev);
+      bail5:
+	c2_rnic_close(c2dev);
+      bail4:
+	vq_term(c2dev);
+      bail3:
+	kfree(q2_pages);
+      bail2:
+	kfree(q1_pages);
+      bail1:
+	c2_free_mqsp_pool(c2dev->kern_mqsp_pool);
+      bail0:
+	vfree(c2dev->qptr_array);
+
+	return err;
+}
+
+/*
+ * Called by c2_remove to cleanup the RNIC resources. 
+ */
+void c2_rnic_term(struct c2_dev *c2dev)
+{
+
+	/* Close the open adapter instance */
+	c2_rnic_close(c2dev);
+
+	/* Send the TERM message to the adapter */
+	c2_adapter_term(c2dev);
+
+	/* Disable interrupts on the adapter */
+	writel(1, c2dev->regs + C2_IDIS);
+
+	/* Free the QP pool */
+	c2_cleanup_qp_table(c2dev);
+
+	/* Free the PD pool */
+	c2_cleanup_pd_table(c2dev);
+
+	/* Free the verbs request allocator */
+	vq_term(c2dev);
+
+	/* Free the asynchronus event queue */
+	kfree(c2dev->aeq.msg_pool.host);
+
+	/* Free the verbs reply queue */
+	kfree(c2dev->rep_vq.msg_pool.host);
+
+	/* Free the MQ shared pointer pool */
+	c2_free_mqsp_pool(c2dev->kern_mqsp_pool);
+
+	/* Free the qptr_array */
+	vfree(c2dev->qptr_array);
+
+	return;
+}


From swise at opengridcomputing.com  Wed Jun  7 13:06:53 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:53 -0500
Subject: [openib-general] [PATCH v2 3/7] AMSO1100 OpenFabrics Provider.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200653.9259.31696.stgit@stevo-desktop>


Review Changes:

sizeof -> sizeof()

dprintk() -> pr_debug()

assert() -> BUG_ON()

C2_DEBUG -> DEBUG
---

 drivers/infiniband/hw/amso1100/c2_cm.c       |  452 ++++++++++++
 drivers/infiniband/hw/amso1100/c2_cq.c       |  423 +++++++++++
 drivers/infiniband/hw/amso1100/c2_pd.c       |   71 ++
 drivers/infiniband/hw/amso1100/c2_provider.c |  867 +++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_provider.h |  182 +++++
 drivers/infiniband/hw/amso1100/c2_qp.c       |  975 ++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_user.h     |   82 ++
 7 files changed, 3052 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_cm.c b/drivers/infiniband/hw/amso1100/c2_cm.c
new file mode 100644
index 0000000..018d11f
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_cm.c
@@ -0,0 +1,452 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include "c2.h"
+#include "c2_wr.h"
+#include "c2_vq.h"
+#include <rdma/iw_cm.h>
+
+int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct c2_dev *c2dev = to_c2dev(cm_id->device);
+	struct ib_qp *ibqp;
+	struct c2_qp *qp;
+	struct c2wr_qp_connect_req *wr;	/* variable size needs a malloc. */
+	struct c2_vq_req *vq_req;
+	int err;
+
+	ibqp = c2_get_qp(cm_id->device, iw_param->qpn);
+	if (!ibqp)
+		return -EINVAL;
+	qp = to_c2qp(ibqp);
+
+	/* Associate QP <--> CM_ID */
+	cm_id->provider_data = qp;
+	cm_id->add_ref(cm_id);
+	qp->cm_id = cm_id;
+
+	/*
+	 * only support the max private_data length
+	 */
+	if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) {
+		err = -EINVAL;
+		goto bail0;
+	}
+	/* 
+	 * Set the rdma read limits 
+	 */
+	err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Create and send a WR_QP_CONNECT...
+	 */
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	c2_wr_set_id(wr, CCWR_QP_CONNECT);
+	wr->hdr.context = 0;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->qp_handle = qp->adapter_handle;
+
+	wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr;
+	wr->remote_port = cm_id->remote_addr.sin_port;
+
+	/*
+	 * Move any private data from the callers's buf into 
+	 * the WR.
+	 */
+	if (iw_param->private_data) {
+		wr->private_data_length = 
+			cpu_to_be32(iw_param->private_data_len);
+		memcpy(&wr->private_data[0], iw_param->private_data,
+		       iw_param->private_data_len);
+	} else
+		wr->private_data_length = 0;
+
+	/*
+	 * Send WR to adapter.  NOTE: There is no synch reply from 
+	 * the adapter.
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	vq_req_free(c2dev, vq_req);
+
+ bail1:
+	kfree(wr);
+ bail0:
+	if (err) {
+		/* 
+		 * If we fail, release reference on QP and
+		 * disassociate QP from CM_ID  
+		 */
+		cm_id->provider_data = NULL;
+		qp->cm_id = NULL;
+		cm_id->rem_ref(cm_id);
+	}
+	return err;
+}
+
+int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog)
+{
+	struct c2_dev *c2dev;
+	struct c2wr_ep_listen_create_req wr;
+	struct c2wr_ep_listen_create_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+	if (c2dev == NULL)
+		return -EINVAL;
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE);
+	wr.hdr.context = (u64) (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.local_addr = cm_id->local_addr.sin_addr.s_addr;
+	wr.local_port = cm_id->local_addr.sin_port;
+	wr.backlog = cpu_to_be32(backlog);
+	wr.user_context = (u64) (unsigned long) cm_id;
+
+	/*
+	 * Reference the request struct.  Dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply =
+	    (struct c2wr_ep_listen_create_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	if ((err = c2_errno(reply)) != 0)
+		goto bail1;
+
+	/* 
+	 * Keep the adapter handle. Used in subsequent destroy 
+	 */
+	cm_id->provider_data = (void*)(unsigned long) reply->ep_handle;
+
+	/*
+	 * free vq stuff
+	 */
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	return 0;
+
+ bail1:
+	vq_repbuf_free(c2dev, reply);
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+
+int c2_llp_service_destroy(struct iw_cm_id *cm_id)
+{
+
+	struct c2_dev *c2dev;
+	struct c2wr_ep_listen_destroy_req wr;
+	struct c2wr_ep_listen_destroy_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+	if (c2dev == NULL)
+		return -EINVAL;
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.ep_handle = (u32)(unsigned long)cm_id->provider_data;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply=(struct c2wr_ep_listen_destroy_rep *)(unsigned long)vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+	if ((err = c2_errno(reply)) != 0)
+		goto bail1;
+
+ bail1:
+	vq_repbuf_free(c2dev, reply);
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+int c2_llp_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct c2_dev *c2dev = to_c2dev(cm_id->device);
+	struct c2_qp *qp;
+	struct ib_qp *ibqp;
+	struct c2wr_cr_accept_req *wr;	/* variable length WR */
+	struct c2_vq_req *vq_req;
+	struct c2wr_cr_accept_rep *reply;	/* VQ Reply msg ptr. */
+	int err;
+
+	ibqp = c2_get_qp(cm_id->device, iw_param->qpn);
+	if (!ibqp)
+		return -EINVAL;
+	qp = to_c2qp(ibqp);
+
+	/* Set the RDMA read limits */
+	err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird);
+	if (err)
+		goto bail0;
+
+	/* Allocate verbs request. */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	vq_req->qp = qp;
+	vq_req->cm_id = cm_id;
+	vq_req->event = IW_CM_EVENT_ESTABLISHED;
+
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	/* Build the WR */
+	c2_wr_set_id(wr, CCWR_CR_ACCEPT);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->ep_handle = (u32) (unsigned long) cm_id->provider_data;
+	wr->qp_handle = qp->adapter_handle;
+
+	/* Replace the cr_handle with the QP after accept */
+	cm_id->provider_data = qp;
+	cm_id->add_ref(cm_id);
+	qp->cm_id = cm_id;
+
+	cm_id->provider_data = qp;
+
+	/* Validate private_data length */
+	if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) {
+		err = -EINVAL;
+		goto bail2;
+	}
+
+	if (iw_param->private_data) {
+		wr->private_data_length = cpu_to_be32(iw_param->private_data_len);
+		memcpy(&wr->private_data[0], 
+		       iw_param->private_data, iw_param->private_data_len);
+	} else 
+		wr->private_data_length = 0;
+
+	/* Reference the request struct.  Dereferenced in the int handler. */
+	vq_req_get(c2dev, vq_req);
+
+	/* Send WR to adapter */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail2;
+	}
+
+	/* Wait for reply from adapter */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail2;
+
+	/* Check that reply is present */
+	reply = (struct c2wr_cr_accept_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+	if (!err)
+		c2_set_qp_state(qp, C2_QP_STATE_RTS);
+ bail2:
+	kfree(wr);
+ bail1:
+	vq_req_free(c2dev, vq_req);
+ bail0:
+	if (err) {
+		/* 
+		 * If we fail, release reference on QP and
+		 * disassociate QP from CM_ID  
+		 */
+		cm_id->provider_data = NULL;
+		qp->cm_id = NULL;
+		cm_id->rem_ref(cm_id);
+	}
+	return err;
+}
+
+int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	struct c2_dev *c2dev;
+	struct c2wr_cr_reject_req wr;
+	struct c2_vq_req *vq_req;
+	struct c2wr_cr_reject_rep *reply;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_CR_REJECT);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.ep_handle = (u32) (unsigned long) cm_id->provider_data;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_cr_reject_rep *) (unsigned long) 
+		vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+	err = c2_errno(reply);
+	/*
+	 * free vq stuff
+	 */
+	vq_repbuf_free(c2dev, reply);
+
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
new file mode 100644
index 0000000..71128ff
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -0,0 +1,423 @@
+/*
+ * Copyright (c) 2004, 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2004 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include "c2.h"
+#include "c2_vq.h"
+#include "c2_status.h"
+
+#define C2_CQ_MSG_SIZE ((sizeof(struct c2wr_ce) + 32-1) & ~(32-1))
+
+struct c2_cq *c2_cq_get(struct c2_dev *c2dev, int cqn)
+{
+	struct c2_cq *cq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2dev->lock, flags);
+	cq = c2dev->qptr_array[cqn];
+	if (!cq) {
+		spin_unlock_irqrestore(&c2dev->lock, flags);
+		return NULL;
+	}
+	atomic_inc(&cq->refcount);
+	spin_unlock_irqrestore(&c2dev->lock, flags);
+	return cq;
+}
+
+void c2_cq_put(struct c2_cq *cq)
+{
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void c2_cq_event(struct c2_dev *c2dev, u32 mq_index)
+{
+	struct c2_cq *cq;
+
+	cq = c2_cq_get(c2dev, mq_index);
+	if (!cq) {
+		printk("discarding events on destroyed CQN=%d\n", mq_index);
+		return;
+	}
+
+	(*cq->ibcq.comp_handler) (&cq->ibcq, cq->ibcq.cq_context);
+	c2_cq_put(cq);
+}
+
+void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index)
+{
+	struct c2_cq *cq;
+	struct c2_mq *q;
+
+	cq = c2_cq_get(c2dev, mq_index);
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+	q = &cq->mq;
+	if (q && !c2_mq_empty(q)) {
+		u16 priv = q->priv;
+		struct c2wr_ce *msg;
+
+		while (priv != be16_to_cpu(*q->shared)) {
+			msg = (struct c2wr_ce *) 
+				(q->msg_pool.host + priv * q->msg_size);
+			if (msg->qp_user_context == (u64) (unsigned long) qp) {
+				msg->qp_user_context = (u64) 0;
+			}
+			priv = (priv + 1) % q->q_size;
+		}
+	}
+	spin_unlock_irq(&cq->lock);
+	c2_cq_put(cq);
+}
+
+static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status)
+{
+	switch (status) {
+	case C2_OK:
+		return IB_WC_SUCCESS;
+	case CCERR_FLUSHED:
+		return IB_WC_WR_FLUSH_ERR;
+	case CCERR_BASE_AND_BOUNDS_VIOLATION:
+		return IB_WC_LOC_PROT_ERR;
+	case CCERR_ACCESS_VIOLATION:
+		return IB_WC_LOC_ACCESS_ERR;
+	case CCERR_TOTAL_LENGTH_TOO_BIG:
+		return IB_WC_LOC_LEN_ERR;
+	case CCERR_INVALID_WINDOW:
+		return IB_WC_MW_BIND_ERR;
+	default:
+		return IB_WC_GENERAL_ERR;
+	}
+}
+
+
+static inline int c2_poll_one(struct c2_dev *c2dev,
+			      struct c2_cq *cq, struct ib_wc *entry)
+{
+	struct c2wr_ce *ce;
+	struct c2_qp *qp;
+	int is_recv = 0;
+
+	ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq);
+	if (!ce) {
+		return -EAGAIN;
+	}
+
+	/*
+	 * if the qp returned is null then this qp has already 
+	 * been freed and we are unable process the completion.  
+	 * try pulling the next message
+	 */
+	while ((qp =
+		(struct c2_qp *) (unsigned long) ce->qp_user_context) == NULL) {
+		c2_mq_free(&cq->mq);
+		ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq);
+		if (!ce)
+			return -EAGAIN;
+	}
+
+	entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce));
+	entry->wr_id = ce->hdr.context;
+	entry->qp_num = ce->handle;
+	entry->wc_flags = 0;
+	entry->slid = 0;
+	entry->sl = 0;
+	entry->src_qp = 0;
+	entry->dlid_path_bits = 0;
+	entry->pkey_index = 0;
+
+	switch (c2_wr_get_id(ce)) {
+	case C2_WR_TYPE_SEND:
+		entry->opcode = IB_WC_SEND;
+		break;
+	case C2_WR_TYPE_RDMA_WRITE:
+		entry->opcode = IB_WC_RDMA_WRITE;
+		break;
+	case C2_WR_TYPE_RDMA_READ:
+		entry->opcode = IB_WC_RDMA_READ;
+		break;
+	case C2_WR_TYPE_BIND_MW:
+		entry->opcode = IB_WC_BIND_MW;
+		break;
+	case C2_WR_TYPE_RECV:
+		entry->byte_len = be32_to_cpu(ce->bytes_rcvd);
+		entry->opcode = IB_WC_RECV;
+		is_recv = 1;
+		break;
+	default:
+		break;
+	}
+
+	/* consume the WQEs */
+	if (is_recv)
+		c2_mq_lconsume(&qp->rq_mq, 1);
+	else
+		c2_mq_lconsume(&qp->sq_mq,
+			       be32_to_cpu(c2_wr_get_wqe_count(ce)) + 1);
+
+	/* free the message */
+	c2_mq_free(&cq->mq);
+
+	return 0;
+}
+
+int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
+{
+	struct c2_dev *c2dev = to_c2dev(ibcq->device);
+	struct c2_cq *cq = to_c2cq(ibcq);
+	unsigned long flags;
+	int npolled, err;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+
+		err = c2_poll_one(c2dev, cq, entry + npolled);
+		if (err)
+			break;
+	}
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return npolled;
+}
+
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+{
+	struct c2_mq_shared __iomem *shared;
+	struct c2_cq *cq;
+
+	cq = to_c2cq(ibcq);
+	shared = cq->mq.peer;
+
+	if (notify == IB_CQ_NEXT_COMP)
+		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type);
+	else if (notify == IB_CQ_SOLICITED)
+		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type);
+	else
+		return -EINVAL;
+
+	writeb(CQ_WAIT_FOR_DMA | CQ_ARMED, &shared->armed);
+
+	/*
+	 * Now read back shared->armed to make the PCI
+	 * write synchronous.  This is necessary for
+	 * correct cq notification semantics.
+	 */
+	readb(&shared->armed);
+
+	return 0;
+}
+
+static void c2_free_cq_buf(struct c2_mq *mq)
+{
+	free_pages((unsigned long) mq->msg_pool.host, 
+		   get_order(mq->q_size * mq->msg_size));
+}
+
+static int c2_alloc_cq_buf(struct c2_mq *mq, int q_size, int msg_size)
+{
+	unsigned long pool_start;
+
+	pool_start = __get_free_pages(GFP_KERNEL, 
+				      get_order(q_size * msg_size));
+	if (!pool_start)
+		return -ENOMEM;
+
+	c2_mq_rep_init(mq, 
+		       0,		/* index (currently unknown) */
+		       q_size, 
+		       msg_size, 
+		       (u8 *) pool_start, 
+		       NULL,	/* peer (currently unknown) */
+		       C2_MQ_HOST_TARGET);
+
+	return 0;
+}
+
+int c2_init_cq(struct c2_dev *c2dev, int entries,
+	       struct c2_ucontext *ctx, struct c2_cq *cq)
+{
+	struct c2wr_cq_create_req wr;
+	struct c2wr_cq_create_rep *reply;
+	unsigned long peer_pa;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	might_sleep();
+
+	cq->ibcq.cqe = entries - 1;
+	cq->is_kernel = !ctx;
+
+	/* Allocate a shared pointer */
+	cq->mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	if (!cq->mq.shared)
+		return -ENOMEM;
+
+	/* Allocate pages for the message pool */
+	err = c2_alloc_cq_buf(&cq->mq, entries + 1, C2_CQ_MSG_SIZE);
+	if (err)
+		goto bail0;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_CQ_CREATE);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.msg_size = cpu_to_be32(cq->mq.msg_size);
+	wr.depth = cpu_to_be32(cq->mq.q_size);
+	wr.shared_ht = cpu_to_be64(__pa(cq->mq.shared));
+	wr.msg_pool = cpu_to_be64(__pa(cq->mq.msg_pool.host));
+	wr.user_context = (u64) (unsigned long) (cq);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail2;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail2;
+
+	reply = (struct c2wr_cq_create_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	if ((err = c2_errno(reply)) != 0)
+		goto bail3;
+
+	cq->adapter_handle = reply->cq_handle;
+	cq->mq.index = be32_to_cpu(reply->mq_index);
+
+	peer_pa = c2dev->pa + be32_to_cpu(reply->adapter_shared);
+	cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE);
+	if (!cq->mq.peer) {
+		err = -ENOMEM;
+		goto bail3;
+	}
+
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	/* 
+	 * Use the MQ index allocated by the adapter to
+	 * store the CQ in the qptr_array
+	 */
+	cq->cqn = cq->mq.index;
+	c2dev->qptr_array[cq->cqn] = cq;
+
+	return 0;
+
+      bail3:
+	vq_repbuf_free(c2dev, reply);
+      bail2:
+	vq_req_free(c2dev, vq_req);
+      bail1:
+	c2_free_cq_buf(&cq->mq);
+      bail0:
+	c2_free_mqsp(cq->mq.shared);
+
+	return err;
+}
+
+void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq)
+{
+	int err;
+	struct c2_vq_req *vq_req;
+	struct c2wr_cq_destroy_req wr;
+	struct c2wr_cq_destroy_rep *reply;
+
+	might_sleep();
+
+	/* Clear CQ from the qptr array */
+	spin_lock_irq(&c2dev->lock);
+	c2dev->qptr_array[cq->mq.index] = NULL;
+	atomic_dec(&cq->refcount);
+	spin_unlock_irq(&c2dev->lock);
+
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		goto bail0;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_CQ_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.cq_handle = cq->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req->reply_msg);
+
+	vq_repbuf_free(c2dev, reply);
+      bail1:
+	vq_req_free(c2dev, vq_req);
+      bail0:
+	if (cq->is_kernel) {
+		c2_free_cq_buf(&cq->mq);
+	}
+
+	return;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_pd.c b/drivers/infiniband/hw/amso1100/c2_pd.c
new file mode 100644
index 0000000..27459b8
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_pd.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "c2.h"
+#include "c2_provider.h"
+
+int c2_pd_alloc(struct c2_dev *dev, int privileged, struct c2_pd *pd)
+{
+	int err = 0;
+
+	might_sleep();
+
+	atomic_set(&pd->sqp_count, 0);
+	pd->pd_id = c2_alloc(&dev->pd_table.alloc);
+	if (pd->pd_id == -1)
+		return -ENOMEM;
+
+	return err;
+}
+
+void c2_pd_free(struct c2_dev *dev, struct c2_pd *pd)
+{
+	might_sleep();
+	c2_free(&dev->pd_table.alloc, pd->pd_id);
+}
+
+int __devinit c2_init_pd_table(struct c2_dev *dev)
+{
+	return c2_alloc_init(&dev->pd_table.alloc, dev->props.max_pd, 0);
+}
+
+void __devexit c2_cleanup_pd_table(struct c2_dev *dev)
+{
+	/* XXX check if any PDs are still allocated? */
+	c2_alloc_cleanup(&dev->pd_table.alloc);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
new file mode 100644
index 0000000..eaf786e
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -0,0 +1,867 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+#include <linux/if_arp.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+#include "c2.h"
+#include "c2_provider.h"
+#include "c2_user.h"
+
+static int c2_query_device(struct ib_device *ibdev,
+			   struct ib_device_attr *props)
+{
+	struct c2_dev *c2dev = to_c2dev(ibdev);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	*props = c2dev->props;
+	return 0;
+}
+
+static int c2_query_port(struct ib_device *ibdev,
+			 u8 port, struct ib_port_attr *props)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 1;
+	props->active_speed = 1;
+
+	return 0;
+}
+
+static int c2_modify_port(struct ib_device *ibdev,
+			  u8 port, int port_modify_mask,
+			  struct ib_port_modify *props)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return 0;
+}
+
+static int c2_query_pkey(struct ib_device *ibdev,
+			 u8 port, u16 index, u16 * pkey)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	*pkey = 0;
+	return 0;
+}
+
+static int c2_query_gid(struct ib_device *ibdev, u8 port,
+			int index, union ib_gid *gid)
+{
+	struct c2_dev *c2dev = to_c2dev(ibdev);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), c2dev->pseudo_netdev->dev_addr, 6);
+
+	return 0;
+}
+
+/* Allocate the user context data structure. This keeps track
+ * of all objects associated with a particular user-mode client.
+ */
+static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev,
+					     struct ib_udata *udata)
+{
+	struct c2_ucontext *context;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+
+	return &context->ibucontext;
+}
+
+static int c2_dealloc_ucontext(struct ib_ucontext *context)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	kfree(context);
+	return 0;
+}
+
+static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev,
+				 struct ib_ucontext *context,
+				 struct ib_udata *udata)
+{
+	struct c2_pd *pd;
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	pd = kmalloc(sizeof(*pd), GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = c2_pd_alloc(to_c2dev(ibdev), !context, pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	if (context) {
+		if (ib_copy_to_udata(udata, &pd->pd_id, sizeof(__u32))) {
+			c2_pd_free(to_c2dev(ibdev), pd);
+			kfree(pd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+
+	return &pd->ibpd;
+}
+
+static int c2_dealloc_pd(struct ib_pd *pd)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	c2_pd_free(to_c2dev(pd->device), to_c2pd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return ERR_PTR(-ENOSYS);
+}
+
+static int c2_ah_destroy(struct ib_ah *ah)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static void c2_add_ref(struct ib_qp *ibqp)
+{
+	struct c2_qp *qp;
+	BUG_ON(!ibqp);
+	qp = to_c2qp(ibqp);
+	atomic_inc(&qp->refcount);
+}
+
+static void c2_rem_ref(struct ib_qp *ibqp)
+{
+	struct c2_qp *qp;
+	BUG_ON(!ibqp);
+	qp = to_c2qp(ibqp);
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+struct ib_qp *c2_get_qp(struct ib_device *device, int qpn)
+{
+	struct c2_dev* c2dev = to_c2dev(device);
+	struct c2_qp *qp;
+
+	qp = c2dev->qp_table.map[qpn];
+	pr_debug("%s Returning QP=%p for QPN=%d, device=%p, refcount=%d\n",
+		__FUNCTION__, qp, qpn, device,
+		(qp?atomic_read(&qp->refcount):0));
+
+	return (qp?&qp->ibqp:NULL);
+}
+
+static struct ib_qp *c2_create_qp(struct ib_pd *pd,
+				  struct ib_qp_init_attr *init_attr,
+				  struct ib_udata *udata)
+{
+	struct c2_qp *qp;
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+		if (!qp) {
+			pr_debug("%s: Unable to allocate QP\n", __FUNCTION__);
+			return ERR_PTR(-ENOMEM);
+		}
+		spin_lock_init(&qp->lock);
+		if (pd->uobject) {
+			/* XXX userspace specific */
+		}
+
+		err = c2_alloc_qp(to_c2dev(pd->device),
+				  to_c2pd(pd), init_attr, qp);
+		
+		if (err && pd->uobject) {
+			/* XXX userspace specific */
+		}
+
+		break;
+	default:
+		pr_debug("%s: Invalid QP type: %d\n", __FUNCTION__,
+			init_attr->qp_type);
+		return ERR_PTR(-EINVAL);
+		break;
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+	return &qp->ibqp;
+}
+
+static int c2_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct c2_qp *qp = to_c2qp(ib_qp);
+
+	pr_debug("%s:%u qp=%p,qp->state=%d\n", 
+		__FUNCTION__, __LINE__,ib_qp,qp->state);
+	c2_free_qp(to_c2dev(ib_qp->device), qp);
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries,
+				  struct ib_ucontext *context,
+				  struct ib_udata *udata)
+{
+	struct c2_cq *cq;
+	int err;
+
+	cq = kmalloc(sizeof(*cq), GFP_KERNEL);
+	if (!cq) {
+		pr_debug("%s: Unable to allocate CQ\n", __FUNCTION__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq);
+	if (err) {
+		pr_debug("%s: error initializing CQ\n", __FUNCTION__);
+		kfree(cq);
+		return ERR_PTR(err);
+	}
+
+	return &cq->ibcq;
+}
+
+static int c2_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct c2_cq *cq = to_c2cq(ib_cq);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	c2_free_cq(to_c2dev(ib_cq->device), cq);
+	kfree(cq);
+
+	return 0;
+}
+
+static inline u32 c2_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? C2_ACF_REMOTE_WRITE : 0) |
+	    (acc & IB_ACCESS_REMOTE_READ ? C2_ACF_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? C2_ACF_LOCAL_WRITE : 0) |
+	    C2_ACF_LOCAL_READ | C2_ACF_WINDOW_BIND;
+}
+
+static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd,
+				    struct ib_phys_buf *buffer_list,
+				    int num_phys_buf, int acc, u64 * iova_start)
+{
+	struct c2_mr *mr;
+	u64 *page_list;
+	u32 total_len;
+	int err, i, j, k, page_shift, pbl_depth;
+
+	pbl_depth = 0;
+	total_len = 0;
+
+	page_shift = PAGE_SHIFT;
+	/*
+	 * If there is only 1 buffer we assume this could
+	 * be a map of all phy mem...use a 32k page_shift.
+	 */
+	if (num_phys_buf == 1)
+		page_shift += 3;	/* XXX */
+
+	for (i = 0; i < num_phys_buf; i++) {
+
+		if (buffer_list[i].addr & ~PAGE_MASK) {
+			pr_debug("Unaligned Memory Buffer: 0x%x\n",
+				(unsigned int) buffer_list[i].addr);
+			return ERR_PTR(-EINVAL);
+		}
+
+		if (!buffer_list[i].size) {
+			pr_debug("Invalid Buffer Size\n");
+			return ERR_PTR(-EINVAL);
+		}
+
+		total_len += buffer_list[i].size;
+		pbl_depth += ALIGN(buffer_list[i].size, 
+				   (1 << page_shift)) >> page_shift;
+	}
+
+	page_list = vmalloc(sizeof(u64) * pbl_depth);
+	if (!page_list) {
+		pr_debug("couldn't vmalloc page_list of size %zd\n",
+			(sizeof(u64) * pbl_depth));
+		return ERR_PTR(-ENOMEM);
+	}
+
+	for (i = 0, j = 0; i < num_phys_buf; i++) {
+
+		int naddrs;
+
+ 		naddrs = ALIGN(buffer_list[i].size, 
+			       (1 << page_shift)) >> page_shift;
+		for (k = 0; k < naddrs; k++)
+			page_list[j++] = (buffer_list[i].addr + 
+						     (k << page_shift));
+	}
+
+	mr = kmalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	mr->pd = to_c2pd(ib_pd);
+	pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, "
+		"*iova_start %llx, first pa %llx, last pa %llx\n",
+		__FUNCTION__, page_shift, pbl_depth, total_len, 
+		*iova_start, page_list[0], page_list[pbl_depth-1]);
+  	err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list,
+ 					 (1 << page_shift), pbl_depth, 
+					 total_len, 0, iova_start, 
+					 c2_convert_access(acc), mr);
+	vfree(page_list);
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva = 0;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* AMSO1100 limit */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	return c2_reg_phys_mr(pd, &bl, 1, acc, &kva);
+}
+
+static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				    int acc, struct ib_udata *udata)
+{
+	u64 *pages;
+	u64 kva = 0;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct c2_pd *c2pd = to_c2pd(pd);
+	struct c2_mr *c2mr;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	shift = ffs(region->page_size) - 1;
+
+	c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL);
+	if (!c2mr)
+		return ERR_PTR(-ENOMEM);
+	c2mr->pd = c2pd;
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	i = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list) {
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = 
+					sg_dma_address(&chunk->page_list[j]) +
+					(region->page_size * k);
+			}
+		}
+	}
+
+	kva = (u64)region->virt_base;
+  	err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), 
+					 pages,
+ 					 region->page_size,
+					 i, 
+					 region->length, 
+					 region->offset,
+					 &kva,
+					 c2_convert_access(acc), 
+					 c2mr);
+	kfree(pages);
+	if (err) {
+		kfree(c2mr);
+		return ERR_PTR(err);
+	}
+	return &c2mr->ibmr;
+
+err:
+	kfree(c2mr);
+	return ERR_PTR(err);
+}
+
+static int c2_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct c2_mr *mr = to_c2mr(ib_mr);
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey);
+	if (err)
+		pr_debug("c2_stag_dealloc failed: %d\n", err);
+	else
+		kfree(mr);
+
+	return err;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev);
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%x\n", dev->props.hw_ver);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev);
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%x.%x.%x\n",
+		       (int) (dev->props.fw_ver >> 32),
+		       (int) (dev->props.fw_ver >> 16) & 0xffff,
+		       (int) (dev->props.fw_ver & 0xffff));
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "AMSO1100\n");
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID");
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *c2_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	int err;
+
+	err =
+	    c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr,
+			 attr_mask);
+
+	return err;
+}
+
+static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_process_mad(struct ib_device *ibdev,
+			  int mad_flags,
+			  u8 port_num,
+			  struct ib_wc *in_wc,
+			  struct ib_grh *in_grh,
+			  struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* Request a connection */
+	return c2_llp_connect(cm_id, iw_param);
+}
+
+static int c2_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* Accept the new connection */
+	return c2_llp_accept(cm_id, iw_param);
+}
+
+static int c2_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_llp_reject(cm_id, pdata, pdata_len);
+	return err;
+}
+
+static int c2_service_create(struct iw_cm_id *cm_id, int backlog)
+{
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	err = c2_llp_service_create(cm_id, backlog);
+	pr_debug("%s:%u err=%d\n", 
+		__FUNCTION__, __LINE__,
+		err);
+	return err;
+}
+
+static int c2_service_destroy(struct iw_cm_id *cm_id)
+{
+	int err;
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_llp_service_destroy(cm_id);
+
+	return err;
+}
+
+static int c2_pseudo_up(struct net_device *netdev)
+{
+	struct in_device *ind;
+	struct c2_dev *c2dev = netdev->priv;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	pr_debug("adding...\n");
+	for_ifa(ind) {
+#ifdef DEBUG
+		u8 *ip = (u8 *) & ifa->ifa_address;
+
+		pr_debug("%s: %d.%d.%d.%d\n",
+		       ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]);
+#endif
+		c2_add_addr(c2dev, ifa->ifa_address, ifa->ifa_mask);
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+
+	return 0;
+}
+
+static int c2_pseudo_down(struct net_device *netdev)
+{
+	struct in_device *ind;
+	struct c2_dev *c2dev = netdev->priv;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	pr_debug("deleting...\n");
+	for_ifa(ind) {
+#ifdef DEBUG
+		u8 *ip = (u8 *) & ifa->ifa_address;
+
+		pr_debug("%s: %d.%d.%d.%d\n",
+		       ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]);
+#endif
+		c2_del_addr(c2dev, ifa->ifa_address, ifa->ifa_mask);
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+
+	return 0;
+}
+
+static int c2_pseudo_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int c2_pseudo_change_mtu(struct net_device *netdev, int new_mtu)
+{
+	int ret = 0;
+
+	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
+		return -EINVAL;
+
+	netdev->mtu = new_mtu;
+
+	/* XXX tell rnic about new rmda interface mtu */
+	return ret;
+}
+
+static void setup(struct net_device *netdev)
+{
+	SET_MODULE_OWNER(netdev);
+	netdev->open = c2_pseudo_up;
+	netdev->stop = c2_pseudo_down;
+	netdev->hard_start_xmit = c2_pseudo_xmit_frame;
+	netdev->get_stats = NULL;
+	netdev->tx_timeout = NULL;
+	netdev->set_mac_address = NULL;
+	netdev->change_mtu = c2_pseudo_change_mtu;
+	netdev->watchdog_timeo = 0;
+	netdev->type = ARPHRD_ETHER;
+	netdev->mtu = 1500;
+	netdev->hard_header_len = ETH_HLEN;
+	netdev->addr_len = ETH_ALEN;
+	netdev->tx_queue_len = 0;
+	netdev->flags |= IFF_NOARP;
+	return;
+}
+
+static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev)
+{
+	char name[IFNAMSIZ];
+	struct net_device *netdev;
+
+	/* change ethxxx to iwxxx */
+	strcpy(name, "iw");
+	strcat(name, &c2dev->netdev->name[3]);
+	netdev = alloc_netdev(sizeof(*netdev), name, setup);
+	if (!netdev) {
+		printk(KERN_ERR PFX "%s -  etherdev alloc failed",
+			__FUNCTION__);
+		return NULL;
+	}
+
+	netdev->priv = c2dev;
+
+	SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev);
+
+	memcpy_fromio(netdev->dev_addr, c2dev->kva + C2_REGS_RDMA_ENADDR, 6);
+
+	/* Print out the MAC address */
+	pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X\n",
+		netdev->name,
+		netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2],
+		netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]);
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+	return netdev;
+}
+
+int c2_register_device(struct c2_dev *dev)
+{
+	int ret;
+	int i;
+
+	/* Register pseudo network device */
+	dev->pseudo_netdev = c2_pseudo_netdev_init(dev);
+	if (dev->pseudo_netdev) {
+		ret = register_netdev(dev->pseudo_netdev);
+		if (ret) {
+			printk(KERN_ERR PFX
+				"Unable to register netdev, ret = %d\n", ret);
+			free_netdev(dev->pseudo_netdev);
+			return ret;
+		}
+	}
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6);
+	dev->ibdev.phys_port_cnt = 1;
+	dev->ibdev.dma_device = &dev->pcidev->dev;
+	dev->ibdev.class_dev.dev = &dev->pcidev->dev;
+	dev->ibdev.query_device = c2_query_device;
+	dev->ibdev.query_port = c2_query_port;
+	dev->ibdev.modify_port = c2_modify_port;
+	dev->ibdev.query_pkey = c2_query_pkey;
+	dev->ibdev.query_gid = c2_query_gid;
+	dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext;
+	dev->ibdev.mmap = c2_mmap_uar;
+	dev->ibdev.alloc_pd = c2_alloc_pd;
+	dev->ibdev.dealloc_pd = c2_dealloc_pd;
+	dev->ibdev.create_ah = c2_ah_create;
+	dev->ibdev.destroy_ah = c2_ah_destroy;
+	dev->ibdev.create_qp = c2_create_qp;
+	dev->ibdev.modify_qp = c2_modify_qp;
+	dev->ibdev.destroy_qp = c2_destroy_qp;
+	dev->ibdev.create_cq = c2_create_cq;
+	dev->ibdev.destroy_cq = c2_destroy_cq;
+	dev->ibdev.poll_cq = c2_poll_cq;
+	dev->ibdev.get_dma_mr = c2_get_dma_mr;
+	dev->ibdev.reg_phys_mr = c2_reg_phys_mr;
+	dev->ibdev.reg_user_mr = c2_reg_user_mr;
+	dev->ibdev.dereg_mr = c2_dereg_mr;
+
+	dev->ibdev.alloc_fmr = NULL;
+	dev->ibdev.unmap_fmr = NULL;
+	dev->ibdev.dealloc_fmr = NULL;
+	dev->ibdev.map_phys_fmr = NULL;
+
+	dev->ibdev.attach_mcast = c2_multicast_attach;
+	dev->ibdev.detach_mcast = c2_multicast_detach;
+	dev->ibdev.process_mad = c2_process_mad;
+
+	dev->ibdev.req_notify_cq = c2_arm_cq;
+	dev->ibdev.post_send = c2_post_send;
+	dev->ibdev.post_recv = c2_post_receive;
+
+	dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL);
+	dev->ibdev.iwcm->add_ref = c2_add_ref;
+	dev->ibdev.iwcm->rem_ref = c2_rem_ref;
+	dev->ibdev.iwcm->get_qp = c2_get_qp;
+	dev->ibdev.iwcm->connect = c2_connect;
+	dev->ibdev.iwcm->accept = c2_accept;
+	dev->ibdev.iwcm->reject = c2_reject;
+	dev->ibdev.iwcm->create_listen = c2_service_create;
+	dev->ibdev.iwcm->destroy_listen = c2_service_destroy;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       c2_class_attributes[i]);
+		if (ret) {
+			unregister_netdev(dev->pseudo_netdev);
+			free_netdev(dev->pseudo_netdev);
+			ib_unregister_device(&dev->ibdev);
+			return ret;
+		}
+	}
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return 0;
+}
+
+void c2_unregister_device(struct c2_dev *dev)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	unregister_netdev(dev->pseudo_netdev);
+	free_netdev(dev->pseudo_netdev);
+	ib_unregister_device(&dev->ibdev);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h
new file mode 100644
index 0000000..05c4ab6
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_provider.h
@@ -0,0 +1,182 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef C2_PROVIDER_H
+#define C2_PROVIDER_H
+#include <linux/inetdevice.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_pack.h>
+
+#include "c2_mq.h"
+#include <rdma/iw_cm.h>
+
+#define C2_MPT_FLAG_ATOMIC        (1 << 14)
+#define C2_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define C2_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define C2_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define C2_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct c2_buf_list {
+	void *buf;
+	 DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+
+/* The user context keeps track of objects allocated for a
+ * particular user-mode client. */
+struct c2_ucontext {
+	struct ib_ucontext ibucontext;
+};
+
+struct c2_mtt;
+
+/* All objects associated with a PD are kept in the 
+ * associated user context if present. 
+ */
+struct c2_pd {
+	struct ib_pd ibpd;
+	u32 pd_id;
+	atomic_t sqp_count;
+};
+
+struct c2_mr {
+	struct ib_mr ibmr;
+	struct c2_pd *pd;
+};
+
+struct c2_av;
+
+enum c2_ah_type {
+	C2_AH_ON_HCA,
+	C2_AH_PCI_POOL,
+	C2_AH_KMALLOC
+};
+
+struct c2_ah {
+	struct ib_ah ibah;
+};
+
+struct c2_cq {
+	struct ib_cq ibcq;
+	spinlock_t lock;
+	atomic_t refcount;
+	int cqn;
+	int is_kernel;
+	wait_queue_head_t wait;
+
+	u32 adapter_handle;
+	struct c2_mq mq;
+};
+
+struct c2_wq {
+	spinlock_t lock;
+};
+struct iw_cm_id;
+struct c2_qp {
+	struct ib_qp ibqp;
+	struct iw_cm_id *cm_id;
+	spinlock_t lock;
+	atomic_t refcount;
+	wait_queue_head_t wait;
+	int qpn;
+
+	u32 adapter_handle;
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u8 state;
+
+	struct c2_mq sq_mq;
+	struct c2_mq rq_mq;
+};
+
+struct c2_cr_query_attrs {
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+};
+
+static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct c2_pd, ibpd);
+}
+
+static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext)
+{
+	return container_of(ibucontext, struct c2_ucontext, ibucontext);
+}
+
+static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct c2_mr, ibmr);
+}
+
+
+static inline struct c2_ah *to_c2ah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct c2_ah, ibah);
+}
+
+static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct c2_cq, ibcq);
+}
+
+static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct c2_qp, ibqp);
+}
+
+static inline int is_rnic_addr(struct net_device *netdev, u32 addr)
+{
+	struct in_device *ind;
+	int ret = 0;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	for_ifa(ind) {
+		if (ifa->ifa_address == addr) {
+			ret = 1;
+			break;
+		}
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+	return ret;
+}
+#endif				/* C2_PROVIDER_H */
diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c
new file mode 100644
index 0000000..6071cf0
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -0,0 +1,975 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems. All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2004 Voltaire, Inc. All rights reserved. 
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include "c2.h"
+#include "c2_vq.h"
+#include "c2_status.h"
+
+#define C2_MAX_ORD_PER_QP 128
+#define C2_MAX_IRD_PER_QP 128
+
+#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count)
+#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16)
+#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF)
+
+#define NO_SUPPORT -1
+static const u8 c2_opcode[] = {
+	[IB_WR_SEND] = C2_WR_TYPE_SEND,
+	[IB_WR_SEND_WITH_IMM] = NO_SUPPORT,
+	[IB_WR_RDMA_WRITE] = C2_WR_TYPE_RDMA_WRITE,
+	[IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT,
+	[IB_WR_RDMA_READ] = C2_WR_TYPE_RDMA_READ,
+	[IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT,
+	[IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT,
+};
+
+static int to_c2_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+		return C2_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return C2_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return C2_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return C2_QP_STATE_CLOSING;
+	case IB_QPS_ERR:
+		return C2_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+int to_ib_state(enum c2_qp_state c2_state)
+{
+	switch (c2_state) {
+	case C2_QP_STATE_IDLE:
+		return IB_QPS_RESET;
+	case C2_QP_STATE_CONNECTING:
+		return IB_QPS_RTR;
+	case C2_QP_STATE_RTS:
+		return IB_QPS_RTS;
+	case C2_QP_STATE_CLOSING:
+		return IB_QPS_SQD;
+	case C2_QP_STATE_ERROR:
+		return IB_QPS_ERR;
+	case C2_QP_STATE_TERMINATE:
+		return IB_QPS_SQE;
+	default:
+		return -1;
+	}
+}
+
+const char *to_ib_state_str(int ib_state)
+{
+	static const char *state_str[] = {
+		"IB_QPS_RESET",
+		"IB_QPS_INIT",
+		"IB_QPS_RTR",
+		"IB_QPS_RTS",
+		"IB_QPS_SQD",
+		"IB_QPS_SQE",
+		"IB_QPS_ERR"
+	};
+	if (ib_state < IB_QPS_RESET ||
+	    ib_state > IB_QPS_ERR)
+		return "<invalid IB QP state>";
+
+	ib_state -= IB_QPS_RESET;
+	return state_str[ib_state];
+}
+
+void c2_set_qp_state(struct c2_qp *qp, int c2_state)
+{
+	int new_state = to_ib_state(c2_state);
+
+	pr_debug("%s: qp[%p] state modify %s --> %s\n", 
+	       __FUNCTION__,
+		qp, 
+		to_ib_state_str(qp->state), 
+		to_ib_state_str(new_state));
+	qp->state = new_state;
+}
+
+#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF
+
+int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp,
+		 struct ib_qp_attr *attr, int attr_mask)
+{
+	struct c2wr_qp_modify_req wr;
+	struct c2wr_qp_modify_rep *reply;
+	struct c2_vq_req *vq_req;
+	unsigned long flags;
+	u8 next_state;
+	int err;
+
+	pr_debug("%s:%d qp=%p, %s --> %s\n", 
+		__FUNCTION__, __LINE__,
+		qp, 
+		to_ib_state_str(qp->state), 
+		to_ib_state_str(attr->qp_state));
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_QP_MODIFY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+	wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+
+	if (attr_mask & IB_QP_STATE) {
+		/* Ensure the state is valid */
+		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+
+		wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state));
+
+		if (attr->qp_state == IB_QPS_ERR) {
+			spin_lock_irqsave(&qp->lock, flags);
+			if (qp->cm_id && qp->state == IB_QPS_RTS) {
+				pr_debug("Generating CLOSE event for QP-->ERR, "
+					"qp=%p, cm_id=%p\n",qp,qp->cm_id);
+				/* Generate an CLOSE event */
+				vq_req->cm_id = qp->cm_id;
+				vq_req->event = IW_CM_EVENT_CLOSE;
+			}
+			spin_unlock_irqrestore(&qp->lock, flags);
+		}
+		next_state =  attr->qp_state;
+
+	} else if (attr_mask & IB_QP_CUR_STATE) {
+
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			wr.next_qp_state =
+			    cpu_to_be32(to_c2_state(attr->cur_qp_state));
+
+		next_state = attr->cur_qp_state;
+
+	} else {
+		err = 0;
+		goto bail0;
+	}
+
+	/* reference the request struct */
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	reply = (struct c2wr_qp_modify_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+	if (!err) 
+		qp->state = next_state;
+#ifdef DEBUG
+	else
+		pr_debug("%s: c2_errno=%d\n", __FUNCTION__, err);
+#endif
+	/*
+	 * If we're going to error and generating the event here, then 
+	 * we need to remove the reference because there will be no
+	 * close event generated by the adapter 
+	*/
+	spin_lock_irqsave(&qp->lock, flags);
+	if (vq_req->event==IW_CM_EVENT_CLOSE && qp->cm_id) {
+		qp->cm_id->rem_ref(qp->cm_id);
+		qp->cm_id = NULL;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+
+	pr_debug("%s:%d qp=%p, cur_state=%s\n", 
+		__FUNCTION__, __LINE__,
+		qp, 
+		to_ib_state_str(qp->state));
+	return err;
+}
+
+int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, 
+			  int ord, int ird)
+{
+	struct c2wr_qp_modify_req wr;
+	struct c2wr_qp_modify_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_QP_MODIFY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+	wr.ord = cpu_to_be32(ord);
+	wr.ird = cpu_to_be32(ird);
+	wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.next_qp_state = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+
+	/* reference the request struct */
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	reply = (struct c2wr_qp_modify_rep *) (unsigned long) 
+		vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+static int destroy_qp(struct c2_dev *c2dev, struct c2_qp *qp)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_qp_destroy_req wr;
+	struct c2wr_qp_destroy_rep *reply;
+	unsigned long flags;
+	int err;
+
+	/*
+	 * Allocate a verb request message
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		return -ENOMEM;
+	}
+
+	/* 
+	 * Initialize the WR 
+	 */
+	c2_wr_set_id(&wr, CCWR_QP_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	spin_lock_irqsave(&qp->lock, flags);
+	if (qp->cm_id && qp->state == IB_QPS_RTS) {
+		pr_debug("destroy_qp: generating CLOSE event for QP-->ERR, "
+			"qp=%p, cm_id=%p\n",qp,qp->cm_id);
+		/* Generate an CLOSE event */
+		vq_req->qp = qp;
+		vq_req->cm_id = qp->cm_id;
+		vq_req->event = IW_CM_EVENT_CLOSE;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply
+	 */
+	reply = (struct c2wr_qp_destroy_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	spin_lock_irqsave(&qp->lock, flags);
+	if (qp->cm_id) {
+		qp->cm_id->rem_ref(qp->cm_id);
+		qp->cm_id = NULL;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+int c2_alloc_qp(struct c2_dev *c2dev,
+		struct c2_pd *pd,
+		struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp)
+{
+	struct c2wr_qp_create_req wr;
+	struct c2wr_qp_create_rep *reply;
+	struct c2_vq_req *vq_req;
+	struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq);
+	struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq);
+	unsigned long peer_pa;
+	u32 q_size, msg_size, mmap_size;
+	void __iomem *mmap;
+	int err;
+
+	qp->qpn = c2_alloc(&c2dev->qp_table.alloc);
+	if (qp->qpn == -1)
+		return -ENOMEM;
+
+	qp->ibqp.qp_num = qp->qpn;
+	qp->ibqp.qp_type = IB_QPT_RC;
+
+	/* Allocate the SQ and RQ shared pointers */
+	qp->sq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	if (!qp->sq_mq.shared) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	qp->rq_mq.shared = c2_alloc_mqsp(c2dev->kern_mqsp_pool);
+	if (!qp->rq_mq.shared) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	/* Allocate the verbs request */
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	/* Initialize the work request */
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_QP_CREATE);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.sq_cq_handle = send_cq->adapter_handle;
+	wr.rq_cq_handle = recv_cq->adapter_handle;
+	wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr + 1);
+	wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr + 1);
+	wr.srq_handle = 0;
+	wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND |
+			       QP_ZERO_STAG | QP_RDMA_READ_RESPONSE);
+	wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge);
+	wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge);
+	wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge);
+	// XXX no write depth?
+	wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared));
+	wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared));
+	wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP);
+	wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP);
+	wr.pd_id = pd->pd_id;
+	wr.user_context = (unsigned long) qp;
+
+	vq_req_get(c2dev, vq_req);
+
+	/* Send the WR to the adapter */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail3;
+	}
+
+	/* Wait for the verb reply  */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail3;
+	}
+
+	/* Process the reply */
+	reply = (struct c2wr_qp_create_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail3;
+	}
+
+	if ((err = c2_wr_get_result(reply)) != 0) {
+		goto bail4;
+	}
+
+	/* Fill in the kernel QP struct */
+	atomic_set(&qp->refcount, 1);
+	qp->adapter_handle = reply->qp_handle;
+	qp->state = IB_QPS_RESET;
+	qp->send_sgl_depth = qp_attrs->cap.max_send_sge;
+	qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge;
+	qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge;
+
+	/* Initialize the SQ MQ */
+	q_size = be32_to_cpu(reply->sq_depth);
+	msg_size = be32_to_cpu(reply->sq_msg_size);
+	peer_pa = c2dev->pa + be32_to_cpu(reply->sq_mq_start);
+	mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size);
+	mmap = ioremap_nocache(peer_pa, mmap_size);
+	if (!mmap) {
+		err = -ENOMEM;
+		goto bail5;
+	}
+
+	c2_mq_req_init(&qp->sq_mq, 
+		       be32_to_cpu(reply->sq_mq_index), 
+		       q_size, 
+		       msg_size, 
+		       mmap + sizeof(struct c2_mq_shared),	/* pool start */
+		       mmap,				/* peer */
+		       C2_MQ_ADAPTER_TARGET);
+
+	/* Initialize the RQ mq */
+	q_size = be32_to_cpu(reply->rq_depth);
+	msg_size = be32_to_cpu(reply->rq_msg_size);
+	peer_pa = c2dev->pa + be32_to_cpu(reply->rq_mq_start);
+	mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size);
+	mmap = ioremap_nocache(peer_pa, mmap_size);
+	if (!mmap) {
+		err = -ENOMEM;
+		goto bail6;
+	}
+
+	c2_mq_req_init(&qp->rq_mq, 
+		       be32_to_cpu(reply->rq_mq_index), 
+		       q_size, 
+		       msg_size, 
+		       mmap + sizeof(struct c2_mq_shared),	/* pool start */
+		       mmap,				/* peer */
+		       C2_MQ_ADAPTER_TARGET);
+
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	spin_lock_irq(&c2dev->qp_table.lock);
+	c2_array_set(&c2dev->qp_table.qp, qp->qpn & (c2dev->props.max_qp - 1), qp);
+	c2dev->qp_table.map[qp->qpn] = qp;
+	spin_unlock_irq(&c2dev->qp_table.lock);
+
+	return 0;
+
+      bail6:
+	iounmap(qp->sq_mq.peer);
+      bail5:
+	destroy_qp(c2dev, qp);
+      bail4:
+	vq_repbuf_free(c2dev, reply);
+      bail3:
+	vq_req_free(c2dev, vq_req);
+      bail2:
+	c2_free_mqsp(qp->rq_mq.shared);
+      bail1:
+	c2_free_mqsp(qp->sq_mq.shared);
+      bail0:
+	c2_free(&c2dev->qp_table.alloc, qp->qpn);
+	return err;
+}
+
+void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp)
+{
+	struct c2_cq *send_cq;
+	struct c2_cq *recv_cq;
+
+	send_cq = to_c2cq(qp->ibqp.send_cq);
+	recv_cq = to_c2cq(qp->ibqp.recv_cq);
+
+	/*
+	 * Lock CQs here, so that CQ polling code can do QP lookup
+	 * without taking a lock.
+	 */
+	spin_lock_irq(&send_cq->lock);
+	if (send_cq != recv_cq)
+		spin_lock(&recv_cq->lock);
+
+	spin_lock(&c2dev->qp_table.lock);
+	c2_array_clear(&c2dev->qp_table.qp, qp->qpn & (c2dev->props.max_qp - 1));
+	c2dev->qp_table.map[qp->qpn] = NULL;
+	spin_unlock(&c2dev->qp_table.lock);
+
+	if (send_cq != recv_cq)
+		spin_unlock(&recv_cq->lock);
+	spin_unlock_irq(&send_cq->lock);
+
+	/*
+	 * Destory qp in the rnic...
+	 */
+	destroy_qp(c2dev, qp);
+
+	/*
+	 * Mark any unreaped CQEs as null and void.
+	 */
+	c2_cq_clean(c2dev, qp, send_cq->cqn);
+	if (send_cq != recv_cq)
+		c2_cq_clean(c2dev, qp, recv_cq->cqn);
+	/*
+	 * Unmap the MQs and return the shared pointers
+	 * to the message pool.
+	 */
+	iounmap(qp->sq_mq.peer);
+	iounmap(qp->rq_mq.peer);
+	c2_free_mqsp(qp->sq_mq.shared);
+	c2_free_mqsp(qp->rq_mq.shared);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+	c2_free(&c2dev->qp_table.alloc, qp->qpn);
+}
+
+/*
+ * Function: move_sgl 
+ *
+ * Description: 
+ * Move an SGL from the user's work request struct into a CCIL Work Request 
+ * message, swapping to WR byte order and ensure the total length doesn't 
+ * overflow. 
+ *
+ * IN: 
+ * dst		- ptr to CCIL Work Request message SGL memory.
+ * src		- ptr to the consumers SGL memory.
+ *
+ * OUT: none
+ *
+ * Return: 
+ * CCIL status codes.
+ */
+static int
+move_sgl(struct c2_data_addr * dst, struct ib_sge *src, int count, u32 * p_len,
+	 u8 * actual_count)
+{
+	u32 tot = 0;		/* running total */
+	u8 acount = 0;		/* running total non-0 len sge's */
+
+	while (count > 0) {
+		/*
+		 * If the addition of this SGE causes the
+		 * total SGL length to exceed 2^32-1, then
+		 * fail-n-bail.
+		 *
+		 * If the current total plus the next element length
+		 * wraps, then it will go negative and be less than the
+		 * current total...
+		 */
+		if ((tot + src->length) < tot) {
+			return -EINVAL;
+		}
+		/*
+		 * Bug: 1456 (as well as 1498 & 1643)
+		 * Skip over any sge's supplied with len=0
+		 */
+		if (src->length) {
+			tot += src->length;
+			dst->stag = cpu_to_be32(src->lkey);
+			dst->to = cpu_to_be64(src->addr);
+			dst->length = cpu_to_be32(src->length);
+			dst++;
+			acount++;
+		}
+		src++;
+		count--;
+	}
+
+	if (acount == 0) {
+		/*
+		 * Bug: 1476 (as well as 1498, 1456 and 1643)
+		 * Setup the SGL in the WR to make it easier for the RNIC.
+		 * This way, the FW doesn't have to deal with special cases.
+		 * Setting length=0 should be sufficient.
+		 */
+		dst->stag = 0;
+		dst->to = 0;
+		dst->length = 0;
+	}
+
+	*p_len = tot;
+	*actual_count = acount;
+	return 0;
+}
+
+/*
+ * Function: c2_activity (private function)
+ *
+ * Description: 
+ * Post an mq index to the host->adapter activity fifo.
+ *
+ * IN: 
+ * c2dev	- ptr to c2dev structure
+ * mq_index	- mq index to post
+ * shared	- value most recently written to shared 
+ *
+ * OUT: 
+ *
+ * Return: 
+ * none
+ */
+static inline void c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared)
+{
+	/*
+	 * First read the register to see if the FIFO is full, and if so,
+	 * spin until it's not.  This isn't perfect -- there is no
+	 * synchronization among the clients of the register, but in
+	 * practice it prevents multiple CPU from hammering the bus
+	 * with PCI RETRY. Note that when this does happen, the card
+	 * cannot get on the bus and the card and system hang in a
+	 * deadlock -- thus the need for this code. [TOT]
+	 */
+	while (readl(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(0);
+	}
+
+	__raw_writel(C2_HINT_MAKE(mq_index, shared),
+		     c2dev->regs + PCI_BAR0_ADAPTER_HINT);
+}
+
+/*
+ * Function: qp_wr_post 
+ *
+ * Description: 
+ * This in-line function allocates a MQ msg, then moves the host-copy of 
+ * the completed WR into msg.  Then it posts the message. 
+ * 
+ * IN: 
+ * q		- ptr to user MQ.
+ * wr		- ptr to host-copy of the WR.
+ * qp		- ptr to user qp
+ * size		- Number of bytes to post.  Assumed to be divisible by 4.
+ *
+ * OUT: none
+ *
+ * Return: 
+ * CCIL status codes.
+ */
+static int qp_wr_post(struct c2_mq *q, union c2wr * wr, struct c2_qp *qp, u32 size)
+{
+	union c2wr *msg;
+
+	msg = c2_mq_alloc(q);
+	if (msg == NULL) {
+		return -EINVAL;
+	}
+#ifdef CCMSGMAGIC
+	((c2wr_hdr_t *) wr)->magic = cpu_to_be32(CCWR_MAGIC);
+#endif
+
+	/*
+	 * Since all header fields in the WR are the same as the
+	 * CQE, set the following so the adapter need not.
+	 */
+	c2_wr_set_result(wr, CCERR_PENDING);
+
+	/*
+	 * Copy the wr down to the adapter
+	 */
+	memcpy((void *) msg, (void *) wr, size);
+
+	c2_mq_produce(q);
+	return 0;
+}
+
+
+int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
+		 struct ib_send_wr **bad_wr)
+{
+	struct c2_dev *c2dev = to_c2dev(ibqp->device);
+	struct c2_qp *qp = to_c2qp(ibqp);
+	union c2wr wr;
+	int err = 0;
+
+	u32 flags;
+	u32 tot_len;
+	u8 actual_sge_count;
+	u32 msg_size;
+
+	if (qp->state > IB_QPS_RTS)
+		return -EINVAL;
+
+	while (ib_wr) {
+
+		flags = 0;
+		wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id;
+		if (ib_wr->send_flags & IB_SEND_SIGNALED) {
+			flags |= SQ_SIGNALED;
+		}
+
+		switch (ib_wr->opcode) {
+		case IB_WR_SEND:
+			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
+				msg_size = sizeof(struct c2wr_send_req);
+			} else {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
+				msg_size = sizeof(struct c2wr_send_req);
+			}
+
+			wr.sqwr.send.remote_stag = 0;
+			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			if (ib_wr->num_sge > qp->send_sgl_depth) {
+				err = -EINVAL;
+				break;
+			}
+			if (ib_wr->send_flags & IB_SEND_FENCE) {
+				flags |= SQ_READ_FENCE;
+			}
+			err = move_sgl((struct c2_data_addr *) & (wr.sqwr.send.data),
+				       ib_wr->sg_list,
+				       ib_wr->num_sge,
+				       &tot_len, &actual_sge_count);
+			wr.sqwr.send.sge_len = cpu_to_be32(tot_len);
+			c2_wr_set_sge_count(&wr, actual_sge_count);
+			break;
+		case IB_WR_RDMA_WRITE:
+			c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_WRITE);
+			msg_size = sizeof(struct c2wr_rdma_write_req) +
+			    (sizeof(struct c2_data_addr) * ib_wr->num_sge);
+			if (ib_wr->num_sge > qp->rdma_write_sgl_depth) {
+				err = -EINVAL;
+				break;
+			}
+			if (ib_wr->send_flags & IB_SEND_FENCE) {
+				flags |= SQ_READ_FENCE;
+			}
+			wr.sqwr.rdma_write.remote_stag =
+			    cpu_to_be32(ib_wr->wr.rdma.rkey);
+			wr.sqwr.rdma_write.remote_to =
+			    cpu_to_be64(ib_wr->wr.rdma.remote_addr);
+			err = move_sgl((struct c2_data_addr *)
+				       & (wr.sqwr.rdma_write.data),
+				       ib_wr->sg_list,
+				       ib_wr->num_sge,
+				       &tot_len, &actual_sge_count);
+			wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len);
+			c2_wr_set_sge_count(&wr, actual_sge_count);
+			break;
+		case IB_WR_RDMA_READ:
+			c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_READ);
+			msg_size = sizeof(struct c2wr_rdma_read_req);
+
+			/* IWarp only suppots 1 sge for RDMA reads */
+			if (ib_wr->num_sge > 1) {
+				err = -EINVAL;
+				break;
+			}
+
+			/*
+			 * Move the local and remote stag/to/len into the WR. 
+			 */
+			wr.sqwr.rdma_read.local_stag =
+			    cpu_to_be32(ib_wr->sg_list->lkey);
+			wr.sqwr.rdma_read.local_to =
+			    cpu_to_be64(ib_wr->sg_list->addr);
+			wr.sqwr.rdma_read.remote_stag =
+			    cpu_to_be32(ib_wr->wr.rdma.rkey);
+			wr.sqwr.rdma_read.remote_to =
+			    cpu_to_be64(ib_wr->wr.rdma.remote_addr);
+			wr.sqwr.rdma_read.length =
+			    cpu_to_be32(ib_wr->sg_list->length);
+			break;
+		default:
+			/* error */
+			msg_size = 0;
+			err = -EINVAL;
+			break;
+		}
+
+		/*
+		 * If we had an error on the last wr build, then
+		 * break out.  Possible errors include bogus WR 
+		 * type, and a bogus SGL length...
+		 */
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Store flags
+		 */
+		c2_wr_set_flags(&wr, flags);
+
+		/*
+		 * Post the puppy!
+		 */
+		err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size);
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Enqueue mq index to activity FIFO.
+		 */
+		c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count);
+
+		ib_wr = ib_wr->next;
+	}
+
+	if (err)
+		*bad_wr = ib_wr;
+	return err;
+}
+
+int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
+		    struct ib_recv_wr **bad_wr)
+{
+	struct c2_dev *c2dev = to_c2dev(ibqp->device);
+	struct c2_qp *qp = to_c2qp(ibqp);
+	union c2wr wr;
+	int err = 0;
+
+	if (qp->state > IB_QPS_RTS)
+		return -EINVAL;
+
+	/*
+	 * Try and post each work request
+	 */
+	while (ib_wr) {
+		u32 tot_len;
+		u8 actual_sge_count;
+
+		if (ib_wr->num_sge > qp->recv_sgl_depth) {
+			err = -EINVAL;
+			break;
+		}
+
+		/*
+		 * Create local host-copy of the WR
+		 */
+		wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id;
+		c2_wr_set_id(&wr, CCWR_RECV);
+		c2_wr_set_flags(&wr, 0);
+
+		/* sge_count is limited to eight bits. */
+		BUG_ON(ib_wr->num_sge >= 256);
+		err = move_sgl((struct c2_data_addr *) & (wr.rqwr.data),
+			       ib_wr->sg_list,
+			       ib_wr->num_sge, &tot_len, &actual_sge_count);
+		c2_wr_set_sge_count(&wr, actual_sge_count);
+
+		/*
+		 * If we had an error on the last wr build, then
+		 * break out.  Possible errors include bogus WR 
+		 * type, and a bogus SGL length...
+		 */
+		if (err) {
+			break;
+		}
+
+		err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size);
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Enqueue mq index to activity FIFO
+		 */
+		c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count);
+
+		ib_wr = ib_wr->next;
+	}
+
+	if (err)
+		*bad_wr = ib_wr;
+	return err;
+}
+
+int __devinit c2_init_qp_table(struct c2_dev *c2dev)
+{
+	int err;
+
+	spin_lock_init(&c2dev->qp_table.lock);
+
+	err = c2_alloc_init(&c2dev->qp_table.alloc, 
+			    c2dev->props.max_qp, 1);
+	if (err)
+		return err;
+
+	err = c2_array_init(&c2dev->qp_table.qp, c2dev->props.max_qp);
+	if (err) {
+		c2_alloc_cleanup(&c2dev->qp_table.alloc);
+		return err;
+	}
+
+	c2dev->qp_table.map = vmalloc(sizeof(struct c2_qp *) * c2dev->props.max_qp);
+	if (!c2dev->qp_table.map) {
+		pr_debug("Could not allocate QPN <-> QP map\n");
+		c2_alloc_cleanup(&c2dev->qp_table.alloc);
+		c2_array_cleanup(&c2dev->qp_table.qp, c2dev->props.max_qp);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev)
+{
+	c2_alloc_cleanup(&c2dev->qp_table.alloc);
+	c2_array_cleanup(&c2dev->qp_table.qp, c2dev->props.max_qp);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_user.h b/drivers/infiniband/hw/amso1100/c2_user.h
new file mode 100644
index 0000000..7e9e7ad
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_user.h
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef C2_USER_H
+#define C2_USER_H
+
+#include <linux/types.h>
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct c2_alloc_ucontext_resp {
+	__u32 qp_tab_size;
+	__u32 uarc_size;
+};
+
+struct c2_alloc_pd_resp {
+	__u32 pdn;
+	__u32 reserved;
+};
+
+struct c2_create_cq {
+	__u32 lkey;
+	__u32 pdn;
+	__u64 arm_db_page;
+	__u64 set_db_page;
+	__u32 arm_db_index;
+	__u32 set_db_index;
+};
+
+struct c2_create_cq_resp {
+	__u32 cqn;
+	__u32 reserved;
+};
+
+struct c2_create_qp {
+	__u32 lkey;
+	__u32 reserved;
+	__u64 sq_db_page;
+	__u64 rq_db_page;
+	__u32 sq_db_index;
+	__u32 rq_db_index;
+};
+
+#endif				/* C2_USER_H */


From swise at opengridcomputing.com  Wed Jun  7 13:06:51 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:51 -0500
Subject: [openib-general] [PATCH v2 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200651.9259.73654.stgit@stevo-desktop>


Review Changes:

C2_DEBUG -> DEBUG
---

 drivers/infiniband/hw/amso1100/c2_ae.h     |  108 ++
 drivers/infiniband/hw/amso1100/c2_status.h |  158 +++
 drivers/infiniband/hw/amso1100/c2_wr.h     | 1523 ++++++++++++++++++++++++++++
 3 files changed, 1789 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_ae.h b/drivers/infiniband/hw/amso1100/c2_ae.h
new file mode 100644
index 0000000..3a065c3
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_ae.h
@@ -0,0 +1,108 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_AE_H_
+#define _C2_AE_H_
+
+/*
+ * WARNING: If you change this file, also bump C2_IVN_BASE
+ * in common/include/clustercore/c2_ivn.h.
+ */
+
+/*
+ * Asynchronous Event Identifiers
+ *
+ * These start at 0x80 only so it's obvious from inspection that
+ * they are not work-request statuses.  This isn't critical.
+ *
+ * NOTE: these event id's must fit in eight bits.
+ */
+enum c2_event_id {
+	CCAE_REMOTE_SHUTDOWN = 0x80,
+	CCAE_ACTIVE_CONNECT_RESULTS,
+	CCAE_CONNECTION_REQUEST,
+	CCAE_LLP_CLOSE_COMPLETE,
+	CCAE_TERMINATE_MESSAGE_RECEIVED,
+	CCAE_LLP_CONNECTION_RESET,
+	CCAE_LLP_CONNECTION_LOST,
+	CCAE_LLP_SEGMENT_SIZE_INVALID,
+	CCAE_LLP_INVALID_CRC,
+	CCAE_LLP_BAD_FPDU,
+	CCAE_INVALID_DDP_VERSION,
+	CCAE_INVALID_RDMA_VERSION,
+	CCAE_UNEXPECTED_OPCODE,
+	CCAE_INVALID_DDP_QUEUE_NUMBER,
+	CCAE_RDMA_READ_NOT_ENABLED,
+	CCAE_RDMA_WRITE_NOT_ENABLED,
+	CCAE_RDMA_READ_TOO_SMALL,
+	CCAE_NO_L_BIT,
+	CCAE_TAGGED_INVALID_STAG,
+	CCAE_TAGGED_BASE_BOUNDS_VIOLATION,
+	CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION,
+	CCAE_TAGGED_INVALID_PD,
+	CCAE_WRAP_ERROR,
+	CCAE_BAD_CLOSE,
+	CCAE_BAD_LLP_CLOSE,
+	CCAE_INVALID_MSN_RANGE,
+	CCAE_INVALID_MSN_GAP,
+	CCAE_IRRQ_OVERFLOW,
+	CCAE_IRRQ_MSN_GAP,
+	CCAE_IRRQ_MSN_RANGE,
+	CCAE_IRRQ_INVALID_STAG,
+	CCAE_IRRQ_BASE_BOUNDS_VIOLATION,
+	CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION,
+	CCAE_IRRQ_INVALID_PD,
+	CCAE_IRRQ_WRAP_ERROR,
+	CCAE_CQ_SQ_COMPLETION_OVERFLOW,
+	CCAE_CQ_RQ_COMPLETION_ERROR,
+	CCAE_QP_SRQ_WQE_ERROR,
+	CCAE_QP_LOCAL_CATASTROPHIC_ERROR,
+	CCAE_CQ_OVERFLOW,
+	CCAE_CQ_OPERATION_ERROR,
+	CCAE_SRQ_LIMIT_REACHED,
+	CCAE_QP_RQ_LIMIT_REACHED,
+	CCAE_SRQ_CATASTROPHIC_ERROR,
+	CCAE_RNIC_CATASTROPHIC_ERROR
+/* WARNING If you add more id's, make sure their values fit in eight bits. */
+};
+
+/*
+ * Resource Indicators and Identifiers
+ */
+enum c2_resource_indicator {
+	C2_RES_IND_QP = 1,
+	C2_RES_IND_EP,
+	C2_RES_IND_CQ,
+	C2_RES_IND_SRQ,
+};
+
+#endif /* _C2_AE_H_ */
diff --git a/drivers/infiniband/hw/amso1100/c2_status.h b/drivers/infiniband/hw/amso1100/c2_status.h
new file mode 100644
index 0000000..6ee4aa9
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_status.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef	_C2_STATUS_H_
+#define _C2_STATUS_H_
+
+/*
+ * Verbs Status Codes
+ */
+enum c2_status {
+	C2_OK = 0,		/* This must be zero */
+	CCERR_INSUFFICIENT_RESOURCES = 1,
+	CCERR_INVALID_MODIFIER = 2,
+	CCERR_INVALID_MODE = 3,
+	CCERR_IN_USE = 4,
+	CCERR_INVALID_RNIC = 5,
+	CCERR_INTERRUPTED_OPERATION = 6,
+	CCERR_INVALID_EH = 7,
+	CCERR_INVALID_CQ = 8,
+	CCERR_CQ_EMPTY = 9,
+	CCERR_NOT_IMPLEMENTED = 10,
+	CCERR_CQ_DEPTH_TOO_SMALL = 11,
+	CCERR_PD_IN_USE = 12,
+	CCERR_INVALID_PD = 13,
+	CCERR_INVALID_SRQ = 14,
+	CCERR_INVALID_ADDRESS = 15,
+	CCERR_INVALID_NETMASK = 16,
+	CCERR_INVALID_QP = 17,
+	CCERR_INVALID_QP_STATE = 18,
+	CCERR_TOO_MANY_WRS_POSTED = 19,
+	CCERR_INVALID_WR_TYPE = 20,
+	CCERR_INVALID_SGL_LENGTH = 21,
+	CCERR_INVALID_SQ_DEPTH = 22,
+	CCERR_INVALID_RQ_DEPTH = 23,
+	CCERR_INVALID_ORD = 24,
+	CCERR_INVALID_IRD = 25,
+	CCERR_QP_ATTR_CANNOT_CHANGE = 26,
+	CCERR_INVALID_STAG = 27,
+	CCERR_QP_IN_USE = 28,
+	CCERR_OUTSTANDING_WRS = 29,
+	CCERR_STAG_IN_USE = 30,
+	CCERR_INVALID_STAG_INDEX = 31,
+	CCERR_INVALID_SGL_FORMAT = 32,
+	CCERR_ADAPTER_TIMEOUT = 33,
+	CCERR_INVALID_CQ_DEPTH = 34,
+	CCERR_INVALID_PRIVATE_DATA_LENGTH = 35,
+	CCERR_INVALID_EP = 36,
+	CCERR_MR_IN_USE = CCERR_STAG_IN_USE,
+	CCERR_FLUSHED = 38,
+	CCERR_INVALID_WQE = 39,
+	CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40,
+	CCERR_REMOTE_TERMINATION_ERROR = 41,
+	CCERR_BASE_AND_BOUNDS_VIOLATION = 42,
+	CCERR_ACCESS_VIOLATION = 43,
+	CCERR_INVALID_PD_ID = 44,
+	CCERR_WRAP_ERROR = 45,
+	CCERR_INV_STAG_ACCESS_ERROR = 46,
+	CCERR_ZERO_RDMA_READ_RESOURCES = 47,
+	CCERR_QP_NOT_PRIVILEGED = 48,
+	CCERR_STAG_STATE_NOT_INVALID = 49,
+	CCERR_INVALID_PAGE_SIZE = 50,
+	CCERR_INVALID_BUFFER_SIZE = 51,
+	CCERR_INVALID_PBE = 52,
+	CCERR_INVALID_FBO = 53,
+	CCERR_INVALID_LENGTH = 54,
+	CCERR_INVALID_ACCESS_RIGHTS = 55,
+	CCERR_PBL_TOO_BIG = 56,
+	CCERR_INVALID_VA = 57,
+	CCERR_INVALID_REGION = 58,
+	CCERR_INVALID_WINDOW = 59,
+	CCERR_TOTAL_LENGTH_TOO_BIG = 60,
+	CCERR_INVALID_QP_ID = 61,
+	CCERR_ADDR_IN_USE = 62,
+	CCERR_ADDR_NOT_AVAIL = 63,
+	CCERR_NET_DOWN = 64,
+	CCERR_NET_UNREACHABLE = 65,
+	CCERR_CONN_ABORTED = 66,
+	CCERR_CONN_RESET = 67,
+	CCERR_NO_BUFS = 68,
+	CCERR_CONN_TIMEDOUT = 69,
+	CCERR_CONN_REFUSED = 70,
+	CCERR_HOST_UNREACHABLE = 71,
+	CCERR_INVALID_SEND_SGL_DEPTH = 72,
+	CCERR_INVALID_RECV_SGL_DEPTH = 73,
+	CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74,
+	CCERR_INSUFFICIENT_PRIVILEGES = 75,
+	CCERR_STACK_ERROR = 76,
+	CCERR_INVALID_VERSION = 77,
+	CCERR_INVALID_MTU = 78,
+	CCERR_INVALID_IMAGE = 79,
+	CCERR_PENDING = 98,	/* not an error; user internally by adapter */
+	CCERR_DEFER = 99,	/* not an error; used internally by adapter */
+	CCERR_FAILED_WRITE = 100,
+	CCERR_FAILED_ERASE = 101,
+	CCERR_FAILED_VERIFICATION = 102,
+	CCERR_NOT_FOUND = 103,
+
+};
+
+/*
+ * CCAE_ACTIVE_CONNECT_RESULTS status result codes.
+ */
+enum c2_connect_status {
+	C2_CONN_STATUS_SUCCESS = C2_OK,
+	C2_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES,
+	C2_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT,
+	C2_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED,
+	C2_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE,
+	C2_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE,
+	C2_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC,
+	C2_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP,
+	C2_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE,
+	C2_CONN_STATUS_REJECTED = CCERR_CONN_RESET,
+	C2_CONN_STATUS_ADDR_NOT_AVAIL = CCERR_ADDR_NOT_AVAIL,
+};
+
+/*
+ * Flash programming status codes.
+ */
+enum c2_flash_status {
+	C2_FLASH_STATUS_SUCCESS = 0x0000,
+	C2_FLASH_STATUS_VERIFY_ERR = 0x0002,
+	C2_FLASH_STATUS_IMAGE_ERR = 0x0004,
+	C2_FLASH_STATUS_ECLBS = 0x0400,
+	C2_FLASH_STATUS_PSLBS = 0x0800,
+	C2_FLASH_STATUS_VPENS = 0x1000,
+};
+
+#endif				/* _C2_STATUS_H_ */
diff --git a/drivers/infiniband/hw/amso1100/c2_wr.h b/drivers/infiniband/hw/amso1100/c2_wr.h
new file mode 100644
index 0000000..9d6468d
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_wr.h
@@ -0,0 +1,1523 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_WR_H_
+#define _C2_WR_H_
+
+#ifdef CCDEBUG
+#define CCWR_MAGIC		0xb07700b0
+#endif
+
+#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF
+
+/* Maximum allowed size in bytes of private_data exchange
+ * on connect.
+ */
+#define C2_MAX_PRIVATE_DATA_SIZE 200
+
+/*
+ * These types are shared among the adapter, host, and CCIL consumer.  
+ */
+enum c2_cq_notification_type {
+	C2_CQ_NOTIFICATION_TYPE_NONE = 1,
+	C2_CQ_NOTIFICATION_TYPE_NEXT,
+	C2_CQ_NOTIFICATION_TYPE_NEXT_SE
+};
+
+enum c2_setconfig_cmd {
+	C2_CFG_ADD_ADDR = 1,
+	C2_CFG_DEL_ADDR = 2,
+	C2_CFG_ADD_ROUTE = 3,
+	C2_CFG_DEL_ROUTE = 4
+};
+
+enum c2_getconfig_cmd {
+	C2_GETCONFIG_ROUTES = 1,
+	C2_GETCONFIG_ADDRS
+};
+
+/*
+ *  CCIL Work Request Identifiers
+ */
+enum c2wr_ids {
+	CCWR_RNIC_OPEN = 1,
+	CCWR_RNIC_QUERY,
+	CCWR_RNIC_SETCONFIG,
+	CCWR_RNIC_GETCONFIG,
+	CCWR_RNIC_CLOSE,
+	CCWR_CQ_CREATE,
+	CCWR_CQ_QUERY,
+	CCWR_CQ_MODIFY,
+	CCWR_CQ_DESTROY,
+	CCWR_QP_CONNECT,
+	CCWR_PD_ALLOC,
+	CCWR_PD_DEALLOC,
+	CCWR_SRQ_CREATE,
+	CCWR_SRQ_QUERY,
+	CCWR_SRQ_MODIFY,
+	CCWR_SRQ_DESTROY,
+	CCWR_QP_CREATE,
+	CCWR_QP_QUERY,
+	CCWR_QP_MODIFY,
+	CCWR_QP_DESTROY,
+	CCWR_NSMR_STAG_ALLOC,
+	CCWR_NSMR_REGISTER,
+	CCWR_NSMR_PBL,
+	CCWR_STAG_DEALLOC,
+	CCWR_NSMR_REREGISTER,
+	CCWR_SMR_REGISTER,
+	CCWR_MR_QUERY,
+	CCWR_MW_ALLOC,
+	CCWR_MW_QUERY,
+	CCWR_EP_CREATE,
+	CCWR_EP_GETOPT,
+	CCWR_EP_SETOPT,
+	CCWR_EP_DESTROY,
+	CCWR_EP_BIND,
+	CCWR_EP_CONNECT,
+	CCWR_EP_LISTEN,
+	CCWR_EP_SHUTDOWN,
+	CCWR_EP_LISTEN_CREATE,
+	CCWR_EP_LISTEN_DESTROY,
+	CCWR_EP_QUERY,
+	CCWR_CR_ACCEPT,
+	CCWR_CR_REJECT,
+	CCWR_CONSOLE,
+	CCWR_TERM,
+	CCWR_FLASH_INIT,
+	CCWR_FLASH,
+	CCWR_BUF_ALLOC,
+	CCWR_BUF_FREE,
+	CCWR_FLASH_WRITE,
+	CCWR_INIT,		/* WARNING: Don't move this ever again! */
+
+
+
+	/* Add new IDs here */
+
+
+
+	/* 
+	 * WARNING: CCWR_LAST must always be the last verbs id defined! 
+	 *          All the preceding IDs are fixed, and must not change.
+	 *          You can add new IDs, but must not remove or reorder
+	 *          any IDs. If you do, YOU will ruin any hope of
+	 *          compatability between versions.
+	 */
+	CCWR_LAST,
+
+	/*
+	 * Start over at 1 so that arrays indexed by user wr id's
+	 * begin at 1.  This is OK since the verbs and user wr id's
+	 * are always used on disjoint sets of queues.
+	 */
+	/* 
+	 * The order of the CCWR_SEND_XX verbs must 
+	 * match the order of the RDMA_OPs 
+	 */
+	CCWR_SEND = 1,
+	CCWR_SEND_INV,
+	CCWR_SEND_SE,
+	CCWR_SEND_SE_INV,
+	CCWR_RDMA_WRITE,
+	CCWR_RDMA_READ,
+	CCWR_RDMA_READ_INV,
+	CCWR_MW_BIND,
+	CCWR_NSMR_FASTREG,
+	CCWR_STAG_INVALIDATE,
+	CCWR_RECV,
+	CCWR_NOP,
+	CCWR_UNIMPL,		
+/* WARNING: This must always be the last user wr id defined! */
+};
+#define RDMA_SEND_OPCODE_FROM_WR_ID(x)   (x+2)
+
+/*
+ * SQ/RQ Work Request Types
+ */
+enum c2_wr_type {
+	C2_WR_TYPE_SEND = CCWR_SEND,
+	C2_WR_TYPE_SEND_SE = CCWR_SEND_SE,
+	C2_WR_TYPE_SEND_INV = CCWR_SEND_INV,
+	C2_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV,
+	C2_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE,
+	C2_WR_TYPE_RDMA_READ = CCWR_RDMA_READ,
+	C2_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV,
+	C2_WR_TYPE_BIND_MW = CCWR_MW_BIND,
+	C2_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG,
+	C2_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE,
+	C2_WR_TYPE_RECV = CCWR_RECV,
+	C2_WR_TYPE_NOP = CCWR_NOP,
+};
+
+struct c2_netaddr {
+	u32 ip_addr;
+	u32 netmask;
+	u32 mtu;
+};
+
+struct c2_route {
+	u32 ip_addr;		/* 0 indicates the default route */
+	u32 netmask;		/* netmask associated with dst */
+	u32 flags;
+	union {
+		u32 ipaddr;	/* address of the nexthop interface */
+		u8 enaddr[6];
+	} nexthop;
+};
+
+/*
+ * A Scatter Gather Entry.
+ */
+struct c2_data_addr {
+	u32 stag;
+	u32 length;
+	u64 to;
+};
+
+/*
+ * MR and MW flags used by the consumer, RI, and RNIC.
+ */
+enum c2_mm_flags {
+	MEM_REMOTE = 0x0001,	/* allow mw binds with remote access. */
+	MEM_VA_BASED = 0x0002,	/* Not Zero-based */
+	MEM_PBL_COMPLETE = 0x0004,	/* PBL array is complete in this msg */
+	MEM_LOCAL_READ = 0x0008,	/* allow local reads */
+	MEM_LOCAL_WRITE = 0x0010,	/* allow local writes */
+	MEM_REMOTE_READ = 0x0020,	/* allow remote reads */
+	MEM_REMOTE_WRITE = 0x0040,	/* allow remote writes */
+	MEM_WINDOW_BIND = 0x0080,	/* binds allowed */
+	MEM_SHARED = 0x0100,	/* set if MR is shared */
+	MEM_STAG_VALID = 0x0200	/* set if STAG is in valid state */
+};
+
+/*
+ * CCIL API ACF flags defined in terms of the low level mem flags.
+ * This minimizes translation needed in the user API
+ */
+enum c2_acf {
+	C2_ACF_LOCAL_READ = MEM_LOCAL_READ,
+	C2_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE,
+	C2_ACF_REMOTE_READ = MEM_REMOTE_READ,
+	C2_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE,
+	C2_ACF_WINDOW_BIND = MEM_WINDOW_BIND
+};
+
+/*
+ * Image types of objects written to flash
+ */
+#define C2_FLASH_IMG_BITFILE 1
+#define C2_FLASH_IMG_OPTION_ROM 2
+#define C2_FLASH_IMG_VPD 3
+
+/*
+ *  to fix bug 1815 we define the max size allowable of the
+ *  terminate message (per the IETF spec).Refer to the IETF
+ *  protocal specification, section 12.1.6, page 64)
+ *  The message is prefixed by 20 types of DDP info.
+ *
+ *  Then the message has 6 bytes for the terminate control 
+ *  and DDP segment length info plus a DDP header (either
+ *  14 or 18 byts) plus 28 bytes for the RDMA header.
+ *  Thus the max size in:
+ *  20 + (6 + 18 + 28) = 72
+ */
+#define C2_MAX_TERMINATE_MESSAGE_SIZE (72)
+
+/*
+ * Build String Length.  It must be the same as C2_BUILD_STR_LEN in ccil_api.h
+ */
+#define WR_BUILD_STR_LEN 64
+
+/*
+ * WARNING:  All of these structs need to align any 64bit types on   
+ * 64 bit boundaries!  64bit types include u64 and u64.
+ */
+
+/*
+ * Clustercore Work Request Header.  Be sensitive to field layout
+ * and alignment.
+ */
+struct c2wr_hdr {
+	/* wqe_count is part of the cqe.  It is put here so the
+	 * adapter can write to it while the wr is pending without
+	 * clobbering part of the wr.  This word need not be dma'd
+	 * from the host to adapter by libccil, but we copy it anyway
+	 * to make the memcpy to the adapter better aligned.
+	 */
+	u32 wqe_count;
+
+	/* Put these fields next so that later 32- and 64-bit
+	 * quantities are naturally aligned.
+	 */
+	u8 id;
+	u8 result;		/* adapter -> host */
+	u8 sge_count;		/* host -> adapter */
+	u8 flags;		/* host -> adapter */
+
+	u64 context;
+#ifdef CCMSGMAGIC
+	u32 magic;
+	u32 pad;
+#endif
+} __attribute__((packed));
+
+/*
+ *------------------------ RNIC ------------------------
+ */
+
+/*
+ * WR_RNIC_OPEN
+ */
+
+/*
+ * Flags for the RNIC WRs
+ */
+enum c2_rnic_flags {
+	RNIC_IRD_STATIC = 0x0001,
+	RNIC_ORD_STATIC = 0x0002,
+	RNIC_QP_STATIC = 0x0004,
+	RNIC_SRQ_SUPPORTED = 0x0008,
+	RNIC_PBL_BLOCK_MODE = 0x0010,
+	RNIC_SRQ_MODEL_ARRIVAL = 0x0020,
+	RNIC_CQ_OVF_DETECTED = 0x0040,
+	RNIC_PRIV_MODE = 0x0080
+};
+
+struct c2wr_rnic_open_req {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u16 flags;		/* See enum c2_rnic_flags */
+	u16 port_num;
+} __attribute__((packed));
+
+struct c2wr_rnic_open_rep {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+union c2wr_rnic_open {
+	struct c2wr_rnic_open_req req;
+	struct c2wr_rnic_open_rep rep;
+} __attribute__((packed));
+
+struct c2wr_rnic_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+/*
+ * WR_RNIC_QUERY
+ */
+struct c2wr_rnic_query_rep {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u32 vendor_id;
+	u32 part_number;
+	u32 hw_version;
+	u32 fw_ver_major;
+	u32 fw_ver_minor;
+	u32 fw_ver_patch;
+	char fw_ver_build_str[WR_BUILD_STR_LEN];
+	u32 max_qps;
+	u32 max_qp_depth;
+	u32 max_srq_depth;
+	u32 max_send_sgl_depth;
+	u32 max_rdma_sgl_depth;
+	u32 max_cqs;
+	u32 max_cq_depth;
+	u32 max_cq_event_handlers;
+	u32 max_mrs;
+	u32 max_pbl_depth;
+	u32 max_pds;
+	u32 max_global_ird;
+	u32 max_global_ord;
+	u32 max_qp_ird;
+	u32 max_qp_ord;
+	u32 flags;		
+	u32 max_mws;
+	u32 pbe_range_low;
+	u32 pbe_range_high;
+	u32 max_srqs;
+	u32 page_size;
+} __attribute__((packed));
+
+union c2wr_rnic_query {
+	struct c2wr_rnic_query_req req;
+	struct c2wr_rnic_query_rep rep;
+} __attribute__((packed));
+
+/*
+ * WR_RNIC_GETCONFIG
+ */
+
+struct c2wr_rnic_getconfig_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 option;		/* see c2_getconfig_cmd_t */
+	u64 reply_buf;
+	u32 reply_buf_len;
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_getconfig_rep {
+	struct c2wr_hdr hdr;
+	u32 option;		/* see c2_getconfig_cmd_t */
+	u32 count_len;		/* length of the number of addresses configured */
+} __attribute__((packed)) ;
+
+union c2wr_rnic_getconfig {
+	struct c2wr_rnic_getconfig_req req;
+	struct c2wr_rnic_getconfig_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * WR_RNIC_SETCONFIG
+ */
+struct c2wr_rnic_setconfig_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 option;		/* See c2_setconfig_cmd_t */
+	/* variable data and pad. See c2_netaddr and c2_route */
+	u8 data[0];
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_setconfig_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_rnic_setconfig {
+	struct c2wr_rnic_setconfig_req req;
+	struct c2wr_rnic_setconfig_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * WR_RNIC_CLOSE
+ */
+struct c2wr_rnic_close_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_close_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_rnic_close {
+	struct c2wr_rnic_close_req req;
+	struct c2wr_rnic_close_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ CQ ------------------------
+ */
+struct c2wr_cq_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_ht;
+	u64 user_context;
+	u64 msg_pool;
+	u32 rnic_handle;
+	u32 msg_size;
+	u32 depth;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_create_rep {
+	struct c2wr_hdr hdr;
+	u32 mq_index;
+	u32 adapter_shared;
+	u32 cq_handle;
+} __attribute__((packed)) ;
+
+union c2wr_cq_create {
+	struct c2wr_cq_create_req req;
+	struct c2wr_cq_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_modify_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 cq_handle;
+	u32 new_depth;
+	u64 new_msg_pool;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_modify_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_cq_modify {
+	struct c2wr_cq_modify_req req;
+	struct c2wr_cq_modify_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 cq_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_cq_destroy {
+	struct c2wr_cq_destroy_req req;
+	struct c2wr_cq_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ PD ------------------------
+ */
+struct c2wr_pd_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_alloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_pd_alloc {
+	struct c2wr_pd_alloc_req req;
+	struct c2wr_pd_alloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_dealloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_dealloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_pd_dealloc {
+	struct c2wr_pd_dealloc_req req;
+	struct c2wr_pd_dealloc_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ SRQ ------------------------
+ */
+struct c2wr_srq_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_ht;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 srq_depth;
+	u32 srq_limit;
+	u32 sgl_depth;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_create_rep {
+	struct c2wr_hdr hdr;
+	u32 srq_depth;
+	u32 sgl_depth;
+	u32 msg_size;
+	u32 mq_index;
+	u32 mq_start;
+	u32 srq_handle;
+} __attribute__((packed)) ;
+
+union c2wr_srq_create {
+	struct c2wr_srq_create_req req;
+	struct c2wr_srq_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 srq_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_srq_destroy {
+	struct c2wr_srq_destroy_req req;
+	struct c2wr_srq_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ QP ------------------------
+ */
+enum c2wr_qp_flags {
+	QP_RDMA_READ = 0x00000001,	/* RDMA read enabled? */
+	QP_RDMA_WRITE = 0x00000002,	/* RDMA write enabled? */
+	QP_MW_BIND = 0x00000004,	/* MWs enabled */
+	QP_ZERO_STAG = 0x00000008,	/* enabled? */
+	QP_REMOTE_TERMINATION = 0x00000010,	/* remote end terminated */
+	QP_RDMA_READ_RESPONSE = 0x00000020	/* Remote RDMA read  */
+	    /* enabled? */
+};
+
+struct c2wr_qp_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_sq_ht;
+	u64 shared_rq_ht;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 sq_cq_handle;
+	u32 rq_cq_handle;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 srq_handle;
+	u32 srq_limit;
+	u32 flags;		/* see enum c2wr_qp_flags */
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_create_rep {
+	struct c2wr_hdr hdr;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u32 sq_msg_size;
+	u32 sq_mq_index;
+	u32 sq_mq_start;
+	u32 rq_msg_size;
+	u32 rq_mq_index;
+	u32 rq_mq_start;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+union c2wr_qp_create {
+	struct c2wr_qp_create_req req;
+	struct c2wr_qp_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_query_rep {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 send_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u16 qp_state;
+	u16 flags;		/* see c2wr_qp_flags_t */
+	u32 qp_id;
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+	u32 terminate_msg_length;	/* 0 if not present */
+	u8 data[0];
+	/* Terminate Message in-line here. */
+} __attribute__((packed)) ;
+
+union c2wr_qp_query {
+	struct c2wr_qp_query_req req;
+	struct c2wr_qp_query_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_modify_req {
+	struct c2wr_hdr hdr;
+	u64 stream_msg;
+	u32 stream_msg_length;
+	u32 rnic_handle;
+	u32 qp_handle;
+	u32 next_qp_state;
+	u32 ord;
+	u32 ird;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 llp_ep_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_modify_rep {
+	struct c2wr_hdr hdr;
+	u32 ord;
+	u32 ird;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 sq_msg_size;
+	u32 sq_mq_index;
+	u32 sq_mq_start;
+	u32 rq_msg_size;
+	u32 rq_mq_index;
+	u32 rq_mq_start;
+} __attribute__((packed)) ;
+
+union c2wr_qp_modify {
+	struct c2wr_qp_modify_req req;
+	struct c2wr_qp_modify_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_qp_destroy {
+	struct c2wr_qp_destroy_req req;
+	struct c2wr_qp_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * The CCWR_QP_CONNECT msg is posted on the verbs request queue.  It can
+ * only be posted when a QP is in IDLE state.  After the connect request is
+ * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state.
+ * No synchronous reply from adapter to this WR.  The results of
+ * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS
+ * See c2wr_ae_active_connect_results_t
+ */
+struct c2wr_qp_connect_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+	u32 remote_addr;
+	u16 remote_port;
+	u16 pad;
+	u32 private_data_length;
+	u8 private_data[0];	/* Private data in-line. */
+} __attribute__((packed)) ;
+
+struct c2wr_qp_connect {
+	struct c2wr_qp_connect_req req;
+	/* no synchronous reply.         */
+} __attribute__((packed)) ;
+
+
+/*
+ *------------------------ MM ------------------------
+ */
+
+struct c2wr_nsmr_stag_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pbl_depth;
+	u32 pd_id;
+	u32 flags;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_stag_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_stag_alloc {
+	struct c2wr_nsmr_stag_alloc_req req;
+	struct c2wr_nsmr_stag_alloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_register_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 pd_id;
+	u32 pbl_depth;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_register_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_register {
+	struct c2wr_nsmr_register_req req;
+	struct c2wr_nsmr_register_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_pbl_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 flags;
+	u32 stag_index;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_pbl_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_pbl {
+	struct c2wr_nsmr_pbl_req req;
+	struct c2wr_nsmr_pbl_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mr_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_mr_query_rep {
+	struct c2wr_hdr hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 pd_id;
+	u32 flags;
+	u32 pbl_depth;
+} __attribute__((packed)) ;
+
+union c2wr_mr_query {
+	struct c2wr_mr_query_req req;
+	struct c2wr_mr_query_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_query_rep {
+	struct c2wr_hdr hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 pd_id;
+	u32 flags;
+} __attribute__((packed)) ;
+
+union c2wr_mw_query {
+	struct c2wr_mw_query_req req;
+	struct c2wr_mw_query_rep rep;
+} __attribute__((packed)) ;
+
+
+struct c2wr_stag_dealloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_stag_dealloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_stag_dealloc {
+	struct c2wr_stag_dealloc_req req;
+	struct c2wr_stag_dealloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_reregister_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 stag_index;
+	u32 pd_id;
+	u32 pbl_depth;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	u32 pad1;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_reregister_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_reregister {
+	struct c2wr_nsmr_reregister_req req;
+	struct c2wr_nsmr_reregister_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_smr_register_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 stag_index;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_smr_register_rep {
+	struct c2wr_hdr hdr;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_smr_register {
+	struct c2wr_smr_register_req req;
+	struct c2wr_smr_register_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_mw_alloc {
+	struct c2wr_mw_alloc_req req;
+	struct c2wr_mw_alloc_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ WRs -----------------------
+ */
+
+struct c2wr_user_hdr {
+	struct c2wr_hdr hdr;		/* Has status and WR Type */
+} __attribute__((packed)) ;
+
+enum c2_qp_state {
+	C2_QP_STATE_IDLE = 0x01,
+	C2_QP_STATE_CONNECTING = 0x02,
+	C2_QP_STATE_RTS = 0x04,
+	C2_QP_STATE_CLOSING = 0x08,
+	C2_QP_STATE_TERMINATE = 0x10,
+	C2_QP_STATE_ERROR = 0x20,
+};
+
+/* Completion queue entry. */
+struct c2wr_ce {
+	struct c2wr_hdr hdr;		/* Has status and WR Type */
+	u64 qp_user_context;	/* c2_user_qp_t * */
+	u32 qp_state;		/* Current QP State */
+	u32 handle;		/* QPID or EP Handle */
+	u32 bytes_rcvd;		/* valid for RECV WCs */
+	u32 stag;
+} __attribute__((packed)) ;
+
+
+/*
+ * Flags used for all post-sq WRs.  These must fit in the flags
+ * field of the struct c2wr_hdr (eight bits).
+ */
+enum {
+	SQ_SIGNALED = 0x01,
+	SQ_READ_FENCE = 0x02,
+	SQ_FENCE = 0x04,
+};
+
+/*
+ * Common fields for all post-sq WRs.  Namely the standard header and a 
+ * secondary header with fields common to all post-sq WRs.
+ */
+struct c2_sq_hdr {
+	struct c2wr_user_hdr user_hdr;
+} __attribute__((packed));
+
+/*
+ * Same as above but for post-rq WRs.
+ */
+struct c2_rq_hdr {
+	struct c2wr_user_hdr user_hdr;
+} __attribute__((packed));
+
+/*
+ * use the same struct for all sends.
+ */
+struct c2wr_send_req {
+	struct c2_sq_hdr sq_hdr;
+	u32 sge_len;
+	u32 remote_stag;
+	u8 data[0];		/* SGE array */
+} __attribute__((packed));
+/* XXX c2wr_send_req_t, c2wr_send_se_req_t, c2wr_send_inv_req_t,
+   c2wr_send_se_inv_req_t;*/
+
+union c2wr_send {
+	struct c2wr_send_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_rdma_write_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 remote_to;
+	u32 remote_stag;
+	u32 sge_len;
+	u8 data[0];		/* SGE array */
+} __attribute__((packed));
+
+union c2wr_rdma_write {
+	struct c2wr_rdma_write_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_rdma_read_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 local_to;
+	u64 remote_to;
+	u32 local_stag;
+	u32 remote_stag;
+	u32 length;
+} __attribute__((packed));
+
+union c2wr_rdma_read {
+	struct c2wr_rdma_read_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_mw_bind_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 va;
+	u8 stag_key;
+	u8 pad[3];
+	u32 mw_stag_index;
+	u32 mr_stag_index;
+	u32 length;
+	u32 flags;
+} __attribute__((packed));
+
+union c2wr_mw_bind {
+	struct c2wr_mw_bind_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_nsmr_fastreg_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 va;
+	u8 stag_key;
+	u8 pad[3];
+	u32 stag_index;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed));
+
+union c2wr_nsmr_fastreg {
+	struct c2wr_nsmr_fastreg_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_stag_invalidate_req {
+	struct c2_sq_hdr sq_hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 stag_index;
+} __attribute__((packed));
+
+union c2wr_stag_invalidate {
+	struct c2wr_stag_invalidate_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+union c2wr_sqwr {
+	struct c2_sq_hdr sq_hdr;
+	struct c2wr_send_req send;
+	struct c2wr_send_req send_se;
+	struct c2wr_send_req send_inv;
+	struct c2wr_send_req send_se_inv;
+	struct c2wr_rdma_write_req rdma_write;
+	struct c2wr_rdma_read_req rdma_read;
+	struct c2wr_mw_bind_req mw_bind;
+	struct c2wr_nsmr_fastreg_req nsmr_fastreg;
+	struct c2wr_stag_invalidate_req stag_inv;
+} __attribute__((packed));
+
+
+/*
+ * RQ WRs
+ */
+struct c2wr_rqwr {
+	struct c2_rq_hdr rq_hdr;
+	u8 data[0];		/* array of SGEs */
+} __attribute__((packed));
+/* XXX  c2wr_rqwr_t, c2wr_recv_req_t; */
+
+union c2wr_recv {
+	struct c2wr_rqwr req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+/*
+ * All AEs start with this header.  Most AEs only need to convey the
+ * information in the header.  Some, like LLP connection events, need
+ * more info.  The union typdef c2wr_ae_t has all the possible AEs.
+ *
+ * hdr.context is the user_context from the rnic_open WR.  NULL If this 
+ * is not affiliated with an rnic
+ *
+ * hdr.id is the AE identifier (eg;  CCAE_REMOTE_SHUTDOWN, 
+ * CCAE_LLP_CLOSE_COMPLETE)
+ *
+ * resource_type is one of:  C2_RES_IND_QP, C2_RES_IND_CQ, C2_RES_IND_SRQ
+ *
+ * user_context is the context passed down when the host created the resource.
+ */
+struct c2wr_ae_hdr {
+	struct c2wr_hdr hdr;
+	u64 user_context;	/* user context for this res. */
+	u32 resource_type;	/* see enum c2_resource_indicator */
+	u32 resource;		/* handle for resource */
+	u32 qp_state;		/* current QP State */
+} __attribute__((packed));
+
+/*
+ * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, 
+ * the adapter moves the QP into RTS state
+ */
+struct c2wr_ae_active_connect_results {
+	struct c2wr_ae_hdr ae_hdr;
+	u32 laddr;
+	u32 raddr;
+	u16 lport;
+	u16 rport;
+	u32 private_data_length;
+	u8 private_data[0];	/* data is in-line in the msg. */
+} __attribute__((packed));
+
+/*
+ * When connections are established by the stack (and the private data
+ * MPA frame is received), the adapter will generate an event to the host.
+ * The details of the connection, any private data, and the new connection
+ * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the
+ * AE queue:
+ */
+struct c2wr_ae_connection_request {
+	struct c2wr_ae_hdr ae_hdr;
+	u32 cr_handle;		/* connreq handle (sock ptr) */
+	u32 laddr;
+	u32 raddr;
+	u16 lport;
+	u16 rport;
+	u32 private_data_length;
+	u8 private_data[0];	/* data is in-line in the msg. */
+} __attribute__((packed));
+
+union c2wr_ae {
+	struct c2wr_ae_hdr ae_generic;
+	struct c2wr_ae_active_connect_results ae_active_connect_results;
+	struct c2wr_ae_connection_request ae_connection_request;
+} __attribute__((packed));
+
+struct c2wr_init_req {
+	struct c2wr_hdr hdr;
+	u64 hint_count;
+	u64 q0_host_shared;
+	u64 q1_host_shared;
+	u64 q1_host_msg_pool;
+	u64 q2_host_shared;
+	u64 q2_host_msg_pool;
+} __attribute__((packed));
+
+struct c2wr_init_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_init {
+	struct c2wr_init_req req;
+	struct c2wr_init_rep rep;
+} __attribute__((packed));
+
+/*
+ * For upgrading flash.
+ */
+
+struct c2wr_flash_init_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+struct c2wr_flash_init_rep {
+	struct c2wr_hdr hdr;
+	u32 adapter_flash_buf_offset;
+	u32 adapter_flash_len;
+} __attribute__((packed));
+
+union c2wr_flash_init {
+	struct c2wr_flash_init_req req;
+	struct c2wr_flash_init_rep rep;
+} __attribute__((packed));
+
+struct c2wr_flash_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 len;
+} __attribute__((packed));
+
+struct c2wr_flash_rep {
+	struct c2wr_hdr hdr;
+	u32 status;
+} __attribute__((packed));
+
+union c2wr_flash {
+	struct c2wr_flash_req req;
+	struct c2wr_flash_rep rep;
+} __attribute__((packed));
+
+struct c2wr_buf_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 size;
+} __attribute__((packed));
+
+struct c2wr_buf_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 offset;		/* 0 if mem not available */
+	u32 size;		/* 0 if mem not available */
+} __attribute__((packed));
+
+union c2wr_buf_alloc {
+	struct c2wr_buf_alloc_req req;
+	struct c2wr_buf_alloc_rep rep;
+} __attribute__((packed));
+
+struct c2wr_buf_free_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 offset;		/* Must match value from alloc */
+	u32 size;		/* Must match value from alloc */
+} __attribute__((packed));
+
+struct c2wr_buf_free_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_buf_free {
+	struct c2wr_buf_free_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_flash_write_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 offset;
+	u32 size;
+	u32 type;
+	u32 flags;
+} __attribute__((packed));
+
+struct c2wr_flash_write_rep {
+	struct c2wr_hdr hdr;
+	u32 status;
+} __attribute__((packed));
+
+union c2wr_flash_write {
+	struct c2wr_flash_write_req req;
+	struct c2wr_flash_write_rep rep;
+} __attribute__((packed));
+
+/*
+ * Messages for LLP connection setup. 
+ */
+
+/*
+ * Listen Request.  This allocates a listening endpoint to allow passive
+ * connection setup.  Newly established LLP connections are passed up
+ * via an AE.  See c2wr_ae_connection_request_t
+ */
+struct c2wr_ep_listen_create_req {
+	struct c2wr_hdr hdr;
+	u64 user_context;	/* returned in AEs. */
+	u32 rnic_handle;
+	u32 local_addr;		/* local addr, or 0  */
+	u16 local_port;		/* 0 means "pick one" */
+	u16 pad;
+	u32 backlog;		/* tradional tcp listen bl */
+} __attribute__((packed));
+
+struct c2wr_ep_listen_create_rep {
+	struct c2wr_hdr hdr;
+	u32 ep_handle;		/* handle to new listening ep */
+	u16 local_port;		/* resulting port... */
+	u16 pad;
+} __attribute__((packed));
+
+union c2wr_ep_listen_create {
+	struct c2wr_ep_listen_create_req req;
+	struct c2wr_ep_listen_create_rep rep;
+} __attribute__((packed));
+
+struct c2wr_ep_listen_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;
+} __attribute__((packed));
+
+struct c2wr_ep_listen_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_ep_listen_destroy {
+	struct c2wr_ep_listen_destroy_req req;
+	struct c2wr_ep_listen_destroy_rep rep;
+} __attribute__((packed));
+
+struct c2wr_ep_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;
+} __attribute__((packed));
+
+struct c2wr_ep_query_rep {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+} __attribute__((packed));
+
+union c2wr_ep_query {
+	struct c2wr_ep_query_req req;
+	struct c2wr_ep_query_rep rep;
+} __attribute__((packed));
+
+
+/*
+ * The host passes this down to indicate acceptance of a pending iWARP
+ * connection.  The cr_handle was obtained from the CONNECTION_REQUEST
+ * AE passed up by the adapter.  See c2wr_ae_connection_request_t.
+ */
+struct c2wr_cr_accept_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;		/* QP to bind to this LLP conn */
+	u32 ep_handle;		/* LLP  handle to accept */
+	u32 private_data_length;
+	u8 private_data[0];	/* data in-line in msg. */
+} __attribute__((packed));
+
+/*
+ * adapter sends reply when private data is successfully submitted to 
+ * the LLP.
+ */
+struct c2wr_cr_accept_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_cr_accept {
+	struct c2wr_cr_accept_req req;
+	struct c2wr_cr_accept_rep rep;
+} __attribute__((packed));
+
+/*
+ * The host sends this down if a given iWARP connection request was 
+ * rejected by the consumer.  The cr_handle was obtained from a 
+ * previous c2wr_ae_connection_request_t AE sent by the adapter.
+ */
+struct  c2wr_cr_reject_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;		/* LLP handle to reject */
+} __attribute__((packed));
+
+/*
+ * Dunno if this is needed, but we'll add it for now.  The adapter will
+ * send the reject_reply after the LLP endpoint has been destroyed.
+ */
+struct  c2wr_cr_reject_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_cr_reject {
+	struct c2wr_cr_reject_req req;
+	struct c2wr_cr_reject_rep rep;
+} __attribute__((packed));
+
+/*
+ * console command.  Used to implement a debug console over the verbs
+ * request and reply queues.  
+ */
+
+/*
+ * Console request message.  It contains:
+ *	- message hdr with id = CCWR_CONSOLE
+ *	- the physaddr/len of host memory to be used for the reply. 
+ *	- the command string.  eg:  "netstat -s" or "zoneinfo"
+ */
+struct c2wr_console_req {
+	struct c2wr_hdr hdr;		/* id = CCWR_CONSOLE */
+	u64 reply_buf;		/* pinned host buf for reply */
+	u32 reply_buf_len;	/* length of reply buffer */
+	u8 command[0];		/* NUL terminated ascii string */
+	/* containing the command req */
+} __attribute__((packed));
+
+/*
+ * flags used in the console reply.
+ */
+enum c2_console_flags {
+	CONS_REPLY_TRUNCATED = 0x00000001	/* reply was truncated */
+} __attribute__((packed));
+
+/*
+ * Console reply message.  
+ * hdr.result contains the c2_status_t error if the reply was _not_ generated, 
+ * or C2_OK if the reply was generated.
+ */
+struct c2wr_console_rep {
+	struct c2wr_hdr hdr;		/* id = CCWR_CONSOLE */
+	u32 flags;
+} __attribute__((packed));
+
+union c2wr_console {
+	struct c2wr_console_req req;
+	struct c2wr_console_rep rep;
+} __attribute__((packed));
+
+
+/*
+ * Giant union with all WRs.  Makes life easier...
+ */
+union c2wr {
+	struct c2wr_hdr hdr;
+	struct c2wr_user_hdr user_hdr;
+	union c2wr_rnic_open rnic_open;
+	union c2wr_rnic_query rnic_query;
+	union c2wr_rnic_getconfig rnic_getconfig;
+	union c2wr_rnic_setconfig rnic_setconfig;
+	union c2wr_rnic_close rnic_close;
+	union c2wr_cq_create cq_create;
+	union c2wr_cq_modify cq_modify;
+	union c2wr_cq_destroy cq_destroy;
+	union c2wr_pd_alloc pd_alloc;
+	union c2wr_pd_dealloc pd_dealloc;
+	union c2wr_srq_create srq_create;
+	union c2wr_srq_destroy srq_destroy;
+	union c2wr_qp_create qp_create;
+	union c2wr_qp_query qp_query;
+	union c2wr_qp_modify qp_modify;
+	union c2wr_qp_destroy qp_destroy;
+	struct c2wr_qp_connect qp_connect;
+	union c2wr_nsmr_stag_alloc nsmr_stag_alloc;
+	union c2wr_nsmr_register nsmr_register;
+	union c2wr_nsmr_pbl nsmr_pbl;
+	union c2wr_mr_query mr_query;
+	union c2wr_mw_query mw_query;
+	union c2wr_stag_dealloc stag_dealloc;
+	union c2wr_sqwr sqwr;
+	struct c2wr_rqwr rqwr;
+	struct c2wr_ce ce;
+	union c2wr_ae ae;
+	union c2wr_init init;
+	union c2wr_ep_listen_create ep_listen_create;
+	union c2wr_ep_listen_destroy ep_listen_destroy;
+	union c2wr_cr_accept cr_accept;
+	union c2wr_cr_reject cr_reject;
+	union c2wr_console console;
+	union c2wr_flash_init flash_init;
+	union c2wr_flash flash;
+	union c2wr_buf_alloc buf_alloc;
+	union c2wr_buf_free buf_free;
+	union c2wr_flash_write flash_write;
+} __attribute__((packed));
+
+
+/*
+ * Accessors for the wr fields that are packed together tightly to
+ * reduce the wr message size.  The wr arguments are void* so that
+ * either a struct c2wr*, a struct c2wr_hdr*, or a pointer to any of the types
+ * in the struct c2wr union can be passed in.
+ */
+static __inline__ u8 c2_wr_get_id(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->id;
+}
+static __inline__ void c2_wr_set_id(void *wr, u8 id)
+{
+	((struct c2wr_hdr *) wr)->id = id;
+}
+static __inline__ u8 c2_wr_get_result(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->result;
+}
+static __inline__ void c2_wr_set_result(void *wr, u8 result)
+{
+	((struct c2wr_hdr *) wr)->result = result;
+}
+static __inline__ u8 c2_wr_get_flags(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->flags;
+}
+static __inline__ void c2_wr_set_flags(void *wr, u8 flags)
+{
+	((struct c2wr_hdr *) wr)->flags = flags;
+}
+static __inline__ u8 c2_wr_get_sge_count(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->sge_count;
+}
+static __inline__ void c2_wr_set_sge_count(void *wr, u8 sge_count)
+{
+	((struct c2wr_hdr *) wr)->sge_count = sge_count;
+}
+static __inline__ u32 c2_wr_get_wqe_count(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->wqe_count;
+}
+static __inline__ void c2_wr_set_wqe_count(void *wr, u32 wqe_count)
+{
+	((struct c2wr_hdr *) wr)->wqe_count = wqe_count;
+}
+
+#endif				/* _C2_WR_H_ */


From swise at opengridcomputing.com  Wed Jun  7 13:07:02 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:07:02 -0500
Subject: [openib-general] [PATCH v2 7/7] AMSO1100 Makefiles and Kconfig
	changes.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200702.9259.62339.stgit@stevo-desktop>


Review Changes:

- C2DEBUG -> DEBUG
---

 drivers/infiniband/Kconfig             |    1 +
 drivers/infiniband/Makefile            |    1 +
 drivers/infiniband/hw/amso1100/Kbuild  |   10 ++++++++++
 drivers/infiniband/hw/amso1100/Kconfig |   15 +++++++++++++++
 drivers/infiniband/hw/amso1100/README  |   11 +++++++++++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ba2d650..04e6d4f 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS
 
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/ipath/Kconfig"
+source "drivers/infiniband/hw/amso1100/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index eea2732..e2b93f9 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_INFINIBAND)		+= core/
 obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
 obj-$(CONFIG_IPATH_CORE)		+= hw/ipath/
+obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
diff --git a/drivers/infiniband/hw/amso1100/Kbuild b/drivers/infiniband/hw/amso1100/Kbuild
new file mode 100644
index 0000000..e1f10ab
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kbuild
@@ -0,0 +1,10 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o
+
+iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \
+	c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o
diff --git a/drivers/infiniband/hw/amso1100/Kconfig b/drivers/infiniband/hw/amso1100/Kconfig
new file mode 100644
index 0000000..809cb14
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kconfig
@@ -0,0 +1,15 @@
+config INFINIBAND_AMSO1100
+	tristate "Ammasso 1100 HCA support"
+	depends on PCI && INET && INFINIBAND
+	---help---
+	  This is a low-level driver for the Ammasso 1100 host
+	  channel adapter (HCA).
+
+config INFINIBAND_AMSO1100_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_AMSO1100
+	default n
+	---help---
+	  This option causes the amso1100 driver to produce a bunch of
+	  debug messages.  Select this if you are developing the driver
+	  or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/amso1100/README b/drivers/infiniband/hw/amso1100/README
new file mode 100644
index 0000000..1331353
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/README
@@ -0,0 +1,11 @@
+This is the OpenFabrics provider driver for the 
+AMSO1100 1Gb RNIC adapter. 
+
+This adapter is available in limited quantities 
+for development purposes from Open Grid Computing.
+
+This driver requires the IWCM and CMA mods necessary
+to support iWARP.
+
+Contact tom at opengridcomputing.com for more information.
+


From swise at opengridcomputing.com  Wed Jun  7 13:06:57 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:06:57 -0500
Subject: [openib-general] [PATCH v2 5/7] AMSO1100 Message Queues.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200657.9259.48820.stgit@stevo-desktop>


Review Changes:

- remove useless asserts

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG
---

 drivers/infiniband/hw/amso1100/c2_mq.c |  175 ++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_mq.h |  103 +++++++++++++++++++
 2 files changed, 278 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c b/drivers/infiniband/hw/amso1100/c2_mq.c
new file mode 100644
index 0000000..0b0ab02
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mq.c
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include "c2_mq.h"
+
+void *c2_mq_alloc(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	if (c2_mq_full(q)) {
+		return NULL;
+	} else {
+#ifdef DEBUG
+		struct c2wr_hdr *m =
+		    (struct c2wr_hdr *) (q->msg_pool.host + q->priv * q->msg_size);
+#ifdef CCMSGMAGIC
+		BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC));
+		m->magic = cpu_to_be32(CCWR_MAGIC);
+#endif
+		return m;
+#else
+		return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+	}
+}
+
+void c2_mq_produce(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	if (!c2_mq_full(q)) {
+		q->priv = (q->priv + 1) % q->q_size;
+		q->hint_count++;
+		/* Update peer's offset. */
+		__raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+	}
+}
+
+void *c2_mq_consume(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+	if (c2_mq_empty(q)) {
+		return NULL;
+	} else {
+#ifdef DEBUG
+		struct c2wr_hdr *m = (struct c2wr_hdr *)
+		    (q->msg_pool.host + q->priv * q->msg_size);
+#ifdef CCMSGMAGIC
+		BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC));
+#endif
+		return m;
+#else
+		return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+	}
+}
+
+void c2_mq_free(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+	if (!c2_mq_empty(q)) {
+
+#ifdef CCMSGMAGIC
+		{
+			struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *)
+			    (q->msg_pool.adapter + q->priv * q->msg_size);
+			__raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic);
+		}
+#endif
+		q->priv = (q->priv + 1) % q->q_size;
+		/* Update peer's offset. */
+		__raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+	}
+}
+
+
+void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	while (wqe_count--) {
+		BUG_ON(c2_mq_empty(q));
+		*q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % q->q_size);
+	}
+}
+
+
+u32 c2_mq_count(struct c2_mq *q)
+{
+	s32 count;
+
+	if (q->type == C2_MQ_HOST_TARGET) {
+		count = be16_to_cpu(*q->shared) - q->priv;
+	} else {
+		count = q->priv - be16_to_cpu(*q->shared);
+	}
+
+	if (count < 0) {
+		count += q->q_size;
+	}
+
+	return (u32) count;
+}
+
+void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+		    u8 __iomem *pool_start, u16 __iomem *peer, u32 type)
+{
+	BUG_ON(!q->shared);
+
+	/* This code assumes the byte swapping has already been done! */
+	q->index = index;
+	q->q_size = q_size;
+	q->msg_size = msg_size;
+	q->msg_pool.adapter = pool_start;
+	q->peer = (struct c2_mq_shared __iomem *) peer;
+	q->magic = C2_MQ_MAGIC;
+	q->type = type;
+	q->priv = 0;
+	q->hint_count = 0;
+	return;
+}
+void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+		    u8 *pool_start, u16 __iomem *peer, u32 type)
+{
+	BUG_ON(!q->shared);
+
+	/* This code assumes the byte swapping has already been done! */
+	q->index = index;
+	q->q_size = q_size;
+	q->msg_size = msg_size;
+	q->msg_pool.host = pool_start;
+	q->peer = (struct c2_mq_shared __iomem *) peer;
+	q->magic = C2_MQ_MAGIC;
+	q->type = type;
+	q->priv = 0;
+	q->hint_count = 0;
+	return;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_mq.h b/drivers/infiniband/hw/amso1100/c2_mq.h
new file mode 100644
index 0000000..de00184
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mq.h
@@ -0,0 +1,103 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _C2_MQ_H_
+#define _C2_MQ_H_
+#include <linux/kernel.h>
+#include "c2_wr.h"
+
+enum c2_shared_regs {
+
+	C2_SHARED_ARMED = 0x10,
+	C2_SHARED_NOTIFY = 0x18,
+	C2_SHARED_SHARED = 0x40,
+};
+
+struct c2_mq_shared {
+	u16 unused1;
+	u8 armed;
+	u8 notification_type;
+	u32 unused2;
+	u16 shared;
+	/* Pad to 64 bytes. */
+	u8 pad[64 - sizeof(u16) - 2 * sizeof(u8) - sizeof(u32) - sizeof(u16)];
+};
+
+enum c2_mq_type {
+	C2_MQ_HOST_TARGET = 1,
+	C2_MQ_ADAPTER_TARGET = 2,
+};
+
+/*
+ * c2_mq_t is for kernel-mode MQs like the VQs Cand the AEQ.
+ * c2_user_mq_t (which is the same format) is for user-mode MQs...
+ */
+#define C2_MQ_MAGIC 0x4d512020	/* 'MQ  ' */
+struct c2_mq {
+	u32 magic; 
+	union {
+		u8 *host;
+		u8 __iomem *adapter;
+	} msg_pool;
+	u16 hint_count;
+	u16 priv;
+	struct c2_mq_shared __iomem *peer;
+	u16 *shared;
+	u32 q_size;
+	u32 msg_size;
+	u32 index;
+	enum c2_mq_type type;
+};
+
+static __inline__ int c2_mq_empty(struct c2_mq *q)
+{
+	return q->priv == be16_to_cpu(*q->shared);
+}
+
+static __inline__ int c2_mq_full(struct c2_mq *q)
+{
+	return q->priv == (be16_to_cpu(*q->shared) + q->q_size - 1) % q->q_size;
+}
+
+extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count);
+extern void *c2_mq_alloc(struct c2_mq *q);
+extern void c2_mq_produce(struct c2_mq *q);
+extern void *c2_mq_consume(struct c2_mq *q);
+extern void c2_mq_free(struct c2_mq *q);
+extern u32 c2_mq_count(struct c2_mq *q);
+extern void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+		       u8 __iomem *pool_start, u16 __iomem *peer, u32 type);
+extern void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+			   u8 *pool_start, u16 __iomem *peer, u32 type);
+
+#endif				/* _C2_MQ_H_ */


From swise at opengridcomputing.com  Wed Jun  7 13:07:00 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:07:00 -0500
Subject: [openib-general] [PATCH v2 6/7] AMSO1100: Privileged Verbs Queues.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <20060607200659.9259.85242.stgit@stevo-desktop>


Review Changes:

dprintk() -> pr_debug()
---

 drivers/infiniband/hw/amso1100/c2_vq.c |  260 ++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_vq.h |   63 ++++++++
 2 files changed, 323 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c
new file mode 100644
index 0000000..445b1ed
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_vq.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "c2_vq.h"
+#include "c2_provider.h"
+
+/*
+ * Verbs Request Objects:
+ *
+ * VQ Request Objects are allocated by the kernel verbs handlers.
+ * They contain a wait object, a refcnt, an atomic bool indicating that the
+ * adapter has replied, and a copy of the verb reply work request.
+ * A pointer to the VQ Request Object is passed down in the context
+ * field of the work request message, and reflected back by the adapter
+ * in the verbs reply message.  The function handle_vq() in the interrupt
+ * path will use this pointer to:
+ * 	1) append a copy of the verbs reply message
+ * 	2) mark that the reply is ready
+ * 	3) wake up the kernel verbs handler blocked awaiting the reply.
+ *
+ *
+ * The kernel verbs handlers do a "get" to put a 2nd reference on the 
+ * VQ Request object.  If the kernel verbs handler exits before the adapter
+ * can respond, this extra reference will keep the VQ Request object around
+ * until the adapter's reply can be processed.  The reason we need this is
+ * because a pointer to this object is stuffed into the context field of
+ * the verbs work request message, and reflected back in the reply message.
+ * It is used in the interrupt handler (handle_vq()) to wake up the appropriate
+ * kernel verb handler that is blocked awaiting the verb reply.  
+ * So handle_vq() will do a "put" on the object when it's done accessing it.
+ * NOTE:  If we guarantee that the kernel verb handler will never bail before 
+ *        getting the reply, then we don't need these refcnts.
+ *
+ *
+ * VQ Request objects are freed by the kernel verbs handlers only 
+ * after the verb has been processed, or when the adapter fails and
+ * does not reply.  
+ *
+ *
+ * Verbs Reply Buffers:
+ *
+ * VQ Reply bufs are local host memory copies of a 
+ * outstanding Verb Request reply
+ * message.  The are always allocated by the kernel verbs handlers, and _may_ be
+ * freed by either the kernel verbs handler -or- the interrupt handler.  The
+ * kernel verbs handler _must_ free the repbuf, then free the vq request object
+ * in that order.
+ */
+
+int vq_init(struct c2_dev *c2dev)
+{
+	sprintf(c2dev->vq_cache_name, "c2-vq:dev%c",
+		(char) ('0' + c2dev->devnum));
+	c2dev->host_msg_cache =
+	    kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0,
+			      SLAB_HWCACHE_ALIGN, NULL, NULL);
+	if (c2dev->host_msg_cache == NULL) {
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+void vq_term(struct c2_dev *c2dev)
+{
+	kmem_cache_destroy(c2dev->host_msg_cache);
+}
+
+/* vq_req_alloc - allocate a VQ Request Object and initialize it.
+ * The refcnt is set to 1.
+ */
+struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *r;
+
+	r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL);
+	if (r) {
+		init_waitqueue_head(&r->wait_object);
+		r->reply_msg = (u64) NULL;
+		r->event = 0;
+		r->cm_id = NULL;
+		r->qp = NULL;
+		atomic_set(&r->refcnt, 1);
+		atomic_set(&r->reply_ready, 0);
+	}
+	return r;
+}
+
+
+/* vq_req_free - free the VQ Request Object.  It is assumed the verbs handler
+ * has already free the VQ Reply Buffer if it existed.
+ */
+void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	r->reply_msg = (u64) NULL;
+	if (atomic_dec_and_test(&r->refcnt)) {
+		kfree(r);
+	}
+}
+
+/* vq_req_get - reference a VQ Request Object.  Done 
+ * only in the kernel verbs handlers.
+ */
+void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	atomic_inc(&r->refcnt);
+}
+
+
+/* vq_req_put - dereference and potentially free a VQ Request Object.
+ *
+ * This is only called by handle_vq() on the 
+ * interrupt when it is done processing
+ * a verb reply message.  If the associated 
+ * kernel verbs handler has already bailed,
+ * then this put will actually free the VQ 
+ * Request object _and_ the VQ Reply Buffer
+ * if it exists.
+ */
+void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	if (atomic_dec_and_test(&r->refcnt)) {
+		if (r->reply_msg != (u64) NULL)
+			vq_repbuf_free(c2dev,
+				       (void *) (unsigned long) r->reply_msg);
+		kfree(r);
+	}
+}
+
+
+/*
+ * vq_repbuf_alloc - allocate a VQ Reply Buffer.
+ */
+void *vq_repbuf_alloc(struct c2_dev *c2dev)
+{
+	return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC);
+}
+
+/*
+ * vq_send_wr - post a verbs request message to the Verbs Request Queue.
+ * If a message is not available in the MQ, then block until one is available.
+ * NOTE: handle_mq() on the interrupt context will wake up threads blocked here.
+ * When the adapter drains the Verbs Request Queue, 
+ * it inserts MQ index 0 in to the
+ * adapter->host activity fifo and interrupts the host.
+ */
+int vq_send_wr(struct c2_dev *c2dev, union c2wr *wr)
+{
+	void *msg;
+	wait_queue_t __wait;
+
+	/*
+	 * grab adapter vq lock
+	 */
+	spin_lock(&c2dev->vqlock);
+
+	/*
+	 * allocate msg
+	 */
+	msg = c2_mq_alloc(&c2dev->req_vq);
+
+	/*
+	 * If we cannot get a msg, then we'll wait
+	 * When a messages are available, the int handler will wake_up() 
+	 * any waiters.
+	 */
+	while (msg == NULL) {
+		pr_debug("%s:%d no available msg in VQ, waiting...\n",
+		       __FUNCTION__, __LINE__);
+		init_waitqueue_entry(&__wait, current);
+		add_wait_queue(&c2dev->req_vq_wo, &__wait);
+		spin_unlock(&c2dev->vqlock);
+		for (;;) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!c2_mq_full(&c2dev->req_vq)) {
+				break;
+			}
+			if (!signal_pending(current)) {
+				schedule_timeout(1 * HZ);	/* 1 second... */
+				continue;
+			}
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&c2dev->req_vq_wo, &__wait);
+			return -EINTR;
+		}
+		set_current_state(TASK_RUNNING);
+		remove_wait_queue(&c2dev->req_vq_wo, &__wait);
+		spin_lock(&c2dev->vqlock);
+		msg = c2_mq_alloc(&c2dev->req_vq);
+	}
+
+	/*
+	 * copy wr into adapter msg
+	 */
+	memcpy(msg, wr, c2dev->req_vq.msg_size);
+
+	/*
+	 * post msg
+	 */
+	c2_mq_produce(&c2dev->req_vq);
+
+	/*
+	 * release adapter vq lock
+	 */
+	spin_unlock(&c2dev->vqlock);
+	return 0;
+}
+
+
+/*
+ * vq_wait_for_reply - block until the adapter posts a Verb Reply Message.  
+ */
+int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req)
+{
+	if (!wait_event_timeout(req->wait_object,
+				atomic_read(&req->reply_ready),
+				60*HZ))
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+/*
+ * vq_repbuf_free - Free a Verbs Reply Buffer.
+ */
+void vq_repbuf_free(struct c2_dev *c2dev, void *reply)
+{
+	kmem_cache_free(c2dev->host_msg_cache, reply);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_vq.h b/drivers/infiniband/hw/amso1100/c2_vq.h
new file mode 100644
index 0000000..3380562
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_vq.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_VQ_H_
+#define _C2_VQ_H_
+#include <linux/sched.h>
+#include "c2.h"
+#include "c2_wr.h"
+#include "c2_provider.h"
+
+struct c2_vq_req {
+	u64 reply_msg;		/* ptr to reply msg */
+	wait_queue_head_t wait_object;	/* wait object for vq reqs */
+	atomic_t reply_ready;	/* set when reply is ready */
+	atomic_t refcnt;	/* used to cancel WRs... */
+	int event;
+	struct iw_cm_id *cm_id;
+	struct c2_qp *qp;
+};
+
+extern int vq_init(struct c2_dev *c2dev);
+extern void vq_term(struct c2_dev *c2dev);
+
+extern struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev);
+extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern int vq_send_wr(struct c2_dev *c2dev, union c2wr * wr);
+
+extern void *vq_repbuf_alloc(struct c2_dev *c2dev);
+extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply);
+
+extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req);
+#endif				/* _C2_VQ_H_ */


From swise at opengridcomputing.com  Wed Jun  7 13:39:22 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:39:22 -0500
Subject: [openib-general] [PATCH v2 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <20060607200646.9259.24588.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
Message-ID: <1149712762.27684.82.camel@stevo-desktop>

Resending 2/7 gziped.  

linux-kernel and netdev mailing lists didn't formward the plain text
patch... 

If anywone knows how to address this issue, please email me directly cuz
I don't know why 2/7 didn't get forwarded.  

Sorry.

Steve.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: amso1100_wr.gz
Type: application/x-gzip
Size: 10387 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060607/c7cc9f93/attachment.bin>

From rdreier at cisco.com  Wed Jun  7 13:43:08 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 07 Jun 2006 13:43:08 -0700
Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <1149712762.27684.82.camel@stevo-desktop> (Steve Wise's
	message of "Wed, 07 Jun 2006 15:39:22 -0500")
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<1149712762.27684.82.camel@stevo-desktop>
Message-ID: <ada4pywg0ir.fsf@cisco.com>

I just realized it could be the spam filters.  You have some comments
with three 'X's in a row which might be getting it blocked.  Is that
possible?

 - R.


From swise at opengridcomputing.com  Wed Jun  7 13:59:32 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:59:32 -0500
Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <ada4pywg0ir.fsf@cisco.com>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<1149712762.27684.82.camel@stevo-desktop> <ada4pywg0ir.fsf@cisco.com>
Message-ID: <1149713972.27684.97.camel@stevo-desktop>

On Wed, 2006-06-07 at 13:43 -0700, Roland Dreier wrote:
> I just realized it could be the spam filters.  You have some comments
> with three 'X's in a row which might be getting it blocked.  Is that
> possible?


There are other files that have comments with 'XXX' like c2_provider.c
and c2_qp.c which is in patch 3/7 and it made it though.  

These 'XXX' comments need to be cleaned up anyway, so I'll remove them
(or address the issue if there is one) and we'll see next time I post a
new version.

Steve.


From swise at opengridcomputing.com  Wed Jun  7 13:59:32 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 07 Jun 2006 15:59:32 -0500
Subject: [openib-general] Re: [PATCH v2 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <ada4pywg0ir.fsf@cisco.com>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<1149712762.27684.82.camel@stevo-desktop> <ada4pywg0ir.fsf@cisco.com>
Message-ID: <1149713972.27684.97.camel@stevo-desktop>

On Wed, 2006-06-07 at 13:43 -0700, Roland Dreier wrote:
> I just realized it could be the spam filters.  You have some comments
> with three 'X's in a row which might be getting it blocked.  Is that
> possible?


There are other files that have comments with 'XXX' like c2_provider.c
and c2_qp.c which is in patch 3/7 and it made it though.  

These 'XXX' comments need to be cleaned up anyway, so I'll remove them
(or address the issue if there is one) and we'll see next time I post a
new version.

Steve.


From tom at opengridcomputing.com  Wed Jun  7 15:13:27 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Wed, 07 Jun 2006 17:13:27 -0500
Subject: [openib-general] Re: [PATCH v2 2/2] iWARP Core Changes.
In-Reply-To: <20060607200610.9003.54068.stgit@stevo-desktop>
References: <20060607200600.9003.56328.stgit@stevo-desktop>
	<20060607200610.9003.54068.stgit@stevo-desktop>
Message-ID: <1149718407.9716.15.camel@trinity.ogc.int>


A reference is being taken on an iWARP device that is never getting
released. This prevents a participating iWARP netdev device from being
unloaded after a connection has been established on the passive side.

Search for ip_dev_find below...

On Wed, 2006-06-07 at 15:06 -0500, Steve Wise wrote:
> This patch contains modifications to the existing rdma header files,
> core files, drivers, and ulp files to support iWARP.
> 
> Review updates:
> 
> - copy_addr() -> rdma_copy_addr()
> 
> - dst_dev_addr param in rdma_copy_addr to const.
> 
> - various spacing nits with recasting
> 
> - include linux/inetdevice.h to get ip_dev_find() prototype.
> ---
> 
>  drivers/infiniband/core/Makefile             |    4 
>  drivers/infiniband/core/addr.c               |   19 +
>  drivers/infiniband/core/cache.c              |    8 -
>  drivers/infiniband/core/cm.c                 |    3 
>  drivers/infiniband/core/cma.c                |  353 +++++++++++++++++++++++---
>  drivers/infiniband/core/device.c             |    6 
>  drivers/infiniband/core/mad.c                |   11 +
>  drivers/infiniband/core/sa_query.c           |    5 
>  drivers/infiniband/core/smi.c                |   18 +
>  drivers/infiniband/core/sysfs.c              |   18 +
>  drivers/infiniband/core/ucm.c                |    5 
>  drivers/infiniband/core/user_mad.c           |    9 -
>  drivers/infiniband/hw/ipath/ipath_verbs.c    |    2 
>  drivers/infiniband/hw/mthca/mthca_provider.c |    2 
>  drivers/infiniband/ulp/ipoib/ipoib_main.c    |    8 +
>  drivers/infiniband/ulp/srp/ib_srp.c          |    2 
>  include/rdma/ib_addr.h                       |   15 +
>  include/rdma/ib_verbs.h                      |   39 +++
>  18 files changed, 435 insertions(+), 92 deletions(-)
> 
> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> index 68e73ec..163d991 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -1,7 +1,7 @@
>  infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)	:= ib_addr.o rdma_cm.o
>  
>  obj-$(CONFIG_INFINIBAND) +=		ib_core.o ib_mad.o ib_sa.o \
> -					ib_cm.o $(infiniband-y)
> +					ib_cm.o iw_cm.o $(infiniband-y)
>  obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
>  obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o
>  
> @@ -14,6 +14,8 @@ ib_sa-y :=			sa_query.o
>  
>  ib_cm-y :=			cm.o
>  
> +iw_cm-y :=			iwcm.o
> +
>  rdma_cm-y :=			cma.o
>  
>  ib_addr-y :=			addr.o
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index d294bbc..83f84ef 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -32,6 +32,7 @@ #include <linux/mutex.h>
>  #include <linux/inetdevice.h>
>  #include <linux/workqueue.h>
>  #include <linux/if_arp.h>
> +#include <linux/inetdevice.h>
>  #include <net/arp.h>
>  #include <net/neighbour.h>
>  #include <net/route.h>
> @@ -60,12 +61,15 @@ static LIST_HEAD(req_list);
>  static DECLARE_WORK(work, process_req, NULL);
>  static struct workqueue_struct *addr_wq;
>  
> -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
> -		     unsigned char *dst_dev_addr)
> +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
> +		     const unsigned char *dst_dev_addr)
>  {
>  	switch (dev->type) {
>  	case ARPHRD_INFINIBAND:
> -		dev_addr->dev_type = IB_NODE_CA;
> +		dev_addr->dev_type = RDMA_NODE_IB_CA;
> +		break;
> +	case ARPHRD_ETHER:
> +		dev_addr->dev_type = RDMA_NODE_RNIC;
>  		break;
>  	default:
>  		return -EADDRNOTAVAIL;
> @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add
>  		memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
>  	return 0;
>  }
> +EXPORT_SYMBOL(rdma_copy_addr);
>  
>  int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
>  {
> @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a
>  	if (!dev)
>  		return -EADDRNOTAVAIL;
>  
> -	ret = copy_addr(dev_addr, dev, NULL);
> +	ret = rdma_copy_addr(dev_addr, dev, NULL);
>  	dev_put(dev);
>  	return ret;
>  }
> @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so
>  
>  	/* If the device does ARP internally, return 'done' */
>  	if (rt->idev->dev->flags & IFF_NOARP) {
> -		copy_addr(addr, rt->idev->dev, NULL);
> +		rdma_copy_addr(addr, rt->idev->dev, NULL);
>  		goto put;
>  	}
>  
> @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so
>  		src_in->sin_addr.s_addr = rt->rt_src;
>  	}
>  
> -	ret = copy_addr(addr, neigh->dev, neigh->ha);
> +	ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
>  release:
>  	neigh_release(neigh);
>  put:
> @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc
>  	if (ZERONET(src_ip)) {
>  		src_in->sin_family = dst_in->sin_family;
>  		src_in->sin_addr.s_addr = dst_ip;
> -		ret = copy_addr(addr, dev, dev->dev_addr);
> +		ret = rdma_copy_addr(addr, dev, dev->dev_addr);
>  	} else if (LOOPBACK(src_ip)) {
>  		ret = rdma_translate_ip((struct sockaddr *)dst_in, addr);
>  		if (!ret)
> diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
> index e05ca2c..061858c 100644
> --- a/drivers/infiniband/core/cache.c
> +++ b/drivers/infiniband/core/cache.c
> @@ -32,13 +32,12 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
> + * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $
>   */
>  
>  #include <linux/module.h>
>  #include <linux/errno.h>
>  #include <linux/slab.h>
> -#include <linux/sched.h>	/* INIT_WORK, schedule_work(), flush_scheduled_work() */
>  
>  #include <rdma/ib_cache.h>
>  
> @@ -62,12 +61,13 @@ struct ib_update_work {
>  
>  static inline int start_port(struct ib_device *device)
>  {
> -	return device->node_type == IB_NODE_SWITCH ? 0 : 1;
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
>  }
>  
>  static inline int end_port(struct ib_device *device)
>  {
> -	return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt;
> +	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
> +		0 : device->phys_port_cnt;
>  }
>  
>  int ib_get_cached_gid(struct ib_device *device,
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 1c7463b..cf43ccb 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -3253,6 +3253,9 @@ static void cm_add_one(struct ib_device 
>  	int ret;
>  	u8 i;
>  
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
>  	cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) *
>  			 device->phys_port_cnt, GFP_KERNEL);
>  	if (!cm_dev)
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 94555d2..414600c 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -35,6 +35,7 @@ #include <linux/in6.h>
>  #include <linux/mutex.h>
>  #include <linux/random.h>
>  #include <linux/idr.h>
> +#include <linux/inetdevice.h>
>  
>  #include <net/tcp.h>
>  
> @@ -43,6 +44,7 @@ #include <rdma/rdma_cm_ib.h>
>  #include <rdma/ib_cache.h>
>  #include <rdma/ib_cm.h>
>  #include <rdma/ib_sa.h>
> +#include <rdma/iw_cm.h>
>  
>  MODULE_AUTHOR("Sean Hefty");
>  MODULE_DESCRIPTION("Generic RDMA CM Agent");
> @@ -124,6 +126,7 @@ struct rdma_id_private {
>  	int			query_id;
>  	union {
>  		struct ib_cm_id	*ib;
> +		struct iw_cm_id	*iw;
>  	} cm_id;
>  
>  	u32			seq_num;
> @@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r
>  	id_priv->cma_dev = NULL;
>  }
>  
> -static int cma_acquire_ib_dev(struct rdma_id_private *id_priv)
> +static int cma_acquire_dev(struct rdma_id_private *id_priv)
>  {
> +	enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type;
>  	struct cma_device *cma_dev;
>  	union ib_gid *gid;
>  	int ret = -ENODEV;
>  
> -	gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
> +	switch (rdma_node_get_transport(dev_type)) {
> +	case RDMA_TRANSPORT_IB:
> +		gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
> +		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
> +		break;
> +	default:
> +		return -ENODEV;
> +	}
>  
>  	mutex_lock(&lock);
>  	list_for_each_entry(cma_dev, &dev_list, list) {
> @@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm
>  	return ret;
>  }
>  
> -static int cma_acquire_dev(struct rdma_id_private *id_priv)
> -{
> -	switch (id_priv->id.route.addr.dev_addr.dev_type) {
> -	case IB_NODE_CA:
> -		return cma_acquire_ib_dev(id_priv);
> -	default:
> -		return -ENODEV;
> -	}
> -}
> -
>  static void cma_deref_id(struct rdma_id_private *id_priv)
>  {
>  	if (atomic_dec_and_test(&id_priv->refcount))
> @@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id
>  					  IB_QP_PKEY_INDEX | IB_QP_PORT);
>  }
>  
> +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp)
> +{
> +	struct ib_qp_attr qp_attr;
> +
> +	qp_attr.qp_state = IB_QPS_INIT;
> +	qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE;
> +
> +	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS);
> +}
> +
>  int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
>  		   struct ib_qp_init_attr *qp_init_attr)
>  {
> @@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id
>  	if (IS_ERR(qp))
>  		return PTR_ERR(qp);
>  
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		ret = cma_init_ib_qp(id_priv, qp);
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = cma_init_iw_qp(id_priv, qp);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id 
>  	int ret;
>  
>  	id_priv = container_of(id, struct rdma_id_private, id);
> -	switch (id_priv->id.device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr,
>  					 qp_attr_mask);
>  		if (qp_attr->qp_state == IB_QPS_RTR)
>  			qp_attr->rq_psn = id_priv->seq_num;
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
> +					qp_attr_mask);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i
>  
>  static void cma_cancel_route(struct rdma_id_private *id_priv)
>  {
> -	switch (id_priv->id.device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		if (id_priv->query)
>  			ib_sa_cancel_query(id_priv->query_id, id_priv->query);
>  		break;
> @@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd
>  	cma_exch(id_priv, CMA_DESTROYING);
>  
>  	if (id_priv->cma_dev) {
> -		switch (id_priv->id.device->node_type) {
> -		case IB_NODE_CA:
> +		switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> +		case RDMA_TRANSPORT_IB:
>  	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
>  				ib_destroy_cm_id(id_priv->cm_id.ib);
>  			break;
> +		case RDMA_TRANSPORT_IWARP:
> +	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
> +				iw_destroy_cm_id(id_priv->cm_id.iw);
> +			break;
>  		default:
>  			break;
>  		}
> @@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id *
>  	cma_cancel_operation(id_priv, state);
>  
>  	if (id_priv->cma_dev) {
> -		switch (id->device->node_type) {
> -		case IB_NODE_CA:
> +		switch (rdma_node_get_transport(id->device->node_type)) {
> +		case RDMA_TRANSPORT_IB:
>  	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
>  				ib_destroy_cm_id(id_priv->cm_id.ib);
>  			break;
> +		case RDMA_TRANSPORT_IWARP:
> +	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
> +				iw_destroy_cm_id(id_priv->cm_id.iw);
> +			break;
>  		default:
>  			break;
>  		}
> @@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i
>  	ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid);
>  	ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
>  	ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey));
> -	rt->addr.dev_addr.dev_type = IB_NODE_CA;
> +	rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA;
>  
>  	id_priv = container_of(id, struct rdma_id_private, id);
>  	id_priv->state = CMA_CONNECT;
> @@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_
>  	}
>  
>  	atomic_inc(&conn_id->dev_remove);
> -	ret = cma_acquire_ib_dev(conn_id);
> +	ret = cma_acquire_dev(conn_id);
>  	if (ret) {
>  		ret = -ENODEV;
>  		cma_release_remove(conn_id);
> @@ -981,6 +1009,123 @@ static void cma_set_compare_data(enum rd
>  	}
>  }
>  
> +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
> +{
> +	struct rdma_id_private *id_priv = iw_id->context;
> +	enum rdma_cm_event_type event = 0;
> +	struct sockaddr_in *sin;
> +	int ret = 0;
> +
> +	atomic_inc(&id_priv->dev_remove);
> +
> +	switch (iw_event->event) {
> +	case IW_CM_EVENT_CLOSE:
> +		event = RDMA_CM_EVENT_DISCONNECTED;
> +		break;
> +	case IW_CM_EVENT_CONNECT_REPLY:
> +		sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
> +		*sin = iw_event->local_addr;
> +		sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
> +		*sin = iw_event->remote_addr;
> +		if (iw_event->status)
> +			event = RDMA_CM_EVENT_REJECTED;
> +		else
> +			event = RDMA_CM_EVENT_ESTABLISHED;
> +		break;
> +	case IW_CM_EVENT_ESTABLISHED:
> +		event = RDMA_CM_EVENT_ESTABLISHED;
> +		break;
> +	default:
> +		BUG_ON(1);
> +	}	
> +
> +	ret = cma_notify_user(id_priv, event, iw_event->status, 
> +			      iw_event->private_data, 
> +			      iw_event->private_data_len);
> +	if (ret) {
> +		/* Destroy the CM ID by returning a non-zero value. */
> +		id_priv->cm_id.iw = NULL;
> +		cma_exch(id_priv, CMA_DESTROYING);
> +		cma_release_remove(id_priv);
> +		rdma_destroy_id(&id_priv->id);
> +		return ret;
> +	}
> +
> +	cma_release_remove(id_priv);
> +	return ret;
> +}
> +
> +static int iw_conn_req_handler(struct iw_cm_id *cm_id, 
> +			       struct iw_cm_event *iw_event)
> +{
> +	struct rdma_cm_id *new_cm_id;
> +	struct rdma_id_private *listen_id, *conn_id;
> +	struct sockaddr_in *sin;
> +	struct net_device *dev;
> +	int ret;
> +
> +	listen_id = cm_id->context;
> +	atomic_inc(&listen_id->dev_remove);
> +	if (!cma_comp(listen_id, CMA_LISTEN)) {
> +		ret = -ECONNABORTED;
> +		goto out;
> +	}
> +
> +	/* Create a new RDMA id for the new IW CM ID */
> +	new_cm_id = rdma_create_id(listen_id->id.event_handler, 
> +				   listen_id->id.context,
> +				   RDMA_PS_TCP);
> +	if (!new_cm_id) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	conn_id = container_of(new_cm_id, struct rdma_id_private, id);
> +	atomic_inc(&conn_id->dev_remove);
> +	conn_id->state = CMA_CONNECT;
> +

we take a reference on the iWARP device here that we never release

> +	dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr);
> +	if (!dev) {
> +		ret = -EADDRNOTAVAIL;
> +		rdma_destroy_id(new_cm_id);
> +		goto out;
> +	}
> +	ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
> +	if (ret) {
> +		rdma_destroy_id(new_cm_id);
> +		goto out;
> +	}
> +
> +	ret = cma_acquire_dev(conn_id);
> +	if (ret) {
> +		rdma_destroy_id(new_cm_id);
> +		goto out;
> +	}
> +
> +	conn_id->cm_id.iw = cm_id;
> +	cm_id->context = conn_id;
> +	cm_id->cm_handler = cma_iw_handler;
> +
> +	sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr;
> +	*sin = iw_event->local_addr;
> +	sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr;
> +	*sin = iw_event->remote_addr;
> +
> +	ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0,
> +			      iw_event->private_data,
> +			      iw_event->private_data_len);
> +	if (ret) {
> +		/* User wants to destroy the CM ID */
> +		conn_id->cm_id.iw = NULL;
> +		cma_exch(conn_id, CMA_DESTROYING);
> +		cma_release_remove(conn_id);
> +		rdma_destroy_id(&conn_id->id);
> +	}
> +
> +out:

We need to put a dev_put here or the reference on the device will never
get released and you won't be able to remove it after you've had at
least one connection. This is my bug....

	dev_put(dev);

> +	cma_release_remove(listen_id);
> +	return ret;
> +}
> +
>  static int cma_ib_listen(struct rdma_id_private *id_priv)
>  {
>  	struct ib_cm_compare_data compare_data;
> @@ -1010,6 +1155,30 @@ static int cma_ib_listen(struct rdma_id_
>  	return ret;
>  }
>  
> +static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog)
> +{
> +	int ret;
> +	struct sockaddr_in *sin;
> +
> +	id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, 
> +					    iw_conn_req_handler,
> +					    id_priv);
> +	if (IS_ERR(id_priv->cm_id.iw))
> +		return PTR_ERR(id_priv->cm_id.iw);
> +
> +	sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
> +	id_priv->cm_id.iw->local_addr = *sin;
> +
> +	ret = iw_cm_listen(id_priv->cm_id.iw, backlog);
> +
> +	if (ret) {
> +		iw_destroy_cm_id(id_priv->cm_id.iw);
> +		id_priv->cm_id.iw = NULL;
> +	}
> +
> +	return ret;
> +}
> +
>  static int cma_listen_handler(struct rdma_cm_id *id,
>  			      struct rdma_cm_event *event)
>  {
> @@ -1085,12 +1254,17 @@ int rdma_listen(struct rdma_cm_id *id, i
>  		return -EINVAL;
>  
>  	if (id->device) {
> -		switch (id->device->node_type) {
> -		case IB_NODE_CA:
> +		switch (rdma_node_get_transport(id->device->node_type)) {
> +		case RDMA_TRANSPORT_IB:
>  			ret = cma_ib_listen(id_priv);
>  			if (ret)
>  				goto err;
>  			break;
> +		case RDMA_TRANSPORT_IWARP:
> +			ret = cma_iw_listen(id_priv, backlog);
> +			if (ret)
> +				goto err;
> +			break;
>  		default:
>  			ret = -ENOSYS;
>  			goto err;
> @@ -1229,6 +1403,23 @@ err:
>  }
>  EXPORT_SYMBOL(rdma_set_ib_paths);
>  
> +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
> +{
> +	struct cma_work *work;
> +
> +	work = kzalloc(sizeof *work, GFP_KERNEL);
> +	if (!work)
> +		return -ENOMEM;
> +
> +	work->id = id_priv;
> +	INIT_WORK(&work->work, cma_work_handler, work);
> +	work->old_state = CMA_ROUTE_QUERY;
> +	work->new_state = CMA_ROUTE_RESOLVED;
> +	work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
> +	queue_work(cma_wq, &work->work);
> +	return 0;
> +}
> +
>  int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
>  {
>  	struct rdma_id_private *id_priv;
> @@ -1239,10 +1430,13 @@ int rdma_resolve_route(struct rdma_cm_id
>  		return -EINVAL;
>  
>  	atomic_inc(&id_priv->refcount);
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		ret = cma_resolve_ib_route(id_priv, timeout_ms);
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = cma_resolve_iw_route(id_priv, timeout_ms);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -1354,8 +1548,8 @@ static int cma_resolve_loopback(struct r
>  			 ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr));
>  
>  	if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) {
> -		src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr;
> -		dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr;
> +		src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
> +		dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
>  		src_in->sin_family = dst_in->sin_family;
>  		src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr;
>  	}
> @@ -1646,6 +1840,47 @@ out:
>  	return ret;
>  }
>  
> +static int cma_connect_iw(struct rdma_id_private *id_priv,
> +			  struct rdma_conn_param *conn_param)
> +{
> +	struct iw_cm_id *cm_id;
> +	struct sockaddr_in* sin;
> +	int ret;
> +	struct iw_cm_conn_param iw_param;
> +
> +	cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv);
> +	if (IS_ERR(cm_id)) {
> +		ret = PTR_ERR(cm_id);
> +		goto out;
> +	}
> +
> +	id_priv->cm_id.iw = cm_id;
> +
> +	sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr;
> +	cm_id->local_addr = *sin;
> +
> +	sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr;
> +	cm_id->remote_addr = *sin;
> +
> +	ret = cma_modify_qp_rtr(&id_priv->id);
> +	if (ret) {
> +		iw_destroy_cm_id(cm_id);
> +		return ret;
> +	}
> +
> +	iw_param.ord = conn_param->initiator_depth;
> +	iw_param.ird = conn_param->responder_resources;
> +	iw_param.private_data = conn_param->private_data;
> +	iw_param.private_data_len = conn_param->private_data_len;
> +	if (id_priv->id.qp)
> +		iw_param.qpn = id_priv->qp_num;
> +	else 
> +		iw_param.qpn = conn_param->qp_num;
> +	ret = iw_cm_connect(cm_id, &iw_param);
> +out:
> +	return ret;
> +}
> +
>  int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
>  {
>  	struct rdma_id_private *id_priv;
> @@ -1661,10 +1896,13 @@ int rdma_connect(struct rdma_cm_id *id, 
>  		id_priv->srq = conn_param->srq;
>  	}
>  
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		ret = cma_connect_ib(id_priv, conn_param);
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = cma_connect_iw(id_priv, conn_param);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -1705,6 +1943,28 @@ static int cma_accept_ib(struct rdma_id_
>  	return ib_send_cm_rep(id_priv->cm_id.ib, &rep);
>  }
>  
> +static int cma_accept_iw(struct rdma_id_private *id_priv, 
> +		  struct rdma_conn_param *conn_param)
> +{
> +	struct iw_cm_conn_param iw_param;
> +	int ret;
> +
> +	ret = cma_modify_qp_rtr(&id_priv->id);
> +	if (ret)
> +		return ret;
> +
> +	iw_param.ord = conn_param->initiator_depth;
> +	iw_param.ird = conn_param->responder_resources;
> +	iw_param.private_data = conn_param->private_data;
> +	iw_param.private_data_len = conn_param->private_data_len;
> +	if (id_priv->id.qp) {
> +		iw_param.qpn = id_priv->qp_num;
> +	} else 
> +		iw_param.qpn = conn_param->qp_num;
> +
> +	return iw_cm_accept(id_priv->cm_id.iw, &iw_param);
> +}
> +
>  int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
>  {
>  	struct rdma_id_private *id_priv;
> @@ -1720,13 +1980,16 @@ int rdma_accept(struct rdma_cm_id *id, s
>  		id_priv->srq = conn_param->srq;
>  	}
>  
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		if (conn_param)
>  			ret = cma_accept_ib(id_priv, conn_param);
>  		else
>  			ret = cma_rep_recv(id_priv);
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = cma_accept_iw(id_priv, conn_param);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -1753,12 +2016,16 @@ int rdma_reject(struct rdma_cm_id *id, c
>  	if (!cma_comp(id_priv, CMA_CONNECT))
>  		return -EINVAL;
>  
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
>  		ret = ib_send_cm_rej(id_priv->cm_id.ib,
>  				     IB_CM_REJ_CONSUMER_DEFINED, NULL, 0,
>  				     private_data, private_data_len);
>  		break;
> +	case RDMA_TRANSPORT_IWARP: 
> +		ret = iw_cm_reject(id_priv->cm_id.iw, 
> +				   private_data, private_data_len);
> +		break;
>  	default:
>  		ret = -ENOSYS;
>  		break;
> @@ -1777,16 +2044,18 @@ int rdma_disconnect(struct rdma_cm_id *i
>  	    !cma_comp(id_priv, CMA_DISCONNECT))
>  		return -EINVAL;
>  
> -	ret = cma_modify_qp_err(id);
> -	if (ret)
> -		goto out;
> -
> -	switch (id->device->node_type) {
> -	case IB_NODE_CA:
> +	switch (rdma_node_get_transport(id->device->node_type)) {
> +	case RDMA_TRANSPORT_IB:
> +		ret = cma_modify_qp_err(id);
> +		if (ret)
> +			goto out;
>  		/* Initiate or respond to a disconnect. */
>  		if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
>  			ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
>  		break;
> +	case RDMA_TRANSPORT_IWARP:
> +		ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index b2f3cb9..7318fba 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -30,7 +30,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $
> + * $Id: device.c 5943 2006-03-22 00:58:04Z roland $
>   */
>  
>  #include <linux/module.h>
> @@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi
>  		  u8 port_num,
>  		  struct ib_port_attr *port_attr)
>  {
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		if (port_num)
>  			return -EINVAL;
>  	} else if (port_num < 1 || port_num > device->phys_port_cnt)
> @@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev
>  		   u8 port_num, int port_modify_mask,
>  		   struct ib_port_modify *port_modify)
>  {
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		if (port_num)
>  			return -EINVAL;
>  	} else if (port_num < 1 || port_num > device->phys_port_cnt)
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index b38e02a..a928ecf 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2005 Intel Corporation.  All rights reserved.
>   * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
>   *
> @@ -31,7 +31,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
> + * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $
>   */
>  #include <linux/dma-mapping.h>
>  #include <rdma/ib_cache.h>
> @@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib
>  {
>  	int start, end, i;
>  
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		start = 0;
>  		end   = 0;
>  	} else {
> @@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct 
>  {
>  	int i, num_ports, cur_port;
>  
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		num_ports = 1;
>  		cur_port = 0;
>  	} else {
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index 501cc05..4230277 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -887,7 +887,10 @@ static void ib_sa_add_one(struct ib_devi
>  	struct ib_sa_device *sa_dev;
>  	int s, e, i;
>  
> -	if (device->node_type == IB_NODE_SWITCH)
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
> +	if (device->node_type == RDMA_NODE_IB_SWITCH)
>  		s = e = 0;
>  	else {
>  		s = 1;
> diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c
> index 35852e7..b81b2b9 100644
> --- a/drivers/infiniband/core/smi.c
> +++ b/drivers/infiniband/core/smi.c
> @@ -34,7 +34,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $
> + * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $
>   */
>  
>  #include <rdma/ib_smi.h>
> @@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp
>  
>  		/* C14-9:2 */
>  		if (hop_ptr && hop_ptr < hop_cnt) {
> -			if (node_type != IB_NODE_SWITCH)
> +			if (node_type != RDMA_NODE_IB_SWITCH)
>  				return 0;
>  
>  			/* smp->return_path set when received */
> @@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp
>  		if (hop_ptr == hop_cnt) {
>  			/* smp->return_path set when received */
>  			smp->hop_ptr++;
> -			return (node_type == IB_NODE_SWITCH ||
> +			return (node_type == RDMA_NODE_IB_SWITCH ||
>  				smp->dr_dlid == IB_LID_PERMISSIVE);
>  		}
>  
> @@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp
>  
>  		/* C14-13:2 */
>  		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
> -			if (node_type != IB_NODE_SWITCH)
> +			if (node_type != RDMA_NODE_IB_SWITCH)
>  				return 0;
>  
>  			smp->hop_ptr--;
> @@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp
>  		if (hop_ptr == 1) {
>  			smp->hop_ptr--;
>  			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
> -			return (node_type == IB_NODE_SWITCH ||
> +			return (node_type == RDMA_NODE_IB_SWITCH ||
>  				smp->dr_slid == IB_LID_PERMISSIVE);
>  		}
>  
> @@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
>  
>  		/* C14-9:2 -- intermediate hop */
>  		if (hop_ptr && hop_ptr < hop_cnt) {
> -			if (node_type != IB_NODE_SWITCH)
> +			if (node_type != RDMA_NODE_IB_SWITCH)
>  				return 0;
>  
>  			smp->return_path[hop_ptr] = port_num;
> @@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
>  				smp->return_path[hop_ptr] = port_num;
>  			/* smp->hop_ptr updated when sending */
>  
> -			return (node_type == IB_NODE_SWITCH ||
> +			return (node_type == RDMA_NODE_IB_SWITCH ||
>  				smp->dr_dlid == IB_LID_PERMISSIVE);
>  		}
>  
> @@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
>  
>  		/* C14-13:2 */
>  		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
> -			if (node_type != IB_NODE_SWITCH)
> +			if (node_type != RDMA_NODE_IB_SWITCH)
>  				return 0;
>  
>  			/* smp->hop_ptr updated when sending */
> @@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
>  				return 1;
>  			}
>  			/* smp->hop_ptr updated when sending */
> -			return (node_type == IB_NODE_SWITCH);
> +			return (node_type == RDMA_NODE_IB_SWITCH);
>  		}
>  
>  		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
> diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
> index 21f9282..cfd2c06 100644
> --- a/drivers/infiniband/core/sysfs.c
> +++ b/drivers/infiniband/core/sysfs.c
> @@ -31,7 +31,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $
> + * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $
>   */
>  
>  #include "core_priv.h"
> @@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla
>  		return -ENODEV;
>  
>  	switch (dev->node_type) {
> -	case IB_NODE_CA:     return sprintf(buf, "%d: CA\n", dev->node_type);
> -	case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type);
> -	case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type);
> -	default:             return sprintf(buf, "%d: <unknown>\n", dev->node_type);
> +	case RDMA_NODE_IB_CA:
> +		return sprintf(buf, "%d: CA\n", dev->node_type);
> +	case RDMA_NODE_RNIC:
> +		return sprintf(buf, "%d: RNIC\n", dev->node_type);
> +	case RDMA_NODE_IB_SWITCH:
> +		return sprintf(buf, "%d: switch\n", dev->node_type);
> +	case RDMA_NODE_IB_ROUTER:
> +		return sprintf(buf, "%d: router\n", dev->node_type);
> +	default:
> +		return sprintf(buf, "%d: <unknown>\n", dev->node_type);
>  	}
>  }
>  
> @@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d
>  	if (ret)
>  		goto err_put;
>  
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		ret = add_port(device, 0);
>  		if (ret)
>  			goto err_put;
> diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
> index 67caf36..ad2e417 100644
> --- a/drivers/infiniband/core/ucm.c
> +++ b/drivers/infiniband/core/ucm.c
> @@ -30,7 +30,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
> + * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $
>   */
>  
>  #include <linux/completion.h>
> @@ -1248,7 +1248,8 @@ static void ib_ucm_add_one(struct ib_dev
>  {
>  	struct ib_ucm_device *ucm_dev;
>  
> -	if (!device->alloc_ucontext)
> +	if (!device->alloc_ucontext ||
> +	    rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
>  		return;
>  
>  	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
> diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
> index afe70a5..0cbd692 100644
> --- a/drivers/infiniband/core/user_mad.c
> +++ b/drivers/infiniband/core/user_mad.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004 Topspin Communications.  All rights reserved.
> - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. 
> + * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. 
>   * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -31,7 +31,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
> + * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $
>   */
>  
>  #include <linux/module.h>
> @@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de
>  	struct ib_umad_device *umad_dev;
>  	int s, e, i;
>  
> -	if (device->node_type == IB_NODE_SWITCH)
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
> +	if (device->node_type == RDMA_NODE_IB_SWITCH)
>  		s = e = 0;
>  	else {
>  		s = 1;
> diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
> index 28fdbda..e4b45d7 100644
> --- a/drivers/infiniband/hw/ipath/ipath_verbs.c
> +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
> @@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in
>  		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
>  		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
>  		(1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
> -	dev->node_type = IB_NODE_CA;
> +	dev->node_type = RDMA_NODE_IB_CA;
>  	dev->phys_port_cnt = 1;
>  	dev->dma_device = ipath_layer_get_device(dd);
>  	dev->class_dev.dev = dev->dma_device;
> diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
> index a2eae8a..5c31819 100644
> --- a/drivers/infiniband/hw/mthca/mthca_provider.c
> +++ b/drivers/infiniband/hw/mthca/mthca_provider.c
> @@ -1273,7 +1273,7 @@ int mthca_register_device(struct mthca_d
>  		(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)		|
>  		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
>  		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
> -	dev->ib_dev.node_type            = IB_NODE_CA;
> +	dev->ib_dev.node_type            = RDMA_NODE_IB_CA;
>  	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
>  	dev->ib_dev.dma_device           = &dev->pdev->dev;
>  	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 1c6ea1c..262427f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi
>  	struct ipoib_dev_priv *priv;
>  	int s, e, p;
>  
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
>  	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
>  	if (!dev_list)
>  		return;
>  
>  	INIT_LIST_HEAD(dev_list);
>  
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		s = 0;
>  		e = 0;
>  	} else {
> @@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d
>  	struct ipoib_dev_priv *priv, *tmp;
>  	struct list_head *dev_list;
>  
> +	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> +		return;
> +
>  	dev_list = ib_get_client_data(device, &ipoib_client);
>  
>  	list_for_each_entry_safe(priv, tmp, dev_list, list) {
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index f1401e1..bba2956 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -1845,7 +1845,7 @@ static void srp_add_one(struct ib_device
>  	if (IS_ERR(srp_dev->fmr_pool))
>  		srp_dev->fmr_pool = NULL;
>  
> -	if (device->node_type == IB_NODE_SWITCH) {
> +	if (device->node_type == RDMA_NODE_IB_SWITCH) {
>  		s = 0;
>  		e = 0;
>  	} else {
> diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
> index fcb5ba8..d95d3eb 100644
> --- a/include/rdma/ib_addr.h
> +++ b/include/rdma/ib_addr.h
> @@ -40,7 +40,7 @@ struct rdma_dev_addr {
>  	unsigned char src_dev_addr[MAX_ADDR_LEN];
>  	unsigned char dst_dev_addr[MAX_ADDR_LEN];
>  	unsigned char broadcast[MAX_ADDR_LEN];
> -	enum ib_node_type dev_type;
> +	enum rdma_node_type dev_type;
>  };
>  
>  /**
> @@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src
>  
>  void rdma_addr_cancel(struct rdma_dev_addr *addr);
>  
> +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
> +	      const unsigned char *dst_dev_addr);
> +
>  static inline int ip_addr_size(struct sockaddr *addr)
>  {
>  	return addr->sa_family == AF_INET6 ?
> @@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru
>  	memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid);
>  }
>  
> +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda)
> +{
> +	return (union ib_gid *) rda->src_dev_addr;
> +}
> +
> +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda)
> +{
> +	return (union ib_gid *) rda->dst_dev_addr;
> +}
> +
>  #endif /* IB_ADDR_H */
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index aeb4fcd..eac2d8f 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -35,7 +35,7 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>   * SOFTWARE.
>   *
> - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $
> + * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $
>   */
>  
>  #if !defined(IB_VERBS_H)
> @@ -56,12 +56,35 @@ union ib_gid {
>  	} global;
>  };
>  
> -enum ib_node_type {
> -	IB_NODE_CA 	= 1,
> -	IB_NODE_SWITCH,
> -	IB_NODE_ROUTER
> +enum rdma_node_type {
> +	/* IB values map to NodeInfo:NodeType. */
> +	RDMA_NODE_IB_CA 	= 1,
> +	RDMA_NODE_IB_SWITCH,
> +	RDMA_NODE_IB_ROUTER,
> +	RDMA_NODE_RNIC
>  };
>  
> +enum rdma_transport_type {
> +	RDMA_TRANSPORT_IB,
> +	RDMA_TRANSPORT_IWARP
> +};
> +
> +static inline enum rdma_transport_type
> +rdma_node_get_transport(enum rdma_node_type node_type)
> +{
> +	switch (node_type) {
> +	case RDMA_NODE_IB_CA:
> +	case RDMA_NODE_IB_SWITCH:
> +	case RDMA_NODE_IB_ROUTER:
> +		return RDMA_TRANSPORT_IB;
> +	case RDMA_NODE_RNIC:
> +		return RDMA_TRANSPORT_IWARP;
> +	default:
> +		BUG();
> +		return 0;
> +	}
> +}
> +
>  enum ib_device_cap_flags {
>  	IB_DEVICE_RESIZE_MAX_WR		= 1,
>  	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1),
> @@ -78,6 +101,9 @@ enum ib_device_cap_flags {
>  	IB_DEVICE_RC_RNR_NAK_GEN	= (1<<12),
>  	IB_DEVICE_SRQ_RESIZE		= (1<<13),
>  	IB_DEVICE_N_NOTIFY_CQ		= (1<<14),
> +	IB_DEVICE_ZERO_STAG		= (1<<15),
> +	IB_DEVICE_SEND_W_INV		= (1<<16),
> +	IB_DEVICE_MEM_WINDOW		= (1<<17)
>  };
>  
>  enum ib_atomic_cap {
> @@ -830,6 +856,7 @@ struct ib_cache {
>  	u8                     *lmc_cache;
>  };
>  
> +struct iw_cm_verbs;
>  struct ib_device {
>  	struct device                *dma_device;
>  
> @@ -846,6 +873,8 @@ struct ib_device {
>  
>  	u32                           flags;
>  
> +	struct iw_cm_verbs	     *iwcm;
> +
>  	int		           (*query_device)(struct ib_device *device,
>  						   struct ib_device_attr *device_attr);
>  	int		           (*query_port)(struct ib_device *device,
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


From mshefty at ichips.intel.com  Wed Jun  7 15:21:27 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 07 Jun 2006 15:21:27 -0700
Subject: [openib-general] crash in ib_sa_mcmember_rec_callback while
	probing out ib_sa
In-Reply-To: <adau06xhs62.fsf@cisco.com>
References: <Pine.LNX.4.64.0606071248110.2804@zuben>
	<adau06xhs62.fsf@cisco.com>
Message-ID: <44875167.2090405@ichips.intel.com>

Roland Dreier wrote:
> Looks like the same crash mst saw related to the multicast module
> being unloaded and then having sa call back into it.  One small clue:
> 
>  > esi: f38a5bec   edi: f38a5bf4   ebp: fffffffc   esp: f599be60
> 
> ebp is -4, which is -EINTR.  So this may be a callback from sa_query's
> send_handler() caused by a IB_WC_WR_FLUSH_ERR status.

This makes sense given the call trace.  When ib_sa is unloading, it unregisters 
its mad_agent, which results in canceling all outstanding MADs.

What doesn't make sense to me is how ib_multicast could have unloaded while 
there are any outstanding SA queries.  All queries hold a reference on a MC 
group until they complete.  And all groups reference a port.  The module 
shouldn't unload until all references are released on all ports.

I removed some code that is intended to speed up cleanup, but is unnecessary. 
We can see if that helps, but I'm skeptical.

- Sean


From ardavis at ichips.intel.com  Wed Jun  7 15:24:46 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 07 Jun 2006 15:24:46 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12E97@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D12E97@xmb-sjc-216.amer.cisco.com>
Message-ID: <4487522E.7060808@ichips.intel.com>

Scott Weitzenkamp (sweitzen) wrote:

>Yes, the modules were loaded.
>
>Each of the 32 hosts had 3 IB ports up.  Does Intel MPI or uDAPL use
>multiple ports and/or multiple HCAs?
>
>I shut down all but one port on each host, and now Pallas is running
>better on the 32 nodes using Intel MPI 2.0.1.  HP MPI 2.2 started
>working too with Pallas too over uDAPL, so maybe this is a uDAPL issue?
>  
>
Can you tell me what adapters are installed (ibstat), how they are 
configured (ifconfig),  and what your dat.conf looks like? It sounds 
like a device mapping issue during the dat_ia_open() processing.

Multiple ports and HCAs should work fine but there is some care required 
in configuration of the dat.conf so you consitantly pick up the correct 
device across the cluster. Intel MPI will simply open a device based on 
the provider/device name (example: setenv 
I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and query dapl 
for the address to be used for connections. This line in the dat.conf 
will determine which library to load and which IB device to open and 
bind too. If you have the same exact configuration on each node and know 
that the ib0,ib1,ib2, etc will always come up in the same order then you 
can simply use the same netdev names across the cluster and use the same 
exact copy of dat.conf  on each node.

Here are the dat.conf options for OpenIB-cma configurations.

# For cma version you specify <ia_params> as:
#       network address, network hostname, or netdev name and 0 for port
#
# Simple (OpenIB-cma) default with netdev name provided first on list
# to enable use of same dat.conf version on all nodes
#
OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 
"ib0 0" ""
OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "192.168.0.22 0" ""
OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "svr1-ib0 0" ""
OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "ib0 0" ""

Which type are you using? address, hostname, or netdev names?

Also, Intel MPI is sometimes too smart for its own good when opening 
rdma devices via uDAPL. If the open fails with the first rdma device 
specified in the dat.conf it will continue onto the next line until one 
is successfull. If all rdma devices fail it will then go onto the static 
device automatcally. This sometimes does more harm then good since one 
node could be failing over to the second device in your configuration 
and the other nodes are all on the first device. If they are all on the 
same subnet then it would work fine but if they are on different subnets 
then we would not be able to connect.

If you send me your configuration, we can set it up here and hopefully 
duplicate your error case.

-arlin


From sweitzen at cisco.com  Wed Jun  7 15:44:42 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 7 Jun 2006 15:44:42 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
	support for IB_CM_REQ_OPTIONS
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D68E0C@xmb-sjc-216.amer.cisco.com>

I have not touched /etc/dat.conf, so I am using whatever comes with OFED
1.0 rc5.

For whatever reason, things have improved some.  I am now running Intel
MPI right after bringing up hosts (previously I was trying MVAPICH, then
Open MPI, then HP MPI, then Intel MPI).  I've run twice, and see these
failures:

Run #1 (after rebooting all hosts):

rank 13 in job 1  192.168.1.1_34674   caused collective abort of all
ranks^M
  exit status of rank 13: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_123945/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_123945/intel.intel/1149709233/IMB_2.3/src/IMB-MPI1
Allreduce : 0\

Run #2 (after rebooting all hosts):

rank 6 in job 1  192.168.1.1_33649   caused collective abort of all
ranks^M
  exit status of rank 6: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Exchange : 0

rank 21 in job 1  192.168.1.1_34734   caused collective abort of all
ranks^M
  exit status of rank 21: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Allgatherrv -\
multi 1: 0

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Arlin Davis [mailto:ardavis at ichips.intel.com] 
> Sent: Wednesday, June 07, 2006 3:25 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Davis, Arlin R; Lentini, James; openib-general
> Subject: Re: [openib-general] [PATCH] uDAPL openib-cma 
> provider - add support for IB_CM_REQ_OPTIONS
> 
> Scott Weitzenkamp (sweitzen) wrote:
> 
> >Yes, the modules were loaded.
> >
> >Each of the 32 hosts had 3 IB ports up.  Does Intel MPI or uDAPL use
> >multiple ports and/or multiple HCAs?
> >
> >I shut down all but one port on each host, and now Pallas is running
> >better on the 32 nodes using Intel MPI 2.0.1.  HP MPI 2.2 started
> >working too with Pallas too over uDAPL, so maybe this is a 
> uDAPL issue?
> >  
> >
> Can you tell me what adapters are installed (ibstat), how they are 
> configured (ifconfig),  and what your dat.conf looks like? It sounds 
> like a device mapping issue during the dat_ia_open() processing.
> 
> Multiple ports and HCAs should work fine but there is some 
> care required 
> in configuration of the dat.conf so you consitantly pick up 
> the correct 
> device across the cluster. Intel MPI will simply open a 
> device based on 
> the provider/device name (example: setenv 
> I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and 
> query dapl 
> for the address to be used for connections. This line in the dat.conf 
> will determine which library to load and which IB device to open and 
> bind too. If you have the same exact configuration on each 
> node and know 
> that the ib0,ib1,ib2, etc will always come up in the same 
> order then you 
> can simply use the same netdev names across the cluster and 
> use the same 
> exact copy of dat.conf  on each node.
> 
> Here are the dat.conf options for OpenIB-cma configurations.
> 
> # For cma version you specify <ia_params> as:
> #       network address, network hostname, or netdev name and 
> 0 for port
> #
> # Simple (OpenIB-cma) default with netdev name provided first on list
> # to enable use of same dat.conf version on all nodes
> #
> OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 
> "ib0 0" ""
> OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "192.168.0.22 0" ""
> OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "svr1-ib0 0" ""
> OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "ib0 0" ""
> 
> Which type are you using? address, hostname, or netdev names?
> 
> Also, Intel MPI is sometimes too smart for its own good when opening 
> rdma devices via uDAPL. If the open fails with the first rdma device 
> specified in the dat.conf it will continue onto the next line 
> until one 
> is successfull. If all rdma devices fail it will then go onto 
> the static 
> device automatcally. This sometimes does more harm then good 
> since one 
> node could be failing over to the second device in your configuration 
> and the other nodes are all on the first device. If they are 
> all on the 
> same subnet then it would work fine but if they are on 
> different subnets 
> then we would not be able to connect.
> 
> If you send me your configuration, we can set it up here and 
> hopefully 
> duplicate your error case.
> 
> -arlin
> 


From mshefty at ichips.intel.com  Wed Jun  7 17:19:37 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 07 Jun 2006 17:19:37 -0700
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <1149706206.4510.292005.camel@hal.voltaire.com>
References: <1149024804.4510.1056.camel@hal.voltaire.com>
	<ORSMSX4011792WdO5uA00000025@orsmsx401.amr.corp.intel.com>
	<20060531090817.GQ21266@mellanox.co.il>
	<447DC8F8.60409@ichips.intel.com>
	<1149095100.4510.29902.camel@hal.voltaire.com>
	<447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com>
	<1149706206.4510.292005.camel@hal.voltaire.com>
Message-ID: <44876D19.5040205@ichips.intel.com>

Hal Rosenstock wrote:
>>  This 
>>leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join 
>>until one of the FullMembers joins.
> 
> Might also be true with joins (not creates) from FullMembers too. I
> would presume in such cases, the join would be retried. SendOnlyMembers
> (at least for IPoIB) do this if not joined every time a packet is sent.

Correct.  But all clients trying to rejoin groups must be aware of this, and 
delay / retry until their groups are recreated.

Let me know if I'm off here, but it also appears that clients can't rely on an 
existing QP attachment or address handle to send to the new group.  Even if a 
group is re-created, there's no guarantee that the SA didn't assign a different 
MLID to the group.

So, the only safe thing to do is for all multicast clients to detach from all 
multicast groups, destroy all address handles, possibly wait for a new group to 
be created, and then start all over again.  Is this correct?

- Sean


From halr at voltaire.com  Wed Jun  7 17:55:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Jun 2006 20:55:27 -0400
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <44876D19.5040205@ichips.intel.com>
References: <1149024804.4510.1056.camel@hal.voltaire.com>
	<ORSMSX4011792WdO5uA00000025@orsmsx401.amr.corp.intel.com>
	<20060531090817.GQ21266@mellanox.co.il>
	<447DC8F8.60409@ichips.intel.com>
	<1149095100.4510.29902.camel@hal.voltaire.com>
	<447DD2E4.3030709@ichips.intel.com> <44871A04.9010705@ichips.intel.com>
	<1149706206.4510.292005.camel@hal.voltaire.com>
	<44876D19.5040205@ichips.intel.com>
Message-ID: <1149728121.4510.301957.camel@hal.voltaire.com>

On Wed, 2006-06-07 at 20:19, Sean Hefty wrote:
> Hal Rosenstock wrote:
> >>  This 
> >>leads to a race where NonMembers and SendOnlyNonMembers will fail to re-join 
> >>until one of the FullMembers joins.
> > 
> > Might also be true with joins (not creates) from FullMembers too. I
> > would presume in such cases, the join would be retried. SendOnlyMembers
> > (at least for IPoIB) do this if not joined every time a packet is sent.
> 
> Correct.  But all clients trying to rejoin groups must be aware of this, and 
> delay / retry until their groups are recreated.

I might be missing your point but UD is unreliable so the sends can be
dropped. The delay/retry is to make sure the join does occur,

> Let me know if I'm off here, but it also appears that clients can't rely on an 
> existing QP attachment or address handle to send to the new group.  Even if a 
> group is re-created, there's no guarantee that the SA didn't assign a different 
> MLID to the group.

Correct. I have seen this behavior with various dynamic groups.

I know there was code in IPoIB to handle local LID changes (adjusting
the AH). I'm not sure about whether multicast changes were handled too
but I don't recall this.

> So, the only safe thing to do is for all multicast clients to detach from all 
> multicast groups, destroy all address handles,

Why all groups ?

> possibly wait for a new group to be created, and then start all over again.

Start what all over again ?

>   Is this correct?

I'm not completely following you yet.

-- Hal

> - Sean


From sean.hefty at intel.com  Wed Jun  7 19:48:42 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 7 Jun 2006 19:48:42 -0700
Subject: [openib-general] RE: Failed multicast join withnew multicast module
In-Reply-To: <1149728121.4510.301957.camel@hal.voltaire.com>
Message-ID: <ORSMSX4018iOvJQAvYl00000043@orsmsx401.amr.corp.intel.com>

>I might be missing your point but UD is unreliable so the sends can be
>dropped. The delay/retry is to make sure the join does occur,

This is different than a dropped request or reply.  In this case, the receiver
gets a reply, but it will be a failure from the SA to join the group.  For
example, a NonMember tries to re-join before a FullMember which would have
created the group does.  The result is that requests that receive a reply also
need to be retried, with the timeout dependent on some remote node in the fabric
creating the group.

>> So, the only safe thing to do is for all multicast clients to detach from all
>> multicast groups, destroy all address handles,
>
>Why all groups ?

Because the SM has lost track that any groups in the fabric existed, so those
groups must be recreated, all potentially with different mlids.

>> possibly wait for a new group to be created, and then start all over again.
>
>Start what all over again ?

I meant attach the QP to the new group and allocate a new address handle.

This is a general comment, and not directed at anyone specific, but is this
really the architecture and implementation that we want to aim for?  I really
think that we need to look at solutions that don't break existing communication,
unless the links providing that communication actually go down, even if this
means extending the architecture.

- Sean


From zhushisongzhu at yahoo.com  Wed Jun  7 22:00:10 2006
From: zhushisongzhu at yahoo.com (zhu shi song)
Date: Wed, 7 Jun 2006 22:00:10 -0700 (PDT)
Subject: [openib-general] how about sdp progress
In-Reply-To: <20060607174343.315582283DE@openib.ca.sandia.gov>
Message-ID: <20060608050010.83564.qmail@web36910.mail.mud.yahoo.com>

MST,
  how about sdp progress now?


  zhu


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From tziporet at mellanox.co.il  Wed Jun  7 23:25:55 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 8 Jun 2006 09:25:55 +0300
Subject: [openib-general] OFED-1.0-rc6 is available
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com>

We sure will.

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Wednesday, June 07, 2006 11:03 PM
To: Tziporet Koren; openfabrics-ewg at openib.org; openib-general
Subject: Re: [openib-general] OFED-1.0-rc6 is available

We also just found a bug in how ibsrpdm discovers Cisco/Topspin FC
gateways.  The patch is below, and is also checked in to the trunk as
svn rev 7803.  Please include this in OFED 1.0 final.

Thanks,
  Roland

--- srptools/ChangeLog	(revision 7796)
+++ srptools/ChangeLog	(working copy)
@@ -1,3 +1,9 @@
+2006-06-07  Roland Dreier  <rdreier at cisco.com>
+	* src/srp-dm.c (do_port): Use correct endianness when comparing
+	GUID against Topspin OUI.
+
+	* src/srp-dm.c (set_class_port_info): Trivial whitespace fixes.
+
 2006-05-29  Ishai Rabinovitz  <ishai at mellanox.co.il>
 
 	* src/srp-dm.c (main): The agent ID array is declared with 0
--- srptools/src/srp-dm.c	(revision 7796)
+++ srptools/src/srp-dm.c	(working copy)
@@ -52,8 +52,6 @@
 #include "ib_user_mad.h"
 #include "srp-dm.h"
 
-static const uint8_t topspin_oui[3] = { 0x00, 0x05, 0xad };
-
 static char *umad_dev   = "/dev/infiniband/umad0";
 static char *port_sysfs_path;
 static int   timeout_ms = 25000;
@@ -249,7 +247,7 @@ static int set_class_port_info(int fd, u
 
 	init_srp_dm_mad(&out_mad, agent[1], dlid,
SRP_DM_ATTR_CLASS_PORT_INFO, 0);
 
-	out_dm_mad          = (void *) out_mad.data;
+	out_dm_mad         = (void *) out_mad.data;
 	out_dm_mad->method = SRP_DM_METHOD_SET;
 
 	cpi                = (void *) out_dm_mad->data;
@@ -266,9 +264,8 @@ static int set_class_port_info(int fd, u
 		return -1;
 	}
 
-	for (i = 0; i < 8; ++i) {
+	for (i = 0; i < 8; ++i)
 		((uint16_t *) cpi->trap_gid)[i] = htons(strtol(val + i *
5, NULL, 16));
-	}
 
 	if (send_and_get(fd, &out_mad, &in_mad, 0) < 0)
 		return -1;
@@ -371,7 +368,10 @@ static int do_port(int fd, uint32_t agen
 	struct srp_dm_svc_entries	svc_entries;
 	int				i, j, k;
 
-	if (!memcmp(&guid, topspin_oui, 3) &&
+	static const uint64_t topspin_oui = 0x0005ad0000000000ull;
+	static const uint64_t oui_mask    = 0xffffff0000000000ull;
+
+	if ((guid & oui_mask) == topspin_oui &&
 	    set_class_port_info(fd, agent, dlid))
 		fprintf(stderr, "Warning: set of ClassPortInfo
failed\n");
 

From jackm at mellanox.co.il  Wed Jun  7 23:42:48 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Thu, 8 Jun 2006 09:42:48 +0300
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<200606071639.03787.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>
Message-ID: <200606080942.48767.jackm@mellanox.co.il>

On Wednesday 07 June 2006 18:26, James Lentini wrote:
> On Wed, 7 Jun 2006, Jack Morgenstein wrote:
> > This (bug fix) can still be included in next-week's release, if you
> > think it is important (I have extracted it from the changes checked
> > in at svn 7755)
>
> If you are going to make another release anyway, then I would included
> it.

Do you mean -- include the fix in next week's release -- or -- wait with the 
fix for the following release?

- Jack


From akpm at osdl.org  Thu Jun  8 00:54:52 2006
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 8 Jun 2006 00:54:52 -0700
Subject: [openib-general] Re: [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <20060607200605.9003.25830.stgit@stevo-desktop>
References: <20060607200600.9003.56328.stgit@stevo-desktop>
	<20060607200605.9003.25830.stgit@stevo-desktop>
Message-ID: <20060608005452.087b34db.akpm@osdl.org>

On Wed, 07 Jun 2006 15:06:05 -0500
Steve Wise <swise at opengridcomputing.com> wrote:

> 
> This patch provides the new files implementing the iWARP Connection
> Manager.
> 
> Review Changes:
> 
> - sizeof -> sizeof()
> 
> - removed printks
> 
> - removed TT debug code
> 
> - cleaned up lock/unlock around switch statements.
> 
> - waitqueue -> completion for destroy path.
>
> ...
>
> +/* 
> + * This function is called on interrupt context. Schedule events on
> + * the iwcm_wq thread to allow callback functions to downcall into
> + * the CM and/or block.  Events are queued to a per-CM_ID
> + * work_list. If this is the first event on the work_list, the work
> + * element is also queued on the iwcm_wq thread.
> + *
> + * Each event holds a reference on the cm_id. Until the last posted
> + * event has been delivered and processed, the cm_id cannot be
> + * deleted. 
> + */
> +static void cm_event_handler(struct iw_cm_id *cm_id,
> +			     struct iw_cm_event *iw_event) 
> +{
> +	struct iwcm_work *work;
> +	struct iwcm_id_private *cm_id_priv;
> +	unsigned long flags;
> +
> +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> +	if (!work)
> +		return;

This allocation _will_ fail sometimes.  The driver must recover from it. 
Will it do so?

> +EXPORT_SYMBOL(iw_cm_init_qp_attr);

This file exports a ton of symbols.  It's usual to provide some justifying
commentary in the changelog when this happens.

> +/*
> + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
> + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +#if !defined(IW_CM_PRIVATE_H)
> +#define IW_CM_PRIVATE_H

We normally use #ifndef here.


From akpm at osdl.org  Thu Jun  8 01:17:44 2006
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 8 Jun 2006 01:17:44 -0700
Subject: [openib-general] Re: [PATCH v2 4/7] AMSO1100 Memory Management.
In-Reply-To: <20060607200655.9259.90768.stgit@stevo-desktop>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<20060607200655.9259.90768.stgit@stevo-desktop>
Message-ID: <20060608011744.1a66e85a.akpm@osdl.org>

On Wed, 07 Jun 2006 15:06:55 -0500
Steve Wise <swise at opengridcomputing.com> wrote:

> 
> +void c2_free(struct c2_alloc *alloc, u32 obj)
> +{
> +	spin_lock(&alloc->lock);
> +	clear_bit(obj, alloc->table);
> +	spin_unlock(&alloc->lock);
> +}

The spinlock is unneeded here.


What does all the code in this file do, anyway?  It looks totally generic
(and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat
similar to idr trees, perhaps.

> +int c2_array_set(struct c2_array *array, int index, void *value)
> +{
> +	int p = (index * sizeof(void *)) >> PAGE_SHIFT;
> +
> +	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
> +	if (!array->page_list[p].page)
> +		array->page_list[p].page =
> +		    (void **) get_zeroed_page(GFP_ATOMIC);
> +
> +	if (!array->page_list[p].page)
> +		return -ENOMEM;

This _will_ happen under load.  What will the result of that be, in the
context of thise driver?

This function is incorrectly designed - it should receive a gfp_t argument.
Because you don't *know* that the caller will always hold a spinlock.  And
GFP_KERNEL is far, far stronger than GFP_ATOMIC.

> +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head)
> +{
> +	int i;
> +	struct sp_chunk *new_head;
> +
> +	new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA);

Why is __GFP_DMA in there?  Unless you've cornered the ISA bus infiniband
market, it's likely to be wrong.


From eitan at mellanox.co.il  Thu Jun  8 01:40:30 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 08 Jun 2006 11:40:30 +0300
Subject: [openib-general] [PATCH] osm: fix mlx vendor rmpp sender fail to
 send zero size RMPP
Message-ID: <86hd2wkpkx.fsf@mtl066.yok.mtl.com>

Hi Hal

Run into this by chance. Some changes introduced lately to the SA queries 
now sends zero size RMPP (single segment with only headers). It used to send 
them as non-RMPP responses. Anyway, this broke the mlx vendor code that I use 
for simulation.

This patch resolves this new problem.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: libvendor/osm_vendor_mlx_sar.c
===================================================================
--- libvendor/osm_vendor_mlx_sar.c	(revision 7703)
+++ libvendor/osm_vendor_mlx_sar.c	(working copy)
@@ -91,7 +91,7 @@ osmv_rmpp_sar_get_mad_seg(
     num_segs++;
   }
 
-  if ( seg_idx > num_segs)
+  if ( (seg_idx > num_segs) && (seg_idx != 1) )
   {
     return IB_NOT_FOUND;
   }
@@ -102,18 +102,14 @@ osmv_rmpp_sar_get_mad_seg(
   /* attach header */
   memcpy(p_buf,p_sar->p_arbt_mad,p_sar->hdr_sz);
 
-
   /* fill data */
   p_seg = (char*)p_sar->p_arbt_mad + p_sar->hdr_sz + ((seg_idx-1) * p_sar->data_sz);
   sz_left = p_sar->data_len - ((seg_idx -1) * p_sar->data_sz);
   if (sz_left > p_sar->data_sz)
-  {
     memcpy((char*)p_buf+p_sar->hdr_sz,(char*)p_seg,p_sar->data_sz);
-  }
   else
     memcpy((char*)p_buf+ p_sar->hdr_sz, (char*)p_seg, sz_left);
 
-
   return IB_SUCCESS;
 }
 

From tziporet at mellanox.co.il  Thu Jun  8 01:53:05 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 8 Jun 2006 11:53:05 +0300
Subject: [openib-general] OFED-1.0-rc6 is available
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7173@mtlexch01.mtl.com>

Roland did the fix on Trunk and I took it to OFED 1.0 branch.

Tziporet

-----Original Message-----
From: Ramachandra K [mailto:rkuchimanchi at silverstorm.com] 
Sent: Wednesday, June 07, 2006 8:28 PM
To: Tziporet Koren
Cc: openfabrics-ewg at openib.org; openib-general; Ramachandra K
Subject: Re: [openib-general] OFED-1.0-rc6 is available

Tziporet Koren wrote:
> Hi All,
> 
> We have prepared OFED 1.0 RC6.
> 
 From the openib source tar ball in OFED RC6, it looks like
the SRP kernel changes (ulp/srp/ib_srp.c) in the trunk for
supporting Rev 10 targets have been included in RC6, but the 
corresponding changes to the userspace srptool--ibsrpdm
(userspace/srptools/src/srp-dm.c) for displaying the IO class
of the target have not been made part of RC6.

The changes to ibsrpdm were committed to the SVN repository trunk in 
revision number 7758.

Will the latest version of ibsrpdm make it to the next OFED release ?

Regards,
Ram


From halr at voltaire.com  Thu Jun  8 03:41:58 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Jun 2006 06:41:58 -0400
Subject: [openib-general] Re: [PATCH] osm: fix mlx vendor rmpp sender fail
 to send zero size RMPP
In-Reply-To: <86hd2wkpkx.fsf@mtl066.yok.mtl.com>
References: <86hd2wkpkx.fsf@mtl066.yok.mtl.com>
Message-ID: <1149763300.4510.319639.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-06-08 at 04:40, Eitan Zahavi wrote:
> Hi Hal
> 
> Run into this by chance. Some changes introduced lately to the SA queries 
> now sends zero size RMPP (single segment with only headers). It used to send 
> them as non-RMPP responses.

Not sure what that change was.

> Anyway, this broke the mlx vendor code that I use 
> for simulation.
> 
> This patch resolves this new problem.

Thanks. Applied to trunk only. Any idea of OFED RC6 has this issue ?

-- Hal

> Eitan


From eitan at mellanox.co.il  Thu Jun  8 04:08:53 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 8 Jun 2006 14:08:53 +0300
Subject: [openib-general] RE: [PATCH] osm: fix mlx vendor rmpp sender fail
 to send zero sizeRMPP
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236880E@mtlexch01.mtl.com>

This does not have to get into OFED. 
I did not see these failures there. 

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Thursday, June 08, 2006 1:42 PM
> To: Eitan Zahavi
> Cc: OPENIB
> Subject: Re: [PATCH] osm: fix mlx vendor rmpp sender fail to send zero
sizeRMPP
> 
> Hi Eitan,
> 
> On Thu, 2006-06-08 at 04:40, Eitan Zahavi wrote:
> > Hi Hal
> >
> > Run into this by chance. Some changes introduced lately to the SA
queries
> > now sends zero size RMPP (single segment with only headers). It used
to send
> > them as non-RMPP responses.
> 
> Not sure what that change was.
> 
> > Anyway, this broke the mlx vendor code that I use
> > for simulation.
> >
> > This patch resolves this new problem.
> 
> Thanks. Applied to trunk only. Any idea of OFED RC6 has this issue ?
> 
> -- Hal
> 
> > Eitan


From halr at voltaire.com  Thu Jun  8 04:03:09 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Jun 2006 07:03:09 -0400
Subject: [openib-general] RE: Failed multicast join withnew multicast module
In-Reply-To: <ORSMSX4018iOvJQAvYl00000043@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4018iOvJQAvYl00000043@orsmsx401.amr.corp.intel.com>
Message-ID: <1149764555.4510.320252.camel@hal.voltaire.com>

On Wed, 2006-06-07 at 22:48, Sean Hefty wrote:
> >I might be missing your point but UD is unreliable so the sends can be
> >dropped. The delay/retry is to make sure the join does occur,
> 
> This is different than a dropped request or reply.  In this case, the receiver
> gets a reply, but it will be a failure from the SA to join the group.

By receiver, I think you are referring to SA requester. Yes, the SA
would reject the request with a status ERR_REQ_INSUFFICIENT_COMPONENTS.

> For example, a NonMember tries to re-join before a FullMember which would have
> created the group does.  The result is that requests that receive a reply also
> need to be retried, with the timeout dependent on some remote node in the fabric
> creating the group.

and it is unknown when such a multicast registration (to create the
group) would occur. So the proper timeout is unknown. That's why IPoIB
has a couple of different strategies for handling this depending on the
JoinState,

> >> So, the only safe thing to do is for all multicast clients to detach from all
> >> multicast groups, destroy all address handles,
> >
> >Why all groups ?
> 
> Because the SM has lost track that any groups in the fabric existed, so those
> groups must be recreated, all potentially with different mlids.

Yes, in the case of client reregister.

> >> possibly wait for a new group to be created, and then start all over again.
> >
> >Start what all over again ?
> 
> I meant attach the QP to the new group and allocate a new address handle.

Couldn't it modify the old one as an alternative strategy ?

> This is a general comment, and not directed at anyone specific,

Don't worry. I'm not taking it personally. Just want to give you my
$0.02 worth on what I think you are saying below:

> but is this
> really the architecture and implementation that we want to aim for?  I really
> think that we need to look at solutions that don't break existing communication,
> unless the links providing that communication actually go down, even if this
> means extending the architecture.

If this comment is directed at client reregister mechanism, you should
note that when this was brought up there was resistance to it based on
the recommendation (probably not a strong enough word for this) that SMs
be redundant in the subnet. There was a fair bit of anecdotal evidence
that this was not how they were being used at the time but it may have
been a chicken and egg problem.

-- Hal

> - Sean


From eitan at mellanox.co.il  Thu Jun  8 04:24:03 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 08 Jun 2006 14:24:03 +0300
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
Message-ID: <86fyiflwks.fsf@mtl066.yok.mtl.com>

Hi Hal

I'm working on passing osmtest check. Found a bug in the new
GUIDInfoRecord query: If you had a physical port with zero guid_cap
the code would loop on blocks 0..255 instead of trying the next port.

I am still looking for why we might have a guid_cap == 0 on some
ports.

This patch resolves this new problem. osmtest passes on some arbitrary
networks.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_sa_guidinfo_record.c
===================================================================
--- opensm/osm_sa_guidinfo_record.c	(revision 7703)
+++ opensm/osm_sa_guidinfo_record.c	(working copy)
@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
       continue;
 
     p_pi = osm_physp_get_port_info_ptr( p_physp );
+
+    if ( p_pi->guid_cap == 0 )  
+      continue;
+
     num_blocks = p_pi->guid_cap / 8;
     if ( p_pi->guid_cap % 8 )
       num_blocks++;


From cganapathi at novell.com  Thu Jun  8 05:12:14 2006
From: cganapathi at novell.com (CH Ganapathi)
Date: Thu, 08 Jun 2006 06:12:14 -0600
Subject: [openib-general] [PATCH] ib_uverbs_get_context does not unlock
 file->mutex in error path
Message-ID: <44886176.6C2D.007B.0@novell.com>

Hi,

If ibdev->alloc_ucontext(ibdev, &udata) fails then
ib_uverbs_get_context
does not unlock file->mutex before returning error.

Thanks,
Ganapathi
Novell Inc.


Signed-off by: Ganapathi CH <cganapathi at novell.com>

Index: linux-kernel/infiniband/core/uverbs_cmd.c
===================================================================
--- infiniband/core/uverbs_cmd.c	2006-06-08 11:52:29.000000000
+0530
+++ infiniband-fix/core/uverbs_cmd.c	2006-06-08 17:16:10.000000000
+0530
@@ -80,8 +80,10 @@ ssize_t ib_uverbs_get_context(struct ib_
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
 	ucontext = ibdev->alloc_ucontext(ibdev, &udata);
-	if (IS_ERR(ucontext))
-		return PTR_ERR(file->ucontext);
+	if (IS_ERR(ucontext)) {
+		ret = PTR_ERR(file->ucontext);
+		goto err;
+	}
 
 	ucontext->device = ibdev;
 	INIT_LIST_HEAD(&ucontext->pd_list);


From halr at voltaire.com  Thu Jun  8 05:54:06 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Jun 2006 08:54:06 -0400
Subject: [openib-general] Re: [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <86fyiflwks.fsf@mtl066.yok.mtl.com>
References: <86fyiflwks.fsf@mtl066.yok.mtl.com>
Message-ID: <1149771197.4510.323092.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> Hi Hal
> 
> I'm working on passing osmtest check. Found a bug in the new
> GUIDInfoRecord query: If you had a physical port with zero guid_cap
> the code would loop on blocks 0..255 instead of trying the next port.

OK; that's definitely a problem.

> I am still looking for why we might have a guid_cap == 0 on some
> ports.

PortInfo:GuidCap is not used for switch external ports.

> This patch resolves this new problem. osmtest passes on some arbitrary
> networks.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> Index: opensm/osm_sa_guidinfo_record.c
> ===================================================================
> --- opensm/osm_sa_guidinfo_record.c	(revision 7703)
> +++ opensm/osm_sa_guidinfo_record.c	(working copy)
> @@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
>        continue;
>  
>      p_pi = osm_physp_get_port_info_ptr( p_physp );
> +
> +    if ( p_pi->guid_cap == 0 )  
> +      continue;
> +

I think the right fix is to detect switch external ports and use the
VLCap from port 0 rather than from the switch external port (unless that
concept is broken in which case it should return 0 records).

-- Hal

>      num_blocks = p_pi->guid_cap / 8;
>      if ( p_pi->guid_cap % 8 )
>        num_blocks++;
> 


From jlentini at netapp.com  Thu Jun  8 07:23:08 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 8 Jun 2006 10:23:08 -0400 (EDT)
Subject: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
	add support for IB_CM_REQ_OPTIONS
In-Reply-To: <200606080942.48767.jackm@mellanox.co.il>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<200606071639.03787.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>
	<200606080942.48767.jackm@mellanox.co.il>
Message-ID: <Pine.LNX.4.64.0606081022310.4750@jlentini-linux.nane.netapp.com>


On Thu, 8 Jun 2006, Jack Morgenstein wrote:

> On Wednesday 07 June 2006 18:26, James Lentini wrote:
> > On Wed, 7 Jun 2006, Jack Morgenstein wrote:
> > > This (bug fix) can still be included in next-week's release, if you
> > > think it is important (I have extracted it from the changes checked
> > > in at svn 7755)
> >
> > If you are going to make another release anyway, then I would included
> > it.
> 
> Do you mean -- include the fix in next week's release -- or -- wait 
> with the fix for the following release?

I'd include the fix in the next release, but I wouldn't create a 
special release just for this fix.


From rdreier at cisco.com  Thu Jun  8 09:21:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 09:21:58 -0700
Subject: [openib-general] OFED-1.0-rc6 is available
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com>
	(Tziporet Koren's message of "Thu, 8 Jun 2006 09:25:55 +0300")
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA7162@mtlexch01.mtl.com>
Message-ID: <adahd2vehy1.fsf@cisco.com>

Thanks... one further fix for Cisco gateways: sometimes the IsDM bit
is set on switch ports as well, so ibsrpdm should not be limited to
just CA ports.

Here's the patch, also on the trunk as r7836.

--- srptools/ChangeLog	(revision 7803)
+++ srptools/ChangeLog	(working copy)
@@ -1,3 +1,10 @@
+2006-06-08  Roland Dreier  <rdreier at cisco.com>
+
+	* src/srp-dm.c (get_port_list): In some setups (eg Cisco SFS 3001
+	with an FC gateway), there will be switches with the IsDM bit set
+	on port 0.  So the initial get of NodeRecords must retrieve all
+	records, not just CA ports.
+
 2006-06-07  Roland Dreier  <rdreier at cisco.com>
 	* src/srp-dm.c (do_port): Use correct endianness when comparing
 	GUID against Topspin OUI.
--- srptools/src/srp-dm.c	(revision 7803)
+++ srptools/src/srp-dm.c	(working copy)
@@ -523,11 +523,9 @@ static int get_port_list(int fd, uint32_
 	out_sa_mad->mgmt_class 	  = SRP_MGMT_CLASS_SA;
 	out_sa_mad->method     	  = SRP_SA_METHOD_GET_TABLE;
 	out_sa_mad->class_version = 2;
-	out_sa_mad->comp_mask     = htonll(1ul << 4); /* node type */
+	out_sa_mad->comp_mask     = 0; /* Get all end ports */
 	out_sa_mad->rmpp_version  = 1;
 	out_sa_mad->rmpp_type     = 1;
-	node                      = (void *) out_sa_mad->data;
-	node->type		  = 1; /* CA */
 
 	len = send_and_get(fd, &out_mad, in_mad, node_table_response_size);
 	if (len < 0)


From rdreier at cisco.com  Thu Jun  8 09:28:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 09:28:12 -0700
Subject: [openib-general] Re: [PATCH] ib_uverbs_get_context does not unlock
 file->mutex in error path
In-Reply-To: <44886176.6C2D.007B.0@novell.com> (CH Ganapathi's message
	of "Thu, 08 Jun 2006 06:12:14 -0600")
References: <44886176.6C2D.007B.0@novell.com>
Message-ID: <adad5djehnn.fsf@cisco.com>

Good catch.  Applied.


From sean.hefty at intel.com  Thu Jun  8 09:49:35 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 Jun 2006 09:49:35 -0700
Subject: [openib-general] RE: Failed multicast join withnew multicast module
In-Reply-To: <1149764555.4510.320252.camel@hal.voltaire.com>
Message-ID: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>

>If this comment is directed at client reregister mechanism, you should
>note that when this was brought up there was resistance to it based on
>the recommendation (probably not a strong enough word for this) that SMs
>be redundant in the subnet. There was a fair bit of anecdotal evidence
>that this was not how they were being used at the time but it may have
>been a chicken and egg problem.

Even with redundant SMs, we wouldn't want them to reassign all of the LIDs in
the subnet just because of failover.  I don't think of MLIDs as being any
different.  Client reregister support is optional, so what if the node(s) that
need to re-create the group doesn't support it?

What if we started with something like the following compliance statement, and
tried to add this to the spec?

An SM, upon becoming the master, shall respect all existing communication in the
fabric, where possible.

- Sean


From bpradip at in.ibm.com  Thu Jun  8 10:42:03 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Thu, 08 Jun 2006 23:12:03 +0530
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <Pine.GSO.4.40.0606071450020.13338-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606071450020.13338-100000@nu.cse.ohio-state.edu>
Message-ID: <4488616B.7030701@in.ibm.com>

Sundeep Narravula wrote:
> Hi,
> 
>> I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc
>> version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC.
> 
> The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and
> AMSO1100 RNIC.
> 
>> Will running it under gdb be of some help ?
> 
> I am able to reproduce this error with/without gdb. The glibc error
> disappears with higher number of iterations.
> 
> (gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999

The problem is due to specifying a less than sufficient size (-S10, -S4) for the 
buffer. If you look into the following lines from the function rping_test_client 
  in rping.c

for (ping = 0; !cb->count || ping < cb->count; ping++) {
                 cb->state = RDMA_READ_ADV;

                /* Put some ascii text in the buffer. */
------>        cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);

 From the above its clear that minimum size for start_buf should be atleast 
sufficient to hold the string, which in the invocations mentioned here (-S10 or 
-S4) is not the case. Hence you notice the glibc errors.


cb->start_buf is allocated in rping_setup_buffers() as
	cb->start_buf = malloc(cb->size);

Basically the check

if ((cb->size < 1) ||
                (cb->size > (RPING_BUFSIZE - 1))) {

in the main()  should be changed to something like this

#define RPING_MIN_BUFSIZE   sizeof(itoa(INT_MAX)) + sizeof("rdma-ping-%d: ")

---> 'ping' is defined as a signed int, its maximum permissible value is defined 
in limits.h (INT_MAX = 2147483647)
We can even hardcode the RPING_MIN_BUFSIZE to '19' if desired/

if ((cb->size < RPING_MIN_BUFSIZE) ||
                (cb->size > (RPING_BUFSIZE - 1))) {

Steve what do you say ??


Thanks,
Pradipta Kumar.


> Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100
> -p 9999
> Reading symbols from shared object read from target memory...done.
> Loaded system supplied DSO at 0xffffe000
> [Thread debugging using libthread_db enabled]
> [New Thread -1208465728 (LWP 23960)]
> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>         driver search path: /usr/local/lib/infiniband
> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>         driver search path: /usr/local/lib/infiniband
> [New Thread -1208468560 (LWP 23963)]
> [New Thread -1216861264 (LWP 23964)]
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> ping data: rdma-ping
> cq completion failed status 5
> DISCONNECT EVENT...
> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> 
> Program received signal SIGABRT, Aborted.
> [Switching to Thread -1208465728 (LWP 23960)]
> 0xffffe410 in __kernel_vsyscall ()
> (gdb)
> 
>   --Sundeep.
> 
>> Thanks
>> Pradipta Kumar.
>>>> Thanx,
>>>>
>>>>
>>>> Steve.
>>>>
>>>>
>>>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
>>>>> Hi Steve,
>>>>>    We are trying the new iwarp branch on ammasso adapters. The installation
>>>>> has gone fine. However, on running rping there is a error during
>>>>> disconnect phase.
>>>>>
>>>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
>>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1
>>>>>          driver search path: /usr/local/lib/infiniband
>>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0
>>>>>          driver search path: /usr/local/lib/infiniband
>>>>> ping data: rdm
>>>>> ping data: rdm
>>>>> ping data: rdm
>>>>> ping data: rdm
>>>>> cq completion failed status 5
>>>>> DISCONNECT EVENT...
>>>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
>>>>> Aborted
>>>>>
>>>>> There are no apparent errors showing up in dmesg. Is this error
>>>>> currently expected?
>>>>>
>>>>> Thanks,
>>>>>    --Sundeep.
>>>>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From mshefty at ichips.intel.com  Thu Jun  8 11:07:03 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 08 Jun 2006 11:07:03 -0700
Subject: [openib-general] [PATCH 0/4] Add support for UD QPs
In-Reply-To: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
References: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
Message-ID: <44886747.4040004@ichips.intel.com>

> The following patch series adds support for UD QPs to userspace through the RDMA
> CM.  UD QPs are referenced by an IP address, UDP port number.  The RDMA CM
> abstracts SIDR for Infiniband clients.

Roland,

Do you see any issues with this patch series or the related userspace changes? 
There's a small change to uverbs, and new APIs added to libibverbs.

- Sean


From rdreier at cisco.com  Thu Jun  8 11:12:53 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 11:12:53 -0700
Subject: [openib-general] [PATCH 0/4] Add support for UD QPs
In-Reply-To: <44886747.4040004@ichips.intel.com> (Sean Hefty's message
	of "Thu, 08 Jun 2006 11:07:03 -0700")
References: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
	<44886747.4040004@ichips.intel.com>
Message-ID: <adazmgncy8q.fsf@cisco.com>

    Sean> Do you see any issues with this patch series or the related
    Sean> userspace changes? There's a small change to uverbs, and new
    Sean> APIs added to libibverbs.

I haven't looked too carefully yet.

What's the motivation?  It seems strange to put an IB-only transport
into the RDMA CM -- iWARP can't handle datagrams, can it?

 - R.


From greg.lindahl at qlogic.com  Thu Jun  8 11:18:09 2006
From: greg.lindahl at qlogic.com (Greg Lindahl)
Date: Thu, 8 Jun 2006 11:18:09 -0700
Subject: [openib-general] RE: Failed multicast join withnew
	multicast module
In-Reply-To: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
References: <1149764555.4510.320252.camel@hal.voltaire.com>
	<ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
Message-ID: <20060608181809.GI1359@greglaptop.internal.keyresearch.com>

On Thu, Jun 08, 2006 at 09:49:35AM -0700, Sean Hefty wrote:

> What if we started with something like the following compliance statement, and
> tried to add this to the spec?
> 
> An SM, upon becoming the master, shall respect all existing communication in the
> fabric, where possible.

Isn't this a quality of implementation issue? It's hard to imagine a
SM author not realizing this is a good thing to do.

If it was in the standard, how would you test it for compliance?

-- g


From mshefty at ichips.intel.com  Thu Jun  8 11:33:10 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 08 Jun 2006 11:33:10 -0700
Subject: [openib-general] [PATCH 0/4] Add support for UD QPs
In-Reply-To: <adazmgncy8q.fsf@cisco.com>
References: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
	<44886747.4040004@ichips.intel.com> <adazmgncy8q.fsf@cisco.com>
Message-ID: <44886D66.7000703@ichips.intel.com>

Roland Dreier wrote:
> I haven't looked too carefully yet.
> 
> What's the motivation?  It seems strange to put an IB-only transport
> into the RDMA CM -- iWARP can't handle datagrams, can it?

This allows using the address translation to locate the remote service.  The 
RDMA CM also provides an IP based interface for IB.  From a user's perspective, 
this extends the RDMA CM to include the UDP port space, in addition to TCP.

- Sean


From mshefty at ichips.intel.com  Thu Jun  8 11:43:24 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 08 Jun 2006 11:43:24 -0700
Subject: [openib-general] RE: Failed multicast join withnew
	multicast module
In-Reply-To: <20060608181809.GI1359@greglaptop.internal.keyresearch.com>
References: <1149764555.4510.320252.camel@hal.voltaire.com>
	<ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<20060608181809.GI1359@greglaptop.internal.keyresearch.com>
Message-ID: <44886FCC.3040108@ichips.intel.com>

Greg Lindahl wrote:
> Isn't this a quality of implementation issue? It's hard to imagine a
> SM author not realizing this is a good thing to do.

I don't know if any SM implementation actually does this today.  I think that 
all break all multicast groups.

> If it was in the standard, how would you test it for compliance?

Stop / restart the SM and see if any existing RC, UD, MCast communication breaks 
could be an easy first test.

- Sean


From mst at mellanox.co.il  Thu Jun  8 11:57:52 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 8 Jun 2006 21:57:52 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <ada1wu2knjm.fsf@cisco.com>
References: <ada1wu2knjm.fsf@cisco.com>
Message-ID: <20060608185752.GA9039@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>     Michael> But ipoib_ib_dev_flush doesn't?
> 
> Ah, that looks like the bug I guess.  What's the situation?  SM clears
> P_Key table and then later readds a P_Key?

Any ideas on how to fix this?

-- 
MST


From mst at mellanox.co.il  Thu Jun  8 12:03:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 8 Jun 2006 22:03:54 +0300
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
Message-ID: <20060608190354.GB9039@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> What if we started with something like the following compliance statement, and
> tried to add this to the spec?
> 
> An SM, upon becoming the master, shall respect all existing communication in
> the fabric, where possible.

To me, "where possible" doesn't sound like an appropriate language for a
compliance statement. Is there precedent for this in IB spec?

-- 
MST


From sean.hefty at intel.com  Thu Jun  8 12:06:13 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 Jun 2006 12:06:13 -0700
Subject: [openib-general] RE: Failed multicast join withnew multicast module
In-Reply-To: <20060608190354.GB9039@mellanox.co.il>
Message-ID: <ORSMSX401koCvDa7oIl00000046@orsmsx401.amr.corp.intel.com>

>> An SM, upon becoming the master, shall respect all existing communication in
>> the fabric, where possible.
>
>To me, "where possible" doesn't sound like an appropriate language for a
>compliance statement. Is there precedent for this in IB spec?

I was trying to express a concept, not formulate exact wording here...


From rdreier at cisco.com  Thu Jun  8 12:15:44 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 12:15:44 -0700
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <20060608185752.GA9039@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 8 Jun 2006 21:57:52 +0300")
References: <ada1wu2knjm.fsf@cisco.com> <20060608185752.GA9039@mellanox.co.il>
Message-ID: <adamzcncvbz.fsf@cisco.com>

 > > Ah, that looks like the bug I guess.  What's the situation?  SM clears
 > > P_Key table and then later readds a P_Key?

 > Any ideas on how to fix this?

Does it work to just start the pkey_task if ipoib_ib_dev_flush() wants
for a P_Key that's not there?  Or is it trickier?

 - R.


From mst at mellanox.co.il  Thu Jun  8 12:28:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 8 Jun 2006 22:28:48 +0300
Subject: [openib-general] Re: RFC: ib_cache_event problems
In-Reply-To: <adamzcncvbz.fsf@cisco.com>
References: <adamzcncvbz.fsf@cisco.com>
Message-ID: <20060608192848.GC9039@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: RFC: ib_cache_event problems
> 
>  > > Ah, that looks like the bug I guess.  What's the situation?  SM clears
>  > > P_Key table and then later readds a P_Key?
> 
>  > Any ideas on how to fix this?
> 
> Does it work to just start the pkey_task if ipoib_ib_dev_flush() wants
> for a P_Key that's not there?  Or is it trickier?

If this works, why is dev_up playing with pkey_chek_presence at all?
Can we kill all of this then?

        ipoib_pkey_dev_check_presence(dev);

        if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
                ipoib_dbg(priv, "PKEY is not assigned.\n");
                return 0;
        }

It seems we must avoid joining multicast groups while key isn't assigned ...


-- 
MST


From bugzilla-daemon at openib.org  Thu Jun  8 12:41:31 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu,  8 Jun 2006 12:41:31 -0700 (PDT)
Subject: [openib-general] [Bug 122] New: mad layer problem
Message-ID: <20060608194131.B3EEC2283E0@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=122

           Summary: mad layer problem
           Product: OpenFabrics Linux
           Version: gen2
          Platform: All
        OS/Version: Other
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: IB Core
        AssignedTo: sean.hefty at intel.com
        ReportedBy: eli at mellanox.co.il
                CC: bugzilla at openib.org


We were running polygraph http://freshmeat.net/projects/polygraph/ over ipoib
and at some time ipoib connectivity was lost. When looking at the state of the
machines (two machines conected through a switch) I noticed that on on one of
the machines, I could not run any program that uses mads. Specifically I tried
sminfo and then opensm, both got stuck. I assume what happend is that at some
time the kernel refreshed its arp cache and at that time there was already a
problem in sending mads so the kernel could not resolve the address so ipoib
connectivity got lost.


------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at openib.org  Thu Jun  8 12:44:35 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu,  8 Jun 2006 12:44:35 -0700 (PDT)
Subject: [openib-general] [Bug 122] mad layer problem
Message-ID: <20060608194435.7BDE82283E0@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=122


------- Comment #1 from rolandd at cisco.com  2006-06-08 12:44 -------
To debug this we probably need to know where sminfo and/or opensm were getting
stuck.  sysrq-T output for the stuck processes would probably be the most
helpful.


------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at openib.org  Thu Jun  8 12:51:57 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu,  8 Jun 2006 12:51:57 -0700 (PDT)
Subject: [openib-general] [Bug 122] mad layer problem
Message-ID: <20060608195157.7FF482283E0@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=122


------- Comment #2 from sean.hefty at intel.com  2006-06-08 12:51 -------
I'm not aware of any relationship between ARP and MADs.  I'd like to verify
that this is indeed a MAD layer issue, and not a problem in the user-to-kernel
interface, or lower level driver.  After the hang, were any applications able
to run?  Did you try running any kernel tests, like grmpp or cmatose?  Loading
madeye after connectivity is lost could also be helpful.  How easily is this
reproduced?


------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From mst at mellanox.co.il  Thu Jun  8 13:26:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 8 Jun 2006 23:26:35 +0300
Subject: [openib-general] race in mthca_cq.c?
Message-ID: <20060608202635.GA9877@mellanox.co.il>

Roland, I think I see a race in mthca: let's assume that
a QP is destroyed. We remove the qpn from qp_table.

Before we have the chance to cleanup the CQ, another QP is created
and put in the same slot in table. If the user now polls the CQ he'll see a
completion for a wrong QP, since poll CQ does:

               *cur_qp = mthca_array_get(&dev->qp_table.qp,
                                          be32_to_cpu(cqe->my_qpn) &
                                          (dev->limits.num_qps - 1));

Is this analysis right?
If yes, I think we can fix this by testing (*cur_qp)->qpn ==
be32_to_cpu(cqe->my_qpn), does this make sense?

Same for userspace I guess?

It seems a similiar issue exists for CQs, does it not?
And I think it can be solved in a similiar way, checking the CQN?

-- 
MST


From rdreier at cisco.com  Thu Jun  8 13:43:24 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 13:43:24 -0700
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <20060608202635.GA9877@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 8 Jun 2006 23:26:35 +0300")
References: <20060608202635.GA9877@mellanox.co.il>
Message-ID: <adairnbcr9v.fsf@cisco.com>

 > Roland, I think I see a race in mthca: let's assume that
 > a QP is destroyed. We remove the qpn from qp_table.
 > 
 > Before we have the chance to cleanup the CQ, another QP is created
 > and put in the same slot in table. If the user now polls the CQ he'll see a
 > completion for a wrong QP, since poll CQ does:
 > 
 >                *cur_qp = mthca_array_get(&dev->qp_table.qp,
 >                                           be32_to_cpu(cqe->my_qpn) &
 >                                           (dev->limits.num_qps - 1));
 > 
 > Is this analysis right?

I don't think so.  There's no way for another QP to be assigned the
same number, since the mthca_free() to clear out the QPN bitmap
doesn't happen until after the CQs are cleaned up.

 > It seems a similiar issue exists for CQs, does it not?
 > And I think it can be solved in a similiar way, checking the CQN?

I don't see anything there either.  When destroying a CQ, mthca does
HW2SW_CQ and synchronize_irq() before a new CQ could be created with
the same number.

 - R.


From halr at voltaire.com  Thu Jun  8 13:39:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Jun 2006 16:39:03 -0400
Subject: [openib-general] RE: Failed multicast join withnew multicast module
In-Reply-To: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
Message-ID: <1149799142.4510.13468.camel@hal.voltaire.com>

On Thu, 2006-06-08 at 12:49, Sean Hefty wrote:
> >If this comment is directed at client reregister mechanism, you should
> >note that when this was brought up there was resistance to it based on
> >the recommendation (probably not a strong enough word for this) that SMs
> >be redundant in the subnet. There was a fair bit of anecdotal evidence
> >that this was not how they were being used at the time but it may have
> >been a chicken and egg problem.
> 
> Even with redundant SMs, we wouldn't want them to reassign all of the LIDs in
> the subnet just because of failover.  I don't think of MLIDs as being any
> different.  

Do you mean without redundant SMs (rather than with) ?

There are a couple of things about MLIDs are different:
1. There are very much fewer of them (not necessarily architecturally
but in some implementations)
2. There is lazy deletion of MC groups allowed so the reclamation may be
difficult.

This is not to say it can't be done but there are some hurdles to clear.

> Client reregister support is optional, so what if the node(s) that
> need to re-create the group doesn't support it?

The endport SMAs are claiming they do support client reregistration but
it does take more than that for the endport/node to behave properly.

> What if we started with something like the following compliance statement, and
> tried to add this to the spec?
> 
> An SM, upon becoming the master, shall respect all existing communication 
> in the fabric, where possible.

At the 50K level, I can see where you are coming from and think there is
merit in this but first, I'm not sure I know how to define this and
second, whether that is achievable but could wait to see whether some
definition could be achieved.

I know it is a conceptual rather than actual compliance. One issue would
be defining what it means to repect all existing communication. Then we
would need to look at whether that was feasible or not and perhaps
rescope what it means to a set of things achievable. Another issue would
be defining where it is possible or not. If that is totally vendor
dependent, then this would have no substance to it. It is largely a
matter of being a "better" SM.

-- Hal

> - Sean


From mst at mellanox.co.il  Thu Jun  8 13:48:26 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 8 Jun 2006 23:48:26 +0300
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <adairnbcr9v.fsf@cisco.com>
References: <adairnbcr9v.fsf@cisco.com>
Message-ID: <20060608204826.GB9957@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: race in mthca_cq.c?
> 
>  > Roland, I think I see a race in mthca: let's assume that
>  > a QP is destroyed. We remove the qpn from qp_table.
>  > 
>  > Before we have the chance to cleanup the CQ, another QP is created
>  > and put in the same slot in table. If the user now polls the CQ he'll see a
>  > completion for a wrong QP, since poll CQ does:
>  > 
>  >                *cur_qp = mthca_array_get(&dev->qp_table.qp,
>  >                                           be32_to_cpu(cqe->my_qpn) &
>  >                                           (dev->limits.num_qps - 1));
>  > 
>  > Is this analysis right?
> 
> I don't think so.  There's no way for another QP to be assigned the
> same number, since the mthca_free() to clear out the QPN bitmap
> doesn't happen until after the CQs are cleaned up.

Not in the driver I have:
mthca_array_clear is at line 1351, mthca_cq_clean at line 1372.
Isn't mthca_array_clear freeing the slot in QP table?

>  > It seems a similiar issue exists for CQs, does it not?
>  > And I think it can be solved in a similiar way, checking the CQN?
> 
> I don't see anything there either.  When destroying a CQ, mthca does
> HW2SW_CQ and synchronize_irq() before a new CQ could be created with
> the same number.

But there might be more EQEs for this CQN outstanding in the EQ
which we have not seen yet.

-- 
MST


From sweitzen at cisco.com  Thu Jun  8 13:59:33 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 8 Jun 2006 13:59:33 -0700
Subject: [openib-general] Compilation issues on rhel4 u3 ppc64
 sysfs.o
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D691EF@xmb-sjc-216.amer.cisco.com>

This is working for us on RHEL4 U3, thanks!

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Vladimir Sokolovsky [mailto:vlad at mellanox.co.il] 
> Sent: Thursday, May 25, 2006 2:49 AM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Paul; openib-general at openib.org
> Subject: Re: [openib-general] Compilation issues on rhel4 u3 
> ppc64 sysfs.o
> 
> In OFED-1.0-rc5 all binaries and libraries will be compiled on *ppc64 
> *with *-m64* flag.
> This requires sysfsutils and sysfsutils-devel 64-bit RPM to 
> be installed 
> (in order to build libibverbs).
> Also pciutils and pciutils-devel 64-bit required for tvflash package.
> 
> libsdp will be built both 32 and 64 bit libraries.
> 
> Note: in order to build sysfsutils 64-bit RPM run:
>          CC="gcc -m64" rpmbuild --rebuild 
> sysfsutils-1.3.0-1.2.1.src.rpm
>           (This was tested on Fedora C4 PPC64)
> 
> Regards,
> Vladimir
> 
> Scott Weitzenkamp (sweitzen) wrote:
> > I know Vlad made some changes for rc5 in this area, at least for 
> > libsdp, not sure if other libs got changed as well.
> >  
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> >
> >     
> --------------------------------------------------------------
> ----------
> >     *From:* Paul [mailto:paul.lundin at gmail.com]
> >     *Sent:* Wednesday, May 24, 2006 11:00 AM
> >     *To:* Scott Weitzenkamp (sweitzen)
> >     *Cc:* openib-general at openib.org
> >     *Subject:* Re: [openib-general] Compilation issues on rhel4 u3
> >     ppc64 sysfs.o
> >
> >     Scott,
> >           Upon further inspection the build.sh and 
> install.sh scripts
> >     built 32bit libraries and binaries. If I export CFLAGS (and the
> >     like) to include -m64 then the build dies while looking for a
> >     64bit libsysfs. rhel4 u3 does not include a ppc64 
> sysfsutils, nor
> >     have I been able to find an actual 64bit version of it. 
> Is there a
> >     workaround for getting things to build actual ppc64
> >     binaries/libraries ?
> >
> >     The actual error is:
> >     checking for dlsym in -ldl... yes
> >     checking for pthread_mutex_init in -lpthread... yes
> >     checking for sysfs_open_class in -lsysfs... no
> >     configure: error: sysfs_open_class() not found. libibverbs
> >     requires libsysfs. 
> >
> 


From mst at mellanox.co.il  Thu Jun  8 14:11:33 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 9 Jun 2006 00:11:33 +0300
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <20060608202635.GA9877@mellanox.co.il>
References: <20060608202635.GA9877@mellanox.co.il>
Message-ID: <20060608211133.GA10263@mellanox.co.il>

Quoting r. Michael S. Tsirkin <mst at mellanox.co.il>:
> Subject: race in mthca_cq.c?
> 
> Roland, I think I see a race in mthca: let's assume that
> a QP is destroyed. We remove the qpn from qp_table.
> 
> Before we have the chance to cleanup the CQ, another QP is created
> and put in the same slot in table. If the user now polls the CQ he'll see a
> completion for a wrong QP, since poll CQ does:
> 
>                *cur_qp = mthca_array_get(&dev->qp_table.qp,
>                                           be32_to_cpu(cqe->my_qpn) &
>                                           (dev->limits.num_qps - 1));
> 
> Is this analysis right?
> If yes, I think we can fix this by testing (*cur_qp)->qpn ==
> be32_to_cpu(cqe->my_qpn), does this make sense?
> 
> Same for userspace I guess?
> 
> It seems a similiar issue exists for CQs, does it not?
> And I think it can be solved in a similiar way, checking the CQN?


The following seems to work. How does it look?

---

Make sure completion/completion event is not for a stale QP/CQ
before reporting to user.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

--- openib/drivers/infiniband/hw/mthca/mthca_cq.c	2006-05-09 21:07:28.623383000 +0300
+++ /mswg/work/mst/tmp/infiniband1/hw/mthca/mthca_cq.c	2006-06-08 23:46:52.404499000 +0300
@@ -217,9 +217,9 @@ void mthca_cq_completion(struct mthca_de
 {
 	struct mthca_cq *cq;
 
 	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
 
-	if (!cq) {
+	if (!cq || cq->cqn != cqn) {
 		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
 		return;
 	}
@@ -513,10 +515,10 @@ static inline int mthca_poll_one(struct 
 		 * because CQs will be locked while QPs are removed
 		 * from the table.
 		 */
 		*cur_qp = mthca_array_get(&dev->qp_table.qp,
 					  be32_to_cpu(cqe->my_qpn) &
 					  (dev->limits.num_qps - 1));
-		if (!*cur_qp) {
+		if (!*cur_qp || (*cur_qp)->qpn != be32_to_cpu(cqe->my_qpn)) {
 			mthca_warn(dev, "CQ entry for unknown QP %06x\n",
 				   be32_to_cpu(cqe->my_qpn) & 0xffffff);
 			err = -EINVAL;

-- 
MST


From rdreier at cisco.com  Thu Jun  8 14:19:46 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 14:19:46 -0700
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <20060608211133.GA10263@mellanox.co.il> (Michael S.
	Tsirkin's message of "Fri, 9 Jun 2006 00:11:33 +0300")
References: <20060608202635.GA9877@mellanox.co.il>
	<20060608211133.GA10263@mellanox.co.il>
Message-ID: <adaejxzcpl9.fsf@cisco.com>

 > The following seems to work. How does it look?

I don't think it's needed, and anyway I don't see how it fixes
things.  The problem only happens when the new CQ or QP has the same
number as an old CQ/QP, so the test of cq->cqn == cqn might still pass
even if the cq has changed (there's no guarantee the upper bits won't
repeat -- or someone could be using 24 bits for index)

 - R.


From rdreier at cisco.com  Thu Jun  8 14:23:22 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 14:23:22 -0700
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <20060608204826.GB9957@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 8 Jun 2006 23:48:26 +0300")
References: <adairnbcr9v.fsf@cisco.com> <20060608204826.GB9957@mellanox.co.il>
Message-ID: <adaac8ncpf9.fsf@cisco.com>

    Michael> Not in the driver I have: mthca_array_clear is at line
    Michael> 1351, mthca_cq_clean at line 1372.  Isn't
    Michael> mthca_array_clear freeing the slot in QP table?

Nope, the bitmap slot isn't freed until mthca_free().

    Michael> But there might be more EQEs for this CQN outstanding in
    Michael> the EQ which we have not seen yet.

Now that you mention it, that could be a real problem I guess.
synchronize_irq() isn't enough because the interrupt handler might not
have even started yet.

But on the other hand a CQ can't be destroyed until after all
associated QPs have been destroyed.  So could we really miss EQEs for
that long?

 - R.


From mshefty at ichips.intel.com  Thu Jun  8 15:00:56 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 08 Jun 2006 15:00:56 -0700
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <1149799142.4510.13468.camel@hal.voltaire.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
Message-ID: <44889E18.8010507@ichips.intel.com>

Hal Rosenstock wrote:
> 2. There is lazy deletion of MC groups allowed so the reclamation may be
> difficult.

I'm not familiar with the switch programming.  Does the SM set the entire 
MulticastForwardingTable for a switch every time a new group is created, or a 
new member joins?  If the SM loses track of all multicast groups, how are the 
stale groups on the switches deleted?

> The endport SMAs are claiming they do support client reregistration but
> it does take more than that for the endport/node to behave properly.

My original plan was to have the ib_multicast module rejoin all groups, but 
since the MLIDs can change I can't see any way to handle reregistration safely 
without involving the application.  My latest changes are just to report errors 
on existing multicast groups on an affected port.

> I know it is a conceptual rather than actual compliance. One issue would
> be defining what it means to repect all existing communication. Then we
> would need to look at whether that was feasible or not and perhaps
> rescope what it means to a set of things achievable. Another issue would
> be defining where it is possible or not. If that is totally vendor
> dependent, then this would have no substance to it. It is largely a
> matter of being a "better" SM.

We could use the phrase, "except where such communication is no longer 
realizable" instead of "where possible".  Where unrealizable means impossible 
because the communication uses properties that are physically impossible to 
achieve given the hardware configuration of the subnet.  (See bottom of page 910 
of the spec.)

If an SM could just query switches for their MulticastForwardingTables or the 
end nodes, would we be able to avoid these issues?

- Sean


From mst at mellanox.co.il  Thu Jun  8 15:06:49 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 9 Jun 2006 01:06:49 +0300
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <adaac8ncpf9.fsf@cisco.com>
References: <adaac8ncpf9.fsf@cisco.com>
Message-ID: <20060608220649.GB10263@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: race in mthca_cq.c?
> 
>     Michael> Not in the driver I have: mthca_array_clear is at line
>     Michael> 1351, mthca_cq_clean at line 1372.  Isn't
>     Michael> mthca_array_clear freeing the slot in QP table?
> 
> Nope, the bitmap slot isn't freed until mthca_free().

Oh. Right. I see it now.

>     Michael> But there might be more EQEs for this CQN outstanding in
>     Michael> the EQ which we have not seen yet.
> 
> Now that you mention it, that could be a real problem I guess.
> synchronize_irq() isn't enough because the interrupt handler might not
> have even started yet.
> 
> But on the other hand a CQ can't be destroyed until after all
> associated QPs have been destroyed.  So could we really miss EQEs for
> that long?

Yes, I think there might be spurious EQEs and they might get delayed
in HW for a long time. Destroyng QPs does not flush completion events out.

So just this bit?

--

Check EQE is not for a stale CQ number.  Since high bits in CQ number are
allocated by round-robin, we can be reasonably sure CQ number is different even
for CQs which share slot in CQ table.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>


--- openib/drivers/infiniband/hw/mthca/mthca_cq.c	2006-05-09 21:07:28.623383000 +0300
+++ /mswg/work/mst/tmp/infiniband1/hw/mthca/mthca_cq.c	2006-06-08 23:46:52.404499000 +0300
@@ -217,9 +217,9 @@ void mthca_cq_completion(struct mthca_de
 {
 	struct mthca_cq *cq;
 
 	cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1));
 
-	if (!cq) {
+	if (!cq || cq->cqn != cqn) {
 		mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn);
 		return;
 	}

-- 
MST


From tom at opengridcomputing.com  Thu Jun  8 15:27:55 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Thu, 08 Jun 2006 17:27:55 -0500
Subject: [openib-general] [ANNOUNCE] New iWARP Branch
In-Reply-To: <4488616B.7030701@in.ibm.com>
References: <Pine.GSO.4.40.0606071450020.13338-100000@nu.cse.ohio-state.edu>
	<4488616B.7030701@in.ibm.com>
Message-ID: <1149805675.12361.5.camel@trinity.ogc.int>

Steve is fishing in the Florida Keys right now (or will be by morning),
but if he were here, I think he would say -- "...sounds like you've
found an rping bug, please post a patch" ;-)

I would prefer the #define you proposed, e.g.
 
#define RPING_MSG_FMT 		"rdma-ping-%d"
#define RPING_MIN_BUFSIZ	sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT)

Then use the RPING_MSG_FMT symbol in the code that prepares the 
contents of the message. then if someone decides to change the string, 
the error checking still works.

> 
Tom

On Thu, 2006-06-08 at 23:12 +0530, Pradipta Kumar Banerjee wrote: 
> Sundeep Narravula wrote:
> > Hi,
> > 
> >> I don't see this problem at all. I am using kernel 2.6.16.16, SLES 9 glibc
> >> version 2.3.3-98, gcc version 3.3.3 and AMSO1100 RNIC.
> > 
> > The versions I used are glibc 2.3.4, kernel 2.6.16 and gcc 3.4.3 and
> > AMSO1100 RNIC.
> > 
> >> Will running it under gdb be of some help ?
> > 
> > I am able to reproduce this error with/without gdb. The glibc error
> > disappears with higher number of iterations.
> > 
> > (gdb) r -c -vV -C10 -S10 -a 150.111.111.100 -p 9999
> 
> The problem is due to specifying a less than sufficient size (-S10, -S4) for the 
> buffer. If you look into the following lines from the function rping_test_client 
>   in rping.c
> 
> for (ping = 0; !cb->count || ping < cb->count; ping++) {
>                  cb->state = RDMA_READ_ADV;
> 
>                 /* Put some ascii text in the buffer. */
> ------>        cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
> 
>  From the above its clear that minimum size for start_buf should be atleast 
> sufficient to hold the string, which in the invocations mentioned here (-S10 or 
> -S4) is not the case. Hence you notice the glibc errors.
> 
> 
> cb->start_buf is allocated in rping_setup_buffers() as
> 	cb->start_buf = malloc(cb->size);
> 
> Basically the check
> 
> if ((cb->size < 1) ||
>                 (cb->size > (RPING_BUFSIZE - 1))) {
> 
> in the main()  should be changed to something like this
> 
> #define RPING_MIN_BUFSIZE   sizeof(itoa(INT_MAX)) + sizeof("rdma-ping-%d: ")
> 
> ---> 'ping' is defined as a signed int, its maximum permissible value is defined 
> in limits.h (INT_MAX = 2147483647)
> We can even hardcode the RPING_MIN_BUFSIZE to '19' if desired/
> 
> if ((cb->size < RPING_MIN_BUFSIZE) ||
>                 (cb->size > (RPING_BUFSIZE - 1))) {
> 
> Steve what do you say ??
> 
> 
> Thanks,
> Pradipta Kumar.
> 
> 
> > Starting program: /usr/local/bin/rping -c -vV -C10 -S10 -a 150.111.111.100
> > -p 9999
> > Reading symbols from shared object read from target memory...done.
> > Loaded system supplied DSO at 0xffffe000
> > [Thread debugging using libthread_db enabled]
> > [New Thread -1208465728 (LWP 23960)]
> > libibverbs: Warning: no userspace device-specific driver found for uverbs1
> >         driver search path: /usr/local/lib/infiniband
> > libibverbs: Warning: no userspace device-specific driver found for uverbs0
> >         driver search path: /usr/local/lib/infiniband
> > [New Thread -1208468560 (LWP 23963)]
> > [New Thread -1216861264 (LWP 23964)]
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > ping data: rdma-ping
> > cq completion failed status 5
> > DISCONNECT EVENT...
> > *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> > 
> > Program received signal SIGABRT, Aborted.
> > [Switching to Thread -1208465728 (LWP 23960)]
> > 0xffffe410 in __kernel_vsyscall ()
> > (gdb)
> > 
> >   --Sundeep.
> > 
> >> Thanks
> >> Pradipta Kumar.
> >>>> Thanx,
> >>>>
> >>>>
> >>>> Steve.
> >>>>
> >>>>
> >>>> On Mon, 2006-06-05 at 00:43 -0400, Sundeep Narravula wrote:
> >>>>> Hi Steve,
> >>>>>    We are trying the new iwarp branch on ammasso adapters. The installation
> >>>>> has gone fine. However, on running rping there is a error during
> >>>>> disconnect phase.
> >>>>>
> >>>>> $ rping -c -vV -C4 -S4 -a 150.10.108.100 -p 9999
> >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs1
> >>>>>          driver search path: /usr/local/lib/infiniband
> >>>>> libibverbs: Warning: no userspace device-specific driver found for uverbs0
> >>>>>          driver search path: /usr/local/lib/infiniband
> >>>>> ping data: rdm
> >>>>> ping data: rdm
> >>>>> ping data: rdm
> >>>>> ping data: rdm
> >>>>> cq completion failed status 5
> >>>>> DISCONNECT EVENT...
> >>>>> *** glibc detected *** free(): invalid next size (fast): 0x0804ea80 ***
> >>>>> Aborted
> >>>>>
> >>>>> There are no apparent errors showing up in dmesg. Is this error
> >>>>> currently expected?
> >>>>>
> >>>>> Thanks,
> >>>>>    --Sundeep.
> >>>>>
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Thu Jun  8 15:47:53 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 9 Jun 2006 01:47:53 +0300
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <adaejxzcpl9.fsf@cisco.com>
References: <adaejxzcpl9.fsf@cisco.com>
Message-ID: <20060608224753.GC10263@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> there's no guarantee the upper bits won't
> repeat -- or someone could be using 24 bits for index

So we need something like mthca_clean_eq?


-- 
MST


From rdreier at cisco.com  Thu Jun  8 16:01:47 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 16:01:47 -0700
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <20060608224753.GC10263@mellanox.co.il> (Michael S.
	Tsirkin's message of "Fri, 9 Jun 2006 01:47:53 +0300")
References: <adaejxzcpl9.fsf@cisco.com> <20060608224753.GC10263@mellanox.co.il>
Message-ID: <adaslmfb6as.fsf@cisco.com>

    Roland> there's no guarantee the upper bits won't repeat -- or
    Roland> someone could be using 24 bits for index

    Michael> So we need something like mthca_clean_eq?

That's one obvious way to handle it.

We could also keep a list of freed CQNs and make sure we don't reuse
the CQNs until their associated EQ has been drained once.

Or just call the handler for that EQ an extra time after freeing the
CQ.  But I guess that would lead to tricky races against the regular
interrupt handler.

 - R.


From rdreier at cisco.com  Thu Jun  8 16:09:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 08 Jun 2006 16:09:18 -0700
Subject: [openib-general] Re: race in mthca_cq.c?
In-Reply-To: <adaslmfb6as.fsf@cisco.com> (Roland Dreier's message of
	"Thu, 08 Jun 2006 16:01:47 -0700")
References: <adaejxzcpl9.fsf@cisco.com>
	<20060608224753.GC10263@mellanox.co.il> <adaslmfb6as.fsf@cisco.com>
Message-ID: <adak67rb5y9.fsf@cisco.com>

    Michael> So we need something like mthca_clean_eq?

    Roland> That's one obvious way to handle it.

Actually that looks very hard without adding locks to the interrupt
handling fast path.

 - R.


From sweitzen at cisco.com  Thu Jun  8 16:38:12 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 8 Jun 2006 16:38:12 -0700
Subject: [openib-general] OFED-1.0-rc6 is available
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D692D9@xmb-sjc-216.amer.cisco.com>

The MTU change undos the changes for bug 81, so I have reopened bug 81 (
http://openib.org/bugzilla/show_bug.cgi?id=81).
 
With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E
osu_bibw performance is bad.  I've enclosed some performance data, look
at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini.
 
Are there other benchmarks driving the changes in rc6 (and rc4)?
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

	OSU MPI:

	*        Added mpi_alltoall fine tuning parameters

	*        Added default configuration/documentation file
$MPIHOME/etc/mvapich.conf

	*        Added shell configuration files
$MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh

	*        Default MTU was changed back to 2K for InfiniHost III
Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
	VIADEV_DEFAULT_MTU=MTU1024

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060608/69088f76/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_perf.xls
Type: application/vnd.ms-excel
Size: 33280 bytes
Desc: mpi_perf.xls
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060608/69088f76/attachment.xls>

From sean.hefty at intel.com  Thu Jun  8 21:38:07 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 Jun 2006 21:38:07 -0700
Subject: [openib-general] [PATCH 1/2] multicast: notify users on membership
	errors
Message-ID: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>

Modify ib_multicast module to detect events that require clients to rejoin
multicast groups.  Add tracking of clients which are members of any groups,
and provide notification to those clients when such an event occurs.

This patch tracks all active members of a group.  When an event occurs that
requires clients to rejoin a multicast group, the active members are moved
into an error state, and the clients are notified of a network reset error.
The group is then reset to force additional join requests to generate requests
to the SA.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Hal, can you apply these patches and see if it fixes the issues that you
are experiencing.  These should eliminate any races with ipoib leaving,
then quickly re-joining a group as a result of an event.

Index: multicast.c
===================================================================
--- multicast.c	(revision 7805)
+++ multicast.c	(working copy)
@@ -61,6 +61,7 @@ static struct ib_client mcast_client = {
 	.remove = mcast_remove_one
 };
 
+static struct ib_event_handler	event_handler;
 static struct workqueue_struct	*mcast_wq;
 
 struct mcast_device;
@@ -86,6 +87,7 @@ enum mcast_state {
 	MCAST_JOINING,
 	MCAST_MEMBER,
 	MCAST_BUSY,
+	MCAST_ERROR
 };
 
 struct mcast_member;
@@ -97,6 +99,7 @@ struct mcast_group {
 	spinlock_t		lock;
 	struct work_struct	work;
 	struct list_head	pending_list;
+	struct list_head	active_list;
 	struct mcast_member	*last_join;
 	int			members[3];
 	atomic_t		refcount;
@@ -338,6 +341,8 @@ static void join_group(struct mcast_grou
 	group->rec.join_state |= join_state;
 	member->multicast.rec = group->rec;
 	member->multicast.rec.join_state = join_state;
+	list_del(&member->list);
+	list_add(&member->list, &group->active_list);
 }
 
 static int fail_join(struct mcast_group *group, struct mcast_member *member,
@@ -349,6 +354,34 @@ static int fail_join(struct mcast_group 
 	return member->multicast.callback(status, &member->multicast);
 }
 
+static void process_group_error(struct mcast_group *group)
+{
+	struct mcast_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->active_list)) {
+		member = list_entry(group->active_list.next,
+				    struct mcast_member, list);
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		adjust_membership(group, member->multicast.rec.join_state, -1);
+		member->state = MCAST_ERROR;
+		spin_unlock_irq(&group->lock);
+
+		ret = member->multicast.callback(-ENETRESET,
+						 &member->multicast);
+		deref_member(member);
+		if (ret)
+			ib_free_multicast(&member->multicast);
+		spin_lock_irq(&group->lock);
+	}
+
+	group->rec.join_state = 0;
+	group->state = MCAST_BUSY;
+	spin_unlock_irq(&group->lock);
+}
+
 static void mcast_work_handler(void *data)
 {
 	struct mcast_group *group = data;
@@ -359,6 +392,12 @@ static void mcast_work_handler(void *dat
 
 retest:
 	spin_lock_irq(&group->lock);
+	if (group->state == MCAST_ERROR) {
+		spin_unlock_irq(&group->lock);
+		process_group_error(group);
+		goto retest;
+	}
+
 	while (!list_empty(&group->pending_list)) {
 		member = list_entry(group->pending_list.next,
 				    struct mcast_member, list);
@@ -371,8 +410,8 @@ retest:
 					 multicast->comp_mask);
 			if (!status)
 				join_group(group, member, join_state);
-
-			list_del_init(&member->list);
+			else
+				list_del_init(&member->list);
 			spin_unlock_irq(&group->lock);
 			ret = multicast->callback(status, multicast);
 		} else {
@@ -467,6 +506,7 @@ static struct mcast_group *acquire_group
 	group->port = port;
 	group->rec.mgid = *mgid;
 	INIT_LIST_HEAD(&group->pending_list);
+	INIT_LIST_HEAD(&group->active_list);
 	INIT_WORK(&group->work, mcast_work_handler, group);
 	spin_lock_init(&group->lock);
 
@@ -551,16 +591,10 @@ void ib_free_multicast(struct ib_multica
 	group = member->group;
 
 	spin_lock_irq(&group->lock);
-	switch (member->state) {
-	case MCAST_MEMBER:
+	if (member->state == MCAST_MEMBER)
 		adjust_membership(group, multicast->rec.join_state, -1);
-		break;
-	case MCAST_JOINING:
-		list_del_init(&member->list);
-		break;
-	default:
-		break;
-	}
+
+	list_del_init(&member->list);
 
 	if (group->state == MCAST_IDLE) {
 		group->state = MCAST_BUSY;
@@ -578,6 +612,48 @@ void ib_free_multicast(struct ib_multica
 }
 EXPORT_SYMBOL(ib_free_multicast);
 
+static void mcast_groups_lost(struct mcast_port *port)
+{
+	struct mcast_group *group;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	for (node = rb_first(&port->table); node; node = rb_next(node)) {
+		group = rb_entry(node, struct mcast_group, node);
+		spin_lock(&group->lock);
+		if (group->state == MCAST_IDLE) {
+			atomic_inc(&group->refcount);
+			queue_work(mcast_wq, &group->work);
+		}
+		group->state = MCAST_ERROR;
+		spin_unlock(&group->lock);
+	}
+	spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void mcast_event_handler(struct ib_event_handler *handler,
+				struct ib_event *event)
+{
+	struct mcast_device *dev;
+
+	dev = ib_get_client_data(event->device, &mcast_client);
+	if (!dev)
+		return;
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+		mcast_groups_lost(&dev->port[event->element.port_num -
+					     dev->start_port]);
+		break;
+	default:
+		break;
+	}
+}
+
 static void mcast_add_one(struct ib_device *device)
 {
 	struct mcast_device *dev;
@@ -611,6 +687,9 @@ static void mcast_add_one(struct ib_devi
 
 	dev->device = device;
 	ib_set_client_data(device, &mcast_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&event_handler, device, mcast_event_handler);
+	ib_register_event_handler(&event_handler);
 }
 
 static void mcast_remove_one(struct ib_device *device)
@@ -623,6 +702,7 @@ static void mcast_remove_one(struct ib_d
 	if (!dev)
 		return;
 
+	ib_unregister_event_handler(&event_handler);
 	flush_workqueue(mcast_wq);
 
 	for (i = 0; i < dev->end_port - dev->start_port; i++) {


From sean.hefty at intel.com  Thu Jun  8 21:54:59 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 8 Jun 2006 21:54:59 -0700
Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group reset
	notification
In-Reply-To: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
Message-ID: <ORSMSX401hQdOgl4rhY00000048@orsmsx401.amr.corp.intel.com>

Ipoib already checks for events that require rejoining multicast groups.
We just need to add code to handle (i.e. ignore) multicast group reset
notifications.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Ignoring the callback is a simple fix.  I didn't try to see what it would
take to have ipoib use the ib_multicast event to trigger a re-join.  My
guess is that it would be less efficient, since ipoib would get a callback
for every group on the affected port.

Index: ipoib_multicast.c
===================================================================
--- ipoib_multicast.c	(revision 7758)
+++ ipoib_multicast.c	(working copy)
@@ -306,6 +306,10 @@ ipoib_mcast_sendonly_join_complete(int s
 	struct net_device *dev = mcast->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	/* We trap for port events ourselves. */
+	if (status == -ENETRESET)
+		return 0;
+
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 	
@@ -390,6 +394,10 @@ static int ipoib_mcast_join_complete(int
 			" (status %d)\n",
 			IPOIB_GID_ARG(mcast->mcmember.mgid), status);
 
+	/* We trap for port events ourselves. */
+	if (status == -ENETRESET)
+		return 0;
+
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 

From mst at mellanox.co.il  Thu Jun  8 22:09:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 9 Jun 2006 08:09:10 +0300
Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on
	membership errors
In-Reply-To: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
Message-ID: <20060609050910.GD10263@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> These should eliminate any races with ipoib leaving,
> then quickly re-joining a group as a result of an event.

Is there a chance this will fix the crashes me and Or were seeing?

-- 
MST


From bpradip at in.ibm.com  Thu Jun  8 23:01:39 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 9 Jun 2006 11:31:39 +0530
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping
	buffer size
Message-ID: <20060609060138.GB13602@harry-potter.in.ibm.com>

rping didn't checked correctly for the minimum size of the ping
buffer resulting in the following error from glibc

"*** glibc detected *** free(): invalid next size (fast)"

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
---

Index: rping.c
=============================================================
--- rping.org	2006-06-09 10:57:43.000000000 +0530
+++ rping.c	2006-06-09 11:00:28.000000000 +0530
@@ -96,6 +96,12 @@ struct rping_rdma_info {
 #define RPING_BUFSIZE 64*1024
 #define RPING_SQ_DEPTH 16
 
+/* Default string for print data and
+ * minimum buffer size
+ */
+#define RPING_MSG_FMT           "rdma-ping-%d: "
+#define RPING_MIN_BUFSIZE       sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT)
+
 /*
  * Control block struct.
  */
@@ -774,7 +780,7 @@ static void rping_test_client(struct rpi
 		cb->state = RDMA_READ_ADV;
 
 		/* Put some ascii text in the buffer. */
-		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
+		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
 		for (i = cc, c = start; i < cb->size; i++) {
 			cb->start_buf[i] = c;
 			c++;
@@ -977,11 +983,11 @@ int main(int argc, char *argv[])
 			break;
 		case 'S':
 			cb->size = atoi(optarg);
-			if ((cb->size < 1) ||
+			if ((cb->size < RPING_MIN_BUFSIZE) ||
 			    (cb->size > (RPING_BUFSIZE - 1))) {
 				fprintf(stderr, "Invalid size %d "
-				       "(valid range is 1 to %d)\n",
-				       cb->size, RPING_BUFSIZE);
+				       "(valid range is %d to %d)\n",
+				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
 				ret = EINVAL;
 			} else
 				DEBUG_LOG("size %d\n", (int) atoi(optarg));


From bpradip at in.ibm.com  Fri Jun  9 00:14:28 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 09 Jun 2006 12:44:28 +0530
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum
	ping buffer size
In-Reply-To: <20060609060138.GB13602@harry-potter.in.ibm.com>
References: <20060609060138.GB13602@harry-potter.in.ibm.com>
Message-ID: <44891FD4.30104@in.ibm.com>

Pradipta Kumar Banerjee wrote:
> rping didn't checked correctly for the minimum size of the ping
> buffer resulting in the following error from glibc
> 
> "*** glibc detected *** free(): invalid next size (fast)"
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> ---
> 
> Index: rping.c
> =============================================================
> --- rping.org	2006-06-09 10:57:43.000000000 +0530
> +++ rping.c	2006-06-09 11:00:28.000000000 +0530
> @@ -96,6 +96,12 @@ struct rping_rdma_info {
>  #define RPING_BUFSIZE 64*1024
>  #define RPING_SQ_DEPTH 16
>  
> +/* Default string for print data and
> + * minimum buffer size
> + */
> +#define RPING_MSG_FMT           "rdma-ping-%d: "
> +#define RPING_MIN_BUFSIZE       sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT)
> +
Tom,
   Just found that 'itoa' is not a built-in library function. The sizeof is 
returning '4' which is not what we really want. Do we hard-code the value to 10 
( like #define RPING_MIN_BUFSIZE     10 + sizeof(RPING_MSG_FMT) )?
INT_MAX is 2147483647 (10 - chars). Other options might include writing our own 
'itoa'.

Thanks,
Pradipta Kumar.

>  /*
>   * Control block struct.
>   */
> @@ -774,7 +780,7 @@ static void rping_test_client(struct rpi
>  		cb->state = RDMA_READ_ADV;
>  
>  		/* Put some ascii text in the buffer. */
> -		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
> +		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
>  		for (i = cc, c = start; i < cb->size; i++) {
>  			cb->start_buf[i] = c;
>  			c++;
> @@ -977,11 +983,11 @@ int main(int argc, char *argv[])
>  			break;
>  		case 'S':
>  			cb->size = atoi(optarg);
> -			if ((cb->size < 1) ||
> +			if ((cb->size < RPING_MIN_BUFSIZE) ||
>  			    (cb->size > (RPING_BUFSIZE - 1))) {
>  				fprintf(stderr, "Invalid size %d "
> -				       "(valid range is 1 to %d)\n",
> -				       cb->size, RPING_BUFSIZE);
> +				       "(valid range is %d to %d)\n",
> +				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
>  				ret = EINVAL;
>  			} else
>  				DEBUG_LOG("size %d\n", (int) atoi(optarg));
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From mst at mellanox.co.il  Fri Jun  9 00:59:12 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 9 Jun 2006 10:59:12 +0300
Subject: [openib-general] [PATCH] mthca: send opcode in error CQE for debug
Message-ID: <20060609075912.GB25811@mellanox.co.il>

I find the following helpful for debug. Pls consider for 2.6.18

--

While IP spec does not require opcode to be valid in error CQEs, Mellanox HCAs
differentiate between send/receive errors, which is useful for debugging
purposes.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: last_stable/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- last_stable.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2006-06-09 10:14:53.000000000 +0300
+++ last_stable/drivers/infiniband/hw/mthca/mthca_cq.c	2006-06-09 10:15:08.000000000 +0300
@@ -562,6 +562,7 @@ static inline int mthca_poll_one(struct 
 		handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send,
 				 (struct mthca_err_cqe *) cqe,
 				 entry, &free_cqe);
+		entry->opcode = is_send ? IB_WC_SEND : IB_WC_RECV;
 		goto out;
 	}
 

-- 
MST


From halr at voltaire.com  Fri Jun  9 03:43:13 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 06:43:13 -0400
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <44889E18.8010507@ichips.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
Message-ID: <1149849791.4510.41634.camel@hal.voltaire.com>

On Thu, 2006-06-08 at 18:00, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > 2. There is lazy deletion of MC groups allowed so the reclamation may be
> > difficult.
> 
> I'm not familiar with the switch programming.

Note the MGRPs are MGIDs and switches are programmed with MLIDs and
these can be 1:1 or many:1 depending on the implementation. Most do not
do the many:1 but this is allowed by the spec. Also, note that switches
know nothing about the groups themselves (only MLIDs and which ports) so
most of the information is in the SM.

> Does the SM set the entire 
> MulticastForwardingTable for a switch every time a new group is created, or a 
> new member joins?

No. It only needs to program the affected block(s) of the MFT based on
the MLID and the portmask (ports for replication).

> If the SM loses track of all multicast groups, how are the 
> stale groups on the switches deleted?

There are different strategies for dealing with this. It could clear out
all the MFTs in all the switches but that is expensive. It could also
wait for multicast registrations and then program the needed MFT blocks
in the affected switches only caring about those. In this case, packets
on those MLIDs would still be forwarded until the MLID is reclaimed.

> > The endport SMAs are claiming they do support client reregistration but
> > it does take more than that for the endport/node to behave properly.
> 
> My original plan was to have the ib_multicast module rejoin all groups, but 
> since the MLIDs can change I can't see any way to handle reregistration safely 
> without involving the application.

Because the application needs to modify the QP for this ? As I said, I'm
not sure IPoIB was handling this before. I'm sure Roland knows for sure.

> My latest changes are just to report errors 
> on existing multicast groups on an affected port.

How ?

> > I know it is a conceptual rather than actual compliance. One issue would
> > be defining what it means to repect all existing communication. Then we
> > would need to look at whether that was feasible or not and perhaps
> > rescope what it means to a set of things achievable. Another issue would
> > be defining where it is possible or not. If that is totally vendor
> > dependent, then this would have no substance to it. It is largely a
> > matter of being a "better" SM.
> 
> We could use the phrase, "except where such communication is no longer 
> realizable" instead of "where possible".  Where unrealizable means impossible 
> because the communication uses properties that are physically impossible to 
> achieve given the hardware configuration of the subnet.  (See bottom of page 910 
> of the spec.)

That specific text is defined there for the case of unrealizable joins
which is very different from the case being discussed. The specific
property mismatches are listed. Still not sure what determines this in
the case we are discussing. 

> If an SM could just query switches for their MulticastForwardingTables or the 
> end nodes,

It can.

>  would we be able to avoid these issues?

How ? Not all the group information is in the switches.

-- Hal

> - Sean


From eli at mellanox.co.il  Fri Jun  9 03:53:54 2006
From: eli at mellanox.co.il (Eli Cohen)
Date: Fri, 9 Jun 2006 13:53:54 +0300
Subject: [openib-general] [Bug 122] New: mad layer problem
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30249F8EE@mtlexch01.mtl.com>

Hi,
Here is some info:
1.	Attached are the SysRq messages.
2.	The relation of MADs to ARP is that after ARP resolves a
hardware address it is required to use an SM query to resolve the path
to the host bearing the hardware address.
3.	How to invoke the tests: Attached one readme file and one
configuration file.
 
Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/70629d6a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.bz2
Type: application/octet-stream
Size: 10202 bytes
Desc: messages.bz2
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/70629d6a/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 114-115-800conn.conf
Type: application/octet-stream
Size: 1571 bytes
Desc: 114-115-800conn.conf
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/70629d6a/attachment-0001.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: polyReadme.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/70629d6a/attachment.txt>

From halr at voltaire.com  Fri Jun  9 04:26:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 07:26:18 -0400
Subject: [openib-general] [PATCH] ibroute: When multiple paths,
 indicate port GUID on alternate paths
Message-ID: <1149852273.4510.42972.camel@hal.voltaire.com>

ibroute: When multiple paths, indicate port GUID on alternate paths

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: diags/src/ibroute.c
===================================================================
--- diags/src/ibroute.c	(revision 7646)
+++ diags/src/ibroute.c	(working copy)
@@ -272,10 +272,22 @@ dump_lid(char *str, int strlen, int lid,
 		if (!valid)
 			return snprintf(str, strlen, ": (path #%d - illegal port)",
 					lid - base_port_lid);
-		else
-			return snprintf(str, strlen, ": (path #%d out of %d)",
-					lid - base_port_lid + 1,
-					last_port_lid - base_port_lid + 1);
+		else {
+			lidport.lid = lid;
+			if (!smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100))
+				return snprintf(str, strlen,
+						": (path #%d out of %d)",
+						lid - base_port_lid + 1,
+						last_port_lid - base_port_lid + 1);
+			else {
+				mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid);
+				return snprintf(str, strlen,
+						": (path #%d out of %d: portguid %s)",
+						lid - base_port_lid + 1,
+						last_port_lid - base_port_lid + 1,
+						mad_dump_val(IB_NODE_PORT_GUID_F, sguid, sizeof sguid, &portguid));
+			}
+		}
 	}
 
 	if (!valid)


From halr at voltaire.com  Fri Jun  9 06:07:24 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 09:07:24 -0400
Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on
	membership errors
In-Reply-To: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
Message-ID: <1149858444.4744.73.camel@hal.voltaire.com>

On Fri, 2006-06-09 at 00:38, Sean Hefty wrote:
> Modify ib_multicast module to detect events that require clients to rejoin
> multicast groups.  Add tracking of clients which are members of any groups,
> and provide notification to those clients when such an event occurs.
> 
> This patch tracks all active members of a group.  When an event occurs that
> requires clients to rejoin a multicast group, the active members are moved
> into an error state, and the clients are notified of a network reset error.
> The group is then reset to force additional join requests to generate requests
> to the SA.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> Hal, can you apply these patches and see if it fixes the issues that you
> are experiencing.  These should eliminate any races with ipoib leaving,
> then quickly re-joining a group as a result of an event.

This is working better now. Thanks!

-- Hal


From halr at voltaire.com  Fri Jun  9 06:11:56 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 09:11:56 -0400
Subject: [openib-general] [PATCH] osmtest: Support LMC > 0
Message-ID: <1149858716.4744.224.camel@hal.voltaire.com>

osmtest: Support LMC > 0

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: osmtest/osmtest.c
===================================================================
--- osmtest/osmtest.c	(revision 7839)
+++ osmtest/osmtest.c	(working copy)
@@ -1609,6 +1609,74 @@ osmtest_stress_port_recs_small( IN osmte
 }
 
 /**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osmtest_get_local_port_lmc( IN osmtest_t * const p_osmt,
+                            OUT uint8_t *  const p_lmc )
+{
+  osmtest_req_context_t context;
+  ib_portinfo_record_t *p_rec;
+  uint32_t i;
+  cl_status_t status;
+  uint32_t num_recs = 0;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_get_local_port_lmc );
+
+  memset( &context, 0, sizeof( context ) );
+
+  /*
+   * Do a blocking query for our own PortRecord in the subnet.
+   */
+  status = osmtest_get_port_rec( p_osmt,
+                                 cl_ntoh16(p_osmt->local_port.lid),
+                                 &context );
+
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_local_port_lmc: ERR 001A: "
+             "osmtest_get_port_rec failed (%s)\n",
+             ib_get_err_str( status ) );
+    goto Exit;
+  }
+
+  num_recs = context.result.result_cnt;
+
+  if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+             "osmtest_get_local_port_lmc: "
+             "Received %u records\n", num_recs );
+  }
+
+  for( i = 0; i < num_recs; i++ )
+  {
+    p_rec = osmv_get_query_portinfo_rec( context.result.p_result_madw, i );
+    osm_dump_portinfo_record( &p_osmt->log, p_rec, OSM_LOG_VERBOSE );
+    if ( p_lmc)
+    {
+      *p_lmc = ib_port_info_get_lmc( &p_rec->port_info );
+      osm_log( &p_osmt->log, OSM_LOG_DEBUG,
+               "osmtest_get_local_port_lmc: "
+               "LMC %d\n", *p_lmc );
+    }
+  }
+
+ Exit:
+  /*
+   * Return the IB query MAD to the pool as necessary.
+   */
+  if( context.result.p_result_madw != NULL )
+  {
+    osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw );
+    context.result.p_result_madw = NULL;
+  }
+
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
+
+/**********************************************************************
  * Use a wrong SM_Key in a simple port query and report success if
  * failed.
  **********************************************************************/
@@ -3100,6 +3168,7 @@ osmtest_validate_path_data( IN osmtest_t
                             IN const ib_path_rec_t * const p_rec )
 {
   cl_status_t status = IB_SUCCESS;
+  uint8_t lmc = 0;
 
   OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_path_data );
 
@@ -3111,17 +3180,38 @@ osmtest_validate_path_data( IN osmtest_t
              cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) );
   }
 
-  /*
-   * Has this record already been returned?
-   */
-  if( p_path->count != 0 )
+  status = osmtest_get_local_port_lmc( p_osmt, &lmc );
+
+  /* HACK: Assume uniform LMC across endports in the subnet */ 
+  /* In absence of this assumption, validation of this is much more complicated */
+  if ( lmc == 0 )
   {
-    osm_log( &p_osmt->log, OSM_LOG_ERROR,
-             "osmtest_validate_path_data: ERR 0056: "
-             "Already received path SLID 0x%X to DLID 0x%X\n",
-             cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) );
-    status = IB_ERROR;
-    goto Exit;
+    /*
+     * Has this record already been returned?
+     */
+    if( p_path->count != 0 )
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_validate_path_data: ERR 0056: "
+               "Already received path SLID 0x%X to DLID 0x%X\n",
+               cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) );
+      status = IB_ERROR;
+      goto Exit;
+    }
+  }
+  else
+  {
+    /* Also, this doesn't detect fewer than the correct number of paths being returned */
+    if ( p_path->count >= ( 1 << lmc ) * ( 1 << lmc ) )
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_validate_path_data: ERR 0052: "
+               "Already received path SLID 0x%X to DLID 0x%X count %d LMC %d\n",
+               cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ),
+               p_path->count, lmc );
+      status = IB_ERROR;
+      goto Exit;
+    }
   }
 
   ++p_path->count;


From halr at voltaire.com  Fri Jun  9 06:52:02 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 09:52:02 -0400
Subject: [openib-general] Re: Failed multicast join withnew
	multicast module
In-Reply-To: <1149849791.4510.41634.camel@hal.voltaire.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
Message-ID: <1149861120.4744.1558.camel@hal.voltaire.com>

On Fri, 2006-06-09 at 06:43, Hal Rosenstock wrote:
> On Thu, 2006-06-08 at 18:00, Sean Hefty wrote:
> > Hal Rosenstock wrote:
> > > 2. There is lazy deletion of MC groups allowed so the reclamation may be
> > > difficult.
> > 
> > I'm not familiar with the switch programming.
> 
> Note the MGRPs are MGIDs and switches are programmed with MLIDs and
> these can be 1:1 or many:1 depending on the implementation. Most do not
> do the many:1 but this is allowed by the spec. Also, note that switches
> know nothing about the groups themselves (only MLIDs and which ports) so
> most of the information is in the SM.
> 
> > Does the SM set the entire 
> > MulticastForwardingTable for a switch every time a new group is created, or a 
> > new member joins?
> 
> No. It only needs to program the affected block(s) of the MFT based on
> the MLID and the portmask (ports for replication).
> 
> > If the SM loses track of all multicast groups, how are the 
> > stale groups on the switches deleted?
> 
> There are different strategies for dealing with this. It could clear out
> all the MFTs in all the switches but that is expensive. It could also
> wait for multicast registrations and then program the needed MFT blocks
> in the affected switches only caring about those. In this case, packets
> on those MLIDs would still be forwarded until the MLID is reclaimed.
> 
> > > The endport SMAs are claiming they do support client reregistration but
> > > it does take more than that for the endport/node to behave properly.
> > 
> > My original plan was to have the ib_multicast module rejoin all groups, but 
> > since the MLIDs can change I can't see any way to handle reregistration safely 
> > without involving the application.
> 
> Because the application needs to modify the QP for this ? As I said, I'm
> not sure IPoIB was handling this before. I'm sure Roland knows for sure.

It does look to me like the pre multicast module IPoIB does leave and
then rejoin on receipt of a client reregister from the SM.

-- Hal

> > My latest changes are just to report errors 
> > on existing multicast groups on an affected port.
> 
> How ?
> 
> > > I know it is a conceptual rather than actual compliance. One issue would
> > > be defining what it means to repect all existing communication. Then we
> > > would need to look at whether that was feasible or not and perhaps
> > > rescope what it means to a set of things achievable. Another issue would
> > > be defining where it is possible or not. If that is totally vendor
> > > dependent, then this would have no substance to it. It is largely a
> > > matter of being a "better" SM.
> > 
> > We could use the phrase, "except where such communication is no longer 
> > realizable" instead of "where possible".  Where unrealizable means impossible 
> > because the communication uses properties that are physically impossible to 
> > achieve given the hardware configuration of the subnet.  (See bottom of page 910 
> > of the spec.)
> 
> That specific text is defined there for the case of unrealizable joins
> which is very different from the case being discussed. The specific
> property mismatches are listed. Still not sure what determines this in
> the case we are discussing. 
> 
> > If an SM could just query switches for their MulticastForwardingTables or the 
> > end nodes,
> 
> It can.
> 
> >  would we be able to avoid these issues?
> 
> How ? Not all the group information is in the switches.
> 
> -- Hal
> 
> > - Sean
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From krause at cup.hp.com  Fri Jun  9 06:59:30 2006
From: krause at cup.hp.com (Michael Krause)
Date: Fri, 09 Jun 2006 06:59:30 -0700
Subject: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
References: <7.0.1.0.2.20060605081948.044849d0@netapp.com>
	<20060606074314.GC2432@mellanox.co.il>
	<7.0.1.0.2.20060606081548.0469bcc8@netapp.com>
Message-ID: <6.2.0.14.2.20060609065546.02cd8b38@esmail.cup.hp.com>


Whether iWARP or IB, there is a fixed number of RDMA Requests allowed to be 
outstanding at any given time.  If one posts more RDMA Read requests than 
the fixed number, the transmit queue is stalled.  This is documented in 
both technology specifications.  It is something that all ULP should be 
aware of and some go so far as to communicate that as part of the Hello / 
login exchange.  This allows the ULP implementation to determine whether it 
wants to stall or wants to wait until Read Responses complete before 
sending another request.  This isn't something silent; this isn't something 
new; this is something for the ULP implementation to decide how to deal 
with the issue.

BTW, this is part of the hardware and associated specifications so it is up 
to software to deal with the limited hardware resources and the associated 
consequences.  Please keep in mind that there are a limited number of RDMA 
Request / Atomic resource "slots" at the receiving HCA / RNIC.  These are 
kept in hardware thus one must know the exact limit to avoid creating 
protocol problems.  A ULP transmitter may post to the transmit queue more 
than the allotted slots but the transmitting (source) HCA / RNIC must not 
issue them to the remote.  These requests do cause the source to 
stall.  This is a well understood problem and if people give the iSCSI / 
iSER and DA specs good read or SDP they can see that this issue is 
comprehended.  I agree with people that ULP designers / implementers must 
pay close attention to this constraint as it is in the iWARP / IB 
specifications for a very good reason and these semantics must be preserved 
to maintain the ordering requirements that are the used by the overall RDMA 
protocols themselves.

Mike


At 05:24 AM 6/6/2006, Talpey, Thomas wrote:
>At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote:
> >Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> >> Semantically, the provider is not required to provide any such flow 
> control
> >> behavior by the way. The Mellanox one apparently does, but it is not
> >> a requirement of the verbs, it's a requirement on the upper layer. If more
> >> RDMA Reads are posted than the remote peer supports, the connection
> >> may break.
> >
> >This does not sound right. Isn't this the meaning of this field:
> >"Initiator Depth: Number of RDMA Reads & atomic operations
> >outstanding at any time"? Shouldn't any provider enforce this limit?
>
>The core spec does not require it. An implementation *may* enforce it,
>but is not *required* to do so. And as pointed out in the other message,
>there are repercussions of doing so.
>
>I believe the silent queue stalling is a bit of a time bomb for upper layers,
>whose implementers are quite likely unaware of the danger. I greatly
>prefer an implementation which simply sends the RDMA Read request,
>resulting in a failed (but unblocked!) connection. Silence is a very
>dangerous thing, no matter how helpful the intent.
>
>Tom.
>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/44557101/attachment.html>

From halr at voltaire.com  Fri Jun  9 07:23:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 10:23:57 -0400
Subject: [openib-general] [PATCH] ibnetdiscover: Add LMC display to switch
	port 0
Message-ID: <1149863036.5093.773.camel@hal.voltaire.com>

ibnetdiscover: Add LMC display to switch port 0

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: src/ibnetdiscover.c
===================================================================
--- src/ibnetdiscover.c	(revision 7841)
+++ src/ibnetdiscover.c	(working copy)
@@ -158,6 +158,7 @@ get_node(Node *node, Port *port, ib_port
 		return 0;
 
 	node->smalid = port->lid;
+	node->smalmc = port->lmc;
 
 	DEBUG("portid %s: got switch node %Lx '%s'",
 	      portid2str(portid), node->nodeguid, nd);
@@ -530,9 +531,9 @@ out_switch(Node *node, int group)
 		}
 	}
 
-	fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d\n",
+	fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d lmc %d\n",
 		node->numports, node_name(node),
-		clean_nodedesc(node->nodedesc), node->smalid);
+		clean_nodedesc(node->nodedesc), node->smalid, node->smalmc);
 }
 
 void
Index: include/ibnetdiscover.h
===================================================================
--- include/ibnetdiscover.h	(revision 7841)
+++ include/ibnetdiscover.h	(working copy)
@@ -82,6 +82,7 @@ struct Node {
 	int numports;
 	int localport;
 	int smalid;
+	int smalmc;
 	uint32_t devid;
 	uint32_t vendid;
 	uint64_t sysimgguid;


From halr at voltaire.com  Fri Jun  9 08:01:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 11:01:44 -0400
Subject: [openib-general] [PATCH] ibnetdiscover: Indicate SP0 type
Message-ID: <1149865304.5093.2035.camel@hal.voltaire.com>

ibnetdiscover: Indicate SP0 type

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: diags/src/ibnetdiscover.c
===================================================================
--- diags/src/ibnetdiscover.c	(revision 7842)
+++ diags/src/ibnetdiscover.c	(working copy)
@@ -126,7 +126,9 @@ int
 get_node(Node *node, Port *port, ib_portid_t *portid)
 {
 	char portinfo[64];
+	char switchinfo[32];
 	void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
+	void *si = switchinfo;
 
 	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
 		return -1;
@@ -160,6 +162,12 @@ get_node(Node *node, Port *port, ib_port
 	node->smalid = port->lid;
 	node->smalmc = port->lmc;
 
+        if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
+                node->smaenhsp0 = 0;	/* assume base SP0 */
+	else {
+        	mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0);
+	}
+
 	DEBUG("portid %s: got switch node %Lx '%s'",
 	      portid2str(portid), node->nodeguid, nd);
 	return 1;
@@ -531,9 +539,11 @@ out_switch(Node *node, int group)
 		}
 	}
 
-	fprintf(f, "\nSwitch\t%d %s\t\t# %s port 0 lid %d lmc %d\n",
+	fprintf(f, "\nSwitch\t%d %s\t\t# %s %s port 0 lid %d lmc %d\n",
 		node->numports, node_name(node),
-		clean_nodedesc(node->nodedesc), node->smalid, node->smalmc);
+		clean_nodedesc(node->nodedesc),
+		node->smaenhsp0 ? "enhanced" : "base",
+		node->smalid, node->smalmc);
 }
 
 void
Index: diags/include/ibnetdiscover.h
===================================================================
--- diags/include/ibnetdiscover.h	(revision 7842)
+++ diags/include/ibnetdiscover.h	(working copy)
@@ -83,6 +83,7 @@ struct Node {
 	int localport;
 	int smalid;
 	int smalmc;
+	int smaenhsp0;
 	uint32_t devid;
 	uint32_t vendid;
 	uint64_t sysimgguid;


From mshefty at ichips.intel.com  Fri Jun  9 09:46:37 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 09:46:37 -0700
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <1149849791.4510.41634.camel@hal.voltaire.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
Message-ID: <4489A5ED.8060200@ichips.intel.com>

Hal Rosenstock wrote:
> Note the MGRPs are MGIDs and switches are programmed with MLIDs and
> these can be 1:1 or many:1 depending on the implementation. Most do not
> do the many:1 but this is allowed by the spec. Also, note that switches
> know nothing about the groups themselves (only MLIDs and which ports) so
> most of the information is in the SM.

Is there any chance that someone using an "old" join can receive data on a group 
that was created after an SM restart?  My guess is that the QP would discard the 
message unless both the MLID and MGIDs matched, so there's probably not a real 
issue here.

> How ? Not all the group information is in the switches.

It's likely that the end nodes have the mcmember records from previous joins. 
Isn't that along with the switch information enough to reconstruct the group 
information?

- Sean


From sweitzen at cisco.com  Fri Jun  9 10:44:56 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 9 Jun 2006 10:44:56 -0700
Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69501@xmb-sjc-216.amer.cisco.com>

While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or
Intel MPI via env var or config file?
 
Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in OFED
1.0 rc4 and rc6, I'd like to try 1K with Intel MPI.
 
Scott


________________________________

	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Scott
Weitzenkamp (sweitzen)
	Sent: Thursday, June 08, 2006 4:38 PM
	To: Tziporet Koren; openfabrics-ewg at openib.org
	Cc: openib-general
	Subject: RE: [openib-general] OFED-1.0-rc6 is available
	
	
	The MTU change undos the changes for bug 81, so I have reopened
bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81).
	 
	With rc6, PCI-X osu_bw and osu_bibw performance is bad, and
PCI-E osu_bibw performance is bad.  I've enclosed some performance data,
look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini.
	 
	Are there other benchmarks driving the changes in rc6 (and rc4)?
	 
	Scott Weitzenkamp
	SQA and Release Manager
	Server Virtualization Business Unit
	Cisco Systems
	 

		OSU MPI:

		*        Added mpi_alltoall fine tuning parameters

		*        Added default configuration/documentation file
$MPIHOME/etc/mvapich.conf

		*        Added shell configuration files
$MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh

		*        Default MTU was changed back to 2K for
InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card
recommended value is:
		VIADEV_DEFAULT_MTU=MTU1024

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/002e33a7/attachment.html>

From mshefty at ichips.intel.com  Fri Jun  9 11:14:16 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 11:14:16 -0700
Subject: [openib-general] Re: [PATCH 1/2] multicast: notify users on
	membership errors
In-Reply-To: <20060609050910.GD10263@mellanox.co.il>
References: <ORSMSX401NsokLQ8aZR00000047@orsmsx401.amr.corp.intel.com>
	<20060609050910.GD10263@mellanox.co.il>
Message-ID: <4489BA78.4050700@ichips.intel.com>

Michael S. Tsirkin wrote:
>>These should eliminate any races with ipoib leaving,
>>then quickly re-joining a group as a result of an event.
> 
> Is there a chance this will fix the crashes me and Or were seeing?

It shouldn't.  The race that I was referring to only involved whether or not a 
MAD is sent when a new join request is received by the ib_multicast module.

I literally spent hours reviewing the ib_multicast code trying to identify how 
the crash you and Or saw could occur.  My conclusion is that it's impossible. 
:)  So, at this point, I'm still not sure how it happens.  I'll continue to 
spend some time looking at this, in the hope that I can at least add in some 
assertions that might indicate when the problem starts.

- Sean


From mshefty at ichips.intel.com  Fri Jun  9 11:34:48 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 11:34:48 -0700
Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group
	reset notification
In-Reply-To: <ORSMSX401hQdOgl4rhY00000048@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401hQdOgl4rhY00000048@orsmsx401.amr.corp.intel.com>
Message-ID: <4489BF48.8010804@ichips.intel.com>

Sean Hefty wrote:
> Ipoib already checks for events that require rejoining multicast groups.
> We just need to add code to handle (i.e. ignore) multicast group reset
> notifications.

Roland,

Any issue committing this?

- Sean


From ardavis at ichips.intel.com  Fri Jun  9 11:38:07 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Fri, 09 Jun 2006 11:38:07 -0700
Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69501@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69501@xmb-sjc-216.amer.cisco.com>
Message-ID: <4489C00F.2000902@ichips.intel.com>

Scott Weitzenkamp (sweitzen) wrote:

> While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or 
> Intel MPI via env var or config file?
>  
> Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in 
> OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI.
>  
> Scott


There is no mechanism for me to modify the MTU using rdma_cm so whatever 
is returned in the path record is what you get with the OpenIB-cma 
provider. However, you could use the OpenIB-scm provider which is hard 
coded for 1K MTU as a comparision.  Can you run with "-genv 
I_MPI_DAPL_PROVIDER OpenIB-scm" on your cluster?

-arlin

>
>     ------------------------------------------------------------------------
>     *From:* openib-general-bounces at openib.org
>     [mailto:openib-general-bounces at openib.org] *On Behalf Of *Scott
>     Weitzenkamp (sweitzen)
>     *Sent:* Thursday, June 08, 2006 4:38 PM
>     *To:* Tziporet Koren; openfabrics-ewg at openib.org
>     *Cc:* openib-general
>     *Subject:* RE: [openib-general] OFED-1.0-rc6 is available
>
>     The MTU change undos the changes for bug 81, so I have reopened
>     bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81).
>      
>     With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E
>     osu_bibw performance is bad.  I've enclosed some performance data,
>     look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini.
>      
>     Are there other benchmarks driving the changes in rc6 (and rc4)?
>      
>     Scott Weitzenkamp
>     SQA and Release Manager
>     Server Virtualization Business Unit
>     Cisco Systems
>      
>
>          
>
>         *OSU MPI:*
>
>         ·        Added mpi_alltoall fine tuning parameters
>
>         ·        Added default configuration/documentation file
>         $MPIHOME/etc/mvapich.conf
>
>         ·        Added shell configuration files 
>         $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh
>
>         ·        Default MTU was changed back to 2K for InfiniHost III
>         Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended
>         value is:
>         VIADEV_DEFAULT_MTU=MTU1024
>
>------------------------------------------------------------------------
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From tom at opengridcomputing.com  Fri Jun  9 13:10:01 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Fri, 09 Jun 2006 15:10:01 -0500
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum
	ping buffer size
In-Reply-To: <44891FD4.30104@in.ibm.com>
References: <20060609060138.GB13602@harry-potter.in.ibm.com>
	<44891FD4.30104@in.ibm.com>
Message-ID: <1149883801.29808.52.camel@trinity.ogc.int>


Well it's almost a puzzle at this point. just hard coding 10 with a
comment is probably easier to read. But ... for the curious, this will
do what you want ... but may cause you to lose your breakfast.

#define _stringify( _x ) # _x
#define stringify( _x ) _stringify( _x )

Then 

	printf("%s %d\n", stringify(INT_MAX), sizeof(INT_MAX)) 

will get you...

2147483647 10

just like you "expected". The double nested macro call is necessary to
get cpp to substitute INT_MAX for 2147483647, otherwise you get 
INT_MAX 7

Later,

On Fri, 2006-06-09 at 12:44 +0530, Pradipta Kumar Banerjee wrote:
> Pradipta Kumar Banerjee wrote:
> > rping didn't checked correctly for the minimum size of the ping
> > buffer resulting in the following error from glibc
> > 
> > "*** glibc detected *** free(): invalid next size (fast)"
> > 
> > Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> > ---
> > 
> > Index: rping.c
> > =============================================================
> > --- rping.org	2006-06-09 10:57:43.000000000 +0530
> > +++ rping.c	2006-06-09 11:00:28.000000000 +0530
> > @@ -96,6 +96,12 @@ struct rping_rdma_info {
> >  #define RPING_BUFSIZE 64*1024
> >  #define RPING_SQ_DEPTH 16
> >  
> > +/* Default string for print data and
> > + * minimum buffer size
> > + */
> > +#define RPING_MSG_FMT           "rdma-ping-%d: "
> > +#define RPING_MIN_BUFSIZE       sizeof(itoa(INT_MAX))+sizeof(RPING_MSG_FMT)
> > +
> Tom,
>    Just found that 'itoa' is not a built-in library function. The sizeof is 
> returning '4' which is not what we really want. Do we hard-code the value to 10 
> ( like #define RPING_MIN_BUFSIZE     10 + sizeof(RPING_MSG_FMT) )?
> INT_MAX is 2147483647 (10 - chars). Other options might include writing our own 
> 'itoa'.
> 
> Thanks,
> Pradipta Kumar.
> 
> >  /*
> >   * Control block struct.
> >   */
> > @@ -774,7 +780,7 @@ static void rping_test_client(struct rpi
> >  		cb->state = RDMA_READ_ADV;
> >  
> >  		/* Put some ascii text in the buffer. */
> > -		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
> > +		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
> >  		for (i = cc, c = start; i < cb->size; i++) {
> >  			cb->start_buf[i] = c;
> >  			c++;
> > @@ -977,11 +983,11 @@ int main(int argc, char *argv[])
> >  			break;
> >  		case 'S':
> >  			cb->size = atoi(optarg);
> > -			if ((cb->size < 1) ||
> > +			if ((cb->size < RPING_MIN_BUFSIZE) ||
> >  			    (cb->size > (RPING_BUFSIZE - 1))) {
> >  				fprintf(stderr, "Invalid size %d "
> > -				       "(valid range is 1 to %d)\n",
> > -				       cb->size, RPING_BUFSIZE);
> > +				       "(valid range is %d to %d)\n",
> > +				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
> >  				ret = EINVAL;
> >  			} else
> >  				DEBUG_LOG("size %d\n", (int) atoi(optarg));
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 


From halr at voltaire.com  Fri Jun  9 13:12:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 16:12:28 -0400
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <4489A5ED.8060200@ichips.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
	<4489A5ED.8060200@ichips.intel.com>
Message-ID: <1149883948.5093.12572.camel@hal.voltaire.com>

On Fri, 2006-06-09 at 12:46, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > Note the MGRPs are MGIDs and switches are programmed with MLIDs and
> > these can be 1:1 or many:1 depending on the implementation. Most do not
> > do the many:1 but this is allowed by the spec. Also, note that switches
> > know nothing about the groups themselves (only MLIDs and which ports) so
> > most of the information is in the SM.
> 
> Is there any chance that someone using an "old" join can receive data on a group 
> that was created after an SM restart?  

I think so. One can also view this as another aspect of lazy deletion.
Actually the deletion can be so slow as to never occur.

> My guess is that the QP would discard the 
> message unless both the MLID and MGIDs matched, 

That would be my guess too but I'm not sure.

> so there's probably not a real issue here.

> > How ? Not all the group information is in the switches.
> 
> It's likely that the end nodes have the mcmember records from previous joins. 
> Isn't that along with the switch information enough to reconstruct the group 
> information?

No. The MCMemberRecord joins don't match the MGIDs to the MLIDs. You
would need more info than that although it is available.

The other issue is whether you trust the state of the network or not
when the SM comes up. That's sometimes a dangerous proposition.

-- Hal

> - Sean


From robert.j.woodruff at intel.com  Fri Jun  9 13:21:23 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 9 Jun 2006 13:21:23 -0700
Subject: [openib-general] RE: [openfabrics-ewg] OFED-1.0-rc6 is available
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408>

Is there any plan to release an RC6 package (or an RC7) that has a
Pathscale driver that
compiles on RHEL4 - U3 that we can test before the release ?
 
woody
 

________________________________

From: openfabrics-ewg-bounces at openib.org
[mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet Koren
Sent: Wednesday, June 07, 2006 7:59 AM
To: Tziporet Koren; openfabrics-ewg at openib.org
Cc: openib-general
Subject: [openfabrics-ewg] OFED-1.0-rc6 is available


Hi All,

 
We have prepared OFED 1.0 RC6.

Release location: https://openib.org/svn/gen2/branches/1.0/ofed/releases

File: OFED-1.0-rc6.tgz

 
Note: This release is the code freeze release for OFED 1.0. Only
showstopper bugs will be fixed.

 
BUILD_ID:

OFED-1.0-rc6

 
openib-1.0 (REV=7772)

# User space

https://openib.org/svn/gen2/branches/1.0/src/userspace

# Kernel space

https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel

Git:

ref: refs/heads/for-2.6.17

commit d9ec5ad24ce80b7ef69a0717363db661d13aada5

 
# MPI

mpi_osu-0.9.7-mlx2.1.0.tgz

openmpi-1.1b1-1.src.rpm

mpitests-1.0-0.src.rpm 

 
OSes:

    * RH EL4 up2: 2.6.9-22.ELsmp

    * RH EL4 up3: 2.6.9-34.ELsmp

    * Fedora C4: 2.6.11-1.1369_FC4

    * SLES10 RC2: 2.6.16.16-1.6-smp

    * SUSE 10 Pro: 2.6.13-15-smp

    * kernel.org: 2.6.16.x

 
Systems:

    * x86_64

    * x86

    * ia64

    * ppc64

 
Main changes from RC5:

1.       SDP - libsdp implementation of RFC proposed by Eitan Zahavi;
bug fixes in kernel module. See details below.

2.       SRP - bug fixes

3.       Open MPI - new package based on 1.1b1-1

4.       OSU-MPI - See details below.

5.       iSER: Enhanced to support SLES 10 RC1.

6.      IPoIB default configuration changed:

a.       IPoIB configuration at install time is now optional.

b.       The default configuration of IPoIB interfaces (if performed at
install time) is DHCP; it can be changed during interactive
installation.

c.       For unattended installation one can give a new configuration
file. See the example below.

7.       Bug Fixes.

 
Package limitations:

1.       The ipath driver does not compile/load on most systems. To be
fixed in final release. 
Meanwhile, one must work with custom build and not choose ipath driver,
or change in the conf file: ib_ipath=n.
I attached a reference ofed-no_ipath.conf file. 
Once Qlogic fixes the backport patches I will publish them on the
release page so any one interested can use them with this release.

2.       iSER is working on SuSE SLES 10 RC1 only

 
IPoIB configuration file example:

If you are going to install OFED on a 32 node cluster and want to use
static IPoIB configuration based on Ethernet device configuration follow
instructions below:

 
Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster
are: 10.0.0.1 - 10.0.0.32

and you want to assign to ib0 IP addresses in the range: 192.168.0.1 -
192.168.0.32

and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32

 
Then create the file ofed_net.conf with the following lines:

 
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=192.168.'*'.'*'
NETMASK_ib0=255.255.0.0
NETWORK_ib0=192.168.0.0
BROADCAST_ib0=192.168.255.255
ONBOOT_ib0=1
LAN_INTERFACE_ib1=eth0
IPADDR_ib1=172.16.'*'.'*'
NETMASK_ib1=255.255.0.0
NETWORK_ib1=172.16.0.0
BROADCAST_ib1=172.16.255.255
ONBOOT_ib1=1

 
Note: '*' will be replaced by the corresponding octal from the eth0 IP
address.

 
Assuming that you already have OFED configuration file (ofed.conf) with
selected packages (created by running OFED-1.0/install.sh)

Run:    ./install.sh -c ofed.conf -net ofed_net.conf

 
OSU MPI:

*        Added mpi_alltoall fine tuning parameters

*        Added default configuration/documentation file
$MPIHOME/etc/mvapich.conf

*        Added shell configuration files  $MPIHOME/etc/mvapich.csh ,
$MPIHOME/etc/mvapich.csh

*        Default MTU was changed back to 2K for InfiniHost III Ex and
InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
VIADEV_DEFAULT_MTU=MTU1024

 
SDP Details: 

libsdp enhancements according to the RFC:

1.	New config syntax (please see libsdp.conf) 
2.	With no config or empty config use SIMPLE_LIBSDP mode 
3.	Support listening on both tcp and sdp 
4.	Support trying both connections (first SDP then TCP) 
5.	Support IPv4 embedded in IPv6 (also convert back address) 
6.	Comprehensive verbosity logging 
7.	BNF based config parser 

 
Current SDP limitations:

*        SDP currently does not support sending/receiving out of band
data (MSG_OOB).

*        Generally, SDP supports only SOL_SOCKET socket options.

*        The following options can be set but actual support is missing:

o       SO_KEEPALIVE - no keepalives are sent

o       SO_OOBINLINE - out of band data is not supported

o       SDP currently supports setting the following SOL_TCP socket
options:

o       TCP_NODELAY, TCP_CORK - but actual support for these options is
still missing

*         SDP currently does not handle Zcopy mode messages correctly
and does not set MaxAdverts properly in HH/HAH messages.

 
OFED components tested by Mellanox:

*	Verbs over mthca 
*	IPoIB 
*	OpenSM 
*	OSU-MPI 
*	SRP 
*	SDP 
*	IB administration utils (ibutils) 

 
Please send us any issues you encounter and/or test results.

 
Thanks

Tziporet & Vlad

 
Tziporet Koren

Software Director

Mellanox Technologies

mailto: tziporet at mellanox.co.il <mailto:tziporet at mellanox.co.il> 
Tel +972-4-9097200, ext 380

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060609/cc34e8cb/attachment.html>

From mshefty at ichips.intel.com  Fri Jun  9 13:35:55 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 13:35:55 -0700
Subject: [openib-general] Re: Failed multicast join withnew multicast module
In-Reply-To: <1149883948.5093.12572.camel@hal.voltaire.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
	<4489A5ED.8060200@ichips.intel.com>
	<1149883948.5093.12572.camel@hal.voltaire.com>
Message-ID: <4489DBAB.3080707@ichips.intel.com>

Hal Rosenstock wrote:
> The other issue is whether you trust the state of the network or not
> when the SM comes up. That's sometimes a dangerous proposition.

I considered this, but I think there's a difference between trusting one of the 
systems on the network, versus the network as a whole.  For example, as long as 
the MCMember records from the end nodes mesh with MulticastForwarding tables on 
the switches, then we may be okay.

Also, the MCMember records carry both the MGID and MLID, so what more would you 
need?

- Sean


From bugzilla-daemon at openib.org  Fri Jun  9 13:45:20 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 13:45:20 -0700 (PDT)
Subject: [openib-general] [Bug 126] New: RDMA_CM and UCM not loaded on boot
Message-ID: <20060609204520.6FD342283FD@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=126

           Summary: RDMA_CM and UCM not loaded on boot
           Product: OpenFabrics Linux
           Version: 1.0rc6
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: RDMA CM
        AssignedTo: bugzilla at openib.org
        ReportedBy: robert.j.woodruff at intel.com


The RDMA_CM and UCM are not being loaded automatically when the 
system boots. (RHELEL4-U3).
This causes uDAPL and Intel MPI to fail.


woody


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 13:55:14 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 13:55:14 -0700 (PDT)
Subject: [openib-general] [Bug 122] mad layer problem
Message-ID: <20060609205514.AD40A228766@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=122


sean.hefty at intel.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From halr at voltaire.com  Fri Jun  9 13:53:46 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 16:53:46 -0400
Subject: [openib-general] Failed multicast join withnew multicast module
In-Reply-To: <4489DBAB.3080707@ichips.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
	<4489A5ED.8060200@ichips.intel.com>
	<1149883948.5093.12572.camel@hal.voltaire.com>
	<4489DBAB.3080707@ichips.intel.com>
Message-ID: <1149886425.5093.13966.camel@hal.voltaire.com>

On Fri, 2006-06-09 at 16:35, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > The other issue is whether you trust the state of the network or not
> > when the SM comes up. That's sometimes a dangerous proposition.
> 
> I considered this, but I think there's a difference between trusting one of the 
> systems on the network, versus the network as a whole.  For example, as long as 
> the MCMember records from the end nodes mesh with MulticastForwarding tables on 
> the switches, then we may be okay.

What does mesh mean in this instance ? How do you know the multicast
routing tables are indeed valid and that the SM didn't corrupt them ?
(Why did the SM need restarting ?)

> Also, the MCMember records carry both the MGID and MLID, so what more would you 
> need?

The MLID is supplied by the SA in response to a group request from the
end node, not the other way around. The end node doesn't tell the SA
what MLID to use for a group.

-- Hal

> - Sean


From mshefty at ichips.intel.com  Fri Jun  9 14:18:58 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 14:18:58 -0700
Subject: [openib-general] Failed multicast join withnew multicast module
In-Reply-To: <1149886425.5093.13966.camel@hal.voltaire.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
	<4489A5ED.8060200@ichips.intel.com>
	<1149883948.5093.12572.camel@hal.voltaire.com>
	<4489DBAB.3080707@ichips.intel.com>
	<1149886425.5093.13966.camel@hal.voltaire.com>
Message-ID: <4489E5C2.3090905@ichips.intel.com>

Hal Rosenstock wrote:
> What does mesh mean in this instance ? How do you know the multicast
> routing tables are indeed valid and that the SM didn't corrupt them ?
> (Why did the SM need restarting ?)

I meant that the values agree with each other, and there are no conflicts.

> The MLID is supplied by the SA in response to a group request from the
> end node, not the other way around. The end node doesn't tell the SA
> what MLID to use for a group.

One of the ideas is for the end nodes to provide this data, even if that means 
extending the architecture.

The problem is that the SA lost its state, but the network is working fine.  The 
end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. 
  And the switches are already programmed correctly.

Even if we have the ability for an SM to transparently fail over to another SM, 
because of the architecture, the end nodes are being forced to assume that all 
multicast group information has been lost.

How about this?  What if the end nodes only re-joined their groups on LID_CHANGE 
or CLIENT_REREGISTER events?  That is, an SM_CHANGE would not result in clients 
needing to rejoin any groups.  This puts the burden on the SM to generate a 
CLIENT_REREGISTER event only if it's needed.  SMs that can fail over and 
maintain multicast state in the process would be able to do so.

- Sean


From sean.hefty at intel.com  Fri Jun  9 14:19:24 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 14:19:24 -0700
Subject: [openib-general] [PATCH 0/5] multicast abstraction
Message-ID: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>

This patch series enhances support for joining and leaving multicast groups,
providing the following functionality:

1. Users identify a multicast group by a multicast IP address.
2. A user binds to a local RDMA device based on resolving the IP address.
3. A new multicast group is created.  The parameters for the multicast group are
obtained based on the ipoib broadcast group, and the MGID is derived using the
same algorithm as ipoib, except with a different signature.
4. Any QP associated with the join is attached to the group once the join
operation completes.
5. A QP may join multiple groups.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Fri Jun  9 14:40:45 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 14:40:45 -0700
Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device
	address
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com>

Extract the MGID used by ipoib for broadcast traffic from the device
address.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This will be used to get the MCMemberRecord for the ipoib broadcast group.

--- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h	2006-05-25 11:18:47.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h	2006-06-06 16:14:11.000000000 -0700
@@ -89,6 +89,11 @@ static inline void ib_addr_set_pkey(stru
 	dev_addr->broadcast[9] = (unsigned char) pkey;
 }
 
+static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr)
+{
+	return 	(union ib_gid *) (dev_addr->broadcast + 4);
+}
+
 static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr)
 {
 	return 	(union ib_gid *) (dev_addr->src_dev_addr + 4);


From sean.hefty at intel.com  Fri Jun  9 14:46:56 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 14:46:56 -0700
Subject: [openib-general] [PATCH 2/5] multicast: allow retrieving an
 MCMemberRecord based on MGID
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <000f01c68c0e$3acafc50$ff0da8c0@amr.corp.intel.com>

Add an API to allow retrieving an MCMemberRecord from the local cache
based on an MGID.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This allows an existing MCMemberRecord to be used as a template for
creating other multicast groups.

--- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_multicast.h	2006-05-25 11:18:47.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_multicast.h	2006-05-23 14:58:06.000000000 -0700
@@ -82,4 +82,16 @@ struct ib_multicast *ib_join_multicast(s
  */
 void ib_free_multicast(struct ib_multicast *multicast);
 
+/**
+ * ib_get_mcmember_rec - Looks up a multicast member record by its MGID and
+ *   returns it if found.
+ * @device: Device associated with the multicast group.
+ * @port_num: Port on the specified device to associate with the multicast
+ *   group.
+ * @mgid: MGID of multicast group.
+ * @rec: Location to copy SA multicast member record.
+ */
+int ib_get_mcmember_rec(struct ib_device *device, u8 port_num,
+			union ib_gid *mgid, struct ib_sa_mcmember_rec *rec);
+
 #endif /* IB_MULTICAST_H */
--- svn3/gen2/trunk/src/linux-kernel/infiniband/core/multicast.c	2006-06-08 21:53:21.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/core/multicast.c	2006-06-08 17:14:01.000000000 -0700
@@ -612,6 +612,29 @@ void ib_free_multicast(struct ib_multica
 }
 EXPORT_SYMBOL(ib_free_multicast);
 
+int ib_get_mcmember_rec(struct ib_device *device, u8 port_num,
+			union ib_gid *mgid, struct ib_sa_mcmember_rec *rec)
+{
+	struct mcast_device *dev;
+	struct mcast_port *port;
+	struct mcast_group *group;
+	unsigned long flags;
+
+	dev = ib_get_client_data(device, &mcast_client);
+	if (!dev)
+		return -ENODEV;
+
+	port = &dev->port[port_num - dev->start_port];
+	spin_lock_irqsave(&port->lock, flags);
+	group = mcast_find(port, mgid);
+	if (group)
+		*rec = group->rec;
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	return group ? 0 : -EADDRNOTAVAIL;
+}
+EXPORT_SYMBOL(ib_get_mcmember_rec);
+
 static void mcast_groups_lost(struct mcast_port *port)
 {
 	struct mcast_group *group;


From sean.hefty at intel.com  Fri Jun  9 14:49:28 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 14:49:28 -0700
Subject: [openib-general] [PATCH 3/5] sa_query: add call to initialize
 ah_attr from an mcmember record
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <001001c68c0e$950f3370$ff0da8c0@amr.corp.intel.com>

Export a call to initialize an ib_ah_attr structure based on an
MCMemberRecord returned from a multicast join request.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
--- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_sa.h	2006-06-06 15:21:05.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_sa.h	2006-06-06 15:56:37.000000000 -0700
@@ -380,6 +380,14 @@ int ib_init_ah_from_path(struct ib_devic
 			 struct ib_sa_path_rec *rec,
 			 struct ib_ah_attr *ah_attr);
 
+ /**
+ * ib_init_ah_from_mcmember - Initialize address handle attributes based on an
+ *   SA mcmember record.
+ */
+int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num,
+			     struct ib_sa_mcmember_rec *rec,
+			     struct ib_ah_attr *ah_attr);
+
 /**
  * ib_sa_pack_attr - Copy an SA attribute from a host defined structure to
  *   a network packed structure.
--- svn3/gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c	2006-06-06 15:21:05.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c	2006-06-06 15:57:21.000000000 -0700
@@ -471,6 +471,36 @@ int ib_init_ah_from_path(struct ib_devic
 }
 EXPORT_SYMBOL(ib_init_ah_from_path);
 
+int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num,
+			     struct ib_sa_mcmember_rec *rec,
+			     struct ib_ah_attr *ah_attr)
+{
+	int ret;
+	u16 gid_index;
+	u8 p;
+
+	ret = ib_find_cached_gid(device, &rec->port_gid, &p, &gid_index);
+	if (ret)
+		return ret;
+
+	memset(ah_attr, 0, sizeof *ah_attr);
+	ah_attr->dlid = be16_to_cpu(rec->mlid);
+	ah_attr->sl = rec->sl;
+	ah_attr->port_num = port_num;
+	ah_attr->static_rate = rec->rate;
+
+	ah_attr->ah_flags = IB_AH_GRH;
+	ah_attr->grh.dgid = rec->mgid;
+
+	ah_attr->grh.sgid_index = (u8) gid_index;
+	ah_attr->grh.flow_label = be32_to_cpu(rec->flow_label);
+	ah_attr->grh.hop_limit = rec->hop_limit;
+	ah_attr->grh.traffic_class = rec->traffic_class;
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_init_ah_from_mcmember);
+
 int ib_sa_pack_attr(void *dst, void *src, int attr_id)
 {
 	switch (attr_id) {


From ardavis at ichips.intel.com  Fri Jun  9 14:54:36 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Fri, 09 Jun 2006 14:54:36 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
 support for IB_CM_REQ_OPTIONS
In-Reply-To: <Pine.LNX.4.64.0606081022310.4750@jlentini-linux.nane.netapp.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<200606071639.03787.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>
	<200606080942.48767.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606081022310.4750@jlentini-linux.nane.netapp.com>
Message-ID: <4489EE1C.1090200@ichips.intel.com>

James Lentini wrote:

>On Thu, 8 Jun 2006, Jack Morgenstein wrote:
>
>  
>
>>On Wednesday 07 June 2006 18:26, James Lentini wrote:
>>    
>>
>>>On Wed, 7 Jun 2006, Jack Morgenstein wrote:
>>>      
>>>
>>>>This (bug fix) can still be included in next-week's release, if you
>>>>think it is important (I have extracted it from the changes checked
>>>>in at svn 7755)
>>>>        
>>>>
>>>If you are going to make another release anyway, then I would included
>>>it.
>>>      
>>>
>>Do you mean -- include the fix in next week's release -- or -- wait 
>>with the fix for the following release?
>>    
>>
>
>I'd include the fix in the next release, but I wouldn't create a 
>special release just for this fix.
>  
>
So are we getting this in next weeks release or not? I think we need it.

>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>  
>


From halr at voltaire.com  Fri Jun  9 15:01:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jun 2006 18:01:15 -0400
Subject: [openib-general] Failed multicast join withnew multicast module
In-Reply-To: <4489E5C2.3090905@ichips.intel.com>
References: <ORSMSX401BBsQwY6YHk00000044@orsmsx401.amr.corp.intel.com>
	<1149799142.4510.13468.camel@hal.voltaire.com>
	<44889E18.8010507@ichips.intel.com>
	<1149849791.4510.41634.camel@hal.voltaire.com>
	<4489A5ED.8060200@ichips.intel.com>
	<1149883948.5093.12572.camel@hal.voltaire.com>
	<4489DBAB.3080707@ichips.intel.com>
	<1149886425.5093.13966.camel@hal.voltaire.com>
	<4489E5C2.3090905@ichips.intel.com>
Message-ID: <1149890474.5093.16250.camel@hal.voltaire.com>

On Fri, 2006-06-09 at 17:18, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > What does mesh mean in this instance ? How do you know the multicast
> > routing tables are indeed valid and that the SM didn't corrupt them ?
> > (Why did the SM need restarting ?)
> 
> I meant that the values agree with each other, and there are no conflicts.

How are conflicts determined ? The SA has no way of querying the end
nodes for their multicast information; it currently is the other way
around.

> > The MLID is supplied by the SA in response to a group request from the
> > end node, not the other way around. The end node doesn't tell the SA
> > what MLID to use for a group.
> 
> One of the ideas is for the end nodes to provide this data, even if that means 
> extending the architecture.

OK. What if the SM already put the MLID to use for something else ?

> The problem is that the SA lost its state, but the network is working fine.

How does the SM know that the network is working fine ?

> The end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. 
>   And the switches are already programmed correctly.

I'm not sure what constitutes a correctness criterion here.

> Even if we have the ability for an SM to transparently fail over to another SM, 
> because of the architecture, the end nodes are being forced to assume that all 
> multicast group information has been lost.

In the case of an SM which replicated its database, it would replicate
the registrations which include multicast so this reregistration
shouldn't be necessary. But I don't know of a way that the end node
knows whether the SM is doing this database replication.

> How about this?  What if the end nodes only re-joined their groups on LID_CHANGE 
> or CLIENT_REREGISTER events?  That is, an SM_CHANGE would not result in clients 
> needing to rejoin any groups.  This puts the burden on the SM to generate a 
> CLIENT_REREGISTER event only if it's needed.  SMs that can fail over and 
> maintain multicast state in the process would be able to do so.

I think more than this is needed.

-- Hal

> - Sean


From jlentini at netapp.com  Fri Jun  9 15:11:34 2006
From: jlentini at netapp.com (James Lentini)
Date: Fri, 9 Jun 2006 18:11:34 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
 support for IB_CM_REQ_OPTIONS
In-Reply-To: <4489EE1C.1090200@ichips.intel.com>
References: <ORSMSX401nhmmnqzdY000000036@orsmsx401.amr.corp.intel.com>
	<200606071639.03787.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606071124500.4750@jlentini-linux.nane.netapp.com>
	<200606080942.48767.jackm@mellanox.co.il>
	<Pine.LNX.4.64.0606081022310.4750@jlentini-linux.nane.netapp.com>
	<4489EE1C.1090200@ichips.intel.com>
Message-ID: <Pine.LNX.4.64.0606091809590.4750@jlentini-linux.nane.netapp.com>


On Fri, 9 Jun 2006, Arlin Davis wrote:

> James Lentini wrote:
> 
> > On Thu, 8 Jun 2006, Jack Morgenstein wrote:
> > 
> >  
> > > On Wednesday 07 June 2006 18:26, James Lentini wrote:
> > >    
> > > > On Wed, 7 Jun 2006, Jack Morgenstein wrote:
> > > >      
> > > > > This (bug fix) can still be included in next-week's release, if you
> > > > > think it is important (I have extracted it from the changes checked
> > > > > in at svn 7755)
> > > > >        
> > > > If you are going to make another release anyway, then I would included
> > > > it.
> > > >      
> > > Do you mean -- include the fix in next week's release -- or -- wait with
> > > the fix for the following release?
> > >    
> > 
> > I'd include the fix in the next release, but I wouldn't create a special
> > release just for this fix.
> >  
> So are we getting this in next weeks release or not? I think we need it.


Tziporet,

Will this be in this fix be in the next OFED release? 


From sean.hefty at intel.com  Fri Jun  9 15:15:18 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 15:15:18 -0700
Subject: [openib-general] [PATCH 4/5] rdma cm: add support to join / leave
 multicast groups
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <001101c68c12$31021d80$ff0da8c0@amr.corp.intel.com>

Add IB multicast abstraction to the CMA.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
--- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_cm.h	2006-06-06 16:53:56.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_cm.h	2006-06-02 10:22:29.000000000 -0700
@@ -52,6 +52,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_ESTABLISHED,
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
+	RDMA_CM_EVENT_MULTICAST_JOIN,
+	RDMA_CM_EVENT_MULTICAST_ERROR
 };
 
 enum rdma_port_space {
@@ -77,6 +79,13 @@ struct rdma_route {
 	int num_paths;
 };
 
+struct rdma_multicast_data {
+	void		*context;
+	struct sockaddr addr;
+	u8		pad[sizeof(struct sockaddr_in6) -
+			    sizeof(struct sockaddr)];
+};
+
 struct rdma_cm_event {
 	enum rdma_cm_event_type	 event;
 	int			 status;
@@ -258,5 +267,20 @@ int rdma_reject(struct rdma_cm_id *id, c
  */
 int rdma_disconnect(struct rdma_cm_id *id);
 
-#endif /* RDMA_CM_H */
+/**
+ * rdma_join_multicast - Join the multicast group specified by the given
+ *   address.
+ * @id: Communication identifier associated with the request.
+ * @addr: Multicast address identifying the group to join.
+ * @context: User-defined context associated with the join request.
+ */
+int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
+			void *context);
 
+/**
+ * rdma_leave_multicast - Leave the multicast group specified by the given
+ *   address.
+ */
+void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr);
+
+#endif /* RDMA_CM_H */
--- svn3/gen2/trunk/src/linux-kernel/infiniband/core/cma.c	2006-06-06 19:30:12.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/core/cma.c	2006-06-06 16:12:42.000000000 -0700
@@ -43,6 +43,7 @@
 #include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_local_sa.h>
+#include <rdma/ib_multicast.h>
 
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
@@ -111,6 +112,7 @@ struct rdma_id_private {
 	struct list_head	list;
 	struct list_head	listen_list;
 	struct cma_device	*cma_dev;
+	struct list_head	mc_list;
 
 	enum cma_state		state;
 	spinlock_t		lock;
@@ -137,6 +139,15 @@ struct rdma_id_private {
 	u8			srq;
 };
 
+struct cma_multicast {
+	struct rdma_id_private *id_priv;
+	union {
+		struct ib_multicast *ib;
+	} multicast;
+	struct list_head list;
+	struct rdma_multicast_data data;
+};
+
 struct cma_work {
 	struct work_struct	work;
 	struct rdma_id_private	*id;
@@ -328,6 +339,7 @@ struct rdma_cm_id* rdma_create_id(rdma_c
 	init_waitqueue_head(&id_priv->wait_remove);
 	atomic_set(&id_priv->dev_remove, 0);
 	INIT_LIST_HEAD(&id_priv->listen_list);
+	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
 
 	return &id_priv->id;
@@ -474,6 +486,32 @@ int rdma_init_qp_attr(struct rdma_cm_id 
 }
 EXPORT_SYMBOL(rdma_init_qp_attr);
 
+static int cma_get_ib_mc_attr(struct rdma_id_private *id_priv,
+			      struct sockaddr *addr,
+			      struct ib_ah_attr *ah_attr, uint32_t *remote_qpn,
+			      uint32_t *remote_qkey)
+{
+	struct cma_multicast *mc;
+	unsigned long flags;
+	int ret = -EADDRNOTAVAIL;
+
+	spin_lock_irqsave(&id_priv->lock, flags);
+	list_for_each_entry(mc, &id_priv->mc_list, list) {
+		if (!memcmp(&mc->data.addr, addr, ip_addr_size(addr))) {
+			ib_init_ah_from_mcmember(id_priv->id.device,
+						 id_priv->id.port_num,
+						 &mc->multicast.ib->rec,
+						 ah_attr);
+			*remote_qpn = 0xFFFFFF;
+			*remote_qkey = be32_to_cpu(mc->multicast.ib->rec.qkey);
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&id_priv->lock, flags);
+	return ret;
+}
+
 int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr,
 		      struct ib_ah_attr *ah_attr, u32 *remote_qpn,
 		      u32 *remote_qkey)
@@ -484,7 +522,10 @@ int rdma_get_dst_attr(struct rdma_cm_id 
 	id_priv = container_of(id, struct rdma_id_private, id);
 	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
 	case RDMA_TRANSPORT_IB:
-		if (!memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr)))
+		ret = cma_get_ib_mc_attr(id_priv, addr, ah_attr,
+					 remote_qpn, remote_qkey);
+		if (ret && id_priv->cm_id.ib &&
+		    !memcmp(&id->route.addr.dst_addr, addr, ip_addr_size(addr)))
 			ret = ib_cm_get_dst_attr(id_priv->cm_id.ib, ah_attr,
 						 remote_qpn, remote_qkey);
 		break;
@@ -718,6 +759,19 @@ static void cma_release_port(struct rdma
 	mutex_unlock(&lock);
 }
 
+static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
+{
+	struct cma_multicast *mc;
+
+	while (!list_empty(&id_priv->mc_list)) {
+		mc = container_of(id_priv->mc_list.next,
+				  struct cma_multicast, list);
+		list_del(&mc->list);
+		ib_free_multicast(mc->multicast.ib);
+		kfree(mc);
+	}
+}
+
 void rdma_destroy_id(struct rdma_cm_id *id)
 {
 	struct rdma_id_private *id_priv;
@@ -736,6 +790,7 @@ void rdma_destroy_id(struct rdma_cm_id *
 		default:
 			break;
 		}
+		cma_leave_mc_groups(id_priv);
 	  	mutex_lock(&lock);
 		cma_detach_from_dev(id_priv);
 		mutex_unlock(&lock);
@@ -2053,6 +2108,150 @@ out:
 }
 EXPORT_SYMBOL(rdma_disconnect);
 
+static int cma_ib_join_handler(int status, struct ib_multicast *multicast)
+{
+	struct rdma_id_private *id_priv;
+	struct cma_multicast *mc = multicast->context;
+	enum rdma_cm_event_type event;
+	int ret;
+
+	id_priv = mc->id_priv;
+	atomic_inc(&id_priv->dev_remove);
+	if (!cma_comp(id_priv, CMA_ADDR_BOUND) &&
+	    !cma_comp(id_priv, CMA_ADDR_RESOLVED))
+		goto out;
+
+	if (!status && id_priv->id.qp) {
+		status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid,
+					 multicast->rec.mlid);
+	}
+
+	event = status ? RDMA_CM_EVENT_MULTICAST_ERROR :
+			 RDMA_CM_EVENT_MULTICAST_JOIN;
+	
+	ret = cma_notify_user(id_priv, event, status, &mc->data,
+			      sizeof mc->data);
+	if (ret) {
+		cma_exch(id_priv, CMA_DESTROYING);
+		cma_release_remove(id_priv);
+		rdma_destroy_id(&id_priv->id);
+		return 0;
+	}
+out:
+	cma_release_remove(id_priv);
+	return 0;
+}
+
+static int cma_join_ib_multicast(struct rdma_id_private *id_priv,
+				 struct cma_multicast *mc)
+{
+	struct ib_sa_mcmember_rec rec;
+	unsigned char mc_map[MAX_ADDR_LEN];
+	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
+	struct sockaddr_in *sin = (struct sockaddr_in *) &mc->data.addr;
+	ib_sa_comp_mask comp_mask;
+	int ret;
+
+	ret = ib_get_mcmember_rec(id_priv->id.device, id_priv->id.port_num,
+				  ib_addr_get_mgid(dev_addr), &rec);
+	if (ret)
+		return ret;
+
+	ip_ib_mc_map(sin->sin_addr.s_addr, mc_map);
+	mc_map[7] = 0x01;			/* Use RDMA CM signature */
+	mc_map[8] = ib_addr_get_pkey(dev_addr) >> 8;
+	mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr);
+
+	rec.mgid = *(union ib_gid *) (mc_map + 4);
+	rec.port_gid = *ib_addr_get_sgid(dev_addr);
+	rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr));
+	rec.join_state = 1;
+	rec.qkey = sin->sin_addr.s_addr;
+
+	comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID |
+		    IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE |
+		    IB_SA_MCMEMBER_REC_QKEY | IB_SA_MCMEMBER_REC_SL |
+		    IB_SA_MCMEMBER_REC_FLOW_LABEL |
+		    IB_SA_MCMEMBER_REC_TRAFFIC_CLASS;
+
+	mc->multicast.ib = ib_join_multicast(id_priv->id.device,
+					     id_priv->id.port_num, &rec,
+					     comp_mask, GFP_KERNEL,
+					     cma_ib_join_handler, mc);
+	if (IS_ERR(mc->multicast.ib))
+		return PTR_ERR(mc->multicast.ib);
+
+	return 0;
+}
+
+int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
+			void *context)
+{
+	struct rdma_id_private *id_priv;
+	struct cma_multicast *mc;
+	int ret;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	if (!cma_comp(id_priv, CMA_ADDR_BOUND) &&
+	    !cma_comp(id_priv, CMA_ADDR_RESOLVED))
+		return -EINVAL;
+
+	mc = kmalloc(sizeof *mc, GFP_KERNEL);
+	if (!mc)
+		return -ENOMEM;
+
+	memcpy(&mc->data.addr, addr, ip_addr_size(addr));
+	mc->data.context = context;
+	mc->id_priv = id_priv;
+
+	spin_lock(&id_priv->lock);
+	list_add(&mc->list, &id_priv->mc_list);
+	spin_unlock(&id_priv->lock);
+
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
+		ret = cma_join_ib_multicast(id_priv, mc);
+		break;
+	default:
+		ret = -ENOSYS;
+		break;
+	}
+
+	if (ret) {
+		spin_lock_irq(&id_priv->lock);
+		list_del(&mc->list);
+		spin_unlock_irq(&id_priv->lock);
+		kfree(mc);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(rdma_join_multicast);
+
+void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
+{
+	struct rdma_id_private *id_priv;
+	struct cma_multicast *mc;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	spin_lock_irq(&id_priv->lock);
+	list_for_each_entry(mc, &id_priv->mc_list, list) {
+		if (!memcmp(&mc->data.addr, addr, ip_addr_size(addr))) {
+			list_del(&mc->list);
+			spin_unlock_irq(&id_priv->lock);
+
+			if (id->qp)
+				ib_detach_mcast(id->qp,
+						&mc->multicast.ib->rec.mgid,
+						mc->multicast.ib->rec.mlid);
+			ib_free_multicast(mc->multicast.ib);
+			kfree(mc);
+			return;
+		}
+	}
+	spin_unlock_irq(&id_priv->lock);
+}
+EXPORT_SYMBOL(rdma_leave_multicast);
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;


From sean.hefty at intel.com  Fri Jun  9 15:16:28 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 9 Jun 2006 15:16:28 -0700
Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to
	userspace
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com>

Expose multicast abstraction through the CMA to userspace.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
--- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_user_cm.h	2006-06-06 16:53:46.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/rdma_user_cm.h	2006-06-06 12:22:57.000000000 -0700
@@ -58,6 +58,8 @@ enum {
 	RDMA_USER_CM_CMD_GET_EVENT,
 	RDMA_USER_CM_CMD_GET_OPTION,
 	RDMA_USER_CM_CMD_SET_OPTION,
+	RDMA_USER_CM_CMD_JOIN_MCAST,
+	RDMA_USER_CM_CMD_LEAVE_MCAST,
 	RDMA_USER_CM_CMD_GET_DST_ATTR
 };
 
@@ -174,6 +176,17 @@ struct rdma_ucm_init_qp_attr {
 	__u32 qp_state;
 };
 
+struct rdma_ucm_join_mcast {
+	__u32 id;
+	struct sockaddr_in6 addr;
+	__u64 uid;
+};
+
+struct rdma_ucm_leave_mcast {
+	__u32 id;
+	struct sockaddr_in6 addr;
+};
+
 struct rdma_ucm_dst_attr_resp {
 	__u32 remote_qpn;
 	__u32 remote_qkey;
--- svn3/gen2/trunk/src/linux-kernel/infiniband/core/ucma.c	2006-06-06 16:56:53.000000000 -0700
+++ svn/gen2/trunk/src/linux-kernel/infiniband/core/ucma.c	2006-06-01 17:48:42.000000000 -0700
@@ -167,6 +167,21 @@ error:
 	return NULL;
 }
 
+static void ucma_copy_multicast_data(struct ucma_context *ctx,
+				     struct ucma_event *uevent,
+				     struct rdma_cm_event *event)
+{
+	struct rdma_multicast_data *mc_data = event->private_data;
+	struct rdma_ucm_join_mcast *umc_data;
+
+	umc_data = (struct rdma_ucm_join_mcast *) uevent->resp.private_data;
+
+	uevent->resp.private_data_len = sizeof *umc_data;
+	umc_data->id = ctx->id;
+	memcpy(&umc_data->addr, &mc_data->addr, ip_addr_size(&mc_data->addr));
+	umc_data->uid = (unsigned long) mc_data->context;
+}
+
 static int ucma_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event)
 {
@@ -184,9 +199,17 @@ static int ucma_event_handler(struct rdm
 	uevent->resp.id = ctx->id;
 	uevent->resp.event = event->event;
 	uevent->resp.status = event->status;
-	if ((uevent->resp.private_data_len = event->private_data_len))
-		memcpy(uevent->resp.private_data, event->private_data,
-		       event->private_data_len);
+	switch (event->event) {
+	case RDMA_CM_EVENT_MULTICAST_JOIN:
+	case RDMA_CM_EVENT_MULTICAST_ERROR:
+		ucma_copy_multicast_data(ctx, uevent, event);
+		break;
+	default:
+		if ((uevent->resp.private_data_len = event->private_data_len))
+			memcpy(uevent->resp.private_data, event->private_data,
+			       event->private_data_len);
+		break;
+	}
 
 	mutex_lock(&ctx->file->file_mutex);
 	if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) {
@@ -737,6 +760,45 @@ static ssize_t ucma_set_option(struct uc
 	return ret;
 }
 
+static ssize_t ucma_join_mcast(struct ucma_file *file, const char __user *inbuf,
+			       int in_len, int out_len)
+{
+	struct rdma_ucm_join_mcast cmd;
+	struct ucma_context *ctx;
+	int ret;
+
+	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+		return -EFAULT;
+
+	ctx = ucma_get_ctx(file, cmd.id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = rdma_join_multicast(ctx->cm_id, (struct sockaddr *) &cmd.addr,
+				  (void *) (unsigned long) cmd.uid);
+	ucma_put_ctx(ctx);
+	return ret;
+}
+
+static ssize_t ucma_leave_mcast(struct ucma_file *file,
+				const char __user *inbuf,
+				int in_len, int out_len)
+{
+	struct rdma_ucm_leave_mcast cmd;
+	struct ucma_context *ctx;
+
+	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+		return -EFAULT;
+
+	ctx = ucma_get_ctx(file, cmd.id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	rdma_leave_multicast(ctx->cm_id, (struct sockaddr *) &cmd.addr);
+	ucma_put_ctx(ctx);
+	return 0;
+}
+
 static ssize_t ucma_get_dst_attr(struct ucma_file *file,
 				 const char __user *inbuf,
 				 int in_len, int out_len)
@@ -789,6 +851,8 @@ static ssize_t (*ucma_cmd_table[])(struc
 	[RDMA_USER_CM_CMD_GET_EVENT]	= ucma_get_event,
 	[RDMA_USER_CM_CMD_GET_OPTION]	= ucma_get_option,
 	[RDMA_USER_CM_CMD_SET_OPTION]	= ucma_set_option,
+	[RDMA_USER_CM_CMD_JOIN_MCAST]	= ucma_join_mcast,
+	[RDMA_USER_CM_CMD_LEAVE_MCAST]	= ucma_leave_mcast,
 	[RDMA_USER_CM_CMD_GET_DST_ATTR] = ucma_get_dst_attr
 };
 

From mshefty at ichips.intel.com  Fri Jun  9 15:20:45 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 09 Jun 2006 15:20:45 -0700
Subject: [openib-general] [PATCH 0/5] multicast abstraction
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
References: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <4489F43D.60502@ichips.intel.com>

Sean Hefty wrote:
> This patch series enhances support for joining and leaving multicast groups,
> providing the following functionality:
> 
> 1. Users identify a multicast group by a multicast IP address.
> 2. A user binds to a local RDMA device based on resolving the IP address.
> 3. A new multicast group is created.  The parameters for the multicast group are
> obtained based on the ipoib broadcast group, and the MGID is derived using the
> same algorithm as ipoib, except with a different signature.
> 4. Any QP associated with the join is attached to the group once the join
> operation completes.
> 5. A QP may join multiple groups.

I forgot to mention that this patch series is dependent on adding UD QP support 
to the RDMA CM.

- Sean


From arlin.r.davis at intel.com  Fri Jun  9 15:37:34 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Fri, 9 Jun 2006 15:37:34 -0700
Subject: [openib-general] [PATCH] uDAPL openib_cma,
 cleanup reported CM error events, add TIMEOUT
Message-ID: <ORSMSX401K9homqOxDu00000049@orsmsx401.amr.corp.intel.com>

James,

I cleaned up the connection error events to report the proper events during address resolution
errors and timeouts. It was returning incorrect DAT event codes.

-arlin


Signed-off by: Arlin Davis <ardavis at ichips.intel.com>


Index: dapl_ib_cm.c
===================================================================
--- dapl_ib_cm.c	(revision 7839)
+++ dapl_ib_cm.c	(working copy)
@@ -330,6 +330,8 @@ static void dapli_cm_active_cb(struct da
 	switch (event->event) {
 	case RDMA_CM_EVENT_UNREACHABLE:
 	case RDMA_CM_EVENT_CONNECT_ERROR:
+	{
+		ib_cm_events_t cm_event;
                 dapl_dbg_log(
                         DAPL_DBG_TYPE_WARN,
                         " dapli_cm_active_handler: CONN_ERR "
@@ -337,10 +339,15 @@ static void dapli_cm_active_cb(struct da
                         event->event, event->status,
                         (event->status == -110)?"TIMEOUT":"" );
 
-		dapl_evd_connection_callback(conn,
-					     IB_CME_DESTINATION_UNREACHABLE,
-					     NULL, conn->ep);
+		/* no device type specified so assume IB for now */
+		if (event->status == -110) /* IB timeout */
+			cm_event = IB_CME_TIMEOUT;
+		else 
+			cm_event = IB_CME_DESTINATION_UNREACHABLE;
+
+		dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep);
 		break;
+	}
 	case RDMA_CM_EVENT_REJECTED:
 	{
 		ib_cm_events_t cm_event;
@@ -357,7 +364,6 @@ static void dapli_cm_active_cb(struct da
 			event->status);
 		
 		dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep);
-		
 		break;
 	}
 	case RDMA_CM_EVENT_ESTABLISHED:
@@ -1028,7 +1034,7 @@ int dapls_ib_private_data_size(IN DAPL_P
 /*
  * Map all socket CM event codes to the DAT equivelent.
  */
-#define DAPL_IB_EVENT_CNT	12
+#define DAPL_IB_EVENT_CNT	13
 
 static struct ib_cm_event_map
 {
@@ -1058,7 +1064,9 @@ static struct ib_cm_event_map
 	/* 10 */  { IB_CME_LOCAL_FAILURE,
 				DAT_CONNECTION_EVENT_BROKEN},
 	/* 11 */  { IB_CME_BROKEN,
-				DAT_CONNECTION_EVENT_BROKEN}
+				DAT_CONNECTION_EVENT_BROKEN},
+	/* 12 */  { IB_CME_TIMEOUT,	
+				DAT_CONNECTION_EVENT_TIMED_OUT},
 };
  
 /*
@@ -1164,7 +1172,7 @@ void dapli_cma_event_cb(void)
 		case RDMA_CM_EVENT_ADDR_ERROR:
 		case RDMA_CM_EVENT_ROUTE_ERROR:
 			dapl_evd_connection_callback(conn, 
-						     IB_CME_LOCAL_FAILURE, 
+						     IB_CME_DESTINATION_UNREACHABLE, 
 						     NULL, conn->ep);
 			break;
 		case RDMA_CM_EVENT_DEVICE_REMOVAL:
Index: dapl_ib_util.h
===================================================================
--- dapl_ib_util.h	(revision 7839)
+++ dapl_ib_util.h	(working copy)
@@ -86,7 +86,8 @@ typedef enum {
 	IB_CME_DESTINATION_UNREACHABLE,
 	IB_CME_TOO_MANY_CONNECTION_REQUESTS,
 	IB_CME_LOCAL_FAILURE,
-	IB_CME_BROKEN
+	IB_CME_BROKEN,
+	IB_CME_TIMEOUT
 } ib_cm_events_t;
 
 /* CQ notifications */


From betsy at pathscale.com  Fri Jun  9 15:50:02 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Fri, 09 Jun 2006 15:50:02 -0700
Subject: [openib-general] [openfabrics-ewg] OFED-1.0-rc6 is available
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007ED438C@orsmsx408>
Message-ID: <1149893403.3034.55.camel@sarium.pathscale.com>

Woody - The short answer is yes - Bryan has created patches in
the subversion tree, which will install on top of what Tziporet
pulled from Roland's tree. These will be in the 1.0 release (and,
we will be testing an early version of that on Monday). We've
tested the ipath driver code pretty thoroughly in-house.

Bryan will send you a pointer to a tarball with these changes,
so you can try them out today. He's planning to have those
to you before 4:30.

- Betsy

On Fri, 2006-06-09 at 13:21 -0700, Woodruff, Robert J wrote:
> Is there any plan to release an RC6 package (or an RC7) that has a
> Pathscale driver that
> compiles on RHEL4 - U3 that we can test before the release ?
>  
> woody
>  
> 
> 
> ______________________________________________________________________
> From: openfabrics-ewg-bounces at openib.org
> [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet
> Koren
> Sent: Wednesday, June 07, 2006 7:59 AM
> To: Tziporet Koren; openfabrics-ewg at openib.org
> Cc: openib-general
> Subject: [openfabrics-ewg] OFED-1.0-rc6 is available
> 
> 
> 
> Hi All,
> 
>  
> 
> We have prepared OFED 1.0 RC6.
> 
> Release location:
> https://openib.org/svn/gen2/branches/1.0/ofed/releases
> 
> File: OFED-1.0-rc6.tgz
> 
>  
> 
> Note: This release is the code freeze release for OFED 1.0. Only
> showstopper bugs will be fixed.
> 
>  
> 
> BUILD_ID:
> 
> OFED-1.0-rc6
> 
>  
> 
> openib-1.0 (REV=7772)
> 
> # User space
> 
> https://openib.org/svn/gen2/branches/1.0/src/userspace
> 
> # Kernel space
> 
> https://openib.org/svn/gen2/branches/1.0/ofed/tags/rc6/linux-kernel
> 
> Git:
> 
> ref: refs/heads/for-2.6.17
> 
> commit d9ec5ad24ce80b7ef69a0717363db661d13aada5
> 
>  
> 
> # MPI
> 
> mpi_osu-0.9.7-mlx2.1.0.tgz
> 
> openmpi-1.1b1-1.src.rpm
> 
> mpitests-1.0-0.src.rpm 
> 
>  
> 
> OSes:
> 
>     * RH EL4 up2: 2.6.9-22.ELsmp
> 
>     * RH EL4 up3: 2.6.9-34.ELsmp
> 
>     * Fedora C4: 2.6.11-1.1369_FC4
> 
>     * SLES10 RC2: 2.6.16.16-1.6-smp
> 
>     * SUSE 10 Pro: 2.6.13-15-smp
> 
>     * kernel.org: 2.6.16.x
> 
>  
> 
> Systems:
> 
>     * x86_64
> 
>     * x86
> 
>     * ia64
> 
>     * ppc64
> 
>  
> 
> Main changes from RC5:
> 
> 1.       SDP – libsdp implementation of RFC proposed by Eitan Zahavi;
> bug fixes in kernel module. See details below.
> 
> 2.       SRP – bug fixes
> 
> 3.       Open MPI – new package based on 1.1b1-1
> 
> 4.       OSU-MPI – See details below.
> 
> 5.       iSER: Enhanced to support SLES 10 RC1.
> 
> 6.      IPoIB default configuration changed:
> 
> a.       IPoIB configuration at install time is now optional.
> 
> b.       The default configuration of IPoIB interfaces (if performed
> at install time) is DHCP; it can be changed during interactive
> installation.
> 
> c.       For unattended installation one can give a new configuration
> file. See the example below.
> 
> 7.       Bug Fixes.
> 
>  
> 
>  
> 
> Package limitations:
> 
> 1.       The ipath driver does not compile/load on most systems. To be
> fixed in final release. 
> Meanwhile, one must work with custom build and not choose ipath
> driver, or change in the conf file: ib_ipath=n.
> I attached a reference ofed-no_ipath.conf file. 
> Once Qlogic fixes the backport patches I will publish them on the
> release page so any one interested can use them with this release.
> 
> 2.       iSER is working on SuSE SLES 10 RC1 only
> 
>  
> 
>  
> 
> IPoIB configuration file example:
> 
> If you are going to install OFED on a 32 node cluster and want to use
> static IPoIB configuration based on Ethernet device configuration
> follow instructions below:
> 
>  
> 
> Assume that the Ethernet IP addresses (eth0 interfaces) of the cluster
> are: 10.0.0.1 - 10.0.0.32
> 
> and you want to assign to ib0 IP addresses in the range: 192.168.0.1 -
> 192.168.0.32
> 
> and to ib1 IP addresses in the range: 172.16.0.1 - 172.16.0.32
> 
>  
> 
> Then create the file ofed_net.conf with the following lines:
> 
>  
> 
> LAN_INTERFACE_ib0=eth0
> IPADDR_ib0=192.168.'*'.'*'
> NETMASK_ib0=255.255.0.0
> NETWORK_ib0=192.168.0.0
> BROADCAST_ib0=192.168.255.255
> ONBOOT_ib0=1
> LAN_INTERFACE_ib1=eth0
> IPADDR_ib1=172.16.'*'.'*'
> NETMASK_ib1=255.255.0.0
> NETWORK_ib1=172.16.0.0
> BROADCAST_ib1=172.16.255.255
> ONBOOT_ib1=1
> 
>  
> 
> Note: ‘*’ will be replaced by the corresponding octal from the eth0 IP
> address.
> 
>  
> 
> Assuming that you already have OFED configuration file (ofed.conf)
> with selected packages (created by running OFED-1.0/install.sh)
> 
> Run:    ./install.sh -c ofed.conf -net ofed_net.conf
> 
>  
> 
>  
> 
>  
> 
> OSU MPI:
> 
> ·        Added mpi_alltoall fine tuning parameters
> 
> ·        Added default configuration/documentation file
> $MPIHOME/etc/mvapich.conf
> 
> ·        Added shell configuration files  $MPIHOME/etc/mvapich.csh ,
> $MPIHOME/etc/mvapich.csh
> 
> ·        Default MTU was changed back to 2K for InfiniHost III Ex and
> InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
> VIADEV_DEFAULT_MTU=MTU1024
> 
>  
> 
>  
> 
> SDP Details: 
> 
> libsdp enhancements according to the RFC:
> 
>      1. New config syntax (please see libsdp.conf) 
>      2. With no config or empty config use SIMPLE_LIBSDP mode 
>      3. Support listening on both tcp and sdp 
>      4. Support trying both connections (first SDP then TCP) 
>      5. Support IPv4 embedded in IPv6 (also convert back address) 
>      6. Comprehensive verbosity logging 
>      7. BNF based config parser 
> 
>  
> 
> Current SDP limitations:
> 
> ·        SDP currently does not support sending/receiving out of band
> data (MSG_OOB).
> 
> ·        Generally, SDP supports only SOL_SOCKET socket options.
> 
> ·        The following options can be set but actual support is
> missing:
> 
> o       SO_KEEPALIVE - no keepalives are sent
> 
> o       SO_OOBINLINE - out of band data is not supported
> 
> o       SDP currently supports setting the following SOL_TCP socket
> options:
> 
> o       TCP_NODELAY, TCP_CORK - but actual support for these options
> is still missing
> 
> ·         SDP currently does not handle Zcopy mode messages correctly
> and does not set MaxAdverts properly in HH/HAH messages.
> 
>  
> 
>  
> 
> OFED components tested by Mellanox:
> 
>       * Verbs over mthca 
>       * IPoIB 
>       * OpenSM 
>       * OSU-MPI 
>       * SRP 
>       * SDP 
>       * IB administration utils (ibutils) 
> 
>  
> 
>  
> 
> Please send us any issues you encounter and/or test results.
> 
>  
> 
> Thanks
> 
> Tziporet & Vlad
> 
>  
> 
>  
> 
> Tziporet Koren
> 
> Software Director
> 
> Mellanox Technologies
> 
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> 
>  
> 
> 


From robert.j.woodruff at intel.com  Fri Jun  9 16:13:32 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 9 Jun 2006 16:13:32 -0700
Subject: [openib-general] [openfabrics-ewg] OFED-1.0-rc6 is available
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007ED46E6@orsmsx408>

Betzy wrote, 
>Woody - The short answer is yes - Bryan has created patches in
>the subversion tree, which will install on top of what Tziporet
>pulled from Roland's tree. These will be in the 1.0 release (and,
>we will be testing an early version of that on Monday). We've
>tested the ipath driver code pretty thoroughly in-house.

 Thanks, probably cannot get to it till monday as 
I needed to pulled the pathscale cards from my OFED test systems 
for now so I can test RC6 with the Mellanox DDR cards over the 
weekend but should be able to get back to this on Monday.

Will their be a RC candidate tarball that has the patches included ?


woody


From bos at pathscale.com  Fri Jun  9 16:20:36 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Fri, 09 Jun 2006 16:20:36 -0700
Subject: [openib-general] OFED 1.0-rc6 tarball available with working ipath
	driver
Message-ID: <1149895236.27921.2.camel@pelerin.serpentine.com>

Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not
work correctly.  You can download an updated tarball from here, for
which the ipath driver works fine:

http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2

Alternatively, pull the necessary patches from SVN.

	<b


From bugzilla-daemon at openib.org  Fri Jun  9 22:41:11 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:41:11 -0700 (PDT)
Subject: [openib-general] [Bug 9] [CHECKER] Fencepost error in
	drivers/infiniband/core/sysfs.c:state_show
Message-ID: <20060610054111.1DB222287AC@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=9


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-09 22:41 -------
Close old bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 22:41:23 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:41:23 -0700 (PDT)
Subject: [openib-general] [Bug 10] [CHECKER] NULL deref in
	drivers/infiniband/core/ucm.c:ib_ucm_event_process
Message-ID: <20060610054123.AEBDF2287AD@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=10


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-09 22:41 -------
Close INVALID bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 22:40:41 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:40:41 -0700 (PDT)
Subject: [openib-general] [Bug 8] [CHECKER] Leak in
	drivers/infiniband/core/sysfs.c:alloc_group_attrs
Message-ID: <20060610054041.13F752287AB@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=8


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #3 from sweitzen at cisco.com  2006-06-09 22:40 -------
Close old bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 22:41:42 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:41:42 -0700 (PDT)
Subject: [openib-general] [Bug 11] [CHECKER] Return value of idr_find not
	checked for NULL
Message-ID: <20060610054142.B0B932287AE@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=11


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #3 from sweitzen at cisco.com  2006-06-09 22:41 -------
Close INVALID bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 22:42:05 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:42:05 -0700 (PDT)
Subject: [openib-general] [Bug 12] [CHECKER]
	drivers/infiniband/ulp/ipoib/ipoib_main.c: confusion over
	NULL pointer
Message-ID: <20060610054205.4FD842287AF@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=12


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-09 22:42 -------
Close INVALID bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Fri Jun  9 22:44:46 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Fri,  9 Jun 2006 22:44:46 -0700 (PDT)
Subject: [openib-general] [Bug 17] [CHECKER] NULL deref in
	drivers/infiniband/ulp/srp/ib_srp.c
Message-ID: <20060610054446.C0B3D2287AB@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=17


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-09 22:44 -------
Close INVALID bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eitan at mellanox.co.il  Sat Jun 10 10:12:45 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 10 Jun 2006 20:12:45 +0300
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <1149771197.4510.323092.camel@hal.voltaire.com>
References: <86fyiflwks.fsf@mtl066.yok.mtl.com>
	<1149771197.4510.323092.camel@hal.voltaire.com>
Message-ID: <448AFD8D.3030809@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
> 
> On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> 
>>Hi Hal
>>
>>I'm working on passing osmtest check. Found a bug in the new
>>GUIDInfoRecord query: If you had a physical port with zero guid_cap
>>the code would loop on blocks 0..255 instead of trying the next port.
> 
> 
> OK; that's definitely a problem.
> 
> 
>>I am still looking for why we might have a guid_cap == 0 on some
>>ports.
> 
> 
> PortInfo:GuidCap is not used for switch external ports.
> 
> 
>>This patch resolves this new problem. osmtest passes on some arbitrary
>>networks.
>>
>>Eitan
>>
>>Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>>
>>Index: opensm/osm_sa_guidinfo_record.c
>>===================================================================
>>--- opensm/osm_sa_guidinfo_record.c	(revision 7703)
>>+++ opensm/osm_sa_guidinfo_record.c	(working copy)
>>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
>>       continue;
>> 
>>     p_pi = osm_physp_get_port_info_ptr( p_physp );
>>+
>>+    if ( p_pi->guid_cap == 0 )  
>>+      continue;
>>+
> 
> 
> I think the right fix is to detect switch external ports and use the
> VLCap from port 0 rather than from the switch external port (unless that
> concept is broken in which case it should return 0 records).
I think switch external ports do not have any PortGUID assigned to them since
they are not "end port" (i.e. addressable). So I think this patch is good enough.
What if a port reports guid_cap == 0? (I understand it is illegal for addressable port
but for the SM it is probably better not to assume all ports are legal...)

EZ
> 
> -- Hal
> 
> 
>>     num_blocks = p_pi->guid_cap / 8;
>>     if ( p_pi->guid_cap % 8 )
>>       num_blocks++;
>>


From bpradip at in.ibm.com  Sat Jun 10 10:34:27 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Sat, 10 Jun 2006 23:04:27 +0530
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum ping
	buffer size
Message-ID: <20060610173417.GA14280@harry-potter.ibm.com>

This includes the changes suggested by Tom.

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
---

Index: rping.c
=================================================================
--- rping.org	2006-06-09 10:57:43.000000000 +0530
+++ rping.c.new	2006-06-10 22:48:53.000000000 +0530
@@ -96,6 +96,15 @@ struct rping_rdma_info {
 #define RPING_BUFSIZE 64*1024
 #define RPING_SQ_DEPTH 16
 
+/* Default string for print data and
+ * minimum buffer size
+ */
+#define _stringify( _x ) # _x
+#define stringify( _x ) _stringify( _x )
+
+#define RPING_MSG_FMT           "rdma-ping-%d: "
+#define RPING_MIN_BUFSIZE       sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT)
+
 /*
  * Control block struct.
  */
@@ -774,7 +783,7 @@ static void rping_test_client(struct rpi
 		cb->state = RDMA_READ_ADV;
 
 		/* Put some ascii text in the buffer. */
-		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
+		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
 		for (i = cc, c = start; i < cb->size; i++) {
 			cb->start_buf[i] = c;
 			c++;
@@ -977,11 +986,11 @@ int main(int argc, char *argv[])
 			break;
 		case 'S':
 			cb->size = atoi(optarg);
-			if ((cb->size < 1) ||
+			if ((cb->size < RPING_MIN_BUFSIZE) ||
 			    (cb->size > (RPING_BUFSIZE - 1))) {
 				fprintf(stderr, "Invalid size %d "
-				       "(valid range is 1 to %d)\n",
-				       cb->size, RPING_BUFSIZE);
+				       "(valid range is %d to %d)\n",
+				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
 				ret = EINVAL;
 			} else
 				DEBUG_LOG("size %d\n", (int) atoi(optarg));


From halr at voltaire.com  Sat Jun 10 11:11:21 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jun 2006 14:11:21 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <448AFD8D.3030809@mellanox.co.il>
References: <86fyiflwks.fsf@mtl066.yok.mtl.com>
	<1149771197.4510.323092.camel@hal.voltaire.com>
	<448AFD8D.3030809@mellanox.co.il>
Message-ID: <1149963035.5093.58165.camel@hal.voltaire.com>

Hi Eitan,

On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > Hi Eitan,
> > 
> > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > 
> >>Hi Hal
> >>
> >>I'm working on passing osmtest check. Found a bug in the new
> >>GUIDInfoRecord query: If you had a physical port with zero guid_cap
> >>the code would loop on blocks 0..255 instead of trying the next port.
> > 
> > 
> > OK; that's definitely a problem.
> > 
> > 
> >>I am still looking for why we might have a guid_cap == 0 on some
> >>ports.
> > 
> > 
> > PortInfo:GuidCap is not used for switch external ports.
> > 
> > 
> >>This patch resolves this new problem. osmtest passes on some arbitrary
> >>networks.
> >>
> >>Eitan
> >>
> >>Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> >>
> >>Index: opensm/osm_sa_guidinfo_record.c
> >>===================================================================
> >>--- opensm/osm_sa_guidinfo_record.c	(revision 7703)
> >>+++ opensm/osm_sa_guidinfo_record.c	(working copy)
> >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
> >>       continue;
> >> 
> >>     p_pi = osm_physp_get_port_info_ptr( p_physp );
> >>+
> >>+    if ( p_pi->guid_cap == 0 )  
> >>+      continue;
> >>+
> > 
> > 
> > I think the right fix is to detect switch external ports and use the
> > VLCap from port 0 rather than from the switch external port (unless that
> > concept is broken in which case it should return 0 records).
> I think switch external ports do not have any PortGUID assigned to them since
> they are not "end port" (i.e. addressable).

Right; that's what I said earlier in a different way (PortGUID is not
used for switch external ports).

> So I think this patch is good enough.

I think its better (an improvement) but not a complete fix for this
issue.

> What if a port reports guid_cap == 0?

Is that legal ? Shouldn't any port where GUIDCap is valid have a non
zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid), it
should be ignored.

> (I understand it is illegal for addressable port
> but for the SM it is probably better not to assume all ports are legal...)

That's my point on what a complete fix for this would include.

-- Hal

> EZ
> > 
> > -- Hal
> > 
> > 
> >>     num_blocks = p_pi->guid_cap / 8;
> >>     if ( p_pi->guid_cap % 8 )
> >>       num_blocks++;
> >>
> 


From eitan at mellanox.co.il  Sat Jun 10 14:02:47 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 11 Jun 2006 00:02:47 +0300
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com>

Hi Hal,

When is a complete fix expected?
Meanwhile osmtest on large enough cluster is not passing due to the huge
number of GUID blocks...

If this full fix not anticipated soon can we have the simple fix applied
first?

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Saturday, June 10, 2006 9:11 PM
> To: Eitan Zahavi
> Cc: OPENIB
> Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query
> 
> Hi Eitan,
> 
> On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> > > Hi Eitan,
> > >
> > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > >
> > >>Hi Hal
> > >>
> > >>I'm working on passing osmtest check. Found a bug in the new
> > >>GUIDInfoRecord query: If you had a physical port with zero
guid_cap
> > >>the code would loop on blocks 0..255 instead of trying the next
port.
> > >
> > >
> > > OK; that's definitely a problem.
> > >
> > >
> > >>I am still looking for why we might have a guid_cap == 0 on some
> > >>ports.
> > >
> > >
> > > PortInfo:GuidCap is not used for switch external ports.
> > >
> > >
> > >>This patch resolves this new problem. osmtest passes on some
arbitrary
> > >>networks.
> > >>
> > >>Eitan
> > >>
> > >>Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > >>
> > >>Index: opensm/osm_sa_guidinfo_record.c
> >
>>===================================================================
> > >>--- opensm/osm_sa_guidinfo_record.c	(revision 7703)
> > >>+++ opensm/osm_sa_guidinfo_record.c	(working copy)
> > >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
> > >>       continue;
> > >>
> > >>     p_pi = osm_physp_get_port_info_ptr( p_physp );
> > >>+
> > >>+    if ( p_pi->guid_cap == 0 )
> > >>+      continue;
> > >>+
> > >
> > >
> > > I think the right fix is to detect switch external ports and use
the
> > > VLCap from port 0 rather than from the switch external port
(unless that
> > > concept is broken in which case it should return 0 records).
> > I think switch external ports do not have any PortGUID assigned to
them since
> > they are not "end port" (i.e. addressable).
> 
> Right; that's what I said earlier in a different way (PortGUID is not
> used for switch external ports).
> 
> > So I think this patch is good enough.
> 
> I think its better (an improvement) but not a complete fix for this
> issue.
> 
> > What if a port reports guid_cap == 0?
> 
> Is that legal ? Shouldn't any port where GUIDCap is valid have a non
> zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid),
it
> should be ignored.
> 
> > (I understand it is illegal for addressable port
> > but for the SM it is probably better not to assume all ports are
legal...)
> 
> That's my point on what a complete fix for this would include.
> 
> -- Hal
> 
> > EZ
> > >
> > > -- Hal
> > >
> > >
> > >>     num_blocks = p_pi->guid_cap / 8;
> > >>     if ( p_pi->guid_cap % 8 )
> > >>       num_blocks++;
> > >>
> >


From halr at voltaire.com  Sat Jun 10 14:07:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jun 2006 17:07:33 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881B@mtlexch01.mtl.com>
Message-ID: <1149973652.5093.64803.camel@hal.voltaire.com>

On Sat, 2006-06-10 at 17:02, Eitan Zahavi wrote:
> Hi Hal,
> 
> When is a complete fix expected?
> Meanwhile osmtest on large enough cluster is not passing due to the huge
> number of GUID blocks...
> 
> If this full fix not anticipated soon can we have the simple fix applied
> first?

Sure. Let me know if this is also needed on the 1.0 branch.

-- Hal

> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Saturday, June 10, 2006 9:11 PM
> > To: Eitan Zahavi
> > Cc: OPENIB
> > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query
> > 
> > Hi Eitan,
> > 
> > On Sat, 2006-06-10 at 13:12, Eitan Zahavi wrote:
> > > Hal Rosenstock wrote:
> > > > Hi Eitan,
> > > >
> > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > >
> > > >>Hi Hal
> > > >>
> > > >>I'm working on passing osmtest check. Found a bug in the new
> > > >>GUIDInfoRecord query: If you had a physical port with zero
> guid_cap
> > > >>the code would loop on blocks 0..255 instead of trying the next
> port.
> > > >
> > > >
> > > > OK; that's definitely a problem.
> > > >
> > > >
> > > >>I am still looking for why we might have a guid_cap == 0 on some
> > > >>ports.
> > > >
> > > >
> > > > PortInfo:GuidCap is not used for switch external ports.
> > > >
> > > >
> > > >>This patch resolves this new problem. osmtest passes on some
> arbitrary
> > > >>networks.
> > > >>
> > > >>Eitan
> > > >>
> > > >>Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > >>
> > > >>Index: opensm/osm_sa_guidinfo_record.c
> > >
> >>===================================================================
> > > >>--- opensm/osm_sa_guidinfo_record.c	(revision 7703)
> > > >>+++ opensm/osm_sa_guidinfo_record.c	(working copy)
> > > >>@@ -255,6 +255,10 @@ __osm_sa_gir_create_gir(
> > > >>       continue;
> > > >>
> > > >>     p_pi = osm_physp_get_port_info_ptr( p_physp );
> > > >>+
> > > >>+    if ( p_pi->guid_cap == 0 )
> > > >>+      continue;
> > > >>+
> > > >
> > > >
> > > > I think the right fix is to detect switch external ports and use
> the
> > > > VLCap from port 0 rather than from the switch external port
> (unless that
> > > > concept is broken in which case it should return 0 records).
> > > I think switch external ports do not have any PortGUID assigned to
> them since
> > > they are not "end port" (i.e. addressable).
> > 
> > Right; that's what I said earlier in a different way (PortGUID is not
> > used for switch external ports).
> > 
> > > So I think this patch is good enough.
> > 
> > I think its better (an improvement) but not a complete fix for this
> > issue.
> > 
> > > What if a port reports guid_cap == 0?
> > 
> > Is that legal ? Shouldn't any port where GUIDCap is valid have a non
> > zero GUIDCap ? On any port where GUIDCap is not used (e.g. invalid),
> it
> > should be ignored.
> > 
> > > (I understand it is illegal for addressable port
> > > but for the SM it is probably better not to assume all ports are
> legal...)
> > 
> > That's my point on what a complete fix for this would include.
> > 
> > -- Hal
> > 
> > > EZ
> > > >
> > > > -- Hal
> > > >
> > > >
> > > >>     num_blocks = p_pi->guid_cap / 8;
> > > >>     if ( p_pi->guid_cap % 8 )
> > > >>       num_blocks++;
> > > >>
> > >


From halr at voltaire.com  Sat Jun 10 14:21:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jun 2006 17:21:36 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <86fyiflwks.fsf@mtl066.yok.mtl.com>
References: <86fyiflwks.fsf@mtl066.yok.mtl.com>
Message-ID: <1149974496.5093.65332.camel@hal.voltaire.com>

Eitan,

On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> Hi Hal
> 
> I'm working on passing osmtest check. Found a bug in the new
> GUIDInfoRecord query: If you had a physical port with zero guid_cap
> the code would loop on blocks 0..255 instead of trying the next port.
> 
> I am still looking for why we might have a guid_cap == 0 on some
> ports.
> 
> This patch resolves this new problem. osmtest passes on some arbitrary
> networks.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only. 

Let me know if it also should be applied to 1.0.

-- Hal


From tom at opengridcomputing.com  Sat Jun 10 15:22:59 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Sat, 10 Jun 2006 17:22:59 -0500
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum
	ping buffer size
In-Reply-To: <20060610173417.GA14280@harry-potter.ibm.com>
References: <20060610173417.GA14280@harry-potter.ibm.com>
Message-ID: <1149978179.7311.29.camel@trinity.ogc.int>

Thanks Pradipta, I'll apply test, and check these in. 

Tom.

On Sat, 2006-06-10 at 23:04 +0530, Pradipta Kumar Banerjee wrote:
> This includes the changes suggested by Tom.
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> ---
> 
> Index: rping.c
> =================================================================
> --- rping.org	2006-06-09 10:57:43.000000000 +0530
> +++ rping.c.new	2006-06-10 22:48:53.000000000 +0530
> @@ -96,6 +96,15 @@ struct rping_rdma_info {
>  #define RPING_BUFSIZE 64*1024
>  #define RPING_SQ_DEPTH 16
>  
> +/* Default string for print data and
> + * minimum buffer size
> + */
> +#define _stringify( _x ) # _x
> +#define stringify( _x ) _stringify( _x )
> +
> +#define RPING_MSG_FMT           "rdma-ping-%d: "
> +#define RPING_MIN_BUFSIZE       sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT)
> +
>  /*
>   * Control block struct.
>   */
> @@ -774,7 +783,7 @@ static void rping_test_client(struct rpi
>  		cb->state = RDMA_READ_ADV;
>  
>  		/* Put some ascii text in the buffer. */
> -		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
> +		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
>  		for (i = cc, c = start; i < cb->size; i++) {
>  			cb->start_buf[i] = c;
>  			c++;
> @@ -977,11 +986,11 @@ int main(int argc, char *argv[])
>  			break;
>  		case 'S':
>  			cb->size = atoi(optarg);
> -			if ((cb->size < 1) ||
> +			if ((cb->size < RPING_MIN_BUFSIZE) ||
>  			    (cb->size > (RPING_BUFSIZE - 1))) {
>  				fprintf(stderr, "Invalid size %d "
> -				       "(valid range is 1 to %d)\n",
> -				       cb->size, RPING_BUFSIZE);
> +				       "(valid range is %d to %d)\n",
> +				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
>  				ret = EINVAL;
>  			} else
>  				DEBUG_LOG("size %d\n", (int) atoi(optarg));


From sashak at voltaire.com  Sat Jun 10 17:27:58 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 11 Jun 2006 03:27:58 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes from
	the file
Message-ID: <20060611002758.22430.63061.stgit@sashak.voltaire.com>

Hi,

There are couple of unicast routing related patches for OpenSM.

Basically it implements routing module which provides possibility to load
switch forwarding tables from pre-created dump file. Currently unicast
tables loading is only supported, multicast may be added in a future.

Short patch descriptions (more details may be found in emails with
patches):

1. Ucast dump file simplification.
2. Modular routing - preliminary implements generic model to plug new
routing engine to OpenSM.
3. New simple unicast routing engine which allows to load LFTs from
pre-created dump file.
4. Example of ucast dump generation script.

Please comment and test. Thanks.

Sasha


From sashak at voltaire.com  Sat Jun 10 17:32:45 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 11 Jun 2006 03:32:45 +0300
Subject: [openib-general] [PATCH 4/4] diags: ucast routing dump file
 generator example - dump_lfts.sh
In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
Message-ID: <20060611003245.22430.93904.stgit@sashak.voltaire.com>


New simple script - dump_lfts.sh, may be used for ucast dump file
generation.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 diags/Makefile.am          |    2 +-
 diags/scripts/dump_lfts.sh |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/diags/Makefile.am b/diags/Makefile.am
index bf0c077..9654675 100644
--- a/diags/Makefile.am
+++ b/diags/Makefile.am
@@ -24,7 +24,7 @@ bin_SCRIPTS = scripts/ibcheckerrs script
 	      scripts/ibcheckstate scripts/ibcheckportstate \
 	      scripts/ibcheckerrors scripts/ibclearerrors \
 	      scripts/ibclearcounters scripts/discover.pl \
-	      scripts/set_mthca_nodedesc.sh
+	      scripts/set_mthca_nodedesc.sh scripts/dump_lfts.sh
 
 src_ibaddr_SOURCES = src/ibaddr.c
 src_ibaddr_CFLAGS = -Wall $(DBGFLAGS)
diff --git a/diags/scripts/dump_lfts.sh b/diags/scripts/dump_lfts.sh
new file mode 100755
index 0000000..bed4778
--- /dev/null
+++ b/diags/scripts/dump_lfts.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+#
+# This simple script will collect outputs of ibroute for all switches
+# on the subnet and drop it on stdout. May be used for LFTs dump
+# generation.
+#
+
+usage ()
+{
+	echo "usage: $0 [-D]"
+	exit 2
+}
+
+dump_by_lid ()
+{
+for sw_lid in `ibswitches \
+		| sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p'` ; do
+	ibroute $sw_lid
+done
+}
+
+dump_by_dr_path ()
+{
+for sw_dr in `ibnetdiscover -v \
+		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
+		| sed -e 's/\]\[/,/g' \
+		| sort -u` ; do
+	ibroute -D ${sw_dr}
+done
+}
+
+
+if [ "$1" = "-D" ] ; then
+	dump_by_dr_path
+elif [ -z "$1" ] ; then
+	dump_by_lid
+else
+	usage
+fi
+
+exit


From sashak at voltaire.com  Sat Jun 10 17:32:38 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 11 Jun 2006 03:32:38 +0300
Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb dumps.
In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
Message-ID: <20060611003238.22430.62423.stgit@sashak.voltaire.com>


This separates the dump procedure from rest of the flow and prevents
multiple fopen()/fclose() (one pair per switch) - one fopen() and one
fclose() instead.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/opensm/osm_ucast_mgr.c |  187 +++++++++++++++++++++++---------------------
 1 files changed, 96 insertions(+), 91 deletions(-)

diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 40422e5..cac7f9b 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -128,7 +128,7 @@ osm_ucast_mgr_init(
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_ucast_mgr_dump_path_distribution(
   IN const osm_ucast_mgr_t* const p_mgr,
   IN const osm_switch_t* const p_sw )
@@ -143,70 +143,65 @@ osm_ucast_mgr_dump_path_distribution(
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_dump_path_distribution );
 
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
-  {
-    p_node = osm_switch_get_node_ptr( p_sw );
+  p_node = osm_switch_get_node_ptr( p_sw );
 
-    num_ports = osm_switch_get_num_ports( p_sw );
-    sprintf( p_mgr->p_report_buf, "osm_ucast_mgr_dump_path_distribution: "
-             "Switch 0x%" PRIx64 "\n"
-             "Port : Path Count Through Port",
-             cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
+  num_ports = osm_switch_get_num_ports( p_sw );
+  sprintf( p_mgr->p_report_buf, "osm_ucast_mgr_dump_path_distribution: "
+           "Switch 0x%" PRIx64 "\n"
+           "Port : Path Count Through Port",
+           cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
 
-    for( i = 0; i < num_ports; i++ )
+  for( i = 0; i < num_ports; i++ )
+  {
+    num_paths = osm_switch_path_count_get( p_sw , i );
+    sprintf( line, "\n %03u : %u", i, num_paths );
+    strcat( p_mgr->p_report_buf, line );
+    if( i == 0 )
     {
-      num_paths = osm_switch_path_count_get( p_sw , i );
-      sprintf( line, "\n %03u : %u", i, num_paths );
-      strcat( p_mgr->p_report_buf, line );
-      if( i == 0 )
-      {
-        strcat( p_mgr->p_report_buf, " (switch management port)" );
-        continue;
-      }
-
-      p_remote_node = osm_node_get_remote_node(
-        p_node, i, NULL );
-
-      if( p_remote_node == NULL )
-        continue;
+      strcat( p_mgr->p_report_buf, " (switch management port)" );
+      continue;
+    }
 
-      remote_guid_ho = cl_ntoh64(
-        osm_node_get_node_guid( p_remote_node ) );
+    p_remote_node = osm_node_get_remote_node( p_node, i, NULL );
+    if( p_remote_node == NULL )
+      continue;
 
-      switch(  osm_node_get_remote_type( p_node, i ) )
-      {
-      case IB_NODE_TYPE_SWITCH:
-        strcat( p_mgr->p_report_buf, " (link to switch" );
-        break;
-      case IB_NODE_TYPE_ROUTER:
-        strcat( p_mgr->p_report_buf, " (link to router" );
-        break;
-      case IB_NODE_TYPE_CA:
-        strcat( p_mgr->p_report_buf, " (link to CA" );
-        break;
-      default:
-        strcat( p_mgr->p_report_buf, " (link to unknown type, node" );
-        break;
-      }
+    remote_guid_ho = cl_ntoh64( osm_node_get_node_guid( p_remote_node ) );
 
-      sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho );
-      strcat( p_mgr->p_report_buf, line );
+    switch(  osm_node_get_remote_type( p_node, i ) )
+    {
+    case IB_NODE_TYPE_SWITCH:
+      strcat( p_mgr->p_report_buf, " (link to switch" );
+      break;
+    case IB_NODE_TYPE_ROUTER:
+      strcat( p_mgr->p_report_buf, " (link to router" );
+      break;
+    case IB_NODE_TYPE_CA:
+      strcat( p_mgr->p_report_buf, " (link to CA" );
+      break;
+    default:
+      strcat( p_mgr->p_report_buf, " (link to unknown type, node" );
+      break;
     }
 
-    strcat( p_mgr->p_report_buf, "\n" );
-
-    osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf );
+    sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho );
+    strcat( p_mgr->p_report_buf, line );
   }
 
+  strcat( p_mgr->p_report_buf, "\n" );
+
+  osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf );
+
   OSM_LOG_EXIT( p_mgr->p_log );
 }
 
 /**********************************************************************
  **********************************************************************/
-void
+static void
 osm_ucast_mgr_dump_ucast_routes(
   IN const osm_ucast_mgr_t*   const p_mgr,
-  IN const osm_switch_t*      const p_sw )
+  IN const osm_switch_t*      const p_sw,
+  IN FILE *p_fdbFile)
 {
   const osm_node_t*        p_node;
   uint8_t                  port_num;
@@ -217,34 +212,10 @@ osm_ucast_mgr_dump_ucast_routes(
   uint16_t              lid_ho;
   char                  line[OSM_REPORT_LINE_SIZE];
   uint32_t              line_num = 0;
-  FILE  *              p_fdbFile;
   boolean_t            ui_ucast_fdb_assign_func_defined;
-  char                *file_name = NULL;
   
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_dump_ucast_routes );
 
-  if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
-    goto Exit;
-
-  file_name = 
-    (char*)malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10);
-  
-  CL_ASSERT(file_name);
-  
-  strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir);
-  strcat(file_name,"/osm.fdbs");
-  
-  /* Open the file or error */
-  p_fdbFile = fopen(file_name, "a");
-  if (! p_fdbFile)
-  {
-    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
-             "osm_ucast_mgr_dump_ucast_routes: ERR 3A12: "
-             "Failed to open fdb file (%s)\n",
-             file_name );
-    goto Exit;
-  }
-
   p_node = osm_switch_get_node_ptr( p_sw );
 
   max_lid_ho = osm_switch_get_max_lid_ho( p_sw );
@@ -324,15 +295,59 @@ osm_ucast_mgr_dump_ucast_routes(
   if( line_num != 0 )
     fprintf(p_fdbFile,"%s\n",p_mgr->p_report_buf );
 
-  fclose(p_fdbFile);
-
- Exit:
-  if (file_name)
-    free(file_name);
   OSM_LOG_EXIT( p_mgr->p_log );
 }
 
 /**********************************************************************
+ **********************************************************************/
+struct ucast_mgr_dump_context {
+	osm_ucast_mgr_t *p_mgr;
+	FILE *file;
+};
+
+static void
+__osm_ucast_mgr_dump_table(
+  IN cl_map_item_t* const  p_map_item,
+  IN void* context )
+{
+  osm_switch_t* const p_sw = (osm_switch_t*)p_map_item;
+  struct ucast_mgr_dump_context *cxt = context;
+
+  if( osm_log_is_active( cxt->p_mgr->p_log, OSM_LOG_DEBUG ) )
+    osm_ucast_mgr_dump_path_distribution( cxt->p_mgr, p_sw );
+  osm_ucast_mgr_dump_ucast_routes( cxt->p_mgr, p_sw, cxt->file );
+}
+
+static void osm_ucast_mgr_dump_tables(
+  IN osm_ucast_mgr_t *p_mgr)
+{
+  char file_name[1024];
+  struct ucast_mgr_dump_context dump_context;
+  FILE  *file;
+  
+  strncpy(file_name, p_mgr->p_subn->opt.dump_files_dir, sizeof(file_name) - 1);
+  strncat(file_name, "/osm.fdbs", sizeof(file_name) - strlen(file_name) - 1);
+  
+  file = fopen(file_name, "w");
+  if (!file)
+  {
+    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
+             "osm_ucast_mgr_dump_ucast_routes: ERR 3A12: "
+             "Failed to open fdb file (%s)\n",
+             file_name );
+    return;
+  }
+
+  dump_context.p_mgr = p_mgr;
+  dump_context.file = file;
+
+  cl_qmap_apply_func( &p_mgr->p_subn->sw_guid_tbl,
+                      __osm_ucast_mgr_dump_table, &dump_context );
+
+  fclose(file);
+}
+
+/**********************************************************************
    Add each switch's own LID to its LID matrix.
 **********************************************************************/
 static void
@@ -952,8 +967,6 @@ __osm_ucast_mgr_process_tbl(
 
   __osm_ucast_mgr_set_table( p_mgr, p_sw );
 
-  osm_ucast_mgr_dump_path_distribution( p_mgr, p_sw );
-  osm_ucast_mgr_dump_ucast_routes( p_mgr, p_sw );
   OSM_LOG_EXIT( p_mgr->p_log );
 }
 
@@ -1047,7 +1060,6 @@ osm_ucast_mgr_process(
   uint32_t iteration_max;
   osm_signal_t signal;
   cl_qmap_t *p_sw_guid_tbl;
-  char *file_name = NULL;
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process );
 
@@ -1148,26 +1160,19 @@ osm_ucast_mgr_process(
       build and download the switch forwarding tables.
     */
 
-    /* remove the old fdb dump file: */
-    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) && (file_name =
-        (char*)malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) + 10)) )
-    {
-      strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir);
-      strcat(file_name, "/osm.fdbs");
-      unlink(file_name);  
-      free(file_name);
-    }
-
     cl_qmap_apply_func( p_sw_guid_tbl,
                         __osm_ucast_mgr_process_tbl, p_mgr );
 
+    /* dump fdb into file: */
+    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
+      osm_ucast_mgr_dump_tables( p_mgr );
+
     /*
       For now don't bother checking if the switch forwarding tables
       actually needed updating.  The current code will always update
       them, and thus leave transactions pending on the wire.
       Therefore, return OSM_SIGNAL_DONE_PENDING.
     */
-
     signal = OSM_SIGNAL_DONE_PENDING;
   }
   else


From sashak at voltaire.com  Sat Jun 10 17:32:43 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 11 Jun 2006 03:32:43 +0300
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
Message-ID: <20060611003243.22430.56582.stgit@sashak.voltaire.com>


This patch implements trivial routing module which able to load LFT
tables from dump file. Main features:
- support for unicast LFTs only, support for multicast can be added later
- this will run after min hop matrix calculation
- this will load switch LFTs according to the path entries introduced in
  the dump file
- no additional checks will be performed (like is port connected, etc)
- in case when fabric LIDs were changed this will try to reconstruct LFTs
  correctly if endport GUIDs are represented in the dump file (in order
  to disable this GUIDs may be removed from the dump file or zeroed)

The dump file format is compatible with output of 'ibroute' util and for
whole fabric may be generated with script like this:

  for sw_lid in `ibswitches | awk '{print $NF}'` ; do
	ibroute $sw_lid
  done > /path/to/dump_file

, or using DR paths:


  for sw_dr in `ibnetdiscover -v \
		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
		| sed -e 's/\]\[/,/g' \
		| sort -u` ; do
	ibroute -D ${sw_dr}
  done > /path/to/dump_file


In order to activate new module use:

  opensm -R file -U /path/to/dump_file

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/include/opensm/osm_subnet.h |    5 +
 osm/opensm/Makefile.am          |    2 
 osm/opensm/main.c               |   16 ++
 osm/opensm/osm_opensm.c         |    2 
 osm/opensm/osm_subnet.c         |   10 ++
 osm/opensm/osm_ucast_file.c     |  258 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 289 insertions(+), 4 deletions(-)

diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
index a637367..ec1d056 100644
--- a/osm/include/opensm/osm_subnet.h
+++ b/osm/include/opensm/osm_subnet.h
@@ -277,6 +277,7 @@ typedef struct _osm_subn_opt
   boolean_t                sweep_on_trap;
   osm_testability_modes_t  testability_mode;
   char *                   routing_engine_name;
+  char *                   ucast_dump_file;
   char *                   updn_guid_file;
   boolean_t                exit_on_fatal;
   boolean_t                honor_guid2lid_file;
@@ -423,6 +424,10 @@ typedef struct _osm_subn_opt
 *  routing_engine_name
 *     Name of used routing engine (other than default Min Hop Algorithm)
 *
+*  ucast_dump_file
+*     Name of the unicast routing dump file from where switch
+*     forwearding tables will be loaded
+*
 *  updn_guid_file
 *     Pointer to name of the UPDN guid file given by User
 *
diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index 7b1060a..5da88a4 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -83,7 +83,7 @@ opensm_SOURCES = main.c osm_console.c os
 		 osm_sw_info_rcv_ctrl.c osm_switch.c \
 		 osm_prtn.c osm_prtn_config.c osm_qos.c \
 		 osm_trap_rcv.c osm_trap_rcv_ctrl.c \
-		 osm_ucast_mgr.c osm_ucast_updn.c \
+		 osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \
 		 osm_vl15intf.c osm_vl_arb_rcv.c \
 		 osm_vl_arb_rcv_ctrl.c st.c
 opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index c888ed4..dfb2aec 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -175,8 +175,12 @@ show_usage(void)
           "          LID assignments resolving multiple use of same LID.\n\n");
   printf( "-R\n"
           "--routing_engine <engine name>\n"
-          "          This option choose routing engine instead of Min Hop\n"
-          "          algorithm (default). Supported engines: updn\n");
+          "          This option chooses routing engine instead of Min Hop\n"
+          "          algorithm (default). Supported engines: updn, file\n");
+  printf( "-U\n"
+          "--ucast_file <file name>\n"
+          "          This option specifies name of the unicast dump file\n"
+          "          from where switch forwarding tables will be loaded.\nn");
   printf ("-a\n"
           "--add_guid_file <path to file>\n"
           "          Set the root nodes for the Up/Down routing algorithm\n"
@@ -523,7 +527,7 @@ #endif
   boolean_t             cache_options = FALSE;
   char                 *ignore_guids_file_name = NULL;
   uint32_t              val;
-  const char * const    short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx";
+  const char * const    short_option = "i:f:ed:g:l:s:t:a:R:U:P:NQvVhorcyx";
 
   /*
     In the array below, the 2nd parameter specified the number
@@ -556,6 +560,7 @@ #endif
       {  "priority",      1, NULL, 'p'},
       {  "smkey",         1, NULL, 'k'},
       {  "routing_engine",1, NULL, 'R'},
+      {  "ucast_file"    ,1, NULL, 'U'},
       {  "add_guid_file", 1, NULL, 'a'},
       {  "cache-options", 0, NULL, 'c'},
       {  "stay_on_fatal", 0, NULL, 'y'},
@@ -780,6 +785,11 @@ #endif
       printf(" Activate \'%s\' routing engine\n", optarg);
       break;
 
+    case 'U':
+      opt.ucast_dump_file = optarg;
+      printf(" Ucast dump file is \'%s\'\n", optarg);
+      break;
+
     case 'a':
       /*
         Specifies port guids file
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 52f06da..a189591 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -74,10 +74,12 @@ struct routing_engine_module {
 };
 
 extern int osm_ucast_updn_setup(osm_opensm_t *p_osm);
+extern int osm_ucast_file_setup(osm_opensm_t *p_osm);
 
 const static struct routing_engine_module routing_modules[] = {
 	{"null", NULL},
 	{"updn", osm_ucast_updn_setup },
+	{"file", osm_ucast_file_setup },
 	{}
 };
 
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 27f97ab..0d46f85 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -489,6 +489,7 @@ osm_subn_set_default_opt(
   p_opt->sweep_on_trap = TRUE;
   p_opt->testability_mode = OSM_TEST_MODE_NONE;
   p_opt->routing_engine_name = NULL;
+  p_opt->ucast_dump_file = NULL;
   p_opt->updn_guid_file = NULL;
   p_opt->exit_on_fatal = TRUE;
   subn_set_default_qos_options(&p_opt->qos_options);
@@ -937,6 +938,10 @@ osm_subn_parse_conf_file(
         p_key, p_val, &p_opts->dump_files_dir);
 
       __osm_subn_opts_unpack_charp( 
+        "ucast_dump_file" ,
+        p_key, p_val, &p_opts->ucast_dump_file);
+
+      __osm_subn_opts_unpack_charp( 
         "updn_guid_file" ,
         p_key, p_val, &p_opts->updn_guid_file);
 
@@ -1094,6 +1099,11 @@ osm_subn_write_conf_file(
              "# Routing engine\n"
              "routing_engine %s\n\n",
              p_opts->routing_engine_name);
+  if (p_opts->ucast_dump_file)
+    fprintf( opts_file,
+             "# Ucast dump file name\n"
+             "ucast_dump_file %s\n\n",
+             p_opts->ucast_dump_file);
   if (p_opts->updn_guid_file)
     fprintf( opts_file,
              "# The file holding the Up/Down root node guids\n"
diff --git a/osm/opensm/osm_ucast_file.c b/osm/opensm/osm_ucast_file.c
new file mode 100644
index 0000000..a68d9ec
--- /dev/null
+++ b/osm/opensm/osm_ucast_file.c
@@ -0,0 +1,258 @@
+/*
+ * Copyright (c) 2006 Voltaire, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+/*
+ * Abstract:
+ *    Implementation of OpenSM unicast routing module which loads
+ *    routes from the dump file
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+
+#include <iba/ib_types.h>
+#include <complib/cl_qmap.h>
+#include <opensm/osm_opensm.h>
+#include <opensm/osm_switch.h>
+#include <opensm/osm_log.h>
+
+#define PARSEERR(log, file_name, lineno, fmt, arg...) \
+		osm_log(log, OSM_LOG_ERROR, "PARSE ERROR: %s:%u: " fmt , \
+			file_name, lineno, ##arg )
+
+#define PARSEWARN(log, file_name, lineno, fmt, arg...) \
+		osm_log(log, OSM_LOG_VERBOSE, "PARSE WARN: %s:%u: " fmt , \
+			file_name, lineno, ##arg )
+
+static uint16_t remap_lid(osm_opensm_t *p_osm, uint16_t lid, ib_net64_t guid)
+{
+	osm_port_t *p_port;
+	uint16_t min_lid, max_lid;
+	uint8_t lmc;
+
+	p_port = (osm_port_t *)cl_qmap_get(&p_osm->subn.port_guid_tbl, guid);
+	if (!p_port ||
+	    p_port == (osm_port_t *)cl_qmap_end(&p_osm->subn.port_guid_tbl)) {
+		osm_log(&p_osm->log, OSM_LOG_VERBOSE,
+			"remap_lid: cannot find port guid 0x%016" PRIx64
+			" , will use the same lid.\n", cl_ntoh64(guid));
+		return lid;
+	}
+
+	osm_port_get_lid_range_ho(p_port, &min_lid, &max_lid);
+	if (min_lid <= lid && lid <= max_lid)
+		return lid;
+
+	lmc = osm_port_get_lmc(p_port);
+	return min_lid + (lid & ((1 << lmc) - 1));
+}
+
+static void add_path(osm_opensm_t * p_osm,
+		     osm_switch_t * p_sw, uint16_t lid, uint8_t port_num,
+		     ib_net64_t port_guid)
+{
+	uint16_t new_lid;
+	uint8_t old_port;
+
+	new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid;
+	old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid);
+	if (old_port != OSM_NO_PATH && old_port != port_num) {
+		osm_log(&p_osm->log, OSM_LOG_VERBOSE,
+			"add_path: LID collision is detected on switch "
+			"0x016%" PRIx64 ", will overwrite LID 0x%x entry.\n",
+			cl_ntoh64(osm_node_get_node_guid
+				  (osm_switch_get_node_ptr(p_sw))), new_lid);
+	}
+
+	osm_switch_set_path(p_sw, new_lid, port_num, TRUE);
+
+	osm_log(&p_osm->log, OSM_LOG_DEBUG,
+		"add_path: route 0x%04x(was 0x%04x) %u 0x%016" PRIx64
+		" is added to switch 0x%016" PRIx64 "\n",
+		new_lid, lid, port_num, cl_ntoh64(port_guid),
+		cl_ntoh64(osm_node_get_node_guid
+			  (osm_switch_get_node_ptr(p_sw))));
+}
+
+static void clean_sw_fwd_table(void *arg, void *context)
+{
+	osm_switch_t *p_sw = arg;
+	uint16_t lid, max_lid;
+
+	max_lid = osm_switch_get_max_lid_ho(p_sw);
+	for (lid = 1 ; lid <= max_lid ; lid++)
+		osm_switch_set_path(p_sw, lid, OSM_NO_PATH, TRUE);
+}
+
+static int do_ucast_file_load(void *context)
+{
+	char line[1024];
+	char *file_name;
+	FILE *file;
+	ib_net64_t sw_guid, port_guid;
+	osm_opensm_t *p_osm = context;
+	osm_switch_t *p_sw;
+	uint16_t lid;
+	uint8_t port_num;
+	unsigned lineno;
+
+	file_name = p_osm->subn.opt.ucast_dump_file;
+
+	if (!file_name) {
+		osm_log(&p_osm->log, OSM_LOG_ERROR,
+			"do_ucast_file_load: "
+			"ucast dump file name is not defined.\n");
+		return -1;
+	}
+
+	file = fopen(file_name, "r");
+	if (!file) {
+		osm_log(&p_osm->log, OSM_LOG_ERROR,
+			"do_ucast_file_load: "
+			"cannot open ucast dump file \'%s\'\n", file_name);
+		return -1;
+	}
+
+	cl_qmap_apply_func(&p_osm->subn.sw_guid_tbl, clean_sw_fwd_table, NULL);
+
+	lineno = 0;
+	p_sw = NULL;
+
+	while (fgets(line, sizeof(line) - 1, file) != NULL) {
+		char *p, *q;
+		lineno++;
+
+		p = line;
+		while (isspace(*p))
+			p++;
+
+		if (*p == '#')
+			continue;
+
+		if (!strncmp(p, "Multicast mlids", 15)) {
+			osm_log(&p_osm->log, OSM_LOG_ERROR,
+				"do_ucast_file_load: "
+				"Multicast dump file is detected. "
+				"Skip parsing.\n");
+		}
+		else if (!strncmp(p, "Unicast lids", 12)) {
+			q = strstr(p, " guid 0x");
+			if (!q) {
+				PARSEERR(&p_osm->log, file_name, lineno,
+					 "cannot parse switch definition\n");
+				return -1;
+			}
+			p = q + 6;
+			sw_guid = strtoll(p, &q, 16);
+			if (q && !isspace(*q)) {
+				PARSEERR(&p_osm->log, file_name, lineno,
+					 "cannot parse switch guid: \'%s\'\n",
+					 p);
+				return -1;
+			}
+			sw_guid = cl_hton64(sw_guid);
+
+			p_sw = (osm_switch_t *)cl_qmap_get(&p_osm->subn.sw_guid_tbl,
+							   sw_guid);
+			if (!p_sw ||
+			    p_sw == (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl)) {
+				p_sw = NULL;
+				osm_log(&p_osm->log, OSM_LOG_VERBOSE,
+					"do_ucast_file_load: "
+					"cannot find switch %016" PRIx64 ".\n",
+					cl_ntoh64(sw_guid));
+				continue;
+			}
+		}
+		else if (p_sw && !strncmp(p, "0x", 2)) {
+			lid = strtoul(p, &q, 16);
+			if (q && !isspace(*q)) {
+				PARSEERR(&p_osm->log, file_name, lineno,
+					 "cannot parse lid: \'%s\'\n", p);
+				return -1;
+			}
+			p = q;
+			while (isspace(*p))
+				p++;
+			port_num = strtoul(p, &q, 10);
+			if (q && !isspace(*q)) {
+				PARSEERR(&p_osm->log, file_name, lineno,
+					 "cannot parse port: \'%s\'\n", p);
+				return -1;
+			}
+			p = q;
+			/* additionally try to exract guid */
+			q = strstr(p, " portguid 0x");
+			if (!q) {
+				PARSEWARN(&p_osm->log, file_name, lineno,
+					  "cannot find port guid "
+					  "(maybe broken dump): \'%s\'\n", p);
+				port_guid = 0;
+			}
+			else {
+				p = q + 10;
+				port_guid = strtoll(p, &q, 16);
+				if (!q && !isspace(*q) && *q != ':') {
+					PARSEWARN(&p_osm->log, file_name,
+						  lineno,
+						  "cannot parse port guid "
+						  "(maybe broken dump): "
+						  "\'%s\'\n", p);
+					port_guid = 0;
+				}
+			}
+			port_guid = cl_hton64(port_guid);
+			add_path(p_osm, p_sw, lid, port_num, port_guid);
+		}
+	}
+
+	fclose(file);
+	return 0;
+}
+
+int osm_ucast_file_setup(osm_opensm_t * p_osm)
+{
+	p_osm->routing_engine.context = (void *)p_osm;
+	p_osm->routing_engine.ucast_build_fwd_tables = do_ucast_file_load;
+	return 0;
+}


From sashak at voltaire.com  Sat Jun 10 17:32:41 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 11 Jun 2006 03:32:41 +0300
Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast only
	yet).
In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
Message-ID: <20060611003240.22430.88414.stgit@sashak.voltaire.com>


This patch introduces routing_engine structure which may be used for
"plugging" new routing module. Currently only unicast callbacks are
supported (multicast can be added later). And existing routing module
is up-down 'updn', may be activated with '-R updn' option (instead of
old '-u'). General usage is:

 $ opensm -R 'module-name'

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/include/opensm/osm_opensm.h     |   17 ++++++++-
 osm/include/opensm/osm_subnet.h     |   16 ++------
 osm/include/opensm/osm_ucast_updn.h |   26 -------------
 osm/opensm/main.c                   |   26 +++++--------
 osm/opensm/osm_opensm.c             |   41 ++++++++++++++++++---
 osm/opensm/osm_subnet.c             |   23 ++++++------
 osm/opensm/osm_ucast_mgr.c          |   69 ++++++++++++++++++++++++-----------
 osm/opensm/osm_ucast_updn.c         |   69 ++++++++++++++++++-----------------
 8 files changed, 156 insertions(+), 131 deletions(-)

diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
index 3235ad4..3e6e120 100644
--- a/osm/include/opensm/osm_opensm.h
+++ b/osm/include/opensm/osm_opensm.h
@@ -92,6 +92,18 @@ BEGIN_C_DECLS
 *
 *********/
 
+/*
+ * routing engine structure - yet limited by ucast_fdb_assign and
+ *      ucast_build_fwd_tables (multicast callbacks may be added later)
+ */
+struct osm_routing_engine {
+	const char *name;
+	void *context;
+	int (*ucast_build_fwd_tables)(void *context);
+	int (*ucast_fdb_assign)(void *context);
+	void (*delete)(void *context);
+};
+
 /****s* OpenSM: OpenSM/osm_opensm_t
 * NAME
 *	osm_opensm_t
@@ -116,7 +128,7 @@ typedef struct _osm_opensm_t
   osm_log_t			log;
   cl_dispatcher_t	disp;
   cl_plock_t		lock;
-  updn_t         *p_updn_ucast_routing;
+  struct osm_routing_engine routing_engine;
   osm_stats_t		stats;
 } osm_opensm_t;
 /*
@@ -153,6 +165,9 @@ typedef struct _osm_opensm_t
 *	lock
 *		Shared lock guarding most OpenSM structures.
 *
+*	routing_engine
+*		Routing engine, will be initialized then used
+*
 *	stats
 *		Open SM statistics block
 *
diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
index 4db449d..a637367 100644
--- a/osm/include/opensm/osm_subnet.h
+++ b/osm/include/opensm/osm_subnet.h
@@ -272,13 +272,11 @@ typedef struct _osm_subn_opt
   uint32_t                 max_port_profile;
   osm_pfn_ui_extension_t   pfn_ui_pre_lid_assign;
   void *                   ui_pre_lid_assign_ctx;
-  osm_pfn_ui_extension_t   pfn_ui_ucast_fdb_assign;
-  void *                   ui_ucast_fdb_assign_ctx;
   osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign;
   void *                   ui_mcast_fdb_assign_ctx;
   boolean_t                sweep_on_trap;
   osm_testability_modes_t  testability_mode;
-  boolean_t                updn_activate;
+  char *                   routing_engine_name;
   char *                   updn_guid_file;
   boolean_t                exit_on_fatal;
   boolean_t                honor_guid2lid_file;
@@ -407,13 +405,6 @@ typedef struct _osm_subn_opt
 *  ui_pre_lid_assign_ctx
 *     A UI context (void *) to be provided to the pfn_ui_pre_lid_assign
 *
-*  pfn_ui_ucast_fdb_assign
-*     A UI function to be called instead of the ucast manager FDB
-*     configuration.
-*
-*  ui_ucast_fdb_assign_ctx
-*     A UI context (void *) to be provided to the pfn_ui_ucast_fdb_assign
-*
 *  pfn_ui_mcast_fdb_assign
 *     A UI function to be called inside the mcast manager instead of the
 *     call for the build spanning tree. This will be called on every
@@ -429,9 +420,8 @@ typedef struct _osm_subn_opt
 *  testability_mode
 *     Object that indicates if we are running in a special testability mode.
 *
-*  updn_activate
-*     Object that indicates if we are running the UPDN algorithm (TRUE) or 
-*     Min Hop Algorithm (FALSE)
+*  routing_engine_name
+*     Name of used routing engine (other than default Min Hop Algorithm)
 *
 *  updn_guid_file
 *     Pointer to name of the UPDN guid file given by User
diff --git a/osm/include/opensm/osm_ucast_updn.h b/osm/include/opensm/osm_ucast_updn.h
index 027056c..fbf8782 100644
--- a/osm/include/opensm/osm_ucast_updn.h
+++ b/osm/include/opensm/osm_ucast_updn.h
@@ -421,32 +421,6 @@ osm_subn_calc_up_down_min_hop_table(
 *	This function returns 0 when rankning has succeded , otherwise 1.
 ******/
 
-/****f* OpenSM: OpenSM/osm_updn_reg_calc_min_hop_table
-* NAME
-*	osm_updn_reg_calc_min_hop_table 
-*
-* DESCRIPTION
-*       Registration function to ucast routing manager (instead of 
-*       Min Hop Algorithm) 
-*
-* SYNOPSIS
-*/
-int
-osm_updn_reg_calc_min_hop_table(
-  IN updn_t * p_updn,
-  IN osm_subn_opt_t* p_opt );
-/*
-* PARAMETERS
-*
-* RETURN VALUES
-*	0 - on success , 1 - on failure
-*
-* NOTES
-*
-* SEE ALSO
-* osm_subn_calc_up_down_min_hop_table
-*********/
-
 /****** Osmsh: UpDown/osm_updn_find_root_nodes_by_min_hop
 * NAME
 *	osm_updn_find_root_nodes_by_min_hop
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index 22591eb..c888ed4 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -60,7 +60,6 @@ #include <opensm/osm_opensm.h>
 #include <complib/cl_types.h>
 #include <complib/cl_debug.h>
 #include <vendor/osm_vendor_api.h>
-#include <opensm/osm_ucast_updn.h>
 #include <opensm/osm_console.h>
 
 /********************************************************************
@@ -174,10 +173,10 @@ show_usage(void)
           "          may disrupt subnet traffic.\n"
           "          Without -r, OpenSM attempts to preserve existing\n"
           "          LID assignments resolving multiple use of same LID.\n\n");
-  printf( "-u\n"
-          "--updn\n"
-          "          This option activate UPDN algorithm instead of Min Hop\n"
-          "          algorithm (default).\n");
+  printf( "-R\n"
+          "--routing_engine <engine name>\n"
+          "          This option choose routing engine instead of Min Hop\n"
+          "          algorithm (default). Supported engines: updn\n");
   printf ("-a\n"
           "--add_guid_file <path to file>\n"
           "          Set the root nodes for the Up/Down routing algorithm\n"
@@ -524,7 +523,7 @@ #endif
   boolean_t             cache_options = FALSE;
   char                 *ignore_guids_file_name = NULL;
   uint32_t              val;
-  const char * const    short_option = "i:f:ed:g:l:s:t:a:P:NQuvVhorcyx";
+  const char * const    short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx";
 
   /*
     In the array below, the 2nd parameter specified the number
@@ -556,7 +555,7 @@ #endif
       {  "reassign_lids", 0, NULL, 'r'},
       {  "priority",      1, NULL, 'p'},
       {  "smkey",         1, NULL, 'k'},
-      {  "updn",          0, NULL, 'u'},
+      {  "routing_engine",1, NULL, 'R'},
       {  "add_guid_file", 1, NULL, 'a'},
       {  "cache-options", 0, NULL, 'c'},
       {  "stay_on_fatal", 0, NULL, 'y'},
@@ -776,9 +775,9 @@ #endif
       opt.sm_key = sm_key;
       break;
 
-    case 'u':
-      opt.updn_activate = TRUE;
-      printf(" Activate UPDN algorithm\n");
+    case 'R':
+      opt.routing_engine_name = optarg;
+      printf(" Activate \'%s\' routing engine\n", optarg);
       break;
 
     case 'a':
@@ -885,13 +884,6 @@ #endif
   setup_signals();
 
   osm_opensm_sweep( &osm );
-  /* since osm_opensm_init get opt as RO we'll set the opt value with UI pfn here */
-  /* Now do the registration */
-  if (opt.updn_activate)
-    if (osm_updn_reg_calc_min_hop_table(osm.p_updn_ucast_routing, &(osm.subn.opt))) {
-      status = IB_ERROR;
-      goto Exit;
-    }
 
   if( run_once_flag == TRUE )
   {
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 8c422b5..52f06da 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -68,6 +68,37 @@ #include <opensm/osm_subnet.h>
 #include <opensm/osm_sm.h>
 #include <opensm/osm_vl15intf.h>
 
+struct routing_engine_module {
+	const char *name;
+	int (*setup)(osm_opensm_t *p_osm);
+};
+
+extern int osm_ucast_updn_setup(osm_opensm_t *p_osm);
+
+const static struct routing_engine_module routing_modules[] = {
+	{"null", NULL},
+	{"updn", osm_ucast_updn_setup },
+	{}
+};
+
+static int setup_routing_engine(osm_opensm_t *p_osm, const char *name)
+{
+	const struct routing_engine_module *r;
+	for (r = routing_modules ; r->name && *r->name ; r++) {
+		if(!strcmp(r->name, name)) {
+			p_osm->routing_engine.name = r->name;
+			if (r->setup(p_osm))
+				break;
+			osm_log (&p_osm->log, OSM_LOG_DEBUG,
+				 "opensm: setup_routing_engine: "
+				 "\'%s\' routing engine set up.\n",
+				 p_osm->routing_engine.name);
+			return 0;
+		}
+	}
+	return -1;
+}
+
 /**********************************************************************
  **********************************************************************/
 void
@@ -118,7 +149,8 @@ osm_opensm_destroy(
    cl_disp_shutdown( &p_osm->disp );
 
    /* do the destruction in reverse order as init */
-   updn_destroy( p_osm->p_updn_ucast_routing );
+   if (p_osm->routing_engine.delete)
+   	p_osm->routing_engine.delete(p_osm->routing_engine.context);
    osm_sa_destroy( &p_osm->sa );
    osm_sm_destroy( &p_osm->sm );
    osm_db_destroy( &p_osm->db );
@@ -252,11 +284,8 @@ #endif
    if( status != IB_SUCCESS )
       goto Exit;
 
-   /* HACK - the UpDown manager should have been a part of the osm_sm_t */
-   /* Init updn struct */
-   p_osm->p_updn_ucast_routing = updn_construct(  );
-   status = updn_init( p_osm->p_updn_ucast_routing );
-   if( status != IB_SUCCESS )
+   if( p_opt->routing_engine_name &&
+       setup_routing_engine(p_osm, p_opt->routing_engine_name))
       goto Exit;
 
  Exit:
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 7c08556..27f97ab 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -484,13 +484,11 @@ osm_subn_set_default_opt(
   p_opt->max_port_profile = 0xffffffff;
   p_opt->pfn_ui_pre_lid_assign = NULL;
   p_opt->ui_pre_lid_assign_ctx = NULL;
-  p_opt->pfn_ui_ucast_fdb_assign = NULL;
-  p_opt->ui_ucast_fdb_assign_ctx = NULL;
   p_opt->pfn_ui_mcast_fdb_assign = NULL;
   p_opt->ui_mcast_fdb_assign_ctx = NULL;
   p_opt->sweep_on_trap = TRUE;
   p_opt->testability_mode = OSM_TEST_MODE_NONE;
-  p_opt->updn_activate = FALSE;
+  p_opt->routing_engine_name = NULL;
   p_opt->updn_guid_file = NULL;
   p_opt->exit_on_fatal = TRUE;
   subn_set_default_qos_options(&p_opt->qos_options);
@@ -911,9 +909,9 @@ osm_subn_parse_conf_file(
         "sweep_on_trap",
         p_key, p_val, &p_opts->sweep_on_trap);
 
-      __osm_subn_opts_unpack_boolean(
-        "updn_activate",
-        p_key, p_val, &p_opts->updn_activate);
+      __osm_subn_opts_unpack_charp(
+        "routing_engine",
+        p_key, p_val, &p_opts->routing_engine_name);
 
       __osm_subn_opts_unpack_charp(
         "log_file", p_key, p_val, &p_opts->log_file);
@@ -1089,12 +1087,13 @@ osm_subn_write_conf_file(
     opts_file,
     "#\n# ROUTING OPTIONS\n#\n"
     "# If true do not count switches as link subscriptions\n"
-    "port_profile_switch_nodes %s\n\n"
-    "# Activate the Up/Down routing algorithm\n"
-    "updn_activate %s\n\n",
-    p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE",
-    p_opts->updn_activate ? "TRUE" : "FALSE"
-    );
+    "port_profile_switch_nodes %s\n\n",
+    p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE");
+  if (p_opts->routing_engine_name)
+    fprintf( opts_file,
+             "# Routing engine\n"
+             "routing_engine %s\n\n",
+             p_opts->routing_engine_name);
   if (p_opts->updn_guid_file)
     fprintf( opts_file,
              "# The file holding the Up/Down root node guids\n"
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index cac7f9b..0c0d635 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -62,6 +62,7 @@ #include <opensm/osm_node.h>
 #include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
+#include <opensm/osm_opensm.h>
 
 #define LINE_LENGTH 256
 
@@ -269,7 +270,7 @@ osm_ucast_mgr_dump_ucast_routes(
       strcat( p_mgr->p_report_buf, "yes" );
     else
     {
-      if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) {
+      if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) {
         ui_ucast_fdb_assign_func_defined = TRUE;
       } else {
         ui_ucast_fdb_assign_func_defined = FALSE;
@@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port(
   node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) );
 
   /* Flag to mark whether or not a ui ucast fdb assign function was given */
-  if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign)
+  if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
     ui_ucast_fdb_assign_func_defined = TRUE;
   else
     ui_ucast_fdb_assign_func_defined = FALSE;
@@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port(
 
       /* Up/Down routing can cause unreachable routes between some 
          switches so we do not report that as an error in that case */
-      if (!p_mgr->p_subn->opt.updn_activate)
+      if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
       {
         osm_log( p_mgr->p_log, OSM_LOG_ERROR,
                  "__osm_ucast_mgr_process_port: ERR 3A08: "
@@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl(
 /**********************************************************************
  **********************************************************************/
 static void
+__osm_ucast_mgr_set_table_cb(
+  IN cl_map_item_t* const  p_map_item,
+  IN void* context )
+{
+  osm_switch_t* const p_sw = (osm_switch_t*)p_map_item;
+  osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context;
+  __osm_ucast_mgr_set_table( p_mgr, p_sw );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
 __osm_ucast_mgr_process_neighbors(
   IN cl_map_item_t* const  p_map_item,
   IN void* context )
@@ -1058,12 +1071,14 @@ osm_ucast_mgr_process(
 {
   uint32_t i;
   uint32_t iteration_max;
+  struct osm_routing_engine *p_routing_eng;
   osm_signal_t signal;
   cl_qmap_t *p_sw_guid_tbl;
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process );
 
   p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl;
+  p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine;
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process(
              i
              );
 
+    if (p_routing_eng->ucast_build_fwd_tables &&
+        p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0)
+    {
+      cl_qmap_apply_func( p_sw_guid_tbl,
+                          __osm_ucast_mgr_set_table_cb, p_mgr );
+    } /* fallback on the regular path in case of failures */
+    else
+    {
     /*
       This is the place where we can load pre-defined routes
       into the switches fwd_tbl structures.
@@ -1136,32 +1159,34 @@ osm_ucast_mgr_process(
       Later code will use these values if not configured for
       re-assignment.
     */
-    if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign)
-    {
-      if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+      if (p_routing_eng->ucast_fdb_assign)
       {
-        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-                 "osm_ucast_mgr_process: "
-                 "Invoking UI function pfn_ui_ucast_fdb_assign\n");
-      }
-      p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign(p_mgr->p_subn->opt.ui_ucast_fdb_assign_ctx);
-    } else {
+        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+        {
+          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                   "osm_ucast_mgr_process: "
+                   "Invoking \'%s\' function ucast_fdb_assign\n",
+                   p_routing_eng->name);
+        }
+        p_routing_eng->ucast_fdb_assign(p_routing_eng->context);
+      } else {
         osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                  "osm_ucast_mgr_process: "
                  "UI pfn was not invoked\n");
-    }
+      }
 
-    osm_log(p_mgr->p_log, OSM_LOG_INFO,
-            "osm_ucast_mgr_process: "
-            "Min Hop Tables configured on all switches\n");
+      osm_log(p_mgr->p_log, OSM_LOG_INFO,
+              "osm_ucast_mgr_process: "
+              "Min Hop Tables configured on all switches\n");
 
-    /*
-      Now that the lid matrixes have been built, we can
-      build and download the switch forwarding tables.
-    */
+      /*
+        Now that the lid matrixes have been built, we can
+        build and download the switch forwarding tables.
+      */
 
-    cl_qmap_apply_func( p_sw_guid_tbl,
-                        __osm_ucast_mgr_process_tbl, p_mgr );
+      cl_qmap_apply_func( p_sw_guid_tbl,
+                          __osm_ucast_mgr_process_tbl, p_mgr );
+    }
 
     /* dump fdb into file: */
     if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index d80f7eb..8e36854 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -76,8 +76,9 @@ __updn_get_dir(IN uint8_t cur_rank,
                IN uint64_t cur_guid,
                IN uint64_t rem_guid)
 {
-  uint32_t i = 0, max_num_guids = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.num_guids;
-  uint64_t *p_guid = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.guid_list;
+  updn_t *p_updn = osm.routing_engine.context;
+  uint32_t i = 0, max_num_guids = p_updn->updn_ucast_reg_inputs.num_guids;
+  uint64_t *p_guid = p_updn->updn_ucast_reg_inputs.guid_list;
   boolean_t cur_is_root = FALSE , rem_is_root = FALSE;
 
   /* HACK: comes to solve root nodes connection, in a classic subnet root nodes does not connect
@@ -540,7 +541,7 @@ updn_init(
   p_updn->updn_ucast_reg_inputs.guid_list = NULL;
   p_updn->auto_detect_root_nodes = FALSE;
   /* Check if updn is activated , then fetch root nodes */
-  if (osm.subn.opt.updn_activate)
+  if (osm.routing_engine.context)
   {
     /*
        Check the source for root node list, if file parse it, otherwise
@@ -569,7 +570,7 @@ updn_init(
           {
             p_tmp = malloc(sizeof(uint64_t));
             *p_tmp = strtoull(line, NULL, 16);
-            cl_list_insert_tail(osm.p_updn_ucast_routing->p_root_nodes, p_tmp);
+            cl_list_insert_tail(p_updn->p_root_nodes, p_tmp);
           }
         }
         else
@@ -588,8 +589,8 @@ updn_init(
                "osm_opensm_init: "
                "UPDN - Root nodes fetching by file %s\n",
                osm.subn.opt.updn_guid_file);
-      guid_iterator = cl_list_head(osm.p_updn_ucast_routing->p_root_nodes);
-      while( guid_iterator != cl_list_end(osm.p_updn_ucast_routing->p_root_nodes) )
+      guid_iterator = cl_list_head(p_updn->p_root_nodes);
+      while( guid_iterator != cl_list_end(p_updn->p_root_nodes) )
       {
         osm_log( &osm.log, OSM_LOG_DEBUG,
                  "osm_opensm_init: "
@@ -600,7 +601,7 @@ updn_init(
     }
     else
     {
-      osm.p_updn_ucast_routing->auto_detect_root_nodes = TRUE;
+      p_updn->auto_detect_root_nodes = TRUE;
     }
     /* If auto mode detection reuired - will be executed in main b4 the assignment of UI Ucast */
   }
@@ -985,33 +986,6 @@ void __osm_updn_convert_list2array(IN up
 
 /**********************************************************************
  **********************************************************************/
-/* Registration function to ucast routing manager (instead of
-   Min Hop Algorithm) */
-int
-osm_updn_reg_calc_min_hop_table(
-  IN updn_t * p_updn,
-  IN osm_subn_opt_t* p_opt )
-{
-  OSM_LOG_ENTER(&(osm.log), osm_updn_reg_calc_min_hop_table);
-  /*
-     If root nodes were supplied by the user - we need to convert into array
-     otherwise, will be created & converted in callback function activation
-  */
-  if (!p_updn->auto_detect_root_nodes)
-  {
-    __osm_updn_convert_list2array(p_updn);
-  }
-  osm_log (&(osm.log), OSM_LOG_DEBUG,
-           "osm_updn_reg_calc_min_hop_table: "
-           "assigning ucast fdb UI function with updn callback\n");
-  p_opt->pfn_ui_ucast_fdb_assign = __osm_updn_call;
-  p_opt->ui_ucast_fdb_assign_ctx = (void *)p_updn;
-  OSM_LOG_EXIT(&(osm.log));
-  return 0;
-}
-
-/**********************************************************************
- **********************************************************************/
 /* Find Root nodes automatically by Min Hop Table info */
 int
 osm_updn_find_root_nodes_by_min_hop( OUT updn_t *  p_updn )
@@ -1210,3 +1184,30 @@ osm_updn_find_root_nodes_by_min_hop( OUT
   OSM_LOG_EXIT(&(osm.log));
   return 0;
 }
+
+/**********************************************************************
+ **********************************************************************/
+
+static void __osm_updn_delete(void *context)
+{
+	updn_t *p_updn = context;
+	updn_destroy(p_updn);
+}
+
+int osm_ucast_updn_setup(osm_opensm_t *p_osm)
+{
+	updn_t *p_updn;
+	p_updn = updn_construct();
+	if (!p_updn)
+		return -1;
+	p_osm->routing_engine.context = p_updn;
+	p_osm->routing_engine.delete = __osm_updn_delete;
+	p_osm->routing_engine.ucast_fdb_assign = __osm_updn_call;
+
+	if (updn_init(p_updn) != IB_SUCCESS)
+		return -1;
+	if (!p_updn->auto_detect_root_nodes)
+		__osm_updn_convert_list2array(p_updn);
+
+	return 0;
+}


From eitan at mellanox.co.il  Sat Jun 10 23:07:56 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 11 Jun 2006 09:07:56 +0300
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com>

Hi Hal,

As the 1.0 does not support GUIDInfo I do not this patch is relevant to
1.0

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Sunday, June 11, 2006 12:22 AM
> To: Eitan Zahavi
> Cc: OPENIB
> Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query
> 
> Eitan,
> 
> On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > Hi Hal
> >
> > I'm working on passing osmtest check. Found a bug in the new
> > GUIDInfoRecord query: If you had a physical port with zero guid_cap
> > the code would loop on blocks 0..255 instead of trying the next
port.
> >
> > I am still looking for why we might have a guid_cap == 0 on some
> > ports.
> >
> > This patch resolves this new problem. osmtest passes on some
arbitrary
> > networks.
> >
> > Eitan
> >
> > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> Thanks. Applied to trunk only.
> 
> Let me know if it also should be applied to 1.0.
> 
> -- Hal


From bugzilla-daemon at openib.org  Sat Jun 10 23:23:13 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Sat, 10 Jun 2006 23:23:13 -0700 (PDT)
Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot
Message-ID: <20060611062313.10CDC2287AC@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=126


vlad at mellanox.co.il changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


------- Comment #1 from vlad at mellanox.co.il  2006-06-10 23:23 -------
RDMA_CM and RDMA_UCM are not loaded by default. In order to load them upon boot
edit /etc/infiniband/openib.conf file and set RDMA_CM_LOAD=yes and
RDMA_UCM_LOAD=yes:

# Start HCA driver upon boot
ONBOOT=yes

# Load UCM module
UCM_LOAD=no

# Load RDMA_CM module
RDMA_CM_LOAD=no

# Load RDMA_UCM module
RDMA_UCM_LOAD=no


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eitan at mellanox.co.il  Sat Jun 10 23:36:45 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 11 Jun 2006 09:36:45 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com>

Hi Sasha,

General comments:
1. I hope the change in osm.fdbs is not going to break the parser in
ibdm:Fabric.cpp - was it really necessary change? or just nice to have ?
2. The modular routing is a great idea. From my first glance it seems
that it assumes calculation of min-hop-tables is common to all routing
engines. I think it should be a callback provided by the engine too.
Please note that the Min-Hop engine takes most of the routing time so in
the future if we could avoid that stage it would be even better. 

[EZ] We should start thinking about testing of this new feature too.

Further comment on the patches themselves.

> There are couple of unicast routing related patches for OpenSM.
> 
> Basically it implements routing module which provides possibility to
load
> switch forwarding tables from pre-created dump file. Currently unicast
> tables loading is only supported, multicast may be added in a future.
> 
> Short patch descriptions (more details may be found in emails with
> patches):
> 
> 1. Ucast dump file simplification.
> 2. Modular routing - preliminary implements generic model to plug new
> routing engine to OpenSM.
> 3. New simple unicast routing engine which allows to load LFTs from
> pre-created dump file.
> 4. Example of ucast dump generation script.
> 
> Please comment and test. Thanks.
> 
> Sasha


From mst at mellanox.co.il  Sat Jun 10 23:38:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 11 Jun 2006 09:38:42 +0300
Subject: [openib-general] race in mthca_cq.c?
In-Reply-To: <adaac8ncpf9.fsf@cisco.com>
References: <adaac8ncpf9.fsf@cisco.com>
Message-ID: <20060611063842.GU7359@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
>     Michael> But there might be more EQEs for this CQN outstanding in
>     Michael> the EQ which we have not seen yet.
> 
> Now that you mention it, that could be a real problem I guess.
> synchronize_irq() isn't enough because the interrupt handler might not
> have even started yet.

Only in MSI configurations though: with regular interrupts command interface
shares IRQ with completions so the EQ will be emptied before interrupt handler
is done.

-- 
MST


From halr at voltaire.com  Sun Jun 11 03:12:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jun 2006 06:12:08 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881C@mtlexch01.mtl.com>
Message-ID: <1150020727.570.29434.camel@hal.voltaire.com>

Hi Eitan,

On Sun, 2006-06-11 at 02:07, Eitan Zahavi wrote:
> Hi Hal,
> 
> As the 1.0 does not support GUIDInfo I do not this patch is relevant to
> 1.0

Huh ? What's
https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/osm/opensm/osm_sa_guidinfo_record.c

-- Hal

> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Sunday, June 11, 2006 12:22 AM
> > To: Eitan Zahavi
> > Cc: OPENIB
> > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable query
> > 
> > Eitan,
> > 
> > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > Hi Hal
> > >
> > > I'm working on passing osmtest check. Found a bug in the new
> > > GUIDInfoRecord query: If you had a physical port with zero guid_cap
> > > the code would loop on blocks 0..255 instead of trying the next
> port.
> > >
> > > I am still looking for why we might have a guid_cap == 0 on some
> > > ports.
> > >
> > > This patch resolves this new problem. osmtest passes on some
> arbitrary
> > > networks.
> > >
> > > Eitan
> > >
> > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > 
> > Thanks. Applied to trunk only.
> > 
> > Let me know if it also should be applied to 1.0.
> > 
> > -- Hal


From eitan at mellanox.co.il  Sun Jun 11 03:46:37 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 11 Jun 2006 13:46:37 +0300
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>

Auuuch it is there! 
My mistake. Sp please apply the patch to the OFED 1.0 branch too.
BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ?

> Huh ? What's
>
https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o
sm/opens
> m/osm_sa_guidinfo_record.c
> 
> -- Hal
> 
> >
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> > > -----Original Message-----
> > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > Sent: Sunday, June 11, 2006 12:22 AM
> > > To: Eitan Zahavi
> > > Cc: OPENIB
> > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable
query
> > >
> > > Eitan,
> > >
> > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > > Hi Hal
> > > >
> > > > I'm working on passing osmtest check. Found a bug in the new
> > > > GUIDInfoRecord query: If you had a physical port with zero
guid_cap
> > > > the code would loop on blocks 0..255 instead of trying the next
> > port.
> > > >
> > > > I am still looking for why we might have a guid_cap == 0 on some
> > > > ports.
> > > >
> > > > This patch resolves this new problem. osmtest passes on some
> > arbitrary
> > > > networks.
> > > >
> > > > Eitan
> > > >
> > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > >
> > > Thanks. Applied to trunk only.
> > >
> > > Let me know if it also should be applied to 1.0.
> > >
> > > -- Hal


From tziporet at mellanox.co.il  Sun Jun 11 03:48:33 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Sun, 11 Jun 2006 13:48:33 +0300
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
 support for IB_CM_REQ_OPTIONS
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com>

Jack put the bug fix to OFED 1.0.

Tziporet


-----Original Message-----
From: James Lentini [mailto:jlentini at netapp.com] 
Sent: Saturday, June 10, 2006 1:12 AM
To: Tziporet Koren
Cc: Jack Morgenstein; openib-general; Arlin Davis
Subject: Re: [openib-general] Re: [PATCH] uDAPL openib-cma provider -
add support for IB_CM_REQ_OPTIONS


On Fri, 9 Jun 2006, Arlin Davis wrote:

> James Lentini wrote:
> 
> > On Thu, 8 Jun 2006, Jack Morgenstein wrote:
> > 
> >  
> > > On Wednesday 07 June 2006 18:26, James Lentini wrote:
> > >    
> > > > On Wed, 7 Jun 2006, Jack Morgenstein wrote:
> > > >      
> > > > > This (bug fix) can still be included in next-week's release,
if you
> > > > > think it is important (I have extracted it from the changes
checked
> > > > > in at svn 7755)
> > > > >        
> > > > If you are going to make another release anyway, then I would
included
> > > > it.
> > > >      
> > > Do you mean -- include the fix in next week's release -- or --
wait with
> > > the fix for the following release?
> > >    
> > 
> > I'd include the fix in the next release, but I wouldn't create a
special
> > release just for this fix.
> >  
> So are we getting this in next weeks release or not? I think we need
it.


Tziporet,

Will this be in this fix be in the next OFED release? 


From mst at mellanox.co.il  Sun Jun 11 03:52:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 11 Jun 2006 13:52:10 +0300
Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device
	address
In-Reply-To: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com>
References: <000e01c68c0d$5d31b500$ff0da8c0@amr.corp.intel.com>
Message-ID: <20060611105210.GA7359@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> Subject: [PATCH 1/5] ib_addr: retrieve MGID from device address
> 
> Extract the MGID used by ipoib for broadcast traffic from the device
> address.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> ---
> This will be used to get the MCMemberRecord for the ipoib broadcast group.
> 
> --- svn3/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h	2006-05-25 11:18:47.000000000 -0700
> +++ svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_addr.h	2006-06-06 16:14:11.000000000 -0700
> @@ -89,6 +89,11 @@ static inline void ib_addr_set_pkey(stru
>  	dev_addr->broadcast[9] = (unsigned char) pkey;
>  }
>  
> +static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr)
> +{
> +	return 	(union ib_gid *) (dev_addr->broadcast + 4);
> +}
> +
>  static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr)
>  {
>  	return 	(union ib_gid *) (dev_addr->src_dev_addr + 4);
> 

dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally aligned,
so casting this pointer to structure type may cause compiler to generate
incorrect code.

In particular, there will generate misaligned access faults on ia64 when used,
as we have already seen in case of IPoIB.  Please fix these to return gid as
char[16] instead, so that user uses memcpy properly and so that compiler knows
the address may not be aligned.

-- 
MST


From halr at voltaire.com  Sun Jun 11 04:14:22 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jun 2006 07:14:22 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
	GetTable query
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>
Message-ID: <1150024461.570.31906.camel@hal.voltaire.com>

On Sun, 2006-06-11 at 06:46, Eitan Zahavi wrote:
> Auuuch it is there! 
> My mistake. Sp please apply the patch to the OFED 1.0 branch too.
> BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ?

Yes.

-- Hal 

> > Huh ? What's
> >
> https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o
> sm/opens
> > m/osm_sa_guidinfo_record.c
> > 
> > -- Hal
> > 
> > >
> > > Eitan Zahavi
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > > > -----Original Message-----
> > > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > Sent: Sunday, June 11, 2006 12:22 AM
> > > > To: Eitan Zahavi
> > > > Cc: OPENIB
> > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable
> query
> > > >
> > > > Eitan,
> > > >
> > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > > > Hi Hal
> > > > >
> > > > > I'm working on passing osmtest check. Found a bug in the new
> > > > > GUIDInfoRecord query: If you had a physical port with zero
> guid_cap
> > > > > the code would loop on blocks 0..255 instead of trying the next
> > > port.
> > > > >
> > > > > I am still looking for why we might have a guid_cap == 0 on some
> > > > > ports.
> > > > >
> > > > > This patch resolves this new problem. osmtest passes on some
> > > arbitrary
> > > > > networks.
> > > > >
> > > > > Eitan
> > > > >
> > > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > >
> > > > Thanks. Applied to trunk only.
> > > >
> > > > Let me know if it also should be applied to 1.0.
> > > >
> > > > -- Hal


From bugzilla-daemon at openib.org  Sun Jun 11 05:55:54 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Sun, 11 Jun 2006 05:55:54 -0700 (PDT)
Subject: [openib-general] [Bug 131] New: working with huge pages may crash
	the kernel on Suse10
Message-ID: <20060611125554.9656E2287AC@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=131

           Summary: working with huge pages may crash the kernel on Suse10
           Product: OpenFabrics Linux
           Version: 1.0rc6
          Platform: X86-64
        OS/Version: Other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: IB Core
        AssignedTo: bugzilla at openib.org
        ReportedBy: dotanb at mellanox.co.il


*************************************************************
Host Architecture : x86_64
Linux Distribution: SUSE LINUX 10.0 (X86-64) OSS VERSION = 10.0
Kernel Version    : 2.6.13-15-smp
Memory size       : 5099744 kB
Driver Version    : OFED-1.0-rc6-post1
HCA ID(s)         : mthca0
HCA model(s)      : 25218
FW version(s)     : 5.1.915
Board(s)          : MT_0200000001
************************************************************* 

working with huge pages may cause a kernel crash in sus10: kernel
2.6.13-15-smp.

everything was fine when we used kenels 2.6.9, 2.6.16 .
here is the back trace from the /var/log/messages:
Jun  9 15:15:03 sw030 kernel: general protection fault: 0000 [1] SMP
Jun  9 15:15:03 sw030 kernel: CPU 1
Jun  9 15:15:03 sw030 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm
ib_addr ib_cm ib_local_sa findex ib_ipoib ib_sa ib_uverbs ib_umad ib_mthca
ib_mad ib_core memtrack mst_pciconf mst_pci hfsplus vfat fat subfs freq_table
autofs4 edd ipv6 button battery ac af_packet floppy e1000 i2c_i801 i2c_core
generic ide_core ehci_hcd hw_random uhci_hcd usbcore shpchp pci_hotplug
parport_pc lp parport dm_mod ext3 jbd fan thermal processor aic79xx
scsi_transport_spi sg sr_mod cdrom ata_piix libata sd_mod scsi_mod
Jun  9 15:15:03 sw030 kernel: Pid: 1822, comm: mr_test Tainted: G     U
2.6.13-15-smp
Jun  9 15:15:03 sw030 kernel: RIP: 0010:[<ffffffff8016cdf2>]
<ffffffff8016cdf2>{set_page_dirty+34}
Jun  9 15:15:03 sw030 kernel: RSP: 0018:ffff81007c5a9e20  EFLAGS: 00010286
Jun  9 15:15:03 sw030 kernel: RAX: 803d9290c7c7485b RBX: 0000000000000001 RCX:
ffff8100016cf000
Jun  9 15:15:03 sw030 kernel: RDX: ffffffff80183550 RSI: ffff8100016cf000 RDI:
ffff8100016cf038
Jun  9 15:15:03 sw030 kernel: RBP: ffff8100016cf038 R08: 0000000000001000 R09:
ffff810051568cd8
Jun  9 15:15:03 sw030 kernel: R10: 000000000000003f R11: ffffffff801dd920 R12:
0000000000000001
Jun  9 15:15:03 sw030 kernel: R13: ffff810064415ca8 R14: ffff81000dc86000 R15:
0000000000000001
Jun  9 15:15:03 sw030 kernel: FS:  00002aaaab21c0a0(0000)
GS:ffffffff8050e880(0000) knlGS:0000000000000000
Jun  9 15:15:03 sw030 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun  9 15:15:03 sw030 kernel: CR2: 0000000000603018 CR3: 000000003d2df000 CR4:
00000000000006e0
Jun  9 15:15:03 sw030 kernel: Process mr_test (pid: 1822, threadinfo
ffff81007c5a8000, task ffff81007bc743f0)
Jun  9 15:15:03 sw030 kernel: Stack: ffffffff8016ce49 ffff810072047a00
0000000000000001 ffff81005259f000
Jun  9 15:15:03 sw030 kernel:        ffffffff882e5c3a ffff810072047a00
ffff8100585fe000 ffff810064415cd0
Jun  9 15:15:03 sw030 kernel:        ffff810064415ca8 ffff810064415c80
Jun  9 15:15:03 sw030 kernel: Call
Trace:<ffffffff8016ce49>{set_page_dirty_lock+41}
<ffffffff882e5c3a>{:ib_uverbs:__ib_umem_release+122}
Jun  9 15:15:03 sw030 kernel:       
<ffffffff882e61de>{:ib_uverbs:ib_umem_release+14}
<ffffffff882e1f05>{:ib_uverbs:ib_uverbs_dereg_mr+245}
Jun  9 15:15:03 sw030 kernel:        <ffffffff80284bc2>{tty_write+578}
<ffffffff882e026e>{:ib_uverbs:ib_uverbs_write+158}
Jun  9 15:15:03 sw030 kernel:        <ffffffff8018c76a>{vfs_write+234}
<ffffffff8018cda3>{sys_write+83}
Jun  9 15:15:03 sw030 kernel:        <ffffffff8010ed7e>{system_call+126}
Jun  9 15:15:03 sw030 kernel:
Jun  9 15:15:03 sw030 kernel: Code: 48 8b 40 20 48 85 c0 74 06 49 89 c3 41 ff
e3 e9 4a 17 02 00
Jun  9 15:15:03 sw030 kernel: RIP <ffffffff8016cdf2>{set_page_dirty+34} RSP
<ffff81007c5a9e20>


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From tziporet at mellanox.co.il  Sun Jun 11 07:31:34 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Sun, 11 Jun 2006 17:31:34 +0300
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
In-Reply-To: <1149895236.27921.2.camel@pelerin.serpentine.com>
References: <1149895236.27921.2.camel@pelerin.serpentine.com>
Message-ID: <448C2946.5010707@mellanox.co.il>

Bryan O'Sullivan wrote:
> Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not
> work correctly.  You can download an updated tarball from here, for
> which the ipath driver works fine:
>
> http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2
>
> Alternatively, pull the necessary patches from SVN.
>
> 	<b
>
>
> __
Hi Bryan

You have forgot some of the patches in your tarball file, thus several 
OSes does not pass (e.g RH EL4 up3).

/openib-1.0/patches/ > ls */ipath*
2.6.11_FC4/ipath_backport.patch  2.6.13/ipath_backport.patch             
2.6.15/ipath_backport.patch
2.6.11/ipath_backport.patch      2.6.13_suse10_0_u/ipath_backport.patch  
2.6.9/ipath_backport.patch
2.6.12/ipath_backport.patch      2.6.14/ipath_backport.patch             
fixes/ipath_rollup.patch
/openib-1.0/patches/ > ls
2.6.11/      2.6.12/  2.6.13_suse10_0_u/  2.6.15/  2.6.16_sles10/  
2.6.9/     dapl/   memtrack/
2.6.11_FC4/  2.6.13/  2.6.14/             2.6.16/  2.6.17/         
2.6.9_U3/  fixes/

I took the freedom to copy these patches to svn since I noticed that 
ipath backport patches are same for all OSes & kernels.
Please take a look and make sure ipath driver compiles & loads on all 
supported systems (including ia64, PPC64 etc.)

Tziporet


From mamidala at cse.ohio-state.edu  Sun Jun 11 09:51:02 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Sun, 11 Jun 2006 12:51:02 -0400 (EDT)
Subject: [openib-general] [PATCH 0/5] multicast abstraction
In-Reply-To: <4489F43D.60502@ichips.intel.com>
Message-ID: <Pine.GSO.4.40.0606111240160.9738-100000@omicron.cse.ohio-state.edu>

Hi Sean,

I am planning to use RDMA CM for multicast functionality. It would be
great if you can point me to a simple multicast test program using
RDMA CM?

Thanks,
Amith

On Fri, 9 Jun 2006, Sean Hefty wrote:

> Sean Hefty wrote:
> > This patch series enhances support for joining and leaving multicast groups,
> > providing the following functionality:
> >
> > 1. Users identify a multicast group by a multicast IP address.
> > 2. A user binds to a local RDMA device based on resolving the IP address.
> > 3. A new multicast group is created.  The parameters for the multicast group are
> > obtained based on the ipoib broadcast group, and the MGID is derived using the
> > same algorithm as ipoib, except with a different signature.
> > 4. Any QP associated with the join is attached to the group once the join
> > operation completes.
> > 5. A QP may join multiple groups.
>
> I forgot to mention that this patch series is dependent on adding UD QP support
> to the RDMA CM.
>
> - Sean
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mst at mellanox.co.il  Sun Jun 11 10:42:41 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 11 Jun 2006 20:42:41 +0300
Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called without
	polling cq
Message-ID: <20060611174241.GA2993@mellanox.co.il>

Hello, Sean!

mad.c calls ib_req_notify_cq on hotplug event in ib_mad_port_start, after QPs
are attached to a CQ.  Since this function does not poll the CQ, if sufficient
number of MADs arrive at the QP before ib_req_notify_cq is called, RQ might get
empty and no completion events will ever be generated.

Please comment.

-- 
MST


From mst at mellanox.co.il  Sun Jun 11 10:50:06 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 11 Jun 2006 20:50:06 +0300
Subject: [openib-general] [PATCH] mad: prevent duplicate RMPP sessions
 on responder side
In-Reply-To: <4473371C.6040504@ichips.intel.com>
References: <200605231459.46326.jackm@mellanox.co.il>
	<4473371C.6040504@ichips.intel.com>
Message-ID: <20060611175006.GB2993@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [PATCH] mad: prevent duplicate RMPP sessions on responder side
> 
> Jack Morgenstein wrote:
> >Prevent opening multiple RMPP MAD transaction sessions at responder side
> >with the same TID, GID/LID, class.
> >
> >Could happen if RMPP requests are retried while response is in progress.
> 
> My preference for handling this is to detect and discard duplicate 
> requests, and verify that response MADs match a request when being sent.  
> See the mail thread starting at:
> 
> http://openib.org/pipermail/openib-general/2006-April/020703.html
> 
> This will also help us add in support for DS RMPP.
> 
> For kernel clients, I anticipate that this sort of change is fairly small. 
> Userspace support requires a bit more work, especially if we don't want to 
> change the ABI.

Sean, is anyone looking at this? If not, given that Jack's approach does not
touch ABI or API, might it make sense to merge Jack's patch after all and use
that as a starting point?  With current code in 2.6.17 large RMPPs often get
aborted because of the problem of the duplicates.  On the other hand, I'm not
aware of users for DS RMPP.

-- 
MST


From pasha at mellanox.co.il  Sun Jun 11 12:15:41 2006
From: pasha at mellanox.co.il (Pavel Shamis (Pasha))
Date: Sun, 11 Jun 2006 22:15:41 +0300
Subject: [openib-general] [openfabrics-ewg] RE: OFED-1.0-rc6 is available
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D692D9@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D692D9@xmb-sjc-216.amer.cisco.com>
Message-ID: <448C6BDD.6030007@mellanox.co.il>

We also did performance checks on different platforms, and the default
MTU was changed to 1K (The 2k is more optimal for ddr platforms)
Thank you for pointing the issue.

Pavel Shamis (Pasha)

Scott Weitzenkamp (sweitzen) wrote:
> The MTU change undos the changes for bug 81, so I have reopened bug 81 
> (http://openib.org/bugzilla/show_bug.cgi?id=81).
>  
> With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E 
> osu_bibw performance is bad.  I've enclosed some performance data, look 
> at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini.
>  
> Are there other benchmarks driving the changes in rc6 (and rc4)?
>  
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
>      
> 
>     *OSU MPI:*
> 
>     ·        Added mpi_alltoall fine tuning parameters
> 
>     ·        Added default configuration/documentation file
>     $MPIHOME/etc/mvapich.conf
> 
>     ·        Added shell configuration files  $MPIHOME/etc/mvapich.csh ,
>     $MPIHOME/etc/mvapich.csh
> 
>     ·        Default MTU was changed back to 2K for InfiniHost III Ex
>     and InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
>     VIADEV_DEFAULT_MTU=MTU1024
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openfabrics-ewg mailing list
> openfabrics-ewg at openib.org
> http://openib.org/mailman/listinfo/openfabrics-ewg


From bugzilla-daemon at openib.org  Sun Jun 11 13:57:05 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Sun, 11 Jun 2006 13:57:05 -0700 (PDT)
Subject: [openib-general] [Bug 1] kernel prints out error message for each
	ib interface
Message-ID: <20060611205705.0D8A52287AC@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=1


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #3 from sweitzen at cisco.com  2006-06-11 13:57 -------
Message still there in RHEL4 U3.  Close bug because it is benign.

[root at svbu-qa1850-1 ~]# uname -a
Linux svbu-qa1850-1 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64
x8
6_64 x86_64 GNU/Linux
[root at svbu-qa1850-1 ~]# dmesg | grep divert
divert: not allocating divert_blk for non-ethernet device lo
divert: allocating divert_blk for eth0
divert: allocating divert_blk for eth1
divert: not allocating divert_blk for non-ethernet device ib0
divert: not allocating divert_blk for non-ethernet device sit0


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From rdreier at cisco.com  Sun Jun 11 17:02:45 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 11 Jun 2006 17:02:45 -0700
Subject: [openib-general] [PATCH 2/2] ipoib: handle multicast group
 reset notification
In-Reply-To: <4489BF48.8010804@ichips.intel.com> (Sean Hefty's message
	of "Fri, 09 Jun 2006 11:34:48 -0700")
References: <ORSMSX401hQdOgl4rhY00000048@orsmsx401.amr.corp.intel.com>
	<4489BF48.8010804@ichips.intel.com>
Message-ID: <ada64j78cm2.fsf@cisco.com>

    Sean> Any issue committing this?

No, looks fine.

 - R.


From rdreier at cisco.com  Sun Jun 11 17:06:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 11 Jun 2006 17:06:12 -0700
Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to
 userspace
In-Reply-To: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com> (Sean
	Hefty's message of "Fri, 9 Jun 2006 15:16:28 -0700")
References: <001201c68c12$5adc3a00$ff0da8c0@amr.corp.intel.com>
Message-ID: <adawtbn6xvv.fsf@cisco.com>

 > @@ -58,6 +58,8 @@ enum {
 >  	RDMA_USER_CM_CMD_GET_EVENT,
 >  	RDMA_USER_CM_CMD_GET_OPTION,
 >  	RDMA_USER_CM_CMD_SET_OPTION,
 > +	RDMA_USER_CM_CMD_JOIN_MCAST,
 > +	RDMA_USER_CM_CMD_LEAVE_MCAST,
 >  	RDMA_USER_CM_CMD_GET_DST_ATTR
 >  };

I think this changes the exported ABI by changing the value of
RDMA_USER_CM_CMD_GET_DST_ATTR, right?

 - R.


From greg.lindahl at qlogic.com  Sun Jun 11 17:40:29 2006
From: greg.lindahl at qlogic.com (Greg Lindahl)
Date: Sun, 11 Jun 2006 17:40:29 -0700
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
 GetTable query
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>
Message-ID: <20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net>

So this is a _critical_ bugfix ?

> Auuuch it is there! 
> My mistake. Sp please apply the patch to the OFED 1.0 branch too.
> BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ?
> 
> > Huh ? What's
> >
> https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o
> sm/opens
> > m/osm_sa_guidinfo_record.c
> > 
> > -- Hal
> > 
> > >
> > > Eitan Zahavi
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > > > -----Original Message-----
> > > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > Sent: Sunday, June 11, 2006 12:22 AM
> > > > To: Eitan Zahavi
> > > > Cc: OPENIB
> > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable
> query
> > > >
> > > > Eitan,
> > > >
> > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > > > Hi Hal
> > > > >
> > > > > I'm working on passing osmtest check. Found a bug in the new
> > > > > GUIDInfoRecord query: If you had a physical port with zero
> guid_cap
> > > > > the code would loop on blocks 0..255 instead of trying the next
> > > port.
> > > > >
> > > > > I am still looking for why we might have a guid_cap == 0 on some
> > > > > ports.
> > > > >
> > > > > This patch resolves this new problem. osmtest passes on some
> > > arbitrary
> > > > > networks.
> > > > >
> > > > > Eitan
> > > > >
> > > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > >
> > > > Thanks. Applied to trunk only.
> > > >
> > > > Let me know if it also should be applied to 1.0.
> > > >
> > > > -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Sun Jun 11 17:46:35 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jun 2006 20:46:35 -0400
Subject: [openib-general] [PATCH] osm: fix num of blocks of GUIDInfo
 GetTable query
In-Reply-To: <20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E302368825@mtlexch01.mtl.com>
	<20060612004029.GA16596@greglaptop.hsd1.ca.comcast.net>
Message-ID: <1150073195.570.63586.camel@hal.voltaire.com>

On Sun, 2006-06-11 at 20:40, Greg Lindahl wrote:
> So this is a _critical_ bugfix ?

Depends on one's definition.

Anyhow, it's been applied to 1.0.

-- Hal

> 
> > Auuuch it is there! 
> > My mistake. Sp please apply the patch to the OFED 1.0 branch too.
> > BTW: Is the osmtest -f a excersizes this query on the OFED 1.0 ?
> > 
> > > Huh ? What's
> > >
> > https://openfabrics.org/svn/gen2/branches/1.0/src/userspace/management/o
> > sm/opens
> > > m/osm_sa_guidinfo_record.c
> > > 
> > > -- Hal
> > > 
> > > >
> > > > Eitan Zahavi
> > > > Senior Engineering Director, Software Architect
> > > > Mellanox Technologies LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > > Sent: Sunday, June 11, 2006 12:22 AM
> > > > > To: Eitan Zahavi
> > > > > Cc: OPENIB
> > > > > Subject: Re: [PATCH] osm: fix num of blocks of GUIDInfo GetTable
> > query
> > > > >
> > > > > Eitan,
> > > > >
> > > > > On Thu, 2006-06-08 at 07:24, Eitan Zahavi wrote:
> > > > > > Hi Hal
> > > > > >
> > > > > > I'm working on passing osmtest check. Found a bug in the new
> > > > > > GUIDInfoRecord query: If you had a physical port with zero
> > guid_cap
> > > > > > the code would loop on blocks 0..255 instead of trying the next
> > > > port.
> > > > > >
> > > > > > I am still looking for why we might have a guid_cap == 0 on some
> > > > > > ports.
> > > > > >
> > > > > > This patch resolves this new problem. osmtest passes on some
> > > > arbitrary
> > > > > > networks.
> > > > > >
> > > > > > Eitan
> > > > > >
> > > > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > > >
> > > > > Thanks. Applied to trunk only.
> > > > >
> > > > > Let me know if it also should be applied to 1.0.
> > > > >
> > > > > -- Hal
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sean.hefty at intel.com  Sun Jun 11 20:54:54 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 11 Jun 2006 20:54:54 -0700
Subject: [openib-general] [PATCH 5/5] ucma: export multicast suport to
 userspace
In-Reply-To: <adawtbn6xvv.fsf@cisco.com>
Message-ID: <000001c68dd3$f70d55e0$68fc070a@amr.corp.intel.com>

> > @@ -58,6 +58,8 @@ enum {
> >  	RDMA_USER_CM_CMD_GET_EVENT,
> >  	RDMA_USER_CM_CMD_GET_OPTION,
> >  	RDMA_USER_CM_CMD_SET_OPTION,
> > +	RDMA_USER_CM_CMD_JOIN_MCAST,
> > +	RDMA_USER_CM_CMD_LEAVE_MCAST,
> >  	RDMA_USER_CM_CMD_GET_DST_ATTR
> >  };
>
>I think this changes the exported ABI by changing the value of
>RDMA_USER_CM_CMD_GET_DST_ATTR, right?

Good catch.  Originally the last 3 commands were only for multicast support, but
I went back and added GET_DST_ATTR as part of the UD QP patch.  I will swap
those around, so that the check-in doesn't change the values.

My intent is to commit the UD QP patch, followed immediately by the multicast
patch, and only update the ABI once.

- Sean


From sean.hefty at intel.com  Sun Jun 11 20:57:53 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 11 Jun 2006 20:57:53 -0700
Subject: [openib-general] [PATCH] mad: prevent duplicate RMPP sessions
 on responder side
In-Reply-To: <20060611175006.GB2993@mellanox.co.il>
Message-ID: <000101c68dd4$61724620$68fc070a@amr.corp.intel.com>

>Sean, is anyone looking at this? If not, given that Jack's approach does not
>touch ABI or API, might it make sense to merge Jack's patch after all and use
>that as a starting point?  With current code in 2.6.17 large RMPPs often get
>aborted because of the problem of the duplicates.  On the other hand, I'm not
>aware of users for DS RMPP.

Yes - I am looking into this.  I have it in my plan to fix this after multicast
support for userspace.  (Which is now done, pending review and updates.)

- Sean


From sean.hefty at intel.com  Sun Jun 11 21:02:05 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 11 Jun 2006 21:02:05 -0700
Subject: [openib-general] [PATCH 0/5] multicast abstraction
In-Reply-To: <Pine.GSO.4.40.0606111240160.9738-100000@omicron.cse.ohio-state.edu>
Message-ID: <000201c68dd4$f7edd560$68fc070a@amr.corp.intel.com>

>I am planning to use RDMA CM for multicast functionality. It would be
>great if you can point me to a simple multicast test program using
>RDMA CM?

There is a userspace test program (mckey) that will be available, but has not
been posted yet.  (A kernel test program would look fairly similar.)  I will
post that tomorrow, along with the changes to the librdmacm.

Let me know if you have any questions.

- Sean


From sean.hefty at intel.com  Sun Jun 11 21:04:59 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 11 Jun 2006 21:04:59 -0700
Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device
	address
In-Reply-To: <20060611105210.GA7359@mellanox.co.il>
Message-ID: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com>

>dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally
>aligned,
>so casting this pointer to structure type may cause compiler to generate
>incorrect code.

Thanks - I'll update this.

- Sean


From sean.hefty at intel.com  Sun Jun 11 21:31:46 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 11 Jun 2006 21:31:46 -0700
Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called
 without polling cq
In-Reply-To: <20060611174241.GA2993@mellanox.co.il>
Message-ID: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com>

>mad.c calls ib_req_notify_cq on hotplug event in ib_mad_port_start, after QPs
>are attached to a CQ.  Since this function does not poll the CQ, if sufficient
>number of MADs arrive at the QP before ib_req_notify_cq is called, RQ might get
>empty and no completion events will ever be generated.

This is arming the CQ _before_ we post MADs to the receive queue of the QP.  I
don't think that there's a race here.

- Sean


From mst at mellanox.co.il  Sun Jun 11 22:21:49 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 08:21:49 +0300
Subject: [openib-general] bug report: mad.c: ib_req_notify_cq called
 without polling cq
In-Reply-To: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com>
References: <000401c68dd9$1d9a29e0$68fc070a@amr.corp.intel.com>
Message-ID: <20060612052149.GA3390@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> This is arming the CQ _before_ we post MADs to the receive queue of the QP.  I
> don't think that there's a race here.

Good point, thanks.

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 05:16:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 15:16:35 +0300
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
Message-ID: <20060612121635.GX7359@mellanox.co.il>

Roland, please consider the following for 2.6.17.

---

Memfree firmware is in rare cases reporting WQE index == -1
in receive completion with error instead of (rq size - 1).
Here is a patch to avoid kernel crash and report a correct WR id in this case.

Since reporting a wrong WR id has severe consequences for ULPs,
make the test as restrictive as possible, and report an error
if we see an unexpected value.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: openib/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- openib/drivers/infiniband/hw/mthca/mthca_cq.c	(revision 7837)
+++ openib/drivers/infiniband/hw/mthca/mthca_cq.c	(working copy)
@@ -542,6 +542,22 @@
 	} else {
 		wq = &(*cur_qp)->rq;
 		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		/* WQE index == -1 might be reported by
+		   Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed
+		   in later revisions. */
+		if (unlikely(wqe_index >= (*cur_qp)->rq.max)) {
+			if (unlikely(is_error) &&
+			    unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) &&
+			    mthca_is_memfree(dev))
+				wqe_index = wq->max - 1;
+			else {
+				mthca_err(dev, "Corrupted RQ CQE. "
+					  "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n",
+					  cq->cqn, entry->qp_num, wqe_index,
+					  wq->max);
+				return -EINVAL;
+			}
+		}
 		entry->wr_id = (*cur_qp)->wrid[wqe_index];
 	}
 

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 05:16:47 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 15:16:47 +0300
Subject: [openib-general] [PATCH] libmthca: work around for cqe with error
Message-ID: <20060612121647.GY7359@mellanox.co.il>

Same patch as posted earlier for kernel.

---

libmthca: completion with error for memfree might get WQE index == -1.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: openib/src/userspace/libmthca/src/cq.c
===================================================================
--- openib/src/userspace/libmthca/src/cq.c	(revision 7890)
+++ openib/src/userspace/libmthca/src/cq.c	(working copy)
@@ -349,6 +349,22 @@
 	} else {
 		wq = &(*cur_qp)->rq;
 		wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift;
+		/* WQE index == -1 might be reported by
+		   Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed
+		   in later revisions. */
+		if (wqe_index >= (*cur_qp)->rq.max) {
+			if (is_error &&
+			    (wqe_index == 0xffffffff >> wq->wqe_shift) &&
+			    mthca_is_memfree(cq->ibv_cq.context))
+				wqe_index = wq->max - 1;
+			else {
+				printf("Corrupted RQ CQE. "
+				       "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n",
+				       cq->cqn, wc->qp_num, wqe_index,
+				       wq->max);
+				return -1;
+			}
+		}
 		wc->wr_id = (*cur_qp)->wrid[wqe_index];
 	}
 

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 05:48:33 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 15:48:33 +0300
Subject: [openib-general] potential bug: ipoib freeing ah before completion
Message-ID: <20060612124833.GA19452@mellanox.co.il>

Hello, Roland!
The following was noted by Eli Cohen:

It seems that ipoib_flush_paths can be called while completions are still
outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens, an
address handle might get freed while a work request is still outstanding for it.
This can trigger a local QP error, and IPoIB will stop working, until QP
is reset.

Please comment.

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 06:57:51 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 16:57:51 +0300
Subject: [openib-general] [PATCH] mthca: restore missing registers
Message-ID: <20060612135751.GB19518@mellanox.co.il>

Roland, please consider the following for 2.6.17.

---

mthca misses restoring the following PCI-X/PCI-Express registers at reset:
PCI-X device: PCI-X command register
PCI-X bridge: upstream and downstream split transaction registers
PCI-Express : PCI-Express device control and link control registers

This causes instability and/or bad performance on systems where one of these
registers is set to a non-default value by BIOS.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: linux-2.6.16/drivers/infiniband/hw/mthca/mthca_reset.c
===================================================================
--- linux-2.6.16.orig/drivers/infiniband/hw/mthca/mthca_reset.c	2006-04-26 15:04:26.000000000 +0300
+++ linux-2.6.16/drivers/infiniband/hw/mthca/mthca_reset.c	2006-06-11 21:52:44.000000000 +0300
@@ -48,6 +48,12 @@ int mthca_reset(struct mthca_dev *mdev)
 	u32 *hca_header    = NULL;
 	u32 *bridge_header = NULL;
 	struct pci_dev *bridge = NULL;
+	int bridge_pcix_cap = 0;
+	int hca_pcie_cap = 0;
+	int hca_pcix_cap = 0;
+
+	u16 devctl;
+	u16 linkctl;
 
 #define MTHCA_RESET_OFFSET 0xf0010
 #define MTHCA_RESET_VALUE  swab32(1)
@@ -109,6 +115,9 @@ int mthca_reset(struct mthca_dev *mdev)
 		}
 	}
 
+	hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+
 	if (bridge) {
 		bridge_header = kmalloc(256, GFP_KERNEL);
 		if (!bridge_header) {
@@ -128,6 +137,13 @@ int mthca_reset(struct mthca_dev *mdev)
 				goto out;
 			}
 		}
+		bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX);
+		if (!bridge_pcix_cap) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't locate HCA bridge "
+					  "PCI-X capability, aborting.\n");
+				goto out;
+		}
 	}
 
 	/* actually hit reset */
@@ -177,6 +193,20 @@ int mthca_reset(struct mthca_dev *mdev)
 good:
 	/* Now restore the PCI headers */
 	if (bridge) {
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8,
+				 bridge_header[(bridge_pcix_cap + 0x8)/ 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Upstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc,
+				 bridge_header[(bridge_pcix_cap + 0xc)/ 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Downstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
 		/*
 		 * Bridge control register is at 0x3e, so we'll
 		 * naturally restore it last in this loop.
@@ -202,6 +232,35 @@ good:
 		}
 	}
 
+	if (hca_pcix_cap) {
+		if (pci_write_config_dword(mdev->pdev, hca_pcix_cap,
+				 hca_header[hca_pcix_cap / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI-X "
+				  "command register, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (hca_pcie_cap) {
+		devctl = hca_header[(hca_pcie_cap + 0x8)/ 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + 0x8,
+					   devctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI-X "
+				  "Device Control register, aborting.\n");
+			goto out;
+		}
+		linkctl = hca_header[(hca_pcie_cap + 0x10)/ 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + 0x10,
+					   linkctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI-Express "
+				  "Link control register, aborting.\n");
+			goto out;
+		}
+	}
+
 	for (i = 0; i < 16; ++i) {
 		if (i * 4 == PCI_COMMAND)
 			continue;

-- 
MST


From eitan at mellanox.co.il  Mon Jun 12 06:59:14 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 12 Jun 2006 16:59:14 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy (and
	other fixes)
Message-ID: <86ejxulbkd.fsf@mtl066.yok.mtl.com>

Hi Hal

As I started to test the partition manager code (using ibmgtsim pkey test),
I realized the implementation does not really enforces the partition policy
on the given fabric. This patch fixes that. It was verified using the 
simulation test. Several other corner cases were fixed too.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 7867)
+++ include/opensm/osm_port.h	(working copy)
@@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
 *  Port, Physical Port
 *********/
 
+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
+* NAME
+*  osm_physp_get_mod_pkey_tbl
+*
+* DESCRIPTION
+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
+*
+* SYNOPSIS
+*/
+static inline osm_pkey_tbl_t *
+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
+{
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  /*
+    (14.2.5.7) - the block number valid values are 0-2047, and are further
+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
+  */
+  return( &p_physp->pkeys );
+};
+/*
+* PARAMETERS
+*  p_physp
+*     [in] Pointer to an osm_physp_t object.
+*
+* RETURN VALUES
+*  The pointer to the P_Key table object.
+*
+* NOTES
+*
+* SEE ALSO
+*  Port, Physical Port
+*********/
+
 /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
 * NAME
 *	osm_physp_set_slvl_tbl
Index: include/opensm/osm_pkey.h
===================================================================
--- include/opensm/osm_pkey.h	(revision 7867)
+++ include/opensm/osm_pkey.h	(working copy)
@@ -92,6 +92,8 @@ typedef struct _osm_pkey_tbl
   cl_ptr_vector_t blocks;
   cl_ptr_vector_t new_blocks;
   cl_map_t        keys;
+  cl_qlist_t      pending;
+  uint16_t        used_blocks;
 } osm_pkey_tbl_t;
 /*
 * FIELDS
@@ -104,6 +106,13 @@ typedef struct _osm_pkey_tbl
 *	keys
 *		A set holding all keys
 *
+*  pending
+*     A list osm_pending_pkey structs that is temporarily set by the 
+*     pkey mgr and used during pkey mgr algorithm only
+*
+*  used_blocks
+*     Tracks the number of blocks having non-zero pkeys
+*
 * NOTES
 * 'blocks' vector should be used to store pkey values obtained from
 * the port and SM pkey manager should not change it directly, for this
@@ -114,6 +123,39 @@ typedef struct _osm_pkey_tbl
 *
 *********/
 
+/****s* OpenSM: osm_pending_pkey_t
+* NAME
+*	osm_pending_pkey_t
+*
+* DESCRIPTION
+*	This objects stores temporary information on pkeys their target block and index
+*  during the pkey manager operation
+*
+* SYNOPSIS
+*/
+typedef struct _osm_pending_pkey {
+  cl_list_item_t list_item;
+  uint16_t		  pkey;
+  uint32_t		  block;
+  uint8_t		  index;
+  boolean_t		  is_new;
+} osm_pending_pkey_t;
+/*
+* FIELDS
+*	pkey
+*		The actual P_Key
+*
+*	block
+*		The block index based on the previous table extracted from the device
+*
+*	index
+*		The index of the pky within the block
+*
+*  is_new
+*     TRUE for new P_Keys such that the block and index are invalid in that case
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_construct
 * NAME
 *  osm_pkey_tbl_construct
@@ -263,6 +305,41 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 *********/
 
+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
+* NAME
+*  osm_pkey_tbl_get_block_and_idx
+*
+* DESCRIPTION
+*  set the block index and pkey index the given
+*  pkey is found in. return 1 if cound not find 
+*  it, 0 if OK
+*
+* SYNOPSIS
+*/
+int
+osm_pkey_tbl_get_block_and_idx(
+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
+  IN  uint16_t       *p_pkey,
+  OUT uint32_t       *block_idx,
+  OUT uint8_t        *pkey_index);
+/*
+*  p_pkey_tbl
+*     [in] Pointer to osm_pkey_tbl_t object.
+*  
+*  p_pkey
+*     [in] Pointer to the P_Key entry searched
+*
+*  p_block_idx
+*     [out] Pointer to the block index to be updated
+*
+*  p_pkey_idx 
+*     [out] Pointer to the pkey index (in the block) to be updated
+*
+*
+* NOTES
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_set
 * NAME
 *  osm_pkey_tbl_set
Index: opensm/osm_prtn.c
===================================================================
--- opensm/osm_prtn.c	(revision 7904)
+++ opensm/osm_prtn.c	(working copy)
@@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo
 
 	p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ;
 
+   osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: "
+           "Added port 0x%" PRIx64 " to "
+           "partition \'%s\' (0x%04x) As %s member\n",
+           cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey),
+           full ? "full" : "partial" );
+
 	if (cl_map_insert(p_tbl, guid, p_physp) == NULL)
 		return IB_INSUFFICIENT_MEMORY;
 
Index: opensm/osm_pkey.c
===================================================================
--- opensm/osm_pkey.c	(revision 7904)
+++ opensm/osm_pkey.c	(working copy)
@@ -100,6 +100,8 @@ int osm_pkey_tbl_init( 
   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
   cl_map_init( &p_pkey_tbl->keys, 1 );
+	cl_qlist_init( &p_pkey_tbl->pending );
+
   return(IB_SUCCESS);
 }
 
@@ -118,14 +120,28 @@ void osm_pkey_tbl_sync_new_blocks(
     p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
     if ( b < new_blocks )
       p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
-    else {
+		else 
+      {
       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
       if (!p_new_block)
         break;
-      memset(p_new_block, 0, sizeof(*p_new_block));
       cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
     }
-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
+
+		memset(p_new_block, 0, sizeof(*p_new_block));
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_pkey_tbl_cleanup_pending(
+	IN osm_pkey_tbl_t *p_pkey_tbl)
+{
+	cl_list_item_t	*p_item;
+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		free( (osm_pending_pkey_t *)p_item );
   }
 }
 
@@ -202,6 +218,38 @@ int osm_pkey_tbl_set( 
 
 /**********************************************************************
  **********************************************************************/
+int
+osm_pkey_tbl_get_block_and_idx(
+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
+	IN	 uint16_t		 *p_pkey,
+	OUT uint32_t		 *p_block_idx,
+	OUT uint8_t			 *p_pkey_index)
+{
+	uint32_t			  num_of_blocks;
+	uint32_t			  block_index;
+	ib_pkey_table_t *block;
+
+	CL_ASSERT( p_pkey_tbl );
+	CL_ASSERT( p_block_idx != NULL );
+	CL_ASSERT( p_pkey_idx != NULL );
+ 
+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		if ( ( block->pkey_entry <= p_pkey ) &&
+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
+		{
+			*p_block_idx = block_index;
+			*p_pkey_index = p_pkey - block->pkey_entry;
+			return 0;
+		}
+	}
+	return 1;
+}
+
+/**********************************************************************
+ **********************************************************************/
 static boolean_t __osm_match_pkey (
   IN const ib_net16_t *pkey1,
   IN const ib_net16_t *pkey2 ) {
@@ -321,7 +369,8 @@ osm_port_share_pkey(
 
   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
 
-  if (!p_port_1 || !p_port_2) {
+	if (!p_port_1 || !p_port_2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
@@ -329,7 +378,8 @@ osm_port_share_pkey(
   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
 
-  if (!p_physp1 || !p_physp2) {
+	if (!p_physp1 || !p_physp2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 7904)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -62,6 +62,138 @@
 
 /**********************************************************************
  **********************************************************************/
+/*
+  the max number of pkey blocks for a physical port is located in
+  different place for switch external ports (SwitchInfo) and the
+  rest of the ports (NodeInfo)
+*/
+static int pkey_mgr_get_physp_max_blocks(
+	IN const osm_subn_t *p_subn,
+	IN const osm_physp_t *p_physp)
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
+	osm_switch_t *p_sw;
+	uint16_t num_pkeys = 0;
+
+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
+		  (osm_physp_get_port_num( p_physp ) == 0))
+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
+	else
+	{
+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
+		if (p_sw)
+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
+	}
+	return( (num_pkeys + 31) / 32 );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+ * Insert the new pending pkey entry to the specific port pkey table
+ * pending pkeys. new entries are inserted at the back.
+ */
+static void pkey_mgr_process_physical_port(
+	IN osm_log_t *p_log,
+	IN const osm_req_t *p_req,
+	IN const ib_net16_t pkey,
+	IN osm_physp_t *p_physp )
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	char *stat = NULL;
+	osm_pending_pkey_t *p_pending;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	if (! p_pkey_tbl)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0501: "
+					"No pkey table found for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+
+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
+	if (! p_pending)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0502: "
+					"Fail to allocate new pending pkey entry for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+	p_pending->pkey = pkey;
+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	if ( !p_orig_pkey  || (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) ))
+	{
+		p_pending->is_new = TRUE;
+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "inserted";
+	}
+	else
+	{
+		p_pending->is_new = FALSE;
+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
+													  &p_pending->block, &p_pending->index))
+		{
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_process_physical_port: ERR 0503: "
+						"Fail to obtain P_Key 0x%04x block and index for node "
+						"0x%016" PRIx64 " port %u\n",
+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+						osm_physp_get_port_num( p_physp ) );
+			return;
+		}
+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "updated";
+	}
+
+	osm_log( p_log, OSM_LOG_VERBOSE,
+				"pkey_mgr_process_physical_port:	"
+				"pkey 0x%04x was %s for node 0x%016" PRIx64
+				" port %u\n",
+				cl_ntoh16( pkey ), stat,
+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+				osm_physp_get_port_num( p_physp ) );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+pkey_mgr_process_partition_table(
+	osm_log_t *p_log,
+	const osm_req_t *p_req,
+	const osm_prtn_t *p_prtn,
+	const boolean_t full )
+{
+	const cl_map_t *p_tbl = full ?
+		&p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
+	cl_map_iterator_t i, i_next;
+	ib_net16_t pkey = p_prtn->pkey;
+	osm_physp_t *p_physp;
+
+	if ( full )
+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
+
+	i_next = cl_map_head( p_tbl );
+	while ( i_next != cl_map_end( p_tbl ) )
+	{
+		i = i_next;
+		i_next = cl_map_next( i );
+		p_physp = cl_map_obj( i );
+		if ( p_physp && osm_physp_is_valid( p_physp ) )
+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 static ib_api_status_t
 pkey_mgr_update_pkey_entry(
    IN const osm_req_t *p_req,
@@ -131,80 +263,153 @@ pkey_mgr_enforce_partition(
 
 /**********************************************************************
  **********************************************************************/
-/*
- * Prepare a new entry for the pkey table for this port when this pkey
- * does not exist. Update existed entry when membership was changed.
- */
-static void pkey_mgr_process_physical_port(
-   IN osm_log_t *p_log,
-   IN const osm_req_t *p_req,
-   IN const ib_net16_t pkey,
-   IN osm_physp_t *p_physp )
+static boolean_t pkey_mgr_update_port(
+	osm_log_t *p_log,
+	osm_req_t *p_req,
+	const osm_port_t * const p_port )
 {
-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
-   ib_pkey_table_t *block;
-   uint16_t block_index;
+	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	ib_pkey_table_t *block, *new_block, *p_old_block;
+	osm_pkey_tbl_t *p_pkey_tbl;
+	uint16_t block_index = 0;
+	uint16_t last_free_block_index = 0;
+	uint16_t last_free_entry_index = 0;
    uint16_t num_of_blocks;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   ib_net16_t *p_orig_pkey;
-   char *stat = NULL;
-   uint32_t i;
+	uint16_t max_num_of_blocks;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	ib_api_status_t status;
+	boolean_t ret_val = FALSE;
+	osm_pending_pkey_t *p_pending;
+	boolean_t found;
+
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
+		return FALSE;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
    num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
 
-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
+	cl_map_remove_all( &p_pkey_tbl->keys );
+	p_pkey_tbl->used_blocks = 0;
 
-   if ( !p_orig_pkey )
-   {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	/* process every pending pkey in order - first must be "updated" last are "new" */
+	p_pending = (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_pending != (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
       {
-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
+		if (p_pending->is_new == FALSE)
          {
-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
+			block = osm_pkey_tbl_new_block_get( p_pkey_tbl, p_pending->block );
+			if (block == NULL)
             {
-               block->pkey_entry[i] = pkey;
-	       stat = "inserted";
-	       goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0504: "
+							"failed to get block %d for node 0x%016" PRIx64 " port %u\n",
+							p_pending->block,
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
             }
+			else
+			{
+				p_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, p_pending->block );
+				CL_ASSERT( p_old_block != NULL );
+				cl_map_insert( &p_pkey_tbl->keys,
+									ib_pkey_get_base(p_pending->pkey),
+									&(p_old_block->pkey_entry[p_pending->index]));
+				block->pkey_entry[p_pending->index] = p_pending->pkey;
+				if (p_pkey_tbl->used_blocks < p_pending->index)
+					p_pending->index = p_pending->index;
          }
       }
+		else
+		{
+			/* need either an empty entry or next block */
+			block = osm_pkey_tbl_new_block_get( p_pkey_tbl, last_free_block_index );
+			found = FALSE;
+			while ( !found && (last_free_block_index < max_num_of_blocks))
+			{
+				if ( block->pkey_entry[last_free_entry_index] == 0)
+					found = TRUE;
+				else
+				{
+					if (last_free_entry_index == IB_NUM_PKEY_ELEMENTS_IN_BLOCK)
+					{
+						last_free_entry_index = 0;
+						last_free_block_index++;
+						block = osm_pkey_tbl_new_block_get( p_pkey_tbl, last_free_block_index );
+						if ((!block) && (last_free_block_index < max_num_of_blocks))
+						{
+							block = (ib_pkey_table_t *)malloc(sizeof(*block));
+							if (!block)
+							{
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_process_physical_port: ERR 0501: "
-               "No empty pkey entry was found to insert 0x%04x for node "
-               "0x%016" PRIx64 " port %u\n",
-               cl_ntoh16( pkey ),
+											"pkey_mgr_update_port: ERR 0513: "
+											"failed to allocate new block %d for node 0x%016" PRIx64 " port %u\n",
+											last_free_block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
+								continue;
+							}
+							cl_ptr_vector_set(&p_pkey_tbl->new_blocks, last_free_block_index, block);
    }
-   else if ( *p_orig_pkey != pkey )
+					}
+					else
    {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+						last_free_entry_index++;
+					}
+				}
+			}
+
+			if ( !found )
       {
-         /* we need real block (not just new_block) in order
-          * to resolve block/pkey indices */
-         block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-	 i = p_orig_pkey - block->pkey_entry;
-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-	    block->pkey_entry[i] = pkey;
-	    stat = "updated";
-	    goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0505: "
+							"failed to empty space for new pkey 0x%04x for node 0x%016" PRIx64 " port %u\n",
+							cl_ntoh16(p_pending->pkey),
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
 	 }
+			else
+			{
+				p_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, last_free_entry_index);
+				CL_ASSERT( p_old_block != NULL );
+				block->pkey_entry[last_free_entry_index] = p_pending->pkey;
+				cl_map_insert( &p_pkey_tbl->keys,
+									ib_pkey_get_base(p_pending->pkey),
+									&(p_old_block->pkey_entry[last_free_entry_index]));
+				if (p_pkey_tbl->used_blocks < last_free_entry_index)
+					p_pending->index = last_free_entry_index;
       }
    }
+		free( p_pending );
+		p_pending = (osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	}
 
- _done:
-   if (stat) {
-      osm_log( p_log, OSM_LOG_VERBOSE,
-               "pkey_mgr_process_physical_port:  "
-               "pkey 0x%04x was %s for node 0x%016" PRIx64
-               " port %u\n",
-               cl_ntoh16( pkey ), stat,
+	/* now look for changes and store */
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
+
+		if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
+			continue;
+
+		status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index );
+		if (status == IB_SUCCESS)
+			ret_val = TRUE;
+		else
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_update_port: ERR 0506: "
+						"pkey_mgr_update_pkey_entry() failed to update "
+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+						block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
+
+	return ret_val;
 }
 
 /**********************************************************************
@@ -217,21 +422,23 @@ pkey_mgr_update_peer_port(
    const osm_port_t * const p_port,
    boolean_t enforce )
 {
-   osm_physp_t *p, *peer;
+	osm_physp_t *p_physp, *peer;
    osm_node_t *p_node;
    ib_pkey_table_t *block, *peer_block;
-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
+	const osm_pkey_tbl_t *p_pkey_tbl;
+	osm_pkey_tbl_t *p_peer_pkey_tbl;
    osm_switch_t *p_sw;
    ib_switch_info_t *p_si;
    uint16_t block_index;
    uint16_t num_of_blocks;
+	uint16_t peer_max_blocks;
    ib_api_status_t status = IB_SUCCESS;
    boolean_t ret_val = FALSE;
 
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
       return FALSE;
-   peer = osm_physp_get_remote( p );
+	peer = osm_physp_get_remote( p_physp );
    if ( !peer || !osm_physp_is_valid( peer ) )
       return FALSE;
    p_node = osm_physp_get_node_ptr( peer );
@@ -245,7 +452,7 @@ pkey_mgr_update_peer_port(
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: ERR 0502: "
+					"pkey_mgr_update_peer_port: ERR 0507: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -255,24 +462,36 @@ pkey_mgr_update_peer_port(
    if (enforce == FALSE)
       return FALSE;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
    num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_update_peer_port: ERR 0508: "
+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
+					" port %u\n",
+					peer_max_blocks, num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( peer ) );
+		return FALSE;
+	}
 
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++ )
    {
       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
       if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
       {
+			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block);
          status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
          if ( status == IB_SUCCESS )
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: ERR 0503: "
+							"pkey_mgr_update_peer_port: ERR 0509: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -282,7 +501,7 @@ pkey_mgr_update_peer_port(
       }
    }
 
-   if ( ret_val == TRUE &&
+	if ( (ret_val == TRUE) &&
         osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
    {
       osm_log( p_log, OSM_LOG_VERBOSE,
@@ -298,82 +517,6 @@ pkey_mgr_update_peer_port(
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t pkey_mgr_update_port(
-   osm_log_t *p_log,
-   osm_req_t *p_req,
-   const osm_port_t * const p_port )
-{
-   osm_physp_t *p;
-   osm_node_t *p_node;
-   ib_pkey_table_t *block, *new_block;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   uint16_t block_index;
-   uint16_t num_of_blocks;
-   ib_api_status_t status;
-   boolean_t ret_val = FALSE;
-
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
-      return FALSE;
-
-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
-   {
-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-
-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
-         continue;
-
-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
-      if (status == IB_SUCCESS)
-         ret_val = TRUE;
-      else
-         osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port: ERR 0504: "
-                  "pkey_mgr_update_pkey_entry() failed to update "
-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-                  block_index,
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                  osm_physp_get_port_num( p ) );
-   }
-
-   return ret_val;
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-pkey_mgr_process_partition_table(
-   osm_log_t *p_log,
-   const osm_req_t *p_req,
-   const osm_prtn_t *p_prtn,
-   const boolean_t full )
-{
-   const cl_map_t *p_tbl = full ?
-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
-   cl_map_iterator_t i, i_next;
-   ib_net16_t pkey = p_prtn->pkey;
-   osm_physp_t *p_physp;
-
-   if ( full )
-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
-
-   i_next = cl_map_head( p_tbl );
-   while ( i_next != cl_map_end( p_tbl ) )
-   {
-      i = i_next;
-      i_next = cl_map_next( i );
-      p_physp = cl_map_obj( i );
-      if ( p_physp && osm_physp_is_valid( p_physp ) )
-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
-   }
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_signal_t
 osm_pkey_mgr_process(
    IN osm_opensm_t *p_osm )
@@ -383,7 +526,6 @@ osm_pkey_mgr_process(
    osm_prtn_t *p_prtn;
    osm_port_t *p_port;
    osm_signal_t signal = OSM_SIGNAL_DONE;
-   osm_physp_t *p_physp;
 
    CL_ASSERT( p_osm );
 
@@ -394,22 +536,12 @@ osm_pkey_mgr_process(
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
       osm_log( &p_osm->log, OSM_LOG_ERROR,
-               "osm_pkey_mgr_process: ERR 0505: "
+					"osm_pkey_mgr_process: ERR 0510: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
 
-   p_tbl = &p_osm->subn.port_guid_tbl;
-   p_next = cl_qmap_head( p_tbl );
-   while ( p_next != cl_qmap_end( p_tbl ) )
-   {
-      p_port = ( osm_port_t * ) p_next;
-      p_next = cl_qmap_next( p_next );
-      p_physp = osm_port_get_default_phys_ptr( p_port );
-      if ( osm_physp_is_valid( p_physp ) )
-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
-   }
-
+	/* populate the pending pkey entries by scanning all partitions */
    p_tbl = &p_osm->subn.prtn_pkey_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -420,6 +552,7 @@ osm_pkey_mgr_process(
       pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
    }
 
+	/* calculate new pkey tables and set */
    p_tbl = &p_osm->subn.port_guid_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -428,7 +561,7 @@ osm_pkey_mgr_process(
       p_next = cl_qmap_next( p_next );
       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
         signal = OSM_SIGNAL_DONE_PENDING;
-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
+		if ( ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH ) &&
            pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
                                       &p_osm->subn, p_port,
                                       !p_osm->subn.opt.no_partition_enforcement ) )


From eli at mellanox.co.il  Mon Jun 12 07:59:00 2006
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 12 Jun 2006 17:59:00 +0300
Subject: [openib-general] potential bug: ipoib freeing ah before
	completion
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30249F8FF@mtlexch01.mtl.com>

The issue was originally raised by Eitan Rabin.

-----Original Message-----
From: Michael S. Tsirkin 
Sent: Monday, June 12, 2006 3:49 PM
To: openib-general at openib.org; Roland Dreier
Cc: Eli Cohen
Subject: potential bug: ipoib freeing ah before completion

Hello, Roland!
The following was noted by Eli Cohen:

It seems that ipoib_flush_paths can be called while completions are
still
outstanding on IPoIB QP (e.g. from ipoib_ib_dev_flush). If this happens,
an
address handle might get freed while a work request is still outstanding
for it.
This can trigger a local QP error, and IPoIB will stop working, until QP
is reset.

Please comment.

-- 
MST


From jlentini at netapp.com  Mon Jun 12 08:44:29 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 12 Jun 2006 11:44:29 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL openib_cma,
 cleanup reported CM error events, add TIMEOUT
In-Reply-To: <ORSMSX401K9homqOxDu00000049@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401K9homqOxDu00000049@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606121140270.21483@jlentini-linux.nane.netapp.com>


On Fri, 9 Jun 2006, Arlin Davis wrote:

> James,
> 
> I cleaned up the connection error events to report the proper events 
> during address resolution errors and timeouts. It was returning 
> incorrect DAT event codes.

Looks good. I committed in revision 7931 with a few minor 
additions (see below).

> Index: dapl_ib_cm.c
> ===================================================================
> --- dapl_ib_cm.c	(revision 7839)
> +++ dapl_ib_cm.c	(working copy)
> @@ -330,6 +330,8 @@ static void dapli_cm_active_cb(struct da
>  	switch (event->event) {
>  	case RDMA_CM_EVENT_UNREACHABLE:
>  	case RDMA_CM_EVENT_CONNECT_ERROR:
> +	{
> +		ib_cm_events_t cm_event;
>                  dapl_dbg_log(
>                          DAPL_DBG_TYPE_WARN,
>                          " dapli_cm_active_handler: CONN_ERR "
> @@ -337,10 +339,15 @@ static void dapli_cm_active_cb(struct da
>                          event->event, event->status,
>                          (event->status == -110)?"TIMEOUT":"" );
>  
> -		dapl_evd_connection_callback(conn,
> -					     IB_CME_DESTINATION_UNREACHABLE,
> -					     NULL, conn->ep);
> +		/* no device type specified so assume IB for now */
> +		if (event->status == -110) /* IB timeout */

I changed -110 to -ETIMEDOUT

> +			cm_event = IB_CME_TIMEOUT;
> +		else 
> +			cm_event = IB_CME_DESTINATION_UNREACHABLE;
> +
> +		dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep);
>  		break;
> +	}
>  	case RDMA_CM_EVENT_REJECTED:
>  	{
>  		ib_cm_events_t cm_event;
> @@ -357,7 +364,6 @@ static void dapli_cm_active_cb(struct da
>  			event->status);
>  		
>  		dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep);
> -		
>  		break;
>  	}
>  	case RDMA_CM_EVENT_ESTABLISHED:
> @@ -1028,7 +1034,7 @@ int dapls_ib_private_data_size(IN DAPL_P
>  /*
>   * Map all socket CM event codes to the DAT equivelent.

I corrected this comment.

>   */
> -#define DAPL_IB_EVENT_CNT	12
> +#define DAPL_IB_EVENT_CNT	13
>  
>  static struct ib_cm_event_map
>  {
> @@ -1058,7 +1064,9 @@ static struct ib_cm_event_map
>  	/* 10 */  { IB_CME_LOCAL_FAILURE,
>  				DAT_CONNECTION_EVENT_BROKEN},
>  	/* 11 */  { IB_CME_BROKEN,
> -				DAT_CONNECTION_EVENT_BROKEN}
> +				DAT_CONNECTION_EVENT_BROKEN},
> +	/* 12 */  { IB_CME_TIMEOUT,	
> +				DAT_CONNECTION_EVENT_TIMED_OUT},
>  };
>   
>  /*
> @@ -1164,7 +1172,7 @@ void dapli_cma_event_cb(void)
>  		case RDMA_CM_EVENT_ADDR_ERROR:
>  		case RDMA_CM_EVENT_ROUTE_ERROR:
>  			dapl_evd_connection_callback(conn, 
> -						     IB_CME_LOCAL_FAILURE, 
> +						     IB_CME_DESTINATION_UNREACHABLE, 
>  						     NULL, conn->ep);
>  			break;
>  		case RDMA_CM_EVENT_DEVICE_REMOVAL:
> Index: dapl_ib_util.h
> ===================================================================
> --- dapl_ib_util.h	(revision 7839)
> +++ dapl_ib_util.h	(working copy)
> @@ -86,7 +86,8 @@ typedef enum {
>  	IB_CME_DESTINATION_UNREACHABLE,
>  	IB_CME_TOO_MANY_CONNECTION_REQUESTS,
>  	IB_CME_LOCAL_FAILURE,
> -	IB_CME_BROKEN
> +	IB_CME_BROKEN,
> +	IB_CME_TIMEOUT
>  } ib_cm_events_t;
>  
>  /* CQ notifications */
> 


From tom at opengridcomputing.com  Mon Jun 12 09:05:49 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Mon, 12 Jun 2006 11:05:49 -0500
Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management.
In-Reply-To: <20060608011744.1a66e85a.akpm@osdl.org>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<20060607200655.9259.90768.stgit@stevo-desktop>
	<20060608011744.1a66e85a.akpm@osdl.org>
Message-ID: <1150128349.22704.20.camel@trinity.ogc.int>

On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote:
> On Wed, 07 Jun 2006 15:06:55 -0500
> Steve Wise <swise at opengridcomputing.com> wrote:
> 
> > 
> > +void c2_free(struct c2_alloc *alloc, u32 obj)
> > +{
> > +	spin_lock(&alloc->lock);
> > +	clear_bit(obj, alloc->table);
> > +	spin_unlock(&alloc->lock);
> > +}
> 
> The spinlock is unneeded here.

Good point.

> 
> 
> What does all the code in this file do, anyway?  It looks totally generic
> (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat
> similar to idr trees, perhaps.
> 

We mimicked the mthca driver. It may be code that should be replaced
with Linux core services for new drivers. We'll investigate.

> > +int c2_array_set(struct c2_array *array, int index, void *value)
> > +{
> > +	int p = (index * sizeof(void *)) >> PAGE_SHIFT;
> > +
> > +	/* Allocate with GFP_ATOMIC because we'll be called with locks held. */
> > +	if (!array->page_list[p].page)
> > +		array->page_list[p].page =
> > +		    (void **) get_zeroed_page(GFP_ATOMIC);
> > +
> > +	if (!array->page_list[p].page)
> > +		return -ENOMEM;
> 
> This _will_ happen under load.  What will the result of that be, in the
> context of thise driver?

A higher level object allocation will fail. In this case, a kernel
application request will fail and the application must handle the error.
> 
> This function is incorrectly designed - it should receive a gfp_t argument.
> Because you don't *know* that the caller will always hold a spinlock.  And
> GFP_KERNEL is far, far stronger than GFP_ATOMIC.

This service is allocating a page that the adapter will DMA 2B message
indices into. 
> 
> > +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head)
> > +{
> > +	int i;
> > +	struct sp_chunk *new_head;
> > +
> > +	new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA);
> 
> Why is __GFP_DMA in there?  Unless you've cornered the ISA bus infiniband
> market, it's likely to be wrong.
> 

Flag confusion about what GFP_DMA means. We'll revisit this whole
file ... 

> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


From rdreier at cisco.com  Mon Jun 12 09:04:02 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 09:04:02 -0700
Subject: [openib-general] potential bug: ipoib freeing ah before
	completion
In-Reply-To: <20060612124833.GA19452@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 15:48:33 +0300")
References: <20060612124833.GA19452@mellanox.co.il>
Message-ID: <adaodwy743x.fsf@cisco.com>

    Michael> It seems that ipoib_flush_paths can be called while
    Michael> completions are still outstanding on IPoIB QP (e.g. from
    Michael> ipoib_ib_dev_flush). If this happens, an address handle
    Michael> might get freed while a work request is still outstanding
    Michael> for it.  This can trigger a local QP error, and IPoIB
    Michael> will stop working, until QP is reset.

So what if path_free is called early?  The address handle shouldn't
get freed until tx_tail is past ah->last_send, so all associated work
requests are complete.  Am I missing something?  Have you actually
seen this happen?


From rdreier at cisco.com  Mon Jun 12 09:06:43 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 09:06:43 -0700
Subject: [openib-general] [PATCH] mthca: restore missing registers
In-Reply-To: <20060612135751.GB19518@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 16:57:51 +0300")
References: <20060612135751.GB19518@mellanox.co.il>
Message-ID: <adak67m73zg.fsf@cisco.com>

    Michael> mthca misses restoring the following PCI-X/PCI-Express
    Michael> registers at reset: PCI-X device: PCI-X command register
    Michael> PCI-X bridge: upstream and downstream split transaction
    Michael> registers PCI-Express : PCI-Express device control and
    Michael> link control registers

Would it be simpler to just restore the full 256-byte PCI headers instead?

 - R.


From mst at mellanox.co.il  Mon Jun 12 09:48:03 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 19:48:03 +0300
Subject: [openib-general] potential bug: ipoib freeing ah before
	completion
In-Reply-To: <adaodwy743x.fsf@cisco.com>
References: <adaodwy743x.fsf@cisco.com>
Message-ID: <20060612164803.GD19518@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> The address handle shouldn't get freed until tx_tail is past ah->last_send, so
> all associated work requests are complete. Am I missing something?

I think you are right - I forgot that ipoib_free_ah doesn't actually free the ah
Looks like there's no issue.

> Have you actually seen this happen?

No.

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 09:53:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 19:53:27 +0300
Subject: [openib-general] [PATCH] mthca: restore missing registers
In-Reply-To: <adak67m73zg.fsf@cisco.com>
References: <adak67m73zg.fsf@cisco.com>
Message-ID: <20060612165327.GE19518@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] mthca: restore missing registers
> 
>     Michael> mthca misses restoring the following PCI-X/PCI-Express
>     Michael> registers at reset: PCI-X device: PCI-X command register
>     Michael> PCI-X bridge: upstream and downstream split transaction
>     Michael> registers PCI-Express : PCI-Express device control and
>     Michael> link control registers
> 
> Would it be simpler to just restore the full 256-byte PCI headers instead?

You can't do that - there are lots of registers where reading a value and then
writing it back has side effects.  Consider fields that are write 1 to clear, or
the VPD flag register. Further, there is no guarantee what writing fields
outside any capability list will do - consider the Tavor 22/23 debugging
registers as an example.

-- 
MST


From mshefty at ichips.intel.com  Mon Jun 12 10:12:15 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 10:12:15 -0700
Subject: [openib-general] [PATCH 0/4] Add support for UD QPs
In-Reply-To: <44886D66.7000703@ichips.intel.com>
References: <ORSMSX4011XvpFVjCRG00000039@orsmsx401.amr.corp.intel.com>
	<44886747.4040004@ichips.intel.com> <adazmgncy8q.fsf@cisco.com>
	<44886D66.7000703@ichips.intel.com>
Message-ID: <448DA06F.4090802@ichips.intel.com>

To clarify the motivation more, a question to answer is if we ignore iWarp 
completely, does it make sense to provide a higher level communication manager 
for IB.  I believe that it does, especially for userspace applications.  This 
lets us leverage existing name services, ipoib, and provides an interface that 
is simpler for most users.

A lot of existing applications and test programs exchange IB QP information over 
TCP connections, including programs that communicate using UD QPs.  If you 
compare the amount of work necessary to establish communication over an RC QP 
versus a UD QP, the difference is basically a single MAD, which isn't even 
needed.  So there's a fair amount of work that can be shared by a higher level CM.

- Sean


From ardavis at ichips.intel.com  Mon Jun 12 10:30:16 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 12 Jun 2006 10:30:16 -0700
Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot
In-Reply-To: <20060611062313.10CDC2287AC@openib.ca.sandia.gov>
References: <20060611062313.10CDC2287AC@openib.ca.sandia.gov>
Message-ID: <448DA4A8.10209@ichips.intel.com>

bugzilla-daemon at openib.org wrote:

>http://openib.org/bugzilla/show_bug.cgi?id=126
>
>
>vlad at mellanox.co.il changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>             Status|NEW                         |RESOLVED
>         Resolution|                            |WONTFIX
>
>
>
>------- Comment #1 from vlad at mellanox.co.il  2006-06-10 23:23 -------
>RDMA_CM and RDMA_UCM are not loaded by default. In order to load them upon boot
>edit /etc/infiniband/openib.conf file and set RDMA_CM_LOAD=yes and
>RDMA_UCM_LOAD=yes:
>
># Start HCA driver upon boot
>ONBOOT=yes
>
># Load UCM module
>UCM_LOAD=no
>
># Load RDMA_CM module
>RDMA_CM_LOAD=no
>
># Load RDMA_UCM module
>RDMA_UCM_LOAD=no
>
>  
>

Did the default openib.conf script get updated with:

RDMA_CM_LOAD=yes
RDMA_UCM_LOAD=yes

-arlin


-arlin


From sean.hefty at intel.com  Mon Jun 12 10:39:53 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 10:39:53 -0700
Subject: [openib-general] [PATCH 1/2] librdmacm: userspace support for
 multicast abstraction
Message-ID: <000001c68e47$36caf250$ff0da8c0@amr.corp.intel.com>

Add support to the userspace RDMA CM library for joining multicast group
based on IP addressing.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
diff -up svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h
svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h
--- svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma_abi.h	2006-06-12 10:16:44.598117880 -0700
@@ -60,7 +60,9 @@ enum {
 	UCMA_CMD_GET_EVENT,
 	UCMA_CMD_GET_OPTION,
 	UCMA_CMD_SET_OPTION,
-	UCMA_CMD_GET_DST_ATTR
+	UCMA_CMD_GET_DST_ATTR,
+	UCMA_CMD_JOIN_MCAST,
+	UCMA_CMD_LEAVE_MCAST
 };
 
 struct ucma_abi_cmd_hdr {
@@ -178,6 +180,17 @@ struct ucma_abi_init_qp_attr {
 	__u32 qp_state;
 };
 
+struct ucma_abi_join_mcast {
+	__u32 id;
+	struct sockaddr_in6 addr;
+	__u64 uid;
+};
+
+struct ucma_abi_leave_mcast {
+	__u32 id;
+	struct sockaddr_in6 addr;
+};
+
 struct ucma_abi_dst_attr_resp {
 	__u32 remote_qpn;
 	__u32 remote_qkey;
diff -up svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h
svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h
--- svn3/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/include/rdma/rdma_cma.h	2006-06-06 12:26:21.000000000 -0700
@@ -52,6 +52,8 @@ enum rdma_cm_event_type {
 	RDMA_CM_EVENT_ESTABLISHED,
 	RDMA_CM_EVENT_DISCONNECTED,
 	RDMA_CM_EVENT_DEVICE_REMOVAL,
+	RDMA_CM_EVENT_MULTICAST_JOIN,
+	RDMA_CM_EVENT_MULTICAST_ERROR
 };
 
 enum rdma_port_space {
@@ -99,6 +101,13 @@ struct rdma_cm_id {
 	uint8_t			 port_num;
 };
 
+struct rdma_multicast_data {
+	void		*context;
+	struct sockaddr addr;
+	uint8_t		pad[sizeof(struct sockaddr_in6) -
+			    sizeof(struct sockaddr)];
+};
+
 struct rdma_cm_event {
 	struct rdma_cm_id	*id;
 	struct rdma_cm_id	*listen_id;
@@ -245,6 +254,24 @@ int rdma_reject(struct rdma_cm_id *id, c
 int rdma_disconnect(struct rdma_cm_id *id);
 
 /**
+ * rdma_join_multicast - Join the multicast group specified by the given
+ *   address.
+ * @id: Communication identifier associated with the request.
+ * @addr: Multicast address identifying the group to join.
+ * @context: User-defined context associated with the join request.  The
+ *   context is returned to the user through the private_data field in
+ *   the rdma_cm_event.
+ */
+int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
+			void *context);
+
+/**
+ * rdma_leave_multicast - Leave the multicast group specified by the given
+ *   address.
+ */
+int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr);
+
+/**
  * rdma_get_cm_event - Retrieves the next pending communications event,
  *   if no event is pending waits for an event.
  * @channel: Event channel to check for events.
diff -up svn3/gen2/trunk/src/userspace/librdmacm/src/cma.c svn/gen2/trunk/src/userspace/librdmacm/src/cma.c
--- svn3/gen2/trunk/src/userspace/librdmacm/src/cma.c	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/src/cma.c	2006-06-06 17:30:17.000000000 -0700
@@ -896,6 +896,66 @@ int rdma_disconnect(struct rdma_cm_id *i
 	return 0;
 }
 
+int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
+			void *context)
+{
+	struct ucma_abi_join_mcast *cmd;
+	struct cma_id_private *id_priv;
+	void *msg;
+	int ret, size, addrlen;
+	
+	addrlen = ucma_addrlen(addr);
+	if (!addrlen)
+		return -EINVAL;
+
+	CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_JOIN_MCAST, size);
+	id_priv = container_of(id, struct cma_id_private, id);
+	cmd->id = id_priv->handle;
+	memcpy(&cmd->addr, addr, addrlen);
+	cmd->uid = (uintptr_t) context;
+
+	ret = write(id->channel->fd, msg, size);
+	if (ret != size)
+		return (ret > 0) ? -ENODATA : ret;
+
+	return 0;
+}
+
+int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
+{
+	struct ucma_abi_leave_mcast *cmd;
+	struct cma_id_private *id_priv;
+	void *msg;
+	int ret, size, addrlen;
+	struct ibv_ah_attr ah_attr;
+	uint32_t qp_info;
+	
+	addrlen = ucma_addrlen(addr);
+	if (!addrlen)
+		return -EINVAL;
+
+	CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_LEAVE_MCAST, size);
+	id_priv = container_of(id, struct cma_id_private, id);
+	cmd->id = id_priv->handle;
+	memcpy(&cmd->addr, addr, addrlen);
+
+	if (id->qp) {
+		ret = rdma_get_dst_attr(id, addr, &ah_attr, &qp_info, &qp_info);
+		if (ret)
+			goto out;
+
+		ret = ibv_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid);
+		if (ret)
+			goto out;
+	}
+	
+	ret = write(id->channel->fd, msg, size);
+	if (ret != size)
+		ret = (ret > 0) ? -ENODATA : ret;
+out:
+	return ret;
+}
+
 static void ucma_copy_event_from_kern(struct rdma_cm_event *dst,
 				      struct ucma_abi_event_resp *src)
 {
@@ -1004,6 +1064,36 @@ static int ucma_process_establish(struct
 	return ret;
 }
 
+static void ucma_process_mcast(struct rdma_cm_id *id, struct rdma_cm_event *evt)
+{
+	struct ucma_abi_join_mcast kmc_data;
+	struct rdma_multicast_data *mc_data;
+	struct ibv_ah_attr ah_attr;
+	uint32_t qp_info;
+
+	kmc_data = *(struct ucma_abi_join_mcast *) evt->private_data;
+
+	mc_data = evt->private_data;
+	mc_data->context = (void *) (uintptr_t) kmc_data.uid;
+	memcpy(&mc_data->addr, &kmc_data.addr,
+	       ucma_addrlen((struct sockaddr *) &kmc_data.addr));
+
+	if (evt->status || !id->qp)
+		return;
+
+	evt->status = rdma_get_dst_attr(id, &mc_data->addr, &ah_attr,
+					&qp_info, &qp_info);
+	if (evt->status)
+		goto err;
+
+	evt->status = ibv_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid);
+	if (evt->status)
+		goto err;
+	return;
+err:
+	evt->event = RDMA_CM_EVENT_MULTICAST_ERROR;
+}
+
 int rdma_get_cm_event(struct rdma_event_channel *channel,
 		      struct rdma_cm_event **event)
 {
@@ -1085,6 +1175,10 @@ retry:
 			goto retry;
 		}
 		break;
+	case RDMA_CM_EVENT_MULTICAST_JOIN:
+	case RDMA_CM_EVENT_MULTICAST_ERROR:
+		ucma_process_mcast(&id_priv->id, evt);
+		break;
 	default:
 		break;
 	}
diff -up svn3/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map svn/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map
--- svn3/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/src/librdmacm.map	2006-06-01 15:03:13.000000000 -0700
@@ -19,5 +19,7 @@ RDMACM_1.0 {
 		rdma_get_option;
 		rdma_set_option;
 		rdma_get_dst_attr;
+		rdma_join_multicast;
+		rdma_leave_multicast;
 	local: *;
 };


From sean.hefty at intel.com  Mon Jun 12 10:43:38 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 10:43:38 -0700
Subject: [openib-general] [PATCH 2/2] librdmacm: add multicast test program
	to examples
In-Reply-To: <000001c68e47$36caf250$ff0da8c0@amr.corp.intel.com>
Message-ID: <000101c68e47$bc5dbd80$ff0da8c0@amr.corp.intel.com>

Simple multicast test program.  When run, the client creates a QP
and joins it to a multicast group.  It then either sends or receives
messages on the group.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
diff -up svn3/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in svn/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in
--- svn3/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/librdmacm.spec.in	2006-06-01 14:53:47.000000000 -0700
@@ -67,3 +67,4 @@ rm -rf $RPM_BUILD_ROOT
 %{_bindir}/rping
 %{_bindir}/ucmatose
 %{_bindir}/udaddy
+%{_bindir}/mckey
diff -up svn3/gen2/trunk/src/userspace/librdmacm/Makefile.am svn/gen2/trunk/src/userspace/librdmacm/Makefile.am
--- svn3/gen2/trunk/src/userspace/librdmacm/Makefile.am	2006-06-06 17:35:31.000000000 -0700
+++ svn/gen2/trunk/src/userspace/librdmacm/Makefile.am	2006-06-06 14:48:23.000000000 -0700
@@ -18,13 +18,15 @@ endif
 src_librdmacm_la_SOURCES = src/cma.c
 src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script)
 
-bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy
+bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy examples/mckey
 examples_ucmatose_SOURCES = examples/cmatose.c
 examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la
 examples_rping_SOURCES = examples/rping.c
 examples_rping_LDADD = $(top_builddir)/src/librdmacm.la
 examples_udaddy_SOURCES = examples/udaddy.c
 examples_udaddy_LDADD = $(top_builddir)/src/librdmacm.la
+examples_mckey_SOURCES = examples/mckey.c
+examples_mckey_LDADD = $(top_builddir)/src/librdmacm.la
 
 librdmacmincludedir = $(includedir)/rdma
 
diff -upN svn3/gen2/trunk/src/userspace/librdmacm/examples/mckey.c svn/gen2/trunk/src/userspace/librdmacm/examples/mckey.c
--- svn3/gen2/trunk/src/userspace/librdmacm/examples/mckey.c	1969-12-31 16:00:00.000000000 -0800
+++ svn/gen2/trunk/src/userspace/librdmacm/examples/mckey.c	2006-06-06 12:56:35.000000000 -0700
@@ -0,0 +1,505 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <byteswap.h>
+#include <unistd.h>
+
+#include <rdma/rdma_cma.h>
+#include <rdma/rdma_cma_ib.h>
+
+struct cmatest_node {
+	int			id;
+	struct rdma_cm_id	*cma_id;
+	int			connected;
+	struct ibv_pd		*pd;
+	struct ibv_cq		*cq;
+	struct ibv_mr		*mr;
+	struct ibv_ah		*ah;
+	uint32_t		remote_qpn;
+	uint32_t		remote_qkey;
+	void			*mem;
+};
+
+struct cmatest {
+	struct rdma_event_channel *channel;
+	struct cmatest_node	*nodes;
+	int			conn_index;
+	int			connects_left;
+
+	struct sockaddr_in	dst_in;
+	struct sockaddr		*dst_addr;
+	struct sockaddr_in	src_in;
+	struct sockaddr		*src_addr;
+};
+
+static struct cmatest test;
+static int connections = 1;
+static int message_size = 100;
+static int message_count = 10;
+static int is_sender;
+
+static int create_message(struct cmatest_node *node)
+{
+	if (!message_size)
+		message_count = 0;
+
+	if (!message_count)
+		return 0;
+
+	node->mem = malloc(message_size + sizeof(struct ibv_grh));
+	if (!node->mem) {
+		printf("failed message allocation\n");
+		return -1;
+	}
+	node->mr = ibv_reg_mr(node->pd, node->mem,
+			      message_size + sizeof(struct ibv_grh),
+			      IBV_ACCESS_LOCAL_WRITE);
+	if (!node->mr) {
+		printf("failed to reg MR\n");
+		goto err;
+	}
+	return 0;
+err:
+	free(node->mem);
+	return -1;
+}
+
+static int init_node(struct cmatest_node *node)
+{
+	struct ibv_qp_init_attr init_qp_attr;
+	int cqe, ret;
+
+	node->pd = ibv_alloc_pd(node->cma_id->verbs);
+	if (!node->pd) {
+		ret = -ENOMEM;
+		printf("cmatose: unable to allocate PD\n");
+		goto out;
+	}
+
+	cqe = message_count ? message_count * 2 : 2;
+	node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0);
+	if (!node->cq) {
+		ret = -ENOMEM;
+		printf("cmatose: unable to create CQ\n");
+		goto out;
+	}
+
+	memset(&init_qp_attr, 0, sizeof init_qp_attr);
+	init_qp_attr.cap.max_send_wr = message_count ? message_count : 1;
+	init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1;
+	init_qp_attr.cap.max_send_sge = 1;
+	init_qp_attr.cap.max_recv_sge = 1;
+	init_qp_attr.qp_context = node;
+	init_qp_attr.sq_sig_all = 0;
+	init_qp_attr.qp_type = IBV_QPT_UD;
+	init_qp_attr.send_cq = node->cq;
+	init_qp_attr.recv_cq = node->cq;
+	ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr);
+	if (ret) {
+		printf("cmatose: unable to create QP: %d\n", ret);
+		goto out;
+	}
+
+	ret = create_message(node);
+	if (ret) {
+		printf("cmatose: failed to create messages: %d\n", ret);
+		goto out;
+	}
+out:
+	return ret;
+}
+
+static int post_recvs(struct cmatest_node *node)
+{
+	struct ibv_recv_wr recv_wr, *recv_failure;
+	struct ibv_sge sge;
+	int i, ret = 0;
+
+	if (!message_count)
+		return 0;
+
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sge;
+	recv_wr.num_sge = 1;
+	recv_wr.wr_id = (uintptr_t) node;
+
+	sge.length = message_size + sizeof(struct ibv_grh);
+	sge.lkey = node->mr->lkey;
+	sge.addr = (uintptr_t) node->mem;
+
+	for (i = 0; i < message_count && !ret; i++ ) {
+		ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure);
+		if (ret) {
+			printf("failed to post receives: %d\n", ret);
+			break;
+		}
+	}
+	return ret;
+}
+
+static int post_sends(struct cmatest_node *node, int signal_flag)
+{
+	struct ibv_send_wr send_wr, *bad_send_wr;
+	struct ibv_sge sge;
+	int i, ret = 0;
+
+	if (!node->connected || !message_count)
+		return 0;
+
+	send_wr.next = NULL;
+	send_wr.sg_list = &sge;
+	send_wr.num_sge = 1;
+	send_wr.opcode = IBV_WR_SEND_WITH_IMM;
+	send_wr.send_flags = IBV_SEND_INLINE | signal_flag;
+	send_wr.wr_id = (unsigned long)node;
+	send_wr.imm_data = htonl(node->cma_id->qp->qp_num);
+
+	send_wr.wr.ud.ah = node->ah;
+	send_wr.wr.ud.remote_qpn = node->remote_qpn;
+	send_wr.wr.ud.remote_qkey = node->remote_qkey;
+
+	sge.length = message_size - sizeof(struct ibv_grh);
+	sge.lkey = node->mr->lkey;
+	sge.addr = (uintptr_t) node->mem;
+
+	for (i = 0; i < message_count && !ret; i++) {
+		ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr);
+		if (ret) 
+			printf("failed to post sends: %d\n", ret);
+	}
+	return ret;
+}
+
+static void connect_error(void)
+{
+	test.connects_left--;
+}
+
+static int addr_handler(struct cmatest_node *node)
+{
+	int ret;
+
+	ret = init_node(node);
+	if (ret)
+		goto err;
+
+	if (!is_sender) {
+		ret = post_recvs(node);
+		if (ret)
+			goto err;
+	}
+
+	ret = rdma_join_multicast(node->cma_id, test.dst_addr, node);
+	if (ret) {
+		printf("cmatose: failure joining: %d\n", ret);
+		goto err;
+	}
+	return 0;
+err:
+	connect_error();
+	return ret;
+}
+
+static int join_handler(struct cmatest_node *node)
+{
+	struct ibv_ah_attr ah_attr;
+	int ret;
+
+	ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr,
+				&node->remote_qpn, &node->remote_qkey);
+	if (ret) {
+		printf("mckey: failure getting destination attributes\n");
+		goto err;
+	}
+
+	node->ah = ibv_create_ah(node->pd, &ah_attr);
+	if (!node->ah) {
+		printf("mckey: failure creating address handle\n");
+		goto err;
+	}
+
+	node->connected = 1;
+	test.connects_left--;
+	return 0;
+err:
+	connect_error();
+	return ret;
+}
+
+static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
+{
+	int ret = 0;
+
+	switch (event->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		ret = addr_handler(cma_id->context);
+		break;
+	case RDMA_CM_EVENT_MULTICAST_JOIN:
+		ret = join_handler(cma_id->context);
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+	case RDMA_CM_EVENT_MULTICAST_ERROR:
+		printf("cmatose: event: %d, error: %d\n", event->event,
+			event->status);
+		connect_error();
+		ret = event->status;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/* Cleanup will occur after test completes. */
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static void destroy_node(struct cmatest_node *node)
+{
+	if (!node->cma_id)
+		return;
+
+	if (node->ah)
+		ibv_destroy_ah(node->ah);
+
+	if (node->cma_id->qp)
+		rdma_destroy_qp(node->cma_id);
+
+	if (node->cq)
+		ibv_destroy_cq(node->cq);
+
+	if (node->mem) {
+		ibv_dereg_mr(node->mr);
+		free(node->mem);
+	}
+
+	if (node->pd)
+		ibv_dealloc_pd(node->pd);
+
+	/* Destroy the RDMA ID after all device resources */
+	rdma_destroy_id(node->cma_id);
+}
+
+static int alloc_nodes(void)
+{
+	int ret, i;
+
+	test.nodes = malloc(sizeof *test.nodes * connections);
+	if (!test.nodes) {
+		printf("cmatose: unable to allocate memory for test nodes\n");
+		return -ENOMEM;
+	}
+	memset(test.nodes, 0, sizeof *test.nodes * connections);
+
+	for (i = 0; i < connections; i++) {
+		test.nodes[i].id = i;
+		ret = rdma_create_id(test.channel, &test.nodes[i].cma_id,
+				     &test.nodes[i], RDMA_PS_UDP);
+		if (ret)
+			goto err;
+	}
+	return 0;
+err:
+	while (--i >= 0)
+		rdma_destroy_id(test.nodes[i].cma_id);
+	free(test.nodes);
+	return ret;
+}
+
+static void destroy_nodes(void)
+{
+	int i;
+
+	for (i = 0; i < connections; i++)
+		destroy_node(&test.nodes[i]);
+	free(test.nodes);
+}
+
+static int poll_cqs(void)
+{
+	struct ibv_wc wc[8];
+	int done, i, ret;
+
+	for (i = 0; i < connections; i++) {
+		if (!test.nodes[i].connected)
+			continue;
+
+		for (done = 0; done < message_count; done += ret) {
+			ret = ibv_poll_cq(test.nodes[i].cq, 8, wc);
+			if (ret < 0) {
+				printf("cmatose: failed polling CQ: %d\n", ret);
+				return ret;
+			}
+		}
+	}
+	return 0;
+}
+
+static int connect_events(void)
+{
+	struct rdma_cm_event *event;
+	int ret = 0;
+
+	while (test.connects_left && !ret) {
+		ret = rdma_get_cm_event(test.channel, &event);
+		if (!ret) {
+			ret = cma_handler(event->id, event);
+			rdma_ack_cm_event(event);
+		}
+	}
+	return ret;
+}
+
+static int get_addr(char *dst, struct sockaddr_in *addr)
+{
+	struct addrinfo *res;
+	int ret;
+
+	ret = getaddrinfo(dst, NULL, NULL, &res);
+	if (ret) {
+		printf("getaddrinfo failed - invalid hostname or IP address\n");
+		return ret;
+	}
+
+	if (res->ai_family != PF_INET) {
+		ret = -1;
+		goto out;
+	}
+
+	*addr = *(struct sockaddr_in *) res->ai_addr;
+out:
+	freeaddrinfo(res);
+	return ret;
+}
+
+static int run(char *dst, char *src)
+{
+	int i, ret;
+
+	printf("cmatose: starting client\n");
+	if (src) {
+		ret = get_addr(src, &test.src_in);
+		if (ret)
+			return ret;
+	}
+
+	ret = get_addr(dst, &test.dst_in);
+	if (ret)
+		return ret;
+
+	test.dst_in.sin_port = 7174;
+
+	printf("cmatose: joining\n");
+	for (i = 0; i < connections; i++) {
+		ret = rdma_resolve_addr(test.nodes[i].cma_id,
+					src ? test.src_addr : NULL,
+					test.dst_addr, 2000);
+		if (ret) {
+			printf("cmatose: failure getting addr: %d\n", ret);
+			connect_error();
+			return ret;
+		}
+	}
+
+	ret = connect_events();
+	if (ret)
+		goto out;
+
+	/*
+	 * Pause to give SM chance to configure switches.  We don't want to
+	 * handle reliability issue in this simple test program.
+	 */
+	sleep(3);
+
+	if (message_count) {
+		if (is_sender) {
+			printf("initiating data transfers\n");
+			for (i = 0; i < connections; i++) {
+				ret = post_sends(&test.nodes[i], 0);
+				if (ret)
+					goto out;
+			}
+		} else {
+			printf("receiving data transfers\n");
+			ret = poll_cqs();
+			if (ret)
+				goto out;
+		}
+		printf("data transfers complete\n");
+	}
+out:
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+
+	if (argc < 3 || argc > 4) {
+		printf("usage: %s {s[end] | r[ecv]} mcast_addr [bind_addr]]\n",
+		       argv[0]);
+		exit(1);
+	}
+	is_sender = (argv[1][0] == 's');
+
+	test.dst_addr = (struct sockaddr *) &test.dst_in;
+	test.src_addr = (struct sockaddr *) &test.src_in;
+	test.connects_left = connections;
+
+	test.channel = rdma_create_event_channel();
+	if (!test.channel) {
+		printf("failed to create event channel\n");
+		exit(1);
+	}
+
+	if (alloc_nodes())
+		exit(1);
+
+	ret = run(argv[2], (argc == 4) ? argv[3] : NULL);
+
+	printf("test complete\n");
+	destroy_nodes();
+	rdma_destroy_event_channel(test.channel);
+
+	printf("return status %d\n", ret);
+	return ret;
+}


From robert.j.woodruff at intel.com  Mon Jun 12 10:49:59 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 12 Jun 2006 10:49:59 -0700
Subject: [openib-general] OFED 1.0-rc6 tarball available with working
	ipath driver
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408>

Bryan wrote, 
>Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not
>work correctly.  You can download an updated tarball from here, for
>which the ipath driver works fine:

http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2

>Alternatively, pull the necessary patches from SVN.

Still does not seem to compile.

 In file included from
/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa
th_cq.c:36:
/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa
th_verbs.h:399: error: `BITS_PER_BYTE' undeclared here (not in a
function)
make[3]: ***
[/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ip
ath_cq.o] Error 1
make[2]: ***
[/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath]
Error 2
make[1]: ***
[_module_/var/tmp/OFED/tmp/openib/openib/src/linux-kernel/infiniband]
Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.EL-smp-x86_64'
make: *** [kernel] Error 2
ERROR: Failed to execute: make kernel
~


From mst at mellanox.co.il  Mon Jun 12 11:32:47 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 21:32:47 +0300
Subject: [openib-general] another ipoib question
Message-ID: <20060612183247.GD20500@mellanox.co.il>


Hello, Roland!

Here's another question from code review conducted by Eitan Rabin:

could flush task set the ipoib_neigh pointer encoded inside the
neighbour hardware address to NULL and free the neighbour,
while ipoib_start_xmit is accessing the ipoib_neigh through
the pointer is has loaded from the hardware address?

flush task does not seem to hold xmit_lock.

-- 
MST


From rdreier at cisco.com  Mon Jun 12 12:36:36 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 12:36:36 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
Message-ID: <adaodwy5fp7.fsf@cisco.com>

IB/uverbs: Don't serialize with ib_uverbs_idr_mutex

Currently, all userspace verbs operations that call into the kernel
are serialized by ib_uverbs_idr_mutex.  This can be a scalability
issue for some workloads, especially for devices driven by the ipath
driver, which needs to call into the kernel even for datapath
operations.

Fix this by adding reference counts to the userspace objects, and then
converting ib_uverbs_idr_mutex into a spinlock that only protects the
idrs long enough to take a reference on the object being looked up.
Because remove operations may fail, we have to do a slightly funky
two-step deletion, which is described in the comments at the top of
uverbs_cmd.c.

This also still leaves ib_uverbs_idr_lock as a single lock that is
possibly subject to contention.  However, the lock hold time will only
be a single idr operation, so multiple threads should still be able to
make progress, even if ib_uverbs_idr_lock is being ping-ponged.

Surprisingly, these changes even shrink the object code:

add/remove: 23/5 grow/shrink: 4/21 up/down: 589/-688 (-99)

Signed-off-by: Roland Dreier <rolandd at cisco.com>

---

I started thinking about the "kill ib_uverbs_idr_mutex" problem, and I
realized that there are actually some interesting issues there (as
described in the comment at the top of uverbs_cmd.c).  In fact I ended
up coding the solution below.  This passes some basic tests but it
could probably use some review.

I'm thinking of checking it into svn for some further cooking in the
next day or two, so let me know if you see any issues.

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 3372d67..bb9bee5 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -132,7 +132,7 @@ struct ib_ucq_object {
 	u32			async_events_reported;
 };
 
-extern struct mutex ib_uverbs_idr_mutex;
+extern spinlock_t ib_uverbs_idr_lock;
 extern struct idr ib_uverbs_pd_idr;
 extern struct idr ib_uverbs_mr_idr;
 extern struct idr ib_uverbs_mw_idr;
@@ -141,6 +141,8 @@ extern struct idr ib_uverbs_cq_idr;
 extern struct idr ib_uverbs_qp_idr;
 extern struct idr ib_uverbs_srq_idr;
 
+void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
+
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 					int is_async, int *fd);
 void ib_uverbs_release_event_file(struct kref *ref);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 403dd81..7968b5f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -50,7 +50,64 @@ #define INIT_UDATA(udata, ibuf, obuf, il
 		(udata)->outlen = (olen);				\
 	} while (0)
 
-static int idr_add_uobj(struct idr *idr, void *obj, struct ib_uobject *uobj)
+/*
+ * The ib_uobject locking scheme is as follows:
+ *
+ * - ib_uverbs_idr_lock protects the uverbs idrs themselves, so it
+ *   needs to be held during all idr operations.  When an object is
+ *   looked up, a reference must be taken on the object's kref before
+ *   dropping this lock.
+ *
+ * - Each object also has an rwsem.  This rwsem must be held for
+ *   reading while an operation that uses the object is performed.
+ *   For example, while registering an MR, the associated PD's
+ *   uobject.mutex must be held for reading.  The rwsem must be held
+ *   for writing while initializing or destroying an object.
+ *
+ * - In addition, each object has a "live" flag.  If this flag is not
+ *   set, then lookups of the object will fail even if it is found in
+ *   the idr.  This handles a reader that blocks and does not acquire
+ *   the rwsem until after the object is destroyed.  The destroy
+ *   operation will set the live flag to 0 and then drop the rwsem;
+ *   this will allow the reader to acquire the rwsem, see that the
+ *   live flag is 0, and then drop the rwsem and its reference to
+ *   object.  The underlying storage will not be freed until the last
+ *   reference to the object is dropped.
+ */
+
+static void init_uobj(struct ib_uobject *uobj, u64 user_handle,
+		      struct ib_ucontext *context)
+{
+	uobj->user_handle = user_handle;
+	uobj->context     = context;
+	kref_init(&uobj->ref);
+	init_rwsem(&uobj->mutex);
+	uobj->live        = 0;
+}
+
+static void release_uobj(struct kref *kref)
+{
+	kfree(container_of(kref, struct ib_uobject, ref));
+}
+
+static void put_uobj(struct ib_uobject *uobj)
+{
+	kref_put(&uobj->ref, release_uobj);
+}
+
+static void put_uobj_read(struct ib_uobject *uobj)
+{
+	up_read(&uobj->mutex);
+	put_uobj(uobj);
+}
+
+static void put_uobj_write(struct ib_uobject *uobj)
+{
+	up_write(&uobj->mutex);
+	put_uobj(uobj);
+}
+
+static int idr_add_uobj(struct idr *idr, struct ib_uobject *uobj)
 {
 	int ret;
 
@@ -58,7 +115,9 @@ retry:
 	if (!idr_pre_get(idr, GFP_KERNEL))
 		return -ENOMEM;
 
+	spin_lock(&ib_uverbs_idr_lock);
 	ret = idr_get_new(idr, uobj, &uobj->id);
+	spin_unlock(&ib_uverbs_idr_lock);
 
 	if (ret == -EAGAIN)
 		goto retry;
@@ -66,6 +125,121 @@ retry:
 	return ret;
 }
 
+void idr_remove_uobj(struct idr *idr, struct ib_uobject *uobj)
+{
+	spin_lock(&ib_uverbs_idr_lock);
+	idr_remove(idr, uobj->id);
+	spin_unlock(&ib_uverbs_idr_lock);
+}
+
+static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id,
+					 struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	spin_lock(&ib_uverbs_idr_lock);
+	uobj = idr_find(idr, id);
+	if (uobj)
+		kref_get(&uobj->ref);
+	spin_unlock(&ib_uverbs_idr_lock);
+
+	return uobj;
+}
+
+static struct ib_uobject *idr_read_uobj(struct idr *idr, int id,
+					struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = __idr_get_uobj(idr, id, context);
+	if (!uobj)
+		return NULL;
+
+	down_read(&uobj->mutex);
+	if (!uobj->live) {
+		put_uobj_read(uobj);
+		return NULL;
+	}
+
+	return uobj;
+}
+
+static struct ib_uobject *idr_write_uobj(struct idr *idr, int id,
+					 struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = __idr_get_uobj(idr, id, context);
+	if (!uobj)
+		return NULL;
+
+	down_write(&uobj->mutex);
+	if (!uobj->live) {
+		put_uobj_write(uobj);
+		return NULL;
+	}
+
+	return uobj;
+}
+
+static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = idr_read_uobj(idr, id, context);
+	return uobj ? uobj->object : NULL;
+}
+
+static struct ib_pd *idr_read_pd(int pd_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context);
+}
+
+static void put_pd_read(struct ib_pd *pd)
+{
+	put_uobj_read(pd->uobject);
+}
+
+static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context);
+}
+
+static void put_cq_read(struct ib_cq *cq)
+{
+	put_uobj_read(cq->uobject);
+}
+
+static struct ib_ah *idr_read_ah(int ah_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context);
+}
+
+static void put_ah_read(struct ib_ah *ah)
+{
+	put_uobj_read(ah->uobject);
+}
+
+static struct ib_qp *idr_read_qp(int qp_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context);
+}
+
+static void put_qp_read(struct ib_qp *qp)
+{
+	put_uobj_read(qp->uobject);
+}
+
+static struct ib_srq *idr_read_srq(int srq_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context);
+}
+
+static void put_srq_read(struct ib_srq *srq)
+{
+	put_uobj_read(srq->uobject);
+}
+
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 			      const char __user *buf,
 			      int in_len, int out_len)
@@ -296,7 +470,8 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	if (!uobj)
 		return -ENOMEM;
 
-	uobj->context = file->ucontext;
+	init_uobj(uobj, 0, file->ucontext);
+	down_write(&uobj->mutex);
 
 	pd = file->device->ib_dev->alloc_pd(file->device->ib_dev,
 					    file->ucontext, &udata);
@@ -309,11 +484,10 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	pd->uobject = uobj;
 	atomic_set(&pd->usecnt, 0);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	ret = idr_add_uobj(&ib_uverbs_pd_idr, pd, uobj);
+	uobj->object = pd;
+	ret = idr_add_uobj(&ib_uverbs_pd_idr, uobj);
 	if (ret)
-		goto err_up;
+		goto err_idr;
 
 	memset(&resp, 0, sizeof resp);
 	resp.pd_handle = uobj->id;
@@ -321,26 +495,27 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
 	mutex_lock(&file->mutex);
 	list_add_tail(&uobj->list, &file->ucontext->pd_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj->live = 1;
+
+	up_write(&uobj->mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_pd_idr, uobj->id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+err_idr:
 	ib_dealloc_pd(pd);
 
 err:
-	kfree(uobj);
+	put_uobj_write(uobj);
 	return ret;
 }
 
@@ -349,37 +524,34 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_u
 			     int in_len, int out_len)
 {
 	struct ib_uverbs_dealloc_pd cmd;
-	struct ib_pd               *pd;
 	struct ib_uobject          *uobj;
-	int                         ret = -EINVAL;
+	int                         ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_pd_idr, cmd.pd_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
 
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_dealloc_pd(uobj->object);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = pd->uobject;
+	put_uobj_write(uobj);
 
-	ret = ib_dealloc_pd(pd);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle);
+	idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 
 	mutex_lock(&file->mutex);
 	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	kfree(uobj);
-
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_uobj(uobj);
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
@@ -419,7 +591,8 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 	if (!obj)
 		return -ENOMEM;
 
-	obj->uobject.context = file->ucontext;
+	init_uobj(&obj->uobject, 0, file->ucontext);
+	down_write(&obj->uobject.mutex);
 
 	/*
 	 * We ask for writable memory if any access flags other than
@@ -436,23 +609,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 
 	obj->umem.virt_base = cmd.hca_va;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext) {
-		ret = -EINVAL;
-		goto err_up;
-	}
-
-	if (!pd->device->reg_user_mr) {
-		ret = -ENOSYS;
-		goto err_up;
-	}
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd)
+		goto err_release;
 
 	mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata);
 	if (IS_ERR(mr)) {
 		ret = PTR_ERR(mr);
-		goto err_up;
+		goto err_put;
 	}
 
 	mr->device  = pd->device;
@@ -461,43 +625,48 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 	atomic_inc(&pd->usecnt);
 	atomic_set(&mr->usecnt, 0);
 
-	memset(&resp, 0, sizeof resp);
-	resp.lkey = mr->lkey;
-	resp.rkey = mr->rkey;
-
-	ret = idr_add_uobj(&ib_uverbs_mr_idr, mr, &obj->uobject);
+	obj->uobject.object = mr;
+	ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject);
 	if (ret)
 		goto err_unreg;
 
+	memset(&resp, 0, sizeof resp);
+	resp.lkey      = mr->lkey;
+	resp.rkey      = mr->rkey;
 	resp.mr_handle = obj->uobject.id;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
 	list_add_tail(&obj->uobject.list, &file->ucontext->mr_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_mr_idr, obj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject);
 
 err_unreg:
 	ib_dereg_mr(mr);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+err_put:
+	put_pd_read(pd);
 
+err_release:
 	ib_umem_release(file->device->ib_dev, &obj->umem);
 
 err_free:
-	kfree(obj);
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -507,37 +676,40 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uve
 {
 	struct ib_uverbs_dereg_mr cmd;
 	struct ib_mr             *mr;
+	struct ib_uobject	 *uobj;
 	struct ib_umem_object    *memobj;
 	int                       ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle);
-	if (!mr || mr->uobject->context != file->ucontext)
-		goto out;
+	uobj = idr_write_uobj(&ib_uverbs_mr_idr, cmd.mr_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
 
-	memobj = container_of(mr->uobject, struct ib_umem_object, uobject);
+	memobj = container_of(uobj, struct ib_umem_object, uobject);
+	mr     = uobj->object;
 
 	ret = ib_dereg_mr(mr);
+	if (!ret)
+		uobj->live = 0;
+
+	put_uobj_write(uobj);
+
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle);
+	idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&memobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
 	ib_umem_release(file->device->ib_dev, &memobj->umem);
-	kfree(memobj);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_uobj(uobj);
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
@@ -576,7 +748,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 	struct ib_uverbs_create_cq      cmd;
 	struct ib_uverbs_create_cq_resp resp;
 	struct ib_udata                 udata;
-	struct ib_ucq_object           *uobj;
+	struct ib_ucq_object           *obj;
 	struct ib_uverbs_event_file    *ev_file = NULL;
 	struct ib_cq                   *cq;
 	int                             ret;
@@ -594,10 +766,13 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 	if (cmd.comp_vector >= file->device->num_comp_vectors)
 		return -EINVAL;
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
+	init_uobj(&obj->uobject, cmd.user_handle, file->ucontext);
+	down_write(&obj->uobject.mutex);
+
 	if (cmd.comp_channel >= 0) {
 		ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel);
 		if (!ev_file) {
@@ -606,63 +781,64 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 		}
 	}
 
-	uobj->uobject.user_handle   = cmd.user_handle;
-	uobj->uobject.context       = file->ucontext;
-	uobj->uverbs_file	    = file;
-	uobj->comp_events_reported  = 0;
-	uobj->async_events_reported = 0;
-	INIT_LIST_HEAD(&uobj->comp_list);
-	INIT_LIST_HEAD(&uobj->async_list);
+	obj->uverbs_file	   = file;
+	obj->comp_events_reported  = 0;
+	obj->async_events_reported = 0;
+	INIT_LIST_HEAD(&obj->comp_list);
+	INIT_LIST_HEAD(&obj->async_list);
 
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
 					     file->ucontext, &udata);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
-		goto err;
+		goto err_file;
 	}
 
 	cq->device        = file->device->ib_dev;
-	cq->uobject       = &uobj->uobject;
+	cq->uobject       = &obj->uobject;
 	cq->comp_handler  = ib_uverbs_comp_handler;
 	cq->event_handler = ib_uverbs_cq_event_handler;
 	cq->cq_context    = ev_file;
 	atomic_set(&cq->usecnt, 0);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	ret = idr_add_uobj(&ib_uverbs_cq_idr, cq, &uobj->uobject);
+	obj->uobject.object = cq;
+	ret = idr_add_uobj(&ib_uverbs_cq_idr, &obj->uobject);
 	if (ret)
-		goto err_up;
+		goto err_free;
 
 	memset(&resp, 0, sizeof resp);
-	resp.cq_handle = uobj->uobject.id;
+	resp.cq_handle = obj->uobject.id;
 	resp.cqe       = cq->cqe;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uobject.list, &file->ucontext->cq_list);
+	list_add_tail(&obj->uobject.list, &file->ucontext->cq_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_cq_idr, uobj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_cq_idr, &obj->uobject);
+
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+err_free:
 	ib_destroy_cq(cq);
 
-err:
+err_file:
 	if (ev_file)
-		ib_uverbs_release_ucq(file, ev_file, uobj);
-	kfree(uobj);
+		ib_uverbs_release_ucq(file, ev_file, obj);
+
+err:
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -673,6 +849,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv
 	struct ib_uverbs_resize_cq	cmd;
 	struct ib_uverbs_resize_cq_resp	resp;
 	struct ib_udata                 udata;
+	struct ib_uobject	        *uobj;
 	struct ib_cq			*cq;
 	int				ret = -EINVAL;
 
@@ -683,11 +860,10 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq)
-		goto out;
+	uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	cq = uobj->object;
 
 	ret = cq->device->resize_cq(cq, cmd.cqe, &udata);
 	if (ret)
@@ -701,7 +877,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_uobj_read(uobj);
 
 	return ret ? ret : in_len;
 }
@@ -712,6 +888,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 {
 	struct ib_uverbs_poll_cq       cmd;
 	struct ib_uverbs_poll_cq_resp *resp;
+	struct ib_uobject	      *uobj;
 	struct ib_cq                  *cq;
 	struct ib_wc                  *wc;
 	int                            ret = 0;
@@ -732,15 +909,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 		goto out_wc;
 	}
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext) {
+	uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj) {
 		ret = -EINVAL;
 		goto out;
 	}
+	cq = uobj->object;
 
 	resp->count = ib_poll_cq(cq, cmd.ne, wc);
 
+	put_uobj_read(uobj);
+
 	for (i = 0; i < resp->count; i++) {
 		resp->wc[i].wr_id 	   = wc[i].wr_id;
 		resp->wc[i].status 	   = wc[i].status;
@@ -762,7 +941,6 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
 	kfree(resp);
 
 out_wc:
@@ -775,22 +953,23 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_uobject	      *uobj;
 	struct ib_cq                  *cq;
-	int                            ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (cq && cq->uobject->context == file->ucontext) {
-		ib_req_notify_cq(cq, cmd.solicited_only ?
-					IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
-		ret = in_len;
-	}
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	cq = uobj->object;
 
-	return ret;
+	ib_req_notify_cq(cq, cmd.solicited_only ?
+			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+
+	put_uobj_read(uobj);
+
+	return in_len;
 }
 
 ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
@@ -799,52 +978,50 @@ ssize_t ib_uverbs_destroy_cq(struct ib_u
 {
 	struct ib_uverbs_destroy_cq      cmd;
 	struct ib_uverbs_destroy_cq_resp resp;
+	struct ib_uobject		*uobj;
 	struct ib_cq               	*cq;
-	struct ib_ucq_object        	*uobj;
+	struct ib_ucq_object        	*obj;
 	struct ib_uverbs_event_file	*ev_file;
-	u64				 user_handle;
 	int                        	 ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	memset(&resp, 0, sizeof resp);
-
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	cq      = uobj->object;
+	ev_file = cq->cq_context;
+	obj     = container_of(cq->uobject, struct ib_ucq_object, uobject);
 
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_cq(cq);
+	if (!ret)
+		uobj->live = 0;
 
-	user_handle = cq->uobject->user_handle;
-	uobj        = container_of(cq->uobject, struct ib_ucq_object, uobject);
-	ev_file     = cq->cq_context;
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_cq(cq);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle);
+	idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_ucq(file, ev_file, uobj);
+	ib_uverbs_release_ucq(file, ev_file, obj);
 
-	resp.comp_events_reported  = uobj->comp_events_reported;
-	resp.async_events_reported = uobj->async_events_reported;
+	memset(&resp, 0, sizeof resp);
+	resp.comp_events_reported  = obj->comp_events_reported;
+	resp.async_events_reported = obj->async_events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
-
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+		return -EFAULT;
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
@@ -854,7 +1031,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	struct ib_uverbs_create_qp      cmd;
 	struct ib_uverbs_create_qp_resp resp;
 	struct ib_udata                 udata;
-	struct ib_uqp_object           *uobj;
+	struct ib_uqp_object           *obj;
 	struct ib_pd                   *pd;
 	struct ib_cq                   *scq, *rcq;
 	struct ib_srq                  *srq;
@@ -872,23 +1049,21 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext);
+	down_write(&obj->uevent.uobject.mutex);
 
-	pd  = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle);
-	rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle);
-	srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL;
+	pd  = idr_read_pd(cmd.pd_handle, file->ucontext);
+	scq = idr_read_cq(cmd.send_cq_handle, file->ucontext);
+	rcq = idr_read_cq(cmd.recv_cq_handle, file->ucontext);
+	srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL;
 
-	if (!pd  || pd->uobject->context  != file->ucontext ||
-	    !scq || scq->uobject->context != file->ucontext ||
-	    !rcq || rcq->uobject->context != file->ucontext ||
-	    (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) {
+	if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err_put;
 	}
 
 	attr.event_handler = ib_uverbs_qp_event_handler;
@@ -905,16 +1080,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	attr.cap.max_recv_sge    = cmd.max_recv_sge;
 	attr.cap.max_inline_data = cmd.max_inline_data;
 
-	uobj->uevent.uobject.user_handle = cmd.user_handle;
-	uobj->uevent.uobject.context     = file->ucontext;
-	uobj->uevent.events_reported     = 0;
-	INIT_LIST_HEAD(&uobj->uevent.event_list);
-	INIT_LIST_HEAD(&uobj->mcast_list);
+	obj->uevent.events_reported     = 0;
+	INIT_LIST_HEAD(&obj->uevent.event_list);
+	INIT_LIST_HEAD(&obj->mcast_list);
 
 	qp = pd->device->create_qp(pd, &attr, &udata);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
-		goto err_up;
+		goto err_put;
 	}
 
 	qp->device     	  = pd->device;
@@ -922,7 +1095,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	qp->send_cq    	  = attr.send_cq;
 	qp->recv_cq    	  = attr.recv_cq;
 	qp->srq	       	  = attr.srq;
-	qp->uobject       = &uobj->uevent.uobject;
+	qp->uobject       = &obj->uevent.uobject;
 	qp->event_handler = attr.event_handler;
 	qp->qp_context    = attr.qp_context;
 	qp->qp_type	  = attr.qp_type;
@@ -932,14 +1105,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	if (attr.srq)
 		atomic_inc(&attr.srq->usecnt);
 
-	memset(&resp, 0, sizeof resp);
-	resp.qpn = qp->qp_num;
-
-	ret = idr_add_uobj(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject);
+	obj->uevent.uobject.object = qp;
+	ret = idr_add_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject);
 	if (ret)
 		goto err_destroy;
 
-	resp.qp_handle       = uobj->uevent.uobject.id;
+	memset(&resp, 0, sizeof resp);
+	resp.qpn             = qp->qp_num;
+	resp.qp_handle       = obj->uevent.uobject.id;
 	resp.max_recv_sge    = attr.cap.max_recv_sge;
 	resp.max_send_sge    = attr.cap.max_send_sge;
 	resp.max_recv_wr     = attr.cap.max_recv_wr;
@@ -949,27 +1122,36 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list);
+	list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uevent.uobject.live = 1;
+
+	up_write(&obj->uevent.uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject);
 
 err_destroy:
 	ib_destroy_qp(qp);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err_put:
+	if (pd)
+		put_pd_read(pd);
+	if (scq)
+		put_cq_read(scq);
+	if (rcq)
+		put_cq_read(rcq);
+	if (srq)
+		put_srq_read(srq);
+
+	put_uobj_write(&obj->uevent.uobject);
 	return ret;
 }
 
@@ -994,15 +1176,15 @@ ssize_t ib_uverbs_query_qp(struct ib_uve
 		goto out;
 	}
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (qp && qp->uobject->context == file->ucontext)
-		ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr);
-	else
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp) {
 		ret = -EINVAL;
+		goto out;
+	}
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr);
+
+	put_qp_read(qp);
 
 	if (ret)
 		goto out;
@@ -1089,10 +1271,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
 	if (!attr)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext) {
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1144,13 +1324,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
 	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
 
 	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
+
+	put_qp_read(qp);
+
 	if (ret)
 		goto out;
 
 	ret = in_len;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
 	kfree(attr);
 
 	return ret;
@@ -1162,8 +1344,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u
 {
 	struct ib_uverbs_destroy_qp      cmd;
 	struct ib_uverbs_destroy_qp_resp resp;
+	struct ib_uobject		*uobj;
 	struct ib_qp               	*qp;
-	struct ib_uqp_object        	*uobj;
+	struct ib_uqp_object        	*obj;
 	int                        	 ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1171,43 +1354,43 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u
 
 	memset(&resp, 0, sizeof resp);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
-
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	uobj = idr_write_uobj(&ib_uverbs_qp_idr, cmd.qp_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	qp  = uobj->object;
+	obj = container_of(uobj, struct ib_uqp_object, uevent.uobject);
 
-	if (!list_empty(&uobj->mcast_list)) {
-		ret = -EBUSY;
-		goto out;
+	if (!list_empty(&obj->mcast_list)) {
+		put_uobj_write(uobj);
+		return -EBUSY;
 	}
 
 	ret = ib_destroy_qp(qp);
+	if (!ret)
+		uobj->live = 0;
+
+	put_uobj_write(uobj);
+
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle);
+	idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uevent.uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_uevent(file, &uobj->uevent);
+	ib_uverbs_release_uevent(file, &obj->uevent);
 
-	resp.events_reported = uobj->uevent.events_reported;
+	resp.events_reported = obj->uevent.events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
-
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+		return -EFAULT;
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
@@ -1220,6 +1403,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 	struct ib_send_wr              *wr = NULL, *last, *next, *bad_wr;
 	struct ib_qp                   *qp;
 	int                             i, sg_ind;
+	int				is_ud;
 	ssize_t                         ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1236,12 +1420,11 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 	if (!user_wr)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
 		goto out;
 
+	is_ud = qp->qp_type == IB_QPT_UD;
 	sg_ind = 0;
 	last = NULL;
 	for (i = 0; i < cmd.wr_count; ++i) {
@@ -1249,12 +1432,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 				   buf + sizeof cmd + i * cmd.wqe_size,
 				   cmd.wqe_size)) {
 			ret = -EFAULT;
-			goto out;
+			goto out_put;
 		}
 
 		if (user_wr->num_sge + sg_ind > cmd.sge_count) {
 			ret = -EINVAL;
-			goto out;
+			goto out_put;
 		}
 
 		next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
@@ -1262,7 +1445,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 			       GFP_KERNEL);
 		if (!next) {
 			ret = -ENOMEM;
-			goto out;
+			goto out_put;
 		}
 
 		if (!last)
@@ -1278,12 +1461,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 		next->send_flags = user_wr->send_flags;
 		next->imm_data   = (__be32 __force) user_wr->imm_data;
 
-		if (qp->qp_type == IB_QPT_UD) {
-			next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr,
-						  user_wr->wr.ud.ah);
+		if (is_ud) {
+			next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah,
+						     file->ucontext);
 			if (!next->wr.ud.ah) {
 				ret = -EINVAL;
-				goto out;
+				goto out_put;
 			}
 			next->wr.ud.remote_qpn  = user_wr->wr.ud.remote_qpn;
 			next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey;
@@ -1320,7 +1503,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 					   sg_ind * sizeof (struct ib_sge),
 					   next->num_sge * sizeof (struct ib_sge))) {
 				ret = -EFAULT;
-				goto out;
+				goto out_put;
 			}
 			sg_ind += next->num_sge;
 		} else
@@ -1340,10 +1523,13 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 			 &resp, sizeof resp))
 		ret = -EFAULT;
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
+out:
 	while (wr) {
+		if (is_ud && wr->wr.ud.ah)
+			put_ah_read(wr->wr.ud.ah);
 		next = wr->next;
 		kfree(wr);
 		wr = next;
@@ -1458,14 +1644,15 @@ ssize_t ib_uverbs_post_recv(struct ib_uv
 	if (IS_ERR(wr))
 		return PTR_ERR(wr);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
 		goto out;
 
 	resp.bad_wr = 0;
 	ret = qp->device->post_recv(qp, wr, &bad_wr);
+
+	put_qp_read(qp);
+
 	if (ret)
 		for (next = wr; next; next = next->next) {
 			++resp.bad_wr;
@@ -1479,8 +1666,6 @@ ssize_t ib_uverbs_post_recv(struct ib_uv
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	while (wr) {
 		next = wr->next;
 		kfree(wr);
@@ -1509,14 +1694,15 @@ ssize_t ib_uverbs_post_srq_recv(struct i
 	if (IS_ERR(wr))
 		return PTR_ERR(wr);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext)
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
 		goto out;
 
 	resp.bad_wr = 0;
 	ret = srq->device->post_srq_recv(srq, wr, &bad_wr);
+
+	put_srq_read(srq);
+
 	if (ret)
 		for (next = wr; next; next = next->next) {
 			++resp.bad_wr;
@@ -1530,8 +1716,6 @@ ssize_t ib_uverbs_post_srq_recv(struct i
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	while (wr) {
 		next = wr->next;
 		kfree(wr);
@@ -1563,17 +1747,15 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	if (!uobj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	init_uobj(uobj, cmd.user_handle, file->ucontext);
+	down_write(&uobj->mutex);
 
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext) {
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err;
 	}
 
-	uobj->user_handle = cmd.user_handle;
-	uobj->context     = file->ucontext;
-
 	attr.dlid 	       = cmd.attr.dlid;
 	attr.sl 	       = cmd.attr.sl;
 	attr.src_path_bits     = cmd.attr.src_path_bits;
@@ -1589,12 +1771,11 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
 		ret = PTR_ERR(ah);
-		goto err_up;
+		goto err;
 	}
 
 	ah->uobject = uobj;
-
-	ret = idr_add_uobj(&ib_uverbs_ah_idr, ah, uobj);
+	ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj);
 	if (ret)
 		goto err_destroy;
 
@@ -1603,27 +1784,29 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
 	list_add_tail(&uobj->list, &file->ucontext->ah_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj->live = 1;
+
+	up_write(&uobj->mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_ah_idr, uobj->id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 err_destroy:
 	ib_destroy_ah(ah);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err:
+	put_uobj_write(uobj);
 	return ret;
 }
 
@@ -1633,35 +1816,34 @@ ssize_t ib_uverbs_destroy_ah(struct ib_u
 	struct ib_uverbs_destroy_ah cmd;
 	struct ib_ah		   *ah;
 	struct ib_uobject	   *uobj;
-	int			    ret = -EINVAL;
+	int			    ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_ah_idr, cmd.ah_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	ah = uobj->object;
 
-	ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle);
-	if (!ah || ah->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_ah(ah);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = ah->uobject;
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_ah(ah);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle);
+	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 	mutex_lock(&file->mutex);
 	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	kfree(uobj);
+	put_uobj(uobj);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file,
@@ -1670,47 +1852,43 @@ ssize_t ib_uverbs_attach_mcast(struct ib
 {
 	struct ib_uverbs_attach_mcast cmd;
 	struct ib_qp                 *qp;
-	struct ib_uqp_object         *uobj;
+	struct ib_uqp_object         *obj;
 	struct ib_uverbs_mcast_entry *mcast;
-	int                           ret = -EINVAL;
+	int                           ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		return -EINVAL;
 
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
 
-	list_for_each_entry(mcast, &uobj->mcast_list, list)
+	list_for_each_entry(mcast, &obj->mcast_list, list)
 		if (cmd.mlid == mcast->lid &&
 		    !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) {
 			ret = 0;
-			goto out;
+			goto out_put;
 		}
 
 	mcast = kmalloc(sizeof *mcast, GFP_KERNEL);
 	if (!mcast) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_put;
 	}
 
 	mcast->lid = cmd.mlid;
 	memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw);
 
 	ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid);
-	if (!ret) {
-		uobj = container_of(qp->uobject, struct ib_uqp_object,
-				    uevent.uobject);
-		list_add_tail(&mcast->list, &uobj->mcast_list);
-	} else
+	if (!ret)
+		list_add_tail(&mcast->list, &obj->mcast_list);
+	else
 		kfree(mcast);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
 	return ret ? ret : in_len;
 }
@@ -1720,7 +1898,7 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 			       int out_len)
 {
 	struct ib_uverbs_detach_mcast cmd;
-	struct ib_uqp_object         *uobj;
+	struct ib_uqp_object         *obj;
 	struct ib_qp                 *qp;
 	struct ib_uverbs_mcast_entry *mcast;
 	int                           ret = -EINVAL;
@@ -1728,19 +1906,17 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		return -EINVAL;
 
 	ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid);
 	if (ret)
-		goto out;
+		goto out_put;
 
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
 
-	list_for_each_entry(mcast, &uobj->mcast_list, list)
+	list_for_each_entry(mcast, &obj->mcast_list, list)
 		if (cmd.mlid == mcast->lid &&
 		    !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) {
 			list_del(&mcast->list);
@@ -1748,8 +1924,8 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 			break;
 		}
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
 	return ret ? ret : in_len;
 }
@@ -1761,7 +1937,7 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 	struct ib_uverbs_create_srq      cmd;
 	struct ib_uverbs_create_srq_resp resp;
 	struct ib_udata                  udata;
-	struct ib_uevent_object         *uobj;
+	struct ib_uevent_object         *obj;
 	struct ib_pd                    *pd;
 	struct ib_srq                   *srq;
 	struct ib_srq_init_attr          attr;
@@ -1777,17 +1953,17 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	pd  = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
+	init_uobj(&obj->uobject, 0, file->ucontext);
+	down_write(&obj->uobject.mutex);
 
-	if (!pd || pd->uobject->context != file->ucontext) {
+	pd  = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err;
 	}
 
 	attr.event_handler  = ib_uverbs_srq_event_handler;
@@ -1796,59 +1972,59 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 	attr.attr.max_sge   = cmd.max_sge;
 	attr.attr.srq_limit = cmd.srq_limit;
 
-	uobj->uobject.user_handle = cmd.user_handle;
-	uobj->uobject.context     = file->ucontext;
-	uobj->events_reported     = 0;
-	INIT_LIST_HEAD(&uobj->event_list);
+	obj->events_reported     = 0;
+	INIT_LIST_HEAD(&obj->event_list);
 
 	srq = pd->device->create_srq(pd, &attr, &udata);
 	if (IS_ERR(srq)) {
 		ret = PTR_ERR(srq);
-		goto err_up;
+		goto err;
 	}
 
 	srq->device    	   = pd->device;
 	srq->pd        	   = pd;
-	srq->uobject       = &uobj->uobject;
+	srq->uobject       = &obj->uobject;
 	srq->event_handler = attr.event_handler;
 	srq->srq_context   = attr.srq_context;
 	atomic_inc(&pd->usecnt);
 	atomic_set(&srq->usecnt, 0);
 
-	memset(&resp, 0, sizeof resp);
-
-	ret = idr_add_uobj(&ib_uverbs_srq_idr, srq, &uobj->uobject);
+	obj->uobject.object = srq;
+	ret = idr_add_uobj(&ib_uverbs_srq_idr, &obj->uobject);
 	if (ret)
 		goto err_destroy;
 
-	resp.srq_handle = uobj->uobject.id;
+	memset(&resp, 0, sizeof resp);
+	resp.srq_handle = obj->uobject.id;
 	resp.max_wr     = attr.attr.max_wr;
 	resp.max_sge    = attr.attr.max_sge;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uobject.list, &file->ucontext->srq_list);
+	list_add_tail(&obj->uobject.list, &file->ucontext->srq_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_srq_idr, uobj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_srq_idr, &obj->uobject);
 
 err_destroy:
 	ib_destroy_srq(srq);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err:
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -1864,21 +2040,16 @@ ssize_t ib_uverbs_modify_srq(struct ib_u
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext) {
-		ret = -EINVAL;
-		goto out;
-	}
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
+		return -EINVAL;
 
 	attr.max_wr    = cmd.max_wr;
 	attr.srq_limit = cmd.srq_limit;
 
 	ret = ib_modify_srq(srq, &attr, cmd.attr_mask);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_srq_read(srq);
 
 	return ret ? ret : in_len;
 }
@@ -1899,18 +2070,16 @@ ssize_t ib_uverbs_query_srq(struct ib_uv
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
+		return -EINVAL;
 
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (srq && srq->uobject->context == file->ucontext)
-		ret = ib_query_srq(srq, &attr);
-	else
-		ret = -EINVAL;
+	ret = ib_query_srq(srq, &attr);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_srq_read(srq);
 
 	if (ret)
-		goto out;
+		return ret;
 
 	memset(&resp, 0, sizeof resp);
 
@@ -1920,10 +2089,9 @@ ssize_t ib_uverbs_query_srq(struct ib_uv
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
+		return -EFAULT;
 
-out:
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
@@ -1932,45 +2100,45 @@ ssize_t ib_uverbs_destroy_srq(struct ib_
 {
 	struct ib_uverbs_destroy_srq      cmd;
 	struct ib_uverbs_destroy_srq_resp resp;
+	struct ib_uobject		 *uobj;
 	struct ib_srq               	 *srq;
-	struct ib_uevent_object        	 *uobj;
+	struct ib_uevent_object        	 *obj;
 	int                         	  ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	memset(&resp, 0, sizeof resp);
+	uobj = idr_write_uobj(&ib_uverbs_srq_idr, cmd.srq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	srq = uobj->object;
+	obj = container_of(uobj, struct ib_uevent_object, uobject);
 
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_srq(srq);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = container_of(srq->uobject, struct ib_uevent_object, uobject);
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_srq(srq);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle);
+	idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_uevent(file, uobj);
+	ib_uverbs_release_uevent(file, obj);
 
-	resp.events_reported = uobj->events_reported;
+	memset(&resp, 0, sizeof resp);
+	resp.events_reported = obj->events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
 		ret = -EFAULT;
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	return ret ? ret : in_len;
 }
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index ff092a0..5ec2d49 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -66,7 +66,7 @@ #define IB_UVERBS_BASE_DEV	MKDEV(IB_UVER
 
 static struct class *uverbs_class;
 
-DEFINE_MUTEX(ib_uverbs_idr_mutex);
+DEFINE_SPINLOCK(ib_uverbs_idr_lock);
 DEFINE_IDR(ib_uverbs_pd_idr);
 DEFINE_IDR(ib_uverbs_mr_idr);
 DEFINE_IDR(ib_uverbs_mw_idr);
@@ -183,21 +183,21 @@ static int ib_uverbs_cleanup_ucontext(st
 	if (!context)
 		return 0;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
-		struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id);
-		idr_remove(&ib_uverbs_ah_idr, uobj->id);
+		struct ib_ah *ah = uobj->object;
+
+		idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 		ib_destroy_ah(ah);
 		list_del(&uobj->list);
 		kfree(uobj);
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
-		struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id);
+		struct ib_qp *qp = uobj->object;
 		struct ib_uqp_object *uqp =
 			container_of(uobj, struct ib_uqp_object, uevent.uobject);
-		idr_remove(&ib_uverbs_qp_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
 		ib_uverbs_detach_umcast(qp, uqp);
 		ib_destroy_qp(qp);
 		list_del(&uobj->list);
@@ -206,11 +206,12 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) {
-		struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id);
+		struct ib_cq *cq = uobj->object;
 		struct ib_uverbs_event_file *ev_file = cq->cq_context;
 		struct ib_ucq_object *ucq =
 			container_of(uobj, struct ib_ucq_object, uobject);
-		idr_remove(&ib_uverbs_cq_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 		ib_destroy_cq(cq);
 		list_del(&uobj->list);
 		ib_uverbs_release_ucq(file, ev_file, ucq);
@@ -218,10 +219,11 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) {
-		struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id);
+		struct ib_srq *srq = uobj->object;
 		struct ib_uevent_object *uevent =
 			container_of(uobj, struct ib_uevent_object, uobject);
-		idr_remove(&ib_uverbs_srq_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 		ib_destroy_srq(srq);
 		list_del(&uobj->list);
 		ib_uverbs_release_uevent(file, uevent);
@@ -231,11 +233,11 @@ static int ib_uverbs_cleanup_ucontext(st
 	/* XXX Free MWs */
 
 	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
-		struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id);
+		struct ib_mr *mr = uobj->object;
 		struct ib_device *mrdev = mr->device;
 		struct ib_umem_object *memobj;
 
-		idr_remove(&ib_uverbs_mr_idr, uobj->id);
+		idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 		ib_dereg_mr(mr);
 
 		memobj = container_of(uobj, struct ib_umem_object, uobject);
@@ -246,15 +248,14 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
-		struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id);
-		idr_remove(&ib_uverbs_pd_idr, uobj->id);
+		struct ib_pd *pd = uobj->object;
+
+		idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 		ib_dealloc_pd(pd);
 		list_del(&uobj->list);
 		kfree(uobj);
 	}
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	return context->device->dealloc_ucontext(context);
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7ced208..ee1f3a3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -697,8 +697,12 @@ struct ib_ucontext {
 struct ib_uobject {
 	u64			user_handle;	/* handle given to us by userspace */
 	struct ib_ucontext     *context;	/* associated user context */
+	void		       *object;		/* containing object */
 	struct list_head	list;		/* link to context's list */
 	u32			id;		/* index into kernel idr */
+	struct kref		ref;
+	struct rw_semaphore	mutex;		/* protects .live */
+	int			live;
 };
 
 struct ib_umem {


From mshefty at ichips.intel.com  Mon Jun 12 13:04:45 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 13:04:45 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
Message-ID: <448DC8DD.20800@ichips.intel.com>

Sean Hefty wrote:
> I'd like to propose that the MAD layer detect duplicate requests.  After a
> request MAD has been handed to a client, its context would be maintained until
> the user calls ib_free_recv_mad(), allowing duplicate requests to be discarded.
> 
{snip}
> 
> Finally, a way would need to be found for when to call ib_free_recv_mad() for
> userspace clients.

I've been trying to come up with a way to handle userspace clients.  Here are a 
few ideas:

1. Export ib_free_recv_mad() to userspace.  This changes the ABI, and would 
require changes to all existing clients for things to work properly.  My 
preference would be to avoid this option.

2. Change the MAD registration, so that clients indicate which methods generate 
responses.  Again, this changes the ABI.

3. Hard-code which methods generate responses.  For most management classes, 
there's only 3-6 methods that generate responses.  The kernel umad module would 
only free a request MAD after a response had been generated.  This would make 
umad class aware, and would not work for user-defined classes.

4. Modify umad to learn which requests generate responses, by examining response 
MADs.  When a response is sent, umad would mark which method the response is for 
by flipping the R-bit.  Based on the algorithm, this could result in losing 
responses the first time that a request is seen.  Some additional hard-coding 
would be needed for a Set, since a Set request generates GetResp MADs.

Comments?

- Sean


From sweitzen at cisco.com  Mon Jun 12 13:21:34 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 12 Jun 2006 13:21:34 -0700
Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69AC3@xmb-sjc-216.amer.cisco.com>

This didn't help.  Osu_bibw.c still reports max bi bandwidth in the
1600s, should be in the 1900s.  I looked back at my notes, and OFED 1.0
rc4 had desired max bi bandwidth with OFED 1.0 rc4, did the uDAPL IB MTU
change?

$ mpiexec -genv I_MPI_DAPL_PROVIDER OpenIB-scm -genv I_MPI_DEBUG 3 -genv
I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../lib -n 2 ../osu_bibw.x
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
# OSU MPI Bidirectional Bandwidth Test (Version 2.1)
# Size          Bi-Bandwidth (MB/s)
1               0.813478
2               1.637650
4               3.260333
8               6.627831
16              12.168080
32              25.683379
64              50.580351
128             95.035855
256             174.132061
512             310.656179
1024            513.066433
2048            726.685587
4096            877.233753
8192            973.311995
16384           1040.096136
32768           849.790165
65536           1088.723063
131072          1296.584344
262144          1428.176271
524288          1540.248671
1048576         1579.665660
2097152         1608.765475
4194304         1628.157462

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Arlin Davis [mailto:ardavis at ichips.intel.com] 
> Sent: Friday, June 09, 2006 11:38 AM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Tziporet Koren; openfabrics-ewg at openib.org; Davis, Arlin 
> R; Lentini, James; openib-general
> Subject: Re: [openib-general] IB MTU tunable for uDAPL and/or 
> Intel MPI?
> 
> Scott Weitzenkamp (sweitzen) wrote:
> 
> > While we're talking about MTUs, is the IB MTU tunable in 
> uDAPL and/or 
> > Intel MPI via env var or config file?
> >  
> > Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in 
> > OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI.
> >  
> > Scott
> 
> 
> There is no mechanism for me to modify the MTU using rdma_cm 
> so whatever 
> is returned in the path record is what you get with the OpenIB-cma 
> provider. However, you could use the OpenIB-scm provider 
> which is hard 
> coded for 1K MTU as a comparision.  Can you run with "-genv 
> I_MPI_DAPL_PROVIDER OpenIB-scm" on your cluster?
> 
> -arlin
> 
> >
> >     
> --------------------------------------------------------------
> ----------
> >     *From:* openib-general-bounces at openib.org
> >     [mailto:openib-general-bounces at openib.org] *On Behalf Of *Scott
> >     Weitzenkamp (sweitzen)
> >     *Sent:* Thursday, June 08, 2006 4:38 PM
> >     *To:* Tziporet Koren; openfabrics-ewg at openib.org
> >     *Cc:* openib-general
> >     *Subject:* RE: [openib-general] OFED-1.0-rc6 is available
> >
> >     The MTU change undos the changes for bug 81, so I have reopened
> >     bug 81 (http://openib.org/bugzilla/show_bug.cgi?id=81).
> >      
> >     With rc6, PCI-X osu_bw and osu_bibw performance is bad, 
> and PCI-E
> >     osu_bibw performance is bad.  I've enclosed some 
> performance data,
> >     look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini.
> >      
> >     Are there other benchmarks driving the changes in rc6 (and rc4)?
> >      
> >     Scott Weitzenkamp
> >     SQA and Release Manager
> >     Server Virtualization Business Unit
> >     Cisco Systems
> >      
> >
> >          
> >
> >         *OSU MPI:*
> >
> >         *        Added mpi_alltoall fine tuning parameters
> >
> >         *        Added default configuration/documentation file
> >         $MPIHOME/etc/mvapich.conf
> >
> >         *        Added shell configuration files 
> >         $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh
> >
> >         *        Default MTU was changed back to 2K for 
> InfiniHost III
> >         Ex and InfiniHost III Lx HCAs. For InfiniHost card 
> recommended
> >         value is:
> >         VIADEV_DEFAULT_MTU=MTU1024
> >
> >-------------------------------------------------------------
> -----------
> >
> >_______________________________________________
> >openib-general mailing list
> >openib-general at openib.org
> >http://openib.org/mailman/listinfo/openib-general
> >
> >To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> >
> 


From rdreier at cisco.com  Mon Jun 12 13:37:11 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 13:37:11 -0700
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <20060612121635.GX7359@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 15:16:35 +0300")
References: <20060612121635.GX7359@mellanox.co.il>
Message-ID: <adad5de5cw8.fsf@cisco.com>

This makes me sad.  We're adding considerable code to the CQ polling
fast path to handle a rare FW bug.

I wish there were a better way.

 - R.


From mst at mellanox.co.il  Mon Jun 12 13:47:59 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 12 Jun 2006 23:47:59 +0300
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <adad5de5cw8.fsf@cisco.com>
References: <adad5de5cw8.fsf@cisco.com>
Message-ID: <20060612204759.GB17643@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] mthca: memfree completion with error workaround
> 
> This makes me sad.  We're adding considerable code to the CQ polling
> fast path

It might not be too bad - there's a single additional test on fastpath, and I am
guessing both wqe_index and rq.max should be in registers already.
Once wqe_index is out of rq.max range we are on slow path.

> I wish there were a better way.

We can make it a compile-time option, so that users can disable it
once there's a firmware that does not need this code.

-- 
MST


From rdreier at cisco.com  Mon Jun 12 13:47:40 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 13:47:40 -0700
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <20060612204759.GB17643@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 23:47:59 +0300")
References: <adad5de5cw8.fsf@cisco.com> <20060612204759.GB17643@mellanox.co.il>
Message-ID: <ada8xo25cer.fsf@cisco.com>

    Michael> It might not be too bad - there's a single additional
    Michael> test on fastpath, and I am guessing both wqe_index and
    Michael> rq.max should be in registers already.  Once wqe_index is
    Michael> out of rq.max range we are on slow path.

But it bloats the function and adds to i-cache footprint.  I'm sure it
benchmarks fine but it adds to general cache usage that pushes useful
code out of cache.

Unfortunately I don't see a clean way to move it out of line.

    Michael> We can make it a compile-time option, so that users can
    Michael> disable it once there's a firmware that does not need
    Michael> this code.

No distro is ever going to turn it off though.

 - R.


From ardavis at ichips.intel.com  Mon Jun 12 14:00:43 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 12 Jun 2006 14:00:43 -0700
Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69AC3@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69AC3@xmb-sjc-216.amer.cisco.com>
Message-ID: <448DD5FB.5090205@ichips.intel.com>

Scott Weitzenkamp (sweitzen) wrote:

>This didn't help.  Osu_bibw.c still reports max bi bandwidth in the
>1600s, should be in the 1900s.  I looked back at my notes, and OFED 1.0
>rc4 had desired max bi bandwidth with OFED 1.0 rc4, did the uDAPL IB MTU
>change?
>  
>
uDAPL does not have any control over IB MTU using OpenIB-cma. We just 
use the path record that
is supplied from Open SM. Not sure where or when the change occured but 
it is not in uDAPL.

>$ mpiexec -genv I_MPI_DAPL_PROVIDER OpenIB-scm -genv I_MPI_DEBUG 3 -genv
>I_MPI_DEVICE rdssm -genv LD_LIBRARY_PATH .../lib -n 2 ../osu_bibw.x
>I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
>I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
>I_MPI: [0] set_up_devices(): will use device: libmpi.rdssm.so
>I_MPI: [0] set_up_devices(): will use DAPL provider: OpenIB-cma
>  
>
It picked up the OpenIB-cma device instead of OpenIB-scm.

-arlin


From ftillier at silverstorm.com  Mon Jun 12 14:00:46 2006
From: ftillier at silverstorm.com (Fabian Tillier)
Date: Mon, 12 Jun 2006 14:00:46 -0700
Subject: [openib-general] [PATCH] mthca: memfree completion with error
 workaround
In-Reply-To: <ada8xo25cer.fsf@cisco.com>
References: <adad5de5cw8.fsf@cisco.com>
	<20060612204759.GB17643@mellanox.co.il> <ada8xo25cer.fsf@cisco.com>
Message-ID: <79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com>

Hi Roland,

On 6/12/06, Roland Dreier <rdreier at cisco.com> wrote:
>    Michael> It might not be too bad - there's a single additional
>    Michael> test on fastpath, and I am guessing both wqe_index and
>    Michael> rq.max should be in registers already.  Once wqe_index is
>    Michael> out of rq.max range we are on slow path.
>
> But it bloats the function and adds to i-cache footprint.  I'm sure it
> benchmarks fine but it adds to general cache usage that pushes useful
> code out of cache.
>
> Unfortunately I don't see a clean way to move it out of line.

Why not just have multiple implemenations of the function, and setup
the function pointer in the verbs according to what firmware and
device is in use?  That way devices not affected could continue to use
the optimized version...

Just a thought - it does have the drawback of having multiple similar functions.

- Fab


From mshefty at ichips.intel.com  Mon Jun 12 14:03:51 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 14:03:51 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adaodwy5fp7.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
Message-ID: <448DD6B7.7010305@ichips.intel.com>

> I started thinking about the "kill ib_uverbs_idr_mutex" problem, and I
> realized that there are actually some interesting issues there (as
> described in the comment at the top of uverbs_cmd.c).  In fact I ended
> up coding the solution below.  This passes some basic tests but it
> could probably use some review.

The basic approach seems fine to me.  It would be nice to eliminate the live 
flag, but I can't think of a way to do so.

- Sean


From rdreier at cisco.com  Mon Jun 12 14:18:09 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 14:18:09 -0700
Subject: [openib-general] [PATCH] mthca: memfree completion with error
 workaround
In-Reply-To: <79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com>
	(Fabian Tillier's message of "Mon, 12 Jun 2006 14:00:46 -0700")
References: <adad5de5cw8.fsf@cisco.com>
	<20060612204759.GB17643@mellanox.co.il> <ada8xo25cer.fsf@cisco.com>
	<79ae2f320606121400j320ee074v939d61435ee93cad@mail.gmail.com>
Message-ID: <adazmgi3wfi.fsf@cisco.com>

    Fabian> Why not just have multiple implemenations of the function,
    Fabian> and setup the function pointer in the verbs according to
    Fabian> what firmware and device is in use?  That way devices not
    Fabian> affected could continue to use the optimized version...

Yup, that's the obvious solution.  But I'm not sure it's worth that
much bloat in this case.


From rdreier at cisco.com  Mon Jun 12 14:19:32 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 14:19:32 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <448DD6B7.7010305@ichips.intel.com> (Sean Hefty's message
	of "Mon, 12 Jun 2006 14:03:51 -0700")
References: <adaodwy5fp7.fsf@cisco.com>
 <448DD6B7.7010305@ichips.intel.com>
Message-ID: <adaver63wd7.fsf@cisco.com>

    Sean> It would be nice to eliminate the live flag, but I can't
    Sean> think of a way to do so.

Agreed -- unfortunately I'm not smart enough either (and believe me I
came up with some complicated attempts)


From rjwalsh at pathscale.com  Mon Jun 12 14:29:16 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Mon, 12 Jun 2006 14:29:16 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adaodwy5fp7.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
Message-ID: <1150147756.23063.16.camel@hematite.internal.keyresearch.com>

On Mon, 2006-06-12 at 12:36 -0700, Roland Dreier wrote:
> IB/uverbs: Don't serialize with ib_uverbs_idr_mutex

This looks good - I had started something similar but your solution
solves some problems mine had (by using the live flag, even if it is
kind of bleh.)

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060612/d1d5e21d/attachment.sig>

From rdreier at cisco.com  Mon Jun 12 14:49:15 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 12 Jun 2006 14:49:15 -0700
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <20060612121635.GX7359@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 15:16:35 +0300")
References: <20060612121635.GX7359@mellanox.co.il>
Message-ID: <adar71u3uzo.fsf@cisco.com>

 > +		/* WQE index == -1 might be reported by
 > +		   Sinai FW 1.0.800, Arbel FW 5.1.400 and should be fixed
 > +		   in later revisions. */

In the future please use

 /*
  * comment
  */

style comments.

 > +		if (unlikely(wqe_index >= (*cur_qp)->rq.max)) {
 > +			if (unlikely(is_error) &&
 > +			    unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) &&

seems like the inside unlikely()s are wrong here -- the reason we
expect to be here is exactly the reason being marked unlikely, which
is backwards.

 > +			    mthca_is_memfree(dev))
 > +				wqe_index = wq->max - 1;
 > +			else {
 > +				mthca_err(dev, "Corrupted RQ CQE. "
 > +					  "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n",
 > +					  cq->cqn, entry->qp_num, wqe_index,
 > +					  wq->max);
 > +				return -EINVAL;

This should probably be "err = -EINVAL; goto out;" right?

 > +			}
 > +		}
 >  		entry->wr_id = (*cur_qp)->wrid[wqe_index];
 >  	}
 > 


From bos at pathscale.com  Mon Jun 12 15:09:45 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Mon, 12 Jun 2006 15:09:45 -0700
Subject: [openib-general] OFED 1.0-rc6 tarball available with working
	ipath driver
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F0B6BD@orsmsx408>
Message-ID: <1150150185.3217.0.camel@chalcedony.pathscale.com>

On Mon, 2006-06-12 at 10:49 -0700, Woodruff, Robert J wrote:

> Still does not seem to compile.

Please try this one instead:

http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2


From mshefty at ichips.intel.com  Mon Jun 12 15:22:19 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 15:22:19 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <448DC8DD.20800@ichips.intel.com>
References: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
	<448DC8DD.20800@ichips.intel.com>
Message-ID: <448DE91B.3080607@ichips.intel.com>

Sean Hefty wrote:
> 4. Modify umad to learn which requests generate responses, by examining response 
> MADs.  When a response is sent, umad would mark which method the response is for 
> by flipping the R-bit.  Based on the algorithm, this could result in losing 
> responses the first time that a request is seen.  Some additional hard-coding 
> would be needed for a Set, since a Set request generates GetResp MADs.

This brings up a concern.  There doesn't seem to be a limit to the number of 
received MADs that can be queued for a user.  Should we have such a limit?

- Sean


From robert.j.woodruff at intel.com  Mon Jun 12 15:41:36 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 12 Jun 2006 15:41:36 -0700
Subject: [openib-general] OFED 1.0-rc6 tarball available with working
	ipath driver
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408>

Brian wrote, 

>Please try this one instead:

http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2

Got farther, but now fails trying to build 
DAPL/rdmacm. This did not fail with the original RC6.

woody

gcc:
/var/tmp/OFED/tmp/openib/openib/src/userspace/librdmacm/src/.libs/.libs/
librdmacm.so: No such file or directory
make[2]: *** [dapl/udapl/libdaplcma.la] Error 1
make[2]: Leaving directory
`/var/tmp/OFED/tmp/openib/openib/src/userspace/dapl'
make[1]: *** [all] Error 2
make[1]: Leaving directory
`/var/tmp/OFED/tmp/openib/openib/src/userspace/dapl'
make: *** [dapl] Error 2
ERROR: Failed to execute: env make user
~


From krause at cup.hp.com  Mon Jun 12 14:18:27 2006
From: krause at cup.hp.com (Michael Krause)
Date: Mon, 12 Jun 2006 14:18:27 -0700
Subject: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69501@xmb-sjc-216.amer.
	cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69501@xmb-sjc-216.amer.cisco.com>
Message-ID: <6.2.0.14.2.20060612135825.02de0fe8@esmail.cup.hp.com>

At 10:44 AM 6/9/2006, Scott Weitzenkamp (sweitzen) wrote:
>Content-class: urn:content-classes:message
>Content-Type: multipart/alternative;
>  boundary="----_=_NextPart_001_01C68BEC.6C768F57"
>Content-Transfer-Encoding: 7bit
>
>While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or 
>Intel MPI via env var or config file?
>
>Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in OFED 
>1.0 rc4 and rc6, I'd like to try 1K with Intel MPI.

IB MTU should be set on a per path basis by the SM.  An application should 
examine the PMTU for a given path and take appropriate action - really only 
applies to UD as connected mode should automatically SAR 
requests.  Communicating PMTU to an application should not occur unless it 
is datagram based.   The same is true for iWARP where TCP / IP takes care 
of the PMTU on behalf of the ULP / application.  If you want to control 
PMTU, then do so via the SM directly which was the intention of the 
architecture and specification.

Mike

>
>Scott
>
>----------
>From: openib-general-bounces at openib.org 
>[mailto:openib-general-bounces at openib.org] On Behalf Of Scott Weitzenkamp 
>(sweitzen)
>Sent: Thursday, June 08, 2006 4:38 PM
>To: Tziporet Koren; openfabrics-ewg at openib.org
>Cc: openib-general
>Subject: RE: [openib-general] OFED-1.0-rc6 is available
>The MTU change undos the changes for bug 81, so I have reopened bug 81 
>(<http://openib.org/bugzilla/show_bug.cgi?id=81>http://openib.org/bugzilla/show_bug.cgi?id=81). 
>
>
>With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E osu_bibw 
>performance is bad.  I've enclosed some performance data, look at rc4 vs 
>rc5 vs rc6 for Cougar/Cheetah/LionMini.
>
>Are there other benchmarks driving the changes in rc6 (and rc4)?
>
>Scott Weitzenkamp
>SQA and Release Manager
>Server Virtualization Business Unit
>Cisco Systems
>
>
>
>
>OSU MPI:
>·        Added mpi_alltoall fine tuning parameters
>·        Added default configuration/documentation file 
>$MPIHOME/etc/mvapich.conf
>·        Added shell configuration files  $MPIHOME/etc/mvapich.csh , 
>$MPIHOME/etc/mvapich.csh
>·        Default MTU was changed back to 2K for InfiniHost III Ex and 
>InfiniHost III Lx HCAs. For InfiniHost card recommended value is:
>VIADEV_DEFAULT_MTU=MTU1024
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060612/45040805/attachment.html>

From bos at pathscale.com  Mon Jun 12 15:50:53 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Mon, 12 Jun 2006 15:50:53 -0700
Subject: [openib-general] OFED 1.0-rc6 tarball available with working
	ipath driver
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F0BDBB@orsmsx408>
Message-ID: <1150152653.3217.8.camel@chalcedony.pathscale.com>

On Mon, 2006-06-12 at 15:41 -0700, Woodruff, Robert J wrote:
> Brian wrote, 
> 
> >Please try this one instead:
> 
> http://openib.red-bean.com/OFED-1.0-rc6+ipath-2.tar.bz2
> 
> Got farther, but now fails trying to build 
> DAPL/rdmacm. This did not fail with the original RC6.

Yeah, I see that too.  As long as the ipath driver built, I'm happy
enough for now :-)

	<b


From halr at voltaire.com  Mon Jun 12 15:50:22 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jun 2006 18:50:22 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <448DE91B.3080607@ichips.intel.com>
References: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
	<448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com>
Message-ID: <1150152620.570.113031.camel@hal.voltaire.com>

On Mon, 2006-06-12 at 18:22, Sean Hefty wrote:
> Sean Hefty wrote:
> > 4. Modify umad to learn which requests generate responses, by examining response 
> > MADs.  When a response is sent, umad would mark which method the response is for 
> > by flipping the R-bit.  Based on the algorithm, this could result in losing 
> > responses the first time that a request is seen.  Some additional hard-coding 
> > would be needed for a Set, since a Set request generates GetResp MADs.
> 
> This brings up a concern.  There doesn't seem to be a limit to the number of 
> received MADs that can be queued for a user.  Should we have such a limit?

How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If
the latter, it seems problematic to limit this as the response to a get
response might be very large.

-- Hal

> 
> - Sean
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From betsy at pathscale.com  Mon Jun 12 16:13:40 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Mon, 12 Jun 2006 16:13:40 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
In-Reply-To: <448C2946.5010707@mellanox.co.il>
References: <1149895236.27921.2.camel@pelerin.serpentine.com>
	<448C2946.5010707@mellanox.co.il>
Message-ID: <1150154020.3034.107.camel@sarium.pathscale.com>

Tziporet - Bryan has confirmed that with the patches you've copied,
things should work correctly. We've been testing with our version, but
I really want to test on the OFED-1.0 version that you've built. Can you
send us a pointer to it? 

Thanks, Betsy

On Sun, 2006-06-11 at 17:31 +0300, Tziporet Koren wrote:
> Bryan O'Sullivan wrote:
> > Due to unfortunate timing, the ipath driver in OFED 1.0-rc6 does not
> > work correctly.  You can download an updated tarball from here, for
> > which the ipath driver works fine:
> >
> > http://openib.red-bean.com/OFED-1.0-rc6+ipath.tar.bz2
> >
> > Alternatively, pull the necessary patches from SVN.
> >
> > 	<b
> >
> >
> > __
> Hi Bryan
> 
> You have forgot some of the patches in your tarball file, thus several 
> OSes does not pass (e.g RH EL4 up3).
> 
> /openib-1.0/patches/ > ls */ipath*
> 2.6.11_FC4/ipath_backport.patch  2.6.13/ipath_backport.patch             
> 2.6.15/ipath_backport.patch
> 2.6.11/ipath_backport.patch      2.6.13_suse10_0_u/ipath_backport.patch  
> 2.6.9/ipath_backport.patch
> 2.6.12/ipath_backport.patch      2.6.14/ipath_backport.patch             
> fixes/ipath_rollup.patch
> /openib-1.0/patches/ > ls
> 2.6.11/      2.6.12/  2.6.13_suse10_0_u/  2.6.15/  2.6.16_sles10/  
> 2.6.9/     dapl/   memtrack/
> 2.6.11_FC4/  2.6.13/  2.6.14/             2.6.16/  2.6.17/         
> 2.6.9_U3/  fixes/
> 
> I took the freedom to copy these patches to svn since I noticed that 
> ipath backport patches are same for all OSes & kernels.
> Please take a look and make sure ipath driver compiles & loads on all 
> supported systems (including ia64, PPC64 etc.)
> 
> Tziporet
> 
> _______________________________________________
> openfabrics-ewg mailing list
> openfabrics-ewg at openib.org
> http://openib.org/mailman/listinfo/openfabrics-ewg
> 


From sean.hefty at intel.com  Mon Jun 12 16:32:36 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 16:32:36 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
In-Reply-To: <1150154020.3034.107.camel@sarium.pathscale.com>
Message-ID: <000b01c68e78$7d214570$ff0da8c0@amr.corp.intel.com>

>Tziporet - Bryan has confirmed that with the patches you've copied,
>things should work correctly. We've been testing with our version, but
>I really want to test on the OFED-1.0 version that you've built. Can you
>send us a pointer to it?

How can you go from an RC6 that doesn't build to a 1.0 release?  Shouldn't you
at least get a release candidate that builds first?

- Sean


From mshefty at ichips.intel.com  Mon Jun 12 16:43:26 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 12 Jun 2006 16:43:26 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150152620.570.113031.camel@hal.voltaire.com>
References: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
	<448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com>
	<1150152620.570.113031.camel@hal.voltaire.com>
Message-ID: <448DFC1E.5080602@ichips.intel.com>

Hal Rosenstock wrote:
>>This brings up a concern.  There doesn't seem to be a limit to the number of 
>>received MADs that can be queued for a user.  Should we have such a limit?
> 
> 
> How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If
> the latter, it seems problematic to limit this as the response to a get
> response might be very large.

I could go either way, or use a hybrid of some sort  Counting a multisegment MAD 
as 1 MAD might be a little easier.  We could also allow something like 100 
segments, or at least 1 reassembled MAD.  So, the user could have 100 single 
segment MADs, 50 2-segment MADs, etc.

Without some sort of restriction, a userspace app that's slow to pull receive 
MADs from the kernel would result in consuming a large amount of kernel memory.

- Sean


From robert.j.woodruff at intel.com  Mon Jun 12 16:47:23 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 12 Jun 2006 16:47:23 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F40A7C@orsmsx408>

Sean wrote,
>>Tziporet - Bryan has confirmed that with the patches you've copied,
>>things should work correctly. We've been testing with our version, but
>>I really want to test on the OFED-1.0 version that you've built. Can
you
>>send us a pointer to it?

>How can you go from an RC6 that doesn't build to a 1.0 release?
Shouldn't you
>at least get a release candidate that builds first?

>- Sean

I agree, don't see how we can go from something that has never been
tested
by the wider community to released. Has anyone run uDAPL tests or Intel
MPI
with Pathscale ? 

woody


From sweitzen at cisco.com  Mon Jun 12 17:29:38 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 12 Jun 2006 17:29:38 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301D69C3A@xmb-sjc-216.amer.cisco.com>

I agree, having an rc7 then ~three days to test it for regressions seems
appropriate.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of 
> Woodruff, Robert J
> Sent: Monday, June 12, 2006 4:47 PM
> To: Hefty, Sean; Betsy Zeller; Tziporet Koren
> Cc: OpenFabricsEWG; openib-general
> Subject: Re: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 
> tarball available with working ipath driver
> 
> Sean wrote,
> >>Tziporet - Bryan has confirmed that with the patches you've copied,
> >>things should work correctly. We've been testing with our 
> version, but
> >>I really want to test on the OFED-1.0 version that you've built. Can
> you
> >>send us a pointer to it?
> 
> >How can you go from an RC6 that doesn't build to a 1.0 release?
> Shouldn't you
> >at least get a release candidate that builds first?
> 
> >- Sean
> 
> I agree, don't see how we can go from something that has never been
> tested
> by the wider community to released. Has anyone run uDAPL 
> tests or Intel
> MPI
> with Pathscale ? 
> 
> woody
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From boris at mellanox.com  Mon Jun 12 17:53:22 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Mon, 12 Jun 2006 17:53:22 -0700
Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux machine
Message-ID: <1E3DCD1C63492545881FACB6063A57C1324241@mtiexch01.mti.com>

Hi,
 
I've run into following failure running OSU MPI out of OFED-rc5 on IBM
PPC-64 platform:
 
[1] Abort: Error creating QP
 at line 820 in file viainit.c
mpirun: executable version 1 does not match our version 3,

This seems to be memory allocation issue which could be easily explained
(and overcome)
if the job is launched with regular user permissions, but in my case
it's root who launches it.
 
Have anybody tested OFED's OSU MPI on PPC-64 platform recently and can
comment on this ?
 
Thanks,
 
Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060612/da1ba3f1/attachment.html>

From bos at pathscale.com  Mon Jun 12 19:03:43 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Mon, 12 Jun 2006 19:03:43 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0-rc6 tarball
 available with working ipath driver
In-Reply-To: <448C2946.5010707@mellanox.co.il>
References: <1149895236.27921.2.camel@pelerin.serpentine.com>
	<448C2946.5010707@mellanox.co.il>
Message-ID: <1150164223.741.3.camel@pelerin.serpentine.com>

On Sun, 2006-06-11 at 17:31 +0300, Tziporet Koren wrote:

> Please take a look and make sure ipath driver compiles & loads on all 
> supported systems (including ia64, PPC64 etc.)

I can't build successfully on RHEL4 U3, because SDP isn't compiling
there; it complains about the parameter count on line 1168 of
sdp_main.c.  Looks like the ipath stuff is fine on the systems we tried
(SLES10 RC2 and RHEL4 U3).

	<b


From halr at voltaire.com  Mon Jun 12 19:58:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jun 2006 22:58:34 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <448DFC1E.5080602@ichips.intel.com>
References: <ORSMSX401kL4BZ1SnjO0000001b@orsmsx401.amr.corp.intel.com>
	<448DC8DD.20800@ichips.intel.com> <448DE91B.3080607@ichips.intel.com>
	<1150152620.570.113031.camel@hal.voltaire.com>
	<448DFC1E.5080602@ichips.intel.com>
Message-ID: <1150167512.570.122321.camel@hal.voltaire.com>

On Mon, 2006-06-12 at 19:43, Sean Hefty wrote:
> Hal Rosenstock wrote:
> >>This brings up a concern.  There doesn't seem to be a limit to the number of 
> >>received MADs that can be queued for a user.  Should we have such a limit?
> > 
> > 
> > How are MADs counted ? Is a multisegment MAD 1 MAD or multiple MADs ? If
> > the latter, it seems problematic to limit this as the response to a get
> > response might be very large.
> 
> I could go either way, or use a hybrid of some sort  Counting a multisegment MAD 
> as 1 MAD might be a little easier.  We could also allow something like 100 
> segments, or at least 1 reassembled MAD.  So, the user could have 100 single 
> segment MADs, 50 2-segment MADs, etc.

This seems prone to introducing a different problem and that this number
would need to scale with the size of the subnet.

> Without some sort of restriction, a userspace app that's slow to pull receive 
> MADs from the kernel would result in consuming a large amount of kernel memory.

Understood but dropping a MAD after acknowledging also seems like a bad
thing to me. Couldn't this be controlled on the request side (assuming
the request has a response as opposed to unsolicited sends/receives) ?

-- Hal

> - Sean


From viswa.krish at gmail.com  Mon Jun 12 20:16:00 2006
From: viswa.krish at gmail.com (Viswanath Krishnamurthy)
Date: Mon, 12 Jun 2006 20:16:00 -0700
Subject: [openib-general] opensm and NPTL
Message-ID: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>

There were some issues with opensm running with NPTL  (thread library). Has
the issues been
resolved ?

Regards,
Viswa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060612/16e9fafb/attachment.html>

From zhushisongzhu at yahoo.com  Mon Jun 12 20:28:31 2006
From: zhushisongzhu at yahoo.com (zhu shi song)
Date: Mon, 12 Jun 2006 20:28:31 -0700 (PDT)
Subject: [openib-general] it's hard to download  through svn
In-Reply-To: <mailman.17.1150155928.9203.openib-general@openib.org>
Message-ID: <20060613032831.5296.qmail@web36913.mail.mud.yahoo.com>

It's hard to download OFED-rc6 release through svn. Do
you suggest some more direct way to get them?

 tks
 zhu


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From mst at mellanox.co.il  Mon Jun 12 21:26:47 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 07:26:47 +0300
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <adar71u3uzo.fsf@cisco.com>
References: <adar71u3uzo.fsf@cisco.com>
Message-ID: <20060613042647.GA4621@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
>  > +		if (unlikely(wqe_index >= (*cur_qp)->rq.max)) {
>  > +			if (unlikely(is_error) &&
>  > +			    unlikely(wqe_index == 0xffffffff >> wq->wqe_shift) &&
> 
> seems like the inside unlikely()s are wrong here -- the reason we
> expect to be here is exactly the reason being marked unlikely, which
> is backwards.

Hmm, right.

>  > +			    mthca_is_memfree(dev))
>  > +				wqe_index = wq->max - 1;
>  > +			else {
>  > +				mthca_err(dev, "Corrupted RQ CQE. "
>  > +					  "CQ 0x%x QP 0x%x idx 0x%x > 0x%x\n",
>  > +					  cq->cqn, entry->qp_num, wqe_index,
>  > +					  wq->max);
>  > +				return -EINVAL;
> 
> This should probably be "err = -EINVAL; goto out;" right?

But note this branch is there "just in case".

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 21:40:21 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 07:40:21 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150167512.570.122321.camel@hal.voltaire.com>
References: <1150167512.570.122321.camel@hal.voltaire.com>
Message-ID: <20060613044021.GC4621@mellanox.co.il>

Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > Without some sort of restriction, a userspace app that's slow to pull receive 
> > MADs from the kernel would result in consuming a large amount of kernel memory.
> 
> Understood but dropping a MAD after acknowledging also seems like a bad
> thing to me.

True. Maybe we can find a way to avoid acknowledging the MAD?

> Couldn't this be controlled on the request side (assuming
> the request has a response as opposed to unsolicited sends/receives) ?

Sounds like the wrong thing to do.

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 21:47:40 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 07:47:40 +0300
Subject: [openib-general] [PATCH] mthca: memfree completion with error
	workaround
In-Reply-To: <ada8xo25cer.fsf@cisco.com>
References: <ada8xo25cer.fsf@cisco.com>
Message-ID: <20060613044740.GD4621@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> But it bloats the function and adds to i-cache footprint.  I'm sure it
> benchmarks fine but it adds to general cache usage that pushes useful
> code out of cache.

Hmm. Would you be more comfortable with just

+               if (unlikely(wqe_index >= wq->max))
+               	wqe_index = wq->max - 1;

The other case is there just in case, to catch firmware errors.

-- 
MST


From mst at mellanox.co.il  Mon Jun 12 22:11:49 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 08:11:49 +0300
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adaodwy5fp7.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
Message-ID: <20060613051149.GE4621@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> @@ -1089,10 +1271,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
>  	if (!attr)
>  		return -ENOMEM;
>  
> -	mutex_lock(&ib_uverbs_idr_mutex);
> -
> -	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
> -	if (!qp || qp->uobject->context != file->ucontext) {
> +	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
> +	if (!qp) {
>  		ret = -EINVAL;
>  		goto out;
>  	}
> @@ -1144,13 +1324,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
>  	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
>  
>  	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
> +
> +	put_qp_read(qp);
> +
>  	if (ret)
>  		goto out;
>  
>  	ret = in_len;
>  
>  out:
> -	mutex_unlock(&ib_uverbs_idr_mutex);
>  	kfree(attr);
>  
>  	return ret;

Won't this let the user issue multiple modify QP commands in parallel
on the same QP? mthca at least does not protect against such attempts,
and doing this will confuse the hardware.

-- 
MST


From halr at voltaire.com  Tue Jun 13 03:10:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 06:10:33 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060613044021.GC4621@mellanox.co.il>
References: <1150167512.570.122321.camel@hal.voltaire.com>
	<20060613044021.GC4621@mellanox.co.il>
Message-ID: <1150193430.570.138279.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 00:40, Michael S. Tsirkin wrote:
> Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > > Without some sort of restriction, a userspace app that's slow to pull receive 
> > > MADs from the kernel would result in consuming a large amount of kernel memory.
> > 
> > Understood but dropping a MAD after acknowledging also seems like a bad
> > thing to me.
> 
> True. Maybe we can find a way to avoid acknowledging the MAD?

There are architected ways to do that. There's busy for MADs which could
be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't
think you can switch to BUSY in the middle (but I'm not 100% sure). I
don't know how this limit is being used exactly, but it might be best if
the RMPP receive were treated as 1 MAD regardless of of how many
segments it was.

-- Hal

> > Couldn't this be controlled on the request side (assuming
> > the request has a response as opposed to unsolicited sends/receives) ?
> 
> Sounds like the wrong thing to do.


From halr at voltaire.com  Tue Jun 13 03:15:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 06:15:34 -0400
Subject: [openib-general] opensm and NPTL
In-Reply-To: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
Message-ID: <1150193732.570.138496.camel@hal.voltaire.com>

Hi Viswa,

On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:
> There were some issues with opensm running with NPTL  (thread
> library). Has the issues been resolved ?

There were some fixes to the signal handling which went in back in the
Feb/early March time frame. OpenSM should be better with NPTL now. Is it
working for you or are you asking before stepping into these waters
again ?

-- Hal

> Regards,
> Viswa
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From eitan at mellanox.co.il  Tue Jun 13 03:44:57 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 13 Jun 2006 13:44:57 +0300
Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy
	sweep
Message-ID: <86r71tgwra.fsf@mtl066.yok.mtl.com>

Hi Hal

This trivial patch provides a "SUBNET UP" message (with level INFO)
every time the SM completes a full heavy sweep. It is most useful for
cases where you want to make sure teh SM responded to some change in
the fabric. Also used to sync the various test flows to the end of sweeps.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_state_mgr.c
===================================================================
--- opensm/osm_state_mgr.c	(revision 7904)
+++ opensm/osm_state_mgr.c	(working copy)
@@ -199,6 +199,8 @@ __osm_state_mgr_up_msg(
       osm_log( p_mgr->p_log, OSM_LOG_SYS, "SUBNET UP\n" ); /* Format Waived */
       /* clear the signal */
       p_mgr->p_subn->moved_to_master_state = FALSE;
+   } else {
+      osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */
    }
 
    if( p_mgr->p_subn->opt.sweep_interval )


From moshek at voltaire.com  Tue Jun 13 03:51:11 2006
From: moshek at voltaire.com (Moshe Kazir)
Date: Tue, 13 Jun 2006 13:51:11 +0300
Subject: [openib-general] OFED-RC4 backport to sles9 sp3 kernel 2.6.5-7.244
Message-ID: <D4F8F0B3820E754C887699BEF26A8940EB8414@taurus.voltaire.com>

           The enclosed diff file include sles9 sp3 backporrt changes.

	 
	I performed the work on RC4.
	 
	changes description:
	 
	- openib-1.0/configure - changed to enable make in kernel 2.6.5 
	- build_env.sh - changed to support gcc 3.3.3 (the current
compiler version)
	- I created a new directory -> patches/2.6.5-7.244/
	- all the patches from patches/2.6.9 to the new created
directory.
	- all the patches created by me are *_6922_to_2_6_5-7.244.patch
.
	 
	limitation / known bugs :
	 
	- I checked the work with build.sh / install.sh of the basic
package.
	- ib_ipath was change to enable compilation, it does not pass
insmod as some entry points are not resolved.
	- As result of the ipath problem after /etc/init.d/openibd start
you'll get -> [fail] , but all the modules are in place and working.
	- ipoib and the ibv_* programs are working o.k.
	 
	 
	I performed very short testing on x86_64 and em64t. 
	 
	Moshe
	   
	 
	____________________________________________________________

	Moshe Katzir   |  +972-9971-8639 (o)   |   +972-52-860-6042  (m)

	 
	Voltaire - The Grid Backbone

	 
	 www.voltaire.com <http://www.voltaire.com/> 

	<mailto:g at voltaire.com> 

	  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/9cca316f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OFED-1.0-rc4.backport_to_sles9_sp3.patch
Type: application/octet-stream
Size: 162278 bytes
Desc: OFED-1.0-rc4.backport_to_sles9_sp3.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/9cca316f/attachment.obj>

From tziporet at mellanox.co.il  Tue Jun 13 04:21:47 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 13 Jun 2006 14:21:47 +0300
Subject: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on boot
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71DB@mtlexch01.mtl.com>

It's not in the default, since CM and CMA are not defined as basic HPC
components (basic components are only mthca, ipath, core and ipoib). 
Thus any one wants these modules should change the file
/etc/infiniband/openib.conf

Tziporet


-----Original Message-----
From: Arlin Davis [mailto:ardavis at ichips.intel.com] 
Sent: Monday, June 12, 2006 8:30 PM
To: Tziporet Koren
Cc: openib; Woodruff, Robert J
Subject: Re: [openib-general] [Bug 126] RDMA_CM and UCM not loaded on
boot

bugzilla-daemon at openib.org wrote:


Did the default openib.conf script get updated with:

RDMA_CM_LOAD=yes
RDMA_UCM_LOAD=yes

-arlin


-arlin


From halr at voltaire.com  Tue Jun 13 04:12:20 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 07:12:20 -0400
Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every
	heavy sweep
In-Reply-To: <86r71tgwra.fsf@mtl066.yok.mtl.com>
References: <86r71tgwra.fsf@mtl066.yok.mtl.com>
Message-ID: <1150197138.570.140615.camel@hal.voltaire.com>

Hi Eitan,

On Tue, 2006-06-13 at 06:44, Eitan Zahavi wrote:
> Hi Hal
> 
> This trivial patch provides a "SUBNET UP" message (with level INFO)
> every time the SM completes a full heavy sweep. It is most useful for
> cases where you want to make sure teh SM responded to some change in
> the fabric. Also used to sync the various test flows to the end of sweeps.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied to trunk only.

-- Hal


From tziporet at mellanox.co.il  Tue Jun 13 04:36:29 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 13 Jun 2006 14:36:29 +0300
Subject: [openib-general] OFED-RC4 backport to sles9 sp3 kernel
 2.6.5-7.244
In-Reply-To: <D4F8F0B3820E754C887699BEF26A8940EB8414@taurus.voltaire.com>
References: <D4F8F0B3820E754C887699BEF26A8940EB8414@taurus.voltaire.com>
Message-ID: <448EA33D.10800@mellanox.co.il>

Moshe Kazir wrote:
>            The enclosed diff file include sles9 sp3 backporrt changes.
>
>
Great you did it

You understand we cannot include it in OFED 1.0, since it should be out 
this week, but we can add it to OFED 1.1, that will be on July.

 
Tziporet


From eitan at mellanox.co.il  Tue Jun 13 04:30:37 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 13 Jun 2006 14:30:37 +0300
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
In-Reply-To: <20060611003243.22430.56582.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003243.22430.56582.stgit@sashak.voltaire.com>
Message-ID: <448EA1DD.7090204@mellanox.co.il>

Hi Sasha,

Please see my comments inside

Sasha Khapyorsky wrote:
> This patch implements trivial routing module which able to load LFT
> tables from dump file. Main features:
> - support for unicast LFTs only, support for multicast can be added later
> - this will run after min hop matrix calculation
> - this will load switch LFTs according to the path entries introduced in
>   the dump file
> - no additional checks will be performed (like is port connected, etc)
> - in case when fabric LIDs were changed this will try to reconstruct LFTs
>   correctly if endport GUIDs are represented in the dump file (in order
>   to disable this GUIDs may be removed from the dump file or zeroed)
I think you cold use the concept of directed routes for storing the LIDs too.
So in case of new LID assignments you can extract the old -> new mapping by
scanning the LIDs of end ports by their DR path.
Anyway, I think it is required that you also perform topology matching such that
if someone changed the topology you are able to figure it out and stop.
THIS IS A SERIOUS LIMITATION OF YOUR PROPOSAL.
> 
> The dump file format is compatible with output of 'ibroute' util and for
> whole fabric may be generated with script like this:
> 
>   for sw_lid in `ibswitches | awk '{print $NF}'` ; do
> 	ibroute $sw_lid
>   done > /path/to/dump_file
> 
> , or using DR paths:
> 
> 
>   for sw_dr in `ibnetdiscover -v \
> 		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
> 		| sed -e 's/\]\[/,/g' \
> 		| sort -u` ; do
> 	ibroute -D ${sw_dr}
>   done > /path/to/dump_file
WE SHOULD ALSO PROVIDE A DUMP FILE VIA:
1. OpenSM should dump its routes using this format (like it does today using osm.fdbs)
2. ibdiagnet
> 
> 
> 
> diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
> index a637367..ec1d056 100644
> --- a/osm/include/opensm/osm_subnet.h
> +++ b/osm/include/opensm/osm_subnet.h
> @@ -423,6 +424,10 @@ typedef struct _osm_subn_opt
>  *  routing_engine_name
>  *     Name of used routing engine (other than default Min Hop Algorithm)
>  *
> +*  ucast_dump_file
> +*     Name of the unicast routing dump file from where switch
> +*     forwearding tables will be loaded
          ^^^^^^^^^^^
          forwarding
> +*
>  *  updn_guid_file
>  *     Pointer to name of the UPDN guid file given by User
>  *
>  
> diff --git a/osm/opensm/osm_ucast_file.c b/osm/opensm/osm_ucast_file.c
> new file mode 100644
> index 0000000..a68d9ec
> --- /dev/null
> +++ b/osm/opensm/osm_ucast_file.c
> @@ -0,0 +1,258 @@
> +/*
> + * Copyright (c) 2006 Voltaire, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * $Id$
> + */
> +
> +/*
> + * Abstract:
> + *    Implementation of OpenSM unicast routing module which loads
> + *    routes from the dump file
> + *
> + * Environment:
> + *    Linux User Mode
> + *
> + */
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif				/* HAVE_CONFIG_H */
> +
> +#include <stdlib.h>
> +#include <string.h>
> +#include <ctype.h>
> +
> +#include <iba/ib_types.h>
> +#include <complib/cl_qmap.h>
> +#include <opensm/osm_opensm.h>
> +#include <opensm/osm_switch.h>
> +#include <opensm/osm_log.h>
> +
> +#define PARSEERR(log, file_name, lineno, fmt, arg...) \
> +		osm_log(log, OSM_LOG_ERROR, "PARSE ERROR: %s:%u: " fmt , \
> +			file_name, lineno, ##arg )
> +
> +#define PARSEWARN(log, file_name, lineno, fmt, arg...) \
> +		osm_log(log, OSM_LOG_VERBOSE, "PARSE WARN: %s:%u: " fmt , \
> +			file_name, lineno, ##arg )
> +
> +static uint16_t remap_lid(osm_opensm_t *p_osm, uint16_t lid, ib_net64_t guid)
> +{
> +	osm_port_t *p_port;
> +	uint16_t min_lid, max_lid;
> +	uint8_t lmc;
> +
> +	p_port = (osm_port_t *)cl_qmap_get(&p_osm->subn.port_guid_tbl, guid);
> +	if (!p_port ||
> +	    p_port == (osm_port_t *)cl_qmap_end(&p_osm->subn.port_guid_tbl)) {
> +		osm_log(&p_osm->log, OSM_LOG_VERBOSE,
> +			"remap_lid: cannot find port guid 0x%016" PRIx64
> +			" , will use the same lid.\n", cl_ntoh64(guid));
> +		return lid;
> +	}
> +
> +	osm_port_get_lid_range_ho(p_port, &min_lid, &max_lid);
> +	if (min_lid <= lid && lid <= max_lid)
> +		return lid;
> +
> +	lmc = osm_port_get_lmc(p_port);
> +	return min_lid + (lid & ((1 << lmc) - 1));
> +}
> +
> +static void add_path(osm_opensm_t * p_osm,
> +		     osm_switch_t * p_sw, uint16_t lid, uint8_t port_num,
> +		     ib_net64_t port_guid)
> +{
> +	uint16_t new_lid;
> +	uint8_t old_port;
> +
> +	new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid;
> +	old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid);
> +	if (old_port != OSM_NO_PATH && old_port != port_num) {
> +		osm_log(&p_osm->log, OSM_LOG_VERBOSE,
> +			"add_path: LID collision is detected on switch "
> +			"0x016%" PRIx64 ", will overwrite LID 0x%x entry.\n",
> +			cl_ntoh64(osm_node_get_node_guid
> +				  (osm_switch_get_node_ptr(p_sw))), new_lid);
> +	}
> +
> +	osm_switch_set_path(p_sw, new_lid, port_num, TRUE);
> +
> +	osm_log(&p_osm->log, OSM_LOG_DEBUG,
> +		"add_path: route 0x%04x(was 0x%04x) %u 0x%016" PRIx64
> +		" is added to switch 0x%016" PRIx64 "\n",
> +		new_lid, lid, port_num, cl_ntoh64(port_guid),
> +		cl_ntoh64(osm_node_get_node_guid
> +			  (osm_switch_get_node_ptr(p_sw))));
> +}
> +
> +static void clean_sw_fwd_table(void *arg, void *context)
> +{
> +	osm_switch_t *p_sw = arg;
> +	uint16_t lid, max_lid;
> +
> +	max_lid = osm_switch_get_max_lid_ho(p_sw);
> +	for (lid = 1 ; lid <= max_lid ; lid++)
> +		osm_switch_set_path(p_sw, lid, OSM_NO_PATH, TRUE);
> +}
> +
> +static int do_ucast_file_load(void *context)
> +{
> +	char line[1024];
> +	char *file_name;
> +	FILE *file;
> +	ib_net64_t sw_guid, port_guid;
> +	osm_opensm_t *p_osm = context;
> +	osm_switch_t *p_sw;
> +	uint16_t lid;
> +	uint8_t port_num;
> +	unsigned lineno;
> +
> +	file_name = p_osm->subn.opt.ucast_dump_file;
> +
> +	if (!file_name) {
> +		osm_log(&p_osm->log, OSM_LOG_ERROR,
> +			"do_ucast_file_load: "
> +			"ucast dump file name is not defined.\n");
> +		return -1;
> +	}
> +
> +	file = fopen(file_name, "r");
> +	if (!file) {
> +		osm_log(&p_osm->log, OSM_LOG_ERROR,
> +			"do_ucast_file_load: "
> +			"cannot open ucast dump file \'%s\'\n", file_name);
> +		return -1;
> +	}
> +
> +	cl_qmap_apply_func(&p_osm->subn.sw_guid_tbl, clean_sw_fwd_table, NULL);
> +
> +	lineno = 0;
> +	p_sw = NULL;
> +
> +	while (fgets(line, sizeof(line) - 1, file) != NULL) {
> +		char *p, *q;
> +		lineno++;
> +
> +		p = line;
> +		while (isspace(*p))
> +			p++;
> +
> +		if (*p == '#')
> +			continue;
> +
> +		if (!strncmp(p, "Multicast mlids", 15)) {
> +			osm_log(&p_osm->log, OSM_LOG_ERROR,
> +				"do_ucast_file_load: "
> +				"Multicast dump file is detected. "
> +				"Skip parsing.\n");
> +		}
> +		else if (!strncmp(p, "Unicast lids", 12)) {
> +			q = strstr(p, " guid 0x");
> +			if (!q) {
> +				PARSEERR(&p_osm->log, file_name, lineno,
> +					 "cannot parse switch definition\n");
> +				return -1;
> +			}
> +			p = q + 6;
> +			sw_guid = strtoll(p, &q, 16);
> +			if (q && !isspace(*q)) {
> +				PARSEERR(&p_osm->log, file_name, lineno,
> +					 "cannot parse switch guid: \'%s\'\n",
> +					 p);
> +				return -1;
> +			}
> +			sw_guid = cl_hton64(sw_guid);
> +
> +			p_sw = (osm_switch_t *)cl_qmap_get(&p_osm->subn.sw_guid_tbl,
> +							   sw_guid);
> +			if (!p_sw ||
> +			    p_sw == (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl)) {
> +				p_sw = NULL;
> +				osm_log(&p_osm->log, OSM_LOG_VERBOSE,
> +					"do_ucast_file_load: "
> +					"cannot find switch %016" PRIx64 ".\n",
> +					cl_ntoh64(sw_guid));
> +				continue;
> +			}
> +		}
> +		else if (p_sw && !strncmp(p, "0x", 2)) {
> +			lid = strtoul(p, &q, 16);
> +			if (q && !isspace(*q)) {
> +				PARSEERR(&p_osm->log, file_name, lineno,
> +					 "cannot parse lid: \'%s\'\n", p);
> +				return -1;
> +			}
> +			p = q;
> +			while (isspace(*p))
> +				p++;
> +			port_num = strtoul(p, &q, 10);
> +			if (q && !isspace(*q)) {
> +				PARSEERR(&p_osm->log, file_name, lineno,
> +					 "cannot parse port: \'%s\'\n", p);
> +				return -1;
> +			}
> +			p = q;
> +			/* additionally try to exract guid */
> +			q = strstr(p, " portguid 0x");
> +			if (!q) {
> +				PARSEWARN(&p_osm->log, file_name, lineno,
> +					  "cannot find port guid "
> +					  "(maybe broken dump): \'%s\'\n", p);
> +				port_guid = 0;
> +			}
> +			else {
> +				p = q + 10;
> +				port_guid = strtoll(p, &q, 16);
> +				if (!q && !isspace(*q) && *q != ':') {
> +					PARSEWARN(&p_osm->log, file_name,
> +						  lineno,
> +						  "cannot parse port guid "
> +						  "(maybe broken dump): "
> +						  "\'%s\'\n", p);
> +					port_guid = 0;
> +				}
> +			}
> +			port_guid = cl_hton64(port_guid);
> +			add_path(p_osm, p_sw, lid, port_num, port_guid);
> +		}
> +	}
> +
> +	fclose(file);
> +	return 0;
> +}
In OpenSM we write with style:
if () {
}
else if ()
{
}
else
{
}

Not any other combination
> +
> +int osm_ucast_file_setup(osm_opensm_t * p_osm)
> +{
> +	p_osm->routing_engine.context = (void *)p_osm;
> +	p_osm->routing_engine.ucast_build_fwd_tables = do_ucast_file_load;
> +	return 0;
> +}


From eitan at mellanox.co.il  Tue Jun 13 04:55:13 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 13 Jun 2006 14:55:13 +0300
Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast
	only yet).
In-Reply-To: <20060611003240.22430.88414.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003240.22430.88414.stgit@sashak.voltaire.com>
Message-ID: <448EA7A1.8060206@mellanox.co.il>

Hi Sasha,

As provided in my previous patch 1/4 comments
I think the callbacks should also have an entry for the MinHop stage (maybe
this is the ucast_build_fwd_tables?) I have some algorithms in mind that will
skip that stage all-together.

Also it might make sense for each routing engine to provide its own "dump"
routine such that each could support difference file format if needed.

Rest of the comments are inline

EZ

Sasha Khapyorsky wrote:
> 
> diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
> index 3235ad4..3e6e120 100644
> --- a/osm/include/opensm/osm_opensm.h
> +++ b/osm/include/opensm/osm_opensm.h
> @@ -92,6 +92,18 @@ BEGIN_C_DECLS
>  *
>  *********/
>  
> +/*
> + * routing engine structure - yet limited by ucast_fdb_assign and
> + *      ucast_build_fwd_tables (multicast callbacks may be added later)
> + */
> +struct osm_routing_engine {
> +	const char *name;
> +	void *context;
> +	int (*ucast_build_fwd_tables)(void *context);
> +	int (*ucast_fdb_assign)(void *context);
> +	void (*delete)(void *context);
> +};
It would be nice if you added a standard header to this struct.
It is not clear to me what ucast_build_fwd_tables and
ucast_fdb_assign are mapping to.

Please see the next section as an example for a struct header.
> +
>  /****s* OpenSM: OpenSM/osm_opensm_t
>  * NAME
>  *	osm_opensm_t
> @@ -116,7 +128,7 @@ typedef struct _osm_opensm_t
>    osm_log_t			log;
>    cl_dispatcher_t	disp;
>    cl_plock_t		lock;
> -  updn_t         *p_updn_ucast_routing;
> +  struct osm_routing_engine routing_engine;
>    osm_stats_t		stats;
>  } osm_opensm_t;
>  /*
> @@ -153,6 +165,9 @@ typedef struct _osm_opensm_t
>  *	lock
>  *		Shared lock guarding most OpenSM structures.
>  *
> +*	routing_engine
> +*		Routing engine, will be initialized then used
> +*
>  *	stats
>  *		Open SM statistics block
>  *
> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index cac7f9b..0c0d635 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c
> @@ -62,6 +62,7 @@ #include <opensm/osm_node.h>
>  #include <opensm/osm_switch.h>
>  #include <opensm/osm_helper.h>
>  #include <opensm/osm_msgdef.h>
> +#include <opensm/osm_opensm.h>
>  
>  #define LINE_LENGTH 256
>  
> @@ -269,7 +270,7 @@ osm_ucast_mgr_dump_ucast_routes(
>        strcat( p_mgr->p_report_buf, "yes" );
>      else
>      {
> -      if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) {
> +      if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) {
>          ui_ucast_fdb_assign_func_defined = TRUE;
>        } else {
>          ui_ucast_fdb_assign_func_defined = FALSE;
> @@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port(
>    node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) );
>  
>    /* Flag to mark whether or not a ui ucast fdb assign function was given */
> -  if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign)
> +  if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
>      ui_ucast_fdb_assign_func_defined = TRUE;
>    else
>      ui_ucast_fdb_assign_func_defined = FALSE;
> @@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port(
>  
>        /* Up/Down routing can cause unreachable routes between some 
>           switches so we do not report that as an error in that case */
> -      if (!p_mgr->p_subn->opt.updn_activate)
> +      if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
>        {
>          osm_log( p_mgr->p_log, OSM_LOG_ERROR,
>                   "__osm_ucast_mgr_process_port: ERR 3A08: "
> @@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl(
>  /**********************************************************************
>   **********************************************************************/
>  static void
> +__osm_ucast_mgr_set_table_cb(
> +  IN cl_map_item_t* const  p_map_item,
> +  IN void* context )
> +{
> +  osm_switch_t* const p_sw = (osm_switch_t*)p_map_item;
> +  osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context;
> +  __osm_ucast_mgr_set_table( p_mgr, p_sw );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +static void
>  __osm_ucast_mgr_process_neighbors(
>    IN cl_map_item_t* const  p_map_item,
>    IN void* context )
> @@ -1058,12 +1071,14 @@ osm_ucast_mgr_process(
>  {
>    uint32_t i;
>    uint32_t iteration_max;
> +  struct osm_routing_engine *p_routing_eng;
>    osm_signal_t signal;
>    cl_qmap_t *p_sw_guid_tbl;
>  
>    OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process );
>  
>    p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl;
> +  p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine;
>  
>    CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
>  
> @@ -1129,6 +1144,14 @@ osm_ucast_mgr_process(
>               i
>               );
>  
> +    if (p_routing_eng->ucast_build_fwd_tables &&
> +        p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0)
> +    {
> +      cl_qmap_apply_func( p_sw_guid_tbl,
> +                          __osm_ucast_mgr_set_table_cb, p_mgr );
> +    } /* fallback on the regular path in case of failures */
> +    else
> +    {
Please explain why this step is needed and why if the routing engine function is
returning 0 you still invoke the standard __osm_ucast_mgr_set_table_cb.

>      /*
>        This is the place where we can load pre-defined routes
>        into the switches fwd_tbl structures.


From eitan at mellanox.co.il  Tue Jun 13 05:03:31 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 13 Jun 2006 15:03:31 +0300
Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb
	dumps.
In-Reply-To: <20060611003238.22430.62423.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003238.22430.62423.stgit@sashak.voltaire.com>
Message-ID: <448EA993.6010000@mellanox.co.il>

Hi Sasha,

I still need to see if there are no real problematic changes in the osm.fdbs
file syntax (need to update ibdm to support those) but I like the patch and
the clean way you resolved the multiple opens of the dump file.

EZ


From eitan at mellanox.co.il  Tue Jun 13 05:39:15 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 13 Jun 2006 15:39:15 +0300
Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every heavy
 sweep - resend
Message-ID: <86pshdgrgs.fsf@mtl066.yok.mtl.com>

Hi Hal

Sorry bout the previous patch - I got the } else { in it.

This trivial patch provides a "SUBNET UP" message (with level INFO)
every time the SM completes a full heavy sweep. It is most useful for
cases where you want to make sure teh SM responded to some change in
the fabric. Also used to sync the various test flows to the end of sweeps.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_state_mgr.c
===================================================================
--- opensm/osm_state_mgr.c	(revision 7904)
+++ opensm/osm_state_mgr.c	(working copy)
@@ -200,6 +200,10 @@ __osm_state_mgr_up_msg(
       /* clear the signal */
       p_mgr->p_subn->moved_to_master_state = FALSE;
    }
+	else
+	{
+      osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */
+	}
 
    if( p_mgr->p_subn->opt.sweep_interval )
    {


From eitan at mellanox.co.il  Tue Jun 13 05:54:15 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 13 Jun 2006 15:54:15 +0300
Subject: [openib-general]  [PATCH] osm: partition manager force policy
Message-ID: <86odwxgqrs.fsf@mtl066.yok.mtl.com>

--text follows this line--
Hi Hal

This is a second take after debug and cleanup of the partition manager
patch I have previously provided. The functionality is the same but
this one is after 2 days of testing on the simulator.
I also did some code restructuring for clarity. 

Tests passed were both dedicated pkey enforcements (pkey.*) and
stress test (osmStress.*)

As I started to test the partition manager code (using ibmgtsim pkey test),
I realized the implementation does not really enforces the partition policy
on the given fabric. This patch fixes that. It was verified using the 
simulation test. Several other corner cases were fixed too.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 7867)
+++ include/opensm/osm_port.h	(working copy)
@@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
 *  Port, Physical Port
 *********/
 
+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
+* NAME
+*  osm_physp_get_mod_pkey_tbl
+*
+* DESCRIPTION
+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
+*
+* SYNOPSIS
+*/
+static inline osm_pkey_tbl_t *
+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
+{
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  /*
+    (14.2.5.7) - the block number valid values are 0-2047, and are further
+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
+  */
+  return( &p_physp->pkeys );
+};
+/*
+* PARAMETERS
+*  p_physp
+*     [in] Pointer to an osm_physp_t object.
+*
+* RETURN VALUES
+*  The pointer to the P_Key table object.
+*
+* NOTES
+*
+* SEE ALSO
+*  Port, Physical Port
+*********/
+
 /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
 * NAME
 *	osm_physp_set_slvl_tbl
Index: include/opensm/osm_pkey.h
===================================================================
--- include/opensm/osm_pkey.h	(revision 7867)
+++ include/opensm/osm_pkey.h	(working copy)
@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
   cl_ptr_vector_t blocks;
   cl_ptr_vector_t new_blocks;
   cl_map_t        keys;
+  cl_qlist_t      pending;
+  uint16_t        used_blocks;
+  uint16_t        max_blocks;
 } osm_pkey_tbl_t;
 /*
 * FIELDS
@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
 *	keys
 *		A set holding all keys
 *
+*  pending
+*     A list osm_pending_pkey structs that is temporarily set by the 
+*     pkey mgr and used during pkey mgr algorithm only
+*
+*  used_blocks
+*     Tracks the number of blocks having non-zero pkeys
+*
+*  max_blocks
+*     The maximal number of blocks this partition table might hold
+*     this value is based on node_info (for port 0 or CA) or switch_info
+*     updated on receiving the node_info or switch_info GetResp
+*
 * NOTES
 * 'blocks' vector should be used to store pkey values obtained from
 * the port and SM pkey manager should not change it directly, for this
@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
 *
 *********/
 
+/****s* OpenSM: osm_pending_pkey_t
+* NAME
+*	osm_pending_pkey_t
+*
+* DESCRIPTION
+*	This objects stores temporary information on pkeys their target block and index
+*  during the pkey manager operation
+*
+* SYNOPSIS
+*/
+typedef struct _osm_pending_pkey {
+  cl_list_item_t list_item;
+  uint16_t		  pkey;
+  uint32_t		  block;
+  uint8_t		  index;
+  boolean_t		  is_new;
+} osm_pending_pkey_t;
+/*
+* FIELDS
+*	pkey
+*		The actual P_Key
+*
+*	block
+*		The block index based on the previous table extracted from the device
+*
+*	index
+*		The index of the pky within the block
+*
+*  is_new
+*     TRUE for new P_Keys such that the block and index are invalid in that case
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_construct
 * NAME
 *  osm_pkey_tbl_construct
@@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( 
 static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
   const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
 {
-  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
-  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
+	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
+			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
 };
 /*
 *  p_pkey_tbl
@@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_
 /*
  *********/
 
+
+/****f* OpenSM: osm_pkey_tbl_make_block_pair
+* NAME
+*  osm_pkey_tbl_make_block_pair
+*
+* DESCRIPTION
+*  Find or create a pair of "old" and "new" blocks for the
+*  given block index
+*
+* SYNOPSIS
+*/
+int osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pp_old_block
+*   [out] Pointer to the old block pointer arg
+*
+* pp_new_block
+*   [out] Pointer to the new block pointer arg
+*
+* RETURN VALUES
+*   0 if OK 1 if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_set_new_entry
+* NAME 
+*  osm_pkey_tbl_set_new_entry
+*
+* DESCRIPTION
+*   stores the given pkey in the "new" blocks array and update
+*   the "map" to show that on the "old" blocks
+*
+* SYNOPSIS
+*/
+int
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pkey_idx
+*   [in] The index within the block
+*
+* pkey
+*   [in] PKey to store
+*
+* RETURN VALUES
+*   0 if OK 1 if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_find_next_free_entry
+* NAME
+*  osm_pkey_find_next_free_entry
+*
+* DESCRIPTION
+*  Find the next free entry in the PKey table. Starting at the given
+*  index and block number. The user should increment pkey_idx before 
+*  next call
+*  Inspect the "new" blocks array for empty space.
+*
+* SYNOPSIS
+*/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* p_block_idx
+*   [out] The block index to use
+*
+* p_pkey_idx
+*   [out] The index within the block to use
+*
+* RETURN VALUES
+*   TRUE if found FALSE if did not find
+* 
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_sync_new_blocks
 * NAME
 *  osm_pkey_tbl_sync_new_blocks
@@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 *********/
 
+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
+* NAME
+*  osm_pkey_tbl_get_block_and_idx
+*
+* DESCRIPTION
+*  set the block index and pkey index the given
+*  pkey is found in. return 1 if cound not find 
+*  it, 0 if OK
+*
+* SYNOPSIS
+*/
+int
+osm_pkey_tbl_get_block_and_idx(
+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
+  IN  uint16_t       *p_pkey,
+  OUT uint32_t       *block_idx,
+  OUT uint8_t        *pkey_index);
+/*
+*  p_pkey_tbl
+*     [in] Pointer to osm_pkey_tbl_t object.
+*  
+*  p_pkey
+*     [in] Pointer to the P_Key entry searched
+*
+*  p_block_idx
+*     [out] Pointer to the block index to be updated
+*
+*  p_pkey_idx 
+*     [out] Pointer to the pkey index (in the block) to be updated
+*
+*
+* NOTES
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_set
 * NAME
 *  osm_pkey_tbl_set
Index: opensm/osm_pkey.c
===================================================================
--- opensm/osm_pkey.c	(revision 7904)
+++ opensm/osm_pkey.c	(working copy)
@@ -100,6 +100,9 @@ int osm_pkey_tbl_init( 
   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
   cl_map_init( &p_pkey_tbl->keys, 1 );
+	cl_qlist_init( &p_pkey_tbl->pending );
+	p_pkey_tbl->used_blocks = 0;
+	p_pkey_tbl->max_blocks = 0;
   return(IB_SUCCESS);
 }
 
@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks(
     p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
     if ( b < new_blocks )
       p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
-    else {
+		else 
+      {
       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
       if (!p_new_block)
         break;
+			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
+									b, p_new_block);
+		}
+
       memset(p_new_block, 0, sizeof(*p_new_block));
-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
     }
-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_pkey_tbl_cleanup_pending(
+	IN osm_pkey_tbl_t *p_pkey_tbl)
+{
+	cl_list_item_t	*p_item;
+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		free( (osm_pending_pkey_t *)p_item );
   }
 }
 
@@ -202,6 +220,138 @@ int osm_pkey_tbl_set( 
 
 /**********************************************************************
  **********************************************************************/
+int osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block)
+{
+	if (block_idx >= p_pkey_tbl->max_blocks) return 1;
+
+	if (pp_old_block)
+	{
+		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_old_block)
+		{
+			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_old_block) return 1;
+			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
+		}
+	}
+	
+	if (pp_new_block)
+	{
+		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_new_block)
+		{
+			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_new_block) return 1;
+			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
+		}
+	}
+	return 0;
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+  store the given pkey in the "new" blocks array and update the "map"
+  to show that on the "old" blocks
+*/
+int
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey)
+{  
+	ib_pkey_table_t *p_old_block;
+	ib_pkey_table_t *p_new_block;
+	
+	if (osm_pkey_tbl_make_block_pair(
+			 p_pkey_tbl,  block_idx, &p_old_block, &p_new_block))
+		return 1;
+		
+	cl_map_insert( &p_pkey_tbl->keys,
+						ib_pkey_get_base(pkey),
+						&(p_old_block->pkey_entry[pkey_idx]));
+	p_new_block->pkey_entry[pkey_idx] = pkey;
+	if (p_pkey_tbl->used_blocks < block_idx)
+		p_pkey_tbl->used_blocks = block_idx;
+
+	return 0;
+}
+
+/**********************************************************************
+ **********************************************************************/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx)
+{
+	ib_pkey_table_t *p_new_block;
+	
+	CL_ASSERT(p_block_idx);
+	CL_ASSERT(p_pkey_idx);
+
+	while ( *p_block_idx < p_pkey_tbl->max_blocks)
+	{
+		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
+		{
+			*p_pkey_idx = 0;
+			(*p_block_idx)++;
+			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
+				return FALSE;
+		}
+
+		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
+
+		if ( !p_new_block || 
+			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
+			return TRUE;
+		else
+			(*p_pkey_idx)++;
+	}
+	return FALSE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+int
+osm_pkey_tbl_get_block_and_idx(
+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
+	IN	 uint16_t		 *p_pkey,
+	OUT uint32_t		 *p_block_idx,
+	OUT uint8_t			 *p_pkey_index)
+{
+	uint32_t			  num_of_blocks;
+	uint32_t			  block_index;
+	ib_pkey_table_t *block;
+
+	CL_ASSERT( p_pkey_tbl );
+	CL_ASSERT( p_block_idx != NULL );
+	CL_ASSERT( p_pkey_idx != NULL );
+ 
+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		if ( ( block->pkey_entry <= p_pkey ) &&
+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
+		{
+			*p_block_idx = block_index;
+			*p_pkey_index = p_pkey - block->pkey_entry;
+			return 0;
+		}
+	}
+	return 1;
+}
+
+/**********************************************************************
+ **********************************************************************/
 static boolean_t __osm_match_pkey (
   IN const ib_net16_t *pkey1,
   IN const ib_net16_t *pkey2 ) {
@@ -305,7 +455,8 @@ osm_physp_share_pkey(
   if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
     return TRUE;
 
-  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
+	return 
+		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
 }
 
 /**********************************************************************
@@ -321,7 +472,8 @@ osm_port_share_pkey(
 
   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
 
-  if (!p_port_1 || !p_port_2) {
+	if (!p_port_1 || !p_port_2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
@@ -329,7 +481,8 @@ osm_port_share_pkey(
   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
 
-  if (!p_physp1 || !p_physp2) {
+	if (!p_physp1 || !p_physp2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 7904)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -62,6 +62,139 @@
 
 /**********************************************************************
  **********************************************************************/
+/*
+  the max number of pkey blocks for a physical port is located in
+  different place for switch external ports (SwitchInfo) and the
+  rest of the ports (NodeInfo)
+*/
+static int pkey_mgr_get_physp_max_blocks(
+	IN const osm_subn_t *p_subn,
+	IN const osm_physp_t *p_physp)
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
+	osm_switch_t *p_sw;
+	uint16_t num_pkeys = 0;
+
+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
+		  (osm_physp_get_port_num( p_physp ) == 0))
+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
+	else
+	{
+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
+		if (p_sw)
+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
+	}
+	return( (num_pkeys + 31) / 32 );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+ * Insert the new pending pkey entry to the specific port pkey table
+ * pending pkeys. new entries are inserted at the back.
+ */
+static void pkey_mgr_process_physical_port(
+	IN osm_log_t *p_log,
+	IN const osm_req_t *p_req,
+	IN const ib_net16_t pkey,
+	IN osm_physp_t *p_physp )
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	char *stat = NULL;
+	osm_pending_pkey_t *p_pending;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	if (! p_pkey_tbl)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0501: "
+					"No pkey table found for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+
+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
+	if (! p_pending)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0502: "
+					"Fail to allocate new pending pkey entry for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+	p_pending->pkey = pkey;
+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	if ( !p_orig_pkey  || 
+		  (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) ))
+	{
+		p_pending->is_new = TRUE;
+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "inserted";
+	}
+	else
+	{
+		p_pending->is_new = FALSE;
+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
+													  &p_pending->block, &p_pending->index))
+		{
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_process_physical_port: ERR 0503: "
+						"Fail to obtain P_Key 0x%04x block and index for node "
+						"0x%016" PRIx64 " port %u\n",
+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+						osm_physp_get_port_num( p_physp ) );
+			return;
+		}
+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "updated";
+	}
+
+	osm_log( p_log, OSM_LOG_DEBUG,
+				"pkey_mgr_process_physical_port:	"
+				"pkey 0x%04x was %s for node 0x%016" PRIx64
+				" port %u\n",
+				cl_ntoh16( pkey ), stat,
+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+				osm_physp_get_port_num( p_physp ) );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+pkey_mgr_process_partition_table(
+	osm_log_t *p_log,
+	const osm_req_t *p_req,
+	const osm_prtn_t *p_prtn,
+	const boolean_t full )
+{
+	const cl_map_t *p_tbl = full ?
+		&p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
+	cl_map_iterator_t i, i_next;
+	ib_net16_t pkey = p_prtn->pkey;
+	osm_physp_t *p_physp;
+
+	if ( full )
+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
+
+	i_next = cl_map_head( p_tbl );
+	while ( i_next != cl_map_end( p_tbl ) )
+	{
+		i = i_next;
+		i_next = cl_map_next( i );
+		p_physp = cl_map_obj( i );
+		if ( p_physp && osm_physp_is_valid( p_physp ) )
+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 static ib_api_status_t
 pkey_mgr_update_pkey_entry(
    IN const osm_req_t *p_req,
@@ -114,7 +247,8 @@ pkey_mgr_enforce_partition(
    p_pi->state_info2 = 0;
    ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
 
-   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+	context.pi_context.node_guid = 
+		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
    context.pi_context.set_method = TRUE;
    context.pi_context.update_master_sm_base_lid = FALSE;
@@ -131,80 +265,132 @@ pkey_mgr_enforce_partition(
 
 /**********************************************************************
  **********************************************************************/
-/*
- * Prepare a new entry for the pkey table for this port when this pkey
- * does not exist. Update existed entry when membership was changed.
- */
-static void pkey_mgr_process_physical_port(
-   IN osm_log_t *p_log,
-   IN const osm_req_t *p_req,
-   IN const ib_net16_t pkey,
-   IN osm_physp_t *p_physp )
+static boolean_t pkey_mgr_update_port(
+	osm_log_t *p_log,
+	osm_req_t *p_req,
+	const osm_port_t * const p_port )
 {
-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
-   ib_pkey_table_t *block;
+	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	ib_pkey_table_t *block, *new_block;
+	osm_pkey_tbl_t *p_pkey_tbl;
    uint16_t block_index;
+	uint8_t  pkey_index;
+	uint16_t last_free_block_index = 0;
+	uint16_t last_free_pkey_index = 0;
    uint16_t num_of_blocks;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   ib_net16_t *p_orig_pkey;
-   char *stat = NULL;
-   uint32_t i;
+	uint16_t max_num_of_blocks;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	ib_api_status_t status;
+	boolean_t ret_val = FALSE;
+	osm_pending_pkey_t *p_pending;
+	boolean_t found;
 
-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
+		return FALSE;
 
-   if ( !p_orig_pkey )
-   {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
       {
-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
+		osm_log( p_log, OSM_LOG_INFO,
+					"pkey_mgr_update_port: "
+					"Max number of blocks reduced from %u to %u " 
+					"for node 0x%016" PRIx64 " port %u\n",
+					p_pkey_tbl->max_blocks, max_num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );				
+	}
+	p_pkey_tbl->max_blocks = max_num_of_blocks;
+
+	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
+	cl_map_remove_all( &p_pkey_tbl->keys );
+	p_pkey_tbl->used_blocks = 0;
+
+	/* 
+		process every pending pkey in order - 
+		first must be "updated" last are "new" 
+	*/
+	p_pending = 
+		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_pending != 
+			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		if (p_pending->is_new == FALSE)
+		{
+			block_index = p_pending->block;
+			pkey_index = p_pending->index;
+			found = TRUE;
+		} 
+		else
          {
-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
+			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
+															  &last_free_block_index,
+															  &last_free_pkey_index);
+			if ( !found )
             {
-               block->pkey_entry[i] = pkey;
-	       stat = "inserted";
-	       goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0504: "
+							"failed to find empty space for new pkey 0x%04x "
+							"of node 0x%016" PRIx64 " port %u\n",
+							cl_ntoh16(p_pending->pkey),
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
             }
+			else
+			{
+				block_index = last_free_block_index;
+				pkey_index = last_free_pkey_index++;
          }
       }
+		
+		if (found) 
+		{
+			if (osm_pkey_tbl_set_new_entry( 
+					 p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
+			{
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_process_physical_port: ERR 0501: "
-               "No empty pkey entry was found to insert 0x%04x for node "
-               "0x%016" PRIx64 " port %u\n",
-               cl_ntoh16( pkey ),
+							"pkey_mgr_update_port: ERR 0505: "
+							"failed to set PKey 0x%04x in block %u idx %u "
+							"of node 0x%016" PRIx64 " port %u\n",
+							p_pending->pkey, block_index, pkey_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
-   else if ( *p_orig_pkey != pkey )
-   {
+		}
+
+		free( p_pending );
+		p_pending = 
+			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	}
+
+	/* now look for changes and store */
       for ( block_index = 0; block_index < num_of_blocks; block_index++ )
       {
-         /* we need real block (not just new_block) in order
-          * to resolve block/pkey indices */
          block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-	 i = p_orig_pkey - block->pkey_entry;
-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-	    block->pkey_entry[i] = pkey;
-	    stat = "updated";
-	    goto _done;
-	 }
-      }
-   }
+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
- _done:
-   if (stat) {
-      osm_log( p_log, OSM_LOG_VERBOSE,
-               "pkey_mgr_process_physical_port:  "
-               "pkey 0x%04x was %s for node 0x%016" PRIx64
-               " port %u\n",
-               cl_ntoh16( pkey ), stat,
+		if (block && 
+			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
+			continue;
+
+		status = pkey_mgr_update_pkey_entry(
+			p_req, p_physp , new_block, block_index );
+		if (status == IB_SUCCESS)
+			ret_val = TRUE;
+		else
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_update_port: ERR 0506: "
+						"pkey_mgr_update_pkey_entry() failed to update "
+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+						block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
+
+	return ret_val;
 }
 
 /**********************************************************************
@@ -217,21 +403,23 @@ pkey_mgr_update_peer_port(
    const osm_port_t * const p_port,
    boolean_t enforce )
 {
-   osm_physp_t *p, *peer;
+	osm_physp_t *p_physp, *peer;
    osm_node_t *p_node;
    ib_pkey_table_t *block, *peer_block;
-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
+	const osm_pkey_tbl_t *p_pkey_tbl;
+	osm_pkey_tbl_t *p_peer_pkey_tbl;
    osm_switch_t *p_sw;
    ib_switch_info_t *p_si;
    uint16_t block_index;
    uint16_t num_of_blocks;
+	uint16_t peer_max_blocks;
    ib_api_status_t status = IB_SUCCESS;
    boolean_t ret_val = FALSE;
 
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
       return FALSE;
-   peer = osm_physp_get_remote( p );
+	peer = osm_physp_get_remote( p_physp );
    if ( !peer || !osm_physp_is_valid( peer ) )
       return FALSE;
    p_node = osm_physp_get_node_ptr( peer );
@@ -245,7 +433,7 @@ pkey_mgr_update_peer_port(
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: ERR 0502: "
+					"pkey_mgr_update_peer_port: ERR 0507: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
    if (enforce == FALSE)
       return FALSE;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
    num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_update_peer_port: ERR 0508: "
+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
+					" port %u\n",
+					peer_max_blocks, num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( peer ) );
+		return FALSE;
+	}
 
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
    {
       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
       if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
       {
+			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block);
          status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
          if ( status == IB_SUCCESS )
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: ERR 0503: "
+							"pkey_mgr_update_peer_port: ERR 0509: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -282,10 +482,10 @@ pkey_mgr_update_peer_port(
       }
    }
 
-   if ( ret_val == TRUE &&
-        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
+	if ( (ret_val == TRUE) &&
+		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
    {
-      osm_log( p_log, OSM_LOG_VERBOSE,
+		osm_log( p_log, OSM_LOG_DEBUG,
                "pkey_mgr_update_peer_port: "
                "pkey table was updated for node 0x%016" PRIx64
                " port %u\n",
@@ -298,82 +498,6 @@ pkey_mgr_update_peer_port(
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t pkey_mgr_update_port(
-   osm_log_t *p_log,
-   osm_req_t *p_req,
-   const osm_port_t * const p_port )
-{
-   osm_physp_t *p;
-   osm_node_t *p_node;
-   ib_pkey_table_t *block, *new_block;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   uint16_t block_index;
-   uint16_t num_of_blocks;
-   ib_api_status_t status;
-   boolean_t ret_val = FALSE;
-
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
-      return FALSE;
-
-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
-   {
-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-
-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
-         continue;
-
-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
-      if (status == IB_SUCCESS)
-         ret_val = TRUE;
-      else
-         osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port: ERR 0504: "
-                  "pkey_mgr_update_pkey_entry() failed to update "
-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-                  block_index,
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                  osm_physp_get_port_num( p ) );
-   }
-
-   return ret_val;
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-pkey_mgr_process_partition_table(
-   osm_log_t *p_log,
-   const osm_req_t *p_req,
-   const osm_prtn_t *p_prtn,
-   const boolean_t full )
-{
-   const cl_map_t *p_tbl = full ?
-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
-   cl_map_iterator_t i, i_next;
-   ib_net16_t pkey = p_prtn->pkey;
-   osm_physp_t *p_physp;
-
-   if ( full )
-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
-
-   i_next = cl_map_head( p_tbl );
-   while ( i_next != cl_map_end( p_tbl ) )
-   {
-      i = i_next;
-      i_next = cl_map_next( i );
-      p_physp = cl_map_obj( i );
-      if ( p_physp && osm_physp_is_valid( p_physp ) )
-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
-   }
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_signal_t
 osm_pkey_mgr_process(
    IN osm_opensm_t *p_osm )
@@ -383,8 +507,7 @@ osm_pkey_mgr_process(
    osm_prtn_t *p_prtn;
    osm_port_t *p_port;
    osm_signal_t signal = OSM_SIGNAL_DONE;
-   osm_physp_t *p_physp;
-
+	osm_node_t *p_node;
    CL_ASSERT( p_osm );
 
    OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
@@ -394,32 +517,25 @@ osm_pkey_mgr_process(
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
       osm_log( &p_osm->log, OSM_LOG_ERROR,
-               "osm_pkey_mgr_process: ERR 0505: "
+					"osm_pkey_mgr_process: ERR 0510: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
 
-   p_tbl = &p_osm->subn.port_guid_tbl;
-   p_next = cl_qmap_head( p_tbl );
-   while ( p_next != cl_qmap_end( p_tbl ) )
-   {
-      p_port = ( osm_port_t * ) p_next;
-      p_next = cl_qmap_next( p_next );
-      p_physp = osm_port_get_default_phys_ptr( p_port );
-      if ( osm_physp_is_valid( p_physp ) )
-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
-   }
-
+	/* populate the pending pkey entries by scanning all partitions */
    p_tbl = &p_osm->subn.prtn_pkey_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
    {
       p_prtn = ( osm_prtn_t * ) p_next;
       p_next = cl_qmap_next( p_next );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
    }
 
+	/* calculate new pkey tables and set */
    p_tbl = &p_osm->subn.port_guid_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -428,8 +544,10 @@ osm_pkey_mgr_process(
       p_next = cl_qmap_next( p_next );
       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
         signal = OSM_SIGNAL_DONE_PENDING;
-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
-           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+		p_node = osm_port_get_parent_node( p_port );
+		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+			  pkey_mgr_update_peer_port( 
+				  &p_osm->log, &p_osm->sm.req,
                                       &p_osm->subn, p_port,
                                       !p_osm->subn.opt.no_partition_enforcement ) )
         signal = OSM_SIGNAL_DONE_PENDING;        


From tziporet at mellanox.co.il  Tue Jun 13 06:07:33 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 13 Jun 2006 16:07:33 +0300
Subject: [openib-general] OFED 1.0 release schedule
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com>

Hi All,

 
After reading the mail thread regarding OFED release I have decided
this:

 
We upload OFED-1.0-pre1.tgz
<https://openib.org/svn/gen2/branches/1.0/ofed/releases/OFED-1.0-pre1.tg
z>  to https://openib.org/svn/gen2/branches/1.0/ofed/releases/

 
We checked that all modules compile and loaded on this build (including
ipath and uDAPL)

The only missing parts of this release from the final release are the
documents, and the scripts rpm that Scott requested.

 
I think testing this version 3 days (Tuesday, Wednesday and Thursday)
should be enough as Scott wrote.

So - we can do the official OFED 1.0 release on Friday 16-June.

 
Matt - please check with Novel if this date is acceptable by them.

 
If not then the earliest we can do the release if Thursday 15-June.

 
Tziporet Koren

Software Director

Mellanox Technologies

mailto: tziporet at mellanox.co.il
Tel +972-4-9097200, ext 380

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/4147d97b/attachment.html>

From halr at voltaire.com  Tue Jun 13 06:09:11 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 09:09:11 -0400
Subject: [openib-general] [PATCH] osm: Provide SUBNET UP message every
 heavy sweep - resend
In-Reply-To: <86pshdgrgs.fsf@mtl066.yok.mtl.com>
References: <86pshdgrgs.fsf@mtl066.yok.mtl.com>
Message-ID: <1150203379.570.144617.camel@hal.voltaire.com>

Hi Eitan,

On Tue, 2006-06-13 at 08:39, Eitan Zahavi wrote:
> Hi Hal
> 
> Sorry bout the previous patch - I got the } else { in it.
> 
> This trivial patch provides a "SUBNET UP" message (with level INFO)
> every time the SM completes a full heavy sweep. It is most useful for
> cases where you want to make sure teh SM responded to some change in
> the fabric. Also used to sync the various test flows to the end of sweeps.

I already had fixed this prior to committing it. I thought that was
easier than "going 'round the block" on it.

> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> Index: opensm/osm_state_mgr.c
> ===================================================================
> --- opensm/osm_state_mgr.c	(revision 7904)
> +++ opensm/osm_state_mgr.c	(working copy)
> @@ -200,6 +200,10 @@ __osm_state_mgr_up_msg(
>        /* clear the signal */
>        p_mgr->p_subn->moved_to_master_state = FALSE;
>     }
> +	else
> +	{
> +      osm_log( p_mgr->p_log, OSM_LOG_INFO, "SUBNET UP\n" ); /* Format Waived */
> +	}

If tab is supposed to be the convention, spaces are used in most OpenSM
modules and I have been trying to keep to the convention used in the
particular module.

-- Hal

>     if( p_mgr->p_subn->opt.sweep_interval )
>     {
> 


From halr at voltaire.com  Tue Jun 13 06:17:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 09:17:33 -0400
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
Message-ID: <1150204529.570.145313.camel@hal.voltaire.com>

Hi Eitan,

On Tue, 2006-06-13 at 08:54, Eitan Zahavi wrote:
> --text follows this line--
> Hi Hal
> 
> This is a second take after debug and cleanup of the partition manager
> patch I have previously provided.

Thanks.

So this patch superceeds the previous version ? If so, in the future,
just indicate [PATCHv2] for this.

>  The functionality is the same but
> this one is after 2 days of testing on the simulator.

Are you still working on this (more testing) ?

> I also did some code restructuring for clarity. 

> Tests passed were both dedicated pkey enforcements (pkey.*) and
> stress test (osmStress.*)
> 
> As I started to test the partition manager code (using ibmgtsim pkey test),
> I realized the implementation does not really enforces the partition policy
> on the given fabric. This patch fixes that. It was verified using the 
> simulation test. Several other corner cases were fixed too.

Can you elaborate on these cases ?

-- Hal


From bpradip at in.ibm.com  Tue Jun 13 06:47:51 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 13 Jun 2006 19:17:51 +0530
Subject: [openib-general] [PATCH] libamso: fix erroneous return and memory
 leak in verbs.c
Message-ID: <20060613134743.GA17393@harry-potter.ibm.com>

Hi,
 This patch fixes an erroneous return in func amso_create_cq() and a memory
leak in amso_create_qp()

---
Index = libamso/verbs.c
============================================================================
--- verbs.org	2006-06-13 18:56:50.000000000 +0530
+++ verbs.c	2006-06-13 19:02:03.000000000 +0530
@@ -154,9 +154,8 @@ struct ibv_cq *amso_create_cq(struct ibv
 	int ret;
 
 	cq = malloc(sizeof *cq);
-	if (!cq) {
-		goto err;
-	}
+	if (!cq) 
+		return NULL;
 
 	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
 				&cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd,
@@ -248,14 +247,15 @@ struct ibv_qp *amso_create_qp(struct ibv
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp.ibv_resp, sizeof resp);
 	if (ret)
-		return NULL;
+		goto err;
 
 #if 0 /* A reminder for bypass functionality */
 	qp->physaddr = resp.physaddr;
 #endif
 
 	return &qp->ibv_qp;
-
+err:
+	free(qp);
 
 	return NULL;
 }


From mst at mellanox.co.il  Tue Jun 13 07:19:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 17:19:55 +0300
Subject: [openib-general] [PATCH updated] libmthca: memfree completion with
	error
Message-ID: <20060613141955.GX19518@mellanox.co.il>

Same thing for userspace.

---

Fix up completion with error for memfree.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: openib/src/userspace/libmthca/src/cq.c
===================================================================
--- openib/src/userspace/libmthca/src/cq.c	(revision 7890)
+++ openib/src/userspace/libmthca/src/cq.c	(working copy)
@@ -347,8 +347,17 @@
 		wc->wr_id = srq->wrid[wqe_index];
 		mthca_free_srq_wqe(srq, wqe_index);
 	} else {
+		int32_t wqe;
 		wq = &(*cur_qp)->rq;
-		wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift;
+		wqe = ntohl(cqe->wqe);
+		wqe_index = wqe >> wq->wqe_shift;
+		/*
+		 * WQE addr == base - 1 might be reported in receive completion
+		 * with error instead of (rq size - 1) by Sinai FW 1.0.800,
+		 * Arbel FW 5.1.400 and should be fixed in later revisions.
+		 */
+		if (wqe_index < 0)
+			wqe_index = wq->max - 1;
 		wc->wr_id = (*cur_qp)->wrid[wqe_index];
 	}
 

-- 
MST


From mst at mellanox.co.il  Tue Jun 13 07:19:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 13 Jun 2006 17:19:42 +0300
Subject: [openib-general] [PATCH updated] mthca: memfree completion with
	error workaround
In-Reply-To: <20060612121635.GX7359@mellanox.co.il>
References: <20060612121635.GX7359@mellanox.co.il>
Message-ID: <20060613141942.GW19518@mellanox.co.il>

OK, here's an optimized version of the fix. With this, I see:

before
   5994       0       0    5994    176a drivers/infiniband/hw/mthca/mthca_cq.o
after
   5995       0       0    5995    176b drivers/infiniband/hw/mthca/mthca_cq.o

So the cost is minimal. Please consider for 2.6.17.

---

Memfree firmware is in rare cases reporting WQE index == base - 1
in receive completion with error instead of (rq size - 1); base is 0 in mthca.
Here is a patch to avoid kernel crash and report a correct WR id in this case.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: linux-2.6.16/drivers/infiniband/hw/mthca/mthca_cq.c
===================================================================
--- linux-2.6.16.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2006-05-16 12:33:05.000000000 +0300
+++ linux-2.6.16/drivers/infiniband/hw/mthca/mthca_cq.c	2006-06-13 12:14:13.000000000 +0300
@@ -540,8 +540,17 @@ static inline int mthca_poll_one(struct 
 		entry->wr_id = srq->wrid[wqe_index];
 		mthca_free_srq_wqe(srq, wqe);
 	} else {
+		s32 wqe;
 		wq = &(*cur_qp)->rq;
-		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		wqe = be32_to_cpu(cqe->wqe);
+		wqe_index = wqe >> wq->wqe_shift;
+               /*
+		* WQE addr == base - 1 might be reported in receive completion
+		* with error instead of (rq size - 1) by Sinai FW 1.0.800,
+		* Arbel FW 5.1.400 and should be fixed in later revisions.
+		*/
+		if (unlikely(wqe_index < 0))
+			wqe_index = wq->max - 1;
 		entry->wr_id = (*cur_qp)->wrid[wqe_index];
 	}
 
-- 
MST


From eitan at mellanox.co.il  Tue Jun 13 07:21:24 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 13 Jun 2006 17:21:24 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <1150204529.570.145313.camel@hal.voltaire.com>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
	<1150204529.570.145313.camel@hal.voltaire.com>
Message-ID: <448EC9E4.3020409@mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Eitan,
> 
> On Tue, 2006-06-13 at 08:54, Eitan Zahavi wrote:
> 
>>--text follows this line--
>>Hi Hal
>>
>>This is a second take after debug and cleanup of the partition manager
>>patch I have previously provided.
> 
> 
> Thanks.
> 
> So this patch superceeds the previous version ? If so, in the future,
> just indicate [PATCHv2] for this.
> 
> 
>> The functionality is the same but
>>this one is after 2 days of testing on the simulator.
> 
> 
> Are you still working on this (more testing) ?
> 
> 
>>I also did some code restructuring for clarity. 
> 
> 
>>Tests passed were both dedicated pkey enforcements (pkey.*) and
>>stress test (osmStress.*)
>>
>>As I started to test the partition manager code (using ibmgtsim pkey test),
>>I realized the implementation does not really enforces the partition policy
>>on the given fabric. This patch fixes that. It was verified using the 
>>simulation test. Several other corner cases were fixed too.
> 
> 
> Can you elaborate on these cases ?
If you ask about the corner cases:
1. A bug in avoiding switch enforcement when the HCA had more blocks then the switch.
2. Similar but when the HCA blocks are unused so actually the switch does not need so many blocks
3. Segfaults due to fabric instability.

If you ask about the test code it is checked in https://openib.org/svn/gen2/utils/src/linux-user/ibmgtsim/tests
the file names start with pkey.* and osmStress.*.

In general the pkey test does:
* Randomize 3 pkeys p1 p2 p3 (first 2 are full 1 is partial)
* Assignment of ports into 3 groups G1 which uses p1, G2 which
   uses p2 and G3 which uses p1,p2 and p3
* For each HCA port randomize pkey tables with random number of entries
   (including the ones above with random location)
* For some ports override the tables with an incorrect set
* write a partition policy file
* start the SM, wait for subnet up
* randomly select HCA ports and verify (using osmtest -f c) that all-to-all path records they
   see are limited by the partitions they belong to
* forcefully null all default pkey entries on  the fabric ports
* set a change bit on a switch to force a sweep
* wait for subnet up and check all ports do have correct default pkey set

The stress test does:
* Setup LIDs
* Force some random LID violations (duplicated, misaligned, zero)
* Write guid2lid file with some random change
* Disconnect some random nodes
* Run OpenSM wait for subnet up
* Repeat 10 times: Reconnect all nodes Disconnect some random nodes
* Wait for subnet up
* check all LID values are correct (according to guid2lid)
* Start 240 iterations of selecting one of the following :
   connect random port
   disconnect random port
   register random service
   query random paths from random nodes
   join random port to 0xC000
   leave random port from 0xC000
*  Eventually:
    connect all nodes
    join 0xC000 from all HCA ports
    wait for subnet up
    check connectivity and FDB validity etc using ibdiagnet


From ishai at mellanox.co.il  Tue Jun 13 07:57:47 2006
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Tue, 13 Jun 2006 17:57:47 +0300
Subject: [openib-general] [PATCH] SRP: Avoid a potential race on
	target->req_queue
Message-ID: <20060613145747.GA18628@mellanox.co.il>

Hi Roland,

There is a potential race between srp_reconnect_target and srp_reset_device
when they access the target->req_queue.
These functions can execute in the same time because srp_reconnect_target is
called form srp_reconnect_work that is scheduled by srp_completion, while
srp_reset_device is called from the scsi layer.

The race is caused because srp_reconnect_target is not holding host_lock while
accessing target->req_queue. It assumes that since the state is CONNECTING no
other function will access target->req_queue (and this is the case with
srp_reset_host for example).

There are two possible solutions: 
1) Change srp_reset_device: after locking host_lock, it will check the
   state. Only if the state is LIVE it will execute the loop that access
   target->req_queue.
2) Change srp_reconnect_target. Before executing the loop that access
   target->req_queue it will lock host_lock and will release it after
   the loop.

I'm sending a patch for the second solution. If you prefer the first, I have 
another patch for it (It is a bit longer).
Which solution do you like better?

Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================
--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-13 02:24:22.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-06-13 02:26:07.000000000 +0300
@@ -641,8 +641,10 @@ static int srp_reconnect_target(struct s
 	while (ib_poll_cq(target->cq, 1, &wc) > 0)
 		; /* nothing */
 
+	spin_lock_irq(target->scsi_host->host_lock);
 	list_for_each_entry_safe(req, tmp, &target->req_queue, list)
 		srp_reset_req(target, req);
+	spin_unlock_irq(target->scsi_host->host_lock);
 
 	target->rx_head	 = 0;
 	target->tx_head	 = 0;
-- 
Ishai Rabinovitz


From viswa.krish at gmail.com  Tue Jun 13 09:21:03 2006
From: viswa.krish at gmail.com (Viswanath Krishnamurthy)
Date: Tue, 13 Jun 2006 09:21:03 -0700
Subject: [openib-general] opensm and NPTL
In-Reply-To: <1150193732.570.138496.camel@hal.voltaire.com>
References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
	<1150193732.570.138496.camel@hal.voltaire.com>
Message-ID: <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com>

Yes.. I want to test waters again and see if the issues went away.

-Viswa


On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> Hi Viswa,
>
> On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:
> > There were some issues with opensm running with NPTL  (thread
> > library). Has the issues been resolved ?
>
> There were some fixes to the signal handling which went in back in the
> Feb/early March time frame. OpenSM should be better with NPTL now. Is it
> working for you or are you asking before stepping into these waters
> again ?
>
> -- Hal
>
> > Regards,
> > Viswa
> >
> >
> >
> > ______________________________________________________________________
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/297476ed/attachment.html>

From faulkner at opengridcomputing.com  Tue Jun 13 09:24:07 2006
From: faulkner at opengridcomputing.com (Boyd R. Faulkner)
Date: Tue, 13 Jun 2006 11:24:07 -0500
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
Message-ID: <200606131124.08110.faulkner@opengridcomputing.com>

This patch resolves a race condition between the receipt of
a connection established event and a receive completion from 
the client.  The server no longer goes to connected state but
merely waits for the READ_ADV state to begin its looping.  This
keeps the server from going back to CONNECTED from the later
states if the connection established event comes in after the
receive completion (i.e. the loop starts).

Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>

Index: rping.c
===================================================================
--- rping.c	(revision 7960)
+++ rping.c	(working copy)
@@ -182,7 +182,13 @@
 
 	case RDMA_CM_EVENT_ESTABLISHED:
 		DEBUG_LOG("ESTABLISHED\n");
-		cb->state = CONNECTED;
+
+		/*
+		 * Server will wake up when first RECV completes.
+		 */
+		if (!cb->server) {
+			cb->state = CONNECTED;
+		}
 		sem_post(&cb->sem);
 		break;
 
@@ -197,7 +203,7 @@
 		break;
 
 	case RDMA_CM_EVENT_DISCONNECTED:
-		fprintf(stderr, "DISCONNECT EVENT...\n");
+		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
 		sem_post(&cb->sem);
 		break;
 
@@ -225,7 +231,7 @@
 	DEBUG_LOG("Received rkey %x addr %" PRIx64 "len %d from peer\n",
 		  cb->remote_rkey, cb->remote_addr, cb->remote_len);
 
-	if (cb->state == CONNECTED || cb->state == RDMA_WRITE_COMPLETE)
+	if (cb->state <= CONNECTED || cb->state == RDMA_WRITE_COMPLETE)
 		cb->state = RDMA_READ_ADV;
 	else
 		cb->state = RDMA_WRITE_ADV;


-- 
Boyd R. Faulkner
Open Grid Computing, Inc.
Phone:	512-343-9196 x109
Fax:	512-343-5450


From halr at voltaire.com  Tue Jun 13 09:35:17 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 12:35:17 -0400
Subject: [openib-general] opensm and NPTL
In-Reply-To: <4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com>
References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
	<1150193732.570.138496.camel@hal.voltaire.com>
	<4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com>
Message-ID: <1150216346.570.152323.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote:
> Yes.. I want to test waters again and see if the issues went away.

Are you using the trunk or 1.0 ?

-- Hal

> -Viswa
> 
> 
> On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <halr at voltaire.com>
> wrote:
>         Hi Viswa,
>         
>         On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:
>         > There were some issues with opensm running with
>         NPTL  (thread
>         > library). Has the issues been resolved ?
>         
>         There were some fixes to the signal handling which went in
>         back in the 
>         Feb/early March time frame. OpenSM should be better with NPTL
>         now. Is it
>         working for you or are you asking before stepping into these
>         waters
>         again ?
>         
>         -- Hal
>         
>         > Regards,
>         > Viswa
>         >
>         > 
>         >
>         >
>         ______________________________________________________________________
>         >
>         > _______________________________________________
>         > openib-general mailing list
>         > openib-general at openib.org
>         > http://openib.org/mailman/listinfo/openib-general
>         >
>         > To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
> 


From halr at voltaire.com  Tue Jun 13 09:38:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 12:38:33 -0400
Subject: [openib-general] [PATCH] osmtest: Add test for non base LID SA
 PortInfoRecord request when LMC > 0
Message-ID: <1150216507.570.152411.camel@hal.voltaire.com>

osmtest: Add test for non base LID SA PortInfoRecord request when LMC >
0

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: osmtest/osmtest.c
===================================================================
--- osmtest/osmtest.c	(revision 7961)
+++ osmtest/osmtest.c	(working copy)
@@ -1613,6 +1613,7 @@ osmtest_stress_port_recs_small( IN osmte
  **********************************************************************/
 ib_api_status_t
 osmtest_get_local_port_lmc( IN osmtest_t * const p_osmt,
+                            IN ib_net16_t  lid,
                             OUT uint8_t *  const p_lmc )
 {
   osmtest_req_context_t context;
@@ -1629,7 +1630,7 @@ osmtest_get_local_port_lmc( IN osmtest_t
    * Do a blocking query for our own PortRecord in the subnet.
    */
   status = osmtest_get_port_rec( p_osmt,
-                                 cl_ntoh16(p_osmt->local_port.lid),
+                                 cl_ntoh16( lid ),
                                  &context );
 
   if( status != IB_SUCCESS )
@@ -3181,7 +3182,7 @@ osmtest_validate_path_data( IN osmtest_t
              cl_ntoh16( p_rec->slid ), cl_ntoh16( p_rec->dlid ) );
   }
 
-  status = osmtest_get_local_port_lmc( p_osmt, &lmc );
+  status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid, &lmc );
 
   /* HACK: Assume uniform LMC across endports in the subnet */ 
   /* In absence of this assumption, validation of this is much more complicated */
@@ -4885,10 +4886,13 @@ static ib_api_status_t
 osmtest_validate_against_db( IN osmtest_t * const p_osmt )
 {
   ib_api_status_t status = IB_SUCCESS;
-#if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
+#ifdef VENDOR_RMPP_SUPPORT
+  uint8_t lmc;
+#ifdef DUAL_SIDED_RMPP
   osmtest_req_context_t context;
   osmv_multipath_req_t request;
 #endif
+#endif
 
   OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db );
 
@@ -4999,6 +5003,18 @@ osmtest_validate_against_db( IN osmtest_
   if( status != IB_SUCCESS )
     goto Exit;
 
+  /* If LMC > 0, test non base LID SA PortInfoRecord request */
+  status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid, &lmc );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  if (lmc != 0)
+  {
+    status = osmtest_get_local_port_lmc( p_osmt, p_osmt->local_port.lid + 1, NULL);
+    if ( status != IB_SUCCESS )
+      goto Exit;
+  }
+
   if (! p_osmt->opt.ignore_path_records)
   {
     status = osmtest_validate_all_path_recs( p_osmt );


From halr at voltaire.com  Tue Jun 13 09:42:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 12:42:19 -0400
Subject: [openib-general] [PATCH] OpenSM/SA: Properly handle non base LID
 requests to some SA records
Message-ID: <1150216933.570.152671.camel@hal.voltaire.com>

OpenSM/SA: Properly handle non base LID requests to some SA records

In osm_sa_node_record.c and osm_sa_portinfo_record.c, properly handle
non base LID requests per C15-0.1.11: Query responses shall contain a
port's base LID in  any LID component of a RID. So when LMC is non 0,
the only records that appear are those with the base LID and not with
any masked LIDs. Furthermore, if a query comes in on a non base LID, the
LID in the RID returned is only with the base LID.

Also, fixed some endian issues in osm_log messages.

Note: Similar patch for other affected SA records will follow.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_node_record.c
===================================================================
--- opensm/osm_sa_node_record.c	(revision 7961)
+++ opensm/osm_sa_node_record.c	(working copy)
@@ -200,12 +200,11 @@ __osm_nr_rcv_create_nr(
   uint8_t                  port_num;
   uint8_t                  num_ports;
   uint16_t                 match_lid_ho;
-  uint16_t                 lid_ho;
+  ib_net16_t               base_lid;
   ib_net16_t               base_lid_ho;
   ib_net16_t               max_lid_ho;
   uint8_t                  lmc;
   ib_net64_t               port_guid;
-  ib_api_status_t          status;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_nr_rcv_create_nr );
 
@@ -245,7 +244,8 @@ __osm_nr_rcv_create_nr(
     if( match_port_guid && ( port_guid != match_port_guid ) )
       continue;
 
-    base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) );
+    base_lid = osm_physp_get_base_lid( p_physp );
+    base_lid_ho = cl_ntoh16( base_lid );
     lmc = osm_physp_get_lmc( p_physp );
     max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
     match_lid_ho = cl_ntoh16( match_lid );
@@ -260,29 +260,18 @@ __osm_nr_rcv_create_nr(
         osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
                  "__osm_nr_rcv_create_nr: "
                  "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
-                 cl_ntoh16( base_lid_ho ),
-                 cl_ntoh16( match_lid_ho ),
-                 cl_ntoh16( max_lid_ho )
+                 base_lid_ho, match_lid_ho, max_lid_ho
                  );
       }
 
       if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) )
       {
-        __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, match_lid );
+        __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid );
       }
     }
     else
     {
-      /*
-        For every lid value create a Node Record.
-      */
-      for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
-      {
-        status = __osm_nr_rcv_new_nr( p_rcv, p_node, p_list,
-                                      port_guid, cl_hton16( lid_ho ) );
-        if( status != IB_SUCCESS )
-          break;
-      }
+      __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid );
     }
   }
 
Index: opensm/osm_sa_portinfo_record.c
===================================================================
--- opensm/osm_sa_portinfo_record.c	(revision 7961)
+++ opensm/osm_sa_portinfo_record.c	(working copy)
@@ -194,9 +194,9 @@ __osm_sa_pir_create(
   IN osm_pir_search_ctxt_t*   const p_ctxt )
 {
   uint8_t               lmc;
-  uint16_t              lid_ho;
   uint16_t              max_lid_ho;
   uint16_t              base_lid_ho;
+  uint16_t              match_lid_ho;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create );
 
@@ -218,17 +218,28 @@ __osm_sa_pir_create(
 
   if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID )
   {
-    __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
-                           p_ctxt->p_rcvd_rec->lid );
-  }
-  else
-  {
-    for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
+    match_lid_ho = cl_ntoh16( p_ctxt->p_rcvd_rec->lid );
+
+    /*
+      We validate that the lid belongs to this node.
+    */
+    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
     {
-      __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
-                             cl_hton16( lid_ho ) );
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__osm_sa_pir_create: "
+               "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
+               base_lid_ho, match_lid_ho, max_lid_ho
+               );
     }
+
+    if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho )
+      goto Exit;
   }
+
+  __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
+                         cl_hton16( base_lid_ho ) );
+
+ Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 

From viswa.krish at gmail.com  Tue Jun 13 09:56:08 2006
From: viswa.krish at gmail.com (Viswanath Krishnamurthy)
Date: Tue, 13 Jun 2006 09:56:08 -0700
Subject: [openib-general] opensm and NPTL
In-Reply-To: <1150216346.570.152323.camel@hal.voltaire.com>
References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
	<1150193732.570.138496.camel@hal.voltaire.com>
	<4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com>
	<1150216346.570.152323.camel@hal.voltaire.com>
Message-ID: <4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com>

I am using the trunk.   Should I be using 1.0 ?

-Viswa


On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote:
> > Yes.. I want to test waters again and see if the issues went away.
>
> Are you using the trunk or 1.0 ?
>
> -- Hal
>
> > -Viswa
> >
> >
> > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <halr at voltaire.com>
> > wrote:
> >         Hi Viswa,
> >
> >         On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:
> >         > There were some issues with opensm running with
> >         NPTL  (thread
> >         > library). Has the issues been resolved ?
> >
> >         There were some fixes to the signal handling which went in
> >         back in the
> >         Feb/early March time frame. OpenSM should be better with NPTL
> >         now. Is it
> >         working for you or are you asking before stepping into these
> >         waters
> >         again ?
> >
> >         -- Hal
> >
> >         > Regards,
> >         > Viswa
> >         >
> >         >
> >         >
> >         >
> >
> ______________________________________________________________________
> >         >
> >         > _______________________________________________
> >         > openib-general mailing list
> >         > openib-general at openib.org
> >         > http://openib.org/mailman/listinfo/openib-general
> >         >
> >         > To unsubscribe, please visit
> >         http://openib.org/mailman/listinfo/openib-general
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/68cb72c5/attachment.html>

From gjohnson at lanl.gov  Tue Jun 13 10:02:46 2006
From: gjohnson at lanl.gov (Greg Johnson)
Date: Tue, 13 Jun 2006 11:02:46 -0600
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
Message-ID: <20060613170246.GH23320@durango.c3.lanl.gov>

On Sun, Jun 11, 2006 at 03:27:58AM +0300, Sasha Khapyorsky wrote:
> Hi,
> 
> There are couple of unicast routing related patches for OpenSM.
> 
> Basically it implements routing module which provides possibility to load
> switch forwarding tables from pre-created dump file. Currently unicast
> tables loading is only supported, multicast may be added in a future.
> 
> Short patch descriptions (more details may be found in emails with
> patches):
> 
> 1. Ucast dump file simplification.
> 2. Modular routing - preliminary implements generic model to plug new
> routing engine to OpenSM.
> 3. New simple unicast routing engine which allows to load LFTs from
> pre-created dump file.
> 4. Example of ucast dump generation script.
> 
> Please comment and test. Thanks.

We tried this on our 256-node cluster with a single chassis Voltaire
288-port switch.  It seems to load the routes generated by the dump
script, but afterward it is not possible to dump the routes again.  I
would like to re-dump the routes after loading to ensure that they were
loaded correctly.

After loading routes with "opensm -R file -U dump_file", dump_lfts.sh
gives:

nodeinfo
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
ibroute: iberror: dump tables failed: node info failed: valid addr?

for each switch.

Also, I had to delete a space in the sed script on line 17 of
dump_lfts.sh:

sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p'

became

sed -ne 's/^.* lid \([1-9a-f]*\).*$/\1/p'

Thanks for the work!

Greg


From halr at voltaire.com  Tue Jun 13 10:06:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 13:06:43 -0400
Subject: [openib-general] opensm and NPTL
In-Reply-To: <4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com>
References: <4df28be40606122016t35a7a6d0s57f218dfea344283@mail.gmail.com>
	<1150193732.570.138496.camel@hal.voltaire.com>
	<4df28be40606130921t1d5eb51dof06280721d1bf1e9@mail.gmail.com>
	<1150216346.570.152323.camel@hal.voltaire.com>
	<4df28be40606130956v4f945921ncbd13f2b6d0ff517@mail.gmail.com>
Message-ID: <1150218085.570.153354.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 12:56, Viswanath Krishnamurthy wrote:
> I am using the trunk.   Should I be using 1.0 ?

No; I didn't check but if my memory serves me correctly, the trunk may
have some fixes 1.0 doesn't towards this but I'm not 100% sure right now
and since you are using the trunk, I'm not going to do my homework on
whether that is really the case or my memory is just fuzzy on this.

-- Hal

> 
> -Viswa
>  
> 
> On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock <halr at voltaire.com>
> wrote:
>         On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote:
>         > Yes.. I want to test waters again and see if the issues went
>         away.
>         
>         Are you using the trunk or 1.0 ?
>         
>         -- Hal
>         
>         > -Viswa
>         >
>         > 
>         > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock
>         <halr at voltaire.com>
>         > wrote:
>         >         Hi Viswa,
>         >
>         >         On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy
>         wrote: 
>         >         > There were some issues with opensm running with
>         >         NPTL  (thread
>         >         > library). Has the issues been resolved ?
>         >
>         >         There were some fixes to the signal handling which
>         went in 
>         >         back in the
>         >         Feb/early March time frame. OpenSM should be better
>         with NPTL
>         >         now. Is it
>         >         working for you or are you asking before stepping
>         into these
>         >         waters 
>         >         again ?
>         >
>         >         -- Hal
>         >
>         >         > Regards,
>         >         > Viswa
>         >         >
>         >         >
>         >         >
>         >         >
>         >        
>         ______________________________________________________________________ 
>         >         >
>         >         > _______________________________________________
>         >         > openib-general mailing list
>         >         > openib-general at openib.org
>         >         > http://openib.org/mailman/listinfo/openib-general
>         >         >
>         >         > To unsubscribe, please visit
>         >         http://openib.org/mailman/listinfo/openib-general
>         >
>         >
>         
> 


From swise at opengridcomputing.com  Tue Jun 13 10:25:52 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 Jun 2006 12:25:52 -0500
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <200606131124.08110.faulkner@opengridcomputing.com>
References: <200606131124.08110.faulkner@opengridcomputing.com>
Message-ID: <1150219552.17394.23.camel@stevo-desktop>

Thanks, applied.

iwarp branch: r7964
trunk: r7966


On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> This patch resolves a race condition between the receipt of
> a connection established event and a receive completion from 
> the client.  The server no longer goes to connected state but
> merely waits for the READ_ADV state to begin its looping.  This
> keeps the server from going back to CONNECTED from the later
> states if the connection established event comes in after the
> receive completion (i.e. the loop starts).
> 
> Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com> 


From swise at opengridcomputing.com  Tue Jun 13 10:31:10 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 Jun 2006 12:31:10 -0500
Subject: [openib-general] [PATCH] rping: Erroneous check for minumum
 ping buffer size
In-Reply-To: <20060610173417.GA14280@harry-potter.ibm.com>
References: <20060610173417.GA14280@harry-potter.ibm.com>
Message-ID: <1150219870.17394.26.camel@stevo-desktop>

Thanks.  Committed under revisions:

trunk: r7968
iwarp branch: r7969


Steve.


On Sat, 2006-06-10 at 23:04 +0530, Pradipta Kumar Banerjee wrote:
> This includes the changes suggested by Tom.
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> ---
> 
> Index: rping.c
> =================================================================
> --- rping.org	2006-06-09 10:57:43.000000000 +0530
> +++ rping.c.new	2006-06-10 22:48:53.000000000 +0530
> @@ -96,6 +96,15 @@ struct rping_rdma_info {
>  #define RPING_BUFSIZE 64*1024
>  #define RPING_SQ_DEPTH 16
>  
> +/* Default string for print data and
> + * minimum buffer size
> + */
> +#define _stringify( _x ) # _x
> +#define stringify( _x ) _stringify( _x )
> +
> +#define RPING_MSG_FMT           "rdma-ping-%d: "
> +#define RPING_MIN_BUFSIZE       sizeof(stringify(INT_MAX)) + sizeof(RPING_MSG_FMT)
> +
>  /*
>   * Control block struct.
>   */
> @@ -774,7 +783,7 @@ static void rping_test_client(struct rpi
>  		cb->state = RDMA_READ_ADV;
>  
>  		/* Put some ascii text in the buffer. */
> -		cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping);
> +		cc = sprintf(cb->start_buf, RPING_MSG_FMT, ping);
>  		for (i = cc, c = start; i < cb->size; i++) {
>  			cb->start_buf[i] = c;
>  			c++;
> @@ -977,11 +986,11 @@ int main(int argc, char *argv[])
>  			break;
>  		case 'S':
>  			cb->size = atoi(optarg);
> -			if ((cb->size < 1) ||
> +			if ((cb->size < RPING_MIN_BUFSIZE) ||
>  			    (cb->size > (RPING_BUFSIZE - 1))) {
>  				fprintf(stderr, "Invalid size %d "
> -				       "(valid range is 1 to %d)\n",
> -				       cb->size, RPING_BUFSIZE);
> +				       "(valid range is %d to %d)\n",
> +				       cb->size, RPING_MIN_BUFSIZE, RPING_BUFSIZE);
>  				ret = EINVAL;
>  			} else
>  				DEBUG_LOG("size %d\n", (int) atoi(optarg));
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Tue Jun 13 10:26:49 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 13:26:49 -0400
Subject: [openib-general] [PATCH 4/4] diags: ucast routing dump file
 generator example - dump_lfts.sh
In-Reply-To: <20060611003245.22430.93904.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003245.22430.93904.stgit@sashak.voltaire.com>
Message-ID: <1150219599.570.154302.camel@hal.voltaire.com>

On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote:
> New simple script - dump_lfts.sh, may be used for ucast dump file
> generation.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From rdreier at cisco.com  Tue Jun 13 10:55:57 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 Jun 2006 10:55:57 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <20060613051149.GE4621@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 13 Jun 2006 08:11:49 +0300")
References: <adaodwy5fp7.fsf@cisco.com> <20060613051149.GE4621@mellanox.co.il>
Message-ID: <adabqsx3poy.fsf@cisco.com>

    Michael> Won't this let the user issue multiple modify QP commands
    Michael> in parallel on the same QP? mthca at least does not
    Michael> protect against such attempts, and doing this will
    Michael> confuse the hardware.

Hmm, that's a good point.  But I did write the following in
Documentation/infiniband/core_locking.txt:

  All of the methods in struct ib_device exported by a low-level
  driver must be fully reentrant.  The low-level driver is required to
  perform all synchronization necessary to maintain consistency, even
  if multiple function calls using the same object are run
  simultaneously.

  The IB midlayer does not perform any serialization of function calls.

So I guess this is a bug in mthca.

I think modify_srq at least has the same problem.  I'll audit this and
fix it up in mthca.

 - R.


From bpradip at in.ibm.com  Tue Jun 13 10:55:07 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 13 Jun 2006 23:25:07 +0530
Subject: [openib-general] [PATCH resend] libamso: fix erroneous return and
 memory leak in verbs.c
Message-ID: <20060613175457.GA8976@harry-potter.ibm.com>

Forgot to add the 'Signed-off-by'

This patch fixes an erroneous return in func amso_create_cq() and a memory
leak in amso_create_qp().

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index = libamso/verbs.c
============================================================================
--- verbs.org	2006-06-13 18:56:50.000000000 +0530
+++ verbs.c	2006-06-13 19:02:03.000000000 +0530
@@ -154,9 +154,8 @@ struct ibv_cq *amso_create_cq(struct ibv
 	int ret;
 
 	cq = malloc(sizeof *cq);
-	if (!cq) {
-		goto err;
-	}
+	if (!cq) 
+		return NULL;
 
 	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
 				&cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd,
@@ -248,14 +247,15 @@ struct ibv_qp *amso_create_qp(struct ibv
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp.ibv_resp, sizeof resp);
 	if (ret)
-		return NULL;
+		goto err;
 
 #if 0 /* A reminder for bypass functionality */
 	qp->physaddr = resp.physaddr;
 #endif
 
 	return &qp->ibv_qp;
-
+err:
+	free(qp);
 
 	return NULL;
 }


From swise at opengridcomputing.com  Tue Jun 13 11:03:57 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 Jun 2006 13:03:57 -0500
Subject: [openib-general] [PATCH resend] libamso: fix erroneous return
 and memory leak in verbs.c
In-Reply-To: <20060613175457.GA8976@harry-potter.ibm.com>
References: <20060613175457.GA8976@harry-potter.ibm.com>
Message-ID: <1150221837.17394.46.camel@stevo-desktop>

On Tue, 2006-06-13 at 23:25 +0530, Pradipta Kumar Banerjee wrote:
> Forgot to add the 'Signed-off-by'
> 
> This patch fixes an erroneous return in func amso_create_cq() and a memory
> leak in amso_create_qp().
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> 

Committed revision 7971.

Thanks,

Steve.


From sean.hefty at intel.com  Tue Jun 13 11:05:23 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 13 Jun 2006 11:05:23 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150193430.570.138279.camel@hal.voltaire.com>
Message-ID: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>

>There are architected ways to do that. There's busy for MADs which could
>be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't
>think you can switch to BUSY in the middle (but I'm not 100% sure). I
>don't know how this limit is being used exactly, but it might be best if
>the RMPP receive were treated as 1 MAD regardless of of how many
>segments it was.

Maybe I should back-up some here.  There are a couple problems that I'm trying
to solve, but the main goal is to prevent sending duplicate responses.  I'd like
to do this by detecting and dropping duplicate requests.

To detect a duplicate request, my proposal is to move completed MADs to a
"done_list".  Newly received MADs would also check the done_list to determine if
the MAD is a duplicate.  When a user sends a response MAD, a check would be made
against the done_list for a matching request that has not generated a response
yet.  If one is not found, then the send would be failed.

Received MADs would be removed from the done_list when they are freed.  My guess
is that for kernel clients, the changes would probably be minimal.  For usermode
clients, the problem is more difficult, since we cannot trust usermode clients
to generate responses correctly, and there's no free_mad call that maps to the
kernel.

One of the ideas then, is for the kernel umad module to learn which MADs
generate responses.  It would do this by updating an entry to a table whenever a
response MAD is generated.  A received MAD would check against the table to see
if a response is supposed to be generated.  If not, then the MAD would be freed
after userspace claims it.  If a response is expected, then the MAD would not be
freed until the response was generated.

Assuming minimal hard-coding of which methods are requests, a client would drop
only about 1 MAD per method during start-up.  Considering most requests are not
sent reliably, this shouldn't be a big issue.  (In fact, outside of a
MultiPathRecord query, I don't believe any requests are sent reliably.)  And I
would argue that even if a request has been acknowledged, the sender of the
request would still need to deal with the case that no response is ever
generated.

If this approach were taken, then, it brings up the issue that MADs are being
stored in the kernel waiting for a response.  But what if a response is never
generated?  This problem is somewhat related to MADs being queued in the kernel,
but the userspace app doesn't call down to receive them.  Ideally, we could come
up with a single solution to both problems, but that may not be possible.

My current thoughts on how to handle requests are to time when each request MAD
is received, and queue it.  Once the queue is full, if another request is
received, it would check the MAD at the head of the queue.  If the MAD at the
head was older than some selected value (say 20 seconds), it would be bumped
from the queue, and the new request would be added to the tail.

- Sean


From rdreier at cisco.com  Tue Jun 13 11:05:55 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 Jun 2006 11:05:55 -0700
Subject: [openib-general] [PATCH] mthca: restore missing registers
In-Reply-To: <20060612135751.GB19518@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 12 Jun 2006 16:57:51 +0300")
References: <20060612135751.GB19518@mellanox.co.il>
Message-ID: <ada7j3l3p8c.fsf@cisco.com>

Thanks, applied for 2.6.17


From rdreier at cisco.com  Tue Jun 13 11:08:22 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 Jun 2006 11:08:22 -0700
Subject: [openib-general] [PATCH updated] mthca: memfree completion with
 error workaround
In-Reply-To: <20060613141942.GW19518@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 13 Jun 2006 17:19:42 +0300")
References: <20060612121635.GX7359@mellanox.co.il>
	<20060613141942.GW19518@mellanox.co.il>
Message-ID: <ada3be93p49.fsf@cisco.com>

Yeah, I like this much more.  It doesn't seem that likely that there
will be another firmware bug with the same symptoms, and we have to
trust some of what the hardware tells us...

 - R.


From eitan at mellanox.co.il  Tue Jun 13 11:21:11 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 13 Jun 2006 21:21:11 +0300
Subject: [openib-general] [PATCH] OpenSM/SA: Properly handle non base
 LID requests to someSA records
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368841@mtlexch01.mtl.com>

Sure. Looks good to me

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, June 13, 2006 7:42 PM
> To: openib-general at openib.org
> Cc: Eitan Zahavi
> Subject: [PATCH] OpenSM/SA: Properly handle non base LID requests to
someSA
> records
> 
> OpenSM/SA: Properly handle non base LID requests to some SA records
> 
> In osm_sa_node_record.c and osm_sa_portinfo_record.c, properly handle
> non base LID requests per C15-0.1.11: Query responses shall contain a
> port's base LID in  any LID component of a RID. So when LMC is non 0,
> the only records that appear are those with the base LID and not with
> any masked LIDs. Furthermore, if a query comes in on a non base LID,
the
> LID in the RID returned is only with the base LID.
> 
> Also, fixed some endian issues in osm_log messages.
> 
> Note: Similar patch for other affected SA records will follow.
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Index: opensm/osm_sa_node_record.c
> ===================================================================
> --- opensm/osm_sa_node_record.c	(revision 7961)
> +++ opensm/osm_sa_node_record.c	(working copy)
> @@ -200,12 +200,11 @@ __osm_nr_rcv_create_nr(
>    uint8_t                  port_num;
>    uint8_t                  num_ports;
>    uint16_t                 match_lid_ho;
> -  uint16_t                 lid_ho;
> +  ib_net16_t               base_lid;
>    ib_net16_t               base_lid_ho;
>    ib_net16_t               max_lid_ho;
>    uint8_t                  lmc;
>    ib_net64_t               port_guid;
> -  ib_api_status_t          status;
> 
>    OSM_LOG_ENTER( p_rcv->p_log, __osm_nr_rcv_create_nr );
> 
> @@ -245,7 +244,8 @@ __osm_nr_rcv_create_nr(
>      if( match_port_guid && ( port_guid != match_port_guid ) )
>        continue;
> 
> -    base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) );
> +    base_lid = osm_physp_get_base_lid( p_physp );
> +    base_lid_ho = cl_ntoh16( base_lid );
>      lmc = osm_physp_get_lmc( p_physp );
>      max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
>      match_lid_ho = cl_ntoh16( match_lid );
> @@ -260,29 +260,18 @@ __osm_nr_rcv_create_nr(
>          osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
>                   "__osm_nr_rcv_create_nr: "
>                   "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
> -                 cl_ntoh16( base_lid_ho ),
> -                 cl_ntoh16( match_lid_ho ),
> -                 cl_ntoh16( max_lid_ho )
> +                 base_lid_ho, match_lid_ho, max_lid_ho
>                   );
>        }
> 
>        if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >=
base_lid_ho) )
>        {
> -        __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid,
match_lid );
> +        __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid,
base_lid );
>        }
>      }
>      else
>      {
> -      /*
> -        For every lid value create a Node Record.
> -      */
> -      for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
> -      {
> -        status = __osm_nr_rcv_new_nr( p_rcv, p_node, p_list,
> -                                      port_guid, cl_hton16( lid_ho )
);
> -        if( status != IB_SUCCESS )
> -          break;
> -      }
> +      __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid
);
>      }
>    }
> 
> Index: opensm/osm_sa_portinfo_record.c
> ===================================================================
> --- opensm/osm_sa_portinfo_record.c	(revision 7961)
> +++ opensm/osm_sa_portinfo_record.c	(working copy)
> @@ -194,9 +194,9 @@ __osm_sa_pir_create(
>    IN osm_pir_search_ctxt_t*   const p_ctxt )
>  {
>    uint8_t               lmc;
> -  uint16_t              lid_ho;
>    uint16_t              max_lid_ho;
>    uint16_t              base_lid_ho;
> +  uint16_t              match_lid_ho;
> 
>    OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create );
> 
> @@ -218,17 +218,28 @@ __osm_sa_pir_create(
> 
>    if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID )
>    {
> -    __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
> -                           p_ctxt->p_rcvd_rec->lid );
> -  }
> -  else
> -  {
> -    for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
> +    match_lid_ho = cl_ntoh16( p_ctxt->p_rcvd_rec->lid );
> +
> +    /*
> +      We validate that the lid belongs to this node.
> +    */
> +    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
>      {
> -      __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
> -                             cl_hton16( lid_ho ) );
> +      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
> +               "__osm_sa_pir_create: "
> +               "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
> +               base_lid_ho, match_lid_ho, max_lid_ho
> +               );
>      }
> +
> +    if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho )
> +      goto Exit;
>    }
> +
> +  __osm_pir_rcv_new_pir( p_rcv, p_physp, p_ctxt->p_list,
> +                         cl_hton16( base_lid_ho ) );
> +
> + Exit:
>    OSM_LOG_EXIT( p_rcv->p_log );
>  }
> 
> 


From rdreier at cisco.com  Tue Jun 13 11:19:13 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 13 Jun 2006 11:19:13 -0700
Subject: [openib-general] [git pull] please pull infiniband.git
Message-ID: <aday7w03om6.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This has a couple of mthca driver bug fixes:

Michael S. Tsirkin:
      IB/mthca: restore missing PCI registers after reset
      IB/mthca: memfree completion with error FW bug workaround

 drivers/infiniband/hw/mthca/mthca_cq.c    |   11 +++++
 drivers/infiniband/hw/mthca/mthca_reset.c |   59 +++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 205854e..87a8f11 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -540,8 +540,17 @@ static inline int mthca_poll_one(struct 
 		entry->wr_id = srq->wrid[wqe_index];
 		mthca_free_srq_wqe(srq, wqe);
 	} else {
+		s32 wqe;
 		wq = &(*cur_qp)->rq;
-		wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift;
+		wqe = be32_to_cpu(cqe->wqe);
+		wqe_index = wqe >> wq->wqe_shift;
+               /*
+		* WQE addr == base - 1 might be reported in receive completion
+		* with error instead of (rq size - 1) by Sinai FW 1.0.800 and
+		* Arbel FW 5.1.400.  This bug should be fixed in later FW revs.
+		*/
+		if (unlikely(wqe_index < 0))
+			wqe_index = wq->max - 1;
 		entry->wr_id = (*cur_qp)->wrid[wqe_index];
 	}
 
diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
index df5e494..f4fddd5 100644
--- a/drivers/infiniband/hw/mthca/mthca_reset.c
+++ b/drivers/infiniband/hw/mthca/mthca_reset.c
@@ -49,6 +49,12 @@ int mthca_reset(struct mthca_dev *mdev)
 	u32 *hca_header    = NULL;
 	u32 *bridge_header = NULL;
 	struct pci_dev *bridge = NULL;
+	int bridge_pcix_cap = 0;
+	int hca_pcie_cap = 0;
+	int hca_pcix_cap = 0;
+
+	u16 devctl;
+	u16 linkctl;
 
 #define MTHCA_RESET_OFFSET 0xf0010
 #define MTHCA_RESET_VALUE  swab32(1)
@@ -110,6 +116,9 @@ #define MTHCA_RESET_VALUE  swab32(1)
 		}
 	}
 
+	hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+
 	if (bridge) {
 		bridge_header = kmalloc(256, GFP_KERNEL);
 		if (!bridge_header) {
@@ -129,6 +138,13 @@ #define MTHCA_RESET_VALUE  swab32(1)
 				goto out;
 			}
 		}
+		bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX);
+		if (!bridge_pcix_cap) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't locate HCA bridge "
+					  "PCI-X capability, aborting.\n");
+				goto out;
+		}
 	}
 
 	/* actually hit reset */
@@ -178,6 +194,20 @@ #define MTHCA_RESET_VALUE  swab32(1)
 good:
 	/* Now restore the PCI headers */
 	if (bridge) {
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8,
+				 bridge_header[(bridge_pcix_cap + 0x8) / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Upstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc,
+				 bridge_header[(bridge_pcix_cap + 0xc) / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Downstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
 		/*
 		 * Bridge control register is at 0x3e, so we'll
 		 * naturally restore it last in this loop.
@@ -203,6 +233,35 @@ good:
 		}
 	}
 
+	if (hca_pcix_cap) {
+		if (pci_write_config_dword(mdev->pdev, hca_pcix_cap,
+				 hca_header[hca_pcix_cap / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI-X "
+				  "command register, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (hca_pcie_cap) {
+		devctl = hca_header[(hca_pcie_cap + PCI_EXP_DEVCTL) / 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_DEVCTL,
+					   devctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI Express "
+				  "Device Control register, aborting.\n");
+			goto out;
+		}
+		linkctl = hca_header[(hca_pcie_cap + PCI_EXP_LNKCTL) / 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_LNKCTL,
+					   linkctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI Express "
+				  "Link control register, aborting.\n");
+			goto out;
+		}
+	}
+
 	for (i = 0; i < 16; ++i) {
 		if (i * 4 == PCI_COMMAND)
 			continue;


From rjwalsh at pathscale.com  Tue Jun 13 11:25:39 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Tue, 13 Jun 2006 11:25:39 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adabqsx3poy.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
Message-ID: <1150223140.11881.2.camel@hematite.internal.keyresearch.com>

On Tue, 2006-06-13 at 10:55 -0700, Roland Dreier wrote:
>     Michael> Won't this let the user issue multiple modify QP commands
>     Michael> in parallel on the same QP? mthca at least does not
>     Michael> protect against such attempts, and doing this will
>     Michael> confuse the hardware.
> 
> Hmm, that's a good point.  But I did write the following in
> Documentation/infiniband/core_locking.txt:
> 
>   All of the methods in struct ib_device exported by a low-level
>   driver must be fully reentrant.  The low-level driver is required to
>   perform all synchronization necessary to maintain consistency, even
>   if multiple function calls using the same object are run
>   simultaneously.
> 
>   The IB midlayer does not perform any serialization of function calls.
> 
> So I guess this is a bug in mthca.

We have a similar problem in resource checking - we were relying on the
idr lock to keep us safe.  I'll fix that up, too.

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/3f3e4a79/attachment.sig>

From halr at voltaire.com  Tue Jun 13 11:32:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 14:32:54 -0400
Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb
	dumps.
In-Reply-To: <20060611003238.22430.62423.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003238.22430.62423.stgit@sashak.voltaire.com>
Message-ID: <1150223563.570.156637.camel@hal.voltaire.com>

On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote:
> This separates the dump procedure from rest of the flow and prevents
> multiple fopen()/fclose() (one pair per switch) - one fopen() and one
> fclose() instead.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (with some cosmetic changes).

-- Hal


From sashak at voltaire.com  Tue Jun 13 13:00:35 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 Jun 2006 23:00:35 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <20060613170246.GH23320@durango.c3.lanl.gov>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060613170246.GH23320@durango.c3.lanl.gov>
Message-ID: <20060613200035.GG10482@sashak.voltaire.com>

Hi Greg,

On 11:02 Tue 13 Jun     , Greg Johnson wrote:
> On Sun, Jun 11, 2006 at 03:27:58AM +0300, Sasha Khapyorsky wrote:
> > Hi,
> > 
> > There are couple of unicast routing related patches for OpenSM.
> > 
> > Basically it implements routing module which provides possibility to load
> > switch forwarding tables from pre-created dump file. Currently unicast
> > tables loading is only supported, multicast may be added in a future.
> > 
> > Short patch descriptions (more details may be found in emails with
> > patches):
> > 
> > 1. Ucast dump file simplification.
> > 2. Modular routing - preliminary implements generic model to plug new
> > routing engine to OpenSM.
> > 3. New simple unicast routing engine which allows to load LFTs from
> > pre-created dump file.
> > 4. Example of ucast dump generation script.
> > 
> > Please comment and test. Thanks.
> 
> We tried this on our 256-node cluster with a single chassis Voltaire
> 288-port switch.

Thanks.

> It seems to load the routes generated by the dump
> script, but afterward it is not possible to dump the routes again.

This means you have broken LFTs now. Probably I know what is going on
here - new LFTs don't have "<switch-lid> 0" entries, and switches are
not accessible by LIDs anymore.

Please update 'ibroute' utility (diags/) from the trunk and recreate the
dump file - this should fix the problem.

(Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement).

> I
> would like to re-dump the routes after loading to ensure that they were
> loaded correctly.
> 
> After loading routes with "opensm -R file -U dump_file", dump_lfts.sh
> gives:
> 
> nodeinfo
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> ibroute: iberror: dump tables failed: node info failed: valid addr?
> 
> for each switch.
> 
> Also, I had to delete a space in the sed script on line 17 of
> dump_lfts.sh:
> 
> sed -ne 's/^.* lid \([1-9a-f]*\) .*$/\1/p'
> 
> became
> 
> sed -ne 's/^.* lid \([1-9a-f]*\).*$/\1/p'

I see. I've used ibswitches/ibnetdiscover from the trunk, there is some
minor difference in the output (' lmc N' was added). I think with your
change the script will work with both old and new outputs. Thanks for
the fix.

> Thanks for the work!

Thanks for trying this.

Sasha


From swise at opengridcomputing.com  Tue Jun 13 13:34:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 Jun 2006 15:34:31 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <1150127698.22704.9.camel@trinity.ogc.int>
References: <20060607200600.9003.56328.stgit@stevo-desktop>
	<20060607200605.9003.25830.stgit@stevo-desktop>
	<20060608005452.087b34db.akpm@osdl.org>
	<1150127698.22704.9.camel@trinity.ogc.int>
Message-ID: <1150230871.17394.68.camel@stevo-desktop>

<snip>

> > > +static void cm_event_handler(struct iw_cm_id *cm_id,
> > > +			     struct iw_cm_event *iw_event) 
> > > +{
> > > +	struct iwcm_work *work;
> > > +	struct iwcm_id_private *cm_id_priv;
> > > +	unsigned long flags;
> > > +
> > > +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> > > +	if (!work)
> > > +		return;
> > 
> > This allocation _will_ fail sometimes.  The driver must recover from it. 
> > Will it do so?
> 
> Er...no. It will lose this event. Depending on the event...the carnage
> varies. We'll take a look at this.
> 

This behavior is consistent with the Infiniband CM (see
drivers/infiniband/core/cm.c function cm_recv_handler()).  But I think
we should at least log an error because a lost event will usually stall
the rdma connection.

> > 
> > > +EXPORT_SYMBOL(iw_cm_init_qp_attr);
> > 
> > This file exports a ton of symbols.  It's usual to provide some justifying
> > commentary in the changelog when this happens.
> 
> This module is a logical instance of the xx_cm where xx is the transport
> type. I think there is some discussion warranted on whether or not these
> should all be built into and exported by rdma_cm. One rationale would be
> that the rdma_cm is the only client for many of these functions (this
> being a particularly good example) and doing so would reduce the export
> count. Others would be reasonably needed for any application (connect,
> etc...)
> 

Transport-dependent ULPs, in theory, are able to use the
transport-specific CM directly if they don't wish to use the RDMA CM.  I
think that's the rationale for have the xx_cm modules seperate from the
rdma_cm module and exporting the various functions.   

> All that said, we'll be sure to document the exported symbols in a
> follow-up patch.
> 

I'll add commentary explaining this.

Steve.


From sashak at voltaire.com  Tue Jun 13 13:39:58 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 13 Jun 2006 23:39:58 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236881F@mtlexch01.mtl.com>
Message-ID: <20060613203958.GI10482@sashak.voltaire.com>

Hi Eitan,

On 09:36 Sun 11 Jun     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> General comments:
> 1. I hope the change in osm.fdbs is not going to break the parser in
> ibdm:Fabric.cpp -

The file format was not changed, I don't expect brokenness.

> was it really necessary change?

Yes, in order to create unified osm.fdbs with any routing engine.

> or just nice to have ?

This is the nice side effect.

> 2. The modular routing is a great idea. From my first glance it seems
> that it assumes calculation of min-hop-tables is common to all routing
> engines.

Yes and no. Currently the min-hop-tables are used with multicast, so it is
common code. But I expect this will be different in the future (for
instance extend this loader to handle multicast tables too).

> I think it should be a callback provided by the engine too.

Yes, when it will be useful.

> Please note that the Min-Hop engine takes most of the routing time so in
> the future if we could avoid that stage it would be even better. 

Agree.

Thanks for the comments.
Sasha

> [EZ] We should start thinking about testing of this new feature too.
> 
> Further comment on the patches themselves.
> 
> > There are couple of unicast routing related patches for OpenSM.
> > 
> > Basically it implements routing module which provides possibility to
> load
> > switch forwarding tables from pre-created dump file. Currently unicast
> > tables loading is only supported, multicast may be added in a future.
> > 
> > Short patch descriptions (more details may be found in emails with
> > patches):
> > 
> > 1. Ucast dump file simplification.
> > 2. Modular routing - preliminary implements generic model to plug new
> > routing engine to OpenSM.
> > 3. New simple unicast routing engine which allows to load LFTs from
> > pre-created dump file.
> > 4. Example of ucast dump generation script.
> > 
> > Please comment and test. Thanks.
> > 
> > Sasha


From betsy at pathscale.com  Tue Jun 13 13:44:16 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Tue, 13 Jun 2006 13:44:16 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71E5@mtlexch01.mtl.com>
Message-ID: <1150231456.3034.219.camel@sarium.pathscale.com>

Tziporet - this plan makes sense. We'll let you know how the testing
goes. BTW, for some reason, if you click on the URL you sent out, it
just hangs but if you type it in, it works. Not sure why.

Thanks, Betsy

On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote:
> Hi All,
> 
>  
> 
> After reading the mail thread regarding OFED release I have decided
> this:
> 
>  
> 
> We upload OFED-1.0-pre1.tgz to
> https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> 
>  
> 
> We checked that all modules compile and loaded on this build
> (including ipath and uDAPL)
> 
> The only missing parts of this release from the final release are the
> documents, and the scripts rpm that Scott requested.
> 
>  
> 
> I think testing this version 3 days (Tuesday, Wednesday and Thursday)
> should be enough as Scott wrote.
> 
> So – we can do the official OFED 1.0 release on Friday 16-June.
> 
>  
> 
> Matt – please check with Novel if this date is acceptable by them.
> 
>  
> 
> If not then the earliest we can do the release if Thursday 15-June.
> 
>  
> 
>  
> 
> Tziporet Koren
> 
> Software Director
> 
> Mellanox Technologies
> 
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> 
>  
> 
> 


From halr at voltaire.com  Tue Jun 13 13:39:01 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 16:39:01 -0400
Subject: [openib-general] [PATCH] OpenSM/modular-routing.txt: Add
 description of modular routing
Message-ID: <1150231129.570.161009.camel@hal.voltaire.com>

OpenSM/doc/modular_routing.txt: Add description of modular routing

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: osm/doc/modular-routing.txt
===================================================================
--- osm/doc/modular-routing.txt	(revision 0)
+++ osm/doc/modular-routing.txt	(revision 0)
@@ -0,0 +1,53 @@
+Modular routing engine structure has been added to allow
+for ease of "plugging" new routing module.
+
+Currently, only unicast callbacks are supported. Multicast
+can be added later.
+
+An existing routing module is up-down "updn", which may be
+activate with '-R updb' option (instead of old '-u').
+
+General usage is:
+$ opensm -R 'module-name'
+
+There is also a trivial routing module which is able
+to load LFT tables from a dump file.
+
+Main features:
+
+- support for unicast LFTs only, support for multicast can be added later
+- this will run after min hop matrix calculation
+- this will load switch LFTs according to the path entries introduced in
+  the dump file
+- no additional checks will be performed (like is port connected, etc)
+- in case when fabric LIDs were changed this will try to reconstruct LFTs
+  correctly if endport GUIDs are represented in the dump file (in order
+  to disable this GUIDs may be removed from the dump file or zeroed)
+
+The dump file format is compatible with output of 'ibroute' util and for
+whole fabric may be generated with script like this:
+
+  for sw_lid in `ibswitches | awk '{print $NF}'` ; do
+        ibroute $sw_lid
+  done > /path/to/dump_file
+
+, or using DR paths:
+
+
+  for sw_dr in `ibnetdiscover -v \
+                | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
+                | sed -e 's/\]\[/,/g' \
+                | sort -u` ; do
+        ibroute -D ${sw_dr}
+  done > /path/to/dump_file
+
+
+In order to activate new module use:
+
+  opensm -R file -U /path/to/dump_file
+
+NOTE: ibroute has been updated to support this (for switch management ports).
+Also, lmc was added to switch management ports. ibroute needs to be 7855 or
+later from the trunk.
+
+


From halr at voltaire.com  Tue Jun 13 14:15:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 17:15:44 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
Message-ID: <1150233334.570.162326.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 14:05, Sean Hefty wrote:
> >There are architected ways to do that. There's busy for MADs which could
> >be used for some MADs. For RMPP, would the transfer be ABORTed ? I don't
> >think you can switch to BUSY in the middle (but I'm not 100% sure). I
> >don't know how this limit is being used exactly, but it might be best if
> >the RMPP receive were treated as 1 MAD regardless of of how many
> >segments it was.
> 
> Maybe I should back-up some here.  There are a couple problems that I'm trying
> to solve, but the main goal is to prevent sending duplicate responses.  I'd like
> to do this by detecting and dropping duplicate requests.
> 
> To detect a duplicate request, my proposal is to move completed MADs to a
> "done_list".  Newly received MADs would also check the done_list to determine if
> the MAD is a duplicate.  When a user sends a response MAD, a check would be made
> against the done_list for a matching request that has not generated a response
> yet.  If one is not found, then the send would be failed.
> 
> Received MADs would be removed from the done_list when they are freed.  My guess
> is that for kernel clients, the changes would probably be minimal.  For usermode
> clients, the problem is more difficult, since we cannot trust usermode clients
> to generate responses correctly, and there's no free_mad call that maps to the
> kernel.
> 
> One of the ideas then, is for the kernel umad module to learn which MADs
> generate responses.  It would do this by updating an entry to a table whenever a
> response MAD is generated.  A received MAD would check against the table to see
> if a response is supposed to be generated.  If not, then the MAD would be freed
> after userspace claims it.  If a response is expected, then the MAD would not be
> freed until the response was generated.
> 
> Assuming minimal hard-coding of which methods are requests, a client would drop
> only about 1 MAD per method during start-up.

Is this only the new methods which are not hard coded ? Would this
invoke a timeout (and hopefully retry) ?

> Considering most requests are not
> sent reliably, this shouldn't be a big issue.  (In fact, outside of a
> MultiPathRecord query, I don't believe any requests are sent reliably.)

If you mean sent via RMPP, then yes, only GetMulti is sent this way.

> And I
> would argue that even if a request has been acknowledged, the sender of the
> request would still need to deal with the case that no response is ever
> generated.

Are you referring to a request being acknowledged but the response is
not sent (yet) ?

> If this approach were taken, then, it brings up the issue that MADs are being
> stored in the kernel waiting for a response.  But what if a response is never
> generated?  This problem is somewhat related to MADs being queued in the kernel,
> but the userspace app doesn't call down to receive them.  Ideally, we could come
> up with a single solution to both problems, but that may not be possible.
> 
> My current thoughts on how to handle requests are to time when each request MAD
> is received, and queue it.  Once the queue is full, if another request is
> received, it would check the MAD at the head of the queue.  If the MAD at the
> head was older than some selected value (say 20 seconds), it would be bumped
> from the queue, and the new request would be added to the tail.

For RMPP, this time should start when the last segment is received. Is
that how you would envision it working ?

I'm also not sure what the right timeout value would be for this. Where
did 20 seconds come from ?

-- Hal

> - Sean


From sashak at voltaire.com  Tue Jun 13 14:36:06 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 00:36:06 +0300
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
In-Reply-To: <448EA1DD.7090204@mellanox.co.il>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003243.22430.56582.stgit@sashak.voltaire.com>
	<448EA1DD.7090204@mellanox.co.il>
Message-ID: <20060613213606.GJ10482@sashak.voltaire.com>

Hi Eitan,

On 14:30 Tue 13 Jun     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> Please see my comments inside
> 
> Sasha Khapyorsky wrote:
> >This patch implements trivial routing module which able to load LFT
> >tables from dump file. Main features:
> >- support for unicast LFTs only, support for multicast can be added later
> >- this will run after min hop matrix calculation
> >- this will load switch LFTs according to the path entries introduced in
> >  the dump file
> >- no additional checks will be performed (like is port connected, etc)
> >- in case when fabric LIDs were changed this will try to reconstruct LFTs
> >  correctly if endport GUIDs are represented in the dump file (in order
> >  to disable this GUIDs may be removed from the dump file or zeroed)
> I think you cold use the concept of directed routes for storing the LIDs 
> too.

Maybe. But there is one disadvantage - such dump file will be node
dependent, we will not be able to generate it on one node and load on
another. Anyway the goal of LID/GUID checking is to provide minimal
fixing for trivial case and not to limit the subnet administrator in
what He/She wants to do.

> So in case of new LID assignments you can extract the old -> new mapping by
> scanning the LIDs of end ports by their DR path.

I do it with GUID.

> Anyway, I think it is required that you also perform topology matching such 
> that
> if someone changed the topology you are able to figure it out and stop.
> THIS IS A SERIOUS LIMITATION OF YOUR PROPOSAL.

I think this is limitation of the subnet administrator's choice - one may
want to create LFT with entries for yet not connected nodes.

If you are about more "safe" dump loader, this may be done (and the code
may be reused), but I think this should be different routing method.

> >The dump file format is compatible with output of 'ibroute' util and for
> >whole fabric may be generated with script like this:
> >
> >  for sw_lid in `ibswitches | awk '{print $NF}'` ; do
> >	ibroute $sw_lid
> >  done > /path/to/dump_file
> >
> >, or using DR paths:
> >
> >
> >  for sw_dr in `ibnetdiscover -v \
> >		| sed -ne '/^DR path .* switch /s/^DR path 
> >		\[\(.*\)\].*$/\1/p' \
> >		| sed -e 's/\]\[/,/g' \
> >		| sort -u` ; do
> >	ibroute -D ${sw_dr}
> >  done > /path/to/dump_file
> WE SHOULD ALSO PROVIDE A DUMP FILE VIA:
> 1. OpenSM should dump its routes using this format (like it does today 
> using osm.fdbs)

In this way you may generate dump with LFTs created only by OpenSM (and
not by other SMs). This is unnecessary limitation for primary method.

However I agree that as additional method this may be good and useful.
Please feel free to provide the path for this.

> 2. ibdiagnet

Ditto

> >
> >
> >
> >diff --git a/osm/include/opensm/osm_subnet.h 
> >b/osm/include/opensm/osm_subnet.h
> >index a637367..ec1d056 100644
> >--- a/osm/include/opensm/osm_subnet.h
> >+++ b/osm/include/opensm/osm_subnet.h
> >@@ -423,6 +424,10 @@ typedef struct _osm_subn_opt
> > *  routing_engine_name
> > *     Name of used routing engine (other than default Min Hop Algorithm)
> > *
> >+*  ucast_dump_file
> >+*     Name of the unicast routing dump file from where switch
> >+*     forwearding tables will be loaded
>          ^^^^^^^^^^^
>          forwarding

Thanks. Will fix.

> >+						  "cannot parse port guid "
> >+						  "(maybe broken dump): "
> >+						  "\'%s\'\n", p);
> >+					port_guid = 0;
> >+				}
> >+			}
> >+			port_guid = cl_hton64(port_guid);
> >+			add_path(p_osm, p_sw, lid, port_num, port_guid);
> >+		}
> >+	}
> >+
> >+	fclose(file);
> >+	return 0;
> >+}
> In OpenSM we write with style:
> if () {
> }
> else if ()
> {
> }
> else
> {
> }
> 
> Not any other combination

Really? Don't want to bother with examples, but I may see almost any
"combination" in OpenSM and it is not clear for me which one is common
(the coding style and identation are different even from file to file).

Thanks for comments.
Sasha


From sean.hefty at intel.com  Tue Jun 13 14:36:46 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 13 Jun 2006 14:36:46 -0700
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <1150230871.17394.68.camel@stevo-desktop>
Message-ID: <000001c68f31$78910fe0$24268686@amr.corp.intel.com>

>> Er...no. It will lose this event. Depending on the event...the carnage
>> varies. We'll take a look at this.
>>
>
>This behavior is consistent with the Infiniband CM (see
>drivers/infiniband/core/cm.c function cm_recv_handler()).  But I think
>we should at least log an error because a lost event will usually stall
>the rdma connection.

I believe that there's a difference here.  For the Infiniband CM, an allocation
error behaves the same as if the received MAD were lost or dropped.  Since MADs
are unreliable anyway, it's not so much that an IB CM event gets lost, as it
doesn't ever occur.  A remote CM should retry the send, which hopefully allows
the connection to make forward progress.

- Sean


From swise at opengridcomputing.com  Tue Jun 13 14:46:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 13 Jun 2006 16:46:36 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <000001c68f31$78910fe0$24268686@amr.corp.intel.com>
References: <000001c68f31$78910fe0$24268686@amr.corp.intel.com>
Message-ID: <1150235196.17394.91.camel@stevo-desktop>

On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote:
> >> Er...no. It will lose this event. Depending on the event...the carnage
> >> varies. We'll take a look at this.
> >>
> >
> >This behavior is consistent with the Infiniband CM (see
> >drivers/infiniband/core/cm.c function cm_recv_handler()).  But I think
> >we should at least log an error because a lost event will usually stall
> >the rdma connection.
> 
> I believe that there's a difference here.  For the Infiniband CM, an allocation
> error behaves the same as if the received MAD were lost or dropped.  Since MADs
> are unreliable anyway, it's not so much that an IB CM event gets lost, as it
> doesn't ever occur.  A remote CM should retry the send, which hopefully allows
> the connection to make forward progress.
> 

hmm.  Ok.  I see.  I misunderstood the code in cm_recv_handler().

Tom and I have been talking about what we can do to not drop the event.
Stay tuned.

Steve.


From sean.hefty at intel.com  Tue Jun 13 14:58:33 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 13 Jun 2006 14:58:33 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150233334.570.162326.camel@hal.voltaire.com>
Message-ID: <000201c68f34$83c2c950$24268686@amr.corp.intel.com>

>> Assuming minimal hard-coding of which methods are requests, a client would
>drop
>> only about 1 MAD per method during start-up.
>
>Is this only the new methods which are not hard coded ? Would this
>invoke a timeout (and hopefully retry) ?

We can hard-code existing methods to avoid this problem.  So only unknown
methods would be affected, which would affect user-defined classes more than the
existing classes.

In most cases, I would expect the sender to timeout and retry the request, which
hopefully comes after the request table has been updated.

>> And I
>> would argue that even if a request has been acknowledged, the sender of the
>> request would still need to deal with the case that no response is ever
>> generated.
>
>Are you referring to a request being acknowledged but the response is
>not sent (yet) ?

Yes.

>> My current thoughts on how to handle requests are to time when each request
>MAD
>> is received, and queue it.  Once the queue is full, if another request is
>> received, it would check the MAD at the head of the queue.  If the MAD at the
>> head was older than some selected value (say 20 seconds), it would be bumped
>> from the queue, and the new request would be added to the tail.
>
>For RMPP, this time should start when the last segment is received. Is
>that how you would envision it working ?

Correct.  Part of the motivation here is if a client cannot or will not generate
a response for some reason, we don't want to keep the MAD hanging around
forever.

>I'm also not sure what the right timeout value would be for this. Where
>did 20 seconds come from ?

I just made that up.  Something like this would probably have to be adaptable,
and would likely depend on the size of the fabric.  In most cases, I would guess
that a timeout indicates some sort of error in the client, so I would tend
towards a larger timeout.

- Sean


From halr at voltaire.com  Tue Jun 13 15:26:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jun 2006 18:26:34 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000201c68f34$83c2c950$24268686@amr.corp.intel.com>
References: <000201c68f34$83c2c950$24268686@amr.corp.intel.com>
Message-ID: <1150237590.570.164947.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 17:58, Sean Hefty wrote:
> >> Assuming minimal hard-coding of which methods are requests, a client would
> >drop
> >> only about 1 MAD per method during start-up.
> >
> >Is this only the new methods which are not hard coded ? Would this
> >invoke a timeout (and hopefully retry) ?
> 
> We can hard-code existing methods to avoid this problem.  So only unknown
> methods would be affected, which would affect user-defined classes more than the
> existing classes.

I would expect vendor classes to follow the standard methods unless they
need something different.

> In most cases, I would expect the sender to timeout and retry the request, which
> hopefully comes after the request table has been updated.
> 
> >> And I
> >> would argue that even if a request has been acknowledged, the sender of the
> >> request would still need to deal with the case that no response is ever
> >> generated.
> >
> >Are you referring to a request being acknowledged but the response is
> >not sent (yet) ?
> 
> Yes.
> 
> >> My current thoughts on how to handle requests are to time when each request
> >MAD
> >> is received, and queue it.  Once the queue is full, if another request is
> >> received, it would check the MAD at the head of the queue.  If the MAD at the
> >> head was older than some selected value (say 20 seconds), it would be bumped
> >> from the queue, and the new request would be added to the tail.
> >
> >For RMPP, this time should start when the last segment is received. Is
> >that how you would envision it working ?
> 
> Correct.  Part of the motivation here is if a client cannot or will not generate
> a response for some reason, we don't want to keep the MAD hanging around
> forever.
> 
> >I'm also not sure what the right timeout value would be for this. Where
> >did 20 seconds come from ?
> 
> I just made that up.  Something like this would probably have to be adaptable,
> and would likely depend on the size of the fabric.  In most cases, I would guess
> that a timeout indicates some sort of error in the client, so I would tend
> towards a larger timeout.

Is the only downside of a larger timeout that potentially more memory
accumulates (until the timeout occurs) before it is freed ?

-- Hal

> - Sean


From robert.j.woodruff at intel.com  Tue Jun 13 16:02:58 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 13 Jun 2006 16:02:58 -0700
Subject: [openib-general] OFED 1.0 release schedule
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408>

Tziporet wrote,
>We upload OFED-1.0-pre1.tgz to
> https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> 

I tried the new tar ball and the pathscale driver now
compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK,
but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to 
be a problem with rdma operations. I also tried SDP/pathscale and 
it does not work either.
Finally, the rdma_cm is missing the changes that match the uDAPL fix
that
was put in for the new setops for the CM timeouts.
Arlin will provide specifics. We'd really like the rdma_cm fix in the
release. 

woody


-----Original Message-----
From: Betsy Zeller [mailto:betsy at pathscale.com] 
Sent: Tuesday, June 13, 2006 1:44 PM
To: Tziporet Koren
Cc: Matt L. Leininger; Scott Weitzenkamp (sweitzen); Matters, Todd; Moni
Levy; Woodruff, Robert J; openib; OpenFabricsEWG
Subject: Re: OFED 1.0 release schedule

Tziporet - this plan makes sense. We'll let you know how the testing
goes. BTW, for some reason, if you click on the URL you sent out, it
just hangs but if you type it in, it works. Not sure why.

Thanks, Betsy

On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote:
> Hi All,
> 
>  
> 
> After reading the mail thread regarding OFED release I have decided
> this:
> 
>  
> 
> We upload OFED-1.0-pre1.tgz to
> https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> 
>  
> 
> We checked that all modules compile and loaded on this build
> (including ipath and uDAPL)
> 
> The only missing parts of this release from the final release are the
> documents, and the scripts rpm that Scott requested.
> 
>  
> 
> I think testing this version 3 days (Tuesday, Wednesday and Thursday)
> should be enough as Scott wrote.
> 
> So - we can do the official OFED 1.0 release on Friday 16-June.
> 
>  
> 
> Matt - please check with Novel if this date is acceptable by them.
> 
>  
> 
> If not then the earliest we can do the release if Thursday 15-June.
> 
>  
> 
>  
> 
> Tziporet Koren
> 
> Software Director
> 
> Mellanox Technologies
> 
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> 
>  
> 
> 


From ardavis at ichips.intel.com  Tue Jun 13 16:07:26 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 13 Jun 2006 16:07:26 -0700
Subject: [openib-general] [PATCH] uDAPL openib-cma provider - add
 support for IB_CM_REQ_OPTIONS
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA719C@mtlexch01.mtl.com>
Message-ID: <448F452E.3090606@ichips.intel.com>

Tziporet Koren wrote:

>Jack put the bug fix to OFED 1.0.
>
>Tziporet
>  
>

Great.

Did the CMA module (SVN 7742) changes also get in? If not, uDAPL is out 
of sync with CMA and will not work.

-arlin


From sashak at voltaire.com  Tue Jun 13 16:20:47 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 02:20:47 +0300
Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast
	only yet).
In-Reply-To: <448EA7A1.8060206@mellanox.co.il>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003240.22430.88414.stgit@sashak.voltaire.com>
	<448EA7A1.8060206@mellanox.co.il>
Message-ID: <20060613232047.GO10482@sashak.voltaire.com>

Hi Eitan,

On 14:55 Tue 13 Jun     , Eitan Zahavi wrote:
> 
> As provided in my previous patch 1/4 comments
> I think the callbacks should also have an entry for the MinHop stage (maybe
> this is the ucast_build_fwd_tables?) I have some algorithms in mind that 
> will
> skip that stage all-together.

We may add new callback when it will be useful.

> Also it might make sense for each routing engine to provide its own "dump"
> routine such that each could support difference file format if needed.

Why we may want dump format per routing engine? Even if we are, you may
put it into routing engine specific code.

> 
> Rest of the comments are inline
> 
> EZ
> 
> Sasha Khapyorsky wrote:
> >
> >diff --git a/osm/include/opensm/osm_opensm.h 
> >b/osm/include/opensm/osm_opensm.h
> >index 3235ad4..3e6e120 100644
> >--- a/osm/include/opensm/osm_opensm.h
> >+++ b/osm/include/opensm/osm_opensm.h
> >@@ -92,6 +92,18 @@ BEGIN_C_DECLS
> > *
> > *********/
> > 
> >+/*
> >+ * routing engine structure - yet limited by ucast_fdb_assign and
> >+ *      ucast_build_fwd_tables (multicast callbacks may be added later)
> >+ */
> >+struct osm_routing_engine {
> >+	const char *name;
> >+	void *context;
> >+	int (*ucast_build_fwd_tables)(void *context);
> >+	int (*ucast_fdb_assign)(void *context);
> >+	void (*delete)(void *context);
> >+};
> It would be nice if you added a standard header to this struct.
> It is not clear to me what ucast_build_fwd_tables and
> ucast_fdb_assign are mapping to.

Ok, will add.

BTW, seems OpenSM declarations were used for generation manuals or other
docs. Do you know are those

 /****h*
 /****s*
 /****f*

in use anymore? And with what is the tool?

> Please see the next section as an example for a struct header.
> >+
> > /****s* OpenSM: OpenSM/osm_opensm_t
> > * NAME
> > *	osm_opensm_t

> >@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process(
> >              i
> >              );
> > 
> >+    if (p_routing_eng->ucast_build_fwd_tables &&
> >+        p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 
> >0)
> >+    {
> >+      cl_qmap_apply_func( p_sw_guid_tbl,
> >+                          __osm_ucast_mgr_set_table_cb, p_mgr );
> >+    } /* fallback on the regular path in case of failures */
> >+    else
> >+    {
> Please explain why this step is needed and why if the routing engine 
> function is
> returning 0 you still invoke the standard __osm_ucast_mgr_set_table_cb.

->ucast_build_fwd_tables() creates fwd tables and
__osm_ucast_mgr_set_table_cb() upload them on the switches. In case of
->ucast_build_fwd_tables() fatal failure (when return status is != 0),
tables uploading will be skipped and flow will continue with default
routing code.

Thanks for the comments.
Sasha


From ardavis at ichips.intel.com  Tue Jun 13 16:20:53 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 13 Jun 2006 16:20:53 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408>
Message-ID: <448F4855.7060204@ichips.intel.com>

Woodruff, Robert J wrote:

>Tziporet wrote,
>  
>
>>We upload OFED-1.0-pre1.tgz to
>>https://openib.org/svn/gen2/branches/1.0/ofed/releases/
>>
>>    
>>
>
>I tried the new tar ball and the pathscale driver now
>compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK,
>but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to 
>be a problem with rdma operations. I also tried SDP/pathscale and 
>it does not work either.
>Finally, the rdma_cm is missing the changes that match the uDAPL fix
>that
>was put in for the new setops for the CM timeouts.
>Arlin will provide specifics. We'd really like the rdma_cm fix in the
>release. 
>
>  
>
Here is a pointer to Sean's email/patches with the details:

http://openib.org/pipermail/openib-general/2006-June/022654.html
http://openib.org/pipermail/openib-general/2006-June/022655.html

-arlin

>woody
>
>
>-----Original Message-----
>From: Betsy Zeller [mailto:betsy at pathscale.com] 
>Sent: Tuesday, June 13, 2006 1:44 PM
>To: Tziporet Koren
>Cc: Matt L. Leininger; Scott Weitzenkamp (sweitzen); Matters, Todd; Moni
>Levy; Woodruff, Robert J; openib; OpenFabricsEWG
>Subject: Re: OFED 1.0 release schedule
>
>Tziporet - this plan makes sense. We'll let you know how the testing
>goes. BTW, for some reason, if you click on the URL you sent out, it
>just hangs but if you type it in, it works. Not sure why.
>
>Thanks, Betsy
>
>On Tue, 2006-06-13 at 16:07 +0300, Tziporet Koren wrote:
>  
>
>>Hi All,
>>
>> 
>>
>>After reading the mail thread regarding OFED release I have decided
>>this:
>>
>> 
>>
>>We upload OFED-1.0-pre1.tgz to
>>https://openib.org/svn/gen2/branches/1.0/ofed/releases/
>>
>> 
>>
>>We checked that all modules compile and loaded on this build
>>(including ipath and uDAPL)
>>
>>The only missing parts of this release from the final release are the
>>documents, and the scripts rpm that Scott requested.
>>
>> 
>>
>>I think testing this version 3 days (Tuesday, Wednesday and Thursday)
>>should be enough as Scott wrote.
>>
>>So - we can do the official OFED 1.0 release on Friday 16-June.
>>
>> 
>>
>>Matt - please check with Novel if this date is acceptable by them.
>>
>> 
>>
>>If not then the earliest we can do the release if Thursday 15-June.
>>
>> 
>>
>> 
>>
>>Tziporet Koren
>>
>>Software Director
>>
>>Mellanox Technologies
>>
>>mailto: tziporet at mellanox.co.il
>>Tel +972-4-9097200, ext 380
>>
>> 
>>
>>
>>    
>>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>  
>


From sashak at voltaire.com  Tue Jun 13 16:31:36 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 02:31:36 +0300
Subject: [openib-general] [PATCH 2/4 v2] Modular routing engine (unicast
	only yet).
In-Reply-To: <20060611003240.22430.88414.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003240.22430.88414.stgit@sashak.voltaire.com>
Message-ID: <20060613233136.GA12137@sashak.voltaire.com>

Hi,

The same patch, but with comment addition about osm_routing_engine
structure.

Sasha.


This patch introduces routing_engine structure which may be used for
"plugging" new routing module. Currently only unicast callbacks are
supported (multicast can be added later). And existing routing module
is up-down 'updn', may be activated with '-R updn' option (instead of
old '-u'). General usage is:

 $ opensm -R 'module-name'

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/include/opensm/osm_opensm.h     |   45 ++++++++++++++++++++++-
 osm/include/opensm/osm_subnet.h     |   16 ++------
 osm/include/opensm/osm_ucast_updn.h |   26 -------------
 osm/opensm/main.c                   |   26 +++++--------
 osm/opensm/osm_opensm.c             |   41 ++++++++++++++++++---
 osm/opensm/osm_subnet.c             |   23 ++++++------
 osm/opensm/osm_ucast_mgr.c          |   69 ++++++++++++++++++++++++-----------
 osm/opensm/osm_ucast_updn.c         |   69 ++++++++++++++++++-----------------
 8 files changed, 184 insertions(+), 131 deletions(-)

diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
index 3235ad4..77d2a86 100644
--- a/osm/include/opensm/osm_opensm.h
+++ b/osm/include/opensm/osm_opensm.h
@@ -92,6 +92,46 @@ BEGIN_C_DECLS
 *
 *********/
 
+/****s* OpenSM: OpenSM/osm_routing_engine
+* NAME
+*	struct osm_routing_engine
+*
+* DESCRIPTION
+*	OpenSM routing engine module definition.
+* NOTES
+*	routing engine structure - yet limited by ucast_fdb_assign and
+*	ucast_build_fwd_tables (multicast callbacks may be added later)
+*/
+struct osm_routing_engine {
+	const char *name;
+	void *context;
+	int (*ucast_build_fwd_tables)(void *context);
+	int (*ucast_fdb_assign)(void *context);
+	void (*delete)(void *context);
+};
+/*
+* FIELDS
+*	name
+*		The routing engine name (will be used in logs).
+*
+*	context	
+*		The routing engine context. Will be passed as parameter
+*		to the callback functions.
+*
+*	ucast_build_fwd_tables
+*		The callback for unicast forwarding table generation.
+*
+*	ucast_fdb_assign
+*		The same as above, but pretty integrated with default
+*		routing flow. Look at osm_ucast_mgr_process() and
+*		osm_ucast_updn.c for details. In future may be merged
+*		with ucast_build_fwd_tables() callback.
+*
+*	delete
+*		The delete method, may be used for routing engine
+*		internals cleanup.
+*/
+
 /****s* OpenSM: OpenSM/osm_opensm_t
 * NAME
 *	osm_opensm_t
@@ -116,7 +156,7 @@ typedef struct _osm_opensm_t
   osm_log_t			log;
   cl_dispatcher_t	disp;
   cl_plock_t		lock;
-  updn_t         *p_updn_ucast_routing;
+  struct osm_routing_engine routing_engine;
   osm_stats_t		stats;
 } osm_opensm_t;
 /*
@@ -153,6 +193,9 @@ typedef struct _osm_opensm_t
 *	lock
 *		Shared lock guarding most OpenSM structures.
 *
+*	routing_engine
+*		Routing engine, will be initialized then used
+*
 *	stats
 *		Open SM statistics block
 *
diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h
index 4db449d..a637367 100644
--- a/osm/include/opensm/osm_subnet.h
+++ b/osm/include/opensm/osm_subnet.h
@@ -272,13 +272,11 @@ typedef struct _osm_subn_opt
   uint32_t                 max_port_profile;
   osm_pfn_ui_extension_t   pfn_ui_pre_lid_assign;
   void *                   ui_pre_lid_assign_ctx;
-  osm_pfn_ui_extension_t   pfn_ui_ucast_fdb_assign;
-  void *                   ui_ucast_fdb_assign_ctx;
   osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign;
   void *                   ui_mcast_fdb_assign_ctx;
   boolean_t                sweep_on_trap;
   osm_testability_modes_t  testability_mode;
-  boolean_t                updn_activate;
+  char *                   routing_engine_name;
   char *                   updn_guid_file;
   boolean_t                exit_on_fatal;
   boolean_t                honor_guid2lid_file;
@@ -407,13 +405,6 @@ typedef struct _osm_subn_opt
 *  ui_pre_lid_assign_ctx
 *     A UI context (void *) to be provided to the pfn_ui_pre_lid_assign
 *
-*  pfn_ui_ucast_fdb_assign
-*     A UI function to be called instead of the ucast manager FDB
-*     configuration.
-*
-*  ui_ucast_fdb_assign_ctx
-*     A UI context (void *) to be provided to the pfn_ui_ucast_fdb_assign
-*
 *  pfn_ui_mcast_fdb_assign
 *     A UI function to be called inside the mcast manager instead of the
 *     call for the build spanning tree. This will be called on every
@@ -429,9 +420,8 @@ typedef struct _osm_subn_opt
 *  testability_mode
 *     Object that indicates if we are running in a special testability mode.
 *
-*  updn_activate
-*     Object that indicates if we are running the UPDN algorithm (TRUE) or 
-*     Min Hop Algorithm (FALSE)
+*  routing_engine_name
+*     Name of used routing engine (other than default Min Hop Algorithm)
 *
 *  updn_guid_file
 *     Pointer to name of the UPDN guid file given by User
diff --git a/osm/include/opensm/osm_ucast_updn.h b/osm/include/opensm/osm_ucast_updn.h
index 027056c..fbf8782 100644
--- a/osm/include/opensm/osm_ucast_updn.h
+++ b/osm/include/opensm/osm_ucast_updn.h
@@ -421,32 +421,6 @@ osm_subn_calc_up_down_min_hop_table(
 *	This function returns 0 when rankning has succeded , otherwise 1.
 ******/
 
-/****f* OpenSM: OpenSM/osm_updn_reg_calc_min_hop_table
-* NAME
-*	osm_updn_reg_calc_min_hop_table 
-*
-* DESCRIPTION
-*       Registration function to ucast routing manager (instead of 
-*       Min Hop Algorithm) 
-*
-* SYNOPSIS
-*/
-int
-osm_updn_reg_calc_min_hop_table(
-  IN updn_t * p_updn,
-  IN osm_subn_opt_t* p_opt );
-/*
-* PARAMETERS
-*
-* RETURN VALUES
-*	0 - on success , 1 - on failure
-*
-* NOTES
-*
-* SEE ALSO
-* osm_subn_calc_up_down_min_hop_table
-*********/
-
 /****** Osmsh: UpDown/osm_updn_find_root_nodes_by_min_hop
 * NAME
 *	osm_updn_find_root_nodes_by_min_hop
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index 22591eb..c888ed4 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -60,7 +60,6 @@ #include <opensm/osm_opensm.h>
 #include <complib/cl_types.h>
 #include <complib/cl_debug.h>
 #include <vendor/osm_vendor_api.h>
-#include <opensm/osm_ucast_updn.h>
 #include <opensm/osm_console.h>
 
 /********************************************************************
@@ -174,10 +173,10 @@ show_usage(void)
           "          may disrupt subnet traffic.\n"
           "          Without -r, OpenSM attempts to preserve existing\n"
           "          LID assignments resolving multiple use of same LID.\n\n");
-  printf( "-u\n"
-          "--updn\n"
-          "          This option activate UPDN algorithm instead of Min Hop\n"
-          "          algorithm (default).\n");
+  printf( "-R\n"
+          "--routing_engine <engine name>\n"
+          "          This option choose routing engine instead of Min Hop\n"
+          "          algorithm (default). Supported engines: updn\n");
   printf ("-a\n"
           "--add_guid_file <path to file>\n"
           "          Set the root nodes for the Up/Down routing algorithm\n"
@@ -524,7 +523,7 @@ #endif
   boolean_t             cache_options = FALSE;
   char                 *ignore_guids_file_name = NULL;
   uint32_t              val;
-  const char * const    short_option = "i:f:ed:g:l:s:t:a:P:NQuvVhorcyx";
+  const char * const    short_option = "i:f:ed:g:l:s:t:a:R:P:NQvVhorcyx";
 
   /*
     In the array below, the 2nd parameter specified the number
@@ -556,7 +555,7 @@ #endif
       {  "reassign_lids", 0, NULL, 'r'},
       {  "priority",      1, NULL, 'p'},
       {  "smkey",         1, NULL, 'k'},
-      {  "updn",          0, NULL, 'u'},
+      {  "routing_engine",1, NULL, 'R'},
       {  "add_guid_file", 1, NULL, 'a'},
       {  "cache-options", 0, NULL, 'c'},
       {  "stay_on_fatal", 0, NULL, 'y'},
@@ -776,9 +775,9 @@ #endif
       opt.sm_key = sm_key;
       break;
 
-    case 'u':
-      opt.updn_activate = TRUE;
-      printf(" Activate UPDN algorithm\n");
+    case 'R':
+      opt.routing_engine_name = optarg;
+      printf(" Activate \'%s\' routing engine\n", optarg);
       break;
 
     case 'a':
@@ -885,13 +884,6 @@ #endif
   setup_signals();
 
   osm_opensm_sweep( &osm );
-  /* since osm_opensm_init get opt as RO we'll set the opt value with UI pfn here */
-  /* Now do the registration */
-  if (opt.updn_activate)
-    if (osm_updn_reg_calc_min_hop_table(osm.p_updn_ucast_routing, &(osm.subn.opt))) {
-      status = IB_ERROR;
-      goto Exit;
-    }
 
   if( run_once_flag == TRUE )
   {
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 8c422b5..52f06da 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -68,6 +68,37 @@ #include <opensm/osm_subnet.h>
 #include <opensm/osm_sm.h>
 #include <opensm/osm_vl15intf.h>
 
+struct routing_engine_module {
+	const char *name;
+	int (*setup)(osm_opensm_t *p_osm);
+};
+
+extern int osm_ucast_updn_setup(osm_opensm_t *p_osm);
+
+const static struct routing_engine_module routing_modules[] = {
+	{"null", NULL},
+	{"updn", osm_ucast_updn_setup },
+	{}
+};
+
+static int setup_routing_engine(osm_opensm_t *p_osm, const char *name)
+{
+	const struct routing_engine_module *r;
+	for (r = routing_modules ; r->name && *r->name ; r++) {
+		if(!strcmp(r->name, name)) {
+			p_osm->routing_engine.name = r->name;
+			if (r->setup(p_osm))
+				break;
+			osm_log (&p_osm->log, OSM_LOG_DEBUG,
+				 "opensm: setup_routing_engine: "
+				 "\'%s\' routing engine set up.\n",
+				 p_osm->routing_engine.name);
+			return 0;
+		}
+	}
+	return -1;
+}
+
 /**********************************************************************
  **********************************************************************/
 void
@@ -118,7 +149,8 @@ osm_opensm_destroy(
    cl_disp_shutdown( &p_osm->disp );
 
    /* do the destruction in reverse order as init */
-   updn_destroy( p_osm->p_updn_ucast_routing );
+   if (p_osm->routing_engine.delete)
+   	p_osm->routing_engine.delete(p_osm->routing_engine.context);
    osm_sa_destroy( &p_osm->sa );
    osm_sm_destroy( &p_osm->sm );
    osm_db_destroy( &p_osm->db );
@@ -252,11 +284,8 @@ #endif
    if( status != IB_SUCCESS )
       goto Exit;
 
-   /* HACK - the UpDown manager should have been a part of the osm_sm_t */
-   /* Init updn struct */
-   p_osm->p_updn_ucast_routing = updn_construct(  );
-   status = updn_init( p_osm->p_updn_ucast_routing );
-   if( status != IB_SUCCESS )
+   if( p_opt->routing_engine_name &&
+       setup_routing_engine(p_osm, p_opt->routing_engine_name))
       goto Exit;
 
  Exit:
diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 7c08556..27f97ab 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -484,13 +484,11 @@ osm_subn_set_default_opt(
   p_opt->max_port_profile = 0xffffffff;
   p_opt->pfn_ui_pre_lid_assign = NULL;
   p_opt->ui_pre_lid_assign_ctx = NULL;
-  p_opt->pfn_ui_ucast_fdb_assign = NULL;
-  p_opt->ui_ucast_fdb_assign_ctx = NULL;
   p_opt->pfn_ui_mcast_fdb_assign = NULL;
   p_opt->ui_mcast_fdb_assign_ctx = NULL;
   p_opt->sweep_on_trap = TRUE;
   p_opt->testability_mode = OSM_TEST_MODE_NONE;
-  p_opt->updn_activate = FALSE;
+  p_opt->routing_engine_name = NULL;
   p_opt->updn_guid_file = NULL;
   p_opt->exit_on_fatal = TRUE;
   subn_set_default_qos_options(&p_opt->qos_options);
@@ -911,9 +909,9 @@ osm_subn_parse_conf_file(
         "sweep_on_trap",
         p_key, p_val, &p_opts->sweep_on_trap);
 
-      __osm_subn_opts_unpack_boolean(
-        "updn_activate",
-        p_key, p_val, &p_opts->updn_activate);
+      __osm_subn_opts_unpack_charp(
+        "routing_engine",
+        p_key, p_val, &p_opts->routing_engine_name);
 
       __osm_subn_opts_unpack_charp(
         "log_file", p_key, p_val, &p_opts->log_file);
@@ -1089,12 +1087,13 @@ osm_subn_write_conf_file(
     opts_file,
     "#\n# ROUTING OPTIONS\n#\n"
     "# If true do not count switches as link subscriptions\n"
-    "port_profile_switch_nodes %s\n\n"
-    "# Activate the Up/Down routing algorithm\n"
-    "updn_activate %s\n\n",
-    p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE",
-    p_opts->updn_activate ? "TRUE" : "FALSE"
-    );
+    "port_profile_switch_nodes %s\n\n",
+    p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE");
+  if (p_opts->routing_engine_name)
+    fprintf( opts_file,
+             "# Routing engine\n"
+             "routing_engine %s\n\n",
+             p_opts->routing_engine_name);
   if (p_opts->updn_guid_file)
     fprintf( opts_file,
              "# The file holding the Up/Down root node guids\n"
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 301aea5..787ae02 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -62,6 +62,7 @@ #include <opensm/osm_node.h>
 #include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
+#include <opensm/osm_opensm.h>
 
 #define LINE_LENGTH 256
 
@@ -269,7 +270,7 @@ __osm_ucast_mgr_dump_ucast_routes(
       strcat( p_mgr->p_report_buf, "yes" );
     else
     {
-      if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign) {
+      if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) {
         ui_ucast_fdb_assign_func_defined = TRUE;
       } else {
         ui_ucast_fdb_assign_func_defined = FALSE;
@@ -708,7 +709,7 @@ __osm_ucast_mgr_process_port(
   node_guid = osm_node_get_node_guid(osm_switch_get_node_ptr( p_sw ) );
 
   /* Flag to mark whether or not a ui ucast fdb assign function was given */
-  if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign)
+  if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
     ui_ucast_fdb_assign_func_defined = TRUE;
   else
     ui_ucast_fdb_assign_func_defined = FALSE;
@@ -753,7 +754,7 @@ __osm_ucast_mgr_process_port(
 
       /* Up/Down routing can cause unreachable routes between some 
          switches so we do not report that as an error in that case */
-      if (!p_mgr->p_subn->opt.updn_activate)
+      if (!p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign)
       {
         osm_log( p_mgr->p_log, OSM_LOG_ERROR,
                  "__osm_ucast_mgr_process_port: ERR 3A08: "
@@ -973,6 +974,18 @@ __osm_ucast_mgr_process_tbl(
 /**********************************************************************
  **********************************************************************/
 static void
+__osm_ucast_mgr_set_table_cb(
+  IN cl_map_item_t* const  p_map_item,
+  IN void* context )
+{
+  osm_switch_t* const p_sw = (osm_switch_t*)p_map_item;
+  osm_ucast_mgr_t* const p_mgr = (osm_ucast_mgr_t*)context;
+  __osm_ucast_mgr_set_table( p_mgr, p_sw );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
 __osm_ucast_mgr_process_neighbors(
   IN cl_map_item_t* const  p_map_item,
   IN void* context )
@@ -1058,12 +1071,14 @@ osm_ucast_mgr_process(
 {
   uint32_t i;
   uint32_t iteration_max;
+  struct osm_routing_engine *p_routing_eng;
   osm_signal_t signal;
   cl_qmap_t *p_sw_guid_tbl;
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process );
 
   p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl;
+  p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine;
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process(
              i
              );
 
+    if (p_routing_eng->ucast_build_fwd_tables &&
+        p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0)
+    {
+      cl_qmap_apply_func( p_sw_guid_tbl,
+                          __osm_ucast_mgr_set_table_cb, p_mgr );
+    } /* fallback on the regular path in case of failures */
+    else
+    {
     /*
       This is the place where we can load pre-defined routes
       into the switches fwd_tbl structures.
@@ -1136,32 +1159,34 @@ osm_ucast_mgr_process(
       Later code will use these values if not configured for
       re-assignment.
     */
-    if (p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign)
-    {
-      if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+      if (p_routing_eng->ucast_fdb_assign)
       {
-        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-                 "osm_ucast_mgr_process: "
-                 "Invoking UI function pfn_ui_ucast_fdb_assign\n");
-      }
-      p_mgr->p_subn->opt.pfn_ui_ucast_fdb_assign(p_mgr->p_subn->opt.ui_ucast_fdb_assign_ctx);
-    } else {
+        if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+        {
+          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                   "osm_ucast_mgr_process: "
+                   "Invoking \'%s\' function ucast_fdb_assign\n",
+                   p_routing_eng->name);
+        }
+        p_routing_eng->ucast_fdb_assign(p_routing_eng->context);
+      } else {
         osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
                  "osm_ucast_mgr_process: "
                  "UI pfn was not invoked\n");
-    }
+      }
 
-    osm_log(p_mgr->p_log, OSM_LOG_INFO,
-            "osm_ucast_mgr_process: "
-            "Min Hop Tables configured on all switches\n");
+      osm_log(p_mgr->p_log, OSM_LOG_INFO,
+              "osm_ucast_mgr_process: "
+              "Min Hop Tables configured on all switches\n");
 
-    /*
-      Now that the lid matrixes have been built, we can
-      build and download the switch forwarding tables.
-    */
+      /*
+        Now that the lid matrixes have been built, we can
+        build and download the switch forwarding tables.
+      */
 
-    cl_qmap_apply_func( p_sw_guid_tbl,
-                        __osm_ucast_mgr_process_tbl, p_mgr );
+      cl_qmap_apply_func( p_sw_guid_tbl,
+                          __osm_ucast_mgr_process_tbl, p_mgr );
+    }
 
     /* dump fdb into file: */
     if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index d80f7eb..8e36854 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -76,8 +76,9 @@ __updn_get_dir(IN uint8_t cur_rank,
                IN uint64_t cur_guid,
                IN uint64_t rem_guid)
 {
-  uint32_t i = 0, max_num_guids = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.num_guids;
-  uint64_t *p_guid = osm.p_updn_ucast_routing->updn_ucast_reg_inputs.guid_list;
+  updn_t *p_updn = osm.routing_engine.context;
+  uint32_t i = 0, max_num_guids = p_updn->updn_ucast_reg_inputs.num_guids;
+  uint64_t *p_guid = p_updn->updn_ucast_reg_inputs.guid_list;
   boolean_t cur_is_root = FALSE , rem_is_root = FALSE;
 
   /* HACK: comes to solve root nodes connection, in a classic subnet root nodes does not connect
@@ -540,7 +541,7 @@ updn_init(
   p_updn->updn_ucast_reg_inputs.guid_list = NULL;
   p_updn->auto_detect_root_nodes = FALSE;
   /* Check if updn is activated , then fetch root nodes */
-  if (osm.subn.opt.updn_activate)
+  if (osm.routing_engine.context)
   {
     /*
        Check the source for root node list, if file parse it, otherwise
@@ -569,7 +570,7 @@ updn_init(
           {
             p_tmp = malloc(sizeof(uint64_t));
             *p_tmp = strtoull(line, NULL, 16);
-            cl_list_insert_tail(osm.p_updn_ucast_routing->p_root_nodes, p_tmp);
+            cl_list_insert_tail(p_updn->p_root_nodes, p_tmp);
           }
         }
         else
@@ -588,8 +589,8 @@ updn_init(
                "osm_opensm_init: "
                "UPDN - Root nodes fetching by file %s\n",
                osm.subn.opt.updn_guid_file);
-      guid_iterator = cl_list_head(osm.p_updn_ucast_routing->p_root_nodes);
-      while( guid_iterator != cl_list_end(osm.p_updn_ucast_routing->p_root_nodes) )
+      guid_iterator = cl_list_head(p_updn->p_root_nodes);
+      while( guid_iterator != cl_list_end(p_updn->p_root_nodes) )
       {
         osm_log( &osm.log, OSM_LOG_DEBUG,
                  "osm_opensm_init: "
@@ -600,7 +601,7 @@ updn_init(
     }
     else
     {
-      osm.p_updn_ucast_routing->auto_detect_root_nodes = TRUE;
+      p_updn->auto_detect_root_nodes = TRUE;
     }
     /* If auto mode detection reuired - will be executed in main b4 the assignment of UI Ucast */
   }
@@ -985,33 +986,6 @@ void __osm_updn_convert_list2array(IN up
 
 /**********************************************************************
  **********************************************************************/
-/* Registration function to ucast routing manager (instead of
-   Min Hop Algorithm) */
-int
-osm_updn_reg_calc_min_hop_table(
-  IN updn_t * p_updn,
-  IN osm_subn_opt_t* p_opt )
-{
-  OSM_LOG_ENTER(&(osm.log), osm_updn_reg_calc_min_hop_table);
-  /*
-     If root nodes were supplied by the user - we need to convert into array
-     otherwise, will be created & converted in callback function activation
-  */
-  if (!p_updn->auto_detect_root_nodes)
-  {
-    __osm_updn_convert_list2array(p_updn);
-  }
-  osm_log (&(osm.log), OSM_LOG_DEBUG,
-           "osm_updn_reg_calc_min_hop_table: "
-           "assigning ucast fdb UI function with updn callback\n");
-  p_opt->pfn_ui_ucast_fdb_assign = __osm_updn_call;
-  p_opt->ui_ucast_fdb_assign_ctx = (void *)p_updn;
-  OSM_LOG_EXIT(&(osm.log));
-  return 0;
-}
-
-/**********************************************************************
- **********************************************************************/
 /* Find Root nodes automatically by Min Hop Table info */
 int
 osm_updn_find_root_nodes_by_min_hop( OUT updn_t *  p_updn )
@@ -1210,3 +1184,30 @@ osm_updn_find_root_nodes_by_min_hop( OUT
   OSM_LOG_EXIT(&(osm.log));
   return 0;
 }
+
+/**********************************************************************
+ **********************************************************************/
+
+static void __osm_updn_delete(void *context)
+{
+	updn_t *p_updn = context;
+	updn_destroy(p_updn);
+}
+
+int osm_ucast_updn_setup(osm_opensm_t *p_osm)
+{
+	updn_t *p_updn;
+	p_updn = updn_construct();
+	if (!p_updn)
+		return -1;
+	p_osm->routing_engine.context = p_updn;
+	p_osm->routing_engine.delete = __osm_updn_delete;
+	p_osm->routing_engine.ucast_fdb_assign = __osm_updn_call;
+
+	if (updn_init(p_updn) != IB_SUCCESS)
+		return -1;
+	if (!p_updn->auto_detect_root_nodes)
+		__osm_updn_convert_list2array(p_updn);
+
+	return 0;
+}


From sashak at voltaire.com  Tue Jun 13 16:40:57 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 02:40:57 +0300
Subject: [openib-general] [PATCH 1/4] Simplification of the ucast fdb
	dumps.
In-Reply-To: <448EA993.6010000@mellanox.co.il>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003238.22430.62423.stgit@sashak.voltaire.com>
	<448EA993.6010000@mellanox.co.il>
Message-ID: <20060613234057.GR10482@sashak.voltaire.com>

Hi Eitan,

On 15:03 Tue 13 Jun     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> I still need to see if there are no real problematic changes in the osm.fdbs
> file syntax (need to update ibdm to support those) but I like the patch and
> the clean way you resolved the multiple opens of the dump file.

Thanks.

Sasha


From arlin.r.davis at intel.com  Tue Jun 13 17:03:40 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Tue, 13 Jun 2006 17:03:40 -0700
Subject: [openib-general] [PATCH] uDAPL cma provider - add missing
 ia_attributes for the ia_query
Message-ID: <ORSMSX401Ou3Slt2ecE0000004a@orsmsx401.amr.corp.intel.com>


James,

Here are some changes to include some missing IA attributes during a query.

-arlin

Signed-off by: Arlin Davis ardavis at ichips.intel.com


Index: dapl/openib_cma/dapl_ib_util.c
===================================================================
--- dapl/openib_cma/dapl_ib_util.c	(revision 7935)
+++ dapl/openib_cma/dapl_ib_util.c	(working copy)
@@ -444,7 +444,10 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC
 		ia_attr->hardware_version_major = dev_attr.hw_ver;
 		ia_attr->max_eps                  = dev_attr.max_qp;
 		ia_attr->max_dto_per_ep           = dev_attr.max_qp_wr;
-		ia_attr->max_rdma_read_per_ep     = dev_attr.max_qp_rd_atom;
+		ia_attr->max_rdma_read_per_ep_in  = dev_attr.max_qp_rd_atom;
+		ia_attr->max_rdma_read_per_ep_out = dev_attr.max_qp_rd_atom;
+		ia_attr->max_rdma_read_per_ep_in_guaranteed  = DAT_TRUE;
+		ia_attr->max_rdma_read_per_ep_out_guaranteed = DAT_TRUE;
 		ia_attr->max_evds                 = dev_attr.max_cq;
 		ia_attr->max_evd_qlen             = dev_attr.max_cqe;
 		ia_attr->max_iov_segments_per_dto = dev_attr.max_sge;
@@ -468,10 +471,11 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC
 			ia_attr->max_eps, ia_attr->max_dto_per_ep,
 			ia_attr->max_evds, ia_attr->max_evd_qlen );
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
-			" query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", 
+			" query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d"
+			" rd_io %d\n", 
 			ia_attr->max_mtu_size, ia_attr->max_rdma_size,
 			ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, 
-			ia_attr->max_rmrs );
+			ia_attr->max_rmrs, ia_attr->max_rdma_read_per_ep_in );
 	}
 	
 	if (ep_attr != NULL) {


From gjohnson at lanl.gov  Tue Jun 13 17:06:10 2006
From: gjohnson at lanl.gov (Greg Johnson)
Date: Tue, 13 Jun 2006 18:06:10 -0600
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <20060613200035.GG10482@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060613170246.GH23320@durango.c3.lanl.gov>
	<20060613200035.GG10482@sashak.voltaire.com>
Message-ID: <20060614000610.GJ23320@durango.c3.lanl.gov>

On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote:
> Hi Greg,
> 
> On 11:02 Tue 13 Jun     , Greg Johnson wrote:
> > It seems to load the routes generated by the dump
> > script, but afterward it is not possible to dump the routes again.
> 
> This means you have broken LFTs now. Probably I know what is going on
> here - new LFTs don't have "<switch-lid> 0" entries, and switches are
> not accessible by LIDs anymore.
> 
> Please update 'ibroute' utility (diags/) from the trunk and recreate the
> dump file - this should fix the problem.
> 
> (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement).

Ok, that fixed it.  It works fine now.

Any chance of making our own lid -> guid assignments while we are at it?

Thanks,

Greg


From robert.j.woodruff at intel.com  Tue Jun 13 17:09:02 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 13 Jun 2006 17:09:02 -0700
Subject: [openib-general] OFED 1.0 release schedule
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408>

 Tziporet wrote,
>We upload OFED-1.0-pre1.tgz to
> https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> 

One other thing I noticed is that you do not enable MSI interrupt
mode by default. You will get lower performance if you do not 
enable MSI. I think you can set it when you load the driver with a 
modprobe parameter. 

woody


From weiny2 at llnl.gov  Tue Jun 13 17:11:47 2006
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 13 Jun 2006 17:11:47 -0700
Subject: [openib-general] MPI error when using a "system" call in mpi job.
Message-ID: <20060613171147.35787125.weiny2@llnl.gov>

A co-worker here was seeing the following MPI error from his job:

[1] Abort: [ldev2:1] Got completion with error, code=1
 at line 2148 in file viacheck.c

After some tracking down he found that apparently if he used a "system" call
[int system(const char *string)] the next MPI command will fail.

I have been able to reproduce this with the attached simple "hello" program.

Perhaps someone has seen this type of error?  Here is the output from 2 runs:

weiny2 at ldev0:~/ior-test
17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x
ldev1
[0] Abort: [ldev1:0] Got completion with error, code=1
 at line 2148 in file viacheck.c
ldev2
mpirun_rsh: Abort signaled from [0]
done.
weiny2 at ldev0:~/ior-test
17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello
now = 0.000000
now = 0.000052
now = 0.000094
now = 0.000121
now = 0.000151
now = 0.001072
now = 0.001102
now = 0.001118
now = 0.001141
now = 0.001160
done.

We are running mvapich 0.9.7 and the openib trunk rev 6829.

Thanks,
Ira

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hello.c
Type: application/octet-stream
Size: 2784 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060613/0188139b/attachment.obj>

From sean.hefty at intel.com  Tue Jun 13 18:46:09 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 13 Jun 2006 18:46:09 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150237590.570.164947.camel@hal.voltaire.com>
Message-ID: <000001c68f54$4f7350a0$5fc8180a@amr.corp.intel.com>

>Is the only downside of a larger timeout that potentially more memory
>accumulates (until the timeout occurs) before it is freed ?

This is the only one that I can think of.  Can anyone think of others?

- Sean


From betsy at pathscale.com  Tue Jun 13 20:09:45 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Tue, 13 Jun 2006 20:09:45 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F738B1@orsmsx408>
Message-ID: <1150254585.3425.6.camel@sarium.pathscale.com>

Woody - you are absolutely correct for ipath - you definitely want MSI
interrupts enabled. We (QLogic) need to submit this information for
inclusion in the OFED 1.0 release notes. 

Thanks, Betsy

On Tue, 2006-06-13 at 17:09 -0700, Woodruff, Robert J wrote:
>  Tziporet wrote,
> >We upload OFED-1.0-pre1.tgz to
> > https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> > 
> 
> One other thing I noticed is that you do not enable MSI interrupt
> mode by default. You will get lower performance if you do not 
> enable MSI. I think you can set it when you load the driver with a 
> modprobe parameter. 
> 
> woody


From tziporet at mellanox.co.il  Tue Jun 13 22:48:17 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 14 Jun 2006 08:48:17 +0300
Subject: [openib-general] MSI enabled (was  OFED 1.0 release schedule)
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA71FC@mtlexch01.mtl.com>

Since this is the case in the git tree too we have not changed it.
Most our QA so far run in this way so I don't want to change the default
now.
I will add this option in mthca release notes.

Tziporet

-----Original Message-----
From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] 
Sent: Wednesday, June 14, 2006 3:09 AM
To: Woodruff, Robert J; Betsy Zeller; Tziporet Koren; Davis, Arlin R
Cc: Matt L. Leininger; OpenFabricsEWG; openib; Matters, Todd
Subject: RE: [openib-general] OFED 1.0 release schedule

 Tziporet wrote,
>We upload OFED-1.0-pre1.tgz to
> https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> 

One other thing I noticed is that you do not enable MSI interrupt
mode by default. You will get lower performance if you do not 
enable MSI. I think you can set it when you load the driver with a 
modprobe parameter. 

woody


From eitan at mellanox.co.il  Tue Jun 13 23:32:30 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 14 Jun 2006 09:32:30 +0300
Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast
	only yet).
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com>

Hi Sasha,

OpenSM header files were used for generating documents using RoboDoc
which was slightly modified by Intel. I found it very useful when I was
learning the code.

I attach the robodoc sources and my scripts for generating the doc for
all headers in a dir.

EZ


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> Sent: Wednesday, June 14, 2006 2:21 AM
> To: Eitan Zahavi
> Cc: Hal Rosenstock; openib-general at openib.org; Greg Johnson; michael k
lang; Yael
> Kalka; Ofer Gigi
> Subject: Re: [PATCH 2/4] Modular routing engine (unicast only yet).
> 
> Hi Eitan,
> 
> On 14:55 Tue 13 Jun     , Eitan Zahavi wrote:
> >
> > As provided in my previous patch 1/4 comments
> > I think the callbacks should also have an entry for the MinHop stage
(maybe
> > this is the ucast_build_fwd_tables?) I have some algorithms in mind
that
> > will
> > skip that stage all-together.
> 
> We may add new callback when it will be useful.
> 
> > Also it might make sense for each routing engine to provide its own
"dump"
> > routine such that each could support difference file format if
needed.
> 
> Why we may want dump format per routing engine? Even if we are, you
may
> put it into routing engine specific code.
> 
> >
> > Rest of the comments are inline
> >
> > EZ
> >
> > Sasha Khapyorsky wrote:
> > >
> > >diff --git a/osm/include/opensm/osm_opensm.h
> > >b/osm/include/opensm/osm_opensm.h
> > >index 3235ad4..3e6e120 100644
> > >--- a/osm/include/opensm/osm_opensm.h
> > >+++ b/osm/include/opensm/osm_opensm.h
> > >@@ -92,6 +92,18 @@ BEGIN_C_DECLS
> > > *
> > > *********/
> > >
> > >+/*
> > >+ * routing engine structure - yet limited by ucast_fdb_assign and
> > >+ *      ucast_build_fwd_tables (multicast callbacks may be added
later)
> > >+ */
> > >+struct osm_routing_engine {
> > >+	const char *name;
> > >+	void *context;
> > >+	int (*ucast_build_fwd_tables)(void *context);
> > >+	int (*ucast_fdb_assign)(void *context);
> > >+	void (*delete)(void *context);
> > >+};
> > It would be nice if you added a standard header to this struct.
> > It is not clear to me what ucast_build_fwd_tables and
> > ucast_fdb_assign are mapping to.
> 
> Ok, will add.
> 
> BTW, seems OpenSM declarations were used for generation manuals or
other
> docs. Do you know are those
> 
>  /****h*
>  /****s*
>  /****f*
> 
> in use anymore? And with what is the tool?
> 
> > Please see the next section as an example for a struct header.
> > >+
> > > /****s* OpenSM: OpenSM/osm_opensm_t
> > > * NAME
> > > *	osm_opensm_t
> 
> > >@@ -1129,6 +1144,14 @@ osm_ucast_mgr_process(
> > >              i
> > >              );
> > >
> > >+    if (p_routing_eng->ucast_build_fwd_tables &&
> > >+
p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) ==
> > >0)
> > >+    {
> > >+      cl_qmap_apply_func( p_sw_guid_tbl,
> > >+                          __osm_ucast_mgr_set_table_cb, p_mgr );
> > >+    } /* fallback on the regular path in case of failures */
> > >+    else
> > >+    {
> > Please explain why this step is needed and why if the routing engine
> > function is
> > returning 0 you still invoke the standard
__osm_ucast_mgr_set_table_cb.
> 
> ->ucast_build_fwd_tables() creates fwd tables and
> __osm_ucast_mgr_set_table_cb() upload them on the switches. In case of
> ->ucast_build_fwd_tables() fatal failure (when return status is != 0),
> tables uploading will be skipped and flow will continue with default
> routing code.
> 
> Thanks for the comments.
> Sasha
-------------- next part --------------
A non-text attachment was scrubbed...
Name: robodoc-3.2.3.tar.gz
Type: application/x-gzip
Size: 112042 bytes
Desc: robodoc-3.2.3.tar.gz
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/f1b6cba6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: roboDocDir
Type: application/octet-stream
Size: 1322 bytes
Desc: roboDocDir
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/f1b6cba6/attachment.obj>

From eitan at mellanox.co.il  Tue Jun 13 23:48:15 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 14 Jun 2006 09:48:15 +0300
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com>

Hi Hal, Sasha,

Regarding OpenSM coding style:

Sasha wrote:
> 
> Really? Don't want to bother with examples, but I may see almost any
> "combination" in OpenSM and it is not clear for me which one is common
> (the coding style and identation are different even from file to
file).
[EZ] This bothers me as I think we should use a consistent coding style.
You might also remember we had put in place a both a script to do
automatic indentation and coding style rule fixes (osm_indent and
osm_check_n_fix)

I did check for all "else" statements:
osm/opensm>grep else *.c | wc -l
397
osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l
361

So you can see only <10%  (36 out of 397) "else" statement are not
coding style consistent. 
Checking what is the code that is "non standard":
osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq -c
| sort -rn
      7 osm_console.c:
      6 osm_prtn_config.c:
      3 st.c:
      3 osm_sa_multipath_record.c:
      2 osm_ucast_mgr.c:
      2 osm_sa_path_record.c:
      1 osm_sa_mcmember_record.c:
      1 osm_sa_informinfo.c:
      1 osm_sa_class_port_info.c:
      1 osm_multicast.c:

You can see the majority of these mismatches are in code introduced by
Hal and yourself.

I think OpenSM should sue a single code style. My proposal is that we
update our osm_indent script with a set of rules we agree on and apply
to the entire tree.


From jackm at mellanox.co.il  Wed Jun 14 00:55:42 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Wed, 14 Jun 2006 10:55:42 +0300
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <448F4855.7060204@ichips.intel.com>
References: <1AC79F16F5C5284499BB9591B33D6F0007F7377F@orsmsx408>
	<448F4855.7060204@ichips.intel.com>
Message-ID: <200606141055.42449.jackm@mellanox.co.il>

On Wednesday 14 June 2006 02:20, Arlin Davis wrote:
> Woodruff, Robert J wrote:
> >>We upload OFED-1.0-pre1.tgz to
> >>https://openib.org/svn/gen2/branches/1.0/ofed/releases/
> >
> >I tried the new tar ball and the pathscale driver now
> >compiles (on Redhat EL4 - U3) and IPoIB and OpenSM appear to work OK,
> >but Intel MPI/uDAPL and NetPipe/uDAPL are broken. It apprears to
> >be a problem with rdma operations. I also tried SDP/pathscale and
> >it does not work either.
> >Finally, the rdma_cm is missing the changes that match the uDAPL fix
> >that
> >was put in for the new setops for the CM timeouts.
> >Arlin will provide specifics. We'd really like the rdma_cm fix in the
> >release.
>
> Here is a pointer to Sean's email/patches with the details:
>
> http://openib.org/pipermail/openib-general/2006-June/022654.html
> http://openib.org/pipermail/openib-general/2006-June/022655.html
>
> -arlin
>
As I posted to ipoib-general on June 7

( http://openib.org/pipermail/openib-general/2006-June/022725.html )

the new setops for CM timeouts will not be available in OFED 1.0 , so please
don't try to use them as yet.

We tested out IntelMPI over uDapl (from OFED 1.0-pre1) using the PALLAS test 
suite, and it worked fine -- no problems.  Evidently, you are trying to use 
these new (and absent/unsupported) features.

We do appreciate that these features are very important for scalability, and 
we plan to include them in the 1.1 release which will follow shortly.


From greg.lindahl at qlogic.com  Wed Jun 14 01:36:42 2006
From: greg.lindahl at qlogic.com (Greg Lindahl)
Date: Wed, 14 Jun 2006 01:36:42 -0700
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <20060613171147.35787125.weiny2@llnl.gov>
References: <20060613171147.35787125.weiny2@llnl.gov>
Message-ID: <20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net>

On Tue, Jun 13, 2006 at 05:11:47PM -0700, Ira Weiny wrote:

> After some tracking down he found that apparently if he used a "system" call
> [int system(const char *string)] the next MPI command will fail.

Are you sure MVAPICH supports fork()? It is not unusual for MPI
implementations to not support fork(). system() uses fork().

-- greg


From mst at mellanox.co.il  Wed Jun 14 01:40:41 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 11:40:41 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
Message-ID: <20060614084041.GA19518@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> One of the ideas then, is for the kernel umad module to learn which MADs
> generate responses.

I thought about this a bit, this seems to add even more state for MAD processing
engine, which sounds like a wrong approach:

Would keeping around MADs in the done list consume significant extra memory
resources? What limits this memory? Would a small client that would normally
just send RMPP, get a response and exit will be slowed down significantly while
the agent learns?  Would a buggy application confuse the umad module, corrupting
MAD processing for all other applications?

The original approach by Jack of detecting, and dropping, duplicate responses
instead of duplicate requests seemed much easier to me.  The only disadvantage
it has that I'm aware of is a slight performance hit for duplicate processing of
each request. But all the done_list scans proposed seem even more CPU intensive.

Can we discuss that approach once again please? The patch is here:

https://openib.org/svn/trunk/contrib/mellanox/patches/mad_rmpp_requester_retry.patch


-- 
MST


From mst at mellanox.co.il  Wed Jun 14 01:49:20 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 11:49:20 +0300
Subject: [openib-general] oops on trunk
Message-ID: <20060614084920.GC19518@mellanox.co.il>

Here's another oops while unloading modules:

Unable to handle kernel paging request at ffffffff8803d8ad RIP:
[<ffffffff8803d8ad>]
PGD 103027 PUD 105027 PMD 11f99f067 PTE 0
Oops: 0010 [1] SMP
CPU 1
Modules linked in: ib_mthca ib_umad ib_sa ib_mad ib_core
Pid: 12364, comm: modprobe Not tainted 2.6.16 #1
RIP: 0010:[<ffffffff8803d8ad>] [<ffffffff8803d8ad>]
RSP: 0000:ffff810118835d80  EFLAGS: 00010246
RAX: 0000000000000005 RBX: ffff810118835e10 RCX: ffffffff8801996e
RDX: ffff81011fe11c00 RSI: 0000000000000000 RDI: 00000000fffffffc
RBP: ffff81010fee5c90 R08: ffff8101199d9d00 R09: 0000000000000000
R10: ffff81011fc227c8 R11: ffff81011c70e8c0 R12: 00000000fffffffc
R13: 0000000000000000 R14: 0000000000000080 R15: 0000000000000000
FS:  00002b68af872b00(0000) GS:ffff81011fc74bc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff8803d8ad CR3: 00000001094a4000 CR4: 00000000000006e0
Process modprobe (pid: 12364, threadinfo ffff810118834000, task ffff81011fe56f00)
Stack: ffffffff880199ae ffff810109461040 ffff81011fe57120 0000000000000008
       ffff810118835e78 ffffffff8049f880 ffff810118835e78 0000000000000080
       0000000000000000 ffff810118835e10
Call Trace: <ffffffff880199ae>{:ib_sa:ib_sa_mcmember_rec_callback+64}
       <ffffffff88019a0a>{:ib_sa:send_handler+74} <ffffffff88010352>{:ib_mad:ib_unregister_mad_agent+366}
       <ffffffff803fd02f>{cond_resched+76} <ffffffff88019cc9>{:ib_sa:ib_sa_remove_one+71}
       <ffffffff880040da>{:ib_core:ib_unregister_client+64}
       <ffffffff88019d09>{:ib_sa:ib_sa_cleanup+13} <ffffffff801480df>{sys_delete_module+481}
       <ffffffff80237f51>{__up_write+20} <ffffffff80160f9c>{sys_munmap+80}
       <ffffffff8010a79a>{system_call+126}

Code:  Bad RIP value.
RIP [<ffffffff8803d8ad>] RSP <ffff810118835d80>
CR2: ffffffff8803d8ad


-- 
MST


From glebn at voltaire.com  Wed Jun 14 02:29:28 2006
From: glebn at voltaire.com (glebn at voltaire.com)
Date: Wed, 14 Jun 2006 12:29:28 +0300
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net>
References: <20060613171147.35787125.weiny2@llnl.gov>
	<20060614083642.GG2741@greglaptop.hsd1.ca.comcast.net>
Message-ID: <20060614092928.GB17758@minantech.com>

On Wed, Jun 14, 2006 at 01:36:42AM -0700, Greg Lindahl wrote:
> On Tue, Jun 13, 2006 at 05:11:47PM -0700, Ira Weiny wrote:
> 
> > After some tracking down he found that apparently if he used a "system" call
> > [int system(const char *string)] the next MPI command will fail.
> 
> Are you sure MVAPICH supports fork()? It is not unusual for MPI
> implementations to not support fork(). system() uses fork().
> 
On kernel 2.6.12 or newer system() should works OK (in non threaded
application).

--
			Gleb.


From mst at mellanox.co.il  Wed Jun 14 02:47:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 12:47:16 +0300
Subject: [openib-general] [PATCH 2/4] Modular routing engine (unicast
	only yet).
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884C@mtlexch01.mtl.com>
Message-ID: <20060614094716.GE19518@mellanox.co.il>

Quoting r. Eitan Zahavi <eitan at mellanox.co.il>:
> Subject: Re: [PATCH 2/4] Modular routing engine (unicast only yet).
> 
> Hi Sasha,
> 
> OpenSM header files were used for generating documents using RoboDoc
> which was slightly modified by Intel. I found it very useful when I was
> learning the code.
> 
> I attach the robodoc sources and my scripts for generating the doc for
> all headers in a dir.
> 
> EZ

Put it all in svn somewhere?
https://openib.org/svn/gen2/trunk/build/

-- 
MST


From halr at voltaire.com  Wed Jun 14 04:03:06 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 07:03:06 -0400
Subject: [openib-general] [PATCH 2/4 v2] Modular routing engine (unicast
	only yet).
In-Reply-To: <20060613233136.GA12137@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003240.22430.88414.stgit@sashak.voltaire.com>
	<20060613233136.GA12137@sashak.voltaire.com>
Message-ID: <1150282969.570.191546.camel@hal.voltaire.com>

On Tue, 2006-06-13 at 19:31, Sasha Khapyorsky wrote:
> Hi,
> 
> The same patch, but with comment addition about osm_routing_engine
> structure.
> 
> Sasha.
> 
> 
> This patch introduces routing_engine structure which may be used for
> "plugging" new routing module. Currently only unicast callbacks are
> supported (multicast can be added later). And existing routing module
> is up-down 'updn', may be activated with '-R updn' option (instead of
> old '-u'). General usage is:
> 
>  $ opensm -R 'module-name'
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied (with some cosmetic changes).

-- Hal


From ogerlitz at voltaire.com  Wed Jun 14 06:04:16 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 Jun 2006 16:04:16 +0300 (IDT)
Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10 RC2
	system
Message-ID: <Pine.LNX.4.64.0606141553001.26251@zuben>

I have a SLES10 RC2 system whose infiniband drivers are the
ones provided by the distro (ie not replaced by OFED).

I have noticed that ib_mthca is not loaded when the system comes up,
however it is loaded fine if i do it manually, and ping -f over ipoib
works fine so the system is very much operative.

Below are the output of modinfo and lspci and attached is /proc/config.gz

Doing a diff on drivers/infiniband and include/rdma with 2.6.16 they are
exactly the same as those of 2.6.16.16-1.6-smp (the sles10 kernel), the
HCA FW is 4.6.2

Anyone has an idea what might be the issue?

Or.

rosemary:/usr/src # uname -a
Linux rosemary 2.6.16.16-1.6-smp #1 SMP Mon May 22 14:37:02 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux

rosemary:/usr/src # modinfo ib_mthca
filename:       /lib/modules/2.6.16.16-1.6-smp/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko
author:         Roland Dreier
description:    Mellanox InfiniBand HCA low-level driver
license:        Dual BSD/GPL
version:        0.07
vermagic:       2.6.16.16-1.6-smp SMP gcc-4.1
depends:        ib_mad,ib_core
alias:          pci:v000015B3d00005A44sv*sd*bc*sc*i*
alias:          pci:v00001867d00005A44sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006278sv*sd*bc*sc*i*
alias:          pci:v00001867d00006278sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006282sv*sd*bc*sc*i*
alias:          pci:v00001867d00006282sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006274sv*sd*bc*sc*i*
alias:          pci:v00001867d00006274sv*sd*bc*sc*i*
alias:          pci:v000015B3d00005E8Csv*sd*bc*sc*i*
alias:          pci:v00001867d00005E8Csv*sd*bc*sc*i*
srcversion:     8494F031EF8F0C77769CB89
parm:           msi:attempt to use MSI if nonzero (int)
parm:           msi_x:attempt to use MSI-X if nonzero (int)

rosemary:/usr/src # lspci
00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c)
00:00.1 Class ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c)
00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c)
00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)
03:04.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
03:04.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
04:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0)
05:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.gz
Type: application/x-gzip
Size: 15590 bytes
Desc: 
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/f336dc8d/attachment.bin>

From surs at cse.ohio-state.edu  Wed Jun 14 06:04:48 2006
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Wed, 14 Jun 2006 09:04:48 -0400
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <20060613171147.35787125.weiny2@llnl.gov>
References: <20060613171147.35787125.weiny2@llnl.gov>
Message-ID: <44900970.9050006@cse.ohio-state.edu>

Hello Ira,

I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 
machine). The program seems to be running fine. Can you tell us which 
kernel you are using? We are using drivers pulled out of the trunk about 
3-4 weeks back.

Thanks,
Sayantan.

Ira Weiny wrote:

>A co-worker here was seeing the following MPI error from his job:
>
>[1] Abort: [ldev2:1] Got completion with error, code=1
> at line 2148 in file viacheck.c
>
>After some tracking down he found that apparently if he used a "system" call
>[int system(const char *string)] the next MPI command will fail.
>
>I have been able to reproduce this with the attached simple "hello" program.
>
>Perhaps someone has seen this type of error?  Here is the output from 2 runs:
>
>weiny2 at ldev0:~/ior-test
>17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x
>ldev1
>[0] Abort: [ldev1:0] Got completion with error, code=1
> at line 2148 in file viacheck.c
>ldev2
>mpirun_rsh: Abort signaled from [0]
>done.
>weiny2 at ldev0:~/ior-test
>17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello
>now = 0.000000
>now = 0.000052
>now = 0.000094
>now = 0.000121
>now = 0.000151
>now = 0.001072
>now = 0.001102
>now = 0.001118
>now = 0.001141
>now = 0.001160
>done.
>
>We are running mvapich 0.9.7 and the openib trunk rev 6829.
>
>Thanks,
>Ira
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>

-- 
http://www.cse.ohio-state.edu/~surs


From mst at mellanox.co.il  Wed Jun 14 06:17:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 16:17:55 +0300
Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10
	RC2 system
In-Reply-To: <Pine.LNX.4.64.0606141553001.26251@zuben>
References: <Pine.LNX.4.64.0606141553001.26251@zuben>
Message-ID: <20060614131755.GA25417@mellanox.co.il>

Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: ib_mthca not loaded by pci hotplug on SLES10 RC2 system
> 
> I have a SLES10 RC2 system whose infiniband drivers are the
> ones provided by the distro (ie not replaced by OFED).
> 
> I have noticed that ib_mthca is not loaded when the system comes up,
> however it is loaded fine if i do it manually, and ping -f over ipoib
> works fine so the system is very much operative.

Generally you need to look at scripts under /etc/hotplug to figure out.

-- 
MST


From trimmer at silverstorm.com  Wed Jun 14 06:24:10 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Wed, 14 Jun 2006 09:24:10 -0400
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
Message-ID: <D80D83302DEE6249A221093BF2BB69AE5FF7A8@mail.silverstorm.com>


> -----Original Message-----
> From: Ira Weiny
> Sent: Tuesday, June 13, 2006 8:12 PM
> A co-worker here was seeing the following MPI error from his job:
> 
> [1] Abort: [ldev2:1] Got completion with error, code=1
>  at line 2148 in file viacheck.c
> 
> After some tracking down he found that apparently if he used a
"system"
> call
> [int system(const char *string)] the next MPI command will fail.
> 
> I have been able to reproduce this with the attached simple "hello"
> program.

I have seen this type of problem a couple years ago with our proprietary
stack and it took a bit of work to correct it.  Here is what it could
be:

This sounds like a conflict between with fork() and the Vma handling in
Open IB for registered memory.  system() is a fork(), exec(), wait()
sequence.  fork generally shares the VMAs and marks the pages as copy on
write.

In your case it sounds like one of the pages written by the child
process includes memory previously registered by the main process, and
the child ended up with the original page.  The result is that the
virtual address in the main process is now pointing to the wrong
physical page.

It sounds like you happened on a "magic sequence" which demonstrates the
problem.  Do you have information on the OS version, CPU type, and
server config?

Todd Rimmer


From halr at voltaire.com  Wed Jun 14 06:24:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 09:24:23 -0400
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
In-Reply-To: <20060611003243.22430.56582.stgit@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060611003243.22430.56582.stgit@sashak.voltaire.com>
Message-ID: <1150291429.570.196696.camel@hal.voltaire.com>

On Sat, 2006-06-10 at 20:32, Sasha Khapyorsky wrote:
> This patch implements trivial routing module which able to load LFT
> tables from dump file. Main features:
> - support for unicast LFTs only, support for multicast can be added later
> - this will run after min hop matrix calculation
> - this will load switch LFTs according to the path entries introduced in
>   the dump file
> - no additional checks will be performed (like is port connected, etc)
> - in case when fabric LIDs were changed this will try to reconstruct LFTs
>   correctly if endport GUIDs are represented in the dump file (in order
>   to disable this GUIDs may be removed from the dump file or zeroed)
> 
> The dump file format is compatible with output of 'ibroute' util and for
> whole fabric may be generated with script like this:
> 
>   for sw_lid in `ibswitches | awk '{print $NF}'` ; do
> 	ibroute $sw_lid
>   done > /path/to/dump_file
> 
> , or using DR paths:
> 
> 
>   for sw_dr in `ibnetdiscover -v \
> 		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
> 		| sed -e 's/\]\[/,/g' \
> 		| sort -u` ; do
> 	ibroute -D ${sw_dr}
>   done > /path/to/dump_file
> 
> 
> In order to activate new module use:
> 
>   opensm -R file -U /path/to/dump_file
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks! Applied with some cosmetic changes.

-- Hal


From halr at voltaire.com  Wed Jun 14 06:39:04 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 09:39:04 -0400
Subject: [openib-general] [PATCH] OpenSM/osm_ucast_file: Eliminate compiler
	warning
Message-ID: <1150292338.570.197234.camel@hal.voltaire.com>

OpenSM/osm_ucast_file: Eliminate compiler warning

osm_ucast_file.c: In function `do_ucast_file_load':
osm_ucast_file.c:156: warning: passing arg 2 of `cl_qmap_apply_func'
from incompatible pointer type

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_ucast_file.c
===================================================================
--- opensm/osm_ucast_file.c	(revision 8000)
+++ opensm/osm_ucast_file.c	(working copy)
@@ -114,9 +114,9 @@ static void add_path(osm_opensm_t * p_os
 			  (osm_switch_get_node_ptr(p_sw))));
 }
 
-static void clean_sw_fwd_table(void *arg, void *context)
+static void clean_sw_fwd_table(cl_map_item_t* const p_map_item, void *context)
 {
-	osm_switch_t *p_sw = arg;
+	osm_switch_t * const p_sw = (osm_switch_t *)p_map_item;
 	uint16_t lid, max_lid;
 
 	max_lid = osm_switch_get_max_lid_ho(p_sw);


From ogerlitz at voltaire.com  Wed Jun 14 06:48:09 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 14 Jun 2006 16:48:09 +0300
Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10
	RC2 system
In-Reply-To: <20060614131755.GA25417@mellanox.co.il>
References: <Pine.LNX.4.64.0606141553001.26251@zuben>
	<20060614131755.GA25417@mellanox.co.il>
Message-ID: <44901399.7040408@voltaire.com>

Michael S. Tsirkin wrote:
> Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
>> Subject: ib_mthca not loaded by pci hotplug on SLES10 RC2 system
>>
>> I have a SLES10 RC2 system whose infiniband drivers are the
>> ones provided by the distro (ie not replaced by OFED).
>>
>> I have noticed that ib_mthca is not loaded when the system comes up,
>> however it is loaded fine if i do it manually, and ping -f over ipoib
>> works fine so the system is very much operative.
> 
> Generally you need to look at scripts under /etc/hotplug to figure out.

OK, thanks... it turns out no hotplug package was installed nor nothing 
related to hotplug is found by the yast2 lookup, so i have installed 
hotplug-0.44-32.46 which is working for me on a sles9 system running 
kernel.org 2.6.16, yet the driver is not loaded on boot but does load 
manually (and works fine).

Do i need to setup something other then installing the package & reboot?

Or.


From trimmer at silverstorm.com  Wed Jun 14 06:54:38 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Wed, 14 Jun 2006 09:54:38 -0400
Subject: [openib-general] Maintainers List
Message-ID: <D80D83302DEE6249A221093BF2BB69AE5FF7C1@mail.silverstorm.com>

Is there a convenient list of the maintainers for all the various OFED
components?

Thanks,
Todd Rimmer 


From mst at mellanox.co.il  Wed Jun 14 07:38:59 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 17:38:59 +0300
Subject: [openib-general] ib_mthca not loaded by pci hotplug on SLES10
	RC2 system
In-Reply-To: <44901E8E.20503@voltaire.com>
References: <44901E8E.20503@voltaire.com>
Message-ID: <20060614143859.GE25417@mellanox.co.il>

Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
> OK, thanks, well the content related to mthca of modules.pcimap on the 
> sles10 system is the same as in the sles9 system, see below, and still
> the sles10 does not load the module.

Fine now all you need is have a script run on hotplug, read this and
load modules.

-- 
MST


From robert.j.woodruff at intel.com  Wed Jun 14 08:37:57 2006
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Wed, 14 Jun 2006 08:37:57 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <200606141055.42449.jackm@mellanox.co.il>
Message-ID: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com>

Jack Morgenstein wrote,
>We tested out IntelMPI over uDapl (from OFED 1.0-pre1) using the PALLAS
test 
>suite, and it worked fine -- no problems.  Evidently, you are trying to use

>these new (and absent/unsupported) features.

>We do appreciate that these features are very important for scalability,
and 
>we plan to include them in the 1.1 release which will follow shortly.

These new options are needed to allow Intel MPI to scale up to larger
clusters,
128+. If you did not run on a large cluster you would not have seen the
problems.

>As I posted to ipoib-general on June 7

>( http://openib.org/pipermail/openib-general/2006-June/022725.html )

Unfortunately, due to a problem with our email server, I was not receiving
openib-general emails for the last week and missed this thread or would have
spoken up then. 

Anyway, What is the criteria/decision process for deciding that something
will
or will not be included. I think that we have an equal say as to what should
go in and what should not. Seems like there are double standards here.
You are including last minute fixes for things like the Pathscale driver 
after RC6, but will not allow a fix that is needed by our product. 


woody


From jackm at mellanox.co.il  Wed Jun 14 09:04:22 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Wed, 14 Jun 2006 19:04:22 +0300
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com>
References: <000001c68fc8$849646b0$50a9070a@amr.corp.intel.com>
Message-ID: <200606141904.22321.jackm@mellanox.co.il>

On Wednesday 14 June 2006 18:37, Bob Woodruff wrote:
> Unfortunately, due to a problem with our email server, I was not receiving
> openib-general emails for the last week and missed this thread or would
> have spoken up then.
>
> Anyway, What is the criteria/decision process for deciding that something
> will
> or will not be included. I think that we have an equal say as to what
> should go in and what should not. Seems like there are double standards
> here. You are including last minute fixes for things like the Pathscale
> driver after RC6, but will not allow a fix that is needed by our product.
>
The Pathscale fixes affect ONLY Pathscale users.  Unfortunately, the changes 
you are requesting affect ALL the ulp's -- IPoIB, SDP, iSer,... , and NOT 
just your product.  These changes would  mean activating the ib_local_sa 
module (which has NOT been QA'd under OFED).  There was a long thread quite a 
while ago on this topic, starting May 4:

see   ( http://openib.org/pipermail/openib-general/2006-May/020977.html )

We decided then not to include the local_sa module, and heard no objections.

Since the change you request (which is a kernel-level change) affects many 
products, not just IntelMPI, it is not possible to just include it at the 
last minute and hope for the best (due to lack of QA).

- Jack


From swise at opengridcomputing.com  Wed Jun 14 09:11:08 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 14 Jun 2006 11:11:08 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
In-Reply-To: <1150235196.17394.91.camel@stevo-desktop>
References: <000001c68f31$78910fe0$24268686@amr.corp.intel.com>
	<1150235196.17394.91.camel@stevo-desktop>
Message-ID: <1150301468.28999.22.camel@stevo-desktop>

On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote:
> On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote:
> > >> Er...no. It will lose this event. Depending on the event...the carnage
> > >> varies. We'll take a look at this.
> > >>
> > >
> > >This behavior is consistent with the Infiniband CM (see
> > >drivers/infiniband/core/cm.c function cm_recv_handler()).  But I think
> > >we should at least log an error because a lost event will usually stall
> > >the rdma connection.
> > 
> > I believe that there's a difference here.  For the Infiniband CM, an allocation
> > error behaves the same as if the received MAD were lost or dropped.  Since MADs
> > are unreliable anyway, it's not so much that an IB CM event gets lost, as it
> > doesn't ever occur.  A remote CM should retry the send, which hopefully allows
> > the connection to make forward progress.
> > 
> 
> hmm.  Ok.  I see.  I misunderstood the code in cm_recv_handler().
> 
> Tom and I have been talking about what we can do to not drop the event.
> Stay tuned.

Here's a simple solution that solves the problem:  

For any given cm_id, there are a finite (and small) number of
outstanding CM events that can be posted.  So we just pre-allocate them
when the cm_id is created and keep them on a free list hanging off of
the cm_id struct.  Then the event handler function will pull from this
free list.  

The only case where there is any non-finite issue is on the passive
listening cm_id.  Each incoming connection request will consume a work
struct.  So based on client connects, we could run out of work structs.
However, the CMA has the concept of a backlog, which is defined as the
max number of pending unaccepted connection requests.  So we allocate
these work structs based on that number (or a computation based on that
number), and if we run out, we simply drop the incoming connection
request due to backlog overflow (I suggest we log the drop event too).
When a MPA connection request is dropped, the (IETF conforming) MPA
client will eventually time out the connection and the consumer can
retry.

Comments?


From robert.j.woodruff at intel.com  Wed Jun 14 09:12:40 2006
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Wed, 14 Jun 2006 09:12:40 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <200606141904.22321.jackm@mellanox.co.il>
Message-ID: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com>

Jack wrote, 
>Since the change you request (which is a kernel-level change) affects many 
>products, not just IntelMPI, it is not possible to just include it at the 
>last minute and hope for the best (due to lack of QA).

>- Jack

At this point if the 1.0-pre1 tar ball goes gold this Friday as scheduled,
then I guess we will have to live with what we have and people
running larger clusters will need to use the trunk until OFED 1.1.
If however, it is decided that there needs to be another RC to fix the other
problems 
with things like pathscale not working with SDP or uDAPL, then I think
we should allow the setops fixes in also, but I understand that would mean
going through another QA cycle. 

my 2 cents.
woody


From jlentini at netapp.com  Wed Jun 14 09:17:15 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 14 Jun 2006 12:17:15 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL cma provider - add missing
 ia_attributes for the ia_query
In-Reply-To: <ORSMSX401Ou3Slt2ecE0000004a@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401Ou3Slt2ecE0000004a@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606141211470.21483@jlentini-linux.nane.netapp.com>


On Tue, 13 Jun 2006, Arlin Davis wrote:

> 
> James,
> 
> Here are some changes to include some missing IA attributes during a 
> query.

Looks good. Committed in revision 8008.


From betsy at pathscale.com  Wed Jun 14 09:20:54 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Wed, 14 Jun 2006 09:20:54 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com>
References: <000101c68fcd$5c2c69c0$50a9070a@amr.corp.intel.com>
Message-ID: <1150302054.3425.66.camel@sarium.pathscale.com>

On Wed, 2006-06-14 at 09:12 -0700, Bob Woodruff wrote:

> with things like pathscale not working with SDP or uDAPL, then I think

Woody - For us, SDP ran just fine on InfiniPath on the RHEL4 tests we
ran yesterday the the OFED pre-release candidate. Can you send me the
output you got when you tried it? 

Thanks, Betsy
-- 
Betsy Zeller
Director of Software Engineering
QLogic Corporation
System Interconnect Group
(formerly PathScale, Inc)
2071 Stierlin Court, Suite 200
Mountain View, CA, 94043
1-650-934-8088


From jlentini at netapp.com  Wed Jun 14 09:34:24 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 14 Jun 2006 12:34:24 -0400 (EDT)
Subject: [openib-general] communication established affiliated asynchronous
	event
Message-ID: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>


The IBTA spec (volume 1, version 1.2) describes a communication 
established affiliated asynchronous event.

Is this event supposed to be delivered to the verbs consumer or the IB 
CM? 

We've seen this event delivered to our NFS-RDMA server and aren't sure 
what to do with it.

james


From jlentini at netapp.com  Wed Jun 14 09:39:06 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 14 Jun 2006 12:39:06 -0400 (EDT)
Subject: [openib-general] communication established affiliated
 asynchronous event
In-Reply-To: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
Message-ID: <Pine.LNX.4.64.0606141238240.21483@jlentini-linux.nane.netapp.com>


On Wed, 14 Jun 2006, James Lentini wrote:

> 
> The IBTA spec (volume 1, version 1.2) describes a communication 
> established affiliated asynchronous event.

The discription is on page 637.

> Is this event supposed to be delivered to the verbs consumer or the IB 
> CM? 
> 
> We've seen this event delivered to our NFS-RDMA server and aren't sure 
> what to do with it.
> 
> james


From mshefty at ichips.intel.com  Wed Jun 14 09:39:34 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 09:39:34 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614084041.GA19518@mellanox.co.il>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
	<20060614084041.GA19518@mellanox.co.il>
Message-ID: <44903BC6.8020302@ichips.intel.com>

Michael S. Tsirkin wrote:
> Would keeping around MADs in the done list consume significant extra memory
> resources?

For kernel clients, it shouldn't consume any additional memory.  For userspace 
clients, it would continue to consume memory until a response were generated. 
Currently, that memory is freed once the userspace application retrieves the MAD 
from the kernel

> What limits this memory?

That's part of the discussion.  Today, there is NO limit on how much memory a 
userspace application can consume.  It will continue to consume memory as long 
as it doesn't call to receive a MAD.

> Would a small client that would normally
> just send RMPP, get a response and exit will be slowed down significantly while
> the agent learns?

Clients that send requests are unaffected.  Clients that use one of the 
pre-defined classes or known methods would also be unaffected.  The learning 
only affects new methods, and would typically be limited to the receiving one 
MAD for each method.

> Would a buggy application confuse the umad module, corrupting
> the agent learns?  Would a buggy application confuse the umad module, corrupting
> MAD processing for all other applications?

A buggy application would only affect itself, plus whoever it was trying to 
communicate with.  We can't really fix the latter though.

> The original approach by Jack of detecting, and dropping, duplicate responses
> instead of duplicate requests seemed much easier to me.  The only disadvantage
> it has that I'm aware of is a slight performance hit for duplicate processing of
> each request. But all the done_list scans proposed seem even more CPU intensive.

Jack's approach results in scanning a list, plus has the overhead of of 
duplicating the processing.

The other problem is that DS RMPP requires maintaining state between receiving a 
request and the generation of a response.  This approach provides a mechanism 
that can be used to maintain that state (i.e. the received request).  By 
applying Jack's patch, I'll end up having to invent another way to store and 
retrieve the state.

- Sean


From mshefty at ichips.intel.com  Wed Jun 14 09:46:21 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 09:46:21 -0700
Subject: [openib-general] communication established affiliated
 asynchronous event
In-Reply-To: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
Message-ID: <44903D5D.10102@ichips.intel.com>

James Lentini wrote:
> The IBTA spec (volume 1, version 1.2) describes a communication 
> established affiliated asynchronous event.
> 
> Is this event supposed to be delivered to the verbs consumer or the IB 
> CM? 
> 
> We've seen this event delivered to our NFS-RDMA server and aren't sure 
> what to do with it.

This event is delivered to the verbs consumer, since it occurs on the QP.  It's 
expected that the consumer will call ib_cm_establish.  Although, I would guess 
that you can probably ignore the event, under the assumption that the RTU will 
eventually be received by the local CM.

- Sean


From mst at mellanox.co.il  Wed Jun 14 09:53:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 19:53:27 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <44903BC6.8020302@ichips.intel.com>
References: <44903BC6.8020302@ichips.intel.com>
Message-ID: <20060614165327.GI25417@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > Would a small client that would normally just send RMPP, get a response and
> > exit will be slowed down significantly while the agent learns?
> 
> Clients that send requests are unaffected.  Clients that use one of the 
> pre-defined classes or known methods would also be unaffected.  The learning 
> only affects new methods, and would typically be limited to the receiving one 
> MAD for each method.

Is that per-agent, or global? If per-agent, can this hurt user that writes
scripts using management utilities? These will typically send or receive
something and exit. No?

> > Would a buggy application confuse the umad module, corrupting the agent
> > learns?  Would a buggy application confuse the umad module, corrupting MAD
> > processing for all other applications?
> 
> A buggy application would only affect itself, plus whoever it was trying to 
> communicate with.  We can't really fix the latter though.

Is the table of methods maintained per agent then?

> The other problem is that DS RMPP requires maintaining state between receiving
> a request and the generation of a response.

It does? Why does it?

-- 
MST


From robert.j.woodruff at intel.com  Wed Jun 14 09:56:04 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Wed, 14 Jun 2006 09:56:04 -0700
Subject: [openib-general] OFED 1.0 release schedule
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408>

>Woody - For us, SDP ran just fine on InfiniPath on the RHEL4 tests we
>ran yesterday the the OFED pre-release candidate. Can you send me the
>output you got when you tried it? 

I ran a modified netpipe over SDP and it hung somewhere around size >
4k.
It was on a production Lindenhurst Xeon system. This works fine with the

Mellanox cards. 

I also had problems with uDAPL over pathscale (and thus Intel MPI) and
suspect problems with
RDMA operations. I did not have time to debug it any further.
Were you able to get perftest running,
as Arlin suggested to your developers a couple of weeks back ?

Right now, I had to pull the pathscale cards to complete regression
testing of 1.0-pre1 with Intel MPI and since pathscale does not
work with Intel MPI, I put the Mellanox cards back in. 


From weiny2 at llnl.gov  Wed Jun 14 09:59:58 2006
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 14 Jun 2006 09:59:58 -0700
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <44900970.9050006@cse.ohio-state.edu>
References: <20060613171147.35787125.weiny2@llnl.gov>
	<44900970.9050006@cse.ohio-state.edu>
Message-ID: <20060614095958.59c7dcc7.weiny2@llnl.gov>

We are on a modified RedHat RHEL4 kernel.  Roughly 2.6.9.  :-(

I am going to try a 2.6.16 kernel I have built to see if it changes.

Ira


On Wed, 14 Jun 2006 09:04:48 -0400
Sayantan Sur <surs at cse.ohio-state.edu> wrote:

> Hello Ira,
> 
> I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 
> machine). The program seems to be running fine. Can you tell us which 
> kernel you are using? We are using drivers pulled out of the trunk
> about 3-4 weeks back.
> 
> Thanks,
> Sayantan.
> 
> Ira Weiny wrote:
> 
> >A co-worker here was seeing the following MPI error from his job:
> >
> >[1] Abort: [ldev2:1] Got completion with error, code=1
> > at line 2148 in file viacheck.c
> >
> >After some tracking down he found that apparently if he used a
> >"system" call [int system(const char *string)] the next MPI command
> >will fail.
> >
> >I have been able to reproduce this with the attached simple "hello"
> >program.
> >
> >Perhaps someone has seen this type of error?  Here is the output
> >from 2 runs:
> >
> >weiny2 at ldev0:~/ior-test
> >17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x
> >ldev1
> >[0] Abort: [ldev1:0] Got completion with error, code=1
> > at line 2148 in file viacheck.c
> >ldev2
> >mpirun_rsh: Abort signaled from [0]
> >done.
> >weiny2 at ldev0:~/ior-test
> >17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello
> >now = 0.000000
> >now = 0.000052
> >now = 0.000094
> >now = 0.000121
> >now = 0.000151
> >now = 0.001072
> >now = 0.001102
> >now = 0.001118
> >now = 0.001141
> >now = 0.001160
> >done.
> >
> >We are running mvapich 0.9.7 and the openib trunk rev 6829.
> >
> >Thanks,
> >Ira
> >
> >  
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >openib-general mailing list
> >openib-general at openib.org
> >http://openib.org/mailman/listinfo/openib-general
> >
> >To unsubscribe, please visit
> >http://openib.org/mailman/listinfo/openib-general
> >
> 
> -- 
> http://www.cse.ohio-state.edu/~surs
> 


From mshefty at ichips.intel.com  Wed Jun 14 10:11:40 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 10:11:40 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614165327.GI25417@mellanox.co.il>
References: <44903BC6.8020302@ichips.intel.com>
	<20060614165327.GI25417@mellanox.co.il>
Message-ID: <4490434C.6020003@ichips.intel.com>

Michael S. Tsirkin wrote:
> Is that per-agent, or global? If per-agent, can this hurt user that writes
> scripts using management utilities? These will typically send or receive
> something and exit. No?

This is per agent.  The proposal would only affect applications that generate 
the responses.  (Think of it as enforcing that all response MADs match with a 
received request, so a user can't generate a response for a request that they 
never received.)  An agent that sends a request, and receives the response is 
unaffected.

> Is the table of methods maintained per agent then?

That would be my plan; although, we could probably make it global.

>>The other problem is that DS RMPP requires maintaining state between receiving
>>a request and the generation of a response.
> 
> It does? Why does it?

It needs to track receiving an ACK of the final ACK to the request, which 
carries the initial window size for the response.  Conceptually, what happens is:

-- request -->
  <-- ACK request --
-- ACK (response window) -->
<-- response --
-- ACK response ->

- Sean


From robert.j.woodruff at intel.com  Wed Jun 14 10:13:24 2006
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Wed, 14 Jun 2006 10:13:24 -0700
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <20060614095958.59c7dcc7.weiny2@llnl.gov>
Message-ID: <000201c68fd5$d958f690$50a9070a@amr.corp.intel.com>

>Subject: Re: [openib-general] MPI error when using a "system" call in mpi
job.

>We are on a modified RedHat RHEL4 kernel.  Roughly 2.6.9.  :-(

>I am going to try a 2.6.16 kernel I have built to see if it changes.

>Ira

We have also seen problems with the 2.6.9 kernel and system call
with Intel MPI. The problem seems to be fixed in the VM system
somewhere around 2.6.15. I tried to look at what was changed
to see if there was an easy patch that one could make to the
2.6.9 kernel to fix the problem, but it was not intuitively
obvious what exactly they changed that fixed the problem.

woody


From mshefty at ichips.intel.com  Wed Jun 14 10:25:59 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 10:25:59 -0700
Subject: [openib-general] oops on trunk
In-Reply-To: <20060614084920.GC19518@mellanox.co.il>
References: <20060614084920.GC19518@mellanox.co.il>
Message-ID: <449046A7.8090809@ichips.intel.com>

How many nodes were running on the fabric when this happened?  This was just 
caused by executing modprobe -r ib_ipoib, right?

I'm still completely stumped on how this is occurring, and haven't been able to 
reproduce it.

- Sean


From halr at voltaire.com  Wed Jun 14 10:20:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 13:20:53 -0400
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables from dump file.
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236884D@mtlexch01.mtl.com>
Message-ID: <1150305614.570.205033.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-06-14 at 02:48, Eitan Zahavi wrote:
> Hi Hal, Sasha,
> 
> Regarding OpenSM coding style:
> 
> Sasha wrote:
> > 
> > Really? Don't want to bother with examples, but I may see almost any
> > "combination" in OpenSM and it is not clear for me which one is common
> > (the coding style and identation are different even from file to
> file).
> [EZ] This bothers me as I think we should use a consistent coding style.
> You might also remember we had put in place a both a script to do
> automatic indentation and coding style rule fixes (osm_indent and
> osm_check_n_fix)
> 
> I did check for all "else" statements:
> osm/opensm>grep else *.c | wc -l
> 397
> osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l
> 361
> 
> So you can see only <10%  (36 out of 397) "else" statement are not
> coding style consistent. 
> Checking what is the code that is "non standard":
> osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq -c
> | sort -rn
>       7 osm_console.c:
>       6 osm_prtn_config.c:
>       3 st.c:
>       3 osm_sa_multipath_record.c:
>       2 osm_ucast_mgr.c:
>       2 osm_sa_path_record.c:
>       1 osm_sa_mcmember_record.c:
>       1 osm_sa_informinfo.c:
>       1 osm_sa_class_port_info.c:
>       1 osm_multicast.c:
> 
> You can see the majority of these mismatches are in code introduced by
> Hal and yourself.

While some of those are ours (clearly osm_console.c, osm_prtn_config.c,
and osm_sa_multipath_record.c), not all of them are. I'm sure some came
from you, Yael, and Ofer so let's not be pointing fingers. I don't
bother to kick back each patch on these details. If I did, we would get
no where. I fixed a number of the ones you pointed to above just now.

But let's back up a bit...

> I think OpenSM should sue a single code style. 

This is the key but now is (still) not the time. How about we take this
up in about a month maybe sooner if things settle down a little quicker
? I'll bring this up on the list when I think the time is right. I do
think it will take time to agree on this and a lot of the rules will be
arbitrary.

> My proposal is that we
> update our osm_indent script with a set of rules we agree on and apply
> to the entire tree.

I'm unconvinced that osm_indent is sufficient. I think a lot of human
attention is needed afterwards. I've seen that happen before. How much
time do you have to invest in doing this ?

-- Hal


From mshefty at ichips.intel.com  Wed Jun 14 10:40:47 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 10:40:47 -0700
Subject: [openib-general] [PATCH 0/5] multicast abstraction
In-Reply-To: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
References: <000d01c68c0a$6222b080$ff0da8c0@amr.corp.intel.com>
Message-ID: <44904A1F.9010701@ichips.intel.com>

Sean Hefty wrote:
> This patch series enhances support for joining and leaving multicast groups,
> providing the following functionality:

I'd like to commit both the multicast and UD QP support change sets.  Are there 
any disagreements with committing these to the trunk?

This would provide a single interface for setting up RC, UD, and multicast 
communication.  The only mentioned drawback is that iWarp does not define 
support for UD or multicast communication.

- Sean


From mst at mellanox.co.il  Wed Jun 14 10:59:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 20:59:34 +0300
Subject: [openib-general] oops on trunk
In-Reply-To: <449046A7.8090809@ichips.intel.com>
References: <449046A7.8090809@ichips.intel.com>
Message-ID: <20060614175934.GA27134@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [openib-general] oops on trunk
> 
> How many nodes were running on the fabric when this happened?

back to back

> This was just 
> caused by executing modprobe -r ib_ipoib, right?

yes

-- 
MST


From mst at mellanox.co.il  Wed Jun 14 11:05:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 21:05:02 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
Message-ID: <20060614180502.GB27134@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> One of the ideas then, is for the kernel umad module to learn which MADs
> generate responses.  It would do this by updating an entry to a table whenever
> a response MAD is generated.  A received MAD would check against the table to
> see if a response is supposed to be generated.  If not, then the MAD would be
> freed after userspace claims it.  If a response is expected, then the MAD
> would not be freed until the response was generated.

Another concern with this approach: consider an application that accepts
incoming MAD requests and drops some of them.  With current code it can do this
safely and remote side will retry. With the duplicate tracking in umad module
that you propose, MAD will stay in the list forever, and application will never
again get called.

This kind of subtle behaviour change seems to me worse than outright ABI
breakage.

-- 
MST


From caitlinb at broadcom.com  Wed Jun 14 11:06:44 2006
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 14 Jun 2006 11:06:44 -0700
Subject: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
Message-ID: <54AD0F12E08D1541B826BE97C98F99F1576635@NT-SJCA-0751.brcm.ad.broadcom.com>

netdev-owner at vger.kernel.org wrote:
> On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote:
>> On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote:
>>>>> Er...no. It will lose this event. Depending on the event...the
>>>>> carnage varies. We'll take a look at this.
>>>>> 
>>>> 
>>>> This behavior is consistent with the Infiniband CM (see
>>>> drivers/infiniband/core/cm.c function cm_recv_handler()).  But I
>>>> think we should at least log an error because a lost event will
>>>> usually stall the rdma connection.
>>> 
>>> I believe that there's a difference here.  For the Infiniband CM, an
>>> allocation error behaves the same as if the received MAD were lost
>>> or dropped.  Since MADs are unreliable anyway, it's not so much that
>>> an IB CM event gets lost, as it doesn't ever occur.  A remote CM
>>> should retry the send, which hopefully allows the
> connection to make forward progress.
>>> 
>> 
>> hmm.  Ok.  I see.  I misunderstood the code in cm_recv_handler().
>> 
>> Tom and I have been talking about what we can do to not drop the
>> event. Stay tuned.
> 
> Here's a simple solution that solves the problem:
> 
> For any given cm_id, there are a finite (and small) number of
> outstanding CM events that can be posted.  So we just
> pre-allocate them when the cm_id is created and keep them on
> a free list hanging off of the cm_id struct.  Then the event
> handler function will pull from this free list.
> 
> The only case where there is any non-finite issue is on the
> passive listening cm_id.  Each incoming connection request
> will consume a work struct.  So based on client connects, we
> could run out of work structs.
> However, the CMA has the concept of a backlog, which is
> defined as the max number of pending unaccepted connection
> requests.  So we allocate these work structs based on that
> number (or a computation based on that number), and if we run
> out, we simply drop the incoming connection request due to
> backlog overflow (I suggest we log the drop event too).
> When a MPA connection request is dropped, the (IETF
> conforming) MPA client will eventually time out the
> connection and the consumer can retry.
> 
> Comments?
> 

If the IWCM cannot accept a Connection Request event from
the driver then *someone* should generate a non-peer reject
MPA Response frame. Since the IWCM does not have the resources
to relay the event, it probably does not have the resources
to generate the MPA Response frame either. So simply returning
an "I'm Busy" error and expecting the driver to handle it
makes sense to me.


From mshefty at ichips.intel.com  Wed Jun 14 11:27:06 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 11:27:06 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614180502.GB27134@mellanox.co.il>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
	<20060614180502.GB27134@mellanox.co.il>
Message-ID: <449054FA.7060304@ichips.intel.com>

Michael S. Tsirkin wrote:
> Another concern with this approach: consider an application that accepts
> incoming MAD requests and drops some of them.  With current code it can do this
> safely and remote side will retry. With the duplicate tracking in umad module
> that you propose, MAD will stay in the list forever, and application will never
> again get called.

This is why I proposed a timeout for responses.

> This kind of subtle behaviour change seems to me worse than outright ABI
> breakage.

If everyone is okay with breaking the ABI, then I would add send completion 
notification to umad, and put the responsibility on callers not to generate 
duplicate responses.

- Sean


From sashak at voltaire.com  Wed Jun 14 11:39:32 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 21:39:32 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <20060614000610.GJ23320@durango.c3.lanl.gov>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060613170246.GH23320@durango.c3.lanl.gov>
	<20060613200035.GG10482@sashak.voltaire.com>
	<20060614000610.GJ23320@durango.c3.lanl.gov>
Message-ID: <20060614183932.GB10544@sashak.voltaire.com>

Hi Greg,

On 18:06 Tue 13 Jun     , Greg Johnson wrote:
> On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote:
> > Hi Greg,
> > 
> > On 11:02 Tue 13 Jun     , Greg Johnson wrote:
> > > It seems to load the routes generated by the dump
> > > script, but afterward it is not possible to dump the routes again.
> > 
> > This means you have broken LFTs now. Probably I know what is going on
> > here - new LFTs don't have "<switch-lid> 0" entries, and switches are
> > not accessible by LIDs anymore.
> > 
> > Please update 'ibroute' utility (diags/) from the trunk and recreate the
> > dump file - this should fix the problem.
> > 
> > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement).
> 
> Ok, that fixed it.  It works fine now.

Good. Thanks for trying this.

> Any chance of making our own lid -> guid assignments while we are at it?

Does guid2lid file not help?

As I understand you want to load predefined LIDs, right?

Sasha.


From mlang at lanl.gov  Wed Jun 14 11:48:06 2006
From: mlang at lanl.gov (michael k lang)
Date: Wed, 14 Jun 2006 12:48:06 -0600
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <20060614183932.GB10544@sashak.voltaire.com>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060613170246.GH23320@durango.c3.lanl.gov>
	<20060613200035.GG10482@sashak.voltaire.com>
	<20060614000610.GJ23320@durango.c3.lanl.gov>
	<20060614183932.GB10544@sashak.voltaire.com>
Message-ID: <1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov>

On Wed, 2006-06-14 at 21:39 +0300, Sasha Khapyorsky wrote:
> Hi Greg,
> 
> On 18:06 Tue 13 Jun     , Greg Johnson wrote:
> > On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote:
> > > Hi Greg,
> > > 
> > > On 11:02 Tue 13 Jun     , Greg Johnson wrote:
> > > > It seems to load the routes generated by the dump
> > > > script, but afterward it is not possible to dump the routes again.
> > > 
> > > This means you have broken LFTs now. Probably I know what is going on
> > > here - new LFTs don't have "<switch-lid> 0" entries, and switches are
> > > not accessible by LIDs anymore.
> > > 
> > > Please update 'ibroute' utility (diags/) from the trunk and recreate the
> > > dump file - this should fix the problem.
> > > 
> > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement).
> > 
> > Ok, that fixed it.  It works fine now.
> 
> Good. Thanks for trying this.
> 
> > Any chance of making our own lid -> guid assignments while we are at it?
> 
> Does guid2lid file not help?
Ya, guid2lid has all the info we need, we were just trying to take off
one more level of indirection its not necessary.

> 
> As I understand you want to load predefined LIDs, right?
> 
> Sasha.

--Mike


From halr at voltaire.com  Wed Jun 14 11:58:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 14:58:34 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <449054FA.7060304@ichips.intel.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
	<20060614180502.GB27134@mellanox.co.il>
	<449054FA.7060304@ichips.intel.com>
Message-ID: <1150311514.4506.1171.camel@hal.voltaire.com>

On Wed, 2006-06-14 at 14:27, Sean Hefty wrote:
> Michael S. Tsirkin wrote:
> > Another concern with this approach: consider an application that accepts
> > incoming MAD requests and drops some of them.  With current code it can do this
> > safely and remote side will retry. With the duplicate tracking in umad module
> > that you propose, MAD will stay in the list forever, and application will never
> > again get called.
> 
> This is why I proposed a timeout for responses.
> 
> > This kind of subtle behaviour change seems to me worse than outright ABI
> > breakage.
> 
> If everyone is okay with breaking the ABI, then I would add send completion 
> notification to umad, and put the responsibility on callers not to generate 
> duplicate responses.

Is this a better architectural solution ?

I'm not sure I totally understand what the new ABI would be and its
impact on existing applications. Is there an example of what this might
look like ?

-- Hal

> - Sean
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Wed Jun 14 12:13:02 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 22:13:02 +0300
Subject: [openib-general] [PATCH] OpenSM/osm_ucast_file: Eliminate
	compiler warning
In-Reply-To: <1150292338.570.197234.camel@hal.voltaire.com>
References: <1150292338.570.197234.camel@hal.voltaire.com>
Message-ID: <20060614191302.GH10544@sashak.voltaire.com>

On 09:39 Wed 14 Jun     , Hal Rosenstock wrote:
> OpenSM/osm_ucast_file: Eliminate compiler warning
> 
> osm_ucast_file.c: In function `do_ucast_file_load':
> osm_ucast_file.c:156: warning: passing arg 2 of `cl_qmap_apply_func'
> from incompatible pointer type
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Missed that. Thanks for fixing.

Sasha


From eitan at mellanox.co.il  Wed Jun 14 12:17:06 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 14 Jun 2006 22:17:06 +0300
Subject: [openib-general] [PATCH 3/4] New routing module which loads LFT
 tables fromdump file.
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236885E@mtlexch01.mtl.com>

Hi Hal,

My point is clear: the fact there are some files that are inconsistent
should not open the door for a total mess. I think the simple statistics
do answer the question raised:
" ... and it is not clear for me which one is common".

If we do delay the major cleanup we should at least try to stick to the
existing standard.
>From the mail thread it seems the intention is different and this makes
me really uncomfortable.

Eitan

> 
> Hi Eitan,
> 
> On Wed, 2006-06-14 at 02:48, Eitan Zahavi wrote:
> > Hi Hal, Sasha,
> >
> > Regarding OpenSM coding style:
> >
> > Sasha wrote:
> > >
> > > Really? Don't want to bother with examples, but I may see almost
any
> > > "combination" in OpenSM and it is not clear for me which one is
common
> > > (the coding style and identation are different even from file to
> > file).
> > [EZ] This bothers me as I think we should use a consistent coding
style.
> > You might also remember we had put in place a both a script to do
> > automatic indentation and coding style rule fixes (osm_indent and
> > osm_check_n_fix)
> >
> > I did check for all "else" statements:
> > osm/opensm>grep else *.c | wc -l
> > 397
> > osm/opensm>grep else *.c | grep -v "{" | grep -v "}" | wc -l
> > 361
> >
> > So you can see only <10%  (36 out of 397) "else" statement are not
> > coding style consistent.
> > Checking what is the code that is "non standard":
> > osm/opensm>grep else *.c | grep "{" | awk '{print $1}' | sort | uniq
-c
> > | sort -rn
> >       7 osm_console.c:
> >       6 osm_prtn_config.c:
> >       3 st.c:
> >       3 osm_sa_multipath_record.c:
> >       2 osm_ucast_mgr.c:
> >       2 osm_sa_path_record.c:
> >       1 osm_sa_mcmember_record.c:
> >       1 osm_sa_informinfo.c:
> >       1 osm_sa_class_port_info.c:
> >       1 osm_multicast.c:
> >
> > You can see the majority of these mismatches are in code introduced
by
> > Hal and yourself.
> 
> While some of those are ours (clearly osm_console.c,
osm_prtn_config.c,
> and osm_sa_multipath_record.c), not all of them are. I'm sure some
came
> from you, Yael, and Ofer so let's not be pointing fingers. I don't
> bother to kick back each patch on these details. If I did, we would
get
> no where. I fixed a number of the ones you pointed to above just now.
> 
> But let's back up a bit...
> 
> > I think OpenSM should sue a single code style.
> 
> This is the key but now is (still) not the time. How about we take
this
> up in about a month maybe sooner if things settle down a little
quicker
> ? I'll bring this up on the list when I think the time is right. I do
> think it will take time to agree on this and a lot of the rules will
be
> arbitrary.
> 
> > My proposal is that we
> > update our osm_indent script with a set of rules we agree on and
apply
> > to the entire tree.
> 
> I'm unconvinced that osm_indent is sufficient. I think a lot of human
> attention is needed afterwards. I've seen that happen before. How much
> time do you have to invest in doing this ?
> 
> -- Hal


From mshefty at ichips.intel.com  Wed Jun 14 12:23:27 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 12:23:27 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150311514.4506.1171.camel@hal.voltaire.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
	<20060614180502.GB27134@mellanox.co.il>
	<449054FA.7060304@ichips.intel.com>
	<1150311514.4506.1171.camel@hal.voltaire.com>
Message-ID: <4490622F.30802@ichips.intel.com>

Hal Rosenstock wrote:
>>If everyone is okay with breaking the ABI, then I would add send completion 
>>notification to umad, and put the responsibility on callers not to generate 
>>duplicate responses.
> 
> Is this a better architectural solution ?

Not sure.  It doesn't solve supporting DS RMPP, which requires maintaining state 
between receiving a request and the generation of a response.

> I'm not sure I totally understand what the new ABI would be and its
> impact on existing applications. Is there an example of what this might
> look like ?

Currently, the only send MADs that are reported to the user are requests that 
time out waiting for a response.  We could probably change that to report all 
send completions.  Failed sends are reported using a status of timeout, with the 
MAD header copied to userspace.  So the length of the MAD indicates if it was a 
send or receive.

 From an implementation stand point, this approach likely requires only minor 
changes to the kernel code.  But any userspace applications that send MADs would 
need to change to handle this.  The list of application that do send MADs is 
likely fairly small however.

If we wanted to be more restrictive on which applications would be affected, we 
could only generate send completions for response MADs.

- Sean


From halr at voltaire.com  Wed Jun 14 12:25:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 15:25:03 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <4490622F.30802@ichips.intel.com>
References: <000101c68f13$f1075ce0$34cc180a@amr.corp.intel.com>
	<20060614180502.GB27134@mellanox.co.il>
	<449054FA.7060304@ichips.intel.com>
	<1150311514.4506.1171.camel@hal.voltaire.com>
	<4490622F.30802@ichips.intel.com>
Message-ID: <1150313103.4506.2186.camel@hal.voltaire.com>

On Wed, 2006-06-14 at 15:23, Sean Hefty wrote:
> Hal Rosenstock wrote:
> >>If everyone is okay with breaking the ABI, then I would add send completion 
> >>notification to umad, and put the responsibility on callers not to generate 
> >>duplicate responses.
> > 
> > Is this a better architectural solution ?
> 
> Not sure.

Then it's likely not worth breaking the ABI which will cause more pain
than it's worth.

>   It doesn't solve supporting DS RMPP, which requires maintaining state 
> between receiving a request and the generation of a response.
> 
> > I'm not sure I totally understand what the new ABI would be and its
> > impact on existing applications. Is there an example of what this might
> > look like ?
> 
> Currently, the only send MADs that are reported to the user are requests that 
> time out waiting for a response.  We could probably change that to report all 
> send completions.  Failed sends are reported using a status of timeout, with the 
> MAD header copied to userspace.  So the length of the MAD indicates if it was a 
> send or receive.
> 
>  From an implementation stand point, this approach likely requires only minor 
> changes to the kernel code.  But any userspace applications that send MADs would 
> need to change to handle this.  The list of application that do send MADs is 
> likely fairly small however.

It's not so small.

> If we wanted to be more restrictive on which applications would be affected, we 
> could only generate send completions for response MADs.

I think that would only pare it down a little.

-- Hal

> - Sean


From sashak at voltaire.com  Wed Jun 14 12:42:06 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 14 Jun 2006 22:42:06 +0300
Subject: [openib-general] [PATCH 0/4] opensm: Loading unicast routes
	from the file
In-Reply-To: <1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov>
References: <20060611002758.22430.63061.stgit@sashak.voltaire.com>
	<20060613170246.GH23320@durango.c3.lanl.gov>
	<20060613200035.GG10482@sashak.voltaire.com>
	<20060614000610.GJ23320@durango.c3.lanl.gov>
	<20060614183932.GB10544@sashak.voltaire.com>
	<1150310886.16684.12.camel@jumper.c3.lanl.gov.c3.lanl.gov>
Message-ID: <20060614194206.GJ10544@sashak.voltaire.com>

On 12:48 Wed 14 Jun     , michael k lang wrote:
> On Wed, 2006-06-14 at 21:39 +0300, Sasha Khapyorsky wrote:
> > Hi Greg,
> > 
> > On 18:06 Tue 13 Jun     , Greg Johnson wrote:
> > > On Tue, Jun 13, 2006 at 11:00:35PM +0300, Sasha Khapyorsky wrote:
> > > > Hi Greg,
> > > > 
> > > > On 11:02 Tue 13 Jun     , Greg Johnson wrote:
> > > > > It seems to load the routes generated by the dump
> > > > > script, but afterward it is not possible to dump the routes again.
> > > > 
> > > > This means you have broken LFTs now. Probably I know what is going on
> > > > here - new LFTs don't have "<switch-lid> 0" entries, and switches are
> > > > not accessible by LIDs anymore.
> > > > 
> > > > Please update 'ibroute' utility (diags/) from the trunk and recreate the
> > > > dump file - this should fix the problem.
> > > > 
> > > > (Sorry, I forgot to mention 'ibroute' upgrade issue in patch announcement).
> > > 
> > > Ok, that fixed it.  It works fine now.
> > 
> > Good. Thanks for trying this.
> > 
> > > Any chance of making our own lid -> guid assignments while we are at it?
> > 
> > Does guid2lid file not help?
> Ya, guid2lid has all the info we need, we were just trying to take off
> one more level of indirection its not necessary.

You want to have all the info in one file. Right?

It could be interesting idea to extend routing engine with 'lid loader'.
Will need to think about it.

Sasha


From mst at mellanox.co.il  Wed Jun 14 13:11:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 23:11:05 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <4490434C.6020003@ichips.intel.com>
References: <4490434C.6020003@ichips.intel.com>
Message-ID: <20060614201105.GA27868@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> >>The other problem is that DS RMPP requires maintaining state between
> >>receiving a request and the generation of a response.
> > 
> > It does? Why does it?
> 
> It needs to track receiving an ACK of the final ACK to the request, which 
> carries the initial window size for the response.  Conceptually, what happens is:
> 
> -- request -->
>   <-- ACK request --
> -- ACK (response window) -->
> <-- response --
> -- ACK response ->

OK, so apparently, what we have with dual-sided, after ack with response window
arrives, is a sender that can't send data since userspace did not give us the
response.  I see how this approach would require significant change in core, and
I'm not really happy with this.

Here's an alternative idea: instead of making huge changes all over, how about
we delay passing the RMPP transaction up to the user until we have the ACK with
the response window, and ask the user to give us back this ACK packet (or just
the window?) when he sends the response?  Since we didn't support dual-sided
tansfers this extends rather than breaks both the ABI and the API.

The issue of duplicates can then be dealt with by Jack's patch,
detecting duplicate requests which does not require additional state.

Sounds good?

-- 
MST


From paul.lundin at gmail.com  Wed Jun 14 13:37:44 2006
From: paul.lundin at gmail.com (Paul)
Date: Wed, 14 Jun 2006 16:37:44 -0400
Subject: [openib-general] OFED 1.0-pre 1 build issues.
Message-ID: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>

Hello All,
      Using the default build.sh script on x86_64 rhel4u3 works flawlessly.
However when doing the same thing on ppc64 the build fails (both are
"everything" installs). The frustrating thing about the failure is that its
failing while looking in the wrong locations for some libraries. Instead of
looking in the lib64 directories its looking in lib. I have tried setting
LDFLAGS, CXXFLAGS, CCFLAGS and CFLAGS to -m64 with no change, lib64 stuff is
listed before lib in ld.so.conf (which I think only affects runtime ...).
Here is the exact error:


g++ -shared -nostdlib
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/crti.o
/usr/lib/gcc/ppc64
-redhat-linux/3.4.5/crtbeginS.o .libs/client.o .libs/simmsg.o .libs/msgmgr.o
.libs/tcpcomm.o -L/usr
/lib/gcc/ppc64-redhat-linux/3.4.5
-L/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib -L/usr/lib/
gcc/ppc64-redhat-linux/3.4.5/../../.. -L/lib/../lib -L/usr/lib/../lib
-lstdc++ -lm -lc -lgcc_s /usr/l
ib/gcc/ppc64-redhat-linux/3.4.5/crtsavres.o
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/crtendS.o /usr/lib/
gcc/ppc64-redhat-linux/3.4.5/../../../../lib/crtn.o -m64 -mminimal-toc
-Wl,-soname -Wl,libibmscli.so
.1 -o .libs/libibmscli.so.1.0.0
/usr/bin/ld: skipping incompatible
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/libc.so when
searching for -lc
/usr/bin/ld: skipping incompatible
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../../lib/libc.a when
searching for -lc
/usr/bin/ld: skipping incompatible
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../libc.so when search
ing for -lc
/usr/bin/ld: skipping incompatible
/usr/lib/gcc/ppc64-redhat-linux/3.4.5/../../../libc.a when searchi
ng for -lc
/usr/bin/ld: skipping incompatible /usr/lib/../lib/libc.so when searching
for -lc
/usr/bin/ld: skipping incompatible /usr/lib/../lib/libc.a when searching for
-lc
/usr/bin/ld: warning: powerpc:common architecture of input file
`/usr/lib/gcc/ppc64-redhat-linux/3.4.
5/../../../../lib/crti.o' is incompatible with powerpc:common64 output
/usr/bin/ld: warning: powerpc:common architecture of input file
`/usr/lib/gcc/ppc64-redhat-linux/3.4.
5/crtbeginS.o' is incompatible with powerpc:common64 output
/usr/bin/ld: warning: powerpc:common architecture of input file
`/usr/lib/gcc/ppc64-redhat-linux/3.4.
5/crtsavres.o' is incompatible with powerpc:common64 output
/usr/bin/ld: warning: powerpc:common architecture of input file
`/usr/lib/gcc/ppc64-redhat-linux/3.4.
5/crtendS.o' is incompatible with powerpc:common64 output
/usr/bin/ld: warning: powerpc:common architecture of input file
`/usr/lib/gcc/ppc64-redhat-linux/3.4.
5/../../../../lib/crtn.o' is incompatible with powerpc:common64 output
/usr/bin/ld: can not size stub section: Bad value
/usr/bin/ld: .libs/libibmscli.so.1.0.0: Not enough room for program headers,
try linking with -N
/usr/bin/ld: final link failed: Bad value
collect2: ld returned 1 exit status
make[3]: *** [libibmscli.la] Error 1
make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibmgtsim'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.18200 (%install)


Regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/b7aea655/attachment.html>

From mshefty at ichips.intel.com  Wed Jun 14 13:50:27 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 13:50:27 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614201105.GA27868@mellanox.co.il>
References: <4490434C.6020003@ichips.intel.com>
	<20060614201105.GA27868@mellanox.co.il>
Message-ID: <44907693.7010001@ichips.intel.com>

Michael S. Tsirkin wrote:
> Here's an alternative idea: instead of making huge changes all over, how about
> we delay passing the RMPP transaction up to the user until we have the ACK with
> the response window, and ask the user to give us back this ACK packet (or just
> the window?) when he sends the response?  Since we didn't support dual-sided
> tansfers this extends rather than breaks both the ABI and the API.

I thought about this as well, and I think there was a discussion about doing 
this.  The window size could be exchanged in the RMPP header if needed.  We're 
kind of left with the same issue of trying to determine if a received MAD will 
generate a response.

> The issue of duplicates can then be dealt with by Jack's patch,
> detecting duplicate requests which does not require additional state.
> 
> Sounds good?
> 

If an alternative for handling DS RMPP can be found, I'm fine with this.

- Sean


From mst at mellanox.co.il  Wed Jun 14 13:56:18 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 14 Jun 2006 23:56:18 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <44907693.7010001@ichips.intel.com>
References: <44907693.7010001@ichips.intel.com>
Message-ID: <20060614205618.GB28111@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: RFC: detecting duplicate MAD requests
> 
> Michael S. Tsirkin wrote:
> > Here's an alternative idea: instead of making huge changes all over, how
> > about we delay passing the RMPP transaction up to the user until we have the
> > ACK with the response window, and ask the user to give us back this ACK
> > packet (or just the window?) when he sends the response?  Since we didn't
> > support dual-sided tansfers this extends rather than breaks both the ABI and
> > the API.
> 
> I thought about this as well, and I think there was a discussion about doing
> this.  The window size could be exchanged in the RMPP header if needed.

Sounds good.

> We're kind of left with the same issue of trying to determine if a received
> MAD will generate a response.

How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
transfer always has a response, doesn't it?

-- 
MST


From mshefty at ichips.intel.com  Wed Jun 14 14:03:22 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 14:03:22 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614205618.GB28111@mellanox.co.il>
References: <44907693.7010001@ichips.intel.com>
	<20060614205618.GB28111@mellanox.co.il>
Message-ID: <4490799A.9040802@ichips.intel.com>

Michael S. Tsirkin wrote:
>>We're kind of left with the same issue of trying to determine if a received
>>MAD will generate a response.
> 
> 
> How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> transfer always has a response, doesn't it?

Unless I completely missed something, there is no IsDS flag.

- Sean


From mst at mellanox.co.il  Wed Jun 14 14:22:50 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 00:22:50 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <4490799A.9040802@ichips.intel.com>
References: <4490799A.9040802@ichips.intel.com>
Message-ID: <20060614212250.GC28111@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: RFC: detecting duplicate MAD requests
> 
> Michael S. Tsirkin wrote:
> >>We're kind of left with the same issue of trying to determine if a received
> >>MAD will generate a response.
> > 
> > 
> > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > transfer always has a response, doesn't it?

I mean, the flag in the application that says that the transfer is dual-sided.
The spec seems to imply that user can figure *from the method* that IsDS=1, so I
assume users will have this logic:

"2)
Begin the initial transfer by starting the send operation at the point labelled
Send. The method or other indication should be interpreted on
the other side as initiating a double-sided transfer, causing the receive
context to set IsDS=1."


So why does the MAD layer care whether a received MAD will generate a resonse?
A request arrives - we pass it up. Now the ACK for the direction switch arrives
- we pass it up too, application should be waiting for it, it should take the
window and pass the response back to us.

-- 
MST


From halr at voltaire.com  Wed Jun 14 14:20:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 17:20:08 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <4490799A.9040802@ichips.intel.com>
References: <44907693.7010001@ichips.intel.com>
	<20060614205618.GB28111@mellanox.co.il>
	<4490799A.9040802@ichips.intel.com>
Message-ID: <1150320008.4506.6189.camel@hal.voltaire.com>

On Wed, 2006-06-14 at 17:03, Sean Hefty wrote:
> Michael S. Tsirkin wrote:
> >>We're kind of left with the same issue of trying to determine if a received
> >>MAD will generate a response.
> > 
> > 
> > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > transfer always has a response, doesn't it?
> 
> Unless I completely missed something, there is no IsDS flag.

IsDS is an internal state variable and not an on wire part of the
protocol.

-- Hal

> - Sean


From mst at mellanox.co.il  Wed Jun 14 14:28:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 00:28:48 +0300
Subject: [openib-general] OFED 1.0-pre 1 build issues.
In-Reply-To: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
References: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
Message-ID: <20060614212848.GD28111@mellanox.co.il>

Quoting r. Paul <paul.lundin at gmail.com>:
> Subject: OFED 1.0-pre 1 build issues.
> 
> Hello All,
>       Using the default build.sh script on x86_64 rhel4u3 works flawlessly. However when doing the same thing on ppc64 the build fails (both are "everything" installs). The frustrating thing about the failure is that its failing while looking in the wrong
> locations for some libraries. Instead of looking in the lib64 directories its looking in lib. I have tried setting LDFLAGS, CXXFLAGS, CCFLAGS and CFLAGS to -m64 with no change, lib64 stuff is listed before lib in ld.so.conf (which I think only affects
> runtime ...). Here is the exact error:
> 

Maybe, write a small script along the lines of

#!/bin/perl
my $name = $0;
$name =~ s#.*/##;
exec("/usr/bin/$name", "-m64", @ARGV);

and have it linked as ld, gcc and g++ on path before /usr/bin?

-- 
MST


From halr at voltaire.com  Wed Jun 14 14:23:25 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Jun 2006 17:23:25 -0400
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614212250.GC28111@mellanox.co.il>
References: <4490799A.9040802@ichips.intel.com>
	<20060614212250.GC28111@mellanox.co.il>
Message-ID: <1150320204.4506.6311.camel@hal.voltaire.com>

On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote:
> Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > Subject: Re: RFC: detecting duplicate MAD requests
> > 
> > Michael S. Tsirkin wrote:
> > >>We're kind of left with the same issue of trying to determine if a received
> > >>MAD will generate a response.
> > > 
> > > 
> > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > > transfer always has a response, doesn't it?
> 
> I mean, the flag in the application that says that the transfer is dual-sided.
> The spec seems to imply that user can figure *from the method* that IsDS=1, so I
> assume users will have this logic:
> 
> "2)
> Begin the initial transfer by starting the send operation at the point labelled
> Send. The method or other indication should be interpreted on
> the other side as initiating a double-sided transfer, causing the receive
> context to set IsDS=1."
> 
> 
> So why does the MAD layer care whether a received MAD will generate a resonse?
> A request arrives - we pass it up. Now the ACK for the direction switch arrives
> - we pass it up too, application should be waiting for it, it should take the
> window and pass the response back to us.

The ACKs are transparent to the application/user.

-- Hal


From mst at mellanox.co.il  Wed Jun 14 14:30:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 00:30:55 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150320008.4506.6189.camel@hal.voltaire.com>
References: <1150320008.4506.6189.camel@hal.voltaire.com>
Message-ID: <20060614213055.GE28111@mellanox.co.il>

Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > >>We're kind of left with the same issue of trying to determine if a received
> > >>MAD will generate a response.
> > > 
> > > 
> > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > > transfer always has a response, doesn't it?
> > 
> > Unless I completely missed something, there is no IsDS flag.
> 
> IsDS is an internal state variable and not an on wire part of the
> protocol.

Yes, I know, but user knows IsDS is 1 so why does MAD layer care whether there
will be a response? It's up to the application to switch to the sender flow.

-- 
MST


From mst at mellanox.co.il  Wed Jun 14 14:33:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 00:33:42 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <1150320204.4506.6311.camel@hal.voltaire.com>
References: <1150320204.4506.6311.camel@hal.voltaire.com>
Message-ID: <20060614213342.GF28111@mellanox.co.il>

Quoting r. Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: RFC: detecting duplicate MAD requests
> 
> On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote:
> > Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > > Subject: Re: RFC: detecting duplicate MAD requests
> > > 
> > > Michael S. Tsirkin wrote:
> > > >>We're kind of left with the same issue of trying to determine if a received
> > > >>MAD will generate a response.
> > > > 
> > > > 
> > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > > > transfer always has a response, doesn't it?
> > 
> > I mean, the flag in the application that says that the transfer is dual-sided.
> > The spec seems to imply that user can figure *from the method* that IsDS=1, so I
> > assume users will have this logic:
> > 
> > "2)
> > Begin the initial transfer by starting the send operation at the point labelled
> > Send. The method or other indication should be interpreted on
> > the other side as initiating a double-sided transfer, causing the receive
> > context to set IsDS=1."
> > 
> > 
> > So why does the MAD layer care whether a received MAD will generate a
> > resonse?  A request arrives - we pass it up. Now the ACK for the direction
> > switch arrives - we pass it up too, application should be waiting for it, it
> > should take the window and pass the response back to us.
> 
> The ACKs are transparent to the application/user.

Well the ACK for the direction switch is special, isn't it?
All I'm saying, let's pass it up to the application.

-- 
MST


From mst at mellanox.co.il  Wed Jun 14 14:37:50 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 00:37:50 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614213342.GF28111@mellanox.co.il>
References: <1150320204.4506.6311.camel@hal.voltaire.com>
	<20060614213342.GF28111@mellanox.co.il>
Message-ID: <20060614213750.GG28111@mellanox.co.il>

Quoting r. Michael S. Tsirkin <mst at mellanox.co.il>:
> Subject: Re: RFC: detecting duplicate MAD requests
> 
> Quoting r. Hal Rosenstock <halr at voltaire.com>:
> > Subject: Re: RFC: detecting duplicate MAD requests
> > 
> > On Wed, 2006-06-14 at 17:22, Michael S. Tsirkin wrote:
> > > Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > > > Subject: Re: RFC: detecting duplicate MAD requests
> > > > 
> > > > Michael S. Tsirkin wrote:
> > > > >>We're kind of left with the same issue of trying to determine if a received
> > > > >>MAD will generate a response.
> > > > > 
> > > > > 
> > > > > How do you mean? We have IsDS=1 flag for dual-sided, don't we? Dual-sided
> > > > > transfer always has a response, doesn't it?
> > > 
> > > I mean, the flag in the application that says that the transfer is dual-sided.
> > > The spec seems to imply that user can figure *from the method* that IsDS=1, so I
> > > assume users will have this logic:
> > > 
> > > "2)
> > > Begin the initial transfer by starting the send operation at the point labelled
> > > Send. The method or other indication should be interpreted on
> > > the other side as initiating a double-sided transfer, causing the receive
> > > context to set IsDS=1."
> > > 
> > > 
> > > So why does the MAD layer care whether a received MAD will generate a
> > > resonse?  A request arrives - we pass it up. Now the ACK for the direction
> > > switch arrives - we pass it up too, application should be waiting for it, it
> > > should take the window and pass the response back to us.
> > 
> > The ACKs are transparent to the application/user.
> 
> Well the ACK for the direction switch is special, isn't it?
> All I'm saying, let's pass it up to the application.

I suggest a rule along the lines of "if an ACK arrives with segment number of 0
this means sender is requesting dual sided RMPP, pass it up to the application".

What's the problem with this approach? I think this does not break existing apps
since these don't do DS RMPP and so never get such an ACK. Right?

-- 
MST


From bos at pathscale.com  Wed Jun 14 15:30:03 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 14 Jun 2006 15:30:03 -0700
Subject: [openib-general] OFED 1.0-pre 1 build issues.
In-Reply-To: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
References: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
Message-ID: <1150324203.10676.17.camel@chalcedony.pathscale.com>

On Wed, 2006-06-14 at 16:37 -0400, Paul wrote:

>       Using the default build.sh script on x86_64 rhel4u3 works
> flawlessly. However when doing the same thing on ppc64 the build fails
> (both are "everything" installs).

Looks like you don't have the gcc-devel.ppc64 RPM installed.  Isn't
building in a multiarch environment fun?

	<b


From robert.j.woodruff at intel.com  Wed Jun 14 15:30:58 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Wed, 14 Jun 2006 15:30:58 -0700
Subject: [openib-general] Processes not exiting on SVN7946
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>

It appears that processes are not exiting cleanly on SVN7946 trunk
backported to 2.6.9-34 EL.

They seem to be stuck in a state of "DL" and I cannot even attach to
them
wil gdb or kill them with a kill -9. 

[root at iclust-1 core]# ps -uax | grep IMB
woody     4087  0.0  0.0 58500 3172 pts/3    T    14:45   0:00 gdb
./IMB-MPI1 -p 4067
woody     4067  2.3  0.0 33108 2708 ?        DL   14:44   0:12
./IMB-MPI1
woody     4109  3.1  0.0 40148 2572 ?        DL   14:47   0:12
./IMB-MPI1
root      4156  0.0  0.0 51080  732 pts/3    S+   14:53   0:00 grep IMB

The last code I pulled SVN7843 did not have this problem.

Any ideas on what might be causing this ?

woody

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/2d3da37b/attachment.html>

From paul.lundin at gmail.com  Wed Jun 14 16:53:24 2006
From: paul.lundin at gmail.com (Paul)
Date: Wed, 14 Jun 2006 19:53:24 -0400
Subject: [openib-general] OFED 1.0-pre 1 build issues.
In-Reply-To: <1150324203.10676.17.camel@chalcedony.pathscale.com>
References: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
	<1150324203.10676.17.camel@chalcedony.pathscale.com>
Message-ID: <d2403b0606141653j777d930ardf9999ac7e042eb@mail.gmail.com>

Bryan,
       There is no such rpm for rhel 4 u3, perhaps you meant glibc-devel
(installed) ? Also worth noting is that this is on a different machine than
the x86_64 build (which was an opteron). This is a standalone power5 system,
not cross-compiling.

Michael,
     I performed the same work-around in bash (not so good with perl these
days) it gets past the prior point. Thanks. Should something that takes care
of this be included in the build.sh or build_env.sh scripts ? We would
certainly need it covered in the docs at least.

Now the build is dying on some undefined references. (log attached)

Regards.

On 6/14/06, Bryan O'Sullivan <bos at pathscale.com> wrote:
>
> On Wed, 2006-06-14 at 16:37 -0400, Paul wrote:
>
> >       Using the default build.sh script on x86_64 rhel4u3 works
> > flawlessly. However when doing the same thing on ppc64 the build fails
> > (both are "everything" installs).
>
> Looks like you don't have the gcc-devel.ppc64 RPM installed.  Isn't
> building in a multiarch environment fun?
>
>         <b
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/7da5e489/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make_user.log.gz
Type: application/x-gzip
Size: 9915 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/7da5e489/attachment.bin>

From sean.hefty at intel.com  Wed Jun 14 17:01:16 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 14 Jun 2006 17:01:16 -0700
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <20060614213342.GF28111@mellanox.co.il>
Message-ID: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com>

>Well the ACK for the direction switch is special, isn't it?
>All I'm saying, let's pass it up to the application.

I really don't think that this is the direction that we want to take the
interface.  A multithreaded application could see the ACK before the request.
Multiple ACKs could be received for the same request, or no ACK could be
received at all.  This pushes timeout handling and duplicate detection up to the
any application using DS RMPP.  We should work for a simpler interface,
especially one exposed to userspace.

Let's start with an interface that's efficient and works well in the kernel, and
then determine how to expose that interface up to userspace.  Let's try to keep
the complexity in one location.

Btw, it looks like Jack's patch has the MAD layer read MAD data while it is in
transfer.  I don't think that we can do this while the data is mapped.

- Sean


From bugzilla-daemon at openib.org  Wed Jun 14 18:09:42 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Wed, 14 Jun 2006 18:09:42 -0700 (PDT)
Subject: [openib-general] [Bug 3] openIB can run on SGI ia64 paltform!
Message-ID: <20060615010942.7D040228738@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=3


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-14 18:09 -------
Close out old bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Wed Jun 14 18:10:07 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Wed, 14 Jun 2006 18:10:07 -0700 (PDT)
Subject: [openib-general] [Bug 4] Re: ipoib_ib_post_receive failed for buf
	111
Message-ID: <20060615011007.D3370228735@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=4


sweitzen at cisco.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- Comment #2 from sweitzen at cisco.com  2006-06-14 18:10 -------
close out old bug.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bsharp at NetEffect.com  Wed Jun 14 18:35:50 2006
From: bsharp at NetEffect.com (Bob Sharp)
Date: Wed, 14 Jun 2006 20:35:50 -0500
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2>

> +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index)
> +{
> +	struct c2_mq *mq = c2dev->qptr_array[mq_index];
> +	union c2wr *wr;
> +	void *resource_user_context;
> +	struct iw_cm_event cm_event;
> +	struct ib_event ib_event;
> +	enum c2_resource_indicator resource_indicator;
> +	enum c2_event_id event_id;
> +	unsigned long flags;
> +	u8 *pdata = NULL;
> +	int status;
> +
> +	/*
> +	 * retreive the message
> +	 */
> +	wr = c2_mq_consume(mq);
> +	if (!wr)
> +		return;
> +
> +	memset(&ib_event, 0, sizeof(ib_event));
> +	memset(&cm_event, 0, sizeof(cm_event));
> +
> +	event_id = c2_wr_get_id(wr);
> +	resource_indicator =
be32_to_cpu(wr->ae.ae_generic.resource_type);
> +	resource_user_context
> +	    (void *) (unsigned long) wr->ae.ae_generic.user_context;
> +
> +	status = cm_event.status =
> c2_convert_cm_status(c2_wr_get_result(wr));
> +
> +	pr_debug("event received c2_dev=%p, event_id=%d, "
> +		"resource_indicator=%d, user_context=%p, status = %d\n",
> +		c2dev, event_id, resource_indicator,
resource_user_context,
> +		status);
> +
> +	switch (resource_indicator) {
> +	case C2_RES_IND_QP:{
> +
> +		struct c2_qp *qp = (struct c2_qp
*)resource_user_context;
> +		struct iw_cm_id *cm_id = qp->cm_id;
> +		struct c2wr_ae_active_connect_results *res;
> +
> +		if (!cm_id) {
> +			pr_debug("event received, but cm_id is <nul>,
qp=%p!\n",
> +				qp);
> +			goto ignore_it;
> +		}
> +		pr_debug("%s: event = %s, user_context=%llx, "
> +			"resource_type=%x, "
> +			"resource=%x, qp_state=%s\n",
> +			__FUNCTION__,
> +			to_event_str(event_id),
> +			be64_to_cpu(wr->ae.ae_generic.user_context),
> +			be32_to_cpu(wr->ae.ae_generic.resource_type),
> +			be32_to_cpu(wr->ae.ae_generic.resource),
> +			to_qp_state_str(be32_to_cpu(wr-
> >ae.ae_generic.qp_state)));
> +
> +		c2_set_qp_state(qp,
be32_to_cpu(wr->ae.ae_generic.qp_state));
> +
> +		switch (event_id) {
> +		case CCAE_ACTIVE_CONNECT_RESULTS:
> +			res = &wr->ae.ae_active_connect_results;
> +			cm_event.event = IW_CM_EVENT_CONNECT_REPLY;
> +			cm_event.local_addr.sin_addr.s_addr =
res->laddr;
> +			cm_event.remote_addr.sin_addr.s_addr =
res->raddr;
> +			cm_event.local_addr.sin_port = res->lport;
> +			cm_event.remote_addr.sin_port =	res->rport;
> +			if (status == 0) {
> +				cm_event.private_data_len =
> +
be32_to_cpu(res->private_data_length);
> +			} else {
> +				spin_lock_irqsave(&qp->lock, flags);
> +				if (qp->cm_id) {
> +					qp->cm_id->rem_ref(qp->cm_id);
> +					qp->cm_id = NULL;
> +				}
> +				spin_unlock_irqrestore(&qp->lock,
flags);
> +				cm_event.private_data_len = 0;
> +				cm_event.private_data = NULL;
> +			}
> +			if (cm_event.private_data_len) {
> +				/* copy private data */
> +				pdata
> +				    kmalloc(cm_event.private_data_len,
> +					    GFP_ATOMIC);
> +				if (!pdata) {
> +					/* Ignore the request, maybe the
> +					 * remote peer will retry */
> +					pr_debug ("Ignored connect
request -- "
> +						 "no memory for pdata"
> +
"private_data_len=%d\n",
> +
cm_event.private_data_len);
> +					goto ignore_it;
> +				}
> +
> +				memcpy(pdata, res->private_data,
> +				       cm_event.private_data_len);
> +
> +				cm_event.private_data = pdata;
> +			}
> +			if (cm_id->event_handler)
> +				cm_id->event_handler(cm_id, &cm_event);
> +			break;
> +		case CCAE_TERMINATE_MESSAGE_RECEIVED:
> +		case CCAE_CQ_SQ_COMPLETION_OVERFLOW:
> +			ib_event.device = &c2dev->ibdev;
> +			ib_event.element.qp = &qp->ibqp;
> +			ib_event.event = IB_EVENT_QP_REQ_ERR;
> +
> +			if (qp->ibqp.event_handler)
> +				qp->ibqp.event_handler(&ib_event,
> +						       qp->ibqp.
> +						       qp_context);
> +			break;
> +		case CCAE_BAD_CLOSE:
> +		case CCAE_LLP_CLOSE_COMPLETE:
> +		case CCAE_LLP_CONNECTION_RESET:
> +		case CCAE_LLP_CONNECTION_LOST:
> +			BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b);
> +
> +			spin_lock_irqsave(&qp->lock, flags);
> +			if (qp->cm_id) {
> +				qp->cm_id->rem_ref(qp->cm_id);
> +				qp->cm_id = NULL;
> +			}
> +			spin_unlock_irqrestore(&qp->lock, flags);
> +			cm_event.event = IW_CM_EVENT_CLOSE;
> +			cm_event.status = 0;
> +			if (cm_id->event_handler)
> +				cm_id->event_handler(cm_id, &cm_event);
> +			break;
> +		default:
> +			BUG_ON(1);
> +			pr_debug("%s:%d Unexpected event_id=%d on QP=%p,
"
> +				"CM_ID=%p\n",
> +				__FUNCTION__, __LINE__,
> +				event_id, qp, cm_id);
> +			break;
> +		}
> +		break;
> +	}
> +
> +	case C2_RES_IND_EP:{
> +
> +		struct c2wr_ae_connection_request *req =
> +			&wr->ae.ae_connection_request;
> +		struct iw_cm_id *cm_id =
> +			(struct iw_cm_id *)resource_user_context;
> +
> +		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
> +		if (event_id != CCAE_CONNECTION_REQUEST) {
> +			pr_debug("%s: Invalid event_id: %d\n",
> +				__FUNCTION__, event_id);
> +			break;
> +		}
> +		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
> +		cm_event.provider_data = (void*)(unsigned
long)req->cr_handle;
> +		cm_event.local_addr.sin_addr.s_addr = req->laddr;
> +		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
> +		cm_event.local_addr.sin_port = req->lport;
> +		cm_event.remote_addr.sin_port = req->rport;
> +		cm_event.private_data_len =
> +			be32_to_cpu(req->private_data_length);
> +
> +		if (cm_event.private_data_len) {


It looks to me as if pdata is leaking here since it is not tracked and
the upper layers do not free it.  Also, if pdata is freed after the call
to cm_id->event_handler returns, it exposes an issue in user space where
the private data is garbage.  I suspect the iwarp cm should be copying
this data before it returns.

> +			pdata
> +			    kmalloc(cm_event.private_data_len,
> +				    GFP_ATOMIC);
> +			if (!pdata) {
> +				/* Ignore the request, maybe the remote
peer
> +				 * will retry */
> +				pr_debug ("Ignored connect request -- "
> +					 "no memory for pdata"
> +					 "private_data_len=%d\n",
> +					 cm_event.private_data_len);
> +				goto ignore_it;
> +			}
> +			memcpy(pdata,
> +			       req->private_data,
> +			       cm_event.private_data_len);
> +
> +			cm_event.private_data = pdata;
> +		}
> +		if (cm_id->event_handler)
> +			cm_id->event_handler(cm_id, &cm_event);
> +		break;
> +	}
> +

Bob


From sweitzen at cisco.com  Wed Jun 14 20:38:34 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 14 Jun 2006 20:38:34 -0700
Subject: [openib-general] please add RHEL4 to OS list on OpenIB bugzilla
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301DCA32A@xmb-sjc-216.amer.cisco.com>

Bryan,
 
Would you please add RHEL4 as an OS for OpenIB bugs?
 
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/5f72c36c/attachment.html>

From sweitzen at cisco.com  Wed Jun 14 20:45:40 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 14 Jun 2006 20:45:40 -0700
Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux machine
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301DCA32D@xmb-sjc-216.amer.cisco.com>

I agree it's not working, and I have opened bug 135 (OFED 1.0: MVAPICH
doesn't work on RHEL4 U3 ppc64).
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Boris
Shpolyansky
	Sent: Monday, June 12, 2006 5:53 PM
	To: openib-general at openib.org
	Subject: [openib-general] MVAPICH failure on IBM PPC-64 Linux
machine
	
	
	Hi,
	 
	I've run into following failure running OSU MPI out of OFED-rc5
on IBM PPC-64 platform:
	 
	[1] Abort: Error creating QP
	 at line 820 in file viainit.c
	mpirun: executable version 1 does not match our version 3,
	
	This seems to be memory allocation issue which could be easily
explained (and overcome)
	if the job is launched with regular user permissions, but in my
case it's root who launches it.
	 
	Have anybody tested OFED's OSU MPI on PPC-64 platform recently
and can comment on this ?
	 
	Thanks,
	 
	
	Boris Shpolyansky
	Application Engineer
	Mellanox Technologies Inc.
	2900 Stender Way
	Santa Clara, CA 95054
	Tel.: (408) 916 0014
	Fax: (408) 970 3403
	Cell: (408) 834 9365
	www.mellanox.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060614/918dbbbc/attachment.html>

From halr at voltaire.com  Wed Jun 14 21:10:42 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Jun 2006 00:10:42 -0400
Subject: [openib-general] [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c: Move
 assert to before where PortInfo is assumed
Message-ID: <1150344641.4506.20803.camel@hal.voltaire.com>

OpenSM/osm_port_info_rcv.c: Move assert to before where PortInfo is
assumed as shouldn't be processing as PortInfo unless it really is

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_port_info_rcv.c
===================================================================
--- opensm/osm_port_info_rcv.c	(revision 7961)
+++ opensm/osm_port_info_rcv.c	(working copy)
@@ -683,6 +683,8 @@ osm_pi_rcv_process(
   p_context = osm_madw_get_pi_context_ptr( p_madw );
   p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp );
 
+  CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO );
+
   /* On receipt of client reregister, clear the reregister bit so
      reregistering won't be sent again and again */
   if ( ib_port_info_get_client_rereg( p_pi ) )
@@ -698,8 +700,6 @@ osm_pi_rcv_process(
   port_guid = p_context->port_guid;
   node_guid = p_context->node_guid;
 
-  CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO );
-
   osm_dump_port_info(
     p_rcv->p_log, node_guid, port_guid, port_num, p_pi, OSM_LOG_DEBUG);
 

From mst at mellanox.co.il  Wed Jun 14 22:12:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 08:12:55 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com>
References: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com>
Message-ID: <20060615051255.GA11911@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> A multithreaded application could see the ACK before the request.

Yes, this is a problem.

-- 
MST


From betsy at pathscale.com  Wed Jun 14 22:13:42 2006
From: betsy at pathscale.com (Betsy Zeller)
Date: Wed, 14 Jun 2006 22:13:42 -0700
Subject: [openib-general] OFED 1.0 release schedule
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F73FBD@orsmsx408>
Message-ID: <1150348422.3425.145.camel@sarium.pathscale.com>

On Wed, 2006-06-14 at 09:56 -0700, Woodruff, Robert J wrote:

> 
> I ran a modified netpipe over SDP and it hung somewhere around size >
> 4k.
> It was on a production Lindenhurst Xeon system. This works fine with the
> 
> Mellanox cards. 
Hmm - netpipe on SDP ran fine for us, as well. We tried it on RHEL4,
FC4, and SLES10 RC2.

> 
> I also had problems with uDAPL over pathscale (and thus Intel MPI) and
> suspect problems with
> RDMA operations. I did not have time to debug it any further.
> Were you able to get perftest running,
We also ran perftest.

> as Arlin suggested to your developers a couple of weeks back ?
> 
> Right now, I had to pull the pathscale cards to complete regression
> testing of 1.0-pre1 with Intel MPI and since pathscale does not
> work with Intel MPI, I put the Mellanox cards back in. 

We hope to get PathScale working Intel MPI in the near future. If you
could send any error messages, or a description of behavior, that would
be very useful.

- Betsy
-- 
Betsy Zeller
Director of Software Engineering
QLogic Corporation
System Interconnect Group
(formerly PathScale, Inc)
2071 Stierlin Court, Suite 200
Mountain View, CA, 94043
1-650-934-8088


From glebn at voltaire.com  Wed Jun 14 22:19:12 2006
From: glebn at voltaire.com (glebn at voltaire.com)
Date: Thu, 15 Jun 2006 08:19:12 +0300
Subject: [openib-general] MPI error when using a "system" call in mpi
 job.
In-Reply-To: <20060614095958.59c7dcc7.weiny2@llnl.gov>
References: <20060613171147.35787125.weiny2@llnl.gov>
	<44900970.9050006@cse.ohio-state.edu>
	<20060614095958.59c7dcc7.weiny2@llnl.gov>
Message-ID: <20060615051912.GI17758@minantech.com>

On Wed, Jun 14, 2006 at 09:59:58AM -0700, Ira Weiny wrote:
> We are on a modified RedHat RHEL4 kernel.  Roughly 2.6.9.  :-(
> 
This is known problen with kernel 2.6.9.

> I am going to try a 2.6.16 kernel I have built to see if it changes.
> 
This will work with system(), but not with fork().

> Ira
> 
> 
> On Wed, 14 Jun 2006 09:04:48 -0400
> Sayantan Sur <surs at cse.ohio-state.edu> wrote:
> 
> > Hello Ira,
> > 
> > I am running the program on 2.6.15 (EM64T machine) and 2.6.16 (IA32 
> > machine). The program seems to be running fine. Can you tell us which 
> > kernel you are using? We are using drivers pulled out of the trunk
> > about 3-4 weeks back.
> > 
> > Thanks,
> > Sayantan.
> > 
> > Ira Weiny wrote:
> > 
> > >A co-worker here was seeing the following MPI error from his job:
> > >
> > >[1] Abort: [ldev2:1] Got completion with error, code=1
> > > at line 2148 in file viacheck.c
> > >
> > >After some tracking down he found that apparently if he used a
> > >"system" call [int system(const char *string)] the next MPI command
> > >will fail.
> > >
> > >I have been able to reproduce this with the attached simple "hello"
> > >program.
> > >
> > >Perhaps someone has seen this type of error?  Here is the output
> > >from 2 runs:
> > >
> > >weiny2 at ldev0:~/ior-test
> > >17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x
> > >ldev1
> > >[0] Abort: [ldev1:0] Got completion with error, code=1
> > > at line 2148 in file viacheck.c
> > >ldev2
> > >mpirun_rsh: Abort signaled from [0]
> > >done.
> > >weiny2 at ldev0:~/ior-test
> > >17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello
> > >now = 0.000000
> > >now = 0.000052
> > >now = 0.000094
> > >now = 0.000121
> > >now = 0.000151
> > >now = 0.001072
> > >now = 0.001102
> > >now = 0.001118
> > >now = 0.001141
> > >now = 0.001160
> > >done.
> > >
> > >We are running mvapich 0.9.7 and the openib trunk rev 6829.
> > >
> > >Thanks,
> > >Ira
> > >
> > >  
> > >
> > >------------------------------------------------------------------------
> > >
> > >_______________________________________________
> > >openib-general mailing list
> > >openib-general at openib.org
> > >http://openib.org/mailman/listinfo/openib-general
> > >
> > >To unsubscribe, please visit
> > >http://openib.org/mailman/listinfo/openib-general
> > >
> > 
> > -- 
> > http://www.cse.ohio-state.edu/~surs
> > 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

--
			Gleb.


From eitan at mellanox.co.il  Wed Jun 14 23:02:07 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 15 Jun 2006 09:02:07 +0300
Subject: [openib-general] [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c:
 Move assert to beforewhere PortInfo is assumed
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236885F@mtlexch01.mtl.com>

Makes perfect sense.

> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Thursday, June 15, 2006 7:11 AM
> To: openib-general at openib.org
> Cc: Eitan Zahavi
> Subject: [PATCH] [MINOR} OpenSM/osm_port_info_rcv.c: Move assert to
> beforewhere PortInfo is assumed
> 
> OpenSM/osm_port_info_rcv.c: Move assert to before where PortInfo is
> assumed as shouldn't be processing as PortInfo unless it really is
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Index: opensm/osm_port_info_rcv.c
> ===================================================================
> --- opensm/osm_port_info_rcv.c	(revision 7961)
> +++ opensm/osm_port_info_rcv.c	(working copy)
> @@ -683,6 +683,8 @@ osm_pi_rcv_process(
>    p_context = osm_madw_get_pi_context_ptr( p_madw );
>    p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp );
> 
> +  CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO );
> +
>    /* On receipt of client reregister, clear the reregister bit so
>       reregistering won't be sent again and again */
>    if ( ib_port_info_get_client_rereg( p_pi ) )
> @@ -698,8 +700,6 @@ osm_pi_rcv_process(
>    port_guid = p_context->port_guid;
>    node_guid = p_context->node_guid;
> 
> -  CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_PORT_INFO );
> -
>    osm_dump_port_info(
>      p_rcv->p_log, node_guid, port_guid, port_num, p_pi,
OSM_LOG_DEBUG);
> 
> 


From ogerlitz at voltaire.com  Thu Jun 15 01:26:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 Jun 2006 11:26:22 +0300
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <44903D5D.10102@ichips.intel.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com>
Message-ID: <449119AE.2010703@voltaire.com>

Sean Hefty wrote:
> James Lentini wrote:
>> The IBTA spec (volume 1, version 1.2) describes a communication 
>> established affiliated asynchronous event.
>> We've seen this event delivered to our NFS-RDMA server and aren't sure 
>> what to do with it.

> This event is delivered to the verbs consumer, since it occurs on the QP.  It's 
> expected that the consumer will call ib_cm_establish.  Although, I would guess 
> that you can probably ignore the event, under the assumption that the RTU will 
> eventually be received by the local CM.

Sean,

The cma/verbs consumer can't just ignore the event since its qp state is 
still RTR which means an attempt to tx replying the rx would fail.

On the other hand it can't call ib_cm_establish since the CMA does not 
expose an API for that, nor the CM can register a cb to get this event 
and emulate an RTU reception since the CMA is the one to create the QP 
and the CMA consumer providing the qp_init_attr along with event handler...

I suggest the following design: the CMA would replace the event handler 
provided with the qp_init_attr struct with a callback of its own and 
keep the original handler/context on a private structure.

On the delivery of IB_EVENT_COMM_EST event, the CMA would call down the 
CM to emulate RTU reception (ib_cm_establish) and then call up the 
consumer original handler, typical CMA consumers would just ignore this 
event, i think.

The CM should be able to allow ib_cm_established to be called in the 
context over which the event handler is called (or jump the treatment to 
higher context). The CM must also ignore the actual RTU if it arrives 
later/in parallel to when ib_cm_establish was called.

By this design the verbs consumer is guaranteed to always get 
RDMA_CM_EVENT_ESTABLISHED no matter if the RTU is just late or never 
arrives but it still can get a CQ RX completion(s) before getting the 
CMA established event; in that case it can queue these completion 
elements for the short time window before the established event arrives 
and then process them.

A design similar to that was implemented at the Voltaire gen1 stack and 
it works in production with iSER target and VIBNAL (CFS Lustre NAL for 
voltaire gen1 ib) server side.

Does anyone know on what context (hard_irq, soft_irq, thread) are the 
event handlers being called?

Or.


From ogerlitz at voltaire.com  Thu Jun 15 01:31:46 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 15 Jun 2006 11:31:46 +0300
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <449119AE.2010703@voltaire.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com>
Message-ID: <44911AF2.4060800@voltaire.com>

Or Gerlitz wrote:
> I suggest the following design: the CMA would replace the event handler 
> provided with the qp_init_attr struct with a callback of its own and 
> keep the original handler/context on a private structure.
> 
> On the delivery of IB_EVENT_COMM_EST event, the CMA would call down the 
> CM to emulate RTU reception (ib_cm_establish) and then call up the 
> consumer original handler, typical CMA consumers would just ignore this 
> event, i think.

and on other qp affiliated events the CMA would just call up the 
consumer callback. This proxy-ing of qp events can help us down the road 
to add support for path migration in the CMA.

Or.


From tziporet at mellanox.co.il  Thu Jun 15 03:51:17 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 15 Jun 2006 13:51:17 +0300
Subject: [openib-general] Maintainers List
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA723B@mtlexch01.mtl.com>

Usually you can see the owners in bugzilla

Tziporet

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Rimmer, Todd
Sent: Wednesday, June 14, 2006 4:55 PM
To: openib-general at openib.org
Subject: [openib-general] Maintainers List

Is there a convenient list of the maintainers for all the various OFED
components?

Thanks,
Todd Rimmer 

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu Jun 15 04:06:17 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 Jun 2006 14:06:17 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
Message-ID: <20060615110617.GA21560@sashak.voltaire.com>

Hi Eitan,

Some comments about the patch.

Personally I'm glad to see that you are using tab instead of spaces as
identaion character. But it would be nice if next time you will not mix
the functional changes and identaion fixes in the same patch, but instead
will provide two different patches. Also it would be nice if your
identation fixes will cover whole file(s) and not just selected lines.

The same is about massive code moving, the patch separation may simplify
review.

The rest is below.

On 15:54 Tue 13 Jun     , Eitan Zahavi wrote:
> --text follows this line--
> Hi Hal
> 
> This is a second take after debug and cleanup of the partition manager
> patch I have previously provided. The functionality is the same but
> this one is after 2 days of testing on the simulator.
> I also did some code restructuring for clarity. 
> 
> Tests passed were both dedicated pkey enforcements (pkey.*) and
> stress test (osmStress.*)
> 
> As I started to test the partition manager code (using ibmgtsim pkey test),
> I realized the implementation does not really enforces the partition policy
> on the given fabric. This patch fixes that. It was verified using the 
> simulation test. Several other corner cases were fixed too.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> Index: include/opensm/osm_port.h
> ===================================================================
> --- include/opensm/osm_port.h	(revision 7867)
> +++ include/opensm/osm_port.h	(working copy)
> @@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
>  *  Port, Physical Port
>  *********/
>  
> +/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
> +* NAME
> +*  osm_physp_get_mod_pkey_tbl
> +*
> +* DESCRIPTION
> +*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
> +*
> +* SYNOPSIS
> +*/
> +static inline osm_pkey_tbl_t *
> +osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
> +{
> +  CL_ASSERT( osm_physp_is_valid( p_physp ) );
> +  /*
> +    (14.2.5.7) - the block number valid values are 0-2047, and are further
> +    limited by the size of the P_Key table specified by the PartitionCap on the node. 
> +  */
> +  return( &p_physp->pkeys );
> +};
> +/*
> +* PARAMETERS
> +*  p_physp
> +*     [in] Pointer to an osm_physp_t object.
> +*
> +* RETURN VALUES
> +*  The pointer to the P_Key table object.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*  Port, Physical Port
> +*********/
> +

Is not this simpler to remove 'const' from existing
osm_physp_get_pkey_tbl() function instead of using new one?

>  /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
>  * NAME
>  *	osm_physp_set_slvl_tbl
> Index: include/opensm/osm_pkey.h
> ===================================================================
> --- include/opensm/osm_pkey.h	(revision 7867)
> +++ include/opensm/osm_pkey.h	(working copy)
> @@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
>    cl_ptr_vector_t blocks;
>    cl_ptr_vector_t new_blocks;
>    cl_map_t        keys;
> +  cl_qlist_t      pending;
> +  uint16_t        used_blocks;
> +  uint16_t        max_blocks;
>  } osm_pkey_tbl_t;
>  /*
>  * FIELDS
> @@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
>  *	keys
>  *		A set holding all keys
>  *
> +*  pending
> +*     A list osm_pending_pkey structs that is temporarily set by the 
> +*     pkey mgr and used during pkey mgr algorithm only
> +*
> +*  used_blocks
> +*     Tracks the number of blocks having non-zero pkeys
> +*
> +*  max_blocks
> +*     The maximal number of blocks this partition table might hold
> +*     this value is based on node_info (for port 0 or CA) or switch_info
> +*     updated on receiving the node_info or switch_info GetResp
> +*
>  * NOTES
>  * 'blocks' vector should be used to store pkey values obtained from
>  * the port and SM pkey manager should not change it directly, for this
> @@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
>  *
>  *********/
>  
> +/****s* OpenSM: osm_pending_pkey_t
> +* NAME
> +*	osm_pending_pkey_t
> +*
> +* DESCRIPTION
> +*	This objects stores temporary information on pkeys their target block and index
> +*  during the pkey manager operation
> +*
> +* SYNOPSIS
> +*/
> +typedef struct _osm_pending_pkey {
> +  cl_list_item_t list_item;
> +  uint16_t		  pkey;
> +  uint32_t		  block;
> +  uint8_t		  index;
> +  boolean_t		  is_new;
> +} osm_pending_pkey_t;
> +/*
> +* FIELDS
> +*	pkey
> +*		The actual P_Key
> +*
> +*	block
> +*		The block index based on the previous table extracted from the device
> +*
> +*	index
> +*		The index of the pky within the block
> +*
> +*  is_new
> +*     TRUE for new P_Keys such that the block and index are invalid in that case
> +*
> +*********/
> +
>  /****f* OpenSM: osm_pkey_tbl_construct
>  * NAME
>  *  osm_pkey_tbl_construct
> @@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( 
>  static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
>    const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
>  {
> -  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
> -  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
> +	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
> +			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
>  };
>  /*
>  *  p_pkey_tbl
> @@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_
>  /*
>   *********/
>  
> +
> +/****f* OpenSM: osm_pkey_tbl_make_block_pair
> +* NAME
> +*  osm_pkey_tbl_make_block_pair
> +*
> +* DESCRIPTION
> +*  Find or create a pair of "old" and "new" blocks for the
> +*  given block index
> +*
> +* SYNOPSIS
> +*/
> +int osm_pkey_tbl_make_block_pair( 
> +	osm_pkey_tbl_t   *p_pkey_tbl, 
> +	uint16_t          block_idx,
> +	ib_pkey_table_t **pp_old_block,
> +	ib_pkey_table_t **pp_new_block);
> +/*
> +* p_pkey_tbl
> +*   [in] Pointer to the PKey table 
> +*
> +* block_idx
> +*   [in] The block index to use
> +*
> +* pp_old_block
> +*   [out] Pointer to the old block pointer arg
> +*
> +* pp_new_block
> +*   [out] Pointer to the new block pointer arg
> +*
> +* RETURN VALUES
> +*   0 if OK 1 if failed

It is better (conventional) to use -1 as failure return status.

> +* 
> +*********/
> +
> +/****f* OpenSM: osm_pkey_tbl_set_new_entry
> +* NAME 
> +*  osm_pkey_tbl_set_new_entry
> +*
> +* DESCRIPTION
> +*   stores the given pkey in the "new" blocks array and update
> +*   the "map" to show that on the "old" blocks
> +*
> +* SYNOPSIS
> +*/
> +int
> +osm_pkey_tbl_set_new_entry( 
> +	IN osm_pkey_tbl_t *p_pkey_tbl,
> +	IN uint16_t        block_idx,
> +	IN uint8_t         pkey_idx,
> +	IN uint16_t        pkey);
> +/*
> +* p_pkey_tbl
> +*   [in] Pointer to the PKey table 
> +*
> +* block_idx
> +*   [in] The block index to use
> +*
> +* pkey_idx
> +*   [in] The index within the block
> +*
> +* pkey
> +*   [in] PKey to store
> +*
> +* RETURN VALUES
> +*   0 if OK 1 if failed

Ditto

> +* 
> +*********/
> +
> +/****f* OpenSM: osm_pkey_find_next_free_entry
> +* NAME
> +*  osm_pkey_find_next_free_entry
> +*
> +* DESCRIPTION
> +*  Find the next free entry in the PKey table. Starting at the given
> +*  index and block number. The user should increment pkey_idx before 
> +*  next call
> +*  Inspect the "new" blocks array for empty space.
> +*
> +* SYNOPSIS
> +*/
> +boolean_t
> +osm_pkey_find_next_free_entry(
> +	IN osm_pkey_tbl_t *p_pkey_tbl, 
> +	OUT uint16_t      *p_block_idx,
> +	OUT uint8_t       *p_pkey_idx);
> +/*
> +* p_pkey_tbl
> +*   [in] Pointer to the PKey table 
> +*
> +* p_block_idx
> +*   [out] The block index to use
> +*
> +* p_pkey_idx
> +*   [out] The index within the block to use
> +*
> +* RETURN VALUES
> +*   TRUE if found FALSE if did not find
> +* 
> +*********/
> +
>  /****f* OpenSM: osm_pkey_tbl_sync_new_blocks
>  * NAME
>  *  osm_pkey_tbl_sync_new_blocks
> @@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( 
>  *
>  *********/
>  
> +/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
> +* NAME
> +*  osm_pkey_tbl_get_block_and_idx
> +*
> +* DESCRIPTION
> +*  set the block index and pkey index the given
> +*  pkey is found in. return 1 if cound not find 
> +*  it, 0 if OK
> +*
> +* SYNOPSIS
> +*/
> +int
> +osm_pkey_tbl_get_block_and_idx(
> +  IN  osm_pkey_tbl_t *p_pkey_tbl, 
> +  IN  uint16_t       *p_pkey,
> +  OUT uint32_t       *block_idx,
> +  OUT uint8_t        *pkey_index);
> +/*
> +*  p_pkey_tbl
> +*     [in] Pointer to osm_pkey_tbl_t object.
> +*  
> +*  p_pkey
> +*     [in] Pointer to the P_Key entry searched
> +*
> +*  p_block_idx
> +*     [out] Pointer to the block index to be updated
> +*
> +*  p_pkey_idx 
> +*     [out] Pointer to the pkey index (in the block) to be updated
> +*
> +*
> +* NOTES
> +*
> +*********/
> +
>  /****f* OpenSM: osm_pkey_tbl_set
>  * NAME
>  *  osm_pkey_tbl_set
> Index: opensm/osm_pkey.c
> ===================================================================
> --- opensm/osm_pkey.c	(revision 7904)
> +++ opensm/osm_pkey.c	(working copy)
> @@ -100,6 +100,9 @@ int osm_pkey_tbl_init( 
>    cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
>    cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
>    cl_map_init( &p_pkey_tbl->keys, 1 );
> +	cl_qlist_init( &p_pkey_tbl->pending );
> +	p_pkey_tbl->used_blocks = 0;
> +	p_pkey_tbl->max_blocks = 0;
>    return(IB_SUCCESS);
>  }
>  
> @@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks(
>      p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
>      if ( b < new_blocks )
>        p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
> -    else {
> +		else 
> +      {
>        p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
>        if (!p_new_block)
>          break;
> +			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
> +									b, p_new_block);
> +		}
> +
>        memset(p_new_block, 0, sizeof(*p_new_block));
> -      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
>      }
> -    memcpy(p_new_block, p_block, sizeof(*p_new_block));
> +}

You changed this function so it does not do any sync anymore. Should
function name be changed too?

> +
> +/**********************************************************************
> + **********************************************************************/
> +void osm_pkey_tbl_cleanup_pending(
> +	IN osm_pkey_tbl_t *p_pkey_tbl)
> +{
> +	cl_list_item_t	*p_item;
> +	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
> +	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
> +	{
> +		free( (osm_pending_pkey_t *)p_item );
>    }
>  }
>  
> @@ -202,6 +220,138 @@ int osm_pkey_tbl_set( 
>  
>  /**********************************************************************
>   **********************************************************************/
> +int osm_pkey_tbl_make_block_pair( 
> +	osm_pkey_tbl_t   *p_pkey_tbl, 
> +	uint16_t          block_idx,
> +	ib_pkey_table_t **pp_old_block,
> +	ib_pkey_table_t **pp_new_block)
> +{
> +	if (block_idx >= p_pkey_tbl->max_blocks) return 1;
> +
> +	if (pp_old_block)
> +	{
> +		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
> +		if (! *pp_old_block)
> +		{
> +			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
> +			if (!*pp_old_block) return 1;
> +			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
> +			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
> +		}
> +	}
> +	
> +	if (pp_new_block)
> +	{
> +		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
> +		if (! *pp_new_block)
> +		{
> +			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
> +			if (!*pp_new_block) return 1;
> +			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
> +			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +/*
> +  store the given pkey in the "new" blocks array and update the "map"
> +  to show that on the "old" blocks
> +*/
> +int
> +osm_pkey_tbl_set_new_entry( 
> +	IN osm_pkey_tbl_t *p_pkey_tbl,
> +	IN uint16_t        block_idx,
> +	IN uint8_t         pkey_idx,
> +	IN uint16_t        pkey)
> +{  
> +	ib_pkey_table_t *p_old_block;
> +	ib_pkey_table_t *p_new_block;
> +	
> +	if (osm_pkey_tbl_make_block_pair(
> +			 p_pkey_tbl,  block_idx, &p_old_block, &p_new_block))
> +		return 1;
> +		
> +	cl_map_insert( &p_pkey_tbl->keys,
> +						ib_pkey_get_base(pkey),
> +						&(p_old_block->pkey_entry[pkey_idx]));

Here you map potentially empty pkey entry. Why? "old block" will be
remapped anyway on pkey receiving.

Actually I don't see why you want this pretty tricky and pkey_mgr
specific procedure as generic function.

> +	p_new_block->pkey_entry[pkey_idx] = pkey;
> +	if (p_pkey_tbl->used_blocks < block_idx)
> +		p_pkey_tbl->used_blocks = block_idx;
> +
> +	return 0;
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +boolean_t
> +osm_pkey_find_next_free_entry(
> +	IN osm_pkey_tbl_t *p_pkey_tbl, 
> +	OUT uint16_t      *p_block_idx,
> +	OUT uint8_t       *p_pkey_idx)
> +{
> +	ib_pkey_table_t *p_new_block;
> +	
> +	CL_ASSERT(p_block_idx);
> +	CL_ASSERT(p_pkey_idx);
> +
> +	while ( *p_block_idx < p_pkey_tbl->max_blocks)
> +	{
> +		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
> +		{
> +			*p_pkey_idx = 0;
> +			(*p_block_idx)++;
> +			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
> +				return FALSE;
> +		}
> +
> +		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
> +
> +		if ( !p_new_block || 
> +			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
> +			return TRUE;
> +		else
> +			(*p_pkey_idx)++;
> +	}
> +	return FALSE;
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +int
> +osm_pkey_tbl_get_block_and_idx(
> +	IN	 osm_pkey_tbl_t *p_pkey_tbl,
> +	IN	 uint16_t		 *p_pkey,
> +	OUT uint32_t		 *p_block_idx,
> +	OUT uint8_t			 *p_pkey_index)
> +{
> +	uint32_t			  num_of_blocks;
> +	uint32_t			  block_index;
> +	ib_pkey_table_t *block;
> +
> +	CL_ASSERT( p_pkey_tbl );
> +	CL_ASSERT( p_block_idx != NULL );
> +	CL_ASSERT( p_pkey_idx != NULL );

Why last two CL_ASSERTs? What should be problem with uninitialized
pointers here?

> + 
> +	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
> +	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	{
> +		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
> +		if ( ( block->pkey_entry <= p_pkey ) &&
> +			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
> +		{
> +			*p_block_idx = block_index;
> +			*p_pkey_index = p_pkey - block->pkey_entry;
> +			return 0;
> +		}
> +	}
> +	return 1;
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
>  static boolean_t __osm_match_pkey (
>    IN const ib_net16_t *pkey1,
>    IN const ib_net16_t *pkey2 ) {
> @@ -305,7 +455,8 @@ osm_physp_share_pkey(
>    if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
>      return TRUE;
>  
> -  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
> +	return 
> +		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
>  }
>  
>  /**********************************************************************
> @@ -321,7 +472,8 @@ osm_port_share_pkey(
>  
>    OSM_LOG_ENTER( p_log, osm_port_share_pkey );
>  
> -  if (!p_port_1 || !p_port_2) {
> +	if (!p_port_1 || !p_port_2)
> +	{
>  	ret = FALSE;
>  	goto Exit;
>    }
> @@ -329,7 +481,8 @@ osm_port_share_pkey(
>    p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
>    p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
>  
> -  if (!p_physp1 || !p_physp2) {
> +	if (!p_physp1 || !p_physp2)
> +	{
>  	ret = FALSE;
>  	goto Exit;
>    }
> Index: opensm/osm_pkey_mgr.c
> ===================================================================
> --- opensm/osm_pkey_mgr.c	(revision 7904)
> +++ opensm/osm_pkey_mgr.c	(working copy)
> @@ -62,6 +62,139 @@
>  
>  /**********************************************************************
>   **********************************************************************/
> +/*
> +  the max number of pkey blocks for a physical port is located in
> +  different place for switch external ports (SwitchInfo) and the
> +  rest of the ports (NodeInfo)
> +*/
> +static int pkey_mgr_get_physp_max_blocks(

I would suggest to add _cap_ to function name. Not too much critical
since it is static function.

> +	IN const osm_subn_t *p_subn,
> +	IN const osm_physp_t *p_physp)
> +{
> +	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
> +	osm_switch_t *p_sw;
> +	uint16_t num_pkeys = 0;
> +
> +	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
> +		  (osm_physp_get_port_num( p_physp ) == 0))
> +		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
> +	else
> +	{
> +		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
> +		if (p_sw)
> +			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
> +	}
> +	return( (num_pkeys + 31) / 32 );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +/*
> + * Insert the new pending pkey entry to the specific port pkey table
> + * pending pkeys. new entries are inserted at the back.
> + */
> +static void pkey_mgr_process_physical_port(
> +	IN osm_log_t *p_log,
> +	IN const osm_req_t *p_req,
> +	IN const ib_net16_t pkey,
> +	IN osm_physp_t *p_physp )
> +{
> +	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
> +	osm_pkey_tbl_t *p_pkey_tbl;
> +	ib_net16_t *p_orig_pkey;
> +	char *stat = NULL;
> +	osm_pending_pkey_t *p_pending;
> +
> +	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
> +	if (! p_pkey_tbl)
           ^^^^^^^^^^^^^
Is it possible?

> +	{
> +		osm_log( p_log, OSM_LOG_ERROR,
> +					"pkey_mgr_process_physical_port: ERR 0501: "
> +					"No pkey table found for node "
> +					"0x%016" PRIx64 " port %u\n",
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( p_physp ) );
> +		return;
> +	}
> +
> +	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
> +	if (! p_pending)
> +	{
> +		osm_log( p_log, OSM_LOG_ERROR,
> +					"pkey_mgr_process_physical_port: ERR 0502: "
> +					"Fail to allocate new pending pkey entry for node "
> +					"0x%016" PRIx64 " port %u\n",
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( p_physp ) );
> +		return;
> +	}
> +	p_pending->pkey = pkey;
> +	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
> +	if ( !p_orig_pkey  || 
> +		  (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) ))

There the cases of new pkey and updated pkey membership is mixed. Why?

> +	{
> +		p_pending->is_new = TRUE;
> +		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
> +		stat = "inserted";
> +	}
> +	else
> +	{
> +		p_pending->is_new = FALSE;
> +		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
> +													  &p_pending->block, &p_pending->index))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AFAIK in this function there were CL_ASSERTs which check for uinitialized
pointers.

> +		{
> +			osm_log( p_log, OSM_LOG_ERROR,
> +						"pkey_mgr_process_physical_port: ERR 0503: "
> +						"Fail to obtain P_Key 0x%04x block and index for node "
> +						"0x%016" PRIx64 " port %u\n",
> +						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +						osm_physp_get_port_num( p_physp ) );
> +			return;
> +		}
> +		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
> +		stat = "updated";

Is it will be updated? It is likely "already there" case. No?

Also in this case you can already put the pkey in new_block instead of
holding it in pending list. Then later you will only need to add new
pkeys. This may simplify the flow and even save some mem.

> +	}
> +
> +	osm_log( p_log, OSM_LOG_DEBUG,
> +				"pkey_mgr_process_physical_port:	"
> +				"pkey 0x%04x was %s for node 0x%016" PRIx64
> +				" port %u\n",
> +				cl_ntoh16( pkey ), stat,
> +				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +				osm_physp_get_port_num( p_physp ) );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +static void
> +pkey_mgr_process_partition_table(
> +	osm_log_t *p_log,
> +	const osm_req_t *p_req,
> +	const osm_prtn_t *p_prtn,
> +	const boolean_t full )
> +{
> +	const cl_map_t *p_tbl = full ?
> +		&p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
> +	cl_map_iterator_t i, i_next;
> +	ib_net16_t pkey = p_prtn->pkey;
> +	osm_physp_t *p_physp;
> +
> +	if ( full )
> +		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
> +
> +	i_next = cl_map_head( p_tbl );
> +	while ( i_next != cl_map_end( p_tbl ) )
> +	{
> +		i = i_next;
> +		i_next = cl_map_next( i );
> +		p_physp = cl_map_obj( i );
> +		if ( p_physp && osm_physp_is_valid( p_physp ) )
> +			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
> +	}
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
>  static ib_api_status_t
>  pkey_mgr_update_pkey_entry(
>     IN const osm_req_t *p_req,
> @@ -114,7 +247,8 @@ pkey_mgr_enforce_partition(
>     p_pi->state_info2 = 0;
>     ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
>  
> -   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
> +	context.pi_context.node_guid = 
> +		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
>     context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
>     context.pi_context.set_method = TRUE;
>     context.pi_context.update_master_sm_base_lid = FALSE;
> @@ -131,80 +265,132 @@ pkey_mgr_enforce_partition(
>  
>  /**********************************************************************
>   **********************************************************************/
> -/*
> - * Prepare a new entry for the pkey table for this port when this pkey
> - * does not exist. Update existed entry when membership was changed.
> - */
> -static void pkey_mgr_process_physical_port(
> -   IN osm_log_t *p_log,
> -   IN const osm_req_t *p_req,
> -   IN const ib_net16_t pkey,
> -   IN osm_physp_t *p_physp )
> +static boolean_t pkey_mgr_update_port(
> +	osm_log_t *p_log,
> +	osm_req_t *p_req,
> +	const osm_port_t * const p_port )
>  {
> -   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
> -   ib_pkey_table_t *block;
> +	osm_physp_t *p_physp;
> +	osm_node_t *p_node;
> +	ib_pkey_table_t *block, *new_block;
> +	osm_pkey_tbl_t *p_pkey_tbl;
>     uint16_t block_index;
> +	uint8_t  pkey_index;
> +	uint16_t last_free_block_index = 0;
> +	uint16_t last_free_pkey_index = 0;
>     uint16_t num_of_blocks;
> -   const osm_pkey_tbl_t *p_pkey_tbl;
> -   ib_net16_t *p_orig_pkey;
> -   char *stat = NULL;
> -   uint32_t i;
> +	uint16_t max_num_of_blocks;
>  
> -   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> -   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> +	ib_api_status_t status;
> +	boolean_t ret_val = FALSE;
> +	osm_pending_pkey_t *p_pending;
> +	boolean_t found;
>  
> -   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
> +	p_physp = osm_port_get_default_phys_ptr( p_port );
> +	if ( !osm_physp_is_valid( p_physp ) )
> +		return FALSE;
>  
> -   if ( !p_orig_pkey )
> -   {
> -      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
> +	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> +	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
> +	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
>        {
> -         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> -         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
> +		osm_log( p_log, OSM_LOG_INFO,
> +					"pkey_mgr_update_port: "
> +					"Max number of blocks reduced from %u to %u " 
> +					"for node 0x%016" PRIx64 " port %u\n",
> +					p_pkey_tbl->max_blocks, max_num_of_blocks,
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( p_physp ) );				
> +	}
> +	p_pkey_tbl->max_blocks = max_num_of_blocks;
> +
> +	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
> +	cl_map_remove_all( &p_pkey_tbl->keys );

What is the reason to drop map here? AFAIK it will be reinitialized later
anyway when pkey blocks will be received.

> +	p_pkey_tbl->used_blocks = 0;
> +
> +	/* 
> +		process every pending pkey in order - 
> +		first must be "updated" last are "new" 
> +	*/
> +	p_pending = 
> +		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
> +	while (p_pending != 
> +			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
> +	{
> +		if (p_pending->is_new == FALSE)
> +		{
> +			block_index = p_pending->block;
> +			pkey_index = p_pending->index;
> +			found = TRUE;
> +		} 
> +		else
>           {
> -            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
> +			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
> +															  &last_free_block_index,
> +															  &last_free_pkey_index);

There should be warning: expected third arg is uint8_t*

> +			if ( !found )
>              {
> -               block->pkey_entry[i] = pkey;
> -	       stat = "inserted";
> -	       goto _done;
> +				osm_log( p_log, OSM_LOG_ERROR,
> +							"pkey_mgr_update_port: ERR 0504: "
> +							"failed to find empty space for new pkey 0x%04x "
> +							"of node 0x%016" PRIx64 " port %u\n",
> +							cl_ntoh16(p_pending->pkey),
> +							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +							osm_physp_get_port_num( p_physp ) );
>              }
> +			else
> +			{
> +				block_index = last_free_block_index;
> +				pkey_index = last_free_pkey_index++;
>           }
>        }
> +		
> +		if (found) 
> +		{
> +			if (osm_pkey_tbl_set_new_entry( 
> +					 p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
> +			{
>        osm_log( p_log, OSM_LOG_ERROR,
> -               "pkey_mgr_process_physical_port: ERR 0501: "
> -               "No empty pkey entry was found to insert 0x%04x for node "
> -               "0x%016" PRIx64 " port %u\n",
> -               cl_ntoh16( pkey ),
> +							"pkey_mgr_update_port: ERR 0505: "
> +							"failed to set PKey 0x%04x in block %u idx %u "
> +							"of node 0x%016" PRIx64 " port %u\n",
> +							p_pending->pkey, block_index, pkey_index,
>                 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>                 osm_physp_get_port_num( p_physp ) );
>     }
> -   else if ( *p_orig_pkey != pkey )
> -   {
> +		}
> +
> +		free( p_pending );
> +		p_pending = 
> +			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
> +	}
> +
> +	/* now look for changes and store */
>        for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>        {
> -         /* we need real block (not just new_block) in order
> -          * to resolve block/pkey indices */
>           block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
> -	 i = p_orig_pkey - block->pkey_entry;
> -	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
> -            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> -	    block->pkey_entry[i] = pkey;
> -	    stat = "updated";
> -	    goto _done;
> -	 }
> -      }
> -   }
> +		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>  
> - _done:
> -   if (stat) {
> -      osm_log( p_log, OSM_LOG_VERBOSE,
> -               "pkey_mgr_process_physical_port:  "
> -               "pkey 0x%04x was %s for node 0x%016" PRIx64
> -               " port %u\n",
> -               cl_ntoh16( pkey ), stat,
> +		if (block && 
> +			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
> +			continue;
> +
> +		status = pkey_mgr_update_pkey_entry(
> +			p_req, p_physp , new_block, block_index );
> +		if (status == IB_SUCCESS)
> +			ret_val = TRUE;
> +		else
> +			osm_log( p_log, OSM_LOG_ERROR,
> +						"pkey_mgr_update_port: ERR 0506: "
> +						"pkey_mgr_update_pkey_entry() failed to update "
> +						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> +						block_index,
>                 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>                 osm_physp_get_port_num( p_physp ) );
>     }
> +
> +	return ret_val;
>  }
>  
>  /**********************************************************************
> @@ -217,21 +403,23 @@ pkey_mgr_update_peer_port(
>     const osm_port_t * const p_port,
>     boolean_t enforce )
>  {
> -   osm_physp_t *p, *peer;
> +	osm_physp_t *p_physp, *peer;
>     osm_node_t *p_node;
>     ib_pkey_table_t *block, *peer_block;
> -   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
> +	const osm_pkey_tbl_t *p_pkey_tbl;
> +	osm_pkey_tbl_t *p_peer_pkey_tbl;
>     osm_switch_t *p_sw;
>     ib_switch_info_t *p_si;
>     uint16_t block_index;
>     uint16_t num_of_blocks;
> +	uint16_t peer_max_blocks;
>     ib_api_status_t status = IB_SUCCESS;
>     boolean_t ret_val = FALSE;
>  
> -   p = osm_port_get_default_phys_ptr( p_port );
> -   if ( !osm_physp_is_valid( p ) )
> +	p_physp = osm_port_get_default_phys_ptr( p_port );
> +	if ( !osm_physp_is_valid( p_physp ) )
>        return FALSE;
> -   peer = osm_physp_get_remote( p );
> +	peer = osm_physp_get_remote( p_physp );
>     if ( !peer || !osm_physp_is_valid( peer ) )
>        return FALSE;
>     p_node = osm_physp_get_node_ptr( peer );
> @@ -245,7 +433,7 @@ pkey_mgr_update_peer_port(
>     if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
>     {
>        osm_log( p_log, OSM_LOG_ERROR,
> -               "pkey_mgr_update_peer_port: ERR 0502: "
> +					"pkey_mgr_update_peer_port: ERR 0507: "
>                 "pkey_mgr_enforce_partition() failed to update "
>                 "node 0x%016" PRIx64 " port %u\n",
>                 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> @@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
>     if (enforce == FALSE)
>        return FALSE;
>  
> -   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
> -   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
> +	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> +	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
>     num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> -   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
> -      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
> +	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
> +	if (peer_max_blocks < p_pkey_tbl->used_blocks)
> +	{
> +		osm_log( p_log, OSM_LOG_ERROR,
> +					"pkey_mgr_update_peer_port: ERR 0508: "
> +					"not enough entries (%u < %u) on switch 0x%016" PRIx64
> +					" port %u\n",
> +					peer_max_blocks, num_of_blocks,
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( peer ) );
> +		return FALSE;

Do you think it is the best way, just to skip update - partitions are
enforced already on the switch. May be better to truncate pkey tables
in order to meet peer's capabilities?

> +	}
>  
> -   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
> +	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
>     {
>        block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>        peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
>        if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
>        {
> +			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block);

Why this (osm_pkey_tbl_set())? This will be called by receiver.

>           status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
>           if ( status == IB_SUCCESS )
>              ret_val = TRUE;
>           else
>              osm_log( p_log, OSM_LOG_ERROR,
> -                     "pkey_mgr_update_peer_port: ERR 0503: "
> +							"pkey_mgr_update_peer_port: ERR 0509: "
>                       "pkey_mgr_update_pkey_entry() failed to update "
>                       "pkey table block %d for node 0x%016" PRIx64
>                       " port %u\n",
> @@ -282,10 +482,10 @@ pkey_mgr_update_peer_port(
>        }
>     }
>  
> -   if ( ret_val == TRUE &&
> -        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
> +	if ( (ret_val == TRUE) &&
> +		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
>     {
> -      osm_log( p_log, OSM_LOG_VERBOSE,
> +		osm_log( p_log, OSM_LOG_DEBUG,
>                 "pkey_mgr_update_peer_port: "
>                 "pkey table was updated for node 0x%016" PRIx64
>                 " port %u\n",
> @@ -298,82 +498,6 @@ pkey_mgr_update_peer_port(
>  
>  /**********************************************************************
>   **********************************************************************/
> -static boolean_t pkey_mgr_update_port(
> -   osm_log_t *p_log,
> -   osm_req_t *p_req,
> -   const osm_port_t * const p_port )
> -{
> -   osm_physp_t *p;
> -   osm_node_t *p_node;
> -   ib_pkey_table_t *block, *new_block;
> -   const osm_pkey_tbl_t *p_pkey_tbl;
> -   uint16_t block_index;
> -   uint16_t num_of_blocks;
> -   ib_api_status_t status;
> -   boolean_t ret_val = FALSE;
> -
> -   p = osm_port_get_default_phys_ptr( p_port );
> -   if ( !osm_physp_is_valid( p ) )
> -      return FALSE;
> -
> -   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
> -   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> -
> -   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> -   {
> -      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
> -      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> -
> -      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
> -         continue;
> -
> -      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
> -      if (status == IB_SUCCESS)
> -         ret_val = TRUE;
> -      else
> -         osm_log( p_log, OSM_LOG_ERROR,
> -                  "pkey_mgr_update_port: ERR 0504: "
> -                  "pkey_mgr_update_pkey_entry() failed to update "
> -                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> -                  block_index,
> -                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> -                  osm_physp_get_port_num( p ) );
> -   }
> -
> -   return ret_val;
> -}
> -
> -/**********************************************************************
> - **********************************************************************/
> -static void
> -pkey_mgr_process_partition_table(
> -   osm_log_t *p_log,
> -   const osm_req_t *p_req,
> -   const osm_prtn_t *p_prtn,
> -   const boolean_t full )
> -{
> -   const cl_map_t *p_tbl = full ?
> -      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
> -   cl_map_iterator_t i, i_next;
> -   ib_net16_t pkey = p_prtn->pkey;
> -   osm_physp_t *p_physp;
> -
> -   if ( full )
> -      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
> -
> -   i_next = cl_map_head( p_tbl );
> -   while ( i_next != cl_map_end( p_tbl ) )
> -   {
> -      i = i_next;
> -      i_next = cl_map_next( i );
> -      p_physp = cl_map_obj( i );
> -      if ( p_physp && osm_physp_is_valid( p_physp ) )
> -          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
> -   }
> -}
> -
> -/**********************************************************************
> - **********************************************************************/
>  osm_signal_t
>  osm_pkey_mgr_process(
>     IN osm_opensm_t *p_osm )
> @@ -383,8 +507,7 @@ osm_pkey_mgr_process(
>     osm_prtn_t *p_prtn;
>     osm_port_t *p_port;
>     osm_signal_t signal = OSM_SIGNAL_DONE;
> -   osm_physp_t *p_physp;
> -
> +	osm_node_t *p_node;
>     CL_ASSERT( p_osm );
>  
>     OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
> @@ -394,32 +517,25 @@ osm_pkey_mgr_process(
>     if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
>     {
>        osm_log( &p_osm->log, OSM_LOG_ERROR,
> -               "osm_pkey_mgr_process: ERR 0505: "
> +					"osm_pkey_mgr_process: ERR 0510: "
>                 "osm_prtn_make_partitions() failed\n" );
>        goto _err;
>     }
>  
> -   p_tbl = &p_osm->subn.port_guid_tbl;
> -   p_next = cl_qmap_head( p_tbl );
> -   while ( p_next != cl_qmap_end( p_tbl ) )
> -   {
> -      p_port = ( osm_port_t * ) p_next;
> -      p_next = cl_qmap_next( p_next );
> -      p_physp = osm_port_get_default_phys_ptr( p_port );
> -      if ( osm_physp_is_valid( p_physp ) )
> -        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
> -   }
> -
> +	/* populate the pending pkey entries by scanning all partitions */
>     p_tbl = &p_osm->subn.prtn_pkey_tbl;
>     p_next = cl_qmap_head( p_tbl );
>     while ( p_next != cl_qmap_end( p_tbl ) )
>     {
>        p_prtn = ( osm_prtn_t * ) p_next;
>        p_next = cl_qmap_next( p_next );
> -      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
> -      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
> +		pkey_mgr_process_partition_table( 
> +			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
> +		pkey_mgr_process_partition_table( 
> +			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
>     }
>  
> +	/* calculate new pkey tables and set */
>     p_tbl = &p_osm->subn.port_guid_tbl;
>     p_next = cl_qmap_head( p_tbl );
>     while ( p_next != cl_qmap_end( p_tbl ) )
> @@ -428,8 +544,10 @@ osm_pkey_mgr_process(
>        p_next = cl_qmap_next( p_next );
>        if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
>          signal = OSM_SIGNAL_DONE_PENDING;
> -      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
> -           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
> +		p_node = osm_port_get_parent_node( p_port );
> +		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
> +			  pkey_mgr_update_peer_port( 
> +				  &p_osm->log, &p_osm->sm.req,
>                                        &p_osm->subn, p_port,
>                                        !p_osm->subn.opt.no_partition_enforcement ) )
>          signal = OSM_SIGNAL_DONE_PENDING;        
> 
> 

Thanks,
Sasha


From hnbhuvaneshwar at novell.com  Thu Jun 15 04:59:59 2006
From: hnbhuvaneshwar at novell.com (Bhuvaneshwar HN)
Date: Thu, 15 Jun 2006 05:59:59 -0600
Subject: [openib-general] Bond0 Driver support for IB
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA301DCA32D@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA301DCA32D@xmb-sjc-216.amer.cisco.com>
Message-ID: <449199170200005F0000ACC8@lucius.provo.novell.com>

Hi
We were thinking of using Linux Bond0 driver for Load balancing and
Fault tolerance for IB, any thoughts on this would be welcome
Regards
Bhuvi


From eitan at mellanox.co.il  Thu Jun 15 05:19:44 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 15 Jun 2006 15:19:44 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <20060615110617.GA21560@sashak.voltaire.com>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
	<20060615110617.GA21560@sashak.voltaire.com>
Message-ID: <44915060.6090103@mellanox.co.il>

Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> Some comments about the patch.

Thanks for the review.

The major point you bring up is the fact I intentionally impose the result of the
pkey settings on the SMDB and not wait for the GetResp to do that for me.

The idea I had was that once the Pkey Manager calculate the new tables any SA query
that is involving PKey matching would be using the results immediately.
But this actually opens up another bigger bug: What if the setting failed?
The SMDB will not that on next sweep and avoid sending the update.
So I think the best approach is to not set anything and
rely on the receive to perform the setting for me.

I will perform the changes test them and send a new patch.
> 
> Personally I'm glad to see that you are using tab instead of spaces as
> identaion character. But it would be nice if next time you will not mix
> the functional changes and identaion fixes in the same patch, but instead
> will provide two different patches. Also it would be nice if your
> identation fixes will cover whole file(s) and not just selected lines.
> 
> The same is about massive code moving, the patch separation may simplify
> review.
Yes you are correct about this. I will use this method in the future:
Use first patch with code changes and a second one with ordering and style changes.

> 
> The rest is below.
> 
> On 15:54 Tue 13 Jun     , Eitan Zahavi wrote:
> 
>>--text follows this line--
>>Hi Hal
>>
>>This is a second take after debug and cleanup of the partition manager
>>patch I have previously provided. The functionality is the same but
>>this one is after 2 days of testing on the simulator.
>>I also did some code restructuring for clarity. 
>>
>>Tests passed were both dedicated pkey enforcements (pkey.*) and
>>stress test (osmStress.*)
>>
>>As I started to test the partition manager code (using ibmgtsim pkey test),
>>I realized the implementation does not really enforces the partition policy
>>on the given fabric. This patch fixes that. It was verified using the 
>>simulation test. Several other corner cases were fixed too.
>>
>>Eitan
>>
>>Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>>
>>Index: include/opensm/osm_port.h
>>===================================================================
>>--- include/opensm/osm_port.h	(revision 7867)
>>+++ include/opensm/osm_port.h	(working copy)
>>@@ -586,6 +586,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
>> *  Port, Physical Port
>> *********/
>> 
>>+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
>>+* NAME
>>+*  osm_physp_get_mod_pkey_tbl
>>+*
>>+* DESCRIPTION
>>+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
>>+*
>>+* SYNOPSIS
>>+*/
>>+static inline osm_pkey_tbl_t *
>>+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
>>+{
>>+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
>>+  /*
>>+    (14.2.5.7) - the block number valid values are 0-2047, and are further
>>+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
>>+  */
>>+  return( &p_physp->pkeys );
>>+};
>>+/*
>>+* PARAMETERS
>>+*  p_physp
>>+*     [in] Pointer to an osm_physp_t object.
>>+*
>>+* RETURN VALUES
>>+*  The pointer to the P_Key table object.
>>+*
>>+* NOTES
>>+*
>>+* SEE ALSO
>>+*  Port, Physical Port
>>+*********/
>>+
> 
> 
> Is not this simpler to remove 'const' from existing
> osm_physp_get_pkey_tbl() function instead of using new one?
There are plenty of const functions using this function internally
so I would have need to fix them too.
> 
> 
>> /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
>> * NAME
>> *	osm_physp_set_slvl_tbl
>>Index: include/opensm/osm_pkey.h
>>===================================================================
>>--- include/opensm/osm_pkey.h	(revision 7867)
>>+++ include/opensm/osm_pkey.h	(working copy)
>>@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
>>   cl_ptr_vector_t blocks;
>>   cl_ptr_vector_t new_blocks;
>>   cl_map_t        keys;
>>+  cl_qlist_t      pending;
>>+  uint16_t        used_blocks;
>>+  uint16_t        max_blocks;
>> } osm_pkey_tbl_t;
>> /*
>> * FIELDS
>>@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
>> *	keys
>> *		A set holding all keys
>> *
>>+*  pending
>>+*     A list osm_pending_pkey structs that is temporarily set by the 
>>+*     pkey mgr and used during pkey mgr algorithm only
>>+*
>>+*  used_blocks
>>+*     Tracks the number of blocks having non-zero pkeys
>>+*
>>+*  max_blocks
>>+*     The maximal number of blocks this partition table might hold
>>+*     this value is based on node_info (for port 0 or CA) or switch_info
>>+*     updated on receiving the node_info or switch_info GetResp
>>+*
>> * NOTES
>> * 'blocks' vector should be used to store pkey values obtained from
>> * the port and SM pkey manager should not change it directly, for this
>>@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
>> *
>> *********/
>> 
>>+/****s* OpenSM: osm_pending_pkey_t
>>+* NAME
>>+*	osm_pending_pkey_t
>>+*
>>+* DESCRIPTION
>>+*	This objects stores temporary information on pkeys their target block and index
>>+*  during the pkey manager operation
>>+*
>>+* SYNOPSIS
>>+*/
>>+typedef struct _osm_pending_pkey {
>>+  cl_list_item_t list_item;
>>+  uint16_t		  pkey;
>>+  uint32_t		  block;
>>+  uint8_t		  index;
>>+  boolean_t		  is_new;
>>+} osm_pending_pkey_t;
>>+/*
>>+* FIELDS
>>+*	pkey
>>+*		The actual P_Key
>>+*
>>+*	block
>>+*		The block index based on the previous table extracted from the device
>>+*
>>+*	index
>>+*		The index of the pky within the block
>>+*
>>+*  is_new
>>+*     TRUE for new P_Keys such that the block and index are invalid in that case
>>+*
>>+*********/
>>+
>> /****f* OpenSM: osm_pkey_tbl_construct
>> * NAME
>> *  osm_pkey_tbl_construct
>>@@ -209,8 +257,8 @@ osm_pkey_tbl_get_num_blocks( 
>> static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
>>   const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
>> {
>>-  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
>>-  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
>>+	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
>>+			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
>> };
>> /*
>> *  p_pkey_tbl
>>@@ -244,6 +292,106 @@ static inline ib_pkey_table_t *osm_pkey_
>> /*
>>  *********/
>> 
>>+
>>+/****f* OpenSM: osm_pkey_tbl_make_block_pair
>>+* NAME
>>+*  osm_pkey_tbl_make_block_pair
>>+*
>>+* DESCRIPTION
>>+*  Find or create a pair of "old" and "new" blocks for the
>>+*  given block index
>>+*
>>+* SYNOPSIS
>>+*/
>>+int osm_pkey_tbl_make_block_pair( 
>>+	osm_pkey_tbl_t   *p_pkey_tbl, 
>>+	uint16_t          block_idx,
>>+	ib_pkey_table_t **pp_old_block,
>>+	ib_pkey_table_t **pp_new_block);
>>+/*
>>+* p_pkey_tbl
>>+*   [in] Pointer to the PKey table 
>>+*
>>+* block_idx
>>+*   [in] The block index to use
>>+*
>>+* pp_old_block
>>+*   [out] Pointer to the old block pointer arg
>>+*
>>+* pp_new_block
>>+*   [out] Pointer to the new block pointer arg
>>+*
>>+* RETURN VALUES
>>+*   0 if OK 1 if failed
> 
> 
> It is better (conventional) to use -1 as failure return status.
I have seen and used both - depend on the application.
I think I should have used IB_SUCCESS or IB_ERROR but I do not
mind changing that to -1 too.
> 
> 
>>+* 
>>+*********/
>>+
>>+/****f* OpenSM: osm_pkey_tbl_set_new_entry
>>+* NAME 
>>+*  osm_pkey_tbl_set_new_entry
>>+*
>>+* DESCRIPTION
>>+*   stores the given pkey in the "new" blocks array and update
>>+*   the "map" to show that on the "old" blocks
>>+*
>>+* SYNOPSIS
>>+*/
>>+int
>>+osm_pkey_tbl_set_new_entry( 
>>+	IN osm_pkey_tbl_t *p_pkey_tbl,
>>+	IN uint16_t        block_idx,
>>+	IN uint8_t         pkey_idx,
>>+	IN uint16_t        pkey);
>>+/*
>>+* p_pkey_tbl
>>+*   [in] Pointer to the PKey table 
>>+*
>>+* block_idx
>>+*   [in] The block index to use
>>+*
>>+* pkey_idx
>>+*   [in] The index within the block
>>+*
>>+* pkey
>>+*   [in] PKey to store
>>+*
>>+* RETURN VALUES
>>+*   0 if OK 1 if failed
> 
> 
> Ditto
> 
> 
>>+* 
>>+*********/
>>+
>>+/****f* OpenSM: osm_pkey_find_next_free_entry
>>+* NAME
>>+*  osm_pkey_find_next_free_entry
>>+*
>>+* DESCRIPTION
>>+*  Find the next free entry in the PKey table. Starting at the given
>>+*  index and block number. The user should increment pkey_idx before 
>>+*  next call
>>+*  Inspect the "new" blocks array for empty space.
>>+*
>>+* SYNOPSIS
>>+*/
>>+boolean_t
>>+osm_pkey_find_next_free_entry(
>>+	IN osm_pkey_tbl_t *p_pkey_tbl, 
>>+	OUT uint16_t      *p_block_idx,
>>+	OUT uint8_t       *p_pkey_idx);
>>+/*
>>+* p_pkey_tbl
>>+*   [in] Pointer to the PKey table 
>>+*
>>+* p_block_idx
>>+*   [out] The block index to use
>>+*
>>+* p_pkey_idx
>>+*   [out] The index within the block to use
>>+*
>>+* RETURN VALUES
>>+*   TRUE if found FALSE if did not find
>>+* 
>>+*********/
>>+
>> /****f* OpenSM: osm_pkey_tbl_sync_new_blocks
>> * NAME
>> *  osm_pkey_tbl_sync_new_blocks
>>@@ -263,9 +411,44 @@ void osm_pkey_tbl_sync_new_blocks( 
>> *
>> *********/
>> 
>>+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
>>+* NAME
>>+*  osm_pkey_tbl_get_block_and_idx
>>+*
>>+* DESCRIPTION
>>+*  set the block index and pkey index the given
>>+*  pkey is found in. return 1 if cound not find 
>>+*  it, 0 if OK
>>+*
>>+* SYNOPSIS
>>+*/
>>+int
>>+osm_pkey_tbl_get_block_and_idx(
>>+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
>>+  IN  uint16_t       *p_pkey,
>>+  OUT uint32_t       *block_idx,
>>+  OUT uint8_t        *pkey_index);
>>+/*
>>+*  p_pkey_tbl
>>+*     [in] Pointer to osm_pkey_tbl_t object.
>>+*  
>>+*  p_pkey
>>+*     [in] Pointer to the P_Key entry searched
>>+*
>>+*  p_block_idx
>>+*     [out] Pointer to the block index to be updated
>>+*
>>+*  p_pkey_idx 
>>+*     [out] Pointer to the pkey index (in the block) to be updated
>>+*
>>+*
>>+* NOTES
>>+*
>>+*********/
>>+
>> /****f* OpenSM: osm_pkey_tbl_set
>> * NAME
>> *  osm_pkey_tbl_set
>>Index: opensm/osm_pkey.c
>>===================================================================
>>--- opensm/osm_pkey.c	(revision 7904)
>>+++ opensm/osm_pkey.c	(working copy)
>>@@ -100,6 +100,9 @@ int osm_pkey_tbl_init( 
>>   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
>>   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
>>   cl_map_init( &p_pkey_tbl->keys, 1 );
>>+	cl_qlist_init( &p_pkey_tbl->pending );
>>+	p_pkey_tbl->used_blocks = 0;
>>+	p_pkey_tbl->max_blocks = 0;
>>   return(IB_SUCCESS);
>> }
>> 
>>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks(
>>     p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
>>     if ( b < new_blocks )
>>       p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
>>-    else {
>>+		else 
>>+      {
>>       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
>>       if (!p_new_block)
>>         break;
>>+			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
>>+									b, p_new_block);
>>+		}
>>+
>>       memset(p_new_block, 0, sizeof(*p_new_block));
>>-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
>>     }
>>-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
>>+}
> 
> 
> You changed this function so it does not do any sync anymore. Should
> function name be changed too?
Yes correct I will change it. Is a better name:
osm_pkey_tbl_init_new_blocks ?

> 
> 
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+void osm_pkey_tbl_cleanup_pending(
>>+	IN osm_pkey_tbl_t *p_pkey_tbl)
>>+{
>>+	cl_list_item_t	*p_item;
>>+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
>>+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
>>+	{
>>+		free( (osm_pending_pkey_t *)p_item );
>>   }
>> }
>> 
>>@@ -202,6 +220,138 @@ int osm_pkey_tbl_set( 
>> 
>> /**********************************************************************
>>  **********************************************************************/
>>+int osm_pkey_tbl_make_block_pair( 
>>+	osm_pkey_tbl_t   *p_pkey_tbl, 
>>+	uint16_t          block_idx,
>>+	ib_pkey_table_t **pp_old_block,
>>+	ib_pkey_table_t **pp_new_block)
>>+{
>>+	if (block_idx >= p_pkey_tbl->max_blocks) return 1;
>>+
>>+	if (pp_old_block)
>>+	{
>>+		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
>>+		if (! *pp_old_block)
>>+		{
>>+			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
>>+			if (!*pp_old_block) return 1;
>>+			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
>>+			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
>>+		}
>>+	}
>>+	
>>+	if (pp_new_block)
>>+	{
>>+		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
>>+		if (! *pp_new_block)
>>+		{
>>+			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
>>+			if (!*pp_new_block) return 1;
>>+			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
>>+			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
>>+		}
>>+	}
>>+	return 0;
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+/*
>>+  store the given pkey in the "new" blocks array and update the "map"
>>+  to show that on the "old" blocks
>>+*/
>>+int
>>+osm_pkey_tbl_set_new_entry( 
>>+	IN osm_pkey_tbl_t *p_pkey_tbl,
>>+	IN uint16_t        block_idx,
>>+	IN uint8_t         pkey_idx,
>>+	IN uint16_t        pkey)
>>+{  
>>+	ib_pkey_table_t *p_old_block;
>>+	ib_pkey_table_t *p_new_block;
>>+	
>>+	if (osm_pkey_tbl_make_block_pair(
>>+			 p_pkey_tbl,  block_idx, &p_old_block, &p_new_block))
>>+		return 1;
>>+		
>>+	cl_map_insert( &p_pkey_tbl->keys,
>>+						ib_pkey_get_base(pkey),
>>+						&(p_old_block->pkey_entry[pkey_idx]));
> 
> 
> Here you map potentially empty pkey entry. Why? "old block" will be
> remapped anyway on pkey receiving.
The reason I did this was that if the GetResp will fail I still want to represent
the settings in the map.But actually it might be better not to do that so next
time we run we will not find it without a GetResp.
> 
> Actually I don't see why you want this pretty tricky and pkey_mgr
> specific procedure as generic function.
I think once the new_blocks was made available through the osm_pkey.h
we actually burden the pkey table object with the full complexity of the
pkey manager. So I think the right place for the functions changing the
pkey table is in the osm_pkey.*

> 
> 
>>+	p_new_block->pkey_entry[pkey_idx] = pkey;
>>+	if (p_pkey_tbl->used_blocks < block_idx)
>>+		p_pkey_tbl->used_blocks = block_idx;
>>+
>>+	return 0;
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+boolean_t
>>+osm_pkey_find_next_free_entry(
>>+	IN osm_pkey_tbl_t *p_pkey_tbl, 
>>+	OUT uint16_t      *p_block_idx,
>>+	OUT uint8_t       *p_pkey_idx)
>>+{
>>+	ib_pkey_table_t *p_new_block;
>>+	
>>+	CL_ASSERT(p_block_idx);
>>+	CL_ASSERT(p_pkey_idx);
>>+
>>+	while ( *p_block_idx < p_pkey_tbl->max_blocks)
>>+	{
>>+		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
>>+		{
>>+			*p_pkey_idx = 0;
>>+			(*p_block_idx)++;
>>+			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
>>+				return FALSE;
>>+		}
>>+
>>+		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
>>+
>>+		if ( !p_new_block || 
>>+			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
>>+			return TRUE;
>>+		else
>>+			(*p_pkey_idx)++;
>>+	}
>>+	return FALSE;
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+int
>>+osm_pkey_tbl_get_block_and_idx(
>>+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
>>+	IN	 uint16_t		 *p_pkey,
>>+	OUT uint32_t		 *p_block_idx,
>>+	OUT uint8_t			 *p_pkey_index)
>>+{
>>+	uint32_t			  num_of_blocks;
>>+	uint32_t			  block_index;
>>+	ib_pkey_table_t *block;
>>+
>>+	CL_ASSERT( p_pkey_tbl );
>>+	CL_ASSERT( p_block_idx != NULL );
>>+	CL_ASSERT( p_pkey_idx != NULL );
> 
> 
> Why last two CL_ASSERTs? What should be problem with uninitialized
> pointers here?
> 
These are the outputs of the function. It does not make sense to call the functions with
null output pointers (calling by ref) . Anyway instead of putting the check in the free build
I used an assert
> 
>>+ 
>>+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
>>+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>+	{
>>+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
>>+		if ( ( block->pkey_entry <= p_pkey ) &&
>>+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
>>+		{
>>+			*p_block_idx = block_index;
>>+			*p_pkey_index = p_pkey - block->pkey_entry;
>>+			return 0;
>>+		}
>>+	}
>>+	return 1;
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>> static boolean_t __osm_match_pkey (
>>   IN const ib_net16_t *pkey1,
>>   IN const ib_net16_t *pkey2 ) {
>>@@ -305,7 +455,8 @@ osm_physp_share_pkey(
>>   if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
>>     return TRUE;
>> 
>>-  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
>>+	return 
>>+		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
>> }
>> 
>> /**********************************************************************
>>@@ -321,7 +472,8 @@ osm_port_share_pkey(
>> 
>>   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
>> 
>>-  if (!p_port_1 || !p_port_2) {
>>+	if (!p_port_1 || !p_port_2)
>>+	{
>> 	ret = FALSE;
>> 	goto Exit;
>>   }
>>@@ -329,7 +481,8 @@ osm_port_share_pkey(
>>   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
>>   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
>> 
>>-  if (!p_physp1 || !p_physp2) {
>>+	if (!p_physp1 || !p_physp2)
>>+	{
>> 	ret = FALSE;
>> 	goto Exit;
>>   }
>>Index: opensm/osm_pkey_mgr.c
>>===================================================================
>>--- opensm/osm_pkey_mgr.c	(revision 7904)
>>+++ opensm/osm_pkey_mgr.c	(working copy)
>>@@ -62,6 +62,139 @@
>> 
>> /**********************************************************************
>>  **********************************************************************/
>>+/*
>>+  the max number of pkey blocks for a physical port is located in
>>+  different place for switch external ports (SwitchInfo) and the
>>+  rest of the ports (NodeInfo)
>>+*/
>>+static int pkey_mgr_get_physp_max_blocks(
> 
> 
> I would suggest to add _cap_ to function name. Not too much critical
> since it is static function.
> 
> 
>>+	IN const osm_subn_t *p_subn,
>>+	IN const osm_physp_t *p_physp)
>>+{
>>+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
>>+	osm_switch_t *p_sw;
>>+	uint16_t num_pkeys = 0;
>>+
>>+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
>>+		  (osm_physp_get_port_num( p_physp ) == 0))
>>+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
>>+	else
>>+	{
>>+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
>>+		if (p_sw)
>>+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
>>+	}
>>+	return( (num_pkeys + 31) / 32 );
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+/*
>>+ * Insert the new pending pkey entry to the specific port pkey table
>>+ * pending pkeys. new entries are inserted at the back.
>>+ */
>>+static void pkey_mgr_process_physical_port(
>>+	IN osm_log_t *p_log,
>>+	IN const osm_req_t *p_req,
>>+	IN const ib_net16_t pkey,
>>+	IN osm_physp_t *p_physp )
>>+{
>>+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
>>+	osm_pkey_tbl_t *p_pkey_tbl;
>>+	ib_net16_t *p_orig_pkey;
>>+	char *stat = NULL;
>>+	osm_pending_pkey_t *p_pending;
>>+
>>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
>>+	if (! p_pkey_tbl)
> 
>            ^^^^^^^^^^^^^
> Is it possible?
Yes it is ! I run into it during testing. The port did not have any pkey table.
> 
> 
>>+	{
>>+		osm_log( p_log, OSM_LOG_ERROR,
>>+					"pkey_mgr_process_physical_port: ERR 0501: "
>>+					"No pkey table found for node "
>>+					"0x%016" PRIx64 " port %u\n",
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( p_physp ) );
>>+		return;
>>+	}
>>+
>>+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
>>+	if (! p_pending)
>>+	{
>>+		osm_log( p_log, OSM_LOG_ERROR,
>>+					"pkey_mgr_process_physical_port: ERR 0502: "
>>+					"Fail to allocate new pending pkey entry for node "
>>+					"0x%016" PRIx64 " port %u\n",
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( p_physp ) );
>>+		return;
>>+	}
>>+	p_pending->pkey = pkey;
>>+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
>>+	if ( !p_orig_pkey  || 
>>+		  (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) ))
> 
> 
> There the cases of new pkey and updated pkey membership is mixed. Why?
I am not following your question.
The specific case I am trying to catch is the one that for some reason the map points to
a pkey entry that was modified somehow and is different then the one you would expect by
the map.
> 
> 
>>+	{
>>+		p_pending->is_new = TRUE;
>>+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
>>+		stat = "inserted";
>>+	}
>>+	else
>>+	{
>>+		p_pending->is_new = FALSE;
>>+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
>>+													  &p_pending->block, &p_pending->index))
> 
>                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> AFAIK in this function there were CL_ASSERTs which check for uinitialized
> pointers.
True. So the asserts are not required in this case.
> 
> 
>>+		{
>>+			osm_log( p_log, OSM_LOG_ERROR,
>>+						"pkey_mgr_process_physical_port: ERR 0503: "
>>+						"Fail to obtain P_Key 0x%04x block and index for node "
>>+						"0x%016" PRIx64 " port %u\n",
>>+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+						osm_physp_get_port_num( p_physp ) );
>>+			return;
>>+		}
>>+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
>>+		stat = "updated";
> 
> 
> Is it will be updated? It is likely "already there" case. No?
> 
> Also in this case you can already put the pkey in new_block instead of
> holding it in pending list. Then later you will only need to add new
> pkeys. This may simplify the flow and even save some mem.
True but in my mind it does not simplify - on the contrary it makes the partition between
populating each port pending list and actually setting the pkey tables mixed.
I do not think the memory impact deserves this mix of staging

> 
> 
>>+	}
>>+
>>+	osm_log( p_log, OSM_LOG_DEBUG,
>>+				"pkey_mgr_process_physical_port:	"
>>+				"pkey 0x%04x was %s for node 0x%016" PRIx64
>>+				" port %u\n",
>>+				cl_ntoh16( pkey ), stat,
>>+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+				osm_physp_get_port_num( p_physp ) );
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>>+static void
>>+pkey_mgr_process_partition_table(
>>+	osm_log_t *p_log,
>>+	const osm_req_t *p_req,
>>+	const osm_prtn_t *p_prtn,
>>+	const boolean_t full )
>>+{
>>+	const cl_map_t *p_tbl = full ?
>>+		&p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
>>+	cl_map_iterator_t i, i_next;
>>+	ib_net16_t pkey = p_prtn->pkey;
>>+	osm_physp_t *p_physp;
>>+
>>+	if ( full )
>>+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
>>+
>>+	i_next = cl_map_head( p_tbl );
>>+	while ( i_next != cl_map_end( p_tbl ) )
>>+	{
>>+		i = i_next;
>>+		i_next = cl_map_next( i );
>>+		p_physp = cl_map_obj( i );
>>+		if ( p_physp && osm_physp_is_valid( p_physp ) )
>>+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
>>+	}
>>+}
>>+
>>+/**********************************************************************
>>+ **********************************************************************/
>> static ib_api_status_t
>> pkey_mgr_update_pkey_entry(
>>    IN const osm_req_t *p_req,
>>@@ -114,7 +247,8 @@ pkey_mgr_enforce_partition(
>>    p_pi->state_info2 = 0;
>>    ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
>> 
>>-   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
>>+	context.pi_context.node_guid = 
>>+		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
>>    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
>>    context.pi_context.set_method = TRUE;
>>    context.pi_context.update_master_sm_base_lid = FALSE;
>>@@ -131,80 +265,132 @@ pkey_mgr_enforce_partition(
>> 
>> /**********************************************************************
>>  **********************************************************************/
>>-/*
>>- * Prepare a new entry for the pkey table for this port when this pkey
>>- * does not exist. Update existed entry when membership was changed.
>>- */
>>-static void pkey_mgr_process_physical_port(
>>-   IN osm_log_t *p_log,
>>-   IN const osm_req_t *p_req,
>>-   IN const ib_net16_t pkey,
>>-   IN osm_physp_t *p_physp )
>>+static boolean_t pkey_mgr_update_port(
>>+	osm_log_t *p_log,
>>+	osm_req_t *p_req,
>>+	const osm_port_t * const p_port )
>> {
>>-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
>>-   ib_pkey_table_t *block;
>>+	osm_physp_t *p_physp;
>>+	osm_node_t *p_node;
>>+	ib_pkey_table_t *block, *new_block;
>>+	osm_pkey_tbl_t *p_pkey_tbl;
>>    uint16_t block_index;
>>+	uint8_t  pkey_index;
>>+	uint16_t last_free_block_index = 0;
>>+	uint16_t last_free_pkey_index = 0;
>>    uint16_t num_of_blocks;
>>-   const osm_pkey_tbl_t *p_pkey_tbl;
>>-   ib_net16_t *p_orig_pkey;
>>-   char *stat = NULL;
>>-   uint32_t i;
>>+	uint16_t max_num_of_blocks;
>> 
>>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
>>-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>+	ib_api_status_t status;
>>+	boolean_t ret_val = FALSE;
>>+	osm_pending_pkey_t *p_pending;
>>+	boolean_t found;
>> 
>>-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
>>+	p_physp = osm_port_get_default_phys_ptr( p_port );
>>+	if ( !osm_physp_is_valid( p_physp ) )
>>+		return FALSE;
>> 
>>-   if ( !p_orig_pkey )
>>-   {
>>-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
>>+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
>>+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
>>       {
>>-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
>>+		osm_log( p_log, OSM_LOG_INFO,
>>+					"pkey_mgr_update_port: "
>>+					"Max number of blocks reduced from %u to %u " 
>>+					"for node 0x%016" PRIx64 " port %u\n",
>>+					p_pkey_tbl->max_blocks, max_num_of_blocks,
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( p_physp ) );				
>>+	}
>>+	p_pkey_tbl->max_blocks = max_num_of_blocks;
>>+
>>+	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
>>+	cl_map_remove_all( &p_pkey_tbl->keys );
> 
> 
> What is the reason to drop map here? AFAIK it will be reinitialized later
> anyway when pkey blocks will be received.
What if it is not received?
> 
> 
>>+	p_pkey_tbl->used_blocks = 0;
>>+
>>+	/* 
>>+		process every pending pkey in order - 
>>+		first must be "updated" last are "new" 
>>+	*/
>>+	p_pending = 
>>+		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
>>+	while (p_pending != 
>>+			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
>>+	{
>>+		if (p_pending->is_new == FALSE)
>>+		{
>>+			block_index = p_pending->block;
>>+			pkey_index = p_pending->index;
>>+			found = TRUE;
>>+		} 
>>+		else
>>          {
>>-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
>>+			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
>>+															  &last_free_block_index,
>>+															  &last_free_pkey_index);
> 
> 
> There should be warning: expected third arg is uint8_t*
True. I will fix the variable declaration to uint8_t
> 
> 
>>+			if ( !found )
>>             {
>>-               block->pkey_entry[i] = pkey;
>>-	       stat = "inserted";
>>-	       goto _done;
>>+				osm_log( p_log, OSM_LOG_ERROR,
>>+							"pkey_mgr_update_port: ERR 0504: "
>>+							"failed to find empty space for new pkey 0x%04x "
>>+							"of node 0x%016" PRIx64 " port %u\n",
>>+							cl_ntoh16(p_pending->pkey),
>>+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+							osm_physp_get_port_num( p_physp ) );
>>             }
>>+			else
>>+			{
>>+				block_index = last_free_block_index;
>>+				pkey_index = last_free_pkey_index++;
>>          }
>>       }
>>+		
>>+		if (found) 
>>+		{
>>+			if (osm_pkey_tbl_set_new_entry( 
>>+					 p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
>>+			{
>>       osm_log( p_log, OSM_LOG_ERROR,
>>-               "pkey_mgr_process_physical_port: ERR 0501: "
>>-               "No empty pkey entry was found to insert 0x%04x for node "
>>-               "0x%016" PRIx64 " port %u\n",
>>-               cl_ntoh16( pkey ),
>>+							"pkey_mgr_update_port: ERR 0505: "
>>+							"failed to set PKey 0x%04x in block %u idx %u "
>>+							"of node 0x%016" PRIx64 " port %u\n",
>>+							p_pending->pkey, block_index, pkey_index,
>>                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>                osm_physp_get_port_num( p_physp ) );
>>    }
>>-   else if ( *p_orig_pkey != pkey )
>>-   {
>>+		}
>>+
>>+		free( p_pending );
>>+		p_pending = 
>>+			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
>>+	}
>>+
>>+	/* now look for changes and store */
>>       for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>       {
>>-         /* we need real block (not just new_block) in order
>>-          * to resolve block/pkey indices */
>>          block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
>>-	 i = p_orig_pkey - block->pkey_entry;
>>-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
>>-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>-	    block->pkey_entry[i] = pkey;
>>-	    stat = "updated";
>>-	    goto _done;
>>-	 }
>>-      }
>>-   }
>>+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>> 
>>- _done:
>>-   if (stat) {
>>-      osm_log( p_log, OSM_LOG_VERBOSE,
>>-               "pkey_mgr_process_physical_port:  "
>>-               "pkey 0x%04x was %s for node 0x%016" PRIx64
>>-               " port %u\n",
>>-               cl_ntoh16( pkey ), stat,
>>+		if (block && 
>>+			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
>>+			continue;
>>+
>>+		status = pkey_mgr_update_pkey_entry(
>>+			p_req, p_physp , new_block, block_index );
>>+		if (status == IB_SUCCESS)
>>+			ret_val = TRUE;
>>+		else
>>+			osm_log( p_log, OSM_LOG_ERROR,
>>+						"pkey_mgr_update_port: ERR 0506: "
>>+						"pkey_mgr_update_pkey_entry() failed to update "
>>+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
>>+						block_index,
>>                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>                osm_physp_get_port_num( p_physp ) );
>>    }
>>+
>>+	return ret_val;
>> }
>> 
>> /**********************************************************************
>>@@ -217,21 +403,23 @@ pkey_mgr_update_peer_port(
>>    const osm_port_t * const p_port,
>>    boolean_t enforce )
>> {
>>-   osm_physp_t *p, *peer;
>>+	osm_physp_t *p_physp, *peer;
>>    osm_node_t *p_node;
>>    ib_pkey_table_t *block, *peer_block;
>>-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
>>+	const osm_pkey_tbl_t *p_pkey_tbl;
>>+	osm_pkey_tbl_t *p_peer_pkey_tbl;
>>    osm_switch_t *p_sw;
>>    ib_switch_info_t *p_si;
>>    uint16_t block_index;
>>    uint16_t num_of_blocks;
>>+	uint16_t peer_max_blocks;
>>    ib_api_status_t status = IB_SUCCESS;
>>    boolean_t ret_val = FALSE;
>> 
>>-   p = osm_port_get_default_phys_ptr( p_port );
>>-   if ( !osm_physp_is_valid( p ) )
>>+	p_physp = osm_port_get_default_phys_ptr( p_port );
>>+	if ( !osm_physp_is_valid( p_physp ) )
>>       return FALSE;
>>-   peer = osm_physp_get_remote( p );
>>+	peer = osm_physp_get_remote( p_physp );
>>    if ( !peer || !osm_physp_is_valid( peer ) )
>>       return FALSE;
>>    p_node = osm_physp_get_node_ptr( peer );
>>@@ -245,7 +433,7 @@ pkey_mgr_update_peer_port(
>>    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
>>    {
>>       osm_log( p_log, OSM_LOG_ERROR,
>>-               "pkey_mgr_update_peer_port: ERR 0502: "
>>+					"pkey_mgr_update_peer_port: ERR 0507: "
>>                "pkey_mgr_enforce_partition() failed to update "
>>                "node 0x%016" PRIx64 " port %u\n",
>>                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
>>    if (enforce == FALSE)
>>       return FALSE;
>> 
>>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
>>-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
>>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
>>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
>>    num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
>>-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
>>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
>>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
>>+	{
>>+		osm_log( p_log, OSM_LOG_ERROR,
>>+					"pkey_mgr_update_peer_port: ERR 0508: "
>>+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
>>+					" port %u\n",
>>+					peer_max_blocks, num_of_blocks,
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( peer ) );
>>+		return FALSE;
> 
> 
> Do you think it is the best way, just to skip update - partitions are
> enforced already on the switch. May be better to truncate pkey tables
> in order to meet peer's capabilities?
You are right about that - Its a bug!
I think the best approach here is to turn off the enforcement on the switch.
If we truncate the table we actually impact connectivity of the fabric.
I prefer a softer approach - an error in the log.
> 
> 
>>+	}
>> 
>>-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
>>+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
>>    {
>>       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
>>       if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
>>       {
>>+			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, block);
> 
> 
> Why this (osm_pkey_tbl_set())? This will be called by receiver.
Same as the above note about updating the map
I wanted to avoid to wait for the GetResp.
I think it is a mistake and we can actually remove it.
> 
> 
>>          status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
>>          if ( status == IB_SUCCESS )
>>             ret_val = TRUE;
>>          else
>>             osm_log( p_log, OSM_LOG_ERROR,
>>-                     "pkey_mgr_update_peer_port: ERR 0503: "
>>+							"pkey_mgr_update_peer_port: ERR 0509: "
>>                      "pkey_mgr_update_pkey_entry() failed to update "
>>                      "pkey table block %d for node 0x%016" PRIx64
>>                      " port %u\n",
>>@@ -282,10 +482,10 @@ pkey_mgr_update_peer_port(
>>       }
>>    }
>> 
>>-   if ( ret_val == TRUE &&
>>-        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
>>+	if ( (ret_val == TRUE) &&
>>+		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
>>    {
>>-      osm_log( p_log, OSM_LOG_VERBOSE,
>>+		osm_log( p_log, OSM_LOG_DEBUG,
>>                "pkey_mgr_update_peer_port: "
>>                "pkey table was updated for node 0x%016" PRIx64
>>                " port %u\n",
>>@@ -298,82 +498,6 @@ pkey_mgr_update_peer_port(
>> 
>> /**********************************************************************
>>  **********************************************************************/
>>-static boolean_t pkey_mgr_update_port(
>>-   osm_log_t *p_log,
>>-   osm_req_t *p_req,
>>-   const osm_port_t * const p_port )
>>-{
>>-   osm_physp_t *p;
>>-   osm_node_t *p_node;
>>-   ib_pkey_table_t *block, *new_block;
>>-   const osm_pkey_tbl_t *p_pkey_tbl;
>>-   uint16_t block_index;
>>-   uint16_t num_of_blocks;
>>-   ib_api_status_t status;
>>-   boolean_t ret_val = FALSE;
>>-
>>-   p = osm_port_get_default_phys_ptr( p_port );
>>-   if ( !osm_physp_is_valid( p ) )
>>-      return FALSE;
>>-
>>-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
>>-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>-
>>-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>-   {
>>-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
>>-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>-
>>-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
>>-         continue;
>>-
>>-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
>>-      if (status == IB_SUCCESS)
>>-         ret_val = TRUE;
>>-      else
>>-         osm_log( p_log, OSM_LOG_ERROR,
>>-                  "pkey_mgr_update_port: ERR 0504: "
>>-                  "pkey_mgr_update_pkey_entry() failed to update "
>>-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
>>-                  block_index,
>>-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>-                  osm_physp_get_port_num( p ) );
>>-   }
>>-
>>-   return ret_val;
>>-}
>>-
>>-/**********************************************************************
>>- **********************************************************************/
>>-static void
>>-pkey_mgr_process_partition_table(
>>-   osm_log_t *p_log,
>>-   const osm_req_t *p_req,
>>-   const osm_prtn_t *p_prtn,
>>-   const boolean_t full )
>>-{
>>-   const cl_map_t *p_tbl = full ?
>>-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
>>-   cl_map_iterator_t i, i_next;
>>-   ib_net16_t pkey = p_prtn->pkey;
>>-   osm_physp_t *p_physp;
>>-
>>-   if ( full )
>>-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
>>-
>>-   i_next = cl_map_head( p_tbl );
>>-   while ( i_next != cl_map_end( p_tbl ) )
>>-   {
>>-      i = i_next;
>>-      i_next = cl_map_next( i );
>>-      p_physp = cl_map_obj( i );
>>-      if ( p_physp && osm_physp_is_valid( p_physp ) )
>>-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
>>-   }
>>-}
>>-
>>-/**********************************************************************
>>- **********************************************************************/
>> osm_signal_t
>> osm_pkey_mgr_process(
>>    IN osm_opensm_t *p_osm )
>>@@ -383,8 +507,7 @@ osm_pkey_mgr_process(
>>    osm_prtn_t *p_prtn;
>>    osm_port_t *p_port;
>>    osm_signal_t signal = OSM_SIGNAL_DONE;
>>-   osm_physp_t *p_physp;
>>-
>>+	osm_node_t *p_node;
>>    CL_ASSERT( p_osm );
>> 
>>    OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
>>@@ -394,32 +517,25 @@ osm_pkey_mgr_process(
>>    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
>>    {
>>       osm_log( &p_osm->log, OSM_LOG_ERROR,
>>-               "osm_pkey_mgr_process: ERR 0505: "
>>+					"osm_pkey_mgr_process: ERR 0510: "
>>                "osm_prtn_make_partitions() failed\n" );
>>       goto _err;
>>    }
>> 
>>-   p_tbl = &p_osm->subn.port_guid_tbl;
>>-   p_next = cl_qmap_head( p_tbl );
>>-   while ( p_next != cl_qmap_end( p_tbl ) )
>>-   {
>>-      p_port = ( osm_port_t * ) p_next;
>>-      p_next = cl_qmap_next( p_next );
>>-      p_physp = osm_port_get_default_phys_ptr( p_port );
>>-      if ( osm_physp_is_valid( p_physp ) )
>>-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
>>-   }
>>-
>>+	/* populate the pending pkey entries by scanning all partitions */
>>    p_tbl = &p_osm->subn.prtn_pkey_tbl;
>>    p_next = cl_qmap_head( p_tbl );
>>    while ( p_next != cl_qmap_end( p_tbl ) )
>>    {
>>       p_prtn = ( osm_prtn_t * ) p_next;
>>       p_next = cl_qmap_next( p_next );
>>-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
>>-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
>>+		pkey_mgr_process_partition_table( 
>>+			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
>>+		pkey_mgr_process_partition_table( 
>>+			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
>>    }
>> 
>>+	/* calculate new pkey tables and set */
>>    p_tbl = &p_osm->subn.port_guid_tbl;
>>    p_next = cl_qmap_head( p_tbl );
>>    while ( p_next != cl_qmap_end( p_tbl ) )
>>@@ -428,8 +544,10 @@ osm_pkey_mgr_process(
>>       p_next = cl_qmap_next( p_next );
>>       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
>>         signal = OSM_SIGNAL_DONE_PENDING;
>>-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
>>-           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
>>+		p_node = osm_port_get_parent_node( p_port );
>>+		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
>>+			  pkey_mgr_update_peer_port( 
>>+				  &p_osm->log, &p_osm->sm.req,
>>                                       &p_osm->subn, p_port,
>>                                       !p_osm->subn.opt.no_partition_enforcement ) )
>>         signal = OSM_SIGNAL_DONE_PENDING;        
>>
>>
> 
> 
> Thanks,
> Sasha


From mst at mellanox.co.il  Thu Jun 15 05:43:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 15:43:23 +0300
Subject: [openib-general] RFC: detecting duplicate MAD requests
In-Reply-To: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com>
References: <000001c6900e$d34b09d0$1d268686@amr.corp.intel.com>
Message-ID: <20060615124323.GB13121@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [openib-general] RFC: detecting duplicate MAD requests
> 
> >Well the ACK for the direction switch is special, isn't it?
> >All I'm saying, let's pass it up to the application.
> 
> I really don't think that this is the direction that we want to take the
> interface.

Yes, you are right. So, I thought about this some more, and I think
I see how your approach can be adapted without breaking applications
in subtle ways:

When a transaction arrives, pass it to user and don't keep any state.  When the
ACK for segment 0 arrives, we know there will be response in this transaction,
so we can queue it up already (but don't send yet as we don't have the data).

Start sending when user responds.

To solve the case where user responds before ACK for segment 0 arrives, a
responder in DS RMPP will pass IsDS flag when he sends the response. mad core
will then don't start sending until ACK for segment 0 arrives.

-- 
MST


From halr at voltaire.com  Thu Jun 15 05:54:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Jun 2006 08:54:48 -0400
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <44915060.6090103@mellanox.co.il>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
	<20060615110617.GA21560@sashak.voltaire.com>
	<44915060.6090103@mellanox.co.il>
Message-ID: <1150376088.4506.40087.camel@hal.voltaire.com>

On Thu, 2006-06-15 at 08:19, Eitan Zahavi wrote:
> >>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
> >>+	if (! p_pkey_tbl)
> > 
> >            ^^^^^^^^^^^^^
> > Is it possible?
> Yes it is ! I run into it during testing. The port did not have any pkey table.

PKey tables are optional and predicated on NodeInfo:PartitionCap for
endports which has a minimum of 1 and SwitchInfo:PartitionEnforcementCap
for switch external (physical) ports which can be 0.

Is this routine used for an endport (CA, router, switch management
port), switch external port, or both ?

> >>@@ -217,21 +403,23 @@ pkey_mgr_update_peer_port(
> >>    const osm_port_t * const p_port,
> >>    boolean_t enforce )
> >> {
> >>-   osm_physp_t *p, *peer;
> >>+	osm_physp_t *p_physp, *peer;
> >>    osm_node_t *p_node;
> >>    ib_pkey_table_t *block, *peer_block;
> >>-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
> >>+	const osm_pkey_tbl_t *p_pkey_tbl;
> >>+	osm_pkey_tbl_t *p_peer_pkey_tbl;
> >>    osm_switch_t *p_sw;
> >>    ib_switch_info_t *p_si;
> >>    uint16_t block_index;
> >>    uint16_t num_of_blocks;
> >>+	uint16_t peer_max_blocks;
> >>    ib_api_status_t status = IB_SUCCESS;
> >>    boolean_t ret_val = FALSE;
> >> 
> >>-   p = osm_port_get_default_phys_ptr( p_port );
> >>-   if ( !osm_physp_is_valid( p ) )
> >>+	p_physp = osm_port_get_default_phys_ptr( p_port );
> >>+	if ( !osm_physp_is_valid( p_physp ) )
> >>       return FALSE;
> >>-   peer = osm_physp_get_remote( p );
> >>+	peer = osm_physp_get_remote( p_physp );
> >>    if ( !peer || !osm_physp_is_valid( peer ) )
> >>       return FALSE;
> >>    p_node = osm_physp_get_node_ptr( peer );
> >>@@ -245,7 +433,7 @@ pkey_mgr_update_peer_port(
> >>    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
> >>    {
> >>       osm_log( p_log, OSM_LOG_ERROR,
> >>-               "pkey_mgr_update_peer_port: ERR 0502: "
> >>+					"pkey_mgr_update_peer_port: ERR 0507: "
> >>                "pkey_mgr_enforce_partition() failed to update "
> >>                "node 0x%016" PRIx64 " port %u\n",
> >>                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> >>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
> >>    if (enforce == FALSE)
> >>       return FALSE;
> >> 
> >>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
> >>-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
> >>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> >>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
> >>    num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> >>-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
> >>-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
> >>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
> >>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
> >>+	{
> >>+		osm_log( p_log, OSM_LOG_ERROR,
> >>+					"pkey_mgr_update_peer_port: ERR 0508: "
> >>+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
> >>+					" port %u\n",
> >>+					peer_max_blocks, num_of_blocks,
> >>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> >>+					osm_physp_get_port_num( peer ) );
> >>+		return FALSE;
> > 
> > 
> > Do you think it is the best way, just to skip update - partitions are
> > enforced already on the switch. May be better to truncate pkey tables
> > in order to meet peer's capabilities?
> You are right about that - Its a bug!
> I think the best approach here is to turn off the enforcement on the switch.
> If we truncate the table we actually impact connectivity of the fabric.
> I prefer a softer approach - an error in the log.

Makes sense to me. It is better to give the administrator as close to
what he wants and not punish him for something like this but warn him
that his policy is weakened.

In addition to an error in the log, one should also go to OSM_LOG_SYS as
well so it might be noticed without checking the log.

-- Hal


From swise at opengridcomputing.com  Thu Jun 15 06:41:03 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 08:41:03 -0500
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2>
References: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2>
Message-ID: <1150378863.22603.12.camel@stevo-desktop>

On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote:

> > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index)
> > +{
> > +

<snip>

> > +	case C2_RES_IND_EP:{
> > +
> > +		struct c2wr_ae_connection_request *req =
> > +			&wr->ae.ae_connection_request;
> > +		struct iw_cm_id *cm_id =
> > +			(struct iw_cm_id *)resource_user_context;
> > +
> > +		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
> > +		if (event_id != CCAE_CONNECTION_REQUEST) {
> > +			pr_debug("%s: Invalid event_id: %d\n",
> > +				__FUNCTION__, event_id);
> > +			break;
> > +		}
> > +		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
> > +		cm_event.provider_data = (void*)(unsigned
> long)req->cr_handle;
> > +		cm_event.local_addr.sin_addr.s_addr = req->laddr;
> > +		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
> > +		cm_event.local_addr.sin_port = req->lport;
> > +		cm_event.remote_addr.sin_port = req->rport;
> > +		cm_event.private_data_len =
> > +			be32_to_cpu(req->private_data_length);
> > +
> > +		if (cm_event.private_data_len) {
> 
> 
> It looks to me as if pdata is leaking here since it is not tracked and
> the upper layers do not free it.  Also, if pdata is freed after the call
> to cm_id->event_handler returns, it exposes an issue in user space where
> the private data is garbage.  I suspect the iwarp cm should be copying
> this data before it returns.
> 

Good catch.  

Yes, I think the IWCM should copy the private data in the upcall.  If it
does, then the amso driver doesn't need to kmalloc()/copy at all.  It
can pass a ptr to its MQ entry directly...

Thanks,

Steve.


From mst at mellanox.co.il  Thu Jun 15 06:56:09 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 15 Jun 2006 16:56:09 +0300
Subject: [openib-general] on vacation through June 24
Message-ID: <20060615135609.GA2281@mellanox.co.il>

I'll be on vacation through June 24. I won't be online most of the time.

-- 
MST


From swise at opengridcomputing.com  Thu Jun 15 07:03:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 09:03:31 -0500
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150378863.22603.12.camel@stevo-desktop>
References: <5E701717F2B2ED4EA60F87C8AA57B7CC05D4E2D8@venom2>
	<1150378863.22603.12.camel@stevo-desktop>
Message-ID: <1150380211.22603.17.camel@stevo-desktop>

On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote:
> On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote:
> 
> > > +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index)
> > > +{
> > > +
> 
> <snip>
> 
> > > +	case C2_RES_IND_EP:{
> > > +
> > > +		struct c2wr_ae_connection_request *req =
> > > +			&wr->ae.ae_connection_request;
> > > +		struct iw_cm_id *cm_id =
> > > +			(struct iw_cm_id *)resource_user_context;
> > > +
> > > +		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
> > > +		if (event_id != CCAE_CONNECTION_REQUEST) {
> > > +			pr_debug("%s: Invalid event_id: %d\n",
> > > +				__FUNCTION__, event_id);
> > > +			break;
> > > +		}
> > > +		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
> > > +		cm_event.provider_data = (void*)(unsigned
> > long)req->cr_handle;
> > > +		cm_event.local_addr.sin_addr.s_addr = req->laddr;
> > > +		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
> > > +		cm_event.local_addr.sin_port = req->lport;
> > > +		cm_event.remote_addr.sin_port = req->rport;
> > > +		cm_event.private_data_len =
> > > +			be32_to_cpu(req->private_data_length);
> > > +
> > > +		if (cm_event.private_data_len) {
> > 
> > 
> > It looks to me as if pdata is leaking here since it is not tracked and
> > the upper layers do not free it.  Also, if pdata is freed after the call
> > to cm_id->event_handler returns, it exposes an issue in user space where
> > the private data is garbage.  I suspect the iwarp cm should be copying
> > this data before it returns.
> > 
> 
> Good catch.  
> 
> Yes, I think the IWCM should copy the private data in the upcall.  If it
> does, then the amso driver doesn't need to kmalloc()/copy at all.  It
> can pass a ptr to its MQ entry directly...
> 

Now that I've looked more into this, I'm not sure there's a simple way
for the IWCM to copy the pdata on the upcall.  Currently, the IWCM's
event upcall, cm_event_handler(), simply queues the work for processing
on a workqueue thread.  So there's no per-event logic at all there.
Lemme think on this more.  Stay tuned.  

Either way, the amso driver has a memory leak...

Steve.


From jlentini at netapp.com  Thu Jun 15 08:05:23 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 15 Jun 2006 11:05:23 -0400 (EDT)
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <449119AE.2010703@voltaire.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com>
Message-ID: <Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>


On Thu, 15 Jun 2006, Or Gerlitz wrote:

> Sean Hefty wrote:
> > James Lentini wrote:
> >> The IBTA spec (volume 1, version 1.2) describes a communication 
> >> established affiliated asynchronous event.
> >> We've seen this event delivered to our NFS-RDMA server and aren't sure 
> >> what to do with it.
> 
> > This event is delivered to the verbs consumer, since it occurs on 
> > the QP.  It's expected that the consumer will call 
> > ib_cm_establish.  Although, I would guess that you can probably 
> > ignore the event, under the assumption that the RTU will 
> > eventually be received by the local CM.
> 
> Sean,
> 
> The cma/verbs consumer can't just ignore the event since its qp 
> state is still RTR which means an attempt to tx replying the rx 
> would fail.

Good point. 

> On the other hand it can't call ib_cm_establish since the CMA does 
> not expose an API for that, 

This is a problem.

> nor the CM can register a cb to get this event and emulate an RTU 
> reception since the CMA is the one to create the QP and the CMA 
> consumer providing the qp_init_attr along with event handler...
> 
> I suggest the following design: the CMA would replace the event 
> handler provided with the qp_init_attr struct with a callback of its 
> own and keep the original handler/context on a private structure.
> 
> On the delivery of IB_EVENT_COMM_EST event, the CMA would call down 
> the CM to emulate RTU reception (ib_cm_establish) and then call up 

ib_cm_establish() doesn't emulate an RTU reception. It generates an 
IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The 
CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED 
event. The QP's state will not be moved to RTS.

> the consumer original handler, typical CMA consumers would just 
> ignore this event, i think.
> 
> The CM should be able to allow ib_cm_established to be called in the 
> context over which the event handler is called (or jump the 
> treatment to higher context). The CM must also ignore the actual RTU 
> if it arrives later/in parallel to when ib_cm_establish was called.
> 
> By this design the verbs consumer is guaranteed to always get 
> RDMA_CM_EVENT_ESTABLISHED no matter if the RTU is just late or never 
> arrives 

The CMA's cma_ib_handler() needs to be modified for this to be true.

> but it still can get a CQ RX completion(s) before getting the CMA 
> established event; in that case it can queue these completion 
> elements for the short time window before the established event 
> arrives and then process them.

Consumers don't actually have to queue the completions, they have to 
defer posting sends (either in response to the recvs or otherwise) 
until the QP moves to RTS. Could the implementations queue up the 
requests for the consumers?

Strictly speaking, IB requires an error to be generated (C10-29 in the 
IBTA spec. vol 1, page 456). Still, it would be nice if consumers 
didn't have to be worry about this issue.

> A design similar to that was implemented at the Voltaire gen1 stack 
> and it works in production with iSER target and VIBNAL (CFS Lustre 
> NAL for voltaire gen1 ib) server side.
> 
> Does anyone know on what context (hard_irq, soft_irq, thread) are 
> the event handlers being called?
> 
> Or.


From mamidala at cse.ohio-state.edu  Thu Jun 15 09:10:11 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Thu, 15 Jun 2006 12:10:11 -0400 (EDT)
Subject: [openib-general] librdmacm error with rping
In-Reply-To: <Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>
Message-ID: <Pine.GSO.4.40.0606151201240.18710-100000@mu.cse.ohio-state.edu>

Hi,

I have installed the latest infiniband stack with 2.6.16.20 kernel.
I tested the installation using ibv_rc_pingpong and it works fine.
But, while trying to use rping, I get the following error:

librdmacm: couldn't open rdma_cm ABI version.
rdma_create_event_channel error 2

Any clues as to why this might be happening will be of great help,


Thanks,
Amith


From swise at opengridcomputing.com  Thu Jun 15 09:40:09 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 11:40:09 -0500
Subject: [openib-general] librdmacm error with rping
In-Reply-To: <Pine.GSO.4.40.0606151201240.18710-100000@mu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606151201240.18710-100000@mu.cse.ohio-state.edu>
Message-ID: <1150389609.6371.1.camel@stevo-desktop>

Sounds like maybe the librdma.so that's installed is down-level...  

Did you nuke the old one, rerun autgen.sh, configure, make, make install
in the librdmacm directory?

Stevo.


On Thu, 2006-06-15 at 12:10 -0400, amith rajith mamidala wrote:
> Hi,
> 
> I have installed the latest infiniband stack with 2.6.16.20 kernel.
> I tested the installation using ibv_rc_pingpong and it works fine.
> But, while trying to use rping, I get the following error:
> 
> librdmacm: couldn't open rdma_cm ABI version.
> rdma_create_event_channel error 2
> 
> Any clues as to why this might be happening will be of great help,
> 
> 
> Thanks,
> Amith
> 
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu Jun 15 11:15:24 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 15 Jun 2006 21:15:24 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <44915060.6090103@mellanox.co.il>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
	<20060615110617.GA21560@sashak.voltaire.com>
	<44915060.6090103@mellanox.co.il>
Message-ID: <20060615181524.GB24808@sashak.voltaire.com>

Hi Eitan,

On 15:19 Thu 15 Jun     , Eitan Zahavi wrote:
> >>+/*
> >>+* PARAMETERS
> >>+*  p_physp
> >>+*     [in] Pointer to an osm_physp_t object.
> >>+*
> >>+* RETURN VALUES
> >>+*  The pointer to the P_Key table object.
> >>+*
> >>+* NOTES
> >>+*
> >>+* SEE ALSO
> >>+*  Port, Physical Port
> >>+*********/
> >>+
> >
> >
> >Is not this simpler to remove 'const' from existing
> >osm_physp_get_pkey_tbl() function instead of using new one?
> There are plenty of const functions using this function internally
> so I would have need to fix them too.

You are right. Maybe separate patch for this?

> >>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks(
> >>    p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
> >>    if ( b < new_blocks )
> >>      p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
> >>-    else {
> >>+		else 
> >>+      {
> >>      p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
> >>      if (!p_new_block)
> >>        break;
> >>+			cl_ptr_vector_set(&((osm_pkey_tbl_t 
> >>*)p_pkey_tbl)->new_blocks, +						 b, 
> >>p_new_block);
> >>+		}
> >>+
> >>      memset(p_new_block, 0, sizeof(*p_new_block));
> >>-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, 
> >>p_new_block);
> >>    }
> >>-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
> >>+}
> >
> >
> >You changed this function so it does not do any sync anymore. Should
> >function name be changed too?
> Yes correct I will change it. Is a better name:
> osm_pkey_tbl_init_new_blocks ?

Great name.

> >>+  to show that on the "old" blocks
> >>+*/
> >>+int
> >>+osm_pkey_tbl_set_new_entry( 
> >>+	IN osm_pkey_tbl_t *p_pkey_tbl,
> >>+	IN uint16_t        block_idx,
> >>+	IN uint8_t         pkey_idx,
> >>+	IN uint16_t        pkey)
> >>+{  
> >>+	ib_pkey_table_t *p_old_block;
> >>+	ib_pkey_table_t *p_new_block;
> >>+	
> >>+	if (osm_pkey_tbl_make_block_pair(
> >>+			 p_pkey_tbl,  block_idx, &p_old_block, &p_new_block))
> >>+		return 1;
> >>+		
> >>+	cl_map_insert( &p_pkey_tbl->keys,
> >>+						ib_pkey_get_base(pkey),
> >>+					 
> >>&(p_old_block->pkey_entry[pkey_idx]));
> >
> >
> >Here you map potentially empty pkey entry. Why? "old block" will be
> >remapped anyway on pkey receiving.
> The reason I did this was that if the GetResp will fail I still want to 
> represent
> the settings in the map.But actually it might be better not to do that so 
> next
> time we run we will not find it without a GetResp.

Agree.

> >>+	IN	 uint16_t		 *p_pkey,
> >>+	OUT uint32_t		 *p_block_idx,
> >>+	OUT uint8_t			 *p_pkey_index)
> >>+{
> >>+	uint32_t			  num_of_blocks;
> >>+	uint32_t			  block_index;
> >>+	ib_pkey_table_t *block;
> >>+
> >>+	CL_ASSERT( p_pkey_tbl );
> >>+	CL_ASSERT( p_block_idx != NULL );
> >>+	CL_ASSERT( p_pkey_idx != NULL );
> >
> >
> >Why last two CL_ASSERTs? What should be problem with uninitialized
> >pointers here?
> >
> These are the outputs of the function. It does not make sense to call the 
> functions with
> null output pointers (calling by ref) . Anyway instead of putting the check 
> in the free build
> I used an assert

I see. Actually I've overlooked that addresses and not values are
checked. Please ignore this comment.

> >>+
> >>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
> >>+	if (! p_pkey_tbl)
> >
> >           ^^^^^^^^^^^^^
> >Is it possible?
> Yes it is ! I run into it during testing. The port did not have any pkey 
> table.

static inline osm_pkey_tbl_t *
osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
{
...
  return( &p_physp->pkeys );
};

This returns the address of physp's pkeys field. Right?
Then if ( &p_physp->pkeys == NULL ) p_physp pointer should be equal to
unsigned equivalent of -(offset of pkey field in physp struct).

> >>+					"Fail to allocate new pending pkey 
> >>entry for node "
> >>+					"0x%016" PRIx64 " port %u\n",
> >>+					cl_ntoh64( osm_node_get_node_guid( 
> >>p_node ) ),
> >>+					osm_physp_get_port_num( p_physp ) );
> >>+		return;
> >>+	}
> >>+	p_pending->pkey = pkey;
> >>+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey 
> >>) );
> >>+	if ( !p_orig_pkey  || 
> >>+		  (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) 
> >>))
> >
> >
> >There the cases of new pkey and updated pkey membership is mixed. Why?
> I am not following your question.
> The specific case I am trying to catch is the one that for some reason the 
> map points to
> a pkey entry that was modified somehow and is different then the one you 
> would expect by
> the map.

Didn't understand it at first pass, now it is clearer.

If pkey entry was modified somehow (how? bugs?), the assumption is that
mapping still be valid? Then it is not new entry (or we will change
pkey's index in the real table).

> >>+	{
> >>+		p_pending->is_new = TRUE;
> >>+		cl_qlist_insert_tail(&p_pkey_tbl->pending, 
> >>(cl_list_item_t*)p_pending);
> >>+		stat = "inserted";
> >>+	}
> >>+	else
> >>+	{
> >>+		p_pending->is_new = FALSE;
> >>+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
> >>+									 
> >>&p_pending->block, &p_pending->index))
> >
> >                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >AFAIK in this function there were CL_ASSERTs which check for uinitialized
> >pointers.
> True. So the asserts are not required in this case.

Up to you. Actually this my comment may be ignored, as stated above I
didn't read this correctly.

> >
> >
> >>+		{
> >>+			osm_log( p_log, OSM_LOG_ERROR,
> >>+					 "pkey_mgr_process_physical_port: 
> >>ERR 0503: "
> >>+						"Fail to obtain P_Key 0x%04x 
> >>block and index for node "
> >>+						"0x%016" PRIx64 " port %u\n",
> >>+						cl_ntoh64( 
> >>osm_node_get_node_guid( p_node ) ),
> >>+						osm_physp_get_port_num( 
> >>p_physp ) );
> >>+			return;
> >>+		}
> >>+		cl_qlist_insert_head(&p_pkey_tbl->pending, 
> >>(cl_list_item_t*)p_pending);
> >>+		stat = "updated";
> >
> >
> >Is it will be updated? It is likely "already there" case. No?
> >
> >Also in this case you can already put the pkey in new_block instead of
> >holding it in pending list. Then later you will only need to add new
> >pkeys. This may simplify the flow and even save some mem.
> True but in my mind it does not simplify - on the contrary it makes the 
> partition between
> populating each port pending list and actually setting the pkey tables 
> mixed.

I meant new_block filling, not actual setting. You will be able to
remove whole if { } else { } flow, as well as is_new, block and index
fields from 'pending' structure (actually only pkey value itself will
matter) - is it not nice simplification?

> I do not think the memory impact deserves this mix of staging
> 
> >
> >

> >>+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, 
> >>p_physp );
> >>+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
> >>      {
> >>-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> >>-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
> >>+		osm_log( p_log, OSM_LOG_INFO,
> >>+					"pkey_mgr_update_port: "
> >>+					"Max number of blocks reduced from 
> >>%u to %u " +					"for node 0x%016" PRIx64 " 
> >>port %u\n",
> >>+					p_pkey_tbl->max_blocks, 
> >>max_num_of_blocks,
> >>+					cl_ntoh64( osm_node_get_node_guid( 
> >>p_node ) ),
> >>+					osm_physp_get_port_num( p_physp ) ); 
> >>+	}
> >>+	p_pkey_tbl->max_blocks = max_num_of_blocks;
> >>+
> >>+	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
> >>+	cl_map_remove_all( &p_pkey_tbl->keys );
> >
> >
> >What is the reason to drop map here? AFAIK it will be reinitialized later
> >anyway when pkey blocks will be received.
> What if it is not received?

Then we will have unreliable data there.

Maybe I know why you wanted this - this is part of "use pkey tables
before sending/receiving to/from ports" idea?

> >>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
> >>   if (enforce == FALSE)
> >>      return FALSE;
> >>
> >>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
> >>-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
> >>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> >>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
> >>   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> >>-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
> >>-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
> >>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
> >>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
> >>+	{
> >>+		osm_log( p_log, OSM_LOG_ERROR,
> >>+					"pkey_mgr_update_peer_port: ERR 
> >>0508: "
> >>+					"not enough entries (%u < %u) on 
> >>switch 0x%016" PRIx64
> >>+					" port %u\n",
> >>+					peer_max_blocks, num_of_blocks,
> >>+					cl_ntoh64( osm_node_get_node_guid( 
> >>p_node ) ),
> >>+					osm_physp_get_port_num( peer ) );
> >>+		return FALSE;
> >
> >
> >Do you think it is the best way, just to skip update - partitions are
> >enforced already on the switch. May be better to truncate pkey tables
> >in order to meet peer's capabilities?
> You are right about that - Its a bug!
> I think the best approach here is to turn off the enforcement on the switch.
> If we truncate the table we actually impact connectivity of the fabric.
> I prefer a softer approach - an error in the log.

Yes this should be good way to handle this.

> >
> >
> >>+	}
> >>
> >>-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> >>+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
> >>+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; 
> >>block_index++)
> >>   {
> >>      block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> >>      peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
> >>      if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
> >>      {
> >>+			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, 
> >>block);
> >
> >
> >Why this (osm_pkey_tbl_set())? This will be called by receiver.
> Same as the above note about updating the map
> I wanted to avoid to wait for the GetResp.
> I think it is a mistake and we can actually remove it.

Agree.

Sasha.


From krause at cup.hp.com  Thu Jun 15 10:55:06 2006
From: krause at cup.hp.com (Michael Krause)
Date: Thu, 15 Jun 2006 10:55:06 -0700
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <7.0.1.0.2.20060606131933.04267008@netapp.com>
References: <D80D83302DEE6249A221093BF2BB69AE58EBD2@mail.silverstorm.com>
	<7.0.1.0.2.20060606131933.04267008@netapp.com>
Message-ID: <6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com>


As one of the authors of IB and iWARP, I can say that both Roland and 
Todd's responses are correct and the intent of the specifications.  The 
number of outstanding RDMA Reads are bounded and that is communicated 
during session establishment.  The ULP can choose to be aware of this 
requirement (certainly when we wrote iSER and DA we were well aware of the 
requirement and we documented as such in the ULP specs) and track from 
above so that it does not see a stall or it can stay ignorant and deal with 
the stall as a result.  This is a ULP choice and has been intentionally 
done that way so that the hardware can be kept as simple as possible and as 
low cost as well while meeting the breadth of ULP needs that were used to 
develop these technologies.

Tom, you raised this issue during iWARP's definition and the debate was 
conducted at least several times.  The outcome of these debates is 
reflected in iWARP and remains aligned with IB.  So, unless you really want 
to have the IETF and IBTA go and modify their specs, I believe you'll have 
to deal with the issue just as other ULP are doing today and be aware of 
the constraint and write the software accordingly.  The open source 
community isn't really the right forum to change iWARP and IB 
specifications at the end of the day.  Build a case in the IETF and IBTA 
and let those bodies determine whether it is appropriate to modify their 
specs or not.  And yes, it is modification of the specs and therefore the 
hardware implementations as well address any interoperability requirements 
that would result (the change proposed could fragment the hardware 
offerings as there are many thousands of devices in the market that would 
not necessarily support this change).

Mike


At 12:07 PM 6/6/2006, Talpey, Thomas wrote:
>Todd, thanks for the set-up. I'm really glad we're having this discussion!
>
>Let me give an NFS/RDMA example to illustrate why this upper layer,
>at least, doesn't want the HCA doing its flow control, or resource
>management.
>
>NFS/RDMA is a credit-based protocol which allows many operations in
>progress at the server. Let's say the client is currently running with
>an RPC slot table of 100 requests (a typical value).
>
>Of these requests, some workload-specific percentage will be reads,
>writes, or metadata. All NFS operations consist of one send from
>client to server, some number of RDMA writes (for NFS reads) or
>RDMA reads (for NFS writes), then terminated with one send from
>server to client.
>
>The number of RDMA read or write operations per NFS op depends
>on the amount of data being read or written, and also the memory
>registration strategy in use on the client. The highest-performing
>such strategy is an all-physical one, which results in one RDMA-able
>segment per physical page. NFS r/w requests are, by default, 32KB,
>or 8 pages typical. So, typically 8 RDMA requests (read or write) are
>the result.
>
>To illustrate, let's say the client is processing a multi-threaded
>workload, with (say) 50% reads, 20% writes, and 30% metadata
>such as lookup and getattr. A kernel build, for example. Therefore,
>of our 100 active operations, 50 are reads for 32KB each, 20 are
>writes of 32KB, and 30 are metadata (non-RDMA).
>
>To the server, this results in 100 requests, 100 replies, 400 RDMA
>writes, and 160 RDMA Reads. Of course, these overlap heavily due
>to the widely differing latency of each op and the highly distributed
>arrival times. But, for the example this is a snapshot of current load.
>
>The latency of the metadata operations is quite low, because lookup
>and getattr are acting on what is effectively cached data. The reads
>and writes however, are much longer, because they reference the
>filesystem. When disk queues are deep, they can take many ms.
>
>Imagine what happens if the client's IRD is 4 and the server ignores
>its local ORD. As soon as a write begins execution, the server posts
>8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
>are sent, the fifth stalls, and stalls the send queue! Even when three
>RDMA Reads complete, the queue remains stalled, it doesn't unblock
>until the fourth is done and all the RDMA Reads have been initiated.
>
>But, what just happened to all the other server send traffic? All those
>metadata replies, and other reads which completed? They're stuck,
>waiting for that one write request. In my example, these number 99 NFS
>ops, i.e. 654 WRs! All for one NFS write! The client operation stream
>effectively became single threaded. What good is the "rapid initiation
>of RDMA Reads" you describe in the face of this?
>
>Yes, there are many arcane and resource-intensive ways around it.
>But the simplest by far is to count the RDMA Reads outstanding, and
>for the *upper layer* to honor ORD, not the HCA. Then, the send queue
>never blocks, and the operation streams never loses parallelism. This
>is what our NFS server does.
>
>As to the depth of IRD, this is a different calculation, it's a 
>DelayxBandwidth
>of the RDMA Read stream. 4 is good for local, low latency connections.
>But over a complicated switch infrastructure, or heaven forbid a dark fiber
>long link, I guarantee it will cause a bottleneck. This isn't an issue except
>for operations that care, but it is certainly detectable. I would like to see
>if a pure RDMA Read stream can fully utilize a typical IB fabric, and how
>much headroom an IRD of 4 provides. Not much, I predict.
>
>Closing the connection if IRD is "insufficient to meet goals" isn't a good
>answer, IMO. How does that benefit interoperability?
>
>Thanks for the opportunity to spout off again. Comments welcome!
>
>Tom.
>
>At 12:43 PM 6/6/2006, Rimmer, Todd wrote:
> >
> >
> >> Talpey, Thomas
> >> Sent: Tuesday, June 06, 2006 10:49 AM
> >>
> >> At 10:40 AM 6/6/2006, Roland Dreier wrote:
> >> >    Thomas> This is the difference between "may" and "must". The
> >value
> >> >    Thomas> is provided, but I don't see anything in the spec that
> >> >    Thomas> makes a requirement on its enforcement. Table 107 says
> >the
> >> >    Thomas> consumer can query it, that's about as close as it
> >> >    Thomas> comes. There's some discussion about CM exchange too.
> >> >
> >> >This seems like a very strained interpretation of the spec.  For
> >>
> >> I don't see how strained has anything to do with it. It's not saying
> >> anything
> >> either way. So, a legal implementation can make either choice. We're
> >> talking about the spec!
> >>
> >> But, it really doesn't matter. The point is, an upper layer should be
> >> paying
> >> attention to the number of RDMA Reads it posts, or else suffer either
> >the
> >> queue-stalling or connection-failing consequences. Bad stuff either
> >way.
> >>
> >> Tom.
> >
> >Somewhere beneath this discussion is a bug in the application or IB
> >stack.  I'm not sure which "may" in the spec you are referring to, but
> >the "may"s I have found all are for cases where the responder might
> >support only 1 outstanding request.  In all cases the negotiation
> >protocol must be followed and the requestor is not allowed to exceed the
> >negotiated limit.
> >
> >The mechanism should be:
> >client queries its local HCA and determines responder resources (eg.
> >number of concurrent outstanding RDMA reads on the wire from the remote
> >end where this end will respond with the read data) and initiator depth
> >(eg. number of concurrent outstanding RDMA reads which this end can
> >initiate as the requestor).
> >
> >client puts the above information in the CM REQ.
> >
> >server similarly gets its information from its local CA and negotiates
> >down the values to the MIN of each side (REP.InitiatorDepth =
> >MIN(REQ.ResponderResources, server's local CAs Initiator depth);
> >REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
> >responder resources).  If server does not support RDMA Reads, it can
> >REJ.
> >
> >If client decided the negotiated values are insufficient to meet its
> >goals, it can disconnect.
> >
> >Each side sets its QP parameters via modify QP appropriately.  Note they
> >too will be mirror images of eachother:
> >client:
> >QP.Max RDMA Reads as Initiator = REP.ResponderResources
> >QP.Max RDMA reads as responder = REP.InitiatorDepth
> >
> >server:
> >QP.Max RDMA Reads as responder = REP.ResponderResources
> >QP.Max RDMA reads as initiator = REP.InitiatorDepth
> >
> >We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
> >and provided the above negotiation is followed, we have seen no issues.
> >Note however that by default a Mellanox HCA typically reports a large
> >InitiatorDepth (128) and a modest ResponderResources (4-8).  Hence when
> >I hear that Responder Resources must be grown to 128 for some
> >application to reliably work, it implies the negotiation I outlined
> >above is not being followed.
> >
> >Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
> >write on a send queue are ordered.  There are many cases where an op can
> >pass an outstanding RDMA read, hence it is not always bad to queue extra
> >RDMA reads.  If needed, the Fence can be sent to force order.
> >
> >For many apps, its going to be better to get the items onto queue and
> >let the QP handle the outstanding reads cases rather than have the app
> >add a level of queuing for this purpose.  Letting the HCA do the queuing
> >will allow for a more rapid initiation of subsequent reads.
> >
> >Todd Rimmer
>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060615/385298c6/attachment.html>

From ralphc at pathscale.com  Thu Jun 15 11:31:20 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 15 Jun 2006 11:31:20 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs modify
 QP, SRQ response
Message-ID: <1150396280.32252.46.camel@brick.pathscale.com>

I am working on a ipathverbs.so version of ibv_poll_cq(),
ibv_post_recv(), and ibv_post_srq_recv() which mmaps the
queue into user space.  I found that I needed to modify the
core libibverbs and kernel uverbs code in order to return
the information I need from ib_ipath to the ipathverbs.so
library.  This patch adds those generic code changes.
A subsequent patch will add the InfiniPath specific changes.

Note that I didn't include matching changes to ehca since
I don't have HW to test with but I can try to make a patch
that allows it compile if requested to.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libibverbs/src/cmd.c
===================================================================
--- src/userspace/libibverbs/src/cmd.c	(revision 8021)
+++ src/userspace/libibverbs/src/cmd.c	(working copy)
@@ -384,6 +384,23 @@
 	return 0;
 }
 
+int ibv_cmd_resize_cq_resp(struct ibv_cq *cq, int cqe,
+			   struct ibv_resize_cq *cmd, size_t cmd_size,
+			   struct ibv_resize_cq_resp *resp, size_t resp_size)
+{
+
+	IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, resp, resp_size);
+	cmd->cq_handle = cq->handle;
+	cmd->cqe       = cqe;
+
+	if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size)
+		return errno;
+
+	cq->cqe = resp->cqe;
+
+	return 0;
+}
+
 static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq)
 {
 	struct ibv_destroy_cq_v1 cmd;
Index: src/userspace/libibverbs/src/libibverbs.map
===================================================================
--- src/userspace/libibverbs/src/libibverbs.map	(revision 8021)
+++ src/userspace/libibverbs/src/libibverbs.map	(working copy)
@@ -48,6 +48,7 @@
 		ibv_cmd_poll_cq;
 		ibv_cmd_req_notify_cq;
 		ibv_cmd_resize_cq;
+		ibv_cmd_resize_cq_resp;
 		ibv_cmd_destroy_cq;
 		ibv_cmd_create_srq;
 		ibv_cmd_modify_srq;
Index: src/userspace/libibverbs/include/infiniband/driver.h
===================================================================
--- src/userspace/libibverbs/include/infiniband/driver.h	(revision 8021)
+++ src/userspace/libibverbs/include/infiniband/driver.h	(working copy)
@@ -96,6 +96,9 @@
 int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only);
 int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe,
 		      struct ibv_resize_cq *cmd, size_t cmd_size);
+int ibv_cmd_resize_cq_resp(struct ibv_cq *cq, int cqe,
+			   struct ibv_resize_cq *cmd, size_t cmd_size,
+			   struct ibv_resize_cq_resp *resp, size_t resp_size);
 int ibv_cmd_destroy_cq(struct ibv_cq *cq);
 
 int ibv_cmd_create_srq(struct ibv_pd *pd,
Index: src/userspace/libibverbs/include/infiniband/kern-abi.h
===================================================================
--- src/userspace/libibverbs/include/infiniband/kern-abi.h	(revision 8021)
+++ src/userspace/libibverbs/include/infiniband/kern-abi.h	(working copy)
@@ -355,6 +355,8 @@
 
 struct ibv_resize_cq_resp {
 	__u32 cqe;
+	__u32 reserved;
+	__u64 driver_data[0];
 };
 
 struct ibv_destroy_cq {
Index: src/linux-kernel/infiniband/core/uverbs_cmd.c
===================================================================
--- src/linux-kernel/infiniband/core/uverbs_cmd.c	(revision 8021)
+++ src/linux-kernel/infiniband/core/uverbs_cmd.c	(working copy)
@@ -1258,6 +1258,7 @@
 			    int out_len)
 {
 	struct ib_uverbs_modify_qp cmd;
+	struct ib_udata            udata;
 	struct ib_qp              *qp;
 	struct ib_qp_attr         *attr;
 	int                        ret;
@@ -1265,6 +1266,9 @@
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd,
+		   out_len);
+
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr)
 		return -ENOMEM;
@@ -1321,7 +1325,7 @@
 	attr->alt_ah_attr.ah_flags 	    = cmd.alt_dest.is_global ? IB_AH_GRH : 0;
 	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
 
-	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
+	ret = qp->device->modify_qp(qp, attr, cmd.attr_mask, &udata);
 
 	put_qp_read(qp);
 
@@ -2031,6 +2035,7 @@
 			     int out_len)
 {
 	struct ib_uverbs_modify_srq cmd;
+	struct ib_udata             udata;
 	struct ib_srq              *srq;
 	struct ib_srq_attr          attr;
 	int                         ret;
@@ -2038,6 +2043,9 @@
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd,
+		   out_len);
+
 	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
 	if (!srq)
 		return -EINVAL;
@@ -2045,7 +2053,7 @@
 	attr.max_wr    = cmd.max_wr;
 	attr.srq_limit = cmd.srq_limit;
 
-	ret = ib_modify_srq(srq, &attr, cmd.attr_mask);
+	ret = srq->device->modify_srq(srq, &attr, cmd.attr_mask, &udata);
 
 	put_srq_read(srq);
 
Index: src/linux-kernel/infiniband/core/verbs.c
===================================================================
--- src/linux-kernel/infiniband/core/verbs.c	(revision 8021)
+++ src/linux-kernel/infiniband/core/verbs.c	(working copy)
@@ -231,7 +231,7 @@
 		  struct ib_srq_attr *srq_attr,
 		  enum ib_srq_attr_mask srq_attr_mask)
 {
-	return srq->device->modify_srq(srq, srq_attr, srq_attr_mask);
+	return srq->device->modify_srq(srq, srq_attr, srq_attr_mask, NULL);
 }
 EXPORT_SYMBOL(ib_modify_srq);
 
@@ -547,7 +547,7 @@
 		 struct ib_qp_attr *qp_attr,
 		 int qp_attr_mask)
 {
-	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask);
+	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, NULL);
 }
 EXPORT_SYMBOL(ib_modify_qp);
 
Index: src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h
===================================================================
--- src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h	(working copy)
@@ -275,6 +275,8 @@
 
 struct ib_uverbs_resize_cq_resp {
 	__u32 cqe;
+	__u32 reserved;
+	__u64 driver_data[0];
 };
 
 struct ib_uverbs_poll_cq {
Index: src/linux-kernel/infiniband/include/rdma/ib_verbs.h
===================================================================
--- src/linux-kernel/infiniband/include/rdma/ib_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/include/rdma/ib_verbs.h	(working copy)
@@ -911,7 +911,8 @@
 						 struct ib_udata *udata);
 	int                        (*modify_srq)(struct ib_srq *srq,
 						 struct ib_srq_attr *srq_attr,
-						 enum ib_srq_attr_mask srq_attr_mask);
+						 enum ib_srq_attr_mask srq_attr_mask,
+						 struct ib_udata *udata);
 	int                        (*query_srq)(struct ib_srq *srq,
 						struct ib_srq_attr *srq_attr);
 	int                        (*destroy_srq)(struct ib_srq *srq);
@@ -923,7 +924,8 @@
 						struct ib_udata *udata);
 	int                        (*modify_qp)(struct ib_qp *qp,
 						struct ib_qp_attr *qp_attr,
-						int qp_attr_mask);
+						int qp_attr_mask,
+						struct ib_udata *udata);
 	int                        (*query_qp)(struct ib_qp *qp,
 					       struct ib_qp_attr *qp_attr,
 					       int qp_attr_mask,
Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(working copy)
@@ -506,7 +506,7 @@
 		    struct ib_srq_attr *attr, struct mthca_srq *srq);
 void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq);
 int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask);
+		     enum ib_srq_attr_mask attr_mask, struct ib_udata *udata);
 int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr);
 int mthca_max_srq_sge(struct mthca_dev *dev);
 void mthca_srq_event(struct mthca_dev *dev, u32 srqn,
@@ -521,7 +521,8 @@
 		    enum ib_event_type event_type);
 int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask,
 		   struct ib_qp_init_attr *qp_init_attr);
-int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		    struct ib_udata *udata);
 int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 			  struct ib_send_wr **bad_wr);
 int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
Index: src/linux-kernel/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_qp.c	(working copy)
@@ -522,7 +522,8 @@
 	return 0;
 }
 
-int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		    struct ib_udata *udata)
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
Index: src/linux-kernel/infiniband/hw/mthca/mthca_srq.c
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_srq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_srq.c	(working copy)
@@ -357,7 +357,7 @@
 }
 
 int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask)
+		     enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
 {
 	struct mthca_dev *dev = to_mdev(ibsrq->device);
 	struct mthca_srq *srq = to_msrq(ibsrq);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(working copy)
@@ -577,7 +577,7 @@
 int ipath_destroy_qp(struct ib_qp *ibqp);
 
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask);
+		    int attr_mask, struct ib_udata *udata);
 
 int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		   int attr_mask, struct ib_qp_init_attr *init_attr);
@@ -636,7 +636,8 @@
 				struct ib_udata *udata);
 
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask);
+		     enum ib_srq_attr_mask attr_mask,
+		     struct ib_udata *udata);
 
 int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr);
 
Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(working copy)
@@ -425,11 +425,12 @@
  * @ibqp: the queue pair who's attributes we're modifying
  * @attr: the new attributes
  * @attr_mask: the mask of attributes to modify
+ * @udata: not used by the InfiniPath verbs driver
  *
  * Returns 0 on success, otherwise returns an errno.
  */
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask)
+		    int attr_mask, struct ib_udata *udata)
 {
 	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_qp *qp = to_iqp(ibqp);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(working copy)
@@ -187,9 +187,10 @@
  * @ibsrq: the SRQ to modify
  * @attr: the new attributes of the SRQ
  * @attr_mask: indicates which attributes to modify
+ * @udata: not used by the InfiniPath verbs driver
  */
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask)
+		     enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
 	unsigned long flags;


-- 
Ralph Campbell <ralphc at pathscale.com>


From swise at opengridcomputing.com  Thu Jun 15 13:11:57 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 15:11:57 -0500
Subject: [openib-general] [PATCH] backlog ignored when listening on all devs
Message-ID: <1150402317.6612.8.camel@stevo-desktop>

Sean, I think this is a bug, eh?

If you listen on 0.0.0.0, then the backlog isn't passed down to the
devices because its not stored in the id_priv struct before calling
cma_listen_on_all().  See cma_list_on_dev() which uses
id_priv->backlog...

Signed-off-by: Steve Wise <swise at opengridcomputing.com>

----------

Index: cma.c
===================================================================
--- cma.c	(revision 7626)
+++ cma.c	(working copy)
@@ -1086,6 +1086,7 @@
 	if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN))
 		return -EINVAL;
 
+	id_priv->backlog = backlog;
 	if (id->device) {
 		switch (rdma_node_get_transport(id->device->node_type)) {
 		case RDMA_TRANSPORT_IB:
@@ -1100,9 +1101,9 @@
 	} else
 		cma_listen_on_all(id_priv);
 
-	id_priv->backlog = backlog;
 	return 0;
 err:
+	id_priv->backlog = 0;
 	cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND);
 	return ret;
 }


From rdreier at cisco.com  Thu Jun 15 13:55:11 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 13:55:11 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs
 modify QP, SRQ response
In-Reply-To: <1150396280.32252.46.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Thu, 15 Jun 2006 11:31:20 -0700")
References: <1150396280.32252.46.camel@brick.pathscale.com>
Message-ID: <adar71qxhow.fsf@cisco.com>

 > I am working on a ipathverbs.so version of ibv_poll_cq(),
 > ibv_post_recv(), and ibv_post_srq_recv() which mmaps the
 > queue into user space.  I found that I needed to modify the
 > core libibverbs and kernel uverbs code in order to return
 > the information I need from ib_ipath to the ipathverbs.so
 > library.

That's weird, since libmthca operates on userspace queues and doesn't
need this.  I'd like to see the ipath changes before we merge this,
because there may be a better way to do things.

 > Note that I didn't include matching changes to ehca since
 > I don't have HW to test with but I can try to make a patch
 > that allows it compile if requested to.

Yes, please include that.  The changes should be the same obvious ones
as for mthca.

Also, this breaks the ABI, and I don't see any compatibility code for
libibverbs.

 >  		ibv_cmd_resize_cq;
 > +		ibv_cmd_resize_cq_resp;

I guess we need to include a new function if we want to support this
in libibverbs 1.0.  But for libibverbs 1.1 I would prefer to just
change the prototype for ibv_resize_cq.

 - R.


From ardavis at ichips.intel.com  Thu Jun 15 14:31:01 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 15 Jun 2006 14:31:01 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
Message-ID: <4491D195.8030106@ichips.intel.com>

Woodruff, Robert J wrote:

> It appears that processes are not exiting cleanly on SVN7946 trunk
> backported to 2.6.9-34 EL.
>
> They seem to be stuck in a state of "DL" and I cannot even attach to them
> wil gdb or kill them with a kill -9.
>
> [root at iclust-1 core]# ps -uax | grep IMB
> woody     4087  0.0  0.0 58500 3172 pts/3    T    14:45   0:00 gdb 
> ./IMB-MPI1 -p 4067
> woody     4067  2.3  0.0 33108 2708 ?        DL   14:44   0:12 ./IMB-MPI1
> woody     4109  3.1  0.0 40148 2572 ?        DL   14:47   0:12 ./IMB-MPI1
> root      4156  0.0  0.0 51080  732 pts/3    S+   14:53   0:00 grep IMB
>
> The last code I pulled SVN7843 did not have this problem.
>
> Any ideas on what might be causing this ?
>

I see the same thing running the uDAPL test (dapl/test/dtest). I am 
running a 2.6.16 kernel and svn8805 and it appears to be deadlocked 
(uninterruptible sleep) in the ibv_destroy_cq() call.  This all worked 
fine on svn7843; my last update on these systems.

-arlin

> woody
>
>------------------------------------------------------------------------
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mamidala at cse.ohio-state.edu  Thu Jun 15 14:24:38 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Thu, 15 Jun 2006 17:24:38 -0400 (EDT)
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <1150219552.17394.23.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606151720290.10580-100000@omicron.cse.ohio-state.edu>

Hi,

With the latest rping code (Revision: 8055) I am still able to see this
race condition.

server side:

[@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
server ping data: rdma-ping-0: ABCDEFGHIJKL
server ping data: rdma-ping-1: BCDEFGHIJKLM
server ping data: rdma-ping-2: CDEFGHIJKLMN
server ping data: rdma-ping-3: DEFGHIJKLMNO
server ping data: rdma-ping-4: EFGHIJKLMNOP
server ping data: rdma-ping-5: FGHIJKLMNOPQ
server ping data: rdma-ping-6: GHIJKLMNOPQR
server ping data: rdma-ping-7: HIJKLMNOPQRS
server ping data: rdma-ping-8: IJKLMNOPQRST
server ping data: rdma-ping-9: JKLMNOPQRSTU
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 9
cq completion failed status 5

Client side:

[@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
ping data: rdma-ping-0: ABCDEFGHIJKL
ping data: rdma-ping-1: BCDEFGHIJKLM
ping data: rdma-ping-2: CDEFGHIJKLMN
ping data: rdma-ping-3: DEFGHIJKLMNO
ping data: rdma-ping-4: EFGHIJKLMNOP
ping data: rdma-ping-5: FGHIJKLMNOPQ
ping data: rdma-ping-6: GHIJKLMNOPQR
ping data: rdma-ping-7: HIJKLMNOPQRS
ping data: rdma-ping-8: IJKLMNOPQRST
ping data: rdma-ping-9: JKLMNOPQRSTU
cq completion failed status 5
client DISCONNECT EVENT...


Thanks,
Amith


On Tue, 13 Jun 2006, Steve Wise wrote:

> Thanks, applied.
>
> iwarp branch: r7964
> trunk: r7966
>
>
> On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > This patch resolves a race condition between the receipt of
> > a connection established event and a receive completion from
> > the client.  The server no longer goes to connected state but
> > merely waits for the READ_ADV state to begin its looping.  This
> > keeps the server from going back to CONNECTED from the later
> > states if the connection established event comes in after the
> > receive completion (i.e. the loop starts).
> >
> > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rdreier at cisco.com  Thu Jun 15 14:33:41 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 14:33:41 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <4491D195.8030106@ichips.intel.com> (Arlin Davis's message
	of "Thu, 15 Jun 2006 14:31:01 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com>
Message-ID: <adairn2xfwq.fsf@cisco.com>

    Arlin> I see the same thing running the uDAPL test
    Arlin> (dapl/test/dtest). I am running a 2.6.16 kernel and svn8805
    Arlin> and it appears to be deadlocked (uninterruptible sleep) in
    Arlin> the ibv_destroy_cq() call.  This all worked fine on
    Arlin> svn7843; my last update on these systems.

Hmm, any further clue where in ibv_destroy_cq() it's stuck?  Is it
doing down_write() or something?

This is probably fallout from my kill-ib_uverbs_idr_mutex change...

 - R.


From rdreier at cisco.com  Thu Jun 15 14:35:03 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 14:35:03 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <adairn2xfwq.fsf@cisco.com> (Roland Dreier's message of
	"Thu, 15 Jun 2006 14:33:41 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com> <adairn2xfwq.fsf@cisco.com>
Message-ID: <adaejxqxfug.fsf@cisco.com>

    Roland> Hmm, any further clue where in ibv_destroy_cq() it's
    Roland> stuck?  Is it doing down_write() or something?

Can you send me full sysrq-t output when it gets stuck?

Thanks...


From ralphc at pathscale.com  Thu Jun 15 14:41:44 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 15 Jun 2006 14:41:44 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs
 modify QP, SRQ response
In-Reply-To: <adar71qxhow.fsf@cisco.com>
References: <1150396280.32252.46.camel@brick.pathscale.com>
	<adar71qxhow.fsf@cisco.com>
Message-ID: <1150407704.32252.65.camel@brick.pathscale.com>

On Thu, 2006-06-15 at 13:55 -0700, Roland Dreier wrote:
>  > I am working on a ipathverbs.so version of ibv_poll_cq(),
>  > ibv_post_recv(), and ibv_post_srq_recv() which mmaps the
>  > queue into user space.  I found that I needed to modify the
>  > core libibverbs and kernel uverbs code in order to return
>  > the information I need from ib_ipath to the ipathverbs.so
>  > library.
> 
> That's weird, since libmthca operates on userspace queues and doesn't
> need this.  I'd like to see the ipath changes before we merge this,
> because there may be a better way to do things.

libmthca uses a single shared page which is created at driver open time.
I'm mmaping vmalloc memory created at ibv_create_cq(), qp, srq time
so I need a way to return the offset to ipathverbs.so to then pass
to mmap().

>  > Note that I didn't include matching changes to ehca since
>  > I don't have HW to test with but I can try to make a patch
>  > that allows it compile if requested to.
> 
> Yes, please include that.  The changes should be the same obvious ones
> as for mthca.

OK.

> Also, this breaks the ABI, and I don't see any compatibility code for
> libibverbs.

The new kernel drivers work with the old libibverbs and vice versa
since only the cqe entry in struct ibv_resize_cq_resp is used.
The reserved entry is only needed to avoid using "packed"
structs if struct ibv_resize_cq_resp is included in another struct.

>  >  		ibv_cmd_resize_cq;
>  > +		ibv_cmd_resize_cq_resp;
> 
> I guess we need to include a new function if we want to support this
> in libibverbs 1.0.  But for libibverbs 1.1 I would prefer to just
> change the prototype for ibv_resize_cq.

I thought about this trade off too.  Either way is OK with me.

I will post the current HW specific changes soon.
I have code for everything except resizing the QP's receive queue.

>  - R.
-- 
Ralph Campbell <ralphc at pathscale.com>


From caitlinb at broadcom.com  Thu Jun 15 14:45:57 2006
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Thu, 15 Jun 2006 14:45:57 -0700
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
Message-ID: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com>

netdev-owner at vger.kernel.org wrote:
> On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote:
>> On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote:
>> 
>>>> +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) {
>>>> +
>> 
>> <snip>
>> 
>>>> +	case C2_RES_IND_EP:{
>>>> +
>>>> +		struct c2wr_ae_connection_request *req =
>>>> +			&wr->ae.ae_connection_request;
>>>> +		struct iw_cm_id *cm_id =
>>>> +			(struct iw_cm_id *)resource_user_context;
>>>> +
>>>> +		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
>>>> +		if (event_id != CCAE_CONNECTION_REQUEST) {
>>>> +			pr_debug("%s: Invalid event_id: %d\n",
>>>> +				__FUNCTION__, event_id);
>>>> +			break;
>>>> +		}
>>>> +		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
>>>> +		cm_event.provider_data = (void*)(unsigned
long)req->cr_handle;
>>>> +		cm_event.local_addr.sin_addr.s_addr = req->laddr;
>>>> +		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
>>>> +		cm_event.local_addr.sin_port = req->lport;
>>>> +		cm_event.remote_addr.sin_port = req->rport;
>>>> +		cm_event.private_data_len =
>>>> +			be32_to_cpu(req->private_data_length);
>>>> +
>>>> +		if (cm_event.private_data_len) {
>>> 
>>> 
>>> It looks to me as if pdata is leaking here since it is not tracked
>>> and the upper layers do not free it.  Also, if pdata is freed after
>>> the call to cm_id->event_handler returns, it exposes an issue in
>>> user space where the private data is garbage.  I suspect the iwarp
>>> cm should be copying this data before it returns.
>>> 
>> 
>> Good catch.
>> 
>> Yes, I think the IWCM should copy the private data in the upcall.  If
>> it does, then the amso driver doesn't need to kmalloc()/copy at all.
>> It can pass a ptr to its MQ entry directly...
>> 
> 
> Now that I've looked more into this, I'm not sure there's a
> simple way for the IWCM to copy the pdata on the upcall.
> Currently, the IWCM's event upcall, cm_event_handler(),
> simply queues the work for processing on a workqueue thread.
> So there's no per-event logic at all there.
> Lemme think on this more.  Stay tuned.
> 
> Either way, the amso driver has a memory leak...
> 

Having the IWCM copy the pdata during the upcall also leaves
the greatest flexibility for the driver on how/where the pdata
is captured. The IWCM has to deal with user-mode, indefinite
delays waiting for a response and user-mode processes that die
while holding a connection request. So it makes sense for that
layer to do the allocating and copying.


From rdreier at cisco.com  Thu Jun 15 14:56:48 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 14:56:48 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs
 modify QP, SRQ response
In-Reply-To: <1150407704.32252.65.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Thu, 15 Jun 2006 14:41:44 -0700")
References: <1150396280.32252.46.camel@brick.pathscale.com>
	<adar71qxhow.fsf@cisco.com>
	<1150407704.32252.65.camel@brick.pathscale.com>
Message-ID: <adaac8exeu7.fsf@cisco.com>

    Ralph> libmthca uses a single shared page which is created at
    Ralph> driver open time.  I'm mmaping vmalloc memory created at
    Ralph> ibv_create_cq(), qp, srq time so I need a way to return the
    Ralph> offset to ipathverbs.so to then pass to mmap().

Hmm... it seems simpler to have userspace allocate the memory with
mmap() before the resize_cq call, and then pass that new buffer into
the resize_cq call.  That way you don't have a window where the kernel
is putting completions into a buffer that userspace doesn't know about.

    Ralph> The new kernel drivers work with the old libibverbs and
    Ralph> vice versa since only the cqe entry in struct
    Ralph> ibv_resize_cq_resp is used.  The reserved entry is only
    Ralph> needed to avoid using "packed" structs if struct
    Ralph> ibv_resize_cq_resp is included in another struct.

OK, I guess we're OK, since the kernel isn't checking the size of the
response buffer.  old libipathverbs does need to bail out on a new ipath
kernel driver, though, or else you'll get corruption when responses go
off the end of a buffer.

 - R.


From swise at opengridcomputing.com  Thu Jun 15 14:58:29 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 16:58:29 -0500
Subject: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com>
References: <54AD0F12E08D1541B826BE97C98F99F15767E3@NT-SJCA-0751.brcm.ad.broadcom.com>
Message-ID: <1150408709.6612.16.camel@stevo-desktop>

> > Now that I've looked more into this, I'm not sure there's a
> > simple way for the IWCM to copy the pdata on the upcall.
> > Currently, the IWCM's event upcall, cm_event_handler(),
> > simply queues the work for processing on a workqueue thread.
> > So there's no per-event logic at all there.
> > Lemme think on this more.  Stay tuned.
> > 
> > Either way, the amso driver has a memory leak...
> > 
> 
> Having the IWCM copy the pdata during the upcall also leaves
> the greatest flexibility for the driver on how/where the pdata
> is captured. The IWCM has to deal with user-mode, indefinite
> delays waiting for a response and user-mode processes that die
> while holding a connection request. So it makes sense for that
> layer to do the allocating and copying.

I've already coded and test this.  The IWCM will copy the pdata...

Steve.


From ardavis at ichips.intel.com  Thu Jun 15 14:57:53 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 15 Jun 2006 14:57:53 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <adaejxqxfug.fsf@cisco.com>
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com> <adairn2xfwq.fsf@cisco.com>
	<adaejxqxfug.fsf@cisco.com>
Message-ID: <4491D7E1.5050504@ichips.intel.com>

Roland Dreier wrote:

>    Roland> Hmm, any further clue where in ibv_destroy_cq() it's
>    Roland> stuck?  Is it doing down_write() or something?
>
>Can you send me full sysrq-t output when it gets stuck?
>
>Thanks...
>
>  
>
I just added ibv_destroy_cq() to ibv_rc_pingpong test.

Here's the output....

open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 3
read(3, "6\n", 8)                       = 2
close(3)                                = 0
open("/sys/class/infiniband_verbs", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents64(3, /* 4 entries */, 4096)    = 112
open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 4
read(4, "1\n", 8)                       = 2
close(4)                                = 0
open("/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY) = 4
read(4, "mthca0\n", 64)                 = 7
close(4)                                = 0
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 4
read(4, "0x15b3\n", 8)                  = 7
close(4)                                = 0
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 4
read(4, "0x6278\n", 8)                  = 7
close(4)                                = 0
getdents64(3, /* 0 entries */, 4096)    = 0
close(3)                                = 0
open("/dev/infiniband/uverbs0", O_RDWR) = 3
write(3, "\0\0\0\0\4\0\4\0\300\227\221\377\377\177\0\0", 16) = 16
mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2b318fa6f000
write(3, "\3\0\0\0\4\0\3\0\200\227\221\377\377\177\0\0", 16) = 16
write(3, "\3\0\0\0\4\0\3\0\320\227\221\377\377\177\0\0", 16) = 16
write(3, "\t\0\0\0\f\0\3\0`\227\221\377\377\177\0\0\0pP\0\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0\240\226\221\377\377\177\0\0\0\240P\0\0"..., 
48) = 48
write(3, "\22\0\0\0\22\0\4\0p\227\221\377\377\177\0\0\320nP\0\0\0"..., 
72) = 72
write(3, "\t\0\0\0\f\0\3\0\240\226\221\377\377\177\0\0\0\360P\0\0"..., 
48) = 48
write(3, "\30\0\0\0\30\0\10\0`\227\221\377\377\177\0\0p\221P\0\0"..., 
96) = 96
write(3, "\32\0\0\0\36\0\0\0\250Y\1a9\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
120) = 120
write(3, "\2\0\0\0\6\0\n\0`\227\221\377\377\177\0\0\1lQ\0\0\0\0\0"..., 
24) = 24
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 7), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x2b318fa70000
write(1, "  local address:  LID 0x0004, QP"..., 57  local address:  LID 
0x0004, QPN 0x040407, PSN 0xce99bd
) = 57
socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET6, sin6_port=htons(18515), 
inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, 28) = 0
getsockname(5, {sa_family=AF_INET6, sin6_port=htons(32770), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [22635233564164124]) = 0
close(5)                                = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(18515), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(32770), 
sin_addr=inet_addr("127.0.0.1")}, [22635233564164112]) = 0
close(5)                                = 0
socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 5
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [22635233564164097], 4) = 0
bind(5, {sa_family=AF_INET6, sin6_port=htons(18515), inet_pton(AF_INET6, 
"::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
listen(5, 1)                            = 0
accept(5, 0, NULL)                      = 6
close(5)                                = 0
read(6, "0005:040407:abb228\0", 19)     = 19
write(3, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
120) = 120
write(3, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
120) = 120
write(6, "0004:040407:ce99bd\0", 19)    = 19
read(6, "done\0", 19)                   = 5
close(6)                                = 0
write(1, "  remote address: LID 0x0005, QP"..., 57  remote address: LID 
0x0005, QPN 0x040407, PSN 0xabb228
) = 57
write(1, " calling destroy_cq\n", 20 calling destroy_cq
)   = 20
write(3, "\24\0\0\0\6\0\2\0\250\227\221\377\377\177\0\0\7\0\0\0\0"..., 24


From swise at opengridcomputing.com  Thu Jun 15 15:03:47 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 15 Jun 2006 17:03:47 -0500
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <Pine.GSO.4.40.0606151720290.10580-100000@omicron.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606151720290.10580-100000@omicron.cse.ohio-state.edu>
Message-ID: <1150409027.6612.20.camel@stevo-desktop>

This is the normal output for rping...

The status error on the completion is 5 (FLUSHED), which is normal.

Steve.


On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote:
> Hi,
> 
> With the latest rping code (Revision: 8055) I am still able to see this
> race condition.
> 
> server side:
> 
> [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
> server ping data: rdma-ping-0: ABCDEFGHIJKL
> server ping data: rdma-ping-1: BCDEFGHIJKLM
> server ping data: rdma-ping-2: CDEFGHIJKLMN
> server ping data: rdma-ping-3: DEFGHIJKLMNO
> server ping data: rdma-ping-4: EFGHIJKLMNOP
> server ping data: rdma-ping-5: FGHIJKLMNOPQ
> server ping data: rdma-ping-6: GHIJKLMNOPQR
> server ping data: rdma-ping-7: HIJKLMNOPQRS
> server ping data: rdma-ping-8: IJKLMNOPQRST
> server ping data: rdma-ping-9: JKLMNOPQRSTU
> server DISCONNECT EVENT...
> wait for RDMA_READ_ADV state 9
> cq completion failed status 5
> 
> Client side:
> 
> [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
> ping data: rdma-ping-0: ABCDEFGHIJKL
> ping data: rdma-ping-1: BCDEFGHIJKLM
> ping data: rdma-ping-2: CDEFGHIJKLMN
> ping data: rdma-ping-3: DEFGHIJKLMNO
> ping data: rdma-ping-4: EFGHIJKLMNOP
> ping data: rdma-ping-5: FGHIJKLMNOPQ
> ping data: rdma-ping-6: GHIJKLMNOPQR
> ping data: rdma-ping-7: HIJKLMNOPQRS
> ping data: rdma-ping-8: IJKLMNOPQRST
> ping data: rdma-ping-9: JKLMNOPQRSTU
> cq completion failed status 5
> client DISCONNECT EVENT...
> 
> 
> Thanks,
> Amith
> 
> 
> On Tue, 13 Jun 2006, Steve Wise wrote:
> 
> > Thanks, applied.
> >
> > iwarp branch: r7964
> > trunk: r7966
> >
> >
> > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > > This patch resolves a race condition between the receipt of
> > > a connection established event and a receive completion from
> > > the client.  The server no longer goes to connected state but
> > > merely waits for the READ_ADV state to begin its looping.  This
> > > keeps the server from going back to CONNECTED from the later
> > > states if the connection established event comes in after the
> > > receive completion (i.e. the loop starts).
> > >
> > > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >


From rdreier at cisco.com  Thu Jun 15 15:03:14 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 15:03:14 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <4491D7E1.5050504@ichips.intel.com> (Arlin Davis's message
	of "Thu, 15 Jun 2006 14:57:53 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com> <adairn2xfwq.fsf@cisco.com>
	<adaejxqxfug.fsf@cisco.com> <4491D7E1.5050504@ichips.intel.com>
Message-ID: <ada64j2xejh.fsf@cisco.com>

Thanks, reproduced it locally.


From sean.hefty at intel.com  Thu Jun 15 15:04:57 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 Jun 2006 15:04:57 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <449119AE.2010703@voltaire.com>
Message-ID: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>

>The cma/verbs consumer can't just ignore the event since its qp state is
>still RTR which means an attempt to tx replying the rx would fail.

In most cases, I would expect that the IB CM will eventually receive the RTU,
which will generate an event to the RDMA CM to transition the QP into RTS.  This
is why I think that the event can safely be ignored.  It does however mean that
a user cannot send on the QP until the user sees RDMA_CM_EVENT_ESTABLISHED.

>I suggest the following design: the CMA would replace the event handler
>provided with the qp_init_attr struct with a callback of its own and
>keep the original handler/context on a private structure.

This sounds like it would work.  I don't think that there are any events where
the additional delay would matter.

As an alternative, I don't think that there's any reason why the QP can't be
transition to RTS when the CM REP is sent.  A user just can't post to the send
queue until either an RDMA_CM_EVENT_ESTABLISHED, IB_EVENT_COMM_EST, or a
completion occurs on the QP.  (This doesn't change the fact that the IB CM still
needs to know that the connection has been established, or it risks putting the
connection into an error state if an RTU is never received.)

- Sean


From rdreier at cisco.com  Thu Jun 15 15:13:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 15:13:12 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <ada64j2xejh.fsf@cisco.com> (Roland Dreier's message of
	"Thu, 15 Jun 2006 15:03:14 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com> <adairn2xfwq.fsf@cisco.com>
	<adaejxqxfug.fsf@cisco.com> <4491D7E1.5050504@ichips.intel.com>
	<ada64j2xejh.fsf@cisco.com>
Message-ID: <ada1wtqxe2v.fsf@cisco.com>

OK, just a dumb oversight on my part.  The change below (already
checked in) fixes it for me:

--- infiniband/core/uverbs_cmd.c	(revision 8055)
+++ infiniband/core/uverbs_cmd.c	(working copy)
@@ -1123,6 +1123,12 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 		goto err_copy;
 	}
 
+	put_pd_read(pd);
+	put_cq_read(scq);
+	put_cq_read(rcq);
+	if (srq)
+		put_srq_read(srq);
+
 	mutex_lock(&file->mutex);
 	list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list);
 	mutex_unlock(&file->mutex);


From sean.hefty at intel.com  Thu Jun 15 15:13:01 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 15 Jun 2006 15:13:01 -0700
Subject: [openib-general] [PATCH] backlog ignored when listening on all
	devs
In-Reply-To: <1150402317.6612.8.camel@stevo-desktop>
Message-ID: <000101c690c8$de077f10$62268686@amr.corp.intel.com>

Roland, can you pick up this patch for 2.6.18?

Thanks - committed in 8057.

- Sean

>If you listen on 0.0.0.0, then the backlog isn't passed down to the
>devices because its not stored in the id_priv struct before calling
>cma_listen_on_all().  See cma_list_on_dev() which uses
>id_priv->backlog...
>
>Signed-off-by: Steve Wise <swise at opengridcomputing.com>
>
>----------
>
>Index: cma.c
>===================================================================
>--- cma.c	(revision 7626)
>+++ cma.c	(working copy)
>@@ -1086,6 +1086,7 @@
> 	if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN))
> 		return -EINVAL;
>
>+	id_priv->backlog = backlog;
> 	if (id->device) {
> 		switch (rdma_node_get_transport(id->device->node_type)) {
> 		case RDMA_TRANSPORT_IB:
>@@ -1100,9 +1101,9 @@
> 	} else
> 		cma_listen_on_all(id_priv);
>
>-	id_priv->backlog = backlog;
> 	return 0;
> err:
>+	id_priv->backlog = 0;
> 	cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND);
> 	return ret;
> }


From robert.j.woodruff at intel.com  Thu Jun 15 15:16:28 2006
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Thu, 15 Jun 2006 15:16:28 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <ada1wtqxe2v.fsf@cisco.com>
Message-ID: <000001c690c9$5a03e9f0$010fa8c0@amr.corp.intel.com>

Roland wrote,
>OK, just a dumb oversight on my part.  The change below (already
>checked in) fixes it for me:

Great thanks for the quick response, 

woody


From rdreier at cisco.com  Thu Jun 15 15:34:37 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 15 Jun 2006 15:34:37 -0700
Subject: [openib-general] [PATCH] backlog ignored when listening on all
 devs
In-Reply-To: <1150402317.6612.8.camel@stevo-desktop> (Steve Wise's
	message of "Thu, 15 Jun 2006 15:11:57 -0500")
References: <1150402317.6612.8.camel@stevo-desktop>
Message-ID: <adawtbivyiq.fsf@cisco.com>

OK, I rolled this into the cma patch in for-2.6.18 branch.
Unfortunately this means that all of the patches in that branch are
rebased, so you'll have to repull my tree if you're tracking it.

 - R.


From ralphc at pathscale.com  Thu Jun 15 15:34:58 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 15 Jun 2006 15:34:58 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs
 modify QP, SRQ response
In-Reply-To: <adaac8exeu7.fsf@cisco.com>
References: <1150396280.32252.46.camel@brick.pathscale.com>
	<adar71qxhow.fsf@cisco.com>
	<1150407704.32252.65.camel@brick.pathscale.com>
	<adaac8exeu7.fsf@cisco.com>
Message-ID: <1150410898.32252.69.camel@brick.pathscale.com>

On Thu, 2006-06-15 at 14:56 -0700, Roland Dreier wrote:
>     Ralph> libmthca uses a single shared page which is created at
>     Ralph> driver open time.  I'm mmaping vmalloc memory created at
>     Ralph> ibv_create_cq(), qp, srq time so I need a way to return the
>     Ralph> offset to ipathverbs.so to then pass to mmap().
> 
> Hmm... it seems simpler to have userspace allocate the memory with
> mmap() before the resize_cq call, and then pass that new buffer into
> the resize_cq call.  That way you don't have a window where the kernel
> is putting completions into a buffer that userspace doesn't know about.

Perhaps. But this way, the code is the same for kernel and
user allocated queues.

>     Ralph> The new kernel drivers work with the old libibverbs and
>     Ralph> vice versa since only the cqe entry in struct
>     Ralph> ibv_resize_cq_resp is used.  The reserved entry is only
>     Ralph> needed to avoid using "packed" structs if struct
>     Ralph> ibv_resize_cq_resp is included in another struct.
> 
> OK, I guess we're OK, since the kernel isn't checking the size of the
> response buffer.  old libipathverbs does need to bail out on a new ipath
> kernel driver, though, or else you'll get corruption when responses go
> off the end of a buffer.

Or the new kernel driver needs to handle the old way and the new way.

>  - R.
-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Thu Jun 15 15:40:54 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 15 Jun 2006 15:40:54 -0700
Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs,
	SRQs [1 of 2]
Message-ID: <1150411254.32252.76.camel@brick.pathscale.com>

Here are the diffs Roland requested for the ipath driver changes
to mmap the completion and receive queues into the user library.
This isn't quite the final version though since I need to implement
QP receive queue resizing and some version checking/handling.


Index: src/userspace/libipathverbs/src/verbs.c
===================================================================
--- src/userspace/libipathverbs/src/verbs.c	(revision 8021)
+++ src/userspace/libipathverbs/src/verbs.c	(working copy)
@@ -40,11 +40,14 @@
 
 #include <stdio.h>
 #include <stdlib.h>
-#include <strings.h>
+#include <string.h>
 #include <pthread.h>
 #include <netinet/in.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "ipathverbs.h"
+#include "ipath-abi.h"
 
 int ipath_query_device(struct ibv_context *context,
 		       struct ibv_device_attr *attr)
@@ -83,11 +86,11 @@
 	struct ibv_pd		 *pd;
 
 	pd = malloc(sizeof *pd);
-	if(!pd)
+	if (!pd)
 		return NULL;
 
-	if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
-			    &resp, sizeof resp)) {
+	if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
+			     &resp, sizeof resp)) {
 		free(pd);
 		return NULL;
 	}
@@ -142,129 +145,396 @@
 			       struct ibv_comp_channel *channel,
 			       int comp_vector)
 {
-	struct ibv_cq		 *cq;
-	struct ibv_create_cq	  cmd;
-	struct ibv_create_cq_resp resp;
-	int			  ret;
+	struct ipath_cq		   *cq;
+	struct ibv_create_cq	    cmd;
+	struct ipath_create_cq_resp resp;
+	int			    ret;
+	size_t			    size;
 
 	cq = malloc(sizeof *cq);
 	if (!cq)
 		return NULL;
 
-	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, cq,
-				&cmd, sizeof cmd, &resp, sizeof resp);
+	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
+				&cq->ibv_cq, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(cq);
 		return NULL;
 	}
 
-	return cq;
+	size = sizeof(struct ipath_cq_wc) + sizeof(struct ipath_wc) * cqe;
+	cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			 context->cmd_fd, resp.offset);
+	if ((void *) cq->queue == MAP_FAILED) {
+		free(cq);
+		return NULL;
+	}
+
+	pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE);
+	return &cq->ibv_cq;
 }
 
-int ipath_destroy_cq(struct ibv_cq *cq)
+int ipath_resize_cq(struct ibv_cq *ibcq, int cqe)
 {
+	struct ipath_cq		       *cq = to_icq(ibcq);
+	struct ibv_resize_cq		cmd;
+	struct ipath_resize_cq_resp	resp;
+	size_t				size;
+	int				ret;
+
+	pthread_spin_lock(&cq->lock);
+	/* Unmap the old queue so we can resize it. */
+	size = sizeof(struct ipath_cq_wc) +
+		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
+	(void) munmap(cq->queue, size);
+	ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd,
+				     &resp.ibv_resp, sizeof resp);
+	if (ret) {
+		pthread_spin_unlock(&cq->lock);
+		return ret;
+	}
+	size = sizeof(struct ipath_cq_wc) +
+		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
+	cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			 ibcq->context->cmd_fd, resp.offset);
+	ret = errno;
+	pthread_spin_unlock(&cq->lock);
+	if ((void *) cq->queue == MAP_FAILED)
+		return ret;
+	return 0;
+}
+
+int ipath_destroy_cq(struct ibv_cq *ibcq)
+{
+	struct ipath_cq *cq = to_icq(ibcq);
 	int ret;
 
-	ret = ibv_cmd_destroy_cq(cq);
+	ret = ibv_cmd_destroy_cq(ibcq);
 	if (ret)
 		return ret;
 
+	(void) munmap(cq->queue, sizeof(struct ipath_cq_wc) +
+				 (sizeof(struct ipath_wc) * cq->ibv_cq.cqe));
 	free(cq);
 	return 0;
 }
 
+int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
+{
+	struct ipath_cq *cq = to_icq(ibcq);
+	struct ipath_cq_wc *q;
+	int npolled;
+	uint32_t tail;
+
+	pthread_spin_lock(&cq->lock);
+	q = cq->queue;
+	tail = q->tail;
+	for (npolled = 0; npolled < ne; ++npolled, ++wc) {
+		if (tail == q->head)
+			break;
+		memcpy(wc, &q->queue[tail], sizeof(*wc));
+		if (tail == cq->ibv_cq.cqe)
+			tail = 0;
+		else
+			tail++;
+	}
+	q->tail = tail;
+	pthread_spin_unlock(&cq->lock);
+
+	return npolled;
+}
+
 struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
 {
-	struct ibv_create_qp	  cmd;
-	struct ibv_create_qp_resp resp;
-	struct ibv_qp		 *qp;
-	int			  ret;
+	struct ibv_create_qp	     cmd;
+	struct ipath_create_qp_resp  resp;
+	struct ipath_qp		    *qp;
+	int			     ret;
+	size_t			     size;
 
 	qp = malloc(sizeof *qp);
 	if (!qp)
 		return NULL;
 
-	ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp);
+	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(qp);
 		return NULL;
 	}
 
-	return qp;
+	if (attr->srq) {
+		qp->rq.size = 0;
+		qp->rq.max_sge = 0;
+		qp->rq.rwq = NULL;
+	} else {
+		qp->rq.size = attr->cap.max_recv_wr + 1;
+		qp->rq.max_sge = attr->cap.max_recv_sge;
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		qp->rq.rwq = mmap(NULL, size,
+				  PROT_READ | PROT_WRITE, MAP_SHARED,
+				  pd->context->cmd_fd, resp.offset);
+		if ((void *) qp->rq.rwq == MAP_FAILED) {
+			free(qp);
+			return NULL;
+		}
+	}
+
+	pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE);
+	return &qp->ibv_qp;
 }
 
-int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr)
+{
+	struct ibv_query_qp cmd;
+
+	return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr,
+				&cmd, sizeof cmd);
+}
+
+int ipath_modify_qp(struct ibv_qp *ibqp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask)
 {
-	struct ibv_modify_qp cmd;
+	struct ipath_qp	           *qp = to_iqp(ibqp);
+	struct ipath_modify_qp_cmd  cmd;
+	__u64                       offset;
+	size_t                      size;
+	int                         ret;
 
-	return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd);
+	if (attr_mask & IBV_QP_CAP) {
+		/* Can't resize receive queue if we haved a shared one. */
+		if (ibqp->srq)
+			return EINVAL;
+		pthread_spin_lock(&qp->rq.lock);
+		/* Unmap the old queue so we can resize it. */
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		(void) munmap(qp->rq.rwq, size);
+	}
+	cmd.offset_addr = (__u64) &offset;
+	ret = ibv_cmd_modify_qp(ibqp, attr, attr_mask,
+				&cmd.ibv_cmd, sizeof cmd);
+	if (ret) {
+		if (attr_mask & IBV_QP_CAP)
+			pthread_spin_unlock(&qp->rq.lock);
+		return ret;
+	}
+	if (attr_mask & IBV_QP_CAP) {
+		qp->rq.size = attr->cap.max_recv_wr + 1;
+		qp->rq.max_sge = attr->cap.max_recv_sge;
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		qp->rq.rwq = mmap(NULL, size,
+				  PROT_READ | PROT_WRITE, MAP_SHARED,
+				  ibqp->context->cmd_fd, offset);
+		pthread_spin_unlock(&qp->rq.lock);
+		/* XXX Now we have no receive queue. */
+		if ((void *) qp->rq.rwq == MAP_FAILED)
+			return errno;
+	}
+	return 0;
 }
 
-int ipath_destroy_qp(struct ibv_qp *qp)
+int ipath_destroy_qp(struct ibv_qp *ibqp)
 {
+	struct ipath_qp	*qp = to_iqp(ibqp);
 	int ret;
 
-	ret = ibv_cmd_destroy_qp(qp);
+	ret = ibv_cmd_destroy_qp(ibqp);
 	if (ret)
 		return ret;
 
+	if (qp->rq.rwq) {
+		size_t size;
+
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		(void) munmap(qp->rq.rwq, size);
+	}
 	free(qp);
 	return 0;
 }
 
+static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr,
+		     struct ibv_recv_wr **bad_wr)
+{
+	struct ibv_recv_wr *i;
+	struct ipath_rwq *rwq;
+	struct ipath_rwqe *wqe;
+	uint32_t head;
+	int n, ret;
+
+	pthread_spin_lock(&rq->lock);
+	rwq = rq->rwq;
+	head = rwq->head;
+	for (i = wr; i; i = i->next) {
+		if ((unsigned) i->num_sge > rq->max_sge)
+			goto bad;
+		wqe = get_rwqe_ptr(rq, head);
+		if (++head >= rq->size)
+			head = 0;
+		if (head == rwq->tail)
+			goto bad;
+		wqe->wr_id = i->wr_id;
+		wqe->num_sge = i->num_sge;
+		for (n = 0; n < wqe->num_sge; n++)
+			wqe->sg_list[n] = i->sg_list[n];
+		rwq->head = head;
+	}
+	ret = 0;
+	goto done;
+
+bad:
+	ret = -ENOMEM;
+	if (bad_wr)
+		*bad_wr = i;
+done:
+	pthread_spin_unlock(&rq->lock);
+	return ret;
+}
+
+int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
+		    struct ibv_recv_wr **bad_wr)
+{
+	struct ipath_qp *qp = to_iqp(ibqp);
+
+	return post_recv(&qp->rq, wr, bad_wr);
+}
+
 struct ibv_srq *ipath_create_srq(struct ibv_pd *pd,
 				 struct ibv_srq_init_attr *attr)
 {
-	struct ibv_srq *srq;
+	struct ipath_srq *srq;
 	struct ibv_create_srq cmd;
-	struct ibv_create_srq_resp resp;
+	struct ipath_create_srq_resp resp;
 	int ret;
+	size_t size;
 
 	srq = malloc(sizeof *srq);
-	if(srq == NULL)
+	if (srq == NULL)
 		return NULL;
 
-	ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd,
-		&resp, sizeof resp);
+	ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, &cmd, sizeof cmd,
+				 &resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(srq);
 		return NULL;
 	}
 
-	return srq;
+	srq->rq.size = attr->attr.max_wr + 1;
+	srq->rq.max_sge = attr->attr.max_sge;
+	size = sizeof(struct ipath_rwq) +
+		(sizeof(struct ipath_rwqe) +
+		 (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size;
+	srq->rq.rwq = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			   pd->context->cmd_fd, resp.offset);
+	if ((void *) srq->rq.rwq == MAP_FAILED) {
+		free(srq);
+		return NULL;
+	}
+
+	pthread_spin_init(&srq->rq.lock, PTHREAD_PROCESS_PRIVATE);
+	return &srq->ibv_srq;
 }
 
-int ipath_modify_srq(struct ibv_srq *srq,
+int ipath_modify_srq(struct ibv_srq *ibsrq,
 		     struct ibv_srq_attr *attr, 
 		     enum ibv_srq_attr_mask attr_mask)
 {
-	struct ibv_modify_srq cmd;
+	struct ipath_srq            *srq = to_isrq(ibsrq);
+	struct ipath_modify_srq_cmd  cmd;
+	__u64                        offset;
+	size_t                       size;
+	int                          ret;
 
-	return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd);
+	if (attr_mask & IBV_SRQ_MAX_WR) {
+		pthread_spin_lock(&srq->rq.lock);
+		/* Unmap the old queue so we can resize it. */
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * srq->rq.max_sge)) *
+			srq->rq.size;
+		(void) munmap(srq->rq.rwq, size);
+	}
+	cmd.offset_addr = (__u64) &offset;
+	ret = ibv_cmd_modify_srq(ibsrq, attr, attr_mask,
+				 &cmd.ibv_cmd, sizeof cmd);
+	if (ret) {
+		if (attr_mask & IBV_SRQ_MAX_WR)
+			pthread_spin_unlock(&srq->rq.lock);
+		return ret;
+	}
+	if (attr_mask & IBV_SRQ_MAX_WR) {
+		srq->rq.size = attr->max_wr + 1;
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * srq->rq.max_sge)) *
+			srq->rq.size;
+		srq->rq.rwq = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE, MAP_SHARED,
+				   ibsrq->context->cmd_fd, offset);
+		pthread_spin_unlock(&srq->rq.lock);
+		/* XXX Now we have no receive queue. */
+		if ((void *) srq->rq.rwq == MAP_FAILED)
+			return errno;
+	}
+	return 0;
 }
 
-int ipath_destroy_srq(struct ibv_srq *srq)
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr)
 {
+	struct ibv_query_srq cmd;
+
+	return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd);
+}
+
+int ipath_destroy_srq(struct ibv_srq *ibsrq)
+{
+	struct ipath_srq *srq = to_isrq(ibsrq);
+	size_t size;
 	int ret;
 
-	ret = ibv_cmd_destroy_srq(srq);
+	ret = ibv_cmd_destroy_srq(ibsrq);
 	if (ret)
 		return ret;
 
+	size = sizeof(struct ipath_rwq) +
+		(sizeof(struct ipath_rwqe) +
+		 (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size;
+	(void) munmap(srq->rq.rwq, size);
 	free(srq);
 	return 0;
 }
 
+int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr,
+			struct ibv_recv_wr **bad_wr)
+{
+	struct ipath_srq *srq = to_isrq(ibsrq);
+
+	return post_recv(&srq->rq, wr, bad_wr); 
+}
+
 struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
 	struct ibv_ah *ah;
 
 	ah = malloc(sizeof *ah);
-	if(ah == NULL)
+	if (ah == NULL)
 		return NULL;
 
-	if(ibv_cmd_create_ah(pd, ah, attr)) {
+	if (ibv_cmd_create_ah(pd, ah, attr)) {
 		free(ah);
 		return NULL;
 	}
Index: src/userspace/libipathverbs/src/ipath-abi.h
===================================================================
--- src/userspace/libipathverbs/src/ipath-abi.h	(revision 0)
+++ src/userspace/libipathverbs/src/ipath-abi.h	(revision 0)
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2006. PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef IPATH_ABI_H
+#define IPATH_ABI_H
+
+#include <infiniband/kern-abi.h>
+
+struct ipath_create_cq_resp {
+	struct ibv_create_cq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_resize_cq_resp {
+	struct ibv_resize_cq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_create_qp_resp {
+	struct ibv_create_qp_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_modify_qp_cmd {
+	struct ibv_modify_qp		ibv_cmd;
+	__u64				offset_addr;
+};
+
+struct ipath_create_srq_resp {
+	struct ibv_create_srq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_modify_srq_cmd {
+	struct ibv_modify_srq		ibv_cmd;
+	__u64				offset_addr;
+};
+
+#endif /* IPATH_ABI_H */
Index: src/userspace/libipathverbs/src/ipathverbs.c
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.c	(revision 8021)
+++ src/userspace/libipathverbs/src/ipathverbs.c	(working copy)
@@ -86,22 +86,25 @@
 	.dereg_mr	= ipath_dereg_mr,
 
 	.create_cq	= ipath_create_cq,
-	.poll_cq	= ibv_cmd_poll_cq,
+	.poll_cq	= ipath_poll_cq,
 	.req_notify_cq	= ibv_cmd_req_notify_cq,
 	.cq_event	= NULL,
+	.resize_cq	= ipath_resize_cq,
 	.destroy_cq	= ipath_destroy_cq,
 
 	.create_srq	= ipath_create_srq,
 	.modify_srq	= ipath_modify_srq,
+	.query_srq	= ipath_query_srq,
 	.destroy_srq	= ipath_destroy_srq,
-	.post_srq_recv	= ibv_cmd_post_srq_recv,
+	.post_srq_recv	= ipath_post_srq_recv,
 
 	.create_qp	= ipath_create_qp,
+	.query_qp	= ipath_query_qp,
 	.modify_qp	= ipath_modify_qp,
 	.destroy_qp	= ipath_destroy_qp,
 
 	.post_send	= ibv_cmd_post_send,
-	.post_recv	= ibv_cmd_post_recv,
+	.post_recv	= ipath_post_recv,
 
 	.create_ah	= ipath_create_ah,
 	.destroy_ah	= ipath_destroy_ah,
Index: src/userspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.h	(revision 8021)
+++ src/userspace/libipathverbs/src/ipathverbs.h	(working copy)
@@ -39,6 +39,7 @@
 
 #include <endian.h>
 #include <byteswap.h>
+#include <pthread.h>
 
 #include <infiniband/driver.h>
 #include <infiniband/arch.h>
@@ -64,6 +65,81 @@
 	struct ibv_context	ibv_ctx;
 };
 
+/*
+ * This structure needs to have the same size and offsets as
+ * the kernel's ib_wc structure since it is memory mapped.
+ */
+struct ipath_wc {
+	uint64_t		wr_id;
+	enum ibv_wc_status	status;
+	enum ibv_wc_opcode	opcode;
+	uint32_t		vendor_err;
+	uint32_t		byte_len;
+	uint32_t		imm_data;	/* in network byte order */
+	uint32_t		qp_num;
+	uint32_t		src_qp;
+	enum ibv_wc_flags	wc_flags;
+	uint16_t		pkey_index;
+	uint16_t		slid;
+	uint8_t			sl;
+	uint8_t			dlid_path_bits;
+	uint8_t			port_num;
+};
+
+struct ipath_cq_wc {
+	uint32_t		head;
+	uint32_t		tail;
+	struct ipath_wc		queue[1];
+};
+
+struct ipath_cq {
+	struct ibv_cq		ibv_cq;
+	struct ipath_cq_wc	*queue;
+	pthread_spinlock_t	lock;
+};
+
+/*
+ * Receive work request queue entry.
+ * The size of the sg_list is determined when the QP is created and stored
+ * in qp->r_max_sge.
+ */
+struct ipath_rwqe {
+	uint64_t		wr_id;
+	uint8_t			num_sge;
+	struct ibv_sge		sg_list[0];
+};
+
+/*
+ * This struture is used to contain the head pointer, tail pointer,
+ * and receive work queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ * Note that the wq array elements are variable size so you can't
+ * just index into the array to get the N'th element;
+ * use get_rwqe_ptr() instead.
+ */
+struct ipath_rwq {
+	uint32_t		head;	/* new requests posted to the head */
+	uint32_t		tail;	/* receives pull requests from here. */
+	struct ipath_rwqe	wq[0];
+};
+
+struct ipath_rq {
+	struct ipath_rwq       *rwq;
+	pthread_spinlock_t	lock;
+	uint32_t		size;
+	uint32_t		max_sge;
+};
+
+struct ipath_qp {
+	struct ibv_qp		ibv_qp;
+	struct ipath_rq		rq;
+};
+
+struct ipath_srq {
+	struct ibv_srq		ibv_srq;
+	struct ipath_rq		rq;
+};
+
 #define to_ixxx(xxx, type)						\
 	((struct ipath_##type *)					\
 	 ((void *) ib##xxx - offsetof(struct ipath_##type, ibv_##xxx)))
@@ -73,6 +149,34 @@
 	return to_ixxx(ctx, context);
 }
 
+static inline struct ipath_cq *to_icq(struct ibv_cq *ibcq)
+{
+	return to_ixxx(cq, cq);
+}
+
+static inline struct ipath_qp *to_iqp(struct ibv_qp *ibqp)
+{
+	return to_ixxx(qp, qp);
+}
+
+static inline struct ipath_srq *to_isrq(struct ibv_srq *ibsrq)
+{
+	return to_ixxx(srq, srq);
+}
+
+/*
+ * Since struct ipath_rwqe is not a fixed size, we can't simply index into
+ * struct ipath_rq.wq.  This function does the array index computation.
+ */
+static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq,
+					      unsigned n)
+{
+	return (struct ipath_rwqe *)
+		((char *) rq->rwq->wq +
+		 (sizeof(struct ipath_rwqe) +
+		  rq->max_sge * sizeof(struct ibv_sge)) * n);
+}
+
 extern int ipath_query_device(struct ibv_context *context,
 			      struct ibv_device_attr *attr);
 
@@ -92,11 +196,19 @@
 			       struct ibv_comp_channel *channel,
 			       int comp_vector);
 
+int ipath_resize_cq(struct ibv_cq *cq, int cqe);
+
 int ipath_destroy_cq(struct ibv_cq *cq);
 
+int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
+
 struct ibv_qp *ipath_create_qp(struct ibv_pd *pd,
 			       struct ibv_qp_init_attr *attr);
 
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr);
+
 int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask);
 
@@ -115,8 +227,12 @@
 		     struct ibv_srq_attr *attr, 
 		     enum ibv_srq_attr_mask attr_mask);
 
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr);
+
 int ipath_destroy_srq(struct ibv_srq *srq);
 
+int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr,
+			struct ibv_recv_wr **bad_wr);
 
 struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr);


-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Thu Jun 15 15:42:12 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 15 Jun 2006 15:42:12 -0700
Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs,
	SRQs [2 of 2]
Message-ID: <1150411332.32252.78.camel@brick.pathscale.com>

Here are the kernel driver changes that go with the user library
changes just posted.

Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(working copy)
@@ -425,11 +425,12 @@
  * @ibqp: the queue pair who's attributes we're modifying
  * @attr: the new attributes
  * @attr_mask: the mask of attributes to modify
+ * @udata: not used by the InfiniPath verbs driver
  *
  * Returns 0 on success, otherwise returns an errno.
  */
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask)
+		    int attr_mask, struct ib_udata *udata)
 {
 	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_qp *qp = to_iqp(ibqp);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c	(working copy)
@@ -105,6 +105,54 @@
 	spin_unlock_irqrestore(&dev->pending_lock, flags);
 }
 
+static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe)
+{
+	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	int user = to_ipd(qp->ibqp.pd)->user;
+	int i, j, ret;
+	struct ib_wc wc;
+
+	qp->r_len = 0;
+	for (i = j = 0; i < wqe->num_sge; i++) {
+		if (wqe->sg_list[i].length == 0)
+			continue;
+		/* Check LKEY */
+		if ((user && wqe->sg_list[i].lkey == 0) ||
+		    !ipath_lkey_ok(&dev->lk_table,
+				   &qp->r_sg_list[j], &wqe->sg_list[i],
+				   IB_ACCESS_LOCAL_WRITE))
+			goto bad_lkey;
+		qp->r_len += wqe->sg_list[i].length;
+		j++;
+	}
+	qp->r_sge.sge = qp->r_sg_list[0];
+	qp->r_sge.sg_list = qp->r_sg_list + 1;
+	qp->r_sge.num_sge = j;
+	ret = 1;
+	goto bail;
+
+bad_lkey:
+	wc.wr_id = wqe->wr_id;
+	wc.status = IB_WC_LOC_PROT_ERR;
+	wc.opcode = IB_WC_RECV;
+	wc.vendor_err = 0;
+	wc.byte_len = 0;
+	wc.imm_data = 0;
+	wc.qp_num = qp->ibqp.qp_num;
+	wc.src_qp = 0;
+	wc.wc_flags = 0;
+	wc.pkey_index = 0;
+	wc.slid = 0;
+	wc.sl = 0;
+	wc.dlid_path_bits = 0;
+	wc.port_num = 0;
+	/* Signal solicited completion event. */
+	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+	ret = 0;
+bail:
+	return ret;
+}
+
 /**
  * ipath_get_rwqe - copy the next RWQE into the QP's RWQE
  * @qp: the QP
@@ -118,73 +166,69 @@
 {
 	unsigned long flags;
 	struct ipath_rq *rq;
+	struct ipath_rwq *wq;
 	struct ipath_srq *srq;
 	struct ipath_rwqe *wqe;
+	void (*handler)(struct ib_event *, void *);
+	u32 tail;
 	int ret;
 
-	if (!qp->ibqp.srq) {
+	if (qp->ibqp.srq) {
+		srq = to_isrq(qp->ibqp.srq);
+		handler = srq->ibsrq.event_handler;
+		rq = &srq->rq;
+	} else {
+		srq = NULL;
+		handler = NULL;
 		rq = &qp->r_rq;
-		spin_lock_irqsave(&rq->lock, flags);
+	}
 
-		if (unlikely(rq->tail == rq->head)) {
+	spin_lock_irqsave(&rq->lock, flags);
+	wq = rq->wq;
+	tail = wq->tail;
+	do {
+		if (unlikely(tail == wq->head)) {
+			spin_unlock_irqrestore(&rq->lock, flags);
 			ret = 0;
 			goto bail;
 		}
-		wqe = get_rwqe_ptr(rq, rq->tail);
-		qp->r_wr_id = wqe->wr_id;
-		if (!wr_id_only) {
-			qp->r_sge.sge = wqe->sg_list[0];
-			qp->r_sge.sg_list = wqe->sg_list + 1;
-			qp->r_sge.num_sge = wqe->num_sge;
-			qp->r_len = wqe->length;
-		}
-		if (++rq->tail >= rq->size)
-			rq->tail = 0;
-		goto done;
-	}
+		wqe = get_rwqe_ptr(rq, tail);
+		if (++tail >= rq->size)
+			tail = 0;
+	} while (!wr_id_only && !init_sge(qp, wqe));
+	qp->r_wr_id = wqe->wr_id;
+	wq->tail = tail;
 
-	srq = to_isrq(qp->ibqp.srq);
-	rq = &srq->rq;
-	spin_lock_irqsave(&rq->lock, flags);
-
-	if (unlikely(rq->tail == rq->head)) {
-		ret = 0;
-		goto bail;
-	}
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	qp->r_wr_id = wqe->wr_id;
-	if (!wr_id_only) {
-		qp->r_sge.sge = wqe->sg_list[0];
-		qp->r_sge.sg_list = wqe->sg_list + 1;
-		qp->r_sge.num_sge = wqe->num_sge;
-		qp->r_len = wqe->length;
-	}
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq->ibsrq.event_handler) {
-		struct ib_event ev;
+	ret = 1;
+	if (handler) {
 		u32 n;
 
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		n = wq->head;
+		if (n >= rq->size)
+			n = 0;
+		if (n < tail)
+			n += rq->size - tail;
 		else
-			n = rq->head - rq->tail;
+			n -= tail;
 		if (n < srq->limit) {
+			struct ib_event ev;
+
 			srq->limit = 0;
 			spin_unlock_irqrestore(&rq->lock, flags);
 			ev.device = qp->ibqp.device;
 			ev.element.srq = qp->ibqp.srq;
 			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
-			spin_lock_irqsave(&rq->lock, flags);
+			handler(&ev, srq->ibsrq.srq_context);
+			goto bail;
 		}
 	}
-done:
-	ret = 1;
+	spin_unlock_irqrestore(&rq->lock, flags);
 
 bail:
-	spin_unlock_irqrestore(&rq->lock, flags);
 	return ret;
 }
 
Index: src/linux-kernel/infiniband/hw/ipath/Makefile
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/Makefile	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/Makefile	(working copy)
@@ -25,6 +25,7 @@
 	ipath_cq.o \
 	ipath_keys.o \
 	ipath_mad.o \
+	ipath_mmap.o \
 	ipath_mr.o \
 	ipath_qp.o \
 	ipath_rc.o \
Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c	(working copy)
@@ -280,11 +280,12 @@
 			      struct ib_recv_wr **bad_wr)
 {
 	struct ipath_qp *qp = to_iqp(ibqp);
+	struct ipath_rwq *wq = qp->r_rq.wq;
 	unsigned long flags;
 	int ret;
 
 	/* Check that state is OK to post receive. */
-	if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) {
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) {
 		*bad_wr = wr;
 		ret = -EINVAL;
 		goto bail;
@@ -293,59 +294,31 @@
 	for (; wr; wr = wr->next) {
 		struct ipath_rwqe *wqe;
 		u32 next;
-		int i, j;
+		int i;
 
-		if (wr->num_sge > qp->r_rq.max_sge) {
+		if ((unsigned) wr->num_sge > qp->r_rq.max_sge) {
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
 		spin_lock_irqsave(&qp->r_rq.lock, flags);
-		next = qp->r_rq.head + 1;
+		next = wq->head + 1;
 		if (next >= qp->r_rq.size)
 			next = 0;
-		if (next == qp->r_rq.tail) {
+		if (next == wq->tail) {
 			spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head);
+		wqe = get_rwqe_ptr(&qp->r_rq, wq->head);
 		wqe->wr_id = wr->wr_id;
-		wqe->sg_list[0].mr = NULL;
-		wqe->sg_list[0].vaddr = NULL;
-		wqe->sg_list[0].length = 0;
-		wqe->sg_list[0].sge_length = 0;
-		wqe->length = 0;
-		for (i = 0, j = 0; i < wr->num_sge; i++) {
-			/* Check LKEY */
-			if (to_ipd(qp->ibqp.pd)->user &&
-			    wr->sg_list[i].lkey == 0) {
-				spin_unlock_irqrestore(&qp->r_rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			if (wr->sg_list[i].length == 0)
-				continue;
-			if (!ipath_lkey_ok(
-				    &to_idev(qp->ibqp.device)->lk_table,
-				    &wqe->sg_list[j], &wr->sg_list[i],
-				    IB_ACCESS_LOCAL_WRITE)) {
-				spin_unlock_irqrestore(&qp->r_rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			wqe->length += wr->sg_list[i].length;
-			j++;
-		}
-		wqe->num_sge = j;
-		qp->r_rq.head = next;
+		wqe->num_sge = wr->num_sge;
+		for (i = 0; i < wr->num_sge; i++)
+			wqe->sg_list[i] = wr->sg_list[i];
+		wq->head = next;
 		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 	}
 	ret = 0;
@@ -694,7 +667,7 @@
 		ipath_layer_get_lastibcstat(dev->dd) & 0xf];
 	props->port_cap_flags = dev->port_cap_flags;
 	props->gid_tbl_len = 1;
-	props->max_msg_sz = 4096;
+	props->max_msg_sz = 0x80000000;
 	props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd);
 	props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) -
 		dev->z_pkey_violations;
@@ -871,7 +844,7 @@
 		goto bail;
 	}
 
-	if (ah_attr->port_num != 1 ||
+	if (ah_attr->port_num < 1 ||
 	    ah_attr->port_num > pd->device->phys_port_cnt) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
@@ -883,6 +856,8 @@
 		goto bail;
 	}
 
+	dev->n_ahs_allocated++;
+
 	/* ib_create_ah() will initialize ah->ibah. */
 	ah->attr = *ah_attr;
 
@@ -1137,6 +1112,7 @@
 	dev->attach_mcast = ipath_multicast_attach;
 	dev->detach_mcast = ipath_multicast_detach;
 	dev->process_mad = ipath_process_mad;
+	dev->mmap = ipath_mmap;
 
 	snprintf(dev->node_desc, sizeof(dev->node_desc),
 		 IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(working copy)
@@ -577,7 +577,7 @@
 int ipath_destroy_qp(struct ib_qp *ibqp);
 
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask);
+		    int attr_mask, struct ib_udata *udata);
 
 int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		   int attr_mask, struct ib_qp_init_attr *init_attr);
@@ -636,7 +636,8 @@
 				struct ib_udata *udata);
 
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask);
+		     enum ib_srq_attr_mask attr_mask,
+		     struct ib_udata *udata);
 
 int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr);
 
Index: src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c	(revision 0)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c	(revision 0)
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+
+#include <linux/mm.h>
+#include <linux/errno.h>
+#include <asm/pgtable.h>
+
+#include "ipath_verbs.h"
+
+/**
+ * ipath_release_mmap_info - free mmap info structure
+ * @ref: a pointer to the kref within struct ipath_mmap_info
+ */
+void ipath_release_mmap_info(struct kref *ref)
+{
+	struct ipath_mmap_info *ip =
+		container_of(ref, struct ipath_mmap_info, ref);
+
+	vfree(ip->obj);
+	kfree(ip);
+}
+
+/*
+ * open and close keep track of how many times the CQ is mapped,
+ * to avoid releasing it.
+ */
+static void ipath_vma_open(struct vm_area_struct *vma)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+
+	kref_get(&ip->ref);
+	ip->mmap_cnt++;
+}
+
+static void ipath_vma_close(struct vm_area_struct *vma)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+
+	ip->mmap_cnt--;
+	kref_put(&ip->ref, ipath_release_mmap_info);
+}
+
+/*
+ * ipath_vma_nopage - handle a VMA page fault.
+ */
+static struct page *ipath_vma_nopage(struct vm_area_struct *vma,
+				     unsigned long address, int *type)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+	unsigned long offset = address - vma->vm_start;
+	struct page *page = NOPAGE_SIGBUS;
+	void *pageptr;
+
+	if (offset >= ip->size)
+		goto out; /* out of range */
+
+	/*
+	 * Convert the vmalloc address into a struct page.
+	 */
+	pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT));
+	page = vmalloc_to_page(pageptr);
+
+	/* Increment the reference count. */
+	get_page(page);
+	if (type)
+		*type = VM_FAULT_MINOR;
+out:
+	return page;
+}
+
+static struct vm_operations_struct ipath_vm_ops = {
+	.open =     ipath_vma_open,
+	.close =    ipath_vma_close,
+	.nopage =   ipath_vma_nopage,
+};
+
+/**
+ * ipath_mmap - create a new mmap region
+ * @context: the IB user context of the process making the mmap() call
+ * @vma: the VMA to be initialized
+ * Return zero if the mmap is OK. Otherwise, return an errno.
+ */
+int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	struct ipath_ibdev *dev = to_idev(context->device);
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct ipath_mmap_info *ip, **pp;
+
+	/*
+	 * Search the device's list of objects waiting for a mmap call.
+	 * Normally, this list is very short since a call to create a
+	 * CQ, QP, or SRQ is soon followed by a call to mmap().
+	 */
+	spin_lock_irq(&dev->pending_lock);
+	for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) {
+		/* Only the creator is allowed to mmap the object */
+		if (context != ip->context || (void *) offset != ip->obj)
+			continue;
+		/* Don't allow a mmap larger than the object. */
+		if (size > ip->size)
+			break;
+
+		*pp = ip->next;
+		spin_unlock_irq(&dev->pending_lock);
+
+		vma->vm_ops = &ipath_vm_ops;
+		vma->vm_flags |= VM_RESERVED;
+		vma->vm_private_data = ip;
+		ipath_vma_open(vma);
+		return 0;
+	}
+	spin_unlock_irq(&dev->pending_lock);
+	return -EINVAL;
+}
Index: src/linux-kernel/infiniband/hw/ipath/ipath_cq.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_cq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_cq.c	(working copy)
@@ -41,20 +41,28 @@
  * @entry: work completion entry to add
  * @sig: true if @entry is a solicitated entry
  *
- * This may be called with one of the qp->s_lock or qp->r_rq.lock held.
+ * This may be called with qp->s_lock held.
  */
 void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited)
 {
+	struct ipath_cq_wc *wc = cq->queue;
 	unsigned long flags;
+	u32 head;
 	u32 next;
 
 	spin_lock_irqsave(&cq->lock, flags);
 
-	if (cq->head == cq->ibcq.cqe)
+	/*
+	 * Note that the head pointer might be writable by user processes.
+	 * Take care to verify it is a sane value.
+	 */
+	head = wc->head;
+	if (head >= (unsigned) cq->ibcq.cqe) {
+		head = cq->ibcq.cqe;
 		next = 0;
-	else
-		next = cq->head + 1;
-	if (unlikely(next == cq->tail)) {
+	} else
+		next = head + 1;
+	if (unlikely(next == wc->tail)) {
 		spin_unlock_irqrestore(&cq->lock, flags);
 		if (cq->ibcq.event_handler) {
 			struct ib_event ev;
@@ -66,8 +74,8 @@
 		}
 		return;
 	}
-	cq->queue[cq->head] = *entry;
-	cq->head = next;
+	wc->queue[head] = *entry;
+	wc->head = next;
 
 	if (cq->notify == IB_CQ_NEXT_COMP ||
 	    (cq->notify == IB_CQ_SOLICITED && solicited)) {
@@ -100,19 +108,20 @@
 int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
+	struct ipath_cq_wc *wc = cq->queue;
 	unsigned long flags;
 	int npolled;
 
 	spin_lock_irqsave(&cq->lock, flags);
 
 	for (npolled = 0; npolled < num_entries; ++npolled, ++entry) {
-		if (cq->tail == cq->head)
+		if (wc->tail == wc->head)
 			break;
-		*entry = cq->queue[cq->tail];
-		if (cq->tail == cq->ibcq.cqe)
-			cq->tail = 0;
+		*entry = wc->queue[wc->tail];
+		if (wc->tail >= cq->ibcq.cqe)
+			wc->tail = 0;
 		else
-			cq->tail++;
+			wc->tail++;
 	}
 
 	spin_unlock_irqrestore(&cq->lock, flags);
@@ -159,7 +168,7 @@
 {
 	struct ipath_ibdev *dev = to_idev(ibdev);
 	struct ipath_cq *cq;
-	struct ib_wc *wc;
+	struct ipath_cq_wc *wc;
 	struct ib_cq *ret;
 
 	if (entries > ib_ipath_max_cqes) {
@@ -172,10 +181,7 @@
 		goto bail;
 	}
 
-	/*
-	 * Need to use vmalloc() if we want to support large #s of
-	 * entries.
-	 */
+	/* Allocate the completion queue structure. */
 	cq = kmalloc(sizeof(*cq), GFP_KERNEL);
 	if (!cq) {
 		ret = ERR_PTR(-ENOMEM);
@@ -183,15 +189,54 @@
 	}
 
 	/*
-	 * Need to use vmalloc() if we want to support large #s of entries.
+	 * Allocate the completion queue entries and head/tail pointers.
+	 * This is allocated separately so that it can be resized and
+	 * also mapped into user space.
+	 * We need to use vmalloc() in order to support mmap and large
+	 * numbers of entries.
 	 */
-	wc = vmalloc(sizeof(*wc) * (entries + 1));
+	wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * entries);
 	if (!wc) {
-		kfree(cq);
 		ret = ERR_PTR(-ENOMEM);
-		goto bail;
+		goto free_cq;
 	}
+
 	/*
+	 * Return the address of the WC as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		struct ipath_mmap_info *ip;
+		__u64 offset = (__u64) wc;
+		int err;
+
+		err = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (err) {
+			ret = ERR_PTR(err);
+			goto free_wc;
+		}
+
+		/* Allocate info for ipath_mmap(). */
+		ip = kmalloc(sizeof(*ip), GFP_KERNEL);
+		if (!ip) {
+			ret = ERR_PTR(-ENOMEM);
+			goto free_wc;
+		}
+		cq->ip = ip;
+		ip->context = context;
+		ip->obj = wc;
+		kref_init(&ip->ref);
+		ip->mmap_cnt = 0;
+		ip->size = PAGE_ALIGN(sizeof(*wc) +
+				      sizeof(struct ib_wc) * entries);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	} else
+		cq->ip = NULL;
+
+	/*
 	 * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe.
 	 * The number of entries should be >= the number requested or return
 	 * an error.
@@ -201,14 +246,18 @@
 	cq->triggered = 0;
 	spin_lock_init(&cq->lock);
 	tasklet_init(&cq->comptask, send_complete, (unsigned long)cq);
-	cq->head = 0;
-	cq->tail = 0;
+	wc->head = 0;
+	wc->tail = 0;
 	cq->queue = wc;
 
 	ret = &cq->ibcq;
-
 	dev->n_cqs_allocated++;
+	goto bail;
 
+free_wc:
+	vfree(wc);
+free_cq:
+	kfree(cq);
 bail:
 	return ret;
 }
@@ -228,7 +277,10 @@
 
 	tasklet_kill(&cq->comptask);
 	dev->n_cqs_allocated--;
-	vfree(cq->queue);
+	if (cq->ip)
+		kref_put(&cq->ip->ref, ipath_release_mmap_info);
+	else
+		vfree(cq->queue);
 	kfree(cq);
 
 	return 0;
@@ -252,7 +304,7 @@
 	spin_lock_irqsave(&cq->lock, flags);
 	/*
 	 * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow
-	 * any other transitions.
+	 * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2).
 	 */
 	if (cq->notify != IB_CQ_NEXT_COMP)
 		cq->notify = notify;
@@ -263,46 +315,87 @@
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
-	struct ib_wc *wc, *old_wc;
-	u32 n;
+	struct ipath_cq_wc *old_wc = cq->queue;
+	struct ipath_cq_wc *wc;
+	u32 head, tail, n;
 	int ret;
 
+	/* Don't allow resize if completion queue is mmapped. */
+	if (cq->ip && cq->ip->mmap_cnt) {
+		ret = -EBUSY;
+		goto bail;
+	}
+
 	/*
 	 * Need to use vmalloc() if we want to support large #s of entries.
 	 */
-	wc = vmalloc(sizeof(*wc) * (cqe + 1));
+	wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * cqe);
 	if (!wc) {
 		ret = -ENOMEM;
 		goto bail;
 	}
 
+	/*
+	 * Return the address of the WC as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		__u64 offset = (__u64) wc;
+
+		ret = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (ret)
+			goto bail;
+	}
+
 	spin_lock_irq(&cq->lock);
-	if (cq->head < cq->tail)
-		n = cq->ibcq.cqe + 1 + cq->head - cq->tail;
+	/*
+	 * Make sure head and tail are sane since they
+	 * might be user writable.
+	 */
+	head = old_wc->head;
+	if (head > (u32) cq->ibcq.cqe)
+		head = (u32) cq->ibcq.cqe;
+	tail = old_wc->tail;
+	if (tail > (u32) cq->ibcq.cqe)
+		tail = (u32) cq->ibcq.cqe;
+	if (head < tail)
+		n = cq->ibcq.cqe + 1 + head - tail;
 	else
-		n = cq->head - cq->tail;
+		n = head - tail;
 	if (unlikely((u32)cqe < n)) {
 		spin_unlock_irq(&cq->lock);
 		vfree(wc);
 		ret = -EOVERFLOW;
 		goto bail;
 	}
-	for (n = 0; cq->tail != cq->head; n++) {
-		wc[n] = cq->queue[cq->tail];
-		if (cq->tail == cq->ibcq.cqe)
-			cq->tail = 0;
+	for (n = 0; tail != head; n++) {
+		wc->queue[n] = old_wc->queue[tail];
+		if (tail == (u32) cq->ibcq.cqe)
+			tail = 0;
 		else
-			cq->tail++;
+			tail++;
 	}
 	cq->ibcq.cqe = cqe;
-	cq->head = n;
-	cq->tail = 0;
-	old_wc = cq->queue;
+	wc->head = n;
+	wc->tail = 0;
 	cq->queue = wc;
 	spin_unlock_irq(&cq->lock);
 
 	vfree(old_wc);
 
+	if (cq->ip) {
+		struct ipath_ibdev *dev = to_idev(ibcq->device);
+		struct ipath_mmap_info *ip = cq->ip;
+
+		ip->obj = wc;
+		ip->size = PAGE_ALIGN(sizeof(*wc) +
+				      sizeof(struct ib_wc) * cqe);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	}
+
 	ret = 0;
 
 bail:
Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(working copy)
@@ -47,66 +47,38 @@
 			   struct ib_recv_wr **bad_wr)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
-	struct ipath_ibdev *dev = to_idev(ibsrq->device);
+	struct ipath_rwq *wq;
 	unsigned long flags;
 	int ret;
 
 	for (; wr; wr = wr->next) {
 		struct ipath_rwqe *wqe;
 		u32 next;
-		int i, j;
+		int i;
 
-		if (wr->num_sge > srq->rq.max_sge) {
+		if ((unsigned) wr->num_sge > srq->rq.max_sge) {
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
 		spin_lock_irqsave(&srq->rq.lock, flags);
-		next = srq->rq.head + 1;
+		wq = srq->rq.wq;
+		next = wq->head + 1;
 		if (next >= srq->rq.size)
 			next = 0;
-		if (next == srq->rq.tail) {
+		if (next == wq->tail) {
 			spin_unlock_irqrestore(&srq->rq.lock, flags);
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		wqe = get_rwqe_ptr(&srq->rq, srq->rq.head);
+		wqe = get_rwqe_ptr(&srq->rq, wq->head);
 		wqe->wr_id = wr->wr_id;
-		wqe->sg_list[0].mr = NULL;
-		wqe->sg_list[0].vaddr = NULL;
-		wqe->sg_list[0].length = 0;
-		wqe->sg_list[0].sge_length = 0;
-		wqe->length = 0;
-		for (i = 0, j = 0; i < wr->num_sge; i++) {
-			/* Check LKEY */
-			if (to_ipd(srq->ibsrq.pd)->user &&
-			    wr->sg_list[i].lkey == 0) {
-				spin_unlock_irqrestore(&srq->rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			if (wr->sg_list[i].length == 0)
-				continue;
-			if (!ipath_lkey_ok(&dev->lk_table,
-					   &wqe->sg_list[j],
-					   &wr->sg_list[i],
-					   IB_ACCESS_LOCAL_WRITE)) {
-				spin_unlock_irqrestore(&srq->rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			wqe->length += wr->sg_list[i].length;
-			j++;
-		}
-		wqe->num_sge = j;
-		srq->rq.head = next;
+		for (i = 0; i < wr->num_sge; i++)
+			wqe->sg_list[0] = wr->sg_list[i];
+		wq->head = next;
 		spin_unlock_irqrestore(&srq->rq.lock, flags);
 	}
 	ret = 0;
@@ -156,28 +128,67 @@
 	 * Need to use vmalloc() if we want to support large #s of entries.
 	 */
 	srq->rq.size = srq_init_attr->attr.max_wr + 1;
-	sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge +
+	srq->rq.max_sge = srq_init_attr->attr.max_sge;
+	sz = sizeof(struct ib_sge) * srq->rq.max_sge +
 		sizeof(struct ipath_rwqe);
-	srq->rq.wq = vmalloc(srq->rq.size * sz);
+	srq->rq.wq = vmalloc(sizeof(struct ipath_rwq) + srq->rq.size * sz);
 	if (!srq->rq.wq) {
-		kfree(srq);
 		ret = ERR_PTR(-ENOMEM);
-		goto bail;
+		goto free_srq;
 	}
 
 	/*
+	 * Return the address of the RWQ as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		struct ipath_mmap_info *ip;
+		__u64 offset = (__u64) srq->rq.wq;
+		int err;
+
+		err = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (err) {
+			ret = ERR_PTR(err);
+			goto free_rwq;
+		}
+
+		/* Allocate info for ipath_mmap(). */
+		ip = kmalloc(sizeof(*ip), GFP_KERNEL);
+		if (!ip) {
+			ret = ERR_PTR(-ENOMEM);
+			goto free_rwq;
+		}
+		srq->ip = ip;
+		ip->context = ibpd->uobject->context;
+		ip->obj = srq->rq.wq;
+		kref_init(&ip->ref);
+		ip->mmap_cnt = 0;
+		ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) +
+				      srq->rq.size * sz);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	} else
+		srq->ip = NULL;
+
+	/*
 	 * ib_create_srq() will initialize srq->ibsrq.
 	 */
 	spin_lock_init(&srq->rq.lock);
-	srq->rq.head = 0;
-	srq->rq.tail = 0;
-	srq->rq.max_sge = srq_init_attr->attr.max_sge;
+	srq->rq.wq->head = 0;
+	srq->rq.wq->tail = 0;
 	srq->limit = srq_init_attr->attr.srq_limit;
 
+	dev->n_srqs_allocated++;
+
 	ret = &srq->ibsrq;
+	goto bail;
 
-	dev->n_srqs_allocated++;
-
+free_rwq:
+	vfree(srq->rq.wq);
+free_srq:
+	kfree(srq);
 bail:
 	return ret;
 }
@@ -187,83 +198,143 @@
  * @ibsrq: the SRQ to modify
  * @attr: the new attributes of the SRQ
  * @attr_mask: indicates which attributes to modify
+ * @udata: user data for ipathverbs.so
  */
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask)
+		     enum ib_srq_attr_mask attr_mask,
+		     struct ib_udata *udata)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
-	unsigned long flags;
-	int ret;
+	int ret = 0;
 
-	if (attr_mask & IB_SRQ_MAX_WR)
-		if ((attr->max_wr > ib_ipath_max_srq_wrs) ||
-		    (attr->max_sge > srq->rq.max_sge)) {
+	if (attr_mask & IB_SRQ_MAX_WR) {
+		struct ipath_rwq *owq;
+		struct ipath_rwq *wq;
+		struct ipath_rwqe *p;
+		u32 sz, size, n, head, tail;
+
+		/* Don't allow resize if mmapped */
+		if (srq->ip && srq->ip->mmap_cnt) {
 			ret = -EINVAL;
 			goto bail;
 		}
 
-	if (attr_mask & IB_SRQ_LIMIT)
-		if (attr->srq_limit >= srq->rq.size) {
+		/*
+		 * Check that the requested sizes are below the limits
+		 * and that user/kernel SRQs are only resized by the
+		 * user/kernel.
+		 */
+		if ((attr->max_wr > ib_ipath_max_srq_wrs) ||
+		    (!udata != !srq->ip) ||
+		    ((attr_mask & IB_SRQ_LIMIT) &&
+		     attr->srq_limit > attr->max_wr) ||
+		    (!(attr_mask & IB_SRQ_LIMIT) &&
+		     srq->limit > attr->max_wr)) {
 			ret = -EINVAL;
 			goto bail;
 		}
 
-	if (attr_mask & IB_SRQ_MAX_WR) {
-		struct ipath_rwqe *wq, *p;
-		u32 sz, size, n;
-
 		sz = sizeof(struct ipath_rwqe) +
-			attr->max_sge * sizeof(struct ipath_sge);
+			srq->rq.max_sge * sizeof(struct ib_sge);
 		size = attr->max_wr + 1;
-		wq = vmalloc(size * sz);
+		wq = vmalloc(sizeof(struct ipath_rwq) + size * sz);
 		if (!wq) {
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		spin_lock_irqsave(&srq->rq.lock, flags);
-		if (srq->rq.head < srq->rq.tail)
-			n = srq->rq.size + srq->rq.head - srq->rq.tail;
+		/*
+		 * Return the address of the RWQ as the offset to mmap.
+		 * See ipath_mmap() for details.
+		 */
+		if (udata) {
+			__u64 offset_addr;
+			__u64 offset = (__u64) wq;
+
+			ret = ib_copy_from_udata(&offset_addr, udata,
+						 sizeof(offset_addr));
+			if (ret) {
+				vfree(wq);
+				goto bail;
+			}
+			udata->outbuf = (void __user *) offset_addr;
+			ret = ib_copy_to_udata(udata, &offset,
+					       sizeof(offset));
+			if (ret) {
+				vfree(wq);
+				goto bail;
+			}
+		}
+
+		spin_lock_irq(&srq->rq.lock);
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		owq = srq->rq.wq;
+		head = owq->head;
+		if (head >= srq->rq.size)
+			head = 0;
+		tail = owq->tail;
+		if (tail >= srq->rq.size)
+			tail = 0;
+		n = head;
+		if (n < tail)
+			n += srq->rq.size - tail;
 		else
-			n = srq->rq.head - srq->rq.tail;
-		if (size <= n || size <= srq->limit) {
-			spin_unlock_irqrestore(&srq->rq.lock, flags);
+			n -= tail;
+		if (size <= n) {
+			spin_unlock_irq(&srq->rq.lock);
 			vfree(wq);
 			ret = -EINVAL;
 			goto bail;
 		}
 		n = 0;
-		p = wq;
-		while (srq->rq.tail != srq->rq.head) {
+		p = wq->wq;
+		while (tail != head) {
 			struct ipath_rwqe *wqe;
 			int i;
 
-			wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail);
+			wqe = get_rwqe_ptr(&srq->rq, tail);
 			p->wr_id = wqe->wr_id;
-			p->length = wqe->length;
 			p->num_sge = wqe->num_sge;
 			for (i = 0; i < wqe->num_sge; i++)
 				p->sg_list[i] = wqe->sg_list[i];
 			n++;
 			p = (struct ipath_rwqe *)((char *) p + sz);
-			if (++srq->rq.tail >= srq->rq.size)
-				srq->rq.tail = 0;
+			if (++tail >= srq->rq.size)
+				tail = 0;
 		}
-		vfree(srq->rq.wq);
 		srq->rq.wq = wq;
 		srq->rq.size = size;
-		srq->rq.head = n;
-		srq->rq.tail = 0;
-		srq->rq.max_sge = attr->max_sge;
-		spin_unlock_irqrestore(&srq->rq.lock, flags);
-	}
+		wq->head = n;
+		wq->tail = 0;
+		if (attr_mask & IB_SRQ_LIMIT)
+			srq->limit = attr->srq_limit;
+		spin_unlock_irq(&srq->rq.lock);
 
-	if (attr_mask & IB_SRQ_LIMIT) {
-		spin_lock_irqsave(&srq->rq.lock, flags);
-		srq->limit = attr->srq_limit;
-		spin_unlock_irqrestore(&srq->rq.lock, flags);
+		vfree(owq);
+
+		if (srq->ip) {
+			struct ipath_mmap_info *ip = srq->ip;
+			struct ipath_ibdev *dev = to_idev(srq->ibsrq.device);
+
+			ip->obj = wq;
+			ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) +
+					      size * sz);
+			spin_lock_irq(&dev->pending_lock);
+			ip->next = dev->pending_mmaps;
+			dev->pending_mmaps = ip;
+			spin_unlock_irq(&dev->pending_lock);
+		}
+	} else if (attr_mask & IB_SRQ_LIMIT) {
+		spin_lock_irq(&srq->rq.lock);
+		if (attr->srq_limit >= srq->rq.size)
+			ret = -EINVAL;
+		else
+			srq->limit = attr->srq_limit;
+		spin_unlock_irq(&srq->rq.lock);
 	}
-	ret = 0;
 
 bail:
 	return ret;
@@ -289,7 +360,10 @@
 	struct ipath_ibdev *dev = to_idev(ibsrq->device);
 
 	dev->n_srqs_allocated--;
-	vfree(srq->rq.wq);
+	if (srq->ip)
+		kref_put(&srq->ip->ref, ipath_release_mmap_info);
+	else
+		vfree(srq->rq.wq);
 	kfree(srq);
 
 	return 0;
Index: src/linux-kernel/infiniband/hw/ipath/ipath_ud.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_ud.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_ud.c	(working copy)
@@ -35,6 +35,53 @@
 #include "ipath_verbs.h"
 #include "ips_common.h"
 
+static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe,
+		    u32 *lengthp, struct ipath_sge_state *ss)
+{
+	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	int user = to_ipd(qp->ibqp.pd)->user;
+	int i, j, ret;
+	struct ib_wc wc;
+
+	*lengthp = 0;
+	for (i = j = 0; i < wqe->num_sge; i++) {
+		if (wqe->sg_list[i].length == 0)
+			continue;
+		/* Check LKEY */
+		if ((user && wqe->sg_list[i].lkey == 0) ||
+		    !ipath_lkey_ok(&dev->lk_table,
+				   j ? &ss->sg_list[j - 1] : &ss->sge,
+				   &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE))
+			goto bad_lkey;
+		*lengthp += wqe->sg_list[i].length;
+		j++;
+	}
+	ss->num_sge = j;
+	ret = 1;
+	goto bail;
+
+bad_lkey:
+	wc.wr_id = wqe->wr_id;
+	wc.status = IB_WC_LOC_PROT_ERR;
+	wc.opcode = IB_WC_RECV;
+	wc.vendor_err = 0;
+	wc.byte_len = 0;
+	wc.imm_data = 0;
+	wc.qp_num = qp->ibqp.qp_num;
+	wc.src_qp = 0;
+	wc.wc_flags = 0;
+	wc.pkey_index = 0;
+	wc.slid = 0;
+	wc.sl = 0;
+	wc.dlid_path_bits = 0;
+	wc.port_num = 0;
+	/* Signal solicited completion event. */
+	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+	ret = 0;
+bail:
+	return ret;
+}
+
 /**
  * ipath_ud_loopback - handle send on loopback QPs
  * @sqp: the QP
@@ -45,6 +92,8 @@
  *
  * This is called from ipath_post_ud_send() to forward a WQE addressed
  * to the same HCA.
+ * Note that the receive interrupt handler may be calling ipath_ud_rcv()
+ * while this is being called.
  */
 static void ipath_ud_loopback(struct ipath_qp *sqp,
 			      struct ipath_sge_state *ss,
@@ -59,7 +108,11 @@
 	struct ipath_srq *srq;
 	struct ipath_sge_state rsge;
 	struct ipath_sge *sge;
+	struct ipath_rwq *wq;
 	struct ipath_rwqe *wqe;
+	void (*handler)(struct ib_event *, void *);
+	u32 tail;
+	u32 rlen;
 
 	qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn);
 	if (!qp)
@@ -93,6 +146,13 @@
 		wc->imm_data = 0;
 	}
 
+	if (wr->num_sge > 1) {
+		rsge.sg_list = kmalloc((wr->num_sge - 1) *
+					sizeof(struct ipath_sge),
+				       GFP_ATOMIC);
+	} else
+		rsge.sg_list = NULL;
+
 	/*
 	 * Get the next work request entry to find where to put the data.
 	 * Note that it is safe to drop the lock after changing rq->tail
@@ -100,37 +160,52 @@
 	 */
 	if (qp->ibqp.srq) {
 		srq = to_isrq(qp->ibqp.srq);
+		handler = srq->ibsrq.event_handler;
 		rq = &srq->rq;
 	} else {
 		srq = NULL;
+		handler = NULL;
 		rq = &qp->r_rq;
 	}
+
 	spin_lock_irqsave(&rq->lock, flags);
-	if (rq->tail == rq->head) {
-		spin_unlock_irqrestore(&rq->lock, flags);
-		dev->n_pkt_drops++;
-		goto done;
+	wq = rq->wq;
+	tail = wq->tail;
+	while (1) {
+		if (unlikely(tail == wq->head)) {
+			spin_unlock_irqrestore(&rq->lock, flags);
+			dev->n_pkt_drops++;
+			goto free_sge;
+		}
+		wqe = get_rwqe_ptr(rq, tail);
+		if (++tail >= rq->size)
+			tail = 0;
+		if (init_sge(qp, wqe, &rlen, &rsge))
+			break;
+		wq->tail = tail;
 	}
 	/* Silently drop packets which are too big. */
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	if (wc->byte_len > wqe->length) {
+	if (wc->byte_len > rlen) {
 		spin_unlock_irqrestore(&rq->lock, flags);
 		dev->n_pkt_drops++;
-		goto done;
+		goto free_sge;
 	}
+	wq->tail = tail;
 	wc->wr_id = wqe->wr_id;
-	rsge.sge = wqe->sg_list[0];
-	rsge.sg_list = wqe->sg_list + 1;
-	rsge.num_sge = wqe->num_sge;
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq && srq->ibsrq.event_handler) {
+	if (handler) {
 		u32 n;
 
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		n = wq->head;
+		if (n >= rq->size)
+			n = 0;
+		if (n < tail)
+			n += rq->size - tail;
 		else
-			n = rq->head - rq->tail;
+			n -= tail;
 		if (n < srq->limit) {
 			struct ib_event ev;
 
@@ -139,12 +214,12 @@
 			ev.device = qp->ibqp.device;
 			ev.element.srq = qp->ibqp.srq;
 			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
+			handler(&ev, srq->ibsrq.srq_context);
 		} else
 			spin_unlock_irqrestore(&rq->lock, flags);
 	} else
 		spin_unlock_irqrestore(&rq->lock, flags);
+
 	ah_attr = &to_iah(wr->wr.ud.ah)->attr;
 	if (ah_attr->ah_flags & IB_AH_GRH) {
 		ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh));
@@ -195,6 +270,8 @@
 	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc,
 		       wr->send_flags & IB_SEND_SOLICITED);
 
+free_sge:
+	kfree(rsge.sg_list);
 done:
 	if (atomic_dec_and_test(&qp->refcount))
 		wake_up(&qp->wait);
@@ -432,13 +509,9 @@
 	int opcode;
 	u32 hdrsize;
 	u32 pad;
-	unsigned long flags;
 	struct ib_wc wc;
 	u32 qkey;
 	u32 src_qp;
-	struct ipath_rq *rq;
-	struct ipath_srq *srq;
-	struct ipath_rwqe *wqe;
 	u16 dlid;
 	int header_in_data;
 
@@ -546,19 +619,10 @@
 
 	/*
 	 * Get the next work request entry to find where to put the data.
-	 * Note that it is safe to drop the lock after changing rq->tail
-	 * since ipath_post_receive() won't fill the empty slot.
 	 */
-	if (qp->ibqp.srq) {
-		srq = to_isrq(qp->ibqp.srq);
-		rq = &srq->rq;
-	} else {
-		srq = NULL;
-		rq = &qp->r_rq;
-	}
-	spin_lock_irqsave(&rq->lock, flags);
-	if (rq->tail == rq->head) {
-		spin_unlock_irqrestore(&rq->lock, flags);
+	if (qp->r_reuse_sge)
+		qp->r_reuse_sge = 0;
+	else if (!ipath_get_rwqe(qp, 0)) {
 		/*
 		 * Count VL15 packets dropped due to no receive buffer.
 		 * Otherwise, count them as buffer overruns since usually,
@@ -572,39 +636,11 @@
 		goto bail;
 	}
 	/* Silently drop packets which are too big. */
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	if (wc.byte_len > wqe->length) {
-		spin_unlock_irqrestore(&rq->lock, flags);
+	if (wc.byte_len > qp->r_len) {
+		qp->r_reuse_sge = 1;
 		dev->n_pkt_drops++;
 		goto bail;
 	}
-	wc.wr_id = wqe->wr_id;
-	qp->r_sge.sge = wqe->sg_list[0];
-	qp->r_sge.sg_list = wqe->sg_list + 1;
-	qp->r_sge.num_sge = wqe->num_sge;
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq && srq->ibsrq.event_handler) {
-		u32 n;
-
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
-		else
-			n = rq->head - rq->tail;
-		if (n < srq->limit) {
-			struct ib_event ev;
-
-			srq->limit = 0;
-			spin_unlock_irqrestore(&rq->lock, flags);
-			ev.device = qp->ibqp.device;
-			ev.element.srq = qp->ibqp.srq;
-			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
-		} else
-			spin_unlock_irqrestore(&rq->lock, flags);
-	} else
-		spin_unlock_irqrestore(&rq->lock, flags);
 	if (has_grh) {
 		ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh,
 			       sizeof(struct ib_grh));
@@ -613,6 +649,7 @@
 		ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh));
 	ipath_copy_sge(&qp->r_sge, data,
 		       wc.byte_len - sizeof(struct ib_grh));
+	wc.wr_id = qp->r_wr_id;
 	wc.status = IB_WC_SUCCESS;
 	wc.opcode = IB_WC_RECV;
 	wc.vendor_err = 0;


-- 
Ralph Campbell <ralphc at pathscale.com>


From ardavis at ichips.intel.com  Thu Jun 15 16:04:40 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 15 Jun 2006 16:04:40 -0700
Subject: [openib-general] Processes not exiting on SVN7946
In-Reply-To: <ada1wtqxe2v.fsf@cisco.com>
References: <1AC79F16F5C5284499BB9591B33D6F0007F74705@orsmsx408>
	<4491D195.8030106@ichips.intel.com> <adairn2xfwq.fsf@cisco.com>
	<adaejxqxfug.fsf@cisco.com> <4491D7E1.5050504@ichips.intel.com>
	<ada64j2xejh.fsf@cisco.com> <ada1wtqxe2v.fsf@cisco.com>
Message-ID: <4491E788.3000609@ichips.intel.com>

Roland Dreier wrote:

>OK, just a dumb oversight on my part.  The change below (already
>checked in) fixes it for me:
>
>--- infiniband/core/uverbs_cmd.c	(revision 8055)
>+++ infiniband/core/uverbs_cmd.c	(working copy)
>@@ -1123,6 +1123,12 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
> 		goto err_copy;
> 	}
> 
>+	put_pd_read(pd);
>+	put_cq_read(scq);
>+	put_cq_read(rcq);
>+	if (srq)
>+		put_srq_read(srq);
>+
> 	mutex_lock(&file->mutex);
> 	list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list);
> 	mutex_unlock(&file->mutex);
>
>  
>

Works for me too. Thanks!

-arlin


From tziporet at mellanox.co.il  Fri Jun 16 01:54:37 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Fri, 16 Jun 2006 11:54:37 +0300
Subject: [openib-general] OFED 1.0 - Official Release
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com>

 
I am happy to announce that OFED 1.0 Official Release is now available.

The release can be found under:

https://openib.org/svn/gen2/branches/1.0/ofed/releases/

 
And later today it will be on the OpenFabrics download page:  
http://www.openfabrics.org/downloads.html.

 
This is the first release that was done in a joint effort of the
following companies:

*	Cisco
*	SilverStorm
*	Voltaire
*	QLogic
*	Intel
*	Mellanox Technologies

 
I wish to thank all who contributed to the success of this release.

 
Tziporet

========================================================================
=======

 
Release summary:

The OFED software package is composed of several software modules
intended for use on a computer cluster 

constructed as an InfiniBand network.

 
OFED package contains the following components:

  o   OpenFabrics core and ULPs:

        - HCA drivers (mthca, ipath)

        - core

        - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host,
RDS and uDAPL

  o   OpenFabrics utilities:

        - OpenSM: InfiniBand Subnet Manager

        - Diagnostic tools

        - Performance tests

  o   MPI:

        - OSU MPI stack supporting the InfiniBand interface

        - Open MPI stack supporting the InfiniBand interface

        - MPI benchmark tests (OSU BW/LAT, Pallas, Presta)

  o   Sources of all software modules (under conditions mentioned in the
modules'

      LICENSE files)

  o   Documentation

 
Notes:

1. SDP and RDS are in technology preview state.

2. The SRP Initiator and Open MPI are in beta state.

3. All other OFED components are in production state.

 
Supported Platforms and Operating Systems

    CPU architectures:

        * x86_64

        * x86

        * ia64

        * ppc64

 
    Linux Operating Systems:

        * RedHat EL4 up2: 2.6.9-22.ELsmp

        * RedHat EL4 up3: 2.6.9-34.ELsmp

        * Fedora C4: 2.6.11-1.1369_FC4

        * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5 2.6.16.14-6-smp)

        * SLES10 RC1: 2.6.16.14-6-smp

        * SUSE 10 Pro: 2.6.13-15-smp

        * kernel.org: 2.6.16.x

 
HCAs Supported

 
Mellanox HCAs:

        - InfiniHost

        - InfiniHost III Ex (both modes: with memory and MemFree)

        - InfiniHost III Lx

        Both SDR and DDR mode of the InfiniHost III family are
supported.

 
        For official FW versions please see:

        http://www.mellanox.com/support/firmware_table.php

 
Qlogic HCAs:

        - QHT6040 (PathScale InfiniPath HT-460)

        - QHT6140 (PathScale InfiniPath HT-465)

        - QLE6140 (PathScale InfiniPath PE-880)

 
Switches Supported

This release was tested with switches and gateways provided by the
following companies:

        - Cisco

        - Voltaire

        - SilverStorm

        - Flextronics

 
Attached are the release notes

 
Tziporet Koren

Software Director

Mellanox Technologies

mailto: tziporet at mellanox.co.il <mailto:tziporet at mellanox.co.il> 
Tel +972-4-9097200, ext 380

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060616/8b546957/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: OFED_release_notes.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060616/8b546957/attachment.txt>

From zhushisongzhu at yahoo.com  Fri Jun 16 03:05:47 2006
From: zhushisongzhu at yahoo.com (zhu shi song)
Date: Fri, 16 Jun 2006 03:05:47 -0700 (PDT)
Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren)
In-Reply-To: <mailman.155.1150447893.9203.openib-general@openib.org>
Message-ID: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com>

> Notes:
> 
> 1. SDP and RDS are in technology preview state.
> 
> 2. The SRP Initiator and Open MPI are in beta state.
> 
> 3. All other OFED components are in production
> state.
> 
I'm sorry SDP is not in production state.  SDP is very
important for our application and we are waiting it
mature enough   to be used in our product.  And do you
have any schedule to let SDP work ok(especially can
support many large concurrent connections just like
TCP)?  I very appreciate I can test new SDP before end
of June.
  tks
  zhu


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From or.gerlitz at gmail.com  Fri Jun 16 03:51:39 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 16 Jun 2006 12:51:39 +0200
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com>
	<Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>
Message-ID: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com>

On 6/15/06, James Lentini <jlentini at netapp.com> wrote:
> ib_cm_establish() doesn't emulate an RTU reception. It generates an
> IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The
> CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED
> event. The QP's state will not be moved to RTS.

This is what i was suspecting, Sean can you confirm that? if it does
not emulate RTU
reception, than what it does do?

> Consumers don't actually have to queue the completions, they have to
> defer posting sends (either in response to the recvs or otherwise)
> until the QP moves to RTS. Could the implementations queue up the
> requests for the consumers?

nope the CM/CMA are not in charge of the consumer CQ, so there is no way for
them to queue those completions and anyway, i think its wrong for lower layer to
queue completions, this "race" exists by IB's nature (since the RTU
goes to QP1 and
the data to the user's QP and the  two QPs are totally unrelated) so
if you want to
have production with IB you need to handle this case in your code, as others do.

> Strictly speaking, IB requires an error to be generated (C10-29 in the
> IBTA spec. vol 1, page 456). Still, it would be nice if consumers
> didn't have to be worry about this issue.

What do you mean by error, this async event happens all the time, you
can't error
the establishment just b/c it happend. I don't have access now to the
spec, so i can't
say what i understand from the section you have pointed to.

Or.


From or.gerlitz at gmail.com  Fri Jun 16 03:54:37 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 16 Jun 2006 12:54:37 +0200
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
References: <449119AE.2010703@voltaire.com>
	<000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
Message-ID: <15ddcffd0606160354q516ffdccj14c721bcb60d254a@mail.gmail.com>

On 6/16/06, Sean Hefty <sean.hefty at intel.com> wrote:
>>The cma/verbs consumer can't just ignore the event since its qp state is
>>still RTR which means an attempt to tx replying the rx would fail.

> In most cases, I would expect that the IB CM will eventually receive the RTU,
> which will generate an event to the RDMA CM to transition the QP into RTS.

But we want an IB stack and set of ULPs which would work in production so they
need to handle also irregular cases... eg when the RTU is lost over and over.

Or


From swise at opengridcomputing.com  Fri Jun 16 06:42:35 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 Jun 2006 08:42:35 -0500
Subject: [openib-general] ucma into kernel.org
Message-ID: <1150465355.29508.4.camel@stevo-desktop>

Hey Roland/Sean,

Will the ucma make it into 2.6.18?  I notice its not in Roland's
for-2.6.18 tree right now.

Thanks,

Steve.


From halr at voltaire.com  Fri Jun 16 07:12:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 Jun 2006 10:12:10 -0400
Subject: [openib-general] [PATCH] osmtest/osmtest.c: Start enabling link
	records
Message-ID: <1150467129.4506.97759.camel@hal.voltaire.com>

osmtest/osmtest.c: Start enabling link records

Enable the obtaining of SA LinkRecords, writing to database, and
parsing, and ignoring them for the time being.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: osmtest/osmtest.c
===================================================================
--- osmtest/osmtest.c	(revision 8064)
+++ osmtest/osmtest.c	(working copy)
@@ -138,6 +138,10 @@ typedef enum _osmtest_token_val
     OSMTEST_TOKEN_RESP_TIME_VAL,
     OSMTEST_TOKEN_ERR_THRESHOLD,
     OSMTEST_TOKEN_MTU,
+    OSMTEST_TOKEN_FROMLID,
+    OSMTEST_TOKEN_FROMPORTNUM,
+    OSMTEST_TOKEN_TOPORTNUM,
+    OSMTEST_TOKEN_TOLID,
     OSMTEST_TOKEN_UNKNOWN
 } osmtest_token_val_t;
 
@@ -213,6 +217,10 @@ const osmtest_token_t token_array[] = {
   {OSMTEST_TOKEN_RESP_TIME_VAL, 15, "resp_time_value"},
   {OSMTEST_TOKEN_ERR_THRESHOLD, 15, "error_threshold"},
   {OSMTEST_TOKEN_MTU, 3, "MTU"}, /*  must be after the other mtu... tokens. */
+  {OSMTEST_TOKEN_FROMLID, 8, "from_lid"},
+  {OSMTEST_TOKEN_FROMPORTNUM, 13, "from_port_num"},
+  {OSMTEST_TOKEN_TOPORTNUM, 11, "to_port_num"},
+  {OSMTEST_TOKEN_TOLID, 6, "to_lid"},
   {OSMTEST_TOKEN_UNKNOWN, 0, ""} /* must be last entry */
 };
 
@@ -1962,9 +1970,6 @@ osmtest_write_node_info( IN osmtest_t * 
   return ( status );
 }
 
-#if 0
-/*  HACK: we do not support link records for now. */
-
 /**********************************************************************
  **********************************************************************/
 static ib_api_status_t
@@ -2076,7 +2081,6 @@ osmtest_write_all_link_recs( IN osmtest_
   OSM_LOG_EXIT( &p_osmt->log );
   return ( status );
 }
-#endif
 
 /**********************************************************************
  **********************************************************************/
@@ -2727,11 +2731,9 @@ osmtest_create_inventory_file( IN osmtes
       goto Exit;
   }
 
-#if 0
   status = osmtest_write_all_link_recs( p_osmt, fh );
   if( status != IB_SUCCESS )
     goto Exit;
-#endif
 
   fclose( fh );
 
@@ -6114,6 +6116,94 @@ osmtest_parse_path( IN osmtest_t * const
 /**********************************************************************
  **********************************************************************/
 static ib_api_status_t
+osmtest_parse_link( IN osmtest_t * const p_osmt,
+                    IN FILE * const fh,
+                    IN OUT uint32_t * const p_line_num )
+{
+  ib_api_status_t status = IB_SUCCESS;
+  uint32_t offset;
+  char line[OSMTEST_MAX_LINE_LEN];
+  boolean_t done = FALSE;
+  const osmtest_token_t *p_tok;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_parse_link);
+
+  /*
+   * Parse the inventory file and create the database.
+   */
+  while( !done )
+  {
+    if( fgets( line, OSMTEST_MAX_LINE_LEN, fh ) == NULL )
+    {
+      /*
+       * End of file in the middle of a definition.
+       */
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_parse_link: ERR 012A: "
+               "Unexpected end of file\n" );
+      status = IB_ERROR;
+      goto Exit;
+    }
+
+    ++*p_line_num;
+
+    /*
+     * Skip whitespace
+     */
+    offset = 0;
+    if( !str_skip_white( line, &offset ) )
+      continue;       /* whole line was whitespace */
+
+    p_tok = str_get_token( &line[offset] );
+    if( p_tok == NULL )
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_parse_link: ERR 012B: "
+               "Ignoring line %u with unknown token: %s\n",
+               *p_line_num, &line[offset] );
+      continue;
+    }
+
+    if( osm_log_is_active( &p_osmt->log, OSM_LOG_DEBUG ) )
+    {
+      osm_log( &p_osmt->log, OSM_LOG_DEBUG,
+               "osmtest_parse_link: "
+               "Found '%s' (line %u)\n", p_tok->str, *p_line_num );
+    }
+
+    str_skip_token( line, &offset );
+
+    switch ( p_tok->val )
+    {
+      case OSMTEST_TOKEN_FROMLID:
+      case OSMTEST_TOKEN_FROMPORTNUM:
+      case OSMTEST_TOKEN_TOPORTNUM:
+      case OSMTEST_TOKEN_TOLID:
+        /* For now */
+        break;
+
+      case OSMTEST_TOKEN_END:
+        done = TRUE;
+        break;
+
+      default:
+        osm_log( &p_osmt->log, OSM_LOG_ERROR,
+                 "osmtest_parse_link: ERR 012C: "
+                 "Ignoring line %u with unknown token: %s\n",
+                 *p_line_num, &line[offset] );
+
+        break;
+    }
+  }
+
+ Exit:
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static ib_api_status_t
 osmtest_create_db( IN osmtest_t * const p_osmt )
 {
   FILE *fh;
@@ -6182,6 +6272,10 @@ osmtest_create_db( IN osmtest_t * const 
         status = osmtest_parse_path( p_osmt, fh, &line_num );
         break;
 
+      case OSMTEST_TOKEN_DEFINE_LINK:
+        status = osmtest_parse_link( p_osmt, fh, &line_num );
+        break;
+
       default:
         osm_log( &p_osmt->log, OSM_LOG_ERROR,
                  "osmtest_create_db: ERR 0132: " 


From swise at opengridcomputing.com  Fri Jun 16 07:20:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 Jun 2006 09:20:31 -0500
Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management.
In-Reply-To: <1150128349.22704.20.camel@trinity.ogc.int>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<20060607200655.9259.90768.stgit@stevo-desktop>
	<20060608011744.1a66e85a.akpm@osdl.org>
	<1150128349.22704.20.camel@trinity.ogc.int>
Message-ID: <1150467631.29508.11.camel@stevo-desktop>


On Mon, 2006-06-12 at 11:05 -0500, Tom Tucker wrote:
> On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote:
> > On Wed, 07 Jun 2006 15:06:55 -0500
> > Steve Wise <swise at opengridcomputing.com> wrote:
> > 
> > > 
> > > +void c2_free(struct c2_alloc *alloc, u32 obj)
> > > +{
> > > +	spin_lock(&alloc->lock);
> > > +	clear_bit(obj, alloc->table);
> > > +	spin_unlock(&alloc->lock);
> > > +}
> > 
> > The spinlock is unneeded here.
> 
> Good point.
> 
> > 
> > 
> > What does all the code in this file do, anyway?  It looks totally generic
> > (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat
> > similar to idr trees, perhaps.
> > 
> 
> We mimicked the mthca driver. It may be code that should be replaced
> with Linux core services for new drivers. We'll investigate.
> 

The code in this file implements 2 sets of services:  

1) allocating unique qp identifiers (type integer).  This is the
c2_alloc struct and functions.

2) maintaining a sparsely allocated array of ptrs indexed by the qp
identifier.  This allows for quick mapping to the qp structure ptr given
the qp identifier.  This is the c2_array struct and functions.


I believe I can use an IDR tree to provide both of these services.


Steve.


From jlentini at netapp.com  Fri Jun 16 08:15:46 2006
From: jlentini at netapp.com (James Lentini)
Date: Fri, 16 Jun 2006 11:15:46 -0400 (EDT)
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>


On Thu, 15 Jun 2006, Sean Hefty wrote:

> >The cma/verbs consumer can't just ignore the event since its qp state is
> >still RTR which means an attempt to tx replying the rx would fail.
> 
> In most cases, I would expect that the IB CM will eventually receive the RTU,
> which will generate an event to the RDMA CM to transition the QP into RTS.  This
> is why I think that the event can safely be ignored.  It does however mean that
> a user cannot send on the QP until the user sees RDMA_CM_EVENT_ESTABLISHED.
> 
> >I suggest the following design: the CMA would replace the event handler
> >provided with the qp_init_attr struct with a callback of its own and
> >keep the original handler/context on a private structure.
> 
> This sounds like it would work.  I don't think that there are any events where
> the additional delay would matter.
> 
> As an alternative, I don't think that there's any reason why the QP 
> can't be transition to RTS when the CM REP is sent.  

I like this idea. It simplifies how ULPs handle this issue. Are there 
any spec. compliance issues with this?

> A user just can't post to the send queue until either an 
> RDMA_CM_EVENT_ESTABLISHED, IB_EVENT_COMM_EST, or a completion occurs 
> on the QP.  (This doesn't change the fact that the IB CM still needs 
> to know that the connection has been established, or it risks 
> putting the connection into an error state if an RTU is never 
> received.)

If the passive side CM doesn't receive an RTU, the passive side CM 
should retransmit the REP. At least that is how I read 12.9.8.6 
"Timeouts and Retries" in the IBTA spec. I can't find where this 
happens in the code. Did I miss it?


From swise at opengridcomputing.com  Fri Jun 16 08:23:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 Jun 2006 10:23:31 -0500
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <Pine.GSO.4.40.0606161118420.1484-100000@kappa.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606161118420.1484-100000@kappa.cse.ohio-state.edu>
Message-ID: <1150471411.29508.17.camel@stevo-desktop>

On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote:
> Hi Steve,
> 
> The rping also doesn't exit after printing these error messages. Is this
> expected?
> 

It should exit!  :-(

Maybe rping is not acking all the CM or Async events?  Or we've got a
bug in our refcnts on the iw_cm_ids in the kernel.  Can you get a gdb
stack trace when its stalled?   And if you kdb, a kernel mode stack
trace of the same thread would be nice too...

What systems/distros/etc are you running this on?

Thanks,

Stevo.


> Thanks,
> Amith
> 
> On Thu, 15 Jun 2006, Steve Wise wrote:
> 
> > This is the normal output for rping...
> >
> > The status error on the completion is 5 (FLUSHED), which is normal.
> >
> > Steve.
> >
> >
> > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote:
> > > Hi,
> > >
> > > With the latest rping code (Revision: 8055) I am still able to see this
> > > race condition.
> > >
> > > server side:
> > >
> > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
> > > server ping data: rdma-ping-0: ABCDEFGHIJKL
> > > server ping data: rdma-ping-1: BCDEFGHIJKLM
> > > server ping data: rdma-ping-2: CDEFGHIJKLMN
> > > server ping data: rdma-ping-3: DEFGHIJKLMNO
> > > server ping data: rdma-ping-4: EFGHIJKLMNOP
> > > server ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > server ping data: rdma-ping-6: GHIJKLMNOPQR
> > > server ping data: rdma-ping-7: HIJKLMNOPQRS
> > > server ping data: rdma-ping-8: IJKLMNOPQRST
> > > server ping data: rdma-ping-9: JKLMNOPQRSTU
> > > server DISCONNECT EVENT...
> > > wait for RDMA_READ_ADV state 9
> > > cq completion failed status 5
> > >
> > > Client side:
> > >
> > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
> > > ping data: rdma-ping-0: ABCDEFGHIJKL
> > > ping data: rdma-ping-1: BCDEFGHIJKLM
> > > ping data: rdma-ping-2: CDEFGHIJKLMN
> > > ping data: rdma-ping-3: DEFGHIJKLMNO
> > > ping data: rdma-ping-4: EFGHIJKLMNOP
> > > ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > ping data: rdma-ping-6: GHIJKLMNOPQR
> > > ping data: rdma-ping-7: HIJKLMNOPQRS
> > > ping data: rdma-ping-8: IJKLMNOPQRST
> > > ping data: rdma-ping-9: JKLMNOPQRSTU
> > > cq completion failed status 5
> > > client DISCONNECT EVENT...
> > >
> > >
> > > Thanks,
> > > Amith
> > >
> > >
> > > On Tue, 13 Jun 2006, Steve Wise wrote:
> > >
> > > > Thanks, applied.
> > > >
> > > > iwarp branch: r7964
> > > > trunk: r7966
> > > >
> > > >
> > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > > > > This patch resolves a race condition between the receipt of
> > > > > a connection established event and a receive completion from
> > > > > the client.  The server no longer goes to connected state but
> > > > > merely waits for the READ_ADV state to begin its looping.  This
> > > > > keeps the server from going back to CONNECTED from the later
> > > > > states if the connection established event comes in after the
> > > > > receive completion (i.e. the loop starts).
> > > > >
> > > > > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
> > > >
> > > >
> > > > _______________________________________________
> > > > openib-general mailing list
> > > > openib-general at openib.org
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > >
> >


From jlentini at netapp.com  Fri Jun 16 08:25:06 2006
From: jlentini at netapp.com (James Lentini)
Date: Fri, 16 Jun 2006 11:25:06 -0400 (EDT)
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com>
	<Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>
	<15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0606161123000.30771@jlentini-linux.nane.netapp.com>


On Fri, 16 Jun 2006, Or Gerlitz wrote:

> On 6/15/06, James Lentini <jlentini at netapp.com> wrote:
> > ib_cm_establish() doesn't emulate an RTU reception. It generates an
> > IB_CM_USER_ESTABLISHED event (not an IB_CM_RTU_RECEIVED event). The
> > CMA's cma_ib_handler() doesn't recognize a IB_CM_USER_ESTABLISHED
> > event. The QP's state will not be moved to RTS.
> 
> This is what i was suspecting, Sean can you confirm that? if it does
> not emulate RTU
> reception, than what it does do?
> 
> > Consumers don't actually have to queue the completions, they have to
> > defer posting sends (either in response to the recvs or otherwise)
> > until the QP moves to RTS. Could the implementations queue up the
> > requests for the consumers?
> 
> nope the CM/CMA are not in charge of the consumer CQ, so there is no 
> way for them to queue those completions and anyway, i think its 

I was refering to requests, not completions. In any event, I like 
Sean's idea of moving the QP to RTS when a REP is sent better.

> wrong for lower layer to queue completions, this "race" exists by 
> IB's nature (since the RTU goes to QP1 and the data to the user's QP 
> and the two QPs are totally unrelated) so if you want to have 
> production with IB you need to handle this case in your code, as 
> others do.

Agreed.

> > Strictly speaking, IB requires an error to be generated (C10-29 in 
> > the IBTA spec. vol 1, page 456). Still, it would be nice if 
> > consumers didn't have to be worry about this issue.
> 
> What do you mean by error, this async event happens all the time, 
> you can't error the establishment just b/c it happend. I don't have 
> access now to the spec, so i can't say what i understand from the 
> section you have pointed to.

Again, I was refering to requests, not completions. 


From mamidala at cse.ohio-state.edu  Fri Jun 16 08:20:29 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Fri, 16 Jun 2006 11:20:29 -0400 (EDT)
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <1150409027.6612.20.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606161118420.1484-100000@kappa.cse.ohio-state.edu>

Hi Steve,

The rping also doesn't exit after printing these error messages. Is this
expected?

Thanks,
Amith

On Thu, 15 Jun 2006, Steve Wise wrote:

> This is the normal output for rping...
>
> The status error on the completion is 5 (FLUSHED), which is normal.
>
> Steve.
>
>
> On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote:
> > Hi,
> >
> > With the latest rping code (Revision: 8055) I am still able to see this
> > race condition.
> >
> > server side:
> >
> > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
> > server ping data: rdma-ping-0: ABCDEFGHIJKL
> > server ping data: rdma-ping-1: BCDEFGHIJKLM
> > server ping data: rdma-ping-2: CDEFGHIJKLMN
> > server ping data: rdma-ping-3: DEFGHIJKLMNO
> > server ping data: rdma-ping-4: EFGHIJKLMNOP
> > server ping data: rdma-ping-5: FGHIJKLMNOPQ
> > server ping data: rdma-ping-6: GHIJKLMNOPQR
> > server ping data: rdma-ping-7: HIJKLMNOPQRS
> > server ping data: rdma-ping-8: IJKLMNOPQRST
> > server ping data: rdma-ping-9: JKLMNOPQRSTU
> > server DISCONNECT EVENT...
> > wait for RDMA_READ_ADV state 9
> > cq completion failed status 5
> >
> > Client side:
> >
> > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
> > ping data: rdma-ping-0: ABCDEFGHIJKL
> > ping data: rdma-ping-1: BCDEFGHIJKLM
> > ping data: rdma-ping-2: CDEFGHIJKLMN
> > ping data: rdma-ping-3: DEFGHIJKLMNO
> > ping data: rdma-ping-4: EFGHIJKLMNOP
> > ping data: rdma-ping-5: FGHIJKLMNOPQ
> > ping data: rdma-ping-6: GHIJKLMNOPQR
> > ping data: rdma-ping-7: HIJKLMNOPQRS
> > ping data: rdma-ping-8: IJKLMNOPQRST
> > ping data: rdma-ping-9: JKLMNOPQRSTU
> > cq completion failed status 5
> > client DISCONNECT EVENT...
> >
> >
> > Thanks,
> > Amith
> >
> >
> > On Tue, 13 Jun 2006, Steve Wise wrote:
> >
> > > Thanks, applied.
> > >
> > > iwarp branch: r7964
> > > trunk: r7966
> > >
> > >
> > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > > > This patch resolves a race condition between the receipt of
> > > > a connection established event and a receive completion from
> > > > the client.  The server no longer goes to connected state but
> > > > merely waits for the READ_ADV state to begin its looping.  This
> > > > keeps the server from going back to CONNECTED from the later
> > > > states if the connection established event comes in after the
> > > > receive completion (i.e. the loop starts).
> > > >
> > > > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
> > >
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > >
>


From Thomas.Talpey at netapp.com  Fri Jun 16 08:29:27 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Fri, 16 Jun 2006 11:29:27 -0400
Subject: [openib-general] Mellanox HCAs: outstanding RDMAs
In-Reply-To: <6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com>
References: <D80D83302DEE6249A221093BF2BB69AE58EBD2@mail.silverstorm.com>
	<7.0.1.0.2.20060606131933.04267008@netapp.com>
	<6.2.0.14.2.20060615104459.06451e28@esmail.cup.hp.com>
Message-ID: <7.0.1.0.2.20060616112445.042523e0@netapp.com>

Mike, I am not arguing to change the standard. I am simply
saying I do not want to be a victim of the default. It is my
belief that very few upper layer programmers are aware of
this, btw.

The Linux NFS/RDMA upper layer implementation already deals
with the issue, as I mentioned. It would certainly welcome a
higher available IRD on Mellanox hardware however.

Thanks for your comments.

Tom.

At 01:55 PM 6/15/2006, Michael Krause wrote:

>As one of the authors of IB and iWARP, I can say that both Roland and Todd's responses are correct and the intent of the specifications.  The number of outstanding RDMA Reads are bounded and that is communicated during session establishment.  The ULP can choose to be aware of this requirement (certainly when we wrote iSER and DA we were well aware of the requirement and we documented as such in the ULP specs) and track from above so that it does not see a stall or it can stay ignorant and deal with the stall as a result.  This is a ULP choice and has been intentionally done that way so that the hardware can be kept as simple as possible and as low cost as well while meeting the breadth of ULP needs that were used to develop these technologies.   
>
>Tom, you raised this issue during iWARP's definition and the debate was conducted at least several times.  The outcome of these debates is reflected in iWARP and remains aligned with IB.  So, unless you really want to have the IETF and IBTA go and modify their specs, I believe you'll have to deal with the issue just as other ULP are doing today and be aware of the constraint and write the software accordingly.  The open source community isn't really the right forum to change iWARP and IB specifications at the end of the day.  Build a case in the IETF and IBTA and let those bodies determine whether it is appropriate to modify their specs or not.  And yes, it is modification of the specs and therefore the hardware implementations as well address any interoperability requirements that would result (the change proposed could fragment the hardware offerings as there are many thousands of devices in the market that would not necessarily support this change).
>
>Mike
>
>
>
>
>At 12:07 PM 6/6/2006, Talpey, Thomas wrote:
>>Todd, thanks for the set-up. I'm really glad we're having this discussion!
>>
>>Let me give an NFS/RDMA example to illustrate why this upper layer,
>>at least, doesn't want the HCA doing its flow control, or resource
>>management.
>>
>>NFS/RDMA is a credit-based protocol which allows many operations in
>>progress at the server. Let's say the client is currently running with
>>an RPC slot table of 100 requests (a typical value).
>>
>>Of these requests, some workload-specific percentage will be reads,
>>writes, or metadata. All NFS operations consist of one send from
>>client to server, some number of RDMA writes (for NFS reads) or
>>RDMA reads (for NFS writes), then terminated with one send from
>>server to client.
>>
>>The number of RDMA read or write operations per NFS op depends
>>on the amount of data being read or written, and also the memory
>>registration strategy in use on the client. The highest-performing
>>such strategy is an all-physical one, which results in one RDMA-able
>>segment per physical page. NFS r/w requests are, by default, 32KB,
>>or 8 pages typical. So, typically 8 RDMA requests (read or write) are
>>the result.
>>
>>To illustrate, let's say the client is processing a multi-threaded
>>workload, with (say) 50% reads, 20% writes, and 30% metadata
>>such as lookup and getattr. A kernel build, for example. Therefore,
>>of our 100 active operations, 50 are reads for 32KB each, 20 are
>>writes of 32KB, and 30 are metadata (non-RDMA). 
>>
>>To the server, this results in 100 requests, 100 replies, 400 RDMA
>>writes, and 160 RDMA Reads. Of course, these overlap heavily due
>>to the widely differing latency of each op and the highly distributed
>>arrival times. But, for the example this is a snapshot of current load.
>>
>>The latency of the metadata operations is quite low, because lookup
>>and getattr are acting on what is effectively cached data. The reads
>>and writes however, are much longer, because they reference the
>>filesystem. When disk queues are deep, they can take many ms.
>>
>>Imagine what happens if the client's IRD is 4 and the server ignores
>>its local ORD. As soon as a write begins execution, the server posts
>>8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
>>are sent, the fifth stalls, and stalls the send queue! Even when three
>>RDMA Reads complete, the queue remains stalled, it doesn't unblock
>>until the fourth is done and all the RDMA Reads have been initiated.
>>
>>But, what just happened to all the other server send traffic? All those
>>metadata replies, and other reads which completed? They're stuck,
>>waiting for that one write request. In my example, these number 99 NFS
>>ops, i.e. 654 WRs! All for one NFS write! The client operation stream
>>effectively became single threaded. What good is the "rapid initiation
>>of RDMA Reads" you describe in the face of this?
>>
>>Yes, there are many arcane and resource-intensive ways around it.
>>But the simplest by far is to count the RDMA Reads outstanding, and
>>for the *upper layer* to honor ORD, not the HCA. Then, the send queue
>>never blocks, and the operation streams never loses parallelism. This
>>is what our NFS server does.
>>
>>As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth
>>of the RDMA Read stream. 4 is good for local, low latency connections.
>>But over a complicated switch infrastructure, or heaven forbid a dark fiber
>>long link, I guarantee it will cause a bottleneck. This isn't an issue except
>>for operations that care, but it is certainly detectable. I would like to see
>>if a pure RDMA Read stream can fully utilize a typical IB fabric, and how
>>much headroom an IRD of 4 provides. Not much, I predict.
>>
>>Closing the connection if IRD is "insufficient to meet goals" isn't a good
>>answer, IMO. How does that benefit interoperability? 
>>
>>Thanks for the opportunity to spout off again. Comments welcome!
>>
>>Tom.
>>
>>At 12:43 PM 6/6/2006, Rimmer, Todd wrote:
>>>
>>>
>>>> Talpey, Thomas
>>>> Sent: Tuesday, June 06, 2006 10:49 AM
>>>> 
>>>> At 10:40 AM 6/6/2006, Roland Dreier wrote:
>>>> >    Thomas> This is the difference between "may" and "must". The
>>>value
>>>> >    Thomas> is provided, but I don't see anything in the spec that
>>>> >    Thomas> makes a requirement on its enforcement. Table 107 says
>>>the
>>>> >    Thomas> consumer can query it, that's about as close as it
>>>> >    Thomas> comes. There's some discussion about CM exchange too.
>>>> >
>>>> >This seems like a very strained interpretation of the spec.  For
>>>> 
>>>> I don't see how strained has anything to do with it. It's not saying
>>>> anything
>>>> either way. So, a legal implementation can make either choice. We're
>>>> talking about the spec!
>>>> 
>>>> But, it really doesn't matter. The point is, an upper layer should be
>>>> paying
>>>> attention to the number of RDMA Reads it posts, or else suffer either
>>>the
>>>> queue-stalling or connection-failing consequences. Bad stuff either
>>>way.
>>>> 
>>>> Tom.
>>>
>>>Somewhere beneath this discussion is a bug in the application or IB
>>>stack.  I'm not sure which "may" in the spec you are referring to, but
>>>the "may"s I have found all are for cases where the responder might
>>>support only 1 outstanding request.  In all cases the negotiation
>>>protocol must be followed and the requestor is not allowed to exceed the
>>>negotiated limit.
>>>
>>>The mechanism should be:
>>>client queries its local HCA and determines responder resources (eg.
>>>number of concurrent outstanding RDMA reads on the wire from the remote
>>>end where this end will respond with the read data) and initiator depth
>>>(eg. number of concurrent outstanding RDMA reads which this end can
>>>initiate as the requestor).
>>>
>>>client puts the above information in the CM REQ.
>>>
>>>server similarly gets its information from its local CA and negotiates
>>>down the values to the MIN of each side (REP.InitiatorDepth =
>>>MIN(REQ.ResponderResources, server's local CAs Initiator depth);
>>>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
>>>responder resources).  If server does not support RDMA Reads, it can
>>>REJ.
>>>
>>>If client decided the negotiated values are insufficient to meet its
>>>goals, it can disconnect.
>>>
>>>Each side sets its QP parameters via modify QP appropriately.  Note they
>>>too will be mirror images of eachother:
>>>client:
>>>QP.Max RDMA Reads as Initiator = REP.ResponderResources
>>>QP.Max RDMA reads as responder = REP.InitiatorDepth
>>>
>>>server:
>>>QP.Max RDMA Reads as responder = REP.ResponderResources
>>>QP.Max RDMA reads as initiator = REP.InitiatorDepth
>>>
>>>We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
>>>and provided the above negotiation is followed, we have seen no issues.
>>>Note however that by default a Mellanox HCA typically reports a large
>>>InitiatorDepth (128) and a modest ResponderResources (4-8).  Hence when
>>>I hear that Responder Resources must be grown to 128 for some
>>>application to reliably work, it implies the negotiation I outlined
>>>above is not being followed.
>>>
>>>Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
>>>write on a send queue are ordered.  There are many cases where an op can
>>>pass an outstanding RDMA read, hence it is not always bad to queue extra
>>>RDMA reads.  If needed, the Fence can be sent to force order.
>>>
>>>For many apps, its going to be better to get the items onto queue and
>>>let the QP handle the outstanding reads cases rather than have the app
>>>add a level of queuing for this purpose.  Letting the HCA do the queuing
>>>will allow for a more rapid initiation of subsequent reads.
>>>
>>>Todd Rimmer
>>
>>
>>_______________________________________________
>>openib-general mailing list
>>openib-general at openib.org
>>http://openib.org/mailman/listinfo/openib-general
>>
>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general 


From sweitzen at cisco.com  Fri Jun 16 08:58:04 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 16 Jun 2006 08:58:04 -0700
Subject: [openib-general] OFED 1.0 - Official Release
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA301DCA853@xmb-sjc-216.amer.cisco.com>

Tziporet,
 
I see a few C code changes from pre1 in the form of patches.  What are
these and why were they added after pre1?
 
$ diff -r
OFED-1.0-pre1/SOURCES/openib-1.0/patches/OFED-1.0/SOURCES/openib-1.0/pat
ches/ 2>&1 | less
...
Only in OFED-1.0-pre1/SOURCES/openib-1.0/patches/fixes:
handle_reconnect_of_offline_host.patch
Only in OFED-1.0/SOURCES/openib-1.0/patches/fixes: sdp_fix.patch
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Tziporet Koren
	Sent: Friday, June 16, 2006 1:55 AM
	To: OpenFabricsEWG; openib
	Subject: [openib-general] OFED 1.0 - Official Release
	
	
	I am happy to announce that OFED 1.0 Official Release is now
available.

	The release can be found under:

	https://openib.org/svn/gen2/branches/1.0/ofed/releases/

	 
	And later today it will be on the OpenFabrics download page:
http://www.openfabrics.org/downloads.html.

	 
	This is the first release that was done in a joint effort of the
following companies:

	*	Cisco 
	*	SilverStorm 
	*	Voltaire 
	*	QLogic 
	*	Intel 
	*	Mellanox Technologies 

	 
	I wish to thank all who contributed to the success of this
release.

	 
	Tziporet

	
========================================================================
=======

	 
	Release summary:

	The OFED software package is composed of several software
modules intended for use on a computer cluster 

	constructed as an InfiniBand network.

	 
	OFED package contains the following components:

	  o   OpenFabrics core and ULPs:

	        - HCA drivers (mthca, ipath)

	        - core

	        - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER
Host, RDS and uDAPL

	  o   OpenFabrics utilities:

	        - OpenSM: InfiniBand Subnet Manager

	        - Diagnostic tools

	        - Performance tests

	  o   MPI:

	        - OSU MPI stack supporting the InfiniBand interface

	        - Open MPI stack supporting the InfiniBand interface

	        - MPI benchmark tests (OSU BW/LAT, Pallas, Presta)

	  o   Sources of all software modules (under conditions
mentioned in the modules'

	      LICENSE files)

	  o   Documentation

	 
	Notes:

	1. SDP and RDS are in technology preview state.

	2. The SRP Initiator and Open MPI are in beta state.

	3. All other OFED components are in production state.

	 
	Supported Platforms and Operating Systems

	    CPU architectures:

	        * x86_64

	        * x86

	        * ia64

	        * ppc64

	 
	    Linux Operating Systems:

	        * RedHat EL4 up2: 2.6.9-22.ELsmp

	        * RedHat EL4 up3: 2.6.9-34.ELsmp

	        * Fedora C4: 2.6.11-1.1369_FC4

	        * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5
2.6.16.14-6-smp)

	        * SLES10 RC1: 2.6.16.14-6-smp

	        * SUSE 10 Pro: 2.6.13-15-smp

	        * kernel.org: 2.6.16.x

	 
	HCAs Supported

	 
	Mellanox HCAs:

	        - InfiniHost

	        - InfiniHost III Ex (both modes: with memory and
MemFree)

	        - InfiniHost III Lx

	        Both SDR and DDR mode of the InfiniHost III family are
supported.

	 
	        For official FW versions please see:

	        http://www.mellanox.com/support/firmware_table.php

	 
	Qlogic HCAs:

	        - QHT6040 (PathScale InfiniPath HT-460)

	        - QHT6140 (PathScale InfiniPath HT-465)

	        - QLE6140 (PathScale InfiniPath PE-880)

	 
	Switches Supported

	This release was tested with switches and gateways provided by
the following companies:

	        - Cisco

	        - Voltaire

	        - SilverStorm

	        - Flextronics

	 
	Attached are the release notes

	 
	Tziporet Koren

	Software Director

	Mellanox Technologies

	mailto: tziporet at mellanox.co.il <mailto:tziporet at mellanox.co.il>

	Tel +972-4-9097200, ext 380

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060616/3e3196e7/attachment.html>

From mshefty at ichips.intel.com  Fri Jun 16 09:06:30 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 09:06:30 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <1150465355.29508.4.camel@stevo-desktop>
References: <1150465355.29508.4.camel@stevo-desktop>
Message-ID: <4492D706.4060106@ichips.intel.com>

Steve Wise wrote:
> Will the ucma make it into 2.6.18?  I notice its not in Roland's
> for-2.6.18 tree right now.

The plan is to allow the userspace interface to mature some before trying to 
merge them upstream.  This is why it is not included in 2.6.18.

- Sean


From mshefty at ichips.intel.com  Fri Jun 16 09:13:34 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 09:13:34 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
Message-ID: <4492D8AE.2010104@ichips.intel.com>

James Lentini wrote:
>>As an alternative, I don't think that there's any reason why the QP 
>>can't be transition to RTS when the CM REP is sent.  
> 
> I like this idea. It simplifies how ULPs handle this issue. Are there 
> any spec. compliance issues with this?

There's no spec compliance issues that I can readily find.  I will make a note 
to fix this, as well as handle the connection established event as Or suggested, 
but it will be a couple of weeks before I get to this.  (I will be attending the 
workshop next week.)

> If the passive side CM doesn't receive an RTU, the passive side CM 
> should retransmit the REP. At least that is how I read 12.9.8.6 
> "Timeouts and Retries" in the IBTA spec. I can't find where this 
> happens in the code. Did I miss it?

The MAD layer retries the CM messages, typically until the CM cancels the operation.

- Sean


From halr at voltaire.com  Fri Jun 16 09:08:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 Jun 2006 12:08:36 -0400
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
Message-ID: <1150474113.4506.102460.camel@hal.voltaire.com>

On Fri, 2006-06-16 at 11:15, James Lentini wrote:

[snip...]

> > As an alternative, I don't think that there's any reason why the QP 
> > can't be transition to RTS when the CM REP is sent.  
> 
> I like this idea. It simplifies how ULPs handle this issue. Are there 
> any spec. compliance issues with this?

IMO, it would violate the CM state machine and the passive CM transition
specification in 12.9.7.2 and have the effect of circumventing the
retransmission of REP on lost RTU. Data can't fly until either the RTU
or the first data message is received from the other direction.

-- Hal


From mshefty at ichips.intel.com  Fri Jun 16 09:20:07 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 09:20:07 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com>
References: <Pine.LNX.4.64.0606141230560.21483@jlentini-linux.nane.netapp.com>
	<44903D5D.10102@ichips.intel.com> <449119AE.2010703@voltaire.com>
	<Pine.LNX.4.64.0606151025400.30771@jlentini-linux.nane.netapp.com>
	<15ddcffd0606160351p276a227v18ca42301256455b@mail.gmail.com>
Message-ID: <4492DA37.4040402@ichips.intel.com>

Or Gerlitz wrote:
> This is what i was suspecting, Sean can you confirm that? if it does
> not emulate RTU
> reception, than what it does do?

Both receiving an RTU and getting a connection established event move the 
connection into the established state.  They generate different events to the 
user of the IB CM because RTUs carry private data.

- Sean


From rdreier at cisco.com  Fri Jun 16 09:22:29 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 09:22:29 -0700
Subject: [openib-general] [PATCH] add HW specific data to libibverbs
 modify QP, SRQ response
In-Reply-To: <1150410898.32252.69.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Thu, 15 Jun 2006 15:34:58 -0700")
References: <1150396280.32252.46.camel@brick.pathscale.com>
	<adar71qxhow.fsf@cisco.com>
	<1150407704.32252.65.camel@brick.pathscale.com>
	<adaac8exeu7.fsf@cisco.com>
	<1150410898.32252.69.camel@brick.pathscale.com>
Message-ID: <adaodwtul2y.fsf@cisco.com>

    Roland> Hmm... it seems simpler to have userspace allocate the
    Roland> memory with mmap() before the resize_cq call, and then
    Roland> pass that new buffer into the resize_cq call.  That way
    Roland> you don't have a window where the kernel is putting
    Roland> completions into a buffer that userspace doesn't know
    Roland> about.

    Ralph> Perhaps. But this way, the code is the same for kernel and
    Ralph> user allocated queues.

I guess there is some benefit there.

    Ralph> Or the new kernel driver needs to handle the old way and
    Ralph> the new way.

Yeah.


From rdreier at cisco.com  Fri Jun 16 09:25:38 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 09:25:38 -0700
Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs,
 SRQs [1 of 2]
In-Reply-To: <1150411254.32252.76.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Thu, 15 Jun 2006 15:40:54 -0700")
References: <1150411254.32252.76.camel@brick.pathscale.com>
Message-ID: <adak67hukxp.fsf@cisco.com>

 > +	/* Unmap the old queue so we can resize it. */
 > +	size = sizeof(struct ipath_cq_wc) +
 > +		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
 > +	(void) munmap(cq->queue, size);
 > +	ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd,
 > +				     &resp.ibv_resp, sizeof resp);
 > +	if (ret) {
 > +		pthread_spin_unlock(&cq->lock);
 > +		return ret;
 > +	}

It seems that this method of returning a new buffer address to mmap
from the resize operation leads to some really nasty error handling
though.  If the resize operation fails (either because of bad
userspace values or because the kernel is out of memory and can't
allocate a new buffer) then the old CQ is gone, possibly beyond recovery.

mthca avoids all this by allocating a resize buffer in advance.

 - R.


From mshefty at ichips.intel.com  Fri Jun 16 09:31:36 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 09:31:36 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <1150474113.4506.102460.camel@hal.voltaire.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
	<1150474113.4506.102460.camel@hal.voltaire.com>
Message-ID: <4492DCE8.2000408@ichips.intel.com>

Hal Rosenstock wrote:
> IMO, it would violate the CM state machine and the passive CM transition
> specification in 12.9.7.2 and have the effect of circumventing the
> retransmission of REP on lost RTU. Data can't fly until either the RTU
> or the first data message is received from the other direction.

This moves the QP state to RTS, as opposed to the CEP state to connected.  So I 
don't believe that it violates the spec.

A drawback to moving the QP to RTS is that the communication established event 
will not be generated.  This forces us to wait for the RTU to move the CEP to 
connected, or we need to do it upon receiving the first completion.

The RDMA CM has no knowledge when the latter occurs, so would need user input.

- Sean


From rdreier at cisco.com  Fri Jun 16 09:36:51 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 09:36:51 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com> (Sean
	Hefty's message of "Thu, 15 Jun 2006 15:04:57 -0700")
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
Message-ID: <adad5d9ukf0.fsf@cisco.com>

>I suggest the following design: the CMA would replace the event handler
>provided with the qp_init_attr struct with a callback of its own and
>keep the original handler/context on a private structure.

This is probably fine.  There is one further situation where the
connection needs to be established, beyond RTU and the communication
established async event.  Namely, if a receive completion is polled.
Since async events are, well, asynchronous, there's no guarantee that
the communication established event will be reported any time soon...


From halr at voltaire.com  Fri Jun 16 09:37:39 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 16 Jun 2006 12:37:39 -0400
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <4492DCE8.2000408@ichips.intel.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
	<1150474113.4506.102460.camel@hal.voltaire.com>
	<4492DCE8.2000408@ichips.intel.com>
Message-ID: <1150475858.4506.103650.camel@hal.voltaire.com>

On Fri, 2006-06-16 at 12:31, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > IMO, it would violate the CM state machine and the passive CM transition
> > specification in 12.9.7.2 and have the effect of circumventing the
> > retransmission of REP on lost RTU. Data can't fly until either the RTU
> > or the first data message is received from the other direction.
> 
> This moves the QP state to RTS, as opposed to the CEP state to connected.  So I 
> don't believe that it violates the spec.

Isn't the CEP the QP (see p. 689 line 7) ? 

> A drawback to moving the QP to RTS is that the communication established event 
> will not be generated.  This forces us to wait for the RTU to move the CEP to 
> connected, or we need to do it upon receiving the first completion.

> The RDMA CM has no knowledge when the latter occurs, so would need user input.

It sounds like I may have been looking at the wrong state but
nonetheless the CEP/QP states are defined there and this would be
different from what is in the spec. I wasn't saying it couldn't be made
to work though. I haven't looked at it enough to know. If it does work,
maybe the spec should get updated to cover this option too.

-- Hal

> - Sean


From johann.george at qlogic.com  Fri Jun 16 09:59:16 2006
From: johann.george at qlogic.com (Johann George)
Date: Fri, 16 Jun 2006 09:59:16 -0700
Subject: [openib-general] OFED 1.0 - Official Release
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA724C@mtlexch01.mtl.com>
Message-ID: <20060616165916.GA1866@cuprite.pathscale.com>

> I am happy to announce that OFED 1.0 Official Release is now available.

Congratulations to everyone involved; and especially to you, Tziporet.  You
have done a fabulous job in pulling this all together.

Johann


From mshefty at ichips.intel.com  Fri Jun 16 09:58:50 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 09:58:50 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <1150475858.4506.103650.camel@hal.voltaire.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<Pine.LNX.4.64.0606161044450.30771@jlentini-linux.nane.netapp.com>
	<1150474113.4506.102460.camel@hal.voltaire.com>
	<4492DCE8.2000408@ichips.intel.com>
	<1150475858.4506.103650.camel@hal.voltaire.com>
Message-ID: <4492E34A.30209@ichips.intel.com>

Hal Rosenstock wrote:
>>This moves the QP state to RTS, as opposed to the CEP state to connected.  So I 
>>don't believe that it violates the spec.
> 
> 
> Isn't the CEP the QP (see p. 689 line 7) ? 

Hmm... I was viewing the CEP as moving through the states described in 12.9.5 
and 12.9.6.  (Idle, REQ sent, REP wait, etc.)  I see what you're saying now.

> It sounds like I may have been looking at the wrong state but
> nonetheless the CEP/QP states are defined there and this would be
> different from what is in the spec. I wasn't saying it couldn't be made
> to work though. I haven't looked at it enough to know. If it does work,
> maybe the spec should get updated to cover this option too.

What I'd like to find is a way that a user, upon receiving a message, can send a 
response.  Today, a user cannot send the response until after they get a 
connection established event from the IB CM, and then RDMA CM.  So, it sounds 
like even the RDMA CM needs some sort of rdma_establish() call to finish 
connecting a QP.

I don't think that iWarp would run into this issue.

- Sean


From ralphc at pathscale.com  Fri Jun 16 10:06:34 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 10:06:34 -0700
Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs,
 SRQs [1 of 2]
In-Reply-To: <adak67hukxp.fsf@cisco.com>
References: <1150411254.32252.76.camel@brick.pathscale.com>
	<adak67hukxp.fsf@cisco.com>
Message-ID: <1150477594.32252.89.camel@brick.pathscale.com>

On Fri, 2006-06-16 at 09:25 -0700, Roland Dreier wrote:
>  > +	/* Unmap the old queue so we can resize it. */
>  > +	size = sizeof(struct ipath_cq_wc) +
>  > +		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
>  > +	(void) munmap(cq->queue, size);
>  > +	ret = ibv_cmd_resize_cq_resp(ibcq, cqe, &cmd, sizeof cmd,
>  > +				     &resp.ibv_resp, sizeof resp);
>  > +	if (ret) {
>  > +		pthread_spin_unlock(&cq->lock);
>  > +		return ret;
>  > +	}
> 
> It seems that this method of returning a new buffer address to mmap
> from the resize operation leads to some really nasty error handling
> though.  If the resize operation fails (either because of bad
> userspace values or because the kernel is out of memory and can't
> allocate a new buffer) then the old CQ is gone, possibly beyond recovery.
> 
> mthca avoids all this by allocating a resize buffer in advance.
> 
>  - R.

I agree. The kernel driver is careful to allocate the new queue
and copy the old contents to the new one atomically. The issue
is making sure the old queue isn't still being used by some
other thread. I guess if the semantics for resize are
that an error means the old mmap is still valid but
if the resize succeeds, the old mmap is invalid, then there isn't
an error recovery issue. All user level threads lock before
using the queue address so the change of address is protected.


-- 
Ralph Campbell <ralphc at pathscale.com>


From rdreier at cisco.com  Fri Jun 16 10:13:57 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 10:13:57 -0700
Subject: [openib-general] Patch for review: ipath mmaped CQs, QPs,
 SRQs [1 of 2]
In-Reply-To: <1150477594.32252.89.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 16 Jun 2006 10:06:34 -0700")
References: <1150411254.32252.76.camel@brick.pathscale.com>
	<adak67hukxp.fsf@cisco.com>
	<1150477594.32252.89.camel@brick.pathscale.com>
Message-ID: <ada8xnxuip6.fsf@cisco.com>

    Ralph> I agree. The kernel driver is careful to allocate the new
    Ralph> queue and copy the old contents to the new one
    Ralph> atomically. The issue is making sure the old queue isn't
    Ralph> still being used by some other thread. I guess if the
    Ralph> semantics for resize are that an error means the old mmap
    Ralph> is still valid but if the resize succeeds, the old mmap is
    Ralph> invalid, then there isn't an error recovery issue. All user
    Ralph> level threads lock before using the queue address so the
    Ralph> change of address is protected.

Those seem like the only sane semantics -- if the operation fails then
the state of the CQ shouldn't change.

 - R.


From mamidala at cse.ohio-state.edu  Fri Jun 16 10:40:38 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Fri, 16 Jun 2006 13:40:38 -0400 (EDT)
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <1150471411.29508.17.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606161336440.1780-100000@nu.cse.ohio-state.edu>

Hi,

I tried using gdb but it also hangs at the end. The system used is a IA32
platform using Red Hat Enterprise Linux AS release 4 (Nahant Update 3).
kernel info:Linux k63-oib 2.6.16.20 #2 SMP Wed Jun 14 15:02:47 EDT 2006
i686 i686 i386 GNU/Linux,

Thanks,
Amith


On Fri, 16 Jun 2006, Steve Wise wrote:

> On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote:
> > Hi Steve,
> >
> > The rping also doesn't exit after printing these error messages. Is this
> > expected?
> >
>
> It should exit!  :-(
>
> Maybe rping is not acking all the CM or Async events?  Or we've got a
> bug in our refcnts on the iw_cm_ids in the kernel.  Can you get a gdb
> stack trace when its stalled?   And if you kdb, a kernel mode stack
> trace of the same thread would be nice too...
>
> What systems/distros/etc are you running this on?
>
> Thanks,
>
> Stevo.
>
>
>
> > Thanks,
> > Amith
> >
> > On Thu, 15 Jun 2006, Steve Wise wrote:
> >
> > > This is the normal output for rping...
> > >
> > > The status error on the completion is 5 (FLUSHED), which is normal.
> > >
> > > Steve.
> > >
> > >
> > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote:
> > > > Hi,
> > > >
> > > > With the latest rping code (Revision: 8055) I am still able to see this
> > > > race condition.
> > > >
> > > > server side:
> > > >
> > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
> > > > server ping data: rdma-ping-0: ABCDEFGHIJKL
> > > > server ping data: rdma-ping-1: BCDEFGHIJKLM
> > > > server ping data: rdma-ping-2: CDEFGHIJKLMN
> > > > server ping data: rdma-ping-3: DEFGHIJKLMNO
> > > > server ping data: rdma-ping-4: EFGHIJKLMNOP
> > > > server ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > > server ping data: rdma-ping-6: GHIJKLMNOPQR
> > > > server ping data: rdma-ping-7: HIJKLMNOPQRS
> > > > server ping data: rdma-ping-8: IJKLMNOPQRST
> > > > server ping data: rdma-ping-9: JKLMNOPQRSTU
> > > > server DISCONNECT EVENT...
> > > > wait for RDMA_READ_ADV state 9
> > > > cq completion failed status 5
> > > >
> > > > Client side:
> > > >
> > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
> > > > ping data: rdma-ping-0: ABCDEFGHIJKL
> > > > ping data: rdma-ping-1: BCDEFGHIJKLM
> > > > ping data: rdma-ping-2: CDEFGHIJKLMN
> > > > ping data: rdma-ping-3: DEFGHIJKLMNO
> > > > ping data: rdma-ping-4: EFGHIJKLMNOP
> > > > ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > > ping data: rdma-ping-6: GHIJKLMNOPQR
> > > > ping data: rdma-ping-7: HIJKLMNOPQRS
> > > > ping data: rdma-ping-8: IJKLMNOPQRST
> > > > ping data: rdma-ping-9: JKLMNOPQRSTU
> > > > cq completion failed status 5
> > > > client DISCONNECT EVENT...
> > > >
> > > >
> > > > Thanks,
> > > > Amith
> > > >
> > > >
> > > > On Tue, 13 Jun 2006, Steve Wise wrote:
> > > >
> > > > > Thanks, applied.
> > > > >
> > > > > iwarp branch: r7964
> > > > > trunk: r7966
> > > > >
> > > > >
> > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > > > > > This patch resolves a race condition between the receipt of
> > > > > > a connection established event and a receive completion from
> > > > > > the client.  The server no longer goes to connected state but
> > > > > > merely waits for the READ_ADV state to begin its looping.  This
> > > > > > keeps the server from going back to CONNECTED from the later
> > > > > > states if the connection established event comes in after the
> > > > > > receive completion (i.e. the loop starts).
> > > > > >
> > > > > > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > openib-general mailing list
> > > > > openib-general at openib.org
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > >
> > >
>


From Sujal at Mellanox.com  Fri Jun 16 10:54:18 2006
From: Sujal at Mellanox.com (Sujal Das)
Date: Fri, 16 Jun 2006 10:54:18 -0700
Subject: [openib-general] [openfabrics-ewg] OFED 1.0 - Official Release
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F1A8F76@mtiexch01.mti.com>

Yes, this is a great achievement.  Congrats!

-----Original Message-----
From: openfabrics-ewg-bounces at openib.org
[mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Johann George
Sent: Friday, June 16, 2006 9:59 AM
To: Tziporet Koren
Cc: OpenFabricsEWG; openib
Subject: Re: [openfabrics-ewg] [openib-general] OFED 1.0 - Official
Release

> I am happy to announce that OFED 1.0 Official Release is now
available.

Congratulations to everyone involved; and especially to you, Tziporet.
You
have done a fabulous job in pulling this all together.

Johann

_______________________________________________
openfabrics-ewg mailing list
openfabrics-ewg at openib.org
http://openib.org/mailman/listinfo/openfabrics-ewg


From swise at opengridcomputing.com  Fri Jun 16 11:43:53 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 16 Jun 2006 13:43:53 -0500
Subject: [openib-general] [PATCH] librdmacm/examples/rping.c
In-Reply-To: <Pine.GSO.4.40.0606161336440.1780-100000@nu.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606161336440.1780-100000@nu.cse.ohio-state.edu>
Message-ID: <1150483433.29508.28.camel@stevo-desktop>

On Fri, 2006-06-16 at 13:40 -0400, amith rajith mamidala wrote:
> Hi,
> 
> I tried using gdb but it also hangs at the end. The system used is a IA32
> platform using Red Hat Enterprise Linux AS release 4 (Nahant Update 3).
> kernel info:Linux k63-oib 2.6.16.20 #2 SMP Wed Jun 14 15:02:47 EDT 2006
> i686 i686 i386 GNU/Linux,
> 

Try breaking in rdma_destroy_id() and see if it ever returns from that
function...

STevo.

> Thanks,
> Amith
> 
> 
> On Fri, 16 Jun 2006, Steve Wise wrote:
> 
> > On Fri, 2006-06-16 at 11:20 -0400, amith rajith mamidala wrote:
> > > Hi Steve,
> > >
> > > The rping also doesn't exit after printing these error messages. Is this
> > > expected?
> > >
> >
> > It should exit!  :-(
> >
> > Maybe rping is not acking all the CM or Async events?  Or we've got a
> > bug in our refcnts on the iw_cm_ids in the kernel.  Can you get a gdb
> > stack trace when its stalled?   And if you kdb, a kernel mode stack
> > trace of the same thread would be nice too...
> >
> > What systems/distros/etc are you running this on?
> >
> > Thanks,
> >
> > Stevo.
> >
> >
> >
> > > Thanks,
> > > Amith
> > >
> > > On Thu, 15 Jun 2006, Steve Wise wrote:
> > >
> > > > This is the normal output for rping...
> > > >
> > > > The status error on the completion is 5 (FLUSHED), which is normal.
> > > >
> > > > Steve.
> > > >
> > > >
> > > > On Thu, 2006-06-15 at 17:24 -0400, amith rajith mamidala wrote:
> > > > > Hi,
> > > > >
> > > > > With the latest rping code (Revision: 8055) I am still able to see this
> > > > > race condition.
> > > > >
> > > > > server side:
> > > > >
> > > > > [@k62-oib examples]$ ./rping -s -vV -C10 -S26 -a 0.0.0.0 -p 9997
> > > > > server ping data: rdma-ping-0: ABCDEFGHIJKL
> > > > > server ping data: rdma-ping-1: BCDEFGHIJKLM
> > > > > server ping data: rdma-ping-2: CDEFGHIJKLMN
> > > > > server ping data: rdma-ping-3: DEFGHIJKLMNO
> > > > > server ping data: rdma-ping-4: EFGHIJKLMNOP
> > > > > server ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > > > server ping data: rdma-ping-6: GHIJKLMNOPQR
> > > > > server ping data: rdma-ping-7: HIJKLMNOPQRS
> > > > > server ping data: rdma-ping-8: IJKLMNOPQRST
> > > > > server ping data: rdma-ping-9: JKLMNOPQRSTU
> > > > > server DISCONNECT EVENT...
> > > > > wait for RDMA_READ_ADV state 9
> > > > > cq completion failed status 5
> > > > >
> > > > > Client side:
> > > > >
> > > > > [@k63-oib examples]$ ./rping -c -vV -C10 -S26 -a 192.168.111.66 -p 9997
> > > > > ping data: rdma-ping-0: ABCDEFGHIJKL
> > > > > ping data: rdma-ping-1: BCDEFGHIJKLM
> > > > > ping data: rdma-ping-2: CDEFGHIJKLMN
> > > > > ping data: rdma-ping-3: DEFGHIJKLMNO
> > > > > ping data: rdma-ping-4: EFGHIJKLMNOP
> > > > > ping data: rdma-ping-5: FGHIJKLMNOPQ
> > > > > ping data: rdma-ping-6: GHIJKLMNOPQR
> > > > > ping data: rdma-ping-7: HIJKLMNOPQRS
> > > > > ping data: rdma-ping-8: IJKLMNOPQRST
> > > > > ping data: rdma-ping-9: JKLMNOPQRSTU
> > > > > cq completion failed status 5
> > > > > client DISCONNECT EVENT...
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Amith
> > > > >
> > > > >
> > > > > On Tue, 13 Jun 2006, Steve Wise wrote:
> > > > >
> > > > > > Thanks, applied.
> > > > > >
> > > > > > iwarp branch: r7964
> > > > > > trunk: r7966
> > > > > >
> > > > > >
> > > > > > On Tue, 2006-06-13 at 11:24 -0500, Boyd R. Faulkner wrote:
> > > > > > > This patch resolves a race condition between the receipt of
> > > > > > > a connection established event and a receive completion from
> > > > > > > the client.  The server no longer goes to connected state but
> > > > > > > merely waits for the READ_ADV state to begin its looping.  This
> > > > > > > keeps the server from going back to CONNECTED from the later
> > > > > > > states if the connection established event comes in after the
> > > > > > > receive completion (i.e. the loop starts).
> > > > > > >
> > > > > > > Signed-off-by: Boyd Faulkner <faulkner at opengridcomputing.com>
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > openib-general mailing list
> > > > > > openib-general at openib.org
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > >
> > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > > >
> > > >
> >
> 


From mshefty at ichips.intel.com  Fri Jun 16 11:52:24 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 11:52:24 -0700
Subject: [openib-general] [PATCH 1/5] ib_addr: retrieve MGID from device
 address
In-Reply-To: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com>
References: <000301c68dd5$5f569ca0$68fc070a@amr.corp.intel.com>
Message-ID: <4492FDE8.6080404@ichips.intel.com>

Sean Hefty wrote:
>>dev_addr->broadcast + 4/dev_addr->src_dev_addr + 4 may not be naturally
>>aligned,
>>so casting this pointer to structure type may cause compiler to generate
>>incorrect code.
> 
> Thanks - I'll update this.

An update for this ends up working out better as a separate patch.  Fixes are 
needed in the existing cma and multicast code.

- Sean


From johnip at sgi.com  Fri Jun 16 12:51:06 2006
From: johnip at sgi.com (John Partridge)
Date: Fri, 16 Jun 2006 14:51:06 -0500
Subject: [openib-general]  MVAPICH failure on SGI Altix SLES10
Message-ID: <44930BAA.6030300@sgi.com>

I am trying to run the example from MPI_README.txt (and other MPI apps
like pallas), but I keep getting a Couldn't modify SRQ limit error
message :-

mig129:~/OFED-1.0-pre1 # 
/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 
-hostfile /root/cluster 
/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw 1000 16
[1] Abort: Couldn't modify SRQ limit
  at line 995 in file viainit.c
mpirun_rsh: Abort signaled from [1]
[0] Abort: [mig125:0] Got completion with error, code=12
  at line 2143 in file viacheck.c
done.

I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also)
OS is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10

HW is SGI Altix ia64

Can anyone help please ?

Thanks
John

-- 
John Partridge

Silicon Graphics Inc
Tel:  651-683-3428
Vnet: 233-3428
E-Mail: johnip at sgi.com


From trimmer at silverstorm.com  Fri Jun 16 13:02:47 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Fri, 16 Jun 2006 16:02:47 -0400
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
Message-ID: <D80D83302DEE6249A221093BF2BB69AE5FFC48@mail.silverstorm.com>

> -----Original Message-----
> From: Or Gerlitz; openib-general
> > In most cases, I would expect that the IB CM will eventually receive
the
> RTU,
> > which will generate an event to the RDMA CM to transition the QP
into
> RTS.
> 
> But we want an IB stack and set of ULPs which would work in production
so
> they
> need to handle also irregular cases... eg when the RTU is lost over
and
> over.

Agreed.  The missing RTU case must be handled for a few reasons:
1. The RTU could honestly be lost (GSI QPs are UD, they could overflow,
fabric could loose the packet, etc)
2. The RC send could beat the processing of the RTU (packets on wire may
be out of order if there are different SLs/VLs involved with GSI vs
application QP).  Also its possible the CM is slower getting to its
queue of packets (such as when bombarded by many connections) while
application/ULP gets its RC send quickly. [I have observed this
situation in various real world stress tests].

This problem is quite simple to handle (I did it a few years ago in the
SilverStorm stack) and the IB spec completely covers this issue:

CM - have a hook so the CM can get the Async Events for all CAs.  On
getting the Async Event for packet first packet received while in RTR
(Communication established), the CM should treat this exactly like an
RTU (with no private data).  The CM will need to cross reference the
CA/QP this event was reported for to identify the applicable connection
endpoint.  If you check the IBTA spec and the CM state machines you will
see the CM is supposed to handle this event.  Also if the RTU does
arrive later, the CM state machine also handles that correctly by
discarding the RTU as if it was a duplicate.  Note: this is why
applications should not depend on private data in the RTU.

ULPs - all ULPs should be written so they are fully ready to process
inbound data before they tell the CM to send the REP.  It is very likely
the ULP will get a CQ completion for the inbound RQ data before the CM
has completed its processing.  In general IB allows for this situation
quite nicely.  The ULP can process the inbound data normally and queue
it to the Send Q.  Putting data on a Send Q is permitted in RTR, but the
QP will not initiate sending until moved to RTS.  As such the ULP can
allow the Cm RTU processing (which will race with the RQ data
completion) do its normal thing and move the QP to RTS.

Todd Rimmer


From boris at mellanox.com  Fri Jun 16 13:23:28 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Fri, 16 Jun 2006 13:23:28 -0700
Subject: [openib-general] MVAPICH failure on SGI Altix SLES10
Message-ID: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com>

Hi John,

Most probably you need to upgrade the FW on your HCAs.
See the following section from MVAPICH 0.9.7 User Guide:

7.2.5 Couldn't modify SRQ limit

This means that your HCA card doesn't support the ibv_modify_srq
feature. Please upgrade
the firmware version and OpenIB Gen2 libraries on your cluster. You can
obtain the latest
Mellanox firmware images from this webpage.
Even after updating your firmware and OpenIB Gen2 libraries, you
continue to experience
this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE
with
-DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build
the MVAPICH
library. Note that you should first try to update your firmware and
OpenIB Gen2 libraries
before taking this measure.
If you believe that your HCA supports this feature, yet you are
experiencing this problem,
please contact the MVAPICH community at
mvapich-discuss at cse.ohio-state.edu. 

Regards,
Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com


-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of John Partridge
Sent: Friday, June 16, 2006 12:51 PM
To: openib-general at openib.org
Subject: [openib-general] MVAPICH failure on SGI Altix SLES10

I am trying to run the example from MPI_README.txt (and other MPI apps
like pallas), but I keep getting a Couldn't modify SRQ limit error
message :-

mig129:~/OFED-1.0-pre1 #
/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2
-hostfile /root/cluster
/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw
1000 16 [1] Abort: Couldn't modify SRQ limit
  at line 995 in file viainit.c
mpirun_rsh: Abort signaled from [1]
[0] Abort: [mig125:0] Got completion with error, code=12
  at line 2143 in file viacheck.c
done.

I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) OS
is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10

HW is SGI Altix ia64

Can anyone help please ?

Thanks
John

--
John Partridge

Silicon Graphics Inc
Tel:  651-683-3428
Vnet: 233-3428
E-Mail: johnip at sgi.com

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From ralphc at pathscale.com  Fri Jun 16 13:30:31 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 13:30:31 -0700
Subject: [openib-general] [PATCH] ib_uverbs_create_ah() doesn't initialize
 ib_uobject.object pointer
Message-ID: <1150489831.32252.102.camel@brick.pathscale.com>

I get a NULL pointer panic when trying to use the current trunk
SVN (rev 8088).  I traced it down to ib_uverbs_create_ah()
failing to initialize the ib_uobject.object pointer.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/linux-kernel/infiniband/core/uverbs_cmd.c
===================================================================
--- src/linux-kernel/infiniband/core/uverbs_cmd.c	(revision 8088)
+++ src/linux-kernel/infiniband/core/uverbs_cmd.c	(working copy)
@@ -1779,6 +1779,7 @@
 	}
 
 	ah->uobject = uobj;
+	uobj->object = ah;
 	ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj);
 	if (ret)
 		goto err_destroy;


-- 
Ralph Campbell <ralphc at pathscale.com>


From rdreier at cisco.com  Fri Jun 16 13:38:16 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 13:38:16 -0700
Subject: [openib-general] [PATCH] ib_uverbs_create_ah() doesn't
 initialize ib_uobject.object pointer
In-Reply-To: <1150489831.32252.102.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 16 Jun 2006 13:30:31 -0700")
References: <1150489831.32252.102.camel@brick.pathscale.com>
Message-ID: <adaver0u98n.fsf@cisco.com>

Thanks, applied.


From johnip at sgi.com  Fri Jun 16 13:51:07 2006
From: johnip at sgi.com (John Partridge)
Date: Fri, 16 Jun 2006 15:51:07 -0500
Subject: [openib-general] MVAPICH failure on SGI Altix SLES10
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com>
References: <1E3DCD1C63492545881FACB6063A57C1324280@mtiexch01.mti.com>
Message-ID: <449319BB.3070702@sgi.com>

Thank You Boris that seems to have fixed it.

Regards
John


Boris Shpolyansky wrote:
> Hi John,
> 
> Most probably you need to upgrade the FW on your HCAs.
> See the following section from MVAPICH 0.9.7 User Guide:
> 
> 7.2.5 Couldn't modify SRQ limit
> 
> This means that your HCA card doesn't support the ibv_modify_srq
> feature. Please upgrade
> the firmware version and OpenIB Gen2 libraries on your cluster. You can
> obtain the latest
> Mellanox firmware images from this webpage.
> Even after updating your firmware and OpenIB Gen2 libraries, you
> continue to experience
> this problem, please edit make.mvapich.gcc and replace -DMEMORY_SCALE
> with
> -DADAPTIVE_RDMA_FAST_PATH. After making this change you need to re-build
> the MVAPICH
> library. Note that you should first try to update your firmware and
> OpenIB Gen2 libraries
> before taking this measure.
> If you believe that your HCA supports this feature, yet you are
> experiencing this problem,
> please contact the MVAPICH community at
> mvapich-discuss at cse.ohio-state.edu. 
> 
> Regards,
> Boris Shpolyansky
> Application Engineer
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
> 
> 
> -----Original Message-----
> From: openib-general-bounces at openib.org
> [mailto:openib-general-bounces at openib.org] On Behalf Of John Partridge
> Sent: Friday, June 16, 2006 12:51 PM
> To: openib-general at openib.org
> Subject: [openib-general] MVAPICH failure on SGI Altix SLES10
> 
> I am trying to run the example from MPI_README.txt (and other MPI apps
> like pallas), but I keep getting a Couldn't modify SRQ limit error
> message :-
> 
> mig129:~/OFED-1.0-pre1 #
> /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2
> -hostfile /root/cluster
> /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw
> 1000 16 [1] Abort: Couldn't modify SRQ limit
>   at line 995 in file viainit.c
> mpirun_rsh: Abort signaled from [1]
> [0] Abort: [mig125:0] Got completion with error, code=12
>   at line 2143 in file viacheck.c
> done.
> 
> I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also) OS
> is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10
> 
> HW is SGI Altix ia64
> 
> Can anyone help please ?
> 
> Thanks
> John
> 
> --
> John Partridge
> 
> Silicon Graphics Inc
> Tel:  651-683-3428
> Vnet: 233-3428
> E-Mail: johnip at sgi.com
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 


-- 
John Partridge

Silicon Graphics Inc
Tel:  651-683-3428
Vnet: 233-3428
E-Mail: johnip at sgi.com


From mshefty at ichips.intel.com  Fri Jun 16 14:24:19 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 16 Jun 2006 14:24:19 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE5FFC48@mail.silverstorm.com>
References: <D80D83302DEE6249A221093BF2BB69AE5FFC48@mail.silverstorm.com>
Message-ID: <44932183.8040107@ichips.intel.com>

Rimmer, Todd wrote:
> CM - have a hook so the CM can get the Async Events for all CAs.  On
> getting the Async Event for packet first packet received while in RTR
> (Communication established), the CM should treat this exactly like an
> RTU (with no private data).  The CM will need to cross reference the
> CA/QP this event was reported for to identify the applicable connection
> endpoint.  If you check the IBTA spec and the CM state machines you will
> see the CM is supposed to handle this event.  Also if the RTU does
> arrive later, the CM state machine also handles that correctly by
> discarding the RTU as if it was a duplicate.  Note: this is why
> applications should not depend on private data in the RTU.

The IB CM has this capability, and behaves as indicated.  The missing piece is 
for the RDMA CM to handle this situation.  I believe that Or's approach of 
replacing the user's QP handler with the CMA's will fix this.

> has completed its processing.  In general IB allows for this situation
> quite nicely.  The ULP can process the inbound data normally and queue
> it to the Send Q.  Putting data on a Send Q is permitted in RTR, but the

This is a good point, which indicates to me that nothing more is needed than 
handling the communication established event by the RDMA CM.

- Sean


From rdreier at cisco.com  Fri Jun 16 15:05:54 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 15:05:54 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adabqsx3poy.fsf@cisco.com> (Roland Dreier's message of
	"Tue, 13 Jun 2006 10:55:57 -0700")
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
Message-ID: <adar71ou56l.fsf@cisco.com>

OK, I think that the modify_qp, modify_srq and resize_cq calls all
need to be serialized.  Unfortunately the only good way I could see to
serialize these calls is to add a mutex to mthca's CQ, QP and SRQ
structures (which bloats the structures somewhat).  The patch I
committed is below -- with this change I think we're OK even if
userspace does crazy multithreaded stuff.

    IB/mthca: Make all device methods truly reentrant
    
    Documentation/infiniband/core_locking.txt says:
    
      All of the methods in struct ib_device exported by a low-level
      driver must be fully reentrant.  The low-level driver is required to
      perform all synchronization necessary to maintain consistency, even
      if multiple function calls using the same object are run
      simultaneously.
    
    However, mthca's modify_qp, modify_srq and resize_cq methods are
    currently not reentrant.  Add a mutex to the QP, SRQ and CQ structures
    so that these calls can be properly serialized.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 205854e..f20a5b6 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -813,6 +813,7 @@ int mthca_init_cq(struct mthca_dev *dev,
 	spin_lock_init(&cq->lock);
 	cq->refcount = 1;
 	init_waitqueue_head(&cq->wait);
+	mutex_init(&cq->mutex);
 
 	memset(cq_context, 0, sizeof *cq_context);
 	cq_context->flags           = cpu_to_be32(MTHCA_CQ_STATUS_OK      |
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 8f89ba7..230ae21 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -793,18 +793,24 @@ static int mthca_resize_cq(struct ib_cq 
 	if (entries < 1 || entries > dev->limits.max_cqes)
 		return -EINVAL;
 
+	mutex_lock(&cq->mutex);
+
 	entries = roundup_pow_of_two(entries + 1);
-	if (entries == ibcq->cqe + 1)
-		return 0;
+	if (entries == ibcq->cqe + 1) {
+		ret = 0;
+		goto out;
+	}
 
 	if (cq->is_kernel) {
 		ret = mthca_alloc_resize_buf(dev, cq, entries);
 		if (ret)
-			return ret;
+			goto out;
 		lkey = cq->resize_buf->buf.mr.ibmr.lkey;
 	} else {
-		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
-			return -EFAULT;
+		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) {
+			ret = -EFAULT;
+			goto out;
+		}
 		lkey = ucmd.lkey;
 	}
 
@@ -821,7 +827,7 @@ static int mthca_resize_cq(struct ib_cq 
 			cq->resize_buf = NULL;
 			spin_unlock_irq(&cq->lock);
 		}
-		return ret;
+		goto out;
 	}
 
 	if (cq->is_kernel) {
@@ -848,7 +854,10 @@ static int mthca_resize_cq(struct ib_cq 
 	} else
 		ibcq->cqe = entries - 1;
 
-	return 0;
+out:
+	mutex_unlock(&cq->mutex);
+
+	return ret;
 }
 
 static int mthca_destroy_cq(struct ib_cq *cq)
diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 322bc32..16c387d 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -536,6 +536,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 	u8 status;
 	int err = -EINVAL;
 
+	mutex_lock(&qp->mutex);
+
 	if (attr_mask & IB_QP_CUR_STATE) {
 		cur_state = attr->cur_qp_state;
 	} else {
@@ -553,39 +555,41 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 			  "%d->%d with attr 0x%08x\n",
 			  qp->transport, cur_state, new_state,
 			  attr_mask);
-		return -EINVAL;
+		goto out;
 	}
 
 	if ((attr_mask & IB_QP_PKEY_INDEX) &&
 	     attr->pkey_index >= dev->limits.pkey_table_len) {
 		mthca_dbg(dev, "P_Key index (%u) too large. max is %d\n",
 			  attr->pkey_index, dev->limits.pkey_table_len-1);
-		return -EINVAL;
+		goto out;
 	}
 
 	if ((attr_mask & IB_QP_PORT) &&
 	    (attr->port_num == 0 || attr->port_num > dev->limits.num_ports)) {
 		mthca_dbg(dev, "Port number (%u) is invalid\n", attr->port_num);
-		return -EINVAL;
+		goto out;
 	}
 
 	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC &&
 	    attr->max_rd_atomic > dev->limits.max_qp_init_rdma) {
 		mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n",
 			  attr->max_rd_atomic, dev->limits.max_qp_init_rdma);
-		return -EINVAL;
+		goto out;
 	}
 
 	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC &&
 	    attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) {
 		mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n",
 			  attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift);
-		return -EINVAL;
+		goto out;
 	}
 
 	mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL);
-	if (IS_ERR(mailbox))
-		return PTR_ERR(mailbox);
+	if (IS_ERR(mailbox)) {
+		err = PTR_ERR(mailbox);
+		goto out;
+	}
 	qp_param = mailbox->buf;
 	qp_context = &qp_param->context;
 	memset(qp_param, 0, sizeof *qp_param);
@@ -618,7 +622,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 		if (attr->path_mtu < IB_MTU_256 || attr->path_mtu > IB_MTU_2048) {
 			mthca_dbg(dev, "path MTU (%u) is invalid\n",
 				  attr->path_mtu);
-			goto out;
+			goto out_mailbox;
 		}
 		qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31;
 	}
@@ -672,7 +676,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 	if (attr_mask & IB_QP_AV) {
 		if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path,
 				   attr_mask & IB_QP_PORT ? attr->port_num : qp->port))
-			goto out;
+			goto out_mailbox;
 
 		qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH);
 	}
@@ -686,18 +690,18 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 		if (attr->alt_pkey_index >= dev->limits.pkey_table_len) {
 			mthca_dbg(dev, "Alternate P_Key index (%u) too large. max is %d\n",
 				  attr->alt_pkey_index, dev->limits.pkey_table_len-1);
-			goto out;
+			goto out_mailbox;
 		}
 
 		if (attr->alt_port_num == 0 || attr->alt_port_num > dev->limits.num_ports) {
 			mthca_dbg(dev, "Alternate port number (%u) is invalid\n",
 				attr->alt_port_num);
-			goto out;
+			goto out_mailbox;
 		}
 
 		if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path,
 				   attr->alt_ah_attr.port_num))
-			goto out;
+			goto out_mailbox;
 
 		qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index |
 							      attr->alt_port_num << 24);
@@ -793,12 +797,12 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 	err = mthca_MODIFY_QP(dev, cur_state, new_state, qp->qpn, 0,
 			      mailbox, sqd_event, &status);
 	if (err)
-		goto out;
+		goto out_mailbox;
 	if (status) {
 		mthca_warn(dev, "modify QP %d->%d returned status %02x.\n",
 			   cur_state, new_state, status);
 		err = -EINVAL;
-		goto out;
+		goto out_mailbox;
 	}
 
 	qp->state = new_state;
@@ -853,8 +857,11 @@ int mthca_modify_qp(struct ib_qp *ibqp, 
 		}
 	}
 
-out:
+out_mailbox:
 	mthca_free_mailbox(dev, mailbox);
+
+out:
+	mutex_unlock(&qp->mutex);
 	return err;
 }
 
@@ -1100,6 +1107,7 @@ static int mthca_alloc_qp_common(struct 
 
 	qp->refcount = 1;
 	init_waitqueue_head(&qp->wait);
+	mutex_init(&qp->mutex);
 	qp->state    	 = IB_QPS_RESET;
 	qp->atomic_rd_en = 0;
 	qp->resp_depth   = 0;
diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c
index b292fef..fab417c 100644
--- a/drivers/infiniband/hw/mthca/mthca_srq.c
+++ b/drivers/infiniband/hw/mthca/mthca_srq.c
@@ -243,6 +243,7 @@ int mthca_alloc_srq(struct mthca_dev *de
 	spin_lock_init(&srq->lock);
 	srq->refcount = 1;
 	init_waitqueue_head(&srq->wait);
+	mutex_init(&srq->mutex);
 
 	if (mthca_is_memfree(dev))
 		mthca_arbel_init_srq_context(dev, pd, srq, mailbox->buf);
@@ -371,7 +372,11 @@ int mthca_modify_srq(struct ib_srq *ibsr
 	if (attr_mask & IB_SRQ_LIMIT) {
 		if (attr->srq_limit > srq->max)
 			return -EINVAL;
+
+		mutex_lock(&srq->mutex);
 		ret = mthca_ARM_SRQ(dev, srq->srqn, attr->srq_limit, &status);
+		mutex_unlock(&srq->mutex);
+
 		if (ret)
 			return ret;
 		if (status)


From rdreier at cisco.com  Fri Jun 16 15:07:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 15:07:12 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <1150223140.11881.2.camel@hematite.internal.keyresearch.com>
	(Robert Walsh's message of "Tue, 13 Jun 2006 11:25:39 -0700")
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
	<1150223140.11881.2.camel@hematite.internal.keyresearch.com>
Message-ID: <adamzccu54f.fsf@cisco.com>

Robert, can you confirm that the new uverbs locking scheme helps the
performance problems you're having?

I'm planning on queueing the patch below for 2.6.18 (which has all
fixes rolled up in it):

    IB/uverbs: Don't serialize with ib_uverbs_idr_mutex

    Currently, all userspace verbs operations that call into the kernel
    are serialized by ib_uverbs_idr_mutex.  This can be a scalability
    issue for some workloads, especially for devices driven by the ipath
    driver, which needs to call into the kernel even for datapath
    operations.

    Fix this by adding reference counts to the userspace objects, and then
    converting ib_uverbs_idr_mutex into a spinlock that only protects the
    idrs long enough to take a reference on the object being looked up.
    Because remove operations may fail, we have to do a slightly funky
    two-step deletion, which is described in the comments at the top of
    uverbs_cmd.c.

    This also still leaves ib_uverbs_idr_lock as a single lock that is
    possibly subject to contention.  However, the lock hold time will only
    be a single idr operation, so multiple threads should still be able to
    make progress, even if ib_uverbs_idr_lock is being ping-ponged.

    Surprisingly, these changes even shrink the object code:

    add/remove: 23/5 grow/shrink: 4/21 up/down: 633/-693 (-60)

    Signed-off-by: Roland Dreier <rolandd at cisco.com>

---

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 3372d67..bb9bee5 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -132,7 +132,7 @@ struct ib_ucq_object {
 	u32			async_events_reported;
 };
 
-extern struct mutex ib_uverbs_idr_mutex;
+extern spinlock_t ib_uverbs_idr_lock;
 extern struct idr ib_uverbs_pd_idr;
 extern struct idr ib_uverbs_mr_idr;
 extern struct idr ib_uverbs_mw_idr;
@@ -141,6 +141,8 @@ extern struct idr ib_uverbs_cq_idr;
 extern struct idr ib_uverbs_qp_idr;
 extern struct idr ib_uverbs_srq_idr;
 
+void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
+
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 					int is_async, int *fd);
 void ib_uverbs_release_event_file(struct kref *ref);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 403dd81..76bf61e 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -50,7 +50,64 @@ #define INIT_UDATA(udata, ibuf, obuf, il
 		(udata)->outlen = (olen);				\
 	} while (0)
 
-static int idr_add_uobj(struct idr *idr, void *obj, struct ib_uobject *uobj)
+/*
+ * The ib_uobject locking scheme is as follows:
+ *
+ * - ib_uverbs_idr_lock protects the uverbs idrs themselves, so it
+ *   needs to be held during all idr operations.  When an object is
+ *   looked up, a reference must be taken on the object's kref before
+ *   dropping this lock.
+ *
+ * - Each object also has an rwsem.  This rwsem must be held for
+ *   reading while an operation that uses the object is performed.
+ *   For example, while registering an MR, the associated PD's
+ *   uobject.mutex must be held for reading.  The rwsem must be held
+ *   for writing while initializing or destroying an object.
+ *
+ * - In addition, each object has a "live" flag.  If this flag is not
+ *   set, then lookups of the object will fail even if it is found in
+ *   the idr.  This handles a reader that blocks and does not acquire
+ *   the rwsem until after the object is destroyed.  The destroy
+ *   operation will set the live flag to 0 and then drop the rwsem;
+ *   this will allow the reader to acquire the rwsem, see that the
+ *   live flag is 0, and then drop the rwsem and its reference to
+ *   object.  The underlying storage will not be freed until the last
+ *   reference to the object is dropped.
+ */
+
+static void init_uobj(struct ib_uobject *uobj, u64 user_handle,
+		      struct ib_ucontext *context)
+{
+	uobj->user_handle = user_handle;
+	uobj->context     = context;
+	kref_init(&uobj->ref);
+	init_rwsem(&uobj->mutex);
+	uobj->live        = 0;
+}
+
+static void release_uobj(struct kref *kref)
+{
+	kfree(container_of(kref, struct ib_uobject, ref));
+}
+
+static void put_uobj(struct ib_uobject *uobj)
+{
+	kref_put(&uobj->ref, release_uobj);
+}
+
+static void put_uobj_read(struct ib_uobject *uobj)
+{
+	up_read(&uobj->mutex);
+	put_uobj(uobj);
+}
+
+static void put_uobj_write(struct ib_uobject *uobj)
+{
+	up_write(&uobj->mutex);
+	put_uobj(uobj);
+}
+
+static int idr_add_uobj(struct idr *idr, struct ib_uobject *uobj)
 {
 	int ret;
 
@@ -58,7 +115,9 @@ retry:
 	if (!idr_pre_get(idr, GFP_KERNEL))
 		return -ENOMEM;
 
+	spin_lock(&ib_uverbs_idr_lock);
 	ret = idr_get_new(idr, uobj, &uobj->id);
+	spin_unlock(&ib_uverbs_idr_lock);
 
 	if (ret == -EAGAIN)
 		goto retry;
@@ -66,6 +125,121 @@ retry:
 	return ret;
 }
 
+void idr_remove_uobj(struct idr *idr, struct ib_uobject *uobj)
+{
+	spin_lock(&ib_uverbs_idr_lock);
+	idr_remove(idr, uobj->id);
+	spin_unlock(&ib_uverbs_idr_lock);
+}
+
+static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id,
+					 struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	spin_lock(&ib_uverbs_idr_lock);
+	uobj = idr_find(idr, id);
+	if (uobj)
+		kref_get(&uobj->ref);
+	spin_unlock(&ib_uverbs_idr_lock);
+
+	return uobj;
+}
+
+static struct ib_uobject *idr_read_uobj(struct idr *idr, int id,
+					struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = __idr_get_uobj(idr, id, context);
+	if (!uobj)
+		return NULL;
+
+	down_read(&uobj->mutex);
+	if (!uobj->live) {
+		put_uobj_read(uobj);
+		return NULL;
+	}
+
+	return uobj;
+}
+
+static struct ib_uobject *idr_write_uobj(struct idr *idr, int id,
+					 struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = __idr_get_uobj(idr, id, context);
+	if (!uobj)
+		return NULL;
+
+	down_write(&uobj->mutex);
+	if (!uobj->live) {
+		put_uobj_write(uobj);
+		return NULL;
+	}
+
+	return uobj;
+}
+
+static void *idr_read_obj(struct idr *idr, int id, struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj;
+
+	uobj = idr_read_uobj(idr, id, context);
+	return uobj ? uobj->object : NULL;
+}
+
+static struct ib_pd *idr_read_pd(int pd_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_pd_idr, pd_handle, context);
+}
+
+static void put_pd_read(struct ib_pd *pd)
+{
+	put_uobj_read(pd->uobject);
+}
+
+static struct ib_cq *idr_read_cq(int cq_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_cq_idr, cq_handle, context);
+}
+
+static void put_cq_read(struct ib_cq *cq)
+{
+	put_uobj_read(cq->uobject);
+}
+
+static struct ib_ah *idr_read_ah(int ah_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_ah_idr, ah_handle, context);
+}
+
+static void put_ah_read(struct ib_ah *ah)
+{
+	put_uobj_read(ah->uobject);
+}
+
+static struct ib_qp *idr_read_qp(int qp_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_qp_idr, qp_handle, context);
+}
+
+static void put_qp_read(struct ib_qp *qp)
+{
+	put_uobj_read(qp->uobject);
+}
+
+static struct ib_srq *idr_read_srq(int srq_handle, struct ib_ucontext *context)
+{
+	return idr_read_obj(&ib_uverbs_srq_idr, srq_handle, context);
+}
+
+static void put_srq_read(struct ib_srq *srq)
+{
+	put_uobj_read(srq->uobject);
+}
+
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 			      const char __user *buf,
 			      int in_len, int out_len)
@@ -296,7 +470,8 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	if (!uobj)
 		return -ENOMEM;
 
-	uobj->context = file->ucontext;
+	init_uobj(uobj, 0, file->ucontext);
+	down_write(&uobj->mutex);
 
 	pd = file->device->ib_dev->alloc_pd(file->device->ib_dev,
 					    file->ucontext, &udata);
@@ -309,11 +484,10 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	pd->uobject = uobj;
 	atomic_set(&pd->usecnt, 0);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	ret = idr_add_uobj(&ib_uverbs_pd_idr, pd, uobj);
+	uobj->object = pd;
+	ret = idr_add_uobj(&ib_uverbs_pd_idr, uobj);
 	if (ret)
-		goto err_up;
+		goto err_idr;
 
 	memset(&resp, 0, sizeof resp);
 	resp.pd_handle = uobj->id;
@@ -321,26 +495,27 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uve
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
 	mutex_lock(&file->mutex);
 	list_add_tail(&uobj->list, &file->ucontext->pd_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj->live = 1;
+
+	up_write(&uobj->mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_pd_idr, uobj->id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+err_idr:
 	ib_dealloc_pd(pd);
 
 err:
-	kfree(uobj);
+	put_uobj_write(uobj);
 	return ret;
 }
 
@@ -349,37 +524,34 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_u
 			     int in_len, int out_len)
 {
 	struct ib_uverbs_dealloc_pd cmd;
-	struct ib_pd               *pd;
 	struct ib_uobject          *uobj;
-	int                         ret = -EINVAL;
+	int                         ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_pd_idr, cmd.pd_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
 
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_dealloc_pd(uobj->object);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = pd->uobject;
+	put_uobj_write(uobj);
 
-	ret = ib_dealloc_pd(pd);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle);
+	idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 
 	mutex_lock(&file->mutex);
 	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	kfree(uobj);
+	put_uobj(uobj);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
@@ -419,7 +591,8 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 	if (!obj)
 		return -ENOMEM;
 
-	obj->uobject.context = file->ucontext;
+	init_uobj(&obj->uobject, 0, file->ucontext);
+	down_write(&obj->uobject.mutex);
 
 	/*
 	 * We ask for writable memory if any access flags other than
@@ -436,23 +609,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 
 	obj->umem.virt_base = cmd.hca_va;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext) {
-		ret = -EINVAL;
-		goto err_up;
-	}
-
-	if (!pd->device->reg_user_mr) {
-		ret = -ENOSYS;
-		goto err_up;
-	}
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd)
+		goto err_release;
 
 	mr = pd->device->reg_user_mr(pd, &obj->umem, cmd.access_flags, &udata);
 	if (IS_ERR(mr)) {
 		ret = PTR_ERR(mr);
-		goto err_up;
+		goto err_put;
 	}
 
 	mr->device  = pd->device;
@@ -461,43 +625,48 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb
 	atomic_inc(&pd->usecnt);
 	atomic_set(&mr->usecnt, 0);
 
-	memset(&resp, 0, sizeof resp);
-	resp.lkey = mr->lkey;
-	resp.rkey = mr->rkey;
-
-	ret = idr_add_uobj(&ib_uverbs_mr_idr, mr, &obj->uobject);
+	obj->uobject.object = mr;
+	ret = idr_add_uobj(&ib_uverbs_mr_idr, &obj->uobject);
 	if (ret)
 		goto err_unreg;
 
+	memset(&resp, 0, sizeof resp);
+	resp.lkey      = mr->lkey;
+	resp.rkey      = mr->rkey;
 	resp.mr_handle = obj->uobject.id;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
 	list_add_tail(&obj->uobject.list, &file->ucontext->mr_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_mr_idr, obj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_mr_idr, &obj->uobject);
 
 err_unreg:
 	ib_dereg_mr(mr);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+err_put:
+	put_pd_read(pd);
 
+err_release:
 	ib_umem_release(file->device->ib_dev, &obj->umem);
 
 err_free:
-	kfree(obj);
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -507,37 +676,40 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uve
 {
 	struct ib_uverbs_dereg_mr cmd;
 	struct ib_mr             *mr;
+	struct ib_uobject	 *uobj;
 	struct ib_umem_object    *memobj;
 	int                       ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle);
-	if (!mr || mr->uobject->context != file->ucontext)
-		goto out;
+	uobj = idr_write_uobj(&ib_uverbs_mr_idr, cmd.mr_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
 
-	memobj = container_of(mr->uobject, struct ib_umem_object, uobject);
+	memobj = container_of(uobj, struct ib_umem_object, uobject);
+	mr     = uobj->object;
 
 	ret = ib_dereg_mr(mr);
+	if (!ret)
+		uobj->live = 0;
+
+	put_uobj_write(uobj);
+
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle);
+	idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&memobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
 	ib_umem_release(file->device->ib_dev, &memobj->umem);
-	kfree(memobj);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_uobj(uobj);
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
@@ -576,7 +748,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 	struct ib_uverbs_create_cq      cmd;
 	struct ib_uverbs_create_cq_resp resp;
 	struct ib_udata                 udata;
-	struct ib_ucq_object           *uobj;
+	struct ib_ucq_object           *obj;
 	struct ib_uverbs_event_file    *ev_file = NULL;
 	struct ib_cq                   *cq;
 	int                             ret;
@@ -594,10 +766,13 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 	if (cmd.comp_vector >= file->device->num_comp_vectors)
 		return -EINVAL;
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
+	init_uobj(&obj->uobject, cmd.user_handle, file->ucontext);
+	down_write(&obj->uobject.mutex);
+
 	if (cmd.comp_channel >= 0) {
 		ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel);
 		if (!ev_file) {
@@ -606,63 +781,64 @@ ssize_t ib_uverbs_create_cq(struct ib_uv
 		}
 	}
 
-	uobj->uobject.user_handle   = cmd.user_handle;
-	uobj->uobject.context       = file->ucontext;
-	uobj->uverbs_file	    = file;
-	uobj->comp_events_reported  = 0;
-	uobj->async_events_reported = 0;
-	INIT_LIST_HEAD(&uobj->comp_list);
-	INIT_LIST_HEAD(&uobj->async_list);
+	obj->uverbs_file	   = file;
+	obj->comp_events_reported  = 0;
+	obj->async_events_reported = 0;
+	INIT_LIST_HEAD(&obj->comp_list);
+	INIT_LIST_HEAD(&obj->async_list);
 
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
 					     file->ucontext, &udata);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
-		goto err;
+		goto err_file;
 	}
 
 	cq->device        = file->device->ib_dev;
-	cq->uobject       = &uobj->uobject;
+	cq->uobject       = &obj->uobject;
 	cq->comp_handler  = ib_uverbs_comp_handler;
 	cq->event_handler = ib_uverbs_cq_event_handler;
 	cq->cq_context    = ev_file;
 	atomic_set(&cq->usecnt, 0);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	ret = idr_add_uobj(&ib_uverbs_cq_idr, cq, &uobj->uobject);
+	obj->uobject.object = cq;
+	ret = idr_add_uobj(&ib_uverbs_cq_idr, &obj->uobject);
 	if (ret)
-		goto err_up;
+		goto err_free;
 
 	memset(&resp, 0, sizeof resp);
-	resp.cq_handle = uobj->uobject.id;
+	resp.cq_handle = obj->uobject.id;
 	resp.cqe       = cq->cqe;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uobject.list, &file->ucontext->cq_list);
+	list_add_tail(&obj->uobject.list, &file->ucontext->cq_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_cq_idr, uobj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_cq_idr, &obj->uobject);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+
+err_free:
 	ib_destroy_cq(cq);
 
-err:
+err_file:
 	if (ev_file)
-		ib_uverbs_release_ucq(file, ev_file, uobj);
-	kfree(uobj);
+		ib_uverbs_release_ucq(file, ev_file, obj);
+
+err:
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -683,11 +859,9 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext || !cq->device->resize_cq)
-		goto out;
+	cq = idr_read_cq(cmd.cq_handle, file->ucontext);
+	if (!cq)
+		return -EINVAL;
 
 	ret = cq->device->resize_cq(cq, cmd.cqe, &udata);
 	if (ret)
@@ -701,7 +875,7 @@ ssize_t ib_uverbs_resize_cq(struct ib_uv
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_cq_read(cq);
 
 	return ret ? ret : in_len;
 }
@@ -712,6 +886,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 {
 	struct ib_uverbs_poll_cq       cmd;
 	struct ib_uverbs_poll_cq_resp *resp;
+	struct ib_uobject	      *uobj;
 	struct ib_cq                  *cq;
 	struct ib_wc                  *wc;
 	int                            ret = 0;
@@ -732,15 +907,17 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 		goto out_wc;
 	}
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext) {
+	uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj) {
 		ret = -EINVAL;
 		goto out;
 	}
+	cq = uobj->object;
 
 	resp->count = ib_poll_cq(cq, cmd.ne, wc);
 
+	put_uobj_read(uobj);
+
 	for (i = 0; i < resp->count; i++) {
 		resp->wc[i].wr_id 	   = wc[i].wr_id;
 		resp->wc[i].status 	   = wc[i].status;
@@ -762,7 +939,6 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
 	kfree(resp);
 
 out_wc:
@@ -775,22 +951,23 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_uobject	      *uobj;
 	struct ib_cq                  *cq;
-	int                            ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (cq && cq->uobject->context == file->ucontext) {
-		ib_req_notify_cq(cq, cmd.solicited_only ?
-					IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
-		ret = in_len;
-	}
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj = idr_read_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	cq = uobj->object;
 
-	return ret;
+	ib_req_notify_cq(cq, cmd.solicited_only ?
+			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+
+	put_uobj_read(uobj);
+
+	return in_len;
 }
 
 ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
@@ -799,52 +976,50 @@ ssize_t ib_uverbs_destroy_cq(struct ib_u
 {
 	struct ib_uverbs_destroy_cq      cmd;
 	struct ib_uverbs_destroy_cq_resp resp;
+	struct ib_uobject		*uobj;
 	struct ib_cq               	*cq;
-	struct ib_ucq_object        	*uobj;
+	struct ib_ucq_object        	*obj;
 	struct ib_uverbs_event_file	*ev_file;
-	u64				 user_handle;
 	int                        	 ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	memset(&resp, 0, sizeof resp);
-
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_cq_idr, cmd.cq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	cq      = uobj->object;
+	ev_file = cq->cq_context;
+	obj     = container_of(cq->uobject, struct ib_ucq_object, uobject);
 
-	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
-	if (!cq || cq->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_cq(cq);
+	if (!ret)
+		uobj->live = 0;
 
-	user_handle = cq->uobject->user_handle;
-	uobj        = container_of(cq->uobject, struct ib_ucq_object, uobject);
-	ev_file     = cq->cq_context;
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_cq(cq);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle);
+	idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_ucq(file, ev_file, uobj);
+	ib_uverbs_release_ucq(file, ev_file, obj);
 
-	resp.comp_events_reported  = uobj->comp_events_reported;
-	resp.async_events_reported = uobj->async_events_reported;
+	memset(&resp, 0, sizeof resp);
+	resp.comp_events_reported  = obj->comp_events_reported;
+	resp.async_events_reported = obj->async_events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
-
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+		return -EFAULT;
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
@@ -854,7 +1029,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	struct ib_uverbs_create_qp      cmd;
 	struct ib_uverbs_create_qp_resp resp;
 	struct ib_udata                 udata;
-	struct ib_uqp_object           *uobj;
+	struct ib_uqp_object           *obj;
 	struct ib_pd                   *pd;
 	struct ib_cq                   *scq, *rcq;
 	struct ib_srq                  *srq;
@@ -872,23 +1047,21 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext);
+	down_write(&obj->uevent.uobject.mutex);
 
-	pd  = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle);
-	rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle);
-	srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL;
+	pd  = idr_read_pd(cmd.pd_handle, file->ucontext);
+	scq = idr_read_cq(cmd.send_cq_handle, file->ucontext);
+	rcq = idr_read_cq(cmd.recv_cq_handle, file->ucontext);
+	srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL;
 
-	if (!pd  || pd->uobject->context  != file->ucontext ||
-	    !scq || scq->uobject->context != file->ucontext ||
-	    !rcq || rcq->uobject->context != file->ucontext ||
-	    (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) {
+	if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err_put;
 	}
 
 	attr.event_handler = ib_uverbs_qp_event_handler;
@@ -905,16 +1078,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	attr.cap.max_recv_sge    = cmd.max_recv_sge;
 	attr.cap.max_inline_data = cmd.max_inline_data;
 
-	uobj->uevent.uobject.user_handle = cmd.user_handle;
-	uobj->uevent.uobject.context     = file->ucontext;
-	uobj->uevent.events_reported     = 0;
-	INIT_LIST_HEAD(&uobj->uevent.event_list);
-	INIT_LIST_HEAD(&uobj->mcast_list);
+	obj->uevent.events_reported     = 0;
+	INIT_LIST_HEAD(&obj->uevent.event_list);
+	INIT_LIST_HEAD(&obj->mcast_list);
 
 	qp = pd->device->create_qp(pd, &attr, &udata);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
-		goto err_up;
+		goto err_put;
 	}
 
 	qp->device     	  = pd->device;
@@ -922,7 +1093,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	qp->send_cq    	  = attr.send_cq;
 	qp->recv_cq    	  = attr.recv_cq;
 	qp->srq	       	  = attr.srq;
-	qp->uobject       = &uobj->uevent.uobject;
+	qp->uobject       = &obj->uevent.uobject;
 	qp->event_handler = attr.event_handler;
 	qp->qp_context    = attr.qp_context;
 	qp->qp_type	  = attr.qp_type;
@@ -932,14 +1103,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	if (attr.srq)
 		atomic_inc(&attr.srq->usecnt);
 
-	memset(&resp, 0, sizeof resp);
-	resp.qpn = qp->qp_num;
-
-	ret = idr_add_uobj(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject);
+	obj->uevent.uobject.object = qp;
+	ret = idr_add_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject);
 	if (ret)
 		goto err_destroy;
 
-	resp.qp_handle       = uobj->uevent.uobject.id;
+	memset(&resp, 0, sizeof resp);
+	resp.qpn             = qp->qp_num;
+	resp.qp_handle       = obj->uevent.uobject.id;
 	resp.max_recv_sge    = attr.cap.max_recv_sge;
 	resp.max_send_sge    = attr.cap.max_send_sge;
 	resp.max_recv_wr     = attr.cap.max_recv_wr;
@@ -949,27 +1120,42 @@ ssize_t ib_uverbs_create_qp(struct ib_uv
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+	put_cq_read(scq);
+	put_cq_read(rcq);
+	if (srq)
+		put_srq_read(srq);
+
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list);
+	list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uevent.uobject.live = 1;
+
+	up_write(&obj->uevent.uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject);
 
 err_destroy:
 	ib_destroy_qp(qp);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err_put:
+	if (pd)
+		put_pd_read(pd);
+	if (scq)
+		put_cq_read(scq);
+	if (rcq)
+		put_cq_read(rcq);
+	if (srq)
+		put_srq_read(srq);
+
+	put_uobj_write(&obj->uevent.uobject);
 	return ret;
 }
 
@@ -994,15 +1180,15 @@ ssize_t ib_uverbs_query_qp(struct ib_uve
 		goto out;
 	}
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (qp && qp->uobject->context == file->ucontext)
-		ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr);
-	else
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp) {
 		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_qp_read(qp);
 
 	if (ret)
 		goto out;
@@ -1089,10 +1275,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
 	if (!attr)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext) {
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1144,13 +1328,15 @@ ssize_t ib_uverbs_modify_qp(struct ib_uv
 	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
 
 	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
+
+	put_qp_read(qp);
+
 	if (ret)
 		goto out;
 
 	ret = in_len;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
 	kfree(attr);
 
 	return ret;
@@ -1162,8 +1348,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u
 {
 	struct ib_uverbs_destroy_qp      cmd;
 	struct ib_uverbs_destroy_qp_resp resp;
+	struct ib_uobject		*uobj;
 	struct ib_qp               	*qp;
-	struct ib_uqp_object        	*uobj;
+	struct ib_uqp_object        	*obj;
 	int                        	 ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1171,43 +1358,43 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u
 
 	memset(&resp, 0, sizeof resp);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
-
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	uobj = idr_write_uobj(&ib_uverbs_qp_idr, cmd.qp_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	qp  = uobj->object;
+	obj = container_of(uobj, struct ib_uqp_object, uevent.uobject);
 
-	if (!list_empty(&uobj->mcast_list)) {
-		ret = -EBUSY;
-		goto out;
+	if (!list_empty(&obj->mcast_list)) {
+		put_uobj_write(uobj);
+		return -EBUSY;
 	}
 
 	ret = ib_destroy_qp(qp);
+	if (!ret)
+		uobj->live = 0;
+
+	put_uobj_write(uobj);
+
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle);
+	idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uevent.uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_uevent(file, &uobj->uevent);
+	ib_uverbs_release_uevent(file, &obj->uevent);
 
-	resp.events_reported = uobj->uevent.events_reported;
+	resp.events_reported = obj->uevent.events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
-
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+		return -EFAULT;
 
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
@@ -1220,6 +1407,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 	struct ib_send_wr              *wr = NULL, *last, *next, *bad_wr;
 	struct ib_qp                   *qp;
 	int                             i, sg_ind;
+	int				is_ud;
 	ssize_t                         ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1236,12 +1424,11 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 	if (!user_wr)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
 		goto out;
 
+	is_ud = qp->qp_type == IB_QPT_UD;
 	sg_ind = 0;
 	last = NULL;
 	for (i = 0; i < cmd.wr_count; ++i) {
@@ -1249,12 +1436,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 				   buf + sizeof cmd + i * cmd.wqe_size,
 				   cmd.wqe_size)) {
 			ret = -EFAULT;
-			goto out;
+			goto out_put;
 		}
 
 		if (user_wr->num_sge + sg_ind > cmd.sge_count) {
 			ret = -EINVAL;
-			goto out;
+			goto out_put;
 		}
 
 		next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
@@ -1262,7 +1449,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 			       GFP_KERNEL);
 		if (!next) {
 			ret = -ENOMEM;
-			goto out;
+			goto out_put;
 		}
 
 		if (!last)
@@ -1278,12 +1465,12 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 		next->send_flags = user_wr->send_flags;
 		next->imm_data   = (__be32 __force) user_wr->imm_data;
 
-		if (qp->qp_type == IB_QPT_UD) {
-			next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr,
-						  user_wr->wr.ud.ah);
+		if (is_ud) {
+			next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah,
+						     file->ucontext);
 			if (!next->wr.ud.ah) {
 				ret = -EINVAL;
-				goto out;
+				goto out_put;
 			}
 			next->wr.ud.remote_qpn  = user_wr->wr.ud.remote_qpn;
 			next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey;
@@ -1320,7 +1507,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 					   sg_ind * sizeof (struct ib_sge),
 					   next->num_sge * sizeof (struct ib_sge))) {
 				ret = -EFAULT;
-				goto out;
+				goto out_put;
 			}
 			sg_ind += next->num_sge;
 		} else
@@ -1340,10 +1527,13 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 			 &resp, sizeof resp))
 		ret = -EFAULT;
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
+out:
 	while (wr) {
+		if (is_ud && wr->wr.ud.ah)
+			put_ah_read(wr->wr.ud.ah);
 		next = wr->next;
 		kfree(wr);
 		wr = next;
@@ -1458,14 +1648,15 @@ ssize_t ib_uverbs_post_recv(struct ib_uv
 	if (IS_ERR(wr))
 		return PTR_ERR(wr);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
 		goto out;
 
 	resp.bad_wr = 0;
 	ret = qp->device->post_recv(qp, wr, &bad_wr);
+
+	put_qp_read(qp);
+
 	if (ret)
 		for (next = wr; next; next = next->next) {
 			++resp.bad_wr;
@@ -1479,8 +1670,6 @@ ssize_t ib_uverbs_post_recv(struct ib_uv
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	while (wr) {
 		next = wr->next;
 		kfree(wr);
@@ -1509,14 +1698,15 @@ ssize_t ib_uverbs_post_srq_recv(struct i
 	if (IS_ERR(wr))
 		return PTR_ERR(wr);
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext)
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
 		goto out;
 
 	resp.bad_wr = 0;
 	ret = srq->device->post_srq_recv(srq, wr, &bad_wr);
+
+	put_srq_read(srq);
+
 	if (ret)
 		for (next = wr; next; next = next->next) {
 			++resp.bad_wr;
@@ -1530,8 +1720,6 @@ ssize_t ib_uverbs_post_srq_recv(struct i
 		ret = -EFAULT;
 
 out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	while (wr) {
 		next = wr->next;
 		kfree(wr);
@@ -1563,17 +1751,15 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	if (!uobj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	init_uobj(uobj, cmd.user_handle, file->ucontext);
+	down_write(&uobj->mutex);
 
-	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
-	if (!pd || pd->uobject->context != file->ucontext) {
+	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err;
 	}
 
-	uobj->user_handle = cmd.user_handle;
-	uobj->context     = file->ucontext;
-
 	attr.dlid 	       = cmd.attr.dlid;
 	attr.sl 	       = cmd.attr.sl;
 	attr.src_path_bits     = cmd.attr.src_path_bits;
@@ -1589,12 +1775,13 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
 		ret = PTR_ERR(ah);
-		goto err_up;
+		goto err;
 	}
 
-	ah->uobject = uobj;
+	ah->uobject  = uobj;
+	uobj->object = ah;
 
-	ret = idr_add_uobj(&ib_uverbs_ah_idr, ah, uobj);
+	ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj);
 	if (ret)
 		goto err_destroy;
 
@@ -1603,27 +1790,29 @@ ssize_t ib_uverbs_create_ah(struct ib_uv
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
 	list_add_tail(&uobj->list, &file->ucontext->ah_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	uobj->live = 1;
+
+	up_write(&uobj->mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_ah_idr, uobj->id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 err_destroy:
 	ib_destroy_ah(ah);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err:
+	put_uobj_write(uobj);
 	return ret;
 }
 
@@ -1633,35 +1822,34 @@ ssize_t ib_uverbs_destroy_ah(struct ib_u
 	struct ib_uverbs_destroy_ah cmd;
 	struct ib_ah		   *ah;
 	struct ib_uobject	   *uobj;
-	int			    ret = -EINVAL;
+	int			    ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	uobj = idr_write_uobj(&ib_uverbs_ah_idr, cmd.ah_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	ah = uobj->object;
 
-	ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle);
-	if (!ah || ah->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_ah(ah);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = ah->uobject;
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_ah(ah);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle);
+	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 	mutex_lock(&file->mutex);
 	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	kfree(uobj);
+	put_uobj(uobj);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file,
@@ -1670,47 +1858,43 @@ ssize_t ib_uverbs_attach_mcast(struct ib
 {
 	struct ib_uverbs_attach_mcast cmd;
 	struct ib_qp                 *qp;
-	struct ib_uqp_object         *uobj;
+	struct ib_uqp_object         *obj;
 	struct ib_uverbs_mcast_entry *mcast;
-	int                           ret = -EINVAL;
+	int                           ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		return -EINVAL;
 
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
 
-	list_for_each_entry(mcast, &uobj->mcast_list, list)
+	list_for_each_entry(mcast, &obj->mcast_list, list)
 		if (cmd.mlid == mcast->lid &&
 		    !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) {
 			ret = 0;
-			goto out;
+			goto out_put;
 		}
 
 	mcast = kmalloc(sizeof *mcast, GFP_KERNEL);
 	if (!mcast) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_put;
 	}
 
 	mcast->lid = cmd.mlid;
 	memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw);
 
 	ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid);
-	if (!ret) {
-		uobj = container_of(qp->uobject, struct ib_uqp_object,
-				    uevent.uobject);
-		list_add_tail(&mcast->list, &uobj->mcast_list);
-	} else
+	if (!ret)
+		list_add_tail(&mcast->list, &obj->mcast_list);
+	else
 		kfree(mcast);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
 	return ret ? ret : in_len;
 }
@@ -1720,7 +1904,7 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 			       int out_len)
 {
 	struct ib_uverbs_detach_mcast cmd;
-	struct ib_uqp_object         *uobj;
+	struct ib_uqp_object         *obj;
 	struct ib_qp                 *qp;
 	struct ib_uverbs_mcast_entry *mcast;
 	int                           ret = -EINVAL;
@@ -1728,19 +1912,17 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
-	if (!qp || qp->uobject->context != file->ucontext)
-		goto out;
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp)
+		return -EINVAL;
 
 	ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid);
 	if (ret)
-		goto out;
+		goto out_put;
 
-	uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
+	obj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject);
 
-	list_for_each_entry(mcast, &uobj->mcast_list, list)
+	list_for_each_entry(mcast, &obj->mcast_list, list)
 		if (cmd.mlid == mcast->lid &&
 		    !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) {
 			list_del(&mcast->list);
@@ -1748,8 +1930,8 @@ ssize_t ib_uverbs_detach_mcast(struct ib
 			break;
 		}
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+out_put:
+	put_qp_read(qp);
 
 	return ret ? ret : in_len;
 }
@@ -1761,7 +1943,7 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 	struct ib_uverbs_create_srq      cmd;
 	struct ib_uverbs_create_srq_resp resp;
 	struct ib_udata                  udata;
-	struct ib_uevent_object         *uobj;
+	struct ib_uevent_object         *obj;
 	struct ib_pd                    *pd;
 	struct ib_srq                   *srq;
 	struct ib_srq_init_attr          attr;
@@ -1777,17 +1959,17 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
-	if (!uobj)
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
 		return -ENOMEM;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	pd  = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
+	init_uobj(&obj->uobject, 0, file->ucontext);
+	down_write(&obj->uobject.mutex);
 
-	if (!pd || pd->uobject->context != file->ucontext) {
+	pd  = idr_read_pd(cmd.pd_handle, file->ucontext);
+	if (!pd) {
 		ret = -EINVAL;
-		goto err_up;
+		goto err;
 	}
 
 	attr.event_handler  = ib_uverbs_srq_event_handler;
@@ -1796,59 +1978,59 @@ ssize_t ib_uverbs_create_srq(struct ib_u
 	attr.attr.max_sge   = cmd.max_sge;
 	attr.attr.srq_limit = cmd.srq_limit;
 
-	uobj->uobject.user_handle = cmd.user_handle;
-	uobj->uobject.context     = file->ucontext;
-	uobj->events_reported     = 0;
-	INIT_LIST_HEAD(&uobj->event_list);
+	obj->events_reported     = 0;
+	INIT_LIST_HEAD(&obj->event_list);
 
 	srq = pd->device->create_srq(pd, &attr, &udata);
 	if (IS_ERR(srq)) {
 		ret = PTR_ERR(srq);
-		goto err_up;
+		goto err;
 	}
 
 	srq->device    	   = pd->device;
 	srq->pd        	   = pd;
-	srq->uobject       = &uobj->uobject;
+	srq->uobject       = &obj->uobject;
 	srq->event_handler = attr.event_handler;
 	srq->srq_context   = attr.srq_context;
 	atomic_inc(&pd->usecnt);
 	atomic_set(&srq->usecnt, 0);
 
-	memset(&resp, 0, sizeof resp);
-
-	ret = idr_add_uobj(&ib_uverbs_srq_idr, srq, &uobj->uobject);
+	obj->uobject.object = srq;
+	ret = idr_add_uobj(&ib_uverbs_srq_idr, &obj->uobject);
 	if (ret)
 		goto err_destroy;
 
-	resp.srq_handle = uobj->uobject.id;
+	memset(&resp, 0, sizeof resp);
+	resp.srq_handle = obj->uobject.id;
 	resp.max_wr     = attr.attr.max_wr;
 	resp.max_sge    = attr.attr.max_sge;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
-		goto err_idr;
+		goto err_copy;
 	}
 
+	put_pd_read(pd);
+
 	mutex_lock(&file->mutex);
-	list_add_tail(&uobj->uobject.list, &file->ucontext->srq_list);
+	list_add_tail(&obj->uobject.list, &file->ucontext->srq_list);
 	mutex_unlock(&file->mutex);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	obj->uobject.live = 1;
+
+	up_write(&obj->uobject.mutex);
 
 	return in_len;
 
-err_idr:
-	idr_remove(&ib_uverbs_srq_idr, uobj->uobject.id);
+err_copy:
+	idr_remove_uobj(&ib_uverbs_srq_idr, &obj->uobject);
 
 err_destroy:
 	ib_destroy_srq(srq);
 
-err_up:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
-	kfree(uobj);
+err:
+	put_uobj_write(&obj->uobject);
 	return ret;
 }
 
@@ -1864,21 +2046,16 @@ ssize_t ib_uverbs_modify_srq(struct ib_u
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext) {
-		ret = -EINVAL;
-		goto out;
-	}
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
+		return -EINVAL;
 
 	attr.max_wr    = cmd.max_wr;
 	attr.srq_limit = cmd.srq_limit;
 
 	ret = ib_modify_srq(srq, &attr, cmd.attr_mask);
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_srq_read(srq);
 
 	return ret ? ret : in_len;
 }
@@ -1899,18 +2076,16 @@ ssize_t ib_uverbs_query_srq(struct ib_uv
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
+	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
+	if (!srq)
+		return -EINVAL;
 
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (srq && srq->uobject->context == file->ucontext)
-		ret = ib_query_srq(srq, &attr);
-	else
-		ret = -EINVAL;
+	ret = ib_query_srq(srq, &attr);
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
+	put_srq_read(srq);
 
 	if (ret)
-		goto out;
+		return ret;
 
 	memset(&resp, 0, sizeof resp);
 
@@ -1920,10 +2095,9 @@ ssize_t ib_uverbs_query_srq(struct ib_uv
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
-		ret = -EFAULT;
+		return -EFAULT;
 
-out:
-	return ret ? ret : in_len;
+	return in_len;
 }
 
 ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
@@ -1932,45 +2106,45 @@ ssize_t ib_uverbs_destroy_srq(struct ib_
 {
 	struct ib_uverbs_destroy_srq      cmd;
 	struct ib_uverbs_destroy_srq_resp resp;
+	struct ib_uobject		 *uobj;
 	struct ib_srq               	 *srq;
-	struct ib_uevent_object        	 *uobj;
+	struct ib_uevent_object        	 *obj;
 	int                         	  ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
-	memset(&resp, 0, sizeof resp);
+	uobj = idr_write_uobj(&ib_uverbs_srq_idr, cmd.srq_handle, file->ucontext);
+	if (!uobj)
+		return -EINVAL;
+	srq = uobj->object;
+	obj = container_of(uobj, struct ib_uevent_object, uobject);
 
-	srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle);
-	if (!srq || srq->uobject->context != file->ucontext)
-		goto out;
+	ret = ib_destroy_srq(srq);
+	if (!ret)
+		uobj->live = 0;
 
-	uobj = container_of(srq->uobject, struct ib_uevent_object, uobject);
+	put_uobj_write(uobj);
 
-	ret = ib_destroy_srq(srq);
 	if (ret)
-		goto out;
+		return ret;
 
-	idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle);
+	idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 
 	mutex_lock(&file->mutex);
-	list_del(&uobj->uobject.list);
+	list_del(&uobj->list);
 	mutex_unlock(&file->mutex);
 
-	ib_uverbs_release_uevent(file, uobj);
+	ib_uverbs_release_uevent(file, obj);
 
-	resp.events_reported = uobj->events_reported;
+	memset(&resp, 0, sizeof resp);
+	resp.events_reported = obj->events_reported;
 
-	kfree(uobj);
+	put_uobj(uobj);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
 		ret = -EFAULT;
 
-out:
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	return ret ? ret : in_len;
 }
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index ff092a0..5ec2d49 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -66,7 +66,7 @@ #define IB_UVERBS_BASE_DEV	MKDEV(IB_UVER
 
 static struct class *uverbs_class;
 
-DEFINE_MUTEX(ib_uverbs_idr_mutex);
+DEFINE_SPINLOCK(ib_uverbs_idr_lock);
 DEFINE_IDR(ib_uverbs_pd_idr);
 DEFINE_IDR(ib_uverbs_mr_idr);
 DEFINE_IDR(ib_uverbs_mw_idr);
@@ -183,21 +183,21 @@ static int ib_uverbs_cleanup_ucontext(st
 	if (!context)
 		return 0;
 
-	mutex_lock(&ib_uverbs_idr_mutex);
-
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
-		struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id);
-		idr_remove(&ib_uverbs_ah_idr, uobj->id);
+		struct ib_ah *ah = uobj->object;
+
+		idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 		ib_destroy_ah(ah);
 		list_del(&uobj->list);
 		kfree(uobj);
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
-		struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id);
+		struct ib_qp *qp = uobj->object;
 		struct ib_uqp_object *uqp =
 			container_of(uobj, struct ib_uqp_object, uevent.uobject);
-		idr_remove(&ib_uverbs_qp_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
 		ib_uverbs_detach_umcast(qp, uqp);
 		ib_destroy_qp(qp);
 		list_del(&uobj->list);
@@ -206,11 +206,12 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) {
-		struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id);
+		struct ib_cq *cq = uobj->object;
 		struct ib_uverbs_event_file *ev_file = cq->cq_context;
 		struct ib_ucq_object *ucq =
 			container_of(uobj, struct ib_ucq_object, uobject);
-		idr_remove(&ib_uverbs_cq_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 		ib_destroy_cq(cq);
 		list_del(&uobj->list);
 		ib_uverbs_release_ucq(file, ev_file, ucq);
@@ -218,10 +219,11 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) {
-		struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id);
+		struct ib_srq *srq = uobj->object;
 		struct ib_uevent_object *uevent =
 			container_of(uobj, struct ib_uevent_object, uobject);
-		idr_remove(&ib_uverbs_srq_idr, uobj->id);
+
+		idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 		ib_destroy_srq(srq);
 		list_del(&uobj->list);
 		ib_uverbs_release_uevent(file, uevent);
@@ -231,11 +233,11 @@ static int ib_uverbs_cleanup_ucontext(st
 	/* XXX Free MWs */
 
 	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
-		struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id);
+		struct ib_mr *mr = uobj->object;
 		struct ib_device *mrdev = mr->device;
 		struct ib_umem_object *memobj;
 
-		idr_remove(&ib_uverbs_mr_idr, uobj->id);
+		idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 		ib_dereg_mr(mr);
 
 		memobj = container_of(uobj, struct ib_umem_object, uobject);
@@ -246,15 +248,14 @@ static int ib_uverbs_cleanup_ucontext(st
 	}
 
 	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
-		struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id);
-		idr_remove(&ib_uverbs_pd_idr, uobj->id);
+		struct ib_pd *pd = uobj->object;
+
+		idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 		ib_dealloc_pd(pd);
 		list_del(&uobj->list);
 		kfree(uobj);
 	}
 
-	mutex_unlock(&ib_uverbs_idr_mutex);
-
 	return context->device->dealloc_ucontext(context);
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7ced208..ee1f3a3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -697,8 +697,12 @@ struct ib_ucontext {
 struct ib_uobject {
 	u64			user_handle;	/* handle given to us by userspace */
 	struct ib_ucontext     *context;	/* associated user context */
+	void		       *object;		/* containing object */
 	struct list_head	list;		/* link to context's list */
 	u32			id;		/* index into kernel idr */
+	struct kref		ref;
+	struct rw_semaphore	mutex;		/* protects .live */
+	int			live;
 };
 
 struct ib_umem {


From ralphc at pathscale.com  Fri Jun 16 15:48:53 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 15:48:53 -0700
Subject: [openib-general] [PATCH] update libipathverbs library to the new
 initialization method
Message-ID: <1150498133.32252.111.camel@brick.pathscale.com>

The current libipathverbs driver in the trunk doesn't
conform to the new module initialization convention for
libibverbs.so.  This patch corrects that.

Also, with this patch, we can now try testing the performance
of Roland's changes to eliminate the single ib_uverbs_idr_mutex.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libipathverbs/src/ipathverbs.c
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.c	(revision 8089)
+++ src/userspace/libipathverbs/src/ipathverbs.c	(working copy)
@@ -145,30 +145,24 @@
 	.free_context	= ipath_free_context
 };
 
-struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
+struct ibv_device *ibv_driver_init(const char *uverbs_sys_path,
+				   int abi_version)
 {
-	struct sysfs_device    *pcidev;
-	struct sysfs_attribute *attr;
+	char			value[8];
 	struct ipath_device    *dev;
-	unsigned		vendor, device;
-	int			i;
+	unsigned                vendor, device;
+	int                     i;
 
-	pcidev = sysfs_get_classdev_device(sysdev);
-	if (!pcidev)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
+				value, sizeof value) < 0)
 		return NULL;
+	sscanf(value, "%i", &vendor);
 
-	attr = sysfs_get_device_attr(pcidev, "vendor");
-	if (!attr)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device",
+				value, sizeof value) < 0)
 		return NULL;
-	sscanf(attr->value, "%i", &vendor);
-	sysfs_close_attribute(attr);
+	sscanf(value, "%i", &device);
 
-	attr = sysfs_get_device_attr(pcidev, "device");
-	if (!attr)
-		return NULL;
-	sscanf(attr->value, "%i", &device);
-	sysfs_close_attribute(attr);
-
 	for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i)
 		if (vendor == hca_table[i].vendor &&
 		    device == hca_table[i].device)
@@ -180,13 +174,12 @@
 	dev = malloc(sizeof *dev);
 	if (!dev) {
 		fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n",
-			sysdev->name);
-		abort();
+			uverbs_sys_path);
+		return NULL;
 	}
 
 	dev->ibv_dev.ops = ipath_dev_ops;
 	dev->hca_type    = hca_table[i].type;
-	dev->page_size   = sysconf(_SC_PAGESIZE);
 
 	return &dev->ibv_dev;
 }
Index: libipathverbs/src/ipathverbs.h
===================================================================
--- libipathverbs/src/ipathverbs.h	(revision 8089)
+++ libipathverbs/src/ipathverbs.h	(working copy)
@@ -57,7 +57,6 @@
 struct ipath_device {
 	struct ibv_device	ibv_dev;
 	enum ipath_hca_type	hca_type;
-	int			page_size;
 };
 
 struct ipath_context {


-- 
Ralph Campbell <ralphc at pathscale.com>


From rjwalsh at pathscale.com  Fri Jun 16 15:51:22 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Fri, 16 Jun 2006 15:51:22 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adamzccu54f.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
	<1150223140.11881.2.camel@hematite.internal.keyresearch.com>
	<adamzccu54f.fsf@cisco.com>
Message-ID: <1150498282.13304.0.camel@hematite.internal.keyresearch.com>

On Fri, 2006-06-16 at 15:07 -0700, Roland Dreier wrote:
> Robert, can you confirm that the new uverbs locking scheme helps the
> performance problems you're having?

Sure - I'll take a look on Monday.

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060616/043e4aba/attachment.sig>

From rdreier at cisco.com  Fri Jun 16 16:12:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 16:12:59 -0700
Subject: [openib-general] [PATCH] update libipathverbs library to the
 new initialization method
In-Reply-To: <1150498133.32252.111.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 16 Jun 2006 15:48:53 -0700")
References: <1150498133.32252.111.camel@brick.pathscale.com>
Message-ID: <adairn0u22s.fsf@cisco.com>

 > The current libipathverbs driver in the trunk doesn't
 > conform to the new module initialization convention for
 > libibverbs.so.  This patch corrects that.

Looks OK but you're now only compatible with unreleased development
versions of libibverbs -- this won't work against the stable
libibverbs 1.0 code shipped with Fedora and Debian for example.

You might want to follow the approach libmthca uses to build against
both libibverbs 1.0 and also pre-1.1 development code.

 > Also, with this patch, we can now try testing the performance
 > of Roland's changes to eliminate the single ib_uverbs_idr_mutex.

Glad you're going to test, but why do you need this patch?  Couldn't
you just have put a new kernel onto a system with libibverbs 1.0 and
the old libipathverbs?

 - R.


From ralphc at pathscale.com  Fri Jun 16 16:15:41 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 16:15:41 -0700
Subject: [openib-general] [PATCH] resend: update libipathverbs library to
 the new initialization method
Message-ID: <1150499741.32252.119.camel@brick.pathscale.com>

The patch I just sent left out a minor change so please
ignore the previous patch and apply this one instead.
(I forgot to include the change to the map file)

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libipathverbs/src/ipathverbs.c
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.c	(revision 8089)
+++ src/userspace/libipathverbs/src/ipathverbs.c	(working copy)
@@ -145,30 +145,24 @@
 	.free_context	= ipath_free_context
 };
 
-struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
+struct ibv_device *ibv_driver_init(const char *uverbs_sys_path,
+				   int abi_version)
 {
-	struct sysfs_device    *pcidev;
-	struct sysfs_attribute *attr;
+	char			value[8];
 	struct ipath_device    *dev;
-	unsigned		vendor, device;
-	int			i;
+	unsigned                vendor, device;
+	int                     i;
 
-	pcidev = sysfs_get_classdev_device(sysdev);
-	if (!pcidev)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
+				value, sizeof value) < 0)
 		return NULL;
+	sscanf(value, "%i", &vendor);
 
-	attr = sysfs_get_device_attr(pcidev, "vendor");
-	if (!attr)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device",
+				value, sizeof value) < 0)
 		return NULL;
-	sscanf(attr->value, "%i", &vendor);
-	sysfs_close_attribute(attr);
+	sscanf(value, "%i", &device);
 
-	attr = sysfs_get_device_attr(pcidev, "device");
-	if (!attr)
-		return NULL;
-	sscanf(attr->value, "%i", &device);
-	sysfs_close_attribute(attr);
-
 	for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i)
 		if (vendor == hca_table[i].vendor &&
 		    device == hca_table[i].device)
@@ -180,13 +174,12 @@
 	dev = malloc(sizeof *dev);
 	if (!dev) {
 		fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n",
-			sysdev->name);
-		abort();
+			uverbs_sys_path);
+		return NULL;
 	}
 
 	dev->ibv_dev.ops = ipath_dev_ops;
 	dev->hca_type    = hca_table[i].type;
-	dev->page_size   = sysconf(_SC_PAGESIZE);
 
 	return &dev->ibv_dev;
 }
Index: src/usrspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/usrspace/libipathverbs/src/ipathverbs.h	(revision 8089)
+++ src/usrspace/libipathverbs/src/ipathverbs.h	(working copy)
@@ -57,7 +57,6 @@
 struct ipath_device {
 	struct ibv_device	ibv_dev;
 	enum ipath_hca_type	hca_type;
-	int			page_size;
 };
 
 struct ipath_context {
Index: src/userspace/libipathverbs/src/ipathverbs.map
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.map	(revision 8089)
+++ src/userspace/libipathverbs/src/ipathverbs.map	(working copy)
@@ -1,4 +1,4 @@
 {
-	global: openib_driver_init;
+	global: ibv_driver_init;
 	local: *;
 };

-- 
Ralph Campbell <ralphc at pathscale.com>


From rdreier at cisco.com  Fri Jun 16 16:26:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 16 Jun 2006 16:26:20 -0700
Subject: [openib-general] [PATCH] resend: update libipathverbs library
 to the new initialization method
In-Reply-To: <1150499741.32252.119.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 16 Jun 2006 16:15:41 -0700")
References: <1150499741.32252.119.camel@brick.pathscale.com>
Message-ID: <adaejxou1gj.fsf@cisco.com>

 > The patch I just sent left out a minor change so please
 > ignore the previous patch and apply this one instead.
 > (I forgot to include the change to the map file)

You can just go ahead and check in libipathverbs changes yourself --
qlogic is definitely going to be the maintainer of that code.

 - R.


From ralphc at pathscale.com  Fri Jun 16 16:30:32 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 16:30:32 -0700
Subject: [openib-general] [PATCH] update libipathverbs library to the
 new initialization method
In-Reply-To: <adairn0u22s.fsf@cisco.com>
References: <1150498133.32252.111.camel@brick.pathscale.com>
	<adairn0u22s.fsf@cisco.com>
Message-ID: <1150500632.32252.128.camel@brick.pathscale.com>

On Fri, 2006-06-16 at 16:12 -0700, Roland Dreier wrote:
>  > The current libipathverbs driver in the trunk doesn't
>  > conform to the new module initialization convention for
>  > libibverbs.so.  This patch corrects that.
> 
> Looks OK but you're now only compatible with unreleased development
> versions of libibverbs -- this won't work against the stable
> libibverbs 1.0 code shipped with Fedora and Debian for example.
> 
> You might want to follow the approach libmthca uses to build against
> both libibverbs 1.0 and also pre-1.1 development code.

Its not hard to allow 1.1 libipathverbs to build against 1.0 libibverbs
for just this change but I suspect that the mmap stuff might not be
so easy or other 1.1 changes. I don't really think it makes sense
to support every combination of up and down rev compilation
and run time compatibility.

>  > Also, with this patch, we can now try testing the performance
>  > of Roland's changes to eliminate the single ib_uverbs_idr_mutex.
> 
> Glad you're going to test, but why do you need this patch?  Couldn't
> you just have put a new kernel onto a system with libibverbs 1.0 and
> the old libipathverbs?
> 
>  - R.

Sure. I was just in the middle of getting the trunk to run again
when you sent your request.

-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Fri Jun 16 16:32:29 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Fri, 16 Jun 2006 16:32:29 -0700
Subject: [openib-general] [PATCH] resend: update libipathverbs library
 to the new initialization method
In-Reply-To: <adaejxou1gj.fsf@cisco.com>
References: <1150499741.32252.119.camel@brick.pathscale.com>
	<adaejxou1gj.fsf@cisco.com>
Message-ID: <1150500749.32252.130.camel@brick.pathscale.com>

On Fri, 2006-06-16 at 16:26 -0700, Roland Dreier wrote:
>  > The patch I just sent left out a minor change so please
>  > ignore the previous patch and apply this one instead.
>  > (I forgot to include the change to the map file)
> 
> You can just go ahead and check in libipathverbs changes yourself --
> qlogic is definitely going to be the maintainer of that code.
> 
>  - R.

Thanks. I was just trying to be complete in case anyone was
applying the patch themselves before SVN was updated.

-- 
Ralph Campbell <ralphc at pathscale.com>


From nickpiggin at yahoo.com.au  Fri Jun 16 20:59:12 2006
From: nickpiggin at yahoo.com.au (Nick Piggin)
Date: Sat, 17 Jun 2006 13:59:12 +1000
Subject: [openib-general] [PATCH v2 4/7] AMSO1100 Memory Management.
In-Reply-To: <1150128349.22704.20.camel@trinity.ogc.int>
References: <20060607200646.9259.24588.stgit@stevo-desktop>
	<20060607200655.9259.90768.stgit@stevo-desktop>
	<20060608011744.1a66e85a.akpm@osdl.org>
	<1150128349.22704.20.camel@trinity.ogc.int>
Message-ID: <44937E10.3000006@yahoo.com.au>

Tom Tucker wrote:
> On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote:
> 
>>On Wed, 07 Jun 2006 15:06:55 -0500
>>Steve Wise <swise at opengridcomputing.com> wrote:
>>
>>
>>>+void c2_free(struct c2_alloc *alloc, u32 obj)
>>>+{
>>>+	spin_lock(&alloc->lock);
>>>+	clear_bit(obj, alloc->table);
>>>+	spin_unlock(&alloc->lock);
>>>+}
>>
>>The spinlock is unneeded here.
> 
> 
> Good point.

Really?

clear_bit does not give you any memory ordering, so you can have
the situation where another CPU sees the bit cleared, but this
CPU still has stores pending to whatever it is being freed. Or
any number of other nasty memory ordering badness.

I'd just use the spinlocks, and prepend the clear_bit with a
double underscore (so you get the non-atomic version), if that
is appropriate.

The spinlocks nicely handle all the memory ordering issues, and
serve to document concurrency issues. If you need every last bit
of performance and scalability, that's OK, but you need comments
and I suspect you'd need more memory barriers.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


From panda at cse.ohio-state.edu  Fri Jun 16 20:57:04 2006
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Fri, 16 Jun 2006 23:57:04 -0400 (EDT)
Subject: [openib-general] Announcing the availability of MVAPICH 0.9.8-rc0
 with on-demand connection management,
 fault-tolerance and adavanced multi-rail scheduling support
Message-ID: <200606170357.k5H3v4w4025857@xi.cse.ohio-state.edu>

The MVAPICH team is pleased to announce the availability of MVAPICH
0.9.8-rc0 with the following new features:

  - On-demand connection management using native InfiniBand 
    Unreliable Datagram (UD) support. This feature enables InfiniBand 
    connections to be setup dynamically, enhancing the scalability 
    of MVAPICH on multi-thousand node clusters. 

  - Support for Fault Tolerance: Mem-to-mem reliable data transfer
    (detection of I/O bus error with 32bit CRC and retransmission in
    case of error). This mode enables MVAPICH to deliver messages
    reliably in presence of I/O bus errors.

  - Multi-rail communication support with flexible scheduling policies:
      - Separate control of small and large message scheduling
      - Three different scheduling policies for small messages: 
            - Using First Subchannel, Round Robin and Process Binding 
      - Six different scheduling policies for large messages: 
            - Round Robin, Weighted striping, Even striping, 
              Stripe Blocking, Adaptive Striping and Process Binding

  - Shared library support for Solaris 

  - Integrated and easy-to-use build script which automatically
    detects system architecture and In.niBand adapter types and
    optimizes MVAPICH for any particular installation
 
More details on all features and supported platforms can be obtained
by visiting the project's web page -> Overview -> features.

For downloading MVAPICH 0.9.8-rc0 package and accessing the anonymous
SVN, please visit the following URL:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

A stripped down version of this release is also available at the
OpenIB SVN.

Under the download page of the above URL, the latest testing results
of this rc0 version for different platforms and test suites are
shown. It also shows the rigorous testing procedures being used by the
team for MVAPICH and MVAPICH2 releases. As soon as the remaining tests
are done, we will make a formal release for MVAPICH 0.9.8.

All feedbacks, including bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to
mvapich-discuss mailing list.

Thanks, 

MVAPICH Team at OSU/NBCL 


From eitan at mellanox.co.il  Sat Jun 17 12:36:40 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 17 Jun 2006 22:36:40 +0300
Subject: [openib-general] [PATCH] osm: partition manager force policy
In-Reply-To: <20060615181524.GB24808@sashak.voltaire.com>
References: <86odwxgqrs.fsf@mtl066.yok.mtl.com>
	<20060615110617.GA21560@sashak.voltaire.com>
	<44915060.6090103@mellanox.co.il>
	<20060615181524.GB24808@sashak.voltaire.com>
Message-ID: <449459C8.9050300@mellanox.co.il>

Sasha Khapyorsky wrote:

I'm working on the changes below. I will send them all as one patch

EZ
> Hi Eitan,
> 
> On 15:19 Thu 15 Jun     , Eitan Zahavi wrote:
> 
>>>>+/*
>>>>+* PARAMETERS
>>>>+*  p_physp
>>>>+*     [in] Pointer to an osm_physp_t object.
>>>>+*
>>>>+* RETURN VALUES
>>>>+*  The pointer to the P_Key table object.
>>>>+*
>>>>+* NOTES
>>>>+*
>>>>+* SEE ALSO
>>>>+*  Port, Physical Port
>>>>+*********/
>>>>+
>>>
>>>
>>>Is not this simpler to remove 'const' from existing
>>>osm_physp_get_pkey_tbl() function instead of using new one?
>>
>>There are plenty of const functions using this function internally
>>so I would have need to fix them too.
> 
> 
> You are right. Maybe separate patch for this?
> 
I think it is preferable to keep the const function.

> 
>>>>@@ -118,14 +121,29 @@ void osm_pkey_tbl_sync_new_blocks(
>>>>   p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
>>>>   if ( b < new_blocks )
>>>>     p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
>>>>-    else {
>>>>+		else 
>>>>+      {
>>>>     p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
>>>>     if (!p_new_block)
>>>>       break;
>>>>+			cl_ptr_vector_set(&((osm_pkey_tbl_t 
>>>>*)p_pkey_tbl)->new_blocks, +						 b, 
>>>>p_new_block);
>>>>+		}
>>>>+
>>>>     memset(p_new_block, 0, sizeof(*p_new_block));
>>>>-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, 
>>>>p_new_block);
>>>>   }
>>>>-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
>>>>+}
>>>
>>>
>>>You changed this function so it does not do any sync anymore. Should
>>>function name be changed too?
>>
>>Yes correct I will change it. Is a better name:
>>osm_pkey_tbl_init_new_blocks ?
> 
> 
> Great name.
> 
> 
>>>>+  to show that on the "old" blocks
>>>>+*/
>>>>+int
>>>>+osm_pkey_tbl_set_new_entry( 
>>>>+	IN osm_pkey_tbl_t *p_pkey_tbl,
>>>>+	IN uint16_t        block_idx,
>>>>+	IN uint8_t         pkey_idx,
>>>>+	IN uint16_t        pkey)
>>>>+{  
>>>>+	ib_pkey_table_t *p_old_block;
>>>>+	ib_pkey_table_t *p_new_block;
>>>>+	
>>>>+	if (osm_pkey_tbl_make_block_pair(
>>>>+			 p_pkey_tbl,  block_idx, &p_old_block, &p_new_block))
>>>>+		return 1;
>>>>+		
>>>>+	cl_map_insert( &p_pkey_tbl->keys,
>>>>+						ib_pkey_get_base(pkey),
>>>>+					 
>>>>&(p_old_block->pkey_entry[pkey_idx]));
>>>
>>>
>>>Here you map potentially empty pkey entry. Why? "old block" will be
>>>remapped anyway on pkey receiving.
>>
>>The reason I did this was that if the GetResp will fail I still want to 
>>represent
>>the settings in the map.But actually it might be better not to do that so 
>>next
>>time we run we will not find it without a GetResp.
> 
> 
> Agree.
> 
> 
>>>>+	IN	 uint16_t		 *p_pkey,
>>>>+	OUT uint32_t		 *p_block_idx,
>>>>+	OUT uint8_t			 *p_pkey_index)
>>>>+{
>>>>+	uint32_t			  num_of_blocks;
>>>>+	uint32_t			  block_index;
>>>>+	ib_pkey_table_t *block;
>>>>+
>>>>+	CL_ASSERT( p_pkey_tbl );
>>>>+	CL_ASSERT( p_block_idx != NULL );
>>>>+	CL_ASSERT( p_pkey_idx != NULL );
>>>
>>>
>>>Why last two CL_ASSERTs? What should be problem with uninitialized
>>>pointers here?
>>>
>>
>>These are the outputs of the function. It does not make sense to call the 
>>functions with
>>null output pointers (calling by ref) . Anyway instead of putting the check 
>>in the free build
>>I used an assert
> 
> 
> I see. Actually I've overlooked that addresses and not values are
> checked. Please ignore this comment.
> 
> 
>>>>+
>>>>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
>>>>+	if (! p_pkey_tbl)
>>>
>>>          ^^^^^^^^^^^^^
>>>Is it possible?
>>
>>Yes it is ! I run into it during testing. The port did not have any pkey 
>>table.
> 
> 
> static inline osm_pkey_tbl_t *
> osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
> {
> ...
>   return( &p_physp->pkeys );
> };
> 
> This returns the address of physp's pkeys field. Right?
> Then if ( &p_physp->pkeys == NULL ) p_physp pointer should be equal to
> unsigned equivalent of -(offset of pkey field in physp struct).

Correct. I will remove the check.
> 
> 
>>>>+					"Fail to allocate new pending pkey 
>>>>entry for node "
>>>>+					"0x%016" PRIx64 " port %u\n",
>>>>+					cl_ntoh64( osm_node_get_node_guid( 
>>>>p_node ) ),
>>>>+					osm_physp_get_port_num( p_physp ) );
>>>>+		return;
>>>>+	}
>>>>+	p_pending->pkey = pkey;
>>>>+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey 
>>>>) );
>>>>+	if ( !p_orig_pkey  || 
>>>>+		  (ib_pkey_get_base(*p_orig_pkey) != ib_pkey_get_base(pkey) 
>>>>))
>>>
>>>
>>>There the cases of new pkey and updated pkey membership is mixed. Why?
>>
>>I am not following your question.
>>The specific case I am trying to catch is the one that for some reason the 
>>map points to
>>a pkey entry that was modified somehow and is different then the one you 
>>would expect by
>>the map.
> 
> 
> Didn't understand it at first pass, now it is clearer.
> 
> If pkey entry was modified somehow (how? bugs?), the assumption is that
> mapping still be valid? Then it is not new entry (or we will change
> pkey's index in the real table).
> 
PKey table mismatch between the block and map should never happen.
I will remove the check and replace that with an ASSERT so I catch the bug if we hit it.

> 
>>>>+	{
>>>>+		p_pending->is_new = TRUE;
>>>>+		cl_qlist_insert_tail(&p_pkey_tbl->pending, 
>>>>(cl_list_item_t*)p_pending);
>>>>+		stat = "inserted";
>>>>+	}
>>>>+	else
>>>>+	{
>>>>+		p_pending->is_new = FALSE;
>>>>+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
>>>>+									 
>>>>&p_pending->block, &p_pending->index))
>>>
>>>                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>AFAIK in this function there were CL_ASSERTs which check for uinitialized
>>>pointers.
>>
>>True. So the asserts are not required in this case.
> 
> 
> Up to you. Actually this my comment may be ignored, as stated above I
> didn't read this correctly.
> 
> 
>>>
>>>>+		{
>>>>+			osm_log( p_log, OSM_LOG_ERROR,
>>>>+					 "pkey_mgr_process_physical_port: 
>>>>ERR 0503: "
>>>>+						"Fail to obtain P_Key 0x%04x 
>>>>block and index for node "
>>>>+						"0x%016" PRIx64 " port %u\n",
>>>>+						cl_ntoh64( 
>>>>osm_node_get_node_guid( p_node ) ),
>>>>+						osm_physp_get_port_num( 
>>>>p_physp ) );
>>>>+			return;
>>>>+		}
>>>>+		cl_qlist_insert_head(&p_pkey_tbl->pending, 
>>>>(cl_list_item_t*)p_pending);
>>>>+		stat = "updated";
>>>
>>>
>>>Is it will be updated? It is likely "already there" case. No?
>>>
>>>Also in this case you can already put the pkey in new_block instead of
>>>holding it in pending list. Then later you will only need to add new
>>>pkeys. This may simplify the flow and even save some mem.
>>
>>True but in my mind it does not simplify - on the contrary it makes the 
>>partition between
>>populating each port pending list and actually setting the pkey tables 
>>mixed.
> 
> 
> I meant new_block filling, not actual setting. You will be able to
> remove whole if { } else { } flow, as well as is_new, block and index
> fields from 'pending' structure (actually only pkey value itself will
> matter) - is it not nice simplification?
I still prefer the clear staging: append to list when scanning the partitions and
filling in the tables when looping on all ports.
> 
> 
>>I do not think the memory impact deserves this mix of staging
>>
>>
>>>
> 
>>>>+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, 
>>>>p_physp );
>>>>+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
>>>>     {
>>>>-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>>>-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
>>>>+		osm_log( p_log, OSM_LOG_INFO,
>>>>+					"pkey_mgr_update_port: "
>>>>+					"Max number of blocks reduced from 
>>>>%u to %u " +					"for node 0x%016" PRIx64 " 
>>>>port %u\n",
>>>>+					p_pkey_tbl->max_blocks, 
>>>>max_num_of_blocks,
>>>>+					cl_ntoh64( osm_node_get_node_guid( 
>>>>p_node ) ),
>>>>+					osm_physp_get_port_num( p_physp ) ); 
>>>>+	}
>>>>+	p_pkey_tbl->max_blocks = max_num_of_blocks;
>>>>+
>>>>+	osm_pkey_tbl_sync_new_blocks( p_pkey_tbl );
>>>>+	cl_map_remove_all( &p_pkey_tbl->keys );
>>>
>>>
>>>What is the reason to drop map here? AFAIK it will be reinitialized later
>>>anyway when pkey blocks will be received.
>>
>>What if it is not received?
> 
> 
> Then we will have unreliable data there.
> 
> Maybe I know why you wanted this - this is part of "use pkey tables
> before sending/receiving to/from ports" idea?
> 
> 
>>>>@@ -255,24 +443,36 @@ pkey_mgr_update_peer_port(
>>>>  if (enforce == FALSE)
>>>>     return FALSE;
>>>>
>>>>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
>>>>-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
>>>>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
>>>>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
>>>>  num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>>>-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
>>>>-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
>>>>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
>>>>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
>>>>+	{
>>>>+		osm_log( p_log, OSM_LOG_ERROR,
>>>>+					"pkey_mgr_update_peer_port: ERR 
>>>>0508: "
>>>>+					"not enough entries (%u < %u) on 
>>>>switch 0x%016" PRIx64
>>>>+					" port %u\n",
>>>>+					peer_max_blocks, num_of_blocks,
>>>>+					cl_ntoh64( osm_node_get_node_guid( 
>>>>p_node ) ),
>>>>+					osm_physp_get_port_num( peer ) );
>>>>+		return FALSE;
>>>
>>>
>>>Do you think it is the best way, just to skip update - partitions are
>>>enforced already on the switch. May be better to truncate pkey tables
>>>in order to meet peer's capabilities?
>>
>>You are right about that - Its a bug!
>>I think the best approach here is to turn off the enforcement on the switch.
>>If we truncate the table we actually impact connectivity of the fabric.
>>I prefer a softer approach - an error in the log.
> 
> 
> Yes this should be good way to handle this.
> 
> 
>>>
>>>>+	}
>>>>
>>>>-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>>>+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
>>>>+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; 
>>>>block_index++)
>>>>  {
>>>>     block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>>>     peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
>>>>     if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
>>>>     {
>>>>+			osm_pkey_tbl_set(p_peer_pkey_tbl, block_index, 
>>>>block);
>>>
>>>
>>>Why this (osm_pkey_tbl_set())? This will be called by receiver.
>>
>>Same as the above note about updating the map
>>I wanted to avoid to wait for the GetResp.
>>I think it is a mistake and we can actually remove it.
> 
> 
> Agree.
> 
> Sasha.


From or.gerlitz at gmail.com  Sun Jun 18 04:35:27 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Sun, 18 Jun 2006 14:35:27 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <4492D706.4060106@ichips.intel.com>
References: <1150465355.29508.4.camel@stevo-desktop>
	<4492D706.4060106@ichips.intel.com>
Message-ID: <15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com>

On 6/16/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
> Steve Wise wrote:
> > Will the ucma make it into 2.6.18?  I notice its not in Roland's
> > for-2.6.18 tree right now.
>
> The plan is to allow the userspace interface to mature some before trying to
> merge them upstream.  This is why it is not included in 2.6.18.

Hi Sean,

Can you remind (me...) what areas of the cma u/k interface seem to be
not mature enough?

upstream CMA can be a significant step in the sense of distros (eg
SLES10 SP1 and RH5) kernel IB functional enough for production, as the
primary inteface for "RDMA communication managment" the uCMA is and
would be vastly used, so there should be good reason why not to push
it for 2.6.18 .

Or.

Or.


From eitan at mellanox.co.il  Sun Jun 18 04:46:17 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 18 Jun 2006 14:46:17 +0300
Subject: [openib-general]  [PATCHv3] osm: partition manager force policy
Message-ID: <86fyi2hek6.fsf@mtl066.yok.mtl.com>

Hi Hal

This is a third take after incorporating Sasha's comments to the 
partition manager patch I have previously provided. 

The main difference is that the manager does not touch the current
set of pkey tables but only sends Set(PKeyTable). 

Another one is the handling of switch limited partition cap by
clearing the switch enforcement bit (on the specific port).

Also modified interface of SMDB access functions from 0/1 to 
IB_SUCCESS/IB_ERROR/IB_NOT_FOUND appropriately.

~100 Tests passed both dedicated pkey enforcement (pkey.*) and
stress test (osmStress.*). The pkey.* test was enhanced to verify 
correct pkey index is used by the manager (it should keep the original).

BTW: the patch intentionally uses tabs and not spaces as I did not know 
what we have decided to use. To modify back simply replace every tab with 3
spaces.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 8100)
+++ include/opensm/osm_port.h	(working copy)
@@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
 *  Port, Physical Port
 *********/
 
+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
+* NAME
+*  osm_physp_get_mod_pkey_tbl
+*
+* DESCRIPTION
+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
+*
+* SYNOPSIS
+*/
+static inline osm_pkey_tbl_t *
+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
+{
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  /*
+    (14.2.5.7) - the block number valid values are 0-2047, and are further
+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
+  */
+  return( &p_physp->pkeys );
+};
+/*
+* PARAMETERS
+*  p_physp
+*     [in] Pointer to an osm_physp_t object.
+*
+* RETURN VALUES
+*  The pointer to the P_Key table object.
+*
+* NOTES
+*
+* SEE ALSO
+*  Port, Physical Port
+*********/
+
 /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
 * NAME
 *	osm_physp_set_slvl_tbl
Index: include/opensm/osm_pkey.h
===================================================================
--- include/opensm/osm_pkey.h	(revision 8100)
+++ include/opensm/osm_pkey.h	(working copy)
@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
   cl_ptr_vector_t blocks;
   cl_ptr_vector_t new_blocks;
   cl_map_t        keys;
+  cl_qlist_t      pending;
+  uint16_t        used_blocks;
+  uint16_t        max_blocks;
 } osm_pkey_tbl_t;
 /*
 * FIELDS
@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
 *	keys
 *		A set holding all keys
 *
+*  pending
+*     A list osm_pending_pkey structs that is temporarily set by the 
+*     pkey mgr and used during pkey mgr algorithm only
+*
+*  used_blocks
+*     Tracks the number of blocks having non-zero pkeys
+*
+*  max_blocks
+*     The maximal number of blocks this partition table might hold
+*     this value is based on node_info (for port 0 or CA) or switch_info
+*     updated on receiving the node_info or switch_info GetResp
+*
 * NOTES
 * 'blocks' vector should be used to store pkey values obtained from
 * the port and SM pkey manager should not change it directly, for this
@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
 *
 *********/
 
+/****s* OpenSM: osm_pending_pkey_t
+* NAME
+*	osm_pending_pkey_t
+*
+* DESCRIPTION
+*	This objects stores temporary information on pkeys their target block and index
+*  during the pkey manager operation
+*
+* SYNOPSIS
+*/
+typedef struct _osm_pending_pkey {
+  cl_list_item_t list_item;
+  uint16_t		  pkey;
+  uint32_t		  block;
+  uint8_t		  index;
+  boolean_t		  is_new;
+} osm_pending_pkey_t;
+/*
+* FIELDS
+*	pkey
+*		The actual P_Key
+*
+*	block
+*		The block index based on the previous table extracted from the device
+*
+*	index
+*		The index of the pky within the block
+*
+*  is_new
+*     TRUE for new P_Keys such that the block and index are invalid in that case
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_construct
 * NAME
 *  osm_pkey_tbl_construct
@@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_init( 
+ib_api_status_t 
+osm_pkey_tbl_init( 
   IN osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( 
 static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
   const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
 {
-  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
-  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
+	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
+			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
 };
 /*
 *  p_pkey_tbl
@@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_
 /*
  *********/
 
-/****f* OpenSM: osm_pkey_tbl_sync_new_blocks
+
+/****f* OpenSM: osm_pkey_tbl_make_block_pair
+* NAME
+*  osm_pkey_tbl_make_block_pair
+*
+* DESCRIPTION
+*  Find or create a pair of "old" and "new" blocks for the
+*  given block index
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pp_old_block
+*   [out] Pointer to the old block pointer arg
+*
+* pp_new_block
+*   [out] Pointer to the new block pointer arg
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_set_new_entry
 * NAME
-*  osm_pkey_tbl_sync_new_blocks
+*  osm_pkey_tbl_set_new_entry
 *
 * DESCRIPTION
-*  Syncs new_blocks vector content with current pkey table blocks
+*   stores the given pkey in the "new" blocks array and update
+*   the "map" to show that on the "old" blocks
 *
 * SYNOPSIS
 */
-void osm_pkey_tbl_sync_new_blocks( 
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pkey_idx
+*   [in] The index within the block
+*
+* pkey
+*   [in] PKey to store
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_find_next_free_entry
+* NAME
+*  osm_pkey_find_next_free_entry
+*
+* DESCRIPTION
+*  Find the next free entry in the PKey table. Starting at the given
+*  index and block number. The user should increment pkey_idx before 
+*  next call
+*  Inspect the "new" blocks array for empty space.
+*
+* SYNOPSIS
+*/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* p_block_idx
+*   [out] The block index to use
+*
+* p_pkey_idx
+*   [out] The index within the block to use
+*
+* RETURN VALUES
+*   TRUE if found FALSE if did not find
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_init_new_blocks
+* NAME
+*  osm_pkey_tbl_init_new_blocks
+*
+* DESCRIPTION
+*  Initializes new_blocks vector content (clear and allocate)
+*
+* SYNOPSIS
+*/
+void osm_pkey_tbl_init_new_blocks( 
   const osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 *********/
 
+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
+* NAME
+*  osm_pkey_tbl_get_block_and_idx
+*
+* DESCRIPTION
+*  set the block index and pkey index the given
+*  pkey is found in. return IB_NOT_FOUND if cound not find 
+*  it, IB_SUCCESS if OK
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
+  IN  uint16_t       *p_pkey,
+  OUT uint32_t       *block_idx,
+  OUT uint8_t        *pkey_index);
+/*
+*  p_pkey_tbl
+*     [in] Pointer to osm_pkey_tbl_t object.
+*  
+*  p_pkey
+*     [in] Pointer to the P_Key entry searched
+*
+*  p_block_idx
+*     [out] Pointer to the block index to be updated
+*
+*  p_pkey_idx 
+*     [out] Pointer to the pkey index (in the block) to be updated
+*
+*
+* NOTES
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_set
 * NAME
 *  osm_pkey_tbl_set
@@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set( 
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl);
Index: opensm/osm_prtn.c
===================================================================
--- opensm/osm_prtn.c	(revision 8100)
+++ opensm/osm_prtn.c	(working copy)
@@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo
 
 	p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ;
 
+   osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: "
+           "Added port 0x%" PRIx64 " to "
+           "partition \'%s\' (0x%04x) As %s member\n",
+           cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey),
+           full ? "full" : "partial" );
+
 	if (cl_map_insert(p_tbl, guid, p_physp) == NULL)
 		return IB_INSUFFICIENT_MEMORY;
 
Index: opensm/osm_pkey.c
===================================================================
--- opensm/osm_pkey.c	(revision 8100)
+++ opensm/osm_pkey.c	(working copy)
@@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( 
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_init( 
+ib_api_status_t
+osm_pkey_tbl_init(
   IN osm_pkey_tbl_t *p_pkey_tbl)
 {
   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
   cl_map_init( &p_pkey_tbl->keys, 1 );
+	cl_qlist_init( &p_pkey_tbl->pending );
+	p_pkey_tbl->used_blocks = 0;
+	p_pkey_tbl->max_blocks = 0;
   return(IB_SUCCESS);
 }
 
 /**********************************************************************
  **********************************************************************/
-void osm_pkey_tbl_sync_new_blocks(
+void osm_pkey_tbl_init_new_blocks(
   IN const osm_pkey_tbl_t *p_pkey_tbl)
 {
   ib_pkey_table_t *p_block, *p_new_block;
@@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks(
       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
       if (!p_new_block)
         break;
+			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
+									b, p_new_block);
+		}
+
       memset(p_new_block, 0, sizeof(*p_new_block));
-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
     }
-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_pkey_tbl_cleanup_pending(
+	IN osm_pkey_tbl_t *p_pkey_tbl)
+{
+	cl_list_item_t	*p_item;
+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		free( (osm_pending_pkey_t *)p_item );
   }
 }
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set(
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl)
@@ -203,7 +222,138 @@ int osm_pkey_tbl_set( 
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t __osm_match_pkey (
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block)
+{
+	if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR);
+
+	if (pp_old_block)
+	{
+		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_old_block)
+		{
+			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_old_block) return(IB_ERROR);
+			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
+		}
+	}
+	
+	if (pp_new_block)
+	{
+		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_new_block)
+		{
+			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_new_block) return(IB_ERROR);
+			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
+		}
+	}
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+  store the given pkey in the "new" blocks array 
+  also makes sure the regular block exists.
+*/
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey)
+{  
+	ib_pkey_table_t *p_old_block;
+	ib_pkey_table_t *p_new_block;
+	
+	if (osm_pkey_tbl_make_block_pair(
+			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
+		return( IB_ERROR );
+		
+	p_new_block->pkey_entry[pkey_idx] = pkey;
+	if (p_pkey_tbl->used_blocks < block_idx)
+		p_pkey_tbl->used_blocks = block_idx;
+
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx)
+{
+	ib_pkey_table_t *p_new_block;
+	
+	CL_ASSERT(p_block_idx);
+	CL_ASSERT(p_pkey_idx);
+
+	while ( *p_block_idx < p_pkey_tbl->max_blocks)
+	{
+		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
+		{
+			*p_pkey_idx = 0;
+			(*p_block_idx)++;
+			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
+				return FALSE;
+		}
+
+		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
+
+		if ( !p_new_block || 
+			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
+			return TRUE;
+		else
+			(*p_pkey_idx)++;
+	}
+	return FALSE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
+	IN	 uint16_t		 *p_pkey,
+	OUT uint32_t		 *p_block_idx,
+	OUT uint8_t			 *p_pkey_index)
+{
+	uint32_t			  num_of_blocks;
+	uint32_t			  block_index;
+	ib_pkey_table_t *block;
+
+	CL_ASSERT( p_pkey_tbl );
+	CL_ASSERT( p_block_idx != NULL );
+	CL_ASSERT( p_pkey_idx != NULL );
+ 
+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		if ( ( block->pkey_entry <= p_pkey ) &&
+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
+		{
+			*p_block_idx = block_index;
+			*p_pkey_index = p_pkey - block->pkey_entry;
+			return( IB_SUCCESS );
+		}
+	}
+	return( IB_NOT_FOUND );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static boolean_t 
+__osm_match_pkey (
   IN const ib_net16_t *pkey1,
   IN const ib_net16_t *pkey2 ) {
 
@@ -306,7 +456,8 @@ osm_physp_share_pkey(
   if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
     return TRUE;
 
-  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
+	return 
+		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
 }
 
 /**********************************************************************
@@ -322,7 +473,8 @@ osm_port_share_pkey(
 
   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
 
-  if (!p_port_1 || !p_port_2) {
+	if (!p_port_1 || !p_port_2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
@@ -330,7 +482,8 @@ osm_port_share_pkey(
   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
 
-  if (!p_physp1 || !p_physp2) {
+	if (!p_physp1 || !p_physp2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8100)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -62,6 +62,131 @@
 
 /**********************************************************************
  **********************************************************************/
+/*
+  the max number of pkey blocks for a physical port is located in
+  different place for switch external ports (SwitchInfo) and the
+  rest of the ports (NodeInfo)
+*/
+static int 
+pkey_mgr_get_physp_max_blocks(
+	IN const osm_subn_t *p_subn,
+	IN const osm_physp_t *p_physp)
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
+	osm_switch_t *p_sw;
+	uint16_t num_pkeys = 0;
+
+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
+		  (osm_physp_get_port_num( p_physp ) == 0))
+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
+	else
+	{
+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
+		if (p_sw)
+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
+	}
+	return( (num_pkeys + 31) / 32 );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+ * Insert the new pending pkey entry to the specific port pkey table
+ * pending pkeys. new entries are inserted at the back.
+ */
+static void 
+pkey_mgr_process_physical_port(
+	IN osm_log_t *p_log,
+	IN const osm_req_t *p_req,
+	IN const ib_net16_t pkey,
+	IN osm_physp_t *p_physp )
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	char *stat = NULL;
+	osm_pending_pkey_t *p_pending;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
+	if (! p_pending)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0502: "
+					"Fail to allocate new pending pkey entry for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+	p_pending->pkey = pkey;
+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	if ( !p_orig_pkey )
+	{
+		p_pending->is_new = TRUE;
+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "inserted";
+	}
+	else
+	{
+		CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) );
+		p_pending->is_new = FALSE;
+		if (osm_pkey_tbl_get_block_and_idx(
+				 p_pkey_tbl, p_orig_pkey,
+				 &p_pending->block, &p_pending->index) != IB_SUCCESS)
+		{
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_process_physical_port: ERR 0503: "
+						"Fail to obtain P_Key 0x%04x block and index for node "
+						"0x%016" PRIx64 " port %u\n",
+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+						osm_physp_get_port_num( p_physp ) );
+			return;
+		}
+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "updated";
+	}
+
+	osm_log( p_log, OSM_LOG_DEBUG,
+				"pkey_mgr_process_physical_port:	"
+				"pkey 0x%04x was %s for node 0x%016" PRIx64
+				" port %u\n",
+				cl_ntoh16( pkey ), stat,
+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+				osm_physp_get_port_num( p_physp ) );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+pkey_mgr_process_partition_table(
+	osm_log_t *p_log,
+	const osm_req_t *p_req,
+	const osm_prtn_t *p_prtn,
+	const boolean_t full )
+{
+	const cl_map_t *p_tbl = 
+		full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
+	cl_map_iterator_t i, i_next;
+	ib_net16_t pkey = p_prtn->pkey;
+	osm_physp_t *p_physp;
+
+	if ( full )
+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
+
+	i_next = cl_map_head( p_tbl );
+	while ( i_next != cl_map_end( p_tbl ) )
+	{
+		i = i_next;
+		i_next = cl_map_next( i );
+		p_physp = cl_map_obj( i );
+		if ( p_physp && osm_physp_is_valid( p_physp ) )
+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 static ib_api_status_t
 pkey_mgr_update_pkey_entry(
    IN const osm_req_t *p_req,
@@ -114,7 +239,8 @@ pkey_mgr_enforce_partition(
    p_pi->state_info2 = 0;
    ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
 
-   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+	context.pi_context.node_guid = 
+		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
    context.pi_context.set_method = TRUE;
    context.pi_context.update_master_sm_base_lid = FALSE;
@@ -131,80 +257,131 @@ pkey_mgr_enforce_partition(
 
 /**********************************************************************
  **********************************************************************/
-/*
- * Prepare a new entry for the pkey table for this port when this pkey
- * does not exist. Update existed entry when membership was changed.
- */
-static void pkey_mgr_process_physical_port(
-   IN osm_log_t *p_log,
-   IN const osm_req_t *p_req,
-   IN const ib_net16_t pkey,
-   IN osm_physp_t *p_physp )
+static boolean_t pkey_mgr_update_port(
+	osm_log_t *p_log,
+	osm_req_t *p_req,
+	const osm_port_t * const p_port )
 {
-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
-   ib_pkey_table_t *block;
+	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	ib_pkey_table_t *block, *new_block;
+	osm_pkey_tbl_t *p_pkey_tbl;
    uint16_t block_index;
+	uint8_t  pkey_index;
+	uint16_t last_free_block_index = 0;
+	uint8_t  last_free_pkey_index = 0;
    uint16_t num_of_blocks;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   ib_net16_t *p_orig_pkey;
-   char *stat = NULL;
-   uint32_t i;
+	uint16_t max_num_of_blocks;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	ib_api_status_t status;
+	boolean_t ret_val = FALSE;
+	osm_pending_pkey_t *p_pending;
+	boolean_t found;
 
-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
+		return FALSE;
 
-   if ( !p_orig_pkey )
-   {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
       {
-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
+		osm_log( p_log, OSM_LOG_INFO,
+					"pkey_mgr_update_port: "
+					"Max number of blocks reduced from %u to %u " 
+					"for node 0x%016" PRIx64 " port %u\n",
+					p_pkey_tbl->max_blocks, max_num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );				
+	}
+	p_pkey_tbl->max_blocks = max_num_of_blocks;
+
+	osm_pkey_tbl_init_new_blocks( p_pkey_tbl );
+	p_pkey_tbl->used_blocks = 0;
+
+	/* 
+		process every pending pkey in order - 
+		first must be "updated" last are "new" 
+	*/
+	p_pending = 
+		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_pending != 
+			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		if (p_pending->is_new == FALSE)
+		{
+			block_index = p_pending->block;
+			pkey_index = p_pending->index;
+			found = TRUE;
+		} 
+		else
          {
-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
+			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
+															  &last_free_block_index,
+															  &last_free_pkey_index);
+			if ( !found )
             {
-               block->pkey_entry[i] = pkey;
-	       stat = "inserted";
-	       goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0504: "
+							"failed to find empty space for new pkey 0x%04x "
+							"of node 0x%016" PRIx64 " port %u\n",
+							cl_ntoh16(p_pending->pkey),
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
             }
+			else
+			{
+				block_index = last_free_block_index;
+				pkey_index = last_free_pkey_index++;
          }
       }
+		
+		if (found) 
+		{
+			if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( 
+					  p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
+			{
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_process_physical_port: ERR 0501: "
-               "No empty pkey entry was found to insert 0x%04x for node "
-               "0x%016" PRIx64 " port %u\n",
-               cl_ntoh16( pkey ),
+							"pkey_mgr_update_port: ERR 0505: "
+							"failed to set PKey 0x%04x in block %u idx %u "
+							"of node 0x%016" PRIx64 " port %u\n",
+							p_pending->pkey, block_index, pkey_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
-   else if ( *p_orig_pkey != pkey )
-   {
+		}
+
+		free( p_pending );
+		p_pending = 
+			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	}
+
+	/* now look for changes and store */
       for ( block_index = 0; block_index < num_of_blocks; block_index++ )
       {
-         /* we need real block (not just new_block) in order
-          * to resolve block/pkey indices */
          block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-	 i = p_orig_pkey - block->pkey_entry;
-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-	    block->pkey_entry[i] = pkey;
-	    stat = "updated";
-	    goto _done;
-	 }
-      }
-   }
+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
- _done:
-   if (stat) {
-      osm_log( p_log, OSM_LOG_VERBOSE,
-               "pkey_mgr_process_physical_port:  "
-               "pkey 0x%04x was %s for node 0x%016" PRIx64
-               " port %u\n",
-               cl_ntoh16( pkey ), stat,
+		if (block && 
+			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
+			continue;
+
+		status = pkey_mgr_update_pkey_entry(
+			p_req, p_physp , new_block, block_index );
+		if (status == IB_SUCCESS)
+			ret_val = TRUE;
+		else
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_update_port: ERR 0506: "
+						"pkey_mgr_update_pkey_entry() failed to update "
+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+						block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
+
+	return ret_val;
 }
 
 /**********************************************************************
@@ -217,21 +394,23 @@ pkey_mgr_update_peer_port(
    const osm_port_t * const p_port,
    boolean_t enforce )
 {
-   osm_physp_t *p, *peer;
+	osm_physp_t *p_physp, *peer;
    osm_node_t *p_node;
    ib_pkey_table_t *block, *peer_block;
-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
+	const osm_pkey_tbl_t *p_pkey_tbl;
+	osm_pkey_tbl_t *p_peer_pkey_tbl;
    osm_switch_t *p_sw;
    ib_switch_info_t *p_si;
    uint16_t block_index;
    uint16_t num_of_blocks;
+	uint16_t peer_max_blocks;
    ib_api_status_t status = IB_SUCCESS;
    boolean_t ret_val = FALSE;
 
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
       return FALSE;
-   peer = osm_physp_get_remote( p );
+	peer = osm_physp_get_remote( p_physp );
    if ( !peer || !osm_physp_is_valid( peer ) )
       return FALSE;
    p_node = osm_physp_get_node_ptr( peer );
@@ -242,10 +421,26 @@ pkey_mgr_update_peer_port(
    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap)
       return FALSE;
 
+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_update_peer_port: ERR 0508: "
+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
+					" port %u. Clearing Enforcement bit.\n",
+					peer_max_blocks, num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( peer ) );
+		enforce = FALSE;
+	}
+
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: ERR 0502: "
+					"pkey_mgr_update_peer_port: ERR 0507: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -255,13 +450,8 @@ pkey_mgr_update_peer_port(
    if (enforce == FALSE)
       return FALSE;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
    {
       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
@@ -272,7 +462,7 @@ pkey_mgr_update_peer_port(
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: ERR 0503: "
+							"pkey_mgr_update_peer_port: ERR 0509: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -282,10 +472,10 @@ pkey_mgr_update_peer_port(
       }
    }
 
-   if ( ret_val == TRUE &&
-        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
+	if ( (ret_val == TRUE) &&
+		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
    {
-      osm_log( p_log, OSM_LOG_VERBOSE,
+		osm_log( p_log, OSM_LOG_DEBUG,
                "pkey_mgr_update_peer_port: "
                "pkey table was updated for node 0x%016" PRIx64
                " port %u\n",
@@ -298,82 +488,6 @@ pkey_mgr_update_peer_port(
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t pkey_mgr_update_port(
-   osm_log_t *p_log,
-   osm_req_t *p_req,
-   const osm_port_t * const p_port )
-{
-   osm_physp_t *p;
-   osm_node_t *p_node;
-   ib_pkey_table_t *block, *new_block;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   uint16_t block_index;
-   uint16_t num_of_blocks;
-   ib_api_status_t status;
-   boolean_t ret_val = FALSE;
-
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
-      return FALSE;
-
-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
-   {
-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-
-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
-         continue;
-
-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
-      if (status == IB_SUCCESS)
-         ret_val = TRUE;
-      else
-         osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port: ERR 0504: "
-                  "pkey_mgr_update_pkey_entry() failed to update "
-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-                  block_index,
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                  osm_physp_get_port_num( p ) );
-   }
-
-   return ret_val;
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-pkey_mgr_process_partition_table(
-   osm_log_t *p_log,
-   const osm_req_t *p_req,
-   const osm_prtn_t *p_prtn,
-   const boolean_t full )
-{
-   const cl_map_t *p_tbl = full ?
-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
-   cl_map_iterator_t i, i_next;
-   ib_net16_t pkey = p_prtn->pkey;
-   osm_physp_t *p_physp;
-
-   if ( full )
-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
-
-   i_next = cl_map_head( p_tbl );
-   while ( i_next != cl_map_end( p_tbl ) )
-   {
-      i = i_next;
-      i_next = cl_map_next( i );
-      p_physp = cl_map_obj( i );
-      if ( p_physp && osm_physp_is_valid( p_physp ) )
-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
-   }
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_signal_t
 osm_pkey_mgr_process(
    IN osm_opensm_t *p_osm )
@@ -383,8 +497,7 @@ osm_pkey_mgr_process(
    osm_prtn_t *p_prtn;
    osm_port_t *p_port;
    osm_signal_t signal = OSM_SIGNAL_DONE;
-   osm_physp_t *p_physp;
-
+	osm_node_t *p_node;
    CL_ASSERT( p_osm );
 
    OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
@@ -394,32 +507,25 @@ osm_pkey_mgr_process(
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
       osm_log( &p_osm->log, OSM_LOG_ERROR,
-               "osm_pkey_mgr_process: ERR 0505: "
+					"osm_pkey_mgr_process: ERR 0510: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
 
-   p_tbl = &p_osm->subn.port_guid_tbl;
-   p_next = cl_qmap_head( p_tbl );
-   while ( p_next != cl_qmap_end( p_tbl ) )
-   {
-      p_port = ( osm_port_t * ) p_next;
-      p_next = cl_qmap_next( p_next );
-      p_physp = osm_port_get_default_phys_ptr( p_port );
-      if ( osm_physp_is_valid( p_physp ) )
-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
-   }
-
+	/* populate the pending pkey entries by scanning all partitions */
    p_tbl = &p_osm->subn.prtn_pkey_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
    {
       p_prtn = ( osm_prtn_t * ) p_next;
       p_next = cl_qmap_next( p_next );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
    }
 
+	/* calculate new pkey tables and set */
    p_tbl = &p_osm->subn.port_guid_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -428,8 +534,10 @@ osm_pkey_mgr_process(
       p_next = cl_qmap_next( p_next );
       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
         signal = OSM_SIGNAL_DONE_PENDING;
-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
-           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+		p_node = osm_port_get_parent_node( p_port );
+		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+			  pkey_mgr_update_peer_port( 
+				  &p_osm->log, &p_osm->sm.req,
                                       &p_osm->subn, p_port,
                                       !p_osm->subn.opt.no_partition_enforcement ) )
         signal = OSM_SIGNAL_DONE_PENDING;        


From rdreier at cisco.com  Sun Jun 18 04:49:25 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 18 Jun 2006 04:49:25 -0700
Subject: [openib-general] [GIT PULL] please pull infiniband.git
Message-ID: <adawtbesmyi.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

Opening the floodgates for 2.6.18:

Ganapathi CH:
      IB/uverbs: Release lock on error path

Ishai Rabinovitz:
      IB/srp: Clean up loop in srp_remove_one()
      IB/srp: Handle DREQ events from CM
      IB/srp: Factor out common request reset code

Jack Morgenstein:
      IB: Add caching of ports' LMC
      IB/mad: Check GID/LID when matching requests
      IPoIB: Fix kernel unaligned access on ia64

Leonid Arsh:
      IB: Add client reregister event type
      IPoIB: Handle client reregister events
      IB: Move struct port_info from ipath to <rdma/ib_smi.h>
      IB/mthca: Add client reregister event generation

Matthew Wilcox:
      IB/srp: Use SCAN_WILD_CARD from SCSI headers
      IB/srp: Get rid of unneeded use of list_for_each_entry_safe()
      IB/srp: Change target_mutex to a spinlock

Michael S. Tsirkin:
      IB/mthca: restore missing PCI registers after reset
      IB/mthca: memfree completion with error FW bug workaround
      IB/mthca: Remove dead code
      IB/cm: remove unneeded flush_workqueue

Or Gerlitz:
      IB/mthca: Fill in max_map_per_fmr device attribute
      IB/fmr: Use device's max_map_map_per_fmr attribute in FMR pool.

Ramachandra K:
      [SCSI] srp.h: Add I/O Class values
      IB/srp: Support SRP rev. 10 targets

Roland Dreier:
      IB/srp: Use FMRs to map gather/scatter lists
      IB/mthca: Convert FW commands to use wait_for_completion_timeout()
      IB: Make needlessly global ib_mad_cache static
      IPoIB: Mention RFC numbers in documentation
      IB/srp: Get rid of "Target has req_lim 0" messages
      IPoIB: Avoid using stale last_send counter when reaping AHs
      IB/ipath: Add client reregister event generation
      IB/uverbs: Don't decrement usecnt on error paths
      IB/uverbs: Factor out common idr code
      IB/mthca: Fix memory leak on modify_qp error paths
      IB/mthca: Make all device methods truly reentrant
      IB/uverbs: Don't serialize with ib_uverbs_idr_mutex

Sean Hefty:
      IB: common handling for marshalling parameters to/from userspace
      IB/cm: Match connection requests based on private data
      [NET]: Export ip_dev_find()
      IB: address translation to map IP toIB addresses (GIDs)
      IB: IP address based RDMA connection manager
      IB/ucm: convert semaphore to mutex
      IB/ucm: Get rid of duplicate P_Key parameter
      IB: Add ib_init_ah_from_wc()
      IB/sa: Add ib_init_ah_from_path()
      IB/cm: Use address handle helpers

Vu Pham:
      IB/srp: Allow cmd_per_lun to be set per target port
      IB/srp: Allow sg_tablesize to be adjusted

 Documentation/infiniband/ipoib.txt             |   12 
 drivers/infiniband/Kconfig                     |    5 
 drivers/infiniband/core/Makefile               |   11 
 drivers/infiniband/core/addr.c                 |  367 +++++
 drivers/infiniband/core/cache.c                |   30 
 drivers/infiniband/core/cm.c                   |  119 +
 drivers/infiniband/core/cma.c                  | 1927 ++++++++++++++++++++++++
 drivers/infiniband/core/fmr_pool.c             |   30 
 drivers/infiniband/core/mad.c                  |   97 +
 drivers/infiniband/core/mad_priv.h             |    2 
 drivers/infiniband/core/sa_query.c             |   31 
 drivers/infiniband/core/ucm.c                  |  183 +-
 drivers/infiniband/core/uverbs.h               |    4 
 drivers/infiniband/core/uverbs_cmd.c           |  971 +++++++-----
 drivers/infiniband/core/uverbs_main.c          |   35 
 drivers/infiniband/core/uverbs_marshall.c      |  138 ++
 drivers/infiniband/core/verbs.c                |   44 -
 drivers/infiniband/hw/ipath/ipath_mad.c        |   42 -
 drivers/infiniband/hw/mthca/mthca_cmd.c        |   23 
 drivers/infiniband/hw/mthca/mthca_cq.c         |   12 
 drivers/infiniband/hw/mthca/mthca_eq.c         |    4 
 drivers/infiniband/hw/mthca/mthca_mad.c        |   14 
 drivers/infiniband/hw/mthca/mthca_provider.c   |   33 
 drivers/infiniband/hw/mthca/mthca_provider.h   |    3 
 drivers/infiniband/hw/mthca/mthca_qp.c         |   40 
 drivers/infiniband/hw/mthca/mthca_reset.c      |   59 +
 drivers/infiniband/hw/mthca/mthca_srq.c        |    5 
 drivers/infiniband/ulp/ipoib/ipoib.h           |   34 
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |   27 
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   28 
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |    3 
 drivers/infiniband/ulp/srp/ib_srp.c            |  482 ++++--
 drivers/infiniband/ulp/srp/ib_srp.h            |   33 
 include/rdma/ib_addr.h                         |  114 +
 include/rdma/ib_cache.h                        |   13 
 include/rdma/ib_cm.h                           |   26 
 include/rdma/ib_marshall.h                     |   50 +
 include/rdma/ib_sa.h                           |    7 
 include/rdma/ib_smi.h                          |   36 
 include/rdma/ib_user_cm.h                      |   86 -
 include/rdma/ib_user_sa.h                      |   60 +
 include/rdma/ib_user_verbs.h                   |   80 +
 include/rdma/ib_verbs.h                        |   22 
 include/rdma/rdma_cm.h                         |  256 +++
 include/rdma/rdma_cm_ib.h                      |   47 +
 include/scsi/srp.h                             |    5 
 net/ipv4/fib_frontend.c                        |    1 
 48 files changed, 4590 insertions(+), 1072 deletions(-)
 create mode 100644 drivers/infiniband/core/addr.c
 create mode 100644 drivers/infiniband/core/cma.c
 create mode 100644 drivers/infiniband/core/uverbs_marshall.c
 create mode 100644 include/rdma/ib_addr.h
 create mode 100644 include/rdma/ib_marshall.h
 create mode 100644 include/rdma/ib_user_sa.h
 create mode 100644 include/rdma/rdma_cm.h
 create mode 100644 include/rdma/rdma_cm_ib.h


From ogerlitz at voltaire.com  Sun Jun 18 05:13:06 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 18 Jun 2006 15:13:06 +0300
Subject: [openib-general] OFED 1.0 - error while running ib_rdma_bw
Message-ID: <D4F8F0B3820E754C887699BEF26A89404A7269@taurus.voltaire.com>

Running ib_rdma_bw (eg from the trunk but also with OFED) from time to time outputs the following message:

	server read: Success 
	0/45: Couldn't read remote address 

Looking in the code, line 148 (and actually 142 as well) seems to be buggy:

   133  struct pingpong_dest * pp_client_exch_dest(int sockfd,
   134                                             const struct pingpong_dest *my_dest)
   135  {
   136          struct pingpong_dest *rem_dest = NULL;
   137          char msg[sizeof "0000:000000:000000:00000000:0000000000000000"];
   138          int parsed;
   139
   140          sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn,
   141                          my_dest->psn,my_dest->rkey,my_dest->vaddr);
   142          if (write(sockfd, msg, sizeof msg) != sizeof msg) {
   143                  perror("client write");
   144                  fprintf(stderr, "Couldn't send local address\n");
   145                  goto out;
   146          }
   147
   148          if (read(sockfd, msg, sizeof msg) != sizeof msg) {
   149                  perror("client read");
   150                  fprintf(stderr, "Couldn't read remote address\n");
   151                  goto out;
   152          }

as read(2) can read less then the max (expected) bytes count, and indeed error is 0 (no error)
when the print is seen.

The below script wouls allow you to easily reproduce it.

At some point, there's also an IB completion with error printed, but it might be realated to the socket handling bug

Or.

SERVER=dill
echo "" 
for i in 16384 32768 65536 131072 262144 524288 1048576 2097152 
do 
for k in 4 
do 
ssh $SERVER "/usr/local/ofed/bin/ib_rdma_bw" & 
sleep 5 
echo $(date) -s = $i -n = $((512*1024*1024/$i)) -t = $k start 
/usr/local/ofed/bin/ib_rdma_bw $SERVER -s $i -n $((512*1024*1024/$i)) -t $k 
echo $(date) -s = $i -n = $((512*1024*1024/$i)) sleeping 3 seconds..... 
sleep 3 
echo $(date) -s = $i -n = $((512*1024*1024/$i)) end 
echo "" 
wait 
done 
done 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060618/1176fb1c/attachment.html>

From dotanb at mellanox.co.il  Sun Jun 18 07:04:39 2006
From: dotanb at mellanox.co.il (Dotan Barak)
Date: Sun, 18 Jun 2006 17:04:39 +0300
Subject: [openib-general] is there is any SA client in user level?
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30257933F@mtlexch01.mtl.com>

Hi.
 
I want to send a join message to the SA from user space.
I know that I can use the umad or the osm_vendor in order to do it..
 
what is the best way to do it?
is there is any SA client implementation in the user level (or is it a
transparent layer?)
 
thanks
Dotan Barak
Software Verification Engineer
Mellanox Technologies
Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245
P.O. Box 86 Yokneam 20692 ISRAEL.
Home: +972-77-8841095 Cell: 052-4222383
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060618/837299ef/attachment.html>

From swise at opengridcomputing.com  Sun Jun 18 10:57:53 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 18 Jun 2006 12:57:53 -0500
Subject: [openib-general] ucma into kernel.org
References: <1150465355.29508.4.camel@stevo-desktop>
	<4492D706.4060106@ichips.intel.com>
	<15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com>
Message-ID: <001e01c69300$b9020c00$020010ac@haggard>


| On 6/16/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
| > Steve Wise wrote:
| > > Will the ucma make it into 2.6.18?  I notice its not in Roland's
| > > for-2.6.18 tree right now.
| >
| > The plan is to allow the userspace interface to mature some before 
trying to
| > merge them upstream.  This is why it is not included in 2.6.18.
|
| Hi Sean,
|
| Can you remind (me...) what areas of the cma u/k interface seem to be
| not mature enough?
|
| upstream CMA can be a significant step in the sense of distros (eg
| SLES10 SP1 and RH5) kernel IB functional enough for production, as the
| primary inteface for "RDMA communication managment" the uCMA is and
| would be vastly used, so there should be good reason why not to push
| it for 2.6.18 .
|

I agree that it would be nice to get this into 2.6.18.  It seems stable 
enough IMO.

Steve. 


From sean.hefty at intel.com  Sun Jun 18 16:41:16 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 18 Jun 2006 16:41:16 -0700
Subject: [openib-general] is there is any SA client in user level?
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30257933F@mtlexch01.mtl.com>
Message-ID: <000001c69330$b1a67fb0$0e278686@amr.corp.intel.com>

I want to send a join message to the SA from user space.

I know that I can use the umad or the osm_vendor in order to do it..

 
what is the best way to do it?

is there is any SA client implementation in the user level (or is it a
transparent layer?)

 
There is no SA client in userspace.  (I'm not sure that one would be that much
simpler than calling umad directly.)  Ideally, join requests should go through
the kernel through the ib_multicast module to allow for proper reference
counting.  Currently, the only interface to that from userspace is through the
rdma_cm.

 
- Sean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060618/e912c584/attachment.html>

From halr at voltaire.com  Mon Jun 19 03:23:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Jun 2006 06:23:36 -0400
Subject: [openib-general] [PATCH] OpenSM/SA: In some SA records,
 send ERR_REQ_INVALID response on LID out of range error
Message-ID: <1150712615.4391.55627.camel@hal.voltaire.com>

OpenSM/SA: In some SA records, send ERR_REQ_INVALID response on LID out
of range error

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_vlarb_record.c
===================================================================
--- opensm/osm_sa_vlarb_record.c	(revision 8105)
+++ opensm/osm_sa_vlarb_record.c	(working copy)
@@ -413,10 +413,14 @@ osm_vlarb_rec_rcv_process(
     }
     else
     { /*  port out of range */
+      cl_plock_release( p_rcv->p_lock );
+
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_vlarb_rec_rcv_process: ERR 2A01: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
+      goto Exit;
     }
   }
 
Index: opensm/osm_sa_pkey_record.c
===================================================================
--- opensm/osm_sa_pkey_record.c	(revision 8105)
+++ opensm/osm_sa_pkey_record.c	(working copy)
@@ -425,10 +425,14 @@ osm_pkey_rec_rcv_process(
     }
     else
     { /* port out of range */
+      cl_plock_release( p_rcv->p_lock );
+
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_pkey_rec_rcv_process: ERR 4609: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
+      goto Exit;
     }
   }
 
Index: opensm/osm_sa_slvl_record.c
===================================================================
--- opensm/osm_sa_slvl_record.c	(revision 8105)
+++ opensm/osm_sa_slvl_record.c	(working copy)
@@ -393,10 +393,14 @@ osm_slvl_rec_rcv_process(
     }
     else
     { /*  port out of range */
+      cl_plock_release( p_rcv->p_lock );
+
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_slvl_rec_rcv_process: ERR 2601: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
+      goto Exit;
     }
   }
 

From eitan at mellanox.co.il  Mon Jun 19 03:45:49 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 19 Jun 2006 13:45:49 +0300
Subject: [openib-general] [PATCH] OpenSM/SA: In some SA records,
 send ERR_REQ_INVALIDresponse on LID out of range error
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236888C@mtlexch01.mtl.com>

Hi Hal,

Thanks for finding and fixing.  Looks good to me.
> Subject: [PATCH] OpenSM/SA: In some SA records, send
> ERR_REQ_INVALIDresponse on LID out of range error
> 
> OpenSM/SA: In some SA records, send ERR_REQ_INVALID response on LID
out
> of range error
> 


From tziporet at mellanox.co.il  Mon Jun 19 04:25:11 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 19 Jun 2006 14:25:11 +0300
Subject: [openib-general] OFED 1.0 - Official Release
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA726A@mtlexch01.mtl.com>

Yes indeed we inserted one more critical bug fix in SDP.

This bug is cause kernel oops in case server and client do not open the
same number of sockets. Thus it can easily happened by any user level
application using socket.

The reason we added it as a patch was to decrease the risk, so if it
cause someone a problem it can be reverted easily.

Note that we did code review for the fix + tested it on all OS matrix we
have to make sure this patch is safe.

 
Tziporet

 
-----Original Message-----
From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Sent: Friday, June 16, 2006 6:58 PM
To: Tziporet Koren; OpenFabricsEWG; openib
Subject: RE: [openib-general] OFED 1.0 - Official Release

 
Tziporet,

 
I see a few C code changes from pre1 in the form of patches.  What are
these and why were they added after pre1?

 
$ diff -r
OFED-1.0-pre1/SOURCES/openib-1.0/patches/OFED-1.0/SOURCES/openib-1.0/pat
ches/ 2>&1 | less
...

Only in OFED-1.0-pre1/SOURCES/openib-1.0/patches/fixes:
handle_reconnect_of_offline_host.patch
Only in OFED-1.0/SOURCES/openib-1.0/patches/fixes: sdp_fix.patch

 
Scott Weitzenkamp

SQA and Release Manager

Server Virtualization Business Unit

Cisco Systems

 
________________________________


	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Tziporet Koren
	Sent: Friday, June 16, 2006 1:55 AM
	To: OpenFabricsEWG; openib
	Subject: [openib-general] OFED 1.0 - Official Release

	 
	I am happy to announce that OFED 1.0 Official Release is now
available.

	The release can be found under:

	https://openib.org/svn/gen2/branches/1.0/ofed/releases/

	 
	And later today it will be on the OpenFabrics download page:
http://www.openfabrics.org/downloads.html.

	 
	This is the first release that was done in a joint effort of the
following companies:

	*   Cisco 

	*   SilverStorm 

	*   Voltaire 

	*   QLogic 

	*   Intel 

	*   Mellanox Technologies 

	 
	I wish to thank all who contributed to the success of this
release.

	 
	Tziporet

	
========================================================================
=======

	 
	Release summary:

	The OFED software package is composed of several software
modules intended for use on a computer cluster 

	constructed as an InfiniBand network.

	 
	OFED package contains the following components:

	  o   OpenFabrics core and ULPs:

	        - HCA drivers (mthca, ipath)

	        - core

	        - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER
Host, RDS and uDAPL

	  o   OpenFabrics utilities:

	        - OpenSM: InfiniBand Subnet Manager

	        - Diagnostic tools

	        - Performance tests

	  o   MPI:

	        - OSU MPI stack supporting the InfiniBand interface

	        - Open MPI stack supporting the InfiniBand interface

	        - MPI benchmark tests (OSU BW/LAT, Pallas, Presta)

	  o   Sources of all software modules (under conditions
mentioned in the modules'

	      LICENSE files)

	  o   Documentation

	 
	Notes:

	1. SDP and RDS are in technology preview state.

	2. The SRP Initiator and Open MPI are in beta state.

	3. All other OFED components are in production state.

	 
	Supported Platforms and Operating Systems

	    CPU architectures:

	        * x86_64

	        * x86

	        * ia64

	        * ppc64

	 
	    Linux Operating Systems:

	        * RedHat EL4 up2: 2.6.9-22.ELsmp

	        * RedHat EL4 up3: 2.6.9-34.ELsmp

	        * Fedora C4: 2.6.11-1.1369_FC4

	        * SLES10 RC2: 2.6.16.16-1.6-smp (or RC 2.5
2.6.16.14-6-smp)

	        * SLES10 RC1: 2.6.16.14-6-smp

	        * SUSE 10 Pro: 2.6.13-15-smp

	        * kernel.org: 2.6.16.x

	 
	HCAs Supported

	 
	Mellanox HCAs:

	        - InfiniHost

	        - InfiniHost III Ex (both modes: with memory and
MemFree)

	        - InfiniHost III Lx

	        Both SDR and DDR mode of the InfiniHost III family are
supported.

	 
	        For official FW versions please see:

	        http://www.mellanox.com/support/firmware_table.php

	 
	Qlogic HCAs:

	        - QHT6040 (PathScale InfiniPath HT-460)

	        - QHT6140 (PathScale InfiniPath HT-465)

	        - QLE6140 (PathScale InfiniPath PE-880)

	 
	Switches Supported

	This release was tested with switches and gateways provided by
the following companies:

	        - Cisco

	        - Voltaire

	        - SilverStorm

	        - Flextronics

	 
	Attached are the release notes

	 
	Tziporet Koren

	Software Director

	Mellanox Technologies

	mailto: tziporet at mellanox.co.il <mailto:tziporet at mellanox.co.il>

	Tel +972-4-9097200, ext 380

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060619/701706bb/attachment.html>

From svenar at simula.no  Mon Jun 19 04:35:11 2006
From: svenar at simula.no (Sven-Arne Reinemo)
Date: Mon, 19 Jun 2006 13:35:11 +0200
Subject: [openib-general] A few questions about IBMgtSim
Message-ID: <44968BEF.9030401@simula.no>

Hi,

After some testing of IBMgtSim I have a few questions:

1) If I try to build topologies using the MTS14400.ibnl as a building
block my simulation fails with a "child process exited abnormally"
message. I guess this is related to ibdmchk since the ibdmchk log
contains lots of errors like the following:

-I- Tracing all CA to CA paths for Credit Loops potential ...
-E- Potential Credit Loop on Path from:H-1/U1/1 to:H-11/U1/1
  Going:Down from:node:0002c9000000007d to:node:0002c9000000006a
  Going:Up from:node:0002c9000000006a to:node:0002c90000000076

-I- Generating non blocking full link coverage plan
into:/tmp/ibdmchk.non_block_
all_links
-E- After 32 stages some switch ports are still not covered:
-E- Fail to cover port:system:0002c90000000054/node:0002c90000000054/P15

I have included two topology files. One that works and one that fails,
the only difference is that the number of hosts are increased from 18 to
20. Also, if I create my own simple ibnl file for a switch with 144 (or
other sizes) ports I am able to run simulations. Any suggestions to what
the problem might be?


2) The included example ibmgtsim/tests/RhinoBased10K.topo never finishes
(at least not in 24 hours). Does this work for anyone else? All other
examples work fine.

3) If I would like to use IBMgtSim with my own (simplified) SM would it
be straightforward? It looks too me like RunSimTest talks to any SM
given the correct path, node and port number for location of the SM.

Best regards,


-- 
Sven-Arne Reinemo
[simula.research laboratory] http://www.simula.no/
++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mts14400_n18_working.topo
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060619/ce65f8f1/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mts14400_n20_not_working.topo
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060619/ce65f8f1/attachment-0001.ksh>

From sashak at voltaire.com  Mon Jun 19 06:56:53 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 Jun 2006 16:56:53 +0300
Subject: [openib-general] [PATCHv3] osm: partition manager force policy
In-Reply-To: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
References: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
Message-ID: <20060619135653.GB5521@sashak.voltaire.com>

Hi Eitan,

On 14:46 Sun 18 Jun     , Eitan Zahavi wrote:
> 
> This is a third take after incorporating Sasha's comments to the 
> partition manager patch I have previously provided. 

Two small comments below.

>  /**********************************************************************
>   **********************************************************************/
> -/*
> - * Prepare a new entry for the pkey table for this port when this pkey
> - * does not exist. Update existed entry when membership was changed.
> - */
> -static void pkey_mgr_process_physical_port(
> -   IN osm_log_t *p_log,
> -   IN const osm_req_t *p_req,
> -   IN const ib_net16_t pkey,
> -   IN osm_physp_t *p_physp )
> +static boolean_t pkey_mgr_update_port(
> +	osm_log_t *p_log,
> +	osm_req_t *p_req,
> +	const osm_port_t * const p_port )
>  {
> -   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
> -   ib_pkey_table_t *block;
> +	osm_physp_t *p_physp;
> +	osm_node_t *p_node;

p_node is uninitialized and used in osm_log() later,

> +	ib_pkey_table_t *block, *new_block;
> +	osm_pkey_tbl_t *p_pkey_tbl;
>     uint16_t block_index;
> +	uint8_t  pkey_index;
> +	uint16_t last_free_block_index = 0;
> +	uint8_t  last_free_pkey_index = 0;
>     uint16_t num_of_blocks;
> -   const osm_pkey_tbl_t *p_pkey_tbl;
> -   ib_net16_t *p_orig_pkey;
> -   char *stat = NULL;
> -   uint32_t i;
> +	uint16_t max_num_of_blocks;
>  
> -   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> -   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> +	ib_api_status_t status;
> +	boolean_t ret_val = FALSE;
> +	osm_pending_pkey_t *p_pending;
> +	boolean_t found;
>  
> -   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
> +	p_physp = osm_port_get_default_phys_ptr( p_port );
> +	if ( !osm_physp_is_valid( p_physp ) )
> +		return FALSE;
>  
> -   if ( !p_orig_pkey )
> -   {
> -      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
> +	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> +	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
> +	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
>        {
> -         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> -         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
> +		osm_log( p_log, OSM_LOG_INFO,
> +					"pkey_mgr_update_port: "
> +					"Max number of blocks reduced from %u to %u " 
> +					"for node 0x%016" PRIx64 " port %u\n",
> +					p_pkey_tbl->max_blocks, max_num_of_blocks,
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( p_physp ) );				
> +	}


> @@ -255,13 +450,8 @@ pkey_mgr_update_peer_port(
>     if (enforce == FALSE)
>        return FALSE;
>  
> -   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
> -   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
> -   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> -   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
> -      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
> -
> -   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;

Peer's pkey table blocks may be not initialized yet, and then

> +	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
>     {
>        block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>        peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );

peer_block can be NULL.

Later in the code (not in this patch) there is
'if (memcmp(peer_block, ...))', should be changed to
'if (!peer_block || memcmp(peer_block, ...))'.


Sasha


From ogerlitz at voltaire.com  Mon Jun 19 07:30:16 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 Jun 2006 17:30:16 +0300 (IDT)
Subject: [openib-general] trunk's udapl does not compile
Message-ID: <Pine.LNX.4.64.0606191725500.13333@zuben>

I've just noted an inconsistency with librdmacm of udapl calling
rdma_create_id without providing the PS param.

This is the trivial patch i was using to fix the compilation.

Or.

Index: dapl/openib_cma/dapl_ib_util.c
===================================================================
--- dapl/openib_cma/dapl_ib_util.c	(revision 8106)
+++ dapl/openib_cma/dapl_ib_util.c	(working copy)
@@ -235,7 +235,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_N
 		return DAT_INVALID_ADDRESS;

 	/* cm_id will bind local device/GID based on IP address */
-	if (rdma_create_id(g_cm_events, &cm_id, (void*)hca_ptr))
+	if (rdma_create_id(g_cm_events, &cm_id, (void*)hca_ptr, RDMA_PS_TCP))
 		return DAT_INTERNAL_ERROR;

 	ret = rdma_bind_addr(cm_id,
Index: dapl/openib_cma/dapl_ib_cm.c
===================================================================
--- dapl/openib_cma/dapl_ib_cm.c	(revision 8106)
+++ dapl/openib_cma/dapl_ib_cm.c	(working copy)
@@ -694,7 +694,7 @@ dapls_ib_setup_conn_listener(IN DAPL_IA
 	dapl_os_lock_init(&conn->lock);

 	/* create CM_ID, bind to local device, create QP */
-	if (rdma_create_id(g_cm_events, &conn->cm_id, (void*)conn)) {
+	if (rdma_create_id(g_cm_events, &conn->cm_id, (void*)conn, RDMA_PS_TCP)) {
 		dapl_os_free(conn, sizeof(*conn));
 		return(dapl_convert_errno(errno,"setup_listener"));
 	}
Index: dapl/openib_cma/dapl_ib_qp.c
===================================================================
--- dapl/openib_cma/dapl_ib_qp.c	(revision 8106)
+++ dapl/openib_cma/dapl_ib_qp.c	(working copy)
@@ -130,7 +130,7 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA
 	dapl_os_lock_init(&conn->lock);

 	/* create CM_ID, bind to local device, create QP */
-	if (rdma_create_id(g_cm_events, &cm_id, (void*)conn)) {
+	if (rdma_create_id(g_cm_events, &cm_id, (void*)conn, RDMA_PS_TCP)) {
 		dapl_os_free(conn, sizeof(*conn));
 		return(dapl_convert_errno(errno, "create_qp"));
 	}


From ogerlitz at voltaire.com  Mon Jun 19 07:43:14 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 19 Jun 2006 17:43:14 +0300 (IDT)
Subject: [openib-general] dapltest gets segfaulted in librdmacm init
Message-ID: <Pine.LNX.4.64.0606191736310.13391@zuben>

After fixing the ucma/port space issue with the calls to rdma_create_id i
am now trying to run

	$ ./Target/dapltest -T S -D OpenIB-cma

and getting an immediate segfault with the below trace, any idea?

Or.

#0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
128             context = device->ops.alloc_context(device, cmd_fd);
(gdb) where
#0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
#1  0x00002af6d3cc4076 in ucma_init () at cma.c:220
#2  0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257
#3  0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222
#4  0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690,
    ia_handle_ptr=0x52e660) at dapl_ia_open.c:145
#5  0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690,
    ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229
#6  0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105
#7  0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55
#8  0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48
#9  0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95
#10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37
(gdb) info sharedlibrary
>From                To                  Syms Read   Shared Object Library
0x00002af6d352e0e0  0x00002af6d3533e38  Yes         /usr/local/ib/lib/libdat.so.1
0x00002af6d365d470  0x00002af6d3664d48  Yes         /lib64/tls/libpthread.so.0
0x00002af6d37888b0  0x00002af6d3852ce0  Yes         /lib64/tls/libc.so.6
0x00002af6d398f450  0x00002af6d3990128  Yes         /lib64/libdl.so.2
0x00002af6d3a94690  0x00002af6d3a99aa8  Yes         /usr/local/ib/lib/libibverbs.so.2
0x00002af6d3415cf0  0x00002af6d3426ab7  Yes         /lib64/ld-linux-x86-64.so.2
0x00002af6d3b9ffc0  0x00002af6d3bb7028  Yes         /usr/local/ib/lib/libdaplcma.so
0x00002af6d3cc3ca0  0x00002af6d3cc6d18  Yes         /usr/local/ib/lib/librdmacm.so
0x00002af6d3deb200  0x00002af6d3df2348  Yes         /usr/local/lib/libsysfs.so.1
0x00002af6d3ef5b50  0x00002af6d3efc138  Yes         /usr/local/ib/lib/infiniband/mthca.so
0x00002af6d40006c0  0x00002af6d4005838  Yes         /usr/local/ib/lib/libibverbs.so.1


From eitan at mellanox.co.il  Mon Jun 19 07:43:23 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 19 Jun 2006 17:43:23 +0300
Subject: [openib-general] [PATCHv3] osm: partition manager force policy
In-Reply-To: <20060619135653.GB5521@sashak.voltaire.com>
References: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
	<20060619135653.GB5521@sashak.voltaire.com>
Message-ID: <4496B80B.8090504@mellanox.co.il>

Hi Sasha,

Thanks!
These two are real bugs.
I am sending PATCHv4...

Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> On 14:46 Sun 18 Jun     , Eitan Zahavi wrote:
> 
>>This is a third take after incorporating Sasha's comments to the 
>>partition manager patch I have previously provided. 
> 
> 
> Two small comments below.
> 
> 
>> /**********************************************************************
>>  **********************************************************************/
>>-/*
>>- * Prepare a new entry for the pkey table for this port when this pkey
>>- * does not exist. Update existed entry when membership was changed.
>>- */
>>-static void pkey_mgr_process_physical_port(
>>-   IN osm_log_t *p_log,
>>-   IN const osm_req_t *p_req,
>>-   IN const ib_net16_t pkey,
>>-   IN osm_physp_t *p_physp )
>>+static boolean_t pkey_mgr_update_port(
>>+	osm_log_t *p_log,
>>+	osm_req_t *p_req,
>>+	const osm_port_t * const p_port )
>> {
>>-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
>>-   ib_pkey_table_t *block;
>>+	osm_physp_t *p_physp;
>>+	osm_node_t *p_node;
> 
> 
> p_node is uninitialized and used in osm_log() later,
Thanks. I wonder how I missed this one.
> 
> 
>>+	ib_pkey_table_t *block, *new_block;
>>+	osm_pkey_tbl_t *p_pkey_tbl;
>>    uint16_t block_index;
>>+	uint8_t  pkey_index;
>>+	uint16_t last_free_block_index = 0;
>>+	uint8_t  last_free_pkey_index = 0;
>>    uint16_t num_of_blocks;
>>-   const osm_pkey_tbl_t *p_pkey_tbl;
>>-   ib_net16_t *p_orig_pkey;
>>-   char *stat = NULL;
>>-   uint32_t i;
>>+	uint16_t max_num_of_blocks;
>> 
>>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
>>-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>+	ib_api_status_t status;
>>+	boolean_t ret_val = FALSE;
>>+	osm_pending_pkey_t *p_pending;
>>+	boolean_t found;
>> 
>>-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
>>+	p_physp = osm_port_get_default_phys_ptr( p_port );
>>+	if ( !osm_physp_is_valid( p_physp ) )
>>+		return FALSE;
>> 
>>-   if ( !p_orig_pkey )
>>-   {
>>-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
>>+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
>>+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
>>       {
>>-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
>>+		osm_log( p_log, OSM_LOG_INFO,
>>+					"pkey_mgr_update_port: "
>>+					"Max number of blocks reduced from %u to %u " 
>>+					"for node 0x%016" PRIx64 " port %u\n",
>>+					p_pkey_tbl->max_blocks, max_num_of_blocks,
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( p_physp ) );				
>>+	}
> 
> 
> 
>>@@ -255,13 +450,8 @@ pkey_mgr_update_peer_port(
>>    if (enforce == FALSE)
>>       return FALSE;
>> 
>>-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
>>-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
>>-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
>>-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
>>-
>>-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
>>+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
> 
> 
> Peer's pkey table blocks may be not initialized yet, and then
> 
> 
>>+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
>>    {
>>       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>>       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
> 
> 
> peer_block can be NULL.
> 
> Later in the code (not in this patch) there is
> 'if (memcmp(peer_block, ...))', should be changed to
> 'if (!peer_block || memcmp(peer_block, ...))'.
> 
> 
> Sasha
> 


From sashak at voltaire.com  Mon Jun 19 07:50:30 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 Jun 2006 17:50:30 +0300
Subject: [openib-general] [PATCHv3] osm: partition manager force policy
In-Reply-To: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
References: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
Message-ID: <20060619145030.GC5521@sashak.voltaire.com>

On 14:46 Sun 18 Jun     , Eitan Zahavi wrote:
> 
> Another one is the handling of switch limited partition cap by
> clearing the switch enforcement bit (on the specific port).

Some comment about this too. See below.

> +ib_api_status_t
> +osm_pkey_tbl_set_new_entry( 
> +	IN osm_pkey_tbl_t *p_pkey_tbl,
> +	IN uint16_t        block_idx,
> +	IN uint8_t         pkey_idx,
> +	IN uint16_t        pkey)
> +{  
> +	ib_pkey_table_t *p_old_block;
> +	ib_pkey_table_t *p_new_block;
> +	
> +	if (osm_pkey_tbl_make_block_pair(
> +			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
> +		return( IB_ERROR );
> +		
> +	p_new_block->pkey_entry[pkey_idx] = pkey;
> +	if (p_pkey_tbl->used_blocks < block_idx)
> +		p_pkey_tbl->used_blocks = block_idx;
> +
> +	return( IB_SUCCESS );
> +}

p_pkey_tbl->used_blocks is updated as block index in range 0,1,2....

> @@ -242,10 +421,26 @@ pkey_mgr_update_peer_port(
>     if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap)
>        return FALSE;
>  
> +	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> +	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
> +	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> +	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
> +	if (peer_max_blocks < p_pkey_tbl->used_blocks)
> +	{

But compared with total number of blocks (ranged 1,2,3,...). In case
where switch supports N pkey blocks and CA - N+1, switch's ports will be
updated and partitioning enforced.

Sasha

> +		osm_log( p_log, OSM_LOG_ERROR,
> +					"pkey_mgr_update_peer_port: ERR 0508: "
> +					"not enough entries (%u < %u) on switch 0x%016" PRIx64
> +					" port %u. Clearing Enforcement bit.\n",
> +					peer_max_blocks, num_of_blocks,
> +					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +					osm_physp_get_port_num( peer ) );
> +		enforce = FALSE;
> +	}
> +


From eitan at mellanox.co.il  Mon Jun 19 07:50:36 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 19 Jun 2006 17:50:36 +0300
Subject: [openib-general]  [PATCHv4] osm: partition manager force policy
Message-ID: <86ejxlgpxf.fsf@mtl066.yok.mtl.com>

Hi Hal

This is a 4th take after incorporating Sasha's new 2 bug reports
for the PATCHv3 for partition manager.

The difference from previous patch is very minor:
1. p_node is initialized in pkey_mgr_update_port
2. checking for a change in peer port pkey block first check for that block is not null

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 8100)
+++ include/opensm/osm_port.h	(working copy)
@@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
 *  Port, Physical Port
 *********/
 
+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
+* NAME
+*  osm_physp_get_mod_pkey_tbl
+*
+* DESCRIPTION
+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
+*
+* SYNOPSIS
+*/
+static inline osm_pkey_tbl_t *
+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
+{
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  /*
+    (14.2.5.7) - the block number valid values are 0-2047, and are further
+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
+  */
+  return( &p_physp->pkeys );
+};
+/*
+* PARAMETERS
+*  p_physp
+*     [in] Pointer to an osm_physp_t object.
+*
+* RETURN VALUES
+*  The pointer to the P_Key table object.
+*
+* NOTES
+*
+* SEE ALSO
+*  Port, Physical Port
+*********/
+
 /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
 * NAME
 *	osm_physp_set_slvl_tbl
Index: include/opensm/osm_pkey.h
===================================================================
--- include/opensm/osm_pkey.h	(revision 8100)
+++ include/opensm/osm_pkey.h	(working copy)
@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
   cl_ptr_vector_t blocks;
   cl_ptr_vector_t new_blocks;
   cl_map_t        keys;
+  cl_qlist_t      pending;
+  uint16_t        used_blocks;
+  uint16_t        max_blocks;
 } osm_pkey_tbl_t;
 /*
 * FIELDS
@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
 *	keys
 *		A set holding all keys
 *
+*  pending
+*     A list osm_pending_pkey structs that is temporarily set by the 
+*     pkey mgr and used during pkey mgr algorithm only
+*
+*  used_blocks
+*     Tracks the number of blocks having non-zero pkeys
+*
+*  max_blocks
+*     The maximal number of blocks this partition table might hold
+*     this value is based on node_info (for port 0 or CA) or switch_info
+*     updated on receiving the node_info or switch_info GetResp
+*
 * NOTES
 * 'blocks' vector should be used to store pkey values obtained from
 * the port and SM pkey manager should not change it directly, for this
@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
 *
 *********/
 
+/****s* OpenSM: osm_pending_pkey_t
+* NAME
+*	osm_pending_pkey_t
+*
+* DESCRIPTION
+*	This objects stores temporary information on pkeys their target block and index
+*  during the pkey manager operation
+*
+* SYNOPSIS
+*/
+typedef struct _osm_pending_pkey {
+  cl_list_item_t list_item;
+  uint16_t		  pkey;
+  uint32_t		  block;
+  uint8_t		  index;
+  boolean_t		  is_new;
+} osm_pending_pkey_t;
+/*
+* FIELDS
+*	pkey
+*		The actual P_Key
+*
+*	block
+*		The block index based on the previous table extracted from the device
+*
+*	index
+*		The index of the pky within the block
+*
+*  is_new
+*     TRUE for new P_Keys such that the block and index are invalid in that case
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_construct
 * NAME
 *  osm_pkey_tbl_construct
@@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_init( 
+ib_api_status_t 
+osm_pkey_tbl_init( 
   IN osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( 
 static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
   const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
 {
-  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
-  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
+	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
+			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
 };
 /*
 *  p_pkey_tbl
@@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_
 /*
  *********/
 
-/****f* OpenSM: osm_pkey_tbl_sync_new_blocks
+
+/****f* OpenSM: osm_pkey_tbl_make_block_pair
+* NAME
+*  osm_pkey_tbl_make_block_pair
+*
+* DESCRIPTION
+*  Find or create a pair of "old" and "new" blocks for the
+*  given block index
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pp_old_block
+*   [out] Pointer to the old block pointer arg
+*
+* pp_new_block
+*   [out] Pointer to the new block pointer arg
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_set_new_entry
 * NAME
-*  osm_pkey_tbl_sync_new_blocks
+*  osm_pkey_tbl_set_new_entry
 *
 * DESCRIPTION
-*  Syncs new_blocks vector content with current pkey table blocks
+*   stores the given pkey in the "new" blocks array and update
+*   the "map" to show that on the "old" blocks
 *
 * SYNOPSIS
 */
-void osm_pkey_tbl_sync_new_blocks( 
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pkey_idx
+*   [in] The index within the block
+*
+* pkey
+*   [in] PKey to store
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_find_next_free_entry
+* NAME
+*  osm_pkey_find_next_free_entry
+*
+* DESCRIPTION
+*  Find the next free entry in the PKey table. Starting at the given
+*  index and block number. The user should increment pkey_idx before 
+*  next call
+*  Inspect the "new" blocks array for empty space.
+*
+* SYNOPSIS
+*/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* p_block_idx
+*   [out] The block index to use
+*
+* p_pkey_idx
+*   [out] The index within the block to use
+*
+* RETURN VALUES
+*   TRUE if found FALSE if did not find
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_init_new_blocks
+* NAME
+*  osm_pkey_tbl_init_new_blocks
+*
+* DESCRIPTION
+*  Initializes new_blocks vector content (clear and allocate)
+*
+* SYNOPSIS
+*/
+void osm_pkey_tbl_init_new_blocks( 
   const osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 *********/
 
+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
+* NAME
+*  osm_pkey_tbl_get_block_and_idx
+*
+* DESCRIPTION
+*  set the block index and pkey index the given
+*  pkey is found in. return IB_NOT_FOUND if cound not find 
+*  it, IB_SUCCESS if OK
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
+  IN  uint16_t       *p_pkey,
+  OUT uint32_t       *block_idx,
+  OUT uint8_t        *pkey_index);
+/*
+*  p_pkey_tbl
+*     [in] Pointer to osm_pkey_tbl_t object.
+*  
+*  p_pkey
+*     [in] Pointer to the P_Key entry searched
+*
+*  p_block_idx
+*     [out] Pointer to the block index to be updated
+*
+*  p_pkey_idx 
+*     [out] Pointer to the pkey index (in the block) to be updated
+*
+*
+* NOTES
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_set
 * NAME
 *  osm_pkey_tbl_set
@@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set( 
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl);
Index: opensm/osm_prtn.c
===================================================================
--- opensm/osm_prtn.c	(revision 8100)
+++ opensm/osm_prtn.c	(working copy)
@@ -140,6 +140,12 @@ ib_api_status_t osm_prtn_add_port(osm_lo
 
 	p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ;
 
+   osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: "
+           "Added port 0x%" PRIx64 " to "
+           "partition \'%s\' (0x%04x) As %s member\n",
+           cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey),
+           full ? "full" : "partial" );
+
 	if (cl_map_insert(p_tbl, guid, p_physp) == NULL)
 		return IB_INSUFFICIENT_MEMORY;
 
Index: opensm/osm_pkey.c
===================================================================
--- opensm/osm_pkey.c	(revision 8100)
+++ opensm/osm_pkey.c	(working copy)
@@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( 
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_init( 
+ib_api_status_t
+osm_pkey_tbl_init(
   IN osm_pkey_tbl_t *p_pkey_tbl)
 {
   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
   cl_map_init( &p_pkey_tbl->keys, 1 );
+	cl_qlist_init( &p_pkey_tbl->pending );
+	p_pkey_tbl->used_blocks = 0;
+	p_pkey_tbl->max_blocks = 0;
   return(IB_SUCCESS);
 }
 
 /**********************************************************************
  **********************************************************************/
-void osm_pkey_tbl_sync_new_blocks(
+void osm_pkey_tbl_init_new_blocks(
   IN const osm_pkey_tbl_t *p_pkey_tbl)
 {
   ib_pkey_table_t *p_block, *p_new_block;
@@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks(
       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
       if (!p_new_block)
         break;
+			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
+									b, p_new_block);
+		}
+
       memset(p_new_block, 0, sizeof(*p_new_block));
-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
     }
-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_pkey_tbl_cleanup_pending(
+	IN osm_pkey_tbl_t *p_pkey_tbl)
+{
+	cl_list_item_t	*p_item;
+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		free( (osm_pending_pkey_t *)p_item );
   }
 }
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set(
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl)
@@ -203,7 +222,138 @@ int osm_pkey_tbl_set( 
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t __osm_match_pkey (
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block)
+{
+	if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR);
+
+	if (pp_old_block)
+	{
+		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_old_block)
+		{
+			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_old_block) return(IB_ERROR);
+			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
+		}
+	}
+	
+	if (pp_new_block)
+	{
+		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_new_block)
+		{
+			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_new_block) return(IB_ERROR);
+			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
+		}
+	}
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+  store the given pkey in the "new" blocks array 
+  also makes sure the regular block exists.
+*/
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey)
+{  
+	ib_pkey_table_t *p_old_block;
+	ib_pkey_table_t *p_new_block;
+	
+	if (osm_pkey_tbl_make_block_pair(
+			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
+		return( IB_ERROR );
+		
+	p_new_block->pkey_entry[pkey_idx] = pkey;
+	if (p_pkey_tbl->used_blocks < block_idx)
+		p_pkey_tbl->used_blocks = block_idx;
+
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx)
+{
+	ib_pkey_table_t *p_new_block;
+	
+	CL_ASSERT(p_block_idx);
+	CL_ASSERT(p_pkey_idx);
+
+	while ( *p_block_idx < p_pkey_tbl->max_blocks)
+	{
+		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
+		{
+			*p_pkey_idx = 0;
+			(*p_block_idx)++;
+			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
+				return FALSE;
+		}
+
+		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
+
+		if ( !p_new_block || 
+			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
+			return TRUE;
+		else
+			(*p_pkey_idx)++;
+	}
+	return FALSE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
+	IN	 uint16_t		 *p_pkey,
+	OUT uint32_t		 *p_block_idx,
+	OUT uint8_t			 *p_pkey_index)
+{
+	uint32_t			  num_of_blocks;
+	uint32_t			  block_index;
+	ib_pkey_table_t *block;
+
+	CL_ASSERT( p_pkey_tbl );
+	CL_ASSERT( p_block_idx != NULL );
+	CL_ASSERT( p_pkey_idx != NULL );
+ 
+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		if ( ( block->pkey_entry <= p_pkey ) &&
+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
+		{
+			*p_block_idx = block_index;
+			*p_pkey_index = p_pkey - block->pkey_entry;
+			return( IB_SUCCESS );
+		}
+	}
+	return( IB_NOT_FOUND );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static boolean_t 
+__osm_match_pkey (
   IN const ib_net16_t *pkey1,
   IN const ib_net16_t *pkey2 ) {
 
@@ -306,7 +456,8 @@ osm_physp_share_pkey(
   if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
     return TRUE;
 
-  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
+	return 
+		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
 }
 
 /**********************************************************************
@@ -322,7 +473,8 @@ osm_port_share_pkey(
 
   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
 
-  if (!p_port_1 || !p_port_2) {
+	if (!p_port_1 || !p_port_2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
@@ -330,7 +482,8 @@ osm_port_share_pkey(
   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
 
-  if (!p_physp1 || !p_physp2) {
+	if (!p_physp1 || !p_physp2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8100)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -62,6 +62,131 @@
 
 /**********************************************************************
  **********************************************************************/
+/*
+  the max number of pkey blocks for a physical port is located in
+  different place for switch external ports (SwitchInfo) and the
+  rest of the ports (NodeInfo)
+*/
+static int 
+pkey_mgr_get_physp_max_blocks(
+	IN const osm_subn_t *p_subn,
+	IN const osm_physp_t *p_physp)
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
+	osm_switch_t *p_sw;
+	uint16_t num_pkeys = 0;
+
+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
+		  (osm_physp_get_port_num( p_physp ) == 0))
+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
+	else
+	{
+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
+		if (p_sw)
+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
+	}
+	return( (num_pkeys + 31) / 32 );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+ * Insert the new pending pkey entry to the specific port pkey table
+ * pending pkeys. new entries are inserted at the back.
+ */
+static void 
+pkey_mgr_process_physical_port(
+	IN osm_log_t *p_log,
+	IN const osm_req_t *p_req,
+	IN const ib_net16_t pkey,
+	IN osm_physp_t *p_physp )
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	char *stat = NULL;
+	osm_pending_pkey_t *p_pending;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
+	if (! p_pending)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0502: "
+					"Fail to allocate new pending pkey entry for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+	p_pending->pkey = pkey;
+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	if ( !p_orig_pkey )
+	{
+		p_pending->is_new = TRUE;
+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "inserted";
+	}
+	else
+	{
+		CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) );
+		p_pending->is_new = FALSE;
+		if (osm_pkey_tbl_get_block_and_idx(
+				 p_pkey_tbl, p_orig_pkey,
+				 &p_pending->block, &p_pending->index) != IB_SUCCESS)
+		{
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_process_physical_port: ERR 0503: "
+						"Fail to obtain P_Key 0x%04x block and index for node "
+						"0x%016" PRIx64 " port %u\n",
+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+						osm_physp_get_port_num( p_physp ) );
+			return;
+		}
+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "updated";
+	}
+
+	osm_log( p_log, OSM_LOG_DEBUG,
+				"pkey_mgr_process_physical_port:	"
+				"pkey 0x%04x was %s for node 0x%016" PRIx64
+				" port %u\n",
+				cl_ntoh16( pkey ), stat,
+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+				osm_physp_get_port_num( p_physp ) );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+pkey_mgr_process_partition_table(
+	osm_log_t *p_log,
+	const osm_req_t *p_req,
+	const osm_prtn_t *p_prtn,
+	const boolean_t full )
+{
+	const cl_map_t *p_tbl = 
+		full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
+	cl_map_iterator_t i, i_next;
+	ib_net16_t pkey = p_prtn->pkey;
+	osm_physp_t *p_physp;
+
+	if ( full )
+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
+
+	i_next = cl_map_head( p_tbl );
+	while ( i_next != cl_map_end( p_tbl ) )
+	{
+		i = i_next;
+		i_next = cl_map_next( i );
+		p_physp = cl_map_obj( i );
+		if ( p_physp && osm_physp_is_valid( p_physp ) )
+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 static ib_api_status_t
 pkey_mgr_update_pkey_entry(
    IN const osm_req_t *p_req,
@@ -114,7 +239,8 @@ pkey_mgr_enforce_partition(
    p_pi->state_info2 = 0;
    ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
 
-   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+	context.pi_context.node_guid = 
+		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
    context.pi_context.set_method = TRUE;
    context.pi_context.update_master_sm_base_lid = FALSE;
@@ -131,80 +257,132 @@ pkey_mgr_enforce_partition(
 
 /**********************************************************************
  **********************************************************************/
-/*
- * Prepare a new entry for the pkey table for this port when this pkey
- * does not exist. Update existed entry when membership was changed.
- */
-static void pkey_mgr_process_physical_port(
-   IN osm_log_t *p_log,
-   IN const osm_req_t *p_req,
-   IN const ib_net16_t pkey,
-   IN osm_physp_t *p_physp )
+static boolean_t pkey_mgr_update_port(
+	osm_log_t *p_log,
+	osm_req_t *p_req,
+	const osm_port_t * const p_port )
 {
-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
-   ib_pkey_table_t *block;
+	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	ib_pkey_table_t *block, *new_block;
+	osm_pkey_tbl_t *p_pkey_tbl;
    uint16_t block_index;
+	uint8_t  pkey_index;
+	uint16_t last_free_block_index = 0;
+	uint8_t  last_free_pkey_index = 0;
    uint16_t num_of_blocks;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   ib_net16_t *p_orig_pkey;
-   char *stat = NULL;
-   uint32_t i;
+	uint16_t max_num_of_blocks;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	ib_api_status_t status;
+	boolean_t ret_val = FALSE;
+	osm_pending_pkey_t *p_pending;
+	boolean_t found;
 
-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
+		return FALSE;
 
-   if ( !p_orig_pkey )
-   {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_node = osm_physp_get_node_ptr( p_physp );
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
       {
-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
+		osm_log( p_log, OSM_LOG_INFO,
+					"pkey_mgr_update_port: "
+					"Max number of blocks reduced from %u to %u " 
+					"for node 0x%016" PRIx64 " port %u\n",
+					p_pkey_tbl->max_blocks, max_num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );				
+	}
+	p_pkey_tbl->max_blocks = max_num_of_blocks;
+
+	osm_pkey_tbl_init_new_blocks( p_pkey_tbl );
+	p_pkey_tbl->used_blocks = 0;
+
+	/* 
+		process every pending pkey in order - 
+		first must be "updated" last are "new" 
+	*/
+	p_pending = 
+		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_pending != 
+			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		if (p_pending->is_new == FALSE)
+		{
+			block_index = p_pending->block;
+			pkey_index = p_pending->index;
+			found = TRUE;
+		} 
+		else
          {
-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
+			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
+															  &last_free_block_index,
+															  &last_free_pkey_index);
+			if ( !found )
             {
-               block->pkey_entry[i] = pkey;
-	       stat = "inserted";
-	       goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0504: "
+							"failed to find empty space for new pkey 0x%04x "
+							"of node 0x%016" PRIx64 " port %u\n",
+							cl_ntoh16(p_pending->pkey),
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
             }
+			else
+			{
+				block_index = last_free_block_index;
+				pkey_index = last_free_pkey_index++;
          }
       }
+		
+		if (found) 
+		{
+			if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( 
+					  p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
+			{
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_process_physical_port: ERR 0501: "
-               "No empty pkey entry was found to insert 0x%04x for node "
-               "0x%016" PRIx64 " port %u\n",
-               cl_ntoh16( pkey ),
+							"pkey_mgr_update_port: ERR 0505: "
+							"failed to set PKey 0x%04x in block %u idx %u "
+							"of node 0x%016" PRIx64 " port %u\n",
+							p_pending->pkey, block_index, pkey_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
-   else if ( *p_orig_pkey != pkey )
-   {
+		}
+
+		free( p_pending );
+		p_pending = 
+			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	}
+
+	/* now look for changes and store */
       for ( block_index = 0; block_index < num_of_blocks; block_index++ )
       {
-         /* we need real block (not just new_block) in order
-          * to resolve block/pkey indices */
          block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-	 i = p_orig_pkey - block->pkey_entry;
-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-	    block->pkey_entry[i] = pkey;
-	    stat = "updated";
-	    goto _done;
-	 }
-      }
-   }
+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
- _done:
-   if (stat) {
-      osm_log( p_log, OSM_LOG_VERBOSE,
-               "pkey_mgr_process_physical_port:  "
-               "pkey 0x%04x was %s for node 0x%016" PRIx64
-               " port %u\n",
-               cl_ntoh16( pkey ), stat,
+		if (block && 
+			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
+			continue;
+
+		status = pkey_mgr_update_pkey_entry(
+			p_req, p_physp , new_block, block_index );
+		if (status == IB_SUCCESS)
+			ret_val = TRUE;
+		else
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_update_port: ERR 0506: "
+						"pkey_mgr_update_pkey_entry() failed to update "
+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+						block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
+
+	return ret_val;
 }
 
 /**********************************************************************
@@ -217,21 +395,23 @@ pkey_mgr_update_peer_port(
    const osm_port_t * const p_port,
    boolean_t enforce )
 {
-   osm_physp_t *p, *peer;
+	osm_physp_t *p_physp, *peer;
    osm_node_t *p_node;
    ib_pkey_table_t *block, *peer_block;
-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
+	const osm_pkey_tbl_t *p_pkey_tbl;
+	osm_pkey_tbl_t *p_peer_pkey_tbl;
    osm_switch_t *p_sw;
    ib_switch_info_t *p_si;
    uint16_t block_index;
    uint16_t num_of_blocks;
+	uint16_t peer_max_blocks;
    ib_api_status_t status = IB_SUCCESS;
    boolean_t ret_val = FALSE;
 
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
       return FALSE;
-   peer = osm_physp_get_remote( p );
+	peer = osm_physp_get_remote( p_physp );
    if ( !peer || !osm_physp_is_valid( peer ) )
       return FALSE;
    p_node = osm_physp_get_node_ptr( peer );
@@ -242,10 +422,26 @@ pkey_mgr_update_peer_port(
    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap)
       return FALSE;
 
+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_update_peer_port: ERR 0508: "
+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
+					" port %u. Clearing Enforcement bit.\n",
+					peer_max_blocks, num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( peer ) );
+		enforce = FALSE;
+	}
+
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: ERR 0502: "
+					"pkey_mgr_update_peer_port: ERR 0507: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -255,24 +451,19 @@ pkey_mgr_update_peer_port(
    if (enforce == FALSE)
       return FALSE;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
    {
       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
-      if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
+		if ( !peer_block || memcmp( peer_block, block, sizeof( *peer_block ) ) )
       {
          status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
          if ( status == IB_SUCCESS )
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: ERR 0503: "
+							"pkey_mgr_update_peer_port: ERR 0509: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -282,10 +473,10 @@ pkey_mgr_update_peer_port(
       }
    }
 
-   if ( ret_val == TRUE &&
-        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
+	if ( (ret_val == TRUE) &&
+		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
    {
-      osm_log( p_log, OSM_LOG_VERBOSE,
+		osm_log( p_log, OSM_LOG_DEBUG,
                "pkey_mgr_update_peer_port: "
                "pkey table was updated for node 0x%016" PRIx64
                " port %u\n",
@@ -298,82 +489,6 @@ pkey_mgr_update_peer_port(
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t pkey_mgr_update_port(
-   osm_log_t *p_log,
-   osm_req_t *p_req,
-   const osm_port_t * const p_port )
-{
-   osm_physp_t *p;
-   osm_node_t *p_node;
-   ib_pkey_table_t *block, *new_block;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   uint16_t block_index;
-   uint16_t num_of_blocks;
-   ib_api_status_t status;
-   boolean_t ret_val = FALSE;
-
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
-      return FALSE;
-
-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
-   {
-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-
-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
-         continue;
-
-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
-      if (status == IB_SUCCESS)
-         ret_val = TRUE;
-      else
-         osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port: ERR 0504: "
-                  "pkey_mgr_update_pkey_entry() failed to update "
-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-                  block_index,
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                  osm_physp_get_port_num( p ) );
-   }
-
-   return ret_val;
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-pkey_mgr_process_partition_table(
-   osm_log_t *p_log,
-   const osm_req_t *p_req,
-   const osm_prtn_t *p_prtn,
-   const boolean_t full )
-{
-   const cl_map_t *p_tbl = full ?
-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
-   cl_map_iterator_t i, i_next;
-   ib_net16_t pkey = p_prtn->pkey;
-   osm_physp_t *p_physp;
-
-   if ( full )
-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
-
-   i_next = cl_map_head( p_tbl );
-   while ( i_next != cl_map_end( p_tbl ) )
-   {
-      i = i_next;
-      i_next = cl_map_next( i );
-      p_physp = cl_map_obj( i );
-      if ( p_physp && osm_physp_is_valid( p_physp ) )
-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
-   }
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_signal_t
 osm_pkey_mgr_process(
    IN osm_opensm_t *p_osm )
@@ -383,8 +498,7 @@ osm_pkey_mgr_process(
    osm_prtn_t *p_prtn;
    osm_port_t *p_port;
    osm_signal_t signal = OSM_SIGNAL_DONE;
-   osm_physp_t *p_physp;
-
+	osm_node_t *p_node;
    CL_ASSERT( p_osm );
 
    OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
@@ -394,32 +508,25 @@ osm_pkey_mgr_process(
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
       osm_log( &p_osm->log, OSM_LOG_ERROR,
-               "osm_pkey_mgr_process: ERR 0505: "
+					"osm_pkey_mgr_process: ERR 0510: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
 
-   p_tbl = &p_osm->subn.port_guid_tbl;
-   p_next = cl_qmap_head( p_tbl );
-   while ( p_next != cl_qmap_end( p_tbl ) )
-   {
-      p_port = ( osm_port_t * ) p_next;
-      p_next = cl_qmap_next( p_next );
-      p_physp = osm_port_get_default_phys_ptr( p_port );
-      if ( osm_physp_is_valid( p_physp ) )
-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
-   }
-
+	/* populate the pending pkey entries by scanning all partitions */
    p_tbl = &p_osm->subn.prtn_pkey_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
    {
       p_prtn = ( osm_prtn_t * ) p_next;
       p_next = cl_qmap_next( p_next );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
    }
 
+	/* calculate new pkey tables and set */
    p_tbl = &p_osm->subn.port_guid_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -428,8 +535,10 @@ osm_pkey_mgr_process(
       p_next = cl_qmap_next( p_next );
       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
         signal = OSM_SIGNAL_DONE_PENDING;
-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
-           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+		p_node = osm_port_get_parent_node( p_port );
+		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+			  pkey_mgr_update_peer_port( 
+				  &p_osm->log, &p_osm->sm.req,
                                       &p_osm->subn, p_port,
                                       !p_osm->subn.opt.no_partition_enforcement ) )
         signal = OSM_SIGNAL_DONE_PENDING;        


From bill at Princeton.EDU  Mon Jun 19 08:11:12 2006
From: bill at Princeton.EDU (Bill Wichser)
Date: Mon, 19 Jun 2006 11:11:12 -0400
Subject: [openib-general] Problem with mca_mpool_openib_register - Cannot
	allocate memory
Message-ID: <4496BE90.40607@princeton.edu>

Running the openib stack from Redhat on a 2.6.9-34.ELsmp kernel, dual 
Xeon.  Running with openmpi v1.0.2 compiled w/gcc.

While we still have the problem with btl_openib_endpoint.c returning  0 
byte(s) for max inline data, and realize that another IB stack addresses 
this, another problem when running across more than a single host pops 
up generating huge amounts of error messages.

The errors go something like this:

mca_mpool_openib_register: ibv_reg_mr(0x2ac2622000,1052672) failed with 
error: Cannot allocate memory
[0,1,1][btl_openib.c:496:mca_btl_openib_prepare_dst] 
mpool_register(0x2ac2622040,1048576) failed: base 0x2ac2222040 lb 0 
offset 4194304

We fixed the /etc/security/limits.conf problem but I don't know what to 
do about this one.  The job seems to complete without error on 2 nodes 
(4 processors) but to scale any larger just generates megabyte files of 
these types of error messages.

Any insights for this problem?  All searches lead me to the limits.conf 
which we have set to 8192.  These are 8G machines if that makes any 
difference.

Thanks,
Bill


From sashak at voltaire.com  Mon Jun 19 08:20:04 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 Jun 2006 18:20:04 +0300
Subject: [openib-general] [PATCH TRIVIAL] opensm: fix type in the usage
Message-ID: <20060619152004.GD5521@sashak.voltaire.com>

Hi Hal,

This fixes typo in the usage.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/opensm/main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index dfb2aec..4382fdb 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -180,7 +180,7 @@ show_usage(void)
   printf( "-U\n"
           "--ucast_file <file name>\n"
           "          This option specifies name of the unicast dump file\n"
-          "          from where switch forwarding tables will be loaded.\nn");
+          "          from where switch forwarding tables will be loaded.\n\n");
   printf ("-a\n"
           "--add_guid_file <path to file>\n"
           "          Set the root nodes for the Up/Down routing algorithm\n"


From eitan at mellanox.co.il  Mon Jun 19 08:24:40 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 19 Jun 2006 18:24:40 +0300
Subject: [openib-general] [PATCHv3] osm: partition manager force policy
In-Reply-To: <20060619145030.GC5521@sashak.voltaire.com>
References: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
	<20060619145030.GC5521@sashak.voltaire.com>
Message-ID: <4496C1B8.40200@mellanox.co.il>

Hi Sasha,

Thanks. This is yet another bug.
The fix is trivial and is noted below.

Please let me know when you are done reviewing and I will post a new patch.

EZ
Sasha Khapyorsky wrote:
> On 14:46 Sun 18 Jun     , Eitan Zahavi wrote:
> 
>>Another one is the handling of switch limited partition cap by
>>clearing the switch enforcement bit (on the specific port).
> 
> 
> Some comment about this too. See below.
> 
> 
>>+ib_api_status_t
>>+osm_pkey_tbl_set_new_entry( 
>>+	IN osm_pkey_tbl_t *p_pkey_tbl,
>>+	IN uint16_t        block_idx,
>>+	IN uint8_t         pkey_idx,
>>+	IN uint16_t        pkey)
>>+{  
>>+	ib_pkey_table_t *p_old_block;
>>+	ib_pkey_table_t *p_new_block;
>>+	
>>+	if (osm_pkey_tbl_make_block_pair(
>>+			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
>>+		return( IB_ERROR );
>>+		
>>+	p_new_block->pkey_entry[pkey_idx] = pkey;
>>+	if (p_pkey_tbl->used_blocks < block_idx)
>>+		p_pkey_tbl->used_blocks = block_idx;
Fix:
	if (p_pkey_tbl->used_blocks <= block_idx)
		p_pkey_tbl->used_blocks = block_idx + 1;
>>+
>>+	return( IB_SUCCESS );
>>+}
> 
> 
> p_pkey_tbl->used_blocks is updated as block index in range 0,1,2....
> 
> 
>>@@ -242,10 +421,26 @@ pkey_mgr_update_peer_port(
>>    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap)
>>       return FALSE;
>> 
>>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
>>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
>>+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
>>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
>>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
>>+	{
> 
> 
> But compared with total number of blocks (ranged 1,2,3,...). In case
> where switch supports N pkey blocks and CA - N+1, switch's ports will be
> updated and partitioning enforced.
> 
> Sasha
> 
> 
>>+		osm_log( p_log, OSM_LOG_ERROR,
>>+					"pkey_mgr_update_peer_port: ERR 0508: "
>>+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
>>+					" port %u. Clearing Enforcement bit.\n",
>>+					peer_max_blocks, num_of_blocks,
>>+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>>+					osm_physp_get_port_num( peer ) );
>>+		enforce = FALSE;
>>+	}
>>+
> 
> 


From halr at voltaire.com  Mon Jun 19 08:24:41 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Jun 2006 11:24:41 -0400
Subject: [openib-general] [PATCH TRIVIAL] opensm: fix type in the usage
In-Reply-To: <20060619152004.GD5521@sashak.voltaire.com>
References: <20060619152004.GD5521@sashak.voltaire.com>
Message-ID: <1150730675.4391.67038.camel@hal.voltaire.com>

On Mon, 2006-06-19 at 11:20, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> This fixes typo in the usage.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From bugzilla-daemon at openib.org  Mon Jun 19 08:53:25 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Mon, 19 Jun 2006 08:53:25 -0700 (PDT)
Subject: [openib-general] [Bug 145] New: IB Core unable to communicate IPoIB
	on Fedora Core 4
Message-ID: <20060619155325.75BDD228735@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=145

           Summary: IB Core unable to communicate IPoIB on Fedora Core 4
           Product: OpenFabrics Linux
           Version: 1.0rc5
          Platform: X86-64
        OS/Version: Other
            Status: NEW
          Severity: major
          Priority: P2
         Component: IB Core
        AssignedTo: bugzilla at openib.org
        ReportedBy: smarsh at analogic.com


I have installed OFED 1.0rc5 on a Dual-core Intel X86-64 system with Fedora
Core 4 (2.6.11-1.1369) intalled.  I have installed using the "everything"
option and typed "no" for mpi_osu with gcc install.  Everything compiles
without error.  

After the install and a reboot, the ib0 and ib1 coonections are apparent.  I
can ping over the TCP/IP stack but cannot ibping (I suspect I have issues with
SDP, the daemon seems to be running though) I receive the following message in
verbose mode..."ibwarn: [3494: ibping to Lid 0xc failed".  Any help would be
greatly appreciated.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From sashak at voltaire.com  Mon Jun 19 09:25:45 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 Jun 2006 19:25:45 +0300
Subject: [openib-general] [PATCHv3] osm: partition manager force policy
In-Reply-To: <4496C1B8.40200@mellanox.co.il>
References: <86fyi2hek6.fsf@mtl066.yok.mtl.com>
	<20060619145030.GC5521@sashak.voltaire.com>
	<4496C1B8.40200@mellanox.co.il>
Message-ID: <20060619162545.GE5521@sashak.voltaire.com>

On 18:24 Mon 19 Jun     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> Thanks. This is yet another bug.
> The fix is trivial and is noted below.
> 
> Please let me know when you are done reviewing and I will post a new patch.

I'm done. Did some running, enforcement works as expected now.

Sasha

> 
> EZ
> Sasha Khapyorsky wrote:
> >On 14:46 Sun 18 Jun     , Eitan Zahavi wrote:
> >
> >>Another one is the handling of switch limited partition cap by
> >>clearing the switch enforcement bit (on the specific port).
> >
> >
> >Some comment about this too. See below.
> >
> >
> >>+ib_api_status_t
> >>+osm_pkey_tbl_set_new_entry( 
> >>+	IN osm_pkey_tbl_t *p_pkey_tbl,
> >>+	IN uint16_t        block_idx,
> >>+	IN uint8_t         pkey_idx,
> >>+	IN uint16_t        pkey)
> >>+{  
> >>+	ib_pkey_table_t *p_old_block;
> >>+	ib_pkey_table_t *p_new_block;
> >>+	
> >>+	if (osm_pkey_tbl_make_block_pair(
> >>+			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
> >>+		return( IB_ERROR );
> >>+		
> >>+	p_new_block->pkey_entry[pkey_idx] = pkey;
> >>+	if (p_pkey_tbl->used_blocks < block_idx)
> >>+		p_pkey_tbl->used_blocks = block_idx;
> Fix:
> 	if (p_pkey_tbl->used_blocks <= block_idx)
> 		p_pkey_tbl->used_blocks = block_idx + 1;
> >>+
> >>+	return( IB_SUCCESS );
> >>+}
> >
> >
> >p_pkey_tbl->used_blocks is updated as block index in range 0,1,2....
> >
> >
> >>@@ -242,10 +421,26 @@ pkey_mgr_update_peer_port(
> >>   if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || 
> >>   !p_si->enforce_cap)
> >>      return FALSE;
> >>
> >>+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
> >>+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
> >>+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
> >>+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
> >>+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
> >>+	{
> >
> >
> >But compared with total number of blocks (ranged 1,2,3,...). In case
> >where switch supports N pkey blocks and CA - N+1, switch's ports will be
> >updated and partitioning enforced.
> >
> >Sasha
> >
> >
> >>+		osm_log( p_log, OSM_LOG_ERROR,
> >>+					"pkey_mgr_update_peer_port: ERR 
> >>0508: "
> >>+					"not enough entries (%u < %u) on 
> >>switch 0x%016" PRIx64
> >>+					" port %u. Clearing Enforcement 
> >>bit.\n",
> >>+					peer_max_blocks, num_of_blocks,
> >>+					cl_ntoh64( osm_node_get_node_guid( 
> >>p_node ) ),
> >>+					osm_physp_get_port_num( peer ) );
> >>+		enforce = FALSE;
> >>+	}
> >>+
> >
> >
> 


From halr at voltaire.com  Mon Jun 19 09:39:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Jun 2006 12:39:54 -0400
Subject: [openib-general] [PATCH] OpenSM/osm_sa_link_record.c: Fix LMC > 0
	handling
Message-ID: <1150735193.4391.69996.camel@hal.voltaire.com>

OpenSM/osm_sa_link_record.c: Fix LMC > 0 handling

In osm_sa_link_record.c, properly handle non base LID requests
per C15-0.1.11: Query responses shall contain a port's base LID in
any LID component of a RID. So when LMC is non 0, the only records that
appear are those with the base LID and not with any masked LIDs.
Furthermore, if a query comes in on a non base LID, the LID in the RID
returned is only with the base LID.

To do this, added new routine osm_get_port_by_base_lid in osm_port.c for
use by other SA records.

Also, fixed some error handling for SA GetTable LinkRecord requests.

Also, added more SA LinkRecord test cases to osmtest/osmtest.c

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 8108)
+++ include/opensm/osm_port.h	(working copy)
@@ -1737,6 +1737,42 @@ osm_port_get_lid_range_ho(
 *	Port
 *********/
 
+/****f* OpenSM: Port/osm_get_port_by_base_lid
+* NAME
+*	osm_get_port_by_base_lid
+*
+* DESCRIPTION
+*	Returns a status on whether a Port was able to be 
+*	determined based on the LID supplied and if so, return the Port.
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_get_port_by_base_lid(
+	IN const osm_subn_t*       const p_subn,
+	IN const ib_net16_t        lid,
+	IN OUT const osm_port_t**  const pp_port );
+/*
+* PARAMETERS
+*	p_subn
+*		[in] Pointer to the subnet data structure.
+*
+*	lid
+*		[in] LID requested.
+*
+*	pp_port
+*		[in][out] Pointer to pointer to Port object.
+*
+* RETURN VALUES
+*	IB_SUCCESS
+*	IB_NOT_FOUND 
+*
+* NOTES
+*
+* SEE ALSO
+*       Port
+*********/
+
 /****f* OpenSM: Port/osm_port_add_new_physp
 * NAME
 *	osm_port_add_new_physp
Index: opensm/osm_port.c
===================================================================
--- opensm/osm_port.c	(revision 8108)
+++ opensm/osm_port.c	(working copy)
@@ -266,6 +266,44 @@ osm_port_get_lid_range_ho(
 
 /**********************************************************************
  **********************************************************************/
+ib_api_status_t
+osm_get_port_by_base_lid(
+  IN const osm_subn_t*      const p_subn,
+  IN const ib_net16_t       lid,
+  IN OUT const osm_port_t** const pp_port )
+{
+  ib_api_status_t           status;
+  uint16_t                  base_lid; 
+  uint8_t                   lmc;
+
+  *pp_port = NULL;
+
+  /* Loop on lmc from 0 up through max LMC */
+  for (lmc = 0; lmc <= IB_PORT_LMC_MAX; lmc++)
+  {
+    /* Calculate a base LID assuming this is the real LMC */
+    base_lid = (cl_ntoh16(lid) & ~(1 << lmc));
+
+    /* Look for a match */
+    status = cl_ptr_vector_at( &p_subn->port_lid_tbl,
+                               base_lid,
+                               (void**)pp_port );
+    if ((status == CL_SUCCESS) && (*pp_port != NULL))
+    {
+       /* Determine if base LID "tested" is the real base LID */
+       /* This is true if the LMC "tested" is the port's actual LMC */
+       if (lmc == osm_port_get_lmc( *pp_port ) )
+         goto Found;
+    }
+  }
+  status = IB_NOT_FOUND;
+
+ Found:
+  return status;
+}
+
+/**********************************************************************
+ **********************************************************************/
 void
 osm_port_add_new_physp(
   IN osm_port_t* const p_port,
Index: opensm/osm_sa_link_record.c
===================================================================
--- opensm/osm_sa_link_record.c	(revision 8108)
+++ opensm/osm_sa_link_record.c	(working copy)
@@ -209,7 +209,6 @@ __osm_lr_rcv_get_physp_link(
   ib_net16_t                  from_max_lid_ho;
   ib_net16_t                  to_max_lid_ho;
   ib_net16_t                  to_base_lid_ho;
-  uint16_t                    i, j;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_lr_rcv_get_physp_link );
 
@@ -313,30 +312,12 @@ __osm_lr_rcv_get_physp_link(
              dest_port_num );
   }
 
-  if( comp_mask & IB_LR_COMPMASK_FROM_LID )
-  {
-    from_max_lid_ho = from_base_lid_ho = cl_ntoh16(p_lr->from_lid);
-  }
-  else
-  {
-    __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho);
-  }
+  __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho);
+  __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho);
 
-  if( comp_mask & IB_LR_COMPMASK_TO_LID )
-  {
-    to_max_lid_ho = to_base_lid_ho = cl_ntoh16(p_lr->to_lid);
-  }
-  else
-  {
-    __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho);
-  }
-
-  for (i = from_base_lid_ho; i <= from_max_lid_ho; i++)
-  {
-    for(j = to_base_lid_ho; j <= to_max_lid_ho; j++)
-      __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(i), cl_ntoh16(j),
-                                    src_port_num, dest_port_num, p_list);
-  }
+  __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(from_base_lid_ho),
+                                cl_ntoh16(to_base_lid_ho),
+                                src_port_num, dest_port_num, p_list);
 
  Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
@@ -515,12 +496,11 @@ __osm_lr_rcv_get_end_points(
 
   if( p_sa_mad->comp_mask & IB_LR_COMPMASK_FROM_LID )
   {
-    status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl,
-                               cl_ntoh16(p_lr->from_lid),
-                               (void**)pp_src_port );
+    status = osm_get_port_by_base_lid( p_rcv->p_subn,
+                                       p_lr->from_lid,
+                                       pp_src_port );
 
-    if( ( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) &&
-          (p_sa_mad->method == IB_MAD_METHOD_GET) )
+    if( (status != CL_SUCCESS) || (*pp_src_port == NULL) )
     {
       /*
         This 'error' is the client's fault (bad lid) so
@@ -539,12 +519,11 @@ __osm_lr_rcv_get_end_points(
 
   if( p_sa_mad->comp_mask & IB_LR_COMPMASK_TO_LID )
   {
-    status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl,
-                               cl_ntoh16(p_lr->to_lid),
-                               (void**)pp_dest_port );
+    status = osm_get_port_by_base_lid( p_rcv->p_subn,
+                                       p_lr->to_lid,
+                                       pp_dest_port );
 
-    if( ( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) &&
-          (p_sa_mad->method == IB_MAD_METHOD_GET) )
+    if( (status != CL_SUCCESS) || (*pp_dest_port == NULL) )
     {
       /*
         This 'error' is the client's fault (bad lid) so
@@ -732,8 +711,8 @@ osm_lr_rcv_process(
 {
   const ib_link_record_t*  p_lr;
   const ib_sa_mad_t*       p_sa_mad;
-  const osm_port_t*        p_src_port = NULL;
-  const osm_port_t*        p_dest_port = NULL;
+  const osm_port_t*        p_src_port;
+  const osm_port_t*        p_dest_port;
   cl_qlist_t               lr_list;
   ib_net16_t               sa_status;
   osm_physp_t*             p_req_physp;
@@ -784,16 +763,12 @@ osm_lr_rcv_process(
   sa_status = __osm_lr_rcv_get_end_points( p_rcv, p_madw,
                                            &p_src_port, &p_dest_port );
 
-  if( sa_status != IB_SA_MAD_STATUS_SUCCESS )
+  if( sa_status == IB_SA_MAD_STATUS_SUCCESS )
   {
-    cl_plock_release( p_rcv->p_lock );
-    osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
-    goto Exit;
+    __osm_lr_rcv_get_port_links( p_rcv, p_lr, p_src_port, p_dest_port,
+                                 p_sa_mad->comp_mask, &lr_list, p_req_physp );
   }
 
-  __osm_lr_rcv_get_port_links( p_rcv, p_lr, p_src_port, p_dest_port,
-                               p_sa_mad->comp_mask, &lr_list, p_req_physp );
-
   cl_plock_release( p_rcv->p_lock );
 
   if( (cl_qlist_count( &lr_list ) == 0) &&
Index: osmtest/osmtest.c
===================================================================
--- osmtest/osmtest.c	(revision 8109)
+++ osmtest/osmtest.c	(working copy)
@@ -4309,6 +4309,99 @@ osmtest_validate_all_path_recs( IN osmte
   OSM_LOG_EXIT( &p_osmt->log );
   return ( status );
 }
+
+/**********************************************************************
+ * Get link record by LID
+ **********************************************************************/
+ib_api_status_t
+osmtest_get_link_rec_by_lid( IN osmtest_t * const p_osmt,
+                             IN ib_net16_t const  from_lid,
+                             IN ib_net16_t const  to_lid,
+                             IN OUT osmtest_req_context_t * const p_context )
+{
+  ib_api_status_t status = IB_SUCCESS;
+  osmv_user_query_t user;
+  osmv_query_req_t req;
+  ib_link_record_t record;
+  ib_mad_t *p_mad;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_get_link_rec_by_lid );
+
+  if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+             "osmtest_get_link_rec_by_lid: "
+             "Getting link record from LID 0x%02X to LID 0x%02X\n",
+             cl_ntoh16( from_lid ), cl_ntoh16( to_lid ) );
+  }
+
+  /*
+   * Do a blocking query for this record in the subnet.
+   * The result is returned in the result field of the caller's
+   * context structure.
+   *
+   * The query structures are locals.
+   */
+  memset( &req, 0, sizeof( req ) );
+  memset( &user, 0, sizeof( user ) );
+  memset( &record, 0, sizeof( record ) );
+
+  record.from_lid = from_lid;
+  record.to_lid = to_lid;
+  p_context->p_osmt = p_osmt;
+  if (from_lid)
+    user.comp_mask |= IB_LR_COMPMASK_FROM_LID;
+  if (to_lid)
+    user.comp_mask |= IB_LR_COMPMASK_TO_LID;
+  user.attr_id = IB_MAD_ATTR_LINK_RECORD;
+  user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) );
+  user.p_attr = &record;
+
+  req.query_type = OSMV_QUERY_USER_DEFINED;
+  req.timeout_ms = p_osmt->opt.transaction_timeout;
+  req.retry_cnt = p_osmt->opt.retry_count;
+  req.flags = OSM_SA_FLAGS_SYNC;
+  req.query_context = p_context;
+  req.pfn_query_cb = osmtest_query_res_cb;
+  req.p_query_input = &user;
+  req.sm_key = 0;
+
+  status = osmv_query_sa( p_osmt->h_bind, &req );
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_link_rec_by_lid: ERR 007A: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    goto Exit;
+  }
+
+  status = p_context->result.status;
+
+  if( status != IB_SUCCESS )
+  {
+    if (status != IB_INVALID_PARAMETER)
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_get_link_rec_by_lid: ERR 007B: "
+               "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    }
+    if( status == IB_REMOTE_ERROR )
+    {
+      p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw );
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_get_link_rec_by_lid: "
+               "Remote error = %s\n",
+               ib_get_mad_status_str( p_mad ));
+
+      status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK );
+    }
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
 #endif
 
 /**********************************************************************
@@ -4891,9 +4984,10 @@ osmtest_validate_against_db( IN osmtest_
 {
   ib_api_status_t status = IB_SUCCESS;
 #ifdef VENDOR_RMPP_SUPPORT
+  ib_net16_t test_lid;
   uint8_t lmc;
-#ifdef DUAL_SIDED_RMPP
   osmtest_req_context_t context;
+#ifdef DUAL_SIDED_RMPP
   osmv_multipath_req_t request;
 #endif
 #endif
@@ -5003,6 +5097,7 @@ osmtest_validate_against_db( IN osmtest_
 #endif
 
 #ifdef VENDOR_RMPP_SUPPORT
+  /* GUIDInfoRecords */
   status = osmtest_validate_all_guidinfo_recs( p_osmt );
   if( status != IB_SUCCESS )
     goto Exit;
@@ -5019,6 +5114,43 @@ osmtest_validate_against_db( IN osmtest_
       goto Exit;
   }
 
+  /* Some LinkRecord tests */
+  test_lid = cl_ntoh16( p_osmt->local_port.lid );
+  /* FromLID */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, 0, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* ToLID */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_link_rec_by_lid( p_osmt, 0, test_lid, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* FromLID & ToLID */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, test_lid, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  if (lmc != 0)
+  {
+    test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 );
+    /* FromLID */
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_get_link_rec_by_lid( p_osmt, test_lid, 0, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+
+    /* ToLID */
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_get_link_rec_by_lid( p_osmt, 0, test_lid, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+  }
+
+  /* PathRecords */
   if (! p_osmt->opt.ignore_path_records)
   {
     status = osmtest_validate_all_path_recs( p_osmt );


From jlentini at netapp.com  Mon Jun 19 10:19:59 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 19 Jun 2006 13:19:59 -0400 (EDT)
Subject: [openib-general] trunk's udapl does not compile
In-Reply-To: <Pine.LNX.4.64.0606191725500.13333@zuben>
References: <Pine.LNX.4.64.0606191725500.13333@zuben>
Message-ID: <Pine.LNX.4.64.0606191317360.6403@jlentini-linux.nane.netapp.com>


On Mon, 19 Jun 2006, Or Gerlitz wrote:

> I've just noted an inconsistency with librdmacm of udapl calling 
> rdma_create_id without providing the PS param.
> 
> This is the trivial patch i was using to fix the compilation.

Yup. The RDMA CM update on Friday afternoon broke uDAPL. Fixed in 
revision 8112.


From jlentini at netapp.com  Mon Jun 19 10:23:39 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 19 Jun 2006 13:23:39 -0400 (EDT)
Subject: [openib-general] dapltest gets segfaulted in librdmacm init
In-Reply-To: <Pine.LNX.4.64.0606191736310.13391@zuben>
References: <Pine.LNX.4.64.0606191736310.13391@zuben>
Message-ID: <Pine.LNX.4.64.0606191320430.6403@jlentini-linux.nane.netapp.com>


I don't see this.

The gdb sharedlibrary output looks suspicious. /usr/local/ib isn't a 
standard path for our binaries.

Are you sure everything is up-to-date on your system? Is the provided 
library that you have configured to handle IA "OpenIB-cma" the latest 
and greatest?


On Mon, 19 Jun 2006, Or Gerlitz wrote:

> After fixing the ucma/port space issue with the calls to rdma_create_id i
> am now trying to run
> 
> 	$ ./Target/dapltest -T S -D OpenIB-cma
> 
> and getting an immediate segfault with the below trace, any idea?
> 
> Or.
> 
> #0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
> 128             context = device->ops.alloc_context(device, cmd_fd);
> (gdb) where
> #0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
> #1  0x00002af6d3cc4076 in ucma_init () at cma.c:220
> #2  0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257
> #3  0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222
> #4  0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690,
>     ia_handle_ptr=0x52e660) at dapl_ia_open.c:145
> #5  0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690,
>     ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229
> #6  0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105
> #7  0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55
> #8  0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48
> #9  0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95
> #10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37
> (gdb) info sharedlibrary
> >From                To                  Syms Read   Shared Object Library
> 0x00002af6d352e0e0  0x00002af6d3533e38  Yes         /usr/local/ib/lib/libdat.so.1
> 0x00002af6d365d470  0x00002af6d3664d48  Yes         /lib64/tls/libpthread.so.0
> 0x00002af6d37888b0  0x00002af6d3852ce0  Yes         /lib64/tls/libc.so.6
> 0x00002af6d398f450  0x00002af6d3990128  Yes         /lib64/libdl.so.2
> 0x00002af6d3a94690  0x00002af6d3a99aa8  Yes         /usr/local/ib/lib/libibverbs.so.2
> 0x00002af6d3415cf0  0x00002af6d3426ab7  Yes         /lib64/ld-linux-x86-64.so.2
> 0x00002af6d3b9ffc0  0x00002af6d3bb7028  Yes         /usr/local/ib/lib/libdaplcma.so
> 0x00002af6d3cc3ca0  0x00002af6d3cc6d18  Yes         /usr/local/ib/lib/librdmacm.so
> 0x00002af6d3deb200  0x00002af6d3df2348  Yes         /usr/local/lib/libsysfs.so.1
> 0x00002af6d3ef5b50  0x00002af6d3efc138  Yes         /usr/local/ib/lib/infiniband/mthca.so
> 0x00002af6d40006c0  0x00002af6d4005838  Yes         /usr/local/ib/lib/libibverbs.so.1
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From swise at opengridcomputing.com  Mon Jun 19 10:27:18 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 19 Jun 2006 12:27:18 -0500
Subject: [openib-general] MVAPICH and librdmacm
Message-ID: <1150738038.26165.5.camel@stevo-desktop>

Hello,

Anybody working on porting the MVAPICH code to use the RDMA CM for
connection setup?  Just wondering how much work is needed to make
MVAPICH run on the iwarp devices.  

Thanks,


Steve.


From bugzilla-daemon at openib.org  Mon Jun 19 10:32:29 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Mon, 19 Jun 2006 10:32:29 -0700 (PDT)
Subject: [openib-general] [Bug 145] IB Core unable to communicate IPoIB on
	Fedora Core 4
Message-ID: <20060619173229.EB5CA228738@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=145


------- Comment #1 from halr at voltaire.com  2006-06-19 10:32 -------
If I understand what you wrote correctly, IPoIB is running fine but ibping
reports some error. What is LID 0xC (and how was this determined) ? Is the
ibping kernel module running or the user space daemon for ibping running on LID
0xC ? This may or may not be separate from whatever SDP issue you may have.

Can you do an ibnetdiscover and attach the output ?

Can you do an /sbin/lsmod | grep ib_ on the remote node (LID 0xC) ?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From ardavis at ichips.intel.com  Mon Jun 19 11:16:15 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 19 Jun 2006 11:16:15 -0700
Subject: [openib-general] dapltest gets segfaulted in librdmacm init
In-Reply-To: <Pine.LNX.4.64.0606191736310.13391@zuben>
References: <Pine.LNX.4.64.0606191736310.13391@zuben>
Message-ID: <4496E9EF.1090607@ichips.intel.com>

Or Gerlitz wrote:

>After fixing the ucma/port space issue with the calls to rdma_create_id i
>am now trying to run
>
>	$ ./Target/dapltest -T S -D OpenIB-cma
>
>and getting an immediate segfault with the below trace, any idea?
>  
>
Hmm, no idea. I just updated to 8112 and everything runs fine for me 
(2.6.17).   

>Or.
>
>#0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
>128             context = device->ops.alloc_context(device, cmd_fd);
>(gdb) where
>#0  0x00002af6d3a97685 in ibv_open_device (device=0x537440) at device.c:128
>#1  0x00002af6d3cc4076 in ucma_init () at cma.c:220
>#2  0x00002af6d3cc4182 in rdma_create_event_channel () at cma.c:257
>#3  0x00002af6d3bb20e3 in dapls_ib_open_hca (hca_name=0x534430 "ib0", hca_ptr=0x532870) at dapl_ib_util.c:222
>#4  0x00002af6d3bab454 in dapl_ia_open (name=0x530028 "OpenIB-cma", async_evd_qlen=8, async_evd_handle_ptr=0x52e690,
>    ia_handle_ptr=0x52e660) at dapl_ia_open.c:145
>#5  0x00002af6d352e422 in dat_ia_openv (name=0x530028 "OpenIB-cma", async_event_qlen=8, async_event_handle=0x52e690,
>    ia_handle=0x52e660, dapl_major=1, dapl_minor=2, thread_safety=DAT_FALSE) at udat.c:229
>#6  0x000000000041461f in DT_cs_Server (params_ptr=0x530020) at dapl_server.c:105
>#7  0x0000000000407aa2 in DT_Execute_Test (params_ptr=0x530020) at dapl_execute.c:55
>#8  0x000000000041e9d9 in DT_Tdep_Execute_Test (params_ptr=0x530020) at udapl_tdep.c:48
>#9  0x0000000000403669 in dapltest (argc=5, argv=0x7fffd7693748) at dapl_main.c:95
>#10 0x00000000004035bb in main (argc=5, argv=0x7fffd7693748) at dapl_main.c:37
>(gdb) info sharedlibrary
>>From                To                  Syms Read   Shared Object Library
>0x00002af6d352e0e0  0x00002af6d3533e38  Yes         /usr/local/ib/lib/libdat.so.1
>0x00002af6d365d470  0x00002af6d3664d48  Yes         /lib64/tls/libpthread.so.0
>0x00002af6d37888b0  0x00002af6d3852ce0  Yes         /lib64/tls/libc.so.6
>0x00002af6d398f450  0x00002af6d3990128  Yes         /lib64/libdl.so.2
>0x00002af6d3a94690  0x00002af6d3a99aa8  Yes         /usr/local/ib/lib/libibverbs.so.2
>0x00002af6d3415cf0  0x00002af6d3426ab7  Yes         /lib64/ld-linux-x86-64.so.2
>0x00002af6d3b9ffc0  0x00002af6d3bb7028  Yes         /usr/local/ib/lib/libdaplcma.so
>0x00002af6d3cc3ca0  0x00002af6d3cc6d18  Yes         /usr/local/ib/lib/librdmacm.so
>0x00002af6d3deb200  0x00002af6d3df2348  Yes         /usr/local/lib/libsysfs.so.1
>0x00002af6d3ef5b50  0x00002af6d3efc138  Yes         /usr/local/ib/lib/infiniband/mthca.so
>0x00002af6d40006c0  0x00002af6d4005838  Yes         /usr/local/ib/lib/libibverbs.so.1
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>  
>


From sashak at voltaire.com  Mon Jun 19 11:30:46 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 19 Jun 2006 21:30:46 +0300
Subject: [openib-general] [PATCH TRIVIAL] opensm: libibmad: fix umad retry
	counter
Message-ID: <20060619183046.GF5521@sashak.voltaire.com>

Hi Hal,

This fixes umad send/recv retry counter in error report.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 libibmad/src/rpc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index a3b29c9..e929ba4 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -132,7 +132,7 @@ _do_madrpc(void *umad, int agentid, int 
 
 	for (retries = 0; retries < madrpc_retries; retries++) {
 		if (retries) {
-			ERRS("retry %d (timeout %d ms)", retries + 1, timeout);
+			ERRS("retry %d (timeout %d ms)", retries, timeout);
 			/* Restore user MAD header */
 			memcpy(&mad->addr, &addr, sizeof addr);
 		}


From eitan at mellanox.co.il  Mon Jun 19 12:05:11 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 19 Jun 2006 22:05:11 +0300
Subject: [openib-general]  [PATCHv5] osm: partition manager force policy
Message-ID: <86d5d5ge54.fsf@mtl066.yok.mtl.com>

Hi Hal

This is a 5th take after incorporating Sasha's last reported bug 
on bad assignment of the used_blocks.

This code was run again through my verification flow and also Sasha
had run some tests too.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: include/opensm/osm_port.h
===================================================================
--- include/opensm/osm_port.h	(revision 8113)
+++ include/opensm/osm_port.h	(working copy)
@@ -591,6 +591,39 @@ osm_physp_get_pkey_tbl( IN const osm_phy
 *  Port, Physical Port
 *********/
 
+/****f* OpenSM: Physical Port/osm_physp_get_mod_pkey_tbl
+* NAME
+*  osm_physp_get_mod_pkey_tbl
+*
+* DESCRIPTION
+*  Returns a NON CONST pointer to the P_Key table object of the Physical Port object.
+*
+* SYNOPSIS
+*/
+static inline osm_pkey_tbl_t *
+osm_physp_get_mod_pkey_tbl( IN osm_physp_t* const p_physp )
+{
+  CL_ASSERT( osm_physp_is_valid( p_physp ) );
+  /*
+    (14.2.5.7) - the block number valid values are 0-2047, and are further
+    limited by the size of the P_Key table specified by the PartitionCap on the node. 
+  */
+  return( &p_physp->pkeys );
+};
+/*
+* PARAMETERS
+*  p_physp
+*     [in] Pointer to an osm_physp_t object.
+*
+* RETURN VALUES
+*  The pointer to the P_Key table object.
+*
+* NOTES
+*
+* SEE ALSO
+*  Port, Physical Port
+*********/
+
 /****f* OpenSM: Physical Port/osm_physp_set_slvl_tbl
 * NAME
 *	osm_physp_set_slvl_tbl
Index: include/opensm/osm_pkey.h
===================================================================
--- include/opensm/osm_pkey.h	(revision 8113)
+++ include/opensm/osm_pkey.h	(working copy)
@@ -92,6 +92,9 @@ typedef struct _osm_pkey_tbl
   cl_ptr_vector_t blocks;
   cl_ptr_vector_t new_blocks;
   cl_map_t        keys;
+  cl_qlist_t      pending;
+  uint16_t        used_blocks;
+  uint16_t        max_blocks;
 } osm_pkey_tbl_t;
 /*
 * FIELDS
@@ -104,6 +107,18 @@ typedef struct _osm_pkey_tbl
 *	keys
 *		A set holding all keys
 *
+*  pending
+*     A list osm_pending_pkey structs that is temporarily set by the 
+*     pkey mgr and used during pkey mgr algorithm only
+*
+*  used_blocks
+*     Tracks the number of blocks having non-zero pkeys
+*
+*  max_blocks
+*     The maximal number of blocks this partition table might hold
+*     this value is based on node_info (for port 0 or CA) or switch_info
+*     updated on receiving the node_info or switch_info GetResp
+*
 * NOTES
 * 'blocks' vector should be used to store pkey values obtained from
 * the port and SM pkey manager should not change it directly, for this
@@ -114,6 +129,39 @@ typedef struct _osm_pkey_tbl
 *
 *********/
 
+/****s* OpenSM: osm_pending_pkey_t
+* NAME
+*	osm_pending_pkey_t
+*
+* DESCRIPTION
+*	This objects stores temporary information on pkeys their target block and index
+*  during the pkey manager operation
+*
+* SYNOPSIS
+*/
+typedef struct _osm_pending_pkey {
+  cl_list_item_t list_item;
+  uint16_t		  pkey;
+  uint32_t		  block;
+  uint8_t		  index;
+  boolean_t		  is_new;
+} osm_pending_pkey_t;
+/*
+* FIELDS
+*	pkey
+*		The actual P_Key
+*
+*	block
+*		The block index based on the previous table extracted from the device
+*
+*	index
+*		The index of the pky within the block
+*
+*  is_new
+*     TRUE for new P_Keys such that the block and index are invalid in that case
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_construct
 * NAME
 *  osm_pkey_tbl_construct
@@ -142,7 +190,8 @@ void osm_pkey_tbl_construct( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_init( 
+ib_api_status_t 
+osm_pkey_tbl_init( 
   IN osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -209,8 +258,8 @@ osm_pkey_tbl_get_num_blocks( 
 static inline ib_pkey_table_t *osm_pkey_tbl_block_get( 
   const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block)
 {
-  CL_ASSERT(block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks));
-  return(cl_ptr_vector_get(&p_pkey_tbl->blocks, block));
+	return( (block < cl_ptr_vector_get_size(&p_pkey_tbl->blocks)) ?
+			  cl_ptr_vector_get(&p_pkey_tbl->blocks, block) : NULL);
 };
 /*
 *  p_pkey_tbl
@@ -244,16 +293,117 @@ static inline ib_pkey_table_t *osm_pkey_
 /*
  *********/
 
-/****f* OpenSM: osm_pkey_tbl_sync_new_blocks
+
+/****f* OpenSM: osm_pkey_tbl_make_block_pair
+* NAME
+*  osm_pkey_tbl_make_block_pair
+*
+* DESCRIPTION
+*  Find or create a pair of "old" and "new" blocks for the
+*  given block index
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pp_old_block
+*   [out] Pointer to the old block pointer arg
+*
+* pp_new_block
+*   [out] Pointer to the new block pointer arg
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_set_new_entry
 * NAME
-*  osm_pkey_tbl_sync_new_blocks
+*  osm_pkey_tbl_set_new_entry
 *
 * DESCRIPTION
-*  Syncs new_blocks vector content with current pkey table blocks
+*   stores the given pkey in the "new" blocks array and update
+*   the "map" to show that on the "old" blocks
 *
 * SYNOPSIS
 */
-void osm_pkey_tbl_sync_new_blocks( 
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* block_idx
+*   [in] The block index to use
+*
+* pkey_idx
+*   [in] The index within the block
+*
+* pkey
+*   [in] PKey to store
+*
+* RETURN VALUES
+*   IB_SUCCESS if OK IB_ERROR if failed
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_find_next_free_entry
+* NAME
+*  osm_pkey_find_next_free_entry
+*
+* DESCRIPTION
+*  Find the next free entry in the PKey table. Starting at the given
+*  index and block number. The user should increment pkey_idx before 
+*  next call
+*  Inspect the "new" blocks array for empty space.
+*
+* SYNOPSIS
+*/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx);
+/*
+* p_pkey_tbl
+*   [in] Pointer to the PKey table 
+*
+* p_block_idx
+*   [out] The block index to use
+*
+* p_pkey_idx
+*   [out] The index within the block to use
+*
+* RETURN VALUES
+*   TRUE if found FALSE if did not find
+* 
+*********/
+
+/****f* OpenSM: osm_pkey_tbl_init_new_blocks
+* NAME
+*  osm_pkey_tbl_init_new_blocks
+*
+* DESCRIPTION
+*  Initializes new_blocks vector content (clear and allocate)
+*
+* SYNOPSIS
+*/
+void osm_pkey_tbl_init_new_blocks( 
   const osm_pkey_tbl_t *p_pkey_tbl);
 /*
 *  p_pkey_tbl
@@ -263,6 +413,41 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 *********/
 
+/****f* OpenSM: osm_pkey_tbl_get_block_and_idx
+* NAME
+*  osm_pkey_tbl_get_block_and_idx
+*
+* DESCRIPTION
+*  set the block index and pkey index the given
+*  pkey is found in. return IB_NOT_FOUND if cound not find 
+*  it, IB_SUCCESS if OK
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+  IN  osm_pkey_tbl_t *p_pkey_tbl, 
+  IN  uint16_t       *p_pkey,
+  OUT uint32_t       *block_idx,
+  OUT uint8_t        *pkey_index);
+/*
+*  p_pkey_tbl
+*     [in] Pointer to osm_pkey_tbl_t object.
+*  
+*  p_pkey
+*     [in] Pointer to the P_Key entry searched
+*
+*  p_block_idx
+*     [out] Pointer to the block index to be updated
+*
+*  p_pkey_idx 
+*     [out] Pointer to the pkey index (in the block) to be updated
+*
+*
+* NOTES
+*
+*********/
+
 /****f* OpenSM: osm_pkey_tbl_set
 * NAME
 *  osm_pkey_tbl_set
@@ -272,7 +457,8 @@ void osm_pkey_tbl_sync_new_blocks( 
 *
 * SYNOPSIS
 */
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set( 
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl);
Index: opensm/osm_pkey.c
===================================================================
--- opensm/osm_pkey.c	(revision 8113)
+++ opensm/osm_pkey.c	(working copy)
@@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( 
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_init( 
+ib_api_status_t
+osm_pkey_tbl_init(
   IN osm_pkey_tbl_t *p_pkey_tbl)
 {
   cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
   cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
   cl_map_init( &p_pkey_tbl->keys, 1 );
+	cl_qlist_init( &p_pkey_tbl->pending );
+	p_pkey_tbl->used_blocks = 0;
+	p_pkey_tbl->max_blocks = 0;
   return(IB_SUCCESS);
 }
 
 /**********************************************************************
  **********************************************************************/
-void osm_pkey_tbl_sync_new_blocks(
+void osm_pkey_tbl_init_new_blocks(
   IN const osm_pkey_tbl_t *p_pkey_tbl)
 {
   ib_pkey_table_t *p_block, *p_new_block;
@@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks(
       p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
       if (!p_new_block)
         break;
+			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
+									b, p_new_block);
+		}
+
       memset(p_new_block, 0, sizeof(*p_new_block));
-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
     }
-    memcpy(p_new_block, p_block, sizeof(*p_new_block));
+}
+
+/**********************************************************************
+ **********************************************************************/
+void osm_pkey_tbl_cleanup_pending(
+	IN osm_pkey_tbl_t *p_pkey_tbl)
+{
+	cl_list_item_t	*p_item;
+	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		free( (osm_pending_pkey_t *)p_item );
   }
 }
 
 /**********************************************************************
  **********************************************************************/
-int osm_pkey_tbl_set( 
+ib_api_status_t
+osm_pkey_tbl_set(
   IN osm_pkey_tbl_t *p_pkey_tbl,
   IN uint16_t block, 
   IN ib_pkey_table_t *p_tbl)
@@ -203,7 +222,138 @@ int osm_pkey_tbl_set( 
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t __osm_match_pkey (
+ib_api_status_t
+osm_pkey_tbl_make_block_pair( 
+	osm_pkey_tbl_t   *p_pkey_tbl, 
+	uint16_t          block_idx,
+	ib_pkey_table_t **pp_old_block,
+	ib_pkey_table_t **pp_new_block)
+{
+	if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR);
+
+	if (pp_old_block)
+	{
+		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_old_block)
+		{
+			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_old_block) return(IB_ERROR);
+			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
+		}
+	}
+	
+	if (pp_new_block)
+	{
+		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
+		if (! *pp_new_block)
+		{
+			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+			if (!*pp_new_block) return(IB_ERROR);
+			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
+			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
+		}
+	}
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+  store the given pkey in the "new" blocks array 
+  also makes sure the regular block exists.
+*/
+ib_api_status_t
+osm_pkey_tbl_set_new_entry( 
+	IN osm_pkey_tbl_t *p_pkey_tbl,
+	IN uint16_t        block_idx,
+	IN uint8_t         pkey_idx,
+	IN uint16_t        pkey)
+{  
+	ib_pkey_table_t *p_old_block;
+	ib_pkey_table_t *p_new_block;
+	
+	if (osm_pkey_tbl_make_block_pair(
+			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
+		return( IB_ERROR );
+		
+	p_new_block->pkey_entry[pkey_idx] = pkey;
+	if (p_pkey_tbl->used_blocks <= block_idx)
+		p_pkey_tbl->used_blocks = block_idx + 1;
+
+	return( IB_SUCCESS );
+}
+
+/**********************************************************************
+ **********************************************************************/
+boolean_t
+osm_pkey_find_next_free_entry(
+	IN osm_pkey_tbl_t *p_pkey_tbl, 
+	OUT uint16_t      *p_block_idx,
+	OUT uint8_t       *p_pkey_idx)
+{
+	ib_pkey_table_t *p_new_block;
+	
+	CL_ASSERT(p_block_idx);
+	CL_ASSERT(p_pkey_idx);
+
+	while ( *p_block_idx < p_pkey_tbl->max_blocks)
+	{
+		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
+		{
+			*p_pkey_idx = 0;
+			(*p_block_idx)++;
+			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
+				return FALSE;
+		}
+
+		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
+
+		if ( !p_new_block || 
+			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
+			return TRUE;
+		else
+			(*p_pkey_idx)++;
+	}
+	return FALSE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_pkey_tbl_get_block_and_idx(
+	IN	 osm_pkey_tbl_t *p_pkey_tbl,
+	IN	 uint16_t		 *p_pkey,
+	OUT uint32_t		 *p_block_idx,
+	OUT uint8_t			 *p_pkey_index)
+{
+	uint32_t			  num_of_blocks;
+	uint32_t			  block_index;
+	ib_pkey_table_t *block;
+
+	CL_ASSERT( p_pkey_tbl );
+	CL_ASSERT( p_block_idx != NULL );
+	CL_ASSERT( p_pkey_idx != NULL );
+ 
+	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
+	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	{
+		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
+		if ( ( block->pkey_entry <= p_pkey ) &&
+			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
+		{
+			*p_block_idx = block_index;
+			*p_pkey_index = p_pkey - block->pkey_entry;
+			return( IB_SUCCESS );
+		}
+	}
+	return( IB_NOT_FOUND );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static boolean_t 
+__osm_match_pkey (
   IN const ib_net16_t *pkey1,
   IN const ib_net16_t *pkey2 ) {
 
@@ -306,7 +456,8 @@ osm_physp_share_pkey(
   if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
     return TRUE;
 
-  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
+	return 
+		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
 }
 
 /**********************************************************************
@@ -322,7 +473,8 @@ osm_port_share_pkey(
 
   OSM_LOG_ENTER( p_log, osm_port_share_pkey );
 
-  if (!p_port_1 || !p_port_2) {
+	if (!p_port_1 || !p_port_2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
@@ -330,7 +482,8 @@ osm_port_share_pkey(
   p_physp1 = osm_port_get_default_phys_ptr(p_port_1);
   p_physp2 = osm_port_get_default_phys_ptr(p_port_2);
 
-  if (!p_physp1 || !p_physp2) {
+	if (!p_physp1 || !p_physp2)
+	{
 	ret = FALSE;
 	goto Exit;
   }
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8113)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -62,6 +62,131 @@
 
 /**********************************************************************
  **********************************************************************/
+/*
+  the max number of pkey blocks for a physical port is located in
+  different place for switch external ports (SwitchInfo) and the
+  rest of the ports (NodeInfo)
+*/
+static int 
+pkey_mgr_get_physp_max_blocks(
+	IN const osm_subn_t *p_subn,
+	IN const osm_physp_t *p_physp)
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr(p_physp);
+	osm_switch_t *p_sw;
+	uint16_t num_pkeys = 0;
+
+	if ( (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) ||
+		  (osm_physp_get_port_num( p_physp ) == 0))
+		num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
+	else
+	{
+		p_sw = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid);
+		if (p_sw)
+			num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
+	}
+	return( (num_pkeys + 31) / 32 );
+}
+
+/**********************************************************************
+ **********************************************************************/
+/*
+ * Insert the new pending pkey entry to the specific port pkey table
+ * pending pkeys. new entries are inserted at the back.
+ */
+static void 
+pkey_mgr_process_physical_port(
+	IN osm_log_t *p_log,
+	IN const osm_req_t *p_req,
+	IN const ib_net16_t pkey,
+	IN osm_physp_t *p_physp )
+{
+	osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	char *stat = NULL;
+	osm_pending_pkey_t *p_pending;
+
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	p_pending = (osm_pending_pkey_t *)malloc(sizeof(osm_pending_pkey_t));
+	if (! p_pending)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_process_physical_port: ERR 0502: "
+					"Fail to allocate new pending pkey entry for node "
+					"0x%016" PRIx64 " port %u\n",
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );
+		return;
+	}
+	p_pending->pkey = pkey;
+	p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	if ( !p_orig_pkey )
+	{
+		p_pending->is_new = TRUE;
+		cl_qlist_insert_tail(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "inserted";
+	}
+	else
+	{
+		CL_ASSERT( ib_pkey_get_base(*p_orig_pkey) == ib_pkey_get_base(pkey) );
+		p_pending->is_new = FALSE;
+		if (osm_pkey_tbl_get_block_and_idx(
+				 p_pkey_tbl, p_orig_pkey,
+				 &p_pending->block, &p_pending->index) != IB_SUCCESS)
+		{
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_process_physical_port: ERR 0503: "
+						"Fail to obtain P_Key 0x%04x block and index for node "
+						"0x%016" PRIx64 " port %u\n",
+						cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+						osm_physp_get_port_num( p_physp ) );
+			return;
+		}
+		cl_qlist_insert_head(&p_pkey_tbl->pending, (cl_list_item_t*)p_pending);
+		stat = "updated";
+	}
+
+	osm_log( p_log, OSM_LOG_DEBUG,
+				"pkey_mgr_process_physical_port:	"
+				"pkey 0x%04x was %s for node 0x%016" PRIx64
+				" port %u\n",
+				cl_ntoh16( pkey ), stat,
+				cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+				osm_physp_get_port_num( p_physp ) );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+pkey_mgr_process_partition_table(
+	osm_log_t *p_log,
+	const osm_req_t *p_req,
+	const osm_prtn_t *p_prtn,
+	const boolean_t full )
+{
+	const cl_map_t *p_tbl = 
+		full ? &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
+	cl_map_iterator_t i, i_next;
+	ib_net16_t pkey = p_prtn->pkey;
+	osm_physp_t *p_physp;
+
+	if ( full )
+		pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
+
+	i_next = cl_map_head( p_tbl );
+	while ( i_next != cl_map_end( p_tbl ) )
+	{
+		i = i_next;
+		i_next = cl_map_next( i );
+		p_physp = cl_map_obj( i );
+		if ( p_physp && osm_physp_is_valid( p_physp ) )
+			pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
 static ib_api_status_t
 pkey_mgr_update_pkey_entry(
    IN const osm_req_t *p_req,
@@ -114,7 +239,8 @@ pkey_mgr_enforce_partition(
    p_pi->state_info2 = 0;
    ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
 
-   context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+	context.pi_context.node_guid = 
+		osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
    context.pi_context.set_method = TRUE;
    context.pi_context.update_master_sm_base_lid = FALSE;
@@ -131,80 +257,132 @@ pkey_mgr_enforce_partition(
 
 /**********************************************************************
  **********************************************************************/
-/*
- * Prepare a new entry for the pkey table for this port when this pkey
- * does not exist. Update existed entry when membership was changed.
- */
-static void pkey_mgr_process_physical_port(
-   IN osm_log_t *p_log,
-   IN const osm_req_t *p_req,
-   IN const ib_net16_t pkey,
-   IN osm_physp_t *p_physp )
+static boolean_t pkey_mgr_update_port(
+	osm_log_t *p_log,
+	osm_req_t *p_req,
+	const osm_port_t * const p_port )
 {
-   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
-   ib_pkey_table_t *block;
+	osm_physp_t *p_physp;
+	osm_node_t *p_node;
+	ib_pkey_table_t *block, *new_block;
+	osm_pkey_tbl_t *p_pkey_tbl;
    uint16_t block_index;
+	uint8_t  pkey_index;
+	uint16_t last_free_block_index = 0;
+	uint8_t  last_free_pkey_index = 0;
    uint16_t num_of_blocks;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   ib_net16_t *p_orig_pkey;
-   char *stat = NULL;
-   uint32_t i;
+	uint16_t max_num_of_blocks;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	ib_api_status_t status;
+	boolean_t ret_val = FALSE;
+	osm_pending_pkey_t *p_pending;
+	boolean_t found;
 
-   p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) );
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
+		return FALSE;
 
-   if ( !p_orig_pkey )
-   {
-      for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_node = osm_physp_get_node_ptr( p_physp );
+	p_pkey_tbl = osm_physp_get_mod_pkey_tbl( p_physp );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	max_num_of_blocks = pkey_mgr_get_physp_max_blocks( p_req->p_subn, p_physp );
+	if (	p_pkey_tbl->max_blocks > max_num_of_blocks )
       {
-         block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-         for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ )
+		osm_log( p_log, OSM_LOG_INFO,
+					"pkey_mgr_update_port: "
+					"Max number of blocks reduced from %u to %u " 
+					"for node 0x%016" PRIx64 " port %u\n",
+					p_pkey_tbl->max_blocks, max_num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( p_physp ) );				
+	}
+	p_pkey_tbl->max_blocks = max_num_of_blocks;
+
+	osm_pkey_tbl_init_new_blocks( p_pkey_tbl );
+	p_pkey_tbl->used_blocks = 0;
+
+	/* 
+		process every pending pkey in order - 
+		first must be "updated" last are "new" 
+	*/
+	p_pending = 
+		(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	while (p_pending != 
+			 (osm_pending_pkey_t *)cl_qlist_end( &p_pkey_tbl->pending ) )
+	{
+		if (p_pending->is_new == FALSE)
+		{
+			block_index = p_pending->block;
+			pkey_index = p_pending->index;
+			found = TRUE;
+		} 
+		else
          {
-            if ( ib_pkey_is_invalid( block->pkey_entry[i] ) )
+			found = osm_pkey_find_next_free_entry(p_pkey_tbl, 
+															  &last_free_block_index,
+															  &last_free_pkey_index);
+			if ( !found )
             {
-               block->pkey_entry[i] = pkey;
-	       stat = "inserted";
-	       goto _done;
+				osm_log( p_log, OSM_LOG_ERROR,
+							"pkey_mgr_update_port: ERR 0504: "
+							"failed to find empty space for new pkey 0x%04x "
+							"of node 0x%016" PRIx64 " port %u\n",
+							cl_ntoh16(p_pending->pkey),
+							cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+							osm_physp_get_port_num( p_physp ) );
             }
+			else
+			{
+				block_index = last_free_block_index;
+				pkey_index = last_free_pkey_index++;
          }
       }
+		
+		if (found) 
+		{
+			if ( IB_SUCCESS != osm_pkey_tbl_set_new_entry( 
+					  p_pkey_tbl, block_index, pkey_index, p_pending->pkey) )
+			{
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_process_physical_port: ERR 0501: "
-               "No empty pkey entry was found to insert 0x%04x for node "
-               "0x%016" PRIx64 " port %u\n",
-               cl_ntoh16( pkey ),
+							"pkey_mgr_update_port: ERR 0505: "
+							"failed to set PKey 0x%04x in block %u idx %u "
+							"of node 0x%016" PRIx64 " port %u\n",
+							p_pending->pkey, block_index, pkey_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
-   else if ( *p_orig_pkey != pkey )
-   {
+		}
+
+		free( p_pending );
+		p_pending = 
+			(osm_pending_pkey_t *)cl_qlist_remove_head( &p_pkey_tbl->pending );
+	}
+
+	/* now look for changes and store */
       for ( block_index = 0; block_index < num_of_blocks; block_index++ )
       {
-         /* we need real block (not just new_block) in order
-          * to resolve block/pkey indices */
          block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-	 i = p_orig_pkey - block->pkey_entry;
-	 if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) {
-            block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-	    block->pkey_entry[i] = pkey;
-	    stat = "updated";
-	    goto _done;
-	 }
-      }
-   }
+		new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
- _done:
-   if (stat) {
-      osm_log( p_log, OSM_LOG_VERBOSE,
-               "pkey_mgr_process_physical_port:  "
-               "pkey 0x%04x was %s for node 0x%016" PRIx64
-               " port %u\n",
-               cl_ntoh16( pkey ), stat,
+		if (block && 
+			 (!new_block || !memcmp( new_block, block, sizeof( *block ) )) )
+			continue;
+
+		status = pkey_mgr_update_pkey_entry(
+			p_req, p_physp , new_block, block_index );
+		if (status == IB_SUCCESS)
+			ret_val = TRUE;
+		else
+			osm_log( p_log, OSM_LOG_ERROR,
+						"pkey_mgr_update_port: ERR 0506: "
+						"pkey_mgr_update_pkey_entry() failed to update "
+						"pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+						block_index,
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                osm_physp_get_port_num( p_physp ) );
    }
+
+	return ret_val;
 }
 
 /**********************************************************************
@@ -217,21 +395,23 @@ pkey_mgr_update_peer_port(
    const osm_port_t * const p_port,
    boolean_t enforce )
 {
-   osm_physp_t *p, *peer;
+	osm_physp_t *p_physp, *peer;
    osm_node_t *p_node;
    ib_pkey_table_t *block, *peer_block;
-   const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl;
+	const osm_pkey_tbl_t *p_pkey_tbl;
+	osm_pkey_tbl_t *p_peer_pkey_tbl;
    osm_switch_t *p_sw;
    ib_switch_info_t *p_si;
    uint16_t block_index;
    uint16_t num_of_blocks;
+	uint16_t peer_max_blocks;
    ib_api_status_t status = IB_SUCCESS;
    boolean_t ret_val = FALSE;
 
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
+	p_physp = osm_port_get_default_phys_ptr( p_port );
+	if ( !osm_physp_is_valid( p_physp ) )
       return FALSE;
-   peer = osm_physp_get_remote( p );
+	peer = osm_physp_get_remote( p_physp );
    if ( !peer || !osm_physp_is_valid( peer ) )
       return FALSE;
    p_node = osm_physp_get_node_ptr( peer );
@@ -242,10 +422,26 @@ pkey_mgr_update_peer_port(
    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap)
       return FALSE;
 
+	p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp );
+	p_peer_pkey_tbl = osm_physp_get_mod_pkey_tbl( peer );
+	num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
+	peer_max_blocks = pkey_mgr_get_physp_max_blocks( p_subn, peer );
+	if (peer_max_blocks < p_pkey_tbl->used_blocks)
+	{
+		osm_log( p_log, OSM_LOG_ERROR,
+					"pkey_mgr_update_peer_port: ERR 0508: "
+					"not enough entries (%u < %u) on switch 0x%016" PRIx64
+					" port %u. Clearing Enforcement bit.\n",
+					peer_max_blocks, num_of_blocks,
+					cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					osm_physp_get_port_num( peer ) );
+		enforce = FALSE;
+	}
+
    if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
    {
       osm_log( p_log, OSM_LOG_ERROR,
-               "pkey_mgr_update_peer_port: ERR 0502: "
+					"pkey_mgr_update_peer_port: ERR 0507: "
                "pkey_mgr_enforce_partition() failed to update "
                "node 0x%016" PRIx64 " port %u\n",
                cl_ntoh64( osm_node_get_node_guid( p_node ) ),
@@ -255,24 +451,19 @@ pkey_mgr_update_peer_port(
    if (enforce == FALSE)
       return FALSE;
 
-   p_pkey_tbl = osm_physp_get_pkey_tbl( p );
-   p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer );
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-   if ( num_of_blocks > osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl ) )
-      num_of_blocks = osm_pkey_tbl_get_num_blocks( p_peer_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
+	p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
+	for ( block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
    {
       block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
       peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index );
-      if ( memcmp( peer_block, block, sizeof( *peer_block ) ) )
+		if ( !peer_block || memcmp( peer_block, block, sizeof( *peer_block ) ) )
       {
          status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index );
          if ( status == IB_SUCCESS )
             ret_val = TRUE;
          else
             osm_log( p_log, OSM_LOG_ERROR,
-                     "pkey_mgr_update_peer_port: ERR 0503: "
+							"pkey_mgr_update_peer_port: ERR 0509: "
                      "pkey_mgr_update_pkey_entry() failed to update "
                      "pkey table block %d for node 0x%016" PRIx64
                      " port %u\n",
@@ -282,10 +473,10 @@ pkey_mgr_update_peer_port(
       }
    }
 
-   if ( ret_val == TRUE &&
-        osm_log_is_active( p_log, OSM_LOG_VERBOSE ) )
+	if ( (ret_val == TRUE) &&
+		  osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
    {
-      osm_log( p_log, OSM_LOG_VERBOSE,
+		osm_log( p_log, OSM_LOG_DEBUG,
                "pkey_mgr_update_peer_port: "
                "pkey table was updated for node 0x%016" PRIx64
                " port %u\n",
@@ -298,82 +489,6 @@ pkey_mgr_update_peer_port(
 
 /**********************************************************************
  **********************************************************************/
-static boolean_t pkey_mgr_update_port(
-   osm_log_t *p_log,
-   osm_req_t *p_req,
-   const osm_port_t * const p_port )
-{
-   osm_physp_t *p;
-   osm_node_t *p_node;
-   ib_pkey_table_t *block, *new_block;
-   const osm_pkey_tbl_t *p_pkey_tbl;
-   uint16_t block_index;
-   uint16_t num_of_blocks;
-   ib_api_status_t status;
-   boolean_t ret_val = FALSE;
-
-   p = osm_port_get_default_phys_ptr( p_port );
-   if ( !osm_physp_is_valid( p ) )
-      return FALSE;
-
-   p_pkey_tbl = osm_physp_get_pkey_tbl(p);
-   num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl );
-
-   for ( block_index = 0; block_index < num_of_blocks; block_index++ )
-   {
-      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
-      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
-
-      if (!new_block || !memcmp( new_block, block, sizeof( *block ) ) )
-         continue;
-
-      status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index );
-      if (status == IB_SUCCESS)
-         ret_val = TRUE;
-      else
-         osm_log( p_log, OSM_LOG_ERROR,
-                  "pkey_mgr_update_port: ERR 0504: "
-                  "pkey_mgr_update_pkey_entry() failed to update "
-                  "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-                  block_index,
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-                  osm_physp_get_port_num( p ) );
-   }
-
-   return ret_val;
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-pkey_mgr_process_partition_table(
-   osm_log_t *p_log,
-   const osm_req_t *p_req,
-   const osm_prtn_t *p_prtn,
-   const boolean_t full )
-{
-   const cl_map_t *p_tbl = full ?
-      &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl;
-   cl_map_iterator_t i, i_next;
-   ib_net16_t pkey = p_prtn->pkey;
-   osm_physp_t *p_physp;
-
-   if ( full )
-      pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 );
-
-   i_next = cl_map_head( p_tbl );
-   while ( i_next != cl_map_end( p_tbl ) )
-   {
-      i = i_next;
-      i_next = cl_map_next( i );
-      p_physp = cl_map_obj( i );
-      if ( p_physp && osm_physp_is_valid( p_physp ) )
-          pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp );
-   }
-}
-
-/**********************************************************************
- **********************************************************************/
 osm_signal_t
 osm_pkey_mgr_process(
    IN osm_opensm_t *p_osm )
@@ -383,8 +498,7 @@ osm_pkey_mgr_process(
    osm_prtn_t *p_prtn;
    osm_port_t *p_port;
    osm_signal_t signal = OSM_SIGNAL_DONE;
-   osm_physp_t *p_physp;
-
+	osm_node_t *p_node;
    CL_ASSERT( p_osm );
 
    OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process );
@@ -394,32 +508,25 @@ osm_pkey_mgr_process(
    if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS )
    {
       osm_log( &p_osm->log, OSM_LOG_ERROR,
-               "osm_pkey_mgr_process: ERR 0505: "
+					"osm_pkey_mgr_process: ERR 0510: "
                "osm_prtn_make_partitions() failed\n" );
       goto _err;
    }
 
-   p_tbl = &p_osm->subn.port_guid_tbl;
-   p_next = cl_qmap_head( p_tbl );
-   while ( p_next != cl_qmap_end( p_tbl ) )
-   {
-      p_port = ( osm_port_t * ) p_next;
-      p_next = cl_qmap_next( p_next );
-      p_physp = osm_port_get_default_phys_ptr( p_port );
-      if ( osm_physp_is_valid( p_physp ) )
-        osm_pkey_tbl_sync_new_blocks( osm_physp_get_pkey_tbl( p_physp ) );
-   }
-
+	/* populate the pending pkey entries by scanning all partitions */
    p_tbl = &p_osm->subn.prtn_pkey_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
    {
       p_prtn = ( osm_prtn_t * ) p_next;
       p_next = cl_qmap_next( p_next );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
-      pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, FALSE );
+		pkey_mgr_process_partition_table( 
+			&p_osm->log, &p_osm->sm.req, p_prtn, TRUE );
    }
 
+	/* calculate new pkey tables and set */
    p_tbl = &p_osm->subn.port_guid_tbl;
    p_next = cl_qmap_head( p_tbl );
    while ( p_next != cl_qmap_end( p_tbl ) )
@@ -428,8 +535,10 @@ osm_pkey_mgr_process(
       p_next = cl_qmap_next( p_next );
       if ( pkey_mgr_update_port( &p_osm->log, &p_osm->sm.req, p_port ) )
         signal = OSM_SIGNAL_DONE_PENDING;
-      if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH &&
-           pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+		p_node = osm_port_get_parent_node( p_port );
+		if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
+			  pkey_mgr_update_peer_port( 
+				  &p_osm->log, &p_osm->sm.req,
                                       &p_osm->subn, p_port,
                                       !p_osm->subn.opt.no_partition_enforcement ) )
         signal = OSM_SIGNAL_DONE_PENDING;        


From eitan at mellanox.co.il  Mon Jun 19 12:12:07 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 19 Jun 2006 22:12:07 +0300
Subject: [openib-general] [PATCH] osm: fix segfault due to unprotected
 access to InformInfo DB
Message-ID: <86bqspgdtk.fsf@mtl066.yok.mtl.com>

Hi Hal

I have added InformInfo requests to the osmStress simulator flow.
Running it overnight exposed a bug as OpenSM segfaulted during
osm_report_notice. Some debug shows the following two flows were
missing a lock. Such that under stress the InformInfo DB was altered
while being accessed by the code in osm_report_notice.

I have verified the other flows calling osm_report_notice are under a
lock. 

The fixed code is running for a while with no crash so far.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_state_mgr.c
===================================================================
--- opensm/osm_state_mgr.c	(revision 8113)
+++ opensm/osm_state_mgr.c	(working copy)
@@ -1709,6 +1709,7 @@ __osm_state_mgr_report_new_ports(
 
    OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_report_new_ports );
 
+   CL_PLOCK_ACQUIRE( p_mgr->p_lock );
    p_port =
       ( osm_port_t
         * ) ( cl_list_remove_head( &p_mgr->p_subn->new_ports_list ) );
@@ -1759,6 +1760,7 @@ __osm_state_mgr_report_new_ports(
          ( osm_port_t
            * ) ( cl_list_remove_head( &p_mgr->p_subn->new_ports_list ) );
    }
+   CL_PLOCK_RELEASE( p_mgr->p_lock );
 
    OSM_LOG_EXIT( p_mgr->p_log );
 }
Index: opensm/osm_trap_rcv.c
===================================================================
--- opensm/osm_trap_rcv.c	(revision 8113)
+++ opensm/osm_trap_rcv.c	(working copy)
@@ -652,7 +652,10 @@ __osm_trap_rcv_process_request(
     p_ntci->issuer_gid.unicast.interface_id = p_port->guid;
   }
 
+  /* we need a lock here as the InformInfo DB must be stable */
+  CL_PLOCK_ACQUIRE( p_rcv->p_lock );
   status = osm_report_notice(p_rcv->p_log, p_rcv->p_subn, p_ntci);
+  CL_PLOCK_RELEASE( p_rcv->p_lock );
   if( status != IB_SUCCESS )
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,


From eitan at mellanox.co.il  Mon Jun 19 12:24:41 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 19 Jun 2006 22:24:41 +0300
Subject: [openib-general] A few questions about IBMgtSim
In-Reply-To: <44968BEF.9030401@simula.no>
References: <44968BEF.9030401@simula.no>
Message-ID: <4496F9F9.90101@mellanox.co.il>

Hi Sven,

Please see my response below:

Eitan

Sven-Arne Reinemo wrote:
> Hi,
> 
> After some testing of IBMgtSim I have a few questions:
> 
> 1) If I try to build topologies using the MTS14400.ibnl as a building
> block my simulation fails with a "child process exited abnormally"
> message. I guess this is related to ibdmchk since the ibdmchk log
> contains lots of errors like the following:
> 
> -I- Tracing all CA to CA paths for Credit Loops potential ...
> -E- Potential Credit Loop on Path from:H-1/U1/1 to:H-11/U1/1
>   Going:Down from:node:0002c9000000007d to:node:0002c9000000006a
>   Going:Up from:node:0002c9000000006a to:node:0002c90000000076
This error indicate what it say: The resulting routing has a potential credit
loop as it does not follow an up/down routing scheme. Credit loops can really
generated by the OpenSM on some topologies and can be avoided by adding the
-R updn flag. And possible also --add_guid_file if the SM is not able to
recognize the root nodes automatically (if the topology is highly not symmetric).

> 
> -I- Generating non blocking full link coverage plan
> into:/tmp/ibdmchk.non_block_
> all_links
> -E- After 32 stages some switch ports are still not covered:
> -E- Fail to cover port:system:0002c90000000054/node:0002c90000000054/P15
This means that there is no route that goes through that port.
I.e. if you trace from all HCA to all other HCA you never go through that port.
> 
> I have included two topology files. One that works and one that fails,
> the only difference is that the number of hosts are increased from 18 to
> 20. Also, if I create my own simple ibnl file for a switch with 144 (or
> other sizes) ports I am able to run simulations. Any suggestions to what
> the problem might be?
As described above the reason is credit loop potential and the specific topology
and routing algorithm used. Please try the -R updn and --add_guid_file.
You can scan the ibmgtsim.guids.txt file to know the GUIDS assigned to the spine switches.
> 
> 
> 2) The included example ibmgtsim/tests/RhinoBased10K.topo never finishes
> (at least not in 24 hours). Does this work for anyone else? All other
> examples work fine.
I was able to simulate it by:
1. Decreasing the verbosity
2. Running the simulator on one machine and the OpenSM on another
> 
> 3) If I would like to use IBMgtSim with my own (simplified) SM would it
> be straightforward? It looks too me like RunSimTest talks to any SM
> given the correct path, node and port number for location of the SM.
You can use libibmscli.so/.a to integrate your SM with ibmgtsim.
This lib API is provided in ibms_client_api.h
It mainly enables connecting to the ibmgtsim server TCP/IP port declaring
the port the SM is attached to, registering to receive some MAD class/attributes
sending and  receiving MADs.

> 


From halr at voltaire.com  Mon Jun 19 13:22:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Jun 2006 16:22:40 -0400
Subject: [openib-general] [PATCH TRIVIAL] opensm: libibmad: fix umad
	retry counter
In-Reply-To: <20060619183046.GF5521@sashak.voltaire.com>
References: <20060619183046.GF5521@sashak.voltaire.com>
Message-ID: <1150748559.4391.78544.camel@hal.voltaire.com>

Hi Sasha,

On Mon, 2006-06-19 at 14:30, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> This fixes umad send/recv retry counter in error report.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---

Thanks. Applied.

-- Hal


From halr at voltaire.com  Mon Jun 19 13:39:09 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Jun 2006 16:39:09 -0400
Subject: [openib-general] [PATCH] osm: fix segfault due to unprotected
 access to InformInfo DB
In-Reply-To: <86bqspgdtk.fsf@mtl066.yok.mtl.com>
References: <86bqspgdtk.fsf@mtl066.yok.mtl.com>
Message-ID: <1150749203.4391.78989.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-06-19 at 15:12, Eitan Zahavi wrote:
> Hi Hal
> 
> I have added InformInfo requests to the osmStress simulator flow.
> Running it overnight exposed a bug as OpenSM segfaulted during
> osm_report_notice. Some debug shows the following two flows were
> missing a lock. Such that under stress the InformInfo DB was altered
> while being accessed by the code in osm_report_notice.
> 
> I have verified the other flows calling osm_report_notice are under a
> lock. 
> 
> The fixed code is running for a while with no crash so far.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied.

-- Hal


From ralphc at pathscale.com  Mon Jun 19 16:37:30 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Mon, 19 Jun 2006 16:37:30 -0700
Subject: [openib-general] [PATCH 1/4] ipath mmaped CQs, QPs, SRQs
Message-ID: <1150760250.32252.158.camel@brick.pathscale.com>

Here is a set of patches which adds mmapped completion queues and
receive queues for the InfiniPath HCA.  This required changing
some of the core code in order to return HW specific data for
the ibv_resize_cq(), ibv_modify_qp(), and ibv_modify_srq().
I have included the minimal changes to mthca and ehca to match
the function signature changes and incorporated Roland's review
comments on the earlier code posted.

The first patch contains the core changes,
the second contains mthca and ehca specific changes,
the third contains libipathverbs changes,
the fourth contains ib_ipath changes.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libibverbs/include/infiniband/driver.h
===================================================================
--- src/userspace/libibverbs/include/infiniband/driver.h	(revision 8021)
+++ src/userspace/libibverbs/include/infiniband/driver.h	(working copy)
@@ -95,7 +95,8 @@
 int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
 int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only);
 int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe,
-		      struct ibv_resize_cq *cmd, size_t cmd_size);
+		      struct ibv_resize_cq *cmd, size_t cmd_size,
+		      struct ibv_resize_cq_resp *resp, size_t resp_size);
 int ibv_cmd_destroy_cq(struct ibv_cq *cq);
 
 int ibv_cmd_create_srq(struct ibv_pd *pd,
Index: src/userspace/libibverbs/include/infiniband/kern-abi.h
===================================================================
--- src/userspace/libibverbs/include/infiniband/kern-abi.h	(revision 8021)
+++ src/userspace/libibverbs/include/infiniband/kern-abi.h	(working copy)
@@ -355,6 +355,8 @@
 
 struct ibv_resize_cq_resp {
 	__u32 cqe;
+	__u32 reserved;
+	__u64 driver_data[0];
 };
 
 struct ibv_destroy_cq {
Index: src/userspace/libibverbs/src/cmd.c
===================================================================
--- src/userspace/libibverbs/src/cmd.c	(revision 8021)
+++ src/userspace/libibverbs/src/cmd.c	(working copy)
@@ -368,18 +368,18 @@
 }
 
 int ibv_cmd_resize_cq(struct ibv_cq *cq, int cqe,
-		      struct ibv_resize_cq *cmd, size_t cmd_size)
+		      struct ibv_resize_cq *cmd, size_t cmd_size,
+		      struct ibv_resize_cq_resp *resp, size_t resp_size)
 {
-	struct ibv_resize_cq_resp resp;
 
-	IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, &resp, sizeof resp);
+	IBV_INIT_CMD_RESP(cmd, cmd_size, RESIZE_CQ, resp, resp_size);
 	cmd->cq_handle = cq->handle;
 	cmd->cqe       = cqe;
 
 	if (write(cq->context->cmd_fd, cmd, cmd_size) != cmd_size)
 		return errno;
 
-	cq->cqe = resp.cqe;
+	cq->cqe = resp->cqe;
 
 	return 0;
 }
Index: src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h
===================================================================
--- src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/include/rdma/ib_user_verbs.h	(working copy)
@@ -275,6 +275,8 @@
 
 struct ib_uverbs_resize_cq_resp {
 	__u32 cqe;
+	__u32 reserved;
+	__u64 driver_data[0];
 };
 
 struct ib_uverbs_poll_cq {
Index: src/linux-kernel/infiniband/include/rdma/ib_verbs.h
===================================================================
--- src/linux-kernel/infiniband/include/rdma/ib_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/include/rdma/ib_verbs.h	(working copy)
@@ -911,7 +911,8 @@
 						 struct ib_udata *udata);
 	int                        (*modify_srq)(struct ib_srq *srq,
 						 struct ib_srq_attr *srq_attr,
-						 enum ib_srq_attr_mask srq_attr_mask);
+						 enum ib_srq_attr_mask srq_attr_mask,
+						 struct ib_udata *udata);
 	int                        (*query_srq)(struct ib_srq *srq,
 						struct ib_srq_attr *srq_attr);
 	int                        (*destroy_srq)(struct ib_srq *srq);
@@ -923,7 +924,8 @@
 						struct ib_udata *udata);
 	int                        (*modify_qp)(struct ib_qp *qp,
 						struct ib_qp_attr *qp_attr,
-						int qp_attr_mask);
+						int qp_attr_mask,
+						struct ib_udata *udata);
 	int                        (*query_qp)(struct ib_qp *qp,
 					       struct ib_qp_attr *qp_attr,
 					       int qp_attr_mask,
Index: src/linux-kernel/infiniband/core/verbs.c
===================================================================
--- src/linux-kernel/infiniband/core/verbs.c	(revision 8021)
+++ src/linux-kernel/infiniband/core/verbs.c	(working copy)
@@ -231,7 +231,7 @@
 		  struct ib_srq_attr *srq_attr,
 		  enum ib_srq_attr_mask srq_attr_mask)
 {
-	return srq->device->modify_srq(srq, srq_attr, srq_attr_mask);
+	return srq->device->modify_srq(srq, srq_attr, srq_attr_mask, NULL);
 }
 EXPORT_SYMBOL(ib_modify_srq);
 
@@ -547,7 +547,7 @@
 		 struct ib_qp_attr *qp_attr,
 		 int qp_attr_mask)
 {
-	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask);
+	return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, NULL);
 }
 EXPORT_SYMBOL(ib_modify_qp);
Index: src/linux-kernel/infiniband/core/uverbs_cmd.c
===================================================================
--- src/linux-kernel/infiniband/core/uverbs_cmd.c	(revision 8021)
+++ src/linux-kernel/infiniband/core/uverbs_cmd.c	(working copy)
@@ -1258,6 +1258,7 @@
 			    int out_len)
 {
 	struct ib_uverbs_modify_qp cmd;
+	struct ib_udata            udata;
 	struct ib_qp              *qp;
 	struct ib_qp_attr         *attr;
 	int                        ret;
@@ -1265,6 +1266,9 @@
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd,
+		   out_len);
+
 	attr = kmalloc(sizeof *attr, GFP_KERNEL);
 	if (!attr)
 		return -ENOMEM;
@@ -1321,7 +1325,7 @@
 	attr->alt_ah_attr.ah_flags 	    = cmd.alt_dest.is_global ? IB_AH_GRH : 0;
 	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
 
-	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
+	ret = qp->device->modify_qp(qp, attr, cmd.attr_mask, &udata);
 
 	put_qp_read(qp);
 
@@ -1773,6 +1777,7 @@
 	}
 
 	ah->uobject = uobj;
+	uobj->object = ah;
 	ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj);
 	if (ret)
 		goto err_destroy;
@@ -2031,6 +2036,7 @@
 			     int out_len)
 {
 	struct ib_uverbs_modify_srq cmd;
+	struct ib_udata             udata;
 	struct ib_srq              *srq;
 	struct ib_srq_attr          attr;
 	int                         ret;
@@ -2038,6 +2044,9 @@
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
+	INIT_UDATA(&udata, buf + sizeof cmd, NULL, in_len - sizeof cmd,
+		   out_len);
+
 	srq = idr_read_srq(cmd.srq_handle, file->ucontext);
 	if (!srq)
 		return -EINVAL;
@@ -2045,7 +2054,7 @@
 	attr.max_wr    = cmd.max_wr;
 	attr.srq_limit = cmd.srq_limit;
 
-	ret = ib_modify_srq(srq, &attr, cmd.attr_mask);
+	ret = srq->device->modify_srq(srq, &attr, cmd.attr_mask, &udata);
 
 	put_srq_read(srq);
 

-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Mon Jun 19 16:41:51 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Mon, 19 Jun 2006 16:41:51 -0700
Subject: [openib-general] [PATCH 2/4] ipath mmaped CQs, QPs, SRQs
Message-ID: <1150760512.32252.164.camel@brick.pathscale.com>

This patch contains the mthca and ehca specific changes.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libmthca/src/verbs.c
===================================================================
--- src/userspace/libmthca/src/verbs.c	(revision 8021)
+++ src/userspace/libmthca/src/verbs.c	(working copy)
@@ -259,6 +259,7 @@
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	struct mthca_resize_cq cmd;
+	struct ibv_resize_cq_resp resp;
 	struct ibv_mr *mr;
 	void *buf;
 	int old_cqe;
@@ -292,7 +293,8 @@
 	old_cqe = ibcq->cqe;
 
 	cmd.lkey = mr->lkey;
-	ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd);
+	ret = ibv_cmd_resize_cq(ibcq, cqe - 1, &cmd.ibv_cmd, sizeof cmd,
+				&resp, sizeof resp);
 	if (ret) {
 		mthca_dereg_mr(mr);
 		free(buf);
Index: src/linux-kernel/infiniband/hw/mthca/mthca_srq.c
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_srq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_srq.c	(working copy)
@@ -357,7 +357,7 @@
 }
 
 int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask)
+		     enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
 {
 	struct mthca_dev *dev = to_mdev(ibsrq->device);
 	struct mthca_srq *srq = to_msrq(ibsrq);
Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h	(working copy)
@@ -506,7 +506,7 @@
 		    struct ib_srq_attr *attr, struct mthca_srq *srq);
 void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq);
 int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask);
+		     enum ib_srq_attr_mask attr_mask, struct ib_udata *udata);
 int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr);
 int mthca_max_srq_sge(struct mthca_dev *dev);
 void mthca_srq_event(struct mthca_dev *dev, u32 srqn,
@@ -521,7 +521,8 @@
 		    enum ib_event_type event_type);
 int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask,
 		   struct ib_qp_init_attr *qp_init_attr);
-int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		    struct ib_udata *udata);
 int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 			  struct ib_send_wr **bad_wr);
 int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
Index: src/linux-kernel/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_qp.c	(working copy)
@@ -522,7 +522,8 @@
 	return 0;
 }
 
-int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		    struct ib_udata *udata)
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
Index: src/linux-kernel/infiniband/hw/ehca/ehca_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/ehca/ehca_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ehca/ehca_qp.c	(working copy)
@@ -1288,7 +1288,8 @@
 	return ret;
 }
 
-int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask)
+int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		   struct ib_udata *udata)
 {
 	int ret = 0;
 	struct ehca_qp *my_qp = NULL;
Index: src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h
===================================================================
--- src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ehca/ehca_iverbs.h	(working copy)
@@ -143,7 +143,8 @@
 
 int ehca_destroy_qp(struct ib_qp *qp);
 
-int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask);
+int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
+		   struct ib_udata *udata);
 
 int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
 		  int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);

-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Mon Jun 19 16:43:33 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Mon, 19 Jun 2006 16:43:33 -0700
Subject: [openib-general] [PATCH 3/4] ipath mmaped CQs, QPs, SRQs
Message-ID: <1150760613.32252.166.camel@brick.pathscale.com>

This patch contains the libipathverbs specific changes.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libipathverbs/src/verbs.c
===================================================================
--- src/userspace/libipathverbs/src/verbs.c	(revision 8021)
+++ src/userspace/libipathverbs/src/verbs.c	(working copy)
@@ -40,11 +40,14 @@
 
 #include <stdio.h>
 #include <stdlib.h>
-#include <strings.h>
+#include <string.h>
 #include <pthread.h>
 #include <netinet/in.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "ipathverbs.h"
+#include "ipath-abi.h"
 
 int ipath_query_device(struct ibv_context *context,
 		       struct ibv_device_attr *attr)
@@ -83,11 +86,11 @@
 	struct ibv_pd		 *pd;
 
 	pd = malloc(sizeof *pd);
-	if(!pd)
+	if (!pd)
 		return NULL;
 
-	if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
-			    &resp, sizeof resp)) {
+	if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
+			     &resp, sizeof resp)) {
 		free(pd);
 		return NULL;
 	}
@@ -142,57 +145,159 @@
 			       struct ibv_comp_channel *channel,
 			       int comp_vector)
 {
-	struct ibv_cq		 *cq;
-	struct ibv_create_cq	  cmd;
-	struct ibv_create_cq_resp resp;
-	int			  ret;
+	struct ipath_cq		   *cq;
+	struct ibv_create_cq	    cmd;
+	struct ipath_create_cq_resp resp;
+	int			    ret;
+	size_t			    size;
 
 	cq = malloc(sizeof *cq);
 	if (!cq)
 		return NULL;
 
-	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, cq,
-				&cmd, sizeof cmd, &resp, sizeof resp);
+	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
+				&cq->ibv_cq, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(cq);
 		return NULL;
 	}
 
-	return cq;
+	size = sizeof(struct ipath_cq_wc) + sizeof(struct ipath_wc) * cqe;
+	cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			 context->cmd_fd, resp.offset);
+	if ((void *) cq->queue == MAP_FAILED) {
+		free(cq);
+		return NULL;
+	}
+
+	pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE);
+	return &cq->ibv_cq;
 }
 
-int ipath_destroy_cq(struct ibv_cq *cq)
+int ipath_resize_cq(struct ibv_cq *ibcq, int cqe)
 {
+	struct ipath_cq		       *cq = to_icq(ibcq);
+	struct ibv_resize_cq		cmd;
+	struct ipath_resize_cq_resp	resp;
+	size_t				size;
+	int				ret;
+
+	pthread_spin_lock(&cq->lock);
+	/* Save the old size so we can unmmap the queue. */
+	size = sizeof(struct ipath_cq_wc) +
+		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
+	ret = ibv_cmd_resize_cq(ibcq, cqe, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
+	if (ret) {
+		pthread_spin_unlock(&cq->lock);
+		return ret;
+	}
+	(void) munmap(cq->queue, size);
+	size = sizeof(struct ipath_cq_wc) +
+		(sizeof(struct ipath_wc) * cq->ibv_cq.cqe);
+	cq->queue = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			 ibcq->context->cmd_fd, resp.offset);
+	ret = errno;
+	pthread_spin_unlock(&cq->lock);
+	if ((void *) cq->queue == MAP_FAILED)
+		return ret;
+	return 0;
+}
+
+int ipath_destroy_cq(struct ibv_cq *ibcq)
+{
+	struct ipath_cq *cq = to_icq(ibcq);
 	int ret;
 
-	ret = ibv_cmd_destroy_cq(cq);
+	ret = ibv_cmd_destroy_cq(ibcq);
 	if (ret)
 		return ret;
 
+	(void) munmap(cq->queue, sizeof(struct ipath_cq_wc) +
+				 (sizeof(struct ipath_wc) * cq->ibv_cq.cqe));
 	free(cq);
 	return 0;
 }
 
+int ipath_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
+{
+	struct ipath_cq *cq = to_icq(ibcq);
+	struct ipath_cq_wc *q;
+	int npolled;
+	uint32_t tail;
+
+	pthread_spin_lock(&cq->lock);
+	q = cq->queue;
+	tail = q->tail;
+	for (npolled = 0; npolled < ne; ++npolled, ++wc) {
+		if (tail == q->head)
+			break;
+		memcpy(wc, &q->queue[tail], sizeof(*wc));
+		if (tail == cq->ibv_cq.cqe)
+			tail = 0;
+		else
+			tail++;
+	}
+	q->tail = tail;
+	pthread_spin_unlock(&cq->lock);
+
+	return npolled;
+}
+
 struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
 {
-	struct ibv_create_qp	  cmd;
-	struct ibv_create_qp_resp resp;
-	struct ibv_qp		 *qp;
-	int			  ret;
+	struct ibv_create_qp	     cmd;
+	struct ipath_create_qp_resp  resp;
+	struct ipath_qp		    *qp;
+	int			     ret;
+	size_t			     size;
 
 	qp = malloc(sizeof *qp);
 	if (!qp)
 		return NULL;
 
-	ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp);
+	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(qp);
 		return NULL;
 	}
 
-	return qp;
+	if (attr->srq) {
+		qp->rq.size = 0;
+		qp->rq.max_sge = 0;
+		qp->rq.rwq = NULL;
+	} else {
+		qp->rq.size = attr->cap.max_recv_wr + 1;
+		qp->rq.max_sge = attr->cap.max_recv_sge;
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		qp->rq.rwq = mmap(NULL, size,
+				  PROT_READ | PROT_WRITE, MAP_SHARED,
+				  pd->context->cmd_fd, resp.offset);
+		if ((void *) qp->rq.rwq == MAP_FAILED) {
+			free(qp);
+			return NULL;
+		}
+	}
+
+	pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE);
+	return &qp->ibv_qp;
 }
 
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr)
+{
+	struct ibv_query_qp cmd;
+
+	return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr,
+				&cmd, sizeof cmd);
+}
+
 int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask)
 {
@@ -201,70 +306,196 @@
 	return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd);
 }
 
-int ipath_destroy_qp(struct ibv_qp *qp)
+int ipath_destroy_qp(struct ibv_qp *ibqp)
 {
+	struct ipath_qp	*qp = to_iqp(ibqp);
 	int ret;
 
-	ret = ibv_cmd_destroy_qp(qp);
+	ret = ibv_cmd_destroy_qp(ibqp);
 	if (ret)
 		return ret;
 
+	if (qp->rq.rwq) {
+		size_t size;
+
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * qp->rq.max_sge)) *
+			qp->rq.size;
+		(void) munmap(qp->rq.rwq, size);
+	}
 	free(qp);
 	return 0;
 }
 
+static int post_recv(struct ipath_rq *rq, struct ibv_recv_wr *wr,
+		     struct ibv_recv_wr **bad_wr)
+{
+	struct ibv_recv_wr *i;
+	struct ipath_rwq *rwq;
+	struct ipath_rwqe *wqe;
+	uint32_t head;
+	int n, ret;
+
+	pthread_spin_lock(&rq->lock);
+	rwq = rq->rwq;
+	head = rwq->head;
+	for (i = wr; i; i = i->next) {
+		if ((unsigned) i->num_sge > rq->max_sge)
+			goto bad;
+		wqe = get_rwqe_ptr(rq, head);
+		if (++head >= rq->size)
+			head = 0;
+		if (head == rwq->tail)
+			goto bad;
+		wqe->wr_id = i->wr_id;
+		wqe->num_sge = i->num_sge;
+		for (n = 0; n < wqe->num_sge; n++)
+			wqe->sg_list[n] = i->sg_list[n];
+		rwq->head = head;
+	}
+	ret = 0;
+	goto done;
+
+bad:
+	ret = -ENOMEM;
+	if (bad_wr)
+		*bad_wr = i;
+done:
+	pthread_spin_unlock(&rq->lock);
+	return ret;
+}
+
+int ipath_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
+		    struct ibv_recv_wr **bad_wr)
+{
+	struct ipath_qp *qp = to_iqp(ibqp);
+
+	return post_recv(&qp->rq, wr, bad_wr);
+}
+
 struct ibv_srq *ipath_create_srq(struct ibv_pd *pd,
 				 struct ibv_srq_init_attr *attr)
 {
-	struct ibv_srq *srq;
+	struct ipath_srq *srq;
 	struct ibv_create_srq cmd;
-	struct ibv_create_srq_resp resp;
+	struct ipath_create_srq_resp resp;
 	int ret;
+	size_t size;
 
 	srq = malloc(sizeof *srq);
-	if(srq == NULL)
+	if (srq == NULL)
 		return NULL;
 
-	ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd,
-		&resp, sizeof resp);
+	ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, &cmd, sizeof cmd,
+				 &resp.ibv_resp, sizeof resp);
 	if (ret) {
 		free(srq);
 		return NULL;
 	}
 
-	return srq;
+	srq->rq.size = attr->attr.max_wr + 1;
+	srq->rq.max_sge = attr->attr.max_sge;
+	size = sizeof(struct ipath_rwq) +
+		(sizeof(struct ipath_rwqe) +
+		 (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size;
+	srq->rq.rwq = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			   pd->context->cmd_fd, resp.offset);
+	if ((void *) srq->rq.rwq == MAP_FAILED) {
+		free(srq);
+		return NULL;
+	}
+
+	pthread_spin_init(&srq->rq.lock, PTHREAD_PROCESS_PRIVATE);
+	return &srq->ibv_srq;
 }
 
-int ipath_modify_srq(struct ibv_srq *srq,
+int ipath_modify_srq(struct ibv_srq *ibsrq,
 		     struct ibv_srq_attr *attr, 
 		     enum ibv_srq_attr_mask attr_mask)
 {
-	struct ibv_modify_srq cmd;
+	struct ipath_srq            *srq = to_isrq(ibsrq);
+	struct ipath_modify_srq_cmd  cmd;
+	__u64                        offset;
+	size_t                       size;
+	int                          ret;
 
-	return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd);
+	if (attr_mask & IBV_SRQ_MAX_WR) {
+		pthread_spin_lock(&srq->rq.lock);
+		/* Save the old size so we can unmmap the queue. */
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * srq->rq.max_sge)) *
+			srq->rq.size;
+	}
+	cmd.offset_addr = (__u64) &offset;
+	ret = ibv_cmd_modify_srq(ibsrq, attr, attr_mask,
+				 &cmd.ibv_cmd, sizeof cmd);
+	if (ret) {
+		if (attr_mask & IBV_SRQ_MAX_WR)
+			pthread_spin_unlock(&srq->rq.lock);
+		return ret;
+	}
+	if (attr_mask & IBV_SRQ_MAX_WR) {
+		(void) munmap(srq->rq.rwq, size);
+		srq->rq.size = attr->max_wr + 1;
+		size = sizeof(struct ipath_rwq) +
+			(sizeof(struct ipath_rwqe) +
+			 (sizeof(struct ibv_sge) * srq->rq.max_sge)) *
+			srq->rq.size;
+		srq->rq.rwq = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE, MAP_SHARED,
+				   ibsrq->context->cmd_fd, offset);
+		pthread_spin_unlock(&srq->rq.lock);
+		/* XXX Now we have no receive queue. */
+		if ((void *) srq->rq.rwq == MAP_FAILED)
+			return errno;
+	}
+	return 0;
 }
 
-int ipath_destroy_srq(struct ibv_srq *srq)
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr)
 {
+	struct ibv_query_srq cmd;
+
+	return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd);
+}
+
+int ipath_destroy_srq(struct ibv_srq *ibsrq)
+{
+	struct ipath_srq *srq = to_isrq(ibsrq);
+	size_t size;
 	int ret;
 
-	ret = ibv_cmd_destroy_srq(srq);
+	ret = ibv_cmd_destroy_srq(ibsrq);
 	if (ret)
 		return ret;
 
+	size = sizeof(struct ipath_rwq) +
+		(sizeof(struct ipath_rwqe) +
+		 (sizeof(struct ibv_sge) * srq->rq.max_sge)) * srq->rq.size;
+	(void) munmap(srq->rq.rwq, size);
 	free(srq);
 	return 0;
 }
 
+int ipath_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr,
+			struct ibv_recv_wr **bad_wr)
+{
+	struct ipath_srq *srq = to_isrq(ibsrq);
+
+	return post_recv(&srq->rq, wr, bad_wr); 
+}
+
 struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
 	struct ibv_ah *ah;
 
 	ah = malloc(sizeof *ah);
-	if(ah == NULL)
+	if (ah == NULL)
 		return NULL;
 
-	if(ibv_cmd_create_ah(pd, ah, attr)) {
+	if (ibv_cmd_create_ah(pd, ah, attr)) {
 		free(ah);
 		return NULL;
 	}
Index: src/userspace/libipathverbs/src/ipathverbs.map
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.map	(revision 8021)
+++ src/userspace/libipathverbs/src/ipathverbs.map	(working copy)
@@ -1,4 +1,4 @@
 {
-	global: openib_driver_init;
+	global: ibv_driver_init;
 	local: *;
 };
Index: src/userspace/libipathverbs/src/ipath-abi.h
===================================================================
--- src/userspace/libipathverbs/src/ipath-abi.h	(revision 0)
+++ src/userspace/libipathverbs/src/ipath-abi.h	(revision 0)
@@ -0,0 +1,67 @@
+/*
+ * Copyright (c) 2006. PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef IPATH_ABI_H
+#define IPATH_ABI_H
+
+#include <infiniband/kern-abi.h>
+
+struct ipath_create_cq_resp {
+	struct ibv_create_cq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_resize_cq_resp {
+	struct ibv_resize_cq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_create_qp_resp {
+	struct ibv_create_qp_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_create_srq_resp {
+	struct ibv_create_srq_resp	ibv_resp;
+	__u64				offset;
+};
+
+struct ipath_modify_srq_cmd {
+	struct ibv_modify_srq		ibv_cmd;
+	__u64				offset_addr;
+};
+
+#endif /* IPATH_ABI_H */
Index: src/userspace/libipathverbs/src/ipathverbs.c
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.c	(revision 8021)
+++ src/userspace/libipathverbs/src/ipathverbs.c	(working copy)
@@ -86,22 +86,25 @@
 	.dereg_mr	= ipath_dereg_mr,
 
 	.create_cq	= ipath_create_cq,
-	.poll_cq	= ibv_cmd_poll_cq,
+	.poll_cq	= ipath_poll_cq,
 	.req_notify_cq	= ibv_cmd_req_notify_cq,
 	.cq_event	= NULL,
+	.resize_cq	= ipath_resize_cq,
 	.destroy_cq	= ipath_destroy_cq,
 
 	.create_srq	= ipath_create_srq,
 	.modify_srq	= ipath_modify_srq,
+	.query_srq	= ipath_query_srq,
 	.destroy_srq	= ipath_destroy_srq,
-	.post_srq_recv	= ibv_cmd_post_srq_recv,
+	.post_srq_recv	= ipath_post_srq_recv,
 
 	.create_qp	= ipath_create_qp,
+	.query_qp	= ipath_query_qp,
 	.modify_qp	= ipath_modify_qp,
 	.destroy_qp	= ipath_destroy_qp,
 
 	.post_send	= ibv_cmd_post_send,
-	.post_recv	= ibv_cmd_post_recv,
+	.post_recv	= ipath_post_recv,
 
 	.create_ah	= ipath_create_ah,
 	.destroy_ah	= ipath_destroy_ah,
@@ -145,30 +148,24 @@
 	.free_context	= ipath_free_context
 };
 
-struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
+struct ibv_device *ibv_driver_init(const char *uverbs_sys_path,
+				   int abi_version)
 {
-	struct sysfs_device    *pcidev;
-	struct sysfs_attribute *attr;
+	char			value[8];
 	struct ipath_device    *dev;
-	unsigned		vendor, device;
-	int			i;
+	unsigned                vendor, device;
+	int                     i;
 
-	pcidev = sysfs_get_classdev_device(sysdev);
-	if (!pcidev)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
+				value, sizeof value) < 0)
 		return NULL;
+	sscanf(value, "%i", &vendor);
 
-	attr = sysfs_get_device_attr(pcidev, "vendor");
-	if (!attr)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device",
+				value, sizeof value) < 0)
 		return NULL;
-	sscanf(attr->value, "%i", &vendor);
-	sysfs_close_attribute(attr);
+	sscanf(value, "%i", &device);
 
-	attr = sysfs_get_device_attr(pcidev, "device");
-	if (!attr)
-		return NULL;
-	sscanf(attr->value, "%i", &device);
-	sysfs_close_attribute(attr);
-
 	for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i)
 		if (vendor == hca_table[i].vendor &&
 		    device == hca_table[i].device)
@@ -180,13 +177,12 @@
 	dev = malloc(sizeof *dev);
 	if (!dev) {
 		fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n",
-			sysdev->name);
-		abort();
+			uverbs_sys_path);
+		return NULL;
 	}
 
 	dev->ibv_dev.ops = ipath_dev_ops;
 	dev->hca_type    = hca_table[i].type;
-	dev->page_size   = sysconf(_SC_PAGESIZE);
 
 	return &dev->ibv_dev;
 }
Index: src/userspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.h	(revision 8021)
+++ src/userspace/libipathverbs/src/ipathverbs.h	(working copy)
@@ -39,6 +39,7 @@
 
 #include <endian.h>
 #include <byteswap.h>
+#include <pthread.h>
 
 #include <infiniband/driver.h>
 #include <infiniband/arch.h>
@@ -57,13 +58,87 @@
 struct ipath_device {
 	struct ibv_device	ibv_dev;
 	enum ipath_hca_type	hca_type;
-	int			page_size;
 };
 
 struct ipath_context {
 	struct ibv_context	ibv_ctx;
 };
 
+/*
+ * This structure needs to have the same size and offsets as
+ * the kernel's ib_wc structure since it is memory mapped.
+ */
+struct ipath_wc {
+	uint64_t		wr_id;
+	enum ibv_wc_status	status;
+	enum ibv_wc_opcode	opcode;
+	uint32_t		vendor_err;
+	uint32_t		byte_len;
+	uint32_t		imm_data;	/* in network byte order */
+	uint32_t		qp_num;
+	uint32_t		src_qp;
+	enum ibv_wc_flags	wc_flags;
+	uint16_t		pkey_index;
+	uint16_t		slid;
+	uint8_t			sl;
+	uint8_t			dlid_path_bits;
+	uint8_t			port_num;
+};
+
+struct ipath_cq_wc {
+	uint32_t		head;
+	uint32_t		tail;
+	struct ipath_wc		queue[1];
+};
+
+struct ipath_cq {
+	struct ibv_cq		ibv_cq;
+	struct ipath_cq_wc	*queue;
+	pthread_spinlock_t	lock;
+};
+
+/*
+ * Receive work request queue entry.
+ * The size of the sg_list is determined when the QP is created and stored
+ * in qp->r_max_sge.
+ */
+struct ipath_rwqe {
+	uint64_t		wr_id;
+	uint8_t			num_sge;
+	struct ibv_sge		sg_list[0];
+};
+
+/*
+ * This struture is used to contain the head pointer, tail pointer,
+ * and receive work queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ * Note that the wq array elements are variable size so you can't
+ * just index into the array to get the N'th element;
+ * use get_rwqe_ptr() instead.
+ */
+struct ipath_rwq {
+	uint32_t		head;	/* new requests posted to the head */
+	uint32_t		tail;	/* receives pull requests from here. */
+	struct ipath_rwqe	wq[0];
+};
+
+struct ipath_rq {
+	struct ipath_rwq       *rwq;
+	pthread_spinlock_t	lock;
+	uint32_t		size;
+	uint32_t		max_sge;
+};
+
+struct ipath_qp {
+	struct ibv_qp		ibv_qp;
+	struct ipath_rq		rq;
+};
+
+struct ipath_srq {
+	struct ibv_srq		ibv_srq;
+	struct ipath_rq		rq;
+};
+
 #define to_ixxx(xxx, type)						\
 	((struct ipath_##type *)					\
 	 ((void *) ib##xxx - offsetof(struct ipath_##type, ibv_##xxx)))
@@ -73,6 +148,34 @@
 	return to_ixxx(ctx, context);
 }
 
+static inline struct ipath_cq *to_icq(struct ibv_cq *ibcq)
+{
+	return to_ixxx(cq, cq);
+}
+
+static inline struct ipath_qp *to_iqp(struct ibv_qp *ibqp)
+{
+	return to_ixxx(qp, qp);
+}
+
+static inline struct ipath_srq *to_isrq(struct ibv_srq *ibsrq)
+{
+	return to_ixxx(srq, srq);
+}
+
+/*
+ * Since struct ipath_rwqe is not a fixed size, we can't simply index into
+ * struct ipath_rq.wq.  This function does the array index computation.
+ */
+static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq,
+					      unsigned n)
+{
+	return (struct ipath_rwqe *)
+		((char *) rq->rwq->wq +
+		 (sizeof(struct ipath_rwqe) +
+		  rq->max_sge * sizeof(struct ibv_sge)) * n);
+}
+
 extern int ipath_query_device(struct ibv_context *context,
 			      struct ibv_device_attr *attr);
 
@@ -92,11 +195,19 @@
 			       struct ibv_comp_channel *channel,
 			       int comp_vector);
 
+int ipath_resize_cq(struct ibv_cq *cq, int cqe);
+
 int ipath_destroy_cq(struct ibv_cq *cq);
 
+int ipath_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
+
 struct ibv_qp *ipath_create_qp(struct ibv_pd *pd,
 			       struct ibv_qp_init_attr *attr);
 
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr);
+
 int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask);
 
@@ -115,8 +226,12 @@
 		     struct ibv_srq_attr *attr, 
 		     enum ibv_srq_attr_mask attr_mask);
 
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr);
+
 int ipath_destroy_srq(struct ibv_srq *srq);
 
+int ipath_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr,
+			struct ibv_recv_wr **bad_wr);
 
 struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr);
 

-- 
Ralph Campbell <ralphc at pathscale.com>


From ralphc at pathscale.com  Mon Jun 19 16:45:46 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Mon, 19 Jun 2006 16:45:46 -0700
Subject: [openib-general] [PATCH 4/4] ipath mmaped CQs, QPs, SRQs
Message-ID: <1150760746.32252.169.camel@brick.pathscale.com>

This patch contains the ib_ipath kernel driver specific changes.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/linux-kernel/infiniband/hw/ipath/ipath_qp.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_qp.c	(working copy)
@@ -354,8 +354,10 @@
 	qp->s_last = 0;
 	qp->s_ssn = 1;
 	qp->s_lsn = 0;
-	qp->r_rq.head = 0;
-	qp->r_rq.tail = 0;
+	if (qp->r_rq.wq) {
+		qp->r_rq.wq->head = 0;
+		qp->r_rq.wq->tail = 0;
+	}
 	qp->r_reuse_sge = 0;
 }
 
@@ -364,7 +366,7 @@
  * @qp: the QP to put into an error state
  *
  * Flushes both send and receive work queues.
- * QP s_lock should be held.
+ * QP s_lock should be held and interrupts disabled.
  */
 
 void ipath_error_qp(struct ipath_qp *qp)
@@ -409,15 +411,32 @@
 	qp->s_hdrwords = 0;
 	qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
 
-	wc.opcode = IB_WC_RECV;
-	spin_lock(&qp->r_rq.lock);
-	while (qp->r_rq.tail != qp->r_rq.head) {
-		wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id;
-		if (++qp->r_rq.tail >= qp->r_rq.size)
-			qp->r_rq.tail = 0;
-		ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+	if (qp->r_rq.wq) {
+		struct ipath_rwq *wq;
+		u32 head;
+		u32 tail;
+
+		spin_lock(&qp->r_rq.lock);
+
+		/* sanity check pointers before trusting them */
+		wq = qp->r_rq.wq;
+		head = wq->head;
+		if (head >= qp->r_rq.size)
+			head = 0;
+		tail = wq->tail;
+		if (tail >= qp->r_rq.size)
+			tail = 0;
+		wc.opcode = IB_WC_RECV;
+		while (tail != head) {
+			wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id;
+			if (++tail >= qp->r_rq.size)
+				tail = 0;
+			ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+		}
+		wq->tail = tail;
+
+		spin_unlock(&qp->r_rq.lock);
 	}
-	spin_unlock(&qp->r_rq.lock);
 }
 
 /**
@@ -425,11 +444,12 @@
  * @ibqp: the queue pair who's attributes we're modifying
  * @attr: the new attributes
  * @attr_mask: the mask of attributes to modify
+ * @udata: user data for ipathverbs.so
  *
  * Returns 0 on success, otherwise returns an errno.
  */
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask)
+		    int attr_mask, struct ib_udata *udata)
 {
 	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_qp *qp = to_iqp(ibqp);
@@ -542,7 +562,7 @@
 	attr->dest_qp_num = qp->remote_qpn;
 	attr->qp_access_flags = qp->qp_access_flags;
 	attr->cap.max_send_wr = qp->s_size - 1;
-	attr->cap.max_recv_wr = qp->r_rq.size - 1;
+	attr->cap.max_recv_wr = qp->ibqp.srq ? 0 : qp->r_rq.size - 1;
 	attr->cap.max_send_sge = qp->s_max_sge;
 	attr->cap.max_recv_sge = qp->r_rq.max_sge;
 	attr->cap.max_inline_data = 0;
@@ -595,13 +615,23 @@
 	} else {
 		u32 min, max, x;
 		u32 credits;
+		struct ipath_rwq *wq = qp->r_rq.wq;
+		u32 head;
+		u32 tail;
 
+		/* sanity check pointers before trusting them */
+		head = wq->head;
+		if (head >= qp->r_rq.size)
+			head = 0;
+		tail = wq->tail;
+		if (tail >= qp->r_rq.size)
+			tail = 0;
 		/*
 		 * Compute the number of credits available (RWQEs).
 		 * XXX Not holding the r_rq.lock here so there is a small
 		 * chance that the pair of reads are not atomic.
 		 */
-		credits = qp->r_rq.head - qp->r_rq.tail;
+		credits = head - tail;
 		if ((int)credits < 0)
 			credits += qp->r_rq.size;
 		/*
@@ -678,27 +708,32 @@
 	case IB_QPT_UD:
 	case IB_QPT_SMI:
 	case IB_QPT_GSI:
-		qp = kmalloc(sizeof(*qp), GFP_KERNEL);
+		sz = sizeof(*qp);
+		if (!init_attr->srq)
+			sz += sizeof(*qp->r_sg_list) *
+				init_attr->cap.max_recv_sge;
+		qp = kmalloc(sz, GFP_KERNEL);
 		if (!qp) {
-			vfree(swq);
 			ret = ERR_PTR(-ENOMEM);
-			goto bail;
+			goto free_swq;
 		}
 		if (init_attr->srq) {
+			sz = 0;
 			qp->r_rq.size = 0;
 			qp->r_rq.max_sge = 0;
 			qp->r_rq.wq = NULL;
+			init_attr->cap.max_recv_wr = 0;
+			init_attr->cap.max_recv_sge = 0;
 		} else {
 			qp->r_rq.size = init_attr->cap.max_recv_wr + 1;
 			qp->r_rq.max_sge = init_attr->cap.max_recv_sge;
-			sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) +
+			sz = (sizeof(struct ib_sge) * qp->r_rq.max_sge) +
 				sizeof(struct ipath_rwqe);
-			qp->r_rq.wq = vmalloc(qp->r_rq.size * sz);
+			qp->r_rq.wq = vmalloc(sizeof(struct ipath_rwq) +
+					      qp->r_rq.size * sz);
 			if (!qp->r_rq.wq) {
-				kfree(qp);
-				vfree(swq);
 				ret = ERR_PTR(-ENOMEM);
-				goto bail;
+				goto free_qp;
 			}
 		}
 
@@ -724,16 +759,14 @@
 		err = ipath_alloc_qpn(&dev->qp_table, qp,
 				      init_attr->qp_type);
 		if (err) {
-			vfree(swq);
-			vfree(qp->r_rq.wq);
-			kfree(qp);
 			ret = ERR_PTR(err);
-			goto bail;
+			goto free_rwq;
 		}
+		qp->ip = NULL;
 		ipath_reset_qp(qp);
 
 		/* Tell the core driver that the kernel SMA is present. */
-		if (qp->ibqp.qp_type == IB_QPT_SMI)
+		if (init_attr->qp_type == IB_QPT_SMI)
 			ipath_layer_set_verbs_flags(dev->dd,
 						    IPATH_VERBS_KERNEL_SMA);
 		break;
@@ -746,8 +779,51 @@
 
 	init_attr->cap.max_inline_data = 0;
 
+	/*
+	 * Return the address of the RWQ as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		struct ipath_mmap_info *ip;
+		__u64 offset = (__u64) qp->r_rq.wq;
+		int err;
+
+		err = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (err) {
+			ret = ERR_PTR(err);
+			goto free_rwq;
+		}
+
+		if (qp->r_rq.wq) {
+			/* Allocate info for ipath_mmap(). */
+			ip = kmalloc(sizeof(*ip), GFP_KERNEL);
+			if (!ip) {
+				ret = ERR_PTR(-ENOMEM);
+				goto free_rwq;
+			}
+			qp->ip = ip;
+			ip->context = ibpd->uobject->context;
+			ip->obj = qp->r_rq.wq;
+			kref_init(&ip->ref);
+			ip->mmap_cnt = 0;
+			ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) +
+					      qp->r_rq.size * sz);
+			spin_lock_irq(&dev->pending_lock);
+			ip->next = dev->pending_mmaps;
+			dev->pending_mmaps = ip;
+			spin_unlock_irq(&dev->pending_lock);
+		}
+	}
+
 	ret = &qp->ibqp;
+	goto bail;
 
+free_rwq:
+	vfree(qp->r_rq.wq);
+free_qp:
+	kfree(qp);
+free_swq:
+	vfree(swq);
 bail:
 	return ret;
 }
@@ -771,11 +847,9 @@
 	if (qp->ibqp.qp_type == IB_QPT_SMI)
 		ipath_layer_set_verbs_flags(dev->dd, 0);
 
-	spin_lock_irqsave(&qp->r_rq.lock, flags);
-	spin_lock(&qp->s_lock);
+	spin_lock_irqsave(&qp->s_lock, flags);
 	qp->state = IB_QPS_ERR;
-	spin_unlock(&qp->s_lock);
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 
 	/* Stop the sending tasklet. */
 	tasklet_kill(&qp->s_task);
@@ -796,8 +870,11 @@
 	if (atomic_read(&qp->refcount) != 0)
 		ipath_free_qp(&dev->qp_table, qp);
 
+	if (qp->ip)
+		kref_put(&qp->ip->ref, ipath_release_mmap_info);
+	else
+		vfree(qp->r_rq.wq);
 	vfree(qp->s_wq);
-	vfree(qp->r_rq.wq);
 	kfree(qp);
 	return 0;
 }
Index: src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_ruc.c	(working copy)
@@ -105,6 +105,54 @@
 	spin_unlock_irqrestore(&dev->pending_lock, flags);
 }
 
+static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe)
+{
+	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	int user = to_ipd(qp->ibqp.pd)->user;
+	int i, j, ret;
+	struct ib_wc wc;
+
+	qp->r_len = 0;
+	for (i = j = 0; i < wqe->num_sge; i++) {
+		if (wqe->sg_list[i].length == 0)
+			continue;
+		/* Check LKEY */
+		if ((user && wqe->sg_list[i].lkey == 0) ||
+		    !ipath_lkey_ok(&dev->lk_table,
+				   &qp->r_sg_list[j], &wqe->sg_list[i],
+				   IB_ACCESS_LOCAL_WRITE))
+			goto bad_lkey;
+		qp->r_len += wqe->sg_list[i].length;
+		j++;
+	}
+	qp->r_sge.sge = qp->r_sg_list[0];
+	qp->r_sge.sg_list = qp->r_sg_list + 1;
+	qp->r_sge.num_sge = j;
+	ret = 1;
+	goto bail;
+
+bad_lkey:
+	wc.wr_id = wqe->wr_id;
+	wc.status = IB_WC_LOC_PROT_ERR;
+	wc.opcode = IB_WC_RECV;
+	wc.vendor_err = 0;
+	wc.byte_len = 0;
+	wc.imm_data = 0;
+	wc.qp_num = qp->ibqp.qp_num;
+	wc.src_qp = 0;
+	wc.wc_flags = 0;
+	wc.pkey_index = 0;
+	wc.slid = 0;
+	wc.sl = 0;
+	wc.dlid_path_bits = 0;
+	wc.port_num = 0;
+	/* Signal solicited completion event. */
+	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+	ret = 0;
+bail:
+	return ret;
+}
+
 /**
  * ipath_get_rwqe - copy the next RWQE into the QP's RWQE
  * @qp: the QP
@@ -118,73 +166,69 @@
 {
 	unsigned long flags;
 	struct ipath_rq *rq;
+	struct ipath_rwq *wq;
 	struct ipath_srq *srq;
 	struct ipath_rwqe *wqe;
+	void (*handler)(struct ib_event *, void *);
+	u32 tail;
 	int ret;
 
-	if (!qp->ibqp.srq) {
+	if (qp->ibqp.srq) {
+		srq = to_isrq(qp->ibqp.srq);
+		handler = srq->ibsrq.event_handler;
+		rq = &srq->rq;
+	} else {
+		srq = NULL;
+		handler = NULL;
 		rq = &qp->r_rq;
-		spin_lock_irqsave(&rq->lock, flags);
+	}
 
-		if (unlikely(rq->tail == rq->head)) {
+	spin_lock_irqsave(&rq->lock, flags);
+	wq = rq->wq;
+	tail = wq->tail;
+	do {
+		if (unlikely(tail == wq->head)) {
+			spin_unlock_irqrestore(&rq->lock, flags);
 			ret = 0;
 			goto bail;
 		}
-		wqe = get_rwqe_ptr(rq, rq->tail);
-		qp->r_wr_id = wqe->wr_id;
-		if (!wr_id_only) {
-			qp->r_sge.sge = wqe->sg_list[0];
-			qp->r_sge.sg_list = wqe->sg_list + 1;
-			qp->r_sge.num_sge = wqe->num_sge;
-			qp->r_len = wqe->length;
-		}
-		if (++rq->tail >= rq->size)
-			rq->tail = 0;
-		goto done;
-	}
+		wqe = get_rwqe_ptr(rq, tail);
+		if (++tail >= rq->size)
+			tail = 0;
+	} while (!wr_id_only && !init_sge(qp, wqe));
+	qp->r_wr_id = wqe->wr_id;
+	wq->tail = tail;
 
-	srq = to_isrq(qp->ibqp.srq);
-	rq = &srq->rq;
-	spin_lock_irqsave(&rq->lock, flags);
-
-	if (unlikely(rq->tail == rq->head)) {
-		ret = 0;
-		goto bail;
-	}
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	qp->r_wr_id = wqe->wr_id;
-	if (!wr_id_only) {
-		qp->r_sge.sge = wqe->sg_list[0];
-		qp->r_sge.sg_list = wqe->sg_list + 1;
-		qp->r_sge.num_sge = wqe->num_sge;
-		qp->r_len = wqe->length;
-	}
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq->ibsrq.event_handler) {
-		struct ib_event ev;
+	ret = 1;
+	if (handler) {
 		u32 n;
 
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		n = wq->head;
+		if (n >= rq->size)
+			n = 0;
+		if (n < tail)
+			n += rq->size - tail;
 		else
-			n = rq->head - rq->tail;
+			n -= tail;
 		if (n < srq->limit) {
+			struct ib_event ev;
+
 			srq->limit = 0;
 			spin_unlock_irqrestore(&rq->lock, flags);
 			ev.device = qp->ibqp.device;
 			ev.element.srq = qp->ibqp.srq;
 			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
-			spin_lock_irqsave(&rq->lock, flags);
+			handler(&ev, srq->ibsrq.srq_context);
+			goto bail;
 		}
 	}
-done:
-	ret = 1;
+	spin_unlock_irqrestore(&rq->lock, flags);
 
 bail:
-	spin_unlock_irqrestore(&rq->lock, flags);
 	return ret;
 }
 
Index: src/linux-kernel/infiniband/hw/ipath/Makefile
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/Makefile	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/Makefile	(working copy)
@@ -25,6 +25,7 @@
 	ipath_cq.o \
 	ipath_keys.o \
 	ipath_mad.o \
+	ipath_mmap.o \
 	ipath_mr.o \
 	ipath_qp.o \
 	ipath_rc.o \
Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c	(working copy)
@@ -280,11 +280,12 @@
 			      struct ib_recv_wr **bad_wr)
 {
 	struct ipath_qp *qp = to_iqp(ibqp);
+	struct ipath_rwq *wq = qp->r_rq.wq;
 	unsigned long flags;
 	int ret;
 
 	/* Check that state is OK to post receive. */
-	if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK)) {
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_RECV_OK) || !wq) {
 		*bad_wr = wr;
 		ret = -EINVAL;
 		goto bail;
@@ -293,59 +294,31 @@
 	for (; wr; wr = wr->next) {
 		struct ipath_rwqe *wqe;
 		u32 next;
-		int i, j;
+		int i;
 
-		if (wr->num_sge > qp->r_rq.max_sge) {
+		if ((unsigned) wr->num_sge > qp->r_rq.max_sge) {
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
 		spin_lock_irqsave(&qp->r_rq.lock, flags);
-		next = qp->r_rq.head + 1;
+		next = wq->head + 1;
 		if (next >= qp->r_rq.size)
 			next = 0;
-		if (next == qp->r_rq.tail) {
+		if (next == wq->tail) {
 			spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head);
+		wqe = get_rwqe_ptr(&qp->r_rq, wq->head);
 		wqe->wr_id = wr->wr_id;
-		wqe->sg_list[0].mr = NULL;
-		wqe->sg_list[0].vaddr = NULL;
-		wqe->sg_list[0].length = 0;
-		wqe->sg_list[0].sge_length = 0;
-		wqe->length = 0;
-		for (i = 0, j = 0; i < wr->num_sge; i++) {
-			/* Check LKEY */
-			if (to_ipd(qp->ibqp.pd)->user &&
-			    wr->sg_list[i].lkey == 0) {
-				spin_unlock_irqrestore(&qp->r_rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			if (wr->sg_list[i].length == 0)
-				continue;
-			if (!ipath_lkey_ok(
-				    &to_idev(qp->ibqp.device)->lk_table,
-				    &wqe->sg_list[j], &wr->sg_list[i],
-				    IB_ACCESS_LOCAL_WRITE)) {
-				spin_unlock_irqrestore(&qp->r_rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			wqe->length += wr->sg_list[i].length;
-			j++;
-		}
-		wqe->num_sge = j;
-		qp->r_rq.head = next;
+		wqe->num_sge = wr->num_sge;
+		for (i = 0; i < wr->num_sge; i++)
+			wqe->sg_list[i] = wr->sg_list[i];
+		wq->head = next;
 		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 	}
 	ret = 0;
@@ -694,7 +667,7 @@
 		ipath_layer_get_lastibcstat(dev->dd) & 0xf];
 	props->port_cap_flags = dev->port_cap_flags;
 	props->gid_tbl_len = 1;
-	props->max_msg_sz = 4096;
+	props->max_msg_sz = 0x80000000;
 	props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd);
 	props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) -
 		dev->z_pkey_violations;
@@ -871,7 +844,7 @@
 		goto bail;
 	}
 
-	if (ah_attr->port_num != 1 ||
+	if (ah_attr->port_num < 1 ||
 	    ah_attr->port_num > pd->device->phys_port_cnt) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
@@ -883,6 +856,8 @@
 		goto bail;
 	}
 
+	dev->n_ahs_allocated++;
+
 	/* ib_create_ah() will initialize ah->ibah. */
 	ah->attr = *ah_attr;
 
@@ -1137,6 +1112,7 @@
 	dev->attach_mcast = ipath_multicast_attach;
 	dev->detach_mcast = ipath_multicast_detach;
 	dev->process_mad = ipath_process_mad;
+	dev->mmap = ipath_mmap;
 
 	snprintf(dev->node_desc, sizeof(dev->node_desc),
 		 IPATH_IDSTR " %s kernel_SMA", system_utsname.nodename);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h	(working copy)
@@ -37,6 +37,7 @@
 #include <linux/spinlock.h>
 #include <linux/kernel.h>
 #include <linux/interrupt.h>
+#include <linux/kref.h>
 #include <rdma/ib_pack.h>
 
 #include "ipath_layer.h"
@@ -177,58 +178,41 @@
 };
 
 /*
- * Quick description of our CQ/QP locking scheme:
- *
- * We have one global lock that protects dev->cq/qp_table.  Each
- * struct ipath_cq/qp also has its own lock.  An individual qp lock
- * may be taken inside of an individual cq lock.  Both cqs attached to
- * a qp may be locked, with the send cq locked first.  No other
- * nesting should be done.
- *
- * Each struct ipath_cq/qp also has an atomic_t ref count.  The
- * pointer from the cq/qp_table to the struct counts as one reference.
- * This reference also is good for access through the consumer API, so
- * modifying the CQ/QP etc doesn't need to take another reference.
- * Access because of a completion being polled does need a reference.
- *
- * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the
- * destroy function to sleep on.
- *
- * This means that access from the consumer API requires nothing but
- * taking the struct's lock.
- *
- * Access because of a completion event should go as follows:
- * - lock cq/qp_table and look up struct
- * - increment ref count in struct
- * - drop cq/qp_table lock
- * - lock struct, do your thing, and unlock struct
- * - decrement ref count; if zero, wake up waiters
- *
- * To destroy a CQ/QP, we can do the following:
- * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
- * - decrement ref count
- * - wait_event until ref count is zero
- *
- * It is the consumer's responsibilty to make sure that no QP
- * operations (WQE posting or state modification) are pending when the
- * QP is destroyed.  Also, the consumer must make sure that calls to
- * qp_modify are serialized.
- *
- * Possible optimizations (wait for profile data to see if/where we
- * have locks bouncing between CPUs):
- * - split cq/qp table lock into n separate (cache-aligned) locks,
- *   indexed (say) by the page in the table
+ * This structure is used by ipath_mmap() to validate an offset
+ * when an mmap() request is made.  The vm_area_struct then uses
+ * this as its vm_private_data.
  */
+struct ipath_mmap_info {
+	struct ipath_mmap_info *next;
+	struct ib_ucontext *context;
+	void *obj;
+	struct kref ref;
+	unsigned size;
+	unsigned mmap_cnt;
+};
 
+/*
+ * This struture is used to contain the head pointer, tail pointer,
+ * and completion queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ */
+struct ipath_cq_wc {
+	u32 head;		/* index of next entry to fill */
+	u32 tail;		/* index of next ib_poll_cq() entry */
+	struct ib_wc queue[1];	/* this is actually size ibcq.cqe + 1 */
+};
+
+/*
+ * The completion queue structure.
+ */
 struct ipath_cq {
 	struct ib_cq ibcq;
 	struct tasklet_struct comptask;
 	spinlock_t lock;
 	u8 notify;
 	u8 triggered;
-	u32 head;		/* new records added to the head */
-	u32 tail;		/* poll_cq() reads from here. */
-	struct ib_wc *queue;	/* this is actually ibcq.cqe + 1 */
+	struct ipath_cq_wc *queue;
+	struct ipath_mmap_info *ip;
 };
 
 /*
@@ -247,28 +231,40 @@
 
 /*
  * Receive work request queue entry.
- * The size of the sg_list is determined when the QP is created and stored
- * in qp->r_max_sge.
+ * The size of the sg_list is determined when the QP (or SRQ) is created
+ * and stored in qp->r_rq.max_sge (or srq->rq.max_sge).
  */
 struct ipath_rwqe {
 	u64 wr_id;
-	u32 length;		/* total length of data in sg_list */
 	u8 num_sge;
-	struct ipath_sge sg_list[0];
+	struct ib_sge sg_list[0];
 };
 
+/*
+ * This struture is used to contain the head pointer, tail pointer,
+ * and receive work queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ * Note that the wq array elements are variable size so you can't
+ * just index into the array to get the N'th element;
+ * use get_rwqe_ptr() instead.
+ */
+struct ipath_rwq {
+	u32 head;		/* new work requests posted to the head */
+	u32 tail;		/* receives pull requests from here. */
+	struct ipath_rwqe wq[0];
+};
+
 struct ipath_rq {
+	struct ipath_rwq *wq;
 	spinlock_t lock;
-	u32 head;		/* new work requests posted to the head */
-	u32 tail;		/* receives pull requests from here. */
 	u32 size;		/* size of RWQE array */
 	u8 max_sge;
-	struct ipath_rwqe *wq;	/* RWQE array */
 };
 
 struct ipath_srq {
 	struct ib_srq ibsrq;
 	struct ipath_rq rq;
+	struct ipath_mmap_info *ip;
 	/* send signal when number of RWQEs < limit */
 	u32 limit;
 };
@@ -292,6 +288,7 @@
 	atomic_t refcount;
 	wait_queue_head_t wait;
 	struct tasklet_struct s_task;
+	struct ipath_mmap_info *ip;
 	struct ipath_sge_state *s_cur_sge;
 	struct ipath_sge_state s_sge;	/* current send request data */
 	/* current RDMA read send data */
@@ -343,7 +340,8 @@
 	u32 s_ssn;		/* SSN of tail entry */
 	u32 s_lsn;		/* limit sequence number (credit) */
 	struct ipath_swqe *s_wq;	/* send work queue */
-	struct ipath_rq r_rq;	/* receive work queue */
+	struct ipath_rq r_rq;		/* receive work queue */
+	struct ipath_sge r_sg_list[0];	/* verified SGEs */
 };
 
 /*
@@ -367,15 +365,15 @@
 
 /*
  * Since struct ipath_rwqe is not a fixed size, we can't simply index into
- * struct ipath_rq.wq.  This function does the array index computation.
+ * struct ipath_rwq.wq.  This function does the array index computation.
  */
 static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq,
 					      unsigned n)
 {
 	return (struct ipath_rwqe *)
-		((char *) rq->wq +
+		((char *) rq->wq->wq +
 		 (sizeof(struct ipath_rwqe) +
-		  rq->max_sge * sizeof(struct ipath_sge)) * n);
+		  rq->max_sge * sizeof(struct ib_sge)) * n);
 }
 
 /*
@@ -415,6 +413,7 @@
 	struct ib_device ibdev;
 	struct list_head dev_list;
 	struct ipath_devdata *dd;
+	struct ipath_mmap_info *pending_mmaps;
 	int ib_unit;		/* This is the device number */
 	u16 sm_lid;		/* in host order */
 	u8 sm_sl;
@@ -577,7 +576,7 @@
 int ipath_destroy_qp(struct ib_qp *ibqp);
 
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
-		    int attr_mask);
+		    int attr_mask, struct ib_udata *udata);
 
 int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		   int attr_mask, struct ib_qp_init_attr *init_attr);
@@ -636,7 +635,8 @@
 				struct ib_udata *udata);
 
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask);
+		     enum ib_srq_attr_mask attr_mask,
+		     struct ib_udata *udata);
 
 int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr);
 
@@ -678,6 +678,10 @@
 
 int ipath_dealloc_fmr(struct ib_fmr *ibfmr);
 
+void ipath_release_mmap_info(struct kref *ref);
+
+int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
+
 void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev);
 
 void ipath_insert_rnr_queue(struct ipath_qp *qp);
Index: src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c	(revision 0)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_mmap.c	(revision 0)
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/errno.h>
+#include <asm/pgtable.h>
+
+#include "ipath_verbs.h"
+
+/**
+ * ipath_release_mmap_info - free mmap info structure
+ * @ref: a pointer to the kref within struct ipath_mmap_info
+ */
+void ipath_release_mmap_info(struct kref *ref)
+{
+	struct ipath_mmap_info *ip =
+		container_of(ref, struct ipath_mmap_info, ref);
+
+	vfree(ip->obj);
+	kfree(ip);
+}
+
+/*
+ * open and close keep track of how many times the CQ is mapped,
+ * to avoid releasing it.
+ */
+static void ipath_vma_open(struct vm_area_struct *vma)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+
+	kref_get(&ip->ref);
+	ip->mmap_cnt++;
+}
+
+static void ipath_vma_close(struct vm_area_struct *vma)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+
+	ip->mmap_cnt--;
+	kref_put(&ip->ref, ipath_release_mmap_info);
+}
+
+/*
+ * ipath_vma_nopage - handle a VMA page fault.
+ */
+static struct page *ipath_vma_nopage(struct vm_area_struct *vma,
+				     unsigned long address, int *type)
+{
+	struct ipath_mmap_info *ip = vma->vm_private_data;
+	unsigned long offset = address - vma->vm_start;
+	struct page *page = NOPAGE_SIGBUS;
+	void *pageptr;
+
+	if (offset >= ip->size)
+		goto out; /* out of range */
+
+	/*
+	 * Convert the vmalloc address into a struct page.
+	 */
+	pageptr = (void *)(offset + (vma->vm_pgoff << PAGE_SHIFT));
+	page = vmalloc_to_page(pageptr);
+
+	/* Increment the reference count. */
+	get_page(page);
+	if (type)
+		*type = VM_FAULT_MINOR;
+out:
+	return page;
+}
+
+static struct vm_operations_struct ipath_vm_ops = {
+	.open =     ipath_vma_open,
+	.close =    ipath_vma_close,
+	.nopage =   ipath_vma_nopage,
+};
+
+/**
+ * ipath_mmap - create a new mmap region
+ * @context: the IB user context of the process making the mmap() call
+ * @vma: the VMA to be initialized
+ * Return zero if the mmap is OK. Otherwise, return an errno.
+ */
+int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	struct ipath_ibdev *dev = to_idev(context->device);
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct ipath_mmap_info *ip, **pp;
+
+	/*
+	 * Search the device's list of objects waiting for a mmap call.
+	 * Normally, this list is very short since a call to create a
+	 * CQ, QP, or SRQ is soon followed by a call to mmap().
+	 */
+	spin_lock_irq(&dev->pending_lock);
+	for (pp = &dev->pending_mmaps; (ip = *pp); pp = &ip->next) {
+		/* Only the creator is allowed to mmap the object */
+		if (context != ip->context || (void *) offset != ip->obj)
+			continue;
+		/* Don't allow a mmap larger than the object. */
+		if (size > ip->size)
+			break;
+
+		*pp = ip->next;
+		spin_unlock_irq(&dev->pending_lock);
+
+		vma->vm_ops = &ipath_vm_ops;
+		vma->vm_flags |= VM_RESERVED;
+		vma->vm_private_data = ip;
+		ipath_vma_open(vma);
+		return 0;
+	}
+	spin_unlock_irq(&dev->pending_lock);
+	return -EINVAL;
+}
Index: src/linux-kernel/infiniband/hw/ipath/ipath_cq.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_cq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_cq.c	(working copy)
@@ -41,20 +41,28 @@
  * @entry: work completion entry to add
  * @sig: true if @entry is a solicitated entry
  *
- * This may be called with one of the qp->s_lock or qp->r_rq.lock held.
+ * This may be called with qp->s_lock held.
  */
 void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited)
 {
+	struct ipath_cq_wc *wc = cq->queue;
 	unsigned long flags;
+	u32 head;
 	u32 next;
 
 	spin_lock_irqsave(&cq->lock, flags);
 
-	if (cq->head == cq->ibcq.cqe)
+	/*
+	 * Note that the head pointer might be writable by user processes.
+	 * Take care to verify it is a sane value.
+	 */
+	head = wc->head;
+	if (head >= (unsigned) cq->ibcq.cqe) {
+		head = cq->ibcq.cqe;
 		next = 0;
-	else
-		next = cq->head + 1;
-	if (unlikely(next == cq->tail)) {
+	} else
+		next = head + 1;
+	if (unlikely(next == wc->tail)) {
 		spin_unlock_irqrestore(&cq->lock, flags);
 		if (cq->ibcq.event_handler) {
 			struct ib_event ev;
@@ -66,8 +74,8 @@
 		}
 		return;
 	}
-	cq->queue[cq->head] = *entry;
-	cq->head = next;
+	wc->queue[head] = *entry;
+	wc->head = next;
 
 	if (cq->notify == IB_CQ_NEXT_COMP ||
 	    (cq->notify == IB_CQ_SOLICITED && solicited)) {
@@ -100,19 +108,20 @@
 int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
+	struct ipath_cq_wc *wc = cq->queue;
 	unsigned long flags;
 	int npolled;
 
 	spin_lock_irqsave(&cq->lock, flags);
 
 	for (npolled = 0; npolled < num_entries; ++npolled, ++entry) {
-		if (cq->tail == cq->head)
+		if (wc->tail == wc->head)
 			break;
-		*entry = cq->queue[cq->tail];
-		if (cq->tail == cq->ibcq.cqe)
-			cq->tail = 0;
+		*entry = wc->queue[wc->tail];
+		if (wc->tail >= cq->ibcq.cqe)
+			wc->tail = 0;
 		else
-			cq->tail++;
+			wc->tail++;
 	}
 
 	spin_unlock_irqrestore(&cq->lock, flags);
@@ -159,7 +168,7 @@
 {
 	struct ipath_ibdev *dev = to_idev(ibdev);
 	struct ipath_cq *cq;
-	struct ib_wc *wc;
+	struct ipath_cq_wc *wc;
 	struct ib_cq *ret;
 
 	if (entries > ib_ipath_max_cqes) {
@@ -172,10 +181,7 @@
 		goto bail;
 	}
 
-	/*
-	 * Need to use vmalloc() if we want to support large #s of
-	 * entries.
-	 */
+	/* Allocate the completion queue structure. */
 	cq = kmalloc(sizeof(*cq), GFP_KERNEL);
 	if (!cq) {
 		ret = ERR_PTR(-ENOMEM);
@@ -183,15 +189,54 @@
 	}
 
 	/*
-	 * Need to use vmalloc() if we want to support large #s of entries.
+	 * Allocate the completion queue entries and head/tail pointers.
+	 * This is allocated separately so that it can be resized and
+	 * also mapped into user space.
+	 * We need to use vmalloc() in order to support mmap and large
+	 * numbers of entries.
 	 */
-	wc = vmalloc(sizeof(*wc) * (entries + 1));
+	wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * entries);
 	if (!wc) {
-		kfree(cq);
 		ret = ERR_PTR(-ENOMEM);
-		goto bail;
+		goto free_cq;
 	}
+
 	/*
+	 * Return the address of the WC as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		struct ipath_mmap_info *ip;
+		__u64 offset = (__u64) wc;
+		int err;
+
+		err = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (err) {
+			ret = ERR_PTR(err);
+			goto free_wc;
+		}
+
+		/* Allocate info for ipath_mmap(). */
+		ip = kmalloc(sizeof(*ip), GFP_KERNEL);
+		if (!ip) {
+			ret = ERR_PTR(-ENOMEM);
+			goto free_wc;
+		}
+		cq->ip = ip;
+		ip->context = context;
+		ip->obj = wc;
+		kref_init(&ip->ref);
+		ip->mmap_cnt = 0;
+		ip->size = PAGE_ALIGN(sizeof(*wc) +
+				      sizeof(struct ib_wc) * entries);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	} else
+		cq->ip = NULL;
+
+	/*
 	 * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe.
 	 * The number of entries should be >= the number requested or return
 	 * an error.
@@ -201,14 +246,18 @@
 	cq->triggered = 0;
 	spin_lock_init(&cq->lock);
 	tasklet_init(&cq->comptask, send_complete, (unsigned long)cq);
-	cq->head = 0;
-	cq->tail = 0;
+	wc->head = 0;
+	wc->tail = 0;
 	cq->queue = wc;
 
 	ret = &cq->ibcq;
-
 	dev->n_cqs_allocated++;
+	goto bail;
 
+free_wc:
+	vfree(wc);
+free_cq:
+	kfree(cq);
 bail:
 	return ret;
 }
@@ -228,7 +277,10 @@
 
 	tasklet_kill(&cq->comptask);
 	dev->n_cqs_allocated--;
-	vfree(cq->queue);
+	if (cq->ip)
+		kref_put(&cq->ip->ref, ipath_release_mmap_info);
+	else
+		vfree(cq->queue);
 	kfree(cq);
 
 	return 0;
@@ -252,7 +304,7 @@
 	spin_lock_irqsave(&cq->lock, flags);
 	/*
 	 * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow
-	 * any other transitions.
+	 * any other transitions (see C11-31 and C11-32 in ch. 11.4.2.2).
 	 */
 	if (cq->notify != IB_CQ_NEXT_COMP)
 		cq->notify = notify;
@@ -263,46 +315,81 @@
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
-	struct ib_wc *wc, *old_wc;
-	u32 n;
+	struct ipath_cq_wc *old_wc = cq->queue;
+	struct ipath_cq_wc *wc;
+	u32 head, tail, n;
 	int ret;
 
 	/*
 	 * Need to use vmalloc() if we want to support large #s of entries.
 	 */
-	wc = vmalloc(sizeof(*wc) * (cqe + 1));
+	wc = vmalloc(sizeof(*wc) + sizeof(struct ib_wc) * cqe);
 	if (!wc) {
 		ret = -ENOMEM;
 		goto bail;
 	}
 
+	/*
+	 * Return the address of the WC as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		__u64 offset = (__u64) wc;
+
+		ret = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (ret)
+			goto bail;
+	}
+
 	spin_lock_irq(&cq->lock);
-	if (cq->head < cq->tail)
-		n = cq->ibcq.cqe + 1 + cq->head - cq->tail;
+	/*
+	 * Make sure head and tail are sane since they
+	 * might be user writable.
+	 */
+	head = old_wc->head;
+	if (head > (u32) cq->ibcq.cqe)
+		head = (u32) cq->ibcq.cqe;
+	tail = old_wc->tail;
+	if (tail > (u32) cq->ibcq.cqe)
+		tail = (u32) cq->ibcq.cqe;
+	if (head < tail)
+		n = cq->ibcq.cqe + 1 + head - tail;
 	else
-		n = cq->head - cq->tail;
+		n = head - tail;
 	if (unlikely((u32)cqe < n)) {
 		spin_unlock_irq(&cq->lock);
 		vfree(wc);
 		ret = -EOVERFLOW;
 		goto bail;
 	}
-	for (n = 0; cq->tail != cq->head; n++) {
-		wc[n] = cq->queue[cq->tail];
-		if (cq->tail == cq->ibcq.cqe)
-			cq->tail = 0;
+	for (n = 0; tail != head; n++) {
+		wc->queue[n] = old_wc->queue[tail];
+		if (tail == (u32) cq->ibcq.cqe)
+			tail = 0;
 		else
-			cq->tail++;
+			tail++;
 	}
 	cq->ibcq.cqe = cqe;
-	cq->head = n;
-	cq->tail = 0;
-	old_wc = cq->queue;
+	wc->head = n;
+	wc->tail = 0;
 	cq->queue = wc;
 	spin_unlock_irq(&cq->lock);
 
 	vfree(old_wc);
 
+	if (cq->ip) {
+		struct ipath_ibdev *dev = to_idev(ibcq->device);
+		struct ipath_mmap_info *ip = cq->ip;
+
+		ip->obj = wc;
+		ip->size = PAGE_ALIGN(sizeof(*wc) +
+				      sizeof(struct ib_wc) * cqe);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	}
+
 	ret = 0;
 
 bail:
Index: src/linux-kernel/infiniband/hw/ipath/ipath_srq.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_srq.c	(working copy)
@@ -47,66 +47,38 @@
 			   struct ib_recv_wr **bad_wr)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
-	struct ipath_ibdev *dev = to_idev(ibsrq->device);
+	struct ipath_rwq *wq;
 	unsigned long flags;
 	int ret;
 
 	for (; wr; wr = wr->next) {
 		struct ipath_rwqe *wqe;
 		u32 next;
-		int i, j;
+		int i;
 
-		if (wr->num_sge > srq->rq.max_sge) {
+		if ((unsigned) wr->num_sge > srq->rq.max_sge) {
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
 		spin_lock_irqsave(&srq->rq.lock, flags);
-		next = srq->rq.head + 1;
+		wq = srq->rq.wq;
+		next = wq->head + 1;
 		if (next >= srq->rq.size)
 			next = 0;
-		if (next == srq->rq.tail) {
+		if (next == wq->tail) {
 			spin_unlock_irqrestore(&srq->rq.lock, flags);
 			*bad_wr = wr;
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		wqe = get_rwqe_ptr(&srq->rq, srq->rq.head);
+		wqe = get_rwqe_ptr(&srq->rq, wq->head);
 		wqe->wr_id = wr->wr_id;
-		wqe->sg_list[0].mr = NULL;
-		wqe->sg_list[0].vaddr = NULL;
-		wqe->sg_list[0].length = 0;
-		wqe->sg_list[0].sge_length = 0;
-		wqe->length = 0;
-		for (i = 0, j = 0; i < wr->num_sge; i++) {
-			/* Check LKEY */
-			if (to_ipd(srq->ibsrq.pd)->user &&
-			    wr->sg_list[i].lkey == 0) {
-				spin_unlock_irqrestore(&srq->rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			if (wr->sg_list[i].length == 0)
-				continue;
-			if (!ipath_lkey_ok(&dev->lk_table,
-					   &wqe->sg_list[j],
-					   &wr->sg_list[i],
-					   IB_ACCESS_LOCAL_WRITE)) {
-				spin_unlock_irqrestore(&srq->rq.lock,
-						       flags);
-				*bad_wr = wr;
-				ret = -EINVAL;
-				goto bail;
-			}
-			wqe->length += wr->sg_list[i].length;
-			j++;
-		}
-		wqe->num_sge = j;
-		srq->rq.head = next;
+		for (i = 0; i < wr->num_sge; i++)
+			wqe->sg_list[0] = wr->sg_list[i];
+		wq->head = next;
 		spin_unlock_irqrestore(&srq->rq.lock, flags);
 	}
 	ret = 0;
@@ -156,28 +128,67 @@
 	 * Need to use vmalloc() if we want to support large #s of entries.
 	 */
 	srq->rq.size = srq_init_attr->attr.max_wr + 1;
-	sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge +
+	srq->rq.max_sge = srq_init_attr->attr.max_sge;
+	sz = sizeof(struct ib_sge) * srq->rq.max_sge +
 		sizeof(struct ipath_rwqe);
-	srq->rq.wq = vmalloc(srq->rq.size * sz);
+	srq->rq.wq = vmalloc(sizeof(struct ipath_rwq) + srq->rq.size * sz);
 	if (!srq->rq.wq) {
-		kfree(srq);
 		ret = ERR_PTR(-ENOMEM);
-		goto bail;
+		goto free_srq;
 	}
 
 	/*
+	 * Return the address of the RWQ as the offset to mmap.
+	 * See ipath_mmap() for details.
+	 */
+	if (udata) {
+		struct ipath_mmap_info *ip;
+		__u64 offset = (__u64) srq->rq.wq;
+		int err;
+
+		err = ib_copy_to_udata(udata, &offset, sizeof(offset));
+		if (err) {
+			ret = ERR_PTR(err);
+			goto free_rwq;
+		}
+
+		/* Allocate info for ipath_mmap(). */
+		ip = kmalloc(sizeof(*ip), GFP_KERNEL);
+		if (!ip) {
+			ret = ERR_PTR(-ENOMEM);
+			goto free_rwq;
+		}
+		srq->ip = ip;
+		ip->context = ibpd->uobject->context;
+		ip->obj = srq->rq.wq;
+		kref_init(&ip->ref);
+		ip->mmap_cnt = 0;
+		ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) +
+				      srq->rq.size * sz);
+		spin_lock_irq(&dev->pending_lock);
+		ip->next = dev->pending_mmaps;
+		dev->pending_mmaps = ip;
+		spin_unlock_irq(&dev->pending_lock);
+	} else
+		srq->ip = NULL;
+
+	/*
 	 * ib_create_srq() will initialize srq->ibsrq.
 	 */
 	spin_lock_init(&srq->rq.lock);
-	srq->rq.head = 0;
-	srq->rq.tail = 0;
-	srq->rq.max_sge = srq_init_attr->attr.max_sge;
+	srq->rq.wq->head = 0;
+	srq->rq.wq->tail = 0;
 	srq->limit = srq_init_attr->attr.srq_limit;
 
+	dev->n_srqs_allocated++;
+
 	ret = &srq->ibsrq;
+	goto bail;
 
-	dev->n_srqs_allocated++;
-
+free_rwq:
+	vfree(srq->rq.wq);
+free_srq:
+	kfree(srq);
 bail:
 	return ret;
 }
@@ -187,83 +198,137 @@
  * @ibsrq: the SRQ to modify
  * @attr: the new attributes of the SRQ
  * @attr_mask: indicates which attributes to modify
+ * @udata: user data for ipathverbs.so
  */
 int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
-		     enum ib_srq_attr_mask attr_mask)
+		     enum ib_srq_attr_mask attr_mask,
+		     struct ib_udata *udata)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
-	unsigned long flags;
-	int ret;
+	int ret = 0;
 
-	if (attr_mask & IB_SRQ_MAX_WR)
+	if (attr_mask & IB_SRQ_MAX_WR) {
+		struct ipath_rwq *owq;
+		struct ipath_rwq *wq;
+		struct ipath_rwqe *p;
+		u32 sz, size, n, head, tail;
+
+		/*
+		 * Check that the requested sizes are below the limits
+		 * and that user/kernel SRQs are only resized by the
+		 * user/kernel.
+		 */
 		if ((attr->max_wr > ib_ipath_max_srq_wrs) ||
-		    (attr->max_sge > srq->rq.max_sge)) {
+		    (!udata != !srq->ip) ||
+		    ((attr_mask & IB_SRQ_LIMIT) &&
+		     attr->srq_limit > attr->max_wr) ||
+		    (!(attr_mask & IB_SRQ_LIMIT) &&
+		     srq->limit > attr->max_wr)) {
 			ret = -EINVAL;
 			goto bail;
 		}
 
-	if (attr_mask & IB_SRQ_LIMIT)
-		if (attr->srq_limit >= srq->rq.size) {
-			ret = -EINVAL;
-			goto bail;
-		}
-
-	if (attr_mask & IB_SRQ_MAX_WR) {
-		struct ipath_rwqe *wq, *p;
-		u32 sz, size, n;
-
 		sz = sizeof(struct ipath_rwqe) +
-			attr->max_sge * sizeof(struct ipath_sge);
+			srq->rq.max_sge * sizeof(struct ib_sge);
 		size = attr->max_wr + 1;
-		wq = vmalloc(size * sz);
+		wq = vmalloc(sizeof(struct ipath_rwq) + size * sz);
 		if (!wq) {
 			ret = -ENOMEM;
 			goto bail;
 		}
 
-		spin_lock_irqsave(&srq->rq.lock, flags);
-		if (srq->rq.head < srq->rq.tail)
-			n = srq->rq.size + srq->rq.head - srq->rq.tail;
+		/*
+		 * Return the address of the RWQ as the offset to mmap.
+		 * See ipath_mmap() for details.
+		 */
+		if (udata) {
+			__u64 offset_addr;
+			__u64 offset = (__u64) wq;
+
+			ret = ib_copy_from_udata(&offset_addr, udata,
+						 sizeof(offset_addr));
+			if (ret) {
+				vfree(wq);
+				goto bail;
+			}
+			udata->outbuf = (void __user *) offset_addr;
+			ret = ib_copy_to_udata(udata, &offset,
+					       sizeof(offset));
+			if (ret) {
+				vfree(wq);
+				goto bail;
+			}
+		}
+
+		spin_lock_irq(&srq->rq.lock);
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		owq = srq->rq.wq;
+		head = owq->head;
+		if (head >= srq->rq.size)
+			head = 0;
+		tail = owq->tail;
+		if (tail >= srq->rq.size)
+			tail = 0;
+		n = head;
+		if (n < tail)
+			n += srq->rq.size - tail;
 		else
-			n = srq->rq.head - srq->rq.tail;
-		if (size <= n || size <= srq->limit) {
-			spin_unlock_irqrestore(&srq->rq.lock, flags);
+			n -= tail;
+		if (size <= n) {
+			spin_unlock_irq(&srq->rq.lock);
 			vfree(wq);
 			ret = -EINVAL;
 			goto bail;
 		}
 		n = 0;
-		p = wq;
-		while (srq->rq.tail != srq->rq.head) {
+		p = wq->wq;
+		while (tail != head) {
 			struct ipath_rwqe *wqe;
 			int i;
 
-			wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail);
+			wqe = get_rwqe_ptr(&srq->rq, tail);
 			p->wr_id = wqe->wr_id;
-			p->length = wqe->length;
 			p->num_sge = wqe->num_sge;
 			for (i = 0; i < wqe->num_sge; i++)
 				p->sg_list[i] = wqe->sg_list[i];
 			n++;
 			p = (struct ipath_rwqe *)((char *) p + sz);
-			if (++srq->rq.tail >= srq->rq.size)
-				srq->rq.tail = 0;
+			if (++tail >= srq->rq.size)
+				tail = 0;
 		}
-		vfree(srq->rq.wq);
 		srq->rq.wq = wq;
 		srq->rq.size = size;
-		srq->rq.head = n;
-		srq->rq.tail = 0;
-		srq->rq.max_sge = attr->max_sge;
-		spin_unlock_irqrestore(&srq->rq.lock, flags);
-	}
+		wq->head = n;
+		wq->tail = 0;
+		if (attr_mask & IB_SRQ_LIMIT)
+			srq->limit = attr->srq_limit;
+		spin_unlock_irq(&srq->rq.lock);
 
-	if (attr_mask & IB_SRQ_LIMIT) {
-		spin_lock_irqsave(&srq->rq.lock, flags);
-		srq->limit = attr->srq_limit;
-		spin_unlock_irqrestore(&srq->rq.lock, flags);
+		vfree(owq);
+
+		if (srq->ip) {
+			struct ipath_mmap_info *ip = srq->ip;
+			struct ipath_ibdev *dev = to_idev(srq->ibsrq.device);
+
+			ip->obj = wq;
+			ip->size = PAGE_ALIGN(sizeof(struct ipath_rwq) +
+					      size * sz);
+			spin_lock_irq(&dev->pending_lock);
+			ip->next = dev->pending_mmaps;
+			dev->pending_mmaps = ip;
+			spin_unlock_irq(&dev->pending_lock);
+		}
+	} else if (attr_mask & IB_SRQ_LIMIT) {
+		spin_lock_irq(&srq->rq.lock);
+		if (attr->srq_limit >= srq->rq.size)
+			ret = -EINVAL;
+		else
+			srq->limit = attr->srq_limit;
+		spin_unlock_irq(&srq->rq.lock);
 	}
-	ret = 0;
 
 bail:
 	return ret;
@@ -289,7 +354,10 @@
 	struct ipath_ibdev *dev = to_idev(ibsrq->device);
 
 	dev->n_srqs_allocated--;
-	vfree(srq->rq.wq);
+	if (srq->ip)
+		kref_put(&srq->ip->ref, ipath_release_mmap_info);
+	else
+		vfree(srq->rq.wq);
 	kfree(srq);
 
 	return 0;
Index: src/linux-kernel/infiniband/hw/ipath/ipath_ud.c
===================================================================
--- src/linux-kernel/infiniband/hw/ipath/ipath_ud.c	(revision 8021)
+++ src/linux-kernel/infiniband/hw/ipath/ipath_ud.c	(working copy)
@@ -35,6 +35,53 @@
 #include "ipath_verbs.h"
 #include "ips_common.h"
 
+static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe,
+		    u32 *lengthp, struct ipath_sge_state *ss)
+{
+	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	int user = to_ipd(qp->ibqp.pd)->user;
+	int i, j, ret;
+	struct ib_wc wc;
+
+	*lengthp = 0;
+	for (i = j = 0; i < wqe->num_sge; i++) {
+		if (wqe->sg_list[i].length == 0)
+			continue;
+		/* Check LKEY */
+		if ((user && wqe->sg_list[i].lkey == 0) ||
+		    !ipath_lkey_ok(&dev->lk_table,
+				   j ? &ss->sg_list[j - 1] : &ss->sge,
+				   &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE))
+			goto bad_lkey;
+		*lengthp += wqe->sg_list[i].length;
+		j++;
+	}
+	ss->num_sge = j;
+	ret = 1;
+	goto bail;
+
+bad_lkey:
+	wc.wr_id = wqe->wr_id;
+	wc.status = IB_WC_LOC_PROT_ERR;
+	wc.opcode = IB_WC_RECV;
+	wc.vendor_err = 0;
+	wc.byte_len = 0;
+	wc.imm_data = 0;
+	wc.qp_num = qp->ibqp.qp_num;
+	wc.src_qp = 0;
+	wc.wc_flags = 0;
+	wc.pkey_index = 0;
+	wc.slid = 0;
+	wc.sl = 0;
+	wc.dlid_path_bits = 0;
+	wc.port_num = 0;
+	/* Signal solicited completion event. */
+	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+	ret = 0;
+bail:
+	return ret;
+}
+
 /**
  * ipath_ud_loopback - handle send on loopback QPs
  * @sqp: the QP
@@ -45,6 +92,8 @@
  *
  * This is called from ipath_post_ud_send() to forward a WQE addressed
  * to the same HCA.
+ * Note that the receive interrupt handler may be calling ipath_ud_rcv()
+ * while this is being called.
  */
 static void ipath_ud_loopback(struct ipath_qp *sqp,
 			      struct ipath_sge_state *ss,
@@ -59,7 +108,11 @@
 	struct ipath_srq *srq;
 	struct ipath_sge_state rsge;
 	struct ipath_sge *sge;
+	struct ipath_rwq *wq;
 	struct ipath_rwqe *wqe;
+	void (*handler)(struct ib_event *, void *);
+	u32 tail;
+	u32 rlen;
 
 	qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn);
 	if (!qp)
@@ -93,6 +146,13 @@
 		wc->imm_data = 0;
 	}
 
+	if (wr->num_sge > 1) {
+		rsge.sg_list = kmalloc((wr->num_sge - 1) *
+					sizeof(struct ipath_sge),
+				       GFP_ATOMIC);
+	} else
+		rsge.sg_list = NULL;
+
 	/*
 	 * Get the next work request entry to find where to put the data.
 	 * Note that it is safe to drop the lock after changing rq->tail
@@ -100,37 +160,52 @@
 	 */
 	if (qp->ibqp.srq) {
 		srq = to_isrq(qp->ibqp.srq);
+		handler = srq->ibsrq.event_handler;
 		rq = &srq->rq;
 	} else {
 		srq = NULL;
+		handler = NULL;
 		rq = &qp->r_rq;
 	}
+
 	spin_lock_irqsave(&rq->lock, flags);
-	if (rq->tail == rq->head) {
-		spin_unlock_irqrestore(&rq->lock, flags);
-		dev->n_pkt_drops++;
-		goto done;
+	wq = rq->wq;
+	tail = wq->tail;
+	while (1) {
+		if (unlikely(tail == wq->head)) {
+			spin_unlock_irqrestore(&rq->lock, flags);
+			dev->n_pkt_drops++;
+			goto free_sge;
+		}
+		wqe = get_rwqe_ptr(rq, tail);
+		if (++tail >= rq->size)
+			tail = 0;
+		if (init_sge(qp, wqe, &rlen, &rsge))
+			break;
+		wq->tail = tail;
 	}
 	/* Silently drop packets which are too big. */
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	if (wc->byte_len > wqe->length) {
+	if (wc->byte_len > rlen) {
 		spin_unlock_irqrestore(&rq->lock, flags);
 		dev->n_pkt_drops++;
-		goto done;
+		goto free_sge;
 	}
+	wq->tail = tail;
 	wc->wr_id = wqe->wr_id;
-	rsge.sge = wqe->sg_list[0];
-	rsge.sg_list = wqe->sg_list + 1;
-	rsge.num_sge = wqe->num_sge;
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq && srq->ibsrq.event_handler) {
+	if (handler) {
 		u32 n;
 
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
+		/*
+		 * validate head pointer value and compute
+		 * the number of remaining WQEs.
+		 */
+		n = wq->head;
+		if (n >= rq->size)
+			n = 0;
+		if (n < tail)
+			n += rq->size - tail;
 		else
-			n = rq->head - rq->tail;
+			n -= tail;
 		if (n < srq->limit) {
 			struct ib_event ev;
 
@@ -139,12 +214,12 @@
 			ev.device = qp->ibqp.device;
 			ev.element.srq = qp->ibqp.srq;
 			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
+			handler(&ev, srq->ibsrq.srq_context);
 		} else
 			spin_unlock_irqrestore(&rq->lock, flags);
 	} else
 		spin_unlock_irqrestore(&rq->lock, flags);
+
 	ah_attr = &to_iah(wr->wr.ud.ah)->attr;
 	if (ah_attr->ah_flags & IB_AH_GRH) {
 		ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh));
@@ -195,6 +270,8 @@
 	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc,
 		       wr->send_flags & IB_SEND_SOLICITED);
 
+free_sge:
+	kfree(rsge.sg_list);
 done:
 	if (atomic_dec_and_test(&qp->refcount))
 		wake_up(&qp->wait);
@@ -432,13 +509,9 @@
 	int opcode;
 	u32 hdrsize;
 	u32 pad;
-	unsigned long flags;
 	struct ib_wc wc;
 	u32 qkey;
 	u32 src_qp;
-	struct ipath_rq *rq;
-	struct ipath_srq *srq;
-	struct ipath_rwqe *wqe;
 	u16 dlid;
 	int header_in_data;
 
@@ -546,19 +619,10 @@
 
 	/*
 	 * Get the next work request entry to find where to put the data.
-	 * Note that it is safe to drop the lock after changing rq->tail
-	 * since ipath_post_receive() won't fill the empty slot.
 	 */
-	if (qp->ibqp.srq) {
-		srq = to_isrq(qp->ibqp.srq);
-		rq = &srq->rq;
-	} else {
-		srq = NULL;
-		rq = &qp->r_rq;
-	}
-	spin_lock_irqsave(&rq->lock, flags);
-	if (rq->tail == rq->head) {
-		spin_unlock_irqrestore(&rq->lock, flags);
+	if (qp->r_reuse_sge)
+		qp->r_reuse_sge = 0;
+	else if (!ipath_get_rwqe(qp, 0)) {
 		/*
 		 * Count VL15 packets dropped due to no receive buffer.
 		 * Otherwise, count them as buffer overruns since usually,
@@ -572,39 +636,11 @@
 		goto bail;
 	}
 	/* Silently drop packets which are too big. */
-	wqe = get_rwqe_ptr(rq, rq->tail);
-	if (wc.byte_len > wqe->length) {
-		spin_unlock_irqrestore(&rq->lock, flags);
+	if (wc.byte_len > qp->r_len) {
+		qp->r_reuse_sge = 1;
 		dev->n_pkt_drops++;
 		goto bail;
 	}
-	wc.wr_id = wqe->wr_id;
-	qp->r_sge.sge = wqe->sg_list[0];
-	qp->r_sge.sg_list = wqe->sg_list + 1;
-	qp->r_sge.num_sge = wqe->num_sge;
-	if (++rq->tail >= rq->size)
-		rq->tail = 0;
-	if (srq && srq->ibsrq.event_handler) {
-		u32 n;
-
-		if (rq->head < rq->tail)
-			n = rq->size + rq->head - rq->tail;
-		else
-			n = rq->head - rq->tail;
-		if (n < srq->limit) {
-			struct ib_event ev;
-
-			srq->limit = 0;
-			spin_unlock_irqrestore(&rq->lock, flags);
-			ev.device = qp->ibqp.device;
-			ev.element.srq = qp->ibqp.srq;
-			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
-			srq->ibsrq.event_handler(&ev,
-						 srq->ibsrq.srq_context);
-		} else
-			spin_unlock_irqrestore(&rq->lock, flags);
-	} else
-		spin_unlock_irqrestore(&rq->lock, flags);
 	if (has_grh) {
 		ipath_copy_sge(&qp->r_sge, &hdr->u.l.grh,
 			       sizeof(struct ib_grh));
@@ -613,6 +649,7 @@
 		ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh));
 	ipath_copy_sge(&qp->r_sge, data,
 		       wc.byte_len - sizeof(struct ib_grh));
+	wc.wr_id = qp->r_wr_id;
 	wc.status = IB_WC_SUCCESS;
 	wc.opcode = IB_WC_RECV;
 	wc.vendor_err = 0;


-- 
Ralph Campbell <ralphc at pathscale.com>


From amit_byron at yahoo.com  Mon Jun 19 17:36:46 2006
From: amit_byron at yahoo.com (Amit Byron)
Date: Tue, 20 Jun 2006 00:36:46 +0000 (UTC)
Subject: [openib-general] =?utf-8?q?ib=5Fgid_lookup?=
Message-ID: <loom.20060620T022600-975@post.gmane.org>


hello,
  i'm trying to find whether i can do a lookup of ib_gid by either
node name or node's ip address. is this information available from
the subnet manager?

thanks,
Amit.


From rjwalsh at pathscale.com  Mon Jun 19 18:34:44 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Mon, 19 Jun 2006 18:34:44 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <adamzccu54f.fsf@cisco.com>
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
	<1150223140.11881.2.camel@hematite.internal.keyresearch.com>
	<adamzccu54f.fsf@cisco.com>
Message-ID: <1150767284.15618.59.camel@hematite.internal.keyresearch.com>

On Fri, 2006-06-16 at 15:07 -0700, Roland Dreier wrote:
> Robert, can you confirm that the new uverbs locking scheme helps the
> performance problems you're having?

Yup - that was a big help.  Thanks!

Regards,
 Robert.

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060619/fc7bdc30/attachment.sig>

From rdreier at cisco.com  Mon Jun 19 20:46:23 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 Jun 2006 20:46:23 -0700
Subject: [openib-general] [RFC] [PATCH] IB/uverbs: Don't serialize with
 ib_uverbs_idr_mutex
In-Reply-To: <1150767284.15618.59.camel@hematite.internal.keyresearch.com>
	(Robert Walsh's message of "Mon, 19 Jun 2006 18:34:44 -0700")
References: <adaodwy5fp7.fsf@cisco.com>
	<20060613051149.GE4621@mellanox.co.il> <adabqsx3poy.fsf@cisco.com>
	<1150223140.11881.2.camel@hematite.internal.keyresearch.com>
	<adamzccu54f.fsf@cisco.com>
	<1150767284.15618.59.camel@hematite.internal.keyresearch.com>
Message-ID: <adazmg8pjzk.fsf@cisco.com>

 > > Robert, can you confirm that the new uverbs locking scheme helps the
 > > performance problems you're having?

 > Yup - that was a big help.  Thanks!

Good, because it's upstream now...


From panda at cse.ohio-state.edu  Mon Jun 19 21:06:33 2006
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Tue, 20 Jun 2006 00:06:33 -0400 (EDT)
Subject: [openib-general] MVAPICH and librdmacm
In-Reply-To: <1150738038.26165.5.camel@stevo-desktop> from "Steve Wise"
	at Jun 19, 2006 12:27:18 PM
Message-ID: <200606200406.k5K46XcM029580@xi.cse.ohio-state.edu>

Steve, 

> Anybody working on porting the MVAPICH code to use the RDMA CM for
> connection setup?  Just wondering how much work is needed to make
> MVAPICH run on the iwarp devices.  

Thanks for your note. We are taking a look at these issues. During the
next few weeks, we will have a complete understanding on these issues
and plan to make some concrete progress. We will keep you and OpenIB
community updated about our progress.

Thanks, 

DK

> Thanks,
> 
> 
> Steve.
> 
> 
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Mon Jun 19 21:27:09 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 Jun 2006 21:27:09 -0700
Subject: [openib-general] iSER updates
Message-ID: <adaveqwpi3m.fsf@cisco.com>

Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream
now.  He still has not pulled scsi-misc-2.6 so AFAIK there is still
more required before we can merge iSER.

However, I had to add the patch below to keep iSER compiling in my
tree, since <scsi/scsi_request.h> is now gone from the upstream
kernel.  Also, I notice that there seem to be a few changes to the svn
tree that you have not sent to me for my git tree (r7784 and r7924 at
least).  Please send me a patch if there are updates to my git tree
that you would like before it gets merged...

 - R.

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c
index c051b65..7099e94 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -68,7 +68,6 @@ #include <asm/uaccess.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_eh.h>
-#include <scsi/scsi_request.h>
 #include <scsi/scsi_tcq.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi.h>


From ogerlitz at voltaire.com  Mon Jun 19 21:54:25 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 07:54:25 +0300
Subject: [openib-general] iSER updates
In-Reply-To: <adaveqwpi3m.fsf@cisco.com>
References: <adaveqwpi3m.fsf@cisco.com>
Message-ID: <44977F81.9080206@voltaire.com>

Roland Dreier wrote:
> Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream
> now.  He still has not pulled scsi-misc-2.6 so AFAIK there is still
> more required before we can merge iSER.
> 
> However, I had to add the patch below to keep iSER compiling in my
> tree, since <scsi/scsi_request.h> is now gone from the upstream
> kernel.  Also, I notice that there seem to be a few changes to the svn
> tree that you have not sent to me for my git tree (r7784 and r7924 at
> least).  Please send me a patch if there are updates to my git tree
> that you would like before it gets merged...

I was aware that scsi/scsi_request.h was killed by Christoph but as iser 
kept compiling under my copy of your tree with James tree pulled into it 
i have not noticed the breakage... guess the reason for that was an 
update by James which i missed, anyway thanks for catching that and will 
update the SVN.

As for the two updates in the SVN since my last patches were sent to 
you, these are two bug fixes which can go to 2.6.18-rc2, but as i 
understand its fine with you to add them into what's pushed for 
2.6.18-rc1, i will send them today.

Or.

> diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
> b/drivers/infiniband/ulp/iser/iscsi_iser.c
> index c051b65..7099e94 100644
> --- a/drivers/infiniband/ulp/iser/iscsi_iser.c
> +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
> @@ -68,7 +68,6 @@ #include <asm/uaccess.h>
>  #include <scsi/scsi_cmnd.h>
>  #include <scsi/scsi_device.h>
>  #include <scsi/scsi_eh.h>
> -#include <scsi/scsi_request.h>
>  #include <scsi/scsi_tcq.h>
>  #include <scsi/scsi_host.h>
>  #include <scsi/scsi.h>
> 


From krkumar2 at in.ibm.com  Mon Jun 19 22:22:19 2006
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Tue, 20 Jun 2006 10:52:19 +0530
Subject: [openib-general] [PATCH] Remove redundant uninitialized warning
Message-ID: <OF4638FEF2.DE79485C-ON65257193.001BDA2A-65257193.001D05AA@in.ibm.com>

This removes a compile warning : "is_ud might be used uninitialized
in this function".

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>

---

diff -ruNp 1/core/uverbs_cmd.c 2/core/uverbs_cmd.c
--- 1/core/uverbs_cmd.c 2006-06-20 10:14:46.000000000 +0530
+++ 2/core/uverbs_cmd.c 2006-06-20 10:23:50.000000000 +0530
@@ -1530,7 +1530,6 @@ ssize_t ib_uverbs_post_send(struct ib_uv
 out_put:
        put_qp_read(qp);
 
-out:
        while (wr) {
                if (is_ud && wr->wr.ud.ah)
                        put_ah_read(wr->wr.ud.ah);
@@ -1539,6 +1538,7 @@ out:
                wr = next;
        }
 
+out:
        kfree(user_wr);
 
        return ret ? ret : in_len;


-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff.
Type: application/octet-stream
Size: 449 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060620/13282f5b/attachment.obj>

From rdreier at cisco.com  Mon Jun 19 23:31:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 19 Jun 2006 23:31:58 -0700
Subject: [openib-general] [PATCH] Remove redundant uninitialized warning
In-Reply-To: <OF4638FEF2.DE79485C-ON65257193.001BDA2A-65257193.001D05AA@in.ibm.com>
	(Krishna Kumar2's message of "Tue, 20 Jun 2006 10:52:19 +0530")
References: <OF4638FEF2.DE79485C-ON65257193.001BDA2A-65257193.001D05AA@in.ibm.com>
Message-ID: <adairmwpcbl.fsf@cisco.com>

Thanks, applied and queued for 2.6.18


From ogerlitz at voltaire.com  Tue Jun 20 00:09:08 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 10:09:08 +0300
Subject: [openib-general] iSER updates
In-Reply-To: <44977F81.9080206@voltaire.com>
References: <adaveqwpi3m.fsf@cisco.com> <44977F81.9080206@voltaire.com>
Message-ID: <44979F14.9000905@voltaire.com>

Or Gerlitz wrote:
> Roland Dreier wrote:
>> However, I had to add the patch below to keep iSER compiling in my
>> tree, since <scsi/scsi_request.h> is now gone from the upstream
>> kernel.  

I see that the patch is applied at the for-mm branch but not at the iser 
  branch, is it fine?

Or.


From ogerlitz at voltaire.com  Tue Jun 20 00:23:25 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 10:23:25 +0300
Subject: [openib-general] dapltest gets segfaulted in librdmacm init
In-Reply-To: <4496E9EF.1090607@ichips.intel.com>
References: <Pine.LNX.4.64.0606191736310.13391@zuben>
	<4496E9EF.1090607@ichips.intel.com>
Message-ID: <4497A26D.9090606@voltaire.com>

Arlin Davis wrote:
> Or Gerlitz wrote:
> 
>> After fixing the ucma/port space issue with the calls to rdma_create_id i
>> am now trying to run
>>
>>     $ ./Target/dapltest -T S -D OpenIB-cma
>>
>> and getting an immediate segfault with the below trace, any idea?
>>  
>>
> Hmm, no idea. I just updated to 8112 and everything runs fine for me 
> (2.6.17).  

OK, sorry, i suspect to had some inconsistency between libibverbs and 
libmthca, recompiling & installing them things now work fine.

Or.


From ogerlitz at voltaire.com  Tue Jun 20 00:25:55 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 10:25:55 +0300
Subject: [openib-general] dapltest gets segfaulted in librdmacm init
In-Reply-To: <Pine.LNX.4.64.0606191320430.6403@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.64.0606191736310.13391@zuben>
	<Pine.LNX.4.64.0606191320430.6403@jlentini-linux.nane.netapp.com>
Message-ID: <4497A303.8090806@voltaire.com>

James Lentini wrote:
> I don't see this.
> 
> The gdb sharedlibrary output looks suspicious. /usr/local/ib isn't a 
> standard path for our binaries.

I have added --prefix=/usr/local/ib to the configure input, we do it all 
the time to test multiple things.

Or.


From tziporet at mellanox.co.il  Tue Jun 20 01:47:48 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 20 Jun 2006 11:47:48 +0300
Subject: [openib-general] OFED 1.0-pre 1 build issues.
In-Reply-To: <d2403b0606141653j777d930ardf9999ac7e042eb@mail.gmail.com>
References: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
	<1150324203.10676.17.camel@chalcedony.pathscale.com>
	<d2403b0606141653j777d930ardf9999ac7e042eb@mail.gmail.com>
Message-ID: <4497B634.2070704@mellanox.co.il>

Paul wrote:
> Michael,
>      I performed the same work-around in bash (not so good with perl 
> these days) it gets past the prior point. Thanks. Should something 
> that takes care of this be included in the build.sh or build_env.sh 
> scripts ? We would certainly need it covered in the docs at least.
>
> Now the build is dying on some undefined references. (log attached)
>
> Regards.
>

I will ask Vlad to look into it.

Tziporet


From tziporet at mellanox.co.il  Tue Jun 20 01:57:08 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 20 Jun 2006 11:57:08 +0300
Subject: [openib-general] MVAPICH failure on SGI Altix SLES10
In-Reply-To: <44930BAA.6030300@sgi.com>
References: <44930BAA.6030300@sgi.com>
Message-ID: <4497B864.7040107@mellanox.co.il>

John Partridge wrote:
> I am trying to run the example from MPI_README.txt (and other MPI apps
> like pallas), but I keep getting a Couldn't modify SRQ limit error
> message :-
>
> mig129:~/OFED-1.0-pre1 # 
> /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/bin/mpirun_rsh -rsh -np 2 
> -hostfile /root/cluster 
> /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.1.0/tests/osutests-1.0/bw 1000 16
> [1] Abort: Couldn't modify SRQ limit
>   at line 995 in file viainit.c
> mpirun_rsh: Abort signaled from [1]
> [0] Abort: [mig125:0] Got completion with error, code=12
>   at line 2143 in file viacheck.c
> done.
>
> I am using OFED-1.0-pre1 (kernel modules are from OFED-1.0-pre1 also)
> OS is SLES10 SUSE Linux Enterprise Server 10 (ia64) VERSION = 10
>
> HW is SGI Altix ia64
>
> Can anyone help please ?
>
> Thanks
> John
>
>   
I guess you use older FW version. See in osu_mpi release notes:

- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only:
  OSU MPI may fail in its default configuration if your HCA is burnt with an
  fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version
  4.7.400 or earlier.
  Workaround:
  Option 1 - Update the firmware.
  Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0


Tziporet


From ogerlitz at voltaire.com  Tue Jun 20 02:33:49 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 12:33:49 +0300 (IDT)
Subject: [openib-general] [PATCH 1/2] IB/iser: don't access
 sc->request_buffer when sc->request_bufflen is zero
Message-ID: <Pine.LNX.4.64.0606201229180.24176@zuben>

calling scsi_init_one on sc->request_buffer when sc->request_bufflen is zero is unsafe

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c
===================================================================
--- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:26:17.000000000 +0300
+++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:27:42.000000000 +0300
@@ -391,7 +391,8 @@
 	if (sc->use_sg) { /* using a scatter list */
 		data_buf->buf  = sc->request_buffer;
 		data_buf->size = sc->use_sg;
-	} else { /* using a single buffer - convert it into one entry SG */
+	} else if (sc->request_bufflen) {
+		/* using a single buffer - convert it into one entry SG */
 		sg_init_one(&data_buf->sg_single,
 			    sc->request_buffer, sc->request_bufflen);
 		data_buf->buf   = &data_buf->sg_single;


From ogerlitz at voltaire.com  Tue Jun 20 02:35:51 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 12:35:51 +0300 (IDT)
Subject: [openib-general] [PATCH 2/2] IB/iser: bugfix for the reconnect flow
In-Reply-To: <Pine.LNX.4.64.0606201229180.24176@zuben>
References: <Pine.LNX.4.64.0606201229180.24176@zuben>
Message-ID: <Pine.LNX.4.64.0606201233550.24176@zuben>

for iscsi reconnect flow the sequence of calls would be conn stop/bind/start
i.e conn create is not called; fixed the post receive code to take that into
account, also moved setting conn->recv_lock into conn bind which is called for
both connect and reconnect flows.

Signed-off-by: Erez Zilber <erezz at voltaire.com>
Signed-off-by: Or Gerlitz  <ogerlitz at voltaire.com>

Index: infiniband-git/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- infiniband-git.orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2006-06-20 12:27:42.000000000 +0300
+++ infiniband-git/drivers/infiniband/ulp/iser/iscsi_iser.c	2006-06-20 12:28:14.000000000 +0300
@@ -311,8 +311,6 @@
 	/* currently this is the only field which need to be initiated */
 	rwlock_init(&iser_conn->lock);

-	conn->recv_lock = &iser_conn->lock;
-
 	conn->dd_data = iser_conn;
 	iser_conn->iscsi_conn = conn;

@@ -363,6 +361,8 @@
 	ib_conn->iser_conn = iser_conn;
 	iser_conn->ib_conn  = ib_conn;

+	conn->recv_lock = &iser_conn->lock;
+
 	return 0;
 }

Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c
===================================================================
--- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:27:42.000000000 +0300
+++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:28:14.000000000 +0300
@@ -232,8 +232,11 @@
 	}
 	rx_desc->type = ISCSI_RX;

-	/* for the login sequence we must support rx of upto 8K */
-	if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE)
+	/* for the login sequence we must support rx of upto 8K; login is done
+	 * after conn create/bind (connect) and conn stop/bind (reconnect),
+	 * what's common for both schemes is that the connection is not started
+	 */
+	if (conn->c_stage != ISCSI_CONN_STARTED)
 		rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH;
 	else /* FIXME till user space sets conn->max_recv_dlength correctly */
 		rx_data_size = 128;


From ogerlitz at voltaire.com  Tue Jun 20 02:39:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 12:39:22 +0300 (IDT)
Subject: [openib-general] resend [PATCH 1/2] IB/iser: don't access
 sc->request_buffer when sc->request_bufflen is zero
In-Reply-To: <Pine.LNX.4.64.0606201229180.24176@zuben>
References: <Pine.LNX.4.64.0606201229180.24176@zuben>
Message-ID: <Pine.LNX.4.64.0606201238030.24176@zuben>

Roland, there was an error in the changelog comment, here it's again.

calling sg_init_one on sc->request_buffer when sc->request_bufflen is zero is unsafe

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c
===================================================================
--- infiniband-git.orig/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:26:17.000000000 +0300
+++ infiniband-git/drivers/infiniband/ulp/iser/iser_initiator.c	2006-06-20 12:27:42.000000000 +0300
@@ -391,7 +391,8 @@
 	if (sc->use_sg) { /* using a scatter list */
 		data_buf->buf  = sc->request_buffer;
 		data_buf->size = sc->use_sg;
-	} else { /* using a single buffer - convert it into one entry SG */
+	} else if (sc->request_bufflen) {
+		/* using a single buffer - convert it into one entry SG */
 		sg_init_one(&data_buf->sg_single,
 			    sc->request_buffer, sc->request_bufflen);
 		data_buf->buf   = &data_buf->sg_single;


From halr at voltaire.com  Tue Jun 20 03:08:32 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 06:08:32 -0400
Subject: [openib-general] ib_gid lookup
In-Reply-To: <loom.20060620T022600-975@post.gmane.org>
References: <loom.20060620T022600-975@post.gmane.org>
Message-ID: <1150798111.4391.111384.camel@hal.voltaire.com>

Hi Amit,

On Mon, 2006-06-19 at 20:36, Amit Byron wrote:
> hello,
>   i'm trying to find whether i can do a lookup of ib_gid by either
> node name or node's ip address. is this information available from
> the subnet manager?

The SM doesn't know the node name but you might be able to do this by
NodeDescription depending on how the subnet was setup (the
NodeDescriptions would need to be made unique on each node; a script for
this was supplied for mthca; there is also a current standards issue
with the SM detecting that these had changed which is being worked on).
If that were to be done, the SA could be queried by NodeDescription
which would return a NodeRecord which would obtain the NodeInfo which
includes the NodeGUID and PortGUID. Note it also returns the base LID as
well.

The SM does not know the IP addresses unless they are registered by DAPL
(via ServiceRecords) but I'm not sure that is done anymore or whether
DAPL runs in your environment.

-- Hal

> thanks,
> Amit.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From ogerlitz at voltaire.com  Tue Jun 20 03:25:23 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 20 Jun 2006 13:25:23 +0300
Subject: [openib-general] IB/iser upstream push for 2.6.18 awaiting for the
 SCSI/iscsi updates
Message-ID: <4497CD13.6080200@voltaire.com>

Hi James,

Roland is ready to push iSER for 2.6.18 through his tree but it can't be 
done before the 2.6.18 iscsi updates on which iSER is dependent upon 
(libiscsi etc) are pushed (and pulled by Linus), so ... just wondering 
when do you plan to push the iscsi updates?

Or.


From halr at voltaire.com  Tue Jun 20 06:06:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 09:06:43 -0400
Subject: [openib-general] [PATCHv5] osm: partition manager force policy
In-Reply-To: <86d5d5ge54.fsf@mtl066.yok.mtl.com>
References: <86d5d5ge54.fsf@mtl066.yok.mtl.com>
Message-ID: <1150808795.4391.118133.camel@hal.voltaire.com>

On Mon, 2006-06-19 at 15:05, Eitan Zahavi wrote:
> Hi Hal
> 
> This is a 5th take after incorporating Sasha's last reported bug 
> on bad assignment of the used_blocks.
> 
> This code was run again through my verification flow and also Sasha
> had run some tests too.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

[snip...]

> Index: opensm/osm_pkey.c
> ===================================================================
> --- opensm/osm_pkey.c	(revision 8113)
> +++ opensm/osm_pkey.c	(working copy)
> @@ -94,18 +94,22 @@ void osm_pkey_tbl_destroy( 
>  
>  /**********************************************************************
>   **********************************************************************/
> -int osm_pkey_tbl_init( 
> +ib_api_status_t
> +osm_pkey_tbl_init(
>    IN osm_pkey_tbl_t *p_pkey_tbl)
>  {
>    cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1);
>    cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1);
>    cl_map_init( &p_pkey_tbl->keys, 1 );
> +	cl_qlist_init( &p_pkey_tbl->pending );
> +	p_pkey_tbl->used_blocks = 0;
> +	p_pkey_tbl->max_blocks = 0;
>    return(IB_SUCCESS);
>  }
>  
>  /**********************************************************************
>   **********************************************************************/
> -void osm_pkey_tbl_sync_new_blocks(
> +void osm_pkey_tbl_init_new_blocks(
>    IN const osm_pkey_tbl_t *p_pkey_tbl)
>  {
>    ib_pkey_table_t *p_block, *p_new_block;
> @@ -123,16 +127,31 @@ void osm_pkey_tbl_sync_new_blocks(
>        p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
>        if (!p_new_block)
>          break;
> +			cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
> +									b, p_new_block);
> +		}
> +
>        memset(p_new_block, 0, sizeof(*p_new_block));
> -      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block);
>      }
> -    memcpy(p_new_block, p_block, sizeof(*p_new_block));
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +void osm_pkey_tbl_cleanup_pending(
> +	IN osm_pkey_tbl_t *p_pkey_tbl)
> +{
> +	cl_list_item_t	*p_item;
> +	p_item = cl_qlist_remove_head( &p_pkey_tbl->pending );
> +	while (p_item != cl_qlist_end( &p_pkey_tbl->pending ) )
> +	{
> +		free( (osm_pending_pkey_t *)p_item );
>    }
>  }
>  
>  /**********************************************************************
>   **********************************************************************/
> -int osm_pkey_tbl_set( 
> +ib_api_status_t
> +osm_pkey_tbl_set(
>    IN osm_pkey_tbl_t *p_pkey_tbl,
>    IN uint16_t block, 
>    IN ib_pkey_table_t *p_tbl)
> @@ -203,7 +222,138 @@ int osm_pkey_tbl_set( 
>  
>  /**********************************************************************
>   **********************************************************************/
> -static boolean_t __osm_match_pkey (
> +ib_api_status_t
> +osm_pkey_tbl_make_block_pair( 
> +	osm_pkey_tbl_t   *p_pkey_tbl, 
> +	uint16_t          block_idx,
> +	ib_pkey_table_t **pp_old_block,
> +	ib_pkey_table_t **pp_new_block)
> +{
> +	if (block_idx >= p_pkey_tbl->max_blocks) return(IB_ERROR);
> +
> +	if (pp_old_block)
> +	{
> +		*pp_old_block = osm_pkey_tbl_block_get( p_pkey_tbl, block_idx );
> +		if (! *pp_old_block)
> +		{
> +			*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
> +			if (!*pp_old_block) return(IB_ERROR);
> +			memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
> +			cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
> +		}
> +	}
> +	
> +	if (pp_new_block)
> +	{
> +		*pp_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_idx );
> +		if (! *pp_new_block)
> +		{
> +			*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
> +			if (!*pp_new_block) return(IB_ERROR);
> +			memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
> +			cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
> +		}
> +	}
> +	return( IB_SUCCESS );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +/*
> +  store the given pkey in the "new" blocks array 
> +  also makes sure the regular block exists.
> +*/
> +ib_api_status_t
> +osm_pkey_tbl_set_new_entry( 
> +	IN osm_pkey_tbl_t *p_pkey_tbl,
> +	IN uint16_t        block_idx,
> +	IN uint8_t         pkey_idx,
> +	IN uint16_t        pkey)
> +{  
> +	ib_pkey_table_t *p_old_block;
> +	ib_pkey_table_t *p_new_block;
> +	
> +	if (osm_pkey_tbl_make_block_pair(
> +			 p_pkey_tbl, block_idx, &p_old_block, &p_new_block))
> +		return( IB_ERROR );
> +		
> +	p_new_block->pkey_entry[pkey_idx] = pkey;
> +	if (p_pkey_tbl->used_blocks <= block_idx)
> +		p_pkey_tbl->used_blocks = block_idx + 1;
> +
> +	return( IB_SUCCESS );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +boolean_t
> +osm_pkey_find_next_free_entry(
> +	IN osm_pkey_tbl_t *p_pkey_tbl, 
> +	OUT uint16_t      *p_block_idx,
> +	OUT uint8_t       *p_pkey_idx)
> +{
> +	ib_pkey_table_t *p_new_block;
> +	
> +	CL_ASSERT(p_block_idx);
> +	CL_ASSERT(p_pkey_idx);
> +
> +	while ( *p_block_idx < p_pkey_tbl->max_blocks)
> +	{
> +		if (*p_pkey_idx > IB_NUM_PKEY_ELEMENTS_IN_BLOCK - 1)
> +		{
> +			*p_pkey_idx = 0;
> +			(*p_block_idx)++;
> +			if (*p_block_idx >= p_pkey_tbl->max_blocks) 
> +				return FALSE;
> +		}
> +
> +		p_new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, *p_block_idx);
> +
> +		if ( !p_new_block || 
> +			  ib_pkey_is_invalid(p_new_block->pkey_entry[*p_pkey_idx]))
> +			return TRUE;
> +		else
> +			(*p_pkey_idx)++;
> +	}
> +	return FALSE;
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +ib_api_status_t
> +osm_pkey_tbl_get_block_and_idx(
> +	IN	 osm_pkey_tbl_t *p_pkey_tbl,
> +	IN	 uint16_t		 *p_pkey,
> +	OUT uint32_t		 *p_block_idx,
> +	OUT uint8_t			 *p_pkey_index)
> +{
> +	uint32_t			  num_of_blocks;
> +	uint32_t			  block_index;
> +	ib_pkey_table_t *block;
> +
> +	CL_ASSERT( p_pkey_tbl );

Should the other routines also assert on this or should this be
consistent with the others ?

> +	CL_ASSERT( p_block_idx != NULL );
> +	CL_ASSERT( p_pkey_idx != NULL );

There is no p_pkey_idx parameter. I presume this should be p_pkey_index.

Also, should there be:
	CL_ASSERT( p_pkey );
as well ?

> +	num_of_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks);
> +	for ( block_index = 0; block_index < num_of_blocks; block_index++ )
> +	{
> +		block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
> +		if ( ( block->pkey_entry <= p_pkey ) &&
> +			  ( p_pkey < block->pkey_entry + IB_NUM_PKEY_ELEMENTS_IN_BLOCK))
> +		{
> +			*p_block_idx = block_index;
> +			*p_pkey_index = p_pkey - block->pkey_entry;
> +			return( IB_SUCCESS );
> +		}
> +	}
> +	return( IB_NOT_FOUND );
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +static boolean_t 
> +__osm_match_pkey (
>    IN const ib_net16_t *pkey1,
>    IN const ib_net16_t *pkey2 ) {
>  
> @@ -306,7 +456,8 @@ osm_physp_share_pkey(
>    if (cl_is_map_empty(&pkey_tbl1->keys) || cl_is_map_empty(&pkey_tbl2->keys))
>      return TRUE;
>  
> -  return !ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
> +	return 
> +		!ib_pkey_is_invalid(osm_physp_find_common_pkey(p_physp_1, p_physp_2));
>  }
>  
>  /**********************************************************************

[snip...]

Also, two things about osm_pkey_mgr.c:

Was there a need to reorder the routines ? This broke the diff so it had
to be done largely by hand.

Also, it would have been nice not to mix the format changes with the
substantive changes. Try to keep it to "one thought per patch".

This patch has been applied with cosmetic changes. We will go from
here...

-- Hal


From dotanb at mellanox.co.il  Tue Jun 20 07:30:44 2006
From: dotanb at mellanox.co.il (Dotan Barak)
Date: Tue, 20 Jun 2006 17:30:44 +0300
Subject: [openib-general] [librdmacm] check return value in operations of
	rping
Message-ID: <200606201730.45253.dotanb@mellanox.co.il>

Added checks to the return values of all of the functions that may fail
(in order to add this test to the regression system).

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

Index: last_stable/src/userspace/librdmacm/examples/rping.c
===================================================================
--- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300
+++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300
@@ -157,10 +157,10 @@ struct rping_cb {
  struct rdma_cm_id *child_cm_id; /* connection on server side */
 };
 
-static void rping_cma_event_handler(struct rdma_cm_id *cma_id,
+static int rping_cma_event_handler(struct rdma_cm_id *cma_id,
         struct rdma_cm_event *event)
 {
- int ret;
+ int ret = 0;
  struct rping_cb *cb = cma_id->context;
 
  DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id,
@@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru
   fprintf(stderr, "cma event %d, error %d\n", event->event,
          event->status);
   sem_post(&cb->sem);
+  ret = -1;
   break;
 
  case RDMA_CM_EVENT_DISCONNECTED:
@@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru
 
  case RDMA_CM_EVENT_DEVICE_REMOVAL:
   fprintf(stderr, "cma detected device removal!!!!\n");
+  ret = -1;
   break;
 
  default:
   fprintf(stderr, "oof bad type!\n");
   sem_post(&cb->sem);
+  ret = -1;
   break;
  }
+
+ return ret;
 }
 
 static int server_recv(struct rping_cb *cb, struct ibv_wc *wc)
@@ -263,16 +268,20 @@ static int client_recv(struct rping_cb *
  return 0;
 }
 
-static void rping_cq_event_handler(struct rping_cb *cb)
+static int rping_cq_event_handler(struct rping_cb *cb)
 {
  struct ibv_wc wc;
  struct ibv_recv_wr *bad_wr;
  int ret;
 
  while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) {
+  ret = 0;
+
   if (wc.status) {
    fprintf(stderr, "cq completion failed status %d\n",
     wc.status);
+   if (wc.status != IBV_WC_WR_FLUSH_ERR)
+    ret = -1;
    goto error;
   }
 
@@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc
 
   default:
    DEBUG_LOG("unknown!!!!! completion\n");
+   ret = -1;
    goto error;
   }
  }
@@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc
   fprintf(stderr, "poll error %d\n", ret);
   goto error;
  }
- return;
+ return 0;
 
 error:
  cb->state = ERROR;
  sem_post(&cb->sem);
+ return ret;
 }
 
 static int rping_accept(struct rping_cb *cb)
@@ -560,7 +571,9 @@ static void *cm_thread(void *arg)
    fprintf(stderr, "rdma_get_cm_event err %d\n", ret);
    exit(ret);
   }
-  rping_cma_event_handler(event->id, event);
+  ret = rping_cma_event_handler(event->id, event);
+  if (ret)
+   exit(ret);
   rdma_ack_cm_event(event);
  }
 }
@@ -589,7 +602,9 @@ static void *cq_thread(void *arg)
    fprintf(stderr, "Failed to set notify!\n");
    exit(ret);
   }
-  rping_cq_event_handler(cb);
+  ret = rping_cq_event_handler(cb);
+  if (ret)
+   exit(ret);
   ibv_ack_cq_events(cb->cq, 1);
  }
 }
@@ -606,7 +621,7 @@ static void rping_format_send(struct rpi
     info->buf, info->rkey, info->size);
 }
 
-static void rping_test_server(struct rping_cb *cb)
+static int rping_test_server(struct rping_cb *cb)
 {
  struct ibv_send_wr *bad_wr;
  int ret;
@@ -617,6 +632,7 @@ static void rping_test_server(struct rpi
   if (cb->state != RDMA_READ_ADV) {
    fprintf(stderr, "wait for RDMA_READ_ADV state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
 
@@ -640,6 +656,7 @@ static void rping_test_server(struct rpi
   if (cb->state != RDMA_READ_COMPLETE) {
    fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
   DEBUG_LOG("server received read complete\n");
@@ -661,6 +678,7 @@ static void rping_test_server(struct rpi
   if (cb->state != RDMA_WRITE_ADV) {
    fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
   DEBUG_LOG("server received sink adv\n");
@@ -686,6 +704,7 @@ static void rping_test_server(struct rpi
   if (cb->state != RDMA_WRITE_COMPLETE) {
    fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
   DEBUG_LOG("server rdma write complete \n");
@@ -698,6 +717,8 @@ static void rping_test_server(struct rpi
   }
   DEBUG_LOG("server posted go ahead\n");
  }
+
+ return ret;
 }
 
 static int rping_bind_server(struct rping_cb *cb)
@@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin
  return 0;
 }
 
-static void rping_run_server(struct rping_cb *cb)
+static int rping_run_server(struct rping_cb *cb)
 {
  struct ibv_recv_wr *bad_wr;
  int ret;
 
  ret = rping_bind_server(cb);
  if (ret)
-  return;
+  return ret;
 
  ret = rping_setup_qp(cb, cb->child_cm_id);
  if (ret) {
   fprintf(stderr, "setup_qp failed: %d\n", ret);
-  return;
+  return ret;
  }
 
  ret = rping_setup_buffers(cb);
@@ -776,11 +797,13 @@ err2:
  rping_free_buffers(cb);
 err1:
  rping_free_qp(cb);
+
+ return ret;
 }
 
-static void rping_test_client(struct rping_cb *cb)
+static int rping_test_client(struct rping_cb *cb)
 {
- int ping, start, cc, i, ret;
+ int ping, start, cc, i, ret = 0;
  struct ibv_send_wr *bad_wr;
  unsigned char c;
 
@@ -813,6 +836,7 @@ static void rping_test_client(struct rpi
   if (cb->state != RDMA_WRITE_ADV) {
    fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
 
@@ -828,18 +852,22 @@ static void rping_test_client(struct rpi
   if (cb->state != RDMA_WRITE_COMPLETE) {
    fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
     cb->state);
+   ret = -1;
    break;
   }
 
   if (cb->validate)
    if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) {
     fprintf(stderr, "data mismatch!\n");
+    ret = -1;
     break;
    }
 
   if (cb->verbose)
    printf("ping data: %s\n", cb->rdma_buf);
  }
+
+ return ret;
 }
 
 static int rping_connect_client(struct rping_cb *cb)
@@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin
  return 0;
 }
 
-static void rping_run_client(struct rping_cb *cb)
+static int rping_run_client(struct rping_cb *cb)
 {
  struct ibv_recv_wr *bad_wr;
  int ret;
 
  ret = rping_bind_client(cb);
  if (ret)
-  return;
+  return ret;
 
  ret = rping_setup_qp(cb, cb->cm_id);
  if (ret) {
   fprintf(stderr, "setup_qp failed: %d\n", ret);
-  return;
+  return ret;
  }
 
  ret = rping_setup_buffers(cb);
@@ -937,6 +965,8 @@ err2:
  rping_free_buffers(cb);
 err1:
  rping_free_qp(cb);
+
+ return ret;
 }
 
 static void usage(char *name)
@@ -1054,9 +1084,9 @@ int main(int argc, char *argv[])
  pthread_create(&cb->cmthread, NULL, cm_thread, cb);
 
  if (cb->server)
-  rping_run_server(cb);
+  ret = rping_run_server(cb);
  else
-  rping_run_client(cb);
+  ret = rping_run_client(cb);
 
  DEBUG_LOG("destroy cm_id %p\n", cb->cm_id);
  rdma_destroy_id(cb->cm_id);


From rdreier at cisco.com  Tue Jun 20 07:59:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 Jun 2006 07:59:59 -0700
Subject: [openib-general] iSER updates
In-Reply-To: <44979F14.9000905@voltaire.com> (Or Gerlitz's message of
	"Tue, 20 Jun 2006 10:09:08 +0300")
References: <adaveqwpi3m.fsf@cisco.com> <44977F81.9080206@voltaire.com>
	<44979F14.9000905@voltaire.com>
Message-ID: <adaac87q3dc.fsf@cisco.com>

    Or> I see that the patch is applied at the for-mm branch but not
    Or> at the iser branch, is it fine?

Sorry, I forgot to push out an update the iser branch on
master.kernel.org.  It should be OK now.

 - R.


From swise at opengridcomputing.com  Tue Jun 20 08:23:43 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 10:23:43 -0500
Subject: [openib-general] [librdmacm] check return value in operations
	of rping
In-Reply-To: <200606201730.45253.dotanb@mellanox.co.il>
References: <200606201730.45253.dotanb@mellanox.co.il>
Message-ID: <1150817023.22519.22.camel@stevo-desktop>

This patch is malformed, I think.  Did your mailer munge it?  


On Tue, 2006-06-20 at 17:30 +0300, Dotan Barak wrote:
> Added checks to the return values of all of the functions that may fail
> (in order to add this test to the regression system).
> 
> Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>
> 
> Index: last_stable/src/userspace/librdmacm/examples/rping.c
> ===================================================================
> --- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300
> +++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300
> @@ -157,10 +157,10 @@ struct rping_cb {
>   struct rdma_cm_id *child_cm_id; /* connection on server side */
>  };
>  
> -static void rping_cma_event_handler(struct rdma_cm_id *cma_id,
> +static int rping_cma_event_handler(struct rdma_cm_id *cma_id,
>          struct rdma_cm_event *event)
>  {
> - int ret;
> + int ret = 0;
>   struct rping_cb *cb = cma_id->context;
>  
>   DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id,
> @@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru
>    fprintf(stderr, "cma event %d, error %d\n", event->event,
>           event->status);
>    sem_post(&cb->sem);
> +  ret = -1;
>    break;
>  
>   case RDMA_CM_EVENT_DISCONNECTED:
> @@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru
>  
>   case RDMA_CM_EVENT_DEVICE_REMOVAL:
>    fprintf(stderr, "cma detected device removal!!!!\n");
> +  ret = -1;
>    break;
>  
>   default:
>    fprintf(stderr, "oof bad type!\n");
>    sem_post(&cb->sem);
> +  ret = -1;
>    break;
>   }
> +
> + return ret;
>  }
>  
>  static int server_recv(struct rping_cb *cb, struct ibv_wc *wc)
> @@ -263,16 +268,20 @@ static int client_recv(struct rping_cb *
>   return 0;
>  }
>  
> -static void rping_cq_event_handler(struct rping_cb *cb)
> +static int rping_cq_event_handler(struct rping_cb *cb)
>  {
>   struct ibv_wc wc;
>   struct ibv_recv_wr *bad_wr;
>   int ret;
>  
>   while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) {
> +  ret = 0;
> +
>    if (wc.status) {
>     fprintf(stderr, "cq completion failed status %d\n",
>      wc.status);
> +   if (wc.status != IBV_WC_WR_FLUSH_ERR)
> +    ret = -1;
>     goto error;
>    }
>  
> @@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc
>  
>    default:
>     DEBUG_LOG("unknown!!!!! completion\n");
> +   ret = -1;
>     goto error;
>    }
>   }
> @@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc
>    fprintf(stderr, "poll error %d\n", ret);
>    goto error;
>   }
> - return;
> + return 0;
>  
>  error:
>   cb->state = ERROR;
>   sem_post(&cb->sem);
> + return ret;
>  }
>  
>  static int rping_accept(struct rping_cb *cb)
> @@ -560,7 +571,9 @@ static void *cm_thread(void *arg)
>     fprintf(stderr, "rdma_get_cm_event err %d\n", ret);
>     exit(ret);
>    }
> -  rping_cma_event_handler(event->id, event);
> +  ret = rping_cma_event_handler(event->id, event);
> +  if (ret)
> +   exit(ret);
>    rdma_ack_cm_event(event);
>   }
>  }
> @@ -589,7 +602,9 @@ static void *cq_thread(void *arg)
>     fprintf(stderr, "Failed to set notify!\n");
>     exit(ret);
>    }
> -  rping_cq_event_handler(cb);
> +  ret = rping_cq_event_handler(cb);
> +  if (ret)
> +   exit(ret);
>    ibv_ack_cq_events(cb->cq, 1);
>   }
>  }
> @@ -606,7 +621,7 @@ static void rping_format_send(struct rpi
>      info->buf, info->rkey, info->size);
>  }
>  
> -static void rping_test_server(struct rping_cb *cb)
> +static int rping_test_server(struct rping_cb *cb)
>  {
>   struct ibv_send_wr *bad_wr;
>   int ret;
> @@ -617,6 +632,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_READ_ADV) {
>     fprintf(stderr, "wait for RDMA_READ_ADV state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>  
> @@ -640,6 +656,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_READ_COMPLETE) {
>     fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server received read complete\n");
> @@ -661,6 +678,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_WRITE_ADV) {
>     fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server received sink adv\n");
> @@ -686,6 +704,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_WRITE_COMPLETE) {
>     fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server rdma write complete \n");
> @@ -698,6 +717,8 @@ static void rping_test_server(struct rpi
>    }
>    DEBUG_LOG("server posted go ahead\n");
>   }
> +
> + return ret;
>  }
>  
>  static int rping_bind_server(struct rping_cb *cb)
> @@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin
>   return 0;
>  }
>  
> -static void rping_run_server(struct rping_cb *cb)
> +static int rping_run_server(struct rping_cb *cb)
>  {
>   struct ibv_recv_wr *bad_wr;
>   int ret;
>  
>   ret = rping_bind_server(cb);
>   if (ret)
> -  return;
> +  return ret;
>  
>   ret = rping_setup_qp(cb, cb->child_cm_id);
>   if (ret) {
>    fprintf(stderr, "setup_qp failed: %d\n", ret);
> -  return;
> +  return ret;
>   }
>  
>   ret = rping_setup_buffers(cb);
> @@ -776,11 +797,13 @@ err2:
>   rping_free_buffers(cb);
>  err1:
>   rping_free_qp(cb);
> +
> + return ret;
>  }
>  
> -static void rping_test_client(struct rping_cb *cb)
> +static int rping_test_client(struct rping_cb *cb)
>  {
> - int ping, start, cc, i, ret;
> + int ping, start, cc, i, ret = 0;
>   struct ibv_send_wr *bad_wr;
>   unsigned char c;
>  
> @@ -813,6 +836,7 @@ static void rping_test_client(struct rpi
>    if (cb->state != RDMA_WRITE_ADV) {
>     fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>  
> @@ -828,18 +852,22 @@ static void rping_test_client(struct rpi
>    if (cb->state != RDMA_WRITE_COMPLETE) {
>     fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>  
>    if (cb->validate)
>     if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) {
>      fprintf(stderr, "data mismatch!\n");
> +    ret = -1;
>      break;
>     }
>  
>    if (cb->verbose)
>     printf("ping data: %s\n", cb->rdma_buf);
>   }
> +
> + return ret;
>  }
>  
>  static int rping_connect_client(struct rping_cb *cb)
> @@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin
>   return 0;
>  }
>  
> -static void rping_run_client(struct rping_cb *cb)
> +static int rping_run_client(struct rping_cb *cb)
>  {
>   struct ibv_recv_wr *bad_wr;
>   int ret;
>  
>   ret = rping_bind_client(cb);
>   if (ret)
> -  return;
> +  return ret;
>  
>   ret = rping_setup_qp(cb, cb->cm_id);
>   if (ret) {
>    fprintf(stderr, "setup_qp failed: %d\n", ret);
> -  return;
> +  return ret;
>   }
>  
>   ret = rping_setup_buffers(cb);
> @@ -937,6 +965,8 @@ err2:
>   rping_free_buffers(cb);
>  err1:
>   rping_free_qp(cb);
> +
> + return ret;
>  }
>  
>  static void usage(char *name)
> @@ -1054,9 +1084,9 @@ int main(int argc, char *argv[])
>   pthread_create(&cb->cmthread, NULL, cm_thread, cb);
>  
>   if (cb->server)
> -  rping_run_server(cb);
> +  ret = rping_run_server(cb);
>   else
> -  rping_run_client(cb);
> +  ret = rping_run_client(cb);
>  
>   DEBUG_LOG("destroy cm_id %p\n", cb->cm_id);
>   rdma_destroy_id(cb->cm_id);


From swise at opengridcomputing.com  Tue Jun 20 08:49:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 10:49:16 -0500
Subject: [openib-general] [librdmacm] check return value in operations
	of rping
In-Reply-To: <200606201844.34381.dotanb@mellanox.co.il>
References: <200606201730.45253.dotanb@mellanox.co.il>
	<1150817023.22519.22.camel@stevo-desktop>
	<200606201844.34381.dotanb@mellanox.co.il>
Message-ID: <1150818556.22519.28.camel@stevo-desktop>

This is still balled up.  It's like the tabs have been converted to
spaces...

Go ahead and send it as an attachment and I'll review it...

Stevo.


On Tue, 2006-06-20 at 18:44 +0300, Dotan Barak wrote:
> On Tuesday 20 June 2006 18:23, Steve Wise wrote:
> > This patch is malformed, I think.  Did your mailer munge it?  
> > 
> 
> Sorry, i changed the mail client recently and it wasnt' configured properly ...
> 
> i hope that this patch looks better ..
> 
> Added checks to the return values of all of the functions that may fail
> (in order to add this test to the regression system).
> 
> Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>
> 
> Index: last_stable/src/userspace/librdmacm/examples/rping.c
> ===================================================================
> --- last_stable.orig/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:41:47.000000000 +0300
> +++ last_stable/src/userspace/librdmacm/examples/rping.c 2006-06-20 14:42:12.000000000 +0300
> @@ -157,10 +157,10 @@ struct rping_cb {
>   struct rdma_cm_id *child_cm_id; /* connection on server side */
>  };
>  
> -static void rping_cma_event_handler(struct rdma_cm_id *cma_id,
> +static int rping_cma_event_handler(struct rdma_cm_id *cma_id,
>          struct rdma_cm_event *event)
>  {
> - int ret;
> + int ret = 0;
>   struct rping_cb *cb = cma_id->context;
>  
>   DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id,
> @@ -209,6 +209,7 @@ static void rping_cma_event_handler(stru
>    fprintf(stderr, "cma event %d, error %d\n", event->event,
>           event->status);
>    sem_post(&cb->sem);
> +  ret = -1;
>    break;
>  
>   case RDMA_CM_EVENT_DISCONNECTED:
> @@ -218,13 +219,17 @@ static void rping_cma_event_handler(stru
>  
>   case RDMA_CM_EVENT_DEVICE_REMOVAL:
>    fprintf(stderr, "cma detected device removal!!!!\n");
> +  ret = -1;
>    break;
>  
>   default:
>    fprintf(stderr, "oof bad type!\n");
>    sem_post(&cb->sem);
> +  ret = -1;
>    break;
>   }
> +
> + return ret;
>  }
>  
>  static int server_recv(struct rping_cb *cb, struct ibv_wc *wc)
> @@ -263,16 +268,20 @@ static int client_recv(struct rping_cb *
>   return 0;
>  }
>  
> -static void rping_cq_event_handler(struct rping_cb *cb)
> +static int rping_cq_event_handler(struct rping_cb *cb)
>  {
>   struct ibv_wc wc;
>   struct ibv_recv_wr *bad_wr;
>   int ret;
>  
>   while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) {
> +  ret = 0;
> +
>    if (wc.status) {
>     fprintf(stderr, "cq completion failed status %d\n",
>      wc.status);
> +   if (wc.status != IBV_WC_WR_FLUSH_ERR)
> +    ret = -1;
>     goto error;
>    }
>  
> @@ -312,6 +321,7 @@ static void rping_cq_event_handler(struc
>  
>    default:
>     DEBUG_LOG("unknown!!!!! completion\n");
> +   ret = -1;
>     goto error;
>    }
>   }
> @@ -319,11 +329,12 @@ static void rping_cq_event_handler(struc
>    fprintf(stderr, "poll error %d\n", ret);
>    goto error;
>   }
> - return;
> + return 0;
>  
>  error:
>   cb->state = ERROR;
>   sem_post(&cb->sem);
> + return ret;
>  }
>  
>  static int rping_accept(struct rping_cb *cb)
> @@ -560,7 +571,9 @@ static void *cm_thread(void *arg)
>     fprintf(stderr, "rdma_get_cm_event err %d\n", ret);
>     exit(ret);
>    }
> -  rping_cma_event_handler(event->id, event);
> +  ret = rping_cma_event_handler(event->id, event);
> +  if (ret)
> +   exit(ret);
>    rdma_ack_cm_event(event);
>   }
>  }
> @@ -589,7 +602,9 @@ static void *cq_thread(void *arg)
>     fprintf(stderr, "Failed to set notify!\n");
>     exit(ret);
>    }
> -  rping_cq_event_handler(cb);
> +  ret = rping_cq_event_handler(cb);
> +  if (ret)
> +   exit(ret);
>    ibv_ack_cq_events(cb->cq, 1);
>   }
>  }
> @@ -606,7 +621,7 @@ static void rping_format_send(struct rpi
>      info->buf, info->rkey, info->size);
>  }
>  
> -static void rping_test_server(struct rping_cb *cb)
> +static int rping_test_server(struct rping_cb *cb)
>  {
>   struct ibv_send_wr *bad_wr;
>   int ret;
> @@ -617,6 +632,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_READ_ADV) {
>     fprintf(stderr, "wait for RDMA_READ_ADV state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>  
> @@ -640,6 +656,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_READ_COMPLETE) {
>     fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server received read complete\n");
> @@ -661,6 +678,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_WRITE_ADV) {
>     fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server received sink adv\n");
> @@ -686,6 +704,7 @@ static void rping_test_server(struct rpi
>    if (cb->state != RDMA_WRITE_COMPLETE) {
>     fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
>      cb->state);
> +   ret = -1;
>     break;
>    }
>    DEBUG_LOG("server rdma write complete \n");
> @@ -698,6 +717,8 @@ static void rping_test_server(struct rpi
>    }
>    DEBUG_LOG("server posted go ahead\n");
>   }
> +
> + return ret;
>  }
>  
>  static int rping_bind_server(struct rping_cb *cb)
> @@ -734,19 +755,19 @@ static int rping_bind_server(struct rpin
>   return 0;
>  }
>  
> -static void rping_run_server(struct rping_cb *cb)
> +static int rping_run_server(struct rping_cb *cb)
>  {
>   struct ibv_recv_wr *bad_wr;
>   int ret;
>  
>   ret = rping_bind_server(cb);
>   if (ret)
> -  return;
> +  return ret;
>  
>   ret = rping_setup_qp(cb, cb->child_cm_id);
>   if (ret) {
>    fprintf(stderr, "setup_qp failed: %d\n", ret);
> -  return;
> +  return ret;
>   }
>  
>   ret = rping_setup_buffers(cb);
> @@ -776,11 +797,13 @@ err2:
>   rping_free_buffers(cb);
>  err1:
>   rping_free_qp(cb);
> +
> + return ret;
>  }
>  
> -static void rping_test_client(struct rping_cb *cb)
> +static int rping_test_client(struct rping_cb *cb)
>  {
> - int ping, start, cc, i, ret;
> + int ping, start, cc, i, ret = 0;
>   struct ibv_send_wr *bad_wr;
>   unsigned char c;
>  
> @@ -813,6 +836,7 @@ static void rping_test_client(struct rpi
>    if (cb->state != RDMA_WRITE_ADV) {
>  			fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n",
>  				cb->state);
> +			ret = -1;
>  			break;
>  		}
>  
> @@ -828,18 +852,22 @@ static void rping_test_client(struct rpi
>  		if (cb->state != RDMA_WRITE_COMPLETE) {
>  			fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n",
>  				cb->state);
> +			ret = -1;
>  			break;
>  		}
>  
>  		if (cb->validate)
>  			if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) {
>  				fprintf(stderr, "data mismatch!\n");
> +				ret = -1;
>  				break;
>  			}
>  
>  		if (cb->verbose)
>  			printf("ping data: %s\n", cb->rdma_buf);
>  	}
> +
> +	return ret;
>  }
>  
>  static int rping_connect_client(struct rping_cb *cb)
> @@ -896,19 +924,19 @@ static int rping_bind_client(struct rpin
>  	return 0;
>  }
>  
> -static void rping_run_client(struct rping_cb *cb)
> +static int rping_run_client(struct rping_cb *cb)
>  {
>  	struct ibv_recv_wr *bad_wr;
>  	int ret;
>  
>  	ret = rping_bind_client(cb);
>  	if (ret)
> -		return;
> +		return ret;
>  
>  	ret = rping_setup_qp(cb, cb->cm_id);
>  	if (ret) {
>    fprintf(stderr, "setup_qp failed: %d\n", ret);
> -  return;
> +  return ret;
>   }
>  
>   ret = rping_setup_buffers(cb);
> @@ -937,6 +965,8 @@ err2:
>   rping_free_buffers(cb);
>  err1:
>   rping_free_qp(cb);
> +
> + return ret;
>  }
>  
>  static void usage(char *name)
> @@ -1054,9 +1084,9 @@ int main(int argc, char *argv[])
>   pthread_create(&cb->cmthread, NULL, cm_thread, cb);
>  
>   if (cb->server)
> -  rping_run_server(cb);
> +  ret = rping_run_server(cb);
>   else
> -  rping_run_client(cb);
> +  ret = rping_run_client(cb);
>  
>   DEBUG_LOG("destroy cm_id %p\n", cb->cm_id);
>   rdma_destroy_id(cb->cm_id);


From bchang at atipa.com  Tue Jun 20 08:42:23 2006
From: bchang at atipa.com (Brady Chang)
Date: Tue, 20 Jun 2006 10:42:23 -0500
Subject: [openib-general] FW: mvapich xhpl memory usage
References: <0D6FBA307D01EA42BAC8715725643AA01EDB53@EXCHG2003.microtech-ks.com>
Message-ID: <0D6FBA307D01EA42BAC8715725643AA01EDB54@EXCHG2003.microtech-ks.com>

 
Hello,
I installed OFED 1.0 (mvapich 0.97)and compile Linpack benchmark.
When I run xhpl, the memory usage creeps up with each NB. and as each N changes the memory allocated is not freed. 
LAZY_MEM_REGISTER is not defined per http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-March/000057.html  I removed it from Make.mvapich.gen2. tar it back up and reran the install.
 
Hardware:
dual core opteron.
4Gig mem
InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0)
xhpl compiled using libmpich.so
 
 
thanks
-Brady
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060620/e60c3622/attachment.html>

From rdreier at cisco.com  Tue Jun 20 09:29:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 Jun 2006 09:29:20 -0700
Subject: [openib-general] [PATCH 2/2] IB/iser: bugfix for the reconnect
	flow
In-Reply-To: <Pine.LNX.4.64.0606201233550.24176@zuben> (Or Gerlitz's
	message of "Tue, 20 Jun 2006 12:35:51 +0300 (IDT)")
References: <Pine.LNX.4.64.0606201229180.24176@zuben>
	<Pine.LNX.4.64.0606201233550.24176@zuben>
Message-ID: <adar71joknz.fsf@cisco.com>

Thanks, I rolled up both of these patches into what I have queued.


From amit_byron at yahoo.com  Tue Jun 20 09:27:50 2006
From: amit_byron at yahoo.com (amit byron)
Date: Tue, 20 Jun 2006 16:27:50 +0000 (UTC)
Subject: [openib-general] =?utf-8?q?ib=5Fgid_lookup?=
References: <loom.20060620T022600-975@post.gmane.org>
	<1150798111.4391.111384.camel@hal.voltaire.com>
Message-ID: <loom.20060620T182256-818@post.gmane.org>

> Hal Rosenstock <halr <at> voltaire.com> writes:
>
> 
> Hi Amit,
> 
> On Mon, 2006-06-19 at 20:36, Amit Byron wrote:
> > hello,
> >   i'm trying to find whether i can do a lookup of ib_gid by either
> > node name or node's ip address. is this information available from
> > the subnet manager?
> 
> The SM doesn't know the node name but you might be able to do this by
> NodeDescription depending on how the subnet was setup (the
> NodeDescriptions would need to be made unique on each node; a script for
> this was supplied for mthca; there is also a current standards issue
> with the SM detecting that these had changed which is being worked on).
> If that were to be done, the SA could be queried by NodeDescription
> which would return a NodeRecord which would obtain the NodeInfo which
> includes the NodeGUID and PortGUID. Note it also returns the base LID as
> well.

hi Hal,
 thank you very much for your suggestions.

 do you mean to say setting up subnet through the topology file? are
there any examples on how to setup the topology file? also, where can
i find the mthca script that you mention above.

> 
> The SM does not know the IP addresses unless they are registered by DAPL
> (via ServiceRecords) but I'm not sure that is done anymore or whether
> DAPL runs in your environment.
> 

if i run DAPL in my environment will it work or this is already made
obsolete?

thanks again,
Amit


From halr at voltaire.com  Tue Jun 20 09:44:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 12:44:34 -0400
Subject: [openib-general] [PATCH] OpenSM/SA:
Message-ID: <1150821864.4391.126438.camel@hal.voltaire.com>

OpenSM/SA: Properly handle non base LID requests per C15-0.1.11 on
remaining SA records where this hasn't been fixed already.

C15-0.1.11: Query responses shall contain a port's base LID in
any LID component of a RID. So when LMC is non 0, the only records that
appear are those with the base LID and not with any masked LIDs.
Furthermore, if a query comes in on a non base LID, the LID in the RID
returned is only with the base LID.

Also, fixed some error handling for SA GetTable requests in these SA
records.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_guidinfo_record.c
===================================================================
--- opensm/osm_sa_guidinfo_record.c	(revision 8140)
+++ opensm/osm_sa_guidinfo_record.c	(working copy)
@@ -201,12 +201,10 @@ __osm_sa_gir_create_gir(
   uint8_t                  port_num;
   uint8_t                  num_ports;
   uint16_t                 match_lid_ho;
-  uint16_t                 lid_ho;
   ib_net16_t               base_lid_ho;
   ib_net16_t               max_lid_ho;
   uint8_t                  lmc;
   ib_net64_t               port_guid;
-  ib_api_status_t          status;
   const ib_port_info_t*    p_pi;
   uint8_t                  block_num, start_block_num, end_block_num, num_blocks;
 
@@ -276,11 +274,12 @@ __osm_sa_gir_create_gir(
     }
 
     base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) );
-    lmc = osm_physp_get_lmc( p_physp );
-    max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
     match_lid_ho = cl_ntoh16( match_lid );
     if( match_lid_ho )
     {
+      lmc = osm_physp_get_lmc( p_physp );
+      max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
+
       /*
         We validate that the lid belongs to this node.
       */
@@ -295,34 +294,15 @@ __osm_sa_gir_create_gir(
                  );
       }
 
-      if( match_lid_ho <= max_lid_ho && match_lid_ho >= base_lid_ho )
-      {
-        /*
-          Ignore return code for now.
-        */
-       for (block_num = start_block_num; block_num <= end_block_num; block_num++)
-          __osm_gir_rcv_new_gir( p_rcv, p_node, p_list,
-                                 port_guid, match_lid,
-                                 p_physp, block_num );
-      }
-    }
-    else
-    {
-      /*
-        For every lid value create the GUIDInfo record(s).
-      */
-      for( lid_ho = base_lid_ho; lid_ho <= max_lid_ho; lid_ho++ )
-      {
-        for (block_num = start_block_num; block_num <= end_block_num; block_num++)
-        {
-          status = __osm_gir_rcv_new_gir( p_rcv, p_node, p_list,
-                                          port_guid, cl_hton16( lid_ho ),
-                                          p_physp, block_num );
-          if( status != IB_SUCCESS )
-            break;
-        }
-      }
+      if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho )
+        continue;
     }
+
+    for (block_num = start_block_num; block_num <= end_block_num; block_num++)
+      __osm_gir_rcv_new_gir( p_rcv, p_node, p_list,
+                             port_guid, cl_ntoh16(base_lid_ho),
+                             p_physp, block_num );
+
   }
 
   OSM_LOG_EXIT( p_rcv->p_log );
@@ -496,24 +476,32 @@ osm_gir_rcv_process(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_gir_rcv_process: ERR 5103: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
-
-    /* need to set the mem free ... */
-    p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list );
-    while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
-      p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
     }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_gir_rcv_process: ERR 5103: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
 
-    goto Exit;
+      /* need to set the mem free ... */
+      p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_gir_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_gir_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
   }
 
   pre_trim_num_rec = num_rec;
Index: opensm/osm_sa_lft_record.c
===================================================================
--- opensm/osm_sa_lft_record.c	(revision 8140)
+++ opensm/osm_sa_lft_record.c	(working copy)
@@ -199,7 +199,6 @@ __osm_lftr_get_port_by_guid(
 
   p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl,
                                      port_guid);
-
   if(p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl))
   {
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
@@ -249,9 +248,6 @@ __osm_lftr_rcv_by_comp_mask(
     return;
   }
 
-  /* get the port 0 of the switch */
-  osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho );
-
   /* check that the requester physp and the current physp are under
      the same partition. */
   p_physp = osm_port_get_default_phys_ptr( p_port );
@@ -268,6 +264,9 @@ __osm_lftr_rcv_by_comp_mask(
   if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp ))
     return;
 
+  /* get the port 0 of the switch */
+  osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho );
+
   /* compare the lids - if required */
   if( comp_mask & IB_LFTR_COMPMASK_LID )
   {
@@ -277,8 +276,8 @@ __osm_lftr_rcv_by_comp_mask(
              cl_ntoh16( p_rcvd_rec->lid ), min_lid_ho, max_lid_ho
              );
     /* ok we are ready for range check */
-    if ((min_lid_ho > cl_ntoh16(p_rcvd_rec->lid)) ||
-        (max_lid_ho < cl_ntoh16(p_rcvd_rec->lid)))
+    if (min_lid_ho > cl_ntoh16(p_rcvd_rec->lid) ||
+        max_lid_ho < cl_ntoh16(p_rcvd_rec->lid))
       return;
   }
 
@@ -323,7 +322,7 @@ osm_lftr_rcv_process(
   uint32_t                  i;
   osm_lftr_search_ctxt_t    context;
   osm_lftr_item_t*          p_rec_item;
-  ib_api_status_t           status;
+  ib_api_status_t           status = IB_SUCCESS;
   osm_physp_t*              p_req_physp;
 
   CL_ASSERT( p_rcv );
@@ -382,24 +381,32 @@ osm_lftr_rcv_process(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_lftr_rcv_process: ERR 4409: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
-
-    /* need to set the mem free ... */
-    p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list );
-    while( p_rec_item != (osm_lftr_item_t*)cl_qlist_end( &rec_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
-      p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
     }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_lftr_rcv_process: ERR 4409: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
 
-    goto Exit;
+      /* need to set the mem free ... */
+      p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_lftr_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_lftr_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
   }
 
   pre_trim_num_rec = num_rec;
Index: opensm/osm_sa_node_record.c
===================================================================
--- opensm/osm_sa_node_record.c	(revision 8140)
+++ opensm/osm_sa_node_record.c	(working copy)
@@ -264,15 +264,12 @@ __osm_nr_rcv_create_nr(
                  );
       }
 
-      if( (match_lid_ho <= max_lid_ho) && (match_lid_ho >= base_lid_ho) )
-      {
-        __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid );
-      }
-    }
-    else
-    {
-      __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid );
+      if ( match_lid_ho < base_lid_ho || match_lid_ho > max_lid_ho )
+        continue;
     }
+
+    __osm_nr_rcv_new_nr( p_rcv, p_node, p_list, port_guid, base_lid );
+
   }
 
   OSM_LOG_EXIT( p_rcv->p_log );
Index: opensm/osm_sa_path_record.c
===================================================================
--- opensm/osm_sa_path_record.c	(revision 8140)
+++ opensm/osm_sa_path_record.c	(working copy)
@@ -1027,8 +1027,7 @@ __osm_pr_rcv_get_end_points(
       status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl,
                                  cl_ntoh16(p_pr->slid), (void**)pp_src_port );
 
-      if( ( (status != CL_SUCCESS) || (*pp_src_port == NULL) ) &&
-          (p_sa_mad->method == IB_MAD_METHOD_GET) )
+      if( (status != CL_SUCCESS) || (*pp_src_port == NULL) )
       {
         /*
           This 'error' is the client's fault (bad lid) so
@@ -1077,8 +1076,7 @@ __osm_pr_rcv_get_end_points(
       status = cl_ptr_vector_at( &p_rcv->p_subn->port_lid_tbl,
                                  cl_ntoh16(p_pr->dlid),  (void**)pp_dest_port );
 
-      if( ( (status != CL_SUCCESS) || (*pp_dest_port == NULL) ) &&
-          (p_sa_mad->method == IB_MAD_METHOD_GET) )
+      if( (status != CL_SUCCESS) || (*pp_dest_port == NULL) )
       {
         /*
           This 'error' is the client's fault (bad lid) so
@@ -1521,22 +1519,30 @@ __osm_pr_rcv_respond(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "__osm_pr_rcv_respond: ERR 1F13: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
-    /* need to set the mem free ... */
-    p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list );
-    while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
+    }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "__osm_pr_rcv_respond: ERR 1F13: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
+      /* need to set the mem free ... */
       p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list );
+      while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item );
+        p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list );
+      }
+      goto Exit;
     }
-    goto Exit;
   }
 
   pre_trim_num_rec = num_rec;
@@ -1704,40 +1710,36 @@ osm_pr_rcv_process(
   sa_status = __osm_pr_rcv_get_end_points( p_rcv, p_madw,
                                            &p_src_port, &p_dest_port );
 
-  if( sa_status != IB_SA_MAD_STATUS_SUCCESS )
+  if( sa_status == IB_SA_MAD_STATUS_SUCCESS )
   {
-    cl_plock_release( p_rcv->p_lock );
-    osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
-    goto Exit;
-  }
-
-  /*
-    What happens next depends on the type of endpoint information
-    that was specified....
-  */
-  if( p_src_port )
-  {
-    if( p_dest_port )
-      __osm_pr_rcv_process_pair( p_rcv, p_madw, requester_port,
-                                 p_src_port, p_dest_port,
-                                 p_sa_mad->comp_mask, &pr_list );
-    else
-      __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port,
-                                 p_src_port, NULL,
-                                 p_sa_mad->comp_mask, &pr_list );
-  }
-  else
-  {
-    if( p_dest_port )
-      __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port,
-                                 NULL, p_dest_port,
-                                 p_sa_mad->comp_mask, &pr_list );
+    /*
+      What happens next depends on the type of endpoint information
+      that was specified....
+    */
+    if( p_src_port )
+    {
+      if( p_dest_port )
+        __osm_pr_rcv_process_pair( p_rcv, p_madw, requester_port,
+                                   p_src_port, p_dest_port,
+                                   p_sa_mad->comp_mask, &pr_list );
+      else
+        __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port,
+                                   p_src_port, NULL,
+                                   p_sa_mad->comp_mask, &pr_list );
+    }
     else
-      /*
-        Katie, bar the door!
-      */
-      __osm_pr_rcv_process_world( p_rcv, p_madw, requester_port,
-                                  p_sa_mad->comp_mask, &pr_list );
+    {
+      if( p_dest_port )
+        __osm_pr_rcv_process_half( p_rcv, p_madw, requester_port,
+                                   NULL, p_dest_port,
+                                   p_sa_mad->comp_mask, &pr_list );
+      else
+        /*
+          Katie, bar the door!
+        */
+        __osm_pr_rcv_process_world( p_rcv, p_madw, requester_port,
+                                    p_sa_mad->comp_mask, &pr_list );
+    }
   }
   goto Unlock;
 
Index: opensm/osm_sa_pkey_record.c
===================================================================
--- opensm/osm_sa_pkey_record.c	(revision 8140)
+++ opensm/osm_sa_pkey_record.c	(working copy)
@@ -332,7 +332,7 @@ osm_pkey_rec_rcv_process(
   uint32_t                 i;
   osm_pkey_search_ctxt_t   context;
   osm_pkey_item_t*         p_rec_item;
-  ib_api_status_t          status;
+  ib_api_status_t          status = IB_SUCCESS;
   ib_net64_t               comp_mask;
   osm_physp_t*             p_req_physp;
 
@@ -421,30 +421,38 @@ osm_pkey_rec_rcv_process(
 
     if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
     {
-      p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) );
+      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
+      {
+        status = IB_NOT_FOUND;
+        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                 "osm_pkey_rec_rcv_process: ERR 460B: "
+                 "No port found with LID 0x%x\n",
+                 cl_ntoh16(p_rcvd_rec->lid) );
+      }
     }
     else
     { /* port out of range */
-      cl_plock_release( p_rcv->p_lock );
-
+      status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_pkey_rec_rcv_process: ERR 4609: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
-      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
-      goto Exit;
     }
   }
 
-  /* if we got a unique port - no need for a port search */
-  if( p_port )
-    /* this does the loop on all the port phys ports */
-    __osm_sa_pkey_by_comp_mask( p_rcv, p_port, &context );
-  else
-  {
-    cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
-                        __osm_sa_pkey_by_comp_mask_cb,
-                        &context );
+  if ( status == IB_SUCCESS )
+  {
+    /* if we got a unique port - no need for a port search */
+    if( p_port )
+      /* this does the loop on all the port phys ports */
+      __osm_sa_pkey_by_comp_mask( p_rcv, p_port, &context );
+    else
+    {
+      cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
+                          __osm_sa_pkey_by_comp_mask_cb,
+                          &context );
+    }
   }
 
   cl_plock_release( p_rcv->p_lock );
@@ -455,24 +463,32 @@ osm_pkey_rec_rcv_process(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_pkey_rec_rcv_process: ERR 460A: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
-
-    /* need to set the mem free ... */
-    p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list );
-    while( p_rec_item != (osm_pkey_item_t*)cl_qlist_end( &rec_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
-      p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
     }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_pkey_rec_rcv_process: ERR 460A: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
 
-    goto Exit;
+      /* need to set the mem free ... */
+      p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_pkey_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_pkey_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
   }
 
   pre_trim_num_rec = num_rec;
Index: opensm/osm_sa_slvl_record.c
===================================================================
--- opensm/osm_sa_slvl_record.c	(revision 8140)
+++ opensm/osm_sa_slvl_record.c	(working copy)
@@ -317,7 +317,7 @@ osm_slvl_rec_rcv_process(
   uint32_t                       i;
   osm_slvl_search_ctxt_t         context;
   osm_slvl_item_t*               p_rec_item;
-  ib_api_status_t                status;
+  ib_api_status_t                status = IB_SUCCESS;
   ib_net64_t                     comp_mask;
   osm_physp_t*                   p_req_physp;
 
@@ -389,30 +389,38 @@ osm_slvl_rec_rcv_process(
 
     if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
     {
-      p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) );
+      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
+      {
+        status = IB_NOT_FOUND;
+        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                 "osm_slvl_rec_rcv_process: ERR 2608: "
+                 "No port found with LID 0x%x\n",
+                 cl_ntoh16(p_rcvd_rec->lid) );
+      }
     }
     else
     { /*  port out of range */
-      cl_plock_release( p_rcv->p_lock );
-
+      status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_slvl_rec_rcv_process: ERR 2601: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
-      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
-      goto Exit;
     }
   }
 
-  /* if we have a unique port - no need for a port search */
-  if( p_port )
-    /*  this does the loop on all the port phys ports */
-    __osm_sa_slvl_by_comp_mask( p_rcv, p_port, &context );
-  else
-  {
-    cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
-                        __osm_sa_slvl_by_comp_mask_cb,
-                        &context );
+  if ( status == IB_SUCCESS )
+  {
+    /* if we have a unique port - no need for a port search */
+    if( p_port )
+      /*  this does the loop on all the port phys ports */
+      __osm_sa_slvl_by_comp_mask( p_rcv, p_port, &context );
+    else
+    {
+      cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
+                          __osm_sa_slvl_by_comp_mask_cb,
+                          &context );
+    }
   }
 
   cl_plock_release( p_rcv->p_lock );
@@ -423,24 +431,32 @@ osm_slvl_rec_rcv_process(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_slvl_rec_rcv_process: ERR 2607: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
-
-    /* need to set the mem free ... */
-    p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list );
-    while( p_rec_item != (osm_slvl_item_t*)cl_qlist_end( &rec_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
-      p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
     }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_slvl_rec_rcv_process: ERR 2607: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
 
-    goto Exit;
+      /* need to set the mem free ... */
+      p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_slvl_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_slvl_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
   }
 
   pre_trim_num_rec = num_rec;
Index: opensm/osm_sa_vlarb_record.c
===================================================================
--- opensm/osm_sa_vlarb_record.c	(revision 8140)
+++ opensm/osm_sa_vlarb_record.c	(working copy)
@@ -337,7 +337,7 @@ osm_vlarb_rec_rcv_process(
   uint32_t                 i;
   osm_vl_arb_search_ctxt_t context;
   osm_vl_arb_item_t*       p_rec_item;
-  ib_api_status_t          status;
+  ib_api_status_t          status = IB_SUCCESS;
   ib_net64_t               comp_mask;
   osm_physp_t*             p_req_physp;
 
@@ -409,30 +409,38 @@ osm_vlarb_rec_rcv_process(
 
     if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
     {
-      p_port = cl_ptr_vector_get( p_tbl, cl_ntoh16(p_rcvd_rec->lid) );
+      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
+      {
+        status = IB_NOT_FOUND;
+        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                 "osm_vlarb_rec_rcv_process: ERR 2A09: "
+                 "No port found with LID 0x%x\n",
+                 cl_ntoh16(p_rcvd_rec->lid) );
+      }
     }
     else
     { /*  port out of range */
-      cl_plock_release( p_rcv->p_lock );
-
+      status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "osm_vlarb_rec_rcv_process: ERR 2A01: "
                "Given LID (0x%X) is out of range:0x%X\n",
                cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
-      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
-      goto Exit;
     }
   }
 
-  /* if we got a unique port - no need for a port search */
-  if( p_port )
-    /*  this does the loop on all the port phys ports */
-    __osm_sa_vl_arb_by_comp_mask( p_rcv, p_port, &context );
-  else
-  {
-    cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
-                        __osm_sa_vl_arb_by_comp_mask_cb,
-                        &context );
+  if ( status == IB_SUCCESS )
+  {
+    /* if we got a unique port - no need for a port search */
+    if( p_port )
+      /*  this does the loop on all the port phys ports */
+      __osm_sa_vl_arb_by_comp_mask( p_rcv, p_port, &context );
+    else
+    {
+      cl_qmap_apply_func( &p_rcv->p_subn->port_guid_tbl,
+                          __osm_sa_vl_arb_by_comp_mask_cb,
+                          &context );
+    }
   }
 
   cl_plock_release( p_rcv->p_lock );
@@ -443,24 +451,32 @@ osm_vlarb_rec_rcv_process(
    * C15-0.1.30:
    * If we do a SubnAdmGet and got more than one record it is an error !
    */
-  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) &&
-       (num_rec > 1)) {
-    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-             "osm_vlarb_rec_rcv_process:  ERR 2A08: "
-             "Got more than one record for SubnAdmGet (%u)\n",
-             num_rec );
-    osm_sa_send_error( p_rcv->p_resp, p_madw,
-                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
-
-    /* need to set the mem free ... */
-    p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list );
-    while( p_rec_item != (osm_vl_arb_item_t*)cl_qlist_end( &rec_list ) )
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
     {
-      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
-      p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list );
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
     }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_vlarb_rec_rcv_process:  ERR 2A08: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
 
-    goto Exit;
+      /* need to set the mem free ... */
+      p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_vl_arb_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_vl_arb_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
   }
 
   pre_trim_num_rec = num_rec;


From halr at voltaire.com  Tue Jun 20 09:50:01 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 12:50:01 -0400
Subject: [openib-general] ib_gid lookup
In-Reply-To: <loom.20060620T182256-818@post.gmane.org>
References: <loom.20060620T022600-975@post.gmane.org>
	<1150798111.4391.111384.camel@hal.voltaire.com>
	<loom.20060620T182256-818@post.gmane.org>
Message-ID: <1150822037.4391.126581.camel@hal.voltaire.com>

Hi again Amit,

On Tue, 2006-06-20 at 12:27, amit byron wrote:
> > Hal Rosenstock <halr <at> voltaire.com> writes:
> >
> > 
> > Hi Amit,
> > 
> > On Mon, 2006-06-19 at 20:36, Amit Byron wrote:
> > > hello,
> > >   i'm trying to find whether i can do a lookup of ib_gid by either
> > > node name or node's ip address. is this information available from
> > > the subnet manager?
> > 
> > The SM doesn't know the node name but you might be able to do this by
> > NodeDescription depending on how the subnet was setup (the
> > NodeDescriptions would need to be made unique on each node; a script for
> > this was supplied for mthca; there is also a current standards issue
> > with the SM detecting that these had changed which is being worked on).
> > If that were to be done, the SA could be queried by NodeDescription
> > which would return a NodeRecord which would obtain the NodeInfo which
> > includes the NodeGUID and PortGUID. Note it also returns the base LID as
> > well.
> 
> hi Hal,
>  thank you very much for your suggestions.
> 
>  do you mean to say setting up subnet through the topology file?

No (although the topology file does display this information).

>  are
> there any examples on how to setup the topology file? also, where can
> i find the mthca script that you mention above.

management/diags/scripts/set_mthca_nodedesc.sh

> > The SM does not know the IP addresses unless they are registered by DAPL
> > (via ServiceRecords) but I'm not sure that is done anymore or whether
> > DAPL runs in your environment.
> > 
> 
> if i run DAPL in my environment will it work or this is already made
> obsolete?

I don't know. James or maybe Arlin would be the ones to answer. You
could also look at the code to figure this out.

-- Hal

> thanks again,
> Amit
> 
> 
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From robert.j.woodruff at intel.com  Tue Jun 20 09:55:57 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 20 Jun 2006 09:55:57 -0700
Subject: [openib-general] ipath verbs does not compile against the latest
	SVN trunk verbs
Message-ID: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408>

When I try to build  SVN 8112 I get the following errors trying
to build the ipath verbs. 

src/ipathverbs.c:148: warning: its scope is only this definition or
declaration, which is probably not what you want
src/ipathverbs.c: In function `openib_driver_init':
src/ipathverbs.c:156: warning: implicit declaration of function
`sysfs_get_classdev_device'
src/ipathverbs.c:156: warning: assignment makes pointer from integer
without a cast
src/ipathverbs.c:160: warning: implicit declaration of function
`sysfs_get_device_attr'
src/ipathverbs.c:160: warning: assignment makes pointer from integer
without a cast
src/ipathverbs.c:163: error: dereferencing pointer to incomplete type
src/ipathverbs.c:164: warning: implicit declaration of function
`sysfs_close_attribute'
src/ipathverbs.c:166: warning: assignment makes pointer from integer
without a cast
src/ipathverbs.c:169: error: dereferencing pointer to incomplete type
src/ipathverbs.c:183: error: dereferencing pointer to incomplete type

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060620/517efd58/attachment.html>

From halr at voltaire.com  Tue Jun 20 10:24:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 13:24:28 -0400
Subject: [openib-general] [PATCH] OpenSM/osm_sa_link_record.c: Only need
 base LID rather than LID range in __osm_lr_rcv_get_physp_link
Message-ID: <1150824264.4391.127940.camel@hal.voltaire.com>

OpenSM/osm_sa_link_record.c: Only need base LID rather than LID range in
__osm_lr_rcv_get_physp_link

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_link_record.c
===================================================================
--- opensm/osm_sa_link_record.c	(revision 8140)
+++ opensm/osm_sa_link_record.c	(working copy)
@@ -166,13 +166,10 @@ __osm_lr_rcv_build_physp_link(
 /**********************************************************************
  **********************************************************************/
 static void
-__get_lid_range(
+__get_base_lid(
   IN const osm_physp_t*  p_physp,
-  OUT uint16_t *         p_base_lid,
-  OUT uint16_t *         p_max_lid )
+  OUT uint16_t *         p_base_lid )
 {
-  uint8_t                lmc;
-
   if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH)
   {
     *p_base_lid =
@@ -180,14 +177,11 @@ __get_lid_range(
         osm_physp_get_base_lid(
           osm_node_get_physp_ptr(p_physp->p_node, 0))
         );
-    *p_max_lid = *p_base_lid;
   }
   else
   {
     *p_base_lid =
       cl_ntoh16(osm_physp_get_base_lid(p_physp));
-    lmc = osm_physp_get_lmc( p_physp );
-    *p_max_lid = (uint16_t)(*p_base_lid + (1<<lmc) - 1);
   }
 }
 
@@ -206,8 +200,6 @@ __osm_lr_rcv_get_physp_link(
   uint8_t                     src_port_num;
   uint8_t                     dest_port_num;
   ib_net16_t                  from_base_lid_ho;
-  ib_net16_t                  from_max_lid_ho;
-  ib_net16_t                  to_max_lid_ho;
   ib_net16_t                  to_base_lid_ho;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_lr_rcv_get_physp_link );
@@ -312,8 +304,8 @@ __osm_lr_rcv_get_physp_link(
              dest_port_num );
   }
 
-  __get_lid_range(p_src_physp, &from_base_lid_ho, &from_max_lid_ho);
-  __get_lid_range(p_dest_physp, &to_base_lid_ho, &to_max_lid_ho);
+  __get_base_lid(p_src_physp, &from_base_lid_ho);
+  __get_base_lid(p_dest_physp, &to_base_lid_ho);
 
   __osm_lr_rcv_build_physp_link(p_rcv, cl_ntoh16(from_base_lid_ho),
                                 cl_ntoh16(to_base_lid_ho),


From halr at voltaire.com  Tue Jun 20 10:42:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Jun 2006 13:42:18 -0400
Subject: [openib-general] ib_gid lookup
In-Reply-To: <1150798111.4391.111384.camel@hal.voltaire.com>
References: <loom.20060620T022600-975@post.gmane.org>
	<1150798111.4391.111384.camel@hal.voltaire.com>
Message-ID: <1150825337.4391.128609.camel@hal.voltaire.com>

Hi again Amit,

On Tue, 2006-06-20 at 06:08, Hal Rosenstock wrote:
> Hi Amit,
> 
> On Mon, 2006-06-19 at 20:36, Amit Byron wrote:
> > hello,
> >   i'm trying to find whether i can do a lookup of ib_gid by either
> > node name or node's ip address. is this information available from
> > the subnet manager?
> 
> The SM doesn't know the node name but you might be able to do this by
> NodeDescription depending on how the subnet was setup (the
> NodeDescriptions would need to be made unique on each node; a script for
> this was supplied for mthca; there is also a current standards issue
> with the SM detecting that these had changed which is being worked on).
> If that were to be done, the SA could be queried by NodeDescription
> which would return a NodeRecord which would obtain the NodeInfo which
> includes the NodeGUID and PortGUID. Note it also returns the base LID as
> well.
> 
> The SM does not know the IP addresses unless they are registered by DAPL
> (via ServiceRecords) but I'm not sure that is done anymore or whether
> DAPL runs in your environment.

Generating an ARP to the IP address could resolve the GID. This API is
exposed through the RDMA CM (in both kernel and user space).

That might be your best option.

-- Hal

> -- Hal
> 
> > thanks,
> > Amit.
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From koop at cse.ohio-state.edu  Tue Jun 20 11:35:09 2006
From: koop at cse.ohio-state.edu (Matthew Koop)
Date: Tue, 20 Jun 2006 14:35:09 -0400 (EDT)
Subject: [openib-general] [mvapich-discuss] mvapich xhpl memory usage
In-Reply-To: <0D6FBA307D01EA42BAC8715725643AA01EDB53@EXCHG2003.microtech-ks.com>
Message-ID: <Pine.GSO.4.40.0606201419360.15442-100000@mu.cse.ohio-state.edu>

Brady,

It appears that the OFED 1.0 release uses a script other than the
make.mvapich.gen2 script to specify the CFLAGS before building the RPMs.
When using the default MVAPICH from our website/svn make.mvapich.gen2 is
still correct though. For this reason, your change did not update the
compilation flags.

To change the OFED 1.0 CFLAGS you will need to edit the "mvapich.make"
script (instead of make.mvapich.gen2) in mvapich-0.9.7-mlx2.1.0 and remove
"-DLAZY_MEM_UNREGISTER" from line 308.

Please let us know if you have any other questions or if this does not
solve your issue.

Thanks,
Matthew Koop
-
Network-Based Computing Laboratory
Ohio State University


> Hello, I installed OFED 1.0 (mvapich 0.97)and compile Linpack
> benchmark. When I run xhpl, the memory usage creeps up with each NB.
> and as each N changes the memory allocated is not freed.
> LAZY_MEM_REGISTER is not defined per
> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-March/000057.html
> I removed it from Make.mvapich.gen2. tar it back up and reran the
> install.


From swise at opengridcomputing.com  Tue Jun 20 13:03:08 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:03:08 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs.
In-Reply-To: <20060620200304.20092.44110.stgit@stevo-desktop>
References: <20060620200304.20092.44110.stgit@stevo-desktop>
Message-ID: <20060620200308.20092.76324.stgit@stevo-desktop>


Cache the node type (iWARP vs IB) in the ib_device struct to enable
transport-dependent logic.
---

 libibverbs/include/infiniband/verbs.h |   44 ++++++++++++++++++++++++++++++++-
 libibverbs/src/device.c               |   16 ++++++++++++
 2 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/libibverbs/include/infiniband/verbs.h b/libibverbs/include/infiniband/verbs.h
index 7679436..0ff97e9 100644
--- a/libibverbs/include/infiniband/verbs.h
+++ b/libibverbs/include/infiniband/verbs.h
@@ -66,9 +66,17 @@ union ibv_gid {
 };
 
 enum ibv_node_type {
+	IBV_NODE_UNKNOWN=-1,
 	IBV_NODE_CA 	= 1,
 	IBV_NODE_SWITCH,
-	IBV_NODE_ROUTER
+	IBV_NODE_ROUTER,
+	IBV_NODE_RNIC
+};
+
+enum ibv_transport_type {
+	IBV_TRANSPORT_UNKNOWN=0,
+	IBV_TRANSPORT_IB=1,
+	IBV_TRANSPORT_IWARP=2
 };
 
 enum ibv_device_cap_flags {
@@ -574,6 +582,7 @@ enum {
 
 struct ibv_device {
 	struct ibv_driver      *driver;
+	enum ibv_node_type	node_type;
 	struct ibv_device_ops	ops;
 	/* Name of underlying kernel IB device, eg "mthca0" */
 	char			name[IBV_SYSFS_NAME_MAX];
@@ -673,6 +682,39 @@ const char *ibv_get_device_name(struct i
 uint64_t ibv_get_device_guid(struct ibv_device *device);
 
 /**
+ * ibv_get_transport_type - Return device's network transport type
+ */
+static inline enum ibv_transport_type
+ibv_get_transport_type(struct ibv_context *context)
+{
+	if (!context->device)
+		return IBV_TRANSPORT_UNKNOWN;
+
+	switch (context->device->node_type) {
+	case IBV_NODE_CA:
+	case IBV_NODE_SWITCH:
+	case IBV_NODE_ROUTER:
+		return IBV_TRANSPORT_IB;
+	case IBV_NODE_RNIC:
+		return IBV_TRANSPORT_IWARP;
+	default:
+		return IBV_TRANSPORT_UNKNOWN;
+	}
+}
+
+/**
+ * ibv_get_node_type - Return device's node type
+ */
+static inline enum ibv_node_type
+ibv_get_node_type(struct ibv_context *context)
+{
+	if (!context->device)
+		return IBV_NODE_UNKNOWN;
+
+	return context->device->node_type;
+}
+
+/**
  * ibv_open_device - Initialize device for use
  */
 struct ibv_context *ibv_open_device(struct ibv_device *device);
diff --git a/libibverbs/src/device.c b/libibverbs/src/device.c
index de97d4d..f08059e 100644
--- a/libibverbs/src/device.c
+++ b/libibverbs/src/device.c
@@ -107,6 +107,20 @@ uint64_t ibv_get_device_guid(struct ibv_
 	return htonll(guid);
 }
 
+static enum ibv_node_type query_node_type(struct ibv_device *device)
+{
+	char node_desc[24];
+	char node_str[24];
+	int node_type;
+
+	if (ibv_read_sysfs_file(device->ibdev_path, "node_type",
+				node_desc, sizeof(node_desc)) < 0)
+		return IBV_NODE_UNKNOWN;
+
+	sscanf(node_desc, "%d: %s\n", (int*)&node_type, node_str);
+	return (enum ibv_node_type) node_type;
+}
+
 struct ibv_context *ibv_open_device(struct ibv_device *device)
 {
 	char *devpath;
@@ -125,6 +139,8 @@ struct ibv_context *ibv_open_device(stru
 	if (cmd_fd < 0)
 		return NULL;
 
+	device->node_type = query_node_type(device);
+
 	context = device->ops.alloc_context(device, cmd_fd);
 	if (!context)
 		goto err;


From swise at opengridcomputing.com  Tue Jun 20 13:03:04 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:03:04 -0500
Subject: [openib-general] [PATCH v2 0/2] [RFC] iWARP Core Usermode Support
Message-ID: <20060620200304.20092.44110.stgit@stevo-desktop>


This patchset defines the modifications to the Open Fabrics gen2 userspace
tree to support iWARP devices.  This is the 2nd review of most of these
changes and we have incorporated all comments from the 1st review.

We're submitting it for review with the goal for inclusion in the gen2
svn trunk.  It is not dependent on the kernel iWARP patchset currently
under review, so we could commit this to the svn trunk now if desired.

This patchset is based on revision 7620 of the svn trunk.  It consists
of 2 patches:

        1 - Changes to libibverbs/
        2 - Changes to librdmacm/

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Tue Jun 20 13:03:12 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:03:12 -0500
Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm.
In-Reply-To: <20060620200304.20092.44110.stgit@stevo-desktop>
References: <20060620200304.20092.44110.stgit@stevo-desktop>
Message-ID: <20060620200312.20092.87834.stgit@stevo-desktop>


For iWARP, rdma_disconnect() moves the QP to SQD instead of ERR. The
iWARP providers map SQD to the RDMAC verbs CLOSING state.
---

 librdmacm/src/cma.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/librdmacm/src/cma.c b/librdmacm/src/cma.c
index e99d15c..a250f69 100644
--- a/librdmacm/src/cma.c
+++ b/librdmacm/src/cma.c
@@ -633,6 +633,17 @@ static int ucma_modify_qp_rts(struct rdm
 	return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask);
 }
 
+static int ucma_modify_qp_sqd(struct rdma_cm_id *id)
+{
+	struct ibv_qp_attr qp_attr;
+
+	if (!id->qp)
+		return 0;
+
+	qp_attr.qp_state = IBV_QPS_SQD;
+	return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE);
+}
+
 static int ucma_modify_qp_err(struct rdma_cm_id *id)
 {
 	struct ibv_qp_attr qp_attr;
@@ -881,7 +892,16 @@ int rdma_disconnect(struct rdma_cm_id *i
 	void *msg;
 	int ret, size;
 
-	ret = ucma_modify_qp_err(id);
+	switch (ibv_get_transport_type(id->verbs)) {
+	case IBV_TRANSPORT_IB:
+		ret = ucma_modify_qp_err(id);
+		break;
+	case IBV_TRANSPORT_IWARP:
+		ret = ucma_modify_qp_sqd(id);
+		break;
+	default:
+		ret = -EINVAL;
+	}
 	if (ret)
 		return ret;
 

From swise at opengridcomputing.com  Tue Jun 20 13:04:30 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:04:30 -0500
Subject: [openib-general] [PATCH v1 0/2] [RFC] Ammasso 1100 iWARP Library
Message-ID: <20060620200430.20732.58792.stgit@stevo-desktop>


This patchset implements a user verbs library for the Ammasso 1100 device.
We're submitting it for review with the goal for inclusion in the gen2
trunk.

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Tue Jun 20 13:04:39 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:04:39 -0500
Subject: [openib-general] [PATCH v1 2/2] AMSO1100 Makefiles.
In-Reply-To: <20060620200430.20732.58792.stgit@stevo-desktop>
References: <20060620200430.20732.58792.stgit@stevo-desktop>
Message-ID: <20060620200439.20732.71569.stgit@stevo-desktop>


---

 libamso/Makefile.am     |   27 +++++++++++++++++++++++
 libamso/autogen.sh      |    8 +++++++
 libamso/configure.in    |   41 ++++++++++++++++++++++++++++++++++
 libamso/libamso.spec.in |   56 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 132 insertions(+), 0 deletions(-)

diff --git a/libamso/Makefile.am b/libamso/Makefile.am
new file mode 100644
index 0000000..9e2cbc1
--- /dev/null
+++ b/libamso/Makefile.am
@@ -0,0 +1,27 @@
+# $Id: $
+
+amsolibdir = $(libdir)/infiniband
+
+amsolib_LTLIBRARIES = src/amso.la
+
+src_amso_la_CFLAGS = -g -Wall -D_GNU_SOURCE
+
+if HAVE_LD_VERSION_SCRIPT
+    amso_version_script = -Wl,--version-script=$(srcdir)/src/amso.map
+else
+    amso_version_script =
+endif
+
+src_amso_la_SOURCES = src/cq.c src/amso.c src/qp.c \
+    src/verbs.c
+src_amso_la_LDFLAGS = -avoid-version -module \
+    $(amso_version_script)
+
+#DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
+#    debian/libamso1.install debian/libamso-dev.install debian/rules
+
+EXTRA_DIST = src/amso.h src/amso-abi.h \
+    src/amso.map libamso.spec.in $(DEBIAN)
+
+dist-hook: libamso.spec
+	cp libamso.spec $(distdir)
diff --git a/libamso/autogen.sh b/libamso/autogen.sh
new file mode 100755
index 0000000..fd47839
--- /dev/null
+++ b/libamso/autogen.sh
@@ -0,0 +1,8 @@
+#! /bin/sh
+
+set -x
+aclocal -I config
+libtoolize --force --copy
+autoheader
+automake --foreign --add-missing --copy
+autoconf
diff --git a/libamso/configure.in b/libamso/configure.in
new file mode 100644
index 0000000..4a920c4
--- /dev/null
+++ b/libamso/configure.in
@@ -0,0 +1,41 @@
+dnl Process this file with autoconf to produce a configure script.
+
+AC_PREREQ(2.57)
+AC_INIT(libamso, 1.0-rc4, openib-general at openib.org)
+AC_CONFIG_SRCDIR([src/amso.h])
+AC_CONFIG_AUX_DIR(config)
+AM_CONFIG_HEADER(config.h)
+AM_INIT_AUTOMAKE(libamso, 1.0-rc4)
+AM_PROG_LIBTOOL
+
+dnl Checks for programs
+AC_PROG_CC
+
+dnl Checks for libraries
+AC_CHECK_LIB(ibverbs, ibv_get_device_list, [],
+    AC_MSG_ERROR([ibv_get_device_list() not found.  libmthca requires libibverbs.]))
+
+dnl Checks for header files.
+AC_CHECK_HEADERS(sysfs/libsysfs.h)
+AC_CHECK_HEADER(infiniband/driver.h, [],
+    AC_MSG_ERROR([<infiniband/driver.h> not found.  Is libibverbs installed?]))
+AC_HEADER_STDC
+
+dnl Checks for typedefs, structures, and compiler characteristics.
+AC_C_CONST
+AC_CHECK_SIZEOF(long)
+
+dnl Checks for library functions
+AC_CHECK_FUNCS(ibv_read_sysfs_file)
+
+AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script,
+    if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then
+        ac_cv_version_script=yes
+    else
+        ac_cv_version_script=no
+    fi)
+
+AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes")
+
+AC_CONFIG_FILES([Makefile libamso.spec])
+AC_OUTPUT
diff --git a/libamso/libamso.spec.in b/libamso/libamso.spec.in
new file mode 100644
index 0000000..1bbb9cb
--- /dev/null
+++ b/libamso/libamso.spec.in
@@ -0,0 +1,56 @@
+# $Id: $
+
+%define ver      @VERSION@
+
+Name: libamso
+Version: 1.0
+Release: 0.2.rc4%{?dist}
+Summary: AMSO1100 Userspace Library
+
+Group: System Environment/Libraries
+License: GPL/BSD
+Url: http://openib.org/
+Source: http://openib.org/downloads/%{name}-%{ver}.tar.gz
+BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
+
+BuildRequires: libibverbs-devel
+
+%description
+libamso provides a device-specific userspace driver for Chelsio RNICs
+for use with the libibverbs library.
+
+%package devel
+Summary: Development files for the libamso driver
+Group: System Environment/Libraries
+Requires: %{name} = %{version}-%{release}
+
+%description devel
+Static version of libamso that may be linked directly to an
+application, which may be useful for debugging.
+
+%prep
+%setup -q -n %{name}-%{ver}
+
+%build
+%configure
+make %{?_smp_mflags}
+
+%install
+rm -rf $RPM_BUILD_ROOT
+%makeinstall
+# remove unpackaged files from the buildroot
+rm -f $RPM_BUILD_ROOT%{_libdir}/infiniband/*.la
+
+%clean
+rm -rf $RPM_BUILD_ROOT
+
+%files
+%defattr(-,root,root,-)
+%{_libdir}/infiniband/amso.so
+%doc AUTHORS COPYING ChangeLog README
+
+%files devel
+%defattr(-,root,root,-)
+%{_libdir}/infiniband/amso.a
+
+%changelog


From swise at opengridcomputing.com  Tue Jun 20 13:04:34 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:04:34 -0500
Subject: [openib-general] [PATCH v1 1/2] AMSO1100 Verbs Library.
In-Reply-To: <20060620200430.20732.58792.stgit@stevo-desktop>
References: <20060620200430.20732.58792.stgit@stevo-desktop>
Message-ID: <20060620200434.20732.99171.stgit@stevo-desktop>


This code implements user verbs for the Ammasso 1100 device.  This library
doesn't do kernel bypass (but it could someday).
---

 libamso/src/amso-abi.h |   79 +++++++++++++
 libamso/src/amso.c     |  180 +++++++++++++++++++++++++++++
 libamso/src/amso.h     |  156 +++++++++++++++++++++++++
 libamso/src/amso.map   |    6 +
 libamso/src/cq.c       |   57 +++++++++
 libamso/src/qp.c       |   55 +++++++++
 libamso/src/verbs.c    |  303 ++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 836 insertions(+), 0 deletions(-)

diff --git a/libamso/src/amso-abi.h b/libamso/src/amso-abi.h
new file mode 100644
index 0000000..a3df617
--- /dev/null
+++ b/libamso/src/amso-abi.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef AMSO_ABI_H
+#define AMSO_ABI_H
+
+#include <infiniband/kern-abi.h>
+
+struct amso_alloc_ucontext_resp {
+	struct ibv_get_context_resp ibv_resp;
+};
+
+struct amso_alloc_pd_resp {
+	struct ibv_alloc_pd_resp ibv_resp;
+};
+
+struct amso_create_cq {
+	struct ibv_create_cq ibv_cmd;
+};
+
+
+struct amso_create_cq_resp {
+	struct ibv_create_cq_resp ibv_resp;
+	__u32 cqid;
+	__u32 entries;
+	__u64 physaddr;		/* library mmaps this to get addressability */
+	__u64 queue;
+};
+
+struct amso_create_qp {
+	struct ibv_create_qp ibv_cmd;
+};
+
+struct amso_create_qp_resp {
+	struct ibv_create_qp_resp ibv_resp;
+	__u32 qpid;
+	__u32 entries;		/* actual number of entries after creation */
+	__u64 physaddr;		/* library mmaps this to get addressability */
+	__u64 physsize;		/* library mmaps this to get addressability */
+	__u64 queue;
+};
+
+
+struct t3_cqe {
+	__u32 header:32;
+	__u32 len:32;
+	__u32 wrid_hi_stag:32;
+	__u32 wrid_low_msn:32;
+};
+
+#endif				/* AMSO_ABI_H */
diff --git a/libamso/src/amso.c b/libamso/src/amso.c
new file mode 100644
index 0000000..c017281
--- /dev/null
+++ b/libamso/src/amso.c
@@ -0,0 +1,180 @@
+/*
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/mman.h>
+#include <pthread.h>
+#include <sysfs/libsysfs.h>
+
+#include "amso.h"
+#include "amso-abi.h"
+
+#define PCI_VENDOR_ID_AMSO		0x18b8
+#define PCI_DEVICE_ID_AMSO_1100		0xb001
+
+#define HCA(v, d, t) \
+	{ .vendor = PCI_VENDOR_ID_##v,			\
+	  .device = PCI_DEVICE_ID_AMSO_##d,		\
+	  .type = AMSO_##t }
+
+struct {
+	unsigned vendor;
+	unsigned device;
+	enum amso_hca_type type;
+} hca_table[] = {
+	HCA(AMSO, 1100, 1100),
+};
+
+static struct ibv_context_ops amso_ctx_ops = {
+	.query_device = amso_query_device,
+	.query_port = amso_query_port,
+	.alloc_pd = amso_alloc_pd,
+	.dealloc_pd = amso_free_pd,
+	.reg_mr = amso_reg_mr,
+	.dereg_mr = amso_dereg_mr,
+	.create_cq = amso_create_cq,
+	.resize_cq = amso_resize_cq,
+	.poll_cq = amso_poll_cq,
+	.destroy_cq = amso_destroy_cq,
+	.create_srq = amso_create_srq,
+	.modify_srq = amso_modify_srq,
+	.destroy_srq = amso_destroy_srq,
+	.create_qp = amso_create_qp,
+	.modify_qp = amso_modify_qp,
+	.destroy_qp = amso_destroy_qp,
+	.create_ah = amso_create_ah,
+	.destroy_ah = amso_destroy_ah,
+	.attach_mcast = amso_attach_mcast,
+	.detach_mcast = amso_detach_mcast
+};
+
+static struct ibv_context *amso_alloc_context(struct ibv_device *ibdev,
+					      int cmd_fd)
+{
+	struct amso_context *context;
+	struct ibv_get_context cmd;
+	struct amso_alloc_ucontext_resp resp;
+
+	context = malloc(sizeof *context);
+	if (!context)
+		return NULL;
+
+	context->ibv_ctx.cmd_fd = cmd_fd;
+
+	if (ibv_cmd_get_context(&context->ibv_ctx, &cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp))
+		goto err_free;
+
+	context->ibv_ctx.device = ibdev;
+	context->ibv_ctx.ops = amso_ctx_ops;
+	context->ibv_ctx.ops.req_notify_cq = amso_arm_cq;
+	context->ibv_ctx.ops.cq_event = NULL;
+	context->ibv_ctx.ops.post_send = amso_post_send;
+	context->ibv_ctx.ops.post_recv = amso_post_recv;
+	context->ibv_ctx.ops.post_srq_recv = amso_post_srq_recv;
+
+	return &context->ibv_ctx;
+err_free:
+	free(context);
+	return NULL;
+}
+
+static void amso_free_context(struct ibv_context *ibctx)
+{
+	struct amso_context *context = to_amso_ctx(ibctx);
+
+	free(context);
+}
+
+static struct ibv_device_ops amso_dev_ops = {
+	.alloc_context = amso_alloc_context,
+	.free_context = amso_free_context
+};
+
+struct ibv_device *ibv_driver_init(const char *uverbs_sys_path,
+				   int abi_version)
+{
+	char value[8];
+	struct amso_device *dev;
+	unsigned vendor, device;
+	int i;
+
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
+				value, sizeof value) < 0)
+		return NULL;
+	sscanf(value, "%i", &vendor);
+
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device",
+				value, sizeof value) < 0)
+		return NULL;
+	sscanf(value, "%i", &device);
+
+
+	for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i)
+		if (vendor == hca_table[i].vendor &&
+		    device == hca_table[i].device)
+			goto found;
+
+	return NULL;
+
+found:
+	dev = malloc(sizeof *dev);
+	if (!dev) {
+		return NULL;
+	}
+
+	dev->ibv_dev.ops = amso_dev_ops;
+	dev->hca_type = hca_table[i].type;
+	dev->page_size = sysconf(_SC_PAGESIZE);
+
+	return &dev->ibv_dev;
+}
+
+#ifdef HAVE_SYSFS_LIBSYSFS_H
+struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
+{
+	int abi_ver = 0;
+	char value[8];
+
+	if (ibv_read_sysfs_file(sysdev->path, "abi_version",
+				value, sizeof value) > 0)
+		abi_ver = strtol(value, NULL, 10);
+
+	return ibv_driver_init(sysdev->path, abi_ver);
+}
+#endif /* HAVE_SYSFS_LIBSYSFS_H */
diff --git a/libamso/src/amso.h b/libamso/src/amso.h
new file mode 100644
index 0000000..eea4319
--- /dev/null
+++ b/libamso/src/amso.h
@@ -0,0 +1,156 @@
+/*
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef AMSO_H
+#define AMSO_H
+
+#include <infiniband/driver.h>
+#include <infiniband/arch.h>
+
+#define HIDDEN		__attribute__((visibility ("hidden")))
+
+#define PFX		"amso: "
+
+enum amso_hca_type {
+	AMSO_1100
+};
+
+struct amso_device {
+	struct ibv_device ibv_dev;
+	enum amso_hca_type hca_type;
+	int page_size;
+};
+
+struct amso_context {
+	struct ibv_context ibv_ctx;
+};
+
+struct amso_pd {
+	struct ibv_pd ibv_pd;
+};
+
+struct amso_cq {
+	struct ibv_cq ibv_cq;
+	__u32 cqid;
+	__u32 entries;
+	__u64 physaddr;
+	__u64 queue;
+};
+
+struct amso_qp {
+	struct ibv_qp ibv_qp;
+	__u32 qpid;
+	__u32 entries;
+	__u64 physaddr;
+	__u64 physsize;
+	__u64 queue;
+};
+
+#define to_amso_xxx(xxx, type)						\
+	((struct amso_##type *)					\
+	 ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))
+
+static inline struct amso_device *to_amso_dev(struct ibv_device *ibdev)
+{
+	return to_amso_xxx(dev, device);
+}
+
+static inline struct amso_context *to_amso_ctx(struct ibv_context *ibctx)
+{
+	return to_amso_xxx(ctx, context);
+}
+
+static inline struct amso_pd *to_amso_pd(struct ibv_pd *ibpd)
+{
+	return to_amso_xxx(pd, pd);
+}
+
+static inline struct amso_cq *to_amso_cq(struct ibv_cq *ibcq)
+{
+	return to_amso_xxx(cq, cq);
+}
+
+static inline struct amso_qp *to_amso_qp(struct ibv_qp *ibqp)
+{
+	return to_amso_xxx(qp, qp);
+}
+
+
+extern int amso_query_device(struct ibv_context *context,
+			     struct ibv_device_attr *attr);
+extern int amso_query_port(struct ibv_context *context, uint8_t port,
+			   struct ibv_port_attr *attr);
+
+extern struct ibv_pd *amso_alloc_pd(struct ibv_context *context);
+extern int amso_free_pd(struct ibv_pd *pd);
+
+extern struct ibv_mr *amso_reg_mr(struct ibv_pd *pd, void *addr,
+				  size_t length, enum ibv_access_flags access);
+extern int amso_dereg_mr(struct ibv_mr *mr);
+
+struct ibv_cq *amso_create_cq(struct ibv_context *context, int cqe,
+			      struct ibv_comp_channel *channel,
+			      int comp_vector);
+extern int amso_resize_cq(struct ibv_cq *cq, int cqe);
+extern int amso_destroy_cq(struct ibv_cq *cq);
+extern int amso_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
+extern int amso_arm_cq(struct ibv_cq *cq, int solicited);
+extern void amso_cq_event(struct ibv_cq *cq);
+extern void amso_init_cq_buf(struct amso_cq *cq, int nent);
+
+extern struct ibv_srq *amso_create_srq(struct ibv_pd *pd,
+				       struct ibv_srq_init_attr *attr);
+extern int amso_modify_srq(struct ibv_srq *srq,
+			   struct ibv_srq_attr *attr,
+			   enum ibv_srq_attr_mask mask);
+extern int amso_destroy_srq(struct ibv_srq *srq);
+extern int amso_post_srq_recv(struct ibv_srq *ibsrq,
+			      struct ibv_recv_wr *wr,
+			      struct ibv_recv_wr **bad_wr);
+
+extern struct ibv_qp *amso_create_qp(struct ibv_pd *pd,
+				     struct ibv_qp_init_attr *attr);
+extern int amso_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+			  enum ibv_qp_attr_mask attr_mask);
+extern int amso_destroy_qp(struct ibv_qp *qp);
+extern int amso_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+			  struct ibv_send_wr **bad_wr);
+extern int amso_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
+			  struct ibv_recv_wr **bad_wr);
+extern struct ibv_ah *amso_create_ah(struct ibv_pd *pd,
+			     struct ibv_ah_attr *ah_attr);
+extern int amso_destroy_ah(struct ibv_ah *ah);
+extern int amso_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid,
+			     uint16_t lid);
+extern int amso_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid,
+			     uint16_t lid);
+
+#endif				/* AMSO_H */
diff --git a/libamso/src/amso.map b/libamso/src/amso.map
new file mode 100644
index 0000000..59a8bae
--- /dev/null
+++ b/libamso/src/amso.map
@@ -0,0 +1,6 @@
+{
+	global:
+		ibv_driver_init;
+		openib_driver_init;
+	local: *;
+};
diff --git a/libamso/src/cq.c b/libamso/src/cq.c
new file mode 100644
index 0000000..65360ce
--- /dev/null
+++ b/libamso/src/cq.c
@@ -0,0 +1,57 @@
+/*
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdio.h>
+#include <netinet/in.h>
+#include <pthread.h>
+
+#include <infiniband/opcode.h>
+
+#include "amso.h"
+#include "amso-abi.h"
+
+
+int amso_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
+{
+	return ibv_cmd_poll_cq(ibcq, ne, wc);
+}
+
+
+int amso_arm_cq(struct ibv_cq *cq, int solicited)
+{
+	return ibv_cmd_req_notify_cq(cq, solicited);
+}
+
+
diff --git a/libamso/src/qp.c b/libamso/src/qp.c
new file mode 100644
index 0000000..e0d99bb
--- /dev/null
+++ b/libamso/src/qp.c
@@ -0,0 +1,55 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdlib.h>
+#include <netinet/in.h>
+#include <pthread.h>
+
+#include "amso.h"
+#include <stdio.h>
+
+int amso_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+		   struct ibv_send_wr **bad_wr)
+{
+	return ibv_cmd_post_send(ibqp, wr, bad_wr);
+}
+
+int amso_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr,
+		   struct ibv_recv_wr **bad_wr)
+{
+	return ibv_cmd_post_recv(ibqp, wr, bad_wr);
+}
+
diff --git a/libamso/src/verbs.c b/libamso/src/verbs.c
new file mode 100644
index 0000000..1cd79d8
--- /dev/null
+++ b/libamso/src/verbs.c
@@ -0,0 +1,303 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <strings.h>
+#include <errno.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <netinet/in.h>
+
+#include "amso.h"
+#include "amso-abi.h"
+
+
+int amso_query_device(struct ibv_context *context, struct ibv_device_attr *attr)
+{
+	struct ibv_query_device cmd;
+	uint64_t raw_fw_ver;
+	unsigned major, minor, sub_minor;
+	int ret;
+
+	ret =
+	    ibv_cmd_query_device(context, attr, &raw_fw_ver, &cmd, sizeof cmd);
+	if (ret)
+		return ret;
+
+	major = (raw_fw_ver >> 32) & 0xffff;
+	minor = (raw_fw_ver >> 16) & 0xffff;
+	sub_minor = raw_fw_ver & 0xffff;
+
+	snprintf(attr->fw_ver, sizeof attr->fw_ver,
+		 "%d.%d.%d", major, minor, sub_minor);
+
+	return 0;
+}
+
+int amso_query_port(struct ibv_context *context, uint8_t port,
+		    struct ibv_port_attr *attr)
+{
+	struct ibv_query_port cmd;
+
+	return ibv_cmd_query_port(context, port, attr, &cmd, sizeof cmd);
+}
+
+struct ibv_pd *amso_alloc_pd(struct ibv_context *context)
+{
+	struct ibv_alloc_pd cmd;
+	struct amso_alloc_pd_resp resp;
+	struct amso_pd *pd;
+
+	pd = malloc(sizeof *pd);
+	if (!pd)
+		return NULL;
+
+	if (ibv_cmd_alloc_pd(context, &pd->ibv_pd, &cmd, sizeof cmd,
+			     &resp.ibv_resp, sizeof resp)) {
+		free(pd);
+		return NULL;
+	}
+
+	return &pd->ibv_pd;
+}
+
+int amso_free_pd(struct ibv_pd *pd)
+{
+	int ret;
+
+	ret = ibv_cmd_dealloc_pd(pd);
+	if (ret)
+		return ret;
+
+	free(pd);
+	return 0;
+}
+
+static struct ibv_mr *__amso_reg_mr(struct ibv_pd *pd, void *addr,
+				    size_t length, uint64_t hca_va,
+				    enum ibv_access_flags access)
+{
+	struct ibv_mr *mr;
+	struct ibv_reg_mr cmd;
+
+	mr = malloc(sizeof *mr);
+	if (!mr)
+		return NULL;
+
+	if (ibv_cmd_reg_mr(pd, addr, length, hca_va,
+			   access, mr, &cmd, sizeof cmd)) {
+		free(mr);
+		return NULL;
+	}
+
+	return mr;
+}
+
+struct ibv_mr *amso_reg_mr(struct ibv_pd *pd, void *addr,
+			   size_t length, enum ibv_access_flags access)
+{
+	return __amso_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+}
+
+int amso_dereg_mr(struct ibv_mr *mr)
+{
+	int ret;
+
+	ret = ibv_cmd_dereg_mr(mr);
+	if (ret)
+		return ret;
+
+	free(mr);
+	return 0;
+}
+
+struct ibv_cq *amso_create_cq(struct ibv_context *context, int cqe,
+			      struct ibv_comp_channel *channel, int comp_vector)
+{
+	struct amso_create_cq cmd;
+	struct amso_create_cq_resp resp;
+	struct amso_cq *cq;
+	int ret;
+
+	cq = malloc(sizeof *cq);
+	if (!cq) {
+		goto err;
+	}
+
+	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
+				&cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
+	if (ret)
+		goto err;
+
+#if 0 /* A reminder for bypass functionality */
+	cq->physaddr = resp.physaddr;
+	cq->queue =
+	    (unsigned long) mmap(NULL, cqe * sizeof(struct t3_cqe), PROT_WRITE,
+				 MAP_SHARED, context->cmd_fd, cq->physaddr);
+#endif
+
+	return &cq->ibv_cq;
+
+
+err:
+	free(cq);
+
+	return NULL;
+}
+
+int amso_resize_cq(struct ibv_cq *cq, int cqe)
+{
+	int ret;
+	struct ibv_resize_cq cmd;
+
+	ret = ibv_cmd_resize_cq(cq, cqe, &cmd, sizeof cmd);
+	if (ret)
+		return ret;
+	/* We will need to unmap and remap when we implement user mode */
+
+	return 0;
+}
+
+int amso_destroy_cq(struct ibv_cq *cq)
+{
+	int ret;
+
+	ret = ibv_cmd_destroy_cq(cq);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+struct ibv_srq *amso_create_srq(struct ibv_pd *pd,
+				struct ibv_srq_init_attr *attr)
+{
+	return (void *) -ENOSYS;
+}
+
+int amso_modify_srq(struct ibv_srq *srq,
+		    struct ibv_srq_attr *attr, enum ibv_srq_attr_mask attr_mask)
+{
+	return -ENOSYS;
+}
+
+int amso_destroy_srq(struct ibv_srq *srq)
+{
+	return -ENOSYS;
+}
+
+int amso_post_srq_recv(struct ibv_srq *ibsrq,
+                       struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr)
+{
+	return -ENOSYS;
+}
+
+struct ibv_qp *amso_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+{
+	struct amso_create_qp cmd;
+	struct amso_create_qp_resp resp;
+	struct amso_qp *qp;
+	int ret;
+
+	/* Sanity check QP size before proceeding */
+	if (attr->cap.max_send_wr > 65536 ||
+	    attr->cap.max_recv_wr > 65536 ||
+	    attr->cap.max_send_sge > 4 ||
+	    attr->cap.max_recv_sge > 4 || attr->cap.max_inline_data > 1024)
+		return NULL;
+
+	qp = malloc(sizeof *qp);
+	if (!qp)
+		return NULL;
+
+	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
+				&resp.ibv_resp, sizeof resp);
+	if (ret)
+		return NULL;
+
+#if 0 /* A reminder for bypass functionality */
+	qp->physaddr = resp.physaddr;
+#endif
+
+	return &qp->ibv_qp;
+
+
+	return NULL;
+}
+
+int amso_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask)
+{
+	struct ibv_modify_qp cmd;
+
+	return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd);
+}
+
+int amso_destroy_qp(struct ibv_qp *qp)
+{
+	int ret;
+
+	ret = ibv_cmd_destroy_qp(qp);
+	if (ret)
+		return ret;
+
+	free(qp);
+
+	return 0;
+}
+
+struct ibv_ah *amso_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
+{
+	return (void *) -ENOSYS;
+}
+
+int amso_destroy_ah(struct ibv_ah *ah)
+{
+	return -ENOSYS;
+}
+
+int amso_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
+{
+	return -ENOSYS;
+}
+
+int amso_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
+{
+	return -ENOSYS;
+}
+


From swise at opengridcomputing.com  Tue Jun 20 13:24:42 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:24:42 -0500
Subject: [openib-general] [PATCH v3 0/2][RFC] iWARP Core Support
Message-ID: <20060620202442.28922.27402.stgit@stevo-desktop>


This patchset defines the modifications to the Linux infiniband subsystem
to support iWARP devices.  We're submitting it for review now with the
goal for inclusion in the 2.6.19 kernel.  This code has gone through
several reviews in the openib-general list.  Now we are submitting it
for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.19 branch.  The patchset consists of 2 patches:

        1 - New iWARP CM implementation.  
        2 - Core changes to support iWARP.

I believe I've addressed all the round 1 and 2 review comments.
Details of the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Tue Jun 20 13:24:47 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:24:47 -0500
Subject: [openib-general] [PATCH v3 1/2] iWARP Connection Manager.
In-Reply-To: <20060620202442.28922.27402.stgit@stevo-desktop>
References: <20060620202442.28922.27402.stgit@stevo-desktop>
Message-ID: <20060620202447.28922.42550.stgit@stevo-desktop>


This patch provides the new files implementing the iWARP Connection
Manager.

This module is a logical instance of the xx_cm where xx is the transport
type (ib or iw). The symbols exported are used by the transport
independent rdma_cm module, and are available also for transport
dependent ULPs.

V2 Review Changes:

- BUG_ON(1) -> BUG()

- Don't typecast whan assigning between something* and void*

- pre-allocate iwcm_work objects to avoid allocating them in the interrupt
  context.

- copy private data on connect request and connect reply events.

- #if !defined() -> #ifndef

V1 Review Changes:

- sizeof -> sizeof()

- removed printks

- removed TT debug code

- cleaned up lock/unlock around switch statements.

- waitqueue -> completion for destroy path.
---

 drivers/infiniband/core/iwcm.c | 1008 ++++++++++++++++++++++++++++++++++++++++
 include/rdma/iw_cm.h           |  255 ++++++++++
 include/rdma/iw_cm_private.h   |   63 +++
 3 files changed, 1326 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
new file mode 100644
index 0000000..fe43c00
--- /dev/null
+++ b/drivers/infiniband/core/iwcm.c
@@ -0,0 +1,1008 @@
+/*
+ * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/idr.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <linux/completion.h>
+#include <rdma/iw_cm.h>
+#include <rdma/iw_cm_private.h>
+#include <rdma/ib_addr.h>
+
+MODULE_AUTHOR("Tom Tucker");
+MODULE_DESCRIPTION("iWARP CM");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static struct workqueue_struct *iwcm_wq;
+struct iwcm_work {
+	struct work_struct work;
+	struct iwcm_id_private *cm_id;
+	struct list_head list;
+	struct iw_cm_event event;
+	struct list_head free_list;
+};
+
+/* 
+ * The following services provide a mechanism for pre-allocating iwcm_work 
+ * elements.  The design pre-allocates them  based on the cm_id type:
+ *	LISTENING IDS: 	Get enough elements preallocated to handle the
+ *			listen backlog.
+ *	ACTIVE IDS:	4: CONNECT_REPLY, ESTABLISHED, DISCONNECT, CLOSE
+ *	PASSIVE IDS:	3: ESTABLISHED, DISCONNECT, CLOSE 
+ *
+ * Allocating them in connect and listen avoids having to deal
+ * with allocation failures on the event upcall from the provider (which 
+ * is called in the interrupt context).  
+ *
+ * One exception is when creating the cm_id for incoming connection requests.  
+ * There are two cases:
+ * 1) in the event upcall, cm_event_handler(), for a listening cm_id.  If
+ *    the backlog is exceeded, then no more connection request events will
+ *    be processed.  cm_event_handler() returns -ENOMEM in this case.  Its up
+ *    to the provider to reject the connectino request.
+ * 2) in the connection request workqueue handler, cm_conn_req_handler().
+ *    If work elements cannot be allocated for the new connect request cm_id,
+ *    then IWCM will call the provider reject method.  This is ok since
+ *    cm_conn_req_handler() runs in the workqueue thread context.
+ */
+
+static struct iwcm_work *get_work(struct iwcm_id_private *cm_id_priv)
+{
+	struct iwcm_work *work;
+
+	if (list_empty(&cm_id_priv->work_free_list))
+		return NULL;
+	work = list_entry(cm_id_priv->work_free_list.next, struct iwcm_work, 
+			  free_list);
+	list_del_init(&work->free_list);
+	return work;
+}
+
+static void put_work(struct iwcm_work *work)
+{
+	list_add(&work->free_list, &work->cm_id->work_free_list);
+}
+
+static void dealloc_work_entries(struct iwcm_id_private *cm_id_priv)
+{
+	struct list_head *e, *tmp;
+
+	list_for_each_safe(e, tmp, &cm_id_priv->work_free_list)
+		kfree(list_entry(e, struct iwcm_work, free_list));
+}
+
+static int alloc_work_entries(struct iwcm_id_private *cm_id_priv, int count)
+{
+	struct iwcm_work *work;
+
+	BUG_ON(!list_empty(&cm_id_priv->work_free_list));
+	while (count--) {
+		work = kmalloc(sizeof(struct iwcm_work), GFP_KERNEL);
+		if (!work) {
+			dealloc_work_entries(cm_id_priv);
+			return -ENOMEM;
+		}
+		work->cm_id = cm_id_priv;
+		INIT_LIST_HEAD(&work->list);
+		put_work(work);
+	}
+	return 0;
+}
+
+/* 
+ * Save private data from incoming connection requests in the 
+ * cm_id_priv so the low level driver doesn't have to.  Adjust
+ * the event ptr to point to the local copy.
+ */
+static int copy_private_data(struct iwcm_id_private *cm_id_priv, 
+		       struct iw_cm_event *event)
+{
+	void *p;
+
+	p = kmalloc(event->private_data_len, GFP_ATOMIC);
+	if (!p)
+		return -ENOMEM;
+	memcpy(p, event->private_data, event->private_data_len);
+	event->private_data = p;
+	return 0;
+}
+
+/* 
+ * Release a reference on cm_id. If the last reference is being removed
+ * and iw_destroy_cm_id is waiting, wake up the waiting thread.
+ */
+static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
+{
+	int ret = 0;
+
+	BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
+	if (atomic_dec_and_test(&cm_id_priv->refcount)) {
+		BUG_ON(!list_empty(&cm_id_priv->work_list));
+		if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) {
+			BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING);
+			BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
+					&cm_id_priv->flags));
+			ret = 1;
+		}
+		complete(&cm_id_priv->destroy_comp);
+	}
+
+	return ret;
+}
+
+static void add_ref(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	atomic_inc(&cm_id_priv->refcount);
+}
+
+static void rem_ref(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	iwcm_deref_id(cm_id_priv);
+}
+
+static int cm_event_handler(struct iw_cm_id *cm_id, struct iw_cm_event *event);
+
+struct iw_cm_id *iw_create_cm_id(struct ib_device *device,
+				 iw_cm_handler cm_handler,
+				 void *context)
+{
+	struct iwcm_id_private *cm_id_priv;
+
+	cm_id_priv = kzalloc(sizeof(*cm_id_priv), GFP_KERNEL);
+	if (!cm_id_priv)
+		return ERR_PTR(-ENOMEM);
+
+	cm_id_priv->state = IW_CM_STATE_IDLE;
+	cm_id_priv->id.device = device;
+	cm_id_priv->id.cm_handler = cm_handler;
+	cm_id_priv->id.context = context;
+	cm_id_priv->id.event_handler = cm_event_handler;
+	cm_id_priv->id.add_ref = add_ref;
+	cm_id_priv->id.rem_ref = rem_ref;
+	spin_lock_init(&cm_id_priv->lock);
+	atomic_set(&cm_id_priv->refcount, 1);
+	init_waitqueue_head(&cm_id_priv->connect_wait);
+	init_completion(&cm_id_priv->destroy_comp);
+	INIT_LIST_HEAD(&cm_id_priv->work_list);
+	INIT_LIST_HEAD(&cm_id_priv->work_free_list);
+
+	return &cm_id_priv->id;
+}
+EXPORT_SYMBOL(iw_create_cm_id);
+
+
+static int iwcm_modify_qp_err(struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	if (!qp)
+		return -EINVAL;
+
+	qp_attr.qp_state = IB_QPS_ERR;
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE);
+}
+
+/* 
+ * This is really the RDMAC CLOSING state. It is most similar to the
+ * IB SQD QP state. 
+ */
+static int iwcm_modify_qp_sqd(struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	BUG_ON(qp == NULL);
+	qp_attr.qp_state = IB_QPS_SQD;
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE);
+}
+
+/* 
+ * CM_ID <-- CLOSING
+ *
+ * Block if a passive or active connection is currenlty being processed. Then
+ * process the event as follows:
+ * - If we are ESTABLISHED, move to CLOSING and modify the QP state
+ *   based on the abrupt flag 
+ * - If the connection is already in the CLOSING or IDLE state, the peer is
+ *   disconnecting concurrently with us and we've already seen the 
+ *   DISCONNECT event -- ignore the request and return 0
+ * - Disconnect on a listening endpoint returns -EINVAL
+ */
+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+	struct ib_qp *qp = NULL;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	/* Wait if we're currently in a connect or accept downcall */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_ESTABLISHED:
+		cm_id_priv->state = IW_CM_STATE_CLOSING;
+
+		/* QP could be <nul> for user-mode client */
+		if (cm_id_priv->qp)
+			qp = cm_id_priv->qp;
+		else
+			ret = -EINVAL;
+		break;
+	case IW_CM_STATE_LISTEN:
+		ret = -EINVAL;
+		break;
+	case IW_CM_STATE_CLOSING:
+		/* remote peer closed first */
+	case IW_CM_STATE_IDLE:	
+		/* accept or connect returned !0 */
+		break;
+	case IW_CM_STATE_CONN_RECV:
+		/* 
+		 * App called disconnect before/without calling accept after
+		 * connect_request event delivered.
+		 */
+		break;
+	case IW_CM_STATE_CONN_SENT:
+		/* Can only get here if wait above fails */
+	default:		
+		BUG();
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	if (qp) {
+		if (abrupt)
+			ret = iwcm_modify_qp_err(qp);
+		else
+			ret = iwcm_modify_qp_sqd(qp);
+
+		/*
+		 * If both sides are disconnecting the QP could
+		 * already be in ERR or SQD states
+		 */
+		ret = 0;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_disconnect);
+
+/* 
+ * CM_ID <-- DESTROYING
+ * 
+ * Clean up all resources associated with the connection and release
+ * the initial reference taken by iw_create_cm_id. 
+ */
+static void destroy_cm_id(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	/* Wait if we're currently in a connect or accept downcall. A
+	 * listening endpoint should never block here. */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_LISTEN:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		/* destroy the listening endpoint */
+		ret = cm_id->device->iwcm->destroy_listen(cm_id);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_ESTABLISHED:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		/* Abrupt close of the connection */
+		(void)iwcm_modify_qp_err(cm_id_priv->qp);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CLOSING:
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		break;
+	case IW_CM_STATE_CONN_RECV:
+		/* 
+		 * App called destroy before/without calling accept after
+		 * receiving connection request event notification.
+		 */ 
+		cm_id_priv->state = IW_CM_STATE_DESTROYING;
+		break;
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_DESTROYING:
+	default:
+		BUG();
+		break;
+	}
+	if (cm_id_priv->qp) {
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	(void)iwcm_deref_id(cm_id_priv);
+}
+
+/* 
+ * This function is only called by the application thread and cannot
+ * be called by the event thread. The function will wait for all
+ * references to be released on the cm_id and then kfree the cm_id
+ * object. 
+ */
+void iw_destroy_cm_id(struct iw_cm_id *cm_id)
+{
+	struct iwcm_id_private *cm_id_priv;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+        BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags));
+
+	destroy_cm_id(cm_id);
+
+	wait_for_completion(&cm_id_priv->destroy_comp);
+
+	dealloc_work_entries(cm_id_priv);
+
+	kfree(cm_id_priv);
+}
+EXPORT_SYMBOL(iw_destroy_cm_id);
+
+/* 
+ * CM_ID <-- LISTEN
+ *
+ * Start listening for connect requests. Generates one CONNECT_REQUEST
+ * event for each inbound connect request. 
+ */
+int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+
+	ret = alloc_work_entries(cm_id_priv, backlog);
+	if (ret)
+		return ret;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+		cm_id_priv->state = IW_CM_STATE_LISTEN;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		ret = cm_id->device->iwcm->create_listen(cm_id, backlog);
+		if (ret)
+			cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_listen);
+
+/* 
+ * CM_ID <-- IDLE
+ *
+ * Rejects an inbound connection request. No events are generated.
+ */
+int iw_cm_reject(struct iw_cm_id *cm_id,
+		 const void *private_data,
+		 u8 private_data_len)
+{
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+	cm_id_priv->state = IW_CM_STATE_IDLE;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->reject(cm_id, private_data, 
+					  private_data_len);
+
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_reject);
+
+/* 
+ * CM_ID <-- ESTABLISHED
+ *
+ * Accepts an inbound connection request and generates an ESTABLISHED
+ * event. Callers of iw_cm_disconnect and iw_destroy_cm_id will block
+ * until the ESTABLISHED event is received from the provider. 
+ */
+int iw_cm_accept(struct iw_cm_id *cm_id, 
+		 struct iw_cm_conn_param *iw_param)
+{
+	struct iwcm_id_private *cm_id_priv;
+	struct ib_qp *qp;
+	unsigned long flags;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state != IW_CM_STATE_CONN_RECV) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+	/* Get the ib_qp given the QPN */
+	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
+	if (!qp) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		return -EINVAL;
+	}
+	cm_id->device->iwcm->add_ref(qp);
+	cm_id_priv->qp = qp;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->accept(cm_id, iw_param);
+	if (ret) {
+		/* An error on accept precludes provider events */
+		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		if (cm_id_priv->qp) {
+			cm_id->device->iwcm->rem_ref(qp);
+			cm_id_priv->qp = NULL;
+		}
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+	}			
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_accept);
+
+/*
+ * Active Side: CM_ID <-- CONN_SENT
+ *
+ * If successful, results in the generation of a CONNECT_REPLY
+ * event. iw_cm_disconnect and iw_cm_destroy will block until the
+ * CONNECT_REPLY event is received from the provider.
+ */
+int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct iwcm_id_private *cm_id_priv;
+	int ret = 0;
+	unsigned long flags;
+	struct ib_qp *qp;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+
+	ret = alloc_work_entries(cm_id_priv, 4);
+	if (ret)
+		return ret;
+
+	set_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+		
+	if (cm_id_priv->state != IW_CM_STATE_IDLE) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+		return -EINVAL;
+	}
+		
+	/* Get the ib_qp given the QPN */
+	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
+	if (!qp) {
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		return -EINVAL;
+	}
+	cm_id->device->iwcm->add_ref(qp);
+	cm_id_priv->qp = qp;
+	cm_id_priv->state = IW_CM_STATE_CONN_SENT;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	ret = cm_id->device->iwcm->connect(cm_id, iw_param);
+	if (ret) {
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		if (cm_id_priv->qp) {
+			cm_id->device->iwcm->rem_ref(qp);
+			cm_id_priv->qp = NULL;
+		}
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+		wake_up_all(&cm_id_priv->connect_wait);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_connect);
+
+/*
+ * Passive Side: new CM_ID <-- CONN_RECV
+ *
+ * Handles an inbound connect request. The function creates a new
+ * iw_cm_id to represent the new connection and inherits the client
+ * callback function and other attributes from the listening parent. 
+ * 
+ * The work item contains a pointer to the listen_cm_id and the event. The
+ * listen_cm_id contains the client cm_handler, context and
+ * device. These are copied when the device is cloned. The event
+ * contains the new four tuple.
+ *
+ * An error on the child should not affect the parent, so this
+ * function does not return a value.
+ */
+static void cm_conn_req_handler(struct iwcm_id_private *listen_id_priv, 
+				struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	struct iw_cm_id *cm_id;
+	struct iwcm_id_private *cm_id_priv;
+	int ret;
+
+	/* The provider should never generate a connection request
+	 * event with a bad status. 
+	 */
+	BUG_ON(iw_event->status);
+
+	/* We could be destroying the listening id. If so, ignore this
+	 * upcall. */
+	spin_lock_irqsave(&listen_id_priv->lock, flags);
+	if (listen_id_priv->state != IW_CM_STATE_LISTEN) {
+		spin_unlock_irqrestore(&listen_id_priv->lock, flags);
+		return;
+	}
+	spin_unlock_irqrestore(&listen_id_priv->lock, flags);
+
+	cm_id = iw_create_cm_id(listen_id_priv->id.device,	
+				listen_id_priv->id.cm_handler, 
+				listen_id_priv->id.context);
+	/* If the cm_id could not be created, ignore the request */
+	if (IS_ERR(cm_id)) 
+		return;
+
+	cm_id->provider_data = iw_event->provider_data;
+	cm_id->local_addr = iw_event->local_addr;
+	cm_id->remote_addr = iw_event->remote_addr;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	cm_id_priv->state = IW_CM_STATE_CONN_RECV;
+
+	ret = alloc_work_entries(cm_id_priv, 3);
+	if (ret) {
+		iw_cm_reject(cm_id, NULL, 0);
+		iw_destroy_cm_id(cm_id);
+		return;
+	}
+	
+	/* Call the client CM handler */
+	ret = cm_id->cm_handler(cm_id, iw_event);
+	if (ret) {
+		set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
+		destroy_cm_id(cm_id);
+		if (atomic_read(&cm_id_priv->refcount)==0)
+			kfree(cm_id);
+	}
+
+	if (iw_event->private_data_len)
+		kfree(iw_event->private_data);
+}
+
+/*
+ * Passive Side: CM_ID <-- ESTABLISHED
+ * 
+ * The provider generated an ESTABLISHED event which means that 
+ * the MPA negotion has completed successfully and we are now in MPA
+ * FPDU mode. 
+ *
+ * This event can only be received in the CONN_RECV state. If the
+ * remote peer closed, the ESTABLISHED event would be received followed
+ * by the CLOSE event. If the app closes, it will block until we wake
+ * it up after processing this event.
+ */
+static int cm_conn_est_handler(struct iwcm_id_private *cm_id_priv, 
+			       struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+
+	/* We clear the CONNECT_WAIT bit here to allow the callback
+	 * function to call iw_cm_disconnect. Calling iw_destroy_cm_id
+	 * from a callback handler is not allowed */
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_RECV);
+	cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+
+/*
+ * Active Side: CM_ID <-- ESTABLISHED
+ *
+ * The app has called connect and is waiting for the established event to
+ * post it's requests to the server. This event will wake up anyone
+ * blocked in iw_cm_disconnect or iw_destroy_id.
+ */
+static int cm_conn_rep_handler(struct iwcm_id_private *cm_id_priv, 
+			       struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	/* Clear the connect wait bit so a callback function calling
+	 * iw_cm_disconnect will not wait and deadlock this thread */
+	clear_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags);
+	BUG_ON(cm_id_priv->state != IW_CM_STATE_CONN_SENT);
+	if (iw_event->status == IW_CM_EVENT_STATUS_ACCEPTED) {
+		cm_id_priv->id.local_addr = iw_event->local_addr;
+		cm_id_priv->id.remote_addr = iw_event->remote_addr;
+		cm_id_priv->state = IW_CM_STATE_ESTABLISHED;
+	} else {
+		/* REJECTED or RESET */
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+
+	if (iw_event->private_data_len)
+		kfree(iw_event->private_data);
+
+	/* Wake up waiters on connect complete */
+	wake_up_all(&cm_id_priv->connect_wait);
+
+	return ret;
+}
+
+/*
+ * CM_ID <-- CLOSING 
+ *
+ * If in the ESTABLISHED state, move to CLOSING.
+ */
+static void cm_disconnect_handler(struct iwcm_id_private *cm_id_priv, 
+				  struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	if (cm_id_priv->state == IW_CM_STATE_ESTABLISHED)
+		cm_id_priv->state = IW_CM_STATE_CLOSING;
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+}
+
+/*
+ * CM_ID <-- IDLE
+ *
+ * If in the ESTBLISHED or CLOSING states, the QP will have have been
+ * moved by the provider to the ERR state. Disassociate the CM_ID from
+ * the QP,  move to IDLE, and remove the 'connected' reference.
+ * 
+ * If in some other state, the cm_id was destroyed asynchronously.
+ * This is the last reference that will result in waking up
+ * the app thread blocked in iw_destroy_cm_id.
+ */
+static int cm_close_handler(struct iwcm_id_private *cm_id_priv, 
+				  struct iw_cm_event *iw_event)
+{
+	unsigned long flags;
+	int ret = 0;
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+
+	if (cm_id_priv->qp) {
+		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
+		cm_id_priv->qp = NULL;
+	}
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_ESTABLISHED:
+	case IW_CM_STATE_CLOSING:
+		cm_id_priv->state = IW_CM_STATE_IDLE;
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, iw_event);
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+		break;
+	case IW_CM_STATE_DESTROYING:
+		break;
+	default:
+		BUG();
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+	return ret;
+}
+
+static int process_event(struct iwcm_id_private *cm_id_priv, 
+			 struct iw_cm_event *iw_event)
+{
+	int ret = 0;
+
+	switch (iw_event->event) {
+	case IW_CM_EVENT_CONNECT_REQUEST:
+		cm_conn_req_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_CONNECT_REPLY:
+		ret = cm_conn_rep_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_ESTABLISHED:
+		ret = cm_conn_est_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_DISCONNECT:
+		cm_disconnect_handler(cm_id_priv, iw_event);
+		break;
+	case IW_CM_EVENT_CLOSE:
+		ret = cm_close_handler(cm_id_priv, iw_event);
+		break;
+	default:
+		BUG();
+	}
+
+	return ret;
+}
+
+/* 
+ * Process events on the work_list for the cm_id. If the callback
+ * function requests that the cm_id be deleted, a flag is set in the
+ * cm_id flags to indicate that when the last reference is
+ * removed, the cm_id is to be destroyed. This is necessary to
+ * distinguish between an object that will be destroyed by the app
+ * thread asleep on the destroy_comp list vs. an object destroyed
+ * here synchronously when the last reference is removed.
+ */
+static void cm_work_handler(void *arg)
+{
+	struct iwcm_work *work = arg, lwork;
+	struct iwcm_id_private *cm_id_priv = work->cm_id;
+	unsigned long flags;
+	int empty;
+	int ret = 0;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	empty = list_empty(&cm_id_priv->work_list);
+	while (!empty) {
+		work = list_entry(cm_id_priv->work_list.next, 
+				  struct iwcm_work, list);
+		list_del_init(&work->list);
+		empty = list_empty(&cm_id_priv->work_list);
+		lwork = *work;
+		put_work(work);
+		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+
+		ret = process_event(cm_id_priv, &work->event);
+		if (ret) {
+			set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags);
+			destroy_cm_id(&cm_id_priv->id);
+		}
+		BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
+		if (iwcm_deref_id(cm_id_priv))
+			return;
+		
+		if (atomic_read(&cm_id_priv->refcount)==0 && 
+		    test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) {
+			dealloc_work_entries(cm_id_priv);
+			kfree(cm_id_priv);
+			return;
+		}
+		spin_lock_irqsave(&cm_id_priv->lock, flags);
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+}
+
+/* 
+ * This function is called on interrupt context. Schedule events on
+ * the iwcm_wq thread to allow callback functions to downcall into
+ * the CM and/or block.  Events are queued to a per-CM_ID
+ * work_list. If this is the first event on the work_list, the work
+ * element is also queued on the iwcm_wq thread.
+ *
+ * Each event holds a reference on the cm_id. Until the last posted
+ * event has been delivered and processed, the cm_id cannot be
+ * deleted. 
+ * 
+ * Returns: 
+ * 	      0	- the event was handled.
+ *	-ENOMEM	- the event was not handled due to lack of resources.
+ */
+static int cm_event_handler(struct iw_cm_id *cm_id,
+			     struct iw_cm_event *iw_event) 
+{
+	struct iwcm_work *work;
+	struct iwcm_id_private *cm_id_priv;
+	unsigned long flags;
+	int ret = 0;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	work = get_work(cm_id_priv);
+	if (!work) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	
+	INIT_WORK(&work->work, cm_work_handler, work);
+	work->cm_id = cm_id_priv;
+	work->event = *iw_event;
+
+	if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST ||
+	     work->event.event == IW_CM_EVENT_CONNECT_REPLY) &&
+	    work->event.private_data_len) {
+		ret = copy_private_data(cm_id_priv, &work->event);
+		if (ret) {
+			put_work(work);
+			goto out;
+		}
+	}
+
+	atomic_inc(&cm_id_priv->refcount);
+	if (list_empty(&cm_id_priv->work_list)) {
+		list_add_tail(&work->list, &cm_id_priv->work_list);
+		queue_work(iwcm_wq, &work->work);
+	} else
+		list_add_tail(&work->list, &cm_id_priv->work_list);
+out:
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+
+static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv,
+				  struct ib_qp_attr *qp_attr,
+				  int *qp_attr_mask)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_CONN_RECV:
+	case IW_CM_STATE_ESTABLISHED:
+		*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
+		qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
+					   IB_ACCESS_REMOTE_WRITE|
+					   IB_ACCESS_REMOTE_READ;
+		ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+
+static int iwcm_init_qp_rts_attr(struct iwcm_id_private *cm_id_priv,
+				  struct ib_qp_attr *qp_attr,
+				  int *qp_attr_mask)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&cm_id_priv->lock, flags);
+	switch (cm_id_priv->state) {
+	case IW_CM_STATE_IDLE:
+	case IW_CM_STATE_CONN_SENT:
+	case IW_CM_STATE_CONN_RECV:
+	case IW_CM_STATE_ESTABLISHED:
+		*qp_attr_mask = 0;
+		ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+	return ret;
+}
+
+int iw_cm_init_qp_attr(struct iw_cm_id *cm_id,
+		       struct ib_qp_attr *qp_attr,
+		       int *qp_attr_mask)
+{
+	struct iwcm_id_private *cm_id_priv;
+	int ret;
+
+	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+	switch (qp_attr->qp_state) {
+	case IB_QPS_INIT:
+	case IB_QPS_RTR:
+		ret = iwcm_init_qp_init_attr(cm_id_priv, 
+					     qp_attr, qp_attr_mask);
+		break;
+	case IB_QPS_RTS:
+		ret = iwcm_init_qp_rts_attr(cm_id_priv, 
+					    qp_attr, qp_attr_mask);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(iw_cm_init_qp_attr);
+
+static int __init iw_cm_init(void)
+{
+	iwcm_wq = create_singlethread_workqueue("iw_cm_wq");
+	if (!iwcm_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __exit iw_cm_cleanup(void)
+{
+	destroy_workqueue(iwcm_wq);
+}
+
+module_init(iw_cm_init);
+module_exit(iw_cm_cleanup);
diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h
new file mode 100644
index 0000000..36f44aa
--- /dev/null
+++ b/include/rdma/iw_cm.h
@@ -0,0 +1,255 @@
+/*
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef IW_CM_H
+#define IW_CM_H
+
+#include <linux/in.h>
+#include <rdma/ib_cm.h>
+
+struct iw_cm_id;
+
+enum iw_cm_event_type {
+	IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */
+	IW_CM_EVENT_CONNECT_REPLY,	 /* reply from active connect request */
+	IW_CM_EVENT_ESTABLISHED,	 /* passive side accept successful */
+	IW_CM_EVENT_DISCONNECT,		 /* orderly shutdown */
+	IW_CM_EVENT_CLOSE		 /* close complete */
+};
+enum iw_cm_event_status {
+	IW_CM_EVENT_STATUS_OK = 0,	 /* request successful */
+	IW_CM_EVENT_STATUS_ACCEPTED = 0, /* connect request accepted */
+	IW_CM_EVENT_STATUS_REJECTED,	 /* connect request rejected */
+	IW_CM_EVENT_STATUS_TIMEOUT,	 /* the operation timed out */
+	IW_CM_EVENT_STATUS_RESET,	 /* reset from remote peer */
+	IW_CM_EVENT_STATUS_EINVAL,	 /* asynchronous failure for bad parm */
+};
+struct iw_cm_event {
+	enum iw_cm_event_type event;
+	enum iw_cm_event_status status;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	void *private_data;
+	u8 private_data_len;
+	void* provider_data;
+};
+
+/**
+ * iw_cm_handler - Function to be called by the IW CM when delivering events
+ * to the client.
+ *
+ * @cm_id: The IW CM identifier associated with the event.
+ * @event: Pointer to the event structure.
+ */
+typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id,
+			     struct iw_cm_event *event);
+
+/**
+ * iw_event_handler - Function called by the provider when delivering provider
+ * events to the IW CM.  Returns either 0 indicating the event was processed
+ * or -errno if the event could not be processed.
+ *
+ * @cm_id: The IW CM identifier associated with the event.
+ * @event: Pointer to the event structure.
+ */
+typedef int (*iw_event_handler)(struct iw_cm_id *cm_id,
+				 struct iw_cm_event *event);
+struct iw_cm_id {
+	iw_cm_handler		cm_handler;      /* client callback function */
+	void		        *context;	 /* client cb context */
+	struct ib_device	*device;	 
+	struct sockaddr_in      local_addr;
+	struct sockaddr_in	remote_addr;
+	void			*provider_data;	 /* provider private data */
+	iw_event_handler        event_handler;   /* cb for provider
+						    events */
+	/* Used by provider to add and remove refs on IW cm_id */	
+	void (*add_ref)(struct iw_cm_id *);     
+	void (*rem_ref)(struct iw_cm_id *);
+};
+
+struct iw_cm_conn_param {
+	const void *private_data;
+	u16 private_data_len;
+	u32 ord;
+	u32 ird;
+	u32 qpn;
+};
+
+struct iw_cm_verbs {
+	void		(*add_ref)(struct ib_qp *qp);
+
+	void		(*rem_ref)(struct ib_qp *qp);
+
+	struct ib_qp *	(*get_qp)(struct ib_device *device,
+				  int qpn);
+
+	int		(*connect)(struct iw_cm_id *cm_id,
+				   struct iw_cm_conn_param *conn_param);
+	
+	int		(*accept)(struct iw_cm_id *cm_id, 
+				  struct iw_cm_conn_param *conn_param);
+
+	int		(*reject)(struct iw_cm_id *cm_id, 
+				  const void *pdata, u8 pdata_len);
+
+	int		(*create_listen)(struct iw_cm_id *cm_id,
+					 int backlog);
+
+	int		(*destroy_listen)(struct iw_cm_id *cm_id);
+};
+
+/**
+ * iw_create_cm_id - Create an IW CM identifier.
+ *
+ * @device: The IB device on which to create the IW CM identier.
+ * @event_handler: User callback invoked to report events associated with the
+ *   returned IW CM identifier. 
+ * @context: User specified context associated with the id.
+ */
+struct iw_cm_id *iw_create_cm_id(struct ib_device *device,
+				 iw_cm_handler cm_handler, void *context);
+
+/**
+ * iw_destroy_cm_id - Destroy an IW CM identifier.
+ *
+ * @cm_id: The previously created IW CM identifier to destroy.
+ *
+ * The client can assume that no events will be delivered for the CM ID after
+ * this function returns. 
+ */
+void iw_destroy_cm_id(struct iw_cm_id *cm_id);
+
+/**
+ * iw_cm_bind_qp - Unbind the specified IW CM identifier and QP
+ *
+ * @cm_id: The IW CM idenfier to unbind from the QP. 
+ * @qp: The QP
+ *
+ * This is called by the provider when destroying the QP to ensure
+ * that any references held by the IWCM are released. It may also
+ * be called by the IWCM when destroying a CM_ID to that any
+ * references held by the provider are released.
+ */
+void iw_cm_unbind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp);
+
+/**
+ * iw_cm_get_qp - Return the ib_qp associated with a QPN
+ *
+ * @ib_device: The IB device
+ * @qpn: The queue pair number
+ */
+struct ib_qp *iw_cm_get_qp(struct ib_device *device, int qpn);
+
+/**
+ * iw_cm_listen - Listen for incoming connection requests on the
+ * specified IW CM id. 
+ *
+ * @cm_id: The IW CM identifier.
+ * @backlog: The maximum number of outstanding un-accepted inbound listen
+ *   requests to queue. 
+ * 
+ * The source address and port number are specified in the IW CM identifier
+ * structure.
+ */
+int iw_cm_listen(struct iw_cm_id *cm_id, int backlog);
+
+/**
+ * iw_cm_accept - Called to accept an incoming connect request. 
+ *
+ * @cm_id: The IW CM identifier associated with the connection request. 
+ * @iw_param: Pointer to a structure containing connection establishment
+ *   parameters. 
+ *
+ * The specified cm_id will have been provided in the event data for a
+ * CONNECT_REQUEST event. Subsequent events related to this connection will be
+ * delivered to the specified IW CM identifier prior and may occur prior to
+ * the return of this function. If this function returns a non-zero value, the
+ * client can assume that no events will be delivered to the specified IW CM
+ * identifier. 
+ */
+int iw_cm_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param);
+
+/**
+ * iw_cm_reject - Reject an incoming connection request.
+ *
+ * @cm_id: Connection identifier associated with the request.
+ * @private_daa: Pointer to data to deliver to the remote peer as part of the
+ *   reject message. 
+ * @private_data_len: The number of bytes in the private_data parameter. 
+ *
+ * The client can assume that no events will be delivered to the specified IW
+ * CM identifier following the return of this function. The private_data
+ * buffer is available for reuse when this function returns. 
+ */
+int iw_cm_reject(struct iw_cm_id *cm_id, const void *private_data,
+		 u8 private_data_len);
+
+/**
+ * iw_cm_connect - Called to request a connection to a remote peer. 
+ *
+ * @cm_id: The IW CM identifier for the connection.
+ * @iw_param: Pointer to a structure containing connection  establishment
+ *   parameters. 
+ *
+ * Events may be delivered to the specified IW CM identifier prior to the
+ * return of this function. If this function returns a non-zero value, the
+ * client can assume that no events will be delivered to the specified IW CM
+ * identifier.  
+ */
+int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param);
+
+/**
+ * iw_cm_disconnect - Close the specified connection. 
+ *
+ * @cm_id: The IW CM identifier to close.
+ * @abrupt: If 0, the connection will be closed gracefully, otherwise, the
+ *   connection will be reset.  
+ *
+ * The IW CM identifier is still active until the IW_CM_EVENT_CLOSE event is
+ * delivered. 
+ */
+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt);
+
+/**
+ * iw_cm_init_qp_attr - Called to initialize the attributes of the QP
+ * associated with a IW CM identifier.
+ *
+ * @cm_id: The IW CM identifier associated with the QP
+ * @qp_attr: Pointer to the QP attributes structure. 
+ * @qp_attr_mask: Pointer to a bit vector specifying which QP attributes are
+ *   valid.  
+ */
+int iw_cm_init_qp_attr(struct iw_cm_id *cm_id, struct ib_qp_attr *qp_attr,
+		       int *qp_attr_mask);
+
+#endif /* IW_CM_H */
diff --git a/include/rdma/iw_cm_private.h b/include/rdma/iw_cm_private.h
new file mode 100644
index 0000000..fc28e34
--- /dev/null
+++ b/include/rdma/iw_cm_private.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef IW_CM_PRIVATE_H
+#define IW_CM_PRIVATE_H
+
+#include <rdma/iw_cm.h>
+
+enum iw_cm_state {
+	IW_CM_STATE_IDLE,             /* unbound, inactive */
+	IW_CM_STATE_LISTEN,           /* listen waiting for connect */
+	IW_CM_STATE_CONN_RECV,        /* inbound waiting for user accept */
+	IW_CM_STATE_CONN_SENT,        /* outbound waiting for peer accept */
+	IW_CM_STATE_ESTABLISHED,      /* established */
+	IW_CM_STATE_CLOSING,	      /* disconnect */
+	IW_CM_STATE_DESTROYING        /* object being deleted */
+};
+
+struct iwcm_id_private {
+	struct iw_cm_id	id;
+	enum iw_cm_state state;
+	unsigned long flags;
+	struct ib_qp *qp;
+	struct completion destroy_comp;
+	wait_queue_head_t connect_wait;
+	struct list_head work_list;
+	spinlock_t lock;
+	atomic_t refcount;
+	struct list_head work_free_list;
+};
+#define IWCM_F_CALLBACK_DESTROY   1
+#define IWCM_F_CONNECT_WAIT       2
+
+#endif /* IW_CM_PRIVATE_H */


From swise at opengridcomputing.com  Tue Jun 20 13:24:52 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:24:52 -0500
Subject: [openib-general] [PATCH v3 2/2] iWARP Core Changes.
In-Reply-To: <20060620202442.28922.27402.stgit@stevo-desktop>
References: <20060620202442.28922.27402.stgit@stevo-desktop>
Message-ID: <20060620202452.28922.39114.stgit@stevo-desktop>


This patch contains modifications to the existing rdma header files,
core files, drivers, and ulp files to support iWARP.

V2 Review updates:

V1 Review updates:

- copy_addr() -> rdma_copy_addr()

- dst_dev_addr param in rdma_copy_addr to const.

- various spacing nits with recasting

- include linux/inetdevice.h to get ip_dev_find() prototype.

- dev_put() after successful ip_dev_find()
---

 drivers/infiniband/core/Makefile             |    4 
 drivers/infiniband/core/addr.c               |   19 +
 drivers/infiniband/core/cache.c              |    8 -
 drivers/infiniband/core/cm.c                 |    3 
 drivers/infiniband/core/cma.c                |  355 +++++++++++++++++++++++---
 drivers/infiniband/core/device.c             |    6 
 drivers/infiniband/core/mad.c                |   11 +
 drivers/infiniband/core/sa_query.c           |    5 
 drivers/infiniband/core/smi.c                |   18 +
 drivers/infiniband/core/sysfs.c              |   18 +
 drivers/infiniband/core/ucm.c                |    5 
 drivers/infiniband/core/user_mad.c           |    9 -
 drivers/infiniband/hw/ipath/ipath_verbs.c    |    2 
 drivers/infiniband/hw/mthca/mthca_provider.c |    2 
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |    8 +
 drivers/infiniband/ulp/srp/ib_srp.c          |    2 
 include/rdma/ib_addr.h                       |   15 +
 include/rdma/ib_verbs.h                      |   39 ++-
 18 files changed, 437 insertions(+), 92 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 68e73ec..163d991 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -1,7 +1,7 @@
 infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)	:= ib_addr.o rdma_cm.o
 
 obj-$(CONFIG_INFINIBAND) +=		ib_core.o ib_mad.o ib_sa.o \
-					ib_cm.o $(infiniband-y)
+					ib_cm.o iw_cm.o $(infiniband-y)
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_sa-y :=			sa_query.o
 
 ib_cm-y :=			cm.o
 
+iw_cm-y :=			iwcm.o
+
 rdma_cm-y :=			cma.o
 
 ib_addr-y :=			addr.o
diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d294bbc..83f84ef 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -32,6 +32,7 @@ #include <linux/mutex.h>
 #include <linux/inetdevice.h>
 #include <linux/workqueue.h>
 #include <linux/if_arp.h>
+#include <linux/inetdevice.h>
 #include <net/arp.h>
 #include <net/neighbour.h>
 #include <net/route.h>
@@ -60,12 +61,15 @@ static LIST_HEAD(req_list);
 static DECLARE_WORK(work, process_req, NULL);
 static struct workqueue_struct *addr_wq;
 
-static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
-		     unsigned char *dst_dev_addr)
+int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+		     const unsigned char *dst_dev_addr)
 {
 	switch (dev->type) {
 	case ARPHRD_INFINIBAND:
-		dev_addr->dev_type = IB_NODE_CA;
+		dev_addr->dev_type = RDMA_NODE_IB_CA;
+		break;
+	case ARPHRD_ETHER:
+		dev_addr->dev_type = RDMA_NODE_RNIC;
 		break;
 	default:
 		return -EADDRNOTAVAIL;
@@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add
 		memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
 	return 0;
 }
+EXPORT_SYMBOL(rdma_copy_addr);
 
 int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
 {
@@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a
 	if (!dev)
 		return -EADDRNOTAVAIL;
 
-	ret = copy_addr(dev_addr, dev, NULL);
+	ret = rdma_copy_addr(dev_addr, dev, NULL);
 	dev_put(dev);
 	return ret;
 }
@@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so
 
 	/* If the device does ARP internally, return 'done' */
 	if (rt->idev->dev->flags & IFF_NOARP) {
-		copy_addr(addr, rt->idev->dev, NULL);
+		rdma_copy_addr(addr, rt->idev->dev, NULL);
 		goto put;
 	}
 
@@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so
 		src_in->sin_addr.s_addr = rt->rt_src;
 	}
 
-	ret = copy_addr(addr, neigh->dev, neigh->ha);
+	ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
 release:
 	neigh_release(neigh);
 put:
@@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc
 	if (ZERONET(src_ip)) {
 		src_in->sin_family = dst_in->sin_family;
 		src_in->sin_addr.s_addr = dst_ip;
-		ret = copy_addr(addr, dev, dev->dev_addr);
+		ret = rdma_copy_addr(addr, dev, dev->dev_addr);
 	} else if (LOOPBACK(src_ip)) {
 		ret = rdma_translate_ip((struct sockaddr *)dst_in, addr);
 		if (!ret)
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index e05ca2c..061858c 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -32,13 +32,12 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: cache.c 6885 2006-05-03 18:22:02Z sean.hefty $
  */
 
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
-#include <linux/sched.h>	/* INIT_WORK, schedule_work(), flush_scheduled_work() */
 
 #include <rdma/ib_cache.h>
 
@@ -62,12 +61,13 @@ struct ib_update_work {
 
 static inline int start_port(struct ib_device *device)
 {
-	return device->node_type == IB_NODE_SWITCH ? 0 : 1;
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1;
 }
 
 static inline int end_port(struct ib_device *device)
 {
-	return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt;
+	return (device->node_type == RDMA_NODE_IB_SWITCH) ?
+		0 : device->phys_port_cnt;
 }
 
 int ib_get_cached_gid(struct ib_device *device,
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 450adfe..070dda9 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3244,6 +3244,9 @@ static void cm_add_one(struct ib_device 
 	int ret;
 	u8 i;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) *
 			 device->phys_port_cnt, GFP_KERNEL);
 	if (!cm_dev)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index a76834e..52a74f5 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -35,6 +35,7 @@ #include <linux/in6.h>
 #include <linux/mutex.h>
 #include <linux/random.h>
 #include <linux/idr.h>
+#include <linux/inetdevice.h>
 
 #include <net/tcp.h>
 
@@ -43,6 +44,7 @@ #include <rdma/rdma_cm_ib.h>
 #include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
 #include <rdma/ib_sa.h>
+#include <rdma/iw_cm.h>
 
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
@@ -124,6 +126,7 @@ struct rdma_id_private {
 	int			query_id;
 	union {
 		struct ib_cm_id	*ib;
+		struct iw_cm_id	*iw;
 	} cm_id;
 
 	u32			seq_num;
@@ -259,13 +262,23 @@ static void cma_detach_from_dev(struct r
 	id_priv->cma_dev = NULL;
 }
 
-static int cma_acquire_ib_dev(struct rdma_id_private *id_priv)
+static int cma_acquire_dev(struct rdma_id_private *id_priv)
 {
+	enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type;
 	struct cma_device *cma_dev;
 	union ib_gid *gid;
 	int ret = -ENODEV;
 
-	gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+	switch (rdma_node_get_transport(dev_type)) {
+	case RDMA_TRANSPORT_IB:
+		gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+		break;
+	case RDMA_TRANSPORT_IWARP:
+		gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+		break;
+	default:
+		return -ENODEV;
+	}
 
 	mutex_lock(&lock);
 	list_for_each_entry(cma_dev, &dev_list, list) {
@@ -280,16 +293,6 @@ static int cma_acquire_ib_dev(struct rdm
 	return ret;
 }
 
-static int cma_acquire_dev(struct rdma_id_private *id_priv)
-{
-	switch (id_priv->id.route.addr.dev_addr.dev_type) {
-	case IB_NODE_CA:
-		return cma_acquire_ib_dev(id_priv);
-	default:
-		return -ENODEV;
-	}
-}
-
 static void cma_deref_id(struct rdma_id_private *id_priv)
 {
 	if (atomic_dec_and_test(&id_priv->refcount))
@@ -347,6 +350,16 @@ static int cma_init_ib_qp(struct rdma_id
 					  IB_QP_PKEY_INDEX | IB_QP_PORT);
 }
 
+static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp)
+{
+	struct ib_qp_attr qp_attr;
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE;
+
+	return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS);
+}
+
 int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
 		   struct ib_qp_init_attr *qp_init_attr)
 {
@@ -362,10 +375,13 @@ int rdma_create_qp(struct rdma_cm_id *id
 	if (IS_ERR(qp))
 		return PTR_ERR(qp);
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_init_ib_qp(id_priv, qp);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_init_iw_qp(id_priv, qp);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -451,13 +467,17 @@ int rdma_init_qp_attr(struct rdma_cm_id 
 	int ret;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	switch (id_priv->id.device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr,
 					 qp_attr_mask);
 		if (qp_attr->qp_state == IB_QPS_RTR)
 			qp_attr->rq_psn = id_priv->seq_num;
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
+					qp_attr_mask);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -590,8 +610,8 @@ static int cma_notify_user(struct rdma_i
 
 static void cma_cancel_route(struct rdma_id_private *id_priv)
 {
-	switch (id_priv->id.device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		if (id_priv->query)
 			ib_sa_cancel_query(id_priv->query_id, id_priv->query);
 		break;
@@ -611,11 +631,15 @@ static void cma_destroy_listen(struct rd
 	cma_exch(id_priv, CMA_DESTROYING);
 
 	if (id_priv->cma_dev) {
-		switch (id_priv->id.device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
+		case RDMA_TRANSPORT_IWARP:
+	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
+				iw_destroy_cm_id(id_priv->cm_id.iw);
+			break;
 		default:
 			break;
 		}
@@ -690,11 +714,15 @@ void rdma_destroy_id(struct rdma_cm_id *
 	cma_cancel_operation(id_priv, state);
 
 	if (id_priv->cma_dev) {
-		switch (id->device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id->device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 	 		if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib))
 				ib_destroy_cm_id(id_priv->cm_id.ib);
 			break;
+		case RDMA_TRANSPORT_IWARP:
+	 		if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw))
+				iw_destroy_cm_id(id_priv->cm_id.iw);
+			break;
 		default:
 			break;
 		}
@@ -868,7 +896,7 @@ static struct rdma_id_private *cma_new_i
 	ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid);
 	ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
 	ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey));
-	rt->addr.dev_addr.dev_type = IB_NODE_CA;
+	rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
 	id_priv->state = CMA_CONNECT;
@@ -897,7 +925,7 @@ static int cma_req_handler(struct ib_cm_
 	}
 
 	atomic_inc(&conn_id->dev_remove);
-	ret = cma_acquire_ib_dev(conn_id);
+	ret = cma_acquire_dev(conn_id);
 	if (ret) {
 		ret = -ENODEV;
 		cma_release_remove(conn_id);
@@ -981,6 +1009,125 @@ static void cma_set_compare_data(enum rd
 	}
 }
 
+static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
+{
+	struct rdma_id_private *id_priv = iw_id->context;
+	enum rdma_cm_event_type event = 0;
+	struct sockaddr_in *sin;
+	int ret = 0;
+
+	atomic_inc(&id_priv->dev_remove);
+
+	switch (iw_event->event) {
+	case IW_CM_EVENT_CLOSE:
+		event = RDMA_CM_EVENT_DISCONNECTED;
+		break;
+	case IW_CM_EVENT_CONNECT_REPLY:
+		sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+		*sin = iw_event->local_addr;
+		sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
+		*sin = iw_event->remote_addr;
+		if (iw_event->status)
+			event = RDMA_CM_EVENT_REJECTED;
+		else
+			event = RDMA_CM_EVENT_ESTABLISHED;
+		break;
+	case IW_CM_EVENT_ESTABLISHED:
+		event = RDMA_CM_EVENT_ESTABLISHED;
+		break;
+	default:
+		BUG_ON(1);
+	}	
+
+	ret = cma_notify_user(id_priv, event, iw_event->status, 
+			      iw_event->private_data, 
+			      iw_event->private_data_len);
+	if (ret) {
+		/* Destroy the CM ID by returning a non-zero value. */
+		id_priv->cm_id.iw = NULL;
+		cma_exch(id_priv, CMA_DESTROYING);
+		cma_release_remove(id_priv);
+		rdma_destroy_id(&id_priv->id);
+		return ret;
+	}
+
+	cma_release_remove(id_priv);
+	return ret;
+}
+
+static int iw_conn_req_handler(struct iw_cm_id *cm_id, 
+			       struct iw_cm_event *iw_event)
+{
+	struct rdma_cm_id *new_cm_id;
+	struct rdma_id_private *listen_id, *conn_id;
+	struct sockaddr_in *sin;
+	struct net_device *dev = NULL;
+	int ret;
+
+	listen_id = cm_id->context;
+	atomic_inc(&listen_id->dev_remove);
+	if (!cma_comp(listen_id, CMA_LISTEN)) {
+		ret = -ECONNABORTED;
+		goto out;
+	}
+
+	/* Create a new RDMA id for the new IW CM ID */
+	new_cm_id = rdma_create_id(listen_id->id.event_handler, 
+				   listen_id->id.context,
+				   RDMA_PS_TCP);
+	if (!new_cm_id) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	conn_id = container_of(new_cm_id, struct rdma_id_private, id);
+	atomic_inc(&conn_id->dev_remove);
+	conn_id->state = CMA_CONNECT;
+
+	dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr);
+	if (!dev) {
+		ret = -EADDRNOTAVAIL;
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+	ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
+	if (ret) {
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+
+	ret = cma_acquire_dev(conn_id);
+	if (ret) {
+		rdma_destroy_id(new_cm_id);
+		goto out;
+	}
+
+	conn_id->cm_id.iw = cm_id;
+	cm_id->context = conn_id;
+	cm_id->cm_handler = cma_iw_handler;
+
+	sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr;
+	*sin = iw_event->local_addr;
+	sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr;
+	*sin = iw_event->remote_addr;
+
+	ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0,
+			      iw_event->private_data,
+			      iw_event->private_data_len);
+	if (ret) {
+		/* User wants to destroy the CM ID */
+		conn_id->cm_id.iw = NULL;
+		cma_exch(conn_id, CMA_DESTROYING);
+		cma_release_remove(conn_id);
+		rdma_destroy_id(&conn_id->id);
+	}
+
+out:
+	if (!dev)
+		dev_put(dev);
+	cma_release_remove(listen_id);
+	return ret;
+}
+
 static int cma_ib_listen(struct rdma_id_private *id_priv)
 {
 	struct ib_cm_compare_data compare_data;
@@ -1010,6 +1157,30 @@ static int cma_ib_listen(struct rdma_id_
 	return ret;
 }
 
+static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog)
+{
+	int ret;
+	struct sockaddr_in *sin;
+
+	id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, 
+					    iw_conn_req_handler,
+					    id_priv);
+	if (IS_ERR(id_priv->cm_id.iw))
+		return PTR_ERR(id_priv->cm_id.iw);
+
+	sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+	id_priv->cm_id.iw->local_addr = *sin;
+
+	ret = iw_cm_listen(id_priv->cm_id.iw, backlog);
+
+	if (ret) {
+		iw_destroy_cm_id(id_priv->cm_id.iw);
+		id_priv->cm_id.iw = NULL;
+	}
+
+	return ret;
+}
+
 static int cma_listen_handler(struct rdma_cm_id *id,
 			      struct rdma_cm_event *event)
 {
@@ -1086,12 +1257,17 @@ int rdma_listen(struct rdma_cm_id *id, i
 
 	id_priv->backlog = backlog;
 	if (id->device) {
-		switch (id->device->node_type) {
-		case IB_NODE_CA:
+		switch (rdma_node_get_transport(id->device->node_type)) {
+		case RDMA_TRANSPORT_IB:
 			ret = cma_ib_listen(id_priv);
 			if (ret)
 				goto err;
 			break;
+		case RDMA_TRANSPORT_IWARP:
+			ret = cma_iw_listen(id_priv, backlog);
+			if (ret)
+				goto err;
+			break;
 		default:
 			ret = -ENOSYS;
 			goto err;
@@ -1230,6 +1406,23 @@ err:
 }
 EXPORT_SYMBOL(rdma_set_ib_paths);
 
+static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
+{
+	struct cma_work *work;
+
+	work = kzalloc(sizeof *work, GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->id = id_priv;
+	INIT_WORK(&work->work, cma_work_handler, work);
+	work->old_state = CMA_ROUTE_QUERY;
+	work->new_state = CMA_ROUTE_RESOLVED;
+	work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED;
+	queue_work(cma_wq, &work->work);
+	return 0;
+}
+
 int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 {
 	struct rdma_id_private *id_priv;
@@ -1240,10 +1433,13 @@ int rdma_resolve_route(struct rdma_cm_id
 		return -EINVAL;
 
 	atomic_inc(&id_priv->refcount);
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_resolve_ib_route(id_priv, timeout_ms);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_resolve_iw_route(id_priv, timeout_ms);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1355,8 +1551,8 @@ static int cma_resolve_loopback(struct r
 			 ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr));
 
 	if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) {
-		src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr;
-		dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr;
+		src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr;
+		dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
 		src_in->sin_family = dst_in->sin_family;
 		src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr;
 	}
@@ -1647,6 +1843,47 @@ out:
 	return ret;
 }
 
+static int cma_connect_iw(struct rdma_id_private *id_priv,
+			  struct rdma_conn_param *conn_param)
+{
+	struct iw_cm_id *cm_id;
+	struct sockaddr_in* sin;
+	int ret;
+	struct iw_cm_conn_param iw_param;
+
+	cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		goto out;
+	}
+
+	id_priv->cm_id.iw = cm_id;
+
+	sin = (struct sockaddr_in*) &id_priv->id.route.addr.src_addr;
+	cm_id->local_addr = *sin;
+
+	sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr;
+	cm_id->remote_addr = *sin;
+
+	ret = cma_modify_qp_rtr(&id_priv->id);
+	if (ret) {
+		iw_destroy_cm_id(cm_id);
+		return ret;
+	}
+
+	iw_param.ord = conn_param->initiator_depth;
+	iw_param.ird = conn_param->responder_resources;
+	iw_param.private_data = conn_param->private_data;
+	iw_param.private_data_len = conn_param->private_data_len;
+	if (id_priv->id.qp)
+		iw_param.qpn = id_priv->qp_num;
+	else 
+		iw_param.qpn = conn_param->qp_num;
+	ret = iw_cm_connect(cm_id, &iw_param);
+out:
+	return ret;
+}
+
 int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 {
 	struct rdma_id_private *id_priv;
@@ -1662,10 +1899,13 @@ int rdma_connect(struct rdma_cm_id *id, 
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = cma_connect_ib(id_priv, conn_param);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_connect_iw(id_priv, conn_param);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1706,6 +1946,28 @@ static int cma_accept_ib(struct rdma_id_
 	return ib_send_cm_rep(id_priv->cm_id.ib, &rep);
 }
 
+static int cma_accept_iw(struct rdma_id_private *id_priv, 
+		  struct rdma_conn_param *conn_param)
+{
+	struct iw_cm_conn_param iw_param;
+	int ret;
+
+	ret = cma_modify_qp_rtr(&id_priv->id);
+	if (ret)
+		return ret;
+
+	iw_param.ord = conn_param->initiator_depth;
+	iw_param.ird = conn_param->responder_resources;
+	iw_param.private_data = conn_param->private_data;
+	iw_param.private_data_len = conn_param->private_data_len;
+	if (id_priv->id.qp) {
+		iw_param.qpn = id_priv->qp_num;
+	} else 
+		iw_param.qpn = conn_param->qp_num;
+
+	return iw_cm_accept(id_priv->cm_id.iw, &iw_param);
+}
+
 int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
 {
 	struct rdma_id_private *id_priv;
@@ -1721,13 +1983,16 @@ int rdma_accept(struct rdma_cm_id *id, s
 		id_priv->srq = conn_param->srq;
 	}
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		if (conn_param)
 			ret = cma_accept_ib(id_priv, conn_param);
 		else
 			ret = cma_rep_recv(id_priv);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = cma_accept_iw(id_priv, conn_param);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1754,12 +2019,16 @@ int rdma_reject(struct rdma_cm_id *id, c
 	if (!cma_comp(id_priv, CMA_CONNECT))
 		return -EINVAL;
 
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
 		ret = ib_send_cm_rej(id_priv->cm_id.ib,
 				     IB_CM_REJ_CONSUMER_DEFINED, NULL, 0,
 				     private_data, private_data_len);
 		break;
+	case RDMA_TRANSPORT_IWARP: 
+		ret = iw_cm_reject(id_priv->cm_id.iw, 
+				   private_data, private_data_len);
+		break;
 	default:
 		ret = -ENOSYS;
 		break;
@@ -1778,16 +2047,18 @@ int rdma_disconnect(struct rdma_cm_id *i
 	    !cma_comp(id_priv, CMA_DISCONNECT))
 		return -EINVAL;
 
-	ret = cma_modify_qp_err(id);
-	if (ret)
-		goto out;
-
-	switch (id->device->node_type) {
-	case IB_NODE_CA:
+	switch (rdma_node_get_transport(id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
+		ret = cma_modify_qp_err(id);
+		if (ret)
+			goto out;
 		/* Initiate or respond to a disconnect. */
 		if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
 			ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
 		break;
+	case RDMA_TRANSPORT_IWARP:
+		ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b2f3cb9..7318fba 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: device.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: device.c 5943 2006-03-22 00:58:04Z roland $
  */
 
 #include <linux/module.h>
@@ -505,7 +505,7 @@ int ib_query_port(struct ib_device *devi
 		  u8 port_num,
 		  struct ib_port_attr *port_attr)
 {
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		if (port_num)
 			return -EINVAL;
 	} else if (port_num < 1 || port_num > device->phys_port_cnt)
@@ -580,7 +580,7 @@ int ib_modify_port(struct ib_device *dev
 		   u8 port_num, int port_modify_mask,
 		   struct ib_port_modify *port_modify)
 {
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		if (port_num)
 			return -EINVAL;
 	} else if (port_num < 1 || port_num > device->phys_port_cnt)
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index b38e02a..a928ecf 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2005 Intel Corporation.  All rights reserved.
  * Copyright (c) 2005 Mellanox Technologies Ltd.  All rights reserved.
  *
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
+ * $Id: mad.c 7294 2006-05-17 18:12:30Z roland $
  */
 #include <linux/dma-mapping.h>
 #include <rdma/ib_cache.h>
@@ -2877,7 +2877,10 @@ static void ib_mad_init_device(struct ib
 {
 	int start, end, i;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		start = 0;
 		end   = 0;
 	} else {
@@ -2924,7 +2927,7 @@ static void ib_mad_remove_device(struct 
 {
 	int i, num_ports, cur_port;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		num_ports = 1;
 		cur_port = 0;
 	} else {
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index e911c99..12a9425 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -918,7 +918,10 @@ static void ib_sa_add_one(struct ib_devi
 	struct ib_sa_device *sa_dev;
 	int s, e, i;
 
-	if (device->node_type == IB_NODE_SWITCH)
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
 		s = 1;
diff --git a/drivers/infiniband/core/smi.c b/drivers/infiniband/core/smi.c
index 35852e7..b81b2b9 100644
--- a/drivers/infiniband/core/smi.c
+++ b/drivers/infiniband/core/smi.c
@@ -34,7 +34,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: smi.c 1389 2004-12-27 22:56:47Z roland $
+ * $Id: smi.c 5258 2006-02-01 20:32:40Z sean.hefty $
  */
 
 #include <rdma/ib_smi.h>
@@ -64,7 +64,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 
 		/* C14-9:2 */
 		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			/* smp->return_path set when received */
@@ -77,7 +77,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 		if (hop_ptr == hop_cnt) {
 			/* smp->return_path set when received */
 			smp->hop_ptr++;
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_dlid == IB_LID_PERMISSIVE);
 		}
 
@@ -95,7 +95,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 
 		/* C14-13:2 */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			smp->hop_ptr--;
@@ -107,7 +107,7 @@ int smi_handle_dr_smp_send(struct ib_smp
 		if (hop_ptr == 1) {
 			smp->hop_ptr--;
 			/* C14-13:3 -- SMPs destined for SM shouldn't be here */
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_slid == IB_LID_PERMISSIVE);
 		}
 
@@ -142,7 +142,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 
 		/* C14-9:2 -- intermediate hop */
 		if (hop_ptr && hop_ptr < hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			smp->return_path[hop_ptr] = port_num;
@@ -156,7 +156,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 				smp->return_path[hop_ptr] = port_num;
 			/* smp->hop_ptr updated when sending */
 
-			return (node_type == IB_NODE_SWITCH ||
+			return (node_type == RDMA_NODE_IB_SWITCH ||
 				smp->dr_dlid == IB_LID_PERMISSIVE);
 		}
 
@@ -175,7 +175,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 
 		/* C14-13:2 */
 		if (2 <= hop_ptr && hop_ptr <= hop_cnt) {
-			if (node_type != IB_NODE_SWITCH)
+			if (node_type != RDMA_NODE_IB_SWITCH)
 				return 0;
 
 			/* smp->hop_ptr updated when sending */
@@ -190,7 +190,7 @@ int smi_handle_dr_smp_recv(struct ib_smp
 				return 1;
 			}
 			/* smp->hop_ptr updated when sending */
-			return (node_type == IB_NODE_SWITCH);
+			return (node_type == RDMA_NODE_IB_SWITCH);
 		}
 
 		/* C14-13:4 -- hop_ptr = 0 -> give to SM */
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 21f9282..cfd2c06 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $
+ * $Id: sysfs.c 6940 2006-05-04 17:04:55Z roland $
  */
 
 #include "core_priv.h"
@@ -589,10 +589,16 @@ static ssize_t show_node_type(struct cla
 		return -ENODEV;
 
 	switch (dev->node_type) {
-	case IB_NODE_CA:     return sprintf(buf, "%d: CA\n", dev->node_type);
-	case IB_NODE_SWITCH: return sprintf(buf, "%d: switch\n", dev->node_type);
-	case IB_NODE_ROUTER: return sprintf(buf, "%d: router\n", dev->node_type);
-	default:             return sprintf(buf, "%d: <unknown>\n", dev->node_type);
+	case RDMA_NODE_IB_CA:
+		return sprintf(buf, "%d: CA\n", dev->node_type);
+	case RDMA_NODE_RNIC:
+		return sprintf(buf, "%d: RNIC\n", dev->node_type);
+	case RDMA_NODE_IB_SWITCH:
+		return sprintf(buf, "%d: switch\n", dev->node_type);
+	case RDMA_NODE_IB_ROUTER:
+		return sprintf(buf, "%d: router\n", dev->node_type);
+	default:
+		return sprintf(buf, "%d: <unknown>\n", dev->node_type);
 	}
 }
 
@@ -708,7 +714,7 @@ int ib_device_register_sysfs(struct ib_d
 	if (ret)
 		goto err_put;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		ret = add_port(device, 0);
 		if (ret)
 			goto err_put;
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index c1c6fda..936afc8 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
+ * $Id: ucm.c 7119 2006-05-11 16:40:38Z sean.hefty $
  */
 
 #include <linux/completion.h>
@@ -1247,7 +1247,8 @@ static void ib_ucm_add_one(struct ib_dev
 {
 	struct ib_ucm_device *ucm_dev;
 
-	if (!device->alloc_ucontext)
+	if (!device->alloc_ucontext ||
+	    rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
 		return;
 
 	ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index afe70a5..0cbd692 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004 Topspin Communications.  All rights reserved.
- * Copyright (c) 2005 Voltaire, Inc. All rights reserved. 
+ * Copyright (c) 2005-2006 Voltaire, Inc. All rights reserved. 
  * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -31,7 +31,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $
+ * $Id: user_mad.c 6041 2006-03-27 21:06:00Z halr $
  */
 
 #include <linux/module.h>
@@ -967,7 +967,10 @@ static void ib_umad_add_one(struct ib_de
 	struct ib_umad_device *umad_dev;
 	int s, e, i;
 
-	if (device->node_type == IB_NODE_SWITCH)
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
 		s = e = 0;
 	else {
 		s = 1;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 28fdbda..e4b45d7 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -984,7 +984,7 @@ static void *ipath_register_ib_device(in
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
-	dev->node_type = IB_NODE_CA;
+	dev->node_type = RDMA_NODE_IB_CA;
 	dev->phys_port_cnt = 1;
 	dev->dma_device = ipath_layer_get_device(dd);
 	dev->class_dev.dev = dev->dma_device;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 230ae21..2103ee8 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1292,7 +1292,7 @@ int mthca_register_device(struct mthca_d
 		(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
-	dev->ib_dev.node_type            = IB_NODE_CA;
+	dev->ib_dev.node_type            = RDMA_NODE_IB_CA;
 	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
 	dev->ib_dev.dma_device           = &dev->pdev->dev;
 	dev->ib_dev.class_dev.dev        = &dev->pdev->dev;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 1c6ea1c..262427f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1084,13 +1084,16 @@ static void ipoib_add_one(struct ib_devi
 	struct ipoib_dev_priv *priv;
 	int s, e, p;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
 	if (!dev_list)
 		return;
 
 	INIT_LIST_HEAD(dev_list);
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		s = 0;
 		e = 0;
 	} else {
@@ -1114,6 +1117,9 @@ static void ipoib_remove_one(struct ib_d
 	struct ipoib_dev_priv *priv, *tmp;
 	struct list_head *dev_list;
 
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
 	dev_list = ib_get_client_data(device, &ipoib_client);
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 4e22afe..37ea240 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1879,7 +1879,7 @@ static void srp_add_one(struct ib_device
 	if (IS_ERR(srp_dev->fmr_pool))
 		srp_dev->fmr_pool = NULL;
 
-	if (device->node_type == IB_NODE_SWITCH) {
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
 		s = 0;
 		e = 0;
 	} else {
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index fcb5ba8..d95d3eb 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -40,7 +40,7 @@ struct rdma_dev_addr {
 	unsigned char src_dev_addr[MAX_ADDR_LEN];
 	unsigned char dst_dev_addr[MAX_ADDR_LEN];
 	unsigned char broadcast[MAX_ADDR_LEN];
-	enum ib_node_type dev_type;
+	enum rdma_node_type dev_type;
 };
 
 /**
@@ -72,6 +72,9 @@ int rdma_resolve_ip(struct sockaddr *src
 
 void rdma_addr_cancel(struct rdma_dev_addr *addr);
 
+int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+	      const unsigned char *dst_dev_addr);
+
 static inline int ip_addr_size(struct sockaddr *addr)
 {
 	return addr->sa_family == AF_INET6 ?
@@ -111,4 +114,14 @@ static inline void ib_addr_set_dgid(stru
 	memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid);
 }
 
+static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda)
+{
+	return (union ib_gid *) rda->src_dev_addr;
+}
+
+static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda)
+{
+	return (union ib_gid *) rda->dst_dev_addr;
+}
+
 #endif /* IB_ADDR_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ee1f3a3..4b4c30a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -35,7 +35,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $
+ * $Id: ib_verbs.h 6885 2006-05-03 18:22:02Z sean.hefty $
  */
 
 #if !defined(IB_VERBS_H)
@@ -56,12 +56,35 @@ union ib_gid {
 	} global;
 };
 
-enum ib_node_type {
-	IB_NODE_CA 	= 1,
-	IB_NODE_SWITCH,
-	IB_NODE_ROUTER
+enum rdma_node_type {
+	/* IB values map to NodeInfo:NodeType. */
+	RDMA_NODE_IB_CA 	= 1,
+	RDMA_NODE_IB_SWITCH,
+	RDMA_NODE_IB_ROUTER,
+	RDMA_NODE_RNIC
 };
 
+enum rdma_transport_type {
+	RDMA_TRANSPORT_IB,
+	RDMA_TRANSPORT_IWARP
+};
+
+static inline enum rdma_transport_type
+rdma_node_get_transport(enum rdma_node_type node_type)
+{
+	switch (node_type) {
+	case RDMA_NODE_IB_CA:
+	case RDMA_NODE_IB_SWITCH:
+	case RDMA_NODE_IB_ROUTER:
+		return RDMA_TRANSPORT_IB;
+	case RDMA_NODE_RNIC:
+		return RDMA_TRANSPORT_IWARP;
+	default:
+		BUG();
+		return 0;
+	}
+}
+
 enum ib_device_cap_flags {
 	IB_DEVICE_RESIZE_MAX_WR		= 1,
 	IB_DEVICE_BAD_PKEY_CNTR		= (1<<1),
@@ -78,6 +101,9 @@ enum ib_device_cap_flags {
 	IB_DEVICE_RC_RNR_NAK_GEN	= (1<<12),
 	IB_DEVICE_SRQ_RESIZE		= (1<<13),
 	IB_DEVICE_N_NOTIFY_CQ		= (1<<14),
+	IB_DEVICE_ZERO_STAG		= (1<<15),
+	IB_DEVICE_SEND_W_INV		= (1<<16),
+	IB_DEVICE_MEM_WINDOW		= (1<<17)
 };
 
 enum ib_atomic_cap {
@@ -835,6 +861,7 @@ struct ib_cache {
 	u8                     *lmc_cache;
 };
 
+struct iw_cm_verbs;
 struct ib_device {
 	struct device                *dma_device;
 
@@ -851,6 +878,8 @@ struct ib_device {
 
 	u32                           flags;
 
+	struct iw_cm_verbs	     *iwcm;
+
 	int		           (*query_device)(struct ib_device *device,
 						   struct ib_device_attr *device_attr);
 	int		           (*query_port)(struct ib_device *device,


From swise at opengridcomputing.com  Tue Jun 20 13:30:50 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:30:50 -0500
Subject: [openib-general] [PATCH v3 0/7][RFC] Ammasso 1100 iWARP Driver
Message-ID: <20060620203050.31536.5341.stgit@stevo-desktop>


This patchset implements the iWARP provider driver for the Ammasso
1100 RNIC.  It is dependent on the "iWARP Core Support" patch set.  We're
submitting it for review with the goal for inclusion in the 2.6.19 kernel.
This code has gone through several reviews in the openib-general list.
Now we are submitting it for external review by the linux community.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.19 branch.  The patchset consists of 7 patches:

        1 - Low-level device interface and native stack support
        2 - Work request definitions
        3 - Provider interface
        4 - Memory management
        5 - User mode message queue implementation      
        6 - Verbs queue implementation
        7 - Kconfig and Makefile

I believe I've addressed all the round 1 and 2 review comments.
Details of the changes are tracked in each patch comment.

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Tue Jun 20 13:31:00 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:00 -0500
Subject: [openib-general] [PATCH v3 2/7] AMSO1100 WR / Event Definitions.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203100.31536.50860.stgit@stevo-desktop>


Review Changes:

- C2_DEBUG -> DEBUG

- removed useless comments
---

 drivers/infiniband/hw/amso1100/c2_ae.h     |  108 ++
 drivers/infiniband/hw/amso1100/c2_status.h |  158 +++
 drivers/infiniband/hw/amso1100/c2_wr.h     | 1520 ++++++++++++++++++++++++++++
 3 files changed, 1786 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_ae.h b/drivers/infiniband/hw/amso1100/c2_ae.h
new file mode 100644
index 0000000..3a065c3
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_ae.h
@@ -0,0 +1,108 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_AE_H_
+#define _C2_AE_H_
+
+/*
+ * WARNING: If you change this file, also bump C2_IVN_BASE
+ * in common/include/clustercore/c2_ivn.h.
+ */
+
+/*
+ * Asynchronous Event Identifiers
+ *
+ * These start at 0x80 only so it's obvious from inspection that
+ * they are not work-request statuses.  This isn't critical.
+ *
+ * NOTE: these event id's must fit in eight bits.
+ */
+enum c2_event_id {
+	CCAE_REMOTE_SHUTDOWN = 0x80,
+	CCAE_ACTIVE_CONNECT_RESULTS,
+	CCAE_CONNECTION_REQUEST,
+	CCAE_LLP_CLOSE_COMPLETE,
+	CCAE_TERMINATE_MESSAGE_RECEIVED,
+	CCAE_LLP_CONNECTION_RESET,
+	CCAE_LLP_CONNECTION_LOST,
+	CCAE_LLP_SEGMENT_SIZE_INVALID,
+	CCAE_LLP_INVALID_CRC,
+	CCAE_LLP_BAD_FPDU,
+	CCAE_INVALID_DDP_VERSION,
+	CCAE_INVALID_RDMA_VERSION,
+	CCAE_UNEXPECTED_OPCODE,
+	CCAE_INVALID_DDP_QUEUE_NUMBER,
+	CCAE_RDMA_READ_NOT_ENABLED,
+	CCAE_RDMA_WRITE_NOT_ENABLED,
+	CCAE_RDMA_READ_TOO_SMALL,
+	CCAE_NO_L_BIT,
+	CCAE_TAGGED_INVALID_STAG,
+	CCAE_TAGGED_BASE_BOUNDS_VIOLATION,
+	CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION,
+	CCAE_TAGGED_INVALID_PD,
+	CCAE_WRAP_ERROR,
+	CCAE_BAD_CLOSE,
+	CCAE_BAD_LLP_CLOSE,
+	CCAE_INVALID_MSN_RANGE,
+	CCAE_INVALID_MSN_GAP,
+	CCAE_IRRQ_OVERFLOW,
+	CCAE_IRRQ_MSN_GAP,
+	CCAE_IRRQ_MSN_RANGE,
+	CCAE_IRRQ_INVALID_STAG,
+	CCAE_IRRQ_BASE_BOUNDS_VIOLATION,
+	CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION,
+	CCAE_IRRQ_INVALID_PD,
+	CCAE_IRRQ_WRAP_ERROR,
+	CCAE_CQ_SQ_COMPLETION_OVERFLOW,
+	CCAE_CQ_RQ_COMPLETION_ERROR,
+	CCAE_QP_SRQ_WQE_ERROR,
+	CCAE_QP_LOCAL_CATASTROPHIC_ERROR,
+	CCAE_CQ_OVERFLOW,
+	CCAE_CQ_OPERATION_ERROR,
+	CCAE_SRQ_LIMIT_REACHED,
+	CCAE_QP_RQ_LIMIT_REACHED,
+	CCAE_SRQ_CATASTROPHIC_ERROR,
+	CCAE_RNIC_CATASTROPHIC_ERROR
+/* WARNING If you add more id's, make sure their values fit in eight bits. */
+};
+
+/*
+ * Resource Indicators and Identifiers
+ */
+enum c2_resource_indicator {
+	C2_RES_IND_QP = 1,
+	C2_RES_IND_EP,
+	C2_RES_IND_CQ,
+	C2_RES_IND_SRQ,
+};
+
+#endif /* _C2_AE_H_ */
diff --git a/drivers/infiniband/hw/amso1100/c2_status.h b/drivers/infiniband/hw/amso1100/c2_status.h
new file mode 100644
index 0000000..6ee4aa9
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_status.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef	_C2_STATUS_H_
+#define _C2_STATUS_H_
+
+/*
+ * Verbs Status Codes
+ */
+enum c2_status {
+	C2_OK = 0,		/* This must be zero */
+	CCERR_INSUFFICIENT_RESOURCES = 1,
+	CCERR_INVALID_MODIFIER = 2,
+	CCERR_INVALID_MODE = 3,
+	CCERR_IN_USE = 4,
+	CCERR_INVALID_RNIC = 5,
+	CCERR_INTERRUPTED_OPERATION = 6,
+	CCERR_INVALID_EH = 7,
+	CCERR_INVALID_CQ = 8,
+	CCERR_CQ_EMPTY = 9,
+	CCERR_NOT_IMPLEMENTED = 10,
+	CCERR_CQ_DEPTH_TOO_SMALL = 11,
+	CCERR_PD_IN_USE = 12,
+	CCERR_INVALID_PD = 13,
+	CCERR_INVALID_SRQ = 14,
+	CCERR_INVALID_ADDRESS = 15,
+	CCERR_INVALID_NETMASK = 16,
+	CCERR_INVALID_QP = 17,
+	CCERR_INVALID_QP_STATE = 18,
+	CCERR_TOO_MANY_WRS_POSTED = 19,
+	CCERR_INVALID_WR_TYPE = 20,
+	CCERR_INVALID_SGL_LENGTH = 21,
+	CCERR_INVALID_SQ_DEPTH = 22,
+	CCERR_INVALID_RQ_DEPTH = 23,
+	CCERR_INVALID_ORD = 24,
+	CCERR_INVALID_IRD = 25,
+	CCERR_QP_ATTR_CANNOT_CHANGE = 26,
+	CCERR_INVALID_STAG = 27,
+	CCERR_QP_IN_USE = 28,
+	CCERR_OUTSTANDING_WRS = 29,
+	CCERR_STAG_IN_USE = 30,
+	CCERR_INVALID_STAG_INDEX = 31,
+	CCERR_INVALID_SGL_FORMAT = 32,
+	CCERR_ADAPTER_TIMEOUT = 33,
+	CCERR_INVALID_CQ_DEPTH = 34,
+	CCERR_INVALID_PRIVATE_DATA_LENGTH = 35,
+	CCERR_INVALID_EP = 36,
+	CCERR_MR_IN_USE = CCERR_STAG_IN_USE,
+	CCERR_FLUSHED = 38,
+	CCERR_INVALID_WQE = 39,
+	CCERR_LOCAL_QP_CATASTROPHIC_ERROR = 40,
+	CCERR_REMOTE_TERMINATION_ERROR = 41,
+	CCERR_BASE_AND_BOUNDS_VIOLATION = 42,
+	CCERR_ACCESS_VIOLATION = 43,
+	CCERR_INVALID_PD_ID = 44,
+	CCERR_WRAP_ERROR = 45,
+	CCERR_INV_STAG_ACCESS_ERROR = 46,
+	CCERR_ZERO_RDMA_READ_RESOURCES = 47,
+	CCERR_QP_NOT_PRIVILEGED = 48,
+	CCERR_STAG_STATE_NOT_INVALID = 49,
+	CCERR_INVALID_PAGE_SIZE = 50,
+	CCERR_INVALID_BUFFER_SIZE = 51,
+	CCERR_INVALID_PBE = 52,
+	CCERR_INVALID_FBO = 53,
+	CCERR_INVALID_LENGTH = 54,
+	CCERR_INVALID_ACCESS_RIGHTS = 55,
+	CCERR_PBL_TOO_BIG = 56,
+	CCERR_INVALID_VA = 57,
+	CCERR_INVALID_REGION = 58,
+	CCERR_INVALID_WINDOW = 59,
+	CCERR_TOTAL_LENGTH_TOO_BIG = 60,
+	CCERR_INVALID_QP_ID = 61,
+	CCERR_ADDR_IN_USE = 62,
+	CCERR_ADDR_NOT_AVAIL = 63,
+	CCERR_NET_DOWN = 64,
+	CCERR_NET_UNREACHABLE = 65,
+	CCERR_CONN_ABORTED = 66,
+	CCERR_CONN_RESET = 67,
+	CCERR_NO_BUFS = 68,
+	CCERR_CONN_TIMEDOUT = 69,
+	CCERR_CONN_REFUSED = 70,
+	CCERR_HOST_UNREACHABLE = 71,
+	CCERR_INVALID_SEND_SGL_DEPTH = 72,
+	CCERR_INVALID_RECV_SGL_DEPTH = 73,
+	CCERR_INVALID_RDMA_WRITE_SGL_DEPTH = 74,
+	CCERR_INSUFFICIENT_PRIVILEGES = 75,
+	CCERR_STACK_ERROR = 76,
+	CCERR_INVALID_VERSION = 77,
+	CCERR_INVALID_MTU = 78,
+	CCERR_INVALID_IMAGE = 79,
+	CCERR_PENDING = 98,	/* not an error; user internally by adapter */
+	CCERR_DEFER = 99,	/* not an error; used internally by adapter */
+	CCERR_FAILED_WRITE = 100,
+	CCERR_FAILED_ERASE = 101,
+	CCERR_FAILED_VERIFICATION = 102,
+	CCERR_NOT_FOUND = 103,
+
+};
+
+/*
+ * CCAE_ACTIVE_CONNECT_RESULTS status result codes.
+ */
+enum c2_connect_status {
+	C2_CONN_STATUS_SUCCESS = C2_OK,
+	C2_CONN_STATUS_NO_MEM = CCERR_INSUFFICIENT_RESOURCES,
+	C2_CONN_STATUS_TIMEDOUT = CCERR_CONN_TIMEDOUT,
+	C2_CONN_STATUS_REFUSED = CCERR_CONN_REFUSED,
+	C2_CONN_STATUS_NETUNREACH = CCERR_NET_UNREACHABLE,
+	C2_CONN_STATUS_HOSTUNREACH = CCERR_HOST_UNREACHABLE,
+	C2_CONN_STATUS_INVALID_RNIC = CCERR_INVALID_RNIC,
+	C2_CONN_STATUS_INVALID_QP = CCERR_INVALID_QP,
+	C2_CONN_STATUS_INVALID_QP_STATE = CCERR_INVALID_QP_STATE,
+	C2_CONN_STATUS_REJECTED = CCERR_CONN_RESET,
+	C2_CONN_STATUS_ADDR_NOT_AVAIL = CCERR_ADDR_NOT_AVAIL,
+};
+
+/*
+ * Flash programming status codes.
+ */
+enum c2_flash_status {
+	C2_FLASH_STATUS_SUCCESS = 0x0000,
+	C2_FLASH_STATUS_VERIFY_ERR = 0x0002,
+	C2_FLASH_STATUS_IMAGE_ERR = 0x0004,
+	C2_FLASH_STATUS_ECLBS = 0x0400,
+	C2_FLASH_STATUS_PSLBS = 0x0800,
+	C2_FLASH_STATUS_VPENS = 0x1000,
+};
+
+#endif				/* _C2_STATUS_H_ */
diff --git a/drivers/infiniband/hw/amso1100/c2_wr.h b/drivers/infiniband/hw/amso1100/c2_wr.h
new file mode 100644
index 0000000..bd9905b
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_wr.h
@@ -0,0 +1,1520 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_WR_H_
+#define _C2_WR_H_
+
+#ifdef CCDEBUG
+#define CCWR_MAGIC		0xb07700b0
+#endif
+
+#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF
+
+/* Maximum allowed size in bytes of private_data exchange
+ * on connect.
+ */
+#define C2_MAX_PRIVATE_DATA_SIZE 200
+
+/*
+ * These types are shared among the adapter, host, and CCIL consumer.  
+ */
+enum c2_cq_notification_type {
+	C2_CQ_NOTIFICATION_TYPE_NONE = 1,
+	C2_CQ_NOTIFICATION_TYPE_NEXT,
+	C2_CQ_NOTIFICATION_TYPE_NEXT_SE
+};
+
+enum c2_setconfig_cmd {
+	C2_CFG_ADD_ADDR = 1,
+	C2_CFG_DEL_ADDR = 2,
+	C2_CFG_ADD_ROUTE = 3,
+	C2_CFG_DEL_ROUTE = 4
+};
+
+enum c2_getconfig_cmd {
+	C2_GETCONFIG_ROUTES = 1,
+	C2_GETCONFIG_ADDRS
+};
+
+/*
+ *  CCIL Work Request Identifiers
+ */
+enum c2wr_ids {
+	CCWR_RNIC_OPEN = 1,
+	CCWR_RNIC_QUERY,
+	CCWR_RNIC_SETCONFIG,
+	CCWR_RNIC_GETCONFIG,
+	CCWR_RNIC_CLOSE,
+	CCWR_CQ_CREATE,
+	CCWR_CQ_QUERY,
+	CCWR_CQ_MODIFY,
+	CCWR_CQ_DESTROY,
+	CCWR_QP_CONNECT,
+	CCWR_PD_ALLOC,
+	CCWR_PD_DEALLOC,
+	CCWR_SRQ_CREATE,
+	CCWR_SRQ_QUERY,
+	CCWR_SRQ_MODIFY,
+	CCWR_SRQ_DESTROY,
+	CCWR_QP_CREATE,
+	CCWR_QP_QUERY,
+	CCWR_QP_MODIFY,
+	CCWR_QP_DESTROY,
+	CCWR_NSMR_STAG_ALLOC,
+	CCWR_NSMR_REGISTER,
+	CCWR_NSMR_PBL,
+	CCWR_STAG_DEALLOC,
+	CCWR_NSMR_REREGISTER,
+	CCWR_SMR_REGISTER,
+	CCWR_MR_QUERY,
+	CCWR_MW_ALLOC,
+	CCWR_MW_QUERY,
+	CCWR_EP_CREATE,
+	CCWR_EP_GETOPT,
+	CCWR_EP_SETOPT,
+	CCWR_EP_DESTROY,
+	CCWR_EP_BIND,
+	CCWR_EP_CONNECT,
+	CCWR_EP_LISTEN,
+	CCWR_EP_SHUTDOWN,
+	CCWR_EP_LISTEN_CREATE,
+	CCWR_EP_LISTEN_DESTROY,
+	CCWR_EP_QUERY,
+	CCWR_CR_ACCEPT,
+	CCWR_CR_REJECT,
+	CCWR_CONSOLE,
+	CCWR_TERM,
+	CCWR_FLASH_INIT,
+	CCWR_FLASH,
+	CCWR_BUF_ALLOC,
+	CCWR_BUF_FREE,
+	CCWR_FLASH_WRITE,
+	CCWR_INIT,		/* WARNING: Don't move this ever again! */
+
+
+
+	/* Add new IDs here */
+
+
+
+	/* 
+	 * WARNING: CCWR_LAST must always be the last verbs id defined! 
+	 *          All the preceding IDs are fixed, and must not change.
+	 *          You can add new IDs, but must not remove or reorder
+	 *          any IDs. If you do, YOU will ruin any hope of
+	 *          compatability between versions.
+	 */
+	CCWR_LAST,
+
+	/*
+	 * Start over at 1 so that arrays indexed by user wr id's
+	 * begin at 1.  This is OK since the verbs and user wr id's
+	 * are always used on disjoint sets of queues.
+	 */
+	/* 
+	 * The order of the CCWR_SEND_XX verbs must 
+	 * match the order of the RDMA_OPs 
+	 */
+	CCWR_SEND = 1,
+	CCWR_SEND_INV,
+	CCWR_SEND_SE,
+	CCWR_SEND_SE_INV,
+	CCWR_RDMA_WRITE,
+	CCWR_RDMA_READ,
+	CCWR_RDMA_READ_INV,
+	CCWR_MW_BIND,
+	CCWR_NSMR_FASTREG,
+	CCWR_STAG_INVALIDATE,
+	CCWR_RECV,
+	CCWR_NOP,
+	CCWR_UNIMPL,		
+/* WARNING: This must always be the last user wr id defined! */
+};
+#define RDMA_SEND_OPCODE_FROM_WR_ID(x)   (x+2)
+
+/*
+ * SQ/RQ Work Request Types
+ */
+enum c2_wr_type {
+	C2_WR_TYPE_SEND = CCWR_SEND,
+	C2_WR_TYPE_SEND_SE = CCWR_SEND_SE,
+	C2_WR_TYPE_SEND_INV = CCWR_SEND_INV,
+	C2_WR_TYPE_SEND_SE_INV = CCWR_SEND_SE_INV,
+	C2_WR_TYPE_RDMA_WRITE = CCWR_RDMA_WRITE,
+	C2_WR_TYPE_RDMA_READ = CCWR_RDMA_READ,
+	C2_WR_TYPE_RDMA_READ_INV_STAG = CCWR_RDMA_READ_INV,
+	C2_WR_TYPE_BIND_MW = CCWR_MW_BIND,
+	C2_WR_TYPE_FASTREG_NSMR = CCWR_NSMR_FASTREG,
+	C2_WR_TYPE_INV_STAG = CCWR_STAG_INVALIDATE,
+	C2_WR_TYPE_RECV = CCWR_RECV,
+	C2_WR_TYPE_NOP = CCWR_NOP,
+};
+
+struct c2_netaddr {
+	u32 ip_addr;
+	u32 netmask;
+	u32 mtu;
+};
+
+struct c2_route {
+	u32 ip_addr;		/* 0 indicates the default route */
+	u32 netmask;		/* netmask associated with dst */
+	u32 flags;
+	union {
+		u32 ipaddr;	/* address of the nexthop interface */
+		u8 enaddr[6];
+	} nexthop;
+};
+
+/*
+ * A Scatter Gather Entry.
+ */
+struct c2_data_addr {
+	u32 stag;
+	u32 length;
+	u64 to;
+};
+
+/*
+ * MR and MW flags used by the consumer, RI, and RNIC.
+ */
+enum c2_mm_flags {
+	MEM_REMOTE = 0x0001,	/* allow mw binds with remote access. */
+	MEM_VA_BASED = 0x0002,	/* Not Zero-based */
+	MEM_PBL_COMPLETE = 0x0004,	/* PBL array is complete in this msg */
+	MEM_LOCAL_READ = 0x0008,	/* allow local reads */
+	MEM_LOCAL_WRITE = 0x0010,	/* allow local writes */
+	MEM_REMOTE_READ = 0x0020,	/* allow remote reads */
+	MEM_REMOTE_WRITE = 0x0040,	/* allow remote writes */
+	MEM_WINDOW_BIND = 0x0080,	/* binds allowed */
+	MEM_SHARED = 0x0100,	/* set if MR is shared */
+	MEM_STAG_VALID = 0x0200	/* set if STAG is in valid state */
+};
+
+/*
+ * CCIL API ACF flags defined in terms of the low level mem flags.
+ * This minimizes translation needed in the user API
+ */
+enum c2_acf {
+	C2_ACF_LOCAL_READ = MEM_LOCAL_READ,
+	C2_ACF_LOCAL_WRITE = MEM_LOCAL_WRITE,
+	C2_ACF_REMOTE_READ = MEM_REMOTE_READ,
+	C2_ACF_REMOTE_WRITE = MEM_REMOTE_WRITE,
+	C2_ACF_WINDOW_BIND = MEM_WINDOW_BIND
+};
+
+/*
+ * Image types of objects written to flash
+ */
+#define C2_FLASH_IMG_BITFILE 1
+#define C2_FLASH_IMG_OPTION_ROM 2
+#define C2_FLASH_IMG_VPD 3
+
+/*
+ *  to fix bug 1815 we define the max size allowable of the
+ *  terminate message (per the IETF spec).Refer to the IETF
+ *  protocal specification, section 12.1.6, page 64)
+ *  The message is prefixed by 20 types of DDP info.
+ *
+ *  Then the message has 6 bytes for the terminate control 
+ *  and DDP segment length info plus a DDP header (either
+ *  14 or 18 byts) plus 28 bytes for the RDMA header.
+ *  Thus the max size in:
+ *  20 + (6 + 18 + 28) = 72
+ */
+#define C2_MAX_TERMINATE_MESSAGE_SIZE (72)
+
+/*
+ * Build String Length.  It must be the same as C2_BUILD_STR_LEN in ccil_api.h
+ */
+#define WR_BUILD_STR_LEN 64
+
+/*
+ * WARNING:  All of these structs need to align any 64bit types on   
+ * 64 bit boundaries!  64bit types include u64 and u64.
+ */
+
+/*
+ * Clustercore Work Request Header.  Be sensitive to field layout
+ * and alignment.
+ */
+struct c2wr_hdr {
+	/* wqe_count is part of the cqe.  It is put here so the
+	 * adapter can write to it while the wr is pending without
+	 * clobbering part of the wr.  This word need not be dma'd
+	 * from the host to adapter by libccil, but we copy it anyway
+	 * to make the memcpy to the adapter better aligned.
+	 */
+	u32 wqe_count;
+
+	/* Put these fields next so that later 32- and 64-bit
+	 * quantities are naturally aligned.
+	 */
+	u8 id;
+	u8 result;		/* adapter -> host */
+	u8 sge_count;		/* host -> adapter */
+	u8 flags;		/* host -> adapter */
+
+	u64 context;
+#ifdef CCMSGMAGIC
+	u32 magic;
+	u32 pad;
+#endif
+} __attribute__((packed));
+
+/*
+ *------------------------ RNIC ------------------------
+ */
+
+/*
+ * WR_RNIC_OPEN
+ */
+
+/*
+ * Flags for the RNIC WRs
+ */
+enum c2_rnic_flags {
+	RNIC_IRD_STATIC = 0x0001,
+	RNIC_ORD_STATIC = 0x0002,
+	RNIC_QP_STATIC = 0x0004,
+	RNIC_SRQ_SUPPORTED = 0x0008,
+	RNIC_PBL_BLOCK_MODE = 0x0010,
+	RNIC_SRQ_MODEL_ARRIVAL = 0x0020,
+	RNIC_CQ_OVF_DETECTED = 0x0040,
+	RNIC_PRIV_MODE = 0x0080
+};
+
+struct c2wr_rnic_open_req {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u16 flags;		/* See enum c2_rnic_flags */
+	u16 port_num;
+} __attribute__((packed));
+
+struct c2wr_rnic_open_rep {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+union c2wr_rnic_open {
+	struct c2wr_rnic_open_req req;
+	struct c2wr_rnic_open_rep rep;
+} __attribute__((packed));
+
+struct c2wr_rnic_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+/*
+ * WR_RNIC_QUERY
+ */
+struct c2wr_rnic_query_rep {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u32 vendor_id;
+	u32 part_number;
+	u32 hw_version;
+	u32 fw_ver_major;
+	u32 fw_ver_minor;
+	u32 fw_ver_patch;
+	char fw_ver_build_str[WR_BUILD_STR_LEN];
+	u32 max_qps;
+	u32 max_qp_depth;
+	u32 max_srq_depth;
+	u32 max_send_sgl_depth;
+	u32 max_rdma_sgl_depth;
+	u32 max_cqs;
+	u32 max_cq_depth;
+	u32 max_cq_event_handlers;
+	u32 max_mrs;
+	u32 max_pbl_depth;
+	u32 max_pds;
+	u32 max_global_ird;
+	u32 max_global_ord;
+	u32 max_qp_ird;
+	u32 max_qp_ord;
+	u32 flags;		
+	u32 max_mws;
+	u32 pbe_range_low;
+	u32 pbe_range_high;
+	u32 max_srqs;
+	u32 page_size;
+} __attribute__((packed));
+
+union c2wr_rnic_query {
+	struct c2wr_rnic_query_req req;
+	struct c2wr_rnic_query_rep rep;
+} __attribute__((packed));
+
+/*
+ * WR_RNIC_GETCONFIG
+ */
+
+struct c2wr_rnic_getconfig_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 option;		/* see c2_getconfig_cmd_t */
+	u64 reply_buf;
+	u32 reply_buf_len;
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_getconfig_rep {
+	struct c2wr_hdr hdr;
+	u32 option;		/* see c2_getconfig_cmd_t */
+	u32 count_len;		/* length of the number of addresses configured */
+} __attribute__((packed)) ;
+
+union c2wr_rnic_getconfig {
+	struct c2wr_rnic_getconfig_req req;
+	struct c2wr_rnic_getconfig_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * WR_RNIC_SETCONFIG
+ */
+struct c2wr_rnic_setconfig_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 option;		/* See c2_setconfig_cmd_t */
+	/* variable data and pad. See c2_netaddr and c2_route */
+	u8 data[0];
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_setconfig_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_rnic_setconfig {
+	struct c2wr_rnic_setconfig_req req;
+	struct c2wr_rnic_setconfig_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * WR_RNIC_CLOSE
+ */
+struct c2wr_rnic_close_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_rnic_close_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_rnic_close {
+	struct c2wr_rnic_close_req req;
+	struct c2wr_rnic_close_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ CQ ------------------------
+ */
+struct c2wr_cq_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_ht;
+	u64 user_context;
+	u64 msg_pool;
+	u32 rnic_handle;
+	u32 msg_size;
+	u32 depth;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_create_rep {
+	struct c2wr_hdr hdr;
+	u32 mq_index;
+	u32 adapter_shared;
+	u32 cq_handle;
+} __attribute__((packed)) ;
+
+union c2wr_cq_create {
+	struct c2wr_cq_create_req req;
+	struct c2wr_cq_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_modify_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 cq_handle;
+	u32 new_depth;
+	u64 new_msg_pool;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_modify_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_cq_modify {
+	struct c2wr_cq_modify_req req;
+	struct c2wr_cq_modify_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 cq_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_cq_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_cq_destroy {
+	struct c2wr_cq_destroy_req req;
+	struct c2wr_cq_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ PD ------------------------
+ */
+struct c2wr_pd_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_alloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_pd_alloc {
+	struct c2wr_pd_alloc_req req;
+	struct c2wr_pd_alloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_dealloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_pd_dealloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_pd_dealloc {
+	struct c2wr_pd_dealloc_req req;
+	struct c2wr_pd_dealloc_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ SRQ ------------------------
+ */
+struct c2wr_srq_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_ht;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 srq_depth;
+	u32 srq_limit;
+	u32 sgl_depth;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_create_rep {
+	struct c2wr_hdr hdr;
+	u32 srq_depth;
+	u32 sgl_depth;
+	u32 msg_size;
+	u32 mq_index;
+	u32 mq_start;
+	u32 srq_handle;
+} __attribute__((packed)) ;
+
+union c2wr_srq_create {
+	struct c2wr_srq_create_req req;
+	struct c2wr_srq_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 srq_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_srq_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_srq_destroy {
+	struct c2wr_srq_destroy_req req;
+	struct c2wr_srq_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ QP ------------------------
+ */
+enum c2wr_qp_flags {
+	QP_RDMA_READ = 0x00000001,	/* RDMA read enabled? */
+	QP_RDMA_WRITE = 0x00000002,	/* RDMA write enabled? */
+	QP_MW_BIND = 0x00000004,	/* MWs enabled */
+	QP_ZERO_STAG = 0x00000008,	/* enabled? */
+	QP_REMOTE_TERMINATION = 0x00000010,	/* remote end terminated */
+	QP_RDMA_READ_RESPONSE = 0x00000020	/* Remote RDMA read  */
+	    /* enabled? */
+};
+
+struct c2wr_qp_create_req {
+	struct c2wr_hdr hdr;
+	u64 shared_sq_ht;
+	u64 shared_rq_ht;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 sq_cq_handle;
+	u32 rq_cq_handle;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 srq_handle;
+	u32 srq_limit;
+	u32 flags;		/* see enum c2wr_qp_flags */
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_create_rep {
+	struct c2wr_hdr hdr;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u32 sq_msg_size;
+	u32 sq_mq_index;
+	u32 sq_mq_start;
+	u32 rq_msg_size;
+	u32 rq_mq_index;
+	u32 rq_mq_start;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+union c2wr_qp_create {
+	struct c2wr_qp_create_req req;
+	struct c2wr_qp_create_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_query_rep {
+	struct c2wr_hdr hdr;
+	u64 user_context;
+	u32 rnic_handle;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 send_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 ord;
+	u32 ird;
+	u16 qp_state;
+	u16 flags;		/* see c2wr_qp_flags_t */
+	u32 qp_id;
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+	u32 terminate_msg_length;	/* 0 if not present */
+	u8 data[0];
+	/* Terminate Message in-line here. */
+} __attribute__((packed)) ;
+
+union c2wr_qp_query {
+	struct c2wr_qp_query_req req;
+	struct c2wr_qp_query_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_modify_req {
+	struct c2wr_hdr hdr;
+	u64 stream_msg;
+	u32 stream_msg_length;
+	u32 rnic_handle;
+	u32 qp_handle;
+	u32 next_qp_state;
+	u32 ord;
+	u32 ird;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 llp_ep_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_modify_rep {
+	struct c2wr_hdr hdr;
+	u32 ord;
+	u32 ird;
+	u32 sq_depth;
+	u32 rq_depth;
+	u32 sq_msg_size;
+	u32 sq_mq_index;
+	u32 sq_mq_start;
+	u32 rq_msg_size;
+	u32 rq_mq_index;
+	u32 rq_mq_start;
+} __attribute__((packed)) ;
+
+union c2wr_qp_modify {
+	struct c2wr_qp_modify_req req;
+	struct c2wr_qp_modify_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+} __attribute__((packed)) ;
+
+struct c2wr_qp_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_qp_destroy {
+	struct c2wr_qp_destroy_req req;
+	struct c2wr_qp_destroy_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ * The CCWR_QP_CONNECT msg is posted on the verbs request queue.  It can
+ * only be posted when a QP is in IDLE state.  After the connect request is
+ * submitted to the LLP, the adapter moves the QP to CONNECT_PENDING state.
+ * No synchronous reply from adapter to this WR.  The results of
+ * connection are passed back in an async event CCAE_ACTIVE_CONNECT_RESULTS
+ * See c2wr_ae_active_connect_results_t
+ */
+struct c2wr_qp_connect_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;
+	u32 remote_addr;
+	u16 remote_port;
+	u16 pad;
+	u32 private_data_length;
+	u8 private_data[0];	/* Private data in-line. */
+} __attribute__((packed)) ;
+
+struct c2wr_qp_connect {
+	struct c2wr_qp_connect_req req;
+	/* no synchronous reply.         */
+} __attribute__((packed)) ;
+
+
+/*
+ *------------------------ MM ------------------------
+ */
+
+struct c2wr_nsmr_stag_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pbl_depth;
+	u32 pd_id;
+	u32 flags;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_stag_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_stag_alloc {
+	struct c2wr_nsmr_stag_alloc_req req;
+	struct c2wr_nsmr_stag_alloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_register_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 pd_id;
+	u32 pbl_depth;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_register_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_register {
+	struct c2wr_nsmr_register_req req;
+	struct c2wr_nsmr_register_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_pbl_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 flags;
+	u32 stag_index;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_pbl_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_pbl {
+	struct c2wr_nsmr_pbl_req req;
+	struct c2wr_nsmr_pbl_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mr_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_mr_query_rep {
+	struct c2wr_hdr hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 pd_id;
+	u32 flags;
+	u32 pbl_depth;
+} __attribute__((packed)) ;
+
+union c2wr_mr_query {
+	struct c2wr_mr_query_req req;
+	struct c2wr_mr_query_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_query_rep {
+	struct c2wr_hdr hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 pd_id;
+	u32 flags;
+} __attribute__((packed)) ;
+
+union c2wr_mw_query {
+	struct c2wr_mw_query_req req;
+	struct c2wr_mw_query_rep rep;
+} __attribute__((packed)) ;
+
+
+struct c2wr_stag_dealloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+struct c2wr_stag_dealloc_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed)) ;
+
+union c2wr_stag_dealloc {
+	struct c2wr_stag_dealloc_req req;
+	struct c2wr_stag_dealloc_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_reregister_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 stag_index;
+	u32 pd_id;
+	u32 pbl_depth;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	u32 pad1;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed)) ;
+
+struct c2wr_nsmr_reregister_rep {
+	struct c2wr_hdr hdr;
+	u32 pbl_depth;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_nsmr_reregister {
+	struct c2wr_nsmr_reregister_req req;
+	struct c2wr_nsmr_reregister_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_smr_register_req {
+	struct c2wr_hdr hdr;
+	u64 va;
+	u32 rnic_handle;
+	u16 flags;
+	u8 stag_key;
+	u8 pad;
+	u32 stag_index;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_smr_register_rep {
+	struct c2wr_hdr hdr;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_smr_register {
+	struct c2wr_smr_register_req req;
+	struct c2wr_smr_register_rep rep;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 pd_id;
+} __attribute__((packed)) ;
+
+struct c2wr_mw_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 stag_index;
+} __attribute__((packed)) ;
+
+union c2wr_mw_alloc {
+	struct c2wr_mw_alloc_req req;
+	struct c2wr_mw_alloc_rep rep;
+} __attribute__((packed)) ;
+
+/*
+ *------------------------ WRs -----------------------
+ */
+
+struct c2wr_user_hdr {
+	struct c2wr_hdr hdr;		/* Has status and WR Type */
+} __attribute__((packed)) ;
+
+enum c2_qp_state {
+	C2_QP_STATE_IDLE = 0x01,
+	C2_QP_STATE_CONNECTING = 0x02,
+	C2_QP_STATE_RTS = 0x04,
+	C2_QP_STATE_CLOSING = 0x08,
+	C2_QP_STATE_TERMINATE = 0x10,
+	C2_QP_STATE_ERROR = 0x20,
+};
+
+/* Completion queue entry. */
+struct c2wr_ce {
+	struct c2wr_hdr hdr;		/* Has status and WR Type */
+	u64 qp_user_context;	/* c2_user_qp_t * */
+	u32 qp_state;		/* Current QP State */
+	u32 handle;		/* QPID or EP Handle */
+	u32 bytes_rcvd;		/* valid for RECV WCs */
+	u32 stag;
+} __attribute__((packed)) ;
+
+
+/*
+ * Flags used for all post-sq WRs.  These must fit in the flags
+ * field of the struct c2wr_hdr (eight bits).
+ */
+enum {
+	SQ_SIGNALED = 0x01,
+	SQ_READ_FENCE = 0x02,
+	SQ_FENCE = 0x04,
+};
+
+/*
+ * Common fields for all post-sq WRs.  Namely the standard header and a 
+ * secondary header with fields common to all post-sq WRs.
+ */
+struct c2_sq_hdr {
+	struct c2wr_user_hdr user_hdr;
+} __attribute__((packed));
+
+/*
+ * Same as above but for post-rq WRs.
+ */
+struct c2_rq_hdr {
+	struct c2wr_user_hdr user_hdr;
+} __attribute__((packed));
+
+/*
+ * use the same struct for all sends.
+ */
+struct c2wr_send_req {
+	struct c2_sq_hdr sq_hdr;
+	u32 sge_len;
+	u32 remote_stag;
+	u8 data[0];		/* SGE array */
+} __attribute__((packed));
+
+union c2wr_send {
+	struct c2wr_send_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_rdma_write_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 remote_to;
+	u32 remote_stag;
+	u32 sge_len;
+	u8 data[0];		/* SGE array */
+} __attribute__((packed));
+
+union c2wr_rdma_write {
+	struct c2wr_rdma_write_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_rdma_read_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 local_to;
+	u64 remote_to;
+	u32 local_stag;
+	u32 remote_stag;
+	u32 length;
+} __attribute__((packed));
+
+union c2wr_rdma_read {
+	struct c2wr_rdma_read_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_mw_bind_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 va;
+	u8 stag_key;
+	u8 pad[3];
+	u32 mw_stag_index;
+	u32 mr_stag_index;
+	u32 length;
+	u32 flags;
+} __attribute__((packed));
+
+union c2wr_mw_bind {
+	struct c2wr_mw_bind_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_nsmr_fastreg_req {
+	struct c2_sq_hdr sq_hdr;
+	u64 va;
+	u8 stag_key;
+	u8 pad[3];
+	u32 stag_index;
+	u32 pbe_size;
+	u32 fbo;
+	u32 length;
+	u32 addrs_length;
+	/* array of paddrs (must be aligned on a 64bit boundary) */
+	u64 paddrs[0];
+} __attribute__((packed));
+
+union c2wr_nsmr_fastreg {
+	struct c2wr_nsmr_fastreg_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_stag_invalidate_req {
+	struct c2_sq_hdr sq_hdr;
+	u8 stag_key;
+	u8 pad[3];
+	u32 stag_index;
+} __attribute__((packed));
+
+union c2wr_stag_invalidate {
+	struct c2wr_stag_invalidate_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+union c2wr_sqwr {
+	struct c2_sq_hdr sq_hdr;
+	struct c2wr_send_req send;
+	struct c2wr_send_req send_se;
+	struct c2wr_send_req send_inv;
+	struct c2wr_send_req send_se_inv;
+	struct c2wr_rdma_write_req rdma_write;
+	struct c2wr_rdma_read_req rdma_read;
+	struct c2wr_mw_bind_req mw_bind;
+	struct c2wr_nsmr_fastreg_req nsmr_fastreg;
+	struct c2wr_stag_invalidate_req stag_inv;
+} __attribute__((packed));
+
+
+/*
+ * RQ WRs
+ */
+struct c2wr_rqwr {
+	struct c2_rq_hdr rq_hdr;
+	u8 data[0];		/* array of SGEs */
+} __attribute__((packed));
+
+union c2wr_recv {
+	struct c2wr_rqwr req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+/*
+ * All AEs start with this header.  Most AEs only need to convey the
+ * information in the header.  Some, like LLP connection events, need
+ * more info.  The union typdef c2wr_ae_t has all the possible AEs.
+ *
+ * hdr.context is the user_context from the rnic_open WR.  NULL If this 
+ * is not affiliated with an rnic
+ *
+ * hdr.id is the AE identifier (eg;  CCAE_REMOTE_SHUTDOWN, 
+ * CCAE_LLP_CLOSE_COMPLETE)
+ *
+ * resource_type is one of:  C2_RES_IND_QP, C2_RES_IND_CQ, C2_RES_IND_SRQ
+ *
+ * user_context is the context passed down when the host created the resource.
+ */
+struct c2wr_ae_hdr {
+	struct c2wr_hdr hdr;
+	u64 user_context;	/* user context for this res. */
+	u32 resource_type;	/* see enum c2_resource_indicator */
+	u32 resource;		/* handle for resource */
+	u32 qp_state;		/* current QP State */
+} __attribute__((packed));
+
+/*
+ * After submitting the CCAE_ACTIVE_CONNECT_RESULTS message on the AEQ, 
+ * the adapter moves the QP into RTS state
+ */
+struct c2wr_ae_active_connect_results {
+	struct c2wr_ae_hdr ae_hdr;
+	u32 laddr;
+	u32 raddr;
+	u16 lport;
+	u16 rport;
+	u32 private_data_length;
+	u8 private_data[0];	/* data is in-line in the msg. */
+} __attribute__((packed));
+
+/*
+ * When connections are established by the stack (and the private data
+ * MPA frame is received), the adapter will generate an event to the host.
+ * The details of the connection, any private data, and the new connection
+ * request handle is passed up via the CCAE_CONNECTION_REQUEST msg on the
+ * AE queue:
+ */
+struct c2wr_ae_connection_request {
+	struct c2wr_ae_hdr ae_hdr;
+	u32 cr_handle;		/* connreq handle (sock ptr) */
+	u32 laddr;
+	u32 raddr;
+	u16 lport;
+	u16 rport;
+	u32 private_data_length;
+	u8 private_data[0];	/* data is in-line in the msg. */
+} __attribute__((packed));
+
+union c2wr_ae {
+	struct c2wr_ae_hdr ae_generic;
+	struct c2wr_ae_active_connect_results ae_active_connect_results;
+	struct c2wr_ae_connection_request ae_connection_request;
+} __attribute__((packed));
+
+struct c2wr_init_req {
+	struct c2wr_hdr hdr;
+	u64 hint_count;
+	u64 q0_host_shared;
+	u64 q1_host_shared;
+	u64 q1_host_msg_pool;
+	u64 q2_host_shared;
+	u64 q2_host_msg_pool;
+} __attribute__((packed));
+
+struct c2wr_init_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_init {
+	struct c2wr_init_req req;
+	struct c2wr_init_rep rep;
+} __attribute__((packed));
+
+/*
+ * For upgrading flash.
+ */
+
+struct c2wr_flash_init_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+} __attribute__((packed));
+
+struct c2wr_flash_init_rep {
+	struct c2wr_hdr hdr;
+	u32 adapter_flash_buf_offset;
+	u32 adapter_flash_len;
+} __attribute__((packed));
+
+union c2wr_flash_init {
+	struct c2wr_flash_init_req req;
+	struct c2wr_flash_init_rep rep;
+} __attribute__((packed));
+
+struct c2wr_flash_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 len;
+} __attribute__((packed));
+
+struct c2wr_flash_rep {
+	struct c2wr_hdr hdr;
+	u32 status;
+} __attribute__((packed));
+
+union c2wr_flash {
+	struct c2wr_flash_req req;
+	struct c2wr_flash_rep rep;
+} __attribute__((packed));
+
+struct c2wr_buf_alloc_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 size;
+} __attribute__((packed));
+
+struct c2wr_buf_alloc_rep {
+	struct c2wr_hdr hdr;
+	u32 offset;		/* 0 if mem not available */
+	u32 size;		/* 0 if mem not available */
+} __attribute__((packed));
+
+union c2wr_buf_alloc {
+	struct c2wr_buf_alloc_req req;
+	struct c2wr_buf_alloc_rep rep;
+} __attribute__((packed));
+
+struct c2wr_buf_free_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 offset;		/* Must match value from alloc */
+	u32 size;		/* Must match value from alloc */
+} __attribute__((packed));
+
+struct c2wr_buf_free_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_buf_free {
+	struct c2wr_buf_free_req req;
+	struct c2wr_ce rep;
+} __attribute__((packed));
+
+struct c2wr_flash_write_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 offset;
+	u32 size;
+	u32 type;
+	u32 flags;
+} __attribute__((packed));
+
+struct c2wr_flash_write_rep {
+	struct c2wr_hdr hdr;
+	u32 status;
+} __attribute__((packed));
+
+union c2wr_flash_write {
+	struct c2wr_flash_write_req req;
+	struct c2wr_flash_write_rep rep;
+} __attribute__((packed));
+
+/*
+ * Messages for LLP connection setup. 
+ */
+
+/*
+ * Listen Request.  This allocates a listening endpoint to allow passive
+ * connection setup.  Newly established LLP connections are passed up
+ * via an AE.  See c2wr_ae_connection_request_t
+ */
+struct c2wr_ep_listen_create_req {
+	struct c2wr_hdr hdr;
+	u64 user_context;	/* returned in AEs. */
+	u32 rnic_handle;
+	u32 local_addr;		/* local addr, or 0  */
+	u16 local_port;		/* 0 means "pick one" */
+	u16 pad;
+	u32 backlog;		/* tradional tcp listen bl */
+} __attribute__((packed));
+
+struct c2wr_ep_listen_create_rep {
+	struct c2wr_hdr hdr;
+	u32 ep_handle;		/* handle to new listening ep */
+	u16 local_port;		/* resulting port... */
+	u16 pad;
+} __attribute__((packed));
+
+union c2wr_ep_listen_create {
+	struct c2wr_ep_listen_create_req req;
+	struct c2wr_ep_listen_create_rep rep;
+} __attribute__((packed));
+
+struct c2wr_ep_listen_destroy_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;
+} __attribute__((packed));
+
+struct c2wr_ep_listen_destroy_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_ep_listen_destroy {
+	struct c2wr_ep_listen_destroy_req req;
+	struct c2wr_ep_listen_destroy_rep rep;
+} __attribute__((packed));
+
+struct c2wr_ep_query_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;
+} __attribute__((packed));
+
+struct c2wr_ep_query_rep {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+} __attribute__((packed));
+
+union c2wr_ep_query {
+	struct c2wr_ep_query_req req;
+	struct c2wr_ep_query_rep rep;
+} __attribute__((packed));
+
+
+/*
+ * The host passes this down to indicate acceptance of a pending iWARP
+ * connection.  The cr_handle was obtained from the CONNECTION_REQUEST
+ * AE passed up by the adapter.  See c2wr_ae_connection_request_t.
+ */
+struct c2wr_cr_accept_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 qp_handle;		/* QP to bind to this LLP conn */
+	u32 ep_handle;		/* LLP  handle to accept */
+	u32 private_data_length;
+	u8 private_data[0];	/* data in-line in msg. */
+} __attribute__((packed));
+
+/*
+ * adapter sends reply when private data is successfully submitted to 
+ * the LLP.
+ */
+struct c2wr_cr_accept_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_cr_accept {
+	struct c2wr_cr_accept_req req;
+	struct c2wr_cr_accept_rep rep;
+} __attribute__((packed));
+
+/*
+ * The host sends this down if a given iWARP connection request was 
+ * rejected by the consumer.  The cr_handle was obtained from a 
+ * previous c2wr_ae_connection_request_t AE sent by the adapter.
+ */
+struct  c2wr_cr_reject_req {
+	struct c2wr_hdr hdr;
+	u32 rnic_handle;
+	u32 ep_handle;		/* LLP handle to reject */
+} __attribute__((packed));
+
+/*
+ * Dunno if this is needed, but we'll add it for now.  The adapter will
+ * send the reject_reply after the LLP endpoint has been destroyed.
+ */
+struct  c2wr_cr_reject_rep {
+	struct c2wr_hdr hdr;
+} __attribute__((packed));
+
+union c2wr_cr_reject {
+	struct c2wr_cr_reject_req req;
+	struct c2wr_cr_reject_rep rep;
+} __attribute__((packed));
+
+/*
+ * console command.  Used to implement a debug console over the verbs
+ * request and reply queues.  
+ */
+
+/*
+ * Console request message.  It contains:
+ *	- message hdr with id = CCWR_CONSOLE
+ *	- the physaddr/len of host memory to be used for the reply. 
+ *	- the command string.  eg:  "netstat -s" or "zoneinfo"
+ */
+struct c2wr_console_req {
+	struct c2wr_hdr hdr;		/* id = CCWR_CONSOLE */
+	u64 reply_buf;		/* pinned host buf for reply */
+	u32 reply_buf_len;	/* length of reply buffer */
+	u8 command[0];		/* NUL terminated ascii string */
+	/* containing the command req */
+} __attribute__((packed));
+
+/*
+ * flags used in the console reply.
+ */
+enum c2_console_flags {
+	CONS_REPLY_TRUNCATED = 0x00000001	/* reply was truncated */
+} __attribute__((packed));
+
+/*
+ * Console reply message.  
+ * hdr.result contains the c2_status_t error if the reply was _not_ generated, 
+ * or C2_OK if the reply was generated.
+ */
+struct c2wr_console_rep {
+	struct c2wr_hdr hdr;		/* id = CCWR_CONSOLE */
+	u32 flags;
+} __attribute__((packed));
+
+union c2wr_console {
+	struct c2wr_console_req req;
+	struct c2wr_console_rep rep;
+} __attribute__((packed));
+
+
+/*
+ * Giant union with all WRs.  Makes life easier...
+ */
+union c2wr {
+	struct c2wr_hdr hdr;
+	struct c2wr_user_hdr user_hdr;
+	union c2wr_rnic_open rnic_open;
+	union c2wr_rnic_query rnic_query;
+	union c2wr_rnic_getconfig rnic_getconfig;
+	union c2wr_rnic_setconfig rnic_setconfig;
+	union c2wr_rnic_close rnic_close;
+	union c2wr_cq_create cq_create;
+	union c2wr_cq_modify cq_modify;
+	union c2wr_cq_destroy cq_destroy;
+	union c2wr_pd_alloc pd_alloc;
+	union c2wr_pd_dealloc pd_dealloc;
+	union c2wr_srq_create srq_create;
+	union c2wr_srq_destroy srq_destroy;
+	union c2wr_qp_create qp_create;
+	union c2wr_qp_query qp_query;
+	union c2wr_qp_modify qp_modify;
+	union c2wr_qp_destroy qp_destroy;
+	struct c2wr_qp_connect qp_connect;
+	union c2wr_nsmr_stag_alloc nsmr_stag_alloc;
+	union c2wr_nsmr_register nsmr_register;
+	union c2wr_nsmr_pbl nsmr_pbl;
+	union c2wr_mr_query mr_query;
+	union c2wr_mw_query mw_query;
+	union c2wr_stag_dealloc stag_dealloc;
+	union c2wr_sqwr sqwr;
+	struct c2wr_rqwr rqwr;
+	struct c2wr_ce ce;
+	union c2wr_ae ae;
+	union c2wr_init init;
+	union c2wr_ep_listen_create ep_listen_create;
+	union c2wr_ep_listen_destroy ep_listen_destroy;
+	union c2wr_cr_accept cr_accept;
+	union c2wr_cr_reject cr_reject;
+	union c2wr_console console;
+	union c2wr_flash_init flash_init;
+	union c2wr_flash flash;
+	union c2wr_buf_alloc buf_alloc;
+	union c2wr_buf_free buf_free;
+	union c2wr_flash_write flash_write;
+} __attribute__((packed));
+
+
+/*
+ * Accessors for the wr fields that are packed together tightly to
+ * reduce the wr message size.  The wr arguments are void* so that
+ * either a struct c2wr*, a struct c2wr_hdr*, or a pointer to any of the types
+ * in the struct c2wr union can be passed in.
+ */
+static __inline__ u8 c2_wr_get_id(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->id;
+}
+static __inline__ void c2_wr_set_id(void *wr, u8 id)
+{
+	((struct c2wr_hdr *) wr)->id = id;
+}
+static __inline__ u8 c2_wr_get_result(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->result;
+}
+static __inline__ void c2_wr_set_result(void *wr, u8 result)
+{
+	((struct c2wr_hdr *) wr)->result = result;
+}
+static __inline__ u8 c2_wr_get_flags(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->flags;
+}
+static __inline__ void c2_wr_set_flags(void *wr, u8 flags)
+{
+	((struct c2wr_hdr *) wr)->flags = flags;
+}
+static __inline__ u8 c2_wr_get_sge_count(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->sge_count;
+}
+static __inline__ void c2_wr_set_sge_count(void *wr, u8 sge_count)
+{
+	((struct c2wr_hdr *) wr)->sge_count = sge_count;
+}
+static __inline__ u32 c2_wr_get_wqe_count(void *wr)
+{
+	return ((struct c2wr_hdr *) wr)->wqe_count;
+}
+static __inline__ void c2_wr_set_wqe_count(void *wr, u32 wqe_count)
+{
+	((struct c2wr_hdr *) wr)->wqe_count = wqe_count;
+}
+
+#endif				/* _C2_WR_H_ */


From swise at opengridcomputing.com  Tue Jun 20 13:30:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:30:55 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203055.31536.15131.stgit@stevo-desktop>


This is the core of the driver and includes the hardware probe, low-level
device interfaces and native Ethernet support.

V2 Review Changes:

- fixed private data memory leak incoming connect requests.  No longer
  need to copy the private data.  The IWCM will.

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes

- sizeof -> sizeof()

- dprintk() -> pr_debug()

- removed useless asserts

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG

- removed debug netevent code

- removed arp request squelch code from intr handler, replacing it
  with setting arp_ignore when the c2 netdev is brought up.

- removed c2_set_mac_addr().
---

 drivers/infiniband/hw/amso1100/c2.c      | 1255 ++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2.h      |  552 +++++++++++++
 drivers/infiniband/hw/amso1100/c2_ae.c   |  321 ++++++++
 drivers/infiniband/hw/amso1100/c2_intr.c |  209 +++++
 drivers/infiniband/hw/amso1100/c2_rnic.c |  664 ++++++++++++++++
 5 files changed, 3001 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c
new file mode 100644
index 0000000..4fdbd80
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2.c
@@ -0,0 +1,1255 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/ib_smi.h>
+#include "c2.h"
+#include "c2_provider.h"
+
+MODULE_AUTHOR("Tom Tucker <tom at opengridcomputing.com>");
+MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK
+    | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN;
+
+static int debug = -1;		/* defaults above */
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
+
+static int c2_up(struct net_device *netdev);
+static int c2_down(struct net_device *netdev);
+static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static void c2_tx_interrupt(struct net_device *netdev);
+static void c2_rx_interrupt(struct net_device *netdev);
+static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs);
+static void c2_tx_timeout(struct net_device *netdev);
+static int c2_change_mtu(struct net_device *netdev, int new_mtu);
+static void c2_reset(struct c2_port *c2_port);
+static struct net_device_stats *c2_get_stats(struct net_device *netdev);
+
+static struct pci_device_id c2_pci_table[] = {
+	{0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID},
+	{0}
+};
+
+MODULE_DEVICE_TABLE(pci, c2_pci_table);
+
+static void c2_print_macaddr(struct net_device *netdev)
+{
+	pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, "
+		"IRQ %u\n", netdev->name,
+		netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2],
+		netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5],
+		netdev->irq);
+}
+
+static void c2_set_rxbufsize(struct c2_port *c2_port)
+{
+	struct net_device *netdev = c2_port->netdev;
+
+	if (netdev->mtu > RX_BUF_SIZE)
+		c2_port->rx_buf_size =
+		    netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) +
+		    NET_IP_ALIGN;
+	else
+		c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE;
+}
+
+/*
+ * Allocate TX ring elements and chain them together.
+ * One-to-one association of adapter descriptors with ring elements.
+ */
+static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
+			    dma_addr_t base, void __iomem * mmio_txp_ring)
+{
+	struct c2_tx_desc *tx_desc;
+	struct c2_txp_desc __iomem *txp_desc;
+	struct c2_element *elem;
+	int i;
+
+	tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);
+	if (!tx_ring->start)
+		return -ENOMEM;
+
+	elem = tx_ring->start;
+	tx_desc = vaddr;
+	txp_desc = mmio_txp_ring;
+	for (i = 0; i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) {
+		tx_desc->len = 0;
+		tx_desc->status = 0;
+
+		/* Set TXP_HTXD_UNINIT */
+		__raw_writeq(cpu_to_be64(0x1122334455667788ULL),
+			     (void __iomem *) txp_desc + C2_TXP_ADDR);
+		__raw_writew(0, (void __iomem *) txp_desc + C2_TXP_LEN);
+		__raw_writew(cpu_to_be16(TXP_HTXD_UNINIT),
+			     (void __iomem *) txp_desc + C2_TXP_FLAGS);
+
+		elem->skb = NULL;
+		elem->ht_desc = tx_desc;
+		elem->hw_desc = txp_desc;
+
+		if (i == tx_ring->count - 1) {
+			elem->next = tx_ring->start;
+			tx_desc->next_offset = base;
+		} else {
+			elem->next = elem + 1;
+			tx_desc->next_offset =
+			    base + (i + 1) * sizeof(*tx_desc);
+		}
+	}
+
+	tx_ring->to_use = tx_ring->to_clean = tx_ring->start;
+
+	return 0;
+}
+
+/*
+ * Allocate RX ring elements and chain them together.
+ * One-to-one association of adapter descriptors with ring elements.
+ */
+static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr,
+			    dma_addr_t base, void __iomem * mmio_rxp_ring)
+{
+	struct c2_rx_desc *rx_desc;
+	struct c2_rxp_desc __iomem *rxp_desc;
+	struct c2_element *elem;
+	int i;
+
+	rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL);
+	if (!rx_ring->start)
+		return -ENOMEM;
+
+	elem = rx_ring->start;
+	rx_desc = vaddr;
+	rxp_desc = mmio_rxp_ring;
+	for (i = 0; i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) {
+		rx_desc->len = 0;
+		rx_desc->status = 0;
+
+		/* Set RXP_HRXD_UNINIT */
+		__raw_writew(cpu_to_be16(RXP_HRXD_OK),
+		       (void __iomem *) rxp_desc + C2_RXP_STATUS);
+		__raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_COUNT);
+		__raw_writew(0, (void __iomem *) rxp_desc + C2_RXP_LEN);
+		__raw_writeq(cpu_to_be64(0x99aabbccddeeffULL),
+			     (void __iomem *) rxp_desc + C2_RXP_ADDR);
+		__raw_writew(cpu_to_be16(RXP_HRXD_UNINIT),
+			     (void __iomem *) rxp_desc + C2_RXP_FLAGS);
+
+		elem->skb = NULL;
+		elem->ht_desc = rx_desc;
+		elem->hw_desc = rxp_desc;
+
+		if (i == rx_ring->count - 1) {
+			elem->next = rx_ring->start;
+			rx_desc->next_offset = base;
+		} else {
+			elem->next = elem + 1;
+			rx_desc->next_offset =
+			    base + (i + 1) * sizeof(*rx_desc);
+		}
+	}
+
+	rx_ring->to_use = rx_ring->to_clean = rx_ring->start;
+
+	return 0;
+}
+
+/* Setup buffer for receiving */
+static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_rx_desc *rx_desc = elem->ht_desc;
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen;
+	struct c2_rxp_hdr *rxp_hdr;
+
+	skb = dev_alloc_skb(c2_port->rx_buf_size);
+	if (unlikely(!skb)) {
+		pr_debug("%s: out of memory for receive\n",
+			c2_port->netdev->name);
+		return -ENOMEM;
+	}
+
+	/* Zero out the rxp hdr in the sk_buff */
+	memset(skb->data, 0, sizeof(*rxp_hdr));
+
+	skb->dev = c2_port->netdev;
+
+	maplen = c2_port->rx_buf_size;
+	mapaddr =
+	    pci_map_single(c2dev->pcidev, skb->data, maplen,
+			   PCI_DMA_FROMDEVICE);
+
+	/* Set the sk_buff RXP_header to RXP_HRXD_READY */
+	rxp_hdr = (struct c2_rxp_hdr *) skb->data;
+	rxp_hdr->flags = RXP_HRXD_READY;
+
+	__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+	__raw_writew(cpu_to_be16((u16) maplen - sizeof(*rxp_hdr)),
+		     elem->hw_desc + C2_RXP_LEN);
+	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_RXP_ADDR);
+	__raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS);
+
+	elem->skb = skb;
+	elem->mapaddr = mapaddr;
+	elem->maplen = maplen;
+	rx_desc->len = maplen;
+
+	return 0;
+}
+
+/*
+ * Allocate buffers for the Rx ring
+ * For receive:  rx_ring.to_clean is next received frame
+ */
+static int c2_rx_fill(struct c2_port *c2_port)
+{
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	int ret = 0;
+
+	elem = rx_ring->start;
+	do {
+		if (c2_rx_alloc(c2_port, elem)) {
+			ret = 1;
+			break;
+		}
+	} while ((elem = elem->next) != rx_ring->start);
+
+	rx_ring->to_clean = rx_ring->start;
+	return ret;
+}
+
+/* Free all buffers in RX ring, assumes receiver stopped */
+static void c2_rx_clean(struct c2_port *c2_port)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	struct c2_rx_desc *rx_desc;
+
+	elem = rx_ring->start;
+	do {
+		rx_desc = elem->ht_desc;
+		rx_desc->len = 0;
+
+		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
+		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);
+		__raw_writeq(cpu_to_be64(0x99aabbccddeeffULL),
+			     elem->hw_desc + C2_RXP_ADDR);
+		__raw_writew(cpu_to_be16(RXP_HRXD_UNINIT),
+			     elem->hw_desc + C2_RXP_FLAGS);
+
+		if (elem->skb) {
+			pci_unmap_single(c2dev->pcidev, elem->mapaddr,
+					 elem->maplen, PCI_DMA_FROMDEVICE);
+			dev_kfree_skb(elem->skb);
+			elem->skb = NULL;
+		}
+	} while ((elem = elem->next) != rx_ring->start);
+}
+
+static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem)
+{
+	struct c2_tx_desc *tx_desc = elem->ht_desc;
+
+	tx_desc->len = 0;
+
+	pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen,
+			 PCI_DMA_TODEVICE);
+
+	if (elem->skb) {
+		dev_kfree_skb_any(elem->skb);
+		elem->skb = NULL;
+	}
+
+	return 0;
+}
+
+/* Free all buffers in TX ring, assumes transmitter stopped */
+static void c2_tx_clean(struct c2_port *c2_port)
+{
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	struct c2_txp_desc txp_htxd;
+	int retry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2_port->tx_lock, flags);
+
+	elem = tx_ring->start;
+
+	do {
+		retry = 0;
+		do {
+			txp_htxd.flags =
+			    readw(elem->hw_desc + C2_TXP_FLAGS);
+
+			if (txp_htxd.flags == TXP_HTXD_READY) {
+				retry = 1;
+				__raw_writew(0,
+					     elem->hw_desc + C2_TXP_LEN);
+				__raw_writeq(0,
+					     elem->hw_desc + C2_TXP_ADDR);
+				__raw_writew(cpu_to_be16(TXP_HTXD_DONE),
+					     elem->hw_desc + C2_TXP_FLAGS);
+				c2_port->netstats.tx_dropped++;
+				break;
+			} else {
+				__raw_writew(0,
+					     elem->hw_desc + C2_TXP_LEN);
+				__raw_writeq(cpu_to_be64(0x1122334455667788ULL),
+					     elem->hw_desc + C2_TXP_ADDR);
+				__raw_writew(cpu_to_be16(TXP_HTXD_UNINIT),
+					     elem->hw_desc + C2_TXP_FLAGS);
+			}
+
+			c2_tx_free(c2_port->c2dev, elem);
+
+		} while ((elem = elem->next) != tx_ring->start);
+	} while (retry);
+
+	c2_port->tx_avail = c2_port->tx_ring.count - 1;
+	c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start;
+
+	if (c2_port->tx_avail > MAX_SKB_FRAGS + 1)
+		netif_wake_queue(c2_port->netdev);
+
+	spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+}
+
+/*
+ * Process transmit descriptors marked 'DONE' by the firmware,
+ * freeing up their unneeded sk_buffs.
+ */
+static void c2_tx_interrupt(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	struct c2_txp_desc txp_htxd;
+
+	spin_lock(&c2_port->tx_lock);
+
+	for (elem = tx_ring->to_clean; elem != tx_ring->to_use;
+	     elem = elem->next) {
+		txp_htxd.flags =
+		    be16_to_cpu(readw(elem->hw_desc + C2_TXP_FLAGS));
+
+		if (txp_htxd.flags != TXP_HTXD_DONE)
+			break;
+
+		if (netif_msg_tx_done(c2_port)) {
+			/* PCI reads are expensive in fast path */
+			txp_htxd.len =
+			    be16_to_cpu(readw(elem->hw_desc + C2_TXP_LEN));
+			pr_debug("%s: tx done slot %3Zu status 0x%x len "
+				"%5u bytes\n",
+				netdev->name, elem - tx_ring->start,
+				txp_htxd.flags, txp_htxd.len);
+		}
+
+		c2_tx_free(c2dev, elem);
+		++(c2_port->tx_avail);
+	}
+
+	tx_ring->to_clean = elem;
+
+	if (netif_queue_stopped(netdev)
+	    && c2_port->tx_avail > MAX_SKB_FRAGS + 1)
+		netif_wake_queue(netdev);
+
+	spin_unlock(&c2_port->tx_lock);
+}
+
+static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem)
+{
+	struct c2_rx_desc *rx_desc = elem->ht_desc;
+	struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data;
+
+	if (rxp_hdr->status != RXP_HRXD_OK ||
+	    rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) {
+		pr_debug("BAD RXP_HRXD\n");
+		pr_debug("  rx_desc : %p\n", rx_desc);
+		pr_debug("    index : %Zu\n",
+			elem - c2_port->rx_ring.start);
+		pr_debug("    len   : %u\n", rx_desc->len);
+		pr_debug("  rxp_hdr : %p [PA %p]\n", rxp_hdr,
+			(void *) __pa((unsigned long) rxp_hdr));
+		pr_debug("    flags : 0x%x\n", rxp_hdr->flags);
+		pr_debug("    status: 0x%x\n", rxp_hdr->status);
+		pr_debug("    len   : %u\n", rxp_hdr->len);
+		pr_debug("    rsvd  : 0x%x\n", rxp_hdr->rsvd);
+	}
+
+	/* Setup the skb for reuse since we're dropping this pkt */
+	elem->skb->tail = elem->skb->data = elem->skb->head;
+
+	/* Zero out the rxp hdr in the sk_buff */
+	memset(elem->skb->data, 0, sizeof(*rxp_hdr));
+
+	/* Write the descriptor to the adapter's rx ring */
+	__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
+	__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
+	__raw_writew(cpu_to_be16((u16) elem->maplen - sizeof(*rxp_hdr)),
+		     elem->hw_desc + C2_RXP_LEN);
+	__raw_writeq(cpu_to_be64(elem->mapaddr), elem->hw_desc + C2_RXP_ADDR);
+	__raw_writew(cpu_to_be16(RXP_HRXD_READY), elem->hw_desc + C2_RXP_FLAGS);
+
+	pr_debug("packet dropped\n");
+	c2_port->netstats.rx_dropped++;
+}
+
+static void c2_rx_interrupt(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *rx_ring = &c2_port->rx_ring;
+	struct c2_element *elem;
+	struct c2_rx_desc *rx_desc;
+	struct c2_rxp_hdr *rxp_hdr;
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen, buflen;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2dev->lock, flags);
+
+	/* Begin where we left off */
+	rx_ring->to_clean = rx_ring->start + c2dev->cur_rx;
+
+	for (elem = rx_ring->to_clean; elem->next != rx_ring->to_clean;
+	     elem = elem->next) {
+		rx_desc = elem->ht_desc;
+		mapaddr = elem->mapaddr;
+		maplen = elem->maplen;
+		skb = elem->skb;
+		rxp_hdr = (struct c2_rxp_hdr *) skb->data;
+
+		if (rxp_hdr->flags != RXP_HRXD_DONE)
+			break;
+		buflen = rxp_hdr->len;
+
+		/* Sanity check the RXP header */
+		if (rxp_hdr->status != RXP_HRXD_OK ||
+		    buflen > (rx_desc->len - sizeof(*rxp_hdr))) {
+			c2_rx_error(c2_port, elem);
+			continue;
+		}
+
+		/* 
+		 * Allocate and map a new skb for replenishing the host 
+		 * RX desc 
+		 */
+		if (c2_rx_alloc(c2_port, elem)) {
+			c2_rx_error(c2_port, elem);
+			continue;
+		}
+
+		/* Unmap the old skb */
+		pci_unmap_single(c2dev->pcidev, mapaddr, maplen,
+				 PCI_DMA_FROMDEVICE);
+
+		prefetch(skb->data);
+
+		/*
+		 * Skip past the leading 8 bytes comprising of the 
+		 * "struct c2_rxp_hdr", prepended by the adapter 
+		 * to the usual Ethernet header ("struct ethhdr"), 
+		 * to the start of the raw Ethernet packet.
+		 * 
+		 * Fix up the various fields in the sk_buff before 
+		 * passing it up to netif_rx(). The transfer size 
+		 * (in bytes) specified by the adapter len field of 
+		 * the "struct rxp_hdr_t" does NOT include the 
+		 * "sizeof(struct c2_rxp_hdr)".
+		 */
+		skb->data += sizeof(*rxp_hdr);
+		skb->tail = skb->data + buflen;
+		skb->len = buflen;
+		skb->dev = netdev;
+		skb->protocol = eth_type_trans(skb, netdev);
+
+		netif_rx(skb);
+
+		netdev->last_rx = jiffies;
+		c2_port->netstats.rx_packets++;
+		c2_port->netstats.rx_bytes += buflen;
+	}
+
+	/* Save where we left off */
+	rx_ring->to_clean = elem;
+	c2dev->cur_rx = elem - rx_ring->start;
+	C2_SET_CUR_RX(c2dev, c2dev->cur_rx);
+
+	spin_unlock_irqrestore(&c2dev->lock, flags);
+}
+
+/*
+ * Handle netisr0 TX & RX interrupts.
+ */
+static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+{
+	unsigned int netisr0, dmaisr;
+	int handled = 0;
+	struct c2_dev *c2dev = (struct c2_dev *) dev_id;
+
+	/* Process CCILNET interrupts */
+	netisr0 = readl(c2dev->regs + C2_NISR0);
+	if (netisr0) {
+
+		/*
+		 * There is an issue with the firmware that always
+		 * provides the status of RX for both TX & RX 
+		 * interrupts.  So process both queues here.
+		 */
+		c2_rx_interrupt(c2dev->netdev);
+		c2_tx_interrupt(c2dev->netdev);
+
+		/* Clear the interrupt */
+		writel(netisr0, c2dev->regs + C2_NISR0);
+		handled++;
+	}
+
+	/* Process RNIC interrupts */
+	dmaisr = readl(c2dev->regs + C2_DISR);
+	if (dmaisr) {
+		writel(dmaisr, c2dev->regs + C2_DISR);
+		c2_rnic_interrupt(c2dev);
+		handled++;
+	}
+
+	if (handled) {
+		return IRQ_HANDLED;
+	} else {
+		return IRQ_NONE;
+	}
+}
+
+static int c2_up(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_element *elem;
+	struct c2_rxp_hdr *rxp_hdr;
+	struct in_device *in_dev;
+	size_t rx_size, tx_size;
+	int ret, i;
+	unsigned int netimr0;
+
+	if (netif_msg_ifup(c2_port))
+		pr_debug("%s: enabling interface\n", netdev->name);
+
+	/* Set the Rx buffer size based on MTU */
+	c2_set_rxbufsize(c2_port);
+
+	/* Allocate DMA'able memory for Tx/Rx host descriptor rings */
+	rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc);
+	tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc);
+
+	c2_port->mem_size = tx_size + rx_size;
+	c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size,
+					    &c2_port->dma);
+	if (c2_port->mem == NULL) {
+		pr_debug("Unable to allocate memory for "
+			"host descriptor rings\n");
+		return -ENOMEM;
+	}
+
+	memset(c2_port->mem, 0, c2_port->mem_size);
+
+	/* Create the Rx host descriptor ring */
+	if ((ret =
+	     c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma,
+			      c2dev->mmio_rxp_ring))) {
+		pr_debug("Unable to create RX ring\n");
+		goto bail0;
+	}
+
+	/* Allocate Rx buffers for the host descriptor ring */
+	if (c2_rx_fill(c2_port)) {
+		pr_debug("Unable to fill RX ring\n");
+		goto bail1;
+	}
+
+	/* Create the Tx host descriptor ring */
+	if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size,
+				    c2_port->dma + rx_size,
+				    c2dev->mmio_txp_ring))) {
+		pr_debug("Unable to create TX ring\n");
+		goto bail1;
+	}
+
+	/* Set the TX pointer to where we left off */
+	c2_port->tx_avail = c2_port->tx_ring.count - 1;
+	c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean =
+	    c2_port->tx_ring.start + c2dev->cur_tx;
+
+	/* missing: Initialize MAC */
+
+	BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean);
+
+	/* Reset the adapter, ensures the driver is in sync with the RXP */
+	c2_reset(c2_port);
+
+	/* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */
+	for (i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count;
+	     i++, elem++) {
+		rxp_hdr = (struct c2_rxp_hdr *) elem->skb->data;
+		rxp_hdr->flags = 0;
+		__raw_writew(cpu_to_be16(RXP_HRXD_READY),
+			     elem->hw_desc + C2_RXP_FLAGS);
+	}
+
+	/* Enable network packets */
+	netif_start_queue(netdev);
+
+	/* Enable IRQ */
+	writel(0, c2dev->regs + C2_IDIS);
+	netimr0 = readl(c2dev->regs + C2_NIMR0);
+	netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT);
+	writel(netimr0, c2dev->regs + C2_NIMR0);
+
+	/* Tell the stack to ignore arp requests for ipaddrs bound to 
+	 * other interfaces.  This is needed to prevent the host stack
+	 * from responding to arp requests to the ipaddr bound on the
+	 * rdma interface.
+	 */
+	in_dev = in_dev_get(netdev);
+	in_dev->cnf.arp_ignore = 1;
+	in_dev_put(in_dev);
+
+	return 0;
+
+      bail1:
+	c2_rx_clean(c2_port);
+	kfree(c2_port->rx_ring.start);
+
+      bail0:
+	pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem,
+			    c2_port->dma);
+
+	return ret;
+}
+
+static int c2_down(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+
+	if (netif_msg_ifdown(c2_port))
+		pr_debug("%s: disabling interface\n",
+			netdev->name);
+
+	/* Wait for all the queued packets to get sent */
+	c2_tx_interrupt(netdev);
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+
+	/* Disable IRQs by clearing the interrupt mask */
+	writel(1, c2dev->regs + C2_IDIS);
+	writel(0, c2dev->regs + C2_NIMR0);
+
+	/* missing: Stop transmitter */
+
+	/* missing: Stop receiver */
+
+	/* Reset the adapter, ensures the driver is in sync with the RXP */
+	c2_reset(c2_port);
+
+	/* missing: Turn off LEDs here */
+
+	/* Free all buffers in the host descriptor rings */
+	c2_tx_clean(c2_port);
+	c2_rx_clean(c2_port);
+
+	/* Free the host descriptor rings */
+	kfree(c2_port->rx_ring.start);
+	kfree(c2_port->tx_ring.start);
+	pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem,
+			    c2_port->dma);
+
+	return 0;
+}
+
+static void c2_reset(struct c2_port *c2_port)
+{
+	struct c2_dev *c2dev = c2_port->c2dev;
+	unsigned int cur_rx = c2dev->cur_rx;
+
+	/* Tell the hardware to quiesce */
+	C2_SET_CUR_RX(c2dev, cur_rx | C2_PCI_HRX_QUI);
+
+	/*
+	 * The hardware will reset the C2_PCI_HRX_QUI bit once
+	 * the RXP is quiesced.  Wait 2 seconds for this.
+	 */
+	ssleep(2);
+
+	cur_rx = C2_GET_CUR_RX(c2dev);
+
+	if (cur_rx & C2_PCI_HRX_QUI)
+		pr_debug("c2_reset: failed to quiesce the hardware!\n");
+
+	cur_rx &= ~C2_PCI_HRX_QUI;
+
+	c2dev->cur_rx = cur_rx;
+
+	pr_debug("Current RX: %u\n", c2dev->cur_rx);
+}
+
+static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+	struct c2_dev *c2dev = c2_port->c2dev;
+	struct c2_ring *tx_ring = &c2_port->tx_ring;
+	struct c2_element *elem;
+	dma_addr_t mapaddr;
+	u32 maplen;
+	unsigned long flags;
+	unsigned int i;
+
+	spin_lock_irqsave(&c2_port->tx_lock, flags);
+
+	if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) {
+		netif_stop_queue(netdev);
+		spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+
+		pr_debug("%s: Tx ring full when queue awake!\n",
+			netdev->name);
+		return NETDEV_TX_BUSY;
+	}
+
+	maplen = skb_headlen(skb);
+	mapaddr =
+	    pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE);
+
+	elem = tx_ring->to_use;
+	elem->skb = skb;
+	elem->mapaddr = mapaddr;
+	elem->maplen = maplen;
+
+	/* Tell HW to xmit */
+	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
+	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
+	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
+
+	c2_port->netstats.tx_packets++;
+	c2_port->netstats.tx_bytes += maplen;
+
+	/* Loop thru additional data fragments and queue them */
+	if (skb_shinfo(skb)->nr_frags) {
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			maplen = frag->size;
+			mapaddr =
+			    pci_map_page(c2dev->pcidev, frag->page,
+					 frag->page_offset, maplen,
+					 PCI_DMA_TODEVICE);
+
+			elem = elem->next;
+			elem->skb = NULL;
+			elem->mapaddr = mapaddr;
+			elem->maplen = maplen;
+
+			/* Tell HW to xmit */
+			__raw_writeq(cpu_to_be64(mapaddr),
+				     elem->hw_desc + C2_TXP_ADDR);
+			__raw_writew(cpu_to_be16(maplen),
+				     elem->hw_desc + C2_TXP_LEN);
+			__raw_writew(cpu_to_be16(TXP_HTXD_READY),
+				     elem->hw_desc + C2_TXP_FLAGS);
+
+			c2_port->netstats.tx_packets++;
+			c2_port->netstats.tx_bytes += maplen;
+		}
+	}
+
+	tx_ring->to_use = elem->next;
+	c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1);
+
+	if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) {
+		netif_stop_queue(netdev);
+		if (netif_msg_tx_queued(c2_port))
+			pr_debug("%s: transmit queue full\n",
+				netdev->name);
+	}
+
+	spin_unlock_irqrestore(&c2_port->tx_lock, flags);
+
+	netdev->trans_start = jiffies;
+
+	return NETDEV_TX_OK;
+}
+
+static struct net_device_stats *c2_get_stats(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+
+	return &c2_port->netstats;
+}
+
+static void c2_tx_timeout(struct net_device *netdev)
+{
+	struct c2_port *c2_port = netdev_priv(netdev);
+
+	if (netif_msg_timer(c2_port))
+		pr_debug("%s: tx timeout\n", netdev->name);
+
+	c2_tx_clean(c2_port);
+}
+
+static int c2_change_mtu(struct net_device *netdev, int new_mtu)
+{
+	int ret = 0;
+
+	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
+		return -EINVAL;
+
+	netdev->mtu = new_mtu;
+
+	if (netif_running(netdev)) {
+		c2_down(netdev);
+
+		c2_up(netdev);
+	}
+
+	return ret;
+}
+
+/* Initialize network device */
+static struct net_device *c2_devinit(struct c2_dev *c2dev,
+				     void __iomem * mmio_addr)
+{
+	struct c2_port *c2_port = NULL;
+	struct net_device *netdev = alloc_etherdev(sizeof(*c2_port));
+
+	if (!netdev) {
+		pr_debug("c2_port etherdev alloc failed");
+		return NULL;
+	}
+
+	SET_MODULE_OWNER(netdev);
+	SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev);
+
+	netdev->open = c2_up;
+	netdev->stop = c2_down;
+	netdev->hard_start_xmit = c2_xmit_frame;
+	netdev->get_stats = c2_get_stats;
+	netdev->tx_timeout = c2_tx_timeout;
+	netdev->change_mtu = c2_change_mtu;
+	netdev->watchdog_timeo = C2_TX_TIMEOUT;
+	netdev->irq = c2dev->pcidev->irq;
+
+	c2_port = netdev_priv(netdev);
+	c2_port->netdev = netdev;
+	c2_port->c2dev = c2dev;
+	c2_port->msg_enable = netif_msg_init(debug, default_msg);
+	c2_port->tx_ring.count = C2_NUM_TX_DESC;
+	c2_port->rx_ring.count = C2_NUM_RX_DESC;
+
+	spin_lock_init(&c2_port->tx_lock);
+
+	/* Copy our 48-bit ethernet hardware address */
+	memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6);
+
+	/* Validate the MAC address */
+	if (!is_valid_ether_addr(netdev->dev_addr)) {
+		pr_debug("Invalid MAC Address\n");
+		c2_print_macaddr(netdev);
+		free_netdev(netdev);
+		return NULL;
+	}
+
+	c2dev->netdev = netdev;
+
+	return netdev;
+}
+
+static int __devinit c2_probe(struct pci_dev *pcidev,
+			      const struct pci_device_id *ent)
+{
+	int ret = 0, i;
+	unsigned long reg0_start, reg0_flags, reg0_len;
+	unsigned long reg2_start, reg2_flags, reg2_len;
+	unsigned long reg4_start, reg4_flags, reg4_len;
+	unsigned kva_map_size;
+	struct net_device *netdev = NULL;
+	struct c2_dev *c2dev = NULL;
+	void __iomem *mmio_regs = NULL;
+
+	printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n",
+		DRV_VERSION);
+
+	/* Enable PCI device */
+	ret = pci_enable_device(pcidev);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: Unable to enable PCI device\n",
+			pci_name(pcidev));
+		goto bail0;
+	}
+
+	reg0_start = pci_resource_start(pcidev, BAR_0);
+	reg0_len = pci_resource_len(pcidev, BAR_0);
+	reg0_flags = pci_resource_flags(pcidev, BAR_0);
+
+	reg2_start = pci_resource_start(pcidev, BAR_2);
+	reg2_len = pci_resource_len(pcidev, BAR_2);
+	reg2_flags = pci_resource_flags(pcidev, BAR_2);
+
+	reg4_start = pci_resource_start(pcidev, BAR_4);
+	reg4_len = pci_resource_len(pcidev, BAR_4);
+	reg4_flags = pci_resource_flags(pcidev, BAR_4);
+
+	pr_debug("BAR0 size = 0x%lX bytes\n", reg0_len);
+	pr_debug("BAR2 size = 0x%lX bytes\n", reg2_len);
+	pr_debug("BAR4 size = 0x%lX bytes\n", reg4_len);
+
+	/* Make sure PCI base addr are MMIO */
+	if (!(reg0_flags & IORESOURCE_MEM) ||
+	    !(reg2_flags & IORESOURCE_MEM) || !(reg4_flags & IORESOURCE_MEM)) {
+		printk(KERN_ERR PFX "PCI regions not an MMIO resource\n");
+		ret = -ENODEV;
+		goto bail1;
+	}
+
+	/* Check for weird/broken PCI region reporting */
+	if ((reg0_len < C2_REG0_SIZE) ||
+	    (reg2_len < C2_REG2_SIZE) || (reg4_len < C2_REG4_SIZE)) {
+		printk(KERN_ERR PFX "Invalid PCI region sizes\n");
+		ret = -ENODEV;
+		goto bail1;
+	}
+
+	/* Reserve PCI I/O and memory resources */
+	ret = pci_request_regions(pcidev, DRV_NAME);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: Unable to request regions\n",
+			pci_name(pcidev));
+		goto bail1;
+	}
+
+	if ((sizeof(dma_addr_t) > 4)) {
+		ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK);
+		if (ret < 0) {
+			printk(KERN_ERR PFX "64b DMA configuration failed\n");
+			goto bail2;
+		}
+	} else {
+		ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK);
+		if (ret < 0) {
+			printk(KERN_ERR PFX "32b DMA configuration failed\n");
+			goto bail2;
+		}
+	}
+
+	/* Enables bus-mastering on the device */
+	pci_set_master(pcidev);
+
+	/* Remap the adapter PCI registers in BAR4 */
+	mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET,
+				    sizeof(struct c2_adapter_pci_regs));
+	if (mmio_regs == 0UL) {
+		printk(KERN_ERR PFX
+			"Unable to remap adapter PCI registers in BAR4\n");
+		ret = -EIO;
+		goto bail2;
+	}
+
+	/* Validate PCI regs magic */
+	for (i = 0; i < sizeof(c2_magic); i++) {
+		if (c2_magic[i] != readb(mmio_regs + C2_REGS_MAGIC + i)) {
+			printk(KERN_ERR PFX "Downlevel Firmware boot loader "
+				"[%d/%Zd: got 0x%x, exp 0x%x]. Use the cc_flash "
+			       "utility to update your boot loader\n",
+				i + 1, sizeof(c2_magic),
+				readb(mmio_regs + C2_REGS_MAGIC + i),
+				c2_magic[i]);
+			printk(KERN_ERR PFX "Adapter not claimed\n");
+			iounmap(mmio_regs);
+			ret = -EIO;
+			goto bail2;
+		}
+	}
+
+	/* Validate the adapter version */
+	if (be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)) != C2_VERSION) {
+		printk(KERN_ERR PFX "Version mismatch "
+			"[fw=%u, c2=%u], Adapter not claimed\n",
+			be32_to_cpu(readl(mmio_regs + C2_REGS_VERS)),
+			C2_VERSION);
+		ret = -EINVAL;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	/* Validate the adapter IVN */
+	if (be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)) != C2_IVN) {
+		printk(KERN_ERR PFX "Downlevel FIrmware level. You should be using "
+		       "the OpenIB device support kit. "
+		       "[fw=0x%x, c2=0x%x], Adapter not claimed\n",
+			be32_to_cpu(readl(mmio_regs + C2_REGS_IVN)),
+			C2_IVN);
+		ret = -EINVAL;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	/* Allocate hardware structure */
+	c2dev = (struct c2_dev *) ib_alloc_device(sizeof(*c2dev));
+	if (!c2dev) {
+		printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n",
+			pci_name(pcidev));
+		ret = -ENOMEM;
+		iounmap(mmio_regs);
+		goto bail2;
+	}
+
+	memset(c2dev, 0, sizeof(*c2dev));
+	spin_lock_init(&c2dev->lock);
+	c2dev->pcidev = pcidev;
+	c2dev->cur_tx = 0;
+
+	/* Get the last RX index */
+	c2dev->cur_rx =
+	    (be32_to_cpu(readl(mmio_regs + C2_REGS_HRX_CUR)) -
+	     0xffffc000) / sizeof(struct c2_rxp_desc);
+
+	/* Request an interrupt line for the driver */
+	ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev);
+	if (ret) {
+		printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n",
+			pci_name(pcidev), pcidev->irq);
+		iounmap(mmio_regs);
+		goto bail3;
+	}
+
+	/* Set driver specific data */
+	pci_set_drvdata(pcidev, c2dev);
+
+	/* Initialize network device */
+	if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) {
+		iounmap(mmio_regs);
+		goto bail4;
+	}
+
+	/* Save off the actual size prior to unmapping mmio_regs */
+	kva_map_size = be32_to_cpu(readl(mmio_regs + C2_REGS_PCI_WINSIZE));
+
+	/* Unmap the adapter PCI registers in BAR4 */
+	iounmap(mmio_regs);
+
+	/* Register network device */
+	ret = register_netdev(netdev);
+	if (ret) {
+		printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n",
+			ret);
+		goto bail5;
+	}
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+
+	/* Remap the adapter HRXDQ PA space to kernel VA space */
+	c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET,
+					       C2_RXP_HRXDQ_SIZE);
+	if (c2dev->mmio_rxp_ring == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n");
+		ret = -EIO;
+		goto bail6;
+	}
+
+	/* Remap the adapter HTXDQ PA space to kernel VA space */
+	c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET,
+					       C2_TXP_HTXDQ_SIZE);
+	if (c2dev->mmio_txp_ring == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n");
+		ret = -EIO;
+		goto bail7;
+	}
+
+	/* Save off the current RX index in the last 4 bytes of the TXP Ring */
+	C2_SET_CUR_RX(c2dev, c2dev->cur_rx);
+
+	/* Remap the PCI registers in adapter BAR0 to kernel VA space */
+	c2dev->regs = ioremap_nocache(reg0_start, reg0_len);
+	if (c2dev->regs == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap BAR0\n");
+		ret = -EIO;
+		goto bail8;
+	}
+
+	/* Remap the PCI registers in adapter BAR4 to kernel VA space */
+	c2dev->pa = reg4_start + C2_PCI_REGS_OFFSET;
+	c2dev->kva = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, 
+				     kva_map_size);
+	if (c2dev->kva == 0UL) {
+		printk(KERN_ERR PFX "Unable to remap BAR4\n");
+		ret = -EIO;
+		goto bail9;
+	}
+
+	/* Print out the MAC address */
+	c2_print_macaddr(netdev);
+
+	ret = c2_rnic_init(c2dev);
+	if (ret) {
+		printk(KERN_ERR PFX "c2_rnic_init failed: %d\n", ret);
+		goto bail10;
+	}
+
+	c2_register_device(c2dev);
+
+	return 0;
+
+ bail10:
+	iounmap(c2dev->kva);
+
+ bail9:
+	iounmap(c2dev->regs);
+
+ bail8:
+	iounmap(c2dev->mmio_txp_ring);
+
+ bail7:
+	iounmap(c2dev->mmio_rxp_ring);
+
+ bail6:
+	unregister_netdev(netdev);
+
+ bail5:
+	free_netdev(netdev);
+
+ bail4:
+	free_irq(pcidev->irq, c2dev);
+
+ bail3:
+	ib_dealloc_device(&c2dev->ibdev);
+
+ bail2:
+	pci_release_regions(pcidev);
+
+ bail1:
+	pci_disable_device(pcidev);
+
+ bail0:
+	return ret;
+}
+
+static void __devexit c2_remove(struct pci_dev *pcidev)
+{
+	struct c2_dev *c2dev = pci_get_drvdata(pcidev);
+	struct net_device *netdev = c2dev->netdev;
+
+	/* Unregister with OpenIB */
+	c2_unregister_device(c2dev);
+
+	/* Clean up the RNIC resources */
+	c2_rnic_term(c2dev);
+
+	/* Remove network device from the kernel */
+	unregister_netdev(netdev);
+
+	/* Free network device */
+	free_netdev(netdev);
+
+	/* Free the interrupt line */
+	free_irq(pcidev->irq, c2dev);
+
+	/* missing: Turn LEDs off here */
+
+	/* Unmap adapter PA space */
+	iounmap(c2dev->kva);
+	iounmap(c2dev->regs);
+	iounmap(c2dev->mmio_txp_ring);
+	iounmap(c2dev->mmio_rxp_ring);
+
+	/* Free the hardware structure */
+	ib_dealloc_device(&c2dev->ibdev);
+
+	/* Release reserved PCI I/O and memory resources */
+	pci_release_regions(pcidev);
+
+	/* Disable PCI device */
+	pci_disable_device(pcidev);
+
+	/* Clear driver specific data */
+	pci_set_drvdata(pcidev, NULL);
+}
+
+static struct pci_driver c2_pci_driver = {
+	.name = DRV_NAME,
+	.id_table = c2_pci_table,
+	.probe = c2_probe,
+	.remove = __devexit_p(c2_remove),
+};
+
+static int __init c2_init_module(void)
+{
+	return pci_module_init(&c2_pci_driver);
+}
+
+static void __exit c2_exit_module(void)
+{
+	pci_unregister_driver(&c2_pci_driver);
+}
+
+module_init(c2_init_module);
+module_exit(c2_exit_module);
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
new file mode 100644
index 0000000..3b17530
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -0,0 +1,552 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __C2_H
+#define __C2_H
+
+#include <linux/netdevice.h>
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <linux/idr.h>
+#include <asm/semaphore.h>
+
+#include "c2_provider.h"
+#include "c2_mq.h"
+#include "c2_status.h"
+
+#define DRV_NAME     "c2"
+#define DRV_VERSION  "1.1"
+#define PFX          DRV_NAME ": "
+
+#define BAR_0                0
+#define BAR_2                2
+#define BAR_4                4
+
+#define RX_BUF_SIZE         (1536 + 8)
+#define ETH_JUMBO_MTU        9000
+#define C2_MAGIC            "CEPHEUS"
+#define C2_VERSION           4
+#define C2_IVN              (18 & 0x7fffffff)
+
+#define C2_REG0_SIZE        (16 * 1024)
+#define C2_REG2_SIZE        (2 * 1024 * 1024)
+#define C2_REG4_SIZE        (256 * 1024 * 1024)
+#define C2_NUM_TX_DESC       341
+#define C2_NUM_RX_DESC       256
+#define C2_PCI_REGS_OFFSET  (0x10000)
+#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2))
+#define C2_RXP_HRXDQ_SIZE   (4096)
+#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE)
+#define C2_TXP_HTXDQ_SIZE   (4096)
+#define C2_TX_TIMEOUT	    (6*HZ)
+
+/* CEPHEUS */
+static const u8 c2_magic[] = {
+	0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53
+};
+
+enum adapter_pci_regs {
+	C2_REGS_MAGIC = 0x0000,
+	C2_REGS_VERS = 0x0008,
+	C2_REGS_IVN = 0x000C,
+	C2_REGS_PCI_WINSIZE = 0x0010,
+	C2_REGS_Q0_QSIZE = 0x0014,
+	C2_REGS_Q0_MSGSIZE = 0x0018,
+	C2_REGS_Q0_POOLSTART = 0x001C,
+	C2_REGS_Q0_SHARED = 0x0020,
+	C2_REGS_Q1_QSIZE = 0x0024,
+	C2_REGS_Q1_MSGSIZE = 0x0028,
+	C2_REGS_Q1_SHARED = 0x0030,
+	C2_REGS_Q2_QSIZE = 0x0034,
+	C2_REGS_Q2_MSGSIZE = 0x0038,
+	C2_REGS_Q2_SHARED = 0x0040,
+	C2_REGS_ENADDR = 0x004C,
+	C2_REGS_RDMA_ENADDR = 0x0054,
+	C2_REGS_HRX_CUR = 0x006C,
+};
+
+struct c2_adapter_pci_regs {
+	char reg_magic[8];
+	u32 version;
+	u32 ivn;
+	u32 pci_window_size;
+	u32 q0_q_size;
+	u32 q0_msg_size;
+	u32 q0_pool_start;
+	u32 q0_shared;
+	u32 q1_q_size;
+	u32 q1_msg_size;
+	u32 q1_pool_start;
+	u32 q1_shared;
+	u32 q2_q_size;
+	u32 q2_msg_size;
+	u32 q2_pool_start;
+	u32 q2_shared;
+	u32 log_start;
+	u32 log_size;
+	u8 host_enaddr[8];
+	u8 rdma_enaddr[8];
+	u32 crash_entry;
+	u32 crash_ready[2];
+	u32 fw_txd_cur;
+	u32 fw_hrxd_cur;
+	u32 fw_rxd_cur;
+};
+
+enum pci_regs {
+	C2_HISR = 0x0000,
+	C2_DISR = 0x0004,
+	C2_HIMR = 0x0008,
+	C2_DIMR = 0x000C,
+	C2_NISR0 = 0x0010,
+	C2_NISR1 = 0x0014,
+	C2_NIMR0 = 0x0018,
+	C2_NIMR1 = 0x001C,
+	C2_IDIS = 0x0020,
+};
+
+enum {
+	C2_PCI_HRX_INT = 1 << 8,
+	C2_PCI_HTX_INT = 1 << 17,
+	C2_PCI_HRX_QUI = 1 << 31,
+};
+
+/*
+ * Cepheus registers in BAR0.
+ */
+struct c2_pci_regs {
+	u32 hostisr;
+	u32 dmaisr;
+	u32 hostimr;
+	u32 dmaimr;
+	u32 netisr0;
+	u32 netisr1;
+	u32 netimr0;
+	u32 netimr1;
+	u32 int_disable;
+};
+
+/* TXP flags */
+enum c2_txp_flags {
+	TXP_HTXD_DONE = 0,
+	TXP_HTXD_READY = 1 << 0,
+	TXP_HTXD_UNINIT = 1 << 1,
+};
+
+/* RXP flags */
+enum c2_rxp_flags {
+	RXP_HRXD_UNINIT = 0,
+	RXP_HRXD_READY = 1 << 0,
+	RXP_HRXD_DONE = 1 << 1,
+};
+
+/* RXP status */
+enum c2_rxp_status {
+	RXP_HRXD_ZERO = 0,
+	RXP_HRXD_OK = 1 << 0,
+	RXP_HRXD_BUF_OV = 1 << 1,
+};
+
+/* TXP descriptor fields */
+enum txp_desc {
+	C2_TXP_FLAGS = 0x0000,
+	C2_TXP_LEN = 0x0002,
+	C2_TXP_ADDR = 0x0004,
+};
+
+/* RXP descriptor fields */
+enum rxp_desc {
+	C2_RXP_FLAGS = 0x0000,
+	C2_RXP_STATUS = 0x0002,
+	C2_RXP_COUNT = 0x0004,
+	C2_RXP_LEN = 0x0006,
+	C2_RXP_ADDR = 0x0008,
+};
+
+struct c2_txp_desc {
+	u16 flags;
+	u16 len;
+	u64 addr;
+} __attribute__ ((packed));
+
+struct c2_rxp_desc {
+	u16 flags;
+	u16 status;
+	u16 count;
+	u16 len;
+	u64 addr;
+} __attribute__ ((packed));
+
+struct c2_rxp_hdr {
+	u16 flags;
+	u16 status;
+	u16 len;
+	u16 rsvd;
+} __attribute__ ((packed));
+
+struct c2_tx_desc {
+	u32 len;
+	u32 status;
+	dma_addr_t next_offset;
+};
+
+struct c2_rx_desc {
+	u32 len;
+	u32 status;
+	dma_addr_t next_offset;
+};
+
+struct c2_alloc {
+	u32 last;
+	u32 max;
+	spinlock_t lock;
+	unsigned long *table;
+};
+
+struct c2_array {
+	struct {
+		void **page;
+		int used;
+	} *page_list;
+};
+
+/*
+ * The MQ shared pointer pool is organized as a linked list of
+ * chunks. Each chunk contains a linked list of free shared pointers
+ * that can be allocated to a given user mode client.
+ *
+ */
+struct sp_chunk {
+	struct sp_chunk *next;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping);
+	u16 head;
+	u16 shared_ptr[0];
+};
+
+struct c2_pd_table {
+	u32 last;
+	u32 max;
+	spinlock_t lock;
+	unsigned long *table;
+};
+
+struct c2_qp_table {
+	struct idr idr;
+	spinlock_t lock;
+	int last;
+};
+
+struct c2_element {
+	struct c2_element *next;
+	void *ht_desc;		/* host     descriptor */
+	void __iomem *hw_desc;	/* hardware descriptor */
+	struct sk_buff *skb;
+	dma_addr_t mapaddr;
+	u32 maplen;
+};
+
+struct c2_ring {
+	struct c2_element *to_clean;
+	struct c2_element *to_use;
+	struct c2_element *start;
+	unsigned long count;
+};
+
+struct c2_dev {
+	struct ib_device ibdev;
+	void __iomem *regs;
+	void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */
+	void __iomem *mmio_rxp_ring;
+	spinlock_t lock;
+	struct pci_dev *pcidev;
+	struct net_device *netdev;
+	struct net_device *pseudo_netdev;
+	unsigned int cur_tx;
+	unsigned int cur_rx;
+	u32 adapter_handle;
+	int device_cap_flags;
+	void __iomem *kva;	/* KVA device memory */
+	unsigned long pa;	/* PA device memory */
+	void **qptr_array;
+
+	kmem_cache_t *host_msg_cache;
+
+	struct list_head cca_link;		/* adapter list */
+	struct list_head eh_wakeup_list;	/* event wakeup list */
+	wait_queue_head_t req_vq_wo;
+
+	/* Cached RNIC properties */
+	struct ib_device_attr props;
+
+	struct c2_pd_table pd_table;
+	struct c2_qp_table qp_table;
+	int ports;		/* num of GigE ports */
+	int devnum;
+	spinlock_t vqlock;	/* sync vbs req MQ */
+
+	/* Verbs Queues */
+	struct c2_mq req_vq;	/* Verbs Request MQ */
+	struct c2_mq rep_vq;	/* Verbs Reply MQ */
+	struct c2_mq aeq;	/* Async Events MQ */
+
+	/* Kernel client MQs */
+	struct sp_chunk *kern_mqsp_pool;
+
+	/* Device updates these values when posting messages to a host
+	 * target queue */
+	u16 req_vq_shared;
+	u16 rep_vq_shared;
+	u16 aeq_shared;
+	u16 irq_claimed;
+
+	/*
+	 * Shared host target pages for user-accessible MQs.
+	 */
+	int hthead;		/* index of first free entry */
+	void *htpages;		/* kernel vaddr */
+	int htlen;		/* length of htpages memory */
+	void *htuva;		/* user mapped vaddr */
+	spinlock_t htlock;	/* serialize allocation */
+
+	u64 adapter_hint_uva;	/* access to the activity FIFO */
+
+	//	spinlock_t aeq_lock;
+	//	spinlock_t rnic_lock;
+
+	u16 *hint_count;
+	dma_addr_t hint_count_dma;
+	u16 hints_read;
+
+	int init;		/* TRUE if it's ready */
+	char ae_cache_name[16];
+	char vq_cache_name[16];
+};
+
+struct c2_port {
+	u32 msg_enable;
+	struct c2_dev *c2dev;
+	struct net_device *netdev;
+
+	spinlock_t tx_lock;
+	u32 tx_avail;
+	struct c2_ring tx_ring;
+	struct c2_ring rx_ring;
+
+	void *mem;		/* PCI memory for host rings */
+	dma_addr_t dma;
+	unsigned long mem_size;
+
+	u32 rx_buf_size;
+
+	struct net_device_stats netstats;
+};
+
+/*
+ * Activity FIFO registers in BAR0.
+ */
+#define PCI_BAR0_HOST_HINT	0x100
+#define PCI_BAR0_ADAPTER_HINT	0x2000
+
+/*
+ * Ammasso PCI vendor id and Cepheus PCI device id.
+ */
+#define CQ_ARMED 	0x01
+#define CQ_WAIT_FOR_DMA	0x80
+
+/*
+ * The format of a hint is as follows:
+ * Lower 16 bits are the count of hints for the queue.
+ * Next 15 bits are the qp_index
+ * Upper most bit depends on who reads it:
+ *    If read by producer, then it means Full (1) or Not-Full (0)
+ *    If read by consumer, then it means Empty (1) or Not-Empty (0)
+ */
+#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count)
+#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16)
+#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF)
+
+
+/*
+ * The following defines the offset in SDRAM for the c2_adapter_pci_regs_t
+ * struct. 
+ */
+#define C2_ADAPTER_PCI_REGS_OFFSET 0x10000
+
+#ifndef readq
+static inline u64 readq(const void __iomem * addr)
+{
+	u64 ret = readl(addr + 4);
+	ret <<= 32;
+	ret |= readl(addr);
+
+	return ret;
+}
+#endif
+
+#ifndef __raw_writeq
+static inline void __raw_writeq(u64 val, void __iomem * addr)
+{
+	__raw_writel((u32) (val), addr);
+	__raw_writel((u32) (val >> 32), (addr + 4));
+}
+#endif
+
+#define C2_SET_CUR_RX(c2dev, cur_rx) \
+	__raw_writel(cpu_to_be32(cur_rx), c2dev->mmio_txp_ring + 4092)
+
+#define C2_GET_CUR_RX(c2dev) \
+	be32_to_cpu(readl(c2dev->mmio_txp_ring + 4092))
+
+static inline struct c2_dev *to_c2dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct c2_dev, ibdev);
+}
+
+static inline int c2_errno(void *reply)
+{
+	switch (c2_wr_get_result(reply)) {
+	case C2_OK:
+		return 0;
+	case CCERR_NO_BUFS:
+	case CCERR_INSUFFICIENT_RESOURCES:
+	case CCERR_ZERO_RDMA_READ_RESOURCES:
+		return -ENOMEM;
+	case CCERR_MR_IN_USE:
+	case CCERR_QP_IN_USE:
+		return -EBUSY;
+	case CCERR_ADDR_IN_USE:
+		return -EADDRINUSE;
+	case CCERR_ADDR_NOT_AVAIL:
+		return -EADDRNOTAVAIL;
+	case CCERR_CONN_RESET:
+		return -ECONNRESET;
+	case CCERR_NOT_IMPLEMENTED:
+	case CCERR_INVALID_WQE:
+		return -ENOSYS;
+	case CCERR_QP_NOT_PRIVILEGED:
+		return -EPERM;
+	case CCERR_STACK_ERROR:
+		return -EPROTO;
+	case CCERR_ACCESS_VIOLATION:
+	case CCERR_BASE_AND_BOUNDS_VIOLATION:
+		return -EFAULT;
+	case CCERR_STAG_STATE_NOT_INVALID:
+	case CCERR_INVALID_ADDRESS:
+	case CCERR_INVALID_CQ:
+	case CCERR_INVALID_EP:
+	case CCERR_INVALID_MODIFIER:
+	case CCERR_INVALID_MTU:
+	case CCERR_INVALID_PD_ID:
+	case CCERR_INVALID_QP:
+	case CCERR_INVALID_RNIC:
+	case CCERR_INVALID_STAG:
+		return -EINVAL;
+	default:
+		return -EAGAIN;
+	}
+}
+
+/* Device */
+extern int c2_register_device(struct c2_dev *c2dev);
+extern void c2_unregister_device(struct c2_dev *c2dev);
+extern int c2_rnic_init(struct c2_dev *c2dev);
+extern void c2_rnic_term(struct c2_dev *c2dev);
+extern void c2_rnic_interrupt(struct c2_dev *c2dev);
+extern int c2_rnic_query(struct c2_dev *c2dev, struct ib_device_attr *props);
+extern int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask);
+extern int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask);
+
+/* QPs */
+extern int c2_alloc_qp(struct c2_dev *c2dev, struct c2_pd *pd,
+		       struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp);
+extern void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp);
+extern struct ib_qp *c2_get_qp(struct ib_device *device, int qpn);
+extern int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp,
+			struct ib_qp_attr *attr, int attr_mask);
+extern int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, 
+				 int ord, int ird);
+extern int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
+			struct ib_send_wr **bad_wr);
+extern int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
+			   struct ib_recv_wr **bad_wr);
+extern void __devinit c2_init_qp_table(struct c2_dev *c2dev);
+extern void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev);
+extern void c2_set_qp_state(struct c2_qp *, int);
+extern struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn);
+
+/* PDs */
+extern int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd);
+extern void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd);
+extern int __devinit c2_init_pd_table(struct c2_dev *c2dev);
+extern void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev);
+
+/* CQs */
+extern int c2_init_cq(struct c2_dev *c2dev, int entries,
+		      struct c2_ucontext *ctx, struct c2_cq *cq);
+extern void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq);
+extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
+extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
+extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+
+/* CM */
+extern int c2_llp_connect(struct iw_cm_id *cm_id, 
+			  struct iw_cm_conn_param *iw_param);
+extern int c2_llp_accept(struct iw_cm_id *cm_id, 
+			 struct iw_cm_conn_param *iw_param);
+extern int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata,
+			 u8 pdata_len);
+extern int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog);
+extern int c2_llp_service_destroy(struct iw_cm_id *cm_id);
+
+/* MM */
+extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list,
+ 				      int page_size, int pbl_depth, u32 length, 
+ 				      u32 off, u64 *va, enum c2_acf acf, 
+				      struct c2_mr *mr);
+extern int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index);
+
+/* AE */
+extern void c2_ae_event(struct c2_dev *c2dev, u32 mq_index);
+
+/* MQSP Allocator */
+extern int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, 
+			     struct sp_chunk **root);
+extern void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root);
+extern u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, 
+			  dma_addr_t *dma_addr, gfp_t gfp_mask);
+extern void c2_free_mqsp(u16 * mqsp);
+#endif
diff --git a/drivers/infiniband/hw/amso1100/c2_ae.c b/drivers/infiniband/hw/amso1100/c2_ae.c
new file mode 100644
index 0000000..495e614
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_ae.c
@@ -0,0 +1,321 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include <rdma/iw_cm.h>
+#include "c2_status.h"
+#include "c2_ae.h"
+
+static int c2_convert_cm_status(u32 c2_status)
+{
+	switch (c2_status) {
+	case C2_CONN_STATUS_SUCCESS:
+		return 0;
+	case C2_CONN_STATUS_REJECTED:
+		return -ENETRESET;
+	case C2_CONN_STATUS_REFUSED:
+		return -ECONNREFUSED;
+	case C2_CONN_STATUS_TIMEDOUT:
+		return -ETIMEDOUT;
+	case C2_CONN_STATUS_NETUNREACH:
+		return -ENETUNREACH;
+	case C2_CONN_STATUS_HOSTUNREACH:
+		return -EHOSTUNREACH;
+	case C2_CONN_STATUS_INVALID_RNIC:
+		return -EINVAL;
+	case C2_CONN_STATUS_INVALID_QP:
+		return -EINVAL;
+	case C2_CONN_STATUS_INVALID_QP_STATE:
+		return -EINVAL;
+	case C2_CONN_STATUS_ADDR_NOT_AVAIL:
+		return -EADDRNOTAVAIL;
+	default:
+		printk(KERN_ERR PFX
+		       "%s - Unable to convert CM status: %d\n",
+		       __FUNCTION__, c2_status);
+		return -EIO;
+	}
+}
+
+#ifdef DEBUG
+static const char* to_event_str(int event)
+{
+	static const char* event_str[] = {
+		"CCAE_REMOTE_SHUTDOWN",
+		"CCAE_ACTIVE_CONNECT_RESULTS",
+		"CCAE_CONNECTION_REQUEST",
+		"CCAE_LLP_CLOSE_COMPLETE",
+		"CCAE_TERMINATE_MESSAGE_RECEIVED",
+		"CCAE_LLP_CONNECTION_RESET",
+		"CCAE_LLP_CONNECTION_LOST",
+		"CCAE_LLP_SEGMENT_SIZE_INVALID",
+		"CCAE_LLP_INVALID_CRC",
+		"CCAE_LLP_BAD_FPDU",
+		"CCAE_INVALID_DDP_VERSION",
+		"CCAE_INVALID_RDMA_VERSION",
+		"CCAE_UNEXPECTED_OPCODE",
+		"CCAE_INVALID_DDP_QUEUE_NUMBER",
+		"CCAE_RDMA_READ_NOT_ENABLED",
+		"CCAE_RDMA_WRITE_NOT_ENABLED",
+		"CCAE_RDMA_READ_TOO_SMALL",
+		"CCAE_NO_L_BIT",
+		"CCAE_TAGGED_INVALID_STAG",
+		"CCAE_TAGGED_BASE_BOUNDS_VIOLATION",
+		"CCAE_TAGGED_ACCESS_RIGHTS_VIOLATION",
+		"CCAE_TAGGED_INVALID_PD",
+		"CCAE_WRAP_ERROR",
+		"CCAE_BAD_CLOSE",
+		"CCAE_BAD_LLP_CLOSE",
+		"CCAE_INVALID_MSN_RANGE",
+		"CCAE_INVALID_MSN_GAP",
+		"CCAE_IRRQ_OVERFLOW",
+		"CCAE_IRRQ_MSN_GAP",
+		"CCAE_IRRQ_MSN_RANGE",
+		"CCAE_IRRQ_INVALID_STAG",
+		"CCAE_IRRQ_BASE_BOUNDS_VIOLATION",
+		"CCAE_IRRQ_ACCESS_RIGHTS_VIOLATION",
+		"CCAE_IRRQ_INVALID_PD",
+		"CCAE_IRRQ_WRAP_ERROR",
+		"CCAE_CQ_SQ_COMPLETION_OVERFLOW",
+		"CCAE_CQ_RQ_COMPLETION_ERROR",
+		"CCAE_QP_SRQ_WQE_ERROR",
+		"CCAE_QP_LOCAL_CATASTROPHIC_ERROR",
+		"CCAE_CQ_OVERFLOW",
+		"CCAE_CQ_OPERATION_ERROR",
+		"CCAE_SRQ_LIMIT_REACHED",
+		"CCAE_QP_RQ_LIMIT_REACHED",
+		"CCAE_SRQ_CATASTROPHIC_ERROR",
+		"CCAE_RNIC_CATASTROPHIC_ERROR"
+	};
+
+	if (event < CCAE_REMOTE_SHUTDOWN || 
+	    event > CCAE_RNIC_CATASTROPHIC_ERROR)
+		return "<invalid event>";
+
+	event -= CCAE_REMOTE_SHUTDOWN;
+	return event_str[event];
+}
+
+const char *to_qp_state_str(int state)
+{
+	switch (state) {
+	case C2_QP_STATE_IDLE:
+		return "C2_QP_STATE_IDLE";
+	case C2_QP_STATE_CONNECTING:
+		return "C2_QP_STATE_CONNECTING";
+	case C2_QP_STATE_RTS:
+		return "C2_QP_STATE_RTS";
+	case C2_QP_STATE_CLOSING:
+		return "C2_QP_STATE_CLOSING";
+	case C2_QP_STATE_TERMINATE:
+		return "C2_QP_STATE_TERMINATE";
+	case C2_QP_STATE_ERROR:
+		return "C2_QP_STATE_ERROR";
+	default:
+		return "<invalid QP state>";
+	};
+}
+#endif
+
+void c2_ae_event(struct c2_dev *c2dev, u32 mq_index)
+{
+	struct c2_mq *mq = c2dev->qptr_array[mq_index];
+	union c2wr *wr;
+	void *resource_user_context;
+	struct iw_cm_event cm_event;
+	struct ib_event ib_event;
+	enum c2_resource_indicator resource_indicator;
+	enum c2_event_id event_id;
+	unsigned long flags;
+	int status;
+
+	/*
+	 * retreive the message
+	 */
+	wr = c2_mq_consume(mq);
+	if (!wr)
+		return;
+
+	memset(&ib_event, 0, sizeof(ib_event));
+	memset(&cm_event, 0, sizeof(cm_event));
+
+	event_id = c2_wr_get_id(wr);
+	resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type);
+	resource_user_context =
+	    (void *) (unsigned long) wr->ae.ae_generic.user_context;
+
+	status = cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr));
+
+	pr_debug("event received c2_dev=%p, event_id=%d, "
+		"resource_indicator=%d, user_context=%p, status = %d\n",
+		c2dev, event_id, resource_indicator, resource_user_context, 
+		status);
+
+	switch (resource_indicator) {
+	case C2_RES_IND_QP:{
+
+		struct c2_qp *qp = (struct c2_qp *)resource_user_context;
+		struct iw_cm_id *cm_id = qp->cm_id;
+		struct c2wr_ae_active_connect_results *res;
+
+		if (!cm_id) {
+			pr_debug("event received, but cm_id is <nul>, qp=%p!\n",
+				qp);
+			goto ignore_it;
+		}
+		pr_debug("%s: event = %s, user_context=%llx, "
+			"resource_type=%x, "
+			"resource=%x, qp_state=%s\n",
+			__FUNCTION__,
+			to_event_str(event_id),
+			be64_to_cpu(wr->ae.ae_generic.user_context),
+			be32_to_cpu(wr->ae.ae_generic.resource_type),
+			be32_to_cpu(wr->ae.ae_generic.resource),
+			to_qp_state_str(be32_to_cpu(wr->ae.ae_generic.qp_state)));
+			
+		c2_set_qp_state(qp, be32_to_cpu(wr->ae.ae_generic.qp_state));
+
+		switch (event_id) {
+		case CCAE_ACTIVE_CONNECT_RESULTS:
+			res = &wr->ae.ae_active_connect_results;
+			cm_event.event = IW_CM_EVENT_CONNECT_REPLY;
+			cm_event.local_addr.sin_addr.s_addr = res->laddr;
+			cm_event.remote_addr.sin_addr.s_addr = res->raddr;
+			cm_event.local_addr.sin_port = res->lport;
+			cm_event.remote_addr.sin_port =	res->rport;
+			if (status == 0) {
+				cm_event.private_data_len = 
+					be32_to_cpu(res->private_data_length);
+				cm_event.private_data = res->private_data;
+			} else {
+				spin_lock_irqsave(&qp->lock, flags);
+				if (qp->cm_id) {
+					qp->cm_id->rem_ref(qp->cm_id);
+					qp->cm_id = NULL;
+				}
+				spin_unlock_irqrestore(&qp->lock, flags);
+				cm_event.private_data_len = 0;
+				cm_event.private_data = NULL;
+			}
+			if (cm_id->event_handler)
+				cm_id->event_handler(cm_id, &cm_event);
+			break;
+		case CCAE_TERMINATE_MESSAGE_RECEIVED:
+		case CCAE_CQ_SQ_COMPLETION_OVERFLOW:
+			ib_event.device = &c2dev->ibdev;
+			ib_event.element.qp = &qp->ibqp;
+			ib_event.event = IB_EVENT_QP_REQ_ERR;
+
+			if (qp->ibqp.event_handler)
+				qp->ibqp.event_handler(&ib_event,
+						       qp->ibqp.
+						       qp_context);
+			break;
+		case CCAE_BAD_CLOSE:
+		case CCAE_LLP_CLOSE_COMPLETE:
+		case CCAE_LLP_CONNECTION_RESET:
+		case CCAE_LLP_CONNECTION_LOST:
+			BUG_ON(cm_id->event_handler==(void*)0x6b6b6b6b);
+
+			spin_lock_irqsave(&qp->lock, flags);
+			if (qp->cm_id) {
+				qp->cm_id->rem_ref(qp->cm_id);
+				qp->cm_id = NULL;
+			}
+			spin_unlock_irqrestore(&qp->lock, flags);
+			cm_event.event = IW_CM_EVENT_CLOSE;
+			cm_event.status = 0;
+			if (cm_id->event_handler)
+				cm_id->event_handler(cm_id, &cm_event);
+			break;
+		default:
+			BUG_ON(1);
+			pr_debug("%s:%d Unexpected event_id=%d on QP=%p, "
+				"CM_ID=%p\n",
+				__FUNCTION__, __LINE__,
+				event_id, qp, cm_id);
+			break;
+		}
+		break;
+	}
+
+	case C2_RES_IND_EP:{
+
+		struct c2wr_ae_connection_request *req = 
+			&wr->ae.ae_connection_request;
+		struct iw_cm_id *cm_id = 
+			(struct iw_cm_id *)resource_user_context;
+
+		pr_debug("C2_RES_IND_EP event_id=%d\n", event_id);
+		if (event_id != CCAE_CONNECTION_REQUEST) {
+			pr_debug("%s: Invalid event_id: %d\n",
+				__FUNCTION__, event_id);
+			break;
+		}
+		cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
+		cm_event.provider_data = (void*)(unsigned long)req->cr_handle;
+		cm_event.local_addr.sin_addr.s_addr = req->laddr;
+		cm_event.remote_addr.sin_addr.s_addr = req->raddr;
+		cm_event.local_addr.sin_port = req->lport;
+		cm_event.remote_addr.sin_port = req->rport;
+		cm_event.private_data_len = 
+			be32_to_cpu(req->private_data_length);
+		cm_event.private_data = req->private_data;
+
+		if (cm_id->event_handler)
+			cm_id->event_handler(cm_id, &cm_event);
+		break;
+	}
+
+	case C2_RES_IND_CQ:{
+		struct c2_cq *cq =
+		    (struct c2_cq *) resource_user_context;
+
+		pr_debug("IB_EVENT_CQ_ERR\n");
+		ib_event.device = &c2dev->ibdev;
+		ib_event.element.cq = &cq->ibcq;
+		ib_event.event = IB_EVENT_CQ_ERR;
+
+		if (cq->ibcq.event_handler)
+			cq->ibcq.event_handler(&ib_event,
+					       cq->ibcq.cq_context);
+	}
+
+	default:
+		printk("Bad resource indicator = %d\n",
+		       resource_indicator);
+		break;
+	}
+
+ ignore_it:
+	c2_mq_free(mq);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c
new file mode 100644
index 0000000..454e3e0
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_intr.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include <rdma/iw_cm.h>
+#include "c2_vq.h"
+
+static void handle_mq(struct c2_dev *c2dev, u32 index);
+static void handle_vq(struct c2_dev *c2dev, u32 mq_index);
+
+/*
+ * Handle RNIC interrupts
+ */
+void c2_rnic_interrupt(struct c2_dev *c2dev)
+{
+	unsigned int mq_index;
+
+	while (c2dev->hints_read != be16_to_cpu(*c2dev->hint_count)) {
+		mq_index = readl(c2dev->regs + PCI_BAR0_HOST_HINT);
+		if (mq_index & 0x80000000) {
+			break;
+		}
+
+		c2dev->hints_read++;
+		handle_mq(c2dev, mq_index);
+	}
+
+}
+
+/*
+ * Top level MQ handler 
+ */
+static void handle_mq(struct c2_dev *c2dev, u32 mq_index)
+{
+	if (c2dev->qptr_array[mq_index] == NULL) {
+		pr_debug(KERN_INFO "handle_mq: stray activity for mq_index=%d\n",
+			mq_index);
+		return;
+	}
+
+	switch (mq_index) {
+	case (0):
+		/*
+		 * An index of 0 in the activity queue
+		 * indicates the req vq now has messages
+		 * available...
+		 *
+		 * Wake up any waiters waiting on req VQ 
+		 * message availability.  
+		 */
+		wake_up(&c2dev->req_vq_wo);
+		break;
+	case (1):
+		handle_vq(c2dev, mq_index);
+		break;
+	case (2):
+		/* We have to purge the VQ in case there are pending
+		 * accept reply requests that would result in the
+		 * generation of an ESTABLISHED event. If we don't
+		 * generate these first, a CLOSE event could end up
+		 * being delivered before the ESTABLISHED event.
+		 */
+		handle_vq(c2dev, 1);
+
+		c2_ae_event(c2dev, mq_index);
+		break;
+	default:
+		/* There is no event synchronization between CQ events
+		 * and AE or CM events. In fact, CQE could be
+		 * delivered for all of the I/O up to and including the
+		 * FLUSH for a peer disconenct prior to the ESTABLISHED
+		 * event being delivered to the app. The reason for this
+		 * is that CM events are delivered on a thread, while AE
+		 * and CM events are delivered on interrupt context. 
+		 */
+		c2_cq_event(c2dev, mq_index);
+		break;
+	}
+
+	return;
+}
+
+/*
+ * Handles verbs WR replies.
+ */
+static void handle_vq(struct c2_dev *c2dev, u32 mq_index)
+{
+	void *adapter_msg, *reply_msg;
+	struct c2wr_hdr *host_msg;
+	struct c2wr_hdr tmp;
+	struct c2_mq *reply_vq;
+	struct c2_vq_req *req;
+	struct iw_cm_event cm_event;
+	int err;
+
+	reply_vq = (struct c2_mq *) c2dev->qptr_array[mq_index];
+
+	/*
+	 * get next msg from mq_index into adapter_msg.
+	 * don't free it yet.
+	 */
+	adapter_msg = c2_mq_consume(reply_vq);
+	if (adapter_msg == NULL) {
+		return;
+	}
+
+	host_msg = vq_repbuf_alloc(c2dev);
+
+	/*
+	 * If we can't get a host buffer, then we'll still 
+	 * wakeup the waiter, we just won't give him the msg.
+	 * It is assumed the waiter will deal with this...
+	 */
+	if (!host_msg) {
+		pr_debug("handle_vq: no repbufs!\n");
+
+		/*      
+		 * just copy the WR header into a local variable.
+		 * this allows us to still demux on the context
+		 */
+		host_msg = &tmp;
+		memcpy(host_msg, adapter_msg, sizeof(tmp));
+		reply_msg = NULL;
+	} else {
+		memcpy(host_msg, adapter_msg, reply_vq->msg_size);
+		reply_msg = host_msg;
+	}
+
+	/*
+	 * consume the msg from the MQ
+	 */
+	c2_mq_free(reply_vq);
+
+	/*
+	 * wakeup the waiter.
+	 */
+	req = (struct c2_vq_req *) (unsigned long) host_msg->context;
+	if (req == NULL) {
+		/*
+		 * We should never get here, as the adapter should
+		 * never send us a reply that we're not expecting.
+		 */
+		vq_repbuf_free(c2dev, host_msg);
+		pr_debug("handle_vq: UNEXPECTEDLY got NULL req\n");
+		return;
+	}
+
+	err = c2_errno(reply_msg);
+	if (!err) switch (req->event) {
+	case IW_CM_EVENT_ESTABLISHED:
+		c2_set_qp_state(req->qp,
+				C2_QP_STATE_RTS);
+	case IW_CM_EVENT_CLOSE:
+
+		/* 
+		 * Move the QP to RTS if this is 
+		 * the established event 
+		 */
+		cm_event.event = req->event;
+		cm_event.status = 0;
+		cm_event.local_addr = req->cm_id->local_addr;
+		cm_event.remote_addr = req->cm_id->remote_addr;
+		cm_event.private_data = NULL;
+		cm_event.private_data_len = 0;
+		req->cm_id->event_handler(req->cm_id, &cm_event);
+		break;
+	default:
+		break;
+	}
+
+	req->reply_msg = (u64) (unsigned long) (reply_msg);
+	atomic_set(&req->reply_ready, 1);
+	wake_up(&req->wait_object);
+
+	/*
+	 * If the request was cancelled, then this put will
+	 * free the vq_req memory...and reply_msg!!!
+	 */
+	vq_req_put(c2dev, req);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c
new file mode 100644
index 0000000..4d9cc57
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_rnic.c
@@ -0,0 +1,664 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+#include <linux/mm.h>
+#include <linux/inet.h>
+
+#include <linux/route.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+#include <rdma/ib_smi.h>
+#include "c2.h"
+#include "c2_vq.h"
+
+/* Device capabilities */
+#define C2_MIN_PAGESIZE  1024
+
+#define C2_MAX_MRS       32768
+#define C2_MAX_QPS       16000
+#define C2_MAX_WQE_SZ    256
+#define C2_MAX_QP_WR     ((128*1024)/C2_MAX_WQE_SZ)
+#define C2_MAX_SGES      4
+#define C2_MAX_SGE_RD    1
+#define C2_MAX_CQS       32768
+#define C2_MAX_CQES      4096
+#define C2_MAX_PDS       16384
+
+/*
+ * Send the adapter INIT message to the amso1100
+ */
+static int c2_adapter_init(struct c2_dev *c2dev)
+{
+	struct c2wr_init_req wr;
+	int err;
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_INIT);
+	wr.hdr.context = 0;
+	wr.hint_count = cpu_to_be64(c2dev->hint_count_dma);
+	wr.q0_host_shared = cpu_to_be64(c2dev->req_vq.shared_dma);
+	wr.q1_host_shared = cpu_to_be64(c2dev->rep_vq.shared_dma);
+	wr.q1_host_msg_pool = cpu_to_be64(c2dev->rep_vq.host_dma);
+	wr.q2_host_shared = cpu_to_be64(c2dev->aeq.shared_dma);
+	wr.q2_host_msg_pool = cpu_to_be64(c2dev->aeq.host_dma);
+
+	/* Post the init message */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+
+	return err;
+}
+
+/*
+ * Send the adapter TERM message to the amso1100
+ */
+static void c2_adapter_term(struct c2_dev *c2dev)
+{
+	struct c2wr_init_req wr;
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_TERM);
+	wr.hdr.context = 0;
+
+	/* Post the init message */
+	vq_send_wr(c2dev, (union c2wr *) & wr);
+	c2dev->init = 0;
+
+	return;
+}
+
+/*
+ * Query the adapter
+ */
+int c2_rnic_query(struct c2_dev *c2dev,
+		  struct ib_device_attr *props)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_query_req wr;
+	struct c2wr_rnic_query_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_RNIC_QUERY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_query_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply)
+		err = -ENOMEM;
+
+	err = c2_errno(reply);
+	if (err)
+		goto bail2;
+
+	props->fw_ver = 
+		((u64)be32_to_cpu(reply->fw_ver_major) << 32) |
+		((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) |
+		(be32_to_cpu(reply->fw_ver_patch) && 0xFFFF);
+	memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6);
+	props->max_mr_size         = 0xFFFFFFFF;
+	props->page_size_cap       = ~(C2_MIN_PAGESIZE-1);
+	props->vendor_id           = be32_to_cpu(reply->vendor_id);
+	props->vendor_part_id      = be32_to_cpu(reply->part_number);
+	props->hw_ver              = be32_to_cpu(reply->hw_version);
+	props->max_qp              = be32_to_cpu(reply->max_qps);
+	props->max_qp_wr           = be32_to_cpu(reply->max_qp_depth);
+	props->device_cap_flags    = c2dev->device_cap_flags;
+	props->max_sge             = C2_MAX_SGES;
+	props->max_sge_rd          = C2_MAX_SGE_RD;
+	props->max_cq              = be32_to_cpu(reply->max_cqs);
+	props->max_cqe             = be32_to_cpu(reply->max_cq_depth);
+	props->max_mr              = be32_to_cpu(reply->max_mrs);
+	props->max_pd              = be32_to_cpu(reply->max_pds);
+	props->max_qp_rd_atom      = be32_to_cpu(reply->max_qp_ird);
+	props->max_ee_rd_atom      = 0;
+	props->max_res_rd_atom     = be32_to_cpu(reply->max_global_ird);
+	props->max_qp_init_rd_atom = be32_to_cpu(reply->max_qp_ord);
+	props->max_ee_init_rd_atom = 0;
+	props->atomic_cap          = IB_ATOMIC_NONE;
+	props->max_ee              = 0;
+	props->max_rdd             = 0;
+	props->max_mw              = be32_to_cpu(reply->max_mws);
+	props->max_raw_ipv6_qp     = 0;
+	props->max_raw_ethy_qp     = 0;
+	props->max_mcast_grp       = 0;
+	props->max_mcast_qp_attach = 0;
+	props->max_total_mcast_qp_attach = 0;
+	props->max_ah              = 0;
+	props->max_fmr             = 0;
+	props->max_map_per_fmr     = 0;
+	props->max_srq             = 0;
+	props->max_srq_wr          = 0;
+	props->max_srq_sge         = 0;
+	props->max_pkeys           = 0;
+	props->local_ca_ack_delay  = 0;
+
+ bail2:
+	vq_repbuf_free(c2dev, reply);
+
+ bail1:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Add an IP address to the RNIC interface
+ */
+int c2_add_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_setconfig_req *wr;
+	struct c2wr_rnic_setconfig_rep *reply;
+	struct c2_netaddr netaddr;
+	int err, len;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	len = sizeof(struct c2_netaddr);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->option = cpu_to_be32(C2_CFG_ADD_ADDR);
+
+	netaddr.ip_addr = inaddr;
+	netaddr.netmask = inmask;
+	netaddr.mtu = 0;
+
+	memcpy(wr->data, &netaddr, len);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Delete an IP address from the RNIC interface
+ */
+int c2_del_addr(struct c2_dev *c2dev, u32 inaddr, u32 inmask)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_rnic_setconfig_req *wr;
+	struct c2wr_rnic_setconfig_rep *reply;
+	struct c2_netaddr netaddr;
+	int err, len;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	len = sizeof(struct c2_netaddr);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	c2_wr_set_id(wr, CCWR_RNIC_SETCONFIG);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->option = cpu_to_be32(C2_CFG_DEL_ADDR);
+
+	netaddr.ip_addr = inaddr;
+	netaddr.netmask = inmask;
+	netaddr.mtu = 0;
+
+	memcpy(wr->data, &netaddr, len);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply =
+	    (struct c2wr_rnic_setconfig_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Open a single RNIC instance to use with all
+ * low level openib calls
+ */
+static int c2_rnic_open(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *vq_req;
+	union c2wr wr;
+	struct c2wr_rnic_open_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		return -ENOMEM;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_RNIC_OPEN);
+	wr.rnic_open.req.hdr.context = (unsigned long) (vq_req);
+	wr.rnic_open.req.flags = cpu_to_be16(RNIC_PRIV_MODE);
+	wr.rnic_open.req.port_num = cpu_to_be16(0);
+	wr.rnic_open.req.user_context = (unsigned long) c2dev;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	reply = (struct c2wr_rnic_open_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	if ((err = c2_errno(reply)) != 0) {
+		goto bail1;
+	}
+
+	c2dev->adapter_handle = reply->rnic_handle;
+
+      bail1:
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Close the RNIC instance
+ */
+static int c2_rnic_close(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *vq_req;
+	union c2wr wr;
+	struct c2wr_rnic_close_rep *reply;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		return -ENOMEM;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_RNIC_CLOSE);
+	wr.rnic_close.req.hdr.context = (unsigned long) vq_req;
+	wr.rnic_close.req.rnic_handle = c2dev->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, &wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	reply = (struct c2wr_rnic_close_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	if ((err = c2_errno(reply)) != 0) {
+		goto bail1;
+	}
+
+	c2dev->adapter_handle = 0;
+
+      bail1:
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+/*
+ * Called by c2_probe to initialize the RNIC. This principally
+ * involves initalizing the various limits and resouce pools that
+ * comprise the RNIC instance.
+ */
+int c2_rnic_init(struct c2_dev *c2dev)
+{
+	int err;
+	u32 qsize, msgsize;
+	void *q1_pages;
+	void *q2_pages;
+	void __iomem *mmio_regs;
+
+	/* Device capabilities */
+	c2dev->device_cap_flags =
+	    (IB_DEVICE_RESIZE_MAX_WR |
+	     IB_DEVICE_CURR_QP_STATE_MOD |
+	     IB_DEVICE_SYS_IMAGE_GUID |
+	     IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	/* Allocate the qptr_array */
+	c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *));
+	if (!c2dev->qptr_array) {
+		return -ENOMEM;
+	}
+
+	/* Inialize the qptr_array */
+	memset(c2dev->qptr_array, 0, C2_MAX_CQS * sizeof(void *));
+	c2dev->qptr_array[0] = (void *) &c2dev->req_vq;
+	c2dev->qptr_array[1] = (void *) &c2dev->rep_vq;
+	c2dev->qptr_array[2] = (void *) &c2dev->aeq;
+
+	/* Initialize data structures */
+	init_waitqueue_head(&c2dev->req_vq_wo);
+	spin_lock_init(&c2dev->vqlock);
+	spin_lock_init(&c2dev->lock);
+
+	/* Allocate MQ shared pointer pool for kernel clients. User
+	 * mode client pools are hung off the user context
+	 */
+	err = c2_init_mqsp_pool(c2dev, GFP_KERNEL, &c2dev->kern_mqsp_pool);
+	if (err) {
+		goto bail0;
+	}
+
+	/* Allocate shared pointers for Q0, Q1, and Q2 from
+	 * the shared pointer pool.
+	 */
+
+	c2dev->hint_count = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, 
+					     &c2dev->hint_count_dma, 
+					     GFP_KERNEL);
+	c2dev->req_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, 
+					     &c2dev->req_vq.shared_dma, 
+					     GFP_KERNEL);
+	c2dev->rep_vq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool,
+					     &c2dev->rep_vq.shared_dma, 
+					     GFP_KERNEL);
+	c2dev->aeq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool,
+					  &c2dev->aeq.shared_dma, GFP_KERNEL);
+	if (!c2dev->hint_count || !c2dev->req_vq.shared || 
+	    !c2dev->rep_vq.shared || !c2dev->aeq.shared) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	mmio_regs = c2dev->kva;
+	/* Initialize the Verbs Request Queue */
+	c2_mq_req_init(&c2dev->req_vq, 0,
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_QSIZE)),
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_MSGSIZE)),
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_POOLSTART)),
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q0_SHARED)),
+		       C2_MQ_ADAPTER_TARGET);
+
+	/* Initialize the Verbs Reply Queue */
+	qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE));
+	msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE));
+	q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL);
+	if (!q1_pages) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	c2dev->rep_vq.host_dma = dma_map_single(c2dev->ibdev.dma_device, 
+					        (void *)q1_pages, qsize * msgsize, 
+				      		DMA_FROM_DEVICE);
+	pci_unmap_addr_set(&c2dev->rep_vq, mapping, c2dev->rep_vq.host_dma);
+	pr_debug("%s rep_vq va %p dma %llx\n", __FUNCTION__, q1_pages, 
+		 (u64)c2dev->rep_vq.host_dma);
+	c2_mq_rep_init(&c2dev->rep_vq,
+		   1,
+		   qsize,
+		   msgsize,
+		   q1_pages,
+		   mmio_regs +
+		   be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_SHARED)),
+		   C2_MQ_HOST_TARGET);
+
+	/* Initialize the Asynchronus Event Queue */
+	qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE));
+	msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE));
+	q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL);
+	if (!q2_pages) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+	c2dev->aeq.host_dma = dma_map_single(c2dev->ibdev.dma_device, 
+					        (void *)q2_pages, qsize * msgsize, 
+				      		DMA_FROM_DEVICE);
+	pci_unmap_addr_set(&c2dev->aeq, mapping, c2dev->aeq.host_dma);
+	pr_debug("%s aeq va %p dma %llx\n", __FUNCTION__, q1_pages, 
+		 (u64)c2dev->rep_vq.host_dma);
+	c2_mq_rep_init(&c2dev->aeq,
+		       2,
+		       qsize,
+		       msgsize,
+		       q2_pages,
+		       mmio_regs +
+		       be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_SHARED)),
+		       C2_MQ_HOST_TARGET);
+
+	/* Initialize the verbs request allocator */
+	err = vq_init(c2dev);
+	if (err)
+		goto bail3;
+
+	/* Enable interrupts on the adapter */
+	writel(0, c2dev->regs + C2_IDIS);
+
+	/* create the WR init message */
+	err = c2_adapter_init(c2dev);
+	if (err)
+		goto bail4;
+	c2dev->init++;
+
+	/* open an adapter instance */
+	err = c2_rnic_open(c2dev);
+	if (err)
+		goto bail4;
+
+	/* Initialize cached the adapter limits */
+	if (c2_rnic_query(c2dev, &c2dev->props))
+		goto bail5;
+
+	/* Initialize the PD pool */
+	err = c2_init_pd_table(c2dev);
+	if (err)
+		goto bail5;
+
+	/* Initialize the QP pool */
+	c2_init_qp_table(c2dev);
+	return 0;
+
+      bail5:
+	c2_rnic_close(c2dev);
+      bail4:
+	vq_term(c2dev);
+      bail3:
+	dma_unmap_single(c2dev->ibdev.dma_device, 
+			 pci_unmap_addr(&c2dev->aeq, mapping), 
+			 c2dev->aeq.q_size * c2dev->aeq.msg_size, 
+		  	 DMA_FROM_DEVICE);
+	kfree(q2_pages);
+      bail2:
+	dma_unmap_single(c2dev->ibdev.dma_device, 
+			 pci_unmap_addr(&c2dev->rep_vq, mapping), 
+			 c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, 
+		  	 DMA_FROM_DEVICE);
+	kfree(q1_pages);
+      bail1:
+	c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool);
+      bail0:
+	vfree(c2dev->qptr_array);
+
+	return err;
+}
+
+/*
+ * Called by c2_remove to cleanup the RNIC resources. 
+ */
+void c2_rnic_term(struct c2_dev *c2dev)
+{
+
+	/* Close the open adapter instance */
+	c2_rnic_close(c2dev);
+
+	/* Send the TERM message to the adapter */
+	c2_adapter_term(c2dev);
+
+	/* Disable interrupts on the adapter */
+	writel(1, c2dev->regs + C2_IDIS);
+
+	/* Free the QP pool */
+	c2_cleanup_qp_table(c2dev);
+
+	/* Free the PD pool */
+	c2_cleanup_pd_table(c2dev);
+
+	/* Free the verbs request allocator */
+	vq_term(c2dev);
+
+	/* Unmap and free the asynchronus event queue */
+	dma_unmap_single(c2dev->ibdev.dma_device, 
+			 pci_unmap_addr(&c2dev->aeq, mapping), 
+			 c2dev->aeq.q_size * c2dev->aeq.msg_size, 
+		  	 DMA_FROM_DEVICE);
+	kfree(c2dev->aeq.msg_pool.host);
+
+	/* Unmap and free the verbs reply queue */
+	dma_unmap_single(c2dev->ibdev.dma_device, 
+			 pci_unmap_addr(&c2dev->rep_vq, mapping), 
+			 c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, 
+		  	 DMA_FROM_DEVICE);
+	kfree(c2dev->rep_vq.msg_pool.host);
+
+	/* Free the MQ shared pointer pool */
+	c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool);
+
+	/* Free the qptr_array */
+	vfree(c2dev->qptr_array);
+
+	return;
+}


From swise at opengridcomputing.com  Tue Jun 20 13:31:11 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:11 -0500
Subject: [openib-general] [PATCH v3 4/7] AMSO1100 Memory Management.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203111.31536.80453.stgit@stevo-desktop>


V2 Review Changes:

- removed c2_array services and replaced them with the idr.

- removed c2_alloc services and made them pd-specific.

- don't use GFP_DMA.

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes:

- sizeof -> sizeof()

- cleaned up comments
---

 drivers/infiniband/hw/amso1100/c2_alloc.c |  144 +++++++++++
 drivers/infiniband/hw/amso1100/c2_mm.c    |  375 +++++++++++++++++++++++++++++
 2 files changed, 519 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c
new file mode 100644
index 0000000..013b152
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_alloc.c
@@ -0,0 +1,144 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/bitmap.h>
+
+#include "c2.h"
+
+static int c2_alloc_mqsp_chunk(struct c2_dev *c2dev, gfp_t gfp_mask, 
+			       struct sp_chunk **head)
+{
+	int i;
+	struct sp_chunk *new_head;
+
+	new_head = (struct sp_chunk *) __get_free_page(gfp_mask);
+	if (new_head == NULL)
+		return -ENOMEM;
+
+	new_head->dma_addr = dma_map_single(c2dev->ibdev.dma_device, new_head, 
+					    PAGE_SIZE, DMA_FROM_DEVICE);
+	pci_unmap_addr_set(new_head, mapping, new_head->dma_addr);
+
+	new_head->next = NULL;
+	new_head->head = 0;
+
+	/* build list where each index is the next free slot */
+	for (i = 0;
+	     i < (PAGE_SIZE - sizeof(struct sp_chunk) - 
+		  sizeof(u16)) / sizeof(u16) - 1; 
+	     i++) {
+		new_head->shared_ptr[i] = i + 1;
+	}
+	/* terminate list */
+	new_head->shared_ptr[i] = 0xFFFF;
+
+	*head = new_head;
+	return 0;
+}
+
+int c2_init_mqsp_pool(struct c2_dev *c2dev, gfp_t gfp_mask, 
+		      struct sp_chunk **root)
+{
+	return c2_alloc_mqsp_chunk(c2dev, gfp_mask, root);
+}
+
+void c2_free_mqsp_pool(struct c2_dev *c2dev, struct sp_chunk *root)
+{
+	struct sp_chunk *next;
+
+	while (root) {
+		next = root->next;
+		dma_unmap_single(c2dev->ibdev.dma_device, 
+				 pci_unmap_addr(root, mapping), PAGE_SIZE, 
+			         DMA_FROM_DEVICE);
+		__free_page((struct page *) root);
+		root = next;
+	}
+}
+
+u16 *c2_alloc_mqsp(struct c2_dev *c2dev, struct sp_chunk *head, 
+		   dma_addr_t *dma_addr, gfp_t gfp_mask)
+{
+	u16 mqsp;
+
+	while (head) {
+		mqsp = head->head;
+		if (mqsp != 0xFFFF) {
+			head->head = head->shared_ptr[mqsp];
+			break;
+		} else if (head->next == NULL) {
+			if (c2_alloc_mqsp_chunk(c2dev, gfp_mask, &head->next) ==
+			    0) {
+				head = head->next;
+				mqsp = head->head;
+				head->head = head->shared_ptr[mqsp];
+				break;
+			} else
+				return NULL;
+		} else
+			head = head->next;
+	}
+	if (head) {
+		*dma_addr = head->dma_addr + 
+			    ((unsigned long) &(head->shared_ptr[mqsp]) - 
+			     (unsigned long) head);
+		pr_debug("%s addr %p dma_addr %llx\n", __FUNCTION__,
+			 &(head->shared_ptr[mqsp]), (u64)*dma_addr);
+		return &(head->shared_ptr[mqsp]);
+	}
+	return NULL;
+}
+
+void c2_free_mqsp(u16 * mqsp)
+{
+	struct sp_chunk *head;
+	u16 idx;
+
+	/* The chunk containing this ptr begins at the page boundary */
+	head = (struct sp_chunk *) ((unsigned long) mqsp & PAGE_MASK);
+
+	/* Link head to new mqsp */
+	*mqsp = head->head;
+
+	/* Compute the shared_ptr index */
+	idx = ((unsigned long) mqsp & ~PAGE_MASK) >> 1;
+	idx -= (unsigned long) &(((struct sp_chunk *) 0)->shared_ptr[0]) >> 1;
+
+	/* Point this index at the head */
+	head->shared_ptr[idx] = head->head;
+
+	/* Point head at this index */
+	head->head = idx;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_mm.c b/drivers/infiniband/hw/amso1100/c2_mm.c
new file mode 100644
index 0000000..314ec07
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mm.c
@@ -0,0 +1,375 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include "c2_vq.h"
+
+#define PBL_VIRT 1
+#define PBL_PHYS 2
+
+/*
+ * Send all the PBL messages to convey the remainder of the PBL
+ * Wait for the adapter's reply on the last one.
+ * This is indicated by setting the MEM_PBL_COMPLETE in the flags.
+ *
+ * NOTE:  vq_req is _not_ freed by this function.  The VQ Host
+ *	  Reply buffer _is_ freed by this function.
+ */
+static int
+send_pbl_messages(struct c2_dev *c2dev, u32 stag_index,
+		  unsigned long va, u32 pbl_depth,
+		  struct c2_vq_req *vq_req, int pbl_type)
+{
+	u32 pbe_count;		/* amt that fits in a PBL msg */
+	u32 count;		/* amt in this PBL MSG. */
+	struct c2wr_nsmr_pbl_req *wr;	/* PBL WR ptr */
+	struct c2wr_nsmr_pbl_rep *reply;	/* reply ptr */
+ 	int err, pbl_virt, pbl_index, i;
+
+	switch (pbl_type) {
+	case PBL_VIRT:
+		pbl_virt = 1;
+		break;
+	case PBL_PHYS:
+		pbl_virt = 0;
+		break;
+	default:
+		return -EINVAL;
+		break;
+	}
+
+	pbe_count = (c2dev->req_vq.msg_size -
+		     sizeof(struct c2wr_nsmr_pbl_req)) / sizeof(u64);
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		return -ENOMEM;
+	}
+	c2_wr_set_id(wr, CCWR_NSMR_PBL);
+
+	/*
+	 * Only the last PBL message will generate a reply from the verbs, 
+	 * so we set the context to 0 indicating there is no kernel verbs
+	 * handler blocked awaiting this reply.
+	 */
+	wr->hdr.context = 0;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->stag_index = stag_index;	/* already swapped */
+	wr->flags = 0;
+	pbl_index = 0;
+	while (pbl_depth) {
+		count = min(pbe_count, pbl_depth);
+		wr->addrs_length = cpu_to_be32(count);
+
+		/*
+		 *  If this is the last message, then reference the
+		 *  vq request struct cuz we're gonna wait for a reply.
+		 *  also make this PBL msg as the last one.
+		 */
+		if (count == pbl_depth) {
+			/*
+			 * reference the request struct.  dereferenced in the 
+			 * int handler.
+			 */
+			vq_req_get(c2dev, vq_req);
+			wr->flags = cpu_to_be32(MEM_PBL_COMPLETE);
+
+			/*
+			 * This is the last PBL message.
+			 * Set the context to our VQ Request Object so we can
+			 * wait for the reply.
+			 */
+			wr->hdr.context = (unsigned long) vq_req;
+		}
+
+		/*
+		 * If pbl_virt is set then va is a virtual address 
+		 * that describes a virtually contiguous memory
+		 * allocation. The wr needs the start of each virtual page
+		 * to be converted to the corresponding physical address
+		 * of the page. If pbl_virt is not set then va is an array
+		 * of physical addresses and there is no conversion to do.
+		 * Just fill in the wr with what is in the array.  
+		 */
+		for (i = 0; i < count; i++) {
+			if (pbl_virt) {
+				va += PAGE_SIZE;
+			} else {
+ 				wr->paddrs[i] = 
+				    cpu_to_be64(((u64 *)va)[pbl_index + i]);
+			}
+		}
+
+		/*
+		 * Send WR to adapter
+		 */
+		err = vq_send_wr(c2dev, (union c2wr *) wr);
+		if (err) {
+			if (count <= pbe_count) {
+				vq_req_put(c2dev, vq_req);
+			}
+			goto bail0;
+		}
+		pbl_depth -= count;
+		pbl_index += count;
+	}
+
+	/*
+	 *  Now wait for the reply...
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_nsmr_pbl_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	kfree(wr);
+	return err;
+}
+
+#define C2_PBL_MAX_DEPTH 131072
+int
+c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 *addr_list, 
+ 			   int page_size, int pbl_depth, u32 length, 
+ 			   u32 offset, u64 *va, enum c2_acf acf, 
+			   struct c2_mr *mr)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_nsmr_register_req *wr;
+	struct c2wr_nsmr_register_rep *reply;
+	u16 flags;
+	int i, pbe_count, count;
+	int err;
+
+	if (!va || !length || !addr_list || !pbl_depth)
+		return -EINTR;
+
+	/*
+	 * Verify PBL depth is within rnic max
+	 */
+	if (pbl_depth > C2_PBL_MAX_DEPTH) {
+		return -EINTR;
+	}
+
+	/*
+	 * allocate verbs request object
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	/*
+	 * build the WR
+	 */
+	c2_wr_set_id(wr, CCWR_NSMR_REGISTER);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+
+	flags = (acf | MEM_VA_BASED | MEM_REMOTE);
+
+	/*
+	 * compute how many pbes can fit in the message
+	 */
+	pbe_count = (c2dev->req_vq.msg_size -
+		     sizeof(struct c2wr_nsmr_register_req)) / sizeof(u64);
+
+	if (pbl_depth <= pbe_count) {
+		flags |= MEM_PBL_COMPLETE;
+	}
+	wr->flags = cpu_to_be16(flags);
+	wr->stag_key = 0;	//stag_key;
+	wr->va = cpu_to_be64(*va);
+	wr->pd_id = mr->pd->pd_id;
+	wr->pbe_size = cpu_to_be32(page_size);
+	wr->length = cpu_to_be32(length);
+	wr->pbl_depth = cpu_to_be32(pbl_depth);
+	wr->fbo = cpu_to_be32(offset);
+	count = min(pbl_depth, pbe_count);
+	wr->addrs_length = cpu_to_be32(count);
+
+	/*
+	 * fill out the PBL for this message
+	 */
+	for (i = 0; i < count; i++) {
+		wr->paddrs[i] = cpu_to_be64(addr_list[i]);
+	}
+
+	/*
+	 * regerence the request struct 
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * send the WR to the adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	/*
+	 * wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail1;
+	}
+
+	/*
+	 * process reply
+	 */
+	reply =
+	    (struct c2wr_nsmr_register_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	if ((err = c2_errno(reply))) {
+		goto bail2;
+	}
+	//*p_pb_entries = be32_to_cpu(reply->pbl_depth);
+	mr->ibmr.lkey = mr->ibmr.rkey = be32_to_cpu(reply->stag_index);
+	vq_repbuf_free(c2dev, reply);
+
+	/*
+	 * if there are still more PBEs we need to send them to
+	 * the adapter and wait for a reply on the final one.
+	 * reuse vq_req for this purpose.
+	 */
+	pbl_depth -= count;
+	if (pbl_depth) {
+
+		vq_req->reply_msg = (unsigned long) NULL;
+		atomic_set(&vq_req->reply_ready, 0);
+		err = send_pbl_messages(c2dev,
+					cpu_to_be32(mr->ibmr.lkey),
+					(unsigned long) &addr_list[i],
+					pbl_depth, vq_req, PBL_PHYS);
+		if (err) {
+			goto bail1;
+		}
+	}
+
+	vq_req_free(c2dev, vq_req);
+	kfree(wr);
+
+	return err;
+
+      bail2:
+	vq_repbuf_free(c2dev, reply);
+      bail1:
+	kfree(wr);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+int c2_stag_dealloc(struct c2_dev *c2dev, u32 stag_index)
+{
+	struct c2_vq_req *vq_req;	/* verbs request object */
+	struct c2wr_stag_dealloc_req wr;	/* work request */
+	struct c2wr_stag_dealloc_rep *reply;	/* WR reply  */
+	int err;
+
+
+	/*
+	 * allocate verbs request object
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		return -ENOMEM;
+	}
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_STAG_DEALLOC);
+	wr.hdr.context = (u64) (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.stag_index = cpu_to_be32(stag_index);
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_stag_dealloc_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}


From swise at opengridcomputing.com  Tue Jun 20 13:31:06 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:06 -0500
Subject: [openib-general] [PATCH v3 3/7] AMSO1100 OpenFabrics Provider.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203105.31536.53569.stgit@stevo-desktop>


V2 Review Changes:

- removed useless atomic_t in c2_pd struct.

- qp ids now allocated and mapped using IDR

- pd ids now allocated using private bitarrary allocator.

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes:

- sizeof -> sizeof()

- dprintk() -> pr_debug()

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG

- cleaned up comments
---

 drivers/infiniband/hw/amso1100/c2_cm.c       |  452 ++++++++++++
 drivers/infiniband/hw/amso1100/c2_cq.c       |  433 ++++++++++++
 drivers/infiniband/hw/amso1100/c2_pd.c       |   89 ++
 drivers/infiniband/hw/amso1100/c2_provider.c |  867 +++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_provider.h |  181 +++++
 drivers/infiniband/hw/amso1100/c2_qp.c       |  975 ++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_user.h     |   82 ++
 7 files changed, 3079 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_cm.c b/drivers/infiniband/hw/amso1100/c2_cm.c
new file mode 100644
index 0000000..018d11f
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_cm.c
@@ -0,0 +1,452 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include "c2.h"
+#include "c2_wr.h"
+#include "c2_vq.h"
+#include <rdma/iw_cm.h>
+
+int c2_llp_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct c2_dev *c2dev = to_c2dev(cm_id->device);
+	struct ib_qp *ibqp;
+	struct c2_qp *qp;
+	struct c2wr_qp_connect_req *wr;	/* variable size needs a malloc. */
+	struct c2_vq_req *vq_req;
+	int err;
+
+	ibqp = c2_get_qp(cm_id->device, iw_param->qpn);
+	if (!ibqp)
+		return -EINVAL;
+	qp = to_c2qp(ibqp);
+
+	/* Associate QP <--> CM_ID */
+	cm_id->provider_data = qp;
+	cm_id->add_ref(cm_id);
+	qp->cm_id = cm_id;
+
+	/*
+	 * only support the max private_data length
+	 */
+	if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) {
+		err = -EINVAL;
+		goto bail0;
+	}
+	/* 
+	 * Set the rdma read limits 
+	 */
+	err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Create and send a WR_QP_CONNECT...
+	 */
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	c2_wr_set_id(wr, CCWR_QP_CONNECT);
+	wr->hdr.context = 0;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->qp_handle = qp->adapter_handle;
+
+	wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr;
+	wr->remote_port = cm_id->remote_addr.sin_port;
+
+	/*
+	 * Move any private data from the callers's buf into 
+	 * the WR.
+	 */
+	if (iw_param->private_data) {
+		wr->private_data_length = 
+			cpu_to_be32(iw_param->private_data_len);
+		memcpy(&wr->private_data[0], iw_param->private_data,
+		       iw_param->private_data_len);
+	} else
+		wr->private_data_length = 0;
+
+	/*
+	 * Send WR to adapter.  NOTE: There is no synch reply from 
+	 * the adapter.
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	vq_req_free(c2dev, vq_req);
+
+ bail1:
+	kfree(wr);
+ bail0:
+	if (err) {
+		/* 
+		 * If we fail, release reference on QP and
+		 * disassociate QP from CM_ID  
+		 */
+		cm_id->provider_data = NULL;
+		qp->cm_id = NULL;
+		cm_id->rem_ref(cm_id);
+	}
+	return err;
+}
+
+int c2_llp_service_create(struct iw_cm_id *cm_id, int backlog)
+{
+	struct c2_dev *c2dev;
+	struct c2wr_ep_listen_create_req wr;
+	struct c2wr_ep_listen_create_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+	if (c2dev == NULL)
+		return -EINVAL;
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE);
+	wr.hdr.context = (u64) (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.local_addr = cm_id->local_addr.sin_addr.s_addr;
+	wr.local_port = cm_id->local_addr.sin_port;
+	wr.backlog = cpu_to_be32(backlog);
+	wr.user_context = (u64) (unsigned long) cm_id;
+
+	/*
+	 * Reference the request struct.  Dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply =
+	    (struct c2wr_ep_listen_create_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	if ((err = c2_errno(reply)) != 0)
+		goto bail1;
+
+	/* 
+	 * Keep the adapter handle. Used in subsequent destroy 
+	 */
+	cm_id->provider_data = (void*)(unsigned long) reply->ep_handle;
+
+	/*
+	 * free vq stuff
+	 */
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	return 0;
+
+ bail1:
+	vq_repbuf_free(c2dev, reply);
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+
+int c2_llp_service_destroy(struct iw_cm_id *cm_id)
+{
+
+	struct c2_dev *c2dev;
+	struct c2wr_ep_listen_destroy_req wr;
+	struct c2wr_ep_listen_destroy_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+	if (c2dev == NULL)
+		return -EINVAL;
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.ep_handle = (u32)(unsigned long)cm_id->provider_data;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply=(struct c2wr_ep_listen_destroy_rep *)(unsigned long)vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+	if ((err = c2_errno(reply)) != 0)
+		goto bail1;
+
+ bail1:
+	vq_repbuf_free(c2dev, reply);
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+int c2_llp_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	struct c2_dev *c2dev = to_c2dev(cm_id->device);
+	struct c2_qp *qp;
+	struct ib_qp *ibqp;
+	struct c2wr_cr_accept_req *wr;	/* variable length WR */
+	struct c2_vq_req *vq_req;
+	struct c2wr_cr_accept_rep *reply;	/* VQ Reply msg ptr. */
+	int err;
+
+	ibqp = c2_get_qp(cm_id->device, iw_param->qpn);
+	if (!ibqp)
+		return -EINVAL;
+	qp = to_c2qp(ibqp);
+
+	/* Set the RDMA read limits */
+	err = c2_qp_set_read_limits(c2dev, qp, iw_param->ord, iw_param->ird);
+	if (err)
+		goto bail0;
+
+	/* Allocate verbs request. */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+	vq_req->qp = qp;
+	vq_req->cm_id = cm_id;
+	vq_req->event = IW_CM_EVENT_ESTABLISHED;
+
+	wr = kmalloc(c2dev->req_vq.msg_size, GFP_KERNEL);
+	if (!wr) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	/* Build the WR */
+	c2_wr_set_id(wr, CCWR_CR_ACCEPT);
+	wr->hdr.context = (unsigned long) vq_req;
+	wr->rnic_handle = c2dev->adapter_handle;
+	wr->ep_handle = (u32) (unsigned long) cm_id->provider_data;
+	wr->qp_handle = qp->adapter_handle;
+
+	/* Replace the cr_handle with the QP after accept */
+	cm_id->provider_data = qp;
+	cm_id->add_ref(cm_id);
+	qp->cm_id = cm_id;
+
+	cm_id->provider_data = qp;
+
+	/* Validate private_data length */
+	if (iw_param->private_data_len > C2_MAX_PRIVATE_DATA_SIZE) {
+		err = -EINVAL;
+		goto bail2;
+	}
+
+	if (iw_param->private_data) {
+		wr->private_data_length = cpu_to_be32(iw_param->private_data_len);
+		memcpy(&wr->private_data[0], 
+		       iw_param->private_data, iw_param->private_data_len);
+	} else 
+		wr->private_data_length = 0;
+
+	/* Reference the request struct.  Dereferenced in the int handler. */
+	vq_req_get(c2dev, vq_req);
+
+	/* Send WR to adapter */
+	err = vq_send_wr(c2dev, (union c2wr *) wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail2;
+	}
+
+	/* Wait for reply from adapter */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail2;
+
+	/* Check that reply is present */
+	reply = (struct c2wr_cr_accept_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+
+	if (!err)
+		c2_set_qp_state(qp, C2_QP_STATE_RTS);
+ bail2:
+	kfree(wr);
+ bail1:
+	vq_req_free(c2dev, vq_req);
+ bail0:
+	if (err) {
+		/* 
+		 * If we fail, release reference on QP and
+		 * disassociate QP from CM_ID  
+		 */
+		cm_id->provider_data = NULL;
+		qp->cm_id = NULL;
+		cm_id->rem_ref(cm_id);
+	}
+	return err;
+}
+
+int c2_llp_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	struct c2_dev *c2dev;
+	struct c2wr_cr_reject_req wr;
+	struct c2_vq_req *vq_req;
+	struct c2wr_cr_reject_rep *reply;
+	int err;
+
+	c2dev = to_c2dev(cm_id->device);
+
+	/*
+	 * Allocate verbs request.
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	/* 
+	 * Build the WR
+	 */
+	c2_wr_set_id(&wr, CCWR_CR_REJECT);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.ep_handle = (u32) (unsigned long) cm_id->provider_data;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	/*
+	 * Process reply 
+	 */
+	reply = (struct c2wr_cr_reject_rep *) (unsigned long) 
+		vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+	err = c2_errno(reply);
+	/*
+	 * free vq stuff
+	 */
+	vq_repbuf_free(c2dev, reply);
+
+ bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
new file mode 100644
index 0000000..d24da05
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -0,0 +1,433 @@
+/*
+ * Copyright (c) 2004, 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005 Cisco Systems, Inc. All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2004 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include "c2.h"
+#include "c2_vq.h"
+#include "c2_status.h"
+
+#define C2_CQ_MSG_SIZE ((sizeof(struct c2wr_ce) + 32-1) & ~(32-1))
+
+struct c2_cq *c2_cq_get(struct c2_dev *c2dev, int cqn)
+{
+	struct c2_cq *cq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&c2dev->lock, flags);
+	cq = c2dev->qptr_array[cqn];
+	if (!cq) {
+		spin_unlock_irqrestore(&c2dev->lock, flags);
+		return NULL;
+	}
+	atomic_inc(&cq->refcount);
+	spin_unlock_irqrestore(&c2dev->lock, flags);
+	return cq;
+}
+
+void c2_cq_put(struct c2_cq *cq)
+{
+	if (atomic_dec_and_test(&cq->refcount))
+		wake_up(&cq->wait);
+}
+
+void c2_cq_event(struct c2_dev *c2dev, u32 mq_index)
+{
+	struct c2_cq *cq;
+
+	cq = c2_cq_get(c2dev, mq_index);
+	if (!cq) {
+		printk("discarding events on destroyed CQN=%d\n", mq_index);
+		return;
+	}
+
+	(*cq->ibcq.comp_handler) (&cq->ibcq, cq->ibcq.cq_context);
+	c2_cq_put(cq);
+}
+
+void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index)
+{
+	struct c2_cq *cq;
+	struct c2_mq *q;
+
+	cq = c2_cq_get(c2dev, mq_index);
+	if (!cq)
+		return;
+
+	spin_lock_irq(&cq->lock);
+	q = &cq->mq;
+	if (q && !c2_mq_empty(q)) {
+		u16 priv = q->priv;
+		struct c2wr_ce *msg;
+
+		while (priv != be16_to_cpu(*q->shared)) {
+			msg = (struct c2wr_ce *) 
+				(q->msg_pool.host + priv * q->msg_size);
+			if (msg->qp_user_context == (u64) (unsigned long) qp) {
+				msg->qp_user_context = (u64) 0;
+			}
+			priv = (priv + 1) % q->q_size;
+		}
+	}
+	spin_unlock_irq(&cq->lock);
+	c2_cq_put(cq);
+}
+
+static inline enum ib_wc_status c2_cqe_status_to_openib(u8 status)
+{
+	switch (status) {
+	case C2_OK:
+		return IB_WC_SUCCESS;
+	case CCERR_FLUSHED:
+		return IB_WC_WR_FLUSH_ERR;
+	case CCERR_BASE_AND_BOUNDS_VIOLATION:
+		return IB_WC_LOC_PROT_ERR;
+	case CCERR_ACCESS_VIOLATION:
+		return IB_WC_LOC_ACCESS_ERR;
+	case CCERR_TOTAL_LENGTH_TOO_BIG:
+		return IB_WC_LOC_LEN_ERR;
+	case CCERR_INVALID_WINDOW:
+		return IB_WC_MW_BIND_ERR;
+	default:
+		return IB_WC_GENERAL_ERR;
+	}
+}
+
+
+static inline int c2_poll_one(struct c2_dev *c2dev,
+			      struct c2_cq *cq, struct ib_wc *entry)
+{
+	struct c2wr_ce *ce;
+	struct c2_qp *qp;
+	int is_recv = 0;
+
+	ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq);
+	if (!ce) {
+		return -EAGAIN;
+	}
+
+	/*
+	 * if the qp returned is null then this qp has already 
+	 * been freed and we are unable process the completion.  
+	 * try pulling the next message
+	 */
+	while ((qp =
+		(struct c2_qp *) (unsigned long) ce->qp_user_context) == NULL) {
+		c2_mq_free(&cq->mq);
+		ce = (struct c2wr_ce *) c2_mq_consume(&cq->mq);
+		if (!ce)
+			return -EAGAIN;
+	}
+
+	entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce));
+	entry->wr_id = ce->hdr.context;
+	entry->qp_num = ce->handle;
+	entry->wc_flags = 0;
+	entry->slid = 0;
+	entry->sl = 0;
+	entry->src_qp = 0;
+	entry->dlid_path_bits = 0;
+	entry->pkey_index = 0;
+
+	switch (c2_wr_get_id(ce)) {
+	case C2_WR_TYPE_SEND:
+		entry->opcode = IB_WC_SEND;
+		break;
+	case C2_WR_TYPE_RDMA_WRITE:
+		entry->opcode = IB_WC_RDMA_WRITE;
+		break;
+	case C2_WR_TYPE_RDMA_READ:
+		entry->opcode = IB_WC_RDMA_READ;
+		break;
+	case C2_WR_TYPE_BIND_MW:
+		entry->opcode = IB_WC_BIND_MW;
+		break;
+	case C2_WR_TYPE_RECV:
+		entry->byte_len = be32_to_cpu(ce->bytes_rcvd);
+		entry->opcode = IB_WC_RECV;
+		is_recv = 1;
+		break;
+	default:
+		break;
+	}
+
+	/* consume the WQEs */
+	if (is_recv)
+		c2_mq_lconsume(&qp->rq_mq, 1);
+	else
+		c2_mq_lconsume(&qp->sq_mq,
+			       be32_to_cpu(c2_wr_get_wqe_count(ce)) + 1);
+
+	/* free the message */
+	c2_mq_free(&cq->mq);
+
+	return 0;
+}
+
+int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
+{
+	struct c2_dev *c2dev = to_c2dev(ibcq->device);
+	struct c2_cq *cq = to_c2cq(ibcq);
+	unsigned long flags;
+	int npolled, err;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+
+		err = c2_poll_one(c2dev, cq, entry + npolled);
+		if (err)
+			break;
+	}
+
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return npolled;
+}
+
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+{
+	struct c2_mq_shared __iomem *shared;
+	struct c2_cq *cq;
+
+	cq = to_c2cq(ibcq);
+	shared = cq->mq.peer;
+
+	if (notify == IB_CQ_NEXT_COMP)
+		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT, &shared->notification_type);
+	else if (notify == IB_CQ_SOLICITED)
+		writeb(C2_CQ_NOTIFICATION_TYPE_NEXT_SE, &shared->notification_type);
+	else
+		return -EINVAL;
+
+	writeb(CQ_WAIT_FOR_DMA | CQ_ARMED, &shared->armed);
+
+	/*
+	 * Now read back shared->armed to make the PCI
+	 * write synchronous.  This is necessary for
+	 * correct cq notification semantics.
+	 */
+	readb(&shared->armed);
+
+	return 0;
+}
+
+static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq)
+{
+	
+	dma_unmap_single(c2dev->ibdev.dma_device, pci_unmap_addr(mq, mapping), 
+			 mq->q_size * mq->msg_size, DMA_FROM_DEVICE);
+	free_pages((unsigned long) mq->msg_pool.host, 
+		   get_order(mq->q_size * mq->msg_size));
+}
+
+static int c2_alloc_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq, int q_size, 
+			   int msg_size)
+{
+	unsigned long pool_start;
+
+	pool_start = __get_free_pages(GFP_KERNEL, 
+				      get_order(q_size * msg_size));
+	if (!pool_start)
+		return -ENOMEM;
+
+	c2_mq_rep_init(mq, 
+		       0,		/* index (currently unknown) */
+		       q_size, 
+		       msg_size, 
+		       (u8 *) pool_start, 
+		       NULL,	/* peer (currently unknown) */
+		       C2_MQ_HOST_TARGET);
+
+	mq->host_dma = dma_map_single(c2dev->ibdev.dma_device, 
+				      (void *)pool_start, 
+				      q_size * msg_size, DMA_FROM_DEVICE);
+	pci_unmap_addr_set(mq, mapping, mq->host_dma);
+
+	return 0;
+}
+
+int c2_init_cq(struct c2_dev *c2dev, int entries,
+	       struct c2_ucontext *ctx, struct c2_cq *cq)
+{
+	struct c2wr_cq_create_req wr;
+	struct c2wr_cq_create_rep *reply;
+	unsigned long peer_pa;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	might_sleep();
+
+	cq->ibcq.cqe = entries - 1;
+	cq->is_kernel = !ctx;
+
+	/* Allocate a shared pointer */
+	cq->mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool, 
+				      &cq->mq.shared_dma, GFP_KERNEL);
+	if (!cq->mq.shared)
+		return -ENOMEM;
+
+	/* Allocate pages for the message pool */
+	err = c2_alloc_cq_buf(c2dev, &cq->mq, entries + 1, C2_CQ_MSG_SIZE);
+	if (err)
+		goto bail0;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_CQ_CREATE);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.msg_size = cpu_to_be32(cq->mq.msg_size);
+	wr.depth = cpu_to_be32(cq->mq.q_size);
+	wr.shared_ht = cpu_to_be64(cq->mq.shared_dma);
+	wr.msg_pool = cpu_to_be64(cq->mq.host_dma);
+	wr.user_context = (u64) (unsigned long) (cq);
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail2;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail2;
+
+	reply = (struct c2wr_cq_create_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	if ((err = c2_errno(reply)) != 0)
+		goto bail3;
+
+	cq->adapter_handle = reply->cq_handle;
+	cq->mq.index = be32_to_cpu(reply->mq_index);
+
+	peer_pa = c2dev->pa + be32_to_cpu(reply->adapter_shared);
+	cq->mq.peer = ioremap_nocache(peer_pa, PAGE_SIZE);
+	if (!cq->mq.peer) {
+		err = -ENOMEM;
+		goto bail3;
+	}
+
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	spin_lock_init(&cq->lock);
+	atomic_set(&cq->refcount, 1);
+	init_waitqueue_head(&cq->wait);
+
+	/* 
+	 * Use the MQ index allocated by the adapter to
+	 * store the CQ in the qptr_array
+	 */
+	cq->cqn = cq->mq.index;
+	c2dev->qptr_array[cq->cqn] = cq;
+
+	return 0;
+
+      bail3:
+	vq_repbuf_free(c2dev, reply);
+      bail2:
+	vq_req_free(c2dev, vq_req);
+      bail1:
+	c2_free_cq_buf(c2dev, &cq->mq);
+      bail0:
+	c2_free_mqsp(cq->mq.shared);
+
+	return err;
+}
+
+void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq)
+{
+	int err;
+	struct c2_vq_req *vq_req;
+	struct c2wr_cq_destroy_req wr;
+	struct c2wr_cq_destroy_rep *reply;
+
+	might_sleep();
+
+	/* Clear CQ from the qptr array */
+	spin_lock_irq(&c2dev->lock);
+	c2dev->qptr_array[cq->mq.index] = NULL;
+	atomic_dec(&cq->refcount);
+	spin_unlock_irq(&c2dev->lock);
+
+	wait_event(cq->wait, !atomic_read(&cq->refcount));
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		goto bail0;
+	}
+
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_CQ_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.cq_handle = cq->adapter_handle;
+
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail1;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail1;
+
+	reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req->reply_msg);
+
+	vq_repbuf_free(c2dev, reply);
+      bail1:
+	vq_req_free(c2dev, vq_req);
+      bail0:
+	if (cq->is_kernel) {
+		c2_free_cq_buf(c2dev, &cq->mq);
+	}
+
+	return;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_pd.c b/drivers/infiniband/hw/amso1100/c2_pd.c
new file mode 100644
index 0000000..b9a647a
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_pd.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+
+#include "c2.h"
+#include "c2_provider.h"
+
+int c2_pd_alloc(struct c2_dev *c2dev, int privileged, struct c2_pd *pd)
+{
+	u32 obj;
+	int ret = 0;
+
+	spin_lock(&c2dev->pd_table.lock);
+	obj = find_next_zero_bit(c2dev->pd_table.table, c2dev->pd_table.max, 
+				 c2dev->pd_table.last);
+	if (obj >= c2dev->pd_table.max)
+		obj = find_first_zero_bit(c2dev->pd_table.table, 
+					  c2dev->pd_table.max);
+	if (obj < c2dev->pd_table.max) {
+		pd->pd_id = obj;
+		__set_bit(obj, c2dev->pd_table.table);
+		c2dev->pd_table.last = obj+1;
+		if (c2dev->pd_table.last >= c2dev->pd_table.max)
+			c2dev->pd_table.last = 0;
+	} else
+		ret = -ENOMEM;
+	spin_unlock(&c2dev->pd_table.lock);
+	return ret;
+}
+
+void c2_pd_free(struct c2_dev *c2dev, struct c2_pd *pd)
+{
+	spin_lock(&c2dev->pd_table.lock);
+	__clear_bit(pd->pd_id, c2dev->pd_table.table);
+	spin_unlock(&c2dev->pd_table.lock);
+}
+
+int __devinit c2_init_pd_table(struct c2_dev *c2dev)
+{
+
+	c2dev->pd_table.last = 0;
+	c2dev->pd_table.max = c2dev->props.max_pd;
+	spin_lock_init(&c2dev->pd_table.lock);
+	c2dev->pd_table.table = kmalloc(BITS_TO_LONGS(c2dev->props.max_pd) * 
+					sizeof(long), GFP_KERNEL);
+	if (!c2dev->pd_table.table)
+		return -ENOMEM;
+	bitmap_zero(c2dev->pd_table.table, c2dev->props.max_pd);
+	return 0;
+}
+
+void __devexit c2_cleanup_pd_table(struct c2_dev *c2dev)
+{
+	kfree(c2dev->pd_table.table);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
new file mode 100644
index 0000000..a0c176e
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -0,0 +1,867 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/delay.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/if_vlan.h>
+#include <linux/crc32.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/init.h>
+#include <linux/dma-mapping.h>
+#include <linux/if_arp.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+#include "c2.h"
+#include "c2_provider.h"
+#include "c2_user.h"
+
+static int c2_query_device(struct ib_device *ibdev,
+			   struct ib_device_attr *props)
+{
+	struct c2_dev *c2dev = to_c2dev(ibdev);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	*props = c2dev->props;
+	return 0;
+}
+
+static int c2_query_port(struct ib_device *ibdev,
+			 u8 port, struct ib_port_attr *props)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 1;
+	props->active_speed = 1;
+
+	return 0;
+}
+
+static int c2_modify_port(struct ib_device *ibdev,
+			  u8 port, int port_modify_mask,
+			  struct ib_port_modify *props)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return 0;
+}
+
+static int c2_query_pkey(struct ib_device *ibdev,
+			 u8 port, u16 index, u16 * pkey)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	*pkey = 0;
+	return 0;
+}
+
+static int c2_query_gid(struct ib_device *ibdev, u8 port,
+			int index, union ib_gid *gid)
+{
+	struct c2_dev *c2dev = to_c2dev(ibdev);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), c2dev->pseudo_netdev->dev_addr, 6);
+
+	return 0;
+}
+
+/* Allocate the user context data structure. This keeps track
+ * of all objects associated with a particular user-mode client.
+ */
+static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev,
+					     struct ib_udata *udata)
+{
+	struct c2_ucontext *context;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+
+	return &context->ibucontext;
+}
+
+static int c2_dealloc_ucontext(struct ib_ucontext *context)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	kfree(context);
+	return 0;
+}
+
+static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev,
+				 struct ib_ucontext *context,
+				 struct ib_udata *udata)
+{
+	struct c2_pd *pd;
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	pd = kmalloc(sizeof(*pd), GFP_KERNEL);
+	if (!pd)
+		return ERR_PTR(-ENOMEM);
+
+	err = c2_pd_alloc(to_c2dev(ibdev), !context, pd);
+	if (err) {
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	if (context) {
+		if (ib_copy_to_udata(udata, &pd->pd_id, sizeof(__u32))) {
+			c2_pd_free(to_c2dev(ibdev), pd);
+			kfree(pd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+
+	return &pd->ibpd;
+}
+
+static int c2_dealloc_pd(struct ib_pd *pd)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	c2_pd_free(to_c2dev(pd->device), to_c2pd(pd));
+	kfree(pd);
+
+	return 0;
+}
+
+static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return ERR_PTR(-ENOSYS);
+}
+
+static int c2_ah_destroy(struct ib_ah *ah)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static void c2_add_ref(struct ib_qp *ibqp)
+{
+	struct c2_qp *qp;
+	BUG_ON(!ibqp);
+	qp = to_c2qp(ibqp);
+	atomic_inc(&qp->refcount);
+}
+
+static void c2_rem_ref(struct ib_qp *ibqp)
+{
+	struct c2_qp *qp;
+	BUG_ON(!ibqp);
+	qp = to_c2qp(ibqp);
+	if (atomic_dec_and_test(&qp->refcount))
+		wake_up(&qp->wait);
+}
+
+struct ib_qp *c2_get_qp(struct ib_device *device, int qpn)
+{
+	struct c2_dev* c2dev = to_c2dev(device);
+	struct c2_qp *qp;
+
+	qp = c2_find_qpn(c2dev, qpn);
+	pr_debug("%s Returning QP=%p for QPN=%d, device=%p, refcount=%d\n",
+		__FUNCTION__, qp, qpn, device,
+		(qp?atomic_read(&qp->refcount):0));
+
+	return (qp?&qp->ibqp:NULL);
+}
+
+static struct ib_qp *c2_create_qp(struct ib_pd *pd,
+				  struct ib_qp_init_attr *init_attr,
+				  struct ib_udata *udata)
+{
+	struct c2_qp *qp;
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_RC:
+		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+		if (!qp) {
+			pr_debug("%s: Unable to allocate QP\n", __FUNCTION__);
+			return ERR_PTR(-ENOMEM);
+		}
+		spin_lock_init(&qp->lock);
+		if (pd->uobject) {
+			/* userspace specific */
+		}
+
+		err = c2_alloc_qp(to_c2dev(pd->device),
+				  to_c2pd(pd), init_attr, qp);
+		
+		if (err && pd->uobject) {
+			/* userspace specific */
+		}
+
+		break;
+	default:
+		pr_debug("%s: Invalid QP type: %d\n", __FUNCTION__,
+			init_attr->qp_type);
+		return ERR_PTR(-EINVAL);
+		break;
+	}
+
+	if (err) {
+		kfree(qp);
+		return ERR_PTR(err);
+	}
+
+	return &qp->ibqp;
+}
+
+static int c2_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct c2_qp *qp = to_c2qp(ib_qp);
+
+	pr_debug("%s:%u qp=%p,qp->state=%d\n", 
+		__FUNCTION__, __LINE__,ib_qp,qp->state);
+	c2_free_qp(to_c2dev(ib_qp->device), qp);
+	kfree(qp);
+	return 0;
+}
+
+static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries,
+				  struct ib_ucontext *context,
+				  struct ib_udata *udata)
+{
+	struct c2_cq *cq;
+	int err;
+
+	cq = kmalloc(sizeof(*cq), GFP_KERNEL);
+	if (!cq) {
+		pr_debug("%s: Unable to allocate CQ\n", __FUNCTION__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq);
+	if (err) {
+		pr_debug("%s: error initializing CQ\n", __FUNCTION__);
+		kfree(cq);
+		return ERR_PTR(err);
+	}
+
+	return &cq->ibcq;
+}
+
+static int c2_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct c2_cq *cq = to_c2cq(ib_cq);
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	c2_free_cq(to_c2dev(ib_cq->device), cq);
+	kfree(cq);
+
+	return 0;
+}
+
+static inline u32 c2_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? C2_ACF_REMOTE_WRITE : 0) |
+	    (acc & IB_ACCESS_REMOTE_READ ? C2_ACF_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? C2_ACF_LOCAL_WRITE : 0) |
+	    C2_ACF_LOCAL_READ | C2_ACF_WINDOW_BIND;
+}
+
+static struct ib_mr *c2_reg_phys_mr(struct ib_pd *ib_pd,
+				    struct ib_phys_buf *buffer_list,
+				    int num_phys_buf, int acc, u64 * iova_start)
+{
+	struct c2_mr *mr;
+	u64 *page_list;
+	u32 total_len;
+	int err, i, j, k, page_shift, pbl_depth;
+
+	pbl_depth = 0;
+	total_len = 0;
+
+	page_shift = PAGE_SHIFT;
+	/*
+	 * If there is only 1 buffer we assume this could
+	 * be a map of all phy mem...use a 32k page_shift.
+	 */
+	if (num_phys_buf == 1)
+		page_shift += 3;
+
+	for (i = 0; i < num_phys_buf; i++) {
+
+		if (buffer_list[i].addr & ~PAGE_MASK) {
+			pr_debug("Unaligned Memory Buffer: 0x%x\n",
+				(unsigned int) buffer_list[i].addr);
+			return ERR_PTR(-EINVAL);
+		}
+
+		if (!buffer_list[i].size) {
+			pr_debug("Invalid Buffer Size\n");
+			return ERR_PTR(-EINVAL);
+		}
+
+		total_len += buffer_list[i].size;
+		pbl_depth += ALIGN(buffer_list[i].size, 
+				   (1 << page_shift)) >> page_shift;
+	}
+
+	page_list = vmalloc(sizeof(u64) * pbl_depth);
+	if (!page_list) {
+		pr_debug("couldn't vmalloc page_list of size %zd\n",
+			(sizeof(u64) * pbl_depth));
+		return ERR_PTR(-ENOMEM);
+	}
+
+	for (i = 0, j = 0; i < num_phys_buf; i++) {
+
+		int naddrs;
+
+ 		naddrs = ALIGN(buffer_list[i].size, 
+			       (1 << page_shift)) >> page_shift;
+		for (k = 0; k < naddrs; k++)
+			page_list[j++] = (buffer_list[i].addr + 
+						     (k << page_shift));
+	}
+
+	mr = kmalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	mr->pd = to_c2pd(ib_pd);
+	pr_debug("%s - page shift %d, pbl_depth %d, total_len %u, "
+		"*iova_start %llx, first pa %llx, last pa %llx\n",
+		__FUNCTION__, page_shift, pbl_depth, total_len, 
+		*iova_start, page_list[0], page_list[pbl_depth-1]);
+  	err = c2_nsmr_register_phys_kern(to_c2dev(ib_pd->device), page_list,
+ 					 (1 << page_shift), pbl_depth, 
+					 total_len, 0, iova_start, 
+					 c2_convert_access(acc), mr);
+	vfree(page_list);
+	if (err) {
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	return &mr->ibmr;
+}
+
+static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva = 0;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* AMSO1100 limit */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	return c2_reg_phys_mr(pd, &bl, 1, acc, &kva);
+}
+
+static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				    int acc, struct ib_udata *udata)
+{
+	u64 *pages;
+	u64 kva = 0;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct c2_pd *c2pd = to_c2pd(pd);
+	struct c2_mr *c2mr;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	shift = ffs(region->page_size) - 1;
+
+	c2mr = kmalloc(sizeof(*c2mr), GFP_KERNEL);
+	if (!c2mr)
+		return ERR_PTR(-ENOMEM);
+	c2mr->pd = c2pd;
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	i = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list) {
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = 
+					sg_dma_address(&chunk->page_list[j]) +
+					(region->page_size * k);
+			}
+		}
+	}
+
+	kva = (u64)region->virt_base;
+  	err = c2_nsmr_register_phys_kern(to_c2dev(pd->device), 
+					 pages,
+ 					 region->page_size,
+					 i, 
+					 region->length, 
+					 region->offset,
+					 &kva,
+					 c2_convert_access(acc), 
+					 c2mr);
+	kfree(pages);
+	if (err) {
+		kfree(c2mr);
+		return ERR_PTR(err);
+	}
+	return &c2mr->ibmr;
+
+err:
+	kfree(c2mr);
+	return ERR_PTR(err);
+}
+
+static int c2_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct c2_mr *mr = to_c2mr(ib_mr);
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_stag_dealloc(to_c2dev(ib_mr->device), ib_mr->lkey);
+	if (err)
+		pr_debug("c2_stag_dealloc failed: %d\n", err);
+	else
+		kfree(mr);
+
+	return err;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev);
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%x\n", dev->props.hw_ver);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev);
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%x.%x.%x\n",
+		       (int) (dev->props.fw_ver >> 32),
+		       (int) (dev->props.fw_ver >> 16) & 0xffff,
+		       (int) (dev->props.fw_ver & 0xffff));
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "AMSO1100\n");
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID");
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *c2_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+static int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+			int attr_mask)
+{
+	int err;
+
+	err =
+	    c2_qp_modify(to_c2dev(ibqp->device), to_c2qp(ibqp), attr,
+			 attr_mask);
+
+	return err;
+}
+
+static int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_process_mad(struct ib_device *ibdev,
+			  int mad_flags,
+			  u8 port_num,
+			  struct ib_wc *in_wc,
+			  struct ib_grh *in_grh,
+			  struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int c2_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* Request a connection */
+	return c2_llp_connect(cm_id, iw_param);
+}
+
+static int c2_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	/* Accept the new connection */
+	return c2_llp_accept(cm_id, iw_param);
+}
+
+static int c2_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_llp_reject(cm_id, pdata, pdata_len);
+	return err;
+}
+
+static int c2_service_create(struct iw_cm_id *cm_id, int backlog)
+{
+	int err;
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	err = c2_llp_service_create(cm_id, backlog);
+	pr_debug("%s:%u err=%d\n", 
+		__FUNCTION__, __LINE__,
+		err);
+	return err;
+}
+
+static int c2_service_destroy(struct iw_cm_id *cm_id)
+{
+	int err;
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+
+	err = c2_llp_service_destroy(cm_id);
+
+	return err;
+}
+
+static int c2_pseudo_up(struct net_device *netdev)
+{
+	struct in_device *ind;
+	struct c2_dev *c2dev = netdev->priv;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	pr_debug("adding...\n");
+	for_ifa(ind) {
+#ifdef DEBUG
+		u8 *ip = (u8 *) & ifa->ifa_address;
+
+		pr_debug("%s: %d.%d.%d.%d\n",
+		       ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]);
+#endif
+		c2_add_addr(c2dev, ifa->ifa_address, ifa->ifa_mask);
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+
+	return 0;
+}
+
+static int c2_pseudo_down(struct net_device *netdev)
+{
+	struct in_device *ind;
+	struct c2_dev *c2dev = netdev->priv;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	pr_debug("deleting...\n");
+	for_ifa(ind) {
+#ifdef DEBUG
+		u8 *ip = (u8 *) & ifa->ifa_address;
+
+		pr_debug("%s: %d.%d.%d.%d\n",
+		       ifa->ifa_label, ip[0], ip[1], ip[2], ip[3]);
+#endif
+		c2_del_addr(c2dev, ifa->ifa_address, ifa->ifa_mask);
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+
+	return 0;
+}
+
+static int c2_pseudo_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int c2_pseudo_change_mtu(struct net_device *netdev, int new_mtu)
+{
+	int ret = 0;
+
+	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
+		return -EINVAL;
+
+	netdev->mtu = new_mtu;
+
+	/* TODO: Tell rnic about new rmda interface mtu */
+	return ret;
+}
+
+static void setup(struct net_device *netdev)
+{
+	SET_MODULE_OWNER(netdev);
+	netdev->open = c2_pseudo_up;
+	netdev->stop = c2_pseudo_down;
+	netdev->hard_start_xmit = c2_pseudo_xmit_frame;
+	netdev->get_stats = NULL;
+	netdev->tx_timeout = NULL;
+	netdev->set_mac_address = NULL;
+	netdev->change_mtu = c2_pseudo_change_mtu;
+	netdev->watchdog_timeo = 0;
+	netdev->type = ARPHRD_ETHER;
+	netdev->mtu = 1500;
+	netdev->hard_header_len = ETH_HLEN;
+	netdev->addr_len = ETH_ALEN;
+	netdev->tx_queue_len = 0;
+	netdev->flags |= IFF_NOARP;
+	return;
+}
+
+static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev)
+{
+	char name[IFNAMSIZ];
+	struct net_device *netdev;
+
+	/* change ethxxx to iwxxx */
+	strcpy(name, "iw");
+	strcat(name, &c2dev->netdev->name[3]);
+	netdev = alloc_netdev(sizeof(*netdev), name, setup);
+	if (!netdev) {
+		printk(KERN_ERR PFX "%s -  etherdev alloc failed",
+			__FUNCTION__);
+		return NULL;
+	}
+
+	netdev->priv = c2dev;
+
+	SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev);
+
+	memcpy_fromio(netdev->dev_addr, c2dev->kva + C2_REGS_RDMA_ENADDR, 6);
+
+	/* Print out the MAC address */
+	pr_debug("%s: MAC %02X:%02X:%02X:%02X:%02X:%02X\n",
+		netdev->name,
+		netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2],
+		netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]);
+
+	/* Disable network packets */
+	netif_stop_queue(netdev);
+	return netdev;
+}
+
+int c2_register_device(struct c2_dev *dev)
+{
+	int ret;
+	int i;
+
+	/* Register pseudo network device */
+	dev->pseudo_netdev = c2_pseudo_netdev_init(dev);
+	if (dev->pseudo_netdev) {
+		ret = register_netdev(dev->pseudo_netdev);
+		if (ret) {
+			printk(KERN_ERR PFX
+				"Unable to register netdev, ret = %d\n", ret);
+			free_netdev(dev->pseudo_netdev);
+			return ret;
+		}
+	}
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6);
+	dev->ibdev.phys_port_cnt = 1;
+	dev->ibdev.dma_device = &dev->pcidev->dev;
+	dev->ibdev.class_dev.dev = &dev->pcidev->dev;
+	dev->ibdev.query_device = c2_query_device;
+	dev->ibdev.query_port = c2_query_port;
+	dev->ibdev.modify_port = c2_modify_port;
+	dev->ibdev.query_pkey = c2_query_pkey;
+	dev->ibdev.query_gid = c2_query_gid;
+	dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext;
+	dev->ibdev.mmap = c2_mmap_uar;
+	dev->ibdev.alloc_pd = c2_alloc_pd;
+	dev->ibdev.dealloc_pd = c2_dealloc_pd;
+	dev->ibdev.create_ah = c2_ah_create;
+	dev->ibdev.destroy_ah = c2_ah_destroy;
+	dev->ibdev.create_qp = c2_create_qp;
+	dev->ibdev.modify_qp = c2_modify_qp;
+	dev->ibdev.destroy_qp = c2_destroy_qp;
+	dev->ibdev.create_cq = c2_create_cq;
+	dev->ibdev.destroy_cq = c2_destroy_cq;
+	dev->ibdev.poll_cq = c2_poll_cq;
+	dev->ibdev.get_dma_mr = c2_get_dma_mr;
+	dev->ibdev.reg_phys_mr = c2_reg_phys_mr;
+	dev->ibdev.reg_user_mr = c2_reg_user_mr;
+	dev->ibdev.dereg_mr = c2_dereg_mr;
+
+	dev->ibdev.alloc_fmr = NULL;
+	dev->ibdev.unmap_fmr = NULL;
+	dev->ibdev.dealloc_fmr = NULL;
+	dev->ibdev.map_phys_fmr = NULL;
+
+	dev->ibdev.attach_mcast = c2_multicast_attach;
+	dev->ibdev.detach_mcast = c2_multicast_detach;
+	dev->ibdev.process_mad = c2_process_mad;
+
+	dev->ibdev.req_notify_cq = c2_arm_cq;
+	dev->ibdev.post_send = c2_post_send;
+	dev->ibdev.post_recv = c2_post_receive;
+
+	dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL);
+	dev->ibdev.iwcm->add_ref = c2_add_ref;
+	dev->ibdev.iwcm->rem_ref = c2_rem_ref;
+	dev->ibdev.iwcm->get_qp = c2_get_qp;
+	dev->ibdev.iwcm->connect = c2_connect;
+	dev->ibdev.iwcm->accept = c2_accept;
+	dev->ibdev.iwcm->reject = c2_reject;
+	dev->ibdev.iwcm->create_listen = c2_service_create;
+	dev->ibdev.iwcm->destroy_listen = c2_service_destroy;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       c2_class_attributes[i]);
+		if (ret) {
+			unregister_netdev(dev->pseudo_netdev);
+			free_netdev(dev->pseudo_netdev);
+			ib_unregister_device(&dev->ibdev);
+			return ret;
+		}
+	}
+
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	return 0;
+}
+
+void c2_unregister_device(struct c2_dev *dev)
+{
+	pr_debug("%s:%u\n", __FUNCTION__, __LINE__);
+	unregister_netdev(dev->pseudo_netdev);
+	free_netdev(dev->pseudo_netdev);
+	ib_unregister_device(&dev->ibdev);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.h b/drivers/infiniband/hw/amso1100/c2_provider.h
new file mode 100644
index 0000000..0fb6f1c
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_provider.h
@@ -0,0 +1,181 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef C2_PROVIDER_H
+#define C2_PROVIDER_H
+#include <linux/inetdevice.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_pack.h>
+
+#include "c2_mq.h"
+#include <rdma/iw_cm.h>
+
+#define C2_MPT_FLAG_ATOMIC        (1 << 14)
+#define C2_MPT_FLAG_REMOTE_WRITE  (1 << 13)
+#define C2_MPT_FLAG_REMOTE_READ   (1 << 12)
+#define C2_MPT_FLAG_LOCAL_WRITE   (1 << 11)
+#define C2_MPT_FLAG_LOCAL_READ    (1 << 10)
+
+struct c2_buf_list {
+	void *buf;
+	 DECLARE_PCI_UNMAP_ADDR(mapping)
+};
+
+
+/* The user context keeps track of objects allocated for a
+ * particular user-mode client. */
+struct c2_ucontext {
+	struct ib_ucontext ibucontext;
+};
+
+struct c2_mtt;
+
+/* All objects associated with a PD are kept in the 
+ * associated user context if present. 
+ */
+struct c2_pd {
+	struct ib_pd ibpd;
+	u32 pd_id;
+};
+
+struct c2_mr {
+	struct ib_mr ibmr;
+	struct c2_pd *pd;
+};
+
+struct c2_av;
+
+enum c2_ah_type {
+	C2_AH_ON_HCA,
+	C2_AH_PCI_POOL,
+	C2_AH_KMALLOC
+};
+
+struct c2_ah {
+	struct ib_ah ibah;
+};
+
+struct c2_cq {
+	struct ib_cq ibcq;
+	spinlock_t lock;
+	atomic_t refcount;
+	int cqn;
+	int is_kernel;
+	wait_queue_head_t wait;
+
+	u32 adapter_handle;
+	struct c2_mq mq;
+};
+
+struct c2_wq {
+	spinlock_t lock;
+};
+struct iw_cm_id;
+struct c2_qp {
+	struct ib_qp ibqp;
+	struct iw_cm_id *cm_id;
+	spinlock_t lock;
+	atomic_t refcount;
+	wait_queue_head_t wait;
+	int qpn;
+
+	u32 adapter_handle;
+	u32 send_sgl_depth;
+	u32 recv_sgl_depth;
+	u32 rdma_write_sgl_depth;
+	u8 state;
+
+	struct c2_mq sq_mq;
+	struct c2_mq rq_mq;
+};
+
+struct c2_cr_query_attrs {
+	u32 local_addr;
+	u32 remote_addr;
+	u16 local_port;
+	u16 remote_port;
+};
+
+static inline struct c2_pd *to_c2pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct c2_pd, ibpd);
+}
+
+static inline struct c2_ucontext *to_c2ucontext(struct ib_ucontext *ibucontext)
+{
+	return container_of(ibucontext, struct c2_ucontext, ibucontext);
+}
+
+static inline struct c2_mr *to_c2mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct c2_mr, ibmr);
+}
+
+
+static inline struct c2_ah *to_c2ah(struct ib_ah *ibah)
+{
+	return container_of(ibah, struct c2_ah, ibah);
+}
+
+static inline struct c2_cq *to_c2cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct c2_cq, ibcq);
+}
+
+static inline struct c2_qp *to_c2qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct c2_qp, ibqp);
+}
+
+static inline int is_rnic_addr(struct net_device *netdev, u32 addr)
+{
+	struct in_device *ind;
+	int ret = 0;
+
+	ind = in_dev_get(netdev);
+	if (!ind)
+		return 0;
+
+	for_ifa(ind) {
+		if (ifa->ifa_address == addr) {
+			ret = 1;
+			break;
+		}
+	}
+	endfor_ifa(ind);
+	in_dev_put(ind);
+	return ret;
+}
+#endif				/* C2_PROVIDER_H */
diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c
new file mode 100644
index 0000000..76a60bc
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -0,0 +1,975 @@
+/*
+ * Copyright (c) 2004 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems. All rights reserved.
+ * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2004 Voltaire, Inc. All rights reserved. 
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include "c2.h"
+#include "c2_vq.h"
+#include "c2_status.h"
+
+#define C2_MAX_ORD_PER_QP 128
+#define C2_MAX_IRD_PER_QP 128
+
+#define C2_HINT_MAKE(q_index, hint_count) (((q_index) << 16) | hint_count)
+#define C2_HINT_GET_INDEX(hint) (((hint) & 0x7FFF0000) >> 16)
+#define C2_HINT_GET_COUNT(hint) ((hint) & 0x0000FFFF)
+
+#define NO_SUPPORT -1
+static const u8 c2_opcode[] = {
+	[IB_WR_SEND] = C2_WR_TYPE_SEND,
+	[IB_WR_SEND_WITH_IMM] = NO_SUPPORT,
+	[IB_WR_RDMA_WRITE] = C2_WR_TYPE_RDMA_WRITE,
+	[IB_WR_RDMA_WRITE_WITH_IMM] = NO_SUPPORT,
+	[IB_WR_RDMA_READ] = C2_WR_TYPE_RDMA_READ,
+	[IB_WR_ATOMIC_CMP_AND_SWP] = NO_SUPPORT,
+	[IB_WR_ATOMIC_FETCH_AND_ADD] = NO_SUPPORT,
+};
+
+static int to_c2_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+		return C2_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return C2_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return C2_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return C2_QP_STATE_CLOSING;
+	case IB_QPS_ERR:
+		return C2_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+int to_ib_state(enum c2_qp_state c2_state)
+{
+	switch (c2_state) {
+	case C2_QP_STATE_IDLE:
+		return IB_QPS_RESET;
+	case C2_QP_STATE_CONNECTING:
+		return IB_QPS_RTR;
+	case C2_QP_STATE_RTS:
+		return IB_QPS_RTS;
+	case C2_QP_STATE_CLOSING:
+		return IB_QPS_SQD;
+	case C2_QP_STATE_ERROR:
+		return IB_QPS_ERR;
+	case C2_QP_STATE_TERMINATE:
+		return IB_QPS_SQE;
+	default:
+		return -1;
+	}
+}
+
+const char *to_ib_state_str(int ib_state)
+{
+	static const char *state_str[] = {
+		"IB_QPS_RESET",
+		"IB_QPS_INIT",
+		"IB_QPS_RTR",
+		"IB_QPS_RTS",
+		"IB_QPS_SQD",
+		"IB_QPS_SQE",
+		"IB_QPS_ERR"
+	};
+	if (ib_state < IB_QPS_RESET ||
+	    ib_state > IB_QPS_ERR)
+		return "<invalid IB QP state>";
+
+	ib_state -= IB_QPS_RESET;
+	return state_str[ib_state];
+}
+
+void c2_set_qp_state(struct c2_qp *qp, int c2_state)
+{
+	int new_state = to_ib_state(c2_state);
+
+	pr_debug("%s: qp[%p] state modify %s --> %s\n", 
+	       __FUNCTION__,
+		qp, 
+		to_ib_state_str(qp->state), 
+		to_ib_state_str(new_state));
+	qp->state = new_state;
+}
+
+#define C2_QP_NO_ATTR_CHANGE 0xFFFFFFFF
+
+int c2_qp_modify(struct c2_dev *c2dev, struct c2_qp *qp,
+		 struct ib_qp_attr *attr, int attr_mask)
+{
+	struct c2wr_qp_modify_req wr;
+	struct c2wr_qp_modify_rep *reply;
+	struct c2_vq_req *vq_req;
+	unsigned long flags;
+	u8 next_state;
+	int err;
+
+	pr_debug("%s:%d qp=%p, %s --> %s\n", 
+		__FUNCTION__, __LINE__,
+		qp, 
+		to_ib_state_str(qp->state), 
+		to_ib_state_str(attr->qp_state));
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_QP_MODIFY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+	wr.ord = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.ird = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+
+	if (attr_mask & IB_QP_STATE) {
+		/* Ensure the state is valid */
+		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
+			return -EINVAL;
+
+		wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state));
+
+		if (attr->qp_state == IB_QPS_ERR) {
+			spin_lock_irqsave(&qp->lock, flags);
+			if (qp->cm_id && qp->state == IB_QPS_RTS) {
+				pr_debug("Generating CLOSE event for QP-->ERR, "
+					"qp=%p, cm_id=%p\n",qp,qp->cm_id);
+				/* Generate an CLOSE event */
+				vq_req->cm_id = qp->cm_id;
+				vq_req->event = IW_CM_EVENT_CLOSE;
+			}
+			spin_unlock_irqrestore(&qp->lock, flags);
+		}
+		next_state =  attr->qp_state;
+
+	} else if (attr_mask & IB_QP_CUR_STATE) {
+
+		if (attr->cur_qp_state != IB_QPS_RTR &&
+		    attr->cur_qp_state != IB_QPS_RTS &&
+		    attr->cur_qp_state != IB_QPS_SQD &&
+		    attr->cur_qp_state != IB_QPS_SQE)
+			return -EINVAL;
+		else
+			wr.next_qp_state =
+			    cpu_to_be32(to_c2_state(attr->cur_qp_state));
+
+		next_state = attr->cur_qp_state;
+
+	} else {
+		err = 0;
+		goto bail0;
+	}
+
+	/* reference the request struct */
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	reply = (struct c2wr_qp_modify_rep *) (unsigned long) vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+	if (!err) 
+		qp->state = next_state;
+#ifdef DEBUG
+	else
+		pr_debug("%s: c2_errno=%d\n", __FUNCTION__, err);
+#endif
+	/*
+	 * If we're going to error and generating the event here, then 
+	 * we need to remove the reference because there will be no
+	 * close event generated by the adapter 
+	*/
+	spin_lock_irqsave(&qp->lock, flags);
+	if (vq_req->event==IW_CM_EVENT_CLOSE && qp->cm_id) {
+		qp->cm_id->rem_ref(qp->cm_id);
+		qp->cm_id = NULL;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+
+	pr_debug("%s:%d qp=%p, cur_state=%s\n", 
+		__FUNCTION__, __LINE__,
+		qp, 
+		to_ib_state_str(qp->state));
+	return err;
+}
+
+int c2_qp_set_read_limits(struct c2_dev *c2dev, struct c2_qp *qp, 
+			  int ord, int ird)
+{
+	struct c2wr_qp_modify_req wr;
+	struct c2wr_qp_modify_rep *reply;
+	struct c2_vq_req *vq_req;
+	int err;
+
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req)
+		return -ENOMEM;
+
+	c2_wr_set_id(&wr, CCWR_QP_MODIFY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+	wr.ord = cpu_to_be32(ord);
+	wr.ird = cpu_to_be32(ird);
+	wr.sq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.rq_depth = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+	wr.next_qp_state = cpu_to_be32(C2_QP_NO_ATTR_CHANGE);
+
+	/* reference the request struct */
+	vq_req_get(c2dev, vq_req);
+
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err)
+		goto bail0;
+
+	reply = (struct c2wr_qp_modify_rep *) (unsigned long) 
+		vq_req->reply_msg;
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	err = c2_errno(reply);
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+static int destroy_qp(struct c2_dev *c2dev, struct c2_qp *qp)
+{
+	struct c2_vq_req *vq_req;
+	struct c2wr_qp_destroy_req wr;
+	struct c2wr_qp_destroy_rep *reply;
+	unsigned long flags;
+	int err;
+
+	/*
+	 * Allocate a verb request message
+	 */
+	vq_req = vq_req_alloc(c2dev);
+	if (!vq_req) {
+		return -ENOMEM;
+	}
+
+	/* 
+	 * Initialize the WR 
+	 */
+	c2_wr_set_id(&wr, CCWR_QP_DESTROY);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.qp_handle = qp->adapter_handle;
+
+	/*
+	 * reference the request struct.  dereferenced in the int handler.
+	 */
+	vq_req_get(c2dev, vq_req);
+
+	spin_lock_irqsave(&qp->lock, flags);
+	if (qp->cm_id && qp->state == IB_QPS_RTS) {
+		pr_debug("destroy_qp: generating CLOSE event for QP-->ERR, "
+			"qp=%p, cm_id=%p\n",qp,qp->cm_id);
+		/* Generate an CLOSE event */
+		vq_req->qp = qp;
+		vq_req->cm_id = qp->cm_id;
+		vq_req->event = IW_CM_EVENT_CLOSE;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	/*
+	 * Send WR to adapter
+	 */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail0;
+	}
+
+	/*
+	 * Wait for reply from adapter
+	 */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail0;
+	}
+
+	/*
+	 * Process reply
+	 */
+	reply = (struct c2wr_qp_destroy_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	spin_lock_irqsave(&qp->lock, flags);
+	if (qp->cm_id) {
+		qp->cm_id->rem_ref(qp->cm_id);
+		qp->cm_id = NULL;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	vq_repbuf_free(c2dev, reply);
+      bail0:
+	vq_req_free(c2dev, vq_req);
+	return err;
+}
+
+static int c2_alloc_qpn(struct c2_dev *c2dev, struct c2_qp *qp)
+{
+	int ret;
+
+        do {
+		spin_lock_irq(&c2dev->qp_table.lock);
+		ret = idr_get_new_above(&c2dev->qp_table.idr, qp, 
+					c2dev->qp_table.last++, &qp->qpn);
+		spin_unlock_irq(&c2dev->qp_table.lock);
+        } while ((ret == -EAGAIN) && 
+	 	 idr_pre_get(&c2dev->qp_table.idr, GFP_KERNEL));
+	return ret;
+}
+
+static void c2_free_qpn(struct c2_dev *c2dev, int qpn)
+{
+	spin_lock_irq(&c2dev->qp_table.lock);
+	idr_remove(&c2dev->qp_table.idr, qpn);
+	spin_unlock_irq(&c2dev->qp_table.lock);
+}
+
+struct c2_qp *c2_find_qpn(struct c2_dev *c2dev, int qpn)
+{
+	unsigned long flags;
+	struct c2_qp *qp;
+
+	spin_lock_irqsave(&c2dev->qp_table.lock, flags);	
+	qp = idr_find(&c2dev->qp_table.idr, qpn);
+	spin_unlock_irqrestore(&c2dev->qp_table.lock, flags);	
+	return qp;
+}
+
+int c2_alloc_qp(struct c2_dev *c2dev,
+		struct c2_pd *pd,
+		struct ib_qp_init_attr *qp_attrs, struct c2_qp *qp)
+{
+	struct c2wr_qp_create_req wr;
+	struct c2wr_qp_create_rep *reply;
+	struct c2_vq_req *vq_req;
+	struct c2_cq *send_cq = to_c2cq(qp_attrs->send_cq);
+	struct c2_cq *recv_cq = to_c2cq(qp_attrs->recv_cq);
+	unsigned long peer_pa;
+	u32 q_size, msg_size, mmap_size;
+	void __iomem *mmap;
+	int err;
+
+	err = c2_alloc_qpn(c2dev, qp);
+	if (err)
+		return err;
+	qp->ibqp.qp_num = qp->qpn;
+	qp->ibqp.qp_type = IB_QPT_RC;
+
+	/* Allocate the SQ and RQ shared pointers */
+	qp->sq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool,
+					 &qp->sq_mq.shared_dma, GFP_KERNEL);
+	if (!qp->sq_mq.shared) {
+		err = -ENOMEM;
+		goto bail0;
+	}
+
+	qp->rq_mq.shared = c2_alloc_mqsp(c2dev, c2dev->kern_mqsp_pool,
+					 &qp->rq_mq.shared_dma, GFP_KERNEL);
+	if (!qp->rq_mq.shared) {
+		err = -ENOMEM;
+		goto bail1;
+	}
+
+	/* Allocate the verbs request */
+	vq_req = vq_req_alloc(c2dev);
+	if (vq_req == NULL) {
+		err = -ENOMEM;
+		goto bail2;
+	}
+
+	/* Initialize the work request */
+	memset(&wr, 0, sizeof(wr));
+	c2_wr_set_id(&wr, CCWR_QP_CREATE);
+	wr.hdr.context = (unsigned long) vq_req;
+	wr.rnic_handle = c2dev->adapter_handle;
+	wr.sq_cq_handle = send_cq->adapter_handle;
+	wr.rq_cq_handle = recv_cq->adapter_handle;
+	wr.sq_depth = cpu_to_be32(qp_attrs->cap.max_send_wr + 1);
+	wr.rq_depth = cpu_to_be32(qp_attrs->cap.max_recv_wr + 1);
+	wr.srq_handle = 0;
+	wr.flags = cpu_to_be32(QP_RDMA_READ | QP_RDMA_WRITE | QP_MW_BIND |
+			       QP_ZERO_STAG | QP_RDMA_READ_RESPONSE);
+	wr.send_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge);
+	wr.recv_sgl_depth = cpu_to_be32(qp_attrs->cap.max_recv_sge);
+	wr.rdma_write_sgl_depth = cpu_to_be32(qp_attrs->cap.max_send_sge);
+	wr.shared_sq_ht = cpu_to_be64(qp->sq_mq.shared_dma);
+	wr.shared_rq_ht = cpu_to_be64(qp->rq_mq.shared_dma);
+	wr.ord = cpu_to_be32(C2_MAX_ORD_PER_QP);
+	wr.ird = cpu_to_be32(C2_MAX_IRD_PER_QP);
+	wr.pd_id = pd->pd_id;
+	wr.user_context = (unsigned long) qp;
+
+	vq_req_get(c2dev, vq_req);
+
+	/* Send the WR to the adapter */
+	err = vq_send_wr(c2dev, (union c2wr *) & wr);
+	if (err) {
+		vq_req_put(c2dev, vq_req);
+		goto bail3;
+	}
+
+	/* Wait for the verb reply  */
+	err = vq_wait_for_reply(c2dev, vq_req);
+	if (err) {
+		goto bail3;
+	}
+
+	/* Process the reply */
+	reply = (struct c2wr_qp_create_rep *) (unsigned long) (vq_req->reply_msg);
+	if (!reply) {
+		err = -ENOMEM;
+		goto bail3;
+	}
+
+	if ((err = c2_wr_get_result(reply)) != 0) {
+		goto bail4;
+	}
+
+	/* Fill in the kernel QP struct */
+	atomic_set(&qp->refcount, 1);
+	qp->adapter_handle = reply->qp_handle;
+	qp->state = IB_QPS_RESET;
+	qp->send_sgl_depth = qp_attrs->cap.max_send_sge;
+	qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge;
+	qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge;
+
+	/* Initialize the SQ MQ */
+	q_size = be32_to_cpu(reply->sq_depth);
+	msg_size = be32_to_cpu(reply->sq_msg_size);
+	peer_pa = c2dev->pa + be32_to_cpu(reply->sq_mq_start);
+	mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size);
+	mmap = ioremap_nocache(peer_pa, mmap_size);
+	if (!mmap) {
+		err = -ENOMEM;
+		goto bail5;
+	}
+
+	c2_mq_req_init(&qp->sq_mq, 
+		       be32_to_cpu(reply->sq_mq_index), 
+		       q_size, 
+		       msg_size, 
+		       mmap + sizeof(struct c2_mq_shared),	/* pool start */
+		       mmap,				/* peer */
+		       C2_MQ_ADAPTER_TARGET);
+
+	/* Initialize the RQ mq */
+	q_size = be32_to_cpu(reply->rq_depth);
+	msg_size = be32_to_cpu(reply->rq_msg_size);
+	peer_pa = c2dev->pa + be32_to_cpu(reply->rq_mq_start);
+	mmap_size = PAGE_ALIGN(sizeof(struct c2_mq_shared) + msg_size * q_size);
+	mmap = ioremap_nocache(peer_pa, mmap_size);
+	if (!mmap) {
+		err = -ENOMEM;
+		goto bail6;
+	}
+
+	c2_mq_req_init(&qp->rq_mq, 
+		       be32_to_cpu(reply->rq_mq_index), 
+		       q_size, 
+		       msg_size, 
+		       mmap + sizeof(struct c2_mq_shared),	/* pool start */
+		       mmap,				/* peer */
+		       C2_MQ_ADAPTER_TARGET);
+
+	vq_repbuf_free(c2dev, reply);
+	vq_req_free(c2dev, vq_req);
+
+	return 0;
+
+      bail6:
+	iounmap(qp->sq_mq.peer);
+      bail5:
+	destroy_qp(c2dev, qp);
+      bail4:
+	vq_repbuf_free(c2dev, reply);
+      bail3:
+	vq_req_free(c2dev, vq_req);
+      bail2:
+	c2_free_mqsp(qp->rq_mq.shared);
+      bail1:
+	c2_free_mqsp(qp->sq_mq.shared);
+      bail0:
+	c2_free_qpn(c2dev, qp->qpn);
+	return err;
+}
+
+void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp)
+{
+	struct c2_cq *send_cq;
+	struct c2_cq *recv_cq;
+
+	send_cq = to_c2cq(qp->ibqp.send_cq);
+	recv_cq = to_c2cq(qp->ibqp.recv_cq);
+
+	/*
+	 * Lock CQs here, so that CQ polling code can do QP lookup
+	 * without taking a lock.
+	 */
+	spin_lock_irq(&send_cq->lock);
+	if (send_cq != recv_cq)
+		spin_lock(&recv_cq->lock);
+
+	c2_free_qpn(c2dev, qp->qpn);
+
+	if (send_cq != recv_cq)
+		spin_unlock(&recv_cq->lock);
+	spin_unlock_irq(&send_cq->lock);
+
+	/*
+	 * Destory qp in the rnic...
+	 */
+	destroy_qp(c2dev, qp);
+
+	/*
+	 * Mark any unreaped CQEs as null and void.
+	 */
+	c2_cq_clean(c2dev, qp, send_cq->cqn);
+	if (send_cq != recv_cq)
+		c2_cq_clean(c2dev, qp, recv_cq->cqn);
+	/*
+	 * Unmap the MQs and return the shared pointers
+	 * to the message pool.
+	 */
+	iounmap(qp->sq_mq.peer);
+	iounmap(qp->rq_mq.peer);
+	c2_free_mqsp(qp->sq_mq.shared);
+	c2_free_mqsp(qp->rq_mq.shared);
+
+	atomic_dec(&qp->refcount);
+	wait_event(qp->wait, !atomic_read(&qp->refcount));
+}
+
+/*
+ * Function: move_sgl 
+ *
+ * Description: 
+ * Move an SGL from the user's work request struct into a CCIL Work Request 
+ * message, swapping to WR byte order and ensure the total length doesn't 
+ * overflow. 
+ *
+ * IN: 
+ * dst		- ptr to CCIL Work Request message SGL memory.
+ * src		- ptr to the consumers SGL memory.
+ *
+ * OUT: none
+ *
+ * Return: 
+ * CCIL status codes.
+ */
+static int
+move_sgl(struct c2_data_addr * dst, struct ib_sge *src, int count, u32 * p_len,
+	 u8 * actual_count)
+{
+	u32 tot = 0;		/* running total */
+	u8 acount = 0;		/* running total non-0 len sge's */
+
+	while (count > 0) {
+		/*
+		 * If the addition of this SGE causes the
+		 * total SGL length to exceed 2^32-1, then
+		 * fail-n-bail.
+		 *
+		 * If the current total plus the next element length
+		 * wraps, then it will go negative and be less than the
+		 * current total...
+		 */
+		if ((tot + src->length) < tot) {
+			return -EINVAL;
+		}
+		/*
+		 * Bug: 1456 (as well as 1498 & 1643)
+		 * Skip over any sge's supplied with len=0
+		 */
+		if (src->length) {
+			tot += src->length;
+			dst->stag = cpu_to_be32(src->lkey);
+			dst->to = cpu_to_be64(src->addr);
+			dst->length = cpu_to_be32(src->length);
+			dst++;
+			acount++;
+		}
+		src++;
+		count--;
+	}
+
+	if (acount == 0) {
+		/*
+		 * Bug: 1476 (as well as 1498, 1456 and 1643)
+		 * Setup the SGL in the WR to make it easier for the RNIC.
+		 * This way, the FW doesn't have to deal with special cases.
+		 * Setting length=0 should be sufficient.
+		 */
+		dst->stag = 0;
+		dst->to = 0;
+		dst->length = 0;
+	}
+
+	*p_len = tot;
+	*actual_count = acount;
+	return 0;
+}
+
+/*
+ * Function: c2_activity (private function)
+ *
+ * Description: 
+ * Post an mq index to the host->adapter activity fifo.
+ *
+ * IN: 
+ * c2dev	- ptr to c2dev structure
+ * mq_index	- mq index to post
+ * shared	- value most recently written to shared 
+ *
+ * OUT: 
+ *
+ * Return: 
+ * none
+ */
+static inline void c2_activity(struct c2_dev *c2dev, u32 mq_index, u16 shared)
+{
+	/*
+	 * First read the register to see if the FIFO is full, and if so,
+	 * spin until it's not.  This isn't perfect -- there is no
+	 * synchronization among the clients of the register, but in
+	 * practice it prevents multiple CPU from hammering the bus
+	 * with PCI RETRY. Note that when this does happen, the card
+	 * cannot get on the bus and the card and system hang in a
+	 * deadlock -- thus the need for this code. [TOT]
+	 */
+	while (readl(c2dev->regs + PCI_BAR0_ADAPTER_HINT) & 0x80000000) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(0);
+	}
+
+	__raw_writel(C2_HINT_MAKE(mq_index, shared),
+		     c2dev->regs + PCI_BAR0_ADAPTER_HINT);
+}
+
+/*
+ * Function: qp_wr_post 
+ *
+ * Description: 
+ * This in-line function allocates a MQ msg, then moves the host-copy of 
+ * the completed WR into msg.  Then it posts the message. 
+ * 
+ * IN: 
+ * q		- ptr to user MQ.
+ * wr		- ptr to host-copy of the WR.
+ * qp		- ptr to user qp
+ * size		- Number of bytes to post.  Assumed to be divisible by 4.
+ *
+ * OUT: none
+ *
+ * Return: 
+ * CCIL status codes.
+ */
+static int qp_wr_post(struct c2_mq *q, union c2wr * wr, struct c2_qp *qp, u32 size)
+{
+	union c2wr *msg;
+
+	msg = c2_mq_alloc(q);
+	if (msg == NULL) {
+		return -EINVAL;
+	}
+#ifdef CCMSGMAGIC
+	((c2wr_hdr_t *) wr)->magic = cpu_to_be32(CCWR_MAGIC);
+#endif
+
+	/*
+	 * Since all header fields in the WR are the same as the
+	 * CQE, set the following so the adapter need not.
+	 */
+	c2_wr_set_result(wr, CCERR_PENDING);
+
+	/*
+	 * Copy the wr down to the adapter
+	 */
+	memcpy((void *) msg, (void *) wr, size);
+
+	c2_mq_produce(q);
+	return 0;
+}
+
+
+int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
+		 struct ib_send_wr **bad_wr)
+{
+	struct c2_dev *c2dev = to_c2dev(ibqp->device);
+	struct c2_qp *qp = to_c2qp(ibqp);
+	union c2wr wr;
+	int err = 0;
+
+	u32 flags;
+	u32 tot_len;
+	u8 actual_sge_count;
+	u32 msg_size;
+
+	if (qp->state > IB_QPS_RTS)
+		return -EINVAL;
+
+	while (ib_wr) {
+
+		flags = 0;
+		wr.sqwr.sq_hdr.user_hdr.hdr.context = ib_wr->wr_id;
+		if (ib_wr->send_flags & IB_SEND_SIGNALED) {
+			flags |= SQ_SIGNALED;
+		}
+
+		switch (ib_wr->opcode) {
+		case IB_WR_SEND:
+			if (ib_wr->send_flags & IB_SEND_SOLICITED) {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND_SE);
+				msg_size = sizeof(struct c2wr_send_req);
+			} else {
+				c2_wr_set_id(&wr, C2_WR_TYPE_SEND);
+				msg_size = sizeof(struct c2wr_send_req);
+			}
+
+			wr.sqwr.send.remote_stag = 0;
+			msg_size += sizeof(struct c2_data_addr) * ib_wr->num_sge;
+			if (ib_wr->num_sge > qp->send_sgl_depth) {
+				err = -EINVAL;
+				break;
+			}
+			if (ib_wr->send_flags & IB_SEND_FENCE) {
+				flags |= SQ_READ_FENCE;
+			}
+			err = move_sgl((struct c2_data_addr *) & (wr.sqwr.send.data),
+				       ib_wr->sg_list,
+				       ib_wr->num_sge,
+				       &tot_len, &actual_sge_count);
+			wr.sqwr.send.sge_len = cpu_to_be32(tot_len);
+			c2_wr_set_sge_count(&wr, actual_sge_count);
+			break;
+		case IB_WR_RDMA_WRITE:
+			c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_WRITE);
+			msg_size = sizeof(struct c2wr_rdma_write_req) +
+			    (sizeof(struct c2_data_addr) * ib_wr->num_sge);
+			if (ib_wr->num_sge > qp->rdma_write_sgl_depth) {
+				err = -EINVAL;
+				break;
+			}
+			if (ib_wr->send_flags & IB_SEND_FENCE) {
+				flags |= SQ_READ_FENCE;
+			}
+			wr.sqwr.rdma_write.remote_stag =
+			    cpu_to_be32(ib_wr->wr.rdma.rkey);
+			wr.sqwr.rdma_write.remote_to =
+			    cpu_to_be64(ib_wr->wr.rdma.remote_addr);
+			err = move_sgl((struct c2_data_addr *)
+				       & (wr.sqwr.rdma_write.data),
+				       ib_wr->sg_list,
+				       ib_wr->num_sge,
+				       &tot_len, &actual_sge_count);
+			wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len);
+			c2_wr_set_sge_count(&wr, actual_sge_count);
+			break;
+		case IB_WR_RDMA_READ:
+			c2_wr_set_id(&wr, C2_WR_TYPE_RDMA_READ);
+			msg_size = sizeof(struct c2wr_rdma_read_req);
+
+			/* IWarp only suppots 1 sge for RDMA reads */
+			if (ib_wr->num_sge > 1) {
+				err = -EINVAL;
+				break;
+			}
+
+			/*
+			 * Move the local and remote stag/to/len into the WR. 
+			 */
+			wr.sqwr.rdma_read.local_stag =
+			    cpu_to_be32(ib_wr->sg_list->lkey);
+			wr.sqwr.rdma_read.local_to =
+			    cpu_to_be64(ib_wr->sg_list->addr);
+			wr.sqwr.rdma_read.remote_stag =
+			    cpu_to_be32(ib_wr->wr.rdma.rkey);
+			wr.sqwr.rdma_read.remote_to =
+			    cpu_to_be64(ib_wr->wr.rdma.remote_addr);
+			wr.sqwr.rdma_read.length =
+			    cpu_to_be32(ib_wr->sg_list->length);
+			break;
+		default:
+			/* error */
+			msg_size = 0;
+			err = -EINVAL;
+			break;
+		}
+
+		/*
+		 * If we had an error on the last wr build, then
+		 * break out.  Possible errors include bogus WR 
+		 * type, and a bogus SGL length...
+		 */
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Store flags
+		 */
+		c2_wr_set_flags(&wr, flags);
+
+		/*
+		 * Post the puppy!
+		 */
+		err = qp_wr_post(&qp->sq_mq, &wr, qp, msg_size);
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Enqueue mq index to activity FIFO.
+		 */
+		c2_activity(c2dev, qp->sq_mq.index, qp->sq_mq.hint_count);
+
+		ib_wr = ib_wr->next;
+	}
+
+	if (err)
+		*bad_wr = ib_wr;
+	return err;
+}
+
+int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
+		    struct ib_recv_wr **bad_wr)
+{
+	struct c2_dev *c2dev = to_c2dev(ibqp->device);
+	struct c2_qp *qp = to_c2qp(ibqp);
+	union c2wr wr;
+	int err = 0;
+
+	if (qp->state > IB_QPS_RTS)
+		return -EINVAL;
+
+	/*
+	 * Try and post each work request
+	 */
+	while (ib_wr) {
+		u32 tot_len;
+		u8 actual_sge_count;
+
+		if (ib_wr->num_sge > qp->recv_sgl_depth) {
+			err = -EINVAL;
+			break;
+		}
+
+		/*
+		 * Create local host-copy of the WR
+		 */
+		wr.rqwr.rq_hdr.user_hdr.hdr.context = ib_wr->wr_id;
+		c2_wr_set_id(&wr, CCWR_RECV);
+		c2_wr_set_flags(&wr, 0);
+
+		/* sge_count is limited to eight bits. */
+		BUG_ON(ib_wr->num_sge >= 256);
+		err = move_sgl((struct c2_data_addr *) & (wr.rqwr.data),
+			       ib_wr->sg_list,
+			       ib_wr->num_sge, &tot_len, &actual_sge_count);
+		c2_wr_set_sge_count(&wr, actual_sge_count);
+
+		/*
+		 * If we had an error on the last wr build, then
+		 * break out.  Possible errors include bogus WR 
+		 * type, and a bogus SGL length...
+		 */
+		if (err) {
+			break;
+		}
+
+		err = qp_wr_post(&qp->rq_mq, &wr, qp, qp->rq_mq.msg_size);
+		if (err) {
+			break;
+		}
+
+		/*
+		 * Enqueue mq index to activity FIFO
+		 */
+		c2_activity(c2dev, qp->rq_mq.index, qp->rq_mq.hint_count);
+
+		ib_wr = ib_wr->next;
+	}
+
+	if (err)
+		*bad_wr = ib_wr;
+	return err;
+}
+
+void __devinit c2_init_qp_table(struct c2_dev *c2dev)
+{
+	spin_lock_init(&c2dev->qp_table.lock);
+	idr_init(&c2dev->qp_table.idr);
+}
+
+void __devexit c2_cleanup_qp_table(struct c2_dev *c2dev)
+{
+	idr_destroy(&c2dev->qp_table.idr);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_user.h b/drivers/infiniband/hw/amso1100/c2_user.h
new file mode 100644
index 0000000..7e9e7ad
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_user.h
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef C2_USER_H
+#define C2_USER_H
+
+#include <linux/types.h>
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct c2_alloc_ucontext_resp {
+	__u32 qp_tab_size;
+	__u32 uarc_size;
+};
+
+struct c2_alloc_pd_resp {
+	__u32 pdn;
+	__u32 reserved;
+};
+
+struct c2_create_cq {
+	__u32 lkey;
+	__u32 pdn;
+	__u64 arm_db_page;
+	__u64 set_db_page;
+	__u32 arm_db_index;
+	__u32 set_db_index;
+};
+
+struct c2_create_cq_resp {
+	__u32 cqn;
+	__u32 reserved;
+};
+
+struct c2_create_qp {
+	__u32 lkey;
+	__u32 reserved;
+	__u64 sq_db_page;
+	__u64 rq_db_page;
+	__u32 sq_db_index;
+	__u32 rq_db_index;
+};
+
+#endif				/* C2_USER_H */


From swise at opengridcomputing.com  Tue Jun 20 13:31:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:16 -0500
Subject: [openib-general] [PATCH v3 5/7] AMSO1100 Message Queues.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203116.31536.27965.stgit@stevo-desktop>


V2 Review Changes:

- correctly map host memory for DMA (don't use __pa()).

V1 Review Changes:

- remove useless asserts

- assert() -> BUG_ON()

- C2_DEBUG -> DEBUG
---

 drivers/infiniband/hw/amso1100/c2_mq.c |  175 ++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_mq.h |  107 ++++++++++++++++++++
 2 files changed, 282 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_mq.c b/drivers/infiniband/hw/amso1100/c2_mq.c
new file mode 100644
index 0000000..96bbe9a
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mq.c
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "c2.h"
+#include "c2_mq.h"
+
+void *c2_mq_alloc(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	if (c2_mq_full(q)) {
+		return NULL;
+	} else {
+#ifdef DEBUG
+		struct c2wr_hdr *m =
+		    (struct c2wr_hdr *) (q->msg_pool.host + q->priv * q->msg_size);
+#ifdef CCMSGMAGIC
+		BUG_ON(m->magic != be32_to_cpu(~CCWR_MAGIC));
+		m->magic = cpu_to_be32(CCWR_MAGIC);
+#endif
+		return m;
+#else
+		return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+	}
+}
+
+void c2_mq_produce(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	if (!c2_mq_full(q)) {
+		q->priv = (q->priv + 1) % q->q_size;
+		q->hint_count++;
+		/* Update peer's offset. */
+		__raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+	}
+}
+
+void *c2_mq_consume(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+	if (c2_mq_empty(q)) {
+		return NULL;
+	} else {
+#ifdef DEBUG
+		struct c2wr_hdr *m = (struct c2wr_hdr *)
+		    (q->msg_pool.host + q->priv * q->msg_size);
+#ifdef CCMSGMAGIC
+		BUG_ON(m->magic != be32_to_cpu(CCWR_MAGIC));
+#endif
+		return m;
+#else
+		return q->msg_pool.host + q->priv * q->msg_size;
+#endif
+	}
+}
+
+void c2_mq_free(struct c2_mq *q)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_HOST_TARGET);
+
+	if (!c2_mq_empty(q)) {
+
+#ifdef CCMSGMAGIC
+		{
+			struct c2wr_hdr __iomem *m = (struct c2wr_hdr __iomem *)
+			    (q->msg_pool.adapter + q->priv * q->msg_size);
+			__raw_writel(cpu_to_be32(~CCWR_MAGIC), &m->magic);
+		}
+#endif
+		q->priv = (q->priv + 1) % q->q_size;
+		/* Update peer's offset. */
+		__raw_writew(cpu_to_be16(q->priv), &q->peer->shared);
+	}
+}
+
+
+void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count)
+{
+	BUG_ON(q->magic != C2_MQ_MAGIC);
+	BUG_ON(q->type != C2_MQ_ADAPTER_TARGET);
+
+	while (wqe_count--) {
+		BUG_ON(c2_mq_empty(q));
+		*q->shared = cpu_to_be16((be16_to_cpu(*q->shared)+1) % q->q_size);
+	}
+}
+
+
+u32 c2_mq_count(struct c2_mq *q)
+{
+	s32 count;
+
+	if (q->type == C2_MQ_HOST_TARGET) {
+		count = be16_to_cpu(*q->shared) - q->priv;
+	} else {
+		count = q->priv - be16_to_cpu(*q->shared);
+	}
+
+	if (count < 0) {
+		count += q->q_size;
+	}
+
+	return (u32) count;
+}
+
+void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+		    u8 __iomem *pool_start, u16 __iomem *peer, u32 type)
+{
+	BUG_ON(!q->shared);
+
+	/* This code assumes the byte swapping has already been done! */
+	q->index = index;
+	q->q_size = q_size;
+	q->msg_size = msg_size;
+	q->msg_pool.adapter = pool_start;
+	q->peer = (struct c2_mq_shared __iomem *) peer;
+	q->magic = C2_MQ_MAGIC;
+	q->type = type;
+	q->priv = 0;
+	q->hint_count = 0;
+	return;
+}
+void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size, 
+		    u8 *pool_start, u16 __iomem *peer, u32 type)
+{
+	BUG_ON(!q->shared);
+
+	/* This code assumes the byte swapping has already been done! */
+	q->index = index;
+	q->q_size = q_size;
+	q->msg_size = msg_size;
+	q->msg_pool.host = pool_start;
+	q->peer = (struct c2_mq_shared __iomem *) peer;
+	q->magic = C2_MQ_MAGIC;
+	q->type = type;
+	q->priv = 0;
+	q->hint_count = 0;
+	return;
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_mq.h b/drivers/infiniband/hw/amso1100/c2_mq.h
new file mode 100644
index 0000000..9b1296e
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_mq.h
@@ -0,0 +1,107 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _C2_MQ_H_
+#define _C2_MQ_H_
+#include <linux/kernel.h>
+#include <linux/dma-mapping.h>
+#include "c2_wr.h"
+
+enum c2_shared_regs {
+
+	C2_SHARED_ARMED = 0x10,
+	C2_SHARED_NOTIFY = 0x18,
+	C2_SHARED_SHARED = 0x40,
+};
+
+struct c2_mq_shared {
+	u16 unused1;
+	u8 armed;
+	u8 notification_type;
+	u32 unused2;
+	u16 shared;
+	/* Pad to 64 bytes. */
+	u8 pad[64 - sizeof(u16) - 2 * sizeof(u8) - sizeof(u32) - sizeof(u16)];
+};
+
+enum c2_mq_type {
+	C2_MQ_HOST_TARGET = 1,
+	C2_MQ_ADAPTER_TARGET = 2,
+};
+
+/*
+ * c2_mq_t is for kernel-mode MQs like the VQs Cand the AEQ.
+ * c2_user_mq_t (which is the same format) is for user-mode MQs...
+ */
+#define C2_MQ_MAGIC 0x4d512020	/* 'MQ  ' */
+struct c2_mq {
+	u32 magic; 
+	union {
+		u8 *host;
+		u8 __iomem *adapter;
+	} msg_pool;
+	dma_addr_t host_dma;
+	DECLARE_PCI_UNMAP_ADDR(mapping);
+	u16 hint_count;
+	u16 priv;
+	struct c2_mq_shared __iomem *peer;
+	u16 *shared;
+	dma_addr_t shared_dma;
+	u32 q_size;
+	u32 msg_size;
+	u32 index;
+	enum c2_mq_type type;
+};
+
+static __inline__ int c2_mq_empty(struct c2_mq *q)
+{
+	return q->priv == be16_to_cpu(*q->shared);
+}
+
+static __inline__ int c2_mq_full(struct c2_mq *q)
+{
+	return q->priv == (be16_to_cpu(*q->shared) + q->q_size - 1) % q->q_size;
+}
+
+extern void c2_mq_lconsume(struct c2_mq *q, u32 wqe_count);
+extern void *c2_mq_alloc(struct c2_mq *q);
+extern void c2_mq_produce(struct c2_mq *q);
+extern void *c2_mq_consume(struct c2_mq *q);
+extern void c2_mq_free(struct c2_mq *q);
+extern u32 c2_mq_count(struct c2_mq *q);
+extern void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+		       u8 __iomem *pool_start, u16 __iomem *peer, u32 type);
+extern void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
+			   u8 *pool_start, u16 __iomem *peer, u32 type);
+
+#endif				/* _C2_MQ_H_ */


From swise at opengridcomputing.com  Tue Jun 20 13:31:21 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:21 -0500
Subject: [openib-general] [PATCH v3 6/7] AMSO1100: Privileged Verbs Queues.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203121.31536.73315.stgit@stevo-desktop>


Review Changes:

dprintk() -> pr_debug()
---

 drivers/infiniband/hw/amso1100/c2_vq.c |  260 ++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_vq.h |   63 ++++++++
 2 files changed, 323 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c
new file mode 100644
index 0000000..445b1ed
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_vq.c
@@ -0,0 +1,260 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "c2_vq.h"
+#include "c2_provider.h"
+
+/*
+ * Verbs Request Objects:
+ *
+ * VQ Request Objects are allocated by the kernel verbs handlers.
+ * They contain a wait object, a refcnt, an atomic bool indicating that the
+ * adapter has replied, and a copy of the verb reply work request.
+ * A pointer to the VQ Request Object is passed down in the context
+ * field of the work request message, and reflected back by the adapter
+ * in the verbs reply message.  The function handle_vq() in the interrupt
+ * path will use this pointer to:
+ * 	1) append a copy of the verbs reply message
+ * 	2) mark that the reply is ready
+ * 	3) wake up the kernel verbs handler blocked awaiting the reply.
+ *
+ *
+ * The kernel verbs handlers do a "get" to put a 2nd reference on the 
+ * VQ Request object.  If the kernel verbs handler exits before the adapter
+ * can respond, this extra reference will keep the VQ Request object around
+ * until the adapter's reply can be processed.  The reason we need this is
+ * because a pointer to this object is stuffed into the context field of
+ * the verbs work request message, and reflected back in the reply message.
+ * It is used in the interrupt handler (handle_vq()) to wake up the appropriate
+ * kernel verb handler that is blocked awaiting the verb reply.  
+ * So handle_vq() will do a "put" on the object when it's done accessing it.
+ * NOTE:  If we guarantee that the kernel verb handler will never bail before 
+ *        getting the reply, then we don't need these refcnts.
+ *
+ *
+ * VQ Request objects are freed by the kernel verbs handlers only 
+ * after the verb has been processed, or when the adapter fails and
+ * does not reply.  
+ *
+ *
+ * Verbs Reply Buffers:
+ *
+ * VQ Reply bufs are local host memory copies of a 
+ * outstanding Verb Request reply
+ * message.  The are always allocated by the kernel verbs handlers, and _may_ be
+ * freed by either the kernel verbs handler -or- the interrupt handler.  The
+ * kernel verbs handler _must_ free the repbuf, then free the vq request object
+ * in that order.
+ */
+
+int vq_init(struct c2_dev *c2dev)
+{
+	sprintf(c2dev->vq_cache_name, "c2-vq:dev%c",
+		(char) ('0' + c2dev->devnum));
+	c2dev->host_msg_cache =
+	    kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0,
+			      SLAB_HWCACHE_ALIGN, NULL, NULL);
+	if (c2dev->host_msg_cache == NULL) {
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+void vq_term(struct c2_dev *c2dev)
+{
+	kmem_cache_destroy(c2dev->host_msg_cache);
+}
+
+/* vq_req_alloc - allocate a VQ Request Object and initialize it.
+ * The refcnt is set to 1.
+ */
+struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev)
+{
+	struct c2_vq_req *r;
+
+	r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL);
+	if (r) {
+		init_waitqueue_head(&r->wait_object);
+		r->reply_msg = (u64) NULL;
+		r->event = 0;
+		r->cm_id = NULL;
+		r->qp = NULL;
+		atomic_set(&r->refcnt, 1);
+		atomic_set(&r->reply_ready, 0);
+	}
+	return r;
+}
+
+
+/* vq_req_free - free the VQ Request Object.  It is assumed the verbs handler
+ * has already free the VQ Reply Buffer if it existed.
+ */
+void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	r->reply_msg = (u64) NULL;
+	if (atomic_dec_and_test(&r->refcnt)) {
+		kfree(r);
+	}
+}
+
+/* vq_req_get - reference a VQ Request Object.  Done 
+ * only in the kernel verbs handlers.
+ */
+void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	atomic_inc(&r->refcnt);
+}
+
+
+/* vq_req_put - dereference and potentially free a VQ Request Object.
+ *
+ * This is only called by handle_vq() on the 
+ * interrupt when it is done processing
+ * a verb reply message.  If the associated 
+ * kernel verbs handler has already bailed,
+ * then this put will actually free the VQ 
+ * Request object _and_ the VQ Reply Buffer
+ * if it exists.
+ */
+void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r)
+{
+	if (atomic_dec_and_test(&r->refcnt)) {
+		if (r->reply_msg != (u64) NULL)
+			vq_repbuf_free(c2dev,
+				       (void *) (unsigned long) r->reply_msg);
+		kfree(r);
+	}
+}
+
+
+/*
+ * vq_repbuf_alloc - allocate a VQ Reply Buffer.
+ */
+void *vq_repbuf_alloc(struct c2_dev *c2dev)
+{
+	return kmem_cache_alloc(c2dev->host_msg_cache, SLAB_ATOMIC);
+}
+
+/*
+ * vq_send_wr - post a verbs request message to the Verbs Request Queue.
+ * If a message is not available in the MQ, then block until one is available.
+ * NOTE: handle_mq() on the interrupt context will wake up threads blocked here.
+ * When the adapter drains the Verbs Request Queue, 
+ * it inserts MQ index 0 in to the
+ * adapter->host activity fifo and interrupts the host.
+ */
+int vq_send_wr(struct c2_dev *c2dev, union c2wr *wr)
+{
+	void *msg;
+	wait_queue_t __wait;
+
+	/*
+	 * grab adapter vq lock
+	 */
+	spin_lock(&c2dev->vqlock);
+
+	/*
+	 * allocate msg
+	 */
+	msg = c2_mq_alloc(&c2dev->req_vq);
+
+	/*
+	 * If we cannot get a msg, then we'll wait
+	 * When a messages are available, the int handler will wake_up() 
+	 * any waiters.
+	 */
+	while (msg == NULL) {
+		pr_debug("%s:%d no available msg in VQ, waiting...\n",
+		       __FUNCTION__, __LINE__);
+		init_waitqueue_entry(&__wait, current);
+		add_wait_queue(&c2dev->req_vq_wo, &__wait);
+		spin_unlock(&c2dev->vqlock);
+		for (;;) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!c2_mq_full(&c2dev->req_vq)) {
+				break;
+			}
+			if (!signal_pending(current)) {
+				schedule_timeout(1 * HZ);	/* 1 second... */
+				continue;
+			}
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&c2dev->req_vq_wo, &__wait);
+			return -EINTR;
+		}
+		set_current_state(TASK_RUNNING);
+		remove_wait_queue(&c2dev->req_vq_wo, &__wait);
+		spin_lock(&c2dev->vqlock);
+		msg = c2_mq_alloc(&c2dev->req_vq);
+	}
+
+	/*
+	 * copy wr into adapter msg
+	 */
+	memcpy(msg, wr, c2dev->req_vq.msg_size);
+
+	/*
+	 * post msg
+	 */
+	c2_mq_produce(&c2dev->req_vq);
+
+	/*
+	 * release adapter vq lock
+	 */
+	spin_unlock(&c2dev->vqlock);
+	return 0;
+}
+
+
+/*
+ * vq_wait_for_reply - block until the adapter posts a Verb Reply Message.  
+ */
+int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req)
+{
+	if (!wait_event_timeout(req->wait_object,
+				atomic_read(&req->reply_ready),
+				60*HZ))
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+/*
+ * vq_repbuf_free - Free a Verbs Reply Buffer.
+ */
+void vq_repbuf_free(struct c2_dev *c2dev, void *reply)
+{
+	kmem_cache_free(c2dev->host_msg_cache, reply);
+}
diff --git a/drivers/infiniband/hw/amso1100/c2_vq.h b/drivers/infiniband/hw/amso1100/c2_vq.h
new file mode 100644
index 0000000..3380562
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/c2_vq.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _C2_VQ_H_
+#define _C2_VQ_H_
+#include <linux/sched.h>
+#include "c2.h"
+#include "c2_wr.h"
+#include "c2_provider.h"
+
+struct c2_vq_req {
+	u64 reply_msg;		/* ptr to reply msg */
+	wait_queue_head_t wait_object;	/* wait object for vq reqs */
+	atomic_t reply_ready;	/* set when reply is ready */
+	atomic_t refcnt;	/* used to cancel WRs... */
+	int event;
+	struct iw_cm_id *cm_id;
+	struct c2_qp *qp;
+};
+
+extern int vq_init(struct c2_dev *c2dev);
+extern void vq_term(struct c2_dev *c2dev);
+
+extern struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev);
+extern void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *req);
+extern int vq_send_wr(struct c2_dev *c2dev, union c2wr * wr);
+
+extern void *vq_repbuf_alloc(struct c2_dev *c2dev);
+extern void vq_repbuf_free(struct c2_dev *c2dev, void *reply);
+
+extern int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req);
+#endif				/* _C2_VQ_H_ */


From swise at opengridcomputing.com  Tue Jun 20 13:31:26 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 20 Jun 2006 15:31:26 -0500
Subject: [openib-general] [PATCH v3 7/7] AMSO1100 Makefiles and Kconfig
	changes.
In-Reply-To: <20060620203050.31536.5341.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
Message-ID: <20060620203126.31536.78501.stgit@stevo-desktop>


Review Changes:

- C2DEBUG -> DEBUG
---

 drivers/infiniband/Kconfig             |    1 +
 drivers/infiniband/Makefile            |    1 +
 drivers/infiniband/hw/amso1100/Kbuild  |   10 ++++++++++
 drivers/infiniband/hw/amso1100/Kconfig |   15 +++++++++++++++
 drivers/infiniband/hw/amso1100/README  |   11 +++++++++++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ba2d650..04e6d4f 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -36,6 +36,7 @@ config INFINIBAND_ADDR_TRANS
 
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/ipath/Kconfig"
+source "drivers/infiniband/hw/amso1100/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index eea2732..e2b93f9 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_INFINIBAND)		+= core/
 obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
 obj-$(CONFIG_IPATH_CORE)		+= hw/ipath/
+obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
diff --git a/drivers/infiniband/hw/amso1100/Kbuild b/drivers/infiniband/hw/amso1100/Kbuild
new file mode 100644
index 0000000..e1f10ab
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kbuild
@@ -0,0 +1,10 @@
+EXTRA_CFLAGS += -Idrivers/infiniband/include
+
+ifdef CONFIG_INFINIBAND_AMSO1100_DEBUG
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_INFINIBAND_AMSO1100) += iw_c2.o
+
+iw_c2-y := c2.o c2_provider.o c2_rnic.o c2_alloc.o c2_mq.o c2_ae.o c2_vq.o \
+	c2_intr.o c2_cq.o c2_qp.o c2_cm.o c2_mm.o c2_pd.o
diff --git a/drivers/infiniband/hw/amso1100/Kconfig b/drivers/infiniband/hw/amso1100/Kconfig
new file mode 100644
index 0000000..809cb14
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/Kconfig
@@ -0,0 +1,15 @@
+config INFINIBAND_AMSO1100
+	tristate "Ammasso 1100 HCA support"
+	depends on PCI && INET && INFINIBAND
+	---help---
+	  This is a low-level driver for the Ammasso 1100 host
+	  channel adapter (HCA).
+
+config INFINIBAND_AMSO1100_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_AMSO1100
+	default n
+	---help---
+	  This option causes the amso1100 driver to produce a bunch of
+	  debug messages.  Select this if you are developing the driver
+	  or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/amso1100/README b/drivers/infiniband/hw/amso1100/README
new file mode 100644
index 0000000..1331353
--- /dev/null
+++ b/drivers/infiniband/hw/amso1100/README
@@ -0,0 +1,11 @@
+This is the OpenFabrics provider driver for the 
+AMSO1100 1Gb RNIC adapter. 
+
+This adapter is available in limited quantities 
+for development purposes from Open Grid Computing.
+
+This driver requires the IWCM and CMA mods necessary
+to support iWARP.
+
+Contact tom at opengridcomputing.com for more information.
+


From arjan at infradead.org  Tue Jun 20 13:43:46 2006
From: arjan at infradead.org (Arjan van de Ven)
Date: Tue, 20 Jun 2006 22:43:46 +0200
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <20060620203055.31536.15131.stgit@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
Message-ID: <1150836226.2891.231.camel@laptopd505.fenrus.org>

On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote:

> +/*
> + * Allocate TX ring elements and chain them together.
> + * One-to-one association of adapter descriptors with ring elements.
> + */
> +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
> +			    dma_addr_t base, void __iomem * mmio_txp_ring)
> +{
> +	struct c2_tx_desc *tx_desc;
> +	struct c2_txp_desc __iomem *txp_desc;
> +	struct c2_element *elem;
> +	int i;
> +
> +	tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);

I would think this needs a dma_alloc_coherent() rather than a kmalloc...


> +
> +/* Free all buffers in RX ring, assumes receiver stopped */
> +static void c2_rx_clean(struct c2_port *c2_port)
> +{
> +	struct c2_dev *c2dev = c2_port->c2dev;
> +	struct c2_ring *rx_ring = &c2_port->rx_ring;
> +	struct c2_element *elem;
> +	struct c2_rx_desc *rx_desc;
> +
> +	elem = rx_ring->start;
> +	do {
> +		rx_desc = elem->ht_desc;
> +		rx_desc->len = 0;
> +
> +		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
> +		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
> +		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);

you seem to be a fan of the __raw_write() functions... any reason why?
__raw_ is not a magic "go faster" prefix....

Also on a related note, have you checked the driver for the needed PCI
posting flushes?

> +
> +	/* Disable IRQs by clearing the interrupt mask */
> +	writel(1, c2dev->regs + C2_IDIS);
> +	writel(0, c2dev->regs + C2_NIMR0);

like here...
> +
> +	elem = tx_ring->to_use;
> +	elem->skb = skb;
> +	elem->mapaddr = mapaddr;
> +	elem->maplen = maplen;
> +
> +	/* Tell HW to xmit */
> +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);

or here

> +static int c2_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	int ret = 0;
> +
> +	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
> +		return -EINVAL;
> +
> +	netdev->mtu = new_mtu;
> +
> +	if (netif_running(netdev)) {
> +		c2_down(netdev);
> +
> +		c2_up(netdev);
> +	}

this looks odd...


From rdreier at cisco.com  Tue Jun 20 15:27:45 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 20 Jun 2006 15:27:45 -0700
Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs.
In-Reply-To: <20060620200308.20092.76324.stgit@stevo-desktop> (Steve
	Wise's message of "Tue, 20 Jun 2006 15:03:08 -0500")
References: <20060620200304.20092.44110.stgit@stevo-desktop>
	<20060620200308.20092.76324.stgit@stevo-desktop>
Message-ID: <adawtbbmpi6.fsf@cisco.com>

Looks pretty good.  I'll get this into the libibverbs development tree
soon (I'm working on the MADV_DONTFORK stuff right now).

 - R.


From sean.hefty at intel.com  Tue Jun 20 16:27:05 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 20 Jun 2006 16:27:05 -0700
Subject: [openib-general] ib_gid lookup
In-Reply-To: <loom.20060620T022600-975@post.gmane.org>
Message-ID: <000001c694c1$0c963f50$36781cac@amr.corp.intel.com>

>  i'm trying to find whether i can do a lookup of ib_gid by either
>node name or node's ip address. is this information available from
>the subnet manager?

A lookup is done from IP address to GID using the address translation module
(ib_addr).  This functionality is exposed to userspace through the rdma_cm
(resolve_addr routine).

- Sean


From eitan at mellanox.co.il  Tue Jun 20 23:10:10 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 21 Jun 2006 09:10:10 +0300
Subject: [openib-general] [PATCHv5] osm: partition manager force policy
In-Reply-To: <1150808795.4391.118133.camel@hal.voltaire.com>
References: <86d5d5ge54.fsf@mtl066.yok.mtl.com>
	<1150808795.4391.118133.camel@hal.voltaire.com>
Message-ID: <4498E2C2.2080906@mellanox.co.il>

Hi Hal,

Thanks for applying the patch.

Regarding the issues :

Hal Rosenstock wrote:
>>+
>>+	CL_ASSERT( p_pkey_tbl );
> 
> 
> Should the other routines also assert on this or should this be
> consistent with the others ?
Yes it should b consistent.
Normally I add assertion on OUT parameters such that a "misuse" is caught.
The idea is that parameters provided by reference are more likely to be passed
by mistake as NULL. So I would remove the assert on p_key_tbl.
> 
> 
>>+	CL_ASSERT( p_block_idx != NULL );
>>+	CL_ASSERT( p_pkey_idx != NULL );
> 
> 
> There is no p_pkey_idx parameter. I presume this should be p_pkey_index.
Ooops - this means the code will not compile in debug mode !
I see you fixed that.
> 
> 
> Also, two things about osm_pkey_mgr.c:
> 
> Was there a need to reorder the routines ? This broke the diff so it had
> to be done largely by hand.
I reordered to to be defined in the order used.
Already agree with Sasha that I should have done that on separate patch.
> 
> Also, it would have been nice not to mix the format changes with the
> substantive changes. Try to keep it to "one thought per patch".
OK.
> 
> This patch has been applied with cosmetic changes. We will go from
> here...
Thanks

Eitan


From halr at voltaire.com  Wed Jun 21 03:53:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 06:53:48 -0400
Subject: [openib-general] [PATCHv5] osm: partition manager force policy
In-Reply-To: <4498E2C2.2080906@mellanox.co.il>
References: <86d5d5ge54.fsf@mtl066.yok.mtl.com>
	<1150808795.4391.118133.camel@hal.voltaire.com>
	<4498E2C2.2080906@mellanox.co.il>
Message-ID: <1150887225.4391.167876.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-06-21 at 02:10, Eitan Zahavi wrote:
> Hi Hal,
> 
> Thanks for applying the patch.
> 
> Regarding the issues :
> 
> Hal Rosenstock wrote:
> >>+
> >>+	CL_ASSERT( p_pkey_tbl );
> > 
> > 
> > Should the other routines also assert on this or should this be
> > consistent with the others ?
> Yes it should b consistent.
> Normally I add assertion on OUT parameters such that a "misuse" is caught.
> The idea is that parameters provided by reference are more likely to be passed
> by mistake as NULL. 
> So I would remove the assert on p_key_tbl.

p_pkey_tbl is a pointer so wouldn't that rule apply ?

I do notice that in the particular usage in osm_pkey_mgr.c it would
already get caught by the assert in osm_physp_get_mod_pkey_tbl.

> >>+	CL_ASSERT( p_block_idx != NULL );
> >>+	CL_ASSERT( p_pkey_idx != NULL );
> > 
> > 
> > There is no p_pkey_idx parameter. I presume this should be p_pkey_index.
> Ooops - this means the code will not compile in debug mode !
> I see you fixed that.
> > 
> > 
> > Also, two things about osm_pkey_mgr.c:
> > 
> > Was there a need to reorder the routines ? This broke the diff so it had
> > to be done largely by hand.
> I reordered to to be defined in the order used.
> Already agree with Sasha that I should have done that on separate patch.

I thought since the patch was reissued several times this comment would
have been addressed.

-- Hal

> > Also, it would have been nice not to mix the format changes with the
> > substantive changes. Try to keep it to "one thought per patch".
> OK.
> > 
> > This patch has been applied with cosmetic changes. We will go from
> > here...
> Thanks
> 
> Eitan


From halr at voltaire.com  Wed Jun 21 04:00:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 07:00:10 -0400
Subject: [openib-general] [PATCH] osm: add release notes to doc dir
In-Reply-To: <86ac87gdxy.fsf@mtl066.yok.mtl.com>
References: <86ac87gdxy.fsf@mtl066.yok.mtl.com>
Message-ID: <1150887610.4391.168097.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-06-21 at 03:34, Eitan Zahavi wrote:
> Hi Hal
> 
> Following the OFED 1.0 release I think it will be very handy to 
> user to see OpenSM release notes accumulate in the doc dir. 

Sure; that seems reasonable.

> As release notes always refer back to old releases - by not specifying
> all features just new ones - I propose a file naming scheme that is
> based on the release date. 

I think a naming scheme based on the OpenSM version is clearer. So the
OFED 1.0 release was openib-1.2.1.

> This patch adds the two latest releases notes. 
> One from Jan 2006 and one from Jun 2006.

What was the Jan 2006 release ?

-- Hal


From sashak at voltaire.com  Wed Jun 21 05:49:27 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 21 Jun 2006 15:49:27 +0300
Subject: [openib-general] [PATCH] opensm: add check for not intialized pkey
	blocks
Message-ID: <20060621124927.GA24726@sashak.voltaire.com>

Hi Hal,

The lower block of pkey tables' 'blocks' vector may be not initialized
due to lost MADs. We need to check it for NULL. Some duplicated code
removal as well.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/opensm/osm_pkey.c |   33 +++++++++++----------------------
 1 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c
index 8166c90..caefe18 100644
--- a/osm/opensm/osm_pkey.c
+++ b/osm/opensm/osm_pkey.c
@@ -76,16 +76,19 @@ void osm_pkey_tbl_construct( 
 void osm_pkey_tbl_destroy( 
   IN osm_pkey_tbl_t *p_pkey_tbl)
 {
+  ib_pkey_table_t *p_block;
   uint16_t num_blocks, i;
 
   num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->blocks ));
   for (i = 0; i < num_blocks; i++)
-    free(cl_ptr_vector_get( &p_pkey_tbl->blocks, i ));
+    if ((p_block = cl_ptr_vector_get( &p_pkey_tbl->blocks, i )))
+      free(p_block);
   cl_ptr_vector_destroy( &p_pkey_tbl->blocks );
 
   num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->new_blocks ));
   for (i = 0; i < num_blocks; i++)
-    free(cl_ptr_vector_get( &p_pkey_tbl->new_blocks, i ));
+    if ((p_block = cl_ptr_vector_get( &p_pkey_tbl->new_blocks, i )))
+      free(p_block);
   cl_ptr_vector_destroy( &p_pkey_tbl->new_blocks );
 
   cl_map_remove_all( &p_pkey_tbl->keys );
@@ -112,26 +115,12 @@ osm_pkey_tbl_init(
 void osm_pkey_tbl_init_new_blocks(
   IN const osm_pkey_tbl_t *p_pkey_tbl)
 {
-  ib_pkey_table_t *p_block, *p_new_block;
-  int16_t b, num_blocks, new_blocks;
+  ib_pkey_table_t *p_block;
+  int16_t b, num_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks);
 
-  num_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->blocks);
-  new_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks);
-
-  for (b = 0; b < num_blocks; b++) {
-    p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b);
-    if ( b < new_blocks )
-      p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b);
-    else
-    {
-      p_new_block = (ib_pkey_table_t *)malloc(sizeof(*p_new_block));
-      if (!p_new_block)
-        break;
-      cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, 
-			b, p_new_block);
-    }
-    memset(p_new_block, 0, sizeof(*p_new_block));
-  }
+  for (b = 0; b < num_blocks; b++)
+    if ((p_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b)))
+      memset(p_block, 0, sizeof(*p_block));
 }
 
 /**********************************************************************
@@ -296,7 +285,7 @@ osm_pkey_find_next_free_entry(
   OUT uint8_t       *p_pkey_idx)
 {
   ib_pkey_table_t *p_new_block;
-	
+
   CL_ASSERT(p_block_idx);
   CL_ASSERT(p_pkey_idx);
 

From sashak at voltaire.com  Wed Jun 21 06:52:38 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 21 Jun 2006 16:52:38 +0300
Subject: [openib-general] [PATCH] opensm: osm_pkey_tbl_make_block_pair()
	removal
Message-ID: <20060621135238.GB24726@sashak.voltaire.com>

Since 'blocks' pkey vector is updated only by receiver, remove it from
osm_pkey_tbl_set_new_entry(), as well as osm_pkey_tbl_make_block_pair().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 osm/include/opensm/osm_pkey.h |   35 ------------------------
 osm/opensm/osm_pkey.c         |   59 ++++++++---------------------------------
 2 files changed, 11 insertions(+), 83 deletions(-)

diff --git a/osm/include/opensm/osm_pkey.h b/osm/include/opensm/osm_pkey.h
index a353ad0..44e932d 100644
--- a/osm/include/opensm/osm_pkey.h
+++ b/osm/include/opensm/osm_pkey.h
@@ -296,41 +296,6 @@ static inline ib_pkey_table_t *osm_pkey_
 /*
  *********/
 
-/****f* OpenSM: osm_pkey_tbl_make_block_pair
-* NAME
-*  osm_pkey_tbl_make_block_pair
-*
-* DESCRIPTION
-*  Find or create a pair of "old" and "new" blocks for the
-*  given block index
-*
-* SYNOPSIS
-*/
-ib_api_status_t
-osm_pkey_tbl_make_block_pair( 
-	osm_pkey_tbl_t   *p_pkey_tbl, 
-	uint16_t          block_idx,
-	ib_pkey_table_t **pp_old_block,
-	ib_pkey_table_t **pp_new_block);
-/*
-* p_pkey_tbl
-*   [in] Pointer to the PKey table 
-*
-* block_idx
-*   [in] The block index to use
-*
-* pp_old_block
-*   [out] Pointer to the old block pointer arg
-*
-* pp_new_block
-*   [out] Pointer to the new block pointer arg
-*
-* RETURN VALUES
-*   IB_SUCCESS if OK
-*   IB_ERROR if failed
-* 
-*********/
-
 /****f* OpenSM: osm_pkey_tbl_set_new_entry
 * NAME
 *  osm_pkey_tbl_set_new_entry
diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c
index caefe18..2937ac8 100644
--- a/osm/opensm/osm_pkey.c
+++ b/osm/opensm/osm_pkey.c
@@ -211,46 +211,6 @@ osm_pkey_tbl_set(
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t
-osm_pkey_tbl_make_block_pair( 
-  osm_pkey_tbl_t   *p_pkey_tbl, 
-  uint16_t          block_idx,
-  ib_pkey_table_t **pp_old_block,
-  ib_pkey_table_t **pp_new_block)
-{
-  if (block_idx >= p_pkey_tbl->max_blocks)
-    return(IB_ERROR);
-
-  if (pp_old_block)
-  {
-    *pp_old_block = osm_pkey_tbl_block_get(p_pkey_tbl, block_idx);
-    if (! *pp_old_block)
-    {
-	*pp_old_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
-	if (!*pp_old_block)
-	  return(IB_ERROR);
-	memset(*pp_old_block, 0, sizeof(ib_pkey_table_t));
-	cl_ptr_vector_set(&p_pkey_tbl->blocks, block_idx, *pp_old_block);
-    }
-  }
-	
-  if (pp_new_block)
-  {
-    *pp_new_block = osm_pkey_tbl_new_block_get(p_pkey_tbl, block_idx);
-    if (! *pp_new_block)
-    {
-	*pp_new_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
-	if (!*pp_new_block)
-	  return(IB_ERROR);
-	memset(*pp_new_block, 0, sizeof(ib_pkey_table_t));
-	cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, *pp_new_block);
-    }
-  }
-  return(IB_SUCCESS);
-}
-
-/**********************************************************************
- **********************************************************************/
 /*
   store the given pkey in the "new" blocks array 
   also makes sure the regular block exists.
@@ -262,14 +222,17 @@ osm_pkey_tbl_set_new_entry( 
   IN uint8_t         pkey_idx,
   IN uint16_t        pkey)
 {  
-  ib_pkey_table_t *p_old_block;
-  ib_pkey_table_t *p_new_block;
-	
-  if (osm_pkey_tbl_make_block_pair(p_pkey_tbl, block_idx, &p_old_block,
-				   &p_new_block))
-    return(IB_ERROR);
-		
-  p_new_block->pkey_entry[pkey_idx] = pkey;
+  ib_pkey_table_t *p_block;
+
+  if (!(p_block = osm_pkey_tbl_new_block_get(p_pkey_tbl, block_idx))) {
+    p_block = (ib_pkey_table_t *)malloc(sizeof(ib_pkey_table_t));
+    if (!p_block)
+      return(IB_ERROR);
+    memset(p_block, 0, sizeof(ib_pkey_table_t));
+    cl_ptr_vector_set(&p_pkey_tbl->new_blocks, block_idx, p_block);
+  }
+
+  p_block->pkey_entry[pkey_idx] = pkey;
   if (p_pkey_tbl->used_blocks <= block_idx)
     p_pkey_tbl->used_blocks = block_idx + 1;
 

From halr at voltaire.com  Wed Jun 21 07:02:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 10:02:15 -0400
Subject: [openib-general] [PATCH] opensm: add check for not intialized
	pkey blocks
In-Reply-To: <20060621124927.GA24726@sashak.voltaire.com>
References: <20060621124927.GA24726@sashak.voltaire.com>
Message-ID: <1150898531.4391.174435.camel@hal.voltaire.com>

Hi Sasha,

On Wed, 2006-06-21 at 08:49, Sasha Khapyorsky wrote:
> Hi Hal,
> 
> The lower block of pkey tables' 'blocks' vector may be not initialized
> due to lost MADs. We need to check it for NULL. Some duplicated code
> removal as well.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From swise at opengridcomputing.com  Wed Jun 21 07:12:46 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 09:12:46 -0500
Subject: [openib-general] [librdmacm] check return value in operations
 of rping (as an attachment)
In-Reply-To: <200606201853.15272.dotanb@mellanox.co.il>
References: <200606201853.15272.dotanb@mellanox.co.il>
Message-ID: <1150899166.11051.7.camel@stevo-desktop>

Thanks,


Committed with minor change to always ack events before exiting.

trunk: revision 8159
iwarp branch: revision 8160

Steve.


On Tue, 2006-06-20 at 18:53 +0300, Dotan Barak wrote:
> Added checks to the return values of all of the functions that may
> fail
> (in order to add this test to the regression system).
> 
> Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>
> 


From mamidala at cse.ohio-state.edu  Wed Jun 21 07:07:04 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Wed, 21 Jun 2006 10:07:04 -0400 (EDT)
Subject: [openib-general] [librdmacm] compile error
In-Reply-To: <1150818556.22519.28.camel@stevo-desktop>
Message-ID: <Pine.GSO.4.40.0606211000120.9470-100000@omicron.cse.ohio-state.edu>

Hi,

I have a quick question. I updated the infiniband kernel modules and while
compiling them got this error.
(Redhat AS 4, linux-2.6.16.20 on IA-32 platform )

drivers/infiniband/ulp/ipoib/ipoib_main.c: In function
`ipoib_neigh_setup_dev':
drivers/infiniband/ulp/ipoib/ipoib_main.c:794: error: structure has no
member named `neigh_destructor'
make[3]: *** [drivers/infiniband/ulp/ipoib/ipoib_main.o] Error 1
make[2]: *** [drivers/infiniband/ulp/ipoib] Error 2
make[1]: *** [drivers/infiniband] Error 2
make: *** [drivers] Error 2

Do I need to update any other module?

Thanks,
Amith


From halr at voltaire.com  Wed Jun 21 07:21:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 10:21:44 -0400
Subject: [openib-general] [librdmacm] compile error
In-Reply-To: <Pine.GSO.4.40.0606211000120.9470-100000@omicron.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606211000120.9470-100000@omicron.cse.ohio-state.edu>
Message-ID: <1150899700.4391.175085.camel@hal.voltaire.com>

Hi Amith,

On Wed, 2006-06-21 at 10:07, amith rajith mamidala wrote:
> Hi,
> 
> I have a quick question. I updated the infiniband kernel modules and while
> compiling them got this error.
> (Redhat AS 4, linux-2.6.16.20 on IA-32 platform )
> 
> drivers/infiniband/ulp/ipoib/ipoib_main.c: In function
> `ipoib_neigh_setup_dev':
> drivers/infiniband/ulp/ipoib/ipoib_main.c:794: error: structure has no
> member named `neigh_destructor'
> make[3]: *** [drivers/infiniband/ulp/ipoib/ipoib_main.o] Error 1
> make[2]: *** [drivers/infiniband/ulp/ipoib] Error 2
> make[1]: *** [drivers/infiniband] Error 2
> make: *** [drivers] Error 2
> 
> Do I need to update any other module?

This is due to the backwards compatibility cruft for
pre-2.6.17 kernels from IPoIB being removed at r8111.
You will need to revert that change.

-- Hal

> 
> Thanks,
> Amith
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eitan at mellanox.co.il  Wed Jun 21 07:27:51 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 21 Jun 2006 17:27:51 +0300
Subject: [openib-general] [PATCH] osm: add release notes to doc dir
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com>

 
> > This patch adds the two latest releases notes.
> > One from Jan 2006 and one from Jun 2006.
> 
> What was the Jan 2006 release ?
[EZ] That was the first gen2 distribution. 

> 
> -- Hal


From halr at voltaire.com  Wed Jun 21 07:33:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 10:33:08 -0400
Subject: [openib-general] [PATCH] osm: add release notes to doc dir
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688AD@mtlexch01.mtl.com>
Message-ID: <1150900386.4391.175519.camel@hal.voltaire.com>

On Wed, 2006-06-21 at 10:27, Eitan Zahavi wrote:
>  > > This patch adds the two latest releases notes.
> > > One from Jan 2006 and one from Jun 2006.
> > 
> > What was the Jan 2006 release ?
> [EZ] That was the first gen2 distribution. 

As I mentioned in the previous email on this, my preference would be to
name the release notes by that version string. Any objections ?

What was the OpenSM version string for this ?

-- Hal

> 
> > 
> > -- Hal


From eitan at mellanox.co.il  Wed Jun 21 07:56:26 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 21 Jun 2006 17:56:26 +0300
Subject: [openib-general] [PATCH] osm: add release notes to doc dir
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3023688B2@mtlexch01.mtl.com>

> 
> As I mentioned in the previous email on this, my preference would be
to
> name the release notes by that version string. Any objections ?
> 
> What was the OpenSM version string for this ?
[EZ] It was 2.0.1
> 


From paul.lundin at gmail.com  Wed Jun 21 08:01:07 2006
From: paul.lundin at gmail.com (Paul)
Date: Wed, 21 Jun 2006 11:01:07 -0400
Subject: [openib-general] OFED 1.0-pre 1 build issues.
In-Reply-To: <4497B634.2070704@mellanox.co.il>
References: <d2403b0606141337x7dfa214amb2034c45589f4f71@mail.gmail.com>
	<1150324203.10676.17.camel@chalcedony.pathscale.com>
	<d2403b0606141653j777d930ardf9999ac7e042eb@mail.gmail.com>
	<4497B634.2070704@mellanox.co.il>
Message-ID: <d2403b0606210801t747a61f8r8fa1fc258628e3e4@mail.gmail.com>

Tziporet,
     Thanks. I also opened a bug on this. Bug # 142.

Regards.

On 6/20/06, Tziporet Koren <tziporet at mellanox.co.il> wrote:
>
> Paul wrote:
> > Michael,
> >      I performed the same work-around in bash (not so good with perl
> > these days) it gets past the prior point. Thanks. Should something
> > that takes care of this be included in the build.sh or build_env.sh
> > scripts ? We would certainly need it covered in the docs at least.
> >
> > Now the build is dying on some undefined references. (log attached)
> >
> > Regards.
> >
>
> I will ask Vlad to look into it.
>
> Tziporet
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060621/3a3942ae/attachment.html>

From jlentini at netapp.com  Wed Jun 21 09:21:45 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 21 Jun 2006 12:21:45 -0400 (EDT)
Subject: [openib-general] ib_gid lookup
In-Reply-To: <1150822037.4391.126581.camel@hal.voltaire.com>
References: <loom.20060620T022600-975@post.gmane.org>
	<1150798111.4391.111384.camel@hal.voltaire.com>
	<loom.20060620T182256-818@post.gmane.org>
	<1150822037.4391.126581.camel@hal.voltaire.com>
Message-ID: <Pine.LNX.4.64.0606211217110.11210@jlentini-linux.nane.netapp.com>


On Tue, 20 Jun 2006, Hal Rosenstock wrote:

> > > The SM does not know the IP addresses unless they are registered 
> > > by DAPL (via ServiceRecords) but I'm not sure that is done 
> > > anymore or whether DAPL runs in your environment.
> > > 
> > 
> > if i run DAPL in my environment will it work or this is already 
> > made obsolete?
> 
> I don't know. James or maybe Arlin would be the ones to answer. You 
> could also look at the code to figure this out.

DAPL used to use the Address Translation Service (ATS) to map between 
IP addresses to GIDs. It now uses IPoIB for this purpose.

You could use IPoIB to determine a node's GID using a IP or install 
the (unsupported?) ATS software on your systems 
(https://openib.org/svn/gen2/branches/ibat/).


From swise at opengridcomputing.com  Wed Jun 21 09:32:51 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 11:32:51 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
Message-ID: <1150907571.31600.31.camel@stevo-desktop>

On Tue, 2006-06-20 at 22:43 +0200, Arjan van de Ven wrote:
> On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote:
> 
> > +/*
> > + * Allocate TX ring elements and chain them together.
> > + * One-to-one association of adapter descriptors with ring elements.
> > + */
> > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
> > +			    dma_addr_t base, void __iomem * mmio_txp_ring)
> > +{
> > +	struct c2_tx_desc *tx_desc;
> > +	struct c2_txp_desc __iomem *txp_desc;
> > +	struct c2_element *elem;
> > +	int i;
> > +
> > +	tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);
> 
> I would think this needs a dma_alloc_coherent() rather than a kmalloc...
> 

No, this memory is used to describe the tx ring from the host's
perspective.  The HW never touches this memory.  The HW's TX descriptor
ring is in adapter memory and is mapped into host memory (see
c2dev->mmio_txp_ring).

> 
> > +
> > +/* Free all buffers in RX ring, assumes receiver stopped */
> > +static void c2_rx_clean(struct c2_port *c2_port)
> > +{
> > +	struct c2_dev *c2dev = c2_port->c2dev;
> > +	struct c2_ring *rx_ring = &c2_port->rx_ring;
> > +	struct c2_element *elem;
> > +	struct c2_rx_desc *rx_desc;
> > +
> > +	elem = rx_ring->start;
> > +	do {
> > +		rx_desc = elem->ht_desc;
> > +		rx_desc->len = 0;
> > +
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);
> 
> you seem to be a fan of the __raw_write() functions... any reason why?
> __raw_ is not a magic "go faster" prefix....
> 

In this particular case, I believe this is done to avoid a swap of '0'
since its not necessary.  In other places, __raw is used because the
adapter needs the data in BE and we want to explicitly swap it using
cpu_to_be* then raw_write it to the adapter memory...


> Also on a related note, have you checked the driver for the needed PCI
> posting flushes?
> 

Um, what's a 'PCI posting flush'?  Can you point me where its
described/used so I can see if we need it?  Thanx.


> > +
> > +	/* Disable IRQs by clearing the interrupt mask */
> > +	writel(1, c2dev->regs + C2_IDIS);
> > +	writel(0, c2dev->regs + C2_NIMR0);
> 
> like here...
> > +
> > +	elem = tx_ring->to_use;
> > +	elem->skb = skb;
> > +	elem->mapaddr = mapaddr;
> > +	elem->maplen = maplen;
> > +
> > +	/* Tell HW to xmit */
> > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> 
> or here
> 
> > +static int c2_change_mtu(struct net_device *netdev, int new_mtu)
> > +{
> > +	int ret = 0;
> > +
> > +	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
> > +		return -EINVAL;
> > +
> > +	netdev->mtu = new_mtu;
> > +
> > +	if (netif_running(netdev)) {
> > +		c2_down(netdev);
> > +
> > +		c2_up(netdev);
> > +	}
> 
> this looks odd...
> 

The 1100 hardware caches the dma address of the next skb that will be
used to place data.  When the MTU changes, we want to free the SKBs in
the RX descriptor ring and get new ones that sufficient for the new MTU.
To effectively flush that cached address of the old skb, we must quiesce
the HW and firmware (via c2_down()), then reinitialize everything with
skb's big enough for the new mtu. 

Steve.


From swise at opengridcomputing.com  Wed Jun 21 09:32:51 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 11:32:51 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
Message-ID: <1150907571.31600.31.camel@stevo-desktop>

On Tue, 2006-06-20 at 22:43 +0200, Arjan van de Ven wrote:
> On Tue, 2006-06-20 at 15:30 -0500, Steve Wise wrote:
> 
> > +/*
> > + * Allocate TX ring elements and chain them together.
> > + * One-to-one association of adapter descriptors with ring elements.
> > + */
> > +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr,
> > +			    dma_addr_t base, void __iomem * mmio_txp_ring)
> > +{
> > +	struct c2_tx_desc *tx_desc;
> > +	struct c2_txp_desc __iomem *txp_desc;
> > +	struct c2_element *elem;
> > +	int i;
> > +
> > +	tx_ring->start = kmalloc(sizeof(*elem) * tx_ring->count, GFP_KERNEL);
> 
> I would think this needs a dma_alloc_coherent() rather than a kmalloc...
> 

No, this memory is used to describe the tx ring from the host's
perspective.  The HW never touches this memory.  The HW's TX descriptor
ring is in adapter memory and is mapped into host memory (see
c2dev->mmio_txp_ring).

> 
> > +
> > +/* Free all buffers in RX ring, assumes receiver stopped */
> > +static void c2_rx_clean(struct c2_port *c2_port)
> > +{
> > +	struct c2_dev *c2dev = c2_port->c2dev;
> > +	struct c2_ring *rx_ring = &c2_port->rx_ring;
> > +	struct c2_element *elem;
> > +	struct c2_rx_desc *rx_desc;
> > +
> > +	elem = rx_ring->start;
> > +	do {
> > +		rx_desc = elem->ht_desc;
> > +		rx_desc->len = 0;
> > +
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
> > +		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);
> 
> you seem to be a fan of the __raw_write() functions... any reason why?
> __raw_ is not a magic "go faster" prefix....
> 

In this particular case, I believe this is done to avoid a swap of '0'
since its not necessary.  In other places, __raw is used because the
adapter needs the data in BE and we want to explicitly swap it using
cpu_to_be* then raw_write it to the adapter memory...


> Also on a related note, have you checked the driver for the needed PCI
> posting flushes?
> 

Um, what's a 'PCI posting flush'?  Can you point me where its
described/used so I can see if we need it?  Thanx.


> > +
> > +	/* Disable IRQs by clearing the interrupt mask */
> > +	writel(1, c2dev->regs + C2_IDIS);
> > +	writel(0, c2dev->regs + C2_NIMR0);
> 
> like here...
> > +
> > +	elem = tx_ring->to_use;
> > +	elem->skb = skb;
> > +	elem->mapaddr = mapaddr;
> > +	elem->maplen = maplen;
> > +
> > +	/* Tell HW to xmit */
> > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> 
> or here
> 
> > +static int c2_change_mtu(struct net_device *netdev, int new_mtu)
> > +{
> > +	int ret = 0;
> > +
> > +	if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU)
> > +		return -EINVAL;
> > +
> > +	netdev->mtu = new_mtu;
> > +
> > +	if (netif_running(netdev)) {
> > +		c2_down(netdev);
> > +
> > +		c2_up(netdev);
> > +	}
> 
> this looks odd...
> 

The 1100 hardware caches the dma address of the next skb that will be
used to place data.  When the MTU changes, we want to free the SKBs in
the RX descriptor ring and get new ones that sufficient for the new MTU.
To effectively flush that cached address of the old skb, we must quiesce
the HW and firmware (via c2_down()), then reinitialize everything with
skb's big enough for the new mtu. 

Steve.


From arjan at infradead.org  Wed Jun 21 10:13:33 2006
From: arjan at infradead.org (Arjan van de Ven)
Date: Wed, 21 Jun 2006 19:13:33 +0200
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150907571.31600.31.camel@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1150907571.31600.31.camel@stevo-desktop>
Message-ID: <1150910013.3057.59.camel@laptopd505.fenrus.org>

>  0;
> > > +
> > > +		__raw_writew(0, elem->hw_desc + C2_RXP_STATUS);
> > > +		__raw_writew(0, elem->hw_desc + C2_RXP_COUNT);
> > > +		__raw_writew(0, elem->hw_desc + C2_RXP_LEN);
> > 
> > you seem to be a fan of the __raw_write() functions... any reason why?
> > __raw_ is not a magic "go faster" prefix....
> > 
> 
> In this particular case, I believe this is done to avoid a swap of '0'
> since its not necessary.  

but.. that should writew() and co just autodetect (or do it at compile
time)...
(maybe it doesn't and we have an optimization opportunity here ;)

> > Also on a related note, have you checked the driver for the needed PCI
> > posting flushes?
> > 
> 
> Um, what's a 'PCI posting flush'?  Can you point me where its
> described/used so I can see if we need it?  Thanx.

ok pci posting...

basically, if you use writel() and co, the PCI bridges in the middle are
allowed (and the more fancy ones do) cache the write, to see if more
writes follow, so that the bridge can do the writes as a single burst to
the device, rather than as individual writes. This is of course great...
... except when you really want the write to hit the device before the
driver continues with other actions. 

Now the PCI spec is set up such that any traffic in the other direction
(basically readl() and co) will first flush the write through the system
before the read is actually sent to the device, so doing a dummy readl()
is a good way to flush any pending posted writes.

Where does this matter? 
it matters most at places such as irq enabling/disabling, IO submission
and possibly IRQ acking, but also often in eeprom-like read/write logic
(where you do manual clocking and need to do delays between the
write()'s). But in general... any place where you do writel() without
doing any readl() before doing nothing to the card for a long time, or
where you are waiting for the card to do something (or want it done NOW,
such as IRQ disabling) you need to issue a (dummy) readl() to flush
pending writes out to the hardware.


does this explanation make any sense? if not please feel free to ask any
questions, I know I'm not always very good at explaining things.

Greetings,
   Arjan van de Ven


From iod00d at hp.com  Wed Jun 21 10:37:11 2006
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 21 Jun 2006 10:37:11 -0700
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150907571.31600.31.camel@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1150907571.31600.31.camel@stevo-desktop>
Message-ID: <20060621173711.GG26637@esmail.cup.hp.com>

On Wed, Jun 21, 2006 at 11:32:51AM -0500, Steve Wise wrote:
> Um, what's a 'PCI posting flush'?  Can you point me where its
> described/used so I can see if we need it?  Thanx.

I've written this up before:
	http://iou.parisc-linux.org/ols_2002/4Posted_vs_Non_Posted.html

grant


From swise at opengridcomputing.com  Wed Jun 21 11:47:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 13:47:45 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150910013.3057.59.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1150907571.31600.31.camel@stevo-desktop>
	<1150910013.3057.59.camel@laptopd505.fenrus.org>
Message-ID: <1150915665.20327.0.camel@stevo-desktop>


> ok pci posting...
> 
> basically, if you use writel() and co, the PCI bridges in the middle are
> allowed (and the more fancy ones do) cache the write, to see if more
> writes follow, so that the bridge can do the writes as a single burst to
> the device, rather than as individual writes. This is of course great...
> ... except when you really want the write to hit the device before the
> driver continues with other actions. 
> 
> Now the PCI spec is set up such that any traffic in the other direction
> (basically readl() and co) will first flush the write through the system
> before the read is actually sent to the device, so doing a dummy readl()
> is a good way to flush any pending posted writes.
> 
> Where does this matter? 
> it matters most at places such as irq enabling/disabling, IO submission
> and possibly IRQ acking, but also often in eeprom-like read/write logic
> (where you do manual clocking and need to do delays between the
> write()'s). But in general... any place where you do writel() without
> doing any readl() before doing nothing to the card for a long time, or
> where you are waiting for the card to do something (or want it done NOW,
> such as IRQ disabling) you need to issue a (dummy) readl() to flush
> pending writes out to the hardware.
> 
> 
> does this explanation make any sense? if not please feel free to ask any
> questions, I know I'm not always very good at explaining things.

Yep.  I get it.  I believe we're ok in this respect, but I'll review the
code again with an eye for this issue...

Steve.


From halr at voltaire.com  Wed Jun 21 11:46:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 14:46:00 -0400
Subject: [openib-general] [PATCH] osm: add release notes to doc dir
In-Reply-To: <86ac87gdxy.fsf@mtl066.yok.mtl.com>
References: <86ac87gdxy.fsf@mtl066.yok.mtl.com>
Message-ID: <1150915557.4391.184538.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-06-21 at 03:34, Eitan Zahavi wrote:
> Hi Hal
> 
> Following the OFED 1.0 release I think it will be very handy to 
> user to see OpenSM release notes accumulate in the doc dir. 
> 
> As release notes always refer back to old releases - by not specifying
> all features just new ones - I propose a file naming scheme that is
> based on the release date. 
> 
> This patch adds the two latest releases notes. 
> One from Jan 2006 and one from Jun 2006.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Applied as opensm_release_notes_openib-1.2.1.txt and
opensm_release_notes_ibg2-2.0.1.txt

-- Hal


From swise at opengridcomputing.com  Wed Jun 21 12:48:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 14:48:16 -0500
Subject: [openib-general] [PATCH 0/2][RFC] Network Event Notifier Mechanism
Message-ID: <20060621194816.4507.4090.stgit@stevo-desktop>


This patch implements a mechanism that allows interested clients to
register for notification of certain network events. The intended use
is to allow RDMA devices (linux/drivers/infiniband) to be notified of
neighbour updates, ICMP redirects, path MTU changes, and route changes.

The reason these devices need update events is because they typically
cache this information in hardware and need to be notified when this
information has been updated.

This approach is one of many possibilities and may be preferred because it
uses an existing notification mechanism that has precedent in the stack.
An alternative would be to add a netdev method to notify affect devices
of these events.

This code does not yet implement path MTU change because the number of
places in which this value is updated is large and if this mechanism
seems reasonable, it would be probably be best to funnel these updates
through a single function.

We would like to get this or similar functionality included in 2.6.19
and request comments.

This patchset consists of 2 patches:

1) New files implementing the Network Event Notifier
2) Core network changes to generate network event notifications

Signed-off-by: Tom Tucker <tom at opengridcomputing.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Wed Jun 21 12:48:21 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 14:48:21 -0500
Subject: [openib-general] [PATCH 2/2] Core network changes to support
 network event notification.
In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop>
References: <20060621194816.4507.4090.stgit@stevo-desktop>
Message-ID: <20060621194821.4507.70124.stgit@stevo-desktop>


This patch adds event calls for neighbour change, route update, and
routing redirect events.

TODO: PMTU change events.
---

 net/core/Makefile        |    2 +-
 net/core/neighbour.c     |    8 ++++++++
 net/ipv4/fib_semantics.c |    7 +++++++
 net/ipv4/route.c         |    6 ++++++
 4 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/net/core/Makefile b/net/core/Makefile
index e9bd246..2645ba4 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o 
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
-obj-y		     += dev.o ethtool.o dev_mcast.o dst.o \
+obj-y		     += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
 obj-$(CONFIG_XFRM) += flow.o
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 50a8c73..c637897 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -30,9 +30,11 @@ #include <linux/times.h>
 #include <net/neighbour.h>
 #include <net/dst.h>
 #include <net/sock.h>
+#include <net/netevent.h>
 #include <linux/rtnetlink.h>
 #include <linux/random.h>
 #include <linux/string.h>
+#include <linux/notifier.h>
 
 #define NEIGH_DEBUG 1
 
@@ -755,6 +757,7 @@ #endif
 			neigh->nud_state = NUD_STALE;
 			neigh->updated = jiffies;
 			neigh_suspect(neigh);
+			call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 		}
 	} else if (state & NUD_DELAY) {
 		if (time_before_eq(now, 
@@ -763,6 +766,7 @@ #endif
 			neigh->nud_state = NUD_REACHABLE;
 			neigh->updated = jiffies;
 			neigh_connect(neigh);
+			call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 			next = neigh->confirmed + neigh->parms->reachable_time;
 		} else {
 			NEIGH_PRINTK2("neigh %p is probed.\n", neigh);
@@ -783,6 +787,7 @@ #endif
 		neigh->nud_state = NUD_FAILED;
 		neigh->updated = jiffies;
 		notify = 1;
+		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 		NEIGH_CACHE_STAT_INC(neigh->tbl, res_failed);
 		NEIGH_PRINTK2("neigh %p is failed.\n", neigh);
 
@@ -1056,6 +1061,9 @@ out:
 			(neigh->flags | NTF_ROUTER) :
 			(neigh->flags & ~NTF_ROUTER);
 	}
+
+	call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
+
 	write_unlock_bh(&neigh->lock);
 #ifdef CONFIG_ARPD
 	if (notify && neigh->parms->app_probes)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 0f4145b..67a30af 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -45,6 +45,7 @@ #include <net/tcp.h>
 #include <net/sock.h>
 #include <net/ip_fib.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 
 #include "fib_lookup.h"
 
@@ -278,9 +279,15 @@ void rtmsg_fib(int event, u32 key, struc
 	       struct nlmsghdr *n, struct netlink_skb_parms *req)
 {
 	struct sk_buff *skb;
+	struct netevent_route_change rev;
+
 	u32 pid = req ? req->pid : n->nlmsg_pid;
 	int size = NLMSG_SPACE(sizeof(struct rtmsg)+256);
 
+	rev.event = event;
+	rev.fib_info = fa->fa_info;
+	call_netevent_notifiers(NETEVENT_ROUTE_UPDATE, &rev);
+
 	skb = alloc_skb(size, GFP_KERNEL);
 	if (!skb)
 		return;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index cc9423d..e9ba831 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -105,6 +105,7 @@ #include <net/tcp.h>
 #include <net/icmp.h>
 #include <net/xfrm.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -1120,6 +1121,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 	struct rtable *rth, **rthp;
 	u32  skeys[2] = { saddr, 0 };
 	int  ikeys[2] = { dev->ifindex, 0 };
+	struct netevent_redirect netevent;
 
 	if (!in_dev)
 		return;
@@ -1211,6 +1213,10 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 					rt_drop(rt);
 					goto do_next;
 				}
+				
+				netevent.old = &rth->u.dst;
+				netevent.new = &rt->u.dst;
+				call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
 
 				rt_del(hash, rth);
 				if (!rt_intern_hash(hash, rt, &rt))


From swise at opengridcomputing.com  Wed Jun 21 12:48:19 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 21 Jun 2006 14:48:19 -0500
Subject: [openib-general] [PATCH 1/2] Network Event Notifier Mechanism.
In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop>
References: <20060621194816.4507.4090.stgit@stevo-desktop>
Message-ID: <20060621194818.4507.80455.stgit@stevo-desktop>


This patch uses notifier blocks to implement a network event
notifier mechanism.

Clients register their callback function by calling
register_netevent_notifier() like this:

static struct notifier_block nb = {
        .notifier_call = my_callback_func
};

...

register_netevent_notifier(&nb);
---

 include/net/netevent.h |   41 +++++++++++++++++++++++++++++
 net/core/netevent.c    |   67 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/include/net/netevent.h b/include/net/netevent.h
new file mode 100644
index 0000000..9ceab27
--- /dev/null
+++ b/include/net/netevent.h
@@ -0,0 +1,41 @@
+#ifndef _NET_EVENT_H
+#define _NET_EVENT_H
+
+/*
+ *	Generic netevent notifiers
+ *
+ *	Authors:
+ *      Tom Tucker              <tom at opengridcomputing.com>
+ *
+ * 	Changes:
+ */
+
+#ifdef __KERNEL__
+
+#include <net/dst.h>
+
+struct netevent_redirect {
+	struct dst_entry *old;
+	struct dst_entry *new;
+};
+
+struct netevent_route_change {
+        int event;
+        struct fib_info *fib_info;
+};
+
+enum netevent_notif_type {
+	NETEVENT_NEIGH_UPDATE = 1, /* arg is * struct neighbour */
+	NETEVENT_ROUTE_UPDATE,	   /* arg is * netevent_route_change */
+	NETEVENT_PMTU_UPDATE,
+	NETEVENT_REDIRECT,	   /* arg is * struct netevent_redirect */
+};
+
+extern int register_netevent_notifier(struct notifier_block *nb);
+extern int unregister_netevent_notifier(struct notifier_block *nb);
+extern int call_netevent_notifiers(unsigned long val, void *v);
+
+#endif
+#endif
+
+
diff --git a/net/core/netevent.c b/net/core/netevent.c
new file mode 100644
index 0000000..2261fb3
--- /dev/null
+++ b/net/core/netevent.c
@@ -0,0 +1,67 @@
+/*
+ *	Network event notifiers
+ *
+ *	Authors:
+ *      Tom Tucker             <tom at opengridcomputing.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ *
+ *	Fixes:
+ */
+
+#include <linux/rtnetlink.h>
+#include <linux/notifier.h>
+
+static struct atomic_notifier_head netevent_notif_chain;
+
+/**
+ *	register_netevent_notifier - register a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Register a notifier to be called when a netevent occurs.
+ *	The notifier passed is linked into the kernel structures and must
+ *	not be reused until it has been unregistered. A negative errno code
+ *	is returned on a failure.
+ */
+int register_netevent_notifier(struct notifier_block *nb)
+{
+	int err;
+
+	err = atomic_notifier_chain_register(&netevent_notif_chain, nb);
+	return err;
+}
+
+/**
+ *	netevent_unregister_notifier - unregister a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Unregister a notifier previously registered by
+ *	register_neigh_notifier(). The notifier is unlinked into the
+ *	kernel structures and may then be reused. A negative errno code
+ *	is returned on a failure.
+ */
+
+int unregister_netevent_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&netevent_notif_chain, nb);
+}
+
+/**
+ *	call_netevent_notifiers - call all netevent notifier blocks
+ *      @val: value passed unmodified to notifier function
+ *      @v:   pointer passed unmodified to notifier function
+ *
+ *	Call all neighbour notifier blocks.  Parameters and return value
+ *	are as for notifier_call_chain().
+ */
+
+int call_netevent_notifiers(unsigned long val, void *v)
+{
+	return atomic_notifier_call_chain(&netevent_notif_chain, val, v);
+}
+
+EXPORT_SYMBOL_GPL(register_netevent_notifier);
+EXPORT_SYMBOL_GPL(unregister_netevent_notifier);


From davem at davemloft.net  Wed Jun 21 13:40:46 2006
From: davem at davemloft.net (David Miller)
Date: Wed, 21 Jun 2006 13:40:46 -0700 (PDT)
Subject: [openib-general] [PATCH 0/2][RFC] Network Event Notifier
	Mechanism
In-Reply-To: <20060621194816.4507.4090.stgit@stevo-desktop>
References: <20060621194816.4507.4090.stgit@stevo-desktop>
Message-ID: <20060621.134046.35016879.davem@davemloft.net>


Most of the folks capable of reviewing networking changes
listen in on the netdev at vger.kernel.org mailing list, not
here.

Thanks.


From halr at voltaire.com  Wed Jun 21 13:59:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Jun 2006 16:59:15 -0400
Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c: Fix
 example name in messages
Message-ID: <1150923404.4391.189477.camel@hal.voltaire.com>

librdmacm/examples/mckey.c: Fix example name in messages

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: ../../librdmacm/examples/mckey.c
===================================================================
--- ../../librdmacm/examples/mckey.c	(revision 8166)
+++ ../../librdmacm/examples/mckey.c	(working copy)
@@ -111,7 +111,7 @@ static int init_node(struct cmatest_node
 	node->pd = ibv_alloc_pd(node->cma_id->verbs);
 	if (!node->pd) {
 		ret = -ENOMEM;
-		printf("cmatose: unable to allocate PD\n");
+		printf("mckey: unable to allocate PD\n");
 		goto out;
 	}
 
@@ -119,7 +119,7 @@ static int init_node(struct cmatest_node
 	node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0);
 	if (!node->cq) {
 		ret = -ENOMEM;
-		printf("cmatose: unable to create CQ\n");
+		printf("mckey: unable to create CQ\n");
 		goto out;
 	}
 
@@ -135,13 +135,13 @@ static int init_node(struct cmatest_node
 	init_qp_attr.recv_cq = node->cq;
 	ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr);
 	if (ret) {
-		printf("cmatose: unable to create QP: %d\n", ret);
+		printf("mckey: unable to create QP: %d\n", ret);
 		goto out;
 	}
 
 	ret = create_message(node);
 	if (ret) {
-		printf("cmatose: failed to create messages: %d\n", ret);
+		printf("mckey: failed to create messages: %d\n", ret);
 		goto out;
 	}
 out:
@@ -230,7 +230,7 @@ static int addr_handler(struct cmatest_n
 
 	ret = rdma_join_multicast(node->cma_id, test.dst_addr, node);
 	if (ret) {
-		printf("cmatose: failure joining: %d\n", ret);
+		printf("mckey: failure joining: %d\n", ret);
 		goto err;
 	}
 	return 0;
@@ -279,7 +279,7 @@ static int cma_handler(struct rdma_cm_id
 	case RDMA_CM_EVENT_ADDR_ERROR:
 	case RDMA_CM_EVENT_ROUTE_ERROR:
 	case RDMA_CM_EVENT_MULTICAST_ERROR:
-		printf("cmatose: event: %d, error: %d\n", event->event,
+		printf("mckey: event: %d, error: %d\n", event->event,
 			event->status);
 		connect_error();
 		ret = event->status;
@@ -325,7 +325,7 @@ static int alloc_nodes(void)
 
 	test.nodes = malloc(sizeof *test.nodes * connections);
 	if (!test.nodes) {
-		printf("cmatose: unable to allocate memory for test nodes\n");
+		printf("mckey: unable to allocate memory for test nodes\n");
 		return -ENOMEM;
 	}
 	memset(test.nodes, 0, sizeof *test.nodes * connections);
@@ -366,7 +366,7 @@ static int poll_cqs(void)
 		for (done = 0; done < message_count; done += ret) {
 			ret = ibv_poll_cq(test.nodes[i].cq, 8, wc);
 			if (ret < 0) {
-				printf("cmatose: failed polling CQ: %d\n", ret);
+				printf("mckey: failed polling CQ: %d\n", ret);
 				return ret;
 			}
 		}
@@ -415,7 +415,7 @@ static int run(char *dst, char *src)
 {
 	int i, ret;
 
-	printf("cmatose: starting client\n");
+	printf("mckey: starting client\n");
 	if (src) {
 		ret = get_addr(src, &test.src_in);
 		if (ret)
@@ -428,13 +428,13 @@ static int run(char *dst, char *src)
 
 	test.dst_in.sin_port = 7174;
 
-	printf("cmatose: joining\n");
+	printf("mckey: joining\n");
 	for (i = 0; i < connections; i++) {
 		ret = rdma_resolve_addr(test.nodes[i].cma_id,
 					src ? test.src_addr : NULL,
 					test.dst_addr, 2000);
 		if (ret) {
-			printf("cmatose: failure getting addr: %d\n", ret);
+			printf("mckey: failure getting addr: %d\n", ret);
 			connect_error();
 			return ret;
 		}


From bunk at stusta.de  Wed Jun 21 15:54:58 2006
From: bunk at stusta.de (Adrian Bunk)
Date: Thu, 22 Jun 2006 00:54:58 +0200
Subject: [openib-general] [-mm patch] drivers/scsi/qla2xxx/: make some
	functions static
In-Reply-To: <20060621034857.35cfe36f.akpm@osdl.org>
References: <20060621034857.35cfe36f.akpm@osdl.org>
Message-ID: <20060621225458.GQ9111@stusta.de>

On Wed, Jun 21, 2006 at 03:48:57AM -0700, Andrew Morton wrote:
>...
> Changes since 2.6.17-rc6-mm2:
>...
>  git-infiniband.patch
>...
>  git trees
>...

This patch makes some needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk at stusta.de>

---

 drivers/scsi/qla2xxx/qla_gbl.h  |    6 ------
 drivers/scsi/qla2xxx/qla_init.c |    8 +++++---
 drivers/scsi/qla2xxx/qla_iocb.c |    3 ++-
 3 files changed, 7 insertions(+), 10 deletions(-)

--- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_gbl.h.old	2006-06-22 00:48:35.000000000 +0200
+++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_gbl.h	2006-06-22 00:50:32.000000000 +0200
@@ -31,13 +31,9 @@
 extern void qla24xx_update_fw_options(scsi_qla_host_t *);
 extern int qla2x00_load_risc(struct scsi_qla_host *, uint32_t *);
 extern int qla24xx_load_risc(scsi_qla_host_t *, uint32_t *);
-extern int qla24xx_load_risc_flash(scsi_qla_host_t *, uint32_t *);
-
-extern fc_port_t *qla2x00_alloc_fcport(scsi_qla_host_t *, gfp_t);
 
 extern int qla2x00_loop_resync(scsi_qla_host_t *);
 
-extern int qla2x00_find_new_loop_id(scsi_qla_host_t *, fc_port_t *);
 extern int qla2x00_fabric_login(scsi_qla_host_t *, fc_port_t *, uint16_t *);
 extern int qla2x00_local_device_login(scsi_qla_host_t *, fc_port_t *);
 
@@ -80,8 +76,6 @@
 /*
  * Global Function Prototypes in qla_iocb.c source file.
  */
-extern void qla2x00_isp_cmd(scsi_qla_host_t *);
-
 extern uint16_t qla2x00_calc_iocbs_32(uint16_t);
 extern uint16_t qla2x00_calc_iocbs_64(uint16_t);
 extern void qla2x00_build_scsi_iocbs_32(srb_t *, cmd_entry_t *, uint16_t);
--- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_init.c.old	2006-06-22 00:48:58.000000000 +0200
+++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_init.c	2006-06-22 00:49:50.000000000 +0200
@@ -39,6 +39,8 @@
 
 static int qla2x00_restart_isp(scsi_qla_host_t *);
 
+static int qla2x00_find_new_loop_id(scsi_qla_host_t *ha, fc_port_t *dev);
+
 /****************************************************************************/
 /*                QLogic ISP2x00 Hardware Support Functions.                */
 /****************************************************************************/
@@ -1701,7 +1703,7 @@
  *
  * Returns a pointer to the allocated fcport, or NULL, if none available.
  */
-fc_port_t *
+static fc_port_t *
 qla2x00_alloc_fcport(scsi_qla_host_t *ha, gfp_t flags)
 {
 	fc_port_t *fcport;
@@ -2497,7 +2499,7 @@
  * Context:
  *	Kernel context.
  */
-int
+static int
 qla2x00_find_new_loop_id(scsi_qla_host_t *ha, fc_port_t *dev)
 {
 	int	rval;
@@ -3472,7 +3474,7 @@
 	return (rval);
 }
 
-int
+static int
 qla24xx_load_risc_flash(scsi_qla_host_t *ha, uint32_t *srisc_addr)
 {
 	int	rval;
--- linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_iocb.c.old	2006-06-22 00:50:42.000000000 +0200
+++ linux-2.6.17-mm1-full/drivers/scsi/qla2xxx/qla_iocb.c	2006-06-22 00:51:00.000000000 +0200
@@ -15,6 +15,7 @@
 static inline cont_entry_t *qla2x00_prep_cont_type0_iocb(scsi_qla_host_t *);
 static inline cont_a64_entry_t *qla2x00_prep_cont_type1_iocb(scsi_qla_host_t *);
 static request_t *qla2x00_req_pkt(scsi_qla_host_t *ha);
+static void qla2x00_isp_cmd(scsi_qla_host_t *ha);
 
 /**
  * qla2x00_get_cmd_direction() - Determine control_flag data direction.
@@ -574,7 +575,7 @@
  *
  * Note: The caller must hold the hardware lock before calling this routine.
  */
-void
+static void
 qla2x00_isp_cmd(scsi_qla_host_t *ha)
 {
 	device_reg_t __iomem *reg = ha->iobase;


From arlin.r.davis at intel.com  Wed Jun 21 15:13:16 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Wed, 21 Jun 2006 15:13:16 -0700
Subject: [openib-general] [PATCH] uDAPL cma - event processing bug
Message-ID: <ORSMSX401lpoVg01vr100000001@orsmsx401.amr.corp.intel.com>

James,

Fix bug in dapls_ib_get_dat_event() call after adding new unreachable event.

-arlin

Signed-off by: Arlin Davis ardavis at ichips.intel.com


Index: dapl/openib_cma/dapl_ib_cm.c
===================================================================
--- dapl/openib_cma/dapl_ib_cm.c        (revision 8166)
+++ dapl/openib_cma/dapl_ib_cm.c        (working copy)
@@ -1092,9 +1092,6 @@ dapls_ib_get_dat_event(IN const ib_cm_ev

        active = active;

-       if (ib_cm_event > IB_CME_BROKEN)
-               return (DAT_EVENT_NUMBER) 0;
-
        dat_event_num = 0;
        for(i = 0; i < DAPL_IB_EVENT_CNT; i++) {
                if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) {


From sean.hefty at intel.com  Wed Jun 21 15:38:50 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 21 Jun 2006 15:38:50 -0700
Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c:
 Fix example name in messages
In-Reply-To: <1150923404.4391.189477.camel@hal.voltaire.com>
Message-ID: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com>

>librdmacm/examples/mckey.c: Fix example name in messages
>
>Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Thanks, Hal.  Do you mind committing this change?

- Sean


From ogerlitz at voltaire.com  Wed Jun 21 22:11:35 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 22 Jun 2006 08:11:35 +0300
Subject: [openib-general] iSER updates
In-Reply-To: <44977F81.9080206@voltaire.com>
References: <adaveqwpi3m.fsf@cisco.com> <44977F81.9080206@voltaire.com>
Message-ID: <449A2687.9070400@voltaire.com>

> Roland Dreier wrote:
>> Linus pull my for-2.6.18 branch today, so the RDMA CM is upstream
>> now.  He still has not pulled scsi-misc-2.6 so AFAIK there is still
>> more required before we can merge iSER.

Roland,

I see now that Linus has pulled the scsi-misc-2.6 updates for 2.6.18 - 
which means the door is open to push iSER...

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28e4b224955cbe30275b2a7842e729023a4f4b03

Or.


From bpradip at in.ibm.com  Thu Jun 22 00:25:27 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Thu, 22 Jun 2006 12:55:27 +0530
Subject: [openib-general] [PATCH] [TRIVIAL] cma.c: Remove compiler warning
Message-ID: <20060622072520.GA1393@harry-potter.in.ibm.com>

This removes a compile warning : `ret' might be used uninitialized in this
function
This patch is against the IWARP branch of the code

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index: core/cma.c
==================================================================
--- cma.org	2006-06-22 12:45:33.000000000 +0530
+++ cma.c	2006-06-22 12:45:51.000000000 +0530
@@ -2066,6 +2066,7 @@ int rdma_disconnect(struct rdma_cm_id *i
 		ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
 		break;
 	default:
+		ret = -ENOSYS;
 		break;
 	}
 out:


From halr at voltaire.com  Thu Jun 22 02:45:21 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Jun 2006 05:45:21 -0400
Subject: [openib-general] [PATCH] [TRIVIAL] librdmacm/examples/mckey.c:
 Fix example name in messages
In-Reply-To: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com>
References: <000001c69583$78fa93e0$57781cac@amr.corp.intel.com>
Message-ID: <1150969499.4391.218996.camel@hal.voltaire.com>

On Wed, 2006-06-21 at 18:38, Sean Hefty wrote:
> >librdmacm/examples/mckey.c: Fix example name in messages
> >
> >Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Thanks, Hal.  Do you mind committing this change?

Sure; committed in r8170.

-- Hal

> - Sean


From halr at voltaire.com  Thu Jun 22 03:12:58 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Jun 2006 06:12:58 -0400
Subject: [openib-general] [PATCH] opensm: osm_pkey_tbl_make_block_pair()
	removal
In-Reply-To: <20060621135238.GB24726@sashak.voltaire.com>
References: <20060621135238.GB24726@sashak.voltaire.com>
Message-ID: <1150971161.4391.219868.camel@hal.voltaire.com>

On Wed, 2006-06-21 at 09:52, Sasha Khapyorsky wrote:
> Since 'blocks' pkey vector is updated only by receiver, remove it from
> osm_pkey_tbl_set_new_entry(), as well as osm_pkey_tbl_make_block_pair().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From eitan at mellanox.co.il  Thu Jun 22 04:37:46 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 22 Jun 2006 14:37:46 +0300
Subject: [openib-general] [ibutils PATCH] osm.m4 fix for x86_64 machines
Message-ID: <449A810A.3020709@mellanox.co.il>

Hi

The following patch "osm.m4 fix" changes the way ibutils packages
auto-detect stack type and location of OpenSM libraries by
scanning the lib and lib64 directories. Instead of hard-coding
the dir name based on the uname -m we scan both giving the
lib priority on the lib64.

I applied it to the ibutils trunk. Please let me know see any issues with it.

Eitan

-------------- next part --------------
An embedded message was scrubbed...
From: unknown sender
Subject: no subject
Date: no date
Size: 96
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/9fd4f8ca/attachment.eml>

From eitan at mellanox.co.il  Thu Jun 22 04:44:32 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 22 Jun 2006 14:44:32 +0300
Subject: [openib-general] [ibutils PATCH] automake required >= 1.9.2
Message-ID: <449A82A0.3020502@mellanox.co.il>

Hi

The following patch remove the requirement for automake version to be
1.9.3 and up and instead allows it to be 1.9.2.

I have applied the patch.

Please let me know if you find any issue with this change.

Thanks
EZ
-------------- next part --------------
An embedded message was scrubbed...
From: "Sasha Khapyorsky" <sashak at voltaire.com>
Subject: [ibutils PATCH] automake required >= 1.9.2
Date: Wed, 21 Jun 2006 19:40:52 +0300
Size: 6453
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/70a1f912/attachment.mht>

From mamidala at cse.ohio-state.edu  Thu Jun 22 06:08:51 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Thu, 22 Jun 2006 09:08:51 -0400 (EDT)
Subject: [openib-general] [librdmacm] rping
In-Reply-To: <1150899700.4391.175085.camel@hal.voltaire.com>
Message-ID: <Pine.GSO.4.40.0606211608030.6482-100000@omicron.cse.ohio-state.edu>

I was checking rping with the latest stack. The client exits normally, the
server still hangs after printing the cq status.

server ping data: rdma-ping-9: JKLMNOPQRSTU
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 9
cq completion failed status 5

When I kill the process and restart the server I get the following error:

rdma_bind_addr error -1


Thanks,
Amith


From halr at voltaire.com  Thu Jun 22 06:24:31 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Jun 2006 09:24:31 -0400
Subject: [openib-general] [PATCH][MINOR] OpenSM/SA client: In
 osm_vendor_ibumad_sa.c:osmv_query_sa, eliminate redundant code
Message-ID: <1150982667.4391.226473.camel@hal.voltaire.com>

OpenSM/SA client: In osm_vendor_ibumad_sa.c:osmv_query_sa, eliminate
redundant code

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: libvendor/osm_vendor_ibumad_sa.c
===================================================================
--- libvendor/osm_vendor_ibumad_sa.c	(revision 8174)
+++ libvendor/osm_vendor_ibumad_sa.c	(working copy)
@@ -655,7 +655,6 @@ osmv_query_sa(
   case OSMV_QUERY_ALL_SVC_RECS:
     osm_log( p_log, OSM_LOG_DEBUG,
              "osmv_query_sa DBG:001 %s", "SVC_REC_BY_NAME\n" );
-    sa_mad_data.method = IB_MAD_METHOD_GETTABLE;
     sa_mad_data.attr_id = IB_MAD_ATTR_SERVICE_RECORD;
     sa_mad_data.attr_offset =
       ib_get_attr_offset( sizeof( ib_service_record_t ) );
@@ -701,7 +700,6 @@ osmv_query_sa(
   case OSMV_QUERY_NODE_REC_BY_NODE_GUID:
     osm_log( p_log, OSM_LOG_DEBUG,
              "osmv_query_sa DBG:001 %s","NODE_REC_BY_NODE_GUID\n" );
-    sa_mad_data.method = IB_MAD_METHOD_GETTABLE;
     sa_mad_data.attr_id = IB_MAD_ATTR_NODE_RECORD;
     sa_mad_data.attr_offset =
       ib_get_attr_offset( sizeof( ib_node_record_t ) );


From tziporet at mellanox.co.il  Thu Jun 22 06:19:39 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 22 Jun 2006 16:19:39 +0300
Subject: [openib-general] SLES9 SP3 support was added
Message-ID: <449A98EB.4050501@mellanox.co.il>

Hi All,

We have added support for SLES9 SP3 that can be used with OFED 1.0.

The kernel modules supported are:

    * mthca
    * core
    * CM & CMA
    * IPoIB
    * SRP

All user level apps and libraries are working too.

CPU Architectures supported:

    * x86
    * x86_64
    * ia64

The backport patches are available at: 
https://openib.org/svn/gen2/branches/1.0/ofed/patches/2.6.5-7.244/
There is also a need to take the updated configure and install.sh that 
add SLES9 specific support.
There are no other changes in the package beside these.

Is there a need to create a package (1.0.1) with SLES9 support?

Tziporet
 

From tziporet at mellanox.co.il  Thu Jun 22 05:53:37 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 22 Jun 2006 15:53:37 +0300
Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren)
In-Reply-To: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com>
References: <20060616100547.13864.qmail@web36915.mail.mud.yahoo.com>
Message-ID: <449A92D1.8090404@mellanox.co.il>

zhu shi song wrote:
> I'm sorry SDP is not in production state.  SDP is very
> important for our application and we are waiting it
> mature enough   to be used in our product.  And do you
> have any schedule to let SDP work ok(especially can
> support many large concurrent connections just like
> TCP)?  I very appreciate I can test new SDP before end
> of June.
>   tks
>   zhu
>
>   
The plan is to have a stable SDP in 1.1 release.
The schedule of 1.1 is end of July in the best case (more likely it will 
be mid-Aug)
However we will have RCs before this and we can let you know when many 
large concurrent connections are supported.

Tziporet


From tziporet at mellanox.co.il  Thu Jun 22 06:41:25 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 22 Jun 2006 16:41:25 +0300
Subject: [openib-general] SLES9 SP3 support was added
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA72DF@mtlexch01.mtl.com>

Hi All,

We have added support for SLES9 SP3 that can be used with OFED 1.0.

The kernel modules supported are:

    * mthca
    * core
    * CM & CMA
    * IPoIB
    * SRP

All user level apps and libraries are working too.

CPU Architectures supported:

    * x86
    * x86_64
    * ia64

The backport patches are available at: 
https://openib.org/svn/gen2/branches/1.0/ofed/patches/2.6.5-7.244/
There is also a need to take the updated configure and install.sh that 
add SLES9 specific support.
There are no other changes in the package beside these.

Is there a need to create a package (1.0.1) with SLES9 support?

Tziporet
 

From swise at opengridcomputing.com  Thu Jun 22 08:30:53 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 22 Jun 2006 10:30:53 -0500
Subject: [openib-general] [librdmacm] rping
In-Reply-To: <Pine.GSO.4.40.0606211608030.6482-100000@omicron.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606211608030.6482-100000@omicron.cse.ohio-state.edu>
Message-ID: <1150990253.3040.7.camel@stevo-desktop>

On Thu, 2006-06-22 at 09:08 -0400, amith rajith mamidala wrote:
> The client exits normally, the
> server still hangs after printing the cq status.
> 
> server ping data: rdma-ping-9: JKLMNOPQRSTU
> server DISCONNECT EVENT...
> wait for RDMA_READ_ADV state 9
> cq completion failed status 5
> 
> When I kill the process and restart the server I get the following
> error:
> 
> rdma_bind_addr error -1
> 

what svn revision?  What transport?  


From rdreier at cisco.com  Thu Jun 22 09:22:33 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 22 Jun 2006 09:22:33 -0700
Subject: [openib-general] [GIT PULL] please pull infiniband.git
Message-ID: <aday7vpkvna.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This is mostly merging the new iSER (iSCSI over RDMA transport) initiator:

Krishna Kumar:
      IB/uverbs: Don't free wr list when it's known to be empty

Or Gerlitz:
      IB/iser: iSCSI iSER transport provider header file
      IB/iser: iSCSI iSER transport provider high level code
      IB/iser: iSER initiator iSCSI PDU and TX/RX
      IB/iser: iSER RDMA CM (CMA) and IB verbs interaction
      IB/iser: iSER handling of memory for RDMA
      IB/iser: iSER Kconfig and Makefile

Roland Dreier:
      IB/uverbs: Remove unnecessary list_del()s

 drivers/infiniband/Kconfig                   |    2 
 drivers/infiniband/Makefile                  |    1 
 drivers/infiniband/core/uverbs_cmd.c         |    2 
 drivers/infiniband/core/uverbs_main.c        |    6 
 drivers/infiniband/ulp/iser/Kconfig          |   11 
 drivers/infiniband/ulp/iser/Makefile         |    4 
 drivers/infiniband/ulp/iser/iscsi_iser.c     |  790 +++++++++++++++++++++++++
 drivers/infiniband/ulp/iser/iscsi_iser.h     |  354 +++++++++++
 drivers/infiniband/ulp/iser/iser_initiator.c |  738 +++++++++++++++++++++++
 drivers/infiniband/ulp/iser/iser_memory.c    |  401 +++++++++++++
 drivers/infiniband/ulp/iser/iser_verbs.c     |  827 ++++++++++++++++++++++++++
 drivers/scsi/Makefile                        |    1 
 12 files changed, 3130 insertions(+), 7 deletions(-)
 create mode 100644 drivers/infiniband/ulp/iser/Kconfig
 create mode 100644 drivers/infiniband/ulp/iser/Makefile
 create mode 100644 drivers/infiniband/ulp/iser/iscsi_iser.c
 create mode 100644 drivers/infiniband/ulp/iser/iscsi_iser.h
 create mode 100644 drivers/infiniband/ulp/iser/iser_initiator.c
 create mode 100644 drivers/infiniband/ulp/iser/iser_memory.c
 create mode 100644 drivers/infiniband/ulp/iser/iser_verbs.c


From pradeep at us.ibm.com  Thu Jun 22 10:22:01 2006
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 22 Jun 2006 10:22:01 -0700
Subject: [openib-general] IPoIB multicast
Message-ID: <OFB0516131.F446BE52-ON88257195.001D47E4-88257195.005F1DE6@us.ibm.com>

Can some one please explain the details of the IPoIB multicast. Or if 
there is some previous discussion or documentation about that can I get a 
pointer? 

In particular I am looking to understand the details of  initiation of 
multicast join throug ipoib_send() and the join completion appears to 
happen through a MAD callback. How are the corresponding skbs freed? Why 
is the tx_ring used for send and what is the mcast->pkt_queue used for. 
Thanks for all the help.

Pradeep
pradeep at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/499a6099/attachment.html>

From pradeep at us.ibm.com  Thu Jun 22 10:22:01 2006
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Thu, 22 Jun 2006 10:22:01 -0700
Subject: [openib-general] Fw: IPoIB multicast
Message-ID: <OF947D802C.412CF4E4-ON88257195.00570897-88257195.005F1DFF@us.ibm.com>

I am not sure if this mail got sent out. Please ignore if it is a 
duplicate.

Pradeep
pradeep at us.ibm.com
----- Forwarded by Pradeep Satyanarayana/Beaverton/IBM on 06/22/2006 08:50 
AM -----

Pradeep Satyanarayana/Beaverton/IBM 
06/21/2006 10:28 PM

To
openib-general at openib.org
cc

Subject
IPoIB multicast


Can some one please explain the details of the IPoIB multicast. Or if 
there is some previous discussion or documentation about that can I get a 
pointer? 

In particular I am looking to understand the details of  initiation of 
multicast join throug ipoib_send() and the join completion appears to 
happen through a MAD callback. How are the corresponding skbs freed? Why 
is the tx_ring used for send and what is the mcast->pkt_queue used for. 
Thanks for all the help.

Pradeep
pradeep at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/0ddd17d1/attachment.html>

From ardavis at ichips.intel.com  Thu Jun 22 11:12:25 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 22 Jun 2006 11:12:25 -0700
Subject: [openib-general] uCMA kernel slab corruption and oops
Message-ID: <449ADD89.6080107@ichips.intel.com>

Sean,

I am running a couple of iMPI/uDAPL benchmarks at the same time and ran 
into this:  (2.6.17 kernel and svn8112)

Jun 22 10:46:51 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:51 localhost kernel: Last user: 
[<ffffffff8807fc41>](rdma_destroy_id+0x188/0x193 [rdma_cm])
Jun 22 10:46:51 localhost kernel: 0f0: 6b 6b 6b 6b 6b 6b 6b 6b 18 be 2d 
37 00 81 ff ff
Jun 22 10:46:51 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:51 localhost kernel: Last user: 
[<ffffffff88086599>](ucma_get_event+0x202/0x21f [rdma_ucm])
Jun 22 10:46:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:51 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 10:46:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:51 localhost kernel: Last user: 
[<ffffffff804762c2>](skb_release_data+0x92/0x97)
Jun 22 10:46:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 10:46:53 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:53 localhost kernel: Last user: 
[<ffffffff804762c2>](skb_release_data+0x92/0x97)
Jun 22 10:46:53 localhost kernel: 0f0: 40 5c 3c 18 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:53 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:53 localhost kernel: Last user: 
[<ffffffff88086599>](ucma_get_event+0x202/0x21f [rdma_ucm])
Jun 22 10:46:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:53 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 10:46:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 10:46:53 localhost kernel: Last user: 
[<ffffffff804762c2>](skb_release_data+0x92/0x97)
Jun 22 10:46:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 10:46:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:01:01 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:01:01 localhost kernel: Last user: 
[<ffffffff88069831>](ib_destroy_cm_id+0x23b/0x246 [ib_cm])
Jun 22 11:01:01 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:01:01 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:01:01 localhost kernel: Last user: 
[<ffffffff802a1a8e>](load_elf_interp+0x411/0x423)
Jun 22 11:01:01 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:01:01 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:01:01 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:01:01 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:01:01 localhost kernel: Last user: 
[<ffffffff804762c2>](skb_release_data+0x92/0x97)
Jun 22 11:01:01 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:01:01 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:22:33 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:22:33 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:33 localhost kernel: Last user: 
[<ffffffff802a1a8e>](load_elf_interp+0x411/0x423)
Jun 22 11:22:33 localhost kernel: 0f0: a0 83 9e 21 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:33 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:22:33 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5.
Jun 22 11:22:33 localhost kernel: Last user: 
[<ffffffff880346bb>](mthca_create_qp+0x48/0x275 [ib_mthca])
Jun 22 11:22:33 localhost kernel: 000: 00 40 6a 3d 00 81 ff ff 38 96 d4 
3a 00 81 ff ff
Jun 22 11:22:33 localhost kernel: 010: 48 15 64 29 00 81 ff ff 48 15 64 
29 00 81 ff ff
Jun 22 11:22:33 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:22:33 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:33 localhost kernel: Last user: 
[<ffffffff802a1a8e>](load_elf_interp+0x411/0x423)
Jun 22 11:22:33 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:33 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:22:43 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:22:43 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:43 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:43 localhost kernel: 0f0: d8 cd c0 36 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:43 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:22:43 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:43 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:43 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:22:49 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:49 localhost kernel: Last user: 
[<ffffffff88069831>](ib_destroy_cm_id+0x23b/0x246 [ib_cm])
Jun 22 11:22:49 localhost kernel: 0f0: d8 cd c0 36 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:49 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:49 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:49 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:49 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:49 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:22:49 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:49 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:49 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:49 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:22:51 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:51 localhost kernel: Last user: 
[<ffffffff880677bc>](cm_free_work+0x23/0x2a [ib_cm])
Jun 22 11:22:51 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:51 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:51 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:51 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:22:51 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:51 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:51 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:51 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:22:53 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:53 localhost kernel: Last user: 
[<ffffffff8805c6ac>](__ib_umem_release+0xac/0xd0 [ib_uverbs])
Jun 22 11:22:53 localhost kernel: 0f0: d0 79 4c 2d 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:53 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:53 localhost kernel: Last user: 
[<ffffffff804762c2>](skb_release_data+0x92/0x97)
Jun 22 11:22:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:53 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:22:53 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:22:53 localhost kernel: Last user: 
[<ffffffff8803494f>](mthca_destroy_qp+0x67/0x70 [ib_mthca])
Jun 22 11:22:53 localhost kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:22:53 localhost kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b

Jun 22 11:23:04 localhost kernel: Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:23:04 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:23:04 localhost kernel: Last user: 
[<ffffffff802a29b1>](load_elf_binary+0xf11/0x16ef)
Jun 22 11:23:04 localhost kernel: 0f0: e8 39 17 1c 00 81 ff ff 6b 6b 6b 
6b 6b 6b 6b 6b
Jun 22 11:23:04 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:23:04 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5.
Jun 22 11:23:04 localhost kernel: Last user: 
[<ffffffff8806785e>](cm_create_timewait_info+0x1b/0x6b [ib_cm])
Jun 22 11:23:04 localhost kernel: 000: 00 00 00 00 00 00 00 00 e8 56 24 
20 00 81 ff ff
Jun 22 11:23:04 localhost kernel: 010: e8 56 24 20 00 81 ff ff c5 a4 06 
88 ff ff ff ff
Jun 22 11:23:04 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:23:04 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5.
Jun 22 11:23:04 localhost kernel: Last user: 
[<ffffffff8806785e>](cm_create_timewait_info+0x1b/0x6b [ib_cm])
Jun 22 11:23:04 localhost kernel: 000: 00 00 00 00 00 00 00 00 18 5b 24 
20 00 81 ff ff
Jun 22 11:23:04 localhost kernel: 010: 18 5b 24 20 00 81 ff ff c5 a4 06 
88 ff ff ff ff


Jun 22 11:23:23 localhost kernel: general protection fault: 0000 [1] SMP
Jun 22 11:23:23 localhost kernel: CPU 0
Jun 22 11:23:23 localhost kernel: Modules linked in: rdma_ucm rdma_cm 
ib_addr ib_local_sa findex ib_ucm ib_cm ib_umad ib_uverbs ib_ipoib 
ib_multicast ib_sa ib_mthca ib_mad ib_core ixgb
Jun 22 11:23:23 localhost kernel: Pid: 4078, comm: ib_cm/0 Not tainted 
2.6.17 #1
Jun 22 11:23:23 localhost kernel: RIP: 0010:[<ffffffff80348315>] 
<ffffffff80348315>{rb_erase+465}
Jun 22 11:23:23 localhost kernel: RSP: 0000:ffff810034ba3d58  EFLAGS: 
00010002
Jun 22 11:23:23 localhost kernel: RAX: 6b6b6b6b6b6b6b6b RBX: 
ffff8100202459d0 RCX: ffff8100202459d0
Jun 22 11:23:23 localhost kernel: RDX: ffff810020245be8 RSI: 
0000000000000000 RDI: 0000000000000000
Jun 22 11:23:23 localhost kernel: RBP: ffff810034ba3d68 R08: 
0000000000000000 R09: 0000000000000000
Jun 22 11:23:23 localhost kernel: R10: ffff8100202458f8 R11: 
0000000000000200 R12: ffffffff8806e750
Jun 22 11:23:23 localhost kernel: R13: ffff810020245b10 R14: 
ffff810020245b10 R15: 0000000000000282
Jun 22 11:23:23 localhost kernel: FS:  0000000000000000(0000) 
GS:ffffffff806ef000(0000) knlGS:0000000000000000
Jun 22 11:23:23 localhost kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Jun 22 11:23:23 localhost kernel: CR2: 0000000000b373f0 CR3: 
000000001ffaf000 CR4: 00000000000006e0
Jun 22 11:23:23 localhost kernel: Process ib_cm/0 (pid: 4078, threadinfo 
ffff810034ba2000, task ffff810034fb03c0)
Jun 22 11:23:23 localhost kernel: Stack: ffff810020245b10 
0000000000000286 ffff810034ba3d88 ffffffff88067828
Jun 22 11:23:23 localhost kernel:        ffff810020245b10 
ffff810020245b18 ffff810034ba3e18 ffffffff8806b4f5
Jun 22 11:23:23 localhost kernel:        ffff810034bf2b70 ffff810034ba2000
Jun 22 11:23:23 localhost kernel: Call Trace: 
<ffffffff88067828>{:ib_cm:cm_cleanup_timewait+101}
Jun 22 11:23:23 localhost kernel:        
<ffffffff8806b4f5>{:ib_cm:cm_work_handler+4144} 
<ffffffff802261f6>{__wake_up+67}
Jun 22 11:23:23 localhost kernel:        
<ffffffff8023bbf1>{run_workqueue+184} 
<ffffffff8806a4c5>{:ib_cm:cm_work_handler+0}
Jun 22 11:23:23 localhost kernel:        
<ffffffff8023bd7c>{worker_thread+313} 
<ffffffff8022613d>{default_wake_function+0}
Jun 22 11:23:23 localhost kernel:        
<ffffffff8022613d>{default_wake_function+0} 
<ffffffff8023bc43>{worker_thread+0}
Jun 22 11:23:23 localhost kernel:        <ffffffff8023efd3>{kthread+215} 
<ffffffff8020a91a>{child_rip+8}
Jun 22 11:23:23 localhost kernel:        <ffffffff8023eefc>{kthread+0} 
<ffffffff8020a912>{child_rip+0}
Jun 22 11:23:23 localhost kernel:
Jun 22 11:23:23 localhost kernel: Code: 44 8b 40 08 48 89 c7 45 85 c0 3e 
75 1d c7 40 08 01 00 00 00
Jun 22 11:23:23 localhost kernel: RIP <ffffffff80348315>{rb_erase+465} 
RSP <ffff810034ba3d58>
Jun 22 11:23:23 localhost kernel:  <3>Slab corruption: 
start=ffff8100202458f8, len=512
Jun 22 11:23:23 localhost kernel: Redzone: 0x5a2cf071/0x5a2cf071.
Jun 22 11:23:23 localhost kernel: Last user: 
[<ffffffff880677bc>](cm_free_work+0x23/0x2a [ib_cm])
Jun 22 11:23:23 localhost kernel: 0e0: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 
00 00 00 00 00
Jun 22 11:23:23 localhost kernel: Prev obj: start=ffff8100202456e0, len=512
Jun 22 11:23:23 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5.
Jun 22 11:23:23 localhost kernel: Last user: 
[<ffffffff8807f2bc>](rdma_create_id+0x25/0xf2 [rdma_cm])
Jun 22 11:23:23 localhost kernel: 000: 00 40 6a 3d 00 81 ff ff f0 88 d4 
3a 00 81 ff ff
Jun 22 11:23:23 localhost kernel: 010: 00 00 00 00 00 00 00 00 27 62 08 
88 ff ff ff ff
Jun 22 11:23:23 localhost kernel: Next obj: start=ffff810020245b10, len=512
Jun 22 11:23:23 localhost kernel: Redzone: 0x170fc2a5/0x170fc2a5.
Jun 22 11:23:23 localhost kernel: Last user: 
[<ffffffff8806785e>](cm_create_timewait_info+0x1b/0x6b [ib_cm])
Jun 22 11:23:23 localhost kernel: 000: 00 00 00 00 00 00 00 00 18 5b 24 
20 00 81 ff ff
Jun 22 11:23:23 localhost kernel: 010: 18 5b 24 20 00 81 ff ff c5 a4 06 
88 ff ff ff ff

Jun 22 11:23:29 localhost kernel: NMI Watchdog detected LOCKUP on CPU 1
Jun 22 11:23:29 localhost kernel: CPU 1 Jun 22 11:23:29 localhost 
kernel: Modules linked in: rdma_ucm rdma_cm ib_addr ib_local_sa findex 
ib_ucm ib_cm ib_umad ib_uverbs ib_ipoib ib_multicast ib_sa ib_mthca 
ib_mad ib_core ixgb
Jun 22 11:23:29 localhost kernel: Pid: 4079, comm: ib_cm/1 Not tainted 
2.6.17 #1
Jun 22 11:23:29 localhost kernel: RIP: 0010:[<ffffffff804fdf3b>] 
<ffffffff804fdf3b>{.text.lock.spinlock+14}
Jun 22 11:23:29 localhost kernel: RSP: 0018:ffff810034989d10  EFLAGS: 
00000086
Jun 22 11:23:29 localhost kernel: RAX: ffff81003d2035e0 RBX: 
000000000005ac69 RCX: ffff810034989ed0
Jun 22 11:23:29 localhost kernel: RDX: ffff81003def5190 RSI: 
000000000005ac69 RDI: ffffffff8806e720
Jun 22 11:23:29 localhost kernel: RBP: ffff810034989d18 R08: 
ffff810034988000 R09: 0000000000000001
Jun 22 11:23:29 localhost kernel: R10: 00000000ffffffff R11: 
0000000000000003 R12: ffff81003d203680
Jun 22 11:23:29 localhost kernel: R13: 000000000005ac68 R14: 
ffff810019b0c4d0 R15: 0000000000000282
Jun 22 11:23:29 localhost kernel: FS:  0000000000000000(0000) 
GS:ffff810037e9e2a8(0000) knlGS:0000000000000000
Jun 22 11:23:29 localhost kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Jun 22 11:23:29 localhost kernel: CR2: 00002b5009e20d10 CR3: 
0000000021544000 CR4: 00000000000006e0
Jun 22 11:23:29 localhost kernel: Process ib_cm/1 (pid: 4079, threadinfo 
ffff810034988000, task ffff810035de50a0)
Jun 22 11:23:29 localhost kernel: Stack: 0000000000000282 
ffff810034989d48 ffffffff880673c2 ffff810021198a48
Jun 22 11:23:29 localhost kernel:        ffff810019b0c4d0 
ffff81003d203680 ffff810019b0c4d0 ffff810034989d88
Jun 22 11:23:29 localhost kernel:        ffffffff88069944 0000000000000000
Jun 22 11:23:29 localhost kernel: Call Trace: 
<ffffffff880673c2>{:ib_cm:cm_acquire_id+30} Jun 22 11:23:29 localhost 
kernel:        <ffffffff88069944>{:ib_cm:cm_dreq_handler+51} 
<ffffffff8806b498>{:ib_cm:cm_work_handler+4051}
Jun 22 11:23:29 localhost kernel:        
<ffffffff8023bbf1>{run_workqueue+184} 
<ffffffff8806a4c5>{:ib_cm:cm_work_handler+0}
Jun 22 11:23:29 localhost kernel:        
<ffffffff8023bd7c>{worker_thread+313} 
<ffffffff8022613d>{default_wake_function+0}
Jun 22 11:23:29 localhost kernel:        
<ffffffff8022613d>{default_wake_function+0} 
<ffffffff8023bc43>{worker_thread+0}
Jun 22 11:23:29 localhost kernel:        <ffffffff8023efd3>{kthread+215} 
<ffffffff8020a91a>{child_rip+8}
Jun 22 11:23:29 localhost kernel:        <ffffffff8023eefc>{kthread+0} 
<ffffffff8020a912>{child_rip+0}
Jun 22 11:23:29 localhost kernel:
Jun 22 11:23:29 localhost kernel: Code: 83 3f 00 7e f9 e9 99 fd ff ff e8 
2a d1 e4 ff e9 bd fd ff ff
Jun 22 11:23:29 localhost kernel: console shuts up ...
Jun 22 11:25:13 localhost kernel:  NMI Watchdog detected LOCKUP on CPU 0
Jun 22 11:43:23 localhost syslogd 1.4.1: restart.


From arlin.r.davis at intel.com  Thu Jun 22 11:17:54 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 22 Jun 2006 11:17:54 -0700
Subject: [openib-general] [PATCH] uDAPL dapl_evd_connection_callback
 does not support TIMED_OUT event
Message-ID: <ORSMSX401ytLN5wbLjJ00000002@orsmsx401.amr.corp.intel.com>

James,

Added support for active side TIMED_OUT event from a provider. 

Signed-off by: Arlin Davis ardavis at ichips.intel.com

Index: dapl/common/dapl_evd_connection_callb.c
===================================================================
--- dapl/common/dapl_evd_connection_callb.c	(revision 8166)
+++ dapl/common/dapl_evd_connection_callb.c	(working copy)
@@ -162,48 +162,15 @@ dapl_evd_connection_callback (
 	    break;
 	}
 	case DAT_CONNECTION_EVENT_DISCONNECTED:
-	{
-	    /*
-	     * EP is now fully disconnected; initiate any post processing
-	     * to reset the underlying QP and get the EP ready for
-	     * another connection
-	     */
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_PEER_REJECTED:
-	{
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_UNREACHABLE:
-	{
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_NON_PEER_REJECTED:
-	{
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_BROKEN:
+	case DAT_CONNECTION_EVENT_TIMED_OUT:
 	{
 	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
 	    dapls_ib_disconnect_clean (ep_ptr, DAT_FALSE, ib_cm_event);
 	    dapl_os_unlock ( &ep_ptr->header.lock );
-
 	    break;
 	}
 	case DAT_CONNECTION_REQUEST_EVENT:


From bpradip at in.ibm.com  Thu Jun 22 11:36:49 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 23 Jun 2006 00:06:49 +0530
Subject: [openib-general] [librdmacm] rping
In-Reply-To: <Pine.GSO.4.40.0606211608030.6482-100000@omicron.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606211608030.6482-100000@omicron.cse.ohio-state.edu>
Message-ID: <449AE341.7070809@in.ibm.com>

amith rajith mamidala wrote:
> I was checking rping with the latest stack. The client exits normally, the
> server still hangs after printing the cq status.
I have seen this happening in the following two scenarios :
(1) server exits before the client - The client prints the following errors and 
stays there
client DISCONNECT EVENT...
cq completion failed status 5
client: post send error 22
(2) client exits before the server - The o/p is same as what you get.

This behaviour is because of the way cm_thread() and cq_thread() functions are 
written. I have coded a fix for this. Will send it tomorrow after some more testing.

> 
> server ping data: rdma-ping-9: JKLMNOPQRSTU
> server DISCONNECT EVENT...
> wait for RDMA_READ_ADV state 9
> cq completion failed status 5
> 
> When I kill the process and restart the server I get the following error:
> 
> rdma_bind_addr error -1
You will be able to kill only the rping process. If you look at the 'ps ax' 
output you will see that lt-rping is in the 'D' state. Hence the bind error.
Only reboot helps


Thanks,
Pradipta Kumar.

> 
> 
> Thanks,
> Amith
> 
> 
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From bpradip at in.ibm.com  Thu Jun 22 12:18:46 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 23 Jun 2006 00:48:46 +0530
Subject: [openib-general] [PATCH] rping.c: Fix hang if either the server or
 the client exits
Message-ID: <20060622191838.GA24554@harry-potter.ibm.com>

early
Reply-To: bpradip at in.ibm.com

This patch fixes the problem as reported by Amith.

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index: rping.c
=============================================================================
--- rping.c.org	2006-06-23 00:22:17.000000000 +0530
+++ rping.c	2006-06-23 00:39:06.000000000 +0530
@@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc
 	case RDMA_CM_EVENT_DISCONNECTED:
 		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
 		sem_post(&cb->sem);
+		ret = -1;
 		break;
 
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:


From bpradip at in.ibm.com  Thu Jun 22 12:23:10 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 23 Jun 2006 00:53:10 +0530
Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the
 server or the client exits early
Message-ID: <20060622192259.GA24588@harry-potter.ibm.com>

Hi,
 Please ignore the earlier mail. There were some problems with the mailer.
Here is the new one.

This patch fixes the problem as reported by Amith.

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index: rping.c
=============================================================================
--- rping.c.org	2006-06-23 00:22:17.000000000 +0530
+++ rping.c	2006-06-23 00:39:06.000000000 +0530
@@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc
 	case RDMA_CM_EVENT_DISCONNECTED:
 		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
 		sem_post(&cb->sem);
+		ret = -1;
 		break;
 
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:


From halr at voltaire.com  Thu Jun 22 12:46:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Jun 2006 15:46:15 -0400
Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c: Fix
 example name in messages
Message-ID: <1151005558.4391.240388.camel@hal.voltaire.com>

librdmacm/examples/udaddy.c: Fix example name in messages

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: ../../librdmacm/examples/udaddy.c
===================================================================
--- ../../librdmacm/examples/udaddy.c	(revision 8166)
+++ ../../librdmacm/examples/udaddy.c	(working copy)
@@ -47,8 +47,8 @@
 
 /*
  * To execute:
- * Server: rdma_cmatose
- * Client: rdma_cmatose "dst_ip=ip"
+ * Server: udaddy
+ * Client: udaddy [server_addr [src_addr]]
  */
 
 struct cmatest_node {
@@ -116,7 +116,7 @@ static int init_node(struct cmatest_node
 	node->pd = ibv_alloc_pd(node->cma_id->verbs);
 	if (!node->pd) {
 		ret = -ENOMEM;
-		printf("cmatose: unable to allocate PD\n");
+		printf("udaddy: unable to allocate PD\n");
 		goto out;
 	}
 
@@ -124,7 +124,7 @@ static int init_node(struct cmatest_node
 	node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0);
 	if (!node->cq) {
 		ret = -ENOMEM;
-		printf("cmatose: unable to create CQ\n");
+		printf("udaddy: unable to create CQ\n");
 		goto out;
 	}
 
@@ -140,13 +140,13 @@ static int init_node(struct cmatest_node
 	init_qp_attr.recv_cq = node->cq;
 	ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr);
 	if (ret) {
-		printf("cmatose: unable to create QP: %d\n", ret);
+		printf("udaddy: unable to create QP: %d\n", ret);
 		goto out;
 	}
 
 	ret = create_message(node);
 	if (ret) {
-		printf("cmatose: failed to create messages: %d\n", ret);
+		printf("udaddy: failed to create messages: %d\n", ret);
 		goto out;
 	}
 out:
@@ -225,7 +225,7 @@ static int addr_handler(struct cmatest_n
 
 	ret = rdma_resolve_route(node->cma_id, 2000);
 	if (ret) {
-		printf("cmatose: resolve route failed: %d\n", ret);
+		printf("udaddy: resolve route failed: %d\n", ret);
 		connect_error();
 	}
 	return ret;
@@ -250,7 +250,7 @@ static int route_handler(struct cmatest_
 	conn_param.retry_count = 5;
 	ret = rdma_connect(node->cma_id, &conn_param);
 	if (ret) {
-		printf("cmatose: failure connecting: %d\n", ret);
+		printf("udaddy: failure connecting: %d\n", ret);
 		goto err;
 	}
 	return 0;
@@ -287,7 +287,7 @@ static int connect_handler(struct rdma_c
 	conn_param.qp_type = node->cma_id->qp->qp_type;
 	ret = rdma_accept(node->cma_id, &conn_param);
 	if (ret) {
-		printf("cmatose: failure accepting: %d\n", ret);
+		printf("udaddy: failure accepting: %d\n", ret);
 		goto err2;
 	}
 	node->connected = 1;
@@ -298,7 +298,7 @@ err2:
 	node->cma_id = NULL;
 	connect_error();
 err1:
-	printf("cmatose: failing connection request\n");
+	printf("udaddy: failing connection request\n");
 	rdma_reject(cma_id, NULL, 0);
 	return ret;
 }
@@ -351,7 +351,7 @@ static int cma_handler(struct rdma_cm_id
 	case RDMA_CM_EVENT_CONNECT_ERROR:
 	case RDMA_CM_EVENT_UNREACHABLE:
 	case RDMA_CM_EVENT_REJECTED:
-		printf("cmatose: event: %d, error: %d\n", event->event,
+		printf("udaddy: event: %d, error: %d\n", event->event,
 			event->status);
 		connect_error();
 		ret = event->status;
@@ -397,7 +397,7 @@ static int alloc_nodes(void)
 
 	test.nodes = malloc(sizeof *test.nodes * connections);
 	if (!test.nodes) {
-		printf("cmatose: unable to allocate memory for test nodes\n");
+		printf("udaddy: unable to allocate memory for test nodes\n");
 		return -ENOMEM;
 	}
 	memset(test.nodes, 0, sizeof *test.nodes * connections);
@@ -449,7 +449,7 @@ static int poll_cqs(void)
 		for (done = 0; done < message_count; done += ret) {
 			ret = ibv_poll_cq(test.nodes[i].cq, 8, wc);
 			if (ret < 0) {
-				printf("cmatose: failed polling CQ: %d\n", ret);
+				printf("udaddy: failed polling CQ: %d\n", ret);
 				return ret;
 			}
 
@@ -480,10 +480,10 @@ static int run_server(void)
 	struct rdma_cm_id *listen_id;
 	int i, ret;
 
-	printf("cmatose: starting server\n");
+	printf("udaddy: starting server\n");
 	ret = rdma_create_id(test.channel, &listen_id, &test, RDMA_PS_UDP);
 	if (ret) {
-		printf("cmatose: listen request failed\n");
+		printf("udaddy: listen request failed\n");
 		return ret;
 	}
 
@@ -491,13 +491,13 @@ static int run_server(void)
 	test.src_in.sin_port = 7174;
 	ret = rdma_bind_addr(listen_id, test.src_addr);
 	if (ret) {
-		printf("cmatose: bind address failed: %d\n", ret);
+		printf("udaddy: bind address failed: %d\n", ret);
 		return ret;
 	}
 
 	ret = rdma_listen(listen_id, 0);
 	if (ret) {
-		printf("cmatose: failure trying to listen: %d\n", ret);
+		printf("udaddy: failure trying to listen: %d\n", ret);
 		goto out;
 	}
 
@@ -552,7 +552,7 @@ static int run_client(char *dst, char *s
 {
 	int i, ret;
 
-	printf("cmatose: starting client\n");
+	printf("udaddy: starting client\n");
 	if (src) {
 		ret = get_addr(src, &test.src_in);
 		if (ret)
@@ -565,13 +565,13 @@ static int run_client(char *dst, char *s
 
 	test.dst_in.sin_port = 7174;
 
-	printf("cmatose: connecting\n");
+	printf("udaddy: connecting\n");
 	for (i = 0; i < connections; i++) {
 		ret = rdma_resolve_addr(test.nodes[i].cma_id,
 					src ? test.src_addr : NULL,
 					test.dst_addr, 2000);
 		if (ret) {
-			printf("cmatose: failure getting addr: %d\n", ret);
+			printf("udaddy: failure getting addr: %d\n", ret);
 			connect_error();
 			return ret;
 		}


From swise at opengridcomputing.com  Thu Jun 22 13:24:07 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 22 Jun 2006 15:24:07 -0500
Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the
 server or the client exits early
In-Reply-To: <20060622192259.GA24588@harry-potter.ibm.com>
References: <20060622192259.GA24588@harry-potter.ibm.com>
Message-ID: <1151007847.3040.51.camel@stevo-desktop>

The goal of adding the return codes was so that the rping program could
exit with a status indicating success or failure.  Every rping run
results in a DISCONNECT event, so I don't think we want to treat that
case as an error.


Also, can you explain why thi fixes Amith's problem, which sounded like
a process was hanging?


Thanks,

Steve.


On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:
> Hi,
>  Please ignore the earlier mail. There were some problems with the mailer.
> Here is the new one.
> 
> This patch fixes the problem as reported by Amith.
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> 
> ---
> 
> Index: rping.c
> =============================================================================
> --- rping.c.org	2006-06-23 00:22:17.000000000 +0530
> +++ rping.c	2006-06-23 00:39:06.000000000 +0530
> @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc
>  	case RDMA_CM_EVENT_DISCONNECTED:
>  		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
>  		sem_post(&cb->sem);
> +		ret = -1;
>  		break;
>  
>  	case RDMA_CM_EVENT_DEVICE_REMOVAL:


From jlentini at netapp.com  Thu Jun 22 13:58:57 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 22 Jun 2006 16:58:57 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL cma - event processing bug
In-Reply-To: <ORSMSX401lpoVg01vr100000001@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401lpoVg01vr100000001@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606221658380.14000@jlentini-linux.nane.netapp.com>


On Wed, 21 Jun 2006, Arlin Davis wrote:

> James,
> 
> Fix bug in dapls_ib_get_dat_event() call after adding new 
> unreachable event.

Committed in revision 8180.


From jlentini at netapp.com  Thu Jun 22 14:13:35 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 22 Jun 2006 17:13:35 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL dapl_evd_connection_callback
 does not support TIMED_OUT event
In-Reply-To: <ORSMSX401ytLN5wbLjJ00000002@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401ytLN5wbLjJ00000002@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606221712410.14000@jlentini-linux.nane.netapp.com>


On Thu, 22 Jun 2006, Arlin Davis wrote:

> James,
> 
> Added support for active side TIMED_OUT event from a provider. 

Committed revision 8181, but with the different flag values retained:

Index: dapl/common/dapl_evd_connection_callb.c
===================================================================
--- dapl/common/dapl_evd_connection_callb.c	(revision 8109)
+++ dapl/common/dapl_evd_connection_callb.c	(working copy)
@@ -162,34 +162,8 @@ dapl_evd_connection_callback (
 	    break;
 	}
 	case DAT_CONNECTION_EVENT_DISCONNECTED:
-	{
-	    /*
-	     * EP is now fully disconnected; initiate any post processing
-	     * to reset the underlying QP and get the EP ready for
-	     * another connection
-	     */
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_PEER_REJECTED:
-	{
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_UNREACHABLE:
-	{
-	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
-	    dapls_ib_disconnect_clean (ep_ptr, DAT_TRUE, ib_cm_event);
-	    dapl_os_unlock (&ep_ptr->header.lock);
-
-	    break;
-	}
 	case DAT_CONNECTION_EVENT_NON_PEER_REJECTED:
 	{
 	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
@@ -199,6 +173,7 @@ dapl_evd_connection_callback (
 	    break;
 	}
 	case DAT_CONNECTION_EVENT_BROKEN:
+	case DAT_CONNECTION_EVENT_TIMED_OUT:
 	{
 	    ep_ptr->param.ep_state  = DAT_EP_STATE_DISCONNECTED;
 	    dapls_ib_disconnect_clean (ep_ptr, DAT_FALSE, ib_cm_event);


From bos at pathscale.com  Thu Jun 22 14:30:08 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 22 Jun 2006 14:30:08 -0700
Subject: [openib-general] ipath verbs does not compile against the
 latest SVN trunk verbs
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F0008057B9B@orsmsx408>
Message-ID: <1151011808.26502.19.camel@chalcedony.pathscale.com>

On Tue, 2006-06-20 at 09:55 -0700, Woodruff, Robert J wrote:
> When I try to build  SVN 8112 I get the following errors trying 
> to build the ipath verbs. 

We're a bit out of date on the trunk.  We'll be syncing it up RSN.

	<b


From arlin.r.davis at intel.com  Thu Jun 22 14:39:30 2006
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Thu, 22 Jun 2006 14:39:30 -0700
Subject: [openib-general] [PATCH] uDAPL cma: lower debug level on consumer
	rejects
Message-ID: <ORSMSX401ik56QROyHk00000003@orsmsx401.amr.corp.intel.com>

James,

Lower the reject debug message level so we don't see warnings when consumers reject.

Signed-off by: Arlin Davis ardavis at ichips.intel.com

Index: dapl/openib_cma/dapl_ib_cm.c
===================================================================
--- dapl/openib_cma/dapl_ib_cm.c        (revision 8166)
+++ dapl/openib_cma/dapl_ib_cm.c        (working copy)
@@ -359,7 +359,7 @@ static void dapli_cm_active_cb(struct da
                        cm_event = IB_CME_DESTINATION_REJECT;

                dapl_dbg_log(
-                       DAPL_DBG_TYPE_WARN,
+                       DAPL_DBG_TYPE_CM,
                        " dapli_cm_active_handler: REJECTED reason=%d\n",
                        event->status);


From dsnedigar at calpont.com  Thu Jun 22 14:56:49 2006
From: dsnedigar at calpont.com (Don Snedigar)
Date: Thu, 22 Jun 2006 16:56:49 -0500
Subject: [openib-general] OFED-1.0 fails install on AMD64
Message-ID: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>

I just downloaded the OFED-1.0 and the install was going fine until
ibutils.  At that point, the install fails with :
 
Open MPI RPM will be created during the installation process
 

Building ibutils RPM. Please wait...
 
Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define
'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
-
ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
 
See log file: /tmp/OFED.28656.log

 
I dug down into the log file it indicates and found :
 
 g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
-DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
-fPIC -DPIC -o .libs/ibnl_scanner.o
ibnl_scanner.ll: In function 'int ibnl_lex()':
ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
warn_unused_result
 g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
-DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o
ibnl_scanner.o >/dev/null 2>&1
/bin/sh ../libtool --tag=CXX --mode=link g++ -O2
-DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo  
g++ -shared -nostdlib
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
.libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
.libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o
.libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
-L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
-L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
-L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
-mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
.libs/libibdmcom.so.1.1.1
/usr/bin/ld:
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
used when making a shared object; recompile with -fPIC
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read
symbols: Bad value
collect2: ld returned 1 exit status
make[3]: *** [libibdmcom.la] Error 1
make[3]: Leaving directory
`/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
 

RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"

Can anyone shed any light on this ?   
 
Machine is dual Opteron, 2 gig memory, kernel 2.6.16
 
Don Snedigar
Calpont Corp.
214-618-9516
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/67f4580f/attachment.html>

From jlentini at netapp.com  Thu Jun 22 14:56:03 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 22 Jun 2006 17:56:03 -0400 (EDT)
Subject: [openib-general] [Bug 146] OFED-1.0 DAPL fails to build on
 SLES10 on IA64 with IA64_FETCHADD error
In-Reply-To: <20060622215505.2F2CF22873D@openib.ca.sandia.gov>
References: <20060622215505.2F2CF22873D@openib.ca.sandia.gov>
Message-ID: <Pine.LNX.4.64.0606221754390.14000@jlentini-linux.nane.netapp.com>


On Thu, 22 Jun 2006, bugzilla-daemon at openib.org wrote:

> http://openib.org/bugzilla/show_bug.cgi?id=146
> 
> 
> jlentini at netapp.com changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
> 
> 
> 
> 
> ------- Comment #1 from jlentini at netapp.com  2006-06-22 14:55 -------
> We have code in dapl/udapl/linux/dapl_osd.h that is supposed to handle this.
> It looks like this broke when we moved to the autotools. I'll send you a patch
> to test.


Here's the patch. Thank you for offering to test this. Please let me 
if it fixes the problem (I do not have an IA64 SLES system).

Index: Makefile.am
===================================================================
--- Makefile.am	(revision 8109)
+++ Makefile.am	(working copy)
@@ -1,10 +1,11 @@
 # $Id: $
 
+OSFLAGS = -DOS_VERSION=$(shell expr `uname -r | cut -f1 -d.` \* 65536 + `uname -r | cut -f2 -d.`)
 # Check for RedHat, needed for ia64 udapl atomic operations (IA64_FETCHADD syntax)
 if OS_RHEL
-OSFLAGS=-DREDHAT_EL4
+OSFLAGS += -DREDHAT_EL4
 else
-OSFLAGS=
+OSFLAGS +=
 endif
 
 if DEBUG


From jlentini at netapp.com  Thu Jun 22 15:02:23 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 22 Jun 2006 18:02:23 -0400 (EDT)
Subject: [openib-general] [PATCH] uDAPL cma: lower debug level on
	consumer rejects
In-Reply-To: <ORSMSX401ik56QROyHk00000003@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401ik56QROyHk00000003@orsmsx401.amr.corp.intel.com>
Message-ID: <Pine.LNX.4.64.0606221802080.14000@jlentini-linux.nane.netapp.com>


On Thu, 22 Jun 2006, Arlin Davis wrote:

> James,
> 
> Lower the reject debug message level so we don't see warnings when 
> consumers reject.

Committed in revision 8182.


From paul.lundin at gmail.com  Thu Jun 22 15:16:17 2006
From: paul.lundin at gmail.com (Paul)
Date: Thu, 22 Jun 2006 18:16:17 -0400
Subject: [openib-general] OFED-1.0 fails install on AMD64
In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>
References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>
Message-ID: <d2403b0606221516s3d638e7fqf31bc8c498d33bf@mail.gmail.com>

Well taking a couple of stabs in the dark here. What version of
redhat/fedora are you using ? I am using rhel 4 update 3 and it uses gcc
version 3.4.5-2 by default. It appears as if your system is using 4.0.0.
Also do you have any environment variables set ? Such as CFLAGS, CCFLAGS or
the like ? For the record the only reason I mention gcc 4x is because it is
the only time I have personally seen that error arise.

On 6/22/06, Don Snedigar <dsnedigar at calpont.com> wrote:
>
>  I just downloaded the OFED-1.0 and the install was going fine until
> ibutils.  At that point, the install fails with :
>
> Open MPI RPM will be created during the installation process
>
>
> Building ibutils RPM. Please wait...
>
> Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define
> 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix
> /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir
> %{_prefix}/share/man' --define 'build_root /var/tmp/OFED'
> /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
> -
> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix
> /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir
> %{_prefix}/share/man' --define 'build_root /var/tmp/OFED'
> /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
>
> See log file: /tmp/OFED.28656.log
>
> I dug down into the log file it indicates and found :
>
>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo
> -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc  -fPIC -DPIC -o
> .libs/ibnl_scanner.o
> ibnl_scanner.ll: In function 'int ibnl_lex()':
> ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
> fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
> warn_unused_result
>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT ibnl_scanner.lo
> -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o ibnl_scanner.o
> >/dev/null 2>&1
> /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o libibdmcom.la-rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
> g++ -shared -nostdlib
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
> .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
> .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o
> .libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
> -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
> -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o .libs/libibdmcom.so.1.1.1
> /usr/bin/ld:
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
> relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
> used when making a shared object; recompile with -fPIC
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read
> symbols: Bad value
> collect2: ld returned 1 exit status
> make[3]: *** [libibdmcom.la] Error 1
> make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0
> /ibdm/datamodel'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> make: *** [all-recursive] Error 1
> error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
>
>
> RPM build errors:
>     Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define '_prefix
> /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64' --define '_mandir
> %{_prefix}/share/man' --define 'build_root /var/tmp/OFED'
> /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> Can anyone shed any light on this ?
>
> Machine is dual Opteron, 2 gig memory, kernel 2.6.16
>
> Don Snedigar
> Calpont Corp.
> 214-618-9516
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/04f710c4/attachment.html>

From dsnedigar at calpont.com  Thu Jun 22 15:35:43 2006
From: dsnedigar at calpont.com (Don Snedigar)
Date: Thu, 22 Jun 2006 17:35:43 -0500
Subject: [openib-general] OFED-1.0 fails install on AMD64
Message-ID: <8953B8331AA98041B0C11DBC678AFC0812C7C0@srvemail1.calpont.com>

Actually, its FSM Labs v 2.2.3 with the 2.6.16 kernel.  We had FC4 on
the box, but then added RTLinuxPro on the box.  
 
Yes, gcc is version 4 (gcc --version gives 4.0.0 20050519 (Red Hat
4.0.0-8) 
 
Only environment variables set would be the ones that the install script
sets itself.\
 
don

________________________________

From: Paul [mailto:paul.lundin at gmail.com] 
Sent: Thursday, June 22, 2006 5:16 PM
To: Don Snedigar
Cc: openib-general at openib.org
Subject: Re: [openib-general] OFED-1.0 fails install on AMD64


Well taking a couple of stabs in the dark here. What version of
redhat/fedora are you using ? I am using rhel 4 update 3 and it uses gcc
version 3.4.5-2 by default. It appears as if your system is using 4.0.0.
Also do you have any environment variables set ? Such as CFLAGS, CCFLAGS
or the like ? For the record the only reason I mention gcc 4x is because
it is the only time I have personally seen that error arise. 


On 6/22/06, Don Snedigar <dsnedigar at calpont.com > wrote: 

	I just downloaded the OFED-1.0 and the install was going fine
until ibutils.  At that point, the install fails with :
	 
	Open MPI RPM will be created during the installation process
	 
	
	Building ibutils RPM. Please wait...
	 
	Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM'
--define 'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
	-
	ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
	 
	See log file: /tmp/OFED.28656.log
	
	 
	I dug down into the log file it indicates and found :
	 
	 g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
-DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
-fPIC -DPIC -o .libs/ibnl_scanner.o
	ibnl_scanner.ll: In function 'int ibnl_lex()':
	ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
warn_unused_result
	 g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
-DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o
ibnl_scanner.o >/dev/null 2>&1
	/bin/sh ../libtool --tag=CXX --mode=link g++ -O2
-DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo  
	g++ -shared -nostdlib
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
.libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
.libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o
.libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
-L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
-L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
-L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
-mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
.libs/libibdmcom.so.1.1.1
	/usr/bin/ld:
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
used when making a shared object; recompile with -fPIC
	/usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not
read symbols: Bad value
	collect2: ld returned 1 exit status
	make[3]: *** [libibdmcom.la] Error 1
	make[3]: Leaving directory
`/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel'
	make[2]: *** [all-recursive] Error 1
	make[2]: Leaving directory
`/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
	make[1]: *** [all] Error 2
	make[1]: Leaving directory
`/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
	make: *** [all-recursive] Error 1
	error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
	 
	
	RPM build errors:
	    Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
	ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
/var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
--mandir=/usr/local/ofed/share/man
--cache-file=/var/tmp/OFED/ibutils.cache
--with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
'_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
--define '_mandir %{_prefix}/share/man' --define 'build_root
/var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
	
	Can anyone shed any light on this ?   
	 
	Machine is dual Opteron, 2 gig memory, kernel 2.6.16
	
	 
	Don Snedigar
	Calpont Corp.
	214-618-9516
	 

	_______________________________________________
	openib-general mailing list
	openib-general at openib.org 
	http://openib.org/mailman/listinfo/openib-general
	
	To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
	
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/07fabade/attachment.html>

From viswa.krish at gmail.com  Thu Jun 22 16:18:17 2006
From: viswa.krish at gmail.com (Viswanath Krishnamurthy)
Date: Thu, 22 Jun 2006 16:18:17 -0700
Subject: [openib-general] Disabling end-to-end flow control
Message-ID: <4df28be40606221618o17bee45bg2289fab53985d168@mail.gmail.com>

Is there a way to disable end-to-end flowcontrol using any of the API's ?

Thanks,
-Viswa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060622/841df48a/attachment.html>

From dddownload at web.de  Fri Jun 23 01:06:40 2006
From: dddownload at web.de (Torsten Boob)
Date: Fri, 23 Jun 2006 10:06:40 +0200
Subject: [openib-general] NFS/RDMA
Message-ID: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net>

Hello,

i have Problems to set up a nfsrdma connection (nfsrdma update 5).

./nfsrdmamount -o rdma=192.168.99.1 192.168.99.1:/nfs /mnt/nfs

has following output on the client.

RPC:       xprt_setup_rdma: 192.168.99.1:2049
unexpected event received for QP=ffff81013eaaea00, event =4
svc_rdma_recvfrom: transport ffff81013f54b600 is closing
svc_rdma_recvfrom: transport ffff81013f54b600 is closing
svc_rdma_put: Destroying transport ffff81013f54b600, cm_id=ffff81013f54b400, sk_flags=54, sk_inuse=0
nfs: RPC call returned error 103
nfsmount: Software caused connection abort


Using normal nfs with 

mount -t nfs 192.168.99.1:/nfs /mnt/nfs

results in

mount: 192.168.99.1:/nfs: can't read superblock


Same results with openib svn {20060516}
Tested with Debian Sarge and Etch.

Any ideas ?

Torsten


From eitan at mellanox.co.il  Fri Jun 23 01:48:14 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 23 Jun 2006 11:48:14 +0300
Subject: [openib-general] OFED-1.0 fails install on AMD64
In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>
References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>
Message-ID: <449BAACE.6000609@mellanox.co.il>

Hi Don,

Sorry for my late response. ibutils compilation (of libibdmcom) is breaking with the
error message:

 > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
 > used when making a shared object; recompile with -fPIC

For the command:
 > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
 > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
 > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
 > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
 > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
 > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
 > g++ -shared -nostdlib

So obviously one has to figure out why -shared did not cause -fPIC ?
Also not clear why this does not break on other machines. Anyways,
reproducing the problem is my first target.

One obvious thing to try is to set CFLAGS=-fPIC

As I do not have access to the exact type of your machine : FSM Labs v 2.2.3 with the 2.6.16 kernel
(as the weekend started over hear) I guess I will be able to reproduce only Sun/Mon.

Eitan

Don Snedigar wrote:
> I just downloaded the OFED-1.0 and the install was going fine until
> ibutils.  At that point, the install fails with :
>  
> Open MPI RPM will be created during the installation process
>  
> 
> Building ibutils RPM. Please wait...
>  
> Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define
> 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man
> --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> --define '_mandir %{_prefix}/share/man' --define 'build_root
> /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
> -
> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man
> --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> --define '_mandir %{_prefix}/share/man' --define 'build_root
> /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
>  
> See log file: /tmp/OFED.28656.log
> 
>  
> I dug down into the log file it indicates and found :
>  
>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
> - -o .libs/ibnl_scanner.o
> ibnl_scanner.ll: In function 'int ibnl_lex()':
> ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
> fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
> warn_unused_result
>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o
> ibnl_scanner.o >/dev/null 2>&1
> /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
> libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo  
> g++ -shared -nostdlib
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
> .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
> .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o
> .libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
> -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
> -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
> .libs/libibdmcom.so.1.1.1
> /usr/bin/ld:
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
> relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
> used when making a shared object; recompile with -fPIC
> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read
> symbols: Bad value
> collect2: ld returned 1 exit status
> make[3]: *** [libibdmcom.la] Error 1
> make[3]: Leaving directory
> `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm/datamodel'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> make: *** [all-recursive] Error 1
> error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
>  
> 
> RPM build errors:
>     Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> --mandir=/usr/local/ofed/share/man
> --cache-file=/var/tmp/OFED/ibutils.cache
> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> --define '_mandir %{_prefix}/share/man' --define 'build_root
> /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> 
> Can anyone shed any light on this ?   
>  
> Machine is dual Opteron, 2 gig memory, kernel 2.6.16
>  
> Don Snedigar
> Calpont Corp.
> 214-618-9516
>  
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From zhushisongzhu at yahoo.com  Fri Jun 23 03:01:46 2006
From: zhushisongzhu at yahoo.com (zhu shi song)
Date: Fri, 23 Jun 2006 03:01:46 -0700 (PDT)
Subject: [openib-general] OFED 1.0 - Official Release (Tziporet Koren)
In-Reply-To: <449A92D1.8090404@mellanox.co.il>
Message-ID: <20060623100146.49805.qmail@web36911.mail.mud.yahoo.com>

thank you very much
SDP is very good concept.  We can port legacy
applications to support infiniband and develop new
applications easily and quickly.  Good luck and
waiting for your good news.  I'm urgent to deploy
infiniband cards for our real production system.

  zhu

--- Tziporet Koren <tziporet at mellanox.co.il> wrote:

> zhu shi song wrote:
> > I'm sorry SDP is not in production state.  SDP is
> very
> > important for our application and we are waiting
> it
> > mature enough   to be used in our product.  And do
> you
> > have any schedule to let SDP work ok(especially
> can
> > support many large concurrent connections just
> like
> > TCP)?  I very appreciate I can test new SDP before
> end
> > of June.
> >   tks
> >   zhu
> >
> >   
> The plan is to have a stable SDP in 1.1 release.
> The schedule of 1.1 is end of July in the best case
> (more likely it will 
> be mid-Aug)
> However we will have RCs before this and we can let
> you know when many 
> large concurrent connections are supported.
> 
> Tziporet
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From sean.hefty at intel.com  Fri Jun 23 04:51:21 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 Jun 2006 04:51:21 -0700
Subject: [openib-general] Disabling end-to-end flow control
In-Reply-To: <4df28be40606221618o17bee45bg2289fab53985d168@mail.gmail.com>
Message-ID: <000401c696bb$5acd42d0$f0791cac@amr.corp.intel.com>

Is there a way to disable end-to-end flowcontrol using any of the API's ?

I believe that all of the APIs (verbs, ib_cm, rdma_cm) let the user specify
whether flow control is enabled.

- Sean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/74287951/attachment.html>

From sean.hefty at intel.com  Fri Jun 23 04:52:46 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 Jun 2006 04:52:46 -0700
Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c:
 Fix example name in messages
In-Reply-To: <1151005558.4391.240388.camel@hal.voltaire.com>
Message-ID: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com>

>librdmacm/examples/udaddy.c: Fix example name in messages
>
>Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Thanks - if you haven't, can you commit this as well?  (My connection is
_really_ slow at the moment...)

- Sean


From sean.hefty at intel.com  Fri Jun 23 05:00:44 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 Jun 2006 05:00:44 -0700
Subject: [openib-general] uCMA kernel slab corruption and oops
In-Reply-To: <449ADD89.6080107@ichips.intel.com>
Message-ID: <000b01c696bc$aad5e380$f0791cac@amr.corp.intel.com>

I will look into this next week.

- Sean


From halr at voltaire.com  Fri Jun 23 05:15:05 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 23 Jun 2006 08:15:05 -0400
Subject: [openib-general] [PATCH][TRIVIAL] librdmacm/examples/udaddy.c:
 Fix example name in messages
In-Reply-To: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com>
References: <000901c696bb$8dd5af00$f0791cac@amr.corp.intel.com>
Message-ID: <1151064898.4391.279481.camel@hal.voltaire.com>

On Fri, 2006-06-23 at 07:52, Sean Hefty wrote:
> >librdmacm/examples/udaddy.c: Fix example name in messages
> >
> >Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Thanks - if you haven't, can you commit this as well?  (My connection is
> _really_ slow at the moment...)

Sure; committed in r8187.

-- Hal

> - Sean


From bpradip at in.ibm.com  Fri Jun 23 05:50:27 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 23 Jun 2006 18:20:27 +0530
Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the
 server or the client exits early
In-Reply-To: <1151007847.3040.51.camel@stevo-desktop>
References: <20060622192259.GA24588@harry-potter.ibm.com>
	<1151007847.3040.51.camel@stevo-desktop>
Message-ID: <449BE393.3020308@in.ibm.com>

Steve Wise wrote:
> The goal of adding the return codes was so that the rping program could
> exit with a status indicating success or failure.  Every rping run
> results in a DISCONNECT event, so I don't think we want to treat that
> case as an error.
DISCONNECT event will be generated when the connection is closed or in case of 
some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso 
driver etc).
> 
> 
> Also, can you explain why thi fixes Amith's problem, which sounded like
> a process was hanging?
> 
On debugging I found that the main thread was blocked in ibv_destroy_cq(), 
cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in 
ibv_get_cq_event->read
Taking the return value of the DISCONNECT event into consideration forcefully 
killed the process.
On delving deeper into this problem, I think that there is more to this rping 
hang. Let me work on this further.

On a related note - I noticed another rping hang in the following case
- Start the rping as a client without first starting an rping server
- If you are lucky the first run itself will result in the 'lt-rping' process in 
'D' state. If not repeating the procedure will result in the hang.

This is the o/p.

cq completion failed status 5
wait for CONNECTED state 10
connect error -1

Thanks,
Pradipta.


> 

> Thanks,
> 
> Steve.
> 
> 
> 
> On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:
>> Hi,
>>  Please ignore the earlier mail. There were some problems with the mailer.
>> Here is the new one.
>>
>> This patch fixes the problem as reported by Amith.
>>
>> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
>>
>> ---
>>
>> Index: rping.c
>> =============================================================================
>> --- rping.c.org	2006-06-23 00:22:17.000000000 +0530
>> +++ rping.c	2006-06-23 00:39:06.000000000 +0530
>> @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc
>>  	case RDMA_CM_EVENT_DISCONNECTED:
>>  		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
>>  		sem_post(&cb->sem);
>> +		ret = -1;
>>  		break;
>>  
>>  	case RDMA_CM_EVENT_DEVICE_REMOVAL:
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From swise at opengridcomputing.com  Fri Jun 23 06:44:50 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 08:44:50 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1150836226.2891.231.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
Message-ID: <1151070290.7808.33.camel@stevo-desktop>


> 
> Also on a related note, have you checked the driver for the needed PCI
> posting flushes?
> 
> > +
> > +	/* Disable IRQs by clearing the interrupt mask */
> > +	writel(1, c2dev->regs + C2_IDIS);
> > +	writel(0, c2dev->regs + C2_NIMR0);
> 
> like here...

This code is followed by a call to c2_reset(), which interacts with the
firmware on the adapter to quiesce the hardware.  So I don't think we
need to wait here for the posted writes to flush...

> > +
> > +	elem = tx_ring->to_use;
> > +	elem->skb = skb;
> > +	elem->mapaddr = mapaddr;
> > +	elem->maplen = maplen;
> > +
> > +	/* Tell HW to xmit */
> > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> 
> or here
> 

No need here.  This logic submits the packet for transmission.  We don't
assume it is transmitted until we (after a completion interrupt usually)
read back the HTXD entry and see the TXP_HTXD_DONE bit set (see
c2_tx_interrupt()). 


Steve.


From jlentini at netapp.com  Fri Jun 23 06:48:59 2006
From: jlentini at netapp.com (James Lentini)
Date: Fri, 23 Jun 2006 09:48:59 -0400 (EDT)
Subject: [openib-general] NFS/RDMA
In-Reply-To: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net>
References: <20060623100640.2119dc38@matrix.tuxianer.homelinux.net>
Message-ID: <Pine.LNX.4.64.0606230935490.14000@jlentini-linux.nane.netapp.com>


Replies below:

On Fri, 23 Jun 2006, Torsten Boob wrote:

> Hello,
> 
> i have Problems to set up a nfsrdma connection (nfsrdma update 5).
> 
> ./nfsrdmamount -o rdma=192.168.99.1 192.168.99.1:/nfs /mnt/nfs
> 
> has following output on the client.

The first message is from the client code...

> RPC:       xprt_setup_rdma: 192.168.99.1:2049

but these messages are from the server code...

> unexpected event received for QP=ffff81013eaaea00, event =4
> svc_rdma_recvfrom: transport ffff81013f54b600 is closing
> svc_rdma_recvfrom: transport ffff81013f54b600 is closing
> svc_rdma_put: Destroying transport ffff81013f54b600, cm_id=ffff81013f54b400, sk_flags=54, sk_inuse=0

and these are from the client code.

> nfs: RPC call returned error 103
> nfsmount: Software caused connection abort

Are you trying to mount to and from the same host? 

> Using normal nfs with 
> 
> mount -t nfs 192.168.99.1:/nfs /mnt/nfs
> 
> results in
> 
> mount: 192.168.99.1:/nfs: can't read superblock

It looks like you have a configuration error unrelated to RDMA. If 
you're looking for documentation on setting up NFS, I'd recommend 
this:

http://nfs.sourceforge.net/nfs-howto/index.html

> Same results with openib svn {20060516}
> Tested with Debian Sarge and Etch.
> 
> Any ideas ?

We've seen the 

 unexpected event received for QP=ffff81013eaaea00, event =4

message once before. There is a timing issue between the NFS-RDMA 
server and the RDMA stack that we've only seen on IA64 systems to 
date. What type of hardware are you using?

We are working on a fix for this now. 


From arjan at infradead.org  Fri Jun 23 06:48:52 2006
From: arjan at infradead.org (Arjan van de Ven)
Date: Fri, 23 Jun 2006 15:48:52 +0200
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1151070290.7808.33.camel@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
Message-ID: <1151070532.3204.10.camel@laptopd505.fenrus.org>


> > > +	/* Tell HW to xmit */
> > > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> > 
> > or here
> > 
> 
> No need here.  This logic submits the packet for transmission.  We don't
> assume it is transmitted until we (after a completion interrupt usually)
> read back the HTXD entry and see the TXP_HTXD_DONE bit set (see
> c2_tx_interrupt()). 

... but will that interrupt happen at all if these 3 writes never hit
the hardware?


From swise at opengridcomputing.com  Fri Jun 23 06:56:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 08:56:45 -0500
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1151070532.3204.10.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
	<1151070532.3204.10.camel@laptopd505.fenrus.org>
Message-ID: <1151071005.7808.39.camel@stevo-desktop>

On Fri, 2006-06-23 at 15:48 +0200, Arjan van de Ven wrote:
> > > > +	/* Tell HW to xmit */
> > > > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > > > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > > > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> > > 
> > > or here
> > > 
> > 
> > No need here.  This logic submits the packet for transmission.  We don't
> > assume it is transmitted until we (after a completion interrupt usually)
> > read back the HTXD entry and see the TXP_HTXD_DONE bit set (see
> > c2_tx_interrupt()). 
> 
> ... but will that interrupt happen at all if these 3 writes never hit
> the hardware?
> 

I thought the posted write WILL eventually get to adapter memory.  Not
stall forever cached in a bridge.  I'm wrong?

My point is for a given HTXD entry, we write it to post a packet for
transmission, then only free the packet memory and reuse this entry
_after_ reading the HTXD and seeing the DONE bit set.  So I still don't
see a problem.  But I've been wrong before ;-)

Steve.


From swise at opengridcomputing.com  Fri Jun 23 07:02:24 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:02:24 -0500
Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the
 server or the client exits early
In-Reply-To: <449BE393.3020308@in.ibm.com>
References: <20060622192259.GA24588@harry-potter.ibm.com>
	<1151007847.3040.51.camel@stevo-desktop> <449BE393.3020308@in.ibm.com>
Message-ID: <1151071344.7808.42.camel@stevo-desktop>

On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote:
> Steve Wise wrote:
> > The goal of adding the return codes was so that the rping program could
> > exit with a status indicating success or failure.  Every rping run
> > results in a DISCONNECT event, so I don't think we want to treat that
> > case as an error.
> DISCONNECT event will be generated when the connection is closed or in case of 
> some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso 
> driver etc).
> > 

You'll also get the DISCONNECT event when one side finished the rping
loops and does rdma_disconnect().  So receiving that event isn't
necessarily an error...


> > 
> > Also, can you explain why thi fixes Amith's problem, which sounded like
> > a process was hanging?
> > 
> On debugging I found that the main thread was blocked in ibv_destroy_cq(), 
> cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in 
> ibv_get_cq_event->read
> Taking the return value of the DISCONNECT event into consideration forcefully 
> killed the process.
> On delving deeper into this problem, I think that there is more to this rping 
> hang. Let me work on this further.
> 

I think rping needs some coordination on these threads and when they
should be killed. 

> On a related note - I noticed another rping hang in the following case
> - Start the rping as a client without first starting an rping server
> - If you are lucky the first run itself will result in the 'lt-rping' process in 
> 'D' state. If not repeating the procedure will result in the hang.
> 
> This is the o/p.
> 
> cq completion failed status 5
> wait for CONNECTED state 10
> connect error -1
> 
> Thanks,
> Pradipta.
> 
> 
> > 
> 
> > Thanks,
> > 
> > Steve.
> > 
> > 
> > 
> > On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:
> >> Hi,
> >>  Please ignore the earlier mail. There were some problems with the mailer.
> >> Here is the new one.
> >>
> >> This patch fixes the problem as reported by Amith.
> >>
> >> Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>
> >>
> >> ---
> >>
> >> Index: rping.c
> >> =============================================================================
> >> --- rping.c.org	2006-06-23 00:22:17.000000000 +0530
> >> +++ rping.c	2006-06-23 00:39:06.000000000 +0530
> >> @@ -215,6 +215,7 @@ static int rping_cma_event_handler(struc
> >>  	case RDMA_CM_EVENT_DISCONNECTED:
> >>  		fprintf(stderr, "%s DISCONNECT EVENT...\n", cb->server ? "server" : "client");
> >>  		sem_post(&cb->sem);
> >> +		ret = -1;
> >>  		break;
> >>  
> >>  	case RDMA_CM_EVENT_DEVICE_REMOVAL:
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> > 


From arjan at infradead.org  Fri Jun 23 07:04:31 2006
From: arjan at infradead.org (Arjan van de Ven)
Date: Fri, 23 Jun 2006 16:04:31 +0200
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1151071005.7808.39.camel@stevo-desktop>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
	<1151070532.3204.10.camel@laptopd505.fenrus.org>
	<1151071005.7808.39.camel@stevo-desktop>
Message-ID: <1151071471.3204.12.camel@laptopd505.fenrus.org>

On Fri, 2006-06-23 at 08:56 -0500, Steve Wise wrote:
> On Fri, 2006-06-23 at 15:48 +0200, Arjan van de Ven wrote:
> > > > > +	/* Tell HW to xmit */
> > > > > +	__raw_writeq(cpu_to_be64(mapaddr), elem->hw_desc + C2_TXP_ADDR);
> > > > > +	__raw_writew(cpu_to_be16(maplen), elem->hw_desc + C2_TXP_LEN);
> > > > > +	__raw_writew(cpu_to_be16(TXP_HTXD_READY), elem->hw_desc + C2_TXP_FLAGS);
> > > > 
> > > > or here
> > > > 
> > > 
> > > No need here.  This logic submits the packet for transmission.  We don't
> > > assume it is transmitted until we (after a completion interrupt usually)
> > > read back the HTXD entry and see the TXP_HTXD_DONE bit set (see
> > > c2_tx_interrupt()). 
> > 
> > ... but will that interrupt happen at all if these 3 writes never hit
> > the hardware?
> > 
> 
> I thought the posted write WILL eventually get to adapter memory.  Not
> stall forever cached in a bridge.  I'm wrong?

I'm not sure there is a theoretical upper bound.... 

(and if it's several msec per bridge, then you have a lot of latency
anyway)


From swise at opengridcomputing.com  Fri Jun 23 07:29:24 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:24 -0500
Subject: [openib-general] [PATCH v2 00/14][RFC] Chelsio CXGB3 iWARP Driver
Message-ID: <20060623142924.32410.7623.stgit@stevo-desktop>


This patchset implements the iWARP provider driver for the Chelsio
CXGB3 RNIC.  It is dependent on the "iWARP Core Support" patch set.

This is round 2 of the openib-general review.  I'm requesting one more
review from the rdma experts before widening the audience to the linux
community in general.  I believe I've addressed all the round 1 review
comments.

The entire subsystem is layed out as three modules:

iw_cxgb3.ko - The main OpenIB Provider module. It depends on the
other two modules.

cxgb3c.ko - The cxgb3 core module that allows TCP connections to
manipulated. It depends on the LLD/NETDEV module.

cxgb3.ko - the cxgb3 LLD/NETDEV driver with offload support.  This driver
is currently checked in to gen2/branches/iwarp/src/linux-kernel/net/cxgb3.
Chelsio will submit this driver eventually to the kernel netdev group
for inclusion into kernel.org.	For now, I've placed it in the openib
tree so the entire subsystem can be used.  I'm only including patches
for the .h files that define the interface used by the other modules.

This StGIT patchset is cloned from Roland Dreier's infiniband.git
for-2.6.19 branch.  The patchset consists of these patches:

	t3_provider 		- OpenIB Provider Driver
	t3_cq_qp		- QP and CQ
	t3_mem			- MR and MW
	t3_ae			- Async and CQ events
	t3_cm			- Connection Manager
	t3_rcore_dbg		- RDMA Core Debug 
	t3_rcore_hal		- RDMA Core HAL
	t3_rcore_resource	- RDMA Core Resource Manager
	t3_rcore_types		- RDMA Core Types
	t3_core_reg		- T3 Core Registation
	t3_core_demux		- T3 Core Demuxer
	t3_core_l2t		- T3 L2 Services
	t3_cfg			- Makefiles
	t3_lld_ulp		- LLD Interface

Since round 1 review, the following has been done:

- incorporate all the review feedback (thanks to all who reviewed it)
- sparse clean
- incorporated some of the ammasso review feedback (like use pr_debug())
- interoperability testing against Ammasso
- iWARP conformance testing 
- NFSoRDMA testing (connectathon basic and general tests)
- rping/krping testing
- dapltest 1-6 testing
- performance characterization

Signed-off-by: Steve Wise <swise at opengridcomputing.com>


From swise at opengridcomputing.com  Fri Jun 23 07:29:29 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:29 -0500
Subject: [openib-general] [PATCH v2 01/14] CXGB3 OpenIB Driver
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142929.32410.12997.stgit@stevo-desktop>


This patch contains the cxgb3 device discovery and openib driver
methods.

The T3 openib driver discovers each T3 adapter by registering as a
client with the cxgb3 "core" module, which will then call the provider
module for each T3 adapter present.   This is similar to the ib_client
mechanism in openib.
---

 drivers/infiniband/hw/cxgb3/iwch.c          |  220 +++++
 drivers/infiniband/hw/cxgb3/iwch.h          |  130 +++
 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1097 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  358 +++++++++
 4 files changed, 1805 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 0000000..20d9f1e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,220 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <rdma/ib_verbs.h>
+
+#include <t3c.h>
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+MODULE_AUTHOR("Boyd Faulkner <faulkner at opengridcomputing.com>, "
+	      "Steve Wise <swise at opengridcomputing.com");
+MODULE_DESCRIPTION("Chelsio iWARP OpenIB Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+
+t3c_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_toe(struct t3cdev *);
+static void close_rnic_toe(struct t3cdev *);
+
+struct t3c_client t3c_client = {
+	.name = "iw_cxgb3",
+	.add = open_rnic_toe,
+	.remove = close_rnic_toe,
+	.handlers = t3c_handlers,
+	.redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static inline void *vzmalloc(int size)
+{
+	void *p = vmalloc(size);
+	memset(p, 0, size);
+	return p;
+}
+
+static int open_rnic_init(struct iwch_dev *rnicp)
+{
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	rnicp->pdid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_PD);
+	if (!rnicp->pdid2hlp)
+		goto pdid_err;
+	rnicp->cqid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_CQ);
+	if (!rnicp->cqid2hlp)
+		goto cqid_err;
+	rnicp->qpid2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_QP);
+	if (!rnicp->qpid2hlp)
+		goto qpid_err;
+	rnicp->stag2hlp = vzmalloc(sizeof(void*) * T3_MAX_NUM_STAG);
+	if (!rnicp->stag2hlp)
+		goto stag_err;
+
+	spin_lock_init(&rnicp->lock);
+
+	/* 
+	 * XXX get these from the hw!
+	 */
+	rnicp->attr.vendor_id = 0x168;
+	rnicp->attr.vendor_part_id = 7;
+	rnicp->attr.hw_version = 3;
+	rnicp->attr.addl_vendor_info = NULL;
+	rnicp->attr.addl_vendor_info_length = 0;
+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+	rnicp->attr.max_wrs = (1UL << 24) - 1;
+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+	rnicp->attr.max_cq_event_handlers = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_mem_regs = T3_MAX_NUM_STAG;
+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
+	rnicp->attr.can_resize_wq = 0;
+	rnicp->attr.max_rdma_reads_per_qp = 16;
+	rnicp->attr.max_rdma_read_resources =
+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+	rnicp->attr.max_rdma_read_qp_depth = 16;	/* IRD */
+	rnicp->attr.max_rdma_read_depth =
+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+	rnicp->attr.rq_overflow_handled = 0;
+	rnicp->attr.can_modify_ird = 0;
+	rnicp->attr.can_modify_ord = 0;
+	rnicp->attr.max_mem_windows = T3_MAX_NUM_STAG - 1;/* Shared with MR */
+	rnicp->attr.stag0_value = 1;
+	rnicp->attr.zbva_support = 1;
+	rnicp->attr.local_invalidate_fence = 1;
+	rnicp->attr.cq_overflow_detection = 1;
+	return 0;
+
+stag_err:
+	vfree(rnicp->qpid2hlp);
+qpid_err:
+	vfree(rnicp->cqid2hlp);
+cqid_err:
+	vfree(rnicp->pdid2hlp);
+pdid_err:
+	return -ENOMEM;
+}
+
+static void open_rnic_toe(struct t3cdev *tdev)
+{
+	struct iwch_dev *rnicp;
+
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+	if (!rnicp) {
+		printk(KERN_ERR PFX "cannot allocate ib device!\n");
+		return;
+	}
+	rnicp->rdev.ulp = rnicp;
+	rnicp->rdev.t3cdev_p = tdev;
+
+	if (cxio_rdev_open(&rnicp->rdev)) {
+		printk(KERN_ERR PFX "Unable to register with RDMA Core\n");
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	if (open_rnic_init(rnicp)) {
+		printk(KERN_ERR PFX "Unable to initialize iwch_dev!\n");
+		cxio_rdev_close(&rnicp->rdev);
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&rnicp->entry, &dev_list);
+	mutex_unlock(&dev_mutex);
+
+	if (iwch_register_device(rnicp)) {
+		printk(KERN_ERR PFX "Unable to register with openib\n");
+		close_rnic_toe(tdev);
+	}
+	return;
+}
+
+static void close_rnic_toe(struct t3cdev *tdev)
+{
+	struct iwch_dev *dev, *tmp;
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	mutex_lock(&dev_mutex);
+	list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
+		if (dev->rdev.t3cdev_p == tdev) {
+			list_del(&dev->entry);
+			iwch_unregister_device(dev);
+			cxio_rdev_close(&dev->rdev);
+			vfree(dev->pdid2hlp);
+			vfree(dev->cqid2hlp);
+			vfree(dev->stag2hlp);
+			vfree(dev->qpid2hlp);
+			ib_dealloc_device(&dev->ibdev);
+			break;
+		}
+	}
+	mutex_unlock(&dev_mutex);
+}
+
+extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb);
+
+static int __init iwch_init_module(void)
+{
+	int err;
+
+	err = cxio_hal_init();
+	if (err) 
+		return err;
+	err = iwch_cm_init();
+	if (err) 
+		return err;
+	cxio_register_ev_cb(iwch_ev_dispatch);
+	t3c_register_client(&t3c_client);
+	return 0;
+}
+
+static void __exit iwch_exit_module(void)
+{
+	t3c_unregister_client(&t3c_client);
+	cxio_unregister_ev_cb(iwch_ev_dispatch);
+	iwch_cm_term();
+	cxio_hal_exit();
+}
+
+module_init(iwch_init_module);
+module_exit(iwch_exit_module);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
new file mode 100644
index 0000000..bf466a6
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -0,0 +1,130 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_H__
+#define __IWCH_H__
+
+#include <linux/mutex.h>
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "common.h"
+#include "t3c.h"
+
+struct iwch_pd;
+struct iwch_cq;
+struct iwch_qp;
+struct iwch_mr;
+
+struct iwch_rnic_attributes {
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 hw_version;
+	char *addl_vendor_info;
+	u32 addl_vendor_info_length;
+	u32 max_qps;
+	u32 max_wrs;				/* Max for any SQ/RQ */
+	u32 max_sge_per_wr;
+	u32 max_sge_per_rdma_write_wr;	/* for RDMA Write WR */
+	u32 max_cqs;
+	u32 max_cqes_per_cq;
+	u32 max_cq_event_handlers;
+	u32 max_mem_regs;
+	u32 max_phys_buf_entries;		/* for phys buf list */
+	u32 max_pds;
+
+	/* 
+	 * The memory page sizes supported by this RNIC.
+	 * Bit position i in bitmap indicates page of
+	 * size (4k)^i.  Phys block list mode unsupported. 
+	 */
+	u32 mem_pgsizes_bitmask;
+	u8 can_resize_wq;
+
+	/*
+	 * The maximum number of RDMA Reads that can be outstanding 
+	 * per QP with this RNIC as the target. 
+	 */
+	u32 max_rdma_reads_per_qp;
+
+	/*
+	 * The maximum number of resources used for RDMA Reads
+	 * by this RNIC with this RNIC as the target. 
+	 */
+	u32 max_rdma_read_resources;
+
+	/*
+	 * The max depth per QP for initiation of RDMA Read
+	 * by this RNIC.  
+	 */
+	u32 max_rdma_read_qp_depth;
+
+	/*
+	 * The maximum depth for initiation of RDMA Read 
+	 * operations by this RNIC on all QPs 
+	 */
+	u32 max_rdma_read_depth;
+	u8 rq_overflow_handled;
+	u32 can_modify_ird;
+	u32 can_modify_ord;
+	u32 max_mem_windows;
+	u32 stag0_value;
+	u8 zbva_support;
+	u8 local_invalidate_fence;
+	u32 cq_overflow_detection;
+};
+
+struct iwch_dev {
+	struct ib_device ibdev;
+	struct cxio_rdev rdev;
+	u32 device_cap_flags;
+	struct iwch_rnic_attributes attr;
+	struct iwch_pd **pdid2hlp;
+	struct iwch_cq **cqid2hlp;
+	struct iwch_qp **qpid2hlp;
+	struct iwch_mr **stag2hlp;
+	spinlock_t lock;
+	struct list_head entry;
+};
+
+static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct iwch_dev, ibdev);
+}
+
+extern struct t3c_client t3c_client;
+extern t3c_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 0000000..b38cd2e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1097 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <cxio_hal.h>
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+			    u8 port, int port_modify_mask,
+			    struct ib_port_modify *props)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+				    struct ib_ah_attr *ah_attr)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+			    int mad_flags,
+			    u8 port_num,
+			    struct ib_wc *in_wc,
+			    struct ib_grh *in_grh,
+			    struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	kfree(to_iwch_ucontext(context));
+	return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+					struct ib_udata *udata)
+{
+	struct iwch_ucontext *context;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context) {
+		return ERR_PTR(-ENOMEM);
+	}
+	return &context->ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct iwch_cq *chp;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	chp = to_iwch_cq(ib_cq);
+
+	spin_lock_irq(&chp->rhp->lock);
+	chp->rhp->cqid2hlp[chp->cqh] = NULL;
+	spin_unlock_irq(&chp->rhp->lock);
+
+	atomic_dec(&chp->refcnt);
+	wait_event(chp->wait, !atomic_read(&chp->refcnt));
+
+	cxio_destroy_cq(&chp->rhp->rdev, &chp->cq);
+	kfree(chp);
+	return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+			     struct ib_ucontext *context,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	struct iwch_create_cq_resp uresp;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	rhp = to_iwch_dev(ibdev);
+	chp = kzalloc(sizeof(*chp), GFP_KERNEL);
+	if (!chp)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+ 	 * Attempt to make the CQ big enough to handle the T3
+	 * additional CQE possibilities:  
+	 * 	TERMINATE, 
+	 * 	2 CQES for each RDMA READ operation,
+	 *	incoming RDMA READ REQUEST FAILUREs
+ 	 * We can make the CQ big enough to handle these for
+	 * a single QP.  But problems can arise if the CQ is shared...
+	 */
+	entries = roundup_pow_of_two(entries + 
+				     8 + 		/* max ORD */ 
+				     8 +		/* max IRRQ */
+				     1			/* TERM */
+				    );
+	chp->cq.size_log2 = long_log2(entries);
+
+	if (cxio_create_cq(&rhp->rdev, &chp->cq)) {
+		kfree(chp);
+		return ERR_PTR(-ENOMEM);
+	}
+	chp->rhp = rhp;
+	chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1;
+	spin_lock_init(&chp->lock);
+	atomic_set(&chp->refcnt, 1);
+	init_waitqueue_head(&chp->wait);
+	chp->cqh = chp->cq.cqid;
+
+	spin_lock_irq(&rhp->lock);
+	rhp->cqid2hlp[chp->cq.cqid] = chp;
+	spin_unlock_irq(&rhp->lock);
+
+	if (context) {
+		uresp.cqid = chp->cq.cqid;
+		uresp.entries = chp->ibcq.cqe;
+		uresp.physaddr = chp->cq.dma_addr;
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("created cq_hdl(%0x) chp=%p size=0x%0x, dma_addr=0x%0llx\n",
+	     chp->cq.cqid, chp, (1 << chp->cq.size_log2), 
+	     (u64)chp->cq.dma_addr);
+	return &chp->ibcq;
+}
+
+static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
+{
+	struct iwch_cq *chp = to_iwch_cq(cq);
+	struct t3_cq oldcq, newcq;
+	int ret;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+
+	/* We don't downsize... */
+	if (cqe <= cq->cqe)
+		return 0;
+
+	/* create new t3_cq with new size */
+	cqe = roundup_pow_of_two(cqe+1);
+	newcq.size_log2 = long_log2(cqe);
+
+	/* Dont allow resize to less than the current wce count */
+	if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) {
+		return -ENOMEM;
+	}
+
+	/* Quiesce all QPs using this CQ */
+	ret = iwch_quiesce_qps(chp);
+	if (ret) {
+		return ret;
+	}
+
+	/* XXX limit max based on rdev */
+	ret = cxio_create_cq(&chp->rhp->rdev, &newcq);
+	if (ret) {
+		kfree(chp);
+		return ret;
+	}
+	
+	/* copy CQEs */
+	memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * 
+				        sizeof(struct t3_cqe));
+
+	/* old iwch_qp gets new t3_cq but keeps old cqid */
+	oldcq = chp->cq;
+	chp->cq = newcq;
+	chp->cq.cqid = oldcq.cqid;
+
+	/* resize new t3_cq to update the HW context */
+	ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq);
+	if (ret) {
+		chp->cq = oldcq;
+		return ret;
+	}
+	chp->ibcq.cqe = (1<<chp->cq.size_log2) - 1;
+
+	/* destroy old t3_cq */
+	oldcq.cqid = newcq.cqid;
+	ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq);
+	if (ret) {
+		printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", 
+			__FUNCTION__, ret);
+	}
+	
+	/* add user hooks here */
+
+	/* resume qps */
+	ret = iwch_resume_qps(chp);
+	return ret;
+}
+
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	enum t3_cq_opcode cq_op;
+	int err;
+	int flags;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+	if (notify == IB_CQ_SOLICITED)
+		cq_op = CQ_ARM_SE;
+	else
+		cq_op = CQ_ARM_AN;
+	spin_lock_irqsave(&chp->lock, flags);
+	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
+	spin_unlock_irqrestore(&chp->lock, flags);
+	if (err) 
+		printk(KERN_ERR MOD "Error %d rearming CQ %llu\n", err, 
+		       chp->cqh);
+	return err;
+}
+
+static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	int len = vma->vm_end - vma->vm_start;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	vma->vm_flags |= VM_RESERVED;
+	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			       len, vma->vm_page_prot))
+		return -EAGAIN;
+	return 0;
+}
+
+static int iwch_deallocate_pd(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	u64 pd_h;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	pd_h = (u64) php->pdid;
+	PDBG("iwch_deallocate_pd entry: hdl(%0llx)\n", pd_h);
+	rhp->pdid2hlp[pd_h] = NULL;
+	cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid);
+	kfree(php);
+	return 0;
+}
+
+static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev,
+			       struct ib_ucontext *context,
+			       struct ib_udata *udata)
+{
+	struct iwch_pd *php;
+	u32 pdid;
+	struct iwch_dev *rhp;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	rhp = (struct iwch_dev *) ibdev;
+	pdid = cxio_hal_get_pdid(rhp->rdev.rscp);
+	if (!pdid)
+		return ERR_PTR(-EINVAL);
+	php = kzalloc(sizeof(*php), GFP_KERNEL);
+	if (!php) {
+		cxio_hal_put_pdid(rhp->rdev.rscp, pdid);
+		return ERR_PTR(-ENOMEM);
+	}
+	php->pdid = pdid;
+	php->rhp = rhp;
+	rhp->pdid2hlp[pdid] = php;
+	if (context) {
+		if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) {
+			iwch_deallocate_pd(&php->ibpd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("iwch_allocate_pd: pdid(0x%0x) hlp(0x%p)\n", pdid, php);
+	return &php->ibpd;
+}
+ 
+static int iwch_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mr *mhp;
+	struct iwch_pd *php;
+	u64 mem_h;
+
+	/* There can be no memory windows */
+	if (atomic_read(&ib_mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(ib_mr);
+	rhp = mhp->rhp;
+	mem_h = mhp->attr.stag >> 8;
+	/* TBD: check dereg_mem return status: regreg mem with mw bound to it */
+	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag);
+	rhp->stag2hlp[mem_h] = NULL;
+	php = get_php(rhp, mhp->attr.pdid);
+	if (mhp->kva)
+		kfree((void *) (unsigned long) mhp->kva);
+	kfree(mhp);
+	PDBG("iwch_dereg_mem: mem_h(0x%0llx) hlp(%p)\n", mem_h, mhp);
+	return 0;
+}
+
+static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
+					struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					int acc,
+					u64 *iova_start)
+{
+	u64 *page_list;
+	int shift;
+	u64 total_size;
+	int npages;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	int ret;
+		
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+
+	acc = iwch_convert_access(acc);
+
+	
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
+			 	   &total_size, &npages, &shift, &page_list);
+	if (ret) 
+		goto err;
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+
+	/* XXX TPT perms are backwards from BIND WR perms! */
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+
+	mhp->attr.va_fbo = *iova_start;
+	mhp->attr.page_size = shift - 12;
+
+	mhp->attr.len = (u32) total_size;
+	mhp->attr.pbl_size = npages;
+	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		goto err;
+	}
+	return &mhp->ibmr;
+err:
+	kfree(mhp);
+	return ERR_PTR(ret);
+	
+}
+
+static int iwch_reregister_phys_mem(struct ib_mr *mr, 
+				     int mr_rereg_mask,
+				     struct ib_pd *pd,
+                                     struct ib_phys_buf *buffer_list,
+                                     int num_phys_buf,
+                                     int acc, u64 * iova_start)
+{
+
+	struct iwch_mr mh, *mhp;
+	struct iwch_pd *php;
+	struct iwch_dev *rhp;
+	int new_acc;
+	u64 *page_list = NULL;
+	int shift = 0;
+	u64 total_size;
+	int npages;
+	int ret;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	/* There can be no memory windows */
+	if (atomic_read(&mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(mr);
+	rhp = mhp->rhp;
+	php = to_iwch_pd(mr->pd);
+
+	/* make sure we are on the same adapter */
+	if (rhp != php->rhp)
+		return -EINVAL;
+
+	new_acc = mhp->attr.perms;
+
+	memcpy(&mh, mhp, sizeof *mhp);
+
+	printk("%s: %d stag = 0x%x\n",__FUNCTION__, __LINE__,mh.attr.stag);
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		php = to_iwch_pd(pd);
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mh.attr.perms = iwch_convert_access(acc);
+	if (mr_rereg_mask & IB_MR_REREG_TRANS)
+		ret = build_phys_page_list(buffer_list, num_phys_buf, 
+					   iova_start,
+					   &total_size, &npages, 
+					   &shift, &page_list);
+
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		return ret;
+	}
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		mhp->attr.pdid = php->pdid;
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mhp->attr.perms = acc;
+	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		mhp->attr.zbva = 0;
+		mhp->attr.va_fbo = *iova_start;
+		mhp->attr.page_size = shift - 12;
+		mhp->attr.len = (u32) total_size;
+		mhp->attr.pbl_size = npages;
+	}
+
+	return 0;	
+}
+
+
+struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				      int acc, struct ib_udata *udata)
+{
+	u64 *pages;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	shift = ffs(region->page_size) - 1;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	acc = iwch_convert_access(acc);
+
+	i = n = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = cpu_to_be64(sg_dma_address(
+					&chunk->page_list[j]) +
+					region->page_size * k);
+			}
+		}
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+	mhp->attr.va_fbo = region->virt_base;
+	mhp->attr.page_size = shift - 12;
+	mhp->attr.len = (u32) region->length;
+	mhp->attr.pbl_size = i;
+	err = iwch_register_mem(rhp, php, mhp, shift, pages);
+	kfree(pages);
+	if (err)
+		goto err;
+	return &mhp->ibmr;
+
+err:
+	kfree(mhp);
+	return ERR_PTR(err);
+}
+
+struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva;
+	struct ib_mr *ibmr;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+
+	/*
+	 * T3 only supports 32 bits of size.
+	 */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	kva = 0;
+	ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva);
+	return ibmr;
+}
+
+struct ib_mw *iwch_alloc_mw(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mw *mhp;
+	u64 win_h;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+	ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_MW;
+	mhp->attr.stag = stag;
+	win_h = (stag) >> 8;
+	rhp->stag2hlp[win_h] = (struct iwch_mr *) mhp;
+	PDBG("iwch_allocate_window: win_h(0x%0llx) mhp(%p) stag(0x%x)\n", 
+	     win_h, mhp, stag);
+	return &(mhp->ibmw);
+}
+
+int iwch_dealloc_mw(struct ib_mw *mw)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_pd *php;
+	u64 win_h;
+
+	mhp = to_iwch_mw(mw);
+	rhp = mhp->rhp;
+	win_h = (mw->rkey) >> 8;
+	php = get_php(rhp, mhp->attr.pdid);
+	cxio_deallocate_window(&rhp->rdev, mhp->attr.stag);
+	rhp->stag2hlp[win_h] = NULL;
+	kfree(mhp);
+	PDBG("iwch_deallocate_window: win_h(0x%0llx) hlp(%p)\n", win_h, mhp);
+	return 0;
+}
+
+static int iwch_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_qp_attributes attrs;
+
+	qhp = to_iwch_qp(ib_qp);
+	rhp = qhp->rhp;
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
+	}
+	wait_event(qhp->wait, !qhp->ep);
+
+	spin_lock_irq(&rhp->lock);
+	rhp->qpid2hlp[qhp->wq.qpid] = NULL;
+	spin_unlock_irq(&rhp->lock);
+
+	atomic_dec(&qhp->refcnt);
+	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
+
+	cxio_destroy_qp(&rhp->rdev, &qhp->wq);
+
+	PDBG("iwch_destroy_qp: qp_h(0x%0x) qhp(%p)\n", qhp->wq.qpid, qhp);
+	kfree(qhp);
+	return 0;
+}
+
+static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *attrs,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_pd *php;
+	struct iwch_cq *schp;
+	struct iwch_cq *rchp;
+	struct iwch_create_qp_resp uresp;
+	int wqsize, sqsize, rqsize;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	if (attrs->qp_type != IB_QPT_RC) 
+		return ERR_PTR(-EINVAL);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cqh);
+	rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cqh);
+	if (!schp || !rchp)
+		return ERR_PTR(-EINVAL);
+
+	/* The RQT size must be # of entries + 1 rounded up to a power of two */
+	rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr);
+	if (rqsize == attrs->cap.max_recv_wr)
+		rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1);
+
+	/* T3 doesn't support RQT depth < 16 */
+	if (rqsize < 16)
+		rqsize = 16;
+
+	if (rqsize >= T3_MAX_RQ_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	/* 
+	 * XXX the SQ and total WQ sizes don't need to be
+	 * a power of two.  However, all the code assumes 
+	 * they are. EG: Q_FREECNT() and friends.
+	 */
+	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
+	wqsize = roundup_pow_of_two(rqsize + sqsize);
+	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, 
+	     wqsize, sqsize, rqsize);
+	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
+	if (!qhp)
+		return ERR_PTR(-ENOMEM);
+	qhp->wq.size_log2 = long_log2(wqsize);
+	qhp->wq.rq_size_log2 = long_log2(rqsize);
+	qhp->wq.sq_size_log2 = long_log2(sqsize);
+	if (cxio_create_qp(&rhp->rdev, 1, &qhp->wq)) {
+		kfree(qhp);
+		return ERR_PTR(-ENOMEM);
+	}
+	attrs->cap.max_recv_wr = rqsize - 1;
+	attrs->cap.max_send_wr = sqsize;
+	qhp->rhp = rhp;
+	qhp->attr.pd = php->pdid;
+	qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cqh;
+	qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cqh;
+	qhp->attr.sq_num_entries = attrs->cap.max_send_wr;
+	qhp->attr.rq_num_entries = attrs->cap.max_recv_wr;
+	qhp->attr.sq_max_sges = attrs->cap.max_send_sge;
+	qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge;
+	qhp->attr.rq_max_sges = attrs->cap.max_recv_sge;
+	qhp->attr.state = IWCH_QP_STATE_IDLE;
+	qhp->attr.next_state = IWCH_QP_STATE_IDLE;
+
+	/* 
+	 * XXX - these don't get passed in from the openib user
+ 	 * at create time.  The CM sets them via a QP modify.
+	 * Need to fix...  I think the CM should 
+	 */
+	qhp->attr.enable_rdma_read = 1;
+	qhp->attr.enable_rdma_write = 1;
+	qhp->attr.enable_bind = 1;
+	qhp->attr.max_ord = 1;
+	qhp->attr.max_ird = 1;
+	spin_lock_init(&qhp->lock);
+	init_waitqueue_head(&qhp->wait);
+	atomic_set(&qhp->refcnt, 1);
+
+	spin_lock_irq(&rhp->lock);
+	rhp->qpid2hlp[qhp->wq.qpid] = qhp;
+	spin_unlock_irq(&rhp->lock);
+
+	PDBG("iwch_create_qp: udata = 0x%p failed\n", udata);
+	if (udata) {
+		uresp.qpid = qhp->wq.qpid;
+		uresp.entries = qhp->attr.sq_num_entries + qhp->attr.rq_num_entries;
+		uresp.physaddr = qhp->wq.dma_addr;
+		uresp.physsize = (u64) uresp.entries * sizeof(union t3_wr);
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			PDBG("iwch_create_qp: ib_copy_to_udata failed\n");
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	qhp->ibqp.qp_num = qhp->wq.qpid;
+	init_timer(&(qhp->timer));
+	PDBG("iwch_create_qp: sq_num_entries = %d, rq_num_entries = %d\n",
+	     qhp->attr.sq_num_entries, qhp->attr.rq_num_entries);
+	PDBG("iwch_create_qp: qh_h(0x%0x) qhp=%p dma_addr=0x%llx size=%d\n",
+	     (qhp->wq.qpid), qhp, (u64)qhp->wq.dma_addr, 
+	     (1 << qhp->wq.size_log2));
+	return (&qhp->ibqp);
+}
+
+static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	enum iwch_qp_attr_mask mask = 0;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+
+	/* iwarp does not support the RTR state */
+	if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR))
+		attr_mask &= ~IB_QP_STATE;
+
+	/* Make sure we still have something left to do */
+	if (!attr_mask)
+		return 0;
+
+	memset(&attrs, 0, sizeof attrs);
+	qhp = to_iwch_qp(ibqp);
+	rhp = qhp->rhp;
+
+	attrs.next_state = iwch_convert_state(attr->qp_state);
+	attrs.enable_rdma_read = (attr->qp_access_flags & 
+			       IB_ACCESS_REMOTE_READ) ?  1 : 0;
+	attrs.enable_rdma_write = (attr->qp_access_flags & 
+				IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
+	attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0;
+
+
+	mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0;
+	mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? 
+			(IWCH_QP_ATTR_ENABLE_RDMA_READ |
+			 IWCH_QP_ATTR_ENABLE_RDMA_WRITE | 
+			 IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0;
+
+	return iwch_modify_qp(rhp, qhp, mask, &attrs, 0);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp)
+{
+	atomic_inc(&(to_iwch_qp(qp)->refcnt));
+}
+
+void iwch_qp_rem_ref(struct ib_qp *qp)
+{
+	if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt)))
+                wake_up(&(to_iwch_qp(qp)->wait));
+}
+
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn)
+{
+	return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn);
+}
+
+
+static int iwch_query_pkey(struct ib_device *ibdev,
+			   u8 port, u16 index, u16 * pkey)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	*pkey = 0;
+	return 0;
+}
+
+static int iwch_query_gid(struct ib_device *ibdev, u8 port,
+			  int index, union ib_gid *gid)
+{
+	struct iwch_dev *dev;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	PDBG("ibdev %p, port %d, index %d, gid %p\n",
+	       ibdev, port, index, gid);
+	dev = to_iwch_dev(ibdev);
+	BUG_ON(port == 0 || port > 2);
+	PDBG("dev %p port %d netdev %p\n", dev, port,
+	     dev->rdev.rnic_info.lldevs[port-1]);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), dev->rdev.rnic_info.lldevs[port-1]->dev_addr, 6);
+	return 0;
+}
+
+static int iwch_query_device(struct ib_device *ibdev,
+			     struct ib_device_attr *props)
+{
+
+	struct iwch_dev *dev;
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+
+	dev = to_iwch_dev(ibdev);
+	memset(props, 0, sizeof *props);
+	memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	props->device_cap_flags = dev->device_cap_flags;
+#if 0
+	props->fw_ver = cht3dev->fw_ver;
+	props->hw_ver = dev->adapter->params->chip_version;
+#endif
+	props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor;
+	props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device;
+	props->max_mr_size = ~0ull;
+	props->max_qp = dev->attr.max_qps;
+	props->max_qp_wr = dev->attr.max_wrs;
+	props->max_sge = dev->attr.max_sge_per_wr;
+	props->max_sge_rd = 1;
+	props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp;
+	props->max_cq = dev->attr.max_cqs;
+	props->max_cqe = dev->attr.max_cqes_per_cq;
+	props->max_mr = dev->attr.max_mem_regs;
+	props->max_pd = dev->attr.max_pds;
+	props->local_ca_ack_delay = 0;
+
+	return 0;
+}
+
+static int iwch_query_port(struct ib_device *ibdev,
+			   u8 port, struct ib_port_attr *props)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_SNMP_TUNNEL_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_DEVICE_MGMT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 2;
+	props->active_speed = 2;
+	props->max_msg_sz = -1;
+
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.version);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.fw_version);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.driver);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s:%s:%u dev = 0x%p\n", __FILE__, __FUNCTION__, __LINE__, dev);
+	return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor,
+		                       dev->rdev.rnic_info.pdev->device);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *iwch_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+int iwch_register_device(struct iwch_dev *dev)
+{
+	int ret;
+	int i;
+
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+	PDBG(" dev name = %s\n", dev->ibdev.name);
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->device_cap_flags =
+	    (IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
+	dev->ibdev.phys_port_cnt = dev->rdev.rnic_info.nports;
+	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.query_device = iwch_query_device;
+	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.modify_port = iwch_modify_port;
+	dev->ibdev.query_pkey = iwch_query_pkey;
+	dev->ibdev.query_gid = iwch_query_gid;
+	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext;
+	dev->ibdev.mmap = iwch_mmap;
+	dev->ibdev.alloc_pd = iwch_allocate_pd;
+	dev->ibdev.dealloc_pd = iwch_deallocate_pd;
+	dev->ibdev.create_ah = iwch_ah_create;
+	dev->ibdev.destroy_ah = iwch_ah_destroy;
+	dev->ibdev.create_qp = iwch_create_qp;
+	dev->ibdev.modify_qp = iwch_ib_modify_qp;
+	dev->ibdev.destroy_qp = iwch_destroy_qp;
+	dev->ibdev.create_cq = iwch_create_cq;
+	dev->ibdev.destroy_cq = iwch_destroy_cq;
+	dev->ibdev.resize_cq = iwch_resize_cq;
+	dev->ibdev.poll_cq = iwch_poll_cq;
+	dev->ibdev.get_dma_mr = iwch_get_dma_mr;
+	dev->ibdev.reg_phys_mr = iwch_register_phys_mem;
+	dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem;
+	dev->ibdev.reg_user_mr = iwch_reg_user_mr;
+	dev->ibdev.dereg_mr = iwch_dereg_mr;
+	dev->ibdev.alloc_mw = iwch_alloc_mw;
+	dev->ibdev.bind_mw = iwch_bind_mw;
+	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+
+	dev->ibdev.attach_mcast = iwch_multicast_attach;
+	dev->ibdev.detach_mcast = iwch_multicast_detach;
+	dev->ibdev.process_mad = iwch_process_mad;
+
+	dev->ibdev.req_notify_cq = iwch_arm_cq;
+	dev->ibdev.post_send = iwch_post_send;
+	dev->ibdev.post_recv = iwch_post_receive;
+
+
+	dev->ibdev.iwcm =
+	    (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs),
+					   GFP_KERNEL);
+	dev->ibdev.iwcm->connect = iwch_connect;
+	dev->ibdev.iwcm->accept = iwch_accept_cr;
+	dev->ibdev.iwcm->reject = iwch_reject_cr;
+	dev->ibdev.iwcm->create_listen = iwch_create_listen;
+	dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen;
+	dev->ibdev.iwcm->add_ref = iwch_qp_add_ref;
+	dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref;
+	dev->ibdev.iwcm->get_qp = iwch_get_qp;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		goto bail1;
+
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       iwch_class_attributes[i]);
+		if (ret) {
+			goto bail2;
+		}
+	}
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	return 0;
+bail2:
+	PDBG("%s line %d\n", __FUNCTION__,  __LINE__);
+	ib_unregister_device(&dev->ibdev);
+bail1:
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	return ret;
+}
+
+void iwch_unregister_device(struct iwch_dev *dev)
+{
+	int i;
+
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i)
+		class_device_remove_file(&dev->ibdev.class_dev,
+					 iwch_class_attributes[i]);
+	ib_unregister_device(&dev->ibdev);
+	return;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
new file mode 100644
index 0000000..3ceed66
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -0,0 +1,358 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_PROVIDER_H__
+#define __IWCH_PROVIDER_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include <asm/types.h>
+#include "t3cdev.h"
+#include "iwch.h"
+#include "cxio_wr.h"
+#include "cxio_hal.h"
+
+
+struct iwch_pd {
+	struct ib_pd ibpd;
+	u32 pdid;
+	struct iwch_dev *rhp;
+};
+
+static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct iwch_pd, ibpd);
+}
+
+struct tpt_attributes {
+	u32 stag;
+	u32 state:1;
+	u32 type:2;
+	u32 rsvd:1;
+	enum tpt_mem_perm perms;
+	u32 remote_invaliate_disable:1;
+	u32 zbva:1;
+	u32 mw_bind_enable:1;
+	u32 page_size:5;
+
+	u32 pdid;
+	u32 qpid;
+	u32 pbl_addr;
+	u32 len;
+	u64 va_fbo;
+	u32 pbl_size;
+};
+
+struct iwch_mr {
+	struct ib_mr ibmr;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+typedef struct iwch_mw iwch_mw_handle;
+
+static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct iwch_mr, ibmr);
+}
+
+struct iwch_mw {
+	struct ib_mw ibmw;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw)
+{
+	return container_of(ibmw, struct iwch_mw, ibmw);
+}
+
+struct iwch_cq {
+	struct ib_cq ibcq;
+	struct iwch_dev *rhp;
+	u64 cqh;
+	struct t3_cq cq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+};
+
+static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct iwch_cq, ibcq);
+}
+
+enum IWCH_QP_FLAGS {
+	QP_QUIESCED = 0x01
+};
+
+struct iwch_mpa_attributes {
+	u8 recv_marker_enabled;
+	u8 xmit_marker_enabled;	/* iWARP: enable inbound Read Resp. */
+	u8 crc_enabled;
+	u8 version;	/* 0 or 1 */
+};
+
+struct iwch_qp_attributes {
+	u64 scq;
+	u64 rcq;
+	u32 sq_num_entries;
+	u32 rq_num_entries;
+	u32 sq_max_sges;
+	u32 sq_max_sges_rdma_write;
+	u32 rq_max_sges;
+	u32 state;
+	u8 enable_rdma_read;
+	u8 enable_rdma_write;	/* enable inbound Read Resp. */
+	u8 enable_bind;
+	u8 enable_stag0_fastreg;	/* Enable STAG0 + Fast-register */
+	/*
+	 * Next QP state. If specify the current state, only the 
+	 * QP attributes will be modified.
+	 */
+	u32 max_ord;
+	u32 max_ird;
+	u64 pd;	/* IN */
+	u32 next_state;
+	char terminate_buffer[52];
+	u32 terminate_msg_len;
+	u8 is_terminate_local;
+	struct iwch_mpa_attributes mpa_attr;	/* IN-OUT */
+	struct iwch_ep *llp_stream_handle;
+	char *stream_msg_buf;	/* Last stream msg. before Idle -> RTS */
+	u32 stream_msg_buf_len;	/* Only on Idle -> RTS */
+};
+
+struct iwch_qp {
+	struct ib_qp ibqp;
+	struct iwch_dev *rhp;
+	struct iwch_ep *ep;
+	struct iwch_qp_attributes attr;
+	struct t3_wq wq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+	enum IWCH_QP_FLAGS flags;
+	struct timer_list timer;
+};
+
+static inline int qp_quiesced(struct iwch_qp *qhp)
+{
+	return (qhp->flags & QP_QUIESCED);
+}
+
+static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct iwch_qp, ibqp);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp);
+void iwch_qp_rem_ref(struct ib_qp *qp);
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn);
+
+/*
+ * I'm anticipating we'll need something per user...
+ */
+struct iwch_ucontext {
+	struct ib_ucontext ibucontext;
+};
+
+static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c)
+{
+	return container_of(c, struct iwch_ucontext, ibucontext);
+}
+
+enum iwch_qp_attr_mask {
+	IWCH_QP_ATTR_NEXT_STATE = 1 << 0,
+	IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
+	IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
+	IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
+	IWCH_QP_ATTR_MAX_ORD = 1 << 11,
+	IWCH_QP_ATTR_MAX_IRD = 1 << 12,
+	IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22,
+	IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23,
+	IWCH_QP_ATTR_MPA_ATTR = 1 << 24,
+	IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25,
+	IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ |
+				     IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+				     IWCH_QP_ATTR_MAX_ORD |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_STREAM_MSG_BUFFER |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE)
+};
+
+int iwch_modify_qp(struct iwch_dev *rhp,
+				struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal);
+
+enum iwch_qp_state {
+	IWCH_QP_STATE_IDLE,
+	IWCH_QP_STATE_RTS,
+	IWCH_QP_STATE_ERROR,
+	IWCH_QP_STATE_TERMINATE,
+	IWCH_QP_STATE_CLOSING,
+	IWCH_QP_STATE_TOT
+};
+
+static inline int iwch_convert_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+	case IB_QPS_INIT:
+		return IWCH_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return IWCH_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return IWCH_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return IWCH_QP_STATE_TERMINATE;
+	case IB_QPS_ERR:
+		return IWCH_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+enum iwch_mem_perms {
+	IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0,
+	IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1,
+	IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2,
+	IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3,
+	IWCH_MEM_ACCESS_ATOMICS = 1 << 4,
+	IWCH_MEM_ACCESS_BINDING = 1 << 5,
+	IWCH_MEM_ACCESS_LOCAL =
+	    (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE),
+	IWCH_MEM_ACCESS_REMOTE =
+	    (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ)
+	    /* cannot go beyond 1 << 31 */
+} __attribute__ ((packed));
+
+static inline u32 iwch_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0)
+	    | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) |
+	    (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) |
+	    IWCH_MEM_ACCESS_LOCAL_READ;
+}
+
+enum iwch_stag_state {
+	IWCH_STAG_STATE_VALID,
+	IWCH_STAG_STATE_INVALID
+};
+
+enum iwch_qp_query_flags {
+	IWCH_QP_QUERY_CONTEXT_NONE = 0x0,	/* No ctx; Only attrs */
+	IWCH_QP_QUERY_CONTEXT_GET = 0x1,	/* Get ctx + attrs */
+	IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2,	/* Not Supported */
+
+	/* 
+	 * Quiesce QP context; Consumer 
+	 * will NOT replay outstanding WR
+	 */
+	IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4,
+	IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8,
+	IWCH_QP_QUERY_TEST_USERWRITE = 0x32	/* Test special */
+};
+
+static inline struct iwch_pd *get_php(struct iwch_dev *rhp, u64 pd_h)
+{
+	if (pd_h >= T3_MAX_NUM_PD)
+		return NULL;
+	return rhp->pdid2hlp[pd_h];
+}
+
+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u64 cq_h)
+{
+	if (cq_h >= T3_MAX_NUM_CQ)
+		return NULL;
+	return rhp->cqid2hlp[cq_h];
+}
+
+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u64 qp_h)
+{
+	if (qp_h >= T3_MAX_NUM_QP)
+		return NULL;
+	return rhp->qpid2hlp[qp_h];
+}
+
+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp,
+				  u64 mem_h)
+{
+	if (mem_h >= T3_MAX_NUM_STAG)
+		return NULL;
+	return rhp->stag2hlp[mem_h];
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr);
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr);
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind);
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg);
+int iwch_register_device(struct iwch_dev *dev);
+void iwch_unregister_device(struct iwch_dev *dev);
+int iwch_quiesce_qps(struct iwch_cq *chp);
+int iwch_resume_qps(struct iwch_cq *chp);
+void stop_read_rep_timer(struct iwch_qp *qhp);
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					u64 *page_list);
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					u64 *page_list);
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					u64 **page_list);
+
+#define MOD "iw_cxgb3:"
+#define PDBG(fmt, args...) pr_debug(MOD fmt, ##args)
+
+#define IWCH_NODE_DESC "cxgb3 Chelsio Communications"
+
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:29:34 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:34 -0500
Subject: [openib-general] [PATCH v2 02/14] CXGB3 QP and CQ.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142934.32410.33916.stgit@stevo-desktop>


This patch contains qp and cq manipulation code.

ISSUE:  CQs can overflow in with the T3A hardware.  There is
no way around this for now.  The next spin of the T3 hardware
will resolve this issue and the driver will be updated.

ISSUE: QP termination/WR flushing not handled correctly.  Need firmware
support to finalize this.
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c   |  228 +++++++
 drivers/infiniband/hw/cxgb3/iwch_qp.c   | 1006 +++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_user.h |   62 ++
 3 files changed, 1296 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 0000000..303b7f2
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,228 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 	0 			EMPTY;
+ *	1			cqe returned
+ *	-EAGAIN 		caller must try again
+ * 	any other -errno	fatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+		     struct ib_wc *wc)
+{
+	struct iwch_qp *qhp = NULL;
+	struct t3_cqe cqe, *rd_cqe;
+	struct t3_wq *wq;
+	u32 credit = 0;
+	u8 cqe_flushed;
+	u64 cookie;
+	int ret = 1;
+
+	rd_cqe = cxio_next_cqe(&chp->cq);
+
+	if (!rd_cqe)
+		return 0;
+
+	qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+	if (!qhp)
+		wq = NULL;
+	else {
+		spin_lock(&qhp->lock);
+		wq = &(qhp->wq);
+	}
+	ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie,
+				   &credit);
+	if (credit) {
+		PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, 
+		     credit, chp->cq.cqid);
+		cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit);
+	}
+
+	if (ret) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = 1;
+
+	BUG_ON(!qhp);
+
+	wc->wr_id = cookie;
+	wc->qp_num = qhp->wq.qpid;
+
+	PDBG("%s qpid 0x%x type %d opcode %d status 0x%d wrid hi 0x%x "
+	     "lo %x cookie %llx\n", __FUNCTION__, CQE_QPID(cqe), CQE_TYPE(cqe),
+	     CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+	     CQE_WRID_LOW(cqe), cookie);
+
+	if (CQE_TYPE(cqe) == 0) {
+		if (!CQE_STATUS(cqe))
+			wc->byte_len = CQE_LEN(cqe);
+		else
+			wc->byte_len = 0;
+		wc->opcode = IB_WC_RECV;
+	} else {
+		switch (CQE_OPCODE(cqe)) {
+		case T3_RDMA_WRITE:
+			wc->opcode = IB_WC_RDMA_WRITE;
+			break;
+		case T3_READ_REQ:
+			wc->opcode = IB_WC_RDMA_READ;
+			wc->byte_len = CQE_LEN(cqe);
+			break;
+		case T3_SEND:
+		case T3_SEND_WITH_SE:
+			wc->opcode = IB_WC_SEND;
+			break;
+		case T3_BIND_MW:
+			wc->opcode = IB_WC_BIND_MW;
+			break;
+
+		/* these aren't supported yet */
+		case T3_SEND_WITH_INV:
+		case T3_SEND_WITH_SE_INV:
+		case T3_LOCAL_INV:
+		case T3_FAST_REGISTER:
+		default:
+			PDBG("unexpected opcode(0x%0x) in the CQE received "
+			     "for QPID=0x%0x\n", CQE_OPCODE(cqe), 
+			     CQE_QPID(cqe));
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (cqe_flushed) {
+		wc->status = IB_WC_WR_FLUSH_ERR;
+	} else {
+		
+		switch (CQE_STATUS(cqe)) {
+		case TPT_ERR_SUCCESS:
+			wc->status = IB_WC_SUCCESS;
+			break;
+		case TPT_ERR_STAG:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_PDID:
+			wc->status = IB_WC_LOC_PROT_ERR;
+			break;
+		case TPT_ERR_QPID:
+		case TPT_ERR_ACCESS:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_WRAP:
+			wc->status = IB_WC_GENERAL_ERR;
+			break;
+		case TPT_ERR_BOUND:
+			wc->status = IB_WC_LOC_LEN_ERR;
+			break;
+		case TPT_ERR_INVALIDATE_SHARED_MR:
+		case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+			wc->status = IB_WC_MW_BIND_ERR;
+			break;
+		case TPT_ERR_CRC:
+		case TPT_ERR_MARKER:
+		case TPT_ERR_PDU_LEN_ERR:
+		case TPT_ERR_OUT_OF_RQE:
+		case TPT_ERR_DDP_VERSION:
+		case TPT_ERR_RDMA_VERSION:
+		case TPT_ERR_DDP_QUEUE_NUM:
+		case TPT_ERR_MSN:
+		case TPT_ERR_TBIT:
+		case TPT_ERR_MO:
+		case TPT_ERR_MSN_RANGE:
+		case TPT_ERR_IRD_OVERFLOW:
+		case TPT_ERR_OPCODE:
+			wc->status = IB_WC_FATAL_ERR;
+			break;
+		default:
+			PDBG("unexpected cqe_status(0x%0x) for QPID=0x(%0x)\n",
+			     CQE_STATUS(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+		}
+	}
+out:
+	if (wq)
+		spin_unlock(&qhp->lock);
+	return ret;
+}
+
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	unsigned long flags;
+	int npolled;
+	int err = 0;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+
+	spin_lock_irqsave(&chp->lock, flags);
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+#ifdef DEBUG
+		int i=0;
+#endif
+
+		/*
+	 	 * Because T3 can post CQEs that are _not_ associated
+	 	 * with a WR, we might have to poll again after removing
+	 	 * one of these.  
+		 */
+		do {
+			err = iwch_poll_cq_one(rhp, chp, wc + npolled);
+#ifdef DEBUG
+			BUG_ON(++i > 1000);
+#endif
+		} while (err == -EAGAIN);
+		if (err <= 0)
+			break;
+	}
+	spin_unlock_irqrestore(&chp->lock, flags);
+
+	if (err < 0)
+		return err;
+	else {
+		return npolled;
+	}
+}
+
+int iwch_modify_cq(struct ib_cq *cq, int cqe)
+{
+	PDBG("iwch_modify_cq: TBD\n");
+	return 0;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 0000000..f1136c1
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1006 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe,
+						    struct ib_send_wr *wr,
+						    u8 * flit_cnt)
+{
+	int i;
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE;
+		else
+			wqe->send.rdmaop = T3_SEND;
+		wqe->send.rem_stag = 0;
+		break;
+#if 0				/* Not currently supported */
+	case TYPE_SEND_INVALIDATE:
+	case TYPE_SEND_INVALIDATE_IMMEDIATE:
+		wqe->send.rdmaop = T3_SEND_WITH_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+	case TYPE_SEND_SE_INVALIDATE:
+		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+#endif
+	default:
+		break;
+	}
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->send.reserved = 0;
+	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+		wqe->send.plen = 4;
+		wqe->send.sgl[0].stag = wr->imm_data;
+		wqe->send.sgl[0].len = 0;
+		wqe->send.num_sgle = 0;
+		*flit_cnt = 5;
+	} else {
+		wqe->send.plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((wqe->send.plen + wr->sg_list[i].length) < 
+			    wqe->send.plen) {
+				return -EMSGSIZE;
+			}
+			wqe->send.plen += wr->sg_list[i].length;
+			wqe->send.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->send.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->send.plen = cpu_to_be32(wqe->send.plen);
+		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 4 + ((wr->num_sge) << 1);
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe,
+							struct ib_send_wr *wr,
+							u8 *flit_cnt)
+{
+	int i;
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->write.rdmaop = T3_RDMA_WRITE;
+	wqe->write.reserved = 0;
+	wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+	wqe->write.num_sgle = wr->num_sge;
+
+	if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+		wqe->write.plen = cpu_to_be32(4);
+		wqe->write.sgl[0].stag = cpu_to_be32(wr->imm_data);
+		wqe->write.sgl[0].len = 0;
+		wqe->write.num_sgle = 0;
+		*flit_cnt = 6;
+	} else {
+		wqe->write.plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((wqe->send.plen + wr->sg_list[i].length) < 
+			    wqe->send.plen) {
+				return -EMSGSIZE;
+			}
+			wqe->write.plen += wr->sg_list[i].length;
+			wqe->write.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->write.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->write.sgl[i].to =
+			    cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->write.plen = cpu_to_be32(wqe->write.plen);
+		wqe->write.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 5 + ((wr->num_sge) << 1);
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_read(union t3_wr *wqe,
+						    struct ib_send_wr *wr,
+						    u8 *flit_cnt)
+{
+	if (wr->num_sge > 1)
+		return -EINVAL;
+	wqe->read.rdmaop = T3_READ_REQ;
+	wqe->read.reserved = 0;
+	wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr);
+	wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey);
+	wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length);
+	wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr);
+	*flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3;
+	return 0;
+}
+
+/* 
+ * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
+ */
+static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp,
+				   struct ib_sge *sg_list, u32 num_sgle,
+				   u32 * pbl_addr, u8 * page_size)
+{
+	int i;
+	struct iwch_mr *mhp;
+	u32 offset;
+	for (i = 0; i < num_sgle; i++) {
+		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
+		if (!mhp)
+			return -EIO;
+		if (!mhp->attr.state)
+			return -EIO;
+		if (mhp->attr.zbva) 
+			return -EIO;
+		if (sg_list[i].addr < mhp->attr.va_fbo)
+			return -EINVAL;
+		if (sg_list[i].addr + ((u64) sg_list[i].length) <
+		    sg_list[i].addr)
+			return -EINVAL;
+		if (sg_list[i].addr + ((u64) sg_list[i].length) >
+		    mhp->attr.va_fbo + ((u64) mhp->attr.len))
+			return -EINVAL;
+		offset = sg_list[i].addr - mhp->attr.va_fbo;
+		offset += ((u32) mhp->attr.va_fbo) %
+		    (1UL << (12 + mhp->attr.page_size));
+		pbl_addr[i] = mhp->attr.pbl_addr +
+		    (offset >> (12 + mhp->attr.page_size));
+		page_size[i] = mhp->attr.page_size;
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_recv(struct iwch_dev *rhp,
+						    union t3_wr *wqe,
+						    struct ib_recv_wr *wr)
+{
+	int i, err = 0;
+	u32 pbl_addr[4];
+	u8 page_size[4];
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, 
+			       page_size);
+	if (err)
+		return err;
+	wqe->recv.pagesz[0] = page_size[0];
+	wqe->recv.pagesz[1] = page_size[1];
+	wqe->recv.pagesz[2] = page_size[2];
+	wqe->recv.pagesz[3] = page_size[3];
+	wqe->recv.num_sgle = cpu_to_be32(wr->num_sge);
+	for (i = 0; i < wr->num_sge; i++) {
+		wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		
+		/* to in the WQE == the offset into the page */
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
+				(1UL << (12 + page_size[i])));
+
+		/* pbl_addr is the adapters address in the PBL */
+		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
+	}
+	for (; i < T3_MAX_SGE; i++) {
+		wqe->recv.sgl[i].stag = 0;
+		wqe->recv.sgl[i].len = 0;
+		wqe->recv.sgl[i].to = 0;
+		wqe->recv.pbl_addr[i] = 0;
+	}
+	return 0;
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr)
+{
+	int err = 0;
+	u8 t3_wr_flit_cnt;
+	enum t3_wr_opcode t3_wr_opcode = 0;
+	enum t3_wr_flags t3_wr_flags;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	int flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+		  qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	while (wr) {
+		if (num_wrs == 0) {
+			err = -ENOMEM;
+			*bad_wr = wr;
+			break;
+		}
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		t3_wr_flags = 0;
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
+		if (wr->send_flags & IB_SEND_FENCE)
+			t3_wr_flags |= T3_READ_FENCE_FLAG;
+		if (wr->send_flags & IB_SEND_SIGNALED)
+			t3_wr_flags |= T3_COMPLETION_FLAG;
+		switch (wr->opcode) {
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_IMM:
+			t3_wr_opcode = T3_WR_SEND;
+			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			t3_wr_opcode = T3_WR_WRITE;
+			err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_READ:
+			PDBG("%s %d - read sq_wptr %u wptr %u cookie %llx\n",
+				__FUNCTION__, __LINE__, qhp->wq.sq_wptr,
+				qhp->wq.wptr, wr->wr_id);
+			t3_wr_opcode = T3_WR_READ;
+			t3_wr_flags = 0; /* XXX */
+			err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		default:
+			PDBG("iwch_post_sendq: post of type=0x%0x TBD!\n",
+			     wr->opcode);
+			err = -EINVAL;
+		}
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+		wqe->send.wrid.id0.low = qhp->wq.wptr;
+		wqe->flit[T3_SQ_COOKIE_FLIT] = wr->wr_id;
+		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, t3_wr_flit_cnt);
+		PDBG("%s %d cookie %llx idx 0x%x sq_wptr %x sw_rptr %x wqe %p opcode %d\n", 
+		     __FUNCTION__, __LINE__, wr->wr_id, idx, 
+		     qhp->wq.sq_wptr, qhp->wq.sq_rptr, wqe, t3_wr_opcode);
+		if (!qhp->wq.sq_oldest_wr && 
+		    ((wr->send_flags & IB_SEND_SIGNALED) || 
+		     (wr->opcode == IB_WR_RDMA_READ))) {
+			qhp->wq.sq_oldest_wr = wqe;
+			PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__,
+				qhp->wq.sq_oldest_wr);
+		}
+		wr = wr->next;
+		num_wrs--;
+		++(qhp->wq.wptr);
+		++(qhp->wq.sq_wptr);
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	int err = 0;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	int flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, 
+			    qhp->wq.rq_size_log2) - 1;
+	if (!wr) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	while (wr) {
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		if (num_wrs)
+			err = iwch_build_rdma_recv(qhp->rhp, wqe, wr);
+		else
+			err = -ENOMEM;
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = 
+			wr->wr_id;
+		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, sizeof(struct t3_receive_wr) >> 3);
+		PDBG("%s %d cookie %llx idx 0x%x rq_wptr %x rw_rptr %x "
+		     "wqe %p \n", __FUNCTION__, __LINE__, wr->wr_id, idx, 
+		     qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
+		++(qhp->wq.rq_wptr);
+		++(qhp->wq.wptr);
+		wr = wr->next;
+		num_wrs--;
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_qp *qhp;
+	union t3_wr *wqe;
+	u32 pbl_addr;
+	u8 page_size;
+	u32 num_wrs;
+	int flag;
+	struct ib_sge sgl;
+	int err=0;
+	enum t3_wr_flags t3_wr_flags;
+	u32 idx;
+
+	qhp = to_iwch_qp(qp);
+	mhp = to_iwch_mw(mw);
+	rhp = qhp->rhp;
+
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+			    qhp->wq.sq_size_log2);
+	if ((num_wrs) <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	PDBG("%s: idx=0x%0x, mw=0x%p, mw_bind=0x%p\n", __FUNCTION__, idx, 
+	     mw, mw_bind);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	wqe->send.wrid.id0.low = qhp->wq.wptr;
+
+	t3_wr_flags = 0;
+	if (mw_bind->send_flags & IB_SEND_SIGNALED)
+		t3_wr_flags = T3_COMPLETION_FLAG;
+
+        sgl.addr = mw_bind->addr;
+        sgl.lkey = mw_bind->mr->lkey;
+        sgl.length = mw_bind->length;
+        wqe->bind.reserved = 0;
+        wqe->bind.type = T3_VA_BASED_TO;
+
+        /* TBD: check perms */
+        wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags);
+        wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey);
+        wqe->bind.mw_stag = cpu_to_be32(mw->rkey);
+        wqe->bind.mw_len = cpu_to_be32(mw_bind->length);
+        wqe->bind.mw_va = cpu_to_be64(mw_bind->addr);
+        err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size);
+        if (err) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+                return err;
+	}
+        wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr);
+        wqe->bind.mr_pagesz = page_size;
+        wqe->bind.reserved2 = 0;
+	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
+	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 
+			        sizeof(struct t3_bind_mw_wr) >> 3);
+
+	if (!qhp->wq.sq_oldest_wr) {
+		qhp->wq.sq_oldest_wr = wqe;
+		PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__,
+			qhp->wq.sq_oldest_wr);
+	}
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid);
+
+	return err;
+}
+
+int iwch_query_qp(u64 rh, u64 qp_h, enum iwch_qp_query_flags flags,
+			       struct iwch_qp_attributes *attrs)
+{
+	return 0;
+}
+
+
+static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode,
+				    int tagged)
+{
+	switch (t3err) {
+	case TPT_ERR_STAG:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_STAG;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_INV_STAG;
+		}
+		break;
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_STAG_NOT_ASSOC;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_STAG_NOT_ASSOC;
+		}
+		break;
+	case TPT_ERR_WRAP:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+		*ecode = RDMAP_TO_WRAP;
+		break;
+	case TPT_ERR_BOUND:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_BASE_BOUNDS;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_BASE_BOUNDS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_MSG_TOOBIG;
+		}
+		break;
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_CANT_INV_STAG;
+		break;
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		*layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_OUT_OF_RQE:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_NOBUF;
+		break;
+	case TPT_ERR_PBL_ADDR_BOUND:
+		*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+		*ecode = DDPT_BASE_BOUNDS;
+		break;
+	case TPT_ERR_CRC:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_CRC_ERR;
+		break;
+	case TPT_ERR_MARKER:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_MARKER_ERR;
+		break;
+	case TPT_ERR_PDU_LEN_ERR:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_MSG_TOOBIG;
+		break;
+	case TPT_ERR_DDP_VERSION:
+		if (tagged) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR; /* XXX */
+			*ecode = DDPT_INV_VERS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; /* XXX */
+			*ecode = DDPU_INV_VERS;
+		}
+		break;
+	case TPT_ERR_RDMA_VERSION:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_VERS;
+		break;
+	case TPT_ERR_OPCODE:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_OPCODE;
+		break;
+	case TPT_ERR_DDP_QUEUE_NUM:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_QN;
+		break;
+	case TPT_ERR_MSN:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_IRD_OVERFLOW:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_RANGE;
+		break;
+	case TPT_ERR_TBIT:
+		*layer_type = LAYER_DDP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_MO:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MO;
+		break;
+	default: 
+		*layer_type = LAYER_RDMAP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	}
+}
+
+/*
+ * This posts a TERMINATE with layer=RDMA, type=catastrophic.
+ */
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
+{
+	int err = 0;
+	u32 idx;
+	union t3_wr *wqe;
+	int num_wrs;
+	int flag;
+	struct terminate_message *term;
+	int status;
+	int tagged = 0;
+
+	PDBG("%s %d\n", __FUNCTION__, __LINE__);
+	spin_lock_irqsave(&qhp->lock, flag);
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+			    qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EIO;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+	if (!qhp->wq.sq_oldest_wr) {
+		qhp->wq.sq_oldest_wr = wqe;
+		PDBG("%s %d sq_oldest_wr %p\n", __FUNCTION__, __LINE__,
+			qhp->wq.sq_oldest_wr);
+	}
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	wqe->send.wrid.id0.low = qhp->wq.wptr;
+	wqe->send.rdmaop = T3_TERMINATE;
+	wqe->send.rem_stag = 0;
+	wqe->send.reserved = 0;
+	
+	/* indicate data is immediate. */
+	wqe->send.num_sgle = 0;
+
+	/* immediate data length */
+	wqe->send.plen = htonl(4);
+
+	/* immediate data starts here. */
+	term = (struct terminate_message *)wqe->send.sgl;
+	status = rsp_msg ? CQE_STATUS(rsp_msg->cqe) : TPT_ERR_INTERNAL_ERR;
+	if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)
+		tagged = 1;
+        if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) ||
+            (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP))
+		tagged = 2;
+	build_term_codes(status, &term->layer_etype, &term->ecode, tagged);
+	term->hdrct_rsvd = 0; /* no header info */
+	
+	wqe->flit[T3_SQ_COOKIE_FLIT] = ~0;
+	build_fw_riwrh((void *)wqe, T3_WR_SEND, 
+		       T3_COMPLETION_FLAG|T3_NOTIFY_FLAG,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 5);
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+/*
+ * Assumes qhp lock is held.
+ */
+static void flush_qp(struct iwch_qp *qhp, int *flag)
+{
+	struct iwch_cq *rchp, *schp;
+
+	rchp = qhp->rhp->cqid2hlp[qhp->attr.rcq];
+	schp = qhp->rhp->cqid2hlp[qhp->attr.scq];
+	
+	/* take a ref on the qhp since we must release the lock */
+	atomic_inc(&qhp->refcnt);
+	spin_unlock_irqrestore(&qhp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&rchp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_rq(&qhp->rhp->rdev, &qhp->wq, &rchp->cq);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&rchp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&schp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_sq(&qhp->rhp->rdev, &qhp->wq, &schp->cq);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&schp->lock, *flag);
+
+	/* deref */
+	if (atomic_dec_and_test(&qhp->refcnt))
+                wake_up(&qhp->wait);
+
+	spin_lock_irqsave(&qhp->lock, *flag);
+}
+
+static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs)
+{
+	struct t3_rdma_init_attr init_attr;
+	int ret;
+
+	init_attr.tid = qhp->ep->hwtid;
+	init_attr.qpid = qhp->wq.qpid;
+	init_attr.pdid = qhp->attr.pd;
+	init_attr.scqid = qhp->attr.scq;
+	init_attr.rcqid = qhp->attr.rcq;
+
+	/* TBD!!! rq table slot allocation needs 
+	 * to be implemented in the core driver.
+	 * For now, allocate 1Kx64B for each rq 
+	 */
+	init_attr.rq_addr = (qhp->ep->hwtid) << 16;
+	init_attr.rq_size = 1 << qhp->wq.rq_size_log2;
+
+	PDBG("%s init_attr.rq_size = %d\n", __FUNCTION__, init_attr.rq_size);
+	init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | 
+		qhp->attr.mpa_attr.recv_marker_enabled |
+		(qhp->attr.mpa_attr.xmit_marker_enabled << 1) |
+		(qhp->attr.mpa_attr.crc_enabled << 2);
+
+	/* 
+	 * XXX - The IWCM doesn't quite handle getting these
+ 	 * attrs set before going into RTS.  For now, just turn 
+	 * them on always...
+	 */
+#if 0
+	init_attr.qpcaps = qhp->attr.enableRdmaRead |
+		(qhp->attr.enableRdmaWrite << 1) |
+		(qhp->attr.enableBind << 2) |
+		(qhp->attr.enable_stag0_fastreg << 3) |
+		(qhp->attr.enable_stag0_fastreg << 4);
+#else
+	init_attr.qpcaps = 0x1f;
+#endif
+	init_attr.tcp_emss = qhp->ep->emss;
+	init_attr.ord = qhp->attr.max_ord;
+	init_attr.ird = qhp->attr.max_ird;
+	init_attr.qp_dma_addr = qhp->wq.dma_addr;
+	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
+	init_attr.rqes_posted = Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr) ? 
+				0 : 1;
+	ret = cxio_rdma_init(&rhp->rdev, &init_attr);
+	PDBG("%s ret %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal)
+{
+	int ret = 0;
+	struct iwch_qp_attributes newattr = qhp->attr;
+	int flag;
+	int disconnect = 0;
+	int terminate = 0;
+	int abort = 0;
+	int free = 0;
+	struct iwch_ep *ep = NULL;
+
+	PDBG("%s %d qhp %p qpid %d ep %p state %d -> %d\n", __FUNCTION__, 
+	     __LINE__, qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, 
+	     (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1);
+
+	spin_lock_irqsave(&qhp->lock, flag);
+
+	/* Process attr changes if in IDLE */
+	if (mask & IWCH_QP_ATTR_VALID_MODIFY) {
+		if (qhp->attr.state != IWCH_QP_STATE_IDLE) {
+			ret = -EIO;
+			goto out;
+		}
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ)
+			newattr.enable_rdma_read = attrs->enable_rdma_read;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE)
+			newattr.enable_rdma_write = attrs->enable_rdma_write;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND)
+			newattr.enable_bind = attrs->enable_bind;
+		if (mask & IWCH_QP_ATTR_MAX_ORD) {
+			if (attrs->max_ord > 
+			    rhp->attr.max_rdma_read_qp_depth) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ord = attrs->max_ord;
+		}
+		if (mask & IWCH_QP_ATTR_MAX_IRD) {
+			if (attrs->max_ird > 
+		  	    rhp->attr.max_rdma_reads_per_qp) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ird = attrs->max_ird;
+		}
+		qhp->attr = newattr;
+	}
+	
+	if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) 
+		goto out;
+	if (qhp->attr.state == attrs->next_state)
+		goto out;
+
+	switch (qhp->attr.state) {
+	case IWCH_QP_STATE_IDLE:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_RTS: 
+			if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			qhp->attr.mpa_attr = attrs->mpa_attr;
+			qhp->attr.llp_stream_handle = attrs->llp_stream_handle;
+			qhp->ep = qhp->attr.llp_stream_handle;
+			qhp->attr.state = IWCH_QP_STATE_RTS;
+
+			/*
+			 * Ref the endpoint here and deref when we
+	 		 * disassociate the endpoint from the QP.  This
+			 * happens in CLOSING->IDLE transition or *->ERROR
+			 * transition.
+			 */
+			atomic_inc(&qhp->ep->com.refcnt);
+			spin_unlock_irqrestore(&qhp->lock, flag);
+			ret = rdma_init(rhp, qhp, mask, attrs);
+			spin_lock_irqsave(&qhp->lock, flag);
+			if (ret)
+				goto err;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			flush_qp(qhp, &flag);
+			break;
+		default:
+			ret = -EINVAL;	
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_RTS:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_CLOSING:
+			BUG_ON(atomic_read(&qhp->ep->com.refcnt) < 2);
+			qhp->attr.state = IWCH_QP_STATE_CLOSING;
+			if (Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr)) {
+				if (!internal) {
+					abort=0;
+					disconnect = 1;
+					ep = qhp->ep;
+				}
+			} else {
+				if (!internal) {
+					abort=1;
+					disconnect = 1;
+					ep = qhp->ep;
+				}
+				ret = -EINVAL;
+				goto err;
+			}
+			break;
+		case IWCH_QP_STATE_TERMINATE:
+			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
+			if (!internal) 
+				terminate = 1;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			if (!internal) {
+				abort=1;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			goto err;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_CLOSING:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (attrs->next_state) {
+			case IWCH_QP_STATE_IDLE:
+				qhp->attr.state = IWCH_QP_STATE_IDLE;
+				qhp->attr.llp_stream_handle = NULL;
+				free_ep(&qhp->ep->com);
+				qhp->ep = NULL;
+				wake_up(&qhp->wait);
+				break;
+			case IWCH_QP_STATE_ERROR:
+				goto err;
+			default:
+				ret = -EINVAL;
+				goto err;
+		}
+		break;
+	case IWCH_QP_STATE_ERROR:
+		if (attrs->next_state != IWCH_QP_STATE_IDLE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		
+		if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || 
+		    !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		qhp->attr.state = IWCH_QP_STATE_IDLE;
+		memset(&qhp->attr, 0, sizeof(qhp->attr));
+		break;
+	case IWCH_QP_STATE_TERMINATE:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		goto err;
+		break;
+	default:
+		printk(KERN_ERR "%s in a bad state %d\n", 
+		       __FUNCTION__, qhp->attr.state);
+		ret = -EINVAL;
+		goto err;
+		break;
+	}
+	goto out;
+err:
+	PDBG("%s disassociating LLP EP %p qpid %d\n", __FUNCTION__, qhp->ep, 
+	     qhp->wq.qpid);
+
+	/* disassociate the LLP connection */
+	qhp->attr.llp_stream_handle = NULL;
+	ep = qhp->ep;
+	qhp->ep = NULL;
+	qhp->attr.state = IWCH_QP_STATE_ERROR;
+	free=1;
+	wake_up(&qhp->wait);
+	BUG_ON(!ep);
+#ifdef notyet
+	flush_qp(qhp, flag);
+#endif
+out:
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	if (terminate)
+		iwch_post_terminate(qhp, NULL);
+
+	/*
+	 * If disconnect is 1, then we need to initiate a disconnect
+	 * on the EP.  This can be a normal close (RTS->CLOSING) or
+	 * an abnormal close (RTS/CLOSING->ERROR).
+	 */
+	if (disconnect)
+		iwch_ep_disconnect(ep, abort, GFP_KERNEL);
+
+	/* 
+	 * If free is 1, then we've disassociated the EP from the QP 
+	 * and we need to dereference the EP.
+	 */
+	if (free)
+		free_ep(&ep->com);
+
+	PDBG("%s %d state -> %d\n", __FUNCTION__, __LINE__, qhp->attr.state);
+	return ret;
+}
+
+static int quiesce_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_quiesce_tid(qhp->ep);
+	qhp->flags |= QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+static int resume_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_resume_tid(qhp->ep);
+	qhp->flags &= ~QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+int iwch_quiesce_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = chp->rhp->qpid2hlp[i];
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) {
+			quiesce_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) 
+			quiesce_qp(qhp);
+	}
+	return 0;
+}
+
+int iwch_resume_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = chp->rhp->qpid2hlp[i];
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) {
+			resume_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp))
+			resume_qp(qhp);
+	}
+	return 0;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h
new file mode 100644
index 0000000..ab87f72
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_user.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_USER_H__
+#define __IWCH_USER_H__
+
+#define IWCH_UVERBS_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct iwch_create_cq_resp {
+	__u32 cqid;
+	__u32 entries;		/* actual number of entries after creation */
+	__u64 physaddr;		/* library mmaps this to get addressability */
+	__u64 queue;
+};
+
+struct iwch_create_qp_resp {
+	__u32 qpid;
+	__u32 entries;		/* actual number of entries after creation */
+	__u64 physaddr;		/* library mmaps this to get addressability */
+	__u64 physsize;		/* library mmaps this to get addressability */
+	__u64 queue;
+	__u64 sq_db_page;
+	__u64 rq_db_page;
+};
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:29:39 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:39 -0500
Subject: [openib-general] [PATCH v2 03/14] CXGB3 Memory Registration
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142939.32410.29905.stgit@stevo-desktop>


This patch contains the code to register memory regions and windows.
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  171 ++++++++++++++++++++++++++++++++
 1 files changed, 171 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 0000000..68ed76a
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,171 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					u64 *page_list)
+{
+	u32 stag;
+	u64 mem_h;
+
+
+	if (cxio_register_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) {
+		return -ENOMEM;
+	}
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mem_h = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	rhp->stag2hlp[mem_h] = mhp;
+	PDBG("iwch_register_mem: mem_h(0x%0llx) mhp(%p)\n", mem_h, mhp);
+	return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					u64 *page_list)
+{
+	u32 stag;
+	u64 mem_h;
+
+
+	stag = mhp->attr.stag;
+	if (cxio_reregister_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) {
+		return -ENOMEM;
+	}
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mem_h = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	rhp->stag2hlp[mem_h] = mhp;
+	PDBG("iwch_reregister_mem: mem_h(0x%0llx) mhp(%p)\n", mem_h, mhp);
+	return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					u64 **page_list)
+{
+	u64 mask;
+	int i, j, n;
+
+	mask = 0;
+	*total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+			return -EINVAL;
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return -EINVAL;
+		*total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	if (*total_size > 0xFFFFFFFFULL)
+		return -ENOMEM;
+
+	/* Find largest page shift we can use to cover buffers */
+	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
+		if (num_phys_buf > 1) {
+			if ((1ULL << *shift) & mask)
+				break;
+		} else {
+			if (1ULL << *shift >=
+			    buffer_list[0].size +
+			    (buffer_list[0].addr & ((1ULL << *shift) - 1)))
+				break;
+		}
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
+	buffer_list[0].addr &= ~0ull << *shift;
+
+	*npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		*npages += (buffer_list[i].size + 
+			(1ULL << *shift) - 1) >> *shift;
+
+	if (!*npages) {
+		return -EINVAL;
+	}
+
+	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
+	if (!*page_list) {
+		return -ENOMEM;
+	}
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
+		     ++j) 
+			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
+			    ((u64) j << *shift));
+
+	PDBG("%s va %llx mask %llx shift %d len %lld pbl_size %d\n",
+	     __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages);
+	PDBG("pa0 %llx\n", (*page_list)[0]);
+
+	return 0;
+
+}


From swise at opengridcomputing.com  Fri Jun 23 07:29:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:44 -0500
Subject: [openib-general] [PATCH v2 04/14] CXGB3 Async Events
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142944.32410.95234.stgit@stevo-desktop>


This patch contains code to handle async and completion events.
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  209 +++++++++++++++++++++++++++++++++
 1 files changed, 209 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 0000000..36837b1
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <net/sock.h>
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+			  struct respQ_msg_t *rsp_msg,
+			  enum ib_event_type ib_event, 
+			  int send_term)
+{
+	struct ib_event event;
+	struct iwch_qp_attributes attrs;
+	struct iwch_qp *qhp;
+
+	printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+	       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+	       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+	       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+	       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+	spin_lock(&rnicp->lock);
+	qhp = rnicp->qpid2hlp[CQE_QPID(rsp_msg->cqe)];
+	if (!qhp) {
+		printk(KERN_ERR "%s unaffiliated error %d\n", 
+		       __FUNCTION__, CQE_STATUS(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		BUG_ON(1);
+		return;
+	}
+	atomic_inc(&qhp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	event.event = ib_event;
+	event.device = chp->ibcq.device;
+	if (ib_event == IB_EVENT_CQ_ERR)
+		event.element.cq = &chp->ibcq;
+	else 
+		event.element.qp = &qhp->ibqp;
+
+	if (qhp->ibqp.event_handler)
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+	attrs.next_state = IWCH_QP_STATE_TERMINATE;
+	if ((qhp->attr.state == IWCH_QP_STATE_RTS) && 
+	    !iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, 
+			     &attrs, 1) && send_term)
+		iwch_post_terminate(qhp, rsp_msg);
+	if (atomic_dec_and_test(&qhp->refcnt))
+		wake_up(&qhp->wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+	struct iwch_dev *rnicp;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	struct iwch_cq *chp;
+	struct iwch_qp *qhp;
+
+	u64 cq_h = be16_to_cpu(rsp_msg->cq_id);
+	rnicp = (struct iwch_dev *) rdev_p->ulp;
+	
+	spin_lock(&rnicp->lock);
+	chp = rnicp->cqid2hlp[cq_h];
+	qhp = rnicp->qpid2hlp[CQE_QPID(rsp_msg->cqe)];
+	if (!chp || !qhp) {
+		printk(KERN_ERR MOD "Event for deleted cq or qp - "
+		       "cqid %d qpid %d\n", (u32)cq_h, 
+		       (u32)CQE_QPID(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		goto out;
+	}
+	iwch_qp_add_ref(&qhp->ibqp);
+	atomic_inc(&chp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	PDBG("%s - cq_h %lld\n", __FUNCTION__, cq_h);
+
+	BUG_ON(!chp->ibcq.comp_handler);
+
+	/* 
+	 * 1) incoming TERMINATE message.  
+	 * 2) completion of our sending a TERMINATE.
+	 */
+	if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && 
+	    (CQE_STATUS(rsp_msg->cqe) == 0)) {
+		if (SQ_TYPE(rsp_msg->cqe)) {
+			PDBG("%s %d disconnecting\n", __FUNCTION__, __LINE__);
+			BUG_ON(!qhp->ep);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		} else {
+			PDBG("%s %d post REQ_ERR AE\n", __FUNCTION__, __LINE__);
+			post_qp_event(rnicp, chp, rsp_msg, 
+				      IB_EVENT_QP_REQ_ERR, 0);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		}
+		goto done;
+	}
+
+	/* Bad incoming Read request */
+	if (SQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	/* Bad incoming write */
+	if (RQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	switch (CQE_STATUS(rsp_msg->cqe)) {
+
+	/* Completion Events */
+	case TPT_ERR_SUCCESS:
+
+		/* 
+		 * Confirm the destination entry if this is a RECV completion.
+		 */
+		if (qhp->ep && SQ_TYPE(rsp_msg->cqe))
+			dst_confirm(qhp->ep->dst);
+
+	case TPT_ERR_STAG:
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+	case TPT_ERR_WRAP:
+	case TPT_ERR_BOUND:
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		break;
+
+	/* Device Fatal Errors */
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1);
+		break;
+	
+	/* QP Fatal Errors */
+	case TPT_ERR_OUT_OF_RQE:
+	case TPT_ERR_PBL_ADDR_BOUND:
+	case TPT_ERR_CRC:
+	case TPT_ERR_MARKER:
+	case TPT_ERR_PDU_LEN_ERR:
+	case TPT_ERR_DDP_VERSION:
+	case TPT_ERR_RDMA_VERSION:
+	case TPT_ERR_OPCODE:
+	case TPT_ERR_DDP_QUEUE_NUM:
+	case TPT_ERR_MSN:
+	case TPT_ERR_TBIT:
+	case TPT_ERR_MO:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_RQE_ADDR_BOUND:
+	case TPT_ERR_IRD_OVERFLOW:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+
+	default:
+		PDBG("%s unknown T3 status 0x%x\n", __FUNCTION__, 
+		     CQE_STATUS(rsp_msg->cqe));
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+	}
+done:
+	if (atomic_dec_and_test(&chp->refcnt))
+                wake_up(&chp->wait);
+	iwch_qp_rem_ref(&qhp->ibqp);
+out:
+	dev_kfree_skb_irq(skb);
+}


From swise at opengridcomputing.com  Fri Jun 23 07:30:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:31 -0500
Subject: [openib-general] [PATCH v2 13/14] CXGB3 Makefiles/Kconfig
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143031.32410.45614.stgit@stevo-desktop>


The cxgb3 rdma support is broken into 2 modules:

iw_cxgb3.ko 	- the openib provider module.
cxgb3c.ko	- the cxgb3 "core" services module.
---

 drivers/infiniband/Kconfig              |    1 +
 drivers/infiniband/Makefile             |    1 +
 drivers/infiniband/hw/cxgb3/Kconfig     |   14 ++++++++++++++
 drivers/infiniband/hw/cxgb3/Makefile    |   21 +++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/locking.txt |   25 +++++++++++++++++++++++++
 5 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 04e6d4f..7dcf976 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -37,6 +37,7 @@ config INFINIBAND_ADDR_TRANS
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index e2b93f9..1a73af0 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -2,5 +2,6 @@ obj-$(CONFIG_INFINIBAND)		+= core/
 obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mthca/
 obj-$(CONFIG_IPATH_CORE)		+= hw/ipath/
 obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
+obj-$(CONFIG_INFINIBAND_IWCH)         	+= hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 0000000..156df63
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,14 @@
+config INFINIBAND_IWCH
+	tristate "Chelsio OpenIB module"
+	depends on CHELSIO_T3 && INFINIBAND
+	---help---
+	   This is the Chelsio OpenIB provider module.
+
+config INFINIBAND_IWCH_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_IWCH
+	default n
+	---help---
+	  This option causes the Chelsio OpenIB provider module to produce 
+          a bunch of debug messages.  Select this if you are developing the 
+          driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 0000000..ed72caa
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,21 @@
+EXTRA_CFLAGS += \
+	-DCONFIG_CHELSIO_T3_OFFLOAD \
+	-I$(TOPDIR)/drivers/infiniband/include \
+	-I$(TOPDIR)/drivers/net/cxgb3 \
+	-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/t3c \
+	-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_IWCH) += iw_cxgb3.o cxgb3c.o 
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+	       iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_IWCH_DEBUG
+EXTRA_CFLAGS += -O1 -g -DDEBUG
+iw_cxgb3-y += core/cxio_dbg.o
+endif
+
+cxgb3c-y := \
+	t3c/t3c.o \
+	t3c/l2t.o \
+	t3c/t3cdev.o
diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt
new file mode 100644
index 0000000..e5e9991
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/locking.txt
@@ -0,0 +1,25 @@
+cq lock:
+	- spin lock
+	- used to synchronize the t3_cq
+
+qp lock:
+	- spin lock
+	- used to synchronize updates to the qp state, attrs, and the t3_wq.
+	- touched on interrupt and process context
+	
+rnicp lock:
+	- spin lock
+	- touched on interrupt and process context
+	- used around lookup tables mapping CQID and QPID to a structure.
+	- used also to bump the refcnt atomically with the lookup.
+
+poll:
+	lock+disable on cq lock
+		lock qp lock for each cqe that is polled around the call
+		to cxio_poll_cq().
+	
+post: 
+	lock+disable qp lock
+
+global mutex iwch_mutex:
+	used to maintain global device list.


From swise at opengridcomputing.com  Fri Jun 23 07:30:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:36 -0500
Subject: [openib-general] [PATCH v2 14/14] CXGB3 Low Level Driver ULP
	Interface
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143036.32410.98171.stgit@stevo-desktop>


This is all I'm submitting from the LLD/NETDEV driver. These headers
define the interface used by the other modules to discover devices and
communicate with the device.

The entire LLD driver can be found in
gen2/branches/iwarp/src/linux-kernel/net/cxgb3
---

 drivers/net/cxgb3/t3_core.h |   45 ++++++++++++++++++++++++++++
 drivers/net/cxgb3/t3cdev.h  |   69 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/drivers/net/cxgb3/t3_core.h b/drivers/net/cxgb3/t3_core.h
new file mode 100644
index 0000000..1ce076a
--- /dev/null
+++ b/drivers/net/cxgb3/t3_core.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright (C) 2003-2006 Chelsio Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _T3_CORE_H_
+#define _T3_CORE_H_
+#include <linux/skbuff.h>
+
+struct t3cdev;
+struct t3_core {
+	void	(*add) (struct t3cdev *);
+        void	(*remove) (struct t3cdev *);
+};
+
+extern struct t3_core *t3_core;
+void t3_register_core(struct t3_core *core);
+void t3_unregister_core(struct t3_core *core);
+#endif
diff --git a/drivers/net/cxgb3/t3cdev.h b/drivers/net/cxgb3/t3cdev.h
new file mode 100644
index 0000000..7bc2df6
--- /dev/null
+++ b/drivers/net/cxgb3/t3cdev.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (C) 2003-2006 Chelsio Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _T3CDEV_H_
+#define _T3CDEV_H_
+
+#include <linux/list.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+#include <linux/netdevice.h>
+#include <linux/proc_fs.h>
+#include <linux/skbuff.h>
+#include <net/neighbour.h>
+#include <net/sock.h>
+
+#define T3CNAMSIZ 16
+
+#define NETIF_F_TCPIP_OFFLOAD (1 << 16)
+
+/* Get the t3cdev associated with a net_device */
+#define T3CDEV(netdev) (*(struct t3cdev **)&(netdev)->ec_ptr)
+
+struct t3cdev {
+	char name[T3CNAMSIZ];               /* T3C device name */
+	struct list_head t3c_list;          /* for list linking */
+	struct net_device *lldev;   /* LL dev associated with T3C messages */
+	struct proc_dir_entry *proc_dir;    /* root of proc dir for this T3C */
+	int (*open)(struct t3cdev *dev);
+	int (*close)(struct t3cdev *dev);
+	int (*send)(struct t3cdev *dev, struct sk_buff *skb);
+	int (*recv)(struct t3cdev *dev, struct sk_buff **skb, int n);
+	int (*ctl)(struct t3cdev *dev, unsigned int req, void *data);
+	void (*neigh_update)(struct t3cdev *dev, struct neighbour *neigh, 
+			     int fl, struct net_device *lldev);
+	void *priv;                         /* driver private data */
+	void *l2opt;                        /* optional layer 2 data */
+	void *l3opt;                        /* optional layer 3 data */
+	void *l4opt;                        /* optional layer 4 data */
+	void *ulp;			    /* ulp stuff */
+};
+#endif /* _T3CDEV_H_ */


From swise at opengridcomputing.com  Fri Jun 23 07:29:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:55 -0500
Subject: [openib-general] [PATCH v2 06/14] CXGB3 RDMA Core Debug Code
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142955.32410.44090.stgit@stevo-desktop>


This patch implements debug code for the RDMA Core.
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  209 +++++++++++++++++++++++++++
 1 files changed, 209 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 0000000..4cc3e96
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include <linux/types.h>
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) 
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size = 32;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		DBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+	m->len = size;
+	DBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		DBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		DBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size, npages;
+
+	shift += 12;
+	npages = (len + (1ULL << shift) - 1) >> shift;
+	size = npages * sizeof(u64);
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		DBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (pbl_addr<<3) + rdev->rnic_info.pbl_base;
+	m->len = size;
+	DBG("%s PBL addr 0x%x len %d depth %d\n", 
+		__FUNCTION__, m->addr, m->len, npages);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		DBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		DBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+	u64 *data = (u64 *)wqe;
+	uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+	while (size > 0) {
+		DBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+		size--;
+		data++;
+	}
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+	u64 *data = (u64 *)wce;
+	int size = sizeof(*wce);
+
+	while (size > 0) {
+		DBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+		size -= 8;
+		data++;
+	}
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+	struct ch_mem_range *m;
+	int size = nents * 64;
+	u64 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		DBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base;
+	m->len = size;
+	DBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		DBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		DBG("RQT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid)
+{
+	struct ch_mem_range *m;
+	int size = TCB_SIZE;
+	u32 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		DBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_CM;
+	m->addr = hwtid * size; 
+	m->len = size;
+	DBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		DBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u32 *)m->buf;
+	while (size > 0) {
+		printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", 
+			m->addr, 
+			*(data+2), *(data+3), *(data),*(data+1),
+			*(data+6), *(data+7), *(data+4), *(data+5));
+		size -= 32;
+		data += 8;
+		m->addr += 32;
+	}
+	kfree(m);
+}
+EXPORT_SYMBOL(cxio_dump_tpt);
+EXPORT_SYMBOL(cxio_dump_pbl);
+EXPORT_SYMBOL(cxio_dump_wqe);
+EXPORT_SYMBOL(cxio_dump_wce);
+EXPORT_SYMBOL(cxio_dump_rqt);
+EXPORT_SYMBOL(cxio_dump_tcb);
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:30:00 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:00 -0500
Subject: [openib-general] [PATCH v2 07/14] CXGB3 RDMA Core HAL Code.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143000.32410.17526.stgit@stevo-desktop>


This code implements a HAL interface to the T3 hardware.
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1152 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  166 ++++
 2 files changed, 1318 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 0000000..e142e5f
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1152 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/netdevice.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <asm/semaphore.h>
+#include <linux/pci.h>
+#include "cxio_hal.h"
+#include "sge_defs.h"
+#include <asm/delay.h>
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+							     *tdev)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (rdev_tbl[i]->t3cdev_p == tdev)
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++) {
+		if (!rdev_tbl[i]) {
+			rdev_tbl[i] = rdev_p;
+			break;
+		}
+	}
+	return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i] == rdev_p) {
+			rdev_tbl[i] = NULL;
+			break;
+		}
+}
+
+extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl);
+extern void cxio_hal_destroy_rhdl_resource(void);
+extern int cxio_hal_init_resource(struct cxio_hal_resource **rscpp,
+				  u32 nr_tpt, u32 nr_pbl,
+				  u32 nr_rqt, u32 nr_qpid, u32 nr_cqid,
+				  u32 nr_pdid);
+extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag);
+extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid);
+extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid);
+extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp);
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, 
+		   enum t3_cq_opcode op, u32 credit)
+{
+	int ret;
+	struct t3_cqe *cqe;
+	u32 rptr;
+
+	struct rdma_cq_op setup;
+	setup.id = cq->cqid;
+	setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+	setup.op = op;
+	ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup);
+
+	if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) 
+		return ret;
+
+	/*
+	 * If the rearm returned an index other than our current index,
+	 * then there might be CQE's in flight (being DMA'd).  We must wait
+	 * here for them to complete or the consumer can miss a notification.
+	 */
+	if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+		int i=0;
+
+		rptr = cq->rptr;
+
+		/* 
+		 * Keep the generation correct by bumping rptr until it
+		 * matches the index returned by the rearm - 1.
+	 	 */
+		while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+			rptr++;
+
+		/* 
+		 * Now rptr is the index for the (last) cqe that was 
+	 	 * in-flight at the time the HW rearmed the CQ.  We 
+		 * spin until that CQE is valid.
+	 	 */
+		cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2);
+		while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) {
+			udelay(1);
+			if (i++ > 1000000) {
+				BUG_ON(1);
+				printk(KERN_ERR "%s: stalled rnic\n", 
+				       rdev_p->dev_name);
+				return -EIO;
+			}
+		}
+	}
+	return 0;
+}
+
+static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cqid;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 0;		/* disaable the CQ */
+	setup.credits = 0;
+	setup.credit_thres = 0;
+	setup.ovfl_mode = 0;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
+{
+	u64 sge_cmd;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		DBG("failed in alloc_skb in destroy_ctrl_qp\n");
+		return -ENOMEM;
+	}
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0x3, 1, qpid,
+		       0x4);
+	sge_cmd = qpid << 8 | 3;
+	wqe->wrid.id1 = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = 0ULL;
+	wqe->ctx0 = 0ULL;
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (t3c_send(rdev_p->t3cdev_p, skb));
+}
+
+int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe);
+
+	cq->cqid = cxio_hal_get_cqid(rdev_p->rscp);
+	if (!cq->cqid)
+		return -ENOMEM;
+	cq->sw_queue = kzalloc(size, GFP_KERNEL);
+	if (!cq->sw_queue)
+		return -ENOMEM;
+	cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     (1UL << (cq->size_log2)) *
+					     sizeof(struct t3_cqe),
+					     &(cq->dma_addr), GFP_KERNEL);
+	if (!cq->queue) {
+		kfree(cq->sw_queue);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(cq, mapping, cq->dma_addr);
+	memset(cq->queue, 0, size);
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = 65535;
+	setup.credit_thres = 1;
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = setup.size;
+	setup.credit_thres = setup.size;	/* TBD: overflow recovery */
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain,
+		   struct t3_wq *wq)
+{
+	int depth = 1UL << wq->size_log2;
+	wq->qpid = cxio_hal_get_qpid(rdev_p->rscp);
+	if (!wq->qpid)
+		return -ENOMEM;
+
+	wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL);
+	if (!wq->rq) {
+		cxio_hal_put_qpid(rdev_p->rscp, wq->qpid);
+		return -ENOMEM;
+	}
+	
+	wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     depth * sizeof(union t3_wr),
+					     &(wq->dma_addr), GFP_KERNEL);
+	if (!wq->queue) {
+		kfree(wq->rq);
+		cxio_hal_put_qpid(rdev_p->rscp, wq->qpid);
+		return -ENOMEM;
+	}
+
+	pci_unmap_addr_set(wq, mapping, wq->dma_addr);
+#ifdef USER_DOORBELL
+	if (kernel_domain)
+#endif
+		wq->doorbell = rdev_p->rnic_info.kdb_addr;
+#ifdef USER_DOORBELL
+	else			
+		wq->doorbell = (void *)rdev_p->rnic_info.udbell_physbase +
+				(wq->qpid << PAGE_SHIFT);
+#endif
+	return 0;
+}
+
+int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	int err;
+	err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid);
+	kfree(cq->sw_queue);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (cq->size_log2))
+			  * sizeof(struct t3_cqe), cq->queue, 
+			  pci_unmap_addr(cq, mapping));
+	cxio_hal_put_cqid(rdev_p->rscp, cq->cqid);
+	return err;
+}
+
+int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq)
+{
+	int err;
+	err = cxio_hal_clear_qp_ctx(rdev_p, wq->qpid);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (wq->size_log2))
+			  * sizeof(union t3_wr), wq->queue, 
+			  pci_unmap_addr(wq, mapping));
+	kfree(wq->rq);
+	cxio_hal_put_qpid(rdev_p->rscp, wq->qpid);
+	return err;
+}
+
+static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_cqe cqe;
+
+	DBG("%s %d wq %p cq %p sw_rptr %x sw_wptr %x\n", __FUNCTION__, 
+	    __LINE__, wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = V_CQE_STATUS(1) | 
+		     V_CQE_OPCODE(T3_SEND) | 
+		     V_CQE_TYPE(0) |
+		     V_CQE_SWCQE(1) |
+		     V_CQE_QPID(wq->qpid) | 
+		     V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, cq->size_log2));
+	cqe.header = cpu_to_be32(cqe.header);
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_rq(struct cxio_rdev *rdev_p, struct t3_wq *wq, 
+		   struct t3_cq *cq)
+{
+	u32 ptr;
+
+	DBG("%s %d wq %p cq %p\n", __FUNCTION__, __LINE__, wq, cq);
+
+	/* mark the wq in error so all CQEs will be completed as flushed */
+	wq->error = 1;
+
+	/* flush RQ */
+	ptr = wq->rq_rptr;
+	while (ptr++ != wq->rq_wptr) {
+		insert_recv_cqe(wq, cq);
+	}
+}
+
+static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, union t3_wr *wr)
+{
+	struct t3_cqe cqe;
+	enum t3_rdma_opcode op;
+
+	DBG("%s %d wq %p cq %p sw_rptr %x sw_wptr %x\n", __FUNCTION__, 
+	    __LINE__, wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	op = wr2opcode(G_FW_RIWR_OP(be32_to_cpu(wr->send.wrh.op_seop_flags)));
+	cqe.header = V_CQE_STATUS(1) | 
+		     V_CQE_OPCODE(op) |
+		     V_CQE_TYPE(1) |
+		     V_CQE_SWCQE(1) |
+		     V_CQE_QPID(wq->qpid) | 
+		     V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, cq->size_log2));
+	cqe.header = cpu_to_be32(cqe.header);
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	CQE_WRID_SQ_WPTR(cqe) = wr->send.wrid.id0.hi;
+	CQE_WRID_WPTR(cqe) = wr->send.wrid.id0.low;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_sq(struct cxio_rdev *rdev_p, struct t3_wq *wq, 
+		   struct t3_cq *cq)
+{
+	u32 ptr;
+	union t3_wr *wr = wq->sq_oldest_wr;
+
+	DBG("%s %d wq %p cq %p\n", __FUNCTION__, __LINE__, wq, cq);
+
+	/* mark the wq in error so all CQEs will be completed as flushed */
+	wq->error = 1;
+
+	/* flush SQ */
+	ptr = wq->sq_rptr;
+	while (ptr++ != wq->sq_wptr) {
+		BUG_ON(!wr);
+		insert_sq_cqe(wq, cq, wr);
+		wr = next_sq_wr(wq);
+
+	}
+}
+
+static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p)
+{
+	struct rdma_cq_setup setup;
+	setup.id = 0;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 1;		/* enable the CQ */
+	setup.credits = 0;
+
+	/* force SGE to redirect to RspQ and interrupt */
+	setup.credit_thres = 0;	
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	int err;
+	u64 sge_cmd, ctx0, ctx1;
+	u64 base_addr;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		DBG("failed in alloc_skb in init_ctrl_qp\n");
+		return -ENOMEM;
+	}
+	err = cxio_hal_init_ctrl_cq(rdev_p);
+	if (err) {
+		DBG("err initializing ctrl_cq, err status =%d\n", err);
+		return err;
+	}
+	rdev_p->ctrl_qp.workq = dma_alloc_coherent(
+					&(rdev_p->rnic_info.pdev->dev),
+					(1 << T3_CTRL_QP_SIZE_LOG2) *
+					sizeof(union t3_wr),
+					&(rdev_p->ctrl_qp.dma_addr), 
+					GFP_KERNEL);
+	if (!rdev_p->ctrl_qp.workq) {
+		DBG("failed to allocate memory for ctrl QP\n");
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, 
+			   rdev_p->ctrl_qp.dma_addr);
+	rdev_p->ctrl_qp.doorbell = rdev_p->rnic_info.kdb_addr;
+	memset(rdev_p->ctrl_qp.workq, 0,
+	       (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr));
+
+	init_MUTEX(&rdev_p->ctrl_qp.sem);
+	init_waitqueue_head(&rdev_p->ctrl_qp.waitq);
+
+	/* update HW Ctrl QP context */
+	base_addr = rdev_p->ctrl_qp.dma_addr;
+	base_addr >>= 12;
+	ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) |
+		V_EC_BASE_LO((u32) base_addr & 0xffff));
+	ctx0 <<= 32;
+	ctx0 |= V_EC_CREDITS(FW_WR_NUM);
+	base_addr >>= 16;
+	ctx1 = (u32) base_addr;
+	base_addr >>= 32;
+	ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) |
+			V_EC_TYPE(0) | V_EC_GEN(1) |
+			V_EC_UP_TOKEN(FW_RI_TID_START) | F_EC_VALID)) << 32;
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0x3, 1,
+		       T3_CTRL_QP_ID, 0x4);
+	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
+	wqe->wrid.id1 = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = cpu_to_be64(ctx1);
+	wqe->ctx0 = cpu_to_be64(ctx0);
+	DBG("CtrlQP dma_addr=0x%llx kaddr=%p size=%d\n",
+	     (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq,
+	     1 << T3_CTRL_QP_SIZE_LOG2);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (t3c_send(rdev_p->t3cdev_p, skb));
+}
+
+static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << T3_CTRL_QP_SIZE_LOG2)
+			  * sizeof(union t3_wr), rdev_p->ctrl_qp.workq,
+			  pci_unmap_addr(&rdev_p->ctrl_qp, mapping));
+	return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID);
+}
+
+/* write len bytes of data into addr (32B aligned address) 
+ * If data is NULL, clear len byte of memory to zero.
+ * caller aquires the sem before the call
+ */
+static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
+				      u32 len, void *data, int completion)
+{
+	u32 i, nr_wqe, copy_len;
+	u8 *copy_data;
+	u8 wr_len, utx_len;	/* lenght in 8 byte flit */
+	enum t3_wr_flags flag;
+	u64 *wqe;
+	u64 utx_cmd;
+	addr &= 0x7FFFFFF;
+	nr_wqe = len % 96 ? len / 96 + 1 : len / 96;	/* 96B max per WQE */
+	DBG("wptr=%d rptr=%d len=%d, nr_wqe=%d data=%p addr=0x%0x\n",
+	     rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, nr_wqe, data,
+	     addr);
+	utx_len = 3;		/* in 32B unit */
+	for (i = 0; i < nr_wqe; i++) {
+		if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr,
+		           T3_CTRL_QP_SIZE_LOG2)) {
+			DBG("ctrl_qp full wtpr=0x%0x rptr=0x%0x, "
+			     "wait for more space i=%d\n", rdev_p->ctrl_qp.wptr,
+			     rdev_p->ctrl_qp.rptr, i);
+			return 0;
+			if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     !Q_FULL(rdev_p->ctrl_qp.
+						     rptr,
+						     rdev_p->ctrl_qp.
+						     wptr,
+						     T3_CTRL_QP_SIZE_LOG2))) {
+				DBG("ctrl_qp workq wakeup due to interrupt\n");
+				return -ERESTARTSYS;
+			}
+			DBG("ctrl_qp wakeup, continue posting work request "
+			     "i=%d\n", i);
+		}
+		wqe = (u64 *) (rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+		flag = 0;
+		if (i == (nr_wqe - 1)) {
+			/* last WQE */
+			flag = completion ? T3_COMPLETION_FLAG : 0;
+			if (len % 32)
+				utx_len = len / 32 + 1;
+			else
+				utx_len = len / 32;
+		}
+
+		/* 
+		 * Force a CQE to return the credit to the workq in case 
+		 * we posted more than half the max QP size of WRs 
+		 */
+		if ((i != 0) && 
+		    (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) {
+			flag = T3_COMPLETION_FLAG;
+			DBG("force a completion at i=%d\n", i);
+		}
+
+		/* build the utx mem command */
+		wqe += (sizeof(struct t3_bypass_wr) >> 3);
+		utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3);
+		utx_cmd <<= 32;
+		utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1);
+		*wqe = cpu_to_be64(utx_cmd);
+		wqe++;
+		copy_data = (u8 *) data + i * 96;
+		copy_len = len > 96 ? 96 : len;
+
+		/* clear memory content if data is NULL */
+		if (data)
+			memcpy(wqe, copy_data, copy_len);
+		else
+			memset(wqe, 0, copy_len);
+		if (copy_len % 32)
+			memset(((u8 *) wqe) + copy_len, 0,
+			       32 - (copy_len % 32));
+		wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + 
+			 (utx_len << 2);
+		wqe = (u64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+
+		/* wptr in the WRID[31:0] */
+		*(wqe + 1) = cpu_to_be64((u64) rdev_p->ctrl_qp.wptr);
+
+		/* 
+		 * This must be the last write with a memory barrier 
+		 * for the genbit 
+		 */
+		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
+			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
+					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
+			       wr_len);
+		if (flag == T3_COMPLETION_FLAG)
+			RING_DOORBELL(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
+		len -= 96;
+		rdev_p->ctrl_qp.wptr++;
+	}
+	return 0;
+}
+
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
+ * OUT: stag index, actual pbl_size, pbl_addr allocated.
+ * TBD: shared memory region support
+ */
+static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
+			 u32 * stag, u8 stag_state, u32 pdid,
+			 enum tpt_mem_type type, enum tpt_mem_perm perm,
+			 u32 zbva, u64 to, u32 len, u8 page_size, u64 * pbl,
+			 u32 * pbl_size, u32 * pbl_addr)
+{
+	int err;
+	struct tpt_entry tpt;
+	u32 stag_idx;
+	u32 wptr;
+	u32 pbl_size_save;
+	stag_state = stag_state > 0;
+	stag_idx = (*stag) >> 8;
+	pbl_size_save = reset_tpt_entry ? 0 : *pbl_size;
+	if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) {
+		stag_idx = cxio_hal_get_stag(rdev_p->rscp);
+		if (!stag_idx)
+			return -ENOMEM;
+		*stag = (stag_idx << 8) | ((*stag) & 0xFF);
+	}
+	DBG("stag_state=%0x type=%0x pdid=%0x, stag_idx = 0x%x`\n", 
+	    stag_state, type, pdid, stag_idx);
+	
+
+	/* allocate pbl entries if requested size >0 */
+	if (pbl_size_save) {
+
+		/* 
+		 * TBD: pbl resource management.
+		 * For now, give each stag a 2KB pbl region, i.e. 256 pages 
+		 */
+		if ((*pbl_size) > 256) {
+			DBG("TBD: PBL allocation failure: fixed 256 entries "
+			     "for now\n");
+			return -ENOMEM;
+		}
+		*pbl_addr = (stag_idx << 8);
+
+		/* update the actual pbl_size allocated */
+		*pbl_size = 256;
+	}
+	down_interruptible(&rdev_p->ctrl_qp.sem);
+
+	/* write PBL first if any - update pbl only if pbl list exist */
+	if (pbl) {
+
+		DBG("*pdb_addr %x, pbl_base %x, pbl_size_save %d\n",
+			*pbl_addr, rdev_p->rnic_info.pbl_base, pbl_size_save);
+		err = cxio_hal_ctrl_qp_write_mem(rdev_p, ((*pbl_addr) >> 2) + 
+				(rdev_p->rnic_info.pbl_base >> 5),
+				(pbl_size_save << 3), pbl, 0);
+		if (err)
+			goto ret;
+	}
+
+	/* write TPT entry */
+	if (reset_tpt_entry) {
+		memset(&tpt, 0, sizeof(tpt));
+	} else {
+		tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID |
+				V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) |
+				V_TPT_STAG_STATE(stag_state) |
+				V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid));
+		BUG_ON(page_size >= 28);
+		tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | 
+			    	F_TPT_MW_BIND_ENABLE |
+				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
+				V_TPT_PAGE_SIZE(page_size));
+		tpt.rsvd_pbl_addr = pbl_size_save ? 
+				    cpu_to_be32(V_TPT_PBL_ADDR(*pbl_addr)) : 0;
+		tpt.len = cpu_to_be32(len);
+		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
+		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
+		tpt.rsvd_bind_cnt_or_pstag = 0;
+		tpt.rsvd_pbl_size = pbl_size_save ?
+			    cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)) : 0;
+	}
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				       stag_idx +
+				       (rdev_p->rnic_info.tpt_base >> 5),
+				       sizeof(tpt), &tpt, 1);
+
+	/* release the stag index to free pool */
+	if (reset_tpt_entry)
+		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
+ret:	
+	wptr = rdev_p->ctrl_qp.wptr;
+	up(&rdev_p->ctrl_qp.sem);
+	if (!err) {
+		if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+						      wptr)))
+			return -ERESTARTSYS;
+	}
+	return err;
+}
+
+/* IN : stag key, pdid, pbl_size
+ * Out: stag index, actaul pbl_size, and pbl_addr allocated. 
+ */
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			      perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr));
+}
+
+int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, u64 * pbl, u32 * pbl_size,
+			   u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, u64 * pbl, u32 * pbl_size,
+			   u32 * pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag)
+{
+	/* TBD: check if there is any MW bound to the MR */
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	u32 pbl_size = 0;
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
+			     NULL, &pbl_size, NULL);
+}
+
+int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
+{
+	struct t3_rdma_init_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	DBG("%s %d\n", __FUNCTION__, __LINE__);
+	wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe));
+	wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT));
+	wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) |
+					   V_FW_RIWR_LEN(sizeof(*wqe) >> 3));
+	wqe->wrid.id1 = 0;
+	wqe->qpid = cpu_to_be32(attr->qpid);
+	wqe->pdid = cpu_to_be32(attr->pdid);
+	wqe->scqid = cpu_to_be32(attr->scqid);
+	wqe->rcqid = cpu_to_be32(attr->rcqid);
+	wqe->rq_addr = cpu_to_be32(attr->rq_addr);
+	wqe->rq_size = cpu_to_be32(attr->rq_size);
+	wqe->mpaattrs = attr->mpaattrs;
+	wqe->qpcaps = attr->qpcaps;
+	wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss);
+	wqe->rqes_posted = cpu_to_be32(attr->rqes_posted);
+	wqe->ord = cpu_to_be32(attr->ord);
+	wqe->ird = cpu_to_be32(attr->ird);
+	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
+	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
+	wqe->rsvd = 0;
+	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
+	return (t3c_send(rdev_p->t3cdev_p, skb));
+}
+
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = ev_cb;
+}
+
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = NULL;
+}
+
+static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb)
+{
+	static int cnt;
+	struct cxio_rdev *rdev_p = NULL;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	DBG("%d: cxio_hal_ev_handler being called for CQ_ID(%d), "
+	     "overflow=%0x, notify=%0x with CQE:\n", cnt, 
+	     be16_to_cpu(rsp_msg->cq_id), rsp_msg->cq_overflow, 
+			      rsp_msg->cq_notify);
+	DBG("QPID=%0x genbit=%0x type=%0x Status=%0x opcode=%0x "
+	     "len=%0x wrid_hi_stag=%x wrid_low_msn=%x\n", 
+	     CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), 
+	     CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+	     CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), 
+	     CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+	rdev_p = (struct cxio_rdev *)t3cdev_p->ulp;
+	if (!rdev_p) {
+		DBG("cxio_hal_ev_handler called by t3cdev (%p) with null!\n",
+		     t3cdev_p);
+		return 0;
+	}
+	if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) {
+		rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1;
+		wake_up_interruptible(&rdev_p->ctrl_qp.waitq);
+		dev_kfree_skb_irq(skb);
+	} else if (cxio_ev_cb) {
+		(*cxio_ev_cb) (rdev_p, skb);
+	} else {
+		dev_kfree_skb_irq(skb);
+	}
+	DBG("ev call back wptr=%d rptr=%d\n", rdev_p->ctrl_qp.wptr,
+	       rdev_p->ctrl_qp.rptr);
+	cnt++;
+	return 0;
+}
+
+/* Caller takes care of locking if needed */
+int cxio_rdev_open(struct cxio_rdev *rdev_p)
+{
+	struct net_device *netdev_p = NULL;
+	int err = 0;
+	if (strlen(rdev_p->dev_name)) {
+		if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) {
+			return -EBUSY;
+		}
+		netdev_p = dev_get_by_name(rdev_p->dev_name);
+		if (!netdev_p) {
+			DBG("dev_get_by_name(%s) failed\n", rdev_p->dev_name);
+			return -EINVAL;
+		}
+		dev_put(netdev_p);
+	} else if (rdev_p->t3cdev_p) {
+		if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) {
+			return -EBUSY;
+		}
+		netdev_p = rdev_p->t3cdev_p->lldev;
+		strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name,
+			T3_MAX_DEV_NAME_LEN);
+	} else {
+		DBG("t3cdev_p or dev_name must be set\n");
+		return -EINVAL;
+	}
+
+	if (cxio_hal_add_rdev(rdev_p)) {
+		DBG("max number of RNIC supported exceeded\n");
+		return -ENOMEM;
+	}
+
+	DBG("opening rnic dev %s\n", rdev_p->dev_name);
+	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
+	if (!rdev_p->t3cdev_p)
+		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
+					 &(rdev_p->rnic_info));
+	if (err) {
+		printk("%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+	DBG("rnic %s info: tpt_base=0x%0x tpt_top=0x%0x pbl_base=0x%0x "
+	     "pbl_top=0x%0x rqt_base=0x%0x, rqt_top=0x%0x\n", 
+	     rdev_p->dev_name, rdev_p->rnic_info.tpt_base, 
+	     rdev_p->rnic_info.tpt_top, rdev_p->rnic_info.pbl_base, 
+	     rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base,
+	     rdev_p->rnic_info.rqt_top);
+	DBG("udbell_len=0x%0x udbell_physbase=0x%lx "
+	     "kdb_addr=%p\n", rdev_p->rnic_info.udbell_len, 
+	     rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr);
+
+	err = cxio_hal_init_ctrl_qp(rdev_p);
+	if (err) {
+		printk("%s error %d initializing ctrl_qp.\n", 
+		       __FUNCTION__, err);
+		goto err1;
+	}
+	err = cxio_hal_init_resource(&rdev_p->rscp, T3_MAX_NUM_STAG, 0,
+					  0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ,
+					  T3_MAX_NUM_PD);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing hal resources.\n", 
+		       __FUNCTION__, err);
+		goto err2;
+	}
+	return 0;
+err2:
+	cxio_hal_destroy_ctrl_qp(rdev_p);
+err1:
+	cxio_hal_delete_rdev(rdev_p);
+	return err;
+}
+
+void cxio_rdev_close(struct cxio_rdev *rdev_p)
+{
+	if (rdev_p) {
+		cxio_hal_delete_rdev(rdev_p);
+		rdev_p->t3cdev_p->ulp = NULL;
+		cxio_hal_destroy_ctrl_qp(rdev_p);
+		cxio_hal_destroy_resource(rdev_p->rscp);
+	}
+}
+
+int __init cxio_hal_init(void)
+{
+	if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI))
+		return -ENOMEM;
+	memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *));
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler);
+	return 0;
+}
+
+void __exit cxio_hal_exit(void)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++) {
+		cxio_rdev_close(rdev_tbl[i]);
+	}
+	cxio_hal_destroy_rhdl_resource();
+}
+
+int cxio_peek_cq(struct t3_wq *wq, struct t3_cq *cq, int cqe_opcode) 
+{
+	struct t3_cqe *peek_cqe;
+	u32 peekptr;
+
+	peekptr = cq->rptr;
+	peek_cqe = cq->queue + Q_PTR2IDX(peekptr, cq->size_log2);
+
+	/* 
+	 * see if the cqe with the requested opcode is here already. 
+	 */
+	while (CQ_VLD_ENTRY(peekptr, cq->size_log2, peek_cqe)) {
+		if ((RQ_TYPE(*peek_cqe)) &&
+		    (CQE_OPCODE(*peek_cqe) == cqe_opcode) &&
+		    (CQE_QPID(*peek_cqe) == wq->qpid)) {
+			return 0;
+		} else {
+			++(peekptr);
+			peek_cqe = cq->queue +
+			    Q_PTR2IDX(peekptr, cq->size_log2);
+		}
+		if (peekptr == cq->rptr) {	/* CQ full */
+			/* Don't handle error here */
+			/* Don't reset timer */
+			return 0;
+		}
+	}
+
+	/*
+ 	 * The opcode was not found
+ 	 */
+	return -EAGAIN;
+}
+
+static inline void create_read_req_cqe(struct t3_rdma_read_wr *wr, 
+				       struct t3_cqe *response_cqe, 
+			               struct t3_cqe *read_cqe)
+{
+	DBG("%s %d enter\n", __FUNCTION__, __LINE__);
+
+	/* 
+	 * Now that we found the read response cqe,
+	 * we build a proper read request sq cqe to
+	 * return to the user, using the read request WR
+	 * and bits of the read response cqe.
+	 */
+	read_cqe->header = 
+		V_CQE_STATUS(CQE_STATUS(*response_cqe)) |
+		V_CQE_OPCODE(T3_READ_REQ) |
+		V_CQE_TYPE(1) |
+		V_CQE_QPID(CQE_QPID(*response_cqe));
+	read_cqe->header = cpu_to_be32(read_cqe->header);
+	CQE_WRID_SQ_WPTR(*read_cqe) = wr->wrid.id0.hi;
+	CQE_WRID_WPTR(*read_cqe) = wr->wrid.id0.low;
+	read_cqe->len = wr->local_len;	/* XXX Violates RDMAC but matches IB */
+}
+
+/*
+ * Slow path poll code.
+ */
+int __cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq,
+		   struct t3_cqe *cqe, u8 * cqe_flushed,
+		   u64 * cookie, u32 * credit)
+{
+	int ret = 0;
+	struct t3_cqe *rd_cqe, *peek_cqe, read_cqe;
+	u32 peekptr;
+
+	rd_cqe = cxio_next_cqe(cq);
+
+	BUG_ON(!rd_cqe);
+
+	/* 
+	 * skip cqe's not affiliated with a QP.
+	 */
+	if (wq == NULL) {
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	/*
+	 * If this CQE was already returned (out of order completion)
+	 * then silently toss it.
+	 */
+	if (CQE_OPCODE(*rd_cqe) == T3_READ_RESP && 
+	    (!wq->sq_oldest_wr || 
+	     (wq->sq_oldest_wr->send.rdmaop != T3_READ_REQ))) {
+		DBG("%s %d dropping old read response cqe\n", 
+		    __FUNCTION__, __LINE__);
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	if (CQE_OPCODE(*rd_cqe) == T3_TERMINATE) {
+		ret = -1;
+		wq->error = 1;
+		goto skip_cqe;
+	}
+
+	if (CQE_STATUS(*rd_cqe) || wq->error) {
+		ret = 0;
+		*cqe_flushed = wq->error;
+		wq->error = 1;
+	
+		/* 
+		 * T3A inserts errors into the CQE.  We cannot return 
+	 	 * these as work completions.
+	 	 */
+		/* incoming write failures */
+		if ((CQE_OPCODE(*rd_cqe) == T3_RDMA_WRITE) 
+		     && RQ_TYPE(*rd_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		/* incoming read request failures */
+		if ((CQE_OPCODE(*rd_cqe) == T3_READ_RESP) && SQ_TYPE(*rd_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+
+		/* incoming SEND with no receive posted failures */
+		if ((CQE_OPCODE(*rd_cqe) == T3_SEND) && RQ_TYPE(*rd_cqe) &&
+		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+	 * If this WQ's oldest pending SQ WR is a read request, then we
+	 * must try and find the RQ Read Response which might not
+	 * be the next CQE for that WQ on the CQ (reads can complete
+	 * out of order). If its not in the CQ yet, then we must return 
+	 * "empty".  This ensures we don't complete a subsequent WR 
+	 * out of order...
+	 */
+
+	/*
+	 * XXX This stalls the CQ for all QPs.  Need to redesign this later
+	 * to only stall the WQ in question.  
+	 */
+	if (wq->sq_oldest_wr && 
+	    (wq->sq_oldest_wr->send.rdmaop == T3_READ_REQ)) {
+		DBG("%s %d oldest wr is read!\n", __FUNCTION__, __LINE__);
+		peekptr = cq->rptr;
+		peek_cqe = cq->queue + Q_PTR2IDX(peekptr, cq->size_log2);
+
+		/* 
+		 * see if the read response is here already. 
+		 */
+		while (CQ_VLD_ENTRY(peekptr, cq->size_log2, peek_cqe)) {
+			if ((RQ_TYPE(*peek_cqe)) &&
+			    (CQE_OPCODE(*peek_cqe) == T3_READ_RESP) &&
+			    (CQE_QPID(*peek_cqe) == wq->qpid)) {
+				create_read_req_cqe(&wq->sq_oldest_wr->read, 
+						    peek_cqe, &read_cqe);
+				rd_cqe = &read_cqe;
+				ret = 0;
+				goto proc_cqe;
+			} else {
+				++peekptr;
+				peek_cqe = cq->queue +
+				    Q_PTR2IDX(peekptr, cq->size_log2);
+			}
+			if (peekptr == cq->rptr) {	/* CQ full */
+				wq->error = 1;
+				*cqe_flushed = 1;
+				ret = 0;
+				goto proc_cqe;
+			}
+		}
+
+		/*
+	 	 * The read response hasn't happened, so we cannot return
+		 * any other completion event for this WQ.
+	 	 */
+		ret = -1;
+		goto ret_cqe;
+	}
+	
+	/* 
+	 * HW only validates 4 bits of MSN.  So we must validate that
+	 * the MSN in the SEND is the next expected MSN.  If its not,
+	 * then we complete this with TPT_ERR_MSN and mark the wq in error.
+ 	 */
+	if (RQ_TYPE(*rd_cqe) && (CQE_WRID_MSN(*rd_cqe) != (wq->rq_rptr + 1))) {
+		ret = 0;
+		wq->error = 1;
+		(*rd_cqe).header = cpu_to_be32(cpu_to_be32((*rd_cqe).header) | 
+			        	       V_CQE_STATUS(TPT_ERR_MSN));
+		goto proc_cqe;
+	}
+
+proc_cqe:
+	*cqe = *rd_cqe;
+
+	/*
+	 * Reap the associated WR(s) that are freed up with this
+	 * completion.
+	 */
+	if (SQ_TYPE(*rd_cqe)) {
+		BUG_ON(!wq->sq_oldest_wr);
+		wq->sq_rptr = CQE_WRID_SQ_WPTR(*rd_cqe) + 1;
+		BUG_ON((wq->sq_oldest_wr-wq->queue) != 
+		       Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), wq->size_log2));
+		*cookie = wq->queue[Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), 
+					      wq->size_log2)
+				   ].flit[T3_SQ_COOKIE_FLIT];
+		wq->sq_oldest_wr = next_sq_wr(wq);
+	} else {
+		*cookie = wq->rq[Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)];
+		++(wq->rq_rptr);
+	}
+
+	/* If we created a READ_REQ CQE, don't skip this one */
+	if (rd_cqe == &read_cqe)
+		goto ret_cqe;
+skip_cqe:
+	if (SW_CQE(*rd_cqe)) {
+		DBG("skip sw cqe sw_rptr %x\n", cq->sw_rptr);
+		++cq->sw_rptr;
+	} else {
+		DBG("cq %p cqid %d skip hw cqe rptr %x\n", cq, cq->cqid, 
+		    cq->rptr);
+		++cq->rptr;
+
+		/*
+		 * compute credits.
+		 */
+		if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1)))
+		    || ((cq->rptr - cq->wptr) >= 128)) {
+			*credit = cq->rptr - cq->wptr;
+			cq->wptr = cq->rptr;
+		}
+	}
+
+ret_cqe:
+	return ret;
+}
+
+EXPORT_SYMBOL(__cxio_poll_cq);
+EXPORT_SYMBOL(cxio_peek_cq);
+EXPORT_SYMBOL(cxio_hal_cq_op);
+EXPORT_SYMBOL(cxio_hal_clear_qp_ctx);
+EXPORT_SYMBOL(cxio_create_cq);
+EXPORT_SYMBOL(cxio_destroy_cq);
+EXPORT_SYMBOL(cxio_resize_cq);
+EXPORT_SYMBOL(cxio_create_qp);
+EXPORT_SYMBOL(cxio_destroy_qp);
+EXPORT_SYMBOL(cxio_allocate_stag);
+EXPORT_SYMBOL(cxio_register_phys_mem);
+EXPORT_SYMBOL(cxio_reregister_phys_mem);
+EXPORT_SYMBOL(cxio_dereg_mem);
+EXPORT_SYMBOL(cxio_allocate_window);
+EXPORT_SYMBOL(cxio_deallocate_window);
+EXPORT_SYMBOL(cxio_rdma_init);
+EXPORT_SYMBOL(cxio_hal_get_rhdl);
+EXPORT_SYMBOL(cxio_hal_put_rhdl);
+EXPORT_SYMBOL(cxio_hal_get_pdid);
+EXPORT_SYMBOL(cxio_hal_put_pdid);
+EXPORT_SYMBOL(cxio_register_ev_cb);
+EXPORT_SYMBOL(cxio_unregister_ev_cb);
+EXPORT_SYMBOL(cxio_rdev_open);
+EXPORT_SYMBOL(cxio_rdev_close);
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
new file mode 100644
index 0000000..37db2b5
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
@@ -0,0 +1,166 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef  __CXIO_HAL_H__
+#define  __CXIO_HAL_H__
+
+#include "t3_cpl.h"
+#include "defs.h"
+#include "t3cdev.h"
+#include "cxgb3_ctl_defs.h"
+#include "cxio_wr.h"
+
+#define T3_CTRL_QP_ID    FW_RI_SGEEC_START
+#define T3_CTL_QP_TID	 FW_RI_TID_START
+#define T3_CTRL_QP_SIZE_LOG2  10
+#define T3_CTRL_CQ_ID    0
+
+/* TBD */
+#define T3_MAX_NUM_RNIC  8
+#define T3_MAX_NUM_RI (1<<15)
+#define T3_MAX_NUM_QP (1<<15)
+#define T3_MAX_NUM_CQ (1<<15)
+#define T3_MAX_NUM_PD (1<<15)
+#define T3_MAX_NUM_STAG (1<<13)
+#define T3_MAX_PBL_SIZE 256
+#define T3_MAX_RQ_SIZE 1024
+
+#define T3_STAG_UNSET 0xffffffff
+
+#define T3_MAX_DEV_NAME_LEN 32
+
+struct cxio_hal_ctrl_qp {
+	u32 wptr;
+	u32 rptr;
+	struct semaphore sem;	/* for the wtpr, can sleep */
+	wait_queue_head_t waitq;	/* wait for RspQ/CQE msg */
+	union t3_wr *workq;	/* the work request queue */
+	dma_addr_t dma_addr;	/* pci bus address of the workq */
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	void __iomem *doorbell;
+};
+
+struct cxio_hal_resource {
+	struct kfifo *tpt_fifo;
+	spinlock_t tpt_fifo_lock;
+	struct kfifo *qpid_fifo;
+	spinlock_t qpid_fifo_lock;
+	struct kfifo *cqid_fifo;
+	spinlock_t cqid_fifo_lock;
+	struct kfifo *pdid_fifo;
+	spinlock_t pdid_fifo_lock;
+};
+
+struct cxio_rdev {
+	char dev_name[T3_MAX_DEV_NAME_LEN];
+	struct t3cdev *t3cdev_p;
+	struct rdma_info rnic_info;
+	struct cxio_hal_resource *rscp;
+	struct cxio_hal_ctrl_qp ctrl_qp;
+	void *ulp;
+};
+
+typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p,
+					     struct sk_buff * skb);
+
+struct respQ_msg_t {
+	u32 opaque0:32;
+	u32 opaque1:8;
+	u32 cq_overflow:1;	/* bit 16 */
+	u32 opaque2:7;
+	u32 opaque3:16;
+
+	u32 opaque4:2;
+	u32 cq_notify:1;	/* bit 58 */
+	u32 opaque5:5;
+	u32 opaque6:24;
+	u32 opaque7:16;
+	u32 cq_id:16;		/* bit [15:0] */
+
+	struct t3_cqe cqe;
+};
+
+enum t3_cq_opcode {
+	CQ_ARM_AN = 0x2,
+	CQ_ARM_SE = 0x6,
+	CQ_FORCE_AN = 0x3,
+	CQ_CREDIT_UPDATE = 0x7
+};
+
+int cxio_rdev_open(struct cxio_rdev *rdev);
+void cxio_rdev_close(struct cxio_rdev *rdev);
+int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, 
+	 	   enum t3_cq_opcode op, u32 credit);
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid);
+int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq);
+int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq);
+int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr);
+int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, u64 * pbl, u32 * pbl_size,
+			   u32 * pbl_addr);
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, u64 * pbl, u32 * pbl_size,
+			   u32 * pbl_addr);
+int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag);
+int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
+int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+u32 cxio_hal_get_rhdl(void);
+void cxio_hal_put_rhdl(u32 rhdl);
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
+int __init cxio_hal_init(void);
+void __exit cxio_hal_exit(void);
+void cxio_flush_rq(struct cxio_rdev *dev, struct t3_wq *wq, struct t3_cq *cq);
+void cxio_flush_sq(struct cxio_rdev *dev, struct t3_wq *wq, struct t3_cq *cq);
+
+#define DBG(fmt, args...) pr_debug("iw_cxgb3: " fmt, ## args)
+
+#ifdef DEBUG
+void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag);
+void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift);
+void cxio_dump_wqe(union t3_wr *wqe);
+void cxio_dump_wce(struct t3_cqe *wce);
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents);
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid);
+#endif
+
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:30:05 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:05 -0500
Subject: [openib-general] [PATCH v2 08/14] CXGB3 RDMA Core Resource
	Allocation
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143005.32410.4680.stgit@stevo-desktop>


This patch implements resource allocation services for assigning unique
IDs to the various objects.

ISSUE:

- this uses kfifos to manage basically a list of numbers to dish out as
QPIDs, CQIDs, STAGs.  A bitmap would be more efficient memory-wise, but
there is an issue with STAG indecies:  They are supposed to be random.
This code randomizes the stag kfifo.
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  255 ++++++++++++++++++++++
 1 files changed, 255 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 0000000..8c8bfb5
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,255 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+
+/* Loosely based on the Mersenne twister algorithm */
+static u32 next_random(u32 rand)
+{
+	u32 y, ylast;
+
+	y = rand;	
+	ylast = y;
+	y = (y * 69069) & 0xffffffff;
+	y = (y & 0x80000000) + (ylast & 0x7fffffff);
+	if ((y & 1))
+		y = ylast ^ (y > 1) ^ (2567483615UL);
+	else
+		y = ylast ^ (y > 1);
+	y = y ^ (y >> 11);
+	y = y ^ ((y >> 7) & 2636928640UL);
+	y = y ^ ((y >> 15) & 4022730752UL);
+	y = y ^ (y << 18);
+	return y;
+}
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+				   spinlock_t *fifo_lock,
+				   u32 nr, u32 skip_low,
+				   u32 skip_high,
+				   int random)
+{
+	u32 i, j, entry = 0, idx;
+	u32 random_bytes;
+	u32 rarray[16];
+	spin_lock_init(fifo_lock);
+
+	*fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+	if (IS_ERR(*fifo))
+		return -ENOMEM;
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		__kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32));
+	if (random) {
+		j = 0;
+		get_random_bytes(&random_bytes,sizeof(random_bytes));
+		for (i = 0; i < RANDOM_SIZE; i++)
+			rarray[i] = i + skip_low;
+		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+			if (j >= RANDOM_SIZE) {
+				j = 0;
+				random_bytes = next_random(random_bytes);
+			}
+			idx = (random_bytes >> (j * 2)) & 0xF;
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[idx],
+				sizeof(u32));
+			rarray[idx] = i;
+			j++;	
+		}
+		for (i = 0; i < RANDOM_SIZE; i++)
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[i],
+				sizeof(u32));
+	} else
+		for (i = skip_low; i < nr - skip_high; i++)
+			__kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32));
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32));
+	return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+				   spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 1));
+}
+
+int cxio_hal_init_rhdl_resource(u32 nr_rhdl)
+{
+	return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1,
+				       0);
+}
+
+void cxio_hal_destroy_rhdl_resource(void)
+{
+	kfifo_free(rhdl_fifo);
+}
+
+/* nr_* must be power of 2 */
+int cxio_hal_init_resource(struct cxio_hal_resource **rscpp,
+			   u32 nr_tpt, u32 nr_pbl,
+			   u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid)
+{
+	int err = 0;
+	struct cxio_hal_resource *rscp;
+	rscp = kmalloc(sizeof(*rscp), GFP_KERNEL);
+	if (!rscp) {
+		return -ENOMEM;
+	}
+	*rscpp = rscp;
+	err = cxio_init_resource_fifo_random(&rscp->tpt_fifo,
+				      &rscp->tpt_fifo_lock, 
+				      nr_tpt, 1, 0);
+	if (err)
+		goto tpt_err;
+	err = cxio_init_resource_fifo(&rscp->qpid_fifo, &rscp->qpid_fifo_lock, 
+				      nr_qpid, 16, 16);
+	if (err)
+		goto qpid_err;
+	err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, 
+				      nr_cqid, 1, 0);
+	if (err)
+		goto cqid_err;
+	err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, 
+				      nr_pdid, 1, 0);
+	if (err)
+		goto pdid_err;
+	return 0;
+pdid_err:
+	kfifo_free(rscp->cqid_fifo);
+cqid_err:
+	kfifo_free(rscp->qpid_fifo);
+qpid_err:
+	kfifo_free(rscp->tpt_fifo);
+tpt_err:
+	return -ENOMEM;
+}
+
+/*
+ * returns 0 if no resource available
+ */
+static inline u32 cxio_hal_get_resource(struct kfifo *fifo)
+{
+	u32 entry;
+	if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32)))
+		return entry;
+	else
+		return 0;	/* fifo emptry */
+}
+
+static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry)
+{
+	BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0);
+}
+
+u32 cxio_hal_get_rhdl(void)
+{
+	return cxio_hal_get_resource(rhdl_fifo);
+}
+
+void cxio_hal_put_rhdl(u32 rhdl)
+{
+	cxio_hal_put_resource(rhdl_fifo, rhdl);
+}
+
+u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->tpt_fifo);
+}
+
+void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag)
+{
+	cxio_hal_put_resource(rscp->tpt_fifo, stag);
+}
+
+u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->qpid_fifo);
+}
+
+void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid)
+{
+	cxio_hal_put_resource(rscp->qpid_fifo, qpid);
+}
+
+u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->cqid_fifo);
+}
+
+void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid)
+{
+	cxio_hal_put_resource(rscp->cqid_fifo, cqid);
+}
+
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->pdid_fifo);
+}
+
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid)
+{
+	cxio_hal_put_resource(rscp->pdid_fifo, pdid);
+}
+
+void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
+{
+	kfifo_free(rscp->tpt_fifo);
+	kfifo_free(rscp->cqid_fifo);
+	kfifo_free(rscp->qpid_fifo);
+	kfifo_free(rscp->pdid_fifo);
+	kfree(rscp);
+}


From swise at opengridcomputing.com  Fri Jun 23 07:30:10 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:10 -0500
Subject: [openib-general] [PATCH v2 09/14] CXGB3 RDMA Core Types.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143010.32410.83385.stgit@stevo-desktop>


This patch contains all the HW-specific types.  Also included is 
a inline fastpath cq_poll() function.
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  722 ++++++++++++++++++++++++++++
 1 files changed, 722 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 0000000..7c78dee
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,722 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include <asm/io.h>
+#include <linux/pci.h>
+#include <linux/timer.h>
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE      4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+				       ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<<size_log2)-((wptr)-(rptr)))
+#define Q_COUNT(rptr,wptr) ((wptr)-(rptr))
+#define Q_PTR2IDX(ptr,size_log2) (ptr & ((1UL<<size_log2)-1))
+#define RING_DOORBELL(doorbell, QPID) { \
+	(writel(((1<<31) | (QPID)),doorbell)); \
+}
+
+#define SEQ32_GE(x,y) (!( (((u32) (x)) - ((u32) (y))) & 0x80000000 ))
+
+enum t3_wr_flags {
+	T3_COMPLETION_FLAG = 0x01,
+	T3_NOTIFY_FLAG = 0x02,
+	T3_SOLICITED_EVENT_FLAG = 0x04,
+	T3_READ_FENCE_FLAG = 0x08,
+	T3_LOCAL_FENCE_FLAG = 0x10
+} __attribute__ ((packed));
+
+enum t3_wr_opcode {
+	T3_WR_BP = FW_WROPCODE_RI_BYPASS,
+	T3_WR_SEND = FW_WROPCODE_RI_SEND,
+	T3_WR_WRITE = FW_WROPCODE_RI_RDMA_WRITE,
+	T3_WR_READ = FW_WROPCODE_RI_RDMA_READ,
+	T3_WR_INV_STAG = FW_WROPCODE_RI_LOCAL_INV,
+	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
+	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
+	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+} __attribute__ ((packed));
+
+enum t3_rdma_opcode {
+	T3_RDMA_WRITE,		/* IETF RDMAP v1.0 ... */
+	T3_READ_REQ,
+	T3_READ_RESP,
+	T3_SEND,
+	T3_SEND_WITH_INV,
+	T3_SEND_WITH_SE,
+	T3_SEND_WITH_SE_INV,
+	T3_TERMINATE,
+	T3_RDMA_INIT,		/* CHELSIO RI specific ... */
+	T3_BIND_MW,
+	T3_FAST_REGISTER,
+	T3_LOCAL_INV,
+	T3_QP_MOD,
+	T3_BYPASS
+} __attribute__ ((packed));
+
+static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
+{
+	switch (wrop) {
+		case T3_WR_BP: return T3_BYPASS;
+		case T3_WR_SEND: return T3_SEND;
+		case T3_WR_WRITE: return T3_RDMA_WRITE;
+		case T3_WR_READ: return T3_READ_RESP;
+		case T3_WR_INV_STAG: return T3_LOCAL_INV;
+		case T3_WR_BIND: return T3_BIND_MW;
+		case T3_WR_INIT: return T3_RDMA_INIT;
+		case T3_WR_QP_MOD: return T3_QP_MOD;
+		default: break;
+	}
+	return -1;
+}
+
+
+/* Work request id */
+union t3_wrid {
+	struct {
+		u32 hi:32;
+		u32 low:32;
+	} id0;
+	u64 id1;
+};
+
+#define WRID(wrid)      	(wrid.id1)
+#define WRID_GEN(wrid)		(wrid.id0.wr_gen)
+#define WRID_IDX(wrid)		(wrid.id0.wr_idx)
+#define WRID_LO(wrid)		(wrid.id0.wr_lo)
+
+struct fw_riwrh {
+	u32 op_seop_flags;
+	u32 gen_tid_len;
+};
+
+#define S_FW_RIWR_OP		24
+#define M_FW_RIWR_OP		0xff
+#define V_FW_RIWR_OP(x)		((x) << S_FW_RIWR_OP)
+#define G_FW_RIWR_OP(x)   	((((x) >> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP	22
+#define M_FW_RIWR_SOPEOP	0x3
+#define V_FW_RIWR_SOPEOP(x)	((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS		8
+#define M_FW_RIWR_FLAGS		0x3fffff
+#define V_FW_RIWR_FLAGS(x)	((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x)   	((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID		8
+#define V_FW_RIWR_TID(x)	((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN		0
+#define V_FW_RIWR_LEN(x)	((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN           31
+#define V_FW_RIWR_GEN(x)        ((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+	u32 stag;
+	u32 len;
+	u64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+
+	enum t3_rdma_opcode rdmaop:8;
+	u32 reserved:24;	/* 2 */
+	u32 rem_stag;		/* 2 */
+	u32 plen;		/* 3 */
+	u32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
+};
+
+struct t3_local_inv_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u32 stag;		/* 2 */
+	u32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	enum t3_rdma_opcode rdmaop:8;	/* 2 */
+	u32 reserved:24;	/* 2 */
+	u32 stag_sink;
+	u64 to_sink;		/* 3 */
+	u32 plen;		/* 4 */
+	u32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 5+ */
+};
+
+struct t3_rdma_read_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	enum t3_rdma_opcode rdmaop:8;	/* 2 */
+	u32 reserved:24;
+	u32 rem_stag;
+	u64 rem_to;		/* 3 */
+	u32 local_stag;		/* 4 */
+	u32 local_len;
+	u64 local_to;		/* 5 */
+};
+
+enum t3_addr_type {
+	T3_VA_BASED_TO = 0x0,
+	T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+	T3_MEM_ACCESS_LOCAL_READ = 0x1,
+	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+	T3_MEM_ACCESS_REM_READ = 0x4,
+	T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u32 reserved:16;
+	enum t3_addr_type type:8;
+	enum t3_mem_perms perms:8;	/* 2 */
+	u32 mr_stag;
+	u32 mw_stag;		/* 3 */
+	u32 mw_len;
+	u64 mw_va;		/* 4 */
+	u32 mr_pbl_addr;	/* 5 */
+	u32 reserved2:24;
+	u32 mr_pagesz:8;
+};
+
+struct t3_receive_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 pagesz[T3_MAX_SGE];
+	u32 num_sgle;		/* 2 */
+	struct t3_sge sgl[T3_MAX_SGE];	/* 3+ */
+	u32 pbl_addr[T3_MAX_SGE];
+};
+
+struct t3_bypass_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;	/* 1 */
+};
+
+struct t3_modify_qp_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;
+	u64 ctx1;
+	u64 ctx0;
+};
+
+enum t3_mpa_attrs {
+	uP_RI_MPA_RX_MARKER_ENABLE = 0x1,
+	uP_RI_MPA_TX_MARKER_ENABLE = 0x2,
+	uP_RI_MPA_CRC_ENABLE = 0x4,
+	uP_RI_MPA_IETF_ENABLE = 0x8
+} __attribute__ ((packed));
+
+enum t3_qp_caps {
+	uP_RI_QP_RDMA_READ_ENABLE = 0x01,
+	uP_RI_QP_RDMA_WRITE_ENABLE = 0x02,
+	uP_RI_QP_BIND_ENABLE = 0x04,
+	uP_RI_QP_FAST_REGISTER_ENABLE = 0x08,
+	uP_RI_QP_STAG0_ENABLE = 0x10
+} __attribute__ ((packed));
+
+struct t3_rdma_init_attr {
+	u32 tid;
+	u32 qpid;
+	u32 pdid;
+	u32 scqid;
+	u32 rcqid;
+	u32 rq_addr;
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs;
+	enum t3_qp_caps qpcaps;
+	u16 tcp_emss;
+	u32 ord;
+	u32 ird;
+	u64 qp_dma_addr;
+	u32 qp_dma_size;
+	u8 rqes_posted;
+};
+
+struct t3_rdma_init_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u32 qpid;		/* 2 */
+	u32 pdid;
+	u32 scqid;		/* 3 */
+	u32 rcqid;
+	u32 rq_addr;		/* 4 */
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs:8;	/* 5 */
+	enum t3_qp_caps qpcaps:8;
+	u32 ulpdu_size:16;
+	u32 rqes_posted;	/* bits 31-1 - reservered */
+				/* bit     0 - set if RECV posted */
+	u32 ord;		/* 6 */
+	u32 ird;
+	u64 qp_dma_addr;	/* 7 */
+	u32 qp_dma_size;	/* 8 */
+	u32 rsvd;
+};
+
+union t3_wr {
+	struct t3_send_wr send;
+	struct t3_rdma_write_wr write;
+	struct t3_rdma_read_wr read;
+	struct t3_receive_wr recv;
+	struct t3_local_inv_wr local_inv;
+	struct t3_bind_mw_wr bind;
+	struct t3_bypass_wr bypass;
+	struct t3_rdma_init_wr init;
+	struct t3_modify_qp_wr qp_mod;
+	u64 flit[16];
+};
+
+#define T3_SQ_CQE_FLIT 	  13
+#define T3_SQ_COOKIE_FLIT 14
+
+#define T3_RQ_COOKIE_FLIT 13
+#define T3_RQ_CQE_FLIT 	  14
+
+static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
+				  enum t3_wr_flags flags, u8 genbit, u32 tid,
+				  u8 len)
+{
+	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
+					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_FLAGS(flags));
+	wmb();
+	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
+				       V_FW_RIWR_TID(tid) |
+				       V_FW_RIWR_LEN(len));
+	/* 2nd gen bit... */
+        ((union t3_wr *)wqe)->flit[15] = cpu_to_be64(genbit);
+}
+
+/*
+ * T3 ULP2_TX commands
+ */
+enum t3_utx_mem_op {
+	T3_UTX_MEM_READ = 2,
+	T3_UTX_MEM_WRITE = 3
+};
+
+/* T3 MC7 RDMA TPT entry format */
+
+enum tpt_mem_type {
+	TPT_NON_SHARED_MR = 0x0,
+	TPT_SHARED_MR = 0x1,
+	TPT_MW = 0x2,
+	TPT_MW_RELAXED_PROTECTION = 0x3
+};
+
+enum tpt_addr_type {
+	TPT_ZBTO = 0,
+	TPT_VATO = 1
+};
+
+enum tpt_mem_perm {
+	TPT_LOCAL_READ = 0x8,
+	TPT_LOCAL_WRITE = 0x4,
+	TPT_REMOTE_READ = 0x2,
+	TPT_REMOTE_WRITE = 0x1
+};
+
+struct tpt_entry {
+	u32 valid_stag_pdid;
+	u32 flags_pagesize_qpid;
+
+	u32 rsvd_pbl_addr;
+	u32 len;
+	u32 va_hi;
+	u32 va_low_or_fbo;
+
+	u32 rsvd_bind_cnt_or_pstag;
+	u32 rsvd_pbl_size;
+};
+#define S_TPT_VALID		31
+#define V_TPT_VALID(x)		((x) << S_TPT_VALID)
+#define F_TPT_VALID		V_TPT_VALID(1U)
+
+#define S_TPT_STAG_KEY		23
+#define M_TPT_STAG_KEY		0xFF
+#define V_TPT_STAG_KEY(x)	((x) << S_TPT_STAG_KEY)
+#define G_TPT_STAG_KEY(x)	(((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY)
+
+#define S_TPT_STAG_STATE	22
+#define V_TPT_STAG_STATE(x)	((x) << S_TPT_STAG_STATE)
+#define F_TPT_STAG_STATE	V_TPT_STAG_STATE(1U)
+
+#define S_TPT_STAG_TYPE		20
+#define M_TPT_STAG_TYPE		0x3
+#define V_TPT_STAG_TYPE(x)	((x) << S_TPT_STAG_TYPE)
+#define G_TPT_STAG_TYPE(x)	(((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE)
+
+#define S_TPT_PDID		0
+#define M_TPT_PDID		0xFFFFF
+#define V_TPT_PDID(x)		((x) << S_TPT_PDID)
+#define G_TPT_PDID(x)		(((x) >> S_TPT_PDID) & M_TPT_PDID)
+
+#define S_TPT_PERM		28
+#define M_TPT_PERM		0xF
+#define V_TPT_PERM(x)		((x) << S_TPT_PERM)
+#define G_TPT_PERM(x)		(((x) >> S_TPT_PERM) & M_TPT_PERM)
+
+#define S_TPT_REM_INV_DIS	27
+#define V_TPT_REM_INV_DIS(x)	((x) << S_TPT_REM_INV_DIS)
+#define F_TPT_REM_INV_DIS	V_TPT_REM_INV_DIS(1U)
+
+#define S_TPT_ADDR_TYPE		26
+#define V_TPT_ADDR_TYPE(x)	((x) << S_TPT_ADDR_TYPE)
+#define F_TPT_ADDR_TYPE		V_TPT_ADDR_TYPE(1U)
+
+#define S_TPT_MW_BIND_ENABLE	25
+#define V_TPT_MW_BIND_ENABLE(x)	((x) << S_TPT_MW_BIND_ENABLE)
+#define F_TPT_MW_BIND_ENABLE    V_TPT_MW_BIND_ENABLE(1U)
+
+#define S_TPT_PAGE_SIZE		20
+#define M_TPT_PAGE_SIZE		0x1F
+#define V_TPT_PAGE_SIZE(x)	((x) << S_TPT_PAGE_SIZE)
+#define G_TPT_PAGE_SIZE(x)	(((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE)
+
+#define S_TPT_PBL_ADDR		0
+#define M_TPT_PBL_ADDR		0x1FFFFFFF
+#define V_TPT_PBL_ADDR(x)	((x) << S_TPT_PBL_ADDR)
+#define G_TPT_PBL_ADDR(x)       (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR)
+
+#define S_TPT_QPID		0
+#define M_TPT_QPID		0xFFFFF
+#define V_TPT_QPID(x)		((x) << S_TPT_QPID)
+#define G_TPT_QPID(x)		(((x) >> S_TPT_QPID) & M_TPT_QPID)
+
+#define S_TPT_PSTAG		0
+#define M_TPT_PSTAG		0xFFFFFF
+#define V_TPT_PSTAG(x)		((x) << S_TPT_PSTAG)
+#define G_TPT_PSTAG(x)		(((x) >> S_TPT_PSTAG) & M_TPT_PSTAG)
+
+#define S_TPT_PBL_SIZE		0
+#define M_TPT_PBL_SIZE		0xFFFFF
+#define V_TPT_PBL_SIZE(x)	((x) << S_TPT_PBL_SIZE)
+#define G_TPT_PBL_SIZE(x)	(((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE)
+
+/*
+ * CQE defs
+ */
+struct t3_cqe {
+	u32 header:32;
+	u32 len:32;
+	u32 wrid_hi_stag:32;
+	u32 wrid_low_msn:32;
+};
+
+#define S_CQE_QPID        12
+#define M_CQE_QPID        0xFFFFF
+#define G_CQE_QPID(x)     ((((x) >> S_CQE_QPID)) & M_CQE_QPID)
+#define V_CQE_QPID(x) 	  ((x)<<S_CQE_QPID)
+
+#define S_CQE_SWCQE       11
+#define M_CQE_SWCQE       0x1
+#define G_CQE_SWCQE(x)    ((((x) >> S_CQE_SWCQE)) & M_CQE_SWCQE)
+#define V_CQE_SWCQE(x) 	  ((x)<<S_CQE_SWCQE)
+
+#define S_CQE_GENBIT      10
+#define M_CQE_GENBIT      0x1
+#define G_CQE_GENBIT(x)   (((x) >> S_CQE_GENBIT) & M_CQE_GENBIT)
+#define V_CQE_GENBIT(x)	  ((x)<<S_CQE_GENBIT)
+
+#define S_CQE_STATUS      5
+#define M_CQE_STATUS      0x1F
+#define G_CQE_STATUS(x)   ((((x) >> S_CQE_STATUS)) & M_CQE_STATUS)
+#define V_CQE_STATUS(x)   ((x)<<S_CQE_STATUS)
+
+#define S_CQE_TYPE        4
+#define M_CQE_TYPE        0x1
+#define G_CQE_TYPE(x)     ((((x) >> S_CQE_TYPE)) & M_CQE_TYPE)
+#define V_CQE_TYPE(x)     ((x)<<S_CQE_TYPE)
+
+#define S_CQE_OPCODE      0
+#define M_CQE_OPCODE      0xF
+#define G_CQE_OPCODE(x)   ((((x) >> S_CQE_OPCODE)) & M_CQE_OPCODE)
+#define V_CQE_OPCODE(x)   ((x)<<S_CQE_OPCODE)
+
+#define SW_CQE(x)         (G_CQE_SWCQE(be32_to_cpu((x).header)))
+#define CQE_QPID(x)       (G_CQE_QPID(be32_to_cpu((x).header)))
+#define CQE_GENBIT(x)     (G_CQE_GENBIT(be32_to_cpu((x).header)))
+#define CQE_TYPE(x)       (G_CQE_TYPE(be32_to_cpu((x).header)))
+#define SQ_TYPE(x)	  (CQE_TYPE((x)))
+#define RQ_TYPE(x)	  (!CQE_TYPE((x)))
+#define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
+#define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
+
+#define CQE_LEN(x)        (be32_to_cpu((x).len))
+
+#define CQE_WRID_HI(x)    (be32_to_cpu((x).wrid_hi_stag))
+#define CQE_WRID_LOW(x)   (be32_to_cpu((x).wrid_low_msn))
+
+/* used for RQ completion processing */
+#define CQE_WRID_STAG(x)  (be32_to_cpu((x).wrid_hi_stag))
+#define CQE_WRID_MSN(x)   (be32_to_cpu((x).wrid_low_msn))
+
+/* used for SQ completion processing */
+#define CQE_WRID_SQ_WPTR(x)	((x).wrid_hi_stag)
+#define CQE_WRID_WPTR(x)   	((x).wrid_low_msn)
+
+#define TPT_ERR_SUCCESS                     0x0
+#define TPT_ERR_STAG                        0x1	 /* STAG invalid: either the */
+						 /* STAG is offlimt, being 0, */
+						 /* or STAG_key mismatch */
+#define TPT_ERR_PDID                        0x2	 /* PDID mismatch */
+#define TPT_ERR_QPID                        0x3	 /* QPID mismatch */
+#define TPT_ERR_ACCESS                      0x4	 /* Invalid access right */
+#define TPT_ERR_WRAP                        0x5	 /* Wrap error */
+#define TPT_ERR_BOUND                       0x6	 /* base and bounds voilation */
+#define TPT_ERR_INVALIDATE_SHARED_MR        0x7	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND 0x8	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_ECC                         0x9	 /* ECC error detected */
+#define TPT_ERR_ECC_PSTAG                   0xA	 /* ECC error detected when  */
+						 /* reading PSTAG for a MW  */
+						 /* Invalidate */
+#define TPT_ERR_PBL_ADDR_BOUND              0xB	 /* pbl addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_CRC                         0x10 /* CRC error */
+#define TPT_ERR_MARKER                      0x11 /* Marker error */
+#define TPT_ERR_PDU_LEN_ERR                 0x12 /* invalid PDU length */
+#define TPT_ERR_OUT_OF_RQE                  0x13 /* out of RQE */
+#define TPT_ERR_DDP_VERSION                 0x14 /* wrong DDP version */
+#define TPT_ERR_RDMA_VERSION                0x15 /* wrong RDMA version */
+#define TPT_ERR_OPCODE                      0x16 /* invalid rdma opcode */
+#define TPT_ERR_DDP_QUEUE_NUM               0x17 /* invalid ddp queue number */
+#define TPT_ERR_MSN                         0x18 /* MSN error */
+#define TPT_ERR_TBIT                        0x19 /* tag bit not set correctly */
+#define TPT_ERR_MO                          0x1A /* MO not 0 for TERMINATE  */
+						 /* or READ_REQ */
+#define TPT_ERR_MSN_GAP                     0x1B
+#define TPT_ERR_MSN_RANGE                   0x1C
+#define TPT_ERR_IRD_OVERFLOW                0x1D
+#define TPT_ERR_RQE_ADDR_BOUND              0x1E /* RQE addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_INTERNAL_ERR                0x1F /* internal error (opcode  */
+						 /* mismatch) */
+
+/*
+ * A T3 WQ implements both the SQ and RQ.
+ */
+struct t3_wq {
+	u32 error;			/* 1 once we go to ERROR */
+	u32 qpid;
+	u32 wptr;			/* idx to next available WR slot */
+	u32 size_log2;			/* total wq size */
+	u32 sq_wptr;			/* sq_wptr - sq_rptr == count of */
+	u32 sq_rptr;			/* pending wrs */
+	union t3_wr *sq_oldest_wr;	/* oldest signaled wr on the SQ */
+	u32 sq_size_log2;		/* sq size */
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	union t3_wr *queue;
+	u64 *rq;			/* SW RQ (holds consumer wr_ids */
+	u32 rq_wptr;			/* rq_wptr - rq_rptr == count of */
+	u32 rq_rptr;			/* pending wrs */
+	u64 *rq_oldest_wr;		/* oldest wr on the SW RQ */
+	u32 rq_size_log2;		/* rq size */
+	void __iomem *doorbell;
+};
+
+struct t3_cq {
+	u32 cqid;
+	u32 rptr;
+	u32 wptr;
+	u32 size_log2;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct t3_cqe *queue;
+	struct t3_cqe *sw_queue;
+	u32 sw_rptr;
+	u32 sw_wptr;
+};
+
+#define CQ_VLD_ENTRY(ptr,size_log2,cqe) (Q_GENBIT(ptr,size_log2) == \
+					 CQE_GENBIT(*cqe))
+
+static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+/*
+ * Return a ptr to the next signaled wr in the SQ or NULL.
+ */
+static inline union t3_wr *next_sq_wr(struct t3_wq *wq)
+{
+	union t3_wr *wr = wq->sq_oldest_wr;
+	int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr);
+	u32 wptr = wr - wq->queue + 1;
+
+	BUG_ON(!wr);
+	while (count) {
+		u32 opflags;
+		wr = (union t3_wr *)(wq->queue+Q_PTR2IDX(wptr, wq->size_log2));
+
+		opflags = be32_to_cpu(wr->recv.wrh.op_seop_flags);
+
+		/* XXX Reads always generate a completion. */
+		if (G_FW_RIWR_OP(opflags) == T3_WR_READ)
+			return wr;
+
+		/* Skip (and don't count) receives */
+		if (G_FW_RIWR_OP(opflags) == T3_WR_RCV) {
+			wptr++;
+			continue;
+		}
+
+		/* If this WR is signaled, return it. */
+		if (G_FW_RIWR_FLAGS(opflags) & T3_COMPLETION_FLAG) 
+			return wr;
+		wptr++;
+		count--;
+	}
+	return NULL;
+}
+
+int __cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq,
+			       struct t3_cqe *cqe, u8 * cqe_flushed,
+			       u64 * cookie, u32 * credit);
+
+#define FASTPATH_POLL
+
+/* 
+ * Fastpath poll.
+ *
+ * Caller must:
+ *     check the validity of the first CQE, 
+ *     supply the wq assicated with the qpid.
+ * credit: cq credit to return to sge.
+ * cqe_flushed: 1 iff the CQE is flushed.
+ * cqe: copy of the polled CQE.
+ *
+ * return value: 
+ * 	0 	CQE returned, 
+ *     -1 	CQE skipped, try again.
+ */
+static inline int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq,
+			       struct t3_cqe *cqe, u8 * cqe_flushed,
+			       u64 * cookie, u32 * credit)
+{
+#ifdef FASTPATH_POLL
+	struct t3_cqe *rd_cqe;
+
+	rd_cqe = cxio_next_cqe(cq);
+
+	/* fastpath:
+	 * 	wq is valid
+	 * 	wq not in error
+	 * 	cqe status not error
+	 * 	opcode not TERMINATE
+	 * 	opcode not read response
+	 * 	wq->sq_oldest_wr is not a read request
+	 * 	either its a SQ CQE -or- the MSN is correct in the RQ CQE
+	 */
+	if (likely(wq && !wq->error &&
+		   !CQE_STATUS(*rd_cqe) &&
+		   (CQE_OPCODE(*rd_cqe) != T3_TERMINATE) &&
+		   (CQE_OPCODE(*rd_cqe) != T3_READ_RESP) &&
+		   (!wq->sq_oldest_wr ||
+		    ( wq->sq_oldest_wr->send.rdmaop != T3_READ_REQ)) &&
+		   (SQ_TYPE(*rd_cqe) || (RQ_TYPE(*rd_cqe) && 
+					 (CQE_WRID_MSN(*rd_cqe) == 
+					  (wq->rq_rptr + 1)))))) {
+		*cqe = *rd_cqe;
+		*cqe_flushed = 0;
+		*credit = 0;
+
+		/*
+		 * Reap the associated WR(s) that are freed up with this
+		 * completion.
+		 */
+		if (SQ_TYPE(*rd_cqe)) {
+			wq->sq_rptr = CQE_WRID_SQ_WPTR(*rd_cqe) + 1;
+			BUG_ON(!wq->sq_oldest_wr);
+			*cookie = wq->queue[Q_PTR2IDX(CQE_WRID_WPTR(*rd_cqe), 
+					              wq->size_log2)
+				           ].flit[T3_SQ_COOKIE_FLIT];
+			wq->sq_oldest_wr = next_sq_wr(wq);
+		} else {
+			*cookie = wq->rq[Q_PTR2IDX(wq->rq_rptr, 
+						   wq->rq_size_log2)];
+			++(wq->rq_rptr);
+		}
+
+		if (SW_CQE(*rd_cqe)) {
+			++cq->sw_rptr;
+		} else {
+			++cq->rptr;
+
+			/*
+			 * compute credits.
+			 */
+			if (((cq->rptr-cq->wptr) > (1 << (cq->size_log2 - 1)))
+			    || ((cq->rptr - cq->wptr) >= 128)) {
+				*credit = cq->rptr - cq->wptr;
+				cq->wptr = cq->rptr;
+			}
+		}
+		return 0;
+	}
+#endif
+	*cqe_flushed = 0;
+	*credit = 0;
+	return __cxio_poll_cq(wq, cq, cqe, cqe_flushed, cookie, credit);
+}
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:29:50 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:29:50 -0500
Subject: [openib-general] [PATCH v2 05/14] CXGB3 Connection Manager
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623142950.32410.76113.stgit@stevo-desktop>


This patch contains the code to manage TCP connections, and do MPA
negotiation.  It implements the IWCM device-specific methods.

ISSUES:

- IWCM should pass down a dst entry or at least the next hop
ipaddr/macaddr.  Currently this code looks up this info based on the
source and destination ipaddr.

- port management isn't correct.  This should be moved into the
core IWCM or CMA.  Its not trivial to support native stack TCP port
allocation/reservation.
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c | 2135 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h |  232 ++++
 2 files changed, 2367 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
new file mode 100644
index 0000000..897cb5e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -0,0 +1,2135 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/atomic.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/skbuff.h>
+#include <linux/timer.h>
+#include <linux/notifier.h>
+#include <net/neighbour.h>
+#include <net/netevent.h>
+
+#include <t3c.h>
+
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+
+#ifdef DEBUG
+char *states[] = {
+	"idle",
+	"listen",
+	"connecting",
+	"mpa_wait_req",
+	"mpa_req_sent",
+	"mpa_req_rcvd",
+	"mpa_rep_sent",
+	"fpdu_mode",
+	"aborting",
+	"closing",
+	"moribund",
+	"dead",
+	NULL,
+};
+#endif
+
+static int ep_timeout_secs = 10;
+module_param(ep_timeout_secs, int, 0444);
+MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
+				   "in seconds (default=10)");
+
+static int mpa_rev = 1;
+module_param(mpa_rev, int, 0444);
+MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
+		 "1 is spec compliant. (default=1)");
+
+static int markers_enabled = 0;
+module_param(markers_enabled, int, 0444);
+MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
+
+static int crc_enabled = 1;
+module_param(crc_enabled, int, 0444);
+MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
+
+static u16 port_start = 32768;
+module_param(port_start, ushort, 0444);
+MODULE_PARM_DESC(port_start, 
+		 "Starting port for ephemeral ports. (default=32768)");
+
+static u16 port_end = 65535;
+module_param(port_end, ushort, 0444);
+MODULE_PARM_DESC(port_end, 
+		 "Ending port for ephemeral ports. (default=65535)");
+
+static int rcv_win = 512 * 1024;
+module_param(rcv_win, int, 0444);
+MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)");
+
+static void process_work(void *ctx);
+static struct workqueue_struct *workq;
+DECLARE_WORK(skb_work, process_work, NULL);
+
+static struct sk_buff_head rxq;
+static t3c_cpl_handler_func work_handlers[NUM_CPL_CMDS];
+
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp);
+static void ep_timeout(unsigned long arg);
+static void connect_reply_upcall(struct iwch_ep *ep, int status);
+
+static void start_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s enter (%s line %u) ep %p\n", 
+			__FUNCTION__, __FILE__, __LINE__, ep);
+	if (timer_pending(&ep->timer)) {
+		PDBG("%s stopped and restarted timer (%s line %u) ep %p\n", 
+			__FUNCTION__, __FILE__, __LINE__, ep);
+		del_timer_sync(&ep->timer);
+	} else
+		ep_atomic_inc(&ep->com.refcnt);
+	ep->timer.expires = jiffies + ep_timeout_secs * HZ;
+	ep->timer.data = (unsigned long)ep;
+	ep->timer.function = ep_timeout;
+	add_timer(&ep->timer);
+}
+
+static void stop_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	del_timer_sync(&ep->timer);
+	free_ep(&ep->com);
+}
+
+/*
+ * Port bitmap to track which ports are in use.  This should be
+ * global to all openib rnic devices...
+ */
+static DECLARE_BITMAP(portbits, 65536);
+static DEFINE_SPINLOCK(portlock);
+
+static int get_port(u16 *portp)
+{
+	u32 port = (u32)ntohs(*portp);
+	int ret = 0;
+	PDBG("%s enter (%s line %u) inp port %d\n", __FUNCTION__, __FILE__, __LINE__, port);
+	spin_lock(&portlock);
+	if (port == 0) {
+		port = find_next_zero_bit(portbits, 65536, port_start);
+		if (port > port_end)
+			ret = 1;
+		else
+			set_bit(port, portbits);
+	} else
+		if (test_and_set_bit(port, portbits))
+			ret = 1;
+	spin_unlock(&portlock);
+	if (!ret) {
+		*portp = htons(port);
+		PDBG("%s alloc port %d\n", __FUNCTION__, port);
+	}
+	return ret;
+}
+
+static void free_port(u16 port)
+{
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	spin_lock(&portlock);
+	PDBG("%s free port %d\n", __FUNCTION__, ntohs(port));
+	clear_bit((u32)ntohs(port), portbits);
+	spin_unlock(&portlock);
+}
+
+int iwch_quiesce_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb) {
+		return -ENOMEM;
+	}
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+int iwch_resume_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb) {
+		return -ENOMEM;
+	}
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = 0;
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static void set_emss(struct iwch_ep *ep, u16 opt)
+{
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40;
+	if (G_TCPOPT_TSTAMP(opt)) {
+		ep->emss -= 12;
+	}
+	if (ep->emss < 128)
+		ep->emss = 128;
+	PDBG("emss=%d\n", ep->emss);
+}
+
+#if 0
+static int state_exch(struct iwch_ep_common *epc, enum iwch_ep_state exch)
+{
+        unsigned long flags;
+        int old;
+
+        spin_lock_irqsave(&epc->lock, flags);
+        old = epc->state;
+	epc->state = exch;
+        spin_unlock_irqrestore(&epc->lock, flags);
+        return old;
+}
+#endif
+
+static int state_comp_exch(struct iwch_ep_common *epc,
+                         		  enum iwch_ep_state comp, 
+				          enum iwch_ep_state exch)
+{
+        unsigned long flags;
+        int ret;
+
+        spin_lock_irqsave(&epc->lock, flags);
+        ret = (epc->state == comp);
+        if (ret)
+                epc->state = exch;
+        spin_unlock_irqrestore(&epc->lock, flags);
+        return ret;
+}
+
+static enum iwch_ep_state state_read(struct iwch_ep_common *epc)
+{
+	unsigned long flags;
+	enum iwch_ep_state state;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	state = epc->state;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return state;
+}
+
+static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	PDBG(" %s - %s -> %s\n", __FUNCTION__, states[epc->state], 
+		states[new]);
+	epc->state = new;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return;
+}
+
+static void *alloc_ep(int size, gfp_t gfp)
+{
+	struct iwch_ep_common *epc;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	epc = kmalloc(size, gfp);
+	if (epc) {
+		memset(epc, 0, size);
+		atomic_set(&epc->refcnt, 1);
+		spin_lock_init(&epc->lock);
+		init_waitqueue_head(&epc->waitq);
+	}
+	PDBG("alloc ep %p\n", epc);
+	return (void *) epc;
+}
+
+void __free_ep(struct iwch_ep_common *epc)
+{
+	PDBG("%s enter (%s line %u) ep %p, &refcnt %p state %s, refcnt %d\n",
+					     __FUNCTION__, __FILE__, 
+					     __LINE__, epc, &epc->refcnt,
+					     states[state_read(epc)],
+					     atomic_read(&epc->refcnt));
+	if (atomic_read(&epc->refcnt) == 1) {
+		goto out;
+	}
+	if (!atomic_dec_and_test(&epc->refcnt)) {
+		return;
+	}
+out:
+	PDBG("free ep %p\n", epc);
+	free_port(epc->local_addr.sin_port);
+	kfree(epc);
+}
+
+static void process_work(void *ctx)
+{
+	struct sk_buff *skb = NULL;
+	void *ep;
+	struct t3cdev *tdev;
+	int ret;
+
+	while ((skb = skb_dequeue(&rxq))) {
+		ep = *((void **) (skb->cb));
+		tdev = *((struct t3cdev **) (skb->cb + sizeof(void *)));
+		ret = work_handlers[G_OPCODE(ntohl(skb->csum))]
+							(tdev, skb, ep);
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+
+		/* 
+		 * ep was referenced in sched(), and is freed here.
+		 */
+		free_ep(ep);
+	}
+}
+
+static int status2errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_NONE:
+		return 0;
+	case CPL_ERR_CONN_RESET:
+		return -ECONNRESET;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Try and reuse skbs already allocated...
+ */
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp)
+{
+	if (skb) {
+		BUG_ON(skb_cloned(skb));
+		skb_trim(skb, 0);
+		skb_get(skb);
+	} else {
+		skb = alloc_skb(len, gfp);
+	}
+	return skb;
+}
+
+static struct rtable *find_route(struct t3cdev *dev,
+				 u32 local_ip, u32 peer_ip, u16 local_port,
+				 u16 peer_port, u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = 0,
+		.nl_u = {
+			 .ip4_u = {
+				   .daddr = peer_ip,
+				   .saddr = local_ip,
+				   .tos = tos}
+			 },
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			  .ports = {
+				    .sport = local_port,
+				    .dport = peer_port}
+			  }
+	};
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	if (ip_route_output_flow(&rt, &fl, NULL, 0)) {
+		return NULL;
+	}
+	return rt;
+}
+
+static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu)
+{
+	int i = 0;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu)
+		++i;
+	return i;
+}
+
+/*
+ * XXX need to upcall the connection setup failure somehow!
+ */
+static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb)
+{
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for an active open.   
+ */
+static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR MOD "ARP failure duing connect\n");
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for a CPL_ABORT_REQ.  Change it into a no RST variant
+ * and send it along.
+ */
+static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_abort_req *req = cplhdr(skb);
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	req->cmd = CPL_ABORT_NO_RST;
+	t3c_send(dev, skb);
+}
+
+static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
+{
+	struct cpl_close_con_req *req;
+	struct sk_buff *skb;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb = get_skb(NULL, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_CLOSE_CON));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	struct cpl_abort_req *req;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb = get_skb(skb, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, abort_arp_failure);
+	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_HOST_ABORT_CON_REQ));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
+	req->cmd = CPL_ABORT_SEND_RST;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_connect(struct iwch_ep *ep)
+{
+	struct cpl_act_open_req *req;
+	struct sk_buff *skb;
+	u32 opt0h, opt0l, opt2;
+	unsigned int mtu_idx;
+	int wscale;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+	skb->priority = CPL_PRIORITY_SETUP;
+	set_arp_failure_handler(skb, act_open_req_arp_failure);
+
+	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->peer_port = ep->com.remote_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
+	req->opt0h = htonl(opt0h);
+	req->opt0l = htonl(opt0l);
+	req->params = 0;
+	req->opt2 = htonl(opt2);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+
+	PDBG("%s (%s line %u pd_len %d)\n", __FUNCTION__, __FILE__, __LINE__, ep->plen);
+
+	BUG_ON(skb_cloned(skb));
+
+	mpalen = sizeof(*mpa) + ep->plen;
+	if (skb->data + mpalen + sizeof(*req) > skb->end) {
+		kfree_skb(skb);
+		skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL);
+		if (!skb) {
+			connect_reply_upcall(ep, -ENOMEM);
+			return;
+		}
+	}
+	skb_trim(skb, 0);
+	skb_reserve(skb, sizeof(*req));
+	skb_put(skb, mpalen);
+	skb->priority = CPL_PRIORITY_DATA;
+	mpa = (struct mpa_message *) skb->data;
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key));
+	mpa->flags = (crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->private_data_size = htons(ep->plen);
+	mpa->revision = mpa_rev;
+
+	if (ep->plen) {
+		memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen);
+	}
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx));
+	req->flags = htonl(F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	start_ep_timer(ep);
+	state_set(&ep->com, MPA_REQ_SENT);
+	return;
+}
+
+static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	struct sk_buff *skb;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = MPA_REJECT;
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen) {
+		memcpy(mpa->private_data, pdata, plen);
+	}
+
+	/* 
+	 * Reference the mpa skb again.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(mpalen);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx));
+	req->flags = htonl(F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+	struct sk_buff *skb;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen) {
+		memcpy(mpa->private_data, pdata, plen);
+	}
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx));
+	req->flags = htonl(F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	ep->mpa_skb = skb;
+	state_set(&ep->com, MPA_REP_SENT);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int tid = GET_TID(req);
+
+	PDBG("%s (%s line %u) hwtid %d\n", __FUNCTION__, __FILE__, __LINE__, 
+	     tid);
+
+	dst_confirm(ep->dst);
+
+	/* setup the hwtid for this connection */
+	ep->hwtid = tid;
+	t3c_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
+
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	/* dealloc the atid */
+	t3c_free_atid(ep->com.tdev, ep->atid);
+
+	/* start MPA negotiation */
+	send_mpa_req(ep, skb);
+
+	return 0;
+}
+
+static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	PDBG("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
+	state_set(&ep->com, ABORTING);
+	send_abort(ep, skb, GFP_KERNEL);
+}
+
+static void close_complete_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	if (ep->com.cm_id) {
+		PDBG("close complete delivered ep %p cm_id %p hwtid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void peer_close_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_DISCONNECT;
+	if (ep->com.cm_id) {
+		PDBG("peer close delivered ep %p cm_id %p hwtid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static void peer_abort_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	event.status = -ECONNRESET;
+	if (ep->com.cm_id) {
+		PDBG("abort delivered ep %p cm_id %p hwtid %d\n", ep,
+		     ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_reply_upcall(struct iwch_ep *ep, int status)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REPLY;
+	event.status = status;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+
+	if ((status == 0) || (status == -ECONNREFUSED)) {
+		event.private_data_len = ep->plen;
+		event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	}
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, ep->hwtid, status);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+	if (status < 0) {
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_request_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REQUEST;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+	event.private_data_len = ep->plen;
+	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	event.provider_data = ep;
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_read(&ep->parent_ep->com) != DEAD)
+		ep->parent_ep->com.cm_id->event_handler(
+						ep->parent_ep->com.cm_id,
+						&event);
+	free_ep(&ep->parent_ep->com);
+	ep->parent_ep = NULL;
+}
+
+static void established_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_ESTABLISHED;
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static int update_rx_credits(struct iwch_ep *ep, u32 credits)
+{
+	struct cpl_rx_data_ack *req;
+	struct sk_buff *skb;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n");
+		return 0;
+	}
+
+	req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
+	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
+	skb->priority = CPL_PRIORITY_ACK;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return credits;
+}
+
+static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	int err;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING) {
+		return;
+	}
+	state_set(&ep->com, FPDU_MODE);
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * if we don't even have the mpa message, then bail. 
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa)) {
+		return;
+	}
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* Validate MPA header. */
+	if (mpa->revision != mpa_rev) {
+		err = -EPROTO;
+		goto err;
+	}
+	if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) {
+		return;
+	}
+
+	if (mpa->flags & MPA_REJECT) {
+		err = -ECONNREFUSED;
+		goto err;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data. And
+	 * the MPA header is valid.
+	 */
+
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	/* 
+	 * Quiesce the TID here.  The uP unquiesces the TID as
+	 * part of the rdma_init operation.
+	 */
+	err = iwch_quiesce_tid(ep);
+	if (err) {
+		goto err;
+	}
+
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ird;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+	    IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR |
+	    IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD;
+
+	/* bind QP and TID with INIT_WR */
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+	if (!err) {
+		goto out;
+	}
+err:
+	abort_connection(ep, skb);
+out:
+	connect_reply_upcall(ep, err);
+	return;
+}
+
+static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING) {
+		return;
+	}
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/*
+	 * Copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * If we don't even have the mpa message, then bail. 
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa)) {
+		return;
+	}
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* 
+	 * Validate MPA Header.
+	 */
+	if (mpa->revision != mpa_rev) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		abort_connection(ep, skb);
+		return;
+	}
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) {
+		return;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data.
+	 */
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	state_set(&ep->com, MPA_REQ_RCVD);
+
+	/* drive upcall */
+	connect_request_upcall(ep);
+	return;
+}
+
+static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_rx_data *hdr = cplhdr(skb);
+	unsigned int dlen = ntohs(hdr->len);
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	skb_pull(skb, sizeof(*hdr));
+	skb_trim(skb, dlen);
+
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_SENT:
+		process_mpa_reply(ep, skb);
+		break;
+	case MPA_REQ_WAIT:
+		process_mpa_request(ep, skb);
+		break;
+	case MPA_REP_SENT:
+		break;
+	default:
+		printk(KERN_ERR MOD "%s - unexpected streaming data."
+		       " ep %p state %d hwtid %d\n",
+		       __FUNCTION__, ep, state_read(&ep->com), ep->hwtid);
+
+		/* generate some kind of upcall if needed */
+		BUG_ON(1);
+		abort_connection(ep, skb);
+		break;
+	}
+
+	/* update RX credits */
+	update_rx_credits(ep, dlen);
+
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Upcall from the adapter indicating data has been transmitted.
+ * For us its just the single MPA request or reply.  We can now free
+ * the skb holding the mpa message.
+ */
+static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_wr_ack *hdr = cplhdr(skb);
+	unsigned int credits = ntohs(hdr->credits);
+	enum iwch_qp_attr_mask  mask;
+
+	PDBG("%s (%s line %u) credits %d\n", __FUNCTION__, __FILE__,
+		__LINE__, credits);
+
+	/* XXX remove this once Felix fixes the FW. */
+	if (credits == 0) {
+		return CPL_RET_BUF_DONE;
+	}
+	BUG_ON(credits != 1);
+	BUG_ON(ep->mpa_skb == NULL);
+	kfree_skb(ep->mpa_skb);
+	ep->mpa_skb = NULL;
+	dst_confirm(ep->dst);
+	if (state_read(&ep->com) == MPA_REP_SENT) {
+		struct iwch_qp_attributes attrs;
+		int err;
+
+		/* bind QP to EP and move to RTS */
+		attrs.mpa_attr = ep->mpa_attr;
+		attrs.max_ird = ep->ord;
+		attrs.max_ord = ep->ord;
+		attrs.llp_stream_handle = ep;
+		attrs.next_state = IWCH_QP_STATE_RTS;
+
+		/* bind QP and TID with INIT_WR */
+		mask = IWCH_QP_ATTR_NEXT_STATE |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_MAX_ORD;
+
+		err = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, mask, &attrs, 1);
+		if (err) {
+			abort_connection(ep, skb);
+			return 0;
+		}
+		state_set(&ep->com, FPDU_MODE);
+		established_upcall(ep);
+		ep->com.rpl_done = 1;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	state_set(&ep->com, DEAD);
+	close_complete_upcall(ep);
+	t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	free_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	t3c_free_atid(ep->com.tdev, ep->atid);
+	connect_reply_upcall(ep, status2errno(rpl->status));
+	free_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_start(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_pass_open_req *req;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n");
+		return -ENOMEM;
+	}
+
+	req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_port = 0;
+	req->peer_ip = 0;
+	req->peer_netmask = 0;
+	req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS);
+	req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10));
+	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
+
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_pass_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s (%s line %u) errno %d\n", __FUNCTION__, __FILE__, __LINE__, 
+	     status2errno(rpl->status));
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_stop(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_close_listserv_req *req;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
+			     void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_close_listserv_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+	return CPL_RET_BUF_DONE;
+}
+
+static void accept_cr(struct iwch_ep *ep, u32 peer_ip, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+	unsigned int mtu_idx;
+	u32 opt0h, opt0l, opt2;
+	int wscale;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+
+	rpl = cplhdr(skb);
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(opt0h);
+	rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT);
+	rpl->opt2 = htonl(opt2);
+	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
+	skb->priority = CPL_PRIORITY_SETUP;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+
+	return;
+}
+
+static void reject_cr(struct t3cdev *tdev, u32 hwtid, u32 peer_ip,
+		      struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	rpl = cplhdr(skb);
+	skb->priority = CPL_PRIORITY_SETUP;
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(F_TCAM_BYPASS);
+	rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
+	rpl->opt2 = 0;
+	rpl->rsvd = rpl->opt2;
+	tdev->send(tdev, skb);
+}
+
+static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *child_ep, *parent_ep = ctx;
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int hwtid = GET_TID(req);
+	struct dst_entry *dst;
+	struct l2t_entry *l2t;
+	struct rtable *rt;
+	struct iff_mac tim;
+
+	PDBG("%s (%s line %u) - hwtid %u\n", __FUNCTION__, __FILE__, __LINE__, 
+	     hwtid);
+
+	if (state_read(&parent_ep->com) != LISTEN) {
+		printk(KERN_ERR "%s - listening ep not in LISTEN\n", 
+		       __FUNCTION__);
+		goto reject;
+	}
+
+	/*
+	 * Find the netdev for this connection request.
+	 */
+	tim.mac_addr = req->dst_mac;
+	tim.vlan_tag = ntohs(req->vlan_tag);
+	if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) {
+		printk(KERN_ERR 
+			"%s bad dst mac %02x %02x %02x %02x %02x %02x\n",
+			__FUNCTION__,
+			req->dst_mac[0],
+			req->dst_mac[1],
+			req->dst_mac[2],
+			req->dst_mac[3],
+			req->dst_mac[4],
+			req->dst_mac[5]);
+		goto reject;
+	}
+
+#if 0
+	if (ip_route_input(skb, req->peer_ip, req->local_ip,
+			   G_PASS_OPEN_TOS(ntohl(req->tos_tid)), tim.dev)) {
+
+		printk(KERN_ERR MOD "%s - failed to find input route\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	PDBG("%s (%s line %u) - hwtid %u\n",
+		__FUNCTION__, __FILE__, __LINE__, hwtid);
+	BUG_TRAP(!skb->dst);
+	dst_release(skb->dst);
+	skb->dst = NULL;
+#endif
+
+	/* Find output route */
+	rt = find_route(tdev,
+			req->local_ip,
+			req->peer_ip,
+			req->local_port,
+			req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid)));
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - failed to find dst entry!\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	dst = &rt->u.dst;
+	l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port);
+	if (!l2t) {
+		printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n",
+		       __FUNCTION__);
+		dst_release(dst);
+		goto reject;
+	}
+	child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL);
+	if (!child_ep) {
+		printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n",
+		       __FUNCTION__);
+		l2t_release(L2DATA(tdev), l2t);
+		dst_release(dst);
+		goto reject;
+	}
+	state_set(&child_ep->com, CONNECTING);
+	child_ep->com.tdev = tdev;
+	child_ep->com.cm_id = NULL;
+	child_ep->com.local_addr.sin_family = PF_INET;
+	child_ep->com.local_addr.sin_port = req->local_port;
+	child_ep->com.local_addr.sin_addr.s_addr = req->local_ip;
+	child_ep->com.remote_addr.sin_family = PF_INET;
+	child_ep->com.remote_addr.sin_port = req->peer_port;
+	child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip;
+	ep_atomic_inc(&parent_ep->com.refcnt);
+	child_ep->parent_ep = parent_ep;
+	child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid));
+	child_ep->l2t = l2t;
+	child_ep->dst = dst;
+	child_ep->hwtid = hwtid;
+	init_timer(&child_ep->timer);
+	t3c_insert_tid(tdev, &t3c_client, child_ep, hwtid);
+	accept_cr(child_ep, req->peer_ip, skb);
+	goto out;
+reject:
+	reject_cr(tdev, hwtid, req->peer_ip, skb);
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_pass_establish *req = cplhdr(skb);
+
+	PDBG("%s (%s line %u) ep %p\n", __FUNCTION__, __FILE__, __LINE__, ep);
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	dst_confirm(ep->dst);
+	state_set(&ep->com, MPA_REQ_WAIT);
+	start_ep_timer(ep);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int abort = 0;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	dst_confirm(ep->dst);
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_WAIT:
+		state_set(&ep->com, CLOSING);
+		break;
+	case MPA_REQ_SENT:
+		state_set(&ep->com, CLOSING);
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REQ_RCVD:
+
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		state_set(&ep->com, CLOSING);
+		ep_atomic_inc(&ep->com.refcnt);
+		break;
+	case MPA_REP_SENT:
+		state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case FPDU_MODE:
+		state_set(&ep->com, CLOSING);
+		peer_close_upcall(ep);
+		attrs.next_state = IWCH_QP_STATE_CLOSING;
+		ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		if (ret) {
+			printk(KERN_ERR MOD "%s - qp <- closing err!\n",
+			       __FUNCTION__);
+			abort = 1;
+		}
+		break;
+	case ABORTING:
+		goto out;
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		goto out;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+				       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				       &attrs, 1);
+		}
+		state_set(&ep->com, DEAD);
+		close_complete_upcall(ep);
+		t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		free_ep(&ep->com);
+		goto out;
+	case DEAD:
+		goto out;
+	default:
+		BUG_ON(1);
+	}
+	iwch_ep_disconnect(ep, abort, GFP_KERNEL);	
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+        return status == CPL_ERR_RTX_NEG_ADVICE ||
+               status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_abort_req_rss *req = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+	struct cpl_abort_rpl *rpl;
+	struct sk_buff *rpl_skb;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int state;
+
+	if (is_neg_adv_abort(req->status)) {
+		PDBG("%s neg_adv_abort ep %p hwtid %d\n", __FUNCTION__, ep, 
+		     ep->hwtid);
+		t3_l2t_send_event(ep->com.tdev, ep->l2t);
+		return CPL_RET_BUF_DONE;
+	}
+
+	state = state_read(&ep->com);
+	PDBG("%s (%s line %u) ep %p state %u\n", __FUNCTION__, __FILE__, 
+	     __LINE__, ep, state);
+	switch (state) {
+	case CONNECTING:
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REP_SENT:
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case MPA_REQ_RCVD:
+	
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		ep_atomic_inc(&ep->com.refcnt);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+	case FPDU_MODE:
+	case CLOSING:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+			if (ret) {
+				printk(KERN_ERR MOD 
+				       "%s - qp <- error failed!\n",
+				       __FUNCTION__);
+			}
+		}
+		peer_abort_upcall(ep);
+		break;
+	case ABORTING:
+		break;
+	case DEAD:
+		PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__);
+		return CPL_RET_BUF_DONE;
+	default:
+		BUG_ON(1);
+		break;
+	}
+	dst_confirm(ep->dst);
+	
+	rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL);
+	if (!rpl_skb) {
+		printk(KERN_ERR MOD "%s - cannot allocate skb!\n",
+		       __FUNCTION__);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		free_ep(&ep->com);
+		return CPL_RET_BUF_DONE;
+	}
+	rpl_skb->priority = CPL_PRIORITY_DATA;
+	rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl));
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_TOE_HOST_ABORT_CON_RPL));
+	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
+	rpl->cmd = CPL_ABORT_NO_RST;
+	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	if (state != ABORTING) {
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid);
+		state_set(&ep->com, DEAD);
+		free_ep(&ep->com);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	BUG_ON(!ep);
+
+	/* The cm_id may be null if we failed to connect */
+	switch (state_read(&ep->com)) {
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if ((ep->com.cm_id) && (ep->com.qp)) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+					     ep->com.qp, 
+					     IWCH_QP_ATTR_NEXT_STATE,
+					     &attrs, 1);
+		}
+		state_set(&ep->com, DEAD);
+		close_complete_upcall(ep);
+		t3c_remove_tid(ep->com.tdev, ctx, ep->hwtid);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		free_ep(&ep->com);
+		break;
+	case DEAD:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * T3A does 3 things when a TERM is received:
+ * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet
+ * 2) generate an async event on the QP with the TERMINATE opcode
+ * 3) post a TERMINATE opcde cqe into the associated CQ.
+ *
+ * For (1), we save the message in the qp for later consumer consumption.
+ * For (2), we move the QP into TERMINATE, post a QP event and disconnect.
+ * For (3), we toss the CQE in cxio_poll_cq().
+ * 
+ * terminate() handles case (1)...
+ */
+static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
+	PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len);
+	memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len);
+	ep->com.qp->attr.terminate_msg_len = skb->len;
+	ep->com.qp->attr.is_terminate_local = 0;
+	return CPL_RET_BUF_DONE;
+}
+
+static void ep_timeout(unsigned long arg)
+{
+	struct iwch_ep *ep = (struct iwch_ep *)arg;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, 
+	     __LINE__, ep, ep->hwtid);
+	if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) {
+		struct sk_buff *skb;
+
+		connect_reply_upcall(ep, -ETIMEDOUT);
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb) {
+			abort_connection(ep, skb);
+		}
+	}
+	if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) {
+		struct sk_buff *skb;
+
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb) {
+			abort_connection(ep, skb);
+		}
+	}
+	if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) {
+		struct sk_buff *skb;
+
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		}
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb) {
+			abort_connection(ep, skb);
+		}
+	}
+	free_ep(&ep->com);
+}
+
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, 
+	     __LINE__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		free_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	state_set(&ep->com, CLOSING);
+	if (mpa_rev == 0) {
+		abort_connection(ep, NULL);
+	} else {
+		err = send_mpa_reject(ep, pdata, pdata_len);
+		err = send_halfclose(ep, GFP_KERNEL);
+	}
+	return 0;
+}
+
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+
+	PDBG("%s enter (%s line %u) ep %p hwtid %d\n", __FUNCTION__, __FILE__, 
+	     __LINE__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		free_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+
+	/* 
+	 * Quiesce the TID here.  The uP unquiesces the TID as
+	 * part of the rdma_init operation.
+	 */
+	err = iwch_quiesce_tid(ep);
+	if (err) {
+		abort_connection(ep, NULL);
+		return err;
+	}
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+	ep_atomic_inc(&ep->com.refcnt);
+	err = send_mpa_reply(ep, conn_param->private_data, 
+			     conn_param->private_data_len);
+	if (err) {
+		free_ep(&ep->com);
+		abort_connection(ep, NULL);
+		return err;
+	}
+
+	/* wait until the MPA is transmitted. */
+	PDBG("sleeping on ep %p\n", ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	PDBG("awakened on ep %p\n", ep);
+
+	err = ep->com.rpl_err;
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+	}
+	free_ep(&ep->com);
+	return err;
+}
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_ep *ep;
+	struct rtable *rt;
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto out;
+	}
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	init_timer(&ep->timer);
+	ep->plen = conn_param->private_data_len;
+	if (ep->plen) {
+		memcpy(ep->mpa_pkt + sizeof(struct mpa_message), 
+		       conn_param->private_data, ep->plen);
+	}
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	ep->com.tdev = h->rdev.t3cdev_p;
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+
+	/*
+	 * XXX.
+	 */
+	if (get_port(&cm_id->local_addr.sin_port)) {
+		err = -EADDRINUSE;
+		goto fail1;
+	}
+
+	/* 
+	 * Allocate an active TID to initiate a TCP connection. 
+	 */
+	ep->atid = t3c_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->atid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	/* find a route */
+	/* XXX Shouldn't need this.  IWCM should pass down dst entry ptr */
+	rt = find_route(h->rdev.t3cdev_p,
+			cm_id->local_addr.sin_addr.s_addr,
+			cm_id->remote_addr.sin_addr.s_addr,
+			cm_id->local_addr.sin_port,
+			cm_id->remote_addr.sin_port, IPTOS_LOWDELAY);
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__);
+		err = -EHOSTUNREACH;
+		goto fail3;
+	}
+	ep->dst = &rt->u.dst;
+
+	/* get a l2t entry */
+	ep->l2t = t3_l2t_get(ep->com.tdev,
+			     ep->dst->neighbour,
+			     ep->dst->neighbour->dev->if_port);
+	if (!ep->l2t) {
+		printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail4;
+	}
+
+	state_set(&ep->com, CONNECTING);
+	ep->tos = IPTOS_LOWDELAY;	/* XXX */
+	ep->com.local_addr = cm_id->local_addr;
+	ep->com.remote_addr = cm_id->remote_addr;
+
+	/* send connect request to rnic */
+	err = send_connect(ep);
+	if (!err) {
+		goto out;
+	}
+
+	l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t);
+fail4:
+	dst_release(ep->dst);
+fail3:
+	t3c_free_atid(ep->com.tdev, ep->atid);
+fail2:
+	free_port(cm_id->local_addr.sin_port);
+fail1:
+	free_ep(&ep->com);
+out:
+	return err;
+}
+
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_listen_ep *ep;
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	might_sleep();
+
+	if (get_port(&cm_id->local_addr.sin_port)) {
+		err = -EADDRINUSE;
+		goto out;
+	}
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail1;
+	}
+	ep->com.tdev = h->rdev.t3cdev_p;
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->backlog = backlog;
+	ep->com.local_addr = cm_id->local_addr;
+
+	/* 
+	 * Allocate a server TID.
+	 */
+	ep->stid = t3c_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->stid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	state_set(&ep->com, LISTEN);
+	err = listen_start(ep);
+	if (err) {
+		goto fail3;
+	}
+
+	/* wait for pass_open_rpl */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (!err) {
+		cm_id->provider_data = ep;
+		goto out;
+	}
+fail3:
+	t3c_free_stid(ep->com.tdev, ep->stid);
+fail2:
+	free_ep(&ep->com);
+fail1:
+	free_port(cm_id->local_addr.sin_port);
+out:
+	return err;
+}
+
+int iwch_destroy_listen(struct iw_cm_id *cm_id)
+{
+	int err;
+	struct iwch_listen_ep *ep = to_listen_ep(cm_id);
+
+	PDBG("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	might_sleep();
+	state_set(&ep->com, DEAD);
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	err = listen_stop(ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	t3c_free_stid(ep->com.tdev, ep->stid);
+	err = ep->com.rpl_err;
+	cm_id->rem_ref(cm_id);
+	free_ep(&ep->com);
+	return err;
+}
+
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
+{
+	int ret=0;
+	int state;
+
+	
+	state = state_read(&ep->com);
+	PDBG("%s enter (%s line %u) ep %p state %s, abrupt %d\n",
+		__FUNCTION__, __FILE__, __LINE__, ep, states[state], abrupt);
+	if (state == DEAD) {
+		PDBG("%s already dead ep %p\n", __FUNCTION__, ep);
+		return 0;
+	}
+	if (abrupt) {
+		if (state != ABORTING) {
+			state_set(&ep->com, ABORTING);
+			ret = send_abort(ep, NULL, gfp);
+		}
+	} else {
+
+		if (state != CLOSING) {
+			state_set(&ep->com, CLOSING);
+		} else {
+			start_ep_timer(ep);
+			state_set(&ep->com, MORIBUND);
+		}
+
+		ret = send_halfclose(ep, gfp);
+	}
+	return ret;
+}
+
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, 
+		     struct l2t_entry *l2t)
+{
+	struct iwch_ep *ep = ctx;
+	
+	if (ep->dst != old)
+		return 0;
+
+	PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, l2t);
+	dst_hold(new);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	ep->l2t = l2t;
+	dst_release(old);
+	ep->dst = new;
+	return 1;
+}
+
+/* 
+ * All the CM events are handled on a work queue to have a safe context.
+ */
+static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep_common *epc = ctx;
+
+	ep_atomic_inc(&epc->refcnt);
+
+	/*
+	 * Save ctx and tdev in the skb->cb area.
+	 */
+	*((void **) skb->cb) = ctx;
+	*((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev;
+
+	/* 
+	 * Queue the skb and schedule the worker thread.
+	 */
+	skb_queue_tail(&rxq, skb);
+	queue_work(workq, &skb_work);
+	return 0;
+}
+
+int __init iwch_cm_init(void)
+{
+	skb_queue_head_init(&rxq);
+
+	workq = create_singlethread_workqueue("iw_cxgb3");
+	if (!workq)
+		return -ENOMEM;
+
+	/*
+	 * All upcalls from the T3 Core go to sched() to 
+	 * schedule the processing on a work queue.
+	 */
+	t3c_handlers[CPL_ACT_ESTABLISH] = sched;
+	t3c_handlers[CPL_ACT_OPEN_RPL] = sched;
+	t3c_handlers[CPL_RX_DATA] = sched;
+	t3c_handlers[CPL_TX_DMA_ACK] = sched;
+	t3c_handlers[CPL_ABORT_RPL_RSS] = sched;
+	t3c_handlers[CPL_ABORT_RPL] = sched;
+	t3c_handlers[CPL_PASS_OPEN_RPL] = sched;
+	t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched;
+	t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched;
+	t3c_handlers[CPL_PASS_ESTABLISH] = sched;
+	t3c_handlers[CPL_PEER_CLOSE] = sched;
+	t3c_handlers[CPL_CLOSE_CON_RPL] = sched;
+	t3c_handlers[CPL_ABORT_REQ_RSS] = sched;
+	t3c_handlers[CPL_RDMA_TERMINATE] = sched;
+
+	/*
+	 * These are the real handlers that are called from a 
+	 * work queue.
+	 */
+	work_handlers[CPL_ACT_ESTABLISH] = act_establish;
+	work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl;
+	work_handlers[CPL_RX_DATA] = rx_data;
+	work_handlers[CPL_TX_DMA_ACK] = tx_ack;
+	work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl;
+	work_handlers[CPL_ABORT_RPL] = abort_rpl;
+	work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl;
+	work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl;
+	work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req;
+	work_handlers[CPL_PASS_ESTABLISH] = pass_establish;
+	work_handlers[CPL_PEER_CLOSE] = peer_close;
+	work_handlers[CPL_ABORT_REQ_RSS] = peer_abort;
+	work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl;
+	work_handlers[CPL_RDMA_TERMINATE] = terminate;
+	return 0;
+}
+
+void __exit iwch_cm_term(void)
+{
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
new file mode 100644
index 0000000..0e26352
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -0,0 +1,232 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _IWCH_CM_H_
+#define _IWCH_CM_H_
+
+#include <linux/inet.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include "iwch_provider.h"
+#include <rdma/iw_cm.h>
+
+#include <t3c.h>
+
+#define MPA_KEY_REQ "MPA ID Req Frame"
+#define MPA_KEY_REP "MPA ID Rep Frame"
+
+#define MPA_MAX_PRIVATE_DATA 	256
+#define MPA_REV 		0	/* XXX - amso1100 uses rev 0 ! */
+#define MPA_REJECT 		0x20
+#define MPA_CRC			0x40
+#define MPA_MARKERS		0x80
+#define MPA_FLAGS_MASK		0xE0
+
+#define free_ep(A) { \
+	PDBG("%s %d: Calling __free_ep\n",__FUNCTION__, __LINE__); \
+	__free_ep(A);  \
+}
+
+#define ep_atomic_inc(A) { \
+	PDBG("%s enter (%s line %u) A %p, refcnt %d\n", \
+					__FUNCTION__, __FILE__, \
+					__LINE__, A, \
+					atomic_read(A)); \
+	atomic_inc(A);  \
+}
+
+struct mpa_message {
+	u8 key[16];
+	u8 flags;
+	u8 revision;
+	u16 private_data_size;
+	u8 private_data[0];
+};
+
+struct terminate_message {
+	u8 layer_etype;
+	u8 ecode;
+	u16 hdrct_rsvd;
+	u8 len_hdrs[0];
+};
+
+#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28)
+
+enum iwch_term_layers {
+	LAYER_RDMAP 		= 0x00,
+	LAYER_DDP		= 0x10,
+	LAYER_MPA		= 0x20,
+};
+
+enum iwch_rdma_etypes {
+	RDMAP_LOCAL_CATA	= 0x00,
+	RDMAP_REMOTE_PROT	= 0x01,
+	RDMAP_REMOTE_OP		= 0x02,
+};
+
+enum iwch_rdma_ecodes {
+	RDMAP_INV_STAG		= 0x00,
+	RDMAP_BASE_BOUNDS	= 0x01,
+	RDMAP_ACC_VIOL		= 0x02,
+	RDMAP_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_TO_WRAP		= 0x04,
+	RDMAP_INV_VERS		= 0x05,
+	RDMAP_INV_OPCODE	= 0x06,
+	RDMAP_STREAM_CATA	= 0x07,
+	RDMAP_GLOBAL_CATA	= 0x08,
+	RDMAP_CANT_INV_STAG	= 0x09,
+	RDMAP_UNSPECIFIED	= 0xff	
+};
+
+enum iwch_ddp_etypes {
+	DDP_LOCAL_CATA		= 0x00,
+	DDP_TAGGED_ERR		= 0x01,
+	DDP_UNTAGGED_ERR	= 0x02,
+	DDP_LLP			= 0x03
+};
+
+enum iwch_ddp_tagged_ecodes {
+	DDPT_INV_STAG		= 0x00,
+	DDPT_BASE_BOUNDS	= 0x01,
+	DDPT_STAG_NOT_ASSOC	= 0x02,
+	DDPT_TO_WRAP		= 0x03,
+	DDPT_INV_VERS		= 0x04,
+};
+
+enum iwch_ddp_utagged_ecodes {
+	DDPU_INV_QN		= 0x01,
+	DDPU_INV_MSN_NOBUF	= 0x02,
+	DDPU_INV_MSN_RANGE	= 0x03,
+	DDPU_INV_MO		= 0x04,
+	DDPU_MSG_TOOBIG		= 0x05,
+	DDPU_INV_VERS		= 0x06
+};
+
+enum iwch_mpa_ecodes {
+	MPA_CRC_ERR		= 0x02,
+	MPA_MARKER_ERR		= 0x03
+};
+
+
+enum iwch_ep_state {
+	IDLE = 0,
+	LISTEN,	
+	CONNECTING,
+	MPA_REQ_WAIT,
+	MPA_REQ_SENT,
+	MPA_REQ_RCVD,
+	MPA_REP_SENT,
+	FPDU_MODE,
+	ABORTING,
+	CLOSING,
+	MORIBUND,
+	DEAD,
+};
+
+struct iwch_ep_common {
+	struct iw_cm_id *cm_id;
+	struct iwch_qp *qp;
+	struct t3cdev *tdev;
+	enum iwch_ep_state state;
+	atomic_t refcnt;
+	spinlock_t lock;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	wait_queue_head_t waitq;
+	int rpl_done;
+	int rpl_err;
+};
+
+struct iwch_listen_ep {
+	struct iwch_ep_common com;
+	unsigned int stid;
+	int backlog;
+};
+
+struct iwch_ep {
+	struct iwch_ep_common com;
+	struct iwch_ep *parent_ep;
+	struct timer_list timer;
+	unsigned int atid;
+	u32 hwtid;
+	u32 snd_seq;
+	struct l2t_entry *l2t;
+	struct dst_entry *dst;
+	struct sk_buff *mpa_skb;
+	struct iwch_mpa_attributes mpa_attr;
+	unsigned int mpa_pkt_len;
+	u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA];
+	u8 tos;
+	u16 emss;
+	u16 plen;
+	u32 ird;
+	u32 ord;
+};
+
+static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_ep *)cm_id->provider_data;
+}
+
+static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_listen_ep *)cm_id->provider_data;
+}
+
+static inline int compute_wscale(int win)
+{
+	int wscale = 0;
+
+	while (wscale < 14 && (65535<<wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+/* CM prototypes */
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog);
+int iwch_destroy_listen(struct iw_cm_id *cm_id);
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp);
+int iwch_quiesce_tid(struct iwch_ep *ep);
+int iwch_resume_tid(struct iwch_ep *ep);
+void __free_ep(struct iwch_ep_common *epc);
+void iwch_rearp(struct iwch_ep *ep);
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t);
+
+int __init iwch_cm_init(void);
+void __exit iwch_cm_term(void);
+
+#endif				/* _IWCH_CM_H_ */


From swise at opengridcomputing.com  Fri Jun 23 07:30:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:16 -0500
Subject: [openib-general] [PATCH v2 10/14] CXGB3 Core Device Registration.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143015.32410.11151.stgit@stevo-desktop>


This patch contains device discovery and registration for the cxgb3
"core" module.  The cxgb3 core module provides TCP connection management
services.

This module is needed to support multiple ULPs using the cxgb3 device
for managing TCP connections.  The OpenIB driver uses it to allocate
and setup iWARP LLP connections + pass data in streaming mode (for MPA
negotiation) before going into RDMA mode and associating the LLP stream
with a iWARP QP.  It is separated from the LLD/NETDEV driver because
its not needed for a dumb NIC only installation.  There will be other
ULPs that use this interface.

This patch also has the first-level event handler functions that process
L2/L3 events obtained via the Network Event Notifier mechanism.
---

 drivers/infiniband/hw/cxgb3/t3c/defs.h   |  100 +++++
 drivers/infiniband/hw/cxgb3/t3c/t3cdev.c |  570 ++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/t3c/tcb.h    |  378 ++++++++++++++++++++
 3 files changed, 1048 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/t3c/defs.h b/drivers/infiniband/hw/cxgb3/t3c/defs.h
new file mode 100644
index 0000000..3f9b9d3
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/defs.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _CHELSIO_DEFS_H
+#define _CHELSIO_DEFS_H
+
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+#include <t3cdev.h>
+
+#include "t3c.h"
+
+#define VALIDATE_TID 1
+
+void *t3_alloc_mem(unsigned long size);
+void t3_free_mem(void *addr);
+void t3c_neigh_update(struct neighbour *neigh, int flags);
+void t3c_redirect(struct dst_entry *old, struct dst_entry *new);
+
+/*
+ * Map an ATID or STID to their entries in the corresponding TID tables.
+ */
+static inline union active_open_entry *atid2entry(const struct tid_info *t,
+						  unsigned int atid)
+{
+	return &t->atid_tab[atid - t->atid_base];
+}
+
+
+static inline union listen_entry *stid2entry(const struct tid_info *t,
+					     unsigned int stid)
+{
+	return &t->stid_tab[stid - t->stid_base];
+}
+
+/*
+ * Find the socket corresponding to a TID.
+ */
+static inline struct t3c_tid_entry *lookup_tid(const struct tid_info *t,
+				      unsigned int tid)
+{
+	return tid < t->ntids ? &(t->tid_tab[tid]) : NULL;
+}
+
+/*
+ * Find the socket corresponding to a server TID.
+ */
+static inline struct t3c_tid_entry *lookup_stid(const struct tid_info *t,
+				       unsigned int tid)
+{
+	if (tid < t->stid_base || tid >= t->stid_base + t->nstids)
+		return NULL;
+	return &(stid2entry(t, tid)->t3c_tid);
+}
+
+/*
+ * Find the socket corresponding to an active-open TID.
+ */
+static inline struct t3c_tid_entry *lookup_atid(const struct tid_info *t,
+				       unsigned int tid)
+{
+	if (tid < t->atid_base || tid >= t->atid_base + t->natids)
+		return NULL;
+	return &(atid2entry(t, tid)->t3c_tid);
+}
+
+int process_rx(struct t3cdev *dev, struct sk_buff **skbs, int n);
+int attach_t3cdev(struct t3cdev *dev);
+void detach_t3cdev(struct t3cdev *dev);
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c b/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c
new file mode 100644
index 0000000..bec4d45
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/t3cdev.c
@@ -0,0 +1,570 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/vmalloc.h>
+#include <net/neighbour.h>
+#include <asm/semaphore.h>
+
+#include <t3_cpl.h>
+#include <cxgb3_ctl_defs.h>
+
+#include "l2t.h"
+#include "defs.h"
+#include "t3cdev.h"
+#include "firmware_exports.h"
+
+DEFINE_MUTEX(t3cdev_db_lock);
+LIST_HEAD(t3cdev_list);
+
+static const unsigned int MAX_ATIDS = 64 * 1024;
+static const unsigned int ATID_BASE = 0x100000;
+
+#ifdef CONFIG_PROC_FS
+#include <linux/proc_fs.h>
+
+static struct proc_dir_entry *t3cdev_proc_root;
+
+static int devices_read_proc(char *buf, char **start, off_t offset,
+			     int length, int *eof, void *data)
+{
+	int len;
+	struct t3cdev *dev;
+	struct net_device *ndev;
+
+	len = sprintf(buf, "Device           Interfaces\n");
+
+	mutex_lock(&t3cdev_db_lock);
+	list_for_each_entry(dev, &t3cdev_list, t3c_list) {
+		len += sprintf(buf + len, "%-16s", dev->name);
+		read_lock(&dev_base_lock);
+		for (ndev = dev_base; ndev; ndev = ndev->next) {
+			if (T3CDEV(ndev) == dev)
+				len += sprintf(buf + len, " %s", ndev->name);
+		}
+		read_unlock(&dev_base_lock);
+		len += sprintf(buf + len, "\n");
+		if (len >= length)
+			break;
+	}
+	mutex_unlock(&t3cdev_db_lock);
+
+	if (len > length)
+		len = length;
+	*eof = 1;
+	return len;
+}
+
+static void t3c_proc_cleanup(void)
+{
+	remove_proc_entry("devices", t3cdev_proc_root);
+	remove_proc_entry("net/cxgb3c", NULL);
+	t3cdev_proc_root = NULL;
+}
+
+static struct proc_dir_entry *create_t3c_proc_dir(const char *name)
+{
+	struct proc_dir_entry *d;
+
+	if (!t3cdev_proc_root)
+		return NULL;
+
+	d = proc_mkdir(name, t3cdev_proc_root);
+	if (d)
+		d->owner = THIS_MODULE;
+	return d;
+}
+
+static void delete_t3c_proc_dir(struct t3cdev *dev)
+{
+	if (dev->proc_dir) {
+		remove_proc_entry(dev->name, t3cdev_proc_root);
+		dev->proc_dir = NULL;
+	}
+}
+
+static int __init t3c_proc_init(void)
+{
+	struct proc_dir_entry *d;
+
+	t3cdev_proc_root = proc_mkdir("net/cxgb3c", NULL);
+	if (!t3cdev_proc_root)
+		return -ENOMEM;
+	t3cdev_proc_root->owner = THIS_MODULE;
+
+	d = create_proc_read_entry("devices", 0, t3cdev_proc_root, 
+			       devices_read_proc, NULL);
+	if (!d)
+		goto cleanup;
+	d->owner = THIS_MODULE;
+	return 0;
+
+cleanup:
+	t3c_proc_cleanup();
+	return -ENOMEM;
+}
+#else
+#define t3c_proc_init() 0
+#define create_t3c_proc_dir(name) NULL
+#define delete_t3c_proc_dir(dev)
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * Register a T3C device and try to attach an appropriate TCP offload module
+ * to it.  'name' is a template that may contain at most one %d format
+ * specifier.
+ */
+void unregister_t3cdev(struct t3cdev *dev)
+{
+	mutex_lock(&t3cdev_db_lock);
+	list_del(&dev->t3c_list);
+	delete_t3c_proc_dir(dev);
+	mutex_unlock(&t3cdev_db_lock);
+	return;
+}
+
+/*
+ * Register a T3C device and try to attach an appropriate TCP offload module
+ * to it.  'name' is a template that may contain at most one %d format
+ * specifier.
+ */
+void register_t3cdev(struct t3cdev *dev, const char *name)
+{
+	static int unit;
+
+	mutex_lock(&t3cdev_db_lock);
+	snprintf(dev->name, sizeof(dev->name), name, unit++);
+	dev->proc_dir = create_t3c_proc_dir(dev->name);
+	list_add_tail(&dev->t3c_list, &t3cdev_list);
+	mutex_unlock(&t3cdev_db_lock);
+	return;
+}
+
+/*
+ * Sends an sk_buff to a T3C driver after dealing with any active network taps.
+ */
+int t3c_send(struct t3cdev *dev, struct sk_buff *skb)
+{
+	int r;
+
+	local_bh_disable();
+	r = dev->send(dev, skb);
+	local_bh_enable();
+	return r;
+}
+EXPORT_SYMBOL(t3c_send);
+
+void t3c_neigh_update(struct neighbour *neigh, int flags)
+{
+	struct net_device *dev = neigh->dev;
+	
+	if (dev && (dev->features & NETIF_F_TCPIP_OFFLOAD)) {
+		struct t3cdev *tdev = T3CDEV(dev);
+
+		BUG_ON(!tdev);
+		t3_l2t_update(tdev, neigh, flags, neigh->dev);
+	}
+}
+
+static void set_l2t_ix(struct t3cdev *tdev, u32 tid, struct l2t_entry *e)
+{
+	struct sk_buff *skb;
+	struct cpl_set_tcb_field *req;
+
+	skb = alloc_skb(sizeof(*req), GFP_ATOMIC);
+	if (!skb) {
+		printk(KERN_ERR "%s: cannot allocate skb!\n", __FUNCTION__);
+		return;
+	}
+	skb->priority = CPL_PRIORITY_CONTROL;
+	req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, tid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_L2T_IX);
+	req->mask = cpu_to_be64(V_TCB_L2T_IX(M_TCB_L2T_IX));
+	req->val = cpu_to_be64(V_TCB_L2T_IX(e->idx));
+	tdev->send(tdev, skb);
+}
+
+void t3c_redirect(struct dst_entry *old, struct dst_entry *new)
+{
+	struct net_device *olddev, *newdev;
+	struct tid_info *ti;
+	struct t3cdev *tdev;
+	u32 tid;
+	int update_tcb;
+	struct l2t_entry *e;
+	struct t3c_tid_entry *te;
+	
+	olddev = old->neighbour->dev;
+	newdev = new->neighbour->dev;
+	if (!(olddev->features & NETIF_F_TCPIP_OFFLOAD))
+		return;
+	if (!(newdev->features & NETIF_F_TCPIP_OFFLOAD)) {
+		printk(KERN_WARNING "%s: Redirect to non-offload"
+		       "device ignored.\n", __FUNCTION__);
+		return;
+	}
+	tdev = T3CDEV(olddev);
+	BUG_ON(!tdev);
+	if (tdev != T3CDEV(newdev)) {
+		printk(KERN_WARNING "%s: Redirect to different "
+		       "offload device ignored.\n", __FUNCTION__);
+		return;
+	}
+
+	/* Add new L2T entry */
+	e = t3_l2t_get(tdev, new->neighbour, new->neighbour->dev->if_port);
+	if (!e) {
+		printk(KERN_ERR "%s: couldn't allocate new l2t entry!\n",
+		       __FUNCTION__);
+		return;
+	}
+
+	/* Walk tid table and notify clients of dst change. */
+	ti = &(T3C_DATA(tdev))->tid_maps;
+	for (tid=0; tid < ti->ntids; tid++) {
+		te = lookup_tid(ti, tid);
+		BUG_ON(!te);
+		if (te->ctx && te->client && te->client->redirect) {
+			update_tcb = te->client->redirect(te->ctx, old, new, e);
+			if (update_tcb)  {
+				l2t_hold(L2DATA(tdev), e);
+				set_l2t_ix(tdev, tid, e);
+			}
+		}
+	}
+	l2t_release(L2DATA(tdev), e);	
+}
+
+/*
+ * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
+ * The allocated memory is cleared.
+ */
+void *t3_alloc_mem(unsigned long size)
+{
+	void *p = kmalloc(size, GFP_KERNEL);
+
+	if (!p)
+		p = vmalloc(size);
+	if (p)
+		memset(p, 0, size);
+	return p;
+}
+
+/*
+ * Free memory allocated through t3_alloc_mem().
+ */
+void t3_free_mem(void *addr)
+{
+	unsigned long p = (unsigned long) addr;
+
+	if (p >= VMALLOC_START && p < VMALLOC_END)
+		vfree(addr);
+	else
+		kfree(addr);
+}
+
+/*
+ * Allocate and initialize the TID tables.  Returns 0 on success.
+ */
+static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
+			 unsigned int natids, unsigned int nstids,
+			 unsigned int atid_base, unsigned int stid_base)
+{
+	unsigned long size = ntids * sizeof(*t->tid_tab) +
+	    natids * sizeof(*t->atid_tab) + nstids * sizeof(*t->stid_tab);
+
+	t->tid_tab = t3_alloc_mem(size);
+	if (!t->tid_tab)
+		return -ENOMEM;
+
+	t->stid_tab = (union listen_entry *)&t->tid_tab[ntids];
+	t->atid_tab = (union active_open_entry *)&t->stid_tab[nstids];
+	t->ntids = ntids;
+	t->nstids = nstids;
+	t->stid_base = stid_base;
+	t->sfree = NULL;
+	t->natids = natids;
+	t->atid_base = atid_base;
+	t->afree = NULL;
+	t->stids_in_use = t->atids_in_use = 0;
+	atomic_set(&t->tids_in_use, 0);
+	spin_lock_init(&t->stid_lock);
+	spin_lock_init(&t->atid_lock);
+
+	/*
+	 * Setup the free lists for stid_tab and atid_tab.
+	 */
+	if (nstids) {
+		while (--nstids)
+			t->stid_tab[nstids - 1].next = &t->stid_tab[nstids];
+		t->sfree = t->stid_tab;
+	}
+	if (natids) {
+		while (--natids)
+			t->atid_tab[natids - 1].next = &t->atid_tab[natids];
+		t->afree = t->atid_tab;
+	}
+	return 0;
+}
+
+static void free_tid_maps(struct tid_info *t)
+{
+	t3_free_mem(t->tid_tab);
+}
+
+/*
+ * Process a received packet with an unknown/unexpected CPL opcode.
+ */
+static int do_bad_cpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR "%s: received bad CPL command 0x%x\n", dev->name,
+	       *skb->data);
+	return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+}
+
+/*
+ * Handlers for each CPL opcode
+ */
+static cpl_handler_func cpl_handlers[NUM_CPL_CMDS];
+
+/*
+ * Add a new handler to the CPL dispatch table.  A NULL handler may be supplied
+ * to unregister an existing handler.
+ */
+void t3_register_cpl_handler(unsigned int opcode, cpl_handler_func h)
+{
+	if (opcode < NUM_CPL_CMDS)
+		cpl_handlers[opcode] = h ? h : do_bad_cpl;
+	else
+		printk(KERN_ERR "T3C: handler registration for "
+		       "opcode %x failed\n", opcode);
+}
+EXPORT_SYMBOL(t3_register_cpl_handler);
+
+/*
+ * T3CDEV's receive method.
+ */
+int process_rx(struct t3cdev *dev, struct sk_buff **skbs, int n)
+{
+	while (n--) {
+		struct sk_buff *skb = *skbs++;
+		unsigned int opcode = G_OPCODE(ntohl(skb->csum));
+		int ret = cpl_handlers[opcode] (dev, skb);
+
+#if VALIDATE_TID
+		if (ret & CPL_RET_UNKNOWN_TID) {
+			union opcode_tid *p = cplhdr(skb);
+
+			printk(KERN_ERR "%s: CPL message (opcode %u) had "
+			       "unknown TID %u\n", dev->name, opcode,
+			       G_TID(ntohl(p->opcode_tid)));
+		}
+#endif
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+	}
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+#include <linux/proc_fs.h>
+
+static int t3cdev_info_read_proc(char *buf, char **start, off_t offset,
+			      int length, int *eof, void *data)
+{
+	struct t3c_data *d = data;
+	struct tid_info *t = &d->tid_maps;
+	int len;
+
+	len = sprintf(buf, "TID range: 0..%d, in use: %u\n"
+		      "STID range: %d..%d, in use: %u\n"
+		      "ATID range: %d..%d, in use: %u\n"
+		      "MSS: %u\n",
+		      t->ntids - 1, atomic_read(&t->tids_in_use), t->stid_base,
+		      t->stid_base + t->nstids - 1, t->stids_in_use,
+		      t->atid_base, t->atid_base + t->natids - 1,
+		      t->atids_in_use, d->tx_max_chunk);
+	if (len > length)
+		len = length;
+	*eof = 1;
+	return len;
+}
+
+static int t3cdev_info_proc_setup(struct proc_dir_entry *dir, 
+			          struct t3c_data *d)
+{
+	struct proc_dir_entry *p;
+
+	if (!dir)
+		return -EINVAL;
+
+	p = create_proc_read_entry("info", 0, dir, t3cdev_info_read_proc, d);
+	if (!p)
+		return -ENOMEM;
+
+	p->owner = THIS_MODULE;
+	return 0;
+}
+
+static void t3cdev_proc_init(struct t3cdev *dev)
+{
+	t3_l2t_proc_setup(dev->proc_dir, L2DATA(dev));
+	t3cdev_info_proc_setup(dev->proc_dir, T3C_DATA(dev));
+}
+
+static void t3cdev_info_proc_free(struct proc_dir_entry *dir)
+{
+	if (dir)
+		remove_proc_entry("info", dir);
+}
+
+static void t3cdev_proc_cleanup(struct t3cdev *dev)
+{
+	t3_l2t_proc_free(dev->proc_dir);
+	t3cdev_info_proc_free(dev->proc_dir);
+}
+
+#else
+#define t3cdev_proc_init(dev)
+#define t3cdev_proc_cleanup(dev)
+#endif
+
+void detach_t3cdev(struct t3cdev *dev)
+{
+	struct t3c_data *t = T3C_DATA(dev);
+	t3cdev_proc_cleanup(dev);
+	dev->close(dev);
+	free_tid_maps(&t->tid_maps);
+	dev->recv = NULL;
+	dev->neigh_update = NULL;
+	T3C_DATA(dev) = NULL;
+	t3_free_l2t(L2DATA(dev));
+	L2DATA(dev) = NULL;
+	kfree(t);
+}
+
+int attach_t3cdev(struct t3cdev *dev)
+{
+	int natids, err;
+	struct t3c_data *t;
+	struct tid_range stid_range, tid_range;
+	struct ddp_params ddp;
+	struct mtutab mtutab;
+	unsigned int l2t_capacity;
+
+	t = kcalloc(1, sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	err = -EOPNOTSUPP;
+	if (dev->ctl(dev, GET_TX_MAX_CHUNK, &t->tx_max_chunk) < 0 ||
+	    dev->ctl(dev, GET_MAX_OUTSTANDING_WR, &t->max_wrs) < 0 ||
+	    dev->ctl(dev, GET_L2T_CAPACITY, &l2t_capacity) < 0 ||
+	    dev->ctl(dev, GET_MTUS, &mtutab) < 0 ||
+	    dev->ctl(dev, GET_DDP_PARAMS, &ddp) < 0 ||
+	    dev->ctl(dev, GET_TID_RANGE, &tid_range) < 0 ||
+	    dev->ctl(dev, GET_STID_RANGE, &stid_range) < 0)
+		goto out_free;
+
+	err = -ENOMEM;
+	L2DATA(dev) = t3_init_l2t(l2t_capacity);
+	if (!L2DATA(dev))
+		goto out_free;
+
+	natids = min(tid_range.num / 2, MAX_ATIDS);
+	err = init_tid_tabs(&t->tid_maps, tid_range.num, natids,
+			    stid_range.num, ATID_BASE, stid_range.base);
+	if (err)
+		goto out_free_l2t;
+
+	t->mtus = mtutab.mtus;
+	t->nmtus = mtutab.size;
+
+	t->ddp_llimit = ddp.llimit;
+	t->ddp_ulimit = ddp.ulimit;
+	t->ddp_tagmask = ddp.tag_mask;
+
+	INIT_LIST_HEAD(&t->list_node);
+	t->dev = dev;
+
+	T3C_DATA(dev) = t;
+	dev->recv = process_rx;
+	dev->neigh_update = t3_l2t_update;
+
+	/* All setup completed, let the driver know. */
+	err = dev->open(dev);
+	if (err)
+		goto free_all;
+
+	t3cdev_proc_init(dev);
+	return 0;
+
+free_all:
+	dev->recv = NULL;
+	dev->neigh_update = NULL;
+	T3C_DATA(dev) = NULL;
+	free_tid_maps(&t->tid_maps);
+out_free_l2t:
+	t3_free_l2t(L2DATA(dev));
+	L2DATA(dev) = NULL;
+out_free:
+	kfree(t);
+	return err;
+}
+
+void __init t3cdev_init(void)
+{
+	int i;
+
+	if (t3c_proc_init())
+		printk(KERN_WARNING "Unable to create /proc/net/t3c dir\n");
+
+	for (i = 0; i < NUM_CPL_CMDS; ++i)
+		cpl_handlers[i] = do_bad_cpl;
+	return;
+}
+
+void __exit t3cdev_exit(void)
+{
+	t3c_proc_cleanup();
+	return;
+}
+
diff --git a/drivers/infiniband/hw/cxgb3/t3c/tcb.h b/drivers/infiniband/hw/cxgb3/t3c/tcb.h
new file mode 100644
index 0000000..64d6e17
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/tcb.h
@@ -0,0 +1,378 @@
+/* This file is automatically generated --- do not edit */
+
+#ifndef _TCB_DEFS_H
+#define _TCB_DEFS_H
+
+#define W_TCB_T_STATE    0
+#define S_TCB_T_STATE    0
+#define M_TCB_T_STATE    0xfULL
+#define V_TCB_T_STATE(x) ((x) << S_TCB_T_STATE)
+
+#define W_TCB_TIMER    0
+#define S_TCB_TIMER    4
+#define M_TCB_TIMER    0x1ULL
+#define V_TCB_TIMER(x) ((x) << S_TCB_TIMER)
+
+#define W_TCB_DACK_TIMER    0
+#define S_TCB_DACK_TIMER    5
+#define M_TCB_DACK_TIMER    0x1ULL
+#define V_TCB_DACK_TIMER(x) ((x) << S_TCB_DACK_TIMER)
+
+#define W_TCB_DEL_FLAG    0
+#define S_TCB_DEL_FLAG    6
+#define M_TCB_DEL_FLAG    0x1ULL
+#define V_TCB_DEL_FLAG(x) ((x) << S_TCB_DEL_FLAG)
+
+#define W_TCB_L2T_IX    0
+#define S_TCB_L2T_IX    7
+#define M_TCB_L2T_IX    0x7ffULL
+#define V_TCB_L2T_IX(x) ((x) << S_TCB_L2T_IX)
+
+#define W_TCB_SMAC_SEL    0
+#define S_TCB_SMAC_SEL    18
+#define M_TCB_SMAC_SEL    0x3ULL
+#define V_TCB_SMAC_SEL(x) ((x) << S_TCB_SMAC_SEL)
+
+#define W_TCB_TOS    0
+#define S_TCB_TOS    20
+#define M_TCB_TOS    0x3fULL
+#define V_TCB_TOS(x) ((x) << S_TCB_TOS)
+
+#define W_TCB_MAX_RT    0
+#define S_TCB_MAX_RT    26
+#define M_TCB_MAX_RT    0xfULL
+#define V_TCB_MAX_RT(x) ((x) << S_TCB_MAX_RT)
+
+#define W_TCB_T_RXTSHIFT    0
+#define S_TCB_T_RXTSHIFT    30
+#define M_TCB_T_RXTSHIFT    0xfULL
+#define V_TCB_T_RXTSHIFT(x) ((x) << S_TCB_T_RXTSHIFT)
+
+#define W_TCB_T_DUPACKS    1
+#define S_TCB_T_DUPACKS    2
+#define M_TCB_T_DUPACKS    0xfULL
+#define V_TCB_T_DUPACKS(x) ((x) << S_TCB_T_DUPACKS)
+
+#define W_TCB_T_MAXSEG    1
+#define S_TCB_T_MAXSEG    6
+#define M_TCB_T_MAXSEG    0xfULL
+#define V_TCB_T_MAXSEG(x) ((x) << S_TCB_T_MAXSEG)
+
+#define W_TCB_T_FLAGS1    1
+#define S_TCB_T_FLAGS1    10
+#define M_TCB_T_FLAGS1    0xffffffffULL
+#define V_TCB_T_FLAGS1(x) ((x) << S_TCB_T_FLAGS1)
+
+#define W_TCB_T_FLAGS2    2
+#define S_TCB_T_FLAGS2    10
+#define M_TCB_T_FLAGS2    0x7fULL
+#define V_TCB_T_FLAGS2(x) ((x) << S_TCB_T_FLAGS2)
+
+#define W_TCB_SND_SCALE    2
+#define S_TCB_SND_SCALE    17
+#define M_TCB_SND_SCALE    0xfULL
+#define V_TCB_SND_SCALE(x) ((x) << S_TCB_SND_SCALE)
+
+#define W_TCB_RCV_SCALE    2
+#define S_TCB_RCV_SCALE    21
+#define M_TCB_RCV_SCALE    0xfULL
+#define V_TCB_RCV_SCALE(x) ((x) << S_TCB_RCV_SCALE)
+
+#define W_TCB_SND_UNA_RAW    2
+#define S_TCB_SND_UNA_RAW    25
+#define M_TCB_SND_UNA_RAW    0x7ffffffULL
+#define V_TCB_SND_UNA_RAW(x) ((x) << S_TCB_SND_UNA_RAW)
+
+#define W_TCB_SND_NXT_RAW    3
+#define S_TCB_SND_NXT_RAW    20
+#define M_TCB_SND_NXT_RAW    0x7ffffffULL
+#define V_TCB_SND_NXT_RAW(x) ((x) << S_TCB_SND_NXT_RAW)
+
+#define W_TCB_RCV_NXT    4
+#define S_TCB_RCV_NXT    15
+#define M_TCB_RCV_NXT    0xffffffffULL
+#define V_TCB_RCV_NXT(x) ((x) << S_TCB_RCV_NXT)
+
+#define W_TCB_RCV_ADV    5
+#define S_TCB_RCV_ADV    15
+#define M_TCB_RCV_ADV    0xffffULL
+#define V_TCB_RCV_ADV(x) ((x) << S_TCB_RCV_ADV)
+
+#define W_TCB_SND_MAX_RAW    5
+#define S_TCB_SND_MAX_RAW    31
+#define M_TCB_SND_MAX_RAW    0x7ffffffULL
+#define V_TCB_SND_MAX_RAW(x) ((x) << S_TCB_SND_MAX_RAW)
+
+#define W_TCB_SND_CWND    6
+#define S_TCB_SND_CWND    26
+#define M_TCB_SND_CWND    0x7ffffffULL
+#define V_TCB_SND_CWND(x) ((x) << S_TCB_SND_CWND)
+
+#define W_TCB_SND_SSTHRESH    7
+#define S_TCB_SND_SSTHRESH    21
+#define M_TCB_SND_SSTHRESH    0x7ffffffULL
+#define V_TCB_SND_SSTHRESH(x) ((x) << S_TCB_SND_SSTHRESH)
+
+#define W_TCB_T_RTT_TS_RECENT_AGE    8
+#define S_TCB_T_RTT_TS_RECENT_AGE    16
+#define M_TCB_T_RTT_TS_RECENT_AGE    0xffffffffULL
+#define V_TCB_T_RTT_TS_RECENT_AGE(x) ((x) << S_TCB_T_RTT_TS_RECENT_AGE)
+
+#define W_TCB_T_RTSEQ_RECENT    9
+#define S_TCB_T_RTSEQ_RECENT    16
+#define M_TCB_T_RTSEQ_RECENT    0xffffffffULL
+#define V_TCB_T_RTSEQ_RECENT(x) ((x) << S_TCB_T_RTSEQ_RECENT)
+
+#define W_TCB_T_SRTT    10
+#define S_TCB_T_SRTT    16
+#define M_TCB_T_SRTT    0xffffULL
+#define V_TCB_T_SRTT(x) ((x) << S_TCB_T_SRTT)
+
+#define W_TCB_T_RTTVAR    11
+#define S_TCB_T_RTTVAR    0
+#define M_TCB_T_RTTVAR    0xffffULL
+#define V_TCB_T_RTTVAR(x) ((x) << S_TCB_T_RTTVAR)
+
+#define W_TCB_TS_LAST_ACK_SENT_RAW    11
+#define S_TCB_TS_LAST_ACK_SENT_RAW    16
+#define M_TCB_TS_LAST_ACK_SENT_RAW    0x7ffffffULL
+#define V_TCB_TS_LAST_ACK_SENT_RAW(x) ((x) << S_TCB_TS_LAST_ACK_SENT_RAW)
+
+#define W_TCB_DIP    12
+#define S_TCB_DIP    11
+#define M_TCB_DIP    0xffffffffULL
+#define V_TCB_DIP(x) ((x) << S_TCB_DIP)
+
+#define W_TCB_SIP    13
+#define S_TCB_SIP    11
+#define M_TCB_SIP    0xffffffffULL
+#define V_TCB_SIP(x) ((x) << S_TCB_SIP)
+
+#define W_TCB_DP    14
+#define S_TCB_DP    11
+#define M_TCB_DP    0xffffULL
+#define V_TCB_DP(x) ((x) << S_TCB_DP)
+
+#define W_TCB_SP    14
+#define S_TCB_SP    27
+#define M_TCB_SP    0xffffULL
+#define V_TCB_SP(x) ((x) << S_TCB_SP)
+
+#define W_TCB_TIMESTAMP    15
+#define S_TCB_TIMESTAMP    11
+#define M_TCB_TIMESTAMP    0xffffffffULL
+#define V_TCB_TIMESTAMP(x) ((x) << S_TCB_TIMESTAMP)
+
+#define W_TCB_TIMESTAMP_OFFSET    16
+#define S_TCB_TIMESTAMP_OFFSET    11
+#define M_TCB_TIMESTAMP_OFFSET    0xfULL
+#define V_TCB_TIMESTAMP_OFFSET(x) ((x) << S_TCB_TIMESTAMP_OFFSET)
+
+#define W_TCB_TX_MAX    16
+#define S_TCB_TX_MAX    15
+#define M_TCB_TX_MAX    0xffffffffULL
+#define V_TCB_TX_MAX(x) ((x) << S_TCB_TX_MAX)
+
+#define W_TCB_TX_HDR_PTR_RAW    17
+#define S_TCB_TX_HDR_PTR_RAW    15
+#define M_TCB_TX_HDR_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_HDR_PTR_RAW(x) ((x) << S_TCB_TX_HDR_PTR_RAW)
+
+#define W_TCB_TX_LAST_PTR_RAW    18
+#define S_TCB_TX_LAST_PTR_RAW    0
+#define M_TCB_TX_LAST_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_LAST_PTR_RAW(x) ((x) << S_TCB_TX_LAST_PTR_RAW)
+
+#define W_TCB_TX_COMPACT    18
+#define S_TCB_TX_COMPACT    17
+#define M_TCB_TX_COMPACT    0x1ULL
+#define V_TCB_TX_COMPACT(x) ((x) << S_TCB_TX_COMPACT)
+
+#define W_TCB_RX_COMPACT    18
+#define S_TCB_RX_COMPACT    18
+#define M_TCB_RX_COMPACT    0x1ULL
+#define V_TCB_RX_COMPACT(x) ((x) << S_TCB_RX_COMPACT)
+
+#define W_TCB_RCV_WND    18
+#define S_TCB_RCV_WND    19
+#define M_TCB_RCV_WND    0x7ffffffULL
+#define V_TCB_RCV_WND(x) ((x) << S_TCB_RCV_WND)
+
+#define W_TCB_RX_HDR_OFFSET    19
+#define S_TCB_RX_HDR_OFFSET    14
+#define M_TCB_RX_HDR_OFFSET    0x7ffffffULL
+#define V_TCB_RX_HDR_OFFSET(x) ((x) << S_TCB_RX_HDR_OFFSET)
+
+#define W_TCB_RX_FRAG0_START_IDX_RAW    20
+#define S_TCB_RX_FRAG0_START_IDX_RAW    9
+#define M_TCB_RX_FRAG0_START_IDX_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG0_START_IDX_RAW(x) ((x) << S_TCB_RX_FRAG0_START_IDX_RAW)
+
+#define W_TCB_RX_FRAG1_START_IDX_OFFSET    21
+#define S_TCB_RX_FRAG1_START_IDX_OFFSET    4
+#define M_TCB_RX_FRAG1_START_IDX_OFFSET    0x7ffffffULL
+#define V_TCB_RX_FRAG1_START_IDX_OFFSET(x) ((x) << S_TCB_RX_FRAG1_START_IDX_OFFSET)
+
+#define W_TCB_RX_FRAG0_LEN    21
+#define S_TCB_RX_FRAG0_LEN    31
+#define M_TCB_RX_FRAG0_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG0_LEN(x) ((x) << S_TCB_RX_FRAG0_LEN)
+
+#define W_TCB_RX_FRAG1_LEN    22
+#define S_TCB_RX_FRAG1_LEN    26
+#define M_TCB_RX_FRAG1_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG1_LEN(x) ((x) << S_TCB_RX_FRAG1_LEN)
+
+#define W_TCB_NEWRENO_RECOVER    23
+#define S_TCB_NEWRENO_RECOVER    21
+#define M_TCB_NEWRENO_RECOVER    0x7ffffffULL
+#define V_TCB_NEWRENO_RECOVER(x) ((x) << S_TCB_NEWRENO_RECOVER)
+
+#define W_TCB_PDU_HAVE_LEN    24
+#define S_TCB_PDU_HAVE_LEN    16
+#define M_TCB_PDU_HAVE_LEN    0x1ULL
+#define V_TCB_PDU_HAVE_LEN(x) ((x) << S_TCB_PDU_HAVE_LEN)
+
+#define W_TCB_PDU_LEN    24
+#define S_TCB_PDU_LEN    17
+#define M_TCB_PDU_LEN    0xffffULL
+#define V_TCB_PDU_LEN(x) ((x) << S_TCB_PDU_LEN)
+
+#define W_TCB_RX_QUIESCE    25
+#define S_TCB_RX_QUIESCE    1
+#define M_TCB_RX_QUIESCE    0x1ULL
+#define V_TCB_RX_QUIESCE(x) ((x) << S_TCB_RX_QUIESCE)
+
+#define W_TCB_RX_PTR_RAW    25
+#define S_TCB_RX_PTR_RAW    2
+#define M_TCB_RX_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_PTR_RAW(x) ((x) << S_TCB_RX_PTR_RAW)
+
+#define W_TCB_CPU_NO    25
+#define S_TCB_CPU_NO    19
+#define M_TCB_CPU_NO    0x7fULL
+#define V_TCB_CPU_NO(x) ((x) << S_TCB_CPU_NO)
+
+#define W_TCB_ULP_TYPE    25
+#define S_TCB_ULP_TYPE    26
+#define M_TCB_ULP_TYPE    0xfULL
+#define V_TCB_ULP_TYPE(x) ((x) << S_TCB_ULP_TYPE)
+
+#define S_TF_DACK    10
+#define V_TF_DACK(x) ((x) << S_TF_DACK)
+
+#define S_TF_NAGLE    11
+#define V_TF_NAGLE(x) ((x) << S_TF_NAGLE)
+
+#define S_TF_RECV_SCALE    12
+#define V_TF_RECV_SCALE(x) ((x) << S_TF_RECV_SCALE)
+
+#define S_TF_RECV_TSTMP    13
+#define V_TF_RECV_TSTMP(x) ((x) << S_TF_RECV_TSTMP)
+
+#define S_TF_RECV_SACK    14
+#define V_TF_RECV_SACK(x) ((x) << S_TF_RECV_SACK)
+
+#define S_TF_TURBO    15
+#define V_TF_TURBO(x) ((x) << S_TF_TURBO)
+
+#define S_TF_KEEPALIVE    16
+#define V_TF_KEEPALIVE(x) ((x) << S_TF_KEEPALIVE)
+
+#define S_TF_TCAM_BYPASS    17
+#define V_TF_TCAM_BYPASS(x) ((x) << S_TF_TCAM_BYPASS)
+
+#define S_TF_CORE_FIN    18
+#define V_TF_CORE_FIN(x) ((x) << S_TF_CORE_FIN)
+
+#define S_TF_CORE_MORE    19
+#define V_TF_CORE_MORE(x) ((x) << S_TF_CORE_MORE)
+
+#define S_TF_MIGRATING    20
+#define V_TF_MIGRATING(x) ((x) << S_TF_MIGRATING)
+
+#define S_TF_ACTIVE_OPEN    21
+#define V_TF_ACTIVE_OPEN(x) ((x) << S_TF_ACTIVE_OPEN)
+
+#define S_TF_ASK_MODE    22
+#define V_TF_ASK_MODE(x) ((x) << S_TF_ASK_MODE)
+
+#define S_TF_NON_OFFLOAD    23
+#define V_TF_NON_OFFLOAD(x) ((x) << S_TF_NON_OFFLOAD)
+
+#define S_TF_MOD_SCHD    24
+#define V_TF_MOD_SCHD(x) ((x) << S_TF_MOD_SCHD)
+
+#define S_TF_MOD_SCHD_REASON0    25
+#define V_TF_MOD_SCHD_REASON0(x) ((x) << S_TF_MOD_SCHD_REASON0)
+
+#define S_TF_MOD_SCHD_REASON1    26
+#define V_TF_MOD_SCHD_REASON1(x) ((x) << S_TF_MOD_SCHD_REASON1)
+
+#define S_TF_MOD_SCHD_RX    27
+#define V_TF_MOD_SCHD_RX(x) ((x) << S_TF_MOD_SCHD_RX)
+
+#define S_TF_CORE_PUSH    28
+#define V_TF_CORE_PUSH(x) ((x) << S_TF_CORE_PUSH)
+
+#define S_TF_RCV_COALESCE_ENABLE    29
+#define V_TF_RCV_COALESCE_ENABLE(x) ((x) << S_TF_RCV_COALESCE_ENABLE)
+
+#define S_TF_RCV_COALESCE_PUSH    30
+#define V_TF_RCV_COALESCE_PUSH(x) ((x) << S_TF_RCV_COALESCE_PUSH)
+
+#define S_TF_RCV_COALESCE_LAST_PSH    31
+#define V_TF_RCV_COALESCE_LAST_PSH(x) ((x) << S_TF_RCV_COALESCE_LAST_PSH)
+
+#define S_TF_RCV_COALESCE_HEARTBEAT    32
+#define V_TF_RCV_COALESCE_HEARTBEAT(x) ((x) << S_TF_RCV_COALESCE_HEARTBEAT)
+
+#define S_TF_HALF_CLOSE    33
+#define V_TF_HALF_CLOSE(x) ((x) << S_TF_HALF_CLOSE)
+
+#define S_TF_DACK_MSS    34
+#define V_TF_DACK_MSS(x) ((x) << S_TF_DACK_MSS)
+
+#define S_TF_CCTRL_SEL0    35
+#define V_TF_CCTRL_SEL0(x) ((x) << S_TF_CCTRL_SEL0)
+
+#define S_TF_CCTRL_SEL1    36
+#define V_TF_CCTRL_SEL1(x) ((x) << S_TF_CCTRL_SEL1)
+
+#define S_TF_TCP_NEWRENO_FAST_RECOVERY    37
+#define V_TF_TCP_NEWRENO_FAST_RECOVERY(x) ((x) << S_TF_TCP_NEWRENO_FAST_RECOVERY)
+
+#define S_TF_TX_PACE_AUTO    38
+#define V_TF_TX_PACE_AUTO(x) ((x) << S_TF_TX_PACE_AUTO)
+
+#define S_TF_PEER_FIN_HELD    39
+#define V_TF_PEER_FIN_HELD(x) ((x) << S_TF_PEER_FIN_HELD)
+
+#define S_TF_CORE_URG    40
+#define V_TF_CORE_URG(x) ((x) << S_TF_CORE_URG)
+
+#define S_TF_RDMA_ERROR    41
+#define V_TF_RDMA_ERROR(x) ((x) << S_TF_RDMA_ERROR)
+
+#define S_TF_SSWS_DISABLED    42
+#define V_TF_SSWS_DISABLED(x) ((x) << S_TF_SSWS_DISABLED)
+
+#define S_TF_DUPACK_COUNT_ODD    43
+#define V_TF_DUPACK_COUNT_ODD(x) ((x) << S_TF_DUPACK_COUNT_ODD)
+
+#define S_TF_TX_CHANNEL    44
+#define V_TF_TX_CHANNEL(x) ((x) << S_TF_TX_CHANNEL)
+
+#define S_TF_RX_CHANNEL    45
+#define V_TF_RX_CHANNEL(x) ((x) << S_TF_RX_CHANNEL)
+
+#define S_TF_TX_PACE_FIXED    46
+#define V_TF_TX_PACE_FIXED(x) ((x) << S_TF_TX_PACE_FIXED)
+
+#define S_TF_RDMA_FLM_ERROR    47
+#define V_TF_RDMA_FLM_ERROR(x) ((x) << S_TF_RDMA_FLM_ERROR)
+
+#define S_TF_RX_FLOW_CONTROL_DISABLE    48
+#define V_TF_RX_FLOW_CONTROL_DISABLE(x) ((x) << S_TF_RX_FLOW_CONTROL_DISABLE)
+
+#endif /* _TCB_DEFS_H */


From swise at opengridcomputing.com  Fri Jun 23 07:30:21 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:21 -0500
Subject: [openib-general] [PATCH v2 11/14] CXGB3 Core ULP Demux Code.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143021.32410.63281.stgit@stevo-desktop>


This code demuxes connection data and events from the LLD driver to the
various registered ULPs.  It also has the cxgb3 core module init logic,
which includes registering with the Network Event Notifier to obtain
L2/L3 events.
---

 drivers/infiniband/hw/cxgb3/t3c/t3c.c |  504 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/t3c/t3c.h |  188 ++++++++++++
 2 files changed, 692 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3c.c b/drivers/infiniband/hw/cxgb3/t3c/t3c.c
new file mode 100644
index 0000000..53d978a
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/t3c.c
@@ -0,0 +1,504 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/list.h>
+#include <net/neighbour.h>
+#include <linux/notifier.h>
+#include <net/netevent.h>
+#include <asm/atomic.h>
+
+#include "defs.h"
+#include "l2t.h"
+#include <firmware_exports.h>
+#include "t3c.h"
+#include <t3_core.h>
+
+// #define T3C_DEBUG
+
+#define MOD "t3c: "
+#ifdef T3C_DEBUG
+#define assert(expr)                                                  \
+    if(!(expr)) {                                                     \
+        printk(KERN_ERR MOD "Assertion failed! %s, %s, %s, line %d\n",\
+               #expr, __FILE__, __FUNCTION__, __LINE__);              \
+    }
+#define dprintk(fmt, args...) do {printk(KERN_INFO MOD fmt, ##args);} while (0)
+#else
+#define assert(expr)          do {} while (0)
+#define dprintk(fmt, args...) do {} while (0)
+#endif 
+
+MODULE_AUTHOR("Steve Wise <swise at opengridcomputing.com>");
+MODULE_DESCRIPTION("Chelsio T3 Core Module");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+
+void __init t3cdev_init(void);
+void __exit t3cdev_exit(void);
+void unregister_t3cdev(struct t3cdev *dev);
+void register_t3cdev(struct t3cdev *dev, const char *name);
+
+static LIST_HEAD(client_list);
+
+void t3c_register_client(struct t3c_client *client)
+{
+	struct t3cdev *tdev;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mutex_lock(&t3cdev_db_lock);
+	list_add_tail(&client->client_list, &client_list);
+	list_for_each_entry(tdev, &t3cdev_list, t3c_list) {
+		if (client->add) {
+			dprintk("%s - calling %s add fn with t3cdev %s\n", 
+				__FUNCTION__, client->name, tdev->name);
+			client->add(tdev);
+		}
+	}
+	mutex_unlock(&t3cdev_db_lock);
+}
+EXPORT_SYMBOL(t3c_register_client);
+
+void t3c_unregister_client(struct t3c_client *client)
+{
+	struct t3cdev *tdev;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mutex_lock(&t3cdev_db_lock);
+	list_del(&client->client_list);
+	list_for_each_entry(tdev, &t3cdev_list, t3c_list) {
+		if (client->remove) {
+			dprintk("%s - calling %s remove fn with t3cdev %s\n", 
+				__FUNCTION__, client->name, tdev->name);
+			client->remove(tdev);
+		}
+	}
+	mutex_unlock(&t3cdev_db_lock);
+}
+EXPORT_SYMBOL(t3c_unregister_client);
+
+/* 
+ * Called by t3's pci add function.
+ */
+static void add_t3cdev(struct t3cdev *tdev)
+{
+	struct t3c_client *client;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	register_t3cdev(tdev, "cxgb3c%d");
+	attach_t3cdev(tdev);
+	list_for_each_entry(client, &client_list, client_list) {
+		if (client->add) {
+			dprintk("%s - calling %s add fn with t3cdev %s\n", 
+				__FUNCTION__, client->name, tdev->name);
+			client->add(tdev);
+		}
+	}
+}
+
+/*
+ * Called by t3's pci remove function.
+ */
+static void remove_t3cdev(struct t3cdev *tdev)
+{
+	struct t3c_client *client;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	list_for_each_entry(client, &client_list, client_list) {
+		if (client->remove) {
+			dprintk("%s - calling %s add fn with t3cdev %s\n", 
+				__FUNCTION__, client->name, tdev->name);
+			client->remove(tdev);
+		}
+	}
+	detach_t3cdev(tdev);
+	unregister_t3cdev(tdev);
+}
+
+/*
+ * Free an active-open TID.
+ */
+void t3c_free_atid(struct t3cdev *tdev, int atid)
+{
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+	union active_open_entry *p = atid2entry(t, atid);
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	spin_lock_bh(&t->atid_lock);
+	p->next = t->afree;
+	t->afree = p;
+	t->atids_in_use--;
+	spin_unlock_bh(&t->atid_lock);
+}
+EXPORT_SYMBOL(t3c_free_atid);
+
+/*
+ * Free a server TID and return it to the free pool.  
+ */
+void t3c_free_stid(struct t3cdev *tdev, int stid)
+{
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+	union listen_entry *p = stid2entry(t, stid);
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	spin_lock_bh(&t->stid_lock);
+	p->next = t->sfree;
+	t->sfree = p;
+	t->stids_in_use--;
+	spin_unlock_bh(&t->stid_lock);
+}
+EXPORT_SYMBOL(t3c_free_stid);
+
+void t3c_insert_tid(struct t3cdev *tdev, struct t3c_client *client, 
+	void *ctx, unsigned int tid)
+{
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t->tid_tab[tid].client = client;
+	t->tid_tab[tid].ctx = ctx;
+	atomic_inc(&t->tids_in_use);
+}
+EXPORT_SYMBOL(t3c_insert_tid);
+
+/*
+ * Remove a t3c from the TID table.  A client may defer processing its last
+ * CPL message if it is locked at the time it arrives, and while the message
+ * sits in the client's backlog the TID may be reused for another connection.
+ * To handle this we atomically switch the TID association if it still points
+ * to the original client context.
+ */
+void t3c_remove_tid(struct t3cdev *tdev, void *ctx, unsigned int tid)
+{
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	cmpxchg(&t->tid_tab[tid].ctx, ctx, NULL);
+	atomic_dec(&t->tids_in_use);
+}
+EXPORT_SYMBOL(t3c_remove_tid);
+
+int t3c_alloc_atid(struct t3cdev *tdev, struct t3c_client *client, void *ctx)
+{
+	int atid = -1;
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	spin_lock_bh(&t->atid_lock);
+	if (t->afree) {
+		union active_open_entry *p = t->afree;
+
+		atid = (p - t->atid_tab) + t->atid_base;
+		t->afree = p->next;
+		p->t3c_tid.ctx = ctx;
+		p->t3c_tid.client = client;
+		t->atids_in_use++;
+	}
+	spin_unlock_bh(&t->atid_lock);
+	return atid;
+}
+EXPORT_SYMBOL(t3c_alloc_atid);
+
+int t3c_alloc_stid(struct t3cdev *tdev, struct t3c_client *client, void *ctx)
+{
+	int stid = -1;
+	struct tid_info *t = &(T3C_DATA(tdev))->tid_maps;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	spin_lock_bh(&t->stid_lock);
+	if (t->sfree) {
+		union listen_entry *p = t->sfree;
+
+		stid = (p - t->stid_tab) + t->stid_base;
+		t->sfree = p->next;
+		p->t3c_tid.ctx = ctx;
+		p->t3c_tid.client = client;
+		t->stids_in_use++;
+	}
+	spin_unlock_bh(&t->stid_lock);
+	return stid;
+}
+EXPORT_SYMBOL(t3c_alloc_stid);
+
+static int do_smt_write_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_smt_write_rpl *rpl = cplhdr(skb);
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	if (rpl->status != CPL_ERR_NONE)
+		printk(KERN_ERR
+		       "Unexpected SMT_WRITE_RPL status %u for entry %u\n",
+		       rpl->status, GET_TID(rpl));
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int do_l2t_write_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_l2t_write_rpl *rpl = cplhdr(skb);
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	if (rpl->status != CPL_ERR_NONE)
+		printk(KERN_ERR
+		       "Unexpected L2T_WRITE_RPL status %u for entry %u\n",
+		       rpl->status, GET_TID(rpl));
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int do_act_open_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+	unsigned int atid = G_TID(ntohl(rpl->atid));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid);
+	if (t3c_tid->ctx && t3c_tid->client && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[CPL_ACT_OPEN_RPL]) {
+		return t3c_tid->client->handlers[CPL_ACT_OPEN_RPL] (dev, skb, 
+			t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, CPL_ACT_OPEN_RPL);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int do_stid_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	union opcode_tid *p = cplhdr(skb);
+	unsigned int stid = G_TID(ntohl(p->opcode_tid));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3c_tid = lookup_stid(&(T3C_DATA(dev))->tid_maps, stid);
+	if (t3c_tid->ctx && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[p->opcode]) {
+		return t3c_tid->client->handlers[p->opcode] (dev, skb, t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, p->opcode);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int do_hwtid_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	union opcode_tid *p = cplhdr(skb);
+	unsigned int hwtid = G_TID(ntohl(p->opcode_tid));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u) opcode 0x%x tid %d\n", 
+		__FUNCTION__, __FILE__, __LINE__, p->opcode, hwtid);
+	t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid);
+	if (t3c_tid->ctx && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[p->opcode]) {
+		return t3c_tid->client->handlers[p->opcode]
+						(dev, skb, t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, p->opcode);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int do_cr(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int stid = G_PASS_OPEN_TID(ntohl(req->tos_tid));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3c_tid = lookup_stid(&(T3C_DATA(dev))->tid_maps, stid);
+	if (t3c_tid->ctx && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[CPL_PASS_ACCEPT_REQ]) {
+		return t3c_tid->client->handlers[CPL_PASS_ACCEPT_REQ]
+						(dev, skb, t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, CPL_PASS_ACCEPT_REQ);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int do_act_establish(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int atid = G_PASS_OPEN_TID(ntohl(req->tos_tid));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3c_tid = lookup_atid(&(T3C_DATA(dev))->tid_maps, atid);
+	if (t3c_tid->ctx && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[CPL_ACT_ESTABLISH]) {
+		return t3c_tid->client->handlers[CPL_ACT_ESTABLISH]
+						(dev, skb, t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, CPL_PASS_ACCEPT_REQ);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int do_set_tcb_rpl(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_set_tcb_rpl *rpl = cplhdr(skb);
+
+	if (rpl->status != CPL_ERR_NONE)
+		printk(KERN_ERR
+			"Unexpected SET_TCB_RPL status %u for tid %u\n",
+			rpl->status, GET_TID(rpl));
+	return CPL_RET_BUF_DONE;
+}
+
+static int do_trace(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_trace_pkt *p = cplhdr(skb);
+
+	skb->protocol = 0xffff;
+	skb->dev = dev->lldev;
+	skb_pull(skb, sizeof(*p));
+	skb->mac.raw = skb->data;
+	netif_receive_skb(skb);
+	return 0;
+}
+
+static int do_term(struct t3cdev *dev, struct sk_buff *skb)
+{
+	unsigned int hwtid = ntohl(skb->priority) >> 8 & 0xfffff;
+	unsigned int opcode = G_OPCODE(ntohl(skb->csum));
+	struct t3c_tid_entry *t3c_tid;
+
+	dprintk("%s enter (%s line %u) opcode 0x%x tid %d\n",
+		__FUNCTION__, __FILE__, __LINE__, opcode, hwtid);
+
+	t3c_tid = lookup_tid(&(T3C_DATA(dev))->tid_maps, hwtid);
+	if (t3c_tid->ctx && t3c_tid->client->handlers && 
+		t3c_tid->client->handlers[opcode]) {
+		return t3c_tid->client->handlers[opcode](dev,skb,t3c_tid->ctx);
+	} else {
+		printk(KERN_ERR "%s: received clientless CPL command 0x%x\n", 
+			dev->name, opcode);
+		return CPL_RET_BUF_DONE | CPL_RET_BAD_MSG;
+	}
+}
+
+static int nb_callback(struct notifier_block *self, unsigned long event, 
+	void *ctx)
+{
+	switch (event) {
+		case (NETEVENT_NEIGH_UPDATE): {
+			t3c_neigh_update((struct neighbour *)ctx, 0);
+			break;
+		}
+		case (NETEVENT_ROUTE_UPDATE):
+			dprintk("%s ROUTE_UPDATE\n", __FUNCTION__);
+			break;
+		case (NETEVENT_PMTU_UPDATE):
+			dprintk("%s PMTU_UPDATE\n", __FUNCTION__);
+			break;
+		case (NETEVENT_REDIRECT): {
+			struct netevent_redirect *nr = ctx;
+			dprintk("%s REDIRECT old dst %p new dst %p "
+			       "old neigh %p new neigh %p old neigh key %x "
+			       "new neigh key %x\n", __FUNCTION__,
+				nr->old, nr->new, 
+				nr->old ? nr->old->neighbour : NULL, 
+				nr->new ? nr->new->neighbour : NULL, 
+				nr->old->neighbour ? 
+					*(u32*)nr->old->neighbour->primary_key
+					: 0,
+				nr->new->neighbour ? 
+					*(u32*)nr->new->neighbour->primary_key
+					: 0);
+			t3c_redirect(nr->old, nr->new);
+			t3c_neigh_update(nr->new->neighbour, 0);
+			break;
+		}
+		default:
+			printk(KERN_ERR "unknown net event notifier type %lu\n", 
+				event);
+			break;
+	}
+	return 0;
+}
+
+static struct notifier_block nb = {
+	.notifier_call = nb_callback
+};
+
+
+/*
+ * upcall struct for the t3 module.
+ */
+static struct t3_core core = {
+	.add 		= add_t3cdev,
+	.remove		= remove_t3cdev,
+};
+
+int __init t3c_init(void)
+{
+	t3cdev_init();
+	register_netevent_notifier(&nb);
+	dprintk("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3_register_cpl_handler(CPL_SMT_WRITE_RPL, do_smt_write_rpl);
+	t3_register_cpl_handler(CPL_L2T_WRITE_RPL, do_l2t_write_rpl);
+	t3_register_cpl_handler(CPL_PASS_OPEN_RPL, do_stid_rpl);
+	t3_register_cpl_handler(CPL_CLOSE_LISTSRV_RPL, do_stid_rpl);
+	t3_register_cpl_handler(CPL_PASS_ACCEPT_REQ, do_cr);
+	t3_register_cpl_handler(CPL_PASS_ESTABLISH, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_ABORT_RPL_RSS, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_ABORT_RPL, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_RX_DATA, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_TX_DATA_ACK, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_TX_DMA_ACK, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_ACT_OPEN_RPL, do_act_open_rpl);
+	t3_register_cpl_handler(CPL_PEER_CLOSE, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_CLOSE_CON_RPL, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_ABORT_REQ_RSS, do_hwtid_rpl);
+	t3_register_cpl_handler(CPL_ACT_ESTABLISH, do_act_establish);
+	t3_register_cpl_handler(CPL_SET_TCB_RPL, do_set_tcb_rpl);
+	t3_register_cpl_handler(CPL_RDMA_TERMINATE, do_term);
+	t3_register_cpl_handler(CPL_TRACE_PKT, do_trace);
+	t3_register_core(&core);
+	
+	return 0;
+}
+
+static void __exit t3c_exit(void)
+{
+	dprintk("%s (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	t3_unregister_core(&core);
+	t3cdev_exit();
+	unregister_netevent_notifier(&nb);
+	return;
+}
+module_init(t3c_init);
+module_exit(t3c_exit);
diff --git a/drivers/infiniband/hw/cxgb3/t3c/t3c.h b/drivers/infiniband/hw/cxgb3/t3c/t3c.h
new file mode 100644
index 0000000..fdc51a8
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/t3c.h
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _CHELSIO_T3C_H
+#define _CHELSIO_T3C_H
+
+#include <linux/list.h>
+#include <linux/skbuff.h>
+
+#include <tcb.h>
+#include <l2t.h>
+
+#include <t3cdev.h>
+#include <t3_cpl.h>
+
+/*
+ * Client registration.  Users of the T3 Core driver must register themselves. 
+ * The T3 Core driver will call the add function of every client for each T3 
+ * PCI device probed, passing up the t3cdev ptr.  Each client fills out an
+ * array of callback functions to process CPL messages.  
+ */
+typedef int (*t3c_cpl_handler_func)(struct t3cdev *dev, 
+				     struct sk_buff *skb, void *ctx);
+
+struct t3c_client {
+	char 			*name;
+	void 			(*add) (struct t3cdev *);
+	void 			(*remove) (struct t3cdev *);
+	t3c_cpl_handler_func 	*handlers;
+	int			(*redirect)(void *ctx, struct dst_entry *old, 
+					    struct dst_entry *new, 
+					    struct l2t_entry *l2t);
+	struct list_head	client_list;
+};
+
+void t3c_register_client(struct t3c_client *);
+void t3c_unregister_client(struct t3c_client *);
+
+/*
+ * TID allocation services. 
+ */
+int t3c_alloc_atid(struct t3cdev *dev, struct t3c_client *client, void *ctx);
+int t3c_alloc_stid(struct t3cdev *dev, struct t3c_client *client, void *ctx);
+void t3c_free_atid(struct t3cdev *dev, int atid);
+void t3c_free_stid(struct t3cdev *dev, int stid);
+void t3c_insert_tid(struct t3cdev *dev, struct t3c_client *client, void *ctx, 
+	unsigned int tid);
+void t3c_remove_tid(struct t3cdev *dev, void *ctx, unsigned int tid);
+
+struct t3c_tid_entry {
+	struct t3c_client 	*client;
+	void 			*ctx;
+};
+
+/* CPL message priority levels */
+enum {
+	CPL_PRIORITY_DATA = 0,     /* data messages */
+	CPL_PRIORITY_SETUP = 1,	   /* connection setup messages */
+	CPL_PRIORITY_TEARDOWN = 0, /* connection teardown messages */
+	CPL_PRIORITY_LISTEN = 1,   /* listen start/stop messages */
+	CPL_PRIORITY_ACK = 1,      /* RX ACK messages */
+	CPL_PRIORITY_CONTROL = 1   /* TOE control messages */
+};
+
+/* Flags for return value of CPL message handlers */
+enum {
+	CPL_RET_BUF_DONE = 1,   // buffer processing done, buffer may be freed
+	CPL_RET_BAD_MSG = 2,    // bad CPL message (e.g., unknown opcode)
+	CPL_RET_UNKNOWN_TID = 4	// unexpected unknown TID
+};
+
+typedef int (*cpl_handler_func)(struct t3cdev *dev, struct sk_buff *skb);
+
+/*
+ * Returns a pointer to the first byte of the CPL header in an sk_buff that
+ * contains a CPL message.
+ */
+static inline void *cplhdr(struct sk_buff *skb)
+{
+	return skb->data;
+}
+
+void t3_register_cpl_handler(unsigned int opcode, cpl_handler_func h);
+
+union listen_entry {
+	struct t3c_tid_entry t3c_tid;
+	union listen_entry *next;
+};
+
+union active_open_entry {
+	struct t3c_tid_entry t3c_tid;
+	union active_open_entry *next;
+};
+
+/*
+ * Holds the size, base address, free list start, etc of the TID, server TID,
+ * and active-open TID tables for a TOE.  The tables themselves are allocated
+ * dynamically.
+ */
+struct tid_info {
+	struct t3c_tid_entry *tid_tab;
+	unsigned int ntids;
+	atomic_t tids_in_use;
+
+	union listen_entry *stid_tab;
+	unsigned int nstids;
+	unsigned int stid_base;
+
+	union active_open_entry *atid_tab;
+	unsigned int natids;
+	unsigned int atid_base;
+
+	/*
+	 * The following members are accessed R/W so we put them in their own
+	 * cache lines.
+	 *
+	 * XXX We could combine the atid fields above with the lock here since
+	 * atids are use once (unlike other tids).  OTOH the above fields are
+	 * usually in cache due to tid_tab.
+	 */
+	spinlock_t atid_lock ____cacheline_aligned_in_smp;
+	union active_open_entry *afree;
+	unsigned int atids_in_use;
+
+	spinlock_t stid_lock ____cacheline_aligned;
+	union listen_entry *sfree;
+	unsigned int stids_in_use;
+};
+
+struct t3c_data {
+	struct list_head list_node;
+	struct t3cdev *dev;
+	unsigned int tx_max_chunk;  /* max payload for TX_DATA */
+	unsigned int max_wrs;       /* max in-flight WRs per connection */
+	unsigned int ddp_llimit;    /* DDP parameters */
+	unsigned int ddp_ulimit;
+	unsigned int ddp_tagmask;
+	unsigned int nmtus;
+	const unsigned short *mtus;
+	struct tid_info tid_maps;
+};
+
+/*
+ * t3cdev -> t3c_data accessor
+ */
+#define T3C_DATA(dev) (*(struct t3c_data **)&(dev)->l4opt)
+
+/* XXX REMOVE THIS HACK WHEN 2.6.16 is published! */
+#include <linux/version.h>
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16)
+#include <linux/mutex-backport.h>
+#else
+#include <linux/mutex.h>
+#endif /* XXX end of hack */
+
+extern struct mutex t3cdev_db_lock;
+extern struct list_head t3cdev_list;
+
+#endif


From swise at opengridcomputing.com  Fri Jun 23 07:30:26 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 09:30:26 -0500
Subject: [openib-general] [PATCH v2 12/14] CXGB3 Core L2 Management.
In-Reply-To: <20060623142924.32410.7623.stgit@stevo-desktop>
References: <20060623142924.32410.7623.stgit@stevo-desktop>
Message-ID: <20060623143026.32410.24373.stgit@stevo-desktop>


This code manages the hardware's neighbour table and thus hooks into the
native L2/L3 stack.  Currently we're using the Netevent Notifier Mechanism
patch to detect neighbour changes.   This needs more discussion on how
RNICs should be made aware of next hop changes...

ISSUE:  The processing of notification of L2/L3 events should really be
in the IWCM, and an interface defined between the IWCM and IW devices to
pass pertinent events down to the device.  Currently, this is all done
in the cxgb3c module since there are no other open source iwarp drivers
that need this.
---

 drivers/infiniband/hw/cxgb3/t3c/l2t.c |  616 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/t3c/l2t.h |  147 ++++++++
 2 files changed, 763 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/t3c/l2t.c b/drivers/infiniband/hw/cxgb3/t3c/l2t.c
new file mode 100644
index 0000000..7912950
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/l2t.c
@@ -0,0 +1,616 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/if.h>
+#include <linux/if_vlan.h>
+#include <linux/jhash.h>
+#include <net/neighbour.h>
+#include "t3cdev.h"
+#include "defs.h"
+#include "l2t.h"
+#include "t3_cpl.h"
+#include "firmware_exports.h"
+
+/* #define L2T_DEBUG */
+
+#ifdef L2T_DEBUG
+#define dprintk(fmt, args...) do {printk(KERN_INFO fmt, ##args);} while (0)
+#else
+#define dprintk(fmt, args...) do {} while (0)
+#endif 
+
+
+#define VLAN_NONE 0xfff
+
+/*
+ * Module locking notes:  There is a RW lock protecting the L2 table as a
+ * whole plus a spinlock per L2T entry.  Entry lookups and allocations happen
+ * under the protection of the table lock, individual entry changes happen
+ * while holding that entry's spinlock.  The table lock nests outside the
+ * entry locks.  Allocations of new entries take the table lock as writers so
+ * no other lookups can happen while allocating new entries.  Entry updates
+ * take the table lock as readers so multiple entries can be updated in
+ * parallel.  An L2T entry can be dropped by decrementing its reference count
+ * and therefore can happen in parallel with entry allocation but no entry
+ * can change state or increment its ref count during allocation as both of
+ * these perform lookups.
+ *
+ * Note: We do not take refereces to net_devices in this module because both 
+ * the TOE and the sockets already hold references to the interfaces and the
+ * lifetime of an L2T entry is fully contained in the lifetime of the TOE.
+ */
+
+static inline unsigned int vlan_prio(const struct l2t_entry *e)
+{
+	return e->vlan >> 13;
+}
+
+static inline unsigned int arp_hash(u32 key, int ifindex,
+				    const struct l2t_data *d)
+{
+	return jhash_2words(key, ifindex, 0) & (d->nentries - 1);
+}
+
+static inline void neigh_replace(struct l2t_entry *e, struct neighbour *n)
+{
+	dprintk("%s %d e %p e->neigh %p neigh %p\n", __FUNCTION__, __LINE__, 
+		e, e->neigh, n);
+	neigh_hold(n);
+	if (e->neigh)
+		neigh_release(e->neigh);
+	e->neigh = n;
+}
+
+/*
+ * Set up an L2T entry and send any packets waiting in the arp queue.  The
+ * supplied skb is used for the CPL_L2T_WRITE_REQ.  Must be called with the
+ * entry locked.
+ */
+static int setup_l2e_send_pending(struct t3cdev *dev, struct sk_buff *skb,
+				  struct l2t_entry *e)
+{
+	struct cpl_l2t_write_req *req;
+
+	if (!skb) {
+		skb = alloc_skb(sizeof(*req), GFP_ATOMIC);
+		if (!skb)
+			return -ENOMEM;
+	}
+
+	dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, 
+	      	e->neigh);
+	req = (struct cpl_l2t_write_req *)__skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_L2T_WRITE_REQ, e->idx));
+	req->params = htonl(V_L2T_W_IDX(e->idx) | V_L2T_W_IFF(e->smt_idx) |
+			    V_L2T_W_VLAN(e->vlan & VLAN_VID_MASK) |
+			    V_L2T_W_PRIO(vlan_prio(e)));
+	memcpy(e->dmac, e->neigh->ha, sizeof(e->dmac));
+	memcpy(req->dst_mac, e->dmac, sizeof(req->dst_mac));
+	dprintk("%s updating HW idx %d with %02x:%02x:%02x:%02x:%02x:%02x\n",
+		__FUNCTION__, e->idx,
+		req->dst_mac[0],
+		req->dst_mac[1],
+		req->dst_mac[2],
+		req->dst_mac[3],
+		req->dst_mac[4],
+		req->dst_mac[5]);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	t3c_send(dev, skb);
+	while (e->arpq_head) {
+		skb = e->arpq_head;
+		e->arpq_head = skb->next;
+		skb->next = NULL;
+		t3c_send(dev, skb);
+	}
+	e->arpq_tail = NULL;
+	e->state = L2T_STATE_VALID;
+
+	return 0;
+}
+
+/*
+ * Add a packet to the an L2T entry's queue of packets awaiting resolution.
+ * Must be called with the entry's lock held.
+ */
+static inline void arpq_enqueue(struct l2t_entry *e, struct sk_buff *skb)
+{
+	dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, e->neigh);
+	skb->next = NULL;
+	if (e->arpq_head)
+		e->arpq_tail->next = skb;
+	else
+		e->arpq_head = skb;
+	e->arpq_tail = skb;
+}
+
+int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb,
+		     struct l2t_entry *e)
+{
+again:
+	switch (e->state) {
+	case L2T_STATE_STALE:     /* entry is stale, kick off revalidation */
+		dprintk("%s %d STALE - e %p neigh %p\n", __FUNCTION__, 
+			__LINE__, e, e->neigh);
+		neigh_event_send(e->neigh, NULL);
+		spin_lock_bh(&e->lock);
+		if (e->state == L2T_STATE_STALE)
+			e->state = L2T_STATE_VALID;
+		spin_unlock_bh(&e->lock);
+	case L2T_STATE_VALID:     /* fast-path, send the packet on */
+		return t3c_send(dev, skb);
+	case L2T_STATE_RESOLVING:
+		spin_lock_bh(&e->lock);
+		if (e->state != L2T_STATE_RESOLVING) { // ARP already completed
+			spin_unlock_bh(&e->lock);
+			goto again;
+		}
+		dprintk("%s %d RESOLVING - e %p neigh %p\n", __FUNCTION__, 
+			__LINE__, e, e->neigh);
+		arpq_enqueue(e, skb);
+		spin_unlock_bh(&e->lock);
+
+		/*
+		 * Only the first packet added to the arpq should kick off
+		 * resolution.  However, because the alloc_skb below can fail,
+		 * we allow each packet added to the arpq to retry resolution
+		 * as a way of recovering from transient memory exhaustion.
+		 * A better way would be to use a work request to retry L2T
+		 * entries when there's no memory.
+		 */
+		if (!neigh_event_send(e->neigh, NULL)) {
+			skb = alloc_skb(sizeof(struct cpl_l2t_write_req),
+					GFP_ATOMIC);
+			if (!skb)
+				break;
+
+			spin_lock_bh(&e->lock);
+			if (e->arpq_head)
+				setup_l2e_send_pending(dev, skb, e);
+			else                           /* we lost the race */
+				__kfree_skb(skb);
+			spin_unlock_bh(&e->lock);
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(t3_l2t_send_slow);
+
+void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e)
+{
+	dprintk("%s l2t %p neigh %p nud_state %x\n", __FUNCTION__, e, 
+		e->neigh, e->neigh->nud_state);
+
+again:
+	switch (e->state) {
+	case L2T_STATE_STALE:     /* entry is stale, kick off revalidation */
+		dprintk("%s %d STALE - e %p neigh %p\n", __FUNCTION__, 
+			__LINE__, e, e->neigh);
+		neigh_event_send(e->neigh, NULL);
+		spin_lock_bh(&e->lock);
+		if (e->state == L2T_STATE_STALE) {
+			e->state = L2T_STATE_VALID;
+			dprintk("%s STALE->VALID!\n", __FUNCTION__);
+		}
+		spin_unlock_bh(&e->lock);
+		return;
+	case L2T_STATE_VALID:     /* fast-path, send the packet on */
+		dprintk("%s %d VALID - e %p neigh %p\n", __FUNCTION__, 
+			__LINE__, e, e->neigh);
+		return;
+	case L2T_STATE_RESOLVING:
+		spin_lock_bh(&e->lock);
+		if (e->state != L2T_STATE_RESOLVING) { // ARP already completed
+			spin_unlock_bh(&e->lock);
+			goto again;
+		}
+		dprintk("%s %d RESOLVING - e %p neigh %p\n", __FUNCTION__, 
+			__LINE__, e, e->neigh);
+		spin_unlock_bh(&e->lock);
+
+		/*
+		 * Only the first packet added to the arpq should kick off
+		 * resolution.  However, because the alloc_skb below can fail,
+		 * we allow each packet added to the arpq to retry resolution
+		 * as a way of recovering from transient memory exhaustion.
+		 * A better way would be to use a work request to retry L2T
+		 * entries when there's no memory.
+		 */
+		neigh_event_send(e->neigh, NULL);
+	}
+	return;
+}
+EXPORT_SYMBOL(t3_l2t_send_event);
+
+/*
+ * Allocate a free L2T entry.  Must be called with l2t_data.lock held.
+ */
+static struct l2t_entry *alloc_l2e(struct l2t_data *d)
+{
+	struct l2t_entry *end, *e, **p;
+
+	if (!atomic_read(&d->nfree))
+		return NULL;
+
+	/* there's definitely a free entry */
+	for (e = d->rover, end = &d->l2tab[d->nentries]; e != end; ++e)
+		if (atomic_read(&e->refcnt) == 0)
+			goto found;
+
+	for (e = &d->l2tab[1]; atomic_read(&e->refcnt); ++e) ;
+found:
+	d->rover = e + 1;
+	atomic_dec(&d->nfree);
+
+	/*
+	 * The entry we found may be an inactive entry that is
+	 * presently in the hash table.  We need to remove it.
+	 */
+	if (e->state != L2T_STATE_UNUSED) {
+		int hash = arp_hash(e->addr, e->ifindex, d);
+
+		for (p = &d->l2tab[hash].first; *p; p = &(*p)->next)
+			if (*p == e) {
+				*p = e->next;
+				break;
+			}
+		e->state = L2T_STATE_UNUSED;
+	}
+	dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, e->neigh);
+	return e;
+}
+
+/*
+ * Called when an L2T entry has no more users.  The entry is left in the hash
+ * table since it is likely to be reused but we also bump nfree to indicate
+ * that the entry can be reallocated for a different neighbor.  We also drop
+ * the existing neighbor reference in case the neighbor is going away and is
+ * waiting on our reference.
+ *
+ * Because entries can be reallocated to other neighbors once their ref count
+ * drops to 0 we need to take the entry's lock to avoid races with a new
+ * incarnation.
+ */
+void t3_l2e_free(struct l2t_data *d, struct l2t_entry *e)
+{
+	spin_lock_bh(&e->lock);
+	if (atomic_read(&e->refcnt) == 0) {  /* hasn't been recycled */	
+		dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, 
+			e, e->neigh);
+		if (e->neigh) {
+			neigh_release(e->neigh);
+			e->neigh = NULL;
+		}
+		/*
+		 * Don't need to worry about the arpq, an L2T entry can't be
+		 * released if any packets are waiting for resolution as we
+		 * need to be able to communicate with the TOE to close a
+		 * connection.
+		 */
+	}
+	spin_unlock_bh(&e->lock);
+	atomic_inc(&d->nfree);
+}
+EXPORT_SYMBOL(t3_l2e_free);
+
+/*
+ * Update an L2T entry that was previously used for the same next hop as neigh.
+ * Must be called with softirqs disabled.
+ */
+static inline void reuse_entry(struct l2t_entry *e, struct neighbour *neigh)
+{
+	unsigned int nud_state;
+
+	spin_lock(&e->lock);                /* avoid race with t3_l2t_free */
+	dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, neigh);
+
+	if (neigh != e->neigh)
+		neigh_replace(e, neigh);
+	nud_state = neigh->nud_state;
+	if (memcmp(e->dmac, neigh->ha, sizeof(e->dmac)) ||
+	    !(nud_state & NUD_VALID))
+		e->state = L2T_STATE_RESOLVING;
+	else if (nud_state & NUD_CONNECTED)
+		e->state = L2T_STATE_VALID;
+	else
+		e->state = L2T_STATE_STALE;
+	spin_unlock(&e->lock);
+}
+	
+struct l2t_entry *t3_l2t_get(struct t3cdev *dev, struct neighbour *neigh,
+			     unsigned int smt_idx)
+{
+	struct l2t_entry *e;
+	struct l2t_data *d = L2DATA(dev);
+	u32 addr = *(u32 *) neigh->primary_key;
+	int ifidx = neigh->dev->ifindex;
+	int hash = arp_hash(addr, ifidx, d);
+
+	write_lock_bh(&d->lock);
+	for (e = d->l2tab[hash].first; e; e = e->next)
+		if (e->addr == addr && e->ifindex == ifidx &&
+		    e->smt_idx == smt_idx) {
+			l2t_hold(d, e);
+			if (atomic_read(&e->refcnt) == 1)
+				reuse_entry(e, neigh);
+			goto done;
+		}
+
+	/* Need to allocate a new entry */
+	e = alloc_l2e(d);
+	if (e) {
+		spin_lock(&e->lock);          /* avoid race with t3_l2t_free */
+		e->next = d->l2tab[hash].first;
+		d->l2tab[hash].first = e;
+		e->state = L2T_STATE_RESOLVING;
+		e->addr = addr;
+		e->ifindex = ifidx;
+		e->smt_idx = smt_idx;
+		atomic_set(&e->refcnt, 1);
+		neigh_replace(e, neigh);
+		if (neigh->dev->priv_flags & IFF_802_1Q_VLAN)
+			e->vlan = VLAN_DEV_INFO(neigh->dev)->vlan_id;
+		else
+			e->vlan = VLAN_NONE;
+		spin_unlock(&e->lock);
+	}
+done:
+	dprintk("%s %d e %p neigh %p\n", __FUNCTION__, __LINE__, e, neigh);
+	write_unlock_bh(&d->lock);
+	return e;
+}
+EXPORT_SYMBOL(t3_l2t_get);
+
+/*
+ * Called when address resolution fails for an L2T entry to handle packets
+ * on the arpq head.  If a packet specifies a failure handler it is invoked,
+ * otherwise the packets is sent to the TOE.
+ *
+ * XXX: maybe we should abandon the latter behavior and just require a failure
+ * handler.
+ */
+static void handle_failed_resolution(struct t3cdev *dev, struct sk_buff *arpq)
+{
+	while (arpq) {
+		struct sk_buff *skb = arpq;
+		struct l2t_skb_cb *cb = L2T_SKB_CB(skb);
+
+		arpq = skb->next;
+		skb->next = NULL;
+		if (cb->arp_failure_handler)
+			cb->arp_failure_handler(dev, skb);
+		else
+			t3c_send(dev, skb);
+	}
+}
+
+/*
+ * Called when the host's ARP layer makes a change to some entry that is
+ * loaded into the HW L2 table.
+ */
+void t3_l2t_update(struct t3cdev *dev, struct neighbour *neigh, int flags, struct net_device *lldev)
+{
+	struct l2t_entry *e;
+	struct sk_buff *arpq = NULL;
+	struct l2t_data *d = L2DATA(dev);
+	u32 addr = *(u32 *) neigh->primary_key;
+	int ifidx = neigh->dev->ifindex;
+	int hash = arp_hash(addr, ifidx, d);
+
+	read_lock_bh(&d->lock);
+	for (e = d->l2tab[hash].first; e; e = e->next)
+		if (e->addr == addr && e->ifindex == ifidx) {
+			spin_lock(&e->lock);
+			goto found;
+		}
+	read_unlock_bh(&d->lock);
+	return;
+
+found:
+	dprintk("%s l2t %p neigh %p nud_state %x\n", __FUNCTION__, e, 
+		e->neigh, e->neigh ? e->neigh->nud_state : -1);
+	read_unlock(&d->lock);
+	if (atomic_read(&e->refcnt)) {
+		if (neigh != e->neigh)
+			neigh_replace(e, neigh);
+
+		if (e->state == L2T_STATE_RESOLVING) {
+			dprintk("%s %d RESOLVING - e %p neigh %p\n", 
+				__FUNCTION__, __LINE__, e, e->neigh);
+			if (neigh->nud_state & NUD_FAILED) {
+				arpq = e->arpq_head;
+				e->arpq_head = e->arpq_tail = NULL;
+			} else if (neigh_is_connected(neigh))
+				setup_l2e_send_pending(dev, NULL, e);
+		} else {
+			e->state = neigh_is_connected(neigh) ?
+				L2T_STATE_VALID : L2T_STATE_STALE;
+			dprintk("%s %d state -> %d - e %p neigh %p\n", 
+				__FUNCTION__, __LINE__, e->state, 
+				e, e->neigh);
+			if (memcmp(e->dmac, neigh->ha, 6))
+				setup_l2e_send_pending(dev, NULL, e);
+		}
+	}
+	spin_unlock_bh(&e->lock);
+
+	if (arpq)
+		handle_failed_resolution(dev, arpq);
+}
+
+struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
+{
+	struct l2t_data *d;
+	int i, size = sizeof(*d) + l2t_capacity * sizeof(struct l2t_entry);
+
+	d = t3_alloc_mem(size);
+	if (!d)
+		return NULL;
+
+	d->nentries = l2t_capacity;
+	d->rover = &d->l2tab[1];	/* entry 0 is not used */
+	atomic_set(&d->nfree, l2t_capacity - 1);
+	rwlock_init(&d->lock);
+
+	for (i = 0; i < l2t_capacity; ++i) {
+		d->l2tab[i].idx = i;
+		d->l2tab[i].state = L2T_STATE_UNUSED;
+		spin_lock_init(&d->l2tab[i].lock);
+		atomic_set(&d->l2tab[i].refcnt, 0);
+	}
+	return d;
+}
+
+void t3_free_l2t(struct l2t_data *d)
+{
+	t3_free_mem(d);
+}
+
+#ifdef CONFIG_PROC_FS
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+
+static inline void *l2t_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct l2t_data *d = seq->private;
+
+	return pos >= d->nentries ? NULL : &d->l2tab[pos];
+}
+
+static void *l2t_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return *pos ? l2t_get_idx(seq, *pos) : SEQ_START_TOKEN;
+}
+
+static void *l2t_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	v = l2t_get_idx(seq, *pos + 1);
+	if (v)
+		++*pos;
+	return v;
+}
+
+static void l2t_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static char l2e_state(const struct l2t_entry *e)
+{
+	switch (e->state) {
+	case L2T_STATE_VALID: return 'V';  /* valid, fast-path entry */
+	case L2T_STATE_STALE: return 'S';  /* needs revalidation, but usable */
+	case L2T_STATE_RESOLVING:
+		return e->arpq_head ? 'A' : 'R';
+	default:
+		return 'U';
+	}
+}
+
+static int l2t_seq_show(struct seq_file *seq, void *v)
+{
+	if (v == SEQ_START_TOKEN)
+		seq_puts(seq, "Index IP address      Ethernet address   VLAN  "
+			 "Prio  State   Users SMTIDX  Port\n");
+	else {
+		char ip[20];
+		struct l2t_entry *e = v;
+
+		spin_lock_bh(&e->lock);
+		sprintf(ip, "%u.%u.%u.%u", NIPQUAD(e->addr));
+		seq_printf(seq, "%-5u %-15s %02x:%02x:%02x:%02x:%02x:%02x  %4d"
+			   "  %3u     %c   %7u   %4u %s\n",
+			   e->idx, ip, e->dmac[0], e->dmac[1], e->dmac[2],
+			   e->dmac[3], e->dmac[4], e->dmac[5],
+			   e->vlan & VLAN_VID_MASK, vlan_prio(e),
+			   l2e_state(e), atomic_read(&e->refcnt), e->smt_idx,
+			   e->neigh ? e->neigh->dev->name : "");
+		spin_unlock_bh(&e->lock);
+	}
+	return 0;
+}
+
+static struct seq_operations l2t_seq_ops = {
+	.start = l2t_seq_start,
+	.next = l2t_seq_next,
+	.stop = l2t_seq_stop,
+	.show = l2t_seq_show
+};
+
+static int l2t_seq_open(struct inode *inode, struct file *file)
+{
+	int rc = seq_open(file, &l2t_seq_ops);
+
+	if (!rc) {
+		struct proc_dir_entry *dp = PDE(inode);
+		struct seq_file *seq = file->private_data;
+
+		seq->private = dp->data;
+	}
+	return rc;
+}
+
+static struct file_operations l2t_seq_fops = {
+	.owner = THIS_MODULE,
+	.open = l2t_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+/*
+ * Create the proc entries for the L2 table under dir.
+ */
+int t3_l2t_proc_setup(struct proc_dir_entry *dir, struct l2t_data *d)
+{
+	struct proc_dir_entry *p;
+
+	if (!dir)
+		return -EINVAL;
+
+	p = create_proc_entry("l2t", S_IRUGO, dir);
+	if (!p)
+		return -ENOMEM;
+
+	p->proc_fops = &l2t_seq_fops;
+	p->data = d;
+	return 0;
+}
+
+void t3_l2t_proc_free(struct proc_dir_entry *dir)
+{
+	if (dir)
+		remove_proc_entry("l2t", dir);
+}
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/t3c/l2t.h b/drivers/infiniband/hw/cxgb3/t3c/l2t.h
new file mode 100644
index 0000000..48a247f
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/t3c/l2t.h
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _CHELSIO_L2T_H
+#define _CHELSIO_L2T_H
+
+#include <linux/config.h>
+#include <linux/spinlock.h>
+#include "t3cdev.h"
+#include <asm/atomic.h>
+
+enum {
+	L2T_STATE_VALID,      /* entry is up to date */
+	L2T_STATE_STALE,      /* entry may be used but needs revalidation */
+	L2T_STATE_RESOLVING,  /* entry needs address resolution */
+	L2T_STATE_UNUSED      /* entry not in use */
+};
+
+struct neighbour;
+struct sk_buff;
+
+/*
+ * Each L2T entry plays multiple roles.  First of all, it keeps state for the
+ * corresponding entry of the HW L2 table and maintains a queue of offload
+ * packets awaiting address resolution.  Second, it is a node of a hash table
+ * chain, where the nodes of the chain are linked together through their next
+ * pointer.  Finally, each node is a bucket of a hash table, pointing to the
+ * first element in its chain through its first pointer.
+ */
+struct l2t_entry {
+	u16 state;                  /* entry state */
+	u16 idx;                    /* entry index */
+	u32 addr;                   /* dest IP address */
+	int ifindex;                /* neighbor's net_device's ifindex */
+	u16 smt_idx;                /* SMT index */
+	u16 vlan;                   /* VLAN TCI (id: bits 0-11, prio: 13-15 */
+	struct neighbour *neigh;    /* associated neighbour */
+	struct l2t_entry *first;    /* start of hash chain */
+	struct l2t_entry *next;     /* next l2t_entry on chain */
+	struct sk_buff *arpq_head;  /* queue of packets awaiting resolution */
+	struct sk_buff *arpq_tail;
+	spinlock_t lock;
+	atomic_t refcnt;            /* entry reference count */
+	u8 dmac[6];                 /* neighbour's MAC address */
+};
+
+struct l2t_data {
+	unsigned int nentries;      /* number of entries */
+	struct l2t_entry *rover;    /* starting point for next allocation */
+	atomic_t nfree;             /* number of free entries */
+	rwlock_t lock;
+	struct l2t_entry l2tab[0];
+};
+
+typedef void (*arp_failure_handler_func)(struct t3cdev *dev, 
+					 struct sk_buff *skb);
+
+/*
+ * Callback stored in an skb to handle address resolution failure.
+ */
+struct l2t_skb_cb {
+	arp_failure_handler_func arp_failure_handler;
+};
+
+#define L2T_SKB_CB(skb) ((struct l2t_skb_cb *)(skb)->cb)
+
+static inline void set_arp_failure_handler(struct sk_buff *skb,
+					   arp_failure_handler_func hnd)
+{
+	L2T_SKB_CB(skb)->arp_failure_handler = hnd;
+}
+
+/*
+ * Getting to the L2 data from a toe device.
+ */
+#define L2DATA(dev) ((dev)->l2opt)
+
+void t3_l2e_free(struct l2t_data *d, struct l2t_entry *e);
+void t3_l2t_update(struct t3cdev *dev, struct neighbour *neigh, int flags, struct net_device *lldev);
+struct l2t_entry *t3_l2t_get(struct t3cdev *dev, struct neighbour *neigh,
+			     unsigned int smt_idx);
+int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb,
+		     struct l2t_entry *e);
+void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e);
+struct l2t_data *t3_init_l2t(unsigned int l2t_capacity);
+void t3_free_l2t(struct l2t_data *d);
+
+#ifdef CONFIG_PROC_FS
+int t3_l2t_proc_setup(struct proc_dir_entry *dir, struct l2t_data *d);
+void t3_l2t_proc_free(struct proc_dir_entry *dir);
+#else
+#define l2t_proc_setup(dir, d) 0
+#define l2t_proc_free(dir)
+#endif
+
+int t3c_send(struct t3cdev *dev, struct sk_buff *skb);
+
+static inline int l2t_send(struct t3cdev *dev, struct sk_buff *skb,
+			   struct l2t_entry *e)
+{
+	if (likely(e->state == L2T_STATE_VALID))
+		return t3c_send(dev, skb);
+	return t3_l2t_send_slow(dev, skb, e);
+}
+
+static inline void l2t_release(struct l2t_data *d, struct l2t_entry *e)
+{
+	if (atomic_dec_and_test(&e->refcnt))
+		t3_l2e_free(d, e);
+}
+
+static inline void l2t_hold(struct l2t_data *d, struct l2t_entry *e)
+{
+	if (atomic_add_return(1, &e->refcnt) == 1)  /* 0 -> 1 transition */
+		atomic_dec(&d->nfree);
+}
+
+#endif


From mamidala at cse.ohio-state.edu  Fri Jun 23 07:28:15 2006
From: mamidala at cse.ohio-state.edu (amith rajith mamidala)
Date: Fri, 23 Jun 2006 10:28:15 -0400 (EDT)
Subject: [openib-general] mckey program
Message-ID: <Pine.GSO.4.40.0606231022320.19604-100000@nu.cse.ohio-state.edu>

Hi,

I was checking the mckey.c program for IB.
I did some quick check and found that the  rdma_resolve_addr function
is invoking the cma_handler with erroneous event.

mckey: event: 1, error: -19

Is there any easy way to check what might be happening?

Thanks,
Amith


---------- Forwarded message ----------
Date: Thu, 22 Jun 2006 09:45:26 -0400 (EDT)
From: amith rajith mamidala <mamidala at cse.ohio-state.edu>
To: Hal Rosenstock <halr at voltaire.com>
Cc: Sean Hefty <mshefty at ichips.intel.com>
Subject: Re: Multicast Addresses

Hi Hal,

IPoIB interface is started. I can ping to other nodes using this. I
have also tried 224.0.0.1 but the error is still the same.
Are there any other set of programs which can make sure that the desired
set-up is in place.

Thanks,
Amith

On 21 Jun 2006, Hal Rosenstock wrote:

> Hi again Amith,
>
> On Wed, 2006-06-21 at 17:40, amith rajith mamidala wrote:
> > Hi Sean,
> >
> > I did a quick test of the mckey program with the following inputs for the
> > receiver:
> > --> mckey recv 224.0.0.0
> >
> > I got the following error with the receiver returning even though the
> > sender was not called. I wanted to double check with you if I am giving
> > the correct options.
> >
> > cmatose: starting client
> > cmatose: joining
> > cmatose: event: 1, error: -19
> > test complete
> > return status -19
>
> -19 is -ENODEV. Do you have an IPoIB interface started ? I see lots of
> other reasons in the library this might be returned too.
>
> 224.0.0.0 is a base reserved address.
>
> Can you try 224.0.0.1 (all systems on subnet) ?
>
> -- Hal
>
> >
> > Thanks,
> > Amith
> >
> >
> > On 21 Jun 2006, Hal Rosenstock wrote:
> >
> > > Hi Amith,
> > >
> > > On Wed, 2006-06-21 at 16:41, amith rajith mamidala wrote:
> > > > Hi,
> > > >
> > > > I had a basic question. How do we specify the multicast addresses while
> > > > using rdma_cm? I am looking at the mckey program,
> > >
> > > The syntax is:
> > >
> > > mckey {s[end] | r[ecv]} mcast_addr [bind_addr]]
> > >
> > > I think that mcast_addr is an IP address as is the bind_addr. bind_addr
> > > is a unicast IP address whereas mcast_addr is a multicast one.
> > >
> > > It's Sean's test program so I added him to this.
> > >
> > > -- Hal
> > >
> > > >
> > > > Thanks,
> > > > Amith
> > > >
> > >
> >
>


From paul.lundin at gmail.com  Fri Jun 23 07:44:46 2006
From: paul.lundin at gmail.com (Paul)
Date: Fri, 23 Jun 2006 10:44:46 -0400
Subject: [openib-general] OFED-1.0 fails install on AMD64
In-Reply-To: <449BAACE.6000609@mellanox.co.il>
References: <8953B8331AA98041B0C11DBC678AFC0812C7B1@srvemail1.calpont.com>
	<449BAACE.6000609@mellanox.co.il>
Message-ID: <d2403b0606230744l7ee081d5xde82e770c76b00bf@mail.gmail.com>

Eitan,
   Anything using version 4 of gcc should (could ?) have the same problem.
If you google the "relocation R_X86_64_32 against" section of the error you
will see a good deal of people with the same/similar issues (not on OFED,
but on many other things). I do not belive the issue lies with OFED in this
instance. Though I could be wrong.

Regards.

On 6/23/06, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> Hi Don,
>
> Sorry for my late response. ibutils compilation (of libibdmcom) is
> breaking with the
> error message:
>
> > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
> > used when making a shared object; recompile with -fPIC
>
> For the command:
> > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
> > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
> > g++ -shared -nostdlib
>
> So obviously one has to figure out why -shared did not cause -fPIC ?
> Also not clear why this does not break on other machines. Anyways,
> reproducing the problem is my first target.
>
> One obvious thing to try is to set CFLAGS=-fPIC
>
> As I do not have access to the exact type of your machine : FSM Labs v
> 2.2.3 with the 2.6.16 kernel
> (as the weekend started over hear) I guess I will be able to reproduce
> only Sun/Mon.
>
> Eitan
>
> Don Snedigar wrote:
> > I just downloaded the OFED-1.0 and the install was going fine until
> > ibutils.  At that point, the install fails with :
> >
> > Open MPI RPM will be created during the installation process
> >
> >
> > Building ibutils RPM. Please wait...
> >
> > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define
> > 'configure_options --prefix=/usr/local/ofed
> > --mandir=/usr/local/ofed/share/man
> > --cache-file=/var/tmp/OFED/ibutils.cache
> > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
> > -
> > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> > --mandir=/usr/local/ofed/share/man
> > --cache-file=/var/tmp/OFED/ibutils.cache
> > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> >
> > See log file: /tmp/OFED.28656.log
> >
> >
> > I dug down into the log file it indicates and found :
> >
> >  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
> > - -o .libs/ibnl_scanner.o
> > ibnl_scanner.ll: In function 'int ibnl_lex()':
> > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
> > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
> > warn_unused_result
> >  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc -o
> > ibnl_scanner.o >/dev/null 2>&1
> > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
> > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
> > g++ -shared -nostdlib
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
> > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
> > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o .libs/ibnl_parser.o
> > .libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
> > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
> > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
> > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
> > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
> > .libs/libibdmcom.so.1.1.1
> > /usr/bin/ld:
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
> > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not be
> > used when making a shared object; recompile with -fPIC
> > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read
> > symbols: Bad value
> > collect2: ld returned 1 exit status
> > make[3]: *** [libibdmcom.la] Error 1
> > make[3]: Leaving directory
> > `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel'
> > make[2]: *** [all-recursive] Error 1
> > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> > make[1]: *** [all] Error 2
> > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm'
> > make: *** [all-recursive] Error 1
> > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> >
> >
> > RPM build errors:
> >     Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> > --mandir=/usr/local/ofed/share/man
> > --cache-file=/var/tmp/OFED/ibutils.cache
> > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> >
> > Can anyone shed any light on this ?
> >
> > Machine is dual Opteron, 2 gig memory, kernel 2.6.16
> >
> > Don Snedigar
> > Calpont Corp.
> > 214-618-9516
> >
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/1ad2bf9a/attachment.html>

From dsnedigar at calpont.com  Fri Jun 23 07:51:05 2006
From: dsnedigar at calpont.com (Don Snedigar)
Date: Fri, 23 Jun 2006 09:51:05 -0500
Subject: [openib-general] OFED-1.0 fails install on AMD64
Message-ID: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com>

Agreed Paul.  Google turns up hundreds, if not thousands, of hits about
this. Its not an OFED problem...
 
I was able to resolve the problem late last night by upgrading the
compiler to gcc-4.0.2.  
 
Thanks for all the help though!
 
Don

________________________________

From: Paul [mailto:paul.lundin at gmail.com] 
Sent: Friday, June 23, 2006 9:45 AM
To: Eitan Zahavi
Cc: Don Snedigar; openib-general at openib.org
Subject: Re: [openib-general] OFED-1.0 fails install on AMD64


Eitan,
   Anything using version 4 of gcc should (could ?) have the same
problem. If you google the "relocation R_X86_64_32 against" section of
the error you will see a good deal of people with the same/similar
issues (not on OFED, but on many other things). I do not belive the
issue lies with OFED in this instance. Though I could be wrong. 

Regards.


On 6/23/06, Eitan Zahavi < eitan at mellanox.co.il
<mailto:eitan at mellanox.co.il> > wrote: 

	Hi Don,
	
	Sorry for my late response. ibutils compilation (of libibdmcom)
is breaking with the
	error message:
	
	> relocation R_X86_64_32 against `__gnu_internal::freelist_key'
can not be
	> used when making a shared object; recompile with -fPIC 
	
	For the command:
	> /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
	> -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2
-g -pipe
	> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o 
	> libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info
"2:1:1"
	> Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo
SysDef.lo
	> LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
	> g++ -shared -nostdlib
	
	So obviously one has to figure out why -shared did not cause
-fPIC ?
	Also not clear why this does not break on other machines.
Anyways, 
	reproducing the problem is my first target.
	
	One obvious thing to try is to set CFLAGS=-fPIC
	
	As I do not have access to the exact type of your machine : FSM
Labs v 2.2.3 with the 2.6.16 kernel
	(as the weekend started over hear) I guess I will be able to
reproduce only Sun/Mon. 
	
	Eitan
	
	Don Snedigar wrote:
	> I just downloaded the OFED-1.0 and the install was going fine
until
	> ibutils.  At that point, the install fails with :
	>
	> Open MPI RPM will be created during the installation process 
	>
	>
	> Building ibutils RPM. Please wait...
	>
	> Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM'
--define
	> 'configure_options --prefix=/usr/local/ofed
	> --mandir=/usr/local/ofed/share/man 
	> --cache-file=/var/tmp/OFED/ibutils.cache
	> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim'
--define
	> '_prefix /usr/local/ofed' --define '_libdir
/usr/local/ofed/lib64'
	> --define '_mandir %{_prefix}/share/man' --define 'build_root 
	> /var/tmp/OFED'
/home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
	> -
	> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
	> /var/tmp/OFEDRPM' --define 'configure_options
--prefix=/usr/local/ofed 
	> --mandir=/usr/local/ofed/share/man
	> --cache-file=/var/tmp/OFED/ibutils.cache
	> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim'
--define
	> '_prefix /usr/local/ofed' --define '_libdir
/usr/local/ofed/lib64' 
	> --define '_mandir %{_prefix}/share/man' --define 'build_root
	> /var/tmp/OFED'
/home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
	>
	> See log file: /tmp/OFED.28656.log
	>
	>
	> I dug down into the log file it indicates and found : 
	>
	>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
	> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2
-g -pipe
	> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
	> ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c
ibnl_scanner.cc 
	> - -o .libs/ibnl_scanner.o
	> ibnl_scanner.ll: In function 'int ibnl_lex()':
	> ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
	> fwrite(const void*, size_t, size_t, FILE*)', declared with
attribute 
	> warn_unused_result
	>  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
	> -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2
-g -pipe
	> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT 
	> ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c
ibnl_scanner.cc -o
	> ibnl_scanner.o >/dev/null 2>&1
	> /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
	> -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2
-g -pipe 
	> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
	> libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info
"2:1:1" 
	> Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo
SysDef.lo
	> LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
	> g++ -shared -nostdlib
	>
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
	> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o
.libs/Fabric.o 
	> .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o
.libs/TopoMatch.o
	> .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o
.libs/ibnl_parser.o
	> .libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0

	> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
	> -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../..
-L/lib/../lib64
	> -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
	> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o 
	>
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
	> -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
	> .libs/libibdmcom.so.1.1.1
	> /usr/bin/ld:
	>
/usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o): 
	> relocation R_X86_64_32 against `__gnu_internal::freelist_key'
can not be
	> used when making a shared object; recompile with -fPIC
	> /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not
read
	> symbols: Bad value
	> collect2: ld returned 1 exit status
	> make[3]: *** [libibdmcom.la] Error 1
	> make[3]: Leaving directory
	> `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel'
	> make[2]: *** [all-recursive] Error 1
	> make[2]: Leaving directory
`/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
	> make[1]: *** [all] Error 2
	> make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-
1.0/ibdm'
	> make: *** [all-recursive] Error 1
	> error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
	>
	>
	> RPM build errors:
	>     Bad exit status from /var/tmp/rpm-tmp.16738 (%install) 
	> ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
	> /var/tmp/OFEDRPM' --define 'configure_options
--prefix=/usr/local/ofed
	> --mandir=/usr/local/ofed/share/man
	> --cache-file=/var/tmp/OFED/ibutils.cache 
	> --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim'
--define
	> '_prefix /usr/local/ofed' --define '_libdir
/usr/local/ofed/lib64'
	> --define '_mandir %{_prefix}/share/man' --define 'build_root 
	> /var/tmp/OFED'
/home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
	>
	> Can anyone shed any light on this ?
	>
	> Machine is dual Opteron, 2 gig memory, kernel 2.6.16
	>
	> Don Snedigar 
	> Calpont Corp.
	> 214-618-9516
	>
	>
	>
	>
	>
------------------------------------------------------------------------
	>
	> _______________________________________________ 
	> openib-general mailing list
	> openib-general at openib.org
	> http://openib.org/mailman/listinfo/openib-general 
	>
	> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general 
	
	
	_______________________________________________ 
	openib-general mailing list
	openib-general at openib.org
	http://openib.org/mailman/listinfo/openib-general 
	
	To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general 
	
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/48a105e8/attachment.html>

From paul.lundin at gmail.com  Fri Jun 23 07:54:48 2006
From: paul.lundin at gmail.com (Paul)
Date: Fri, 23 Jun 2006 10:54:48 -0400
Subject: [openib-general] OFED-1.0 fails install on AMD64
In-Reply-To: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com>
References: <8953B8331AA98041B0C11DBC678AFC0816AB2D@srvemail1.calpont.com>
Message-ID: <d2403b0606230754s2e973925jcacb87bfb5f0e35f@mail.gmail.com>

Your welcome. Good to hear that you got it working.

On 6/23/06, Don Snedigar <dsnedigar at calpont.com> wrote:
>
>  Agreed Paul.  Google turns up hundreds, if not thousands, of hits about
> this. Its not an OFED problem...
>
> I was able to resolve the problem late last night by upgrading the
> compiler to gcc-4.0.2.
>
> Thanks for all the help though!
>
> Don
>
>  ------------------------------
> *From:* Paul [mailto:paul.lundin at gmail.com]
> *Sent:* Friday, June 23, 2006 9:45 AM
> *To:* Eitan Zahavi
> *Cc:* Don Snedigar; openib-general at openib.org
>
> *Subject:* Re: [openib-general] OFED-1.0 fails install on AMD64
>
> Eitan,
>    Anything using version 4 of gcc should (could ?) have the same problem.
> If you google the "relocation R_X86_64_32 against" section of the error
> you will see a good deal of people with the same/similar issues (not on
> OFED, but on many other things). I do not belive the issue lies with OFED in
> this instance. Though I could be wrong.
>
> Regards.
>
> On 6/23/06, Eitan Zahavi < eitan at mellanox.co.il> wrote:
> >
> > Hi Don,
> >
> > Sorry for my late response. ibutils compilation (of libibdmcom) is
> > breaking with the
> > error message:
> >
> > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not
> > be
> > > used when making a shared object; recompile with -fPIC
> >
> > For the command:
> > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
> > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
> > > g++ -shared -nostdlib
> >
> > So obviously one has to figure out why -shared did not cause -fPIC ?
> > Also not clear why this does not break on other machines. Anyways,
> > reproducing the problem is my first target.
> >
> > One obvious thing to try is to set CFLAGS=-fPIC
> >
> > As I do not have access to the exact type of your machine : FSM Labs v
> > 2.2.3 with the 2.6.16 kernel
> > (as the weekend started over hear) I guess I will be able to reproduce
> > only Sun/Mon.
> >
> > Eitan
> >
> > Don Snedigar wrote:
> > > I just downloaded the OFED-1.0 and the install was going fine until
> > > ibutils.  At that point, the install fails with :
> > >
> > > Open MPI RPM will be created during the installation process
> > >
> > >
> > > Building ibutils RPM. Please wait...
> > >
> > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM'
> > --define
> > > 'configure_options --prefix=/usr/local/ofed
> > > --mandir=/usr/local/ofed/share/man
> > > --cache-file=/var/tmp/OFED/ibutils.cache
> > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm
> > > -
> > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> >
> > > --mandir=/usr/local/ofed/share/man
> > > --cache-file=/var/tmp/OFED/ibutils.cache
> > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> > >
> > > See log file: /tmp/OFED.28656.log
> > >
> > >
> > > I dug down into the log file it indicates and found :
> > >
> > >  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
> > > - -o .libs/ibnl_scanner.o
> > > ibnl_scanner.ll: In function 'int ibnl_lex()':
> > > ibnl_scanner.ll:197: warning: ignoring return value of 'size_t
> > > fwrite(const void*, size_t, size_t, FILE*)', declared with attribute
> > > warn_unused_result
> > >  g++ -DHAVE_CONFIG_H -I. -I. -I.. -O2
> > > -DIBDM_IBNL_DIR=\"/usr/local/ofed/lib64\" -I/usr/include -O2 -g -pipe
> > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona -MT
> > > ibnl_scanner.lo -MD -MP -MF .deps/ibnl_scanner.Tpo -c ibnl_scanner.cc
> > -o
> > > ibnl_scanner.o >/dev/null 2>&1
> > > /bin/sh ../libtool --tag=CXX --mode=link g++ -O2
> > > -DIBDM_IBNL_DIR='"/usr/local/ofed/lib64"' -I/usr/include -O2 -g -pipe
> > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m64 -mtune=nocona   -o
> > > libibdmcom.la -rpath /usr/local/ofed/lib64 -version-info "2:1:1"
> > > Fabric.lo SubnMgt.lo TraceRoute.lo CredLoops.lo TopoMatch.lo SysDef.lo
> > > LinkCover.lo Congestion.lo ibnl_parser.lo ibnl_scanner.lo
> > > g++ -shared -nostdlib
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crti.o
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtbeginS.o  .libs/Fabric.o
> > > .libs/SubnMgt.o .libs/TraceRoute.o .libs/CredLoops.o .libs/TopoMatch.o
> > > .libs/SysDef.o .libs/LinkCover.o .libs/Congestion.o
> > .libs/ibnl_parser.o
> > > .libs/ibnl_scanner.o  -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0
> > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64
> > > -L/usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../.. -L/lib/../lib64
> > > -L/usr/lib/../lib64 -lstdc++ -lm -lc -lgcc_s
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/crtendS.o
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/../../../../lib64/crtn.o  -m64
> > > -mtune=nocona -Wl,-soname -Wl,libibdmcom.so.1 -o
> > > .libs/libibdmcom.so.1.1.1
> > > /usr/bin/ld:
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a(mt_allocator.o):
> > > relocation R_X86_64_32 against `__gnu_internal::freelist_key' can not
> > be
> > > used when making a shared object; recompile with -fPIC
> > > /usr/lib/gcc/x86_64-redhat-linux/4.0.0/libstdc++.a: could not read
> > > symbols: Bad value
> > > collect2: ld returned 1 exit status
> > > make[3]: *** [libibdmcom.la] Error 1
> > > make[3]: Leaving directory
> > > `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm/datamodel'
> > > make[2]: *** [all-recursive] Error 1
> > > make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils-1.0/ibdm'
> > > make[1]: *** [all] Error 2
> > > make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ibutils- 1.0/ibdm'
> > > make: *** [all-recursive] Error 1
> > > error: Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> > >
> > >
> > > RPM build errors:
> > >     Bad exit status from /var/tmp/rpm-tmp.16738 (%install)
> > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir
> > > /var/tmp/OFEDRPM' --define 'configure_options --prefix=/usr/local/ofed
> > > --mandir=/usr/local/ofed/share/man
> > > --cache-file=/var/tmp/OFED/ibutils.cache
> > > --with-osm=/var/tmp/OFED/usr/local/ofed --enable-ibmgtsim' --define
> > > '_prefix /usr/local/ofed' --define '_libdir /usr/local/ofed/lib64'
> > > --define '_mandir %{_prefix}/share/man' --define 'build_root
> > > /var/tmp/OFED' /home/snedigar/OFED-1.0/SRPMS/ibutils-1.0-0.src.rpm"
> > >
> > > Can anyone shed any light on this ?
> > >
> > > Machine is dual Opteron, 2 gig memory, kernel 2.6.16
> > >
> > > Don Snedigar
> > > Calpont Corp.
> > > 214-618-9516
> > >
> > >
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/6eb349f2/attachment.html>

From bpradip at in.ibm.com  Fri Jun 23 09:09:50 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Fri, 23 Jun 2006 21:39:50 +0530
Subject: [openib-general] resend [PATCH] rping.c: Fix hang if either the
 server or the client exits early
In-Reply-To: <1151071344.7808.42.camel@stevo-desktop>
References: <20060622192259.GA24588@harry-potter.ibm.com>
	<1151007847.3040.51.camel@stevo-desktop> <449BE393.3020308@in.ibm.com>
	<1151071344.7808.42.camel@stevo-desktop>
Message-ID: <449C124E.4050308@in.ibm.com>

Steve Wise wrote:
> On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote:
>> Steve Wise wrote:
>>> The goal of adding the return codes was so that the rping program could
>>> exit with a status indicating success or failure.  Every rping run
>>> results in a DISCONNECT event, so I don't think we want to treat that
>>> case as an error.
>> DISCONNECT event will be generated when the connection is closed or in case of 
>> some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso 
>> driver etc).
> 
> You'll also get the DISCONNECT event when one side finished the rping
> loops and does rdma_disconnect().  So receiving that event isn't
> necessarily an error...
Yes definitely, but this event can _also_ be received due to errors!!
> 
> 
>>> Also, can you explain why thi fixes Amith's problem, which sounded like
>>> a process was hanging?
>>>
>> On debugging I found that the main thread was blocked in ibv_destroy_cq(), 
>> cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in 
>> ibv_get_cq_event->read
>> Taking the return value of the DISCONNECT event into consideration forcefully 
>> killed the process.
>> On delving deeper into this problem, I think that there is more to this rping 
>> hang. Let me work on this further.
>>
> 
> I think rping needs some coordination on these threads and when they
> should be killed. 
> 
Right..

Thanks,
Pradipta


>> On a related note - I noticed another rping hang in the following case
>> - Start the rping as a client without first starting an rping server
>> - If you are lucky the first run itself will result in the 'lt-rping' process in 
>> 'D' state. If not repeating the procedure will result in the hang.
>>
>> This is the o/p.
>>
>> cq completion failed status 5
>> wait for CONNECTED state 10
>> connect error -1
>>
>> Thanks,
>> Pradipta.
>>
>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>>
>>>
>>> On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:


From iod00d at hp.com  Fri Jun 23 10:14:57 2006
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 23 Jun 2006 10:14:57 -0700
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1151071471.3204.12.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
	<1151070532.3204.10.camel@laptopd505.fenrus.org>
	<1151071005.7808.39.camel@stevo-desktop>
	<1151071471.3204.12.camel@laptopd505.fenrus.org>
Message-ID: <20060623171457.GA3610@esmail.cup.hp.com>

On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote:
> > I thought the posted write WILL eventually get to adapter memory.  Not
> > stall forever cached in a bridge.  I'm wrong?
> 
> I'm not sure there is a theoretical upper bound.... 

I'm not aware of one either since MMIO writes can travel
across many other chips that are not constrained by
PCI ordering rules (I'm thinking of SGI Altix...)

> (and if it's several msec per bridge, then you have a lot of latency
> anyway)

That's what my original concern was when I saw you point this out.
But MMIO reads here would be expensive and many drivers tolerate
this latency in exchange for avoiding the MMIO read in the
performance path.

grant


From Don.Albert at Bull.com  Fri Jun 23 10:14:31 2006
From: Don.Albert at Bull.com (Don.Albert at Bull.com)
Date: Fri, 23 Jun 2006 10:14:31 -0700
Subject: [openib-general] Stopping Infiniband kernel modules from loading at
	system boot
Message-ID: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>

Short of uninstalling the OFED-1.0 release,  how can I stop the Infiniband 
related kernel modules from loading at system boot? 

I am trying to debug a problem with programs hanging in the kernel,  so I 
thought that I would try manually loading the modules one at a time to see 
if I could isolate the problem.  This is on a RHEL4 U3 system with the 
2.6.16 kernel and the OFED-1.0 release installed.

 I used "/sbin/chkconfig" to turn off the  "openibd" and "opensmd" 
services in the /etc/rc.d/ runlevel files.  I even removed "ifcfg-ib0" and 
"ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory.  I don't 
see any other scripts that would cause these modules to be loaded.   But 
every time I reboot, I get the following modules loaded, according to 
/sbin/lsmod:

ib_mthca              117424  0
ib_mad                 35896  1 ib_mthca
ib_core                45952  2 ib_mthca,ib_mad

What have I missed?

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/d1e5b6c9/attachment.html>

From boris at mellanox.com  Fri Jun 23 10:57:38 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Fri, 23 Jun 2006 10:57:38 -0700
Subject: [openib-general] Stopping Infiniband kernel modules from
 loading at system boot
Message-ID: <1E3DCD1C63492545881FACB6063A57C13242D1@mtiexch01.mti.com>

Hi Don,
 
I believe you need to disable the "hotplug" loading of the Infiniband
drivers
by putting the modules you have listed into /etc/hotplug/blacklist file.
 
Please, let me know if this helped.
 
Regards,
Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com

________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of
Don.Albert at Bull.com
Sent: Friday, June 23, 2006 10:15 AM
To: openfabrics-ewg at openib.org; openib-general at openib.org
Subject: [openib-general] Stopping Infiniband kernel modules from
loading at system boot


Short of uninstalling the OFED-1.0 release,  how can I stop the
Infiniband related kernel modules from loading at system boot?   

I am trying to debug a problem with programs hanging in the kernel,  so
I thought that I would try manually loading the modules one at a time to
see if I could isolate the problem.  This is on a RHEL4 U3 system with
the 2.6.16 kernel and the OFED-1.0 release installed. 

 I used "/sbin/chkconfig" to turn off the  "openibd" and "opensmd"
services in the /etc/rc.d/ runlevel files.  I even removed "ifcfg-ib0"
and "ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory.  I
don't see any other scripts that would cause these modules to be loaded.
But every time I reboot, I get the following modules loaded, according
to /sbin/lsmod: 

ib_mthca              117424  0 
ib_mad                 35896  1 ib_mthca 
ib_core                45952  2 ib_mthca,ib_mad 

What have I missed? 

        -Don Albert- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/22d08a20/attachment.html>

From pw at osc.edu  Fri Jun 23 11:19:59 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Fri, 23 Jun 2006 14:19:59 -0400
Subject: [openib-general] amso userspace
Message-ID: <20060623181959.GA21488@osc.edu>

Having seen your four patchsets recently, I thought I'd give amso in
openib another shot, r8187 is what I'm looking at now.  Here's a few
questions for you.  (I did ccflash2 to the fw in ogc kit 20060308
already, and use the boot_image in there too.)

Should I expect to be able to use the kernel directories in branches/iwarp
directly with linux-2.6.17.1?  It looks like your branch may be out
of date with respect to trunk for a few files.  I used it anyway and
it does seem to build and run.

In the userspace source, amso_create_qp limits max_send_sge and
max_recv_sge to 4.  Is this really the hardware limit?  It seems
quite low.

Should I expect the examples in
branches/iwarp/src/userspace/libibverbs/examples to work?  I was
hoping to use rc_pingpong.c as a way to understand what was going
wrong with my code, but it exits when it finds that its local lid is
zero (line 578).

One spot in my code where I'm trying to understand why libamso
errors is this transition to RTS (not using rdmacm, just bringing
it up by hand):

    /* transition qp to ready-to-send */
    mask =
       IBV_QP_STATE
     | IBV_QP_SQ_PSN
     | IBV_QP_MAX_QP_RD_ATOMIC
     | IBV_QP_TIMEOUT
     | IBV_QP_RETRY_CNT
     | IBV_QP_RNR_RETRY;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_RTS;
    attr.sq_psn = 0;
    attr.max_rd_atomic = 1;
    attr.timeout = 26;  /* 4.096us * 2^26 = 5 min */
    attr.retry_cnt = 20;
    attr.rnr_retry = 20;
    ret = ibv_modify_qp(qp, &attr, mask);
    if (ret)
        error_xerrno(ret, "%s: ibv_modify_qp RTR -> RTS", __func__);

The return value is 11, EAGAIN.

With C2_DEBUG on, the kernel says to the console:

    c2: c2_qp_modify:145 qp=ffff81003e71f180, IB_QPS_RTR --> IB_QPS_RTS
    c2: c2_qp_modify: c2_errno=-11
    c2: c2_qp_modify:243 qp=ffff81003e71f180, cur_state=IB_QPS_RTR

I'm guessing one of those values must be off, but can't see where
anything is enforced in the lib or kernel driver.  Some of these
fields don't make sense for non-IB fabrics.  Just using a mask of
IBV_QP_STATE caused the same return value.  Can you see the problem
right off?  (This code does work fine on mthca.)

Thanks,

		-- Pete


From swise at opengridcomputing.com  Fri Jun 23 11:26:24 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 13:26:24 -0500
Subject: [openib-general] amso userspace
In-Reply-To: <20060623181959.GA21488@osc.edu>
References: <20060623181959.GA21488@osc.edu>
Message-ID: <1151087184.7808.67.camel@stevo-desktop>


> Should I expect to be able to use the kernel directories in branches/iwarp
> directly with linux-2.6.17.1?  It looks like your branch may be out
> of date with respect to trunk for a few files.  I used it anyway and
> it does seem to build and run.

I haven't tried branches/iwarp with a 2.6.17 kernel.  It works fine with
2.6.16 though, and I expect it to work fine in 2.6.17.  The branch is a
snapshot of the main trunk and we only update it occasionally.

> 
> In the userspace source, amso_create_qp limits max_send_sge and
> max_recv_sge to 4.  Is this really the hardware limit?  It seems
> quite low.
> 

Yep.  That's a HW limit.  

> Should I expect the examples in
> branches/iwarp/src/userspace/libibverbs/examples to work?  I was
> hoping to use rc_pingpong.c as a way to understand what was going
> wrong with my code, but it exits when it finds that its local lid is
> zero (line 578).

Those examples only work for IB transports.  The examples in
librdma/examples will run over iwarp because they utilize the RDMA
CMA.  


> 
> One spot in my code where I'm trying to understand why libamso
> errors is this transition to RTS (not using rdmacm, just bringing
> it up by hand):
> 
>     /* transition qp to ready-to-send */
>     mask =
>        IBV_QP_STATE
>      | IBV_QP_SQ_PSN
>      | IBV_QP_MAX_QP_RD_ATOMIC
>      | IBV_QP_TIMEOUT
>      | IBV_QP_RETRY_CNT
>      | IBV_QP_RNR_RETRY;
>     memset(&attr, 0, sizeof(attr));
>     attr.qp_state = IBV_QPS_RTS;
>     attr.sq_psn = 0;
>     attr.max_rd_atomic = 1;
>     attr.timeout = 26;  /* 4.096us * 2^26 = 5 min */
>     attr.retry_cnt = 20;
>     attr.rnr_retry = 20;
>     ret = ibv_modify_qp(qp, &attr, mask);
>     if (ret)
>         error_xerrno(ret, "%s: ibv_modify_qp RTR -> RTS", __func__);
> 
> The return value is 11, EAGAIN.
> 
> With C2_DEBUG on, the kernel says to the console:
> 
>     c2: c2_qp_modify:145 qp=ffff81003e71f180, IB_QPS_RTR --> IB_QPS_RTS
>     c2: c2_qp_modify: c2_errno=-11
>     c2: c2_qp_modify:243 qp=ffff81003e71f180, cur_state=IB_QPS_RTR
> 
> I'm guessing one of those values must be off, but can't see where
> anything is enforced in the lib or kernel driver.  Some of these
> fields don't make sense for non-IB fabrics.  Just using a mask of
> IBV_QP_STATE caused the same return value.  Can you see the problem
> right off?  (This code does work fine on mthca.)
> 

You need to use librdmacm to setup iwarp connections. That's the only
way it will work for the amso device.  See librdma/examples.  I also
posted a patch to perftest/rdma_lat.c and rdma_bw.c that added a -c
option to utilize the RDMA CMA.   The patch didn't get pulled in,
however...


Steve.


From krause at cup.hp.com  Fri Jun 23 11:02:31 2006
From: krause at cup.hp.com (Michael Krause)
Date: Fri, 23 Jun 2006 11:02:31 -0700
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <20060623171457.GA3610@esmail.cup.hp.com>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
	<1151070532.3204.10.camel@laptopd505.fenrus.org>
	<1151071005.7808.39.camel@stevo-desktop>
	<1151071471.3204.12.camel@laptopd505.fenrus.org>
	<20060623171457.GA3610@esmail.cup.hp.com>
Message-ID: <6.2.0.14.2.20060623105755.0201e650@esmail.cup.hp.com>

At 10:14 AM 6/23/2006, Grant Grundler wrote:
>On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote:
> > > I thought the posted write WILL eventually get to adapter memory.  Not
> > > stall forever cached in a bridge.  I'm wrong?
> >
> > I'm not sure there is a theoretical upper bound....
>
>I'm not aware of one either since MMIO writes can travel
>across many other chips that are not constrained by
>PCI ordering rules (I'm thinking of SGI Altix...)

It is processor / coherency backplane technology specific as to the number 
of outstanding writes.  There is also no guarantee that such writes will 
hit the top of the PCI hierarchy in the order they were posted in a 
multi-core / processor system.  Hence, it is up to software to guarantee 
that ordering is preserved and to not assume anything about ordering from a 
hardware perspective.  Once a transaction hits the PCI hierarchy, then the 
PCI ordering rules apply and depending upon the transaction type and other 
rules, what is guaranteed is deterministic in nature.


> > (and if it's several msec per bridge, then you have a lot of latency
> > anyway)
>
>That's what my original concern was when I saw you point this out.
>But MMIO reads here would be expensive and many drivers tolerate
>this latency in exchange for avoiding the MMIO read in the
>performance path.

As the saying goes, MMIO Reads are "pure evil" and should be avoided at all 
costs if performance is the goal.   Even in a relatively flat I/O 
hierarchy, the additional latency is non-trivial and can lead to a 
significant loss in performance for the system.

Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/c386dd05/attachment.html>

From dledford at redhat.com  Fri Jun 23 11:44:46 2006
From: dledford at redhat.com (Doug Ledford)
Date: Fri, 23 Jun 2006 14:44:46 -0400
Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel
 modules from loading at system boot
In-Reply-To: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>
References: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>
Message-ID: <1151088287.22762.14.camel@fc5.xsintricity.com>

On Fri, 2006-06-23 at 10:14 -0700, Don.Albert at Bull.com wrote:
> 
> Short of uninstalling the OFED-1.0 release,  how can I stop the
> Infiniband related kernel modules from loading at system boot?    
> 
> I am trying to debug a problem with programs hanging in the kernel,
>  so I thought that I would try manually loading the modules one at a
> time to see if I could isolate the problem.  This is on a RHEL4 U3
> system with the 2.6.16 kernel and the OFED-1.0 release installed. 
> 
>  I used "/sbin/chkconfig" to turn off the  "openibd" and "opensmd"
> services in the /etc/rc.d/runlevel files.  I even removed "ifcfg-ib0"
> and "ifcfg-ib1" from the /etc/sysconfig/networking-scriptsdirectory.
>  I don't see any other scripts that would cause these modules to be
> loaded.   But every time I reboot, I get the following modules loaded,
> according to /sbin/lsmod: 
> 
> ib_mthca              117424  0 
> ib_mad                 35896  1 ib_mthca 
> ib_core                45952  2 ib_mthca,ib_mad 
> 
> What have I missed? 

/etc/rc.d/rc.sysinit

In the sysinit we load all the modules required to support the hardware
in the system (that's when it prints the Initializing hardware: storage
network sound other [OK] message).  In order to stop that you have to
move the modules out of the way.  But, I'm a bit surprised that ib_mad
is loaded as that doesn't seem a hard dependancy for ib_mthca (or more
appropriately, I'm surprised to see ib_mad and not a bunch of other ib
modules as well, check the /etc/modprobe.conf
and /etc/modprobe.conf.dist to see if there are rules to force lots of
ib modules to be loaded any time ib_core is loaded).

-- 
Doug Ledford <dledford at redhat.com>
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband


From pw at osc.edu  Fri Jun 23 11:56:25 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Fri, 23 Jun 2006 14:56:25 -0400
Subject: [openib-general] amso userspace
In-Reply-To: <1151087184.7808.67.camel@stevo-desktop>
References: <20060623181959.GA21488@osc.edu>
	<1151087184.7808.67.camel@stevo-desktop>
Message-ID: <20060623185625.GB21488@osc.edu>

swise at opengridcomputing.com wrote on Fri, 23 Jun 2006 13:26 -0500:
> You need to use librdmacm to setup iwarp connections. That's the only
> way it will work for the amso device.  See librdma/examples.  I also
> posted a patch to perftest/rdma_lat.c and rdma_bw.c that added a -c
> option to utilize the RDMA CMA.   The patch didn't get pulled in,
> however...

Thanks for the clarification.  rping and cmatose from the iwarp
branch work fine.  (The trunk versions are slightly different.)

I'll have to think about whether I'm willing to switch over to
rdmacm just yet.  I was hoping to stick with my hand-rolled
TCP-based connection setup, but understand why that is not possible
if I want to support iwarp gear on the same code base.

Having libamso and librdmacm show up in fedora-extras would
definitely help us make the transition, if you want the nudging.  :)
Thanks for all the iwarp work.

		-- Pete


From pradeep at us.ibm.com  Fri Jun 23 12:26:56 2006
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Fri, 23 Jun 2006 12:26:56 -0700
Subject: [openib-general] Stopping Infiniband kernel modules from
 loading at system boot
In-Reply-To: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>
Message-ID: <OFDEE2FFA9.D58CCABD-ON88257196.0062A3F1-88257196.006A8D97@us.ibm.com>

I was encountering some similar problems in the past and I put an entry 
into /etc/hotplug/blaclist like the following:

# Mellanox InfiniBand
ib_mthca

This has worked for me.

Pradeep
pradeep at us.ibm.com


Don.Albert at Bull.com 
Sent by: openib-general-bounces at openib.org
06/23/2006 10:14 AM

To
openfabrics-ewg at openib.org, openib-general at openib.org
cc

Subject
[openib-general] Stopping Infiniband kernel modules from loading at system 
boot


Short of uninstalling the OFED-1.0 release,  how can I stop the Infiniband 
related kernel modules from loading at system boot?   

I am trying to debug a problem with programs hanging in the kernel,  so I 
thought that I would try manually loading the modules one at a time to see 
if I could isolate the problem.  This is on a RHEL4 U3 system with the 
2.6.16 kernel and the OFED-1.0 release installed. 

 I used "/sbin/chkconfig" to turn off the  "openibd" and "opensmd" 
services in the /etc/rc.d/ runlevel files.  I even removed "ifcfg-ib0" and 
"ifcfg-ib1" from the /etc/sysconfig/networking-scripts directory.  I don't 
see any other scripts that would cause these modules to be loaded.   But 
every time I reboot, I get the following modules loaded, according to 
/sbin/lsmod: 

ib_mthca              117424  0 
ib_mad                 35896  1 ib_mthca 
ib_core                45952  2 ib_mthca,ib_mad 

What have I missed? 

        -Don Albert- _______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/61ed0e57/attachment.html>

From swise at opengridcomputing.com  Fri Jun 23 13:31:37 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 15:31:37 -0500
Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm.
In-Reply-To: <20060620200312.20092.87834.stgit@stevo-desktop>
References: <20060620200304.20092.44110.stgit@stevo-desktop>
	<20060620200312.20092.87834.stgit@stevo-desktop>
Message-ID: <1151094697.7808.82.camel@stevo-desktop>

Sean, 

Are these changes acceptable?


Steve.


On Tue, 2006-06-20 at 15:03 -0500, Steve Wise wrote:
> For iWARP, rdma_disconnect() moves the QP to SQD instead of ERR. The
> iWARP providers map SQD to the RDMAC verbs CLOSING state.
> ---
> 
>  librdmacm/src/cma.c |   22 +++++++++++++++++++++-
>  1 files changed, 21 insertions(+), 1 deletions(-)
> 
> diff --git a/librdmacm/src/cma.c b/librdmacm/src/cma.c
> index e99d15c..a250f69 100644
> --- a/librdmacm/src/cma.c
> +++ b/librdmacm/src/cma.c
> @@ -633,6 +633,17 @@ static int ucma_modify_qp_rts(struct rdm
>  	return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask);
>  }
>  
> +static int ucma_modify_qp_sqd(struct rdma_cm_id *id)
> +{
> +	struct ibv_qp_attr qp_attr;
> +
> +	if (!id->qp)
> +		return 0;
> +
> +	qp_attr.qp_state = IBV_QPS_SQD;
> +	return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE);
> +}
> +
>  static int ucma_modify_qp_err(struct rdma_cm_id *id)
>  {
>  	struct ibv_qp_attr qp_attr;
> @@ -881,7 +892,16 @@ int rdma_disconnect(struct rdma_cm_id *i
>  	void *msg;
>  	int ret, size;
>  
> -	ret = ucma_modify_qp_err(id);
> +	switch (ibv_get_transport_type(id->verbs)) {
> +	case IBV_TRANSPORT_IB:
> +		ret = ucma_modify_qp_err(id);
> +		break;
> +	case IBV_TRANSPORT_IWARP:
> +		ret = ucma_modify_qp_sqd(id);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
>  	if (ret)
>  		return ret;
>  
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Fri Jun 23 13:32:33 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 23 Jun 2006 15:32:33 -0500
Subject: [openib-general] [PATCH v2 1/2] iWARP changes to libibverbs.
In-Reply-To: <adawtbbmpi6.fsf@cisco.com>
References: <20060620200304.20092.44110.stgit@stevo-desktop>
	<20060620200308.20092.76324.stgit@stevo-desktop>
	<adawtbbmpi6.fsf@cisco.com>
Message-ID: <1151094753.7808.84.camel@stevo-desktop>

On Tue, 2006-06-20 at 15:27 -0700, Roland Dreier wrote:
> Looks pretty good.  I'll get this into the libibverbs development tree
> soon (I'm working on the MADV_DONTFORK stuff right now).
> 
>  - R.


Sounds good.

Once you commit the libibverbs changes, we can commit the librdma
changes that depend on them (assume everyone agrees to the changes).

Stevo.


From Don.Albert at Bull.com  Fri Jun 23 13:35:53 2006
From: Don.Albert at Bull.com (Don.Albert at Bull.com)
Date: Fri, 23 Jun 2006 13:35:53 -0700
Subject: [openib-general] Stopping Infiniband kernel modules from
 loading at system boot
In-Reply-To: <OFDEE2FFA9.D58CCABD-ON88257196.0062A3F1-88257196.006A8D97@us.ibm.com>
Message-ID: <OFE6D4926F.2422913F-ON07257196.006DE4FB-07257196.007125C4@us-phx1.az05.bull.com>

Thanks to Pradeep Satyanarayana and Boris Shpolyansky  for suggesting that 
I add an entry to /etc/hotplug/blacklist,  but I thought that the 
"/etc/hotplug" stuff was replaced in the latest kernels with "/etc/udev" 
functionality.  Is this not true?

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/d47844e6/attachment.html>

From halr at voltaire.com  Fri Jun 23 14:22:14 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 23 Jun 2006 17:22:14 -0400
Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel
 modules from loading at system boot
In-Reply-To: <1151088287.22762.14.camel@fc5.xsintricity.com>
References: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>
	<1151088287.22762.14.camel@fc5.xsintricity.com>
Message-ID: <1151097726.4391.300562.camel@hal.voltaire.com>

Hi Doug,

On Fri, 2006-06-23 at 14:44, Doug Ledford wrote:
> On Fri, 2006-06-23 at 10:14 -0700, Don.Albert at Bull.com wrote:
> > 
> > Short of uninstalling the OFED-1.0 release,  how can I stop the
> > Infiniband related kernel modules from loading at system boot?    
> > 
> > I am trying to debug a problem with programs hanging in the kernel,
> >  so I thought that I would try manually loading the modules one at a
> > time to see if I could isolate the problem.  This is on a RHEL4 U3
> > system with the 2.6.16 kernel and the OFED-1.0 release installed. 
> > 
> >  I used "/sbin/chkconfig" to turn off the  "openibd" and "opensmd"
> > services in the /etc/rc.d/runlevel files.  I even removed "ifcfg-ib0"
> > and "ifcfg-ib1" from the /etc/sysconfig/networking-scriptsdirectory.
> >  I don't see any other scripts that would cause these modules to be
> > loaded.   But every time I reboot, I get the following modules loaded,
> > according to /sbin/lsmod: 
> > 
> > ib_mthca              117424  0 
> > ib_mad                 35896  1 ib_mthca 
> > ib_core                45952  2 ib_mthca,ib_mad 
> > 
> > What have I missed? 
> 
> /etc/rc.d/rc.sysinit
> 
> In the sysinit we load all the modules required to support the hardware
> in the system (that's when it prints the Initializing hardware: storage
> network sound other [OK] message).  In order to stop that you have to
> move the modules out of the way.  But, I'm a bit surprised that ib_mad
> is loaded as that doesn't seem a hard dependancy for ib_mthca (or more
> appropriately, I'm surprised to see ib_mad 

ib_mthca.ko has a number of unresolved symbols which are in the MAD
module (ib_mad.ko) as well as IB core (ib_core.ko) so those are loaded
when mthca is. If you look in /lib/modules/2.6.n/modules.dep, you should
see these dependencies for ib_mthca.

-- Hal

> and not a bunch of other ib
> modules as well, check the /etc/modprobe.conf
> and /etc/modprobe.conf.dist to see if there are rules to force lots of
> ib modules to be loaded any time ib_core is loaded).


From Don.Albert at Bull.com  Fri Jun 23 14:36:13 2006
From: Don.Albert at Bull.com (Don.Albert at Bull.com)
Date: Fri, 23 Jun 2006 14:36:13 -0700
Subject: [openib-general] Link Initialization problem and hangs in MTHCA on
	OFED-1.0
Message-ID: <OF863A6E89.6DF785E1-ON07257196.007272DE-07257196.0076AC2B@us-phx1.az05.bull.com>

I was corresponding with Hal Rosenstock about this problem,  but he 
suggested that I resubmit to a wider audience.   The previous messages are 
under the subject of  "How do I use "madeye" to diagnose a problem?".   I 
was trying to use "madeye" to find out if any MAD packets were being 
received by a node in which the link fails to initialize.

I have a small two-node testbed system which consists of two EM64T 
machines ("koa" and "jatoba") cabled back-to-back with two Mellanox 
MT25204 (4x DDR) HCAs.   This configuration worked with a backported 
2.6.11-34 kernel and revision 6500 from the OpenIB svn trunk.   I was able 
to run basic tests and several sets of MPI benchmarks.

Since moving to a "2.6.16" kernel and the OFED-1.0 release,  we cannot get 
the link on the "jatoba" machine to come up.   The "madeye" module seems 
to show that no MAD packets are being received when the Subnet Manager is 
run on the other machine.   When I try to run SM on "jatoba",  or try to 
run any other program that uses MAD,  I get process hangs.   Here is a 
portion of the stack traces for one of the hung processes,  obtained by 
doing "echo t > /proc/sysrq-trigger" and looking at the dmesg output.


ibis          D 0000000000000003     0  5489   5097  5522 (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640 
       ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418 

       ffff8100788c6000 ffff8100788c7cb8 
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
       <ffffffff801350ce>{lock_timer_base+27} 
<ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
       <ffffffff803c1c20>{_spin_unlock_irq+9} 
<ffffffff803bfd5f>{wait_for_completion+179}
       <ffffffff80127468>{default_wake_function+0} 
<ffffffff80127468>{default_wake_function+0}
       <ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
       <ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
       <ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564} 
<ffffffff80140025>{autoremove_wake_function+0}
       <ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
       <ffffffff8018948e>{mntput_no_expire+28} 
<ffffffff80181083>{sys_ioctl+60}
       <ffffffff8010aa52>{system_call+126}

It seems to be a lock or mutex problem,  but I don't know how to proceed 
from here.

Some things I have tried are:

Connecting the two machines to a switch instead of back-to-back,  to use 
the SM in the switch.  The link to "koa" comes up, but the link to 
"jatoba" does not.

Physically swapping the two HCAs between the two machines:   the problem 
stays on the "jatoba" side.

Turning on "debug_level" traces with "modprobe ib_mthca debug_level=1" on 
both machines.   The traces seem to be identical on both, except for the 
actual PCI bus location and the memory addresses being mapped.  No 
additional traces are generated when the hangs occur.

The machines are both EM64T but are not identical.  The "koa" side has the 
HCA on PCI "06:00.0",  and the "jatoba" side has the HCA on "03:00.0". The 
two machines are:

   koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip 
set).
   jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).

Can anyone suggest what to try or look at next?

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/d19f3f9a/attachment.html>

From phucviet at gmail.com  Fri Jun 23 14:49:49 2006
From: phucviet at gmail.com (Cong ty Tin hoc Phuc Viet)
Date: Sat, 24 Jun 2006 04:49:49 +0700
Subject: [openib-general] Chuyen Cung cap: Phan mem Quan ly nhan su, Cong Van,
 Quan ly Kho - CNK, Qly Khach hang, Kinh doanh phan phoi, ..
Message-ID: <20060623214932.78E43F0005@sentry-two.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060624/09b55a5f/attachment.html>

From dledford at redhat.com  Fri Jun 23 14:52:46 2006
From: dledford at redhat.com (Doug Ledford)
Date: Fri, 23 Jun 2006 17:52:46 -0400
Subject: [openib-general] [openfabrics-ewg] Stopping Infiniband kernel
 modules from loading at system boot
In-Reply-To: <1151097726.4391.300562.camel@hal.voltaire.com>
References: <OF10812226.98012761-ON07257196.005CAEEF-07257196.005EB648@us-phx1.az05.bull.com>
	<1151088287.22762.14.camel@fc5.xsintricity.com>
	<1151097726.4391.300562.camel@hal.voltaire.com>
Message-ID: <1151099566.22762.21.camel@fc5.xsintricity.com>

On Fri, 2006-06-23 at 17:22 -0400, Hal Rosenstock wrote:

> > > ib_mthca              117424  0 
> > > ib_mad                 35896  1 ib_mthca 
                                      ^^^^^^^^
> > > ib_core                45952  2 ib_mthca,ib_mad 

> > In the sysinit we load all the modules required to support the hardware
> > in the system (that's when it prints the Initializing hardware: storage
> > network sound other [OK] message).  In order to stop that you have to
> > move the modules out of the way.  But, I'm a bit surprised that ib_mad
> > is loaded as that doesn't seem a hard dependancy for ib_mthca (or more
> > appropriately, I'm surprised to see ib_mad 
> 
> ib_mthca.ko has a number of unresolved symbols which are in the MAD
> module (ib_mad.ko) as well as IB core (ib_core.ko) so those are loaded
> when mthca is. If you look in /lib/modules/2.6.n/modules.dep, you should
> see these dependencies for ib_mthca.

Yeah, if I hadn't been spacing during my reply I would have noticed what
I highlighted above....

-- 
Doug Ledford <dledford at redhat.com>
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband


From sean.hefty at intel.com  Fri Jun 23 16:33:19 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 Jun 2006 16:33:19 -0700
Subject: [openib-general] mckey program
In-Reply-To: <Pine.GSO.4.40.0606231022320.19604-100000@nu.cse.ohio-state.edu>
Message-ID: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com>

>I was checking the mckey.c program for IB.
>I did some quick check and found that the  rdma_resolve_addr function
>is invoking the cma_handler with erroneous event.
>
>mckey: event: 1, error: -19
>
>Is there any easy way to check what might be happening?

Try adding a route for 224.0.0.1 to the ipoib dev.

- Sean


From sean.hefty at intel.com  Fri Jun 23 16:40:11 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 23 Jun 2006 16:40:11 -0700
Subject: [openib-general] [PATCH v2 2/2] iWARP changes to librdmacm.
In-Reply-To: <1151094697.7808.82.camel@stevo-desktop>
Message-ID: <000601c6971e$5fbc0b10$1c781cac@amr.corp.intel.com>

>Are these changes acceptable?

These look fine to commit by me.

- Sean


From pradeep at us.ibm.com  Fri Jun 23 16:45:36 2006
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Fri, 23 Jun 2006 16:45:36 -0700
Subject: [openib-general] Stopping Infiniband kernel modules from
 loading at system boot
In-Reply-To: <OFE6D4926F.2422913F-ON07257196.006DE4FB-07257196.007125C4@us-phx1.az05.bull.com>
Message-ID: <OF59EEE964.6EEEAA4B-ON88257196.00821BAA-88257196.00823B63@us.ibm.com>

I am using a slightly older kernel -2.6.16-rc2 and it works for me.

Pradeep
pradeep at us.ibm.com


Don.Albert at Bull.com 
Sent by: openib-general-bounces at openib.org
06/23/2006 01:35 PM

To
openfabrics-ewg at openib.org, openib-general at openib.org
cc

Subject
Re: [openib-general] Stopping Infiniband kernel modules from loading at 
system boot


Thanks to Pradeep Satyanarayana and Boris Shpolyansky  for suggesting that 
I add an entry to /etc/hotplug/blacklist,  but I thought that the 
"/etc/hotplug" stuff was replaced in the latest kernels with "/etc/udev" 
functionality.  Is this not true? 

        -Don Albert- _______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060623/cc6876c4/attachment.html>

From iod00d at hp.com  Fri Jun 23 10:14:57 2006
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 23 Jun 2006 10:14:57 -0700
Subject: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
In-Reply-To: <1151071471.3204.12.camel@laptopd505.fenrus.org>
References: <20060620203050.31536.5341.stgit@stevo-desktop>
	<20060620203055.31536.15131.stgit@stevo-desktop>
	<1150836226.2891.231.camel@laptopd505.fenrus.org>
	<1151070290.7808.33.camel@stevo-desktop>
	<1151070532.3204.10.camel@laptopd505.fenrus.org>
	<1151071005.7808.39.camel@stevo-desktop>
	<1151071471.3204.12.camel@laptopd505.fenrus.org>
Message-ID: <20060623171457.GA3610@esmail.cup.hp.com>

On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote:
> > I thought the posted write WILL eventually get to adapter memory.  Not
> > stall forever cached in a bridge.  I'm wrong?
> 
> I'm not sure there is a theoretical upper bound.... 

I'm not aware of one either since MMIO writes can travel
across many other chips that are not constrained by
PCI ordering rules (I'm thinking of SGI Altix...)

> (and if it's several msec per bridge, then you have a lot of latency
> anyway)

That's what my original concern was when I saw you point this out.
But MMIO reads here would be expensive and many drivers tolerate
this latency in exchange for avoiding the MMIO read in the
performance path.

grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


From halr at voltaire.com  Sat Jun 24 05:39:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 24 Jun 2006 08:39:15 -0400
Subject: [openib-general] mckey program
In-Reply-To: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com>
References: <000001c6971d$69cdab50$1c781cac@amr.corp.intel.com>
Message-ID: <1151152754.4482.7620.camel@hal.voltaire.com>

On Fri, 2006-06-23 at 19:33, Sean Hefty wrote:
> >I was checking the mckey.c program for IB.
> >I did some quick check and found that the  rdma_resolve_addr function
> >is invoking the cma_handler with erroneous event.
> >
> >mckey: event: 1, error: -19
> >
> >Is there any easy way to check what might be happening?
> 
> Try adding a route for 224.0.0.1 to the ipoib dev.

Could it also be done without adding the route but using the bind_addr
option on mckey ?

mckey {s[end] | r[ecv]} mcast_addr [bind_addr]]

-- Hal

> 
> - Sean
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eitan at mellanox.co.il  Sun Jun 25 12:29:34 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 25 Jun 2006 22:29:34 +0300
Subject: [openib-general] [PATCH] osm: pkey manager fail to clear unused
 block
Message-ID: <86hd296nkx.fsf@mtl066.yok.mtl.com>

Hi Hal

Done some basic test with the simulator. Seems that the manager miss all
blocks needed to be cleared (as they used to have some pkeys that are not part of 
the policy). The following trivial patch fixes the issue. The bug sneaked in 
through the last few patched.

The fix avoids skipping cases where the new pkey block is null and for that
sake requires an empty block to be used in the lower level function.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8189)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -194,10 +194,12 @@ pkey_mgr_update_pkey_entry(
   IN const ib_pkey_table_t *block,
   IN const uint16_t block_index )
 {
+	ib_pkey_table_t empty_block = {0, };
   osm_madw_context_t context;
   osm_node_t *p_node = osm_physp_get_node_ptr( p_physp );
   uint32_t attr_mod;
 
+  if (!block) block = &empty_block;
   context.pkey_context.node_guid = osm_node_get_node_guid( p_node );
   context.pkey_context.port_guid = osm_physp_get_port_guid( p_physp );
   context.pkey_context.set_method = TRUE;
@@ -360,7 +362,7 @@ static boolean_t pkey_mgr_update_port(
     block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
     new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
-    if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) 
+    if (block && new_block && !memcmp( new_block, block, sizeof( *block ) )) 
 	continue;
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index );


From eitan at mellanox.co.il  Mon Jun 26 00:00:48 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: 26 Jun 2006 10:00:48 +0300
Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear unused
 block
Message-ID: <86fyhs765b.fsf@mtl066.yok.mtl.com>

Hi Hal

Had some second thought (slept on it) about this patch. 
It has a problem as it would continuously set a block if it is all empty (since the new
block does not exist)

The new patch fixes it by catching the case of null new_block and
still comparing to the old block. 

I also hope I did better job on the indentation (at least I used untabify).

> Done some basic test with the simulator. Seems that the manager miss all
> blocks needed to be cleared (as they used to have some pkeys that are not part of 
> the policy). The following trivial patch fixes the issue. The bug sneaked in 
> through the last few patched.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8189)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port(
   boolean_t ret_val = FALSE;
   osm_pending_pkey_t *p_pending;
   boolean_t found;
+  ib_pkey_table_t empty_block = {.pkey_entry = {0}, };
 
   p_physp = osm_port_get_default_phys_ptr( p_port );
   if ( !osm_physp_is_valid( p_physp ) )
@@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port(
     block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
     new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
 
-    if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) 
+    if (!new_block) new_block = &empty_block;
+    if (block && !memcmp( new_block, block, sizeof( *block ) ))
 	continue;
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index );


From jackm at mellanox.co.il  Mon Jun 26 00:51:12 2006
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Mon, 26 Jun 2006 10:51:12 +0300
Subject: [openib-general] Kernel Oops related to IPoIB (multicast module?)
Message-ID: <200606261051.12515.jackm@mellanox.co.il>

Problem in main trunk (SVN 8189):

The following Oops occurred upon unloading the openib driver.  I unloaded the driver immediately following a reboot
(the driver had been loaded during the boot sequence).  I did NOT run opensm before unloading the driver.

Evidently, ipoib was still attempting to connect with an SA, when the ipoib module was unloaded (modprobe -r). 
After the ipoib module was unloaded (or at least rendered inaccessible), the ib_sa module attempted to invoke 
"ib_sa_mcmember_rec_callback" (for a callback address that was part of the unloaded ipoib module).  Hence, the Oops
below.

The "modprobe" process in the trace below is "modprobe -r ib_sa" (After unloading ib_ipoib, we attempt to unload ib_sa).
Following the Oops, I've included info on the running environment.

Jack

===============================================

Jun 26 10:19:56 sw134 ifdown:     ib0       device: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
Jun 26 10:19:58 sw134 kernel: Unable to handle kernel paging request at ffffffff883219dd RIP:
Jun 26 10:19:58 sw134 kernel: [<ffffffff883219dd>]
Jun 26 10:19:58 sw134 kernel: PGD 103027 PUD 105027 PMD 7bd53067 PTE 0
Jun 26 10:19:58 sw134 kernel: Oops: 0010 [1] SMP
Jun 26 10:19:58 sw134 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Jun 26 10:19:58 sw134 kernel: CPU 2
Jun 26 10:19:58 sw134 kernel: Modules linked in: autofs4 ipv6 ib_sa ib_uverbs ib_umad nfs lockd nfs_acl sunrpc ib_mthca ib_mad ib_core af_
packet button battery ac apparmor aamatch_pcre loop dm_mod hw_random shpchp ehci_hcd uhci_hcd i8xx_tco usbcore pci_hotplug e1000 i2c_i801
i2c_core ide_cd cdrom floppy ext3 jbd sg edd fan thermal processor ata_piix libata piix sd_mod scsi_mod ide_disk ide_core
Jun 26 10:19:58 sw134 kernel: Pid: 4457, comm: modprobe Tainted: G     U 2.6.16.16-1.6-smp #1
Jun 26 10:19:58 sw134 kernel: RIP: 0010:[<ffffffff883219dd>] [<ffffffff883219dd>]
Jun 26 10:19:58 sw134 kernel: RSP: 0018:ffff81007163dd90  EFLAGS: 00010246
Jun 26 10:19:58 sw134 kernel: RAX: 0000000000000005 RBX: ffff81007d78be00 RCX: ffffffff8831747f
Jun 26 10:19:58 sw134 kernel: RDX: ffff81007dec3000 RSI: 0000000000000000 RDI: 00000000fffffffc
Jun 26 10:19:58 sw134 kernel: RBP: ffff810079960fd0 R08: 0000000000000206 R09: 0000000000000002
Jun 26 10:19:58 sw134 kernel: R10: ffff810001029400 R11: 0000000000000000 R12: 00000000fffffffc
Jun 26 10:19:58 sw134 kernel: R13: 0000000000000000 R14: 00000000005182a8 R15: 0000000000000000
Jun 26 10:19:58 sw134 kernel: FS:  00002ba7037ef6d0(0000) GS:ffff81007e3ab340(0000) knlGS:0000000000000000
Jun 26 10:19:58 sw134 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 26 10:19:58 sw134 kernel: CR2: ffffffff883219dd CR3: 0000000072da0000 CR4: 00000000000006e0
Jun 26 10:19:58 sw134 ifdown:     ib0
Jun 26 10:19:58 sw134 kernel: Process modprobe (pid: 4457, threadinfo ffff81007163c000, task ffff81006fcb7040)
Jun 26 10:19:58 sw134 ifdown: Interface not available and no configuration found.
Jun 26 10:19:58 sw134 kernel: Stack: ffffffff883174bf 0000000000000bd4 000000027163de78 ffff81007163de80
Jun 26 10:19:58 sw134 kernel:        ffff81007163de78 ffff81007d810790 ffff81007163de68 0000000000000001
Jun 26 10:19:59 sw134 kernel:        0000000000000000 ffff81007d78be00
Jun 26 10:19:59 sw134 kernel: Call Trace: <ffffffff883174bf>{:ib_sa:ib_sa_mcmember_rec_callback+64}
Jun 26 10:19:59 sw134 kernel:        <ffffffff883172ae>{:ib_sa:send_handler+72} <ffffffff8824e387>{:ib_mad:ib_unregister_mad_agent+345}
Jun 26 10:19:59 sw134 kernel:        <ffffffff802cdb65>{wait_for_completion+155} <ffffffff801e86af>{find_next_bit+85}
Jun 26 10:19:59 sw134 kernel:        <ffffffff8831703a>{:ib_sa:ib_sa_remove_one+58} <ffffffff8823b2b9>{:ib_core:ib_unregister_client+47}
Jun 26 10:19:59 sw134 kernel:        <ffffffff88317df8>{:ib_sa:ib_sa_cleanup+16} <ffffffff8014a9d8>{sys_delete_module+540}
Jun 26 10:19:59 sw134 kernel:        <ffffffff80167ccc>{do_munmap+619} <ffffffff801e6fe3>{__up_write+33}
Jun 26 10:19:59 sw134 kernel:        <ffffffff8010a7be>{system_call+126}
Jun 26 10:19:59 sw134 kernel:
Jun 26 10:19:59 sw134 kernel: Code:  Bad RIP value.
Jun 26 10:19:59 sw134 kernel: RIP [<ffffffff883219dd>] RSP <ffff81007163dd90>
Jun 26 10:19:59 sw134 kernel: CR2: ffffffff883219dd
Jun 26 10:20:01 sw134 /usr/sbin/cron[4615]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null)

===================================
Host information given below:
*************************************************************
Host Architecture : x86_64
Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10
Kernel Version    : 2.6.16.16-1.6-smp
Memory size       : 2060956 kB
Driver Version    : openib_gen2-20060625-1800 (REV=8189)
HCA ID(s)         : mthca0
HCA model(s)      : 25204
FW version(s)     : 1.0.800
Board(s)          : MT_0230000001
*************************************************************


From bugzilla-daemon at openib.org  Mon Jun 26 01:15:05 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Mon, 26 Jun 2006 01:15:05 -0700 (PDT)
Subject: [openib-general] [Bug 148] New: WSD: When connecting to a remote
	host, with no socket listening, time out is returned
Message-ID: <20060626081505.AA5A2228735@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=148

           Summary: WSD: When connecting to a remote host, with no socket
                    listening, time out is returned
           Product: OpenFabrics Windows
           Version: unspecified
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: major
          Priority: P2
         Component: WSD
        AssignedTo: bugzilla at openib.org
        ReportedBy: tzachid at mellanox.co.il


As a result it takes about 20 seconds for the connection to fall to IPOIB.

On TCP, the remote side will send reset, and the connection will end in about a
second.

One more consequence of this problem is that when there is a fallback to IPOIB,
it takes ~20 seconds to realize that no one is listening there instead of less
than 1ms.

This also causes one of the WHQL tests to fail. (Waiting for time out takes too
long).

Investigation that was made by Yossi showed that this issue is related to the
CM. It seems that Invalid_sid was retuned from the remote side and was ignored.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Mon Jun 26 01:16:25 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Mon, 26 Jun 2006 01:16:25 -0700 (PDT)
Subject: [openib-general] [Bug 148] WSD: When connecting to a remote host,
	with no socket listening, time out is returned
Message-ID: <20060626081625.A5C87228735@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=148


------- Comment #1 from tzachid at mellanox.co.il  2006-06-26 01:16 -------
Created an attachment (id=27)
 --> (http://openib.org/bugzilla/attachment.cgi?id=27&action=view)
Suggested fix

The following patch by Yossi, solves this problem


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Mon Jun 26 01:19:04 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Mon, 26 Jun 2006 01:19:04 -0700 (PDT)
Subject: [openib-general] [Bug 148] WSD: When connecting to a remote host,
	with no socket listening, time out is returned
Message-ID: <20060626081904.04507228738@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=148


tzachid at mellanox.co.il changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|bugzilla at openib.org         |ftillier at silverstorm.com


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.


From ogerlitz at voltaire.com  Mon Jun 26 03:01:18 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 26 Jun 2006 13:01:18 +0300
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <aday7vpkvna.fsf@cisco.com>
References: <aday7vpkvna.fsf@cisco.com>
Message-ID: <449FB06E.3020709@voltaire.com>

Roland Dreier wrote:
> Linus, please pull from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This is mostly merging the new iSER (iSCSI over RDMA transport) initiator:

Hi Roland,

Following the merge done by Linus yesterday, I have cloned, built, 
installed and booted the linux-2.6 tree and now running iSER over it!

Thanks a lot for all your help and guidance (&& compile error 
findings...) through the upstream push cycle.

I'd like to thank Mike Christie for his co-op (and patience) in 
integrating iser within the open iscsi framework and much help along the 
push cycle, specially for working on (now upstream) libiscsi

iSER is the first consumer of the (now upstream) RDMA CM, I'd like to 
thank Sean Hefty for his co-op in the CMA design cycle and the very fast 
and robust coding while implementing it.

Or.


From bpradip at in.ibm.com  Mon Jun 26 03:24:19 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Mon, 26 Jun 2006 15:54:19 +0530
Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils to
 work with new stack and libraries
Message-ID: <20060626102410.GA17835@harry-potter.ibm.com>

modified perftest utilities to work with the latest stack and libraries. 
This patchset consists changes for rdma_lat and rdma_bw only.

1 - rdma_lat.c changes
2 - rdma_bw.c changes

-- 
Thanks,
Pradipta Kumar


From bpradip at in.ibm.com  Mon Jun 26 03:27:16 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Mon, 26 Jun 2006 15:57:16 +0530
Subject: [openib-general] [PATCH 1/2] perftest: Modified perftest utils to
 work with new stack and libraries
Message-ID: <20060626102715.GB17835@harry-potter.ibm.com>

This is the patch for rdma_lat.c 

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index: rdma_lat.c
=============================================================================
--- ../perftest-org/rdma_lat.c	2006-06-22 18:28:13.000000000 +0530
+++ rdma_lat.c	2006-06-22 18:36:12.000000000 +0530
@@ -51,6 +51,7 @@
 #include <arpa/inet.h>
 #include <byteswap.h>
 #include <time.h>
+#include <errno.h>
 
 #include <infiniband/verbs.h>
 #include <rdma/rdma_cma.h>
@@ -83,6 +84,7 @@ struct pingpong_context {
 	struct ibv_sge list;
 	struct ibv_send_wr wr;
 	struct rdma_cm_id  *cm_id;
+	struct rdma_event_channel *cm_channel;		
 };
 
 struct pingpong_dest {
@@ -612,11 +614,12 @@ static void pp_close_cma(struct pingpong
 		}
 	}
 	
-	rdma_get_cm_event(&event);
+	rdma_get_cm_event(ctx->cm_channel, &event);
 	if (event->event != RDMA_CM_EVENT_DISCONNECTED)
 		printf("unexpected event during disconnect %d\n", event->event);
 	rdma_ack_cm_event(event);
 	rdma_destroy_id(ctx->cm_id);
+	rdma_destroy_event_channel(ctx->cm_channel);
 }
 
 static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth,
@@ -629,17 +632,26 @@ static struct pingpong_context *pp_serve
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *child_cm_id;
+	struct rdma_event_channel *channel;		
 	struct pingpong_context *ctx;
-
+	
 	printf("%s starting server\n", __FUNCTION__);
-	ret = rdma_create_id(&listen_id, NULL);
-	if (ret) {
-		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", 
+						__FUNCTION__, ret);
 		return NULL;
 	}
 
+	ret = rdma_create_id(channel, &listen_id, NULL);
+	if (ret) {
+		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
+		goto err3;
+	}
+	memset(&sin, 0, sizeof(sin));
 	sin.sin_addr.s_addr = 0;
-	sin.sin_family = PF_INET;
+	sin.sin_family = AF_INET;
 	sin.sin_port = htons(port);
 	ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin);
 	if (ret) {
@@ -653,7 +665,7 @@ static struct pingpong_context *pp_serve
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -678,7 +690,8 @@ static struct pingpong_context *pp_serve
 		fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__);
 		goto err0;
 	}
-
+	
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xbb;
 	my_dest->rkey = ctx->mr->rkey;
@@ -694,7 +707,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 	rdma_ack_cm_event(event);
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) {
 		fprintf(stderr,"rdma_get_cm_event error %d\n", ret);
 		rdma_destroy_id(child_cm_id);
@@ -713,8 +726,10 @@ err0:
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	rdma_destroy_id(listen_id);
 	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
+	rdma_destroy_id(listen_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 
@@ -750,6 +765,7 @@ static struct pingpong_context *pp_clien
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *cm_id;
+	struct rdma_event_channel *channel;		
 	struct pingpong_context *ctx;
 
 	fprintf(stderr,"%s starting client\n", __FUNCTION__);
@@ -758,10 +774,18 @@ static struct pingpong_context *pp_clien
 		return NULL;
 	}
 
-	ret = rdma_create_id(&cm_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+							__FUNCTION__, ret);
+		return NULL;
+	}
+
+	ret = rdma_create_id(channel, &cm_id, NULL);
 	if (ret) {
 		fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_family = PF_INET;
@@ -772,7 +796,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -789,7 +813,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -806,7 +830,8 @@ static struct pingpong_context *pp_clien
 		fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__);
 		goto err2;
 	}
-
+	
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xaa;
 	my_dest->rkey = ctx->mr->rkey;
@@ -823,7 +848,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -845,8 +870,10 @@ static struct pingpong_context *pp_clien
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	fprintf(stderr,"NOT connected!\n");
+	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
 	rdma_destroy_id(cm_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 

From bpradip at in.ibm.com  Mon Jun 26 03:29:28 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Mon, 26 Jun 2006 15:59:28 +0530
Subject: [openib-general] [PATCH 2/2] perftest: Modified perftest utils to
 work with new stack and libraries
Message-ID: <20060626102926.GC17835@harry-potter.ibm.com>

This is the patch for rdma_bw.c

Signed-off-by: Pradipta Kumar Banerjee <bpradipt at in.ibm.com>

---

Index: rdma_bw.c
=============================================================================
--- ../perftest-org/rdma_bw.c	2006-06-22 18:28:13.000000000 +0530
+++ rdma_bw.c	2006-06-22 18:40:01.000000000 +0530
@@ -51,6 +51,7 @@
 #include <arpa/inet.h>
 #include <byteswap.h>
 #include <time.h>
+#include <errno.h>
 
 #include <infiniband/verbs.h>
 #include <rdma/rdma_cma.h>
@@ -75,6 +76,7 @@ struct pingpong_context {
 	struct ibv_sge      list;
 	struct ibv_send_wr  wr;
 	struct rdma_cm_id  *cm_id;
+	struct rdma_event_channel *cm_channel;		
 };
 
 struct pingpong_dest {
@@ -545,11 +547,12 @@ static void pp_close_cma(struct pingpong
 		}
 	}
 	
-	rdma_get_cm_event(&event);
+	rdma_get_cm_event(ctx->cm_channel, &event);
 	if (event->event != RDMA_CM_EVENT_DISCONNECTED)
 		printf("unexpected event during disconnect %d\n", event->event);
 	rdma_ack_cm_event(event);
 	rdma_destroy_id(ctx->cm_id);
+	rdma_destroy_event_channel(ctx->cm_channel);
 }
 
 static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth,
@@ -562,13 +565,22 @@ static struct pingpong_context *pp_serve
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *child_cm_id;
+	struct rdma_event_channel *channel;
 	struct pingpong_context *ctx;
 
 	printf("%s starting server\n", __FUNCTION__);
-	ret = rdma_create_id(&listen_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+							 __FUNCTION__, ret);
+		return NULL;
+        }
+
+	ret = rdma_create_id(channel, &listen_id, NULL);
 	if (ret) {
 		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_addr.s_addr = 0;
@@ -586,7 +598,7 @@ static struct pingpong_context *pp_serve
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -612,6 +624,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xbb;
 	my_dest->rkey = ctx->mr->rkey;
@@ -627,7 +640,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 	rdma_ack_cm_event(event);
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) {
 		fprintf(stderr,"rdma_get_cm_event error %d\n", ret);
 		rdma_destroy_id(child_cm_id);
@@ -646,8 +659,10 @@ err0:
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	rdma_destroy_id(listen_id);
 	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
+	rdma_destroy_id(listen_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 
@@ -683,6 +698,7 @@ static struct pingpong_context *pp_clien
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *cm_id;
+	struct rdma_event_channel *channel;
 	struct pingpong_context *ctx;
 
 	fprintf(stderr,"%s starting client\n", __FUNCTION__);
@@ -691,10 +707,18 @@ static struct pingpong_context *pp_clien
 		return NULL;
 	}
 
-	ret = rdma_create_id(&cm_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+						__FUNCTION__, ret);
+		return NULL;
+	}
+
+	ret = rdma_create_id(channel, &cm_id, NULL);
 	if (ret) {
 		fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_family = PF_INET;
@@ -705,7 +729,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -722,7 +746,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -740,6 +764,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xaa;
 	my_dest->rkey = ctx->mr->rkey;
@@ -756,7 +781,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -779,6 +804,8 @@ err1:
 err2:
 	fprintf(stderr,"NOT connected!\n");
 	rdma_destroy_id(cm_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 

From halr at voltaire.com  Mon Jun 26 03:43:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Jun 2006 06:43:48 -0400
Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear
 unused block
In-Reply-To: <86fyhs765b.fsf@mtl066.yok.mtl.com>
References: <86fyhs765b.fsf@mtl066.yok.mtl.com>
Message-ID: <1151318627.4482.119971.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-06-26 at 03:00, Eitan Zahavi wrote:
> Hi Hal
> 
> Had some second thought (slept on it) about this patch. 
> It has a problem as it would continuously set a block if it is all empty (since the new
> block does not exist)
> 
> The new patch fixes it by catching the case of null new_block and
> still comparing to the old block. 
> 
> I also hope I did better job on the indentation (at least I used untabify).
> 
> > Done some basic test with the simulator. Seems that the manager miss all
> > blocks needed to be cleared (as they used to have some pkeys that are not part of 
> > the policy). The following trivial patch fixes the issue. The bug sneaked in 
> > through the last few patched.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied.

-- Hal


From sashak at voltaire.com  Mon Jun 26 07:42:43 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 26 Jun 2006 17:42:43 +0300
Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear
 unused block
In-Reply-To: <86fyhs765b.fsf@mtl066.yok.mtl.com>
References: <86fyhs765b.fsf@mtl066.yok.mtl.com>
Message-ID: <20060626144243.GF16738@sashak.voltaire.com>

Hi Eitan,

On 10:00 Mon 26 Jun     , Eitan Zahavi wrote:
> Hi Hal
> 
> Had some second thought (slept on it) about this patch. 
> It has a problem as it would continuously set a block if it is all empty (since the new
> block does not exist)
> 
> The new patch fixes it by catching the case of null new_block and
> still comparing to the old block. 
> 
> I also hope I did better job on the indentation (at least I used untabify).
> 
> > Done some basic test with the simulator. Seems that the manager miss all
> > blocks needed to be cleared (as they used to have some pkeys that are not part of 
> > the policy). The following trivial patch fixes the issue. The bug sneaked in 
> > through the last few patched.

And what with peer port's pkey table update. Is there the same problem?

Sasha

> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> Index: opensm/osm_pkey_mgr.c
> ===================================================================
> --- opensm/osm_pkey_mgr.c	(revision 8189)
> +++ opensm/osm_pkey_mgr.c	(working copy)
> @@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port(
>    boolean_t ret_val = FALSE;
>    osm_pending_pkey_t *p_pending;
>    boolean_t found;
> +  ib_pkey_table_t empty_block = {.pkey_entry = {0}, };
>  
>    p_physp = osm_port_get_default_phys_ptr( p_port );
>    if ( !osm_physp_is_valid( p_physp ) )
> @@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port(
>      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
>      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
>  
> -    if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) 
> +    if (!new_block) new_block = &empty_block;
> +    if (block && !memcmp( new_block, block, sizeof( *block ) ))
>  	continue;
>  
>      status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index );
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From iod00d at hp.com  Mon Jun 26 08:15:06 2006
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 26 Jun 2006 08:15:06 -0700
Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils
 to work with new stack and libraries
In-Reply-To: <20060626102410.GA17835@harry-potter.ibm.com>
References: <20060626102410.GA17835@harry-potter.ibm.com>
Message-ID: <20060626151506.GA14684@esmail.cup.hp.com>

On Mon, Jun 26, 2006 at 03:54:19PM +0530, Pradipta Kumar Banerjee wrote:
> modified perftest utilities to work with the latest stack and libraries. 
> This patchset consists changes for rdma_lat and rdma_bw only.
> 
> 1 - rdma_lat.c changes
> 2 - rdma_bw.c changes

Pradipta,
thanks for posting the patches...but could you do us a favor
and provide a useful changelog entry?

We can see it's a patch and which files the patch modifies.
The changelog should summarize "what problem does this patch fix?".

thanks again,
grant


From halr at voltaire.com  Mon Jun 26 08:15:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Jun 2006 11:15:53 -0400
Subject: [openib-general] [PATCHv2] osm: pkey manager fail to clear
 unused block
In-Reply-To: <20060626144243.GF16738@sashak.voltaire.com>
References: <86fyhs765b.fsf@mtl066.yok.mtl.com>
	<20060626144243.GF16738@sashak.voltaire.com>
Message-ID: <1151334761.4482.130837.camel@hal.voltaire.com>

On Mon, 2006-06-26 at 10:42, Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> On 10:00 Mon 26 Jun     , Eitan Zahavi wrote:
> > Hi Hal
> > 
> > Had some second thought (slept on it) about this patch. 
> > It has a problem as it would continuously set a block if it is all empty (since the new
> > block does not exist)
> > 
> > The new patch fixes it by catching the case of null new_block and
> > still comparing to the old block. 
> > 
> > I also hope I did better job on the indentation (at least I used untabify).
> > 
> > > Done some basic test with the simulator. Seems that the manager miss all
> > > blocks needed to be cleared (as they used to have some pkeys that are not part of 
> > > the policy). The following trivial patch fixes the issue. The bug sneaked in 
> > > through the last few patched.
> 
> And what with peer port's pkey table update. Is there the same problem?

Looks to me like the same logic is there.

-- Hal

> 
> Sasha
> 
> > 
> > Eitan
> > 
> > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > Index: opensm/osm_pkey_mgr.c
> > ===================================================================
> > --- opensm/osm_pkey_mgr.c	(revision 8189)
> > +++ opensm/osm_pkey_mgr.c	(working copy)
> > @@ -276,6 +276,7 @@ static boolean_t pkey_mgr_update_port(
> >    boolean_t ret_val = FALSE;
> >    osm_pending_pkey_t *p_pending;
> >    boolean_t found;
> > +  ib_pkey_table_t empty_block = {.pkey_entry = {0}, };
> >  
> >    p_physp = osm_port_get_default_phys_ptr( p_port );
> >    if ( !osm_physp_is_valid( p_physp ) )
> > @@ -360,7 +361,8 @@ static boolean_t pkey_mgr_update_port(
> >      block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index );
> >      new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index );
> >  
> > -    if (block && (!new_block || !memcmp( new_block, block, sizeof( *block ) ))) 
> > +    if (!new_block) new_block = &empty_block;
> > +    if (block && !memcmp( new_block, block, sizeof( *block ) ))
> >  	continue;
> >  
> >      status = pkey_mgr_update_pkey_entry( p_req, p_physp , new_block, block_index );
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 


From sashak at voltaire.com  Mon Jun 26 08:35:00 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 26 Jun 2006 18:35:00 +0300
Subject: [openib-general] [PATCH] opensm: libibmad: match MAD TransactionID
Message-ID: <20060626153500.18078.85785.stgit@sashak.voltaire.com>


Match MAD TransactionID on receiving. This prevents request/response MADs
mixing - reproducible when poll() (in libibumad) returns timeout.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---

 libibmad/src/rpc.c |   66 ++++++++++++++++++++++++----------------------------
 1 files changed, 31 insertions(+), 35 deletions(-)

diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index e929ba4..d9dc407 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -105,57 +105,54 @@ madrpc_portid(void)
 }
 
 static int 
-_do_madrpc(void *umad, int agentid, int len, int timeout)
+_do_madrpc(void *sndbuf, void *rcvbuf, int agentid, int len, int timeout)
 {
+	uint32_t trid; /* only low 32 bits */
 	int retries;
 	int length, status;
-	ib_user_mad_t *mad;
-	ib_mad_addr_t addr;
 
 	if (!timeout)
 		timeout = def_madrpc_timeout;
 
 	if (ibdebug > 1) {
 		IBWARN(">>> sending: len %d pktsz %d", len, umad_size() + len);
-		xdump(stderr, "send buf\n", umad, umad_size() + len);
+		xdump(stderr, "send buf\n", sndbuf, umad_size() + len);
 	}
 
-	/* Save user MAD header in case of retry */
-	mad = umad;
-	memcpy(&addr, &mad->addr, sizeof addr);
-
 	if (save_mad) {
-		memcpy(save_mad, umad_get_mad(umad),
+		memcpy(save_mad, umad_get_mad(sndbuf),
 		       save_mad_len < len ? save_mad_len : len);
 		save_mad = 0;
 	}
 
+	trid = mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F);
+
 	for (retries = 0; retries < madrpc_retries; retries++) {
 		if (retries) {
 			ERRS("retry %d (timeout %d ms)", retries, timeout);
-			/* Restore user MAD header */
-			memcpy(&mad->addr, &addr, sizeof addr);
 		}
 
 		length = len;
-		if (umad_send(mad_portid, agentid, umad, length, timeout, 0) < 0) {
+		if (umad_send(mad_portid, agentid, sndbuf, length, timeout, 0) < 0) {
 			IBWARN("send failed; %m");
 			return -1;
 		}
 
 		/* Use same timeout on receive side just in case */
 		/* send packet is lost somewhere. */
-		if (umad_recv(mad_portid, umad, &length, timeout) < 0) {
-			IBWARN("recv failed: %m");
-			return -1;
-		}
-		
-		if (ibdebug > 1) {
-			IBWARN("rcv buf:");
-			xdump(stderr, "rcv buf\n", umad_get_mad(umad), IB_MAD_SIZE);
-		}
-
-		status = umad_status(umad);
+		do {
+			if (umad_recv(mad_portid, rcvbuf, &length, timeout) < 0) {
+				IBWARN("recv failed: %m");
+				return -1;
+			}
+
+			if (ibdebug > 1) {
+				IBWARN("rcv buf:");
+				xdump(stderr, "rcv buf\n", umad_get_mad(rcvbuf), IB_MAD_SIZE);
+			}
+		} while ((uint32_t)mad_get_field64(umad_get_mad(rcvbuf), 0, IB_MAD_TRID_F) != trid);
+
+		status = umad_status(rcvbuf);
 		if (!status)
 			return length;		/* done */
 		if (status == ENOMEM)
@@ -170,19 +167,19 @@ void *
 madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata)
 {
 	int status, len;
-	uint8_t pktbuf[1024], *mad;
-	void *umad = pktbuf;
+	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
-	memset(pktbuf, 0, umad_size() + IB_MAD_SIZE);
+	len = 0;
+	memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);
 
-	if ((len = mad_build_pkt(umad, rpc, dport, 0, payload)) < 0)
+	if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(umad, mad_class_agent(rpc->mgtclass),
+	if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass),
 			      len, rpc->timeout)) < 0)
 		return 0;
 
-	mad = umad_get_mad(umad);
+	mad = umad_get_mad(rcvbuf);
 
 	if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) {
 		ERRS("MAD completed with error status 0x%x", status);
@@ -204,21 +201,20 @@ void *
 madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data)
 {
 	int status, len;
-	uint8_t pktbuf[1024], *mad;
-	void *umad = pktbuf;
+	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
 
-	memset(pktbuf, 0, umad_size() + IB_MAD_SIZE);
+	memset(sndbuf, 0, umad_size() + IB_MAD_SIZE);
 
 	DEBUG("rmpp %p data %p", rmpp, data);
 
-	if ((len = mad_build_pkt(umad, rpc, dport, rmpp, data)) < 0)
+	if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0)
 		return 0;
 
-	if ((len = _do_madrpc(umad, mad_class_agent(rpc->mgtclass),
+	if ((len = _do_madrpc(sndbuf, rcvbuf, mad_class_agent(rpc->mgtclass),
 			      len, rpc->timeout)) < 0)
 		return 0;
 
-	mad = umad_get_mad(umad);
+	mad = umad_get_mad(rcvbuf);
 
 	if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) {
 		ERRS("MAD completed with error status 0x%x", status);


From rdreier at cisco.com  Mon Jun 26 09:27:34 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 26 Jun 2006 09:27:34 -0700
Subject: [openib-general] it's a girl...
Message-ID: <ada7j33khl5.fsf@cisco.com>

Hi, just quick note to let everyone know that my daughter was born
last week.  So please don't expect me to do anything, read anything,
think about anything, or accomplish anything at all for a while...

 - Roland


From mshefty at ichips.intel.com  Mon Jun 26 09:44:31 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 09:44:31 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <001e01c69300$b9020c00$020010ac@haggard>
References: <1150465355.29508.4.camel@stevo-desktop>
	<4492D706.4060106@ichips.intel.com>
	<15ddcffd0606180435g366a6effs4d4826c8b3fbbd4f@mail.gmail.com>
	<001e01c69300$b9020c00$020010ac@haggard>
Message-ID: <44A00EEF.702@ichips.intel.com>

Steve Wise wrote:
> I agree that it would be nice to get this into 2.6.18.  It seems stable 
> enough IMO.

It's not a stability issue.  We wanted to make sure that the user to kernel 
interface was correct before pushing anything upstream.  At the time the 
decision was made (a couple of months ago), this made sense, and the ABI has 
changed since that time.

It would be nice to know that there are at least a couple of applications using 
the userspace library before trying to push anything upstream.  I know that DAPL 
is using it.  Are there any others?

- Sean


From mst at mellanox.co.il  Mon Jun 26 10:46:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 20:46:11 +0300
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <ORSMSX401Rqf69aZbLA00000024@orsmsx401.amr.corp.intel.com>
References: <20060530183454.GH10234@mellanox.co.il>
	<ORSMSX401Rqf69aZbLA00000024@orsmsx401.amr.corp.intel.com>
Message-ID: <20060626174611.GB19929@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> >Yes, that was my thinking. To avoid touching all users, maybe the simplest way
> >is to make ib_cm discard the new cm_id without reject if the client callback
> >returned -ENOMEM?
> >
> >If you consider that in out of memory situation sending reject will also likely
> >fail, this might be a good idea, regardless.
> >
> >Sounds good?
> 
> I'd like to get some other feedback, but this approach sounds reasonable.

Here's an untested patch that does this. Comments?

Signed-off-by: Jack Morgenstein <jackm at mellanox.co.il>

Index: src/drivers/infiniband/core/cma.c
===================================================================
--- src.orig/drivers/infiniband/core/cma.c	2006-06-07 11:33:04.359936000 +0300
+++ src/drivers/infiniband/core/cma.c	2006-06-15 13:44:07.030643000 +0300
@@ -118,7 +118,8 @@ struct rdma_id_private {
 	wait_queue_head_t	wait_remove;
 	atomic_t		dev_remove;
 
 	int			backlog;
+	atomic_t		curr_backlog;
 	int			timeout_ms;
 	struct ib_sa_query	*query;
 	int			query_id;
@@ -328,6 +329,7 @@ struct rdma_cm_id* rdma_create_id(rdma_c
 	atomic_set(&id_priv->dev_remove, 0);
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
+	atomic_set(&id_priv->curr_backlog, 0);
 
 	return &id_priv->id;
 }
@@ -1022,6 +1024,9 @@ static int cma_listen_handler(struct rdm
 {
 	struct rdma_id_private *id_priv = id->context;
 
+	if (atomic_read(&id_priv->curr_backlog) > id_priv->backlog)
+		return -ENOMEM;
+
 	id->context = id_priv->id.context;
 	id->event_handler = id_priv->id.event_handler;
 	return id_priv->id.event_handler(id, event);
@@ -1870,6 +1875,25 @@ out:
 }
 EXPORT_SYMBOL(rdma_disconnect);
 
+
+void rdma_backlog_added_one(struct rdma_cm_id *id)
+{
+	struct rdma_id_private *id_priv;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	atomic_inc(&id_priv->curr_backlog);
+}
+EXPORT_SYMBOL(rdma_backlog_added_one);
+
+void rdma_backlog_removed_one(struct rdma_cm_id *id)
+{
+	struct rdma_id_private *id_priv;
+
+	id_priv = container_of(id, struct rdma_id_private, id);
+	atomic_dec(&id_priv->curr_backlog);
+}
+EXPORT_SYMBOL(rdma_backlog_removed_one);
+
 static void cma_add_one(struct ib_device *device)
 {
 	struct cma_device *cma_dev;
Index: src/drivers/infiniband/include/rdma/rdma_cm.h
===================================================================
--- src.orig/drivers/infiniband/include/rdma/rdma_cm.h	2006-05-10 11:18:37.538572000 +0300
+++ src/drivers/infiniband/include/rdma/rdma_cm.h	2006-06-15 15:49:37.708725000 +0300
@@ -252,5 +252,21 @@ int rdma_reject(struct rdma_cm_id *id, c
  */
 int rdma_disconnect(struct rdma_cm_id *id);
 
+/**
+ * rdma_backlog_added_one - This function is called by the passive side to
+ *   notify cma that one connection request has been added to backlog queue.
+ *
+ * No error checking is done here (e.g., if backlog is already at max, etc)
+ */
+void rdma_backlog_added_one(struct rdma_cm_id *id);
+
+/**
+ * rdma_backlog_added_one - This function is called by the passive side to
+ *   notify cma that one connection request has been added to backlog queue.
+ *
+ * No error checking is done here (e.g., if queue was already empty)
+ */
+void rdma_backlog_removed_one(struct rdma_cm_id *id);
+
 #endif /* RDMA_CM_H */
 
Index: src/drivers/infiniband/core/cm.c
===================================================================
--- src.orig/drivers/infiniband/core/cm.c	2006-06-07 11:33:04.109937000 +0300
+++ src/drivers/infiniband/core/cm.c	2006-06-15 15:50:34.222140000 +0300
@@ -701,6 +701,19 @@ static void cm_reset_to_idle(struct cm_i
 	}
 }
 
+static void ib_destroy_cm_id_no_resp(struct ib_cm_id *cm_id)
+{
+	struct cm_id_private *cm_id_priv;
+
+	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+
+	cm_free_id(cm_id->local_id);
+	cm_deref_id(cm_id_priv);
+	kfree(cm_id_priv->compare_data);
+	kfree(cm_id_priv->private_data);
+	kfree(cm_id_priv);
+}
+
 void ib_destroy_cm_id(struct ib_cm_id *cm_id)
 {
 	struct cm_id_private *cm_id_priv;
@@ -1162,7 +1177,10 @@ static void cm_process_work(struct cm_id
 		cm_free_work(work);
 	}
 	cm_deref_id(cm_id_priv);
-	if (ret)
+
+	if (ret == -ENOMEM)
+		ib_destroy_cm_id_no_resp(&cm_id_priv->id);
+	else if (ret)
 		ib_destroy_cm_id(&cm_id_priv->id);
 }
 

-- 
MST


From mst at mellanox.co.il  Mon Jun 26 10:41:17 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 20:41:17 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A00EEF.702@ichips.intel.com>
References: <44A00EEF.702@ichips.intel.com>
Message-ID: <20060626174117.GA19929@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: ucma into kernel.org
> 
> Steve Wise wrote:
> > I agree that it would be nice to get this into 2.6.18.  It seems stable 
> > enough IMO.
> 
> It's not a stability issue.  We wanted to make sure that the user to kernel 
> interface was correct before pushing anything upstream.

How about the cma changes required by ucma to get/set options?  I think they are
not upstream yet.  Could these go upstream, to make building ucma out-of-kernel
possible, without kernel patches?

-- 
MST


From halr at voltaire.com  Mon Jun 26 11:04:17 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Jun 2006 14:04:17 -0400
Subject: [openib-general] [PATCH] opensm: libibmad: match MAD
	TransactionID
In-Reply-To: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
References: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
Message-ID: <1151345056.4482.137712.camel@hal.voltaire.com>

On Mon, 2006-06-26 at 11:35, Sasha Khapyorsky wrote:
> Match MAD TransactionID on receiving. This prevents request/response MADs
> mixing - reproducible when poll() (in libibumad) returns timeout.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From pradeep at us.ibm.com  Mon Jun 26 11:19:55 2006
From: pradeep at us.ibm.com (Pradeep Satyanarayana)
Date: Mon, 26 Jun 2006 11:19:55 -0700
Subject: [openib-general] bug #33
Message-ID: <OFA41CE017.03B11163-ON88257199.00640A11-88257199.00646A7E@us.ibm.com>

I am curious -was the root cause of bug# 33 determined? Which of the fixes 
between OFED RC4 and RC5 closed this bug?

Pradeep
pradeep at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060626/d05de5e8/attachment.html>

From mshefty at ichips.intel.com  Mon Jun 26 11:21:54 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 11:21:54 -0700
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <20060626174611.GB19929@mellanox.co.il>
References: <20060530183454.GH10234@mellanox.co.il>
	<ORSMSX401Rqf69aZbLA00000024@orsmsx401.amr.corp.intel.com>
	<20060626174611.GB19929@mellanox.co.il>
Message-ID: <44A025C2.2070204@ichips.intel.com>

Michael S. Tsirkin wrote:
> Here's an untested patch that does this. Comments?

Rather than exporting wrapper functions around atomic inc/dec, I would rather 
the user just maintain the current backlog themselves, with the patch limited to 
the cm.c file only.

> Index: src/drivers/infiniband/core/cm.c
> ===================================================================
> --- src.orig/drivers/infiniband/core/cm.c	2006-06-07 11:33:04.109937000 +0300
> +++ src/drivers/infiniband/core/cm.c	2006-06-15 15:50:34.222140000 +0300
> @@ -701,6 +701,19 @@ static void cm_reset_to_idle(struct cm_i
>  	}
>  }
>  
> +static void ib_destroy_cm_id_no_resp(struct ib_cm_id *cm_id)
> +{
> +	struct cm_id_private *cm_id_priv;
> +
> +	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
> +
> +	cm_free_id(cm_id->local_id);
> +	cm_deref_id(cm_id_priv);
> +	kfree(cm_id_priv->compare_data);
> +	kfree(cm_id_priv->private_data);
> +	kfree(cm_id_priv);
> +}

I think that we need to dequeue and free any additional work items as well here. 
  See the bottom of ib_destroy_cm_id().  (It may makes sense for 
ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet.) 
We will also need to wait for all references on the cm_id to go to 0.  (Incoming 
MADs could be accessing the cm_id, such as receiving a REJ while we're 
processing a REQ.)

There are likely some additional race conditions / cleanup not handled here as 
well.  We may still need to perform some state checking to ensure that the cm_id 
is not in any lists / trees, and that there are no outstanding MADs associated 
with the id.  (A user could have sent an MRA or other CM MAD from their 
callback, before returning an error.)

- Sean


From mshefty at ichips.intel.com  Mon Jun 26 11:25:47 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 11:25:47 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <20060626174117.GA19929@mellanox.co.il>
References: <44A00EEF.702@ichips.intel.com>
	<20060626174117.GA19929@mellanox.co.il>
Message-ID: <44A026AB.8090607@ichips.intel.com>

Michael S. Tsirkin wrote:
> How about the cma changes required by ucma to get/set options?  I think they are
> not upstream yet.  Could these go upstream, to make building ucma out-of-kernel
> possible, without kernel patches?

Wouldn't you have to patch the kernel to include the kernel ucma anyway?

- Sean


From narravul at cse.ohio-state.edu  Mon Jun 26 11:22:03 2006
From: narravul at cse.ohio-state.edu (Sundeep Narravula)
Date: Mon, 26 Jun 2006 14:22:03 -0400 (EDT)
Subject: [openib-general] Interface for getting RNIC's IP address
Message-ID: <Pine.GSO.4.40.0606261416590.18260-100000@omicron.cse.ohio-state.edu>


 Is there any s/w interface to obtain the local RNIC's IP address?

The current rdma cm examples, rping and cmatose, require the user to enter
the ip address as a command line parameter. I am currently looking for a
way to get this programatically.

Thanks,
  --Sundeep.


From mst at mellanox.co.il  Mon Jun 26 11:37:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 21:37:11 +0300
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <44A025C2.2070204@ichips.intel.com>
References: <44A025C2.2070204@ichips.intel.com>
Message-ID: <20060626183711.GA20281@mellanox.co.il>

Sean, thanks for comments.

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> It may makes sense for 
> ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet.

Maybe add a new routine getting a response flag, and use that from
ib_destroy_cm_id?

-- 
MST


From mst at mellanox.co.il  Mon Jun 26 11:28:30 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 21:28:30 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A026AB.8090607@ichips.intel.com>
References: <44A026AB.8090607@ichips.intel.com>
Message-ID: <20060626182830.GD19929@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > How about the cma changes required by ucma to get/set options?  I think they
> > are not upstream yet.  Could these go upstream, to make building ucma
> > out-of-kernel possible, without kernel patches?
> 
> Wouldn't you have to patch the kernel to include the kernel ucma anyway?

I would? Why can't it be compiled as an out of kernel module?

-- 
MST


From mshefty at ichips.intel.com  Mon Jun 26 12:04:45 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 12:04:45 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <20060626182830.GD19929@mellanox.co.il>
References: <44A026AB.8090607@ichips.intel.com>
	<20060626182830.GD19929@mellanox.co.il>
Message-ID: <44A02FCD.8030404@ichips.intel.com>

Michael S. Tsirkin wrote:
>>Wouldn't you have to patch the kernel to include the kernel ucma anyway?
> 
> 
> I would? Why can't it be compiled as an out of kernel module?

I understand you now.

UD QP and multicast support were also recently added.  I don't think that we 
want to risk pushing them upstream for 2.6.18 as well, since it requires adding 
the ib_multicast module.

- Sean


From mshefty at ichips.intel.com  Mon Jun 26 12:13:34 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 12:13:34 -0700
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <20060626183711.GA20281@mellanox.co.il>
References: <44A025C2.2070204@ichips.intel.com>
	<20060626183711.GA20281@mellanox.co.il>
Message-ID: <44A031DE.6030905@ichips.intel.com>

Michael S. Tsirkin wrote:
>>It may makes sense for 
>>ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet.
> 
> 
> Maybe add a new routine getting a response flag, and use that from
> ib_destroy_cm_id?

I'm not following what you mean here.

Originally, I was suggesting taking the bottom portion of ib_destroy_cm_id() and 
making it the "destroy no response" call.  But after thinking about it more, I' 
don't believe that the cleanup is that easy.  We still need to check and modify 
the cm_id state to ensure that newly received MADs are handled correctly, plus 
remove the cm_id from any trees used to track the connection.

- Sean


From swise at opengridcomputing.com  Mon Jun 26 12:19:26 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 26 Jun 2006 14:19:26 -0500
Subject: [openib-general] Interface for getting RNIC's IP address
In-Reply-To: <Pine.GSO.4.40.0606261416590.18260-100000@omicron.cse.ohio-state.edu>
References: <Pine.GSO.4.40.0606261416590.18260-100000@omicron.cse.ohio-state.edu>
Message-ID: <1151349566.2398.59.camel@stevo-desktop>

On Mon, 2006-06-26 at 14:22 -0400, Sundeep Narravula wrote:
>  Is there any s/w interface to obtain the local RNIC's IP address?
> 
> The current rdma cm examples, rping and cmatose, require the user to enter
> the ip address as a command line parameter. I am currently looking for a
> way to get this programatically.
> 

You can use 0.0.0.0 which will allow you to listen across all rdma
devices.  Otherwise, you have to "know" which ip address is bound to the
device you wish to listen on.


> Thanks,
>   --Sundeep.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Mon Jun 26 12:24:08 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 22:24:08 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A02FCD.8030404@ichips.intel.com>
References: <44A02FCD.8030404@ichips.intel.com>
Message-ID: <20060626192408.GA20568@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> UD QP and multicast support were also recently added.

These options are slightly different however - kernel ULPs I think will also
want to set the number of retries/timeout (SDP needs it).  So you can look it as
a kind of fix, not a new feature.  And, the change is I think smaller.

No?

> I don't think that we want to risk pushing them upstream for 2.6.18 as well,
> since it requires adding the ib_multicast module.

Yes, we still see crashes with the new ib_multicast.

-- 
MST


From caitlinb at broadcom.com  Mon Jun 26 12:39:30 2006
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Mon, 26 Jun 2006 12:39:30 -0700
Subject: [openib-general] Interface for getting RNIC's IP address
Message-ID: <54AD0F12E08D1541B826BE97C98F99F15F58B7@NT-SJCA-0751.brcm.ad.broadcom.com>

openib-general-bounces at openib.org wrote:
> On Mon, 2006-06-26 at 14:22 -0400, Sundeep Narravula wrote:
>>  Is there any s/w interface to obtain the local RNIC's IP address?
>> 
>> The current rdma cm examples, rping and cmatose, require the user to
>> enter the ip address as a command line parameter. I am currently
>> looking for a way to get this programatically.
>> 
> 
> You can use 0.0.0.0 which will allow you to listen across all
> rdma devices.  Otherwise, you have to "know" which ip address
> is bound to the device you wish to listen on.
> 
> 
> 
True, but if you know what netdevice the rdma device you want
to listen on is associated with then you can work from the
IP Address to the netdevice.


From mst at mellanox.co.il  Mon Jun 26 12:40:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 26 Jun 2006 22:40:10 +0300
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <44A031DE.6030905@ichips.intel.com>
References: <44A031DE.6030905@ichips.intel.com>
Message-ID: <20060626194010.GB20568@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
> 
> Michael S. Tsirkin wrote:
> >>It may makes sense for 
> >>ib_destroy_cm_id() to call the new routine, but I'm not sure about that yet.
> > 
> > 
> > Maybe add a new routine getting a response flag, and use that from
> > ib_destroy_cm_id?
> 
> I'm not following what you mean here.

I'm just saying that we can use exactly the code in ib_destroy_cm_id, but
avoid calling ib_send_cm_rej in this one case:


        case IB_CM_REQ_RCVD:
        case IB_CM_MRA_REQ_SENT:
        case IB_CM_REP_RCVD:
        case IB_CM_MRA_REP_SENT:
+		if (noresponse)
+			cm_reset_to_idle(cm_id_priv);
                spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		if (noresponse)
                ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
                               NULL, 0, NULL, 0);

So we get all the handling for free, just avoid sending out the MAD.

-- 
MST


From mshefty at ichips.intel.com  Mon Jun 26 13:17:07 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 13:17:07 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <20060626192408.GA20568@mellanox.co.il>
References: <44A02FCD.8030404@ichips.intel.com>
	<20060626192408.GA20568@mellanox.co.il>
Message-ID: <44A040C3.1060600@ichips.intel.com>

Michael S. Tsirkin wrote:
>>UD QP and multicast support were also recently added.
> 
> These options are slightly different however - kernel ULPs I think will also
> want to set the number of retries/timeout (SDP needs it).  So you can look it as
> a kind of fix, not a new feature.  And, the change is I think smaller.
> 
> No?

I agree that they're different.  I was merely pointing out that the ucma has 
those changes too.  You would need a special out of kernel version of the ucma 
and compatible librdmacm to talk with whatever kernel cma is upstream.

- Sean


From mshefty at ichips.intel.com  Mon Jun 26 13:19:38 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 13:19:38 -0700
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <20060626194010.GB20568@mellanox.co.il>
References: <44A031DE.6030905@ichips.intel.com>
	<20060626194010.GB20568@mellanox.co.il>
Message-ID: <44A0415A.9000105@ichips.intel.com>

Michael S. Tsirkin wrote:
> I'm just saying that we can use exactly the code in ib_destroy_cm_id, but
> avoid calling ib_send_cm_rej in this one case:

Ah... yes, something like that should work.

- Sean


From pw at osc.edu  Mon Jun 26 14:53:19 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Mon, 26 Jun 2006 17:53:19 -0400
Subject: [openib-general] max_send_sge < max_sge
Message-ID: <20060626215319.GA9291@osc.edu>

Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4
with MT25204, this line:

    ret = ibv_query_device(ctx, &hca_cap);

tells me that hca_cap.max_sge = 30.

However, this code fails, with the last kernel write returning EINVAL:

    memset(&att, 0, sizeof(att));
    att.send_cq = 1024;
    att.recv_cq = 1024;
    att.cap.max_recv_wr = 512;
    att.cap.max_send_wr = 512;
    att.cap.max_recv_sge = 30;
    att.cap.max_send_sge = 30;
    att.qp_type = IBV_QPT_RC; 
    qp = ibv_create_qp(pd, &att);

But if I set:

    att.cap.max_recv_sge = 30;
    att.cap.max_send_sge = 29;  /* hca_cap.max_sge - 1 */

the QP create succeeds.

Is this a known issue?  Should I always subtract 1 from the reported
max on the send side?  Just for this hardware?

		-- Pete


From rjwalsh at pathscale.com  Mon Jun 26 15:53:07 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Mon, 26 Jun 2006 15:53:07 -0700
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060626215319.GA9291@osc.edu>
References: <20060626215319.GA9291@osc.edu>
Message-ID: <1151362387.20061.0.camel@hematite.internal.keyresearch.com>

On Mon, 2006-06-26 at 17:53 -0400, Pete Wyckoff wrote:
> Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4
> with MT25204, this line:
> 
>     ret = ibv_query_device(ctx, &hca_cap);
> 
> tells me that hca_cap.max_sge = 30.
> 
> However, this code fails, with the last kernel write returning EINVAL:
> 
>     memset(&att, 0, sizeof(att));
>     att.send_cq = 1024;
>     att.recv_cq = 1024;
>     att.cap.max_recv_wr = 512;
>     att.cap.max_send_wr = 512;
>     att.cap.max_recv_sge = 30;
>     att.cap.max_send_sge = 30;
>     att.qp_type = IBV_QPT_RC; 
>     qp = ibv_create_qp(pd, &att);
> 
> But if I set:
> 
>     att.cap.max_recv_sge = 30;
>     att.cap.max_send_sge = 29;  /* hca_cap.max_sge - 1 */
> 
> the QP create succeeds.
> 
> Is this a known issue?  Should I always subtract 1 from the reported
> max on the send side?  Just for this hardware?

Probably something else has a QP allocated?  Like the SMA, maybe?

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060626/116503b9/attachment.sig>

From rjwalsh at pathscale.com  Mon Jun 26 15:53:48 2006
From: rjwalsh at pathscale.com (Robert Walsh)
Date: Mon, 26 Jun 2006 15:53:48 -0700
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <1151362387.20061.0.camel@hematite.internal.keyresearch.com>
References: <20060626215319.GA9291@osc.edu>
	<1151362387.20061.0.camel@hematite.internal.keyresearch.com>
Message-ID: <1151362428.20061.2.camel@hematite.internal.keyresearch.com>

On Mon, 2006-06-26 at 15:53 -0700, Robert Walsh wrote:
> On Mon, 2006-06-26 at 17:53 -0400, Pete Wyckoff wrote:
> > Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4
> > with MT25204, this line:
> > 
> >     ret = ibv_query_device(ctx, &hca_cap);
> > 
> > tells me that hca_cap.max_sge = 30.
> > 
> > However, this code fails, with the last kernel write returning EINVAL:
> > 
> >     memset(&att, 0, sizeof(att));
> >     att.send_cq = 1024;
> >     att.recv_cq = 1024;
> >     att.cap.max_recv_wr = 512;
> >     att.cap.max_send_wr = 512;
> >     att.cap.max_recv_sge = 30;
> >     att.cap.max_send_sge = 30;
> >     att.qp_type = IBV_QPT_RC; 
> >     qp = ibv_create_qp(pd, &att);
> > 
> > But if I set:
> > 
> >     att.cap.max_recv_sge = 30;
> >     att.cap.max_send_sge = 29;  /* hca_cap.max_sge - 1 */
> > 
> > the QP create succeeds.
> > 
> > Is this a known issue?  Should I always subtract 1 from the reported
> > max on the send side?  Just for this hardware?
> 
> Probably something else has a QP allocated?  Like the SMA, maybe?

Doh - never mind.  SGE's, not QPs.  Wasn't paying attention :-)

-- 
Robert Walsh                                 Email: rjwalsh at pathscale.com
PathScale, Inc.                              Phone: +1 650 934 8117
2071 Stierlin Court, Suite 200                 Fax: +1 650 428 1969
Mountain View, CA 94043.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060626/7771bcdc/attachment.sig>

From contactpost at sanook.com  Mon Jun 26 15:58:47 2006
From: contactpost at sanook.com (contactpost)
Date: Tue, 27 Jun 2006 05:58:47 +0700
Subject: [openib-general] BUSINESS INVITATION
Message-ID: <E1Fv02Z-0003p7-9q@research.msu.ac.th>


___________________________________________________________________________
��ҹ���Ѻ��������ѵ��ѵԨҡ http://www.journal.msu.ac.th


From mshefty at ichips.intel.com  Mon Jun 26 17:03:14 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 26 Jun 2006 17:03:14 -0700
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <200606261051.12515.jackm@mellanox.co.il>
References: <200606261051.12515.jackm@mellanox.co.il>
Message-ID: <44A075C2.6060409@ichips.intel.com>

Jack Morgenstein wrote:
> The following Oops occurred upon unloading the openib driver.  I unloaded the
> driver immediately following a reboot (the driver had been loaded during the
> boot sequence).  I did NOT run opensm before unloading the driver.
> 
> Evidently, ipoib was still attempting to connect with an SA, when the ipoib
> module was unloaded (modprobe -r). After the ipoib module was unloaded (or at
> least rendered inaccessible), the ib_sa module attempted to invoke 
> "ib_sa_mcmember_rec_callback" (for a callback address that was part of the
> unloaded ipoib module).  Hence, the Oops below.
> 
> The "modprobe" process in the trace below is "modprobe -r ib_sa" (After
> unloading ib_ipoib, we attempt to unload ib_sa). Following the Oops, I've
> included info on the running environment.

Thanks for the additional information.  I've been trying to reproduce this, but
haven't been able to yet.  I did notice that there's a several second delay when
calling modprobe -r ip_iboib, but only if I've tried to configure ib0 first.
(No SM was running.)

I am confused on one area.  After executing modprobe -r ib_ipoib, what kept
ib_sa loaded?  (Why was modprobe -r ib_sa necessary?)  I would have expected it
to be unloaded at the same time.

- Sean


From mst at mellanox.co.il  Mon Jun 26 23:42:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 27 Jun 2006 09:42:34 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060626215319.GA9291@osc.edu>
References: <20060626215319.GA9291@osc.edu>
Message-ID: <20060627064234.GG19300@mellanox.co.il>

Quoting r. Pete Wyckoff <pw at osc.edu>:
> Subject: max_send_sge < max_sge
> 
> Using stock 2.6.17.1, with verbs 1.0.3-1.fc4 and mthca 1.0.2-1.fc4
> with MT25204, this line:
> 
>     ret = ibv_query_device(ctx, &hca_cap);
> 
> tells me that hca_cap.max_sge = 30.
> 
> However, this code fails, with the last kernel write returning EINVAL:
> 
>     memset(&att, 0, sizeof(att));
>     att.send_cq = 1024;
>     att.recv_cq = 1024;
>     att.cap.max_recv_wr = 512;
>     att.cap.max_send_wr = 512;
>     att.cap.max_recv_sge = 30;
>     att.cap.max_send_sge = 30;
>     att.qp_type = IBV_QPT_RC; 
>     qp = ibv_create_qp(pd, &att);

Some Mellanox HCAs support different max sge values for send queue versus
receive queue, or for different QP types. ibv_query_device returns the maximum
value hardware can support.

> Is this a known issue?

Yes. The fact that ibv_query_device returns some value in hca_cap can not
guarantee that ibv_create_qp with these parameters will succeed. For example,
system administrator might have imposed a limit on the amount of memory you can
pin down, and you will get ENOMEM.


> Should I always subtract 1 from the reported max on the send side?  Just for
> this hardware?

Unless you use it, passing the absolute maximum value supported by hardware does
not seem, to me, to make sense - it will just slow you down, and waste
resources.  Is there a protocol out there that actually has a use for 30 sge?

In my opinion, for the application to be robust it has to either use small
values that empirically work on most systems, or be able to scale down to
require less resources if an allocation fails.

-- 
MST


From mst at mellanox.co.il  Mon Jun 26 23:49:01 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 27 Jun 2006 09:49:01 +0300
Subject: [openib-general] [git pull] please pull infiniband.git
In-Reply-To: <aday7w03om6.fsf@cisco.com>
References: <aday7w03om6.fsf@cisco.com>
Message-ID: <20060627064901.GH19300@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: [git pull] please pull infiniband.git
> 
> Linus, please pull from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This has a couple of mthca driver bug fixes:
> 
> Michael S. Tsirkin:
>       IB/mthca: restore missing PCI registers after reset
>       IB/mthca: memfree completion with error FW bug workaround
> 
>  drivers/infiniband/hw/mthca/mthca_cq.c    |   11 +++++
>  drivers/infiniband/hw/mthca/mthca_reset.c |   59 +++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 1 deletions(-)

These two patches didn't seem to make it to 2.6.17, did they?
Is there support for their inclusion in -stable? These both actually
fix stability issues for our customers.

-- 
MST


From eitan at mellanox.co.il  Mon Jun 26 23:48:37 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 27 Jun 2006 09:48:37 +0300
Subject: [openib-general] [PATCH] opensm: libibmad: match MAD
 TransactionID
In-Reply-To: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
References: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
Message-ID: <44A0D4C5.4060705@mellanox.co.il>

Hi Sasha

Can you provide a little more info on the cause and impact of the
issue you are solving with this patch?

How is it related to work on the thread:
"mad: add GID/class checking for matching received to sent MADs"?

Thanks

Sasha Khapyorsky wrote:
> Match MAD TransactionID on receiving. This prevents request/response MADs
> mixing - reproducible when poll() (in libibumad) returns timeout.
> 


From bpradip at in.ibm.com  Mon Jun 26 23:56:25 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 27 Jun 2006 12:26:25 +0530
Subject: [openib-general] [PATCH 0/2] perftest: Modified perftest utils
 to work with new stack and libraries
In-Reply-To: <20060626151506.GA14684@esmail.cup.hp.com>
References: <20060626102410.GA17835@harry-potter.ibm.com>
	<20060626151506.GA14684@esmail.cup.hp.com>
Message-ID: <44A0D699.1030808@in.ibm.com>

Grant Grundler wrote:
> On Mon, Jun 26, 2006 at 03:54:19PM +0530, Pradipta Kumar Banerjee wrote:
>> modified perftest utilities to work with the latest stack and libraries. 
>> This patchset consists changes for rdma_lat and rdma_bw only.
>>
>> 1 - rdma_lat.c changes
>> 2 - rdma_bw.c changes
> 
> Pradipta,
> thanks for posting the patches...but could you do us a favor
> and provide a useful changelog entry?
> 
> We can see it's a patch and which files the patch modifies.
> The changelog should summarize "what problem does this patch fix?".
> 
Grant,
  Will repost the patches again with the changelog.

Thanks,
Pradipta
> thanks again,
> grant
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 


From halr at voltaire.com  Tue Jun 27 04:15:05 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Jun 2006 07:15:05 -0400
Subject: [openib-general] [PATCH] OpenSM/SA: Eliminate some no longer needed
	code
Message-ID: <1151406904.4482.179805.camel@hal.voltaire.com>

OpenSM/SA: Eliminate some no longer needed code

No longer a need to check whether the LID is beyond the vector table
size. In fact, this turns an edge case into an error (when LMC > 0 and
a non base LID is requested which is above the last base LID but within
that port's LID range). In any case, osm_get_port_by_base_lid uses
cl_ptr_vector_get_at which does this check at the proper time.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_pkey_record.c
===================================================================
--- opensm/osm_sa_pkey_record.c	(revision 8236)
+++ opensm/osm_sa_pkey_record.c	(working copy)
@@ -419,25 +419,14 @@ osm_pkey_rec_rcv_process(
 
     CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 );
 
-    if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
+    status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+    if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
     {
-      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
-      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
-      {
-        status = IB_NOT_FOUND;
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-                 "osm_pkey_rec_rcv_process: ERR 460B: "
-                 "No port found with LID 0x%x\n",
-                 cl_ntoh16(p_rcvd_rec->lid) );
-      }
-    }
-    else
-    { /* LID out of range */
       status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_pkey_rec_rcv_process: ERR 4609: "
-               "Given LID (0x%X) is out of range:0x%X\n",
-               cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
+               "osm_pkey_rec_rcv_process: ERR 460B: "
+               "No port found with LID 0x%x\n",
+               cl_ntoh16(p_rcvd_rec->lid) );
     }
   }
 
Index: opensm/osm_sa_portinfo_record.c
===================================================================
--- opensm/osm_sa_portinfo_record.c	(revision 8236)
+++ opensm/osm_sa_portinfo_record.c	(working copy)
@@ -677,25 +677,14 @@ osm_pir_rcv_process(
   */
   if( comp_mask & IB_PIR_COMPMASK_LID )
   {
-    if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
-    {
-      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
-      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
-      {
-        status = IB_NOT_FOUND;
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-                 "osm_pir_rcv_process: ERR 2109: "
-                 "No port found with LID 0x%x\n",
-                 cl_ntoh16(p_rcvd_rec->lid) );
-      }
-    }
-    else
+    status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+    if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
     {
       status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_pir_rcv_process: ERR 2101: "
-               "Given LID (0x%X) is out of range:0x%X\n",
-               cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
+               "osm_pir_rcv_process: ERR 2109: "
+               "No port found with LID 0x%x\n",
+               cl_ntoh16(p_rcvd_rec->lid) );
     }
   }
   else
Index: opensm/osm_sa_slvl_record.c
===================================================================
--- opensm/osm_sa_slvl_record.c	(revision 8236)
+++ opensm/osm_sa_slvl_record.c	(working copy)
@@ -387,25 +387,14 @@ osm_slvl_rec_rcv_process(
 
     CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 );
 
-    if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
+    status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+    if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
     {
-      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
-      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
-      {
-        status = IB_NOT_FOUND;
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-                 "osm_slvl_rec_rcv_process: ERR 2608: "
-                 "No port found with LID 0x%x\n",
-                 cl_ntoh16(p_rcvd_rec->lid) );
-      }
-    }
-    else
-    { /* LID out of range */
       status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_slvl_rec_rcv_process: ERR 2601: "
-               "Given LID (0x%X) is out of range:0x%X\n",
-               cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl));
+               "osm_slvl_rec_rcv_process: ERR 2608: "
+               "No port found with LID 0x%x\n",
+               cl_ntoh16(p_rcvd_rec->lid) );
     }
   }
 
Index: opensm/osm_sa_vlarb_record.c
===================================================================
--- opensm/osm_sa_vlarb_record.c	(revision 8236)
+++ opensm/osm_sa_vlarb_record.c	(working copy)
@@ -407,25 +407,14 @@ osm_vlarb_rec_rcv_process(
 
     CL_ASSERT( cl_ptr_vector_get_size(p_tbl) < 0x10000 );
 
-    if ((uint16_t)cl_ptr_vector_get_size(p_tbl) > cl_ntoh16(p_rcvd_rec->lid))
+    status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+    if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
     {
-      status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
-      if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
-      {
-        status = IB_NOT_FOUND;
-        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-                 "osm_vlarb_rec_rcv_process: ERR 2A09: "
-                 "No port found with LID 0x%x\n",
-                 cl_ntoh16(p_rcvd_rec->lid) );
-      }
-    }
-    else
-    { /* LID out of range */
       status = IB_NOT_FOUND;
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "osm_vlarb_rec_rcv_process: ERR 2A01: "
-               "Given LID (0x%X) is out of range:0x%X\n",
-               cl_ntoh16(p_rcvd_rec->lid), cl_ptr_vector_get_size(p_tbl) );
+               "osm_vlarb_rec_rcv_process: ERR 2A09: "
+               "No port found with LID 0x%x\n",
+               cl_ntoh16(p_rcvd_rec->lid) );
     }
   }
 

From rkuchimanchi at silverstorm.com  Tue Jun 27 05:16:36 2006
From: rkuchimanchi at silverstorm.com (Ramachandra K)
Date: Tue, 27 Jun 2006 17:46:36 +0530
Subject: [openib-general] Local QP operation error
Message-ID: <44A121A4.8090509@silverstorm.com>

In a kernel module, on polling the CQ, I am getting a local QP
operation error (IB_WC_LOC_QP_OP_ERR). Work request
posted was of type IB_WR_SEND and the QP was moved to
IB_QPS_RTS state before posting the send work request.

The IB specifcation says that this error indicates an internal QP 
consistency
error. What are the possible reasons for this and is there any way I can
pin point the inconsistency ?

I would appreciate any hints to resolve this error.

Regards,
Ram


From tziporet at mellanox.co.il  Tue Jun 27 05:39:41 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 27 Jun 2006 15:39:41 +0300
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <44A075C2.6060409@ichips.intel.com>
References: <200606261051.12515.jackm@mellanox.co.il>
	<44A075C2.6060409@ichips.intel.com>
Message-ID: <44A1270D.2070109@mellanox.co.il>

Sean Hefty wrote:
> Thanks for the additional information.  I've been trying to reproduce this, but
> haven't been able to yet.  I did notice that there's a several second delay when
> calling modprobe -r ip_iboib, but only if I've tried to configure ib0 first.
> (No SM was running.)
>
> I am confused on one area.  After executing modprobe -r ib_ipoib, what kept
> ib_sa loaded?  (Why was modprobe -r ib_sa necessary?)  I would have expected it
> to be unloaded at the same time.
>
> - Sean
>
>   
Hi Sean,
Resolving this issue is critical for us since it prevent us from any 
usage of the new multicsat module.
An easy way to reproduce it is to use the OFED "openibd" script. Just 
run "openibd start" and than "openibd stop" and you will see the 
problem. This script is available within OFED release.

Thanks,
Tziporet


From mst at mellanox.co.il  Tue Jun 27 05:45:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 27 Jun 2006 15:45:05 +0300
Subject: [openib-general] Local QP operation error
In-Reply-To: <44A121A4.8090509@silverstorm.com>
References: <44A121A4.8090509@silverstorm.com>
Message-ID: <20060627124505.GL19300@mellanox.co.il>

Quoting r. Ramachandra K <rkuchimanchi at silverstorm.com>:
> Subject: Local QP operation error
> 
> In a kernel module, on polling the CQ, I am getting a local QP
> operation error (IB_WC_LOC_QP_OP_ERR). Work request
> posted was of type IB_WR_SEND and the QP was moved to
> IB_QPS_RTS state before posting the send work request.
> 
> The IB specifcation says that this error indicates an internal QP consistency
> error. What are the possible reasons for this and is there any way I can pin
> point the inconsistency ?

This normally indicates some kind of driver bug, or memory corruption.
What is the value of the vendor_err field?

-- 
MST


From Thomas.Talpey at netapp.com  Tue Jun 27 06:06:17 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 Jun 2006 09:06:17 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627064234.GG19300@mellanox.co.il>
References: <20060626215319.GA9291@osc.edu>
	<20060627064234.GG19300@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060627090204.04471ba0@netapp.com>

At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
>Unless you use it, passing the absolute maximum value supported by 
>hardware does
>not seem, to me, to make sense - it will just slow you down, and waste
>resources.  Is there a protocol out there that actually has a use for 30 sge?

It's not a protocol thing, it's a memory registration thing. But I agree,
that's a huge number of segments for send and receive. 2-4 is more
typical. I'd be interested to know what wants 30 as well...

Tom.


From tziporet at mellanox.co.il  Tue Jun 27 06:04:42 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 27 Jun 2006 16:04:42 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <20060626174117.GA19929@mellanox.co.il>
References: <44A00EEF.702@ichips.intel.com>
	<20060626174117.GA19929@mellanox.co.il>
Message-ID: <44A12CEA.6010204@mellanox.co.il>


>
> How about the cma changes required by ucma to get/set options?  I think they are
> not upstream yet.  Could these go upstream, to make building ucma out-of-kernel
> possible, without kernel patches?
>
>   

Hi Sean,
These features are needed for uDAPL and were requested by Woody and 
Arlin for Intel MPI scalability.
Since in OFED 1.1 we are going to take CMA from kernel 2.6.18 we need 
them upstream.

Can you drive these enhancements only to 2.6.18.

Thanks,
Tziporet


From rkuchimanchi at silverstorm.com  Tue Jun 27 06:21:19 2006
From: rkuchimanchi at silverstorm.com (Ramachandra K)
Date: Tue, 27 Jun 2006 18:51:19 +0530
Subject: [openib-general] Local QP operation error
In-Reply-To: <20060627124505.GL19300@mellanox.co.il>
References: <44A121A4.8090509@silverstorm.com>
	<20060627124505.GL19300@mellanox.co.il>
Message-ID: <44A130CF.2060908@silverstorm.com>

Michael S. Tsirkin wrote:

>>The IB specifcation says that this error indicates an internal QP consistency
>>error. What are the possible reasons for this and is there any way I can pin
>>point the inconsistency ?
>>    
>>
>
>This normally indicates some kind of driver bug, or memory corruption.
>What is the value of the vendor_err field?
>
>  
>
The vendor_err field value is 115 (0x73).

Just to clarify, I am writing the kernel module that is getting the local
QP operation error. I guess I am missing something in my code that
is causing the error. But I am unable to pinpoint the cause of the error.

Does this error point to some issue with the DMA address specified
in the work request SGE ?

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060627/5e229777/attachment.html>

From Thomas.Talpey at netapp.com  Tue Jun 27 06:29:53 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 27 Jun 2006 09:29:53 -0400
Subject: [openib-general] Local QP operation error
In-Reply-To: <44A130CF.2060908@silverstorm.com>
References: <44A121A4.8090509@silverstorm.com>
	<20060627124505.GL19300@mellanox.co.il>
	<44A130CF.2060908@silverstorm.com>
Message-ID: <7.0.1.0.2.20060627092733.04471ce8@netapp.com>

At 09:21 AM 6/27/2006, Ramachandra K wrote:
>Does this error point to some issue with the DMA address specified 
>in the work request SGE ?


Ding Ding Ding Ding! :-)

We recently identified the exact issue in the NFS/RDMA server, which
happened only when running on ia64. If you're not using the dma_map_*
api, that's maybe something to look at. ;-)

Tom. 


From mst at mellanox.co.il  Tue Jun 27 06:31:19 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 27 Jun 2006 16:31:19 +0300
Subject: [openib-general] Local QP operation error
In-Reply-To: <44A130CF.2060908@silverstorm.com>
References: <44A130CF.2060908@silverstorm.com>
Message-ID: <20060627133119.GO19300@mellanox.co.il>

Quoting r. Ramachandra K <rkuchimanchi at silverstorm.com>:
> Just to clarify, I am writing the kernel module that is getting the local
> QP operation error. I guess I am missing something in my code that
> is causing the error. But I am unable to pinpoint the cause of the error.
> 
> Does this error point to some issue with the DMA address specified 
> in the work request SGE ?

Yes, it seems hardware could not read (gather) data when executing the work
request SGE.

-- 
MST


From mshefty at ichips.intel.com  Tue Jun 27 08:45:52 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 08:45:52 -0700
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <44A1270D.2070109@mellanox.co.il>
References: <200606261051.12515.jackm@mellanox.co.il>
	<44A075C2.6060409@ichips.intel.com> <44A1270D.2070109@mellanox.co.il>
Message-ID: <44A152B0.3000007@ichips.intel.com>

Tziporet Koren wrote:
> Resolving this issue is critical for us since it prevent us from any 
> usage of the new multicsat module.
> An easy way to reproduce it is to use the OFED "openibd" script. Just 
> run "openibd start" and than "openibd stop" and you will see the 
> problem. This script is available within OFED release.

I am working on trying to resolve this as my top priority at the moment, but I 
have not been able to reproduce this on my systems.  I want to understand why 
ib_sa was not unloaded as part of modprobe -r ib_ipoib, but why ib_multicast 
apparently was.  I will examine the script that you mentioned, but I typically 
do not run the OFED release.

- Sean


From sashak at voltaire.com  Tue Jun 27 10:07:20 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 27 Jun 2006 20:07:20 +0300
Subject: [openib-general] [PATCH] opensm: libibmad: match MAD
 TransactionID
In-Reply-To: <44A0D4C5.4060705@mellanox.co.il>
References: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
	<44A0D4C5.4060705@mellanox.co.il>
Message-ID: <20060627170720.GO16738@sashak.voltaire.com>

Hi Eitan,

On 09:48 Tue 27 Jun     , Eitan Behave wrote:
> Hi Sasha
> 
> Can you provide a little more info on the cause and impact of the
> issue you are solving with this patch?

umad_recv() uses poll(), when it is timeouted umad_recv() returns error
and _do_madrpc() returns with error too. The next _do_madrpc() session
will got the previous response MAD. And so on.

> 
> How is it related to work on the thread:
> "mad: add GID/class checking for matching received to sent MADs"?

It is not related.

Sasha

> 
> Thanks
> 
> Sasha Khapyorsky wrote:
> >Match MAD TransactionID on receiving. This prevents request/response MADs
> >mixing - reproducible when poll() (in libibumad) returns timeout.
> >


From halr at voltaire.com  Tue Jun 27 10:25:31 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Jun 2006 13:25:31 -0400
Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_pkey_mgr.c: In
 pkey_mgr_get_physp_max_blocks,
 use routine rather than accessing structure member directly
Message-ID: <1151429130.4482.194685.camel@hal.voltaire.com>

OpenSM/osm_pkey_mgr.c: In pkey_mgr_get_physp_max_blocks, use routine
rather than accessing structure member directly

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_pkey_mgr.c
===================================================================
--- opensm/osm_pkey_mgr.c	(revision 8220)
+++ opensm/osm_pkey_mgr.c	(working copy)
@@ -81,7 +81,7 @@ pkey_mgr_get_physp_max_blocks(
     num_pkeys = cl_ntoh16( p_node->node_info.partition_cap );
   else
   {
-    p_sw = osm_get_switch_by_guid( p_subn, p_node->node_info.node_guid );
+    p_sw = osm_get_switch_by_guid( p_subn, osm_node_get_node_guid( p_node ) );
     if (p_sw)
       num_pkeys = cl_ntoh16( p_sw->switch_info.enforce_cap );
   }


From halr at voltaire.com  Tue Jun 27 10:29:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Jun 2006 13:29:10 -0400
Subject: [openib-general] [PATCH] OpenSM/osm_port_info_rcv.c: In
 __osm_pi_rcv_process_switch_port, better BSP0 handling
Message-ID: <1151429320.4482.194834.camel@hal.voltaire.com>

OpenSM/osm_port_info_rcv.c: In __osm_pi_rcv_process_switch_port, better
BSP0 handling

In __osm_pi_rcv_process_switch_port, if base switch port 0, then copy
the received PortInfo attribute into the physp structure regardless of
the port state. On BSP0, the port state is not used so this protects
against an SMA which set this to LINK_DOWN. This makes the code for BSP0
more similar to how it originally was at the cost of an extra copy of
the PortInfo attribute.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_port_info_rcv.c
===================================================================
--- opensm/osm_port_info_rcv.c	(revision 8252)
+++ opensm/osm_port_info_rcv.c	(working copy)
@@ -239,6 +239,8 @@ __osm_pi_rcv_process_switch_port(
   uint8_t port_num;
   uint8_t remote_port_num;
   osm_dr_path_t path;
+  osm_switch_t *p_sw;
+  ib_switch_info_t *p_si;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_pi_rcv_process_switch_port );
 
@@ -350,6 +352,15 @@ __osm_pi_rcv_process_switch_port(
 			 "__osm_pi_rcv_process_switch_port: ERR 0F04: "
 			 "Invalid base LID 0x%x corrected\n",
 			 cl_ntoh16( orig_lid ) );
+	/* Determine if base switch port 0 */
+	p_sw = osm_get_switch_by_guid(p_rcv->p_subn,
+				      osm_node_get_node_guid( p_node )); 
+	if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) && 
+	    !ib_switch_info_is_enhanced_port0(p_si))
+        {
+		/* PortState is not used on BSP0 but just in case it is DOWN */
+		p_physp->port_info = *p_pi;
+        }
 	__osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);
   }
 

From halr at voltaire.com  Tue Jun 27 10:32:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Jun 2006 13:32:23 -0400
Subject: [openib-general] [PATCH] opensm: libibmad: match MAD
 TransactionID
In-Reply-To: <20060627170720.GO16738@sashak.voltaire.com>
References: <20060626153500.18078.85785.stgit@sashak.voltaire.com>
	<44A0D4C5.4060705@mellanox.co.il>
	<20060627170720.GO16738@sashak.voltaire.com>
Message-ID: <1151429349.4482.194836.camel@hal.voltaire.com>

On Tue, 2006-06-27 at 13:07, Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> On 09:48 Tue 27 Jun     , Eitan Behave wrote:
> > Hi Sasha
> > 
> > Can you provide a little more info on the cause and impact of the
> > issue you are solving with this patch?
> 
> umad_recv() uses poll(), when it is timeouted umad_recv() returns error
> and _do_madrpc() returns with error too. The next _do_madrpc() session
> will got the previous response MAD. And so on.

One more note to add to this:

This only affects the OpenIB diagnostics and not OpenSM as the latter
does not use this library; it uses umad directly not via rpc.

-- Hal

> > How is it related to work on the thread:
> > "mad: add GID/class checking for matching received to sent MADs"?
> 
> It is not related.
> 
> Sasha
> 
> > 
> > Thanks
> > 
> > Sasha Khapyorsky wrote:
> > >Match MAD TransactionID on receiving. This prevents request/response MADs
> > >mixing - reproducible when poll() (in libibumad) returns timeout.
> > >


From bpradip at in.ibm.com  Tue Jun 27 10:51:49 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 27 Jun 2006 23:21:49 +0530
Subject: [openib-general] [IWARP BRANCH] [PATCH 0/3] Fix rdma_lat and
 rdma_bw to work with the new stack and libraries
Message-ID: <20060627175141.GA9249@harry-potter.ibm.com>

The present rdma_lat and rdma_bw utilizing the RDMA CM is broken and doesn't
work with the latest libraries. The present code breaks because of using the old
signature for the function rdma_get_cm_event. 

old function signature - int rdma_get_cm_event(struct rdma_cm_event **event)
new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel,
				struct rdma_cm_event **event)

This patchset consists changes for rdma_lat, rdma_bw and Makefile.

1 - rdma_lat.c changes
2 - rdma_bw.c changes
3 - Makefile changes

Signed-off-by: Pradipta Kumar Banerjee <bpradip at in.ibm.com>

---

Thanks,
Pradipta Kumar.


From bpradip at in.ibm.com  Tue Jun 27 10:56:26 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 27 Jun 2006 23:26:26 +0530
Subject: [openib-general] [IWARP BRANCH] [PATCH 1/3] Fix rdma_lat and
 rdma_bw to work with the new stack and libraries
Message-ID: <20060627175624.GB9249@harry-potter.ibm.com>

This patch fixes the broken rdma_lat by using the correct function signature for
rdma_get_cm_event.

old function signature - int rdma_get_cm_event(struct rdma_cm_event **event)
new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel,
			struct rdma_cm_event **event)


Signed-off-by: Pradipta Kumar Banerjee <bpradip at in.ibm.com>

---

Index: rdma_lat.c
=============================================================================
--- ../perftest-org/rdma_lat.c	2006-06-22 18:28:13.000000000 +0530
+++ rdma_lat.c	2006-06-22 18:36:12.000000000 +0530
@@ -51,6 +51,7 @@
 #include <arpa/inet.h>
 #include <byteswap.h>
 #include <time.h>
+#include <errno.h>
 
 #include <infiniband/verbs.h>
 #include <rdma/rdma_cma.h>
@@ -83,6 +84,7 @@ struct pingpong_context {
 	struct ibv_sge list;
 	struct ibv_send_wr wr;
 	struct rdma_cm_id  *cm_id;
+	struct rdma_event_channel *cm_channel;		
 };
 
 struct pingpong_dest {
@@ -612,11 +614,12 @@ static void pp_close_cma(struct pingpong
 		}
 	}
 	
-	rdma_get_cm_event(&event);
+	rdma_get_cm_event(ctx->cm_channel, &event);
 	if (event->event != RDMA_CM_EVENT_DISCONNECTED)
 		printf("unexpected event during disconnect %d\n", event->event);
 	rdma_ack_cm_event(event);
 	rdma_destroy_id(ctx->cm_id);
+	rdma_destroy_event_channel(ctx->cm_channel);
 }
 
 static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth,
@@ -629,17 +632,26 @@ static struct pingpong_context *pp_serve
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *child_cm_id;
+	struct rdma_event_channel *channel;		
 	struct pingpong_context *ctx;
-
+	
 	printf("%s starting server\n", __FUNCTION__);
-	ret = rdma_create_id(&listen_id, NULL);
-	if (ret) {
-		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n", 
+						__FUNCTION__, ret);
 		return NULL;
 	}
 
+	ret = rdma_create_id(channel, &listen_id, NULL);
+	if (ret) {
+		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
+		goto err3;
+	}
+	memset(&sin, 0, sizeof(sin));
 	sin.sin_addr.s_addr = 0;
-	sin.sin_family = PF_INET;
+	sin.sin_family = AF_INET;
 	sin.sin_port = htons(port);
 	ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin);
 	if (ret) {
@@ -653,7 +665,7 @@ static struct pingpong_context *pp_serve
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -678,7 +690,8 @@ static struct pingpong_context *pp_serve
 		fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__);
 		goto err0;
 	}
-
+	
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xbb;
 	my_dest->rkey = ctx->mr->rkey;
@@ -694,7 +707,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 	rdma_ack_cm_event(event);
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) {
 		fprintf(stderr,"rdma_get_cm_event error %d\n", ret);
 		rdma_destroy_id(child_cm_id);
@@ -713,8 +726,10 @@ err0:
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	rdma_destroy_id(listen_id);
 	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
+	rdma_destroy_id(listen_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 
@@ -750,6 +765,7 @@ static struct pingpong_context *pp_clien
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *cm_id;
+	struct rdma_event_channel *channel;		
 	struct pingpong_context *ctx;
 
 	fprintf(stderr,"%s starting client\n", __FUNCTION__);
@@ -758,10 +774,18 @@ static struct pingpong_context *pp_clien
 		return NULL;
 	}
 
-	ret = rdma_create_id(&cm_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+							__FUNCTION__, ret);
+		return NULL;
+	}
+
+	ret = rdma_create_id(channel, &cm_id, NULL);
 	if (ret) {
 		fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_family = PF_INET;
@@ -772,7 +796,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -789,7 +813,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -806,7 +830,8 @@ static struct pingpong_context *pp_clien
 		fprintf(stderr,"%s pp_init_cma_ctx failed\n", __FUNCTION__);
 		goto err2;
 	}
-
+	
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xaa;
 	my_dest->rkey = ctx->mr->rkey;
@@ -823,7 +848,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -845,8 +870,10 @@ static struct pingpong_context *pp_clien
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	fprintf(stderr,"NOT connected!\n");
+	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
 	rdma_destroy_id(cm_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 

From bpradip at in.ibm.com  Tue Jun 27 10:58:28 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 27 Jun 2006 23:28:28 +0530
Subject: [openib-general] [IWARP BRANCH] [PATCH 2/3] Fix rdma_lat and
 rdma_bw to work with the new stack and libraries
Message-ID: <20060627175826.GC9249@harry-potter.ibm.com>

This patch fixes the broken rdma_bw by using the correct function signature for
rdma_get_cm_event.

old function signature - int rdma_get_cm_event(struct rdma_cm_event **event)
new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel,
                        struct rdma_cm_event **event)


Signed-off-by: Pradipta Kumar Banerjee <bpradip at in.ibm.com>

---

Index: rdma_bw.c
=============================================================================
--- ../perftest-org/rdma_bw.c	2006-06-22 18:28:13.000000000 +0530
+++ rdma_bw.c	2006-06-22 18:40:01.000000000 +0530
@@ -51,6 +51,7 @@
 #include <arpa/inet.h>
 #include <byteswap.h>
 #include <time.h>
+#include <errno.h>
 
 #include <infiniband/verbs.h>
 #include <rdma/rdma_cma.h>
@@ -75,6 +76,7 @@ struct pingpong_context {
 	struct ibv_sge      list;
 	struct ibv_send_wr  wr;
 	struct rdma_cm_id  *cm_id;
+	struct rdma_event_channel *cm_channel;		
 };
 
 struct pingpong_dest {
@@ -545,11 +547,12 @@ static void pp_close_cma(struct pingpong
 		}
 	}
 	
-	rdma_get_cm_event(&event);
+	rdma_get_cm_event(ctx->cm_channel, &event);
 	if (event->event != RDMA_CM_EVENT_DISCONNECTED)
 		printf("unexpected event during disconnect %d\n", event->event);
 	rdma_ack_cm_event(event);
 	rdma_destroy_id(ctx->cm_id);
+	rdma_destroy_event_channel(ctx->cm_channel);
 }
 
 static struct pingpong_context *pp_server_connect_cma(unsigned short port, int size, int tx_depth,
@@ -562,13 +565,22 @@ static struct pingpong_context *pp_serve
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *child_cm_id;
+	struct rdma_event_channel *channel;
 	struct pingpong_context *ctx;
 
 	printf("%s starting server\n", __FUNCTION__);
-	ret = rdma_create_id(&listen_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+							 __FUNCTION__, ret);
+		return NULL;
+        }
+
+	ret = rdma_create_id(channel, &listen_id, NULL);
 	if (ret) {
 		fprintf(stderr, "%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_addr.s_addr = 0;
@@ -586,7 +598,7 @@ static struct pingpong_context *pp_serve
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -612,6 +624,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xbb;
 	my_dest->rkey = ctx->mr->rkey;
@@ -627,7 +640,7 @@ static struct pingpong_context *pp_serve
 		goto err0;
 	}
 	rdma_ack_cm_event(event);
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) {
 		fprintf(stderr,"rdma_get_cm_event error %d\n", ret);
 		rdma_destroy_id(child_cm_id);
@@ -646,8 +659,10 @@ err0:
 err1:
 	rdma_ack_cm_event(event);
 err2:
-	rdma_destroy_id(listen_id);
 	fprintf(stderr,"%s NOT connected!\n", __FUNCTION__);
+	rdma_destroy_id(listen_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 
@@ -683,6 +698,7 @@ static struct pingpong_context *pp_clien
 	int ret;
 	struct sockaddr_in sin;
 	struct rdma_cm_id *cm_id;
+	struct rdma_event_channel *channel;
 	struct pingpong_context *ctx;
 
 	fprintf(stderr,"%s starting client\n", __FUNCTION__);
@@ -691,10 +707,18 @@ static struct pingpong_context *pp_clien
 		return NULL;
 	}
 
-	ret = rdma_create_id(&cm_id, NULL);
+	channel = rdma_create_event_channel();
+	if (!channel) {
+		ret = errno;
+		fprintf(stderr, "%s rdma_create_event_channel failed with error %d\n",
+						__FUNCTION__, ret);
+		return NULL;
+	}
+
+	ret = rdma_create_id(channel, &cm_id, NULL);
 	if (ret) {
 		fprintf(stderr,"%s rdma_create_id failed %d\n", __FUNCTION__, ret);
-		return NULL;
+		goto err3;
 	}
 
 	sin.sin_family = PF_INET;
@@ -705,7 +729,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -722,7 +746,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -740,6 +764,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
+	ctx->cm_channel = channel;
 	my_dest->qpn = 0;
 	my_dest->psn = 0xaa;
 	my_dest->rkey = ctx->mr->rkey;
@@ -756,7 +781,7 @@ static struct pingpong_context *pp_clien
 		goto err2;
 	}
 
-	ret = rdma_get_cm_event(&event);
+	ret = rdma_get_cm_event(channel, &event);
 	if (ret) 
 		goto err2;
 
@@ -779,6 +804,8 @@ err1:
 err2:
 	fprintf(stderr,"NOT connected!\n");
 	rdma_destroy_id(cm_id);
+err3:
+	rdma_destroy_event_channel(channel);
 	return NULL;
 }
 

From bpradip at in.ibm.com  Tue Jun 27 11:01:22 2006
From: bpradip at in.ibm.com (Pradipta Kumar Banerjee)
Date: Tue, 27 Jun 2006 23:31:22 +0530
Subject: [openib-general] [IWARP BRANCH] [PATCH 3/3] Fix rdma_lat and
 rdma_bw to work with the new stack and libraries
Message-ID: <20060627180120.GD9249@harry-potter.ibm.com>

This fixes the Makefile to properly build rdma_lat and rdma_bw
Includes the librdmacm library.

Signed-off-by: Pradipta Kumar Banerjee <bpradip at in.ibm.com>

---

Index: Makefile
=============================================================================
--- bkp/Makefile	2006-06-22 10:18:58.000000000 +0530
+++ Makefile	2006-06-22 10:26:55.000000000 +0530
@@ -10,7 +10,7 @@ EXTRA_HEADERS = get_clock.h
 LOADLIBES += 
 LDFLAGS +=
 
-${TESTS}: LOADLIBES += -libverbs
+${TESTS}: LOADLIBES += -libverbs -lrdmacm
 
 ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 	$(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o $@


From swise at opengridcomputing.com  Tue Jun 27 11:07:27 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 27 Jun 2006 13:07:27 -0500
Subject: [openib-general] [IWARP BRANCH] [PATCH 0/3] Fix rdma_lat and
 rdma_bw to work with the new stack and libraries
In-Reply-To: <20060627175141.GA9249@harry-potter.ibm.com>
References: <20060627175141.GA9249@harry-potter.ibm.com>
Message-ID: <1151431647.3207.47.camel@stevo-desktop>

Committed in the iwarp branch.  r8254.

Thanks,

Steve.


On Tue, 2006-06-27 at 23:21 +0530, Pradipta Kumar Banerjee wrote:
> The present rdma_lat and rdma_bw utilizing the RDMA CM is broken and doesn't
> work with the latest libraries. The present code breaks because of using the old
> signature for the function rdma_get_cm_event. 
> 
> old function signature - int rdma_get_cm_event(struct rdma_cm_event **event)
> new function signature - int rdma_get_cm_event(struct rdma_event_channel *channel,
> 				struct rdma_cm_event **event)
> 
> This patchset consists changes for rdma_lat, rdma_bw and Makefile.
> 
> 1 - rdma_lat.c changes
> 2 - rdma_bw.c changes
> 3 - Makefile changes
> 
> Signed-off-by: Pradipta Kumar Banerjee <bpradip at in.ibm.com>
> 
> ---
> 
> Thanks,
> Pradipta Kumar.


From mst at mellanox.co.il  Tue Jun 27 11:18:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 27 Jun 2006 21:18:27 +0300
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <44A0415A.9000105@ichips.intel.com>
References: <44A0415A.9000105@ichips.intel.com>
Message-ID: <20060627181827.GD4896@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: RFC: CMA backlog (was Re: CMA backlog)
> 
> Michael S. Tsirkin wrote:
> > I'm just saying that we can use exactly the code in ib_destroy_cm_id, but
> > avoid calling ib_send_cm_rej in this one case:
> 
> Ah... yes, something like that should work.


Like this then (untested)?

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: linux-2.6.17-2.6.18/drivers/infiniband/core/cm.c
===================================================================
--- linux-2.6.17-2.6.18.orig/drivers/infiniband/core/cm.c	2006-06-27 12:21:34.000000000 +0300
+++ linux-2.6.17-2.6.18/drivers/infiniband/core/cm.c	2006-06-27 21:16:49.000000000 +0300
@@ -701,7 +701,7 @@ static void cm_reset_to_idle(struct cm_i
 	}
 }
 
-void ib_destroy_cm_id(struct ib_cm_id *cm_id)
+static void cm_destroy_id(struct ib_cm_id *cm_id, int reject)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_work *work;
@@ -731,9 +731,9 @@ retest:
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
 		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
 		ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT,
-			       &cm_id_priv->av.port->cm_dev->ca_guid,
-			       sizeof cm_id_priv->av.port->cm_dev->ca_guid,
-			       NULL, 0);
+				&cm_id_priv->av.port->cm_dev->ca_guid,
+				sizeof cm_id_priv->av.port->cm_dev->ca_guid,
+				NULL, 0);
 		break;
 	case IB_CM_MRA_REQ_RCVD:
 	case IB_CM_REP_SENT:
@@ -744,9 +744,14 @@ retest:
 	case IB_CM_MRA_REQ_SENT:
 	case IB_CM_REP_RCVD:
 	case IB_CM_MRA_REP_SENT:
-		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
-		ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
-			       NULL, 0, NULL, 0);
+		if (reject) {
+			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
+				       NULL, 0, NULL, 0);
+		} else {
+			cm_reset_to_idle(cm_id_priv);
+			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		}
 		break;
 	case IB_CM_ESTABLISHED:
 		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
@@ -775,6 +780,12 @@ retest:
 	kfree(cm_id_priv->private_data);
 	kfree(cm_id_priv);
 }
+
+void ib_destroy_cm_id(struct ib_cm_id *cm_id)
+{
+	cm_destroy_id(cm_id, 1);
+}
+
 EXPORT_SYMBOL(ib_destroy_cm_id);
 
 int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
@@ -1163,7 +1174,7 @@ static void cm_process_work(struct cm_id
 	}
 	cm_deref_id(cm_id_priv);
 	if (ret)
-		ib_destroy_cm_id(&cm_id_priv->id);
+		cm_destroy_id(&cm_id_priv->id, ret != -ENOMEM);
 }
 
 static void cm_format_mra(struct cm_mra_msg *mra_msg,

-- 
MST


From halr at voltaire.com  Tue Jun 27 12:22:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Jun 2006 15:22:54 -0400
Subject: [openib-general] [PATCH]OpenSM/osm_lid_mgr.c: In
 __osm_lid_mgr_init_sweep, support enhanced switch port 0 for LMC > 0
Message-ID: <1151436172.4482.199561.camel@hal.voltaire.com>

OpenSM/osm_lid_mgr.c: In __osm_lid_mgr_init_sweep, support enhanced
switch port 0 for LMC > 0

Base port 0 is constrained to have LMC of 0 whereas enhanced switch port
0 is not. Support enhanced switch port 0 is more like CA and router
ports in terms of this.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_lid_mgr.c
===================================================================
--- opensm/osm_lid_mgr.c	(revision 8239)
+++ opensm/osm_lid_mgr.c	(working copy)
@@ -94,6 +94,7 @@
 #include <opensm/osm_lid_mgr.h>
 #include <opensm/osm_log.h>
 #include <opensm/osm_node.h>
+#include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
 #include <vendor/osm_vendor_api.h>
@@ -351,6 +352,8 @@ __osm_lid_mgr_init_sweep(
   osm_lid_mgr_range_t *p_range = NULL;
   osm_port_t          *p_port;
   cl_qmap_t           *p_port_guid_tbl;
+  osm_switch_t        *p_sw;
+  ib_switch_info_t    *p_si;
   uint8_t              lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc);
   uint16_t             lmc_mask;
   uint16_t             req_lid, num_lids;
@@ -436,7 +439,20 @@ __osm_lid_mgr_init_sweep(
            IB_NODE_TYPE_SWITCH )
         num_lids = lmc_num_lids;
       else
-        num_lids = 1;
+      {
+        /* Determine if enhanced switch port 0 */
+        p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                      osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+        if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
+            ib_switch_info_is_enhanced_port0(p_si))
+        {
+          num_lids = lmc_num_lids;
+        }
+        else
+        {
+          num_lids = 1;
+        }
+      }
 
       if ((num_lids != 1) &&
           (((db_min_lid & lmc_mask) != db_min_lid) ||


From mshefty at ichips.intel.com  Tue Jun 27 12:36:30 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 12:36:30 -0700
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <200606261051.12515.jackm@mellanox.co.il>
References: <200606261051.12515.jackm@mellanox.co.il>
Message-ID: <44A188BE.1080207@ichips.intel.com>

Jack Morgenstein wrote:
> Evidently, ipoib was still attempting to connect with an SA, when the ipoib
> module was unloaded (modprobe -r). After the ipoib module was unloaded (or at
> least rendered inaccessible), the ib_sa module attempted to invoke 
> "ib_sa_mcmember_rec_callback" (for a callback address that was part of the
> unloaded ipoib module).  Hence, the Oops below.

I still haven't been able to reproduce this, but I _think_ I understand what's 
likely happening.

The SA query interface always invokes a callback, regardless if a call succeeds. 
  So if a call to ib_sa_mcmmember_rec_set() fails (which happens in this case 
because the SM is down), the user's callback is still invoked.  The multicast 
module is coded assuming that an immediate failure does not result in a 
callback, so the callback is unexpected, which throws off the reference counting.

I should have a patch for this shortly, but since I can't reproduce the problem, 
my testing of it will be limited.

- Sean


From jlentini at netapp.com  Tue Jun 27 13:00:05 2006
From: jlentini at netapp.com (James Lentini)
Date: Tue, 27 Jun 2006 16:00:05 -0400 (EDT)
Subject: [openib-general] new uDAPL co-maintainer
Message-ID: <Pine.LNX.4.64.0606271555360.28670@jlentini-linux.nane.netapp.com>


In recognition of his many contributions to the DAPL project, Arlin 
Davis is joining the project as an official co-maintainer. Arlin and I 
will collaborate on DAPL maintenance and development decisions.

james

--
James Lentini | Network Appliance | 781-768-5359 | jlentini at netapp.com


From pw at osc.edu  Tue Jun 27 13:21:03 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Tue, 27 Jun 2006 16:21:03 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627064234.GG19300@mellanox.co.il>
References: <20060626215319.GA9291@osc.edu>
	<20060627064234.GG19300@mellanox.co.il>
Message-ID: <20060627202103.GA10737@osc.edu>

mst at mellanox.co.il wrote on Tue, 27 Jun 2006 09:42 +0300:
> Quoting r. Pete Wyckoff <pw at osc.edu>:
> > Is this a known issue?
> 
> Yes. The fact that ibv_query_device returns some value in hca_cap can not
> guarantee that ibv_create_qp with these parameters will succeed. For example,
> system administrator might have imposed a limit on the amount of memory you can
> pin down, and you will get ENOMEM.

I was hoping to get a guaranteed maximum number from
ibv_query_device so that I would know that calls to ibv_create_qp
would not fail due to my asking for too many CQ entries.  My code
has some idea of how many it wants (16), and compares that to the
hca_cap values to settle for what it can get.  I only happened to
notice that 30 wouldn't work even though it was so claimed when
debugging.

> > Should I always subtract 1 from the reported max on the send side?  Just for
> > this hardware?
> 
> Unless you use it, passing the absolute maximum value supported by hardware does
> not seem, to me, to make sense - it will just slow you down, and waste
> resources.  Is there a protocol out there that actually has a use for 30 sge?

Perhaps I don't understand what is more resource-costly about using
29 sge when they are supported by the hardware.  I'm using them on
the send side to avoid having to either:
    1.  memcpy 29 little buffers into one big buffer
or
    2.  send 29 rdma writes instead of a single rdma write with 29 sges
The buffer on the receiver is contiguous and big enough to hold
everything.

> In my opinion, for the application to be robust it has to either use small
> values that empirically work on most systems, or be able to scale down to
> require less resources if an allocation fails.

Scale down?  So if ibv_create_qp fails, you think I should look at
the return value (which is NULL, not ENOMEM or EINVAL or anything
informative), and then gradually reduce the values for max_recv_sge,
max_send_sge, max_recv_wr, max_send_wr, max_inline_data below the
reported HCA maximum until I find something that works?

I'll subtract 1 from the hca_cap.max_sge for Mellanox hardware
before doing the comparison against how many SGEs I'd like to get.
Otherwise I can't see much alternative to trusting the hca_cap
values that are returned.

		-- Pete


From pw at osc.edu  Tue Jun 27 13:34:33 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Tue, 27 Jun 2006 16:34:33 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <7.0.1.0.2.20060627090204.04471ba0@netapp.com>
References: <20060626215319.GA9291@osc.edu>
	<20060627064234.GG19300@mellanox.co.il>
	<7.0.1.0.2.20060627090204.04471ba0@netapp.com>
Message-ID: <20060627203433.GB10737@osc.edu>

Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400:
> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
> >Unless you use it, passing the absolute maximum value supported by 
> >hardware does
> >not seem, to me, to make sense - it will just slow you down, and waste
> >resources.  Is there a protocol out there that actually has a use for 30 sge?
> 
> It's not a protocol thing, it's a memory registration thing. But I agree,
> that's a huge number of segments for send and receive. 2-4 is more
> typical. I'd be interested to know what wants 30 as well...

This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html
See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport
stack.  The max_sge-1 aspect I'm complaining about isn't checked in yet.

It's a file system application.  The MPI-IO interface provides
datatypes and file views that let a client write complex subsets of
the in-memory data to a file with a single call.  One case that
happens is contiguous-in-file but discontiguous-in-memory, where the
file system client writes data from multiple addresses to a single
region in a file.  The application calls MPI_File_write or a
variant, and this complex buffer description filters all the way
down to the OpenIB transport, which then has to figure out how to
get the data to the server.

These separate data regions may have been allocated all at once
using MPI_Alloc_mem (rarely), or may have been used previously for
file system operations so are already pinned in the registration
cache.  Are you implying there is more memory registration work that
has to happen beyond making sure each of the SGE buffers is pinned
and has a valid lkey?

It would not be a major problem to avoid using more than a couple of
SGEs; however, I didn't see any reason to avoid them.  Please let me
know if you see a problem with this approach.

		-- Pete


From ralphc at pathscale.com  Tue Jun 27 14:06:02 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Tue, 27 Jun 2006 14:06:02 -0700
Subject: [openib-general] [PATCH] change libipathverbs to use the new
 initialization convention
Message-ID: <1151442362.4572.67.camel@brick.pathscale.com>

The libibverbs.so.2 has a different device plug-in module
intialization convention from libibverbs.so.1.  This patch
updates the InfiniPath libipathverbs module to conform
to the new convention.

Signed-off-by Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libipathverbs/src/ipathverbs.map
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.map	(revision 8255)
+++ src/userspace/libipathverbs/src/ipathverbs.map	(working copy)
@@ -1,4 +1,4 @@
 {
-	global: openib_driver_init;
+	global: ibv_driver_init;
 	local: *;
 };
Index: src/userspace/libipathverbs/src/ipathverbs.c
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.c	(revision 8255)
+++ src/userspace/libipathverbs/src/ipathverbs.c	(working copy)
@@ -145,30 +145,24 @@
 	.free_context	= ipath_free_context
 };
 
-struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
+struct ibv_device *ibv_driver_init(const char *uverbs_sys_path,
+				   int abi_version)
 {
-	struct sysfs_device    *pcidev;
-	struct sysfs_attribute *attr;
+	char			value[8];
 	struct ipath_device    *dev;
-	unsigned		vendor, device;
-	int			i;
+	unsigned                vendor, device;
+	int                     i;
 
-	pcidev = sysfs_get_classdev_device(sysdev);
-	if (!pcidev)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor",
+				value, sizeof value) < 0)
 		return NULL;
+	sscanf(value, "%i", &vendor);
 
-	attr = sysfs_get_device_attr(pcidev, "vendor");
-	if (!attr)
+	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device",
+				value, sizeof value) < 0)
 		return NULL;
-	sscanf(attr->value, "%i", &vendor);
-	sysfs_close_attribute(attr);
+	sscanf(value, "%i", &device);
 
-	attr = sysfs_get_device_attr(pcidev, "device");
-	if (!attr)
-		return NULL;
-	sscanf(attr->value, "%i", &device);
-	sysfs_close_attribute(attr);
-
 	for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i)
 		if (vendor == hca_table[i].vendor &&
 		    device == hca_table[i].device)
@@ -180,13 +174,12 @@
 	dev = malloc(sizeof *dev);
 	if (!dev) {
 		fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n",
-			sysdev->name);
-		abort();
+			uverbs_sys_path);
+		return NULL;
 	}
 
 	dev->ibv_dev.ops = ipath_dev_ops;
 	dev->hca_type    = hca_table[i].type;
-	dev->page_size   = sysconf(_SC_PAGESIZE);
 
 	return &dev->ibv_dev;
 }
Index: src/userspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.h	(revision 8255)
+++ src/userspace/libipathverbs/src/ipathverbs.h	(working copy)
@@ -57,7 +57,6 @@
 struct ipath_device {
 	struct ibv_device	ibv_dev;
 	enum ipath_hca_type	hca_type;
-	int			page_size;
 };
 
 struct ipath_context {


From mshefty at ichips.intel.com  Tue Jun 27 14:13:17 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 14:13:17 -0700
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <44A188BE.1080207@ichips.intel.com>
References: <200606261051.12515.jackm@mellanox.co.il>
	<44A188BE.1080207@ichips.intel.com>
Message-ID: <44A19F6D.7060008@ichips.intel.com>

Sean Hefty wrote:
> The SA query interface always invokes a callback, regardless if a call succeeds. 
>   So if a call to ib_sa_mcmmember_rec_set() fails (which happens in this case 
> because the SM is down), the user's callback is still invoked.  The multicast 
> module is coded assuming that an immediate failure does not result in a 
> callback, so the callback is unexpected, which throws off the reference counting.

I've committed a patch that should hopefully fix this problem.

The problem was that a return code of 0 from the SA query calls should have been 
treated as valid, rather than an error.

- Sean


From mst at mellanox.co.il  Tue Jun 27 14:28:51 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 00:28:51 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627203433.GB10737@osc.edu>
References: <20060627203433.GB10737@osc.edu>
Message-ID: <20060627212851.GB5398@mellanox.co.il>

Quoting r. Pete Wyckoff <pw at osc.edu>:
> It would not be a major problem to avoid using more than a couple of
> SGEs; however, I didn't see any reason to avoid them.  Please let me
> know if you see a problem with this approach.

A QP with a large number of SGEs per WQE enabled uses up more resources and
might also be slower if typical WR has a small number of SGEs.  So you should
anticipate the typical number of SGEs for best performance.

As I mentioned previously, even if you do want a large number of SGEs, but you
want your application to be robust and scalable, you should scale your
parameters down if QP allocation fails since device query does not guarantee the
allocation will always succeed.

-- 
MST


From ralphc at pathscale.com  Tue Jun 27 15:02:18 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Tue, 27 Jun 2006 15:02:18 -0700
Subject: [openib-general] [PATCH] add support for ibv_query_qp(),
 ibv_query_srq() to libipathverbs
Message-ID: <1151445738.4572.73.camel@brick.pathscale.com>

This patch adds support for ibv_query_qp() and ibv_query_srq()
to libipathverbs which are new in libibverbs.so.2.
Note that it layers on top of my previous patch.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

Index: src/userspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.h	(old)
+++ src/userspace/libipathverbs/src/ipathverbs.h	(new)
@@ -96,6 +96,10 @@
 struct ibv_qp *ipath_create_qp(struct ibv_pd *pd,
 			       struct ibv_qp_init_attr *attr);
 
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr);
+
 int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask);
 
@@ -114,6 +118,8 @@
 		     struct ibv_srq_attr *attr, 
 		     enum ibv_srq_attr_mask attr_mask);
 
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr);
+
 int ipath_destroy_srq(struct ibv_srq *srq);
 
 
Index: src/userspace/libipathverbs/src/verbs.c
===================================================================
--- libipathverbs/src/verbs.c	(old)
+++ libipathverbs/src/verbs.c	(new)
@@ -40,7 +40,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
-#include <strings.h>
+#include <string.h>
 #include <pthread.h>
 #include <netinet/in.h>
 
@@ -193,6 +193,16 @@
 	return qp;
 }
 
+int ipath_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+		   enum ibv_qp_attr_mask attr_mask,
+		   struct ibv_qp_init_attr *init_attr)
+{
+	struct ibv_query_qp cmd;
+
+	return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr,
+				&cmd, sizeof cmd);
+}
+
 int ipath_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 		    enum ibv_qp_attr_mask attr_mask)
 {
@@ -244,6 +254,13 @@
 	return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd);
 }
 
+int ipath_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr)
+{
+	struct ibv_query_srq cmd;
+
+	return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd);
+}
+
 int ipath_destroy_srq(struct ibv_srq *srq)
 {
 	int ret;


From ralphc at pathscale.com  Tue Jun 27 15:17:43 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Tue, 27 Jun 2006 15:17:43 -0700
Subject: [openib-general] [PATCH] trivial white space clean up in
	libipathverbs
Message-ID: <1151446663.4572.79.camel@brick.pathscale.com>

This patch just corrects some white space code conventions.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Index: src/userspace/libipathverbs/src/ipathverbs.h
===================================================================
--- src/userspace/libipathverbs/src/ipathverbs.h	(old)
+++ src/userspace/libipathverbs/src/ipathverbs.h	(new)
@@ -122,7 +122,6 @@
 
 int ipath_destroy_srq(struct ibv_srq *srq);
 
-
 struct ibv_ah *ipath_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr);
 
 int ipath_destroy_ah(struct ibv_ah *ah);
Index: src/userspace/libipathverbs/src/verbs.c
===================================================================
--- src/userspace/libipathverbs/src/verbs.c	(old)
+++ src/userspace/libipathverbs/src/verbs.c	(new)
@@ -83,11 +83,11 @@
 	struct ibv_pd		 *pd;
 
 	pd = malloc(sizeof *pd);
-	if(!pd)
+	if (!pd)
 		return NULL;
 
-	if(ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
-			    &resp, sizeof resp)) {
+	if (ibv_cmd_alloc_pd(context, pd, &cmd, sizeof cmd,
+			     &resp, sizeof resp)) {
 		free(pd);
 		return NULL;
 	}
@@ -232,7 +232,7 @@
 	int ret;
 
 	srq = malloc(sizeof *srq);
-	if(srq == NULL)
+	if (srq == NULL)
 		return NULL;
 
 	ret = ibv_cmd_create_srq(pd, srq, attr, &cmd, sizeof cmd,
@@ -278,10 +278,10 @@
 	struct ibv_ah *ah;
 
 	ah = malloc(sizeof *ah);
-	if(ah == NULL)
+	if (ah == NULL)
 		return NULL;
 
-	if(ibv_cmd_create_ah(pd, ah, attr)) {
+	if (ibv_cmd_create_ah(pd, ah, attr)) {
 		free(ah);
 		return NULL;
 	}


From sean.hefty at intel.com  Tue Jun 27 15:21:05 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 15:21:05 -0700
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <20060627181827.GD4896@mellanox.co.il>
Message-ID: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com>

If a user of the IB CM returns -ENOMEM from their connection callback,
simply drop the incoming REQ.  Do not send a reject, which should allow
the sender to retry the request.  This is necessary for SDP to support
a backlog.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This is a slightly modified version of the patch.  I passed the return
code directly to the destroy function for future flexibility, and
limited the behavior change to REQ processing only.

I ran some basic tests to make sure that this didn't break anything.
If this looks okay to you, I can commit this to SVN.

Index: cm.c
===================================================================
--- cm.c	(revision 8224)
+++ cm.c	(working copy)
@@ -702,7 +702,7 @@ static void cm_reset_to_idle(struct cm_i
 	}
 }
 
-void ib_destroy_cm_id(struct ib_cm_id *cm_id)
+static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_work *work;
@@ -736,12 +736,22 @@ retest:
 			       sizeof cm_id_priv->av.port->cm_dev->ca_guid,
 			       NULL, 0);
 		break;
+	case IB_CM_REQ_RCVD:
+		if (err == -ENOMEM) {
+			/* Do not reject to allow future retries. */
+			cm_reset_to_idle(cm_id_priv);
+			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+		} else {
+			spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
+				       NULL, 0, NULL, 0);
+		}
+		break;
 	case IB_CM_MRA_REQ_RCVD:
 	case IB_CM_REP_SENT:
 	case IB_CM_MRA_REP_RCVD:
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
 		/* Fall through */
-	case IB_CM_REQ_RCVD:
 	case IB_CM_MRA_REQ_SENT:
 	case IB_CM_REP_RCVD:
 	case IB_CM_MRA_REP_SENT:
@@ -776,6 +786,11 @@ retest:
 	kfree(cm_id_priv->private_data);
 	kfree(cm_id_priv);
 }
+
+void ib_destroy_cm_id(struct ib_cm_id *cm_id)
+{
+	cm_destroy_id(cm_id, 0);
+}
 EXPORT_SYMBOL(ib_destroy_cm_id);
 
 int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
@@ -1164,7 +1179,7 @@ static void cm_process_work(struct cm_id
 	}
 	cm_deref_id(cm_id_priv);
 	if (ret)
-		ib_destroy_cm_id(&cm_id_priv->id);
+		cm_destroy_id(&cm_id_priv->id, ret);
 }
 
 static void cm_format_mra(struct cm_mra_msg *mra_msg,


From mst at mellanox.co.il  Tue Jun 27 15:38:26 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 01:38:26 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627202103.GA10737@osc.edu>
References: <20060627202103.GA10737@osc.edu>
Message-ID: <20060627223826.GC5398@mellanox.co.il>

Quoting r. Pete Wyckoff <pw at osc.edu>:
> Subject: Re: max_send_sge < max_sge
> 
> mst at mellanox.co.il wrote on Tue, 27 Jun 2006 09:42 +0300:
> > Quoting r. Pete Wyckoff <pw at osc.edu>:
> > > Is this a known issue?
> > 
> > Yes. The fact that ibv_query_device returns some value in hca_cap can not
> > guarantee that ibv_create_qp with these parameters will succeed. For
> > example, system administrator might have imposed a limit on the amount of
> > memory you can pin down, and you will get ENOMEM.
> 
> I was hoping to get a guaranteed maximum number from ibv_query_device so that
> I would know that calls to ibv_create_qp would not fail due to my asking for
> too many CQ entries.  My code has some idea of how many it wants (16), and
> compares that to the hca_cap values to settle for what it can get.  I only
> happened to notice that 30 wouldn't work even though it was so claimed when
> debugging.

Ah. I see. Unfortunately I don't think ibv_query_device currently provides this
guarantee, and its not something easy to fix.  What are you doing of the hca cap
is below the values you want?  Also, please see below for ideas about extending
the API in a way that might be useful to you.

> > > Should I always subtract 1 from the reported max on the send side?  Just
> > > for this hardware?
> > 
> > Unless you use it, passing the absolute maximum value supported by hardware
> > does not seem, to me, to make sense - it will just slow you down, and waste
> > resources.  Is there a protocol out there that actually has a use for 30
> > sge?
> 
> Perhaps I don't understand what is more resource-costly about using
> 29 sge when they are supported by the hardware.

Well, more SGEs per WR does mean more resources are consumed for the same
amount of WRs per QP. OK?

> I'm using them on the send side to avoid having to either:
>     1.  memcpy 29 little buffers into one big buffer
> or
>     2.  send 29 rdma writes instead of a single rdma write with 29 sges
> The buffer on the receiver is contiguous and big enough to hold
> everything.

Its the same thing. Seems I'm not being clear.  I was just saying that large SGE
and WR values have cost so one should use a smallest SGE and WR numbers
that still give good performance, not maximum thinkable values. But
you probably know this :)

> > In my opinion, for the application to be robust it has to either use small
> > values that empirically work on most systems, or be able to scale down to
> > require less resources if an allocation fails.
> 
> Scale down?  So if ibv_create_qp fails, you think I should look at
> the return value (which is NULL, not ENOMEM or EINVAL or anything
> informative), and then gradually reduce the values for max_recv_sge,
> max_send_sge, max_recv_wr, max_send_wr, max_inline_data below the
> reported HCA maximum until I find something that works?

Well, if there's no bug I see no reason for ibv_create_qp to fail except that
you are asking for too much WRs/SGEs. So yes, the trick you describe will work
I think.

At some point, I tried to think about extending the API in such a way that
verbs like ibv_create_qp would round the parameters down to
whatever does work. Would something like this be useful to you?
Further, if the given SGE/WR pair can't be satisfied, will you want to scale
down the number of SGEs or the number of WRs?

> I'll subtract 1 from the hca_cap.max_sge for Mellanox hardware
> before doing the comparison against how many SGEs I'd like to get.
> Otherwise I can't see much alternative to trusting the hca_cap
> values that are returned.

If this works for you, great. I was just trying to point out query device can
not guarantee that QP allocaton will always succeed even if you stay within
limits it reports.

For example, are you using a large number of WRs per QP as well?  If so after
alocating a couple of QPs you might run out of locked memory limit allowed
per-user, depending on your system setup. QP allocation will then fail, even if
you use the hcacap - 1 heuristic.

-- 
MST


From mshefty at ichips.intel.com  Tue Jun 27 15:40:55 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 15:40:55 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A12CEA.6010204@mellanox.co.il>
References: <44A00EEF.702@ichips.intel.com>
	<20060626174117.GA19929@mellanox.co.il>
	<44A12CEA.6010204@mellanox.co.il>
Message-ID: <44A1B3F7.7090504@ichips.intel.com>

Tziporet Koren wrote:
> These features are needed for uDAPL and were requested by Woody and 
> Arlin for Intel MPI scalability.
> Since in OFED 1.1 we are going to take CMA from kernel 2.6.18 we need 
> them upstream.
> 
> Can you drive these enhancements only to 2.6.18.

I would like these features in OFED 1.1 as well.  However, there are no users of 
those new interfaces in 2.6.18 that would justify their inclusion.  I can target 
  userspace support of the RDMA CM for 2.6.19, but I don't think it makes sense 
to try for 2.6.18.

- Sean


From mst at mellanox.co.il  Tue Jun 27 15:42:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 01:42:57 +0300
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com>
References: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com>
Message-ID: <20060627224257.GD5398@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> This is a slightly modified version of the patch.  I passed the return
> code directly to the destroy function for future flexibility, and
> limited the behavior change to REQ processing only.
> 
> I ran some basic tests to make sure that this didn't break anything.
> If this looks okay to you, I can commit this to SVN.

Looks good to me. Please go ahead, then I'll use this in SDP and test this way.

-- 
MST


From mshefty at ichips.intel.com  Tue Jun 27 15:51:56 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 15:51:56 -0700
Subject: [openib-general] RFC: CMA backlog (was Re: CMA backlog)
In-Reply-To: <20060627224257.GD5398@mellanox.co.il>
References: <000101c69a37$fb0deb30$e598070a@amr.corp.intel.com>
	<20060627224257.GD5398@mellanox.co.il>
Message-ID: <44A1B68C.9030806@ichips.intel.com>

Michael S. Tsirkin wrote:
> Looks good to me. Please go ahead, then I'll use this in SDP and test this way.

Committed in 8261.


From mst at mellanox.co.il  Tue Jun 27 15:48:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 01:48:57 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A1B3F7.7090504@ichips.intel.com>
References: <44A1B3F7.7090504@ichips.intel.com>
Message-ID: <20060627224857.GE5398@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > Can you drive these enhancements only to 2.6.18.
> 
> I would like these features in OFED 1.1 as well.  However, there are no users
> of those new interfaces in 2.6.18 that would justify their inclusion.

I think setting the number of retries and timeout in CMA might be useful for
iSER as well. Or, what do you think?

-- 
MST


From mst at mellanox.co.il  Tue Jun 27 16:04:20 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 02:04:20 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A1B3F7.7090504@ichips.intel.com>
References: <44A1B3F7.7090504@ichips.intel.com>
Message-ID: <20060627230420.GF5398@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> > Can you drive these enhancements only to 2.6.18.
> 
> I would like these features in OFED 1.1 as well.

Would you consider making a git repository available with just
the CMA code appropriate for OFED 1.1? Mixing git and SVN code
to build OFED is really painful for us.

-- 
MST


From mshefty at ichips.intel.com  Tue Jun 27 16:20:00 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 16:20:00 -0700
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <20060627230420.GF5398@mellanox.co.il>
References: <44A1B3F7.7090504@ichips.intel.com>
	<20060627230420.GF5398@mellanox.co.il>
Message-ID: <44A1BD20.1090009@ichips.intel.com>

Michael S. Tsirkin wrote:
> Would you consider making a git repository available with just
> the CMA code appropriate for OFED 1.1? Mixing git and SVN code
> to build OFED is really painful for us.

Sure, I can consider doing that.  There would just be some logistics to work 
out, like the location of the git tree.

Would a patch series in Roland's git tree work?  Once he returns, we can start 
queuing up patches for 2.6.19, which could include any or all of the following:

userspace support for the RDMA CM
iWarp support
latest changes for IB (UD QP and multicast)

- Sean


From mst at mellanox.co.il  Tue Jun 27 16:45:31 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 02:45:31 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A1BD20.1090009@ichips.intel.com>
References: <44A1BD20.1090009@ichips.intel.com>
Message-ID: <20060627234531.GG5398@mellanox.co.il>

Quoting r. Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: ucma into kernel.org
> 
> Michael S. Tsirkin wrote:
> > Would you consider making a git repository available with just
> > the CMA code appropriate for OFED 1.1? Mixing git and SVN code
> > to build OFED is really painful for us.
> 
> Sure, I can consider doing that.  There would just be some logistics to work 
> out, like the location of the git tree.

Oh, there's no reason to decide this up front: as I learned hosting a clone of a
git tree is *really* trivial.  For example, we can arrange to host a clone of
your tree at mellanox.co.il if you like, and let you push there.  And its also
trivial to clone and switch to another location whenever you like.

> Would a patch series in Roland's git tree work?

You mean a head there, like for-ofed-1.1? Why not. But it does mean you'll need
Roland to apply your patches to his tree.

> Once he returns, we can start 
> queuing up patches for 2.6.19, which could include any or all of the following:
> 
> userspace support for the RDMA CM
> iWarp support
> latest changes for IB (UD QP and multicast)

And hopefully the retry/timeout options which started this dicussion? :)

It is probably best to take whatever is needed in OFED and have a branch
with these things, separate from for-2.6.19.

-- 
MST


From sean.hefty at intel.com  Tue Jun 27 17:21:53 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 27 Jun 2006 17:21:53 -0700
Subject: [openib-general] [PATCH] ib_addr: fix get/set gid alignment issues
Message-ID: <000001c69a48$db8e3290$e598070a@amr.corp.intel.com>

The device address contains unsigned character arrays, which contain raw
GID addresses.  The GIDs may not be naturally aligned, so do not cast
them to structures or unions.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This fixes an alignment issue pointed out by Michael when adding MGID
support to the ib_addr module.

Index: include/rdma/ib_addr.h
===================================================================
--- include/rdma/ib_addr.h	(revision 8224)
+++ include/rdma/ib_addr.h	(working copy)
@@ -89,14 +89,16 @@ static inline void ib_addr_set_pkey(stru
 	dev_addr->broadcast[9] = (unsigned char) pkey;
 }
 
-static inline union ib_gid *ib_addr_get_mgid(struct rdma_dev_addr *dev_addr)
+static inline void ib_addr_get_mgid(struct rdma_dev_addr *dev_addr,
+				    union ib_gid *gid)
 {
-	return 	(union ib_gid *) (dev_addr->broadcast + 4);
+	memcpy(gid, dev_addr->broadcast + 4, sizeof *gid);
 }
 
-static inline union ib_gid *ib_addr_get_sgid(struct rdma_dev_addr *dev_addr)
+static inline void ib_addr_get_sgid(struct rdma_dev_addr *dev_addr,
+				    union ib_gid *gid)
 {
-	return 	(union ib_gid *) (dev_addr->src_dev_addr + 4);
+	memcpy(gid, dev_addr->src_dev_addr + 4, sizeof *gid);
 }
 
 static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr,
@@ -105,9 +107,10 @@ static inline void ib_addr_set_sgid(stru
 	memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid);
 }
 
-static inline union ib_gid *ib_addr_get_dgid(struct rdma_dev_addr *dev_addr)
+static inline void ib_addr_get_dgid(struct rdma_dev_addr *dev_addr,
+				    union ib_gid *gid)
 {
-	return 	(union ib_gid *) (dev_addr->dst_dev_addr + 4);
+	memcpy(gid, dev_addr->dst_dev_addr + 4, sizeof *gid);
 }
 
 static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr,
Index: core/ucma_ib.c
===================================================================
--- core/ucma_ib.c	(revision 8224)
+++ core/ucma_ib.c	(working copy)
@@ -40,27 +40,27 @@ static int ucma_get_paths(struct rdma_cm
 	struct ib_sa_cursor *cursor;
 	struct ib_sa_path_rec *path;
 	struct ib_user_path_rec user_path;
-	union ib_gid *gid;
+	union ib_gid gid;
 	int left, ret = 0;
 	u16 pkey;
 
 	if (!id->device)
 		return -ENODEV;
 
-	gid = ib_addr_get_dgid(&id->route.addr.dev_addr);
+	ib_addr_get_dgid(&id->route.addr.dev_addr, &gid);
 	pkey = ib_addr_get_pkey(&id->route.addr.dev_addr);
-	cursor = ib_create_path_cursor(id->device, id->port_num, gid);
+	cursor = ib_create_path_cursor(id->device, id->port_num, &gid);
 	if (IS_ERR(cursor))
 		return PTR_ERR(cursor);
 
-	gid = ib_addr_get_sgid(&id->route.addr.dev_addr);
+	ib_addr_get_sgid(&id->route.addr.dev_addr, &gid);
 	left = *len;
 	*len = 0;
 
 	for (path = ib_get_next_sa_attr(&cursor); path;
 	     path = ib_get_next_sa_attr(&cursor)) {
 		if (pkey == path->pkey &&
-		    !memcmp(gid, path->sgid.raw, sizeof *gid)) {
+		    !memcmp(&gid, path->sgid.raw, sizeof gid)) {
 			if (paths) {
 				ib_copy_path_rec_to_user(&user_path, path);
 				if (copy_to_user(paths, &user_path,
Index: core/cma.c
===================================================================
--- core/cma.c	(revision 8224)
+++ core/cma.c	(working copy)
@@ -278,14 +278,14 @@ static void cma_detach_from_dev(struct r
 static int cma_acquire_ib_dev(struct rdma_id_private *id_priv)
 {
 	struct cma_device *cma_dev;
-	union ib_gid *gid;
+	union ib_gid gid;
 	int ret = -ENODEV;
 
-	gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
+	ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid),
 
 	mutex_lock(&lock);
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		ret = ib_find_cached_gid(cma_dev->device, gid,
+		ret = ib_find_cached_gid(cma_dev->device, &gid,
 					 &id_priv->id.port_num, NULL);
 		if (!ret) {
 			cma_attach_to_dev(id_priv, cma_dev);
@@ -1266,8 +1266,8 @@ static int cma_query_ib_route(struct rdm
 	struct ib_sa_path_rec path_rec;
 
 	memset(&path_rec, 0, sizeof path_rec);
-	path_rec.sgid = *ib_addr_get_sgid(addr);
-	path_rec.dgid = *ib_addr_get_dgid(addr);
+	ib_addr_get_sgid(addr, &path_rec.sgid);
+	ib_addr_get_dgid(addr, &path_rec.dgid);
 	path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
 	path_rec.numb_path = 1;
 
@@ -1326,8 +1326,10 @@ static int cma_resolve_ib_route(struct r
 		goto err1;
 	}
 
+	ib_addr_get_sgid(addr, &route->path_rec->sgid);
+	ib_addr_get_dgid(addr, &route->path_rec->dgid);
 	ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num,
-			      ib_addr_get_sgid(addr), ib_addr_get_dgid(addr),
+			      &route->path_rec->sgid, &route->path_rec->dgid,
 			      ib_addr_get_pkey(addr), route->path_rec);
 	if (!ret) {
 		route->num_paths = 1;
@@ -1463,7 +1465,7 @@ static int cma_bind_loopback(struct rdma
 {
 	struct cma_device *cma_dev;
 	struct ib_port_attr port_attr;
-	union ib_gid *gid;
+	union ib_gid gid;
 	u16 pkey;
 	int ret;
 	u8 p;
@@ -1484,8 +1486,7 @@ static int cma_bind_loopback(struct rdma
 	}
 
 port_found:
-	gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr);
-	ret = ib_get_cached_gid(cma_dev->device, p, 0, gid);
+	ret = ib_get_cached_gid(cma_dev->device, p, 0, &gid);
 	if (ret)
 		goto out;
 
@@ -1493,6 +1494,7 @@ port_found:
 	if (ret)
 		goto out;
 
+	ib_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
 	ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey);
 	id_priv->id.port_num = p;
 	cma_attach_to_dev(id_priv, cma_dev);
@@ -1539,6 +1541,7 @@ static int cma_resolve_loopback(struct r
 {
 	struct cma_work *work;
 	struct sockaddr_in *src_in, *dst_in;
+	union ib_gid gid;
 	int ret;
 
 	work = kzalloc(sizeof *work, GFP_KERNEL);
@@ -1551,8 +1554,8 @@ static int cma_resolve_loopback(struct r
 			goto err;
 	}
 
-	ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr,
-			 ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr));
+	ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr, &gid);
+	ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid);
 
 	if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) {
 		src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr;
@@ -2153,8 +2156,9 @@ static int cma_join_ib_multicast(struct 
 	ib_sa_comp_mask comp_mask;
 	int ret;
 
+	ib_addr_get_mgid(dev_addr, &rec.mgid);
 	ret = ib_get_mcmember_rec(id_priv->id.device, id_priv->id.port_num,
-				  ib_addr_get_mgid(dev_addr), &rec);
+				  &rec.mgid, &rec);
 	if (ret)
 		return ret;
 
@@ -2163,8 +2167,8 @@ static int cma_join_ib_multicast(struct 
 	mc_map[8] = ib_addr_get_pkey(dev_addr) >> 8;
 	mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr);
 
-	rec.mgid = *(union ib_gid *) (mc_map + 4);
-	rec.port_gid = *ib_addr_get_sgid(dev_addr);
+	memcpy(&rec.mgid.raw, mc_map + 4, sizeof rec.mgid);
+	ib_addr_get_sgid(dev_addr, &rec.port_gid);
 	rec.pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr));
 	rec.join_state = 1;
 	rec.qkey = sin->sin_addr.s_addr;
Index: core/ucma.c
===================================================================
--- core/ucma.c	(revision 8224)
+++ core/ucma.c	(working copy)
@@ -453,10 +453,10 @@ static void ucma_copy_ib_route(struct rd
 	switch (route->num_paths) {
 	case 0:
 		dev_addr = &route->addr.dev_addr;
-		memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr),
-		       sizeof(union ib_gid));
-		memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr),
-		       sizeof(union ib_gid));
+		ib_addr_get_dgid(dev_addr,
+				 (union ib_gid *) &resp->ib_route[0].dgid);
+		ib_addr_get_sgid(dev_addr,
+				 (union ib_gid *) &resp->ib_route[0].sgid);
 		resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr));
 		break;
 	case 2:


From tziporet at mellanox.co.il  Tue Jun 27 22:53:43 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 28 Jun 2006 08:53:43 +0300
Subject: [openib-general] ucma into kernel.org
In-Reply-To: <44A1BD20.1090009@ichips.intel.com>
References: <44A1B3F7.7090504@ichips.intel.com>
	<20060627230420.GF5398@mellanox.co.il>
	<44A1BD20.1090009@ichips.intel.com>
Message-ID: <44A21967.7040907@mellanox.co.il>

Sean Hefty wrote:
> Sure, I can consider doing that.  There would just be some logistics 
> to work out, like the location of the git tree.
>
> Would a patch series in Roland's git tree work?  Once he returns, we 
> can start queuing up patches for 2.6.19, which could include any or 
> all of the following:
>
> userspace support for the RDMA CM
> iWarp support
> latest changes for IB (UD QP and multicast)
>
> - Sean
>
For OFED 1.1 we need only userspace support for the RDMA CM

Tziporet


From tziporet at mellanox.co.il  Tue Jun 27 23:33:05 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 28 Jun 2006 09:33:05 +0300
Subject: [openib-general] Kernel Oops related to IPoIB (multicast
 module?)
In-Reply-To: <44A152B0.3000007@ichips.intel.com>
References: <200606261051.12515.jackm@mellanox.co.il>
	<44A075C2.6060409@ichips.intel.com> <44A1270D.2070109@mellanox.co.il>
	<44A152B0.3000007@ichips.intel.com>
Message-ID: <44A222A1.1020502@mellanox.co.il>

Sean Hefty wrote:
>
> I am working on trying to resolve this as my top priority at the 
> moment, but I have not been able to reproduce this on my systems.  I 
> want to understand why ib_sa was not unloaded as part of modprobe -r 
> ib_ipoib, but why ib_multicast apparently was.  I will examine the 
> script that you mentioned, but I typically do not run the OFED release.
>
> - Sean
>
No need to run the OFED release, just take openibd script from 
https://openib.org/svn/gen2/branches/1.0/ofed/openib/scripts/ and use 
it: openibd start and openibd stop.

In order for it to load/unload modules you need also to have the file 
openib.conf under /etc/infiniband directory with this content:

# Start HCA driver upon boot
ONBOOT=yes

# Load MTHCA
MTHCA_LOAD=yes

# Load IPoIB
IPOIB_LOAD=yes


Tziporet


From erezz at voltaire.com  Wed Jun 28 04:41:35 2006
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 28 Jun 2006 14:41:35 +0300
Subject: [openib-general] [PATCH] iser: fix iSER description in Kconfig
Message-ID: <44A26AEF.6090204@voltaire.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060628/ef682ab6/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: iser_description.diff
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060628/ef682ab6/attachment.ksh>

From Thomas.Talpey at netapp.com  Wed Jun 28 05:36:51 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 Jun 2006 08:36:51 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627203433.GB10737@osc.edu>
References: <20060626215319.GA9291@osc.edu>
	<20060627064234.GG19300@mellanox.co.il>
	<7.0.1.0.2.20060627090204.04471ba0@netapp.com>
	<20060627203433.GB10737@osc.edu>
Message-ID: <7.0.1.0.2.20060628082216.04740028@netapp.com>

Yep, you're confirming my comment that the sge size is dependent
on the memory registration strategy (and not the protocol itself).
Because you have a pool approach, you potentially have a lot of
discontiguous regions. Therefore, you need more sge's. (You could
have the same issue with large preregistrations, etc.)

If it's just for RDMA Write, the penalty really isn't that high - you can
easily break the i/o up into separate RDMA Write ops and pump them
out in a sequence. The HCA streams them, and using unsignalled
completion on the WRs means the host overhead can be low.

For sends, it's more painful. You have to "pull them up". Do you really
need send inlines to be that big? I guess if you're supporting a writev()
api over inline you don't have much control, but even writev has a
maxiov.

The approach the NFS/RDMA client takes is basically to have a pool
of dedicated buffers for headers, with a certain amount of space for
"small" sends. This maximum inline size is typically 1K or maybe 4K
(it's configurable), and it copies send data into them if it fits. All
other operations are posted as "chunks", which are explicit protocol
objects corresponding to { mr, offset, length } triplets. The protocol
supports an arbitrary number of them, but typically 8 is plenty. Each
chunk results in an RDMA op from the server. If the server is coded
well, the RDMA streams beautifully and there is no bandwidth issue.

Just some ideas. I feel your pain.

Tom.

At 04:34 PM 6/27/2006, Pete Wyckoff wrote:
>Thomas.Talpey at netapp.com wrote on Tue, 27 Jun 2006 09:06 -0400:
>> At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
>> >Unless you use it, passing the absolute maximum value supported by 
>> >hardware does
>> >not seem, to me, to make sense - it will just slow you down, and waste
>> >resources.  Is there a protocol out there that actually has a use 
>for 30 sge?
>> 
>> It's not a protocol thing, it's a memory registration thing. But I agree,
>> that's a huge number of segments for send and receive. 2-4 is more
>> typical. I'd be interested to know what wants 30 as well...
>
>This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html
>See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport
>stack.  The max_sge-1 aspect I'm complaining about isn't checked in yet.
>
>It's a file system application.  The MPI-IO interface provides
>datatypes and file views that let a client write complex subsets of
>the in-memory data to a file with a single call.  One case that
>happens is contiguous-in-file but discontiguous-in-memory, where the
>file system client writes data from multiple addresses to a single
>region in a file.  The application calls MPI_File_write or a
>variant, and this complex buffer description filters all the way
>down to the OpenIB transport, which then has to figure out how to
>get the data to the server.
>
>These separate data regions may have been allocated all at once
>using MPI_Alloc_mem (rarely), or may have been used previously for
>file system operations so are already pinned in the registration
>cache.  Are you implying there is more memory registration work that
>has to happen beyond making sure each of the SGE buffers is pinned
>and has a valid lkey?
>
>It would not be a major problem to avoid using more than a couple of
>SGEs; however, I didn't see any reason to avoid them.  Please let me
>know if you see a problem with this approach.
>
>		-- Pete


From mst at mellanox.co.il  Wed Jun 28 05:42:07 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 15:42:07 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <7.0.1.0.2.20060628082216.04740028@netapp.com>
References: <7.0.1.0.2.20060628082216.04740028@netapp.com>
Message-ID: <20060628124207.GZ19300@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> Just some ideas. I feel your pain.

Is there something that would make life easier for you?

-- 
MST


From Thomas.Talpey at netapp.com  Wed Jun 28 05:51:57 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 Jun 2006 08:51:57 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060628124207.GZ19300@mellanox.co.il>
References: <7.0.1.0.2.20060628082216.04740028@netapp.com>
	<20060628124207.GZ19300@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060628084952.042a2d90@netapp.com>

At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote:
>Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
>> Just some ideas. I feel your pain.
>
>Is there something that would make life easier for you?

A work-request-based IBTA1.2/iWARP-compliant FMR implementation.

Please. :-)

Tom.


From pw at osc.edu  Wed Jun 28 07:21:21 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Wed, 28 Jun 2006 10:21:21 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060627223826.GC5398@mellanox.co.il>
References: <20060627202103.GA10737@osc.edu>
	<20060627223826.GC5398@mellanox.co.il>
Message-ID: <20060628142121.GA11906@osc.edu>

mst at mellanox.co.il wrote on Wed, 28 Jun 2006 01:38 +0300:
> If this works for you, great. I was just trying to point out query device can
> not guarantee that QP allocaton will always succeed even if you stay within
> limits it reports.
> 
> For example, are you using a large number of WRs per QP as well?  If so after
> alocating a couple of QPs you might run out of locked memory limit allowed
> per-user, depending on your system setup. QP allocation will then fail, even if
> you use the hcacap - 1 heuristic.

Thanks for all the comments.  I'm not specifically trying to be a
pain here.  The bit I was failing to notice was that when
considering many QP allocations, the resource demands add up faster
when using more SGEs each.  Still find it odd that the very first
QP created can not achieve the maximum-reported values, but
understand your general argument.

Regarding the API, some interfaces I've seen will do the equivalent
of putting the "max currently available" values in ibv_qp_init_attr
so userspace can reconsider and try again.  I never liked that very
much, and it doesn't help much in this multi-dimensional space where
WRs and SGEs apparently share the same overall constraints.  Plus
the returned values aren't guaranteed to be valid next time an
attempt is made anyway, so don't do that.  :)

It may make people realize what's going on faster to get an actual
return value somewhere.  Right now many failure conditions are
lumped into the returned NULL pointer: attr->cap values are bigger
than HCA max, a library malloc falied, the HCA is out of new QP
resources, the HCA is on fire.  That said, an API that returns an
explicit error code is clumsy:

    int ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr
	*qp_init_attr, struct ibv_qp **newqp);

    struct ibv_qp *qp;
    int ret = ibv_create_qp(pd, &attr, &qp);
    if (ret < 0)
	printf("create qp failed: %s", strerror(ret));

So I'll have to vote against that bad idea too.

It would be possible but odd to store the return code in errno.
I.e., use the current API, but augmented to stick the return value
in the (thread-private) errno.  I'm not sure if I've seen anything
outside of libc use errno.  Having "ibv_errno" would be icky.

Thanks,

		-- Pete


From mst at mellanox.co.il  Wed Jun 28 07:37:56 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 17:37:56 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <7.0.1.0.2.20060628084952.042a2d90@netapp.com>
References: <7.0.1.0.2.20060628084952.042a2d90@netapp.com>
Message-ID: <20060628143755.GC19300@mellanox.co.il>

Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> Subject: Re: max_send_sge < max_sge
> 
> At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote:
> >Quoting r. Talpey, Thomas <Thomas.Talpey at netapp.com>:
> >> Just some ideas. I feel your pain.
> >
> >Is there something that would make life easier for you?
> 
> A work-request-based IBTA1.2/iWARP-compliant FMR implementation.

Hmm. No an easy one :)
Just to clarify: what feature exactly do you want to use? The spec has 3
relevant compliance statements as far as I can see.  With respect to
fast registration:

	o10-37.2.6: If the HCA supports the Base Memory Management Extensions,
	the Fast Registration must take place before any subsequent Work
	Request on the same Send Queue is started.

It looks like Fast Registration can bypass previous work requests, so
existing FMR implementation already has this property I think.

So I am guessing that what you want is one of the
o10-37.2.19: Relaxed ordered, o10-37.2.20: Local Invalidate Fencing or
o10-37.2.21: Send with Invalidate.

Which one is it then?

-- 
MST


From mst at mellanox.co.il  Wed Jun 28 07:51:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 17:51:02 +0300
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060628142121.GA11906@osc.edu>
References: <20060628142121.GA11906@osc.edu>
Message-ID: <20060628145102.GD19300@mellanox.co.il>

Quoting r. Pete Wyckoff <pw at osc.edu>:
> Subject: Re: max_send_sge < max_sge
> 
> mst at mellanox.co.il wrote on Wed, 28 Jun 2006 01:38 +0300:
> > If this works for you, great. I was just trying to point out query device
> > can not guarantee that QP allocaton will always succeed even if you stay
> > within limits it reports.
> > 
> > For example, are you using a large number of WRs per QP as well?  If so
> > after alocating a couple of QPs you might run out of locked memory limit
> > allowed per-user, depending on your system setup. QP allocation will then
> > fail, even if you use the hcacap - 1 heuristic.
> 
> Thanks for all the comments.  I'm not specifically trying to be a
> pain here.  The bit I was failing to notice was that when
> considering many QP allocations, the resource demands add up faster
> when using more SGEs each.  Still find it odd that the very first
> QP created can not achieve the maximum-reported values, but
> understand your general argument.

Yea, that's because the API only can report 1 max value. But when
this was considered the concensus was its not worth extending the API
because of the other issues you mention.

> Regarding the API, some interfaces I've seen will do the equivalent
> of putting the "max currently available" values in ibv_qp_init_attr
> so userspace can reconsider and try again.  I never liked that very
> much, and it doesn't help much in this multi-dimensional space where
> WRs and SGEs apparently share the same overall constraints.  Plus
> the returned values aren't guaranteed to be valid next time an
> attempt is made anyway, so don't do that.  :)

Yep.  We could have an option to have the stack scale the requested values down
to some legal set instead of failing an allocation.  But we couldn't come up
with a clean way to tell the stack e.g.  what should it round down: the SGE or
WR value.  Do you think selecting something arbitrarily might still be a good
idea?

So in the end we are back to either using low numbers that just work
empirically, or starting with some value and going down till it succeeds.

-- 
MST


From caitlinb at broadcom.com  Wed Jun 28 09:52:35 2006
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 28 Jun 2006 09:52:35 -0700
Subject: [openib-general] max_send_sge < max_sge
Message-ID: <54AD0F12E08D1541B826BE97C98F99F15F5B4D@NT-SJCA-0751.brcm.ad.broadcom.com>


> 
> Yep.  We could have an option to have the stack scale the
> requested values down to some legal set instead of failing an
> allocation.  But we couldn't come up with a clean way to tell
> the stack e.g.  what should it round down: the SGE or WR
> value.  Do you think selecting something arbitrarily might still be a
> good idea? 
> 

Having a "query only" option might help here. With this size SGE, what
is the largest number of SGEs that I could current get? (but don't
actually
allocate that yet)


From mst at mellanox.co.il  Wed Jun 28 10:14:28 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 28 Jun 2006 20:14:28 +0300
Subject: [openib-general] [PATCH -stable] IB/mthca: restore missing PCI
 registers after reset
Message-ID: <20060628171428.GF19300@mellanox.co.il>

Hello, stable team!
The pull of the following fix was requested by Roland Dreier just a couple of
days before 2.6.17 came out, and so it seems it missed 2.6.17 by a narrow
margin:

http://lkml.org/lkml/2006/6/13/164

It is now upsteam:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=13aa6ecb47990cfc78e20e347fdd3f1df6189426

As I hear from users about systems where mthca does not work at all without this
patch, please consider it for -stable.

Note: Roland Dreier is currently unavailable, and said he will be for a while.
I am assuming since he ACKed this for 2.6.17 it's good for -stable as well
as far as he's concerned.

---

mthca does not restore the following PCI-X/PCI Express registers after reset:
  PCI-X device: PCI-X command register
  PCI-X bridge: upstream and downstream split transaction registers
  PCI Express : PCI Express device control and link control registers

This causes instability and/or bad performance on systems where one of
these registers is set to a non-default value by BIOS.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c
index df5e494..f4fddd5 100644
--- a/drivers/infiniband/hw/mthca/mthca_reset.c
+++ b/drivers/infiniband/hw/mthca/mthca_reset.c
@@ -49,6 +49,12 @@ int mthca_reset(struct mthca_dev *mdev)
 	u32 *hca_header    = NULL;
 	u32 *bridge_header = NULL;
 	struct pci_dev *bridge = NULL;
+	int bridge_pcix_cap = 0;
+	int hca_pcie_cap = 0;
+	int hca_pcix_cap = 0;
+
+	u16 devctl;
+	u16 linkctl;
 
 #define MTHCA_RESET_OFFSET 0xf0010
 #define MTHCA_RESET_VALUE  swab32(1)
@@ -110,6 +116,9 @@ #define MTHCA_RESET_VALUE  swab32(1)
 		}
 	}
 
+	hca_pcix_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX);
+	hca_pcie_cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
+
 	if (bridge) {
 		bridge_header = kmalloc(256, GFP_KERNEL);
 		if (!bridge_header) {
@@ -129,6 +138,13 @@ #define MTHCA_RESET_VALUE  swab32(1)
 				goto out;
 			}
 		}
+		bridge_pcix_cap = pci_find_capability(bridge, PCI_CAP_ID_PCIX);
+		if (!bridge_pcix_cap) {
+				err = -ENODEV;
+				mthca_err(mdev, "Couldn't locate HCA bridge "
+					  "PCI-X capability, aborting.\n");
+				goto out;
+		}
 	}
 
 	/* actually hit reset */
@@ -178,6 +194,20 @@ #define MTHCA_RESET_VALUE  swab32(1)
 good:
 	/* Now restore the PCI headers */
 	if (bridge) {
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0x8,
+				 bridge_header[(bridge_pcix_cap + 0x8) / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Upstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
+		if (pci_write_config_dword(bridge, bridge_pcix_cap + 0xc,
+				 bridge_header[(bridge_pcix_cap + 0xc) / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA bridge Downstream "
+				  "split transaction control, aborting.\n");
+			goto out;
+		}
 		/*
 		 * Bridge control register is at 0x3e, so we'll
 		 * naturally restore it last in this loop.
@@ -203,6 +233,35 @@ good:
 		}
 	}
 
+	if (hca_pcix_cap) {
+		if (pci_write_config_dword(mdev->pdev, hca_pcix_cap,
+				 hca_header[hca_pcix_cap / 4])) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI-X "
+				  "command register, aborting.\n");
+			goto out;
+		}
+	}
+
+	if (hca_pcie_cap) {
+		devctl = hca_header[(hca_pcie_cap + PCI_EXP_DEVCTL) / 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_DEVCTL,
+					   devctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI Express "
+				  "Device Control register, aborting.\n");
+			goto out;
+		}
+		linkctl = hca_header[(hca_pcie_cap + PCI_EXP_LNKCTL) / 4];
+		if (pci_write_config_word(mdev->pdev, hca_pcie_cap + PCI_EXP_LNKCTL,
+					   linkctl)) {
+			err = -ENODEV;
+			mthca_err(mdev, "Couldn't restore HCA PCI Express "
+				  "Link control register, aborting.\n");
+			goto out;
+		}
+	}
+
 	for (i = 0; i < 16; ++i) {
 		if (i * 4 == PCI_COMMAND)
 			continue;


-- 
MST


From bugzilla-daemon at openib.org  Wed Jun 28 12:17:33 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Wed, 28 Jun 2006 12:17:33 -0700 (PDT)
Subject: [openib-general] [Bug 159] New: OFED1.0: Missing interfaces
Message-ID: <20060628191733.B7E3922873F@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=159

           Summary: OFED1.0: Missing interfaces
           Product: OpenFabrics Linux
           Version: gen2
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Verbs
        AssignedTo: bugzilla at openib.org
        ReportedBy: venkatesh.babu at 3leafnetworks.com


I was looking for Gen2 equvivalent of Gen1 Access Layer interfaces
tsIbInServiceNoticeHandler() and ib_cm_path_migrate(), I could not find any
which provides the similar functionality.

 I was wandering if you can give any comments on why it was omitted in Gen2
and/or if there are any plans of implementing it in future releases.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From pw at osc.edu  Wed Jun 28 12:29:39 2006
From: pw at osc.edu (Pete Wyckoff)
Date: Wed, 28 Jun 2006 15:29:39 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060628145102.GD19300@mellanox.co.il>
References: <20060628142121.GA11906@osc.edu>
	<20060628145102.GD19300@mellanox.co.il>
Message-ID: <20060628192939.GA12298@osc.edu>

mst at mellanox.co.il wrote on Wed, 28 Jun 2006 17:51 +0300:
> Yea, that's because the API only can report 1 max value. But when
> this was considered the concensus was its not worth extending the API
> because of the other issues you mention.

Maybe you should report min(max_recv_sge, max_send_sge) instead of
max().  In this case I don't care because I currently need fewer
SGEs than either limit.  I'm just worried you're going to get the
same complaint by newbie IB users later.

> Yep.  We could have an option to have the stack scale the requested values down
> to some legal set instead of failing an allocation.  But we couldn't come up
> with a clean way to tell the stack e.g.  what should it round down: the SGE or
> WR value.  Do you think selecting something arbitrarily might still be a good
> idea?

No.  If I get fewer WRs than requested, the app would break.  If I
get fewer SGs, things would work for this particular app with some
more infrastructure to check for that, but I don't see how that
could be a general rule.

I like the model where the app can provision itself by querying the
NIC before opening any QPs, then get the same settings for every QP,
until the maximum number of QPs is reached.

We already have a way in PVFS2 to close "idle" connections, but it
isn't hooked up into QP allocation failure yet.  I prefer to do that
than to limp along on certain connections with fewer WRs or SGs,
along with all the code that would have to be added to handle that
situation.

> So in the end we are back to either using low numbers that just work
> empirically, or starting with some value and going down till it succeeds.

Yep.  Thanks for the insight.  It'll be fun when I try to get this
to work on amso with only 4 SGEs per QP.

		-- Pete


From Thomas.Talpey at netapp.com  Wed Jun 28 12:31:34 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 28 Jun 2006 15:31:34 -0400
Subject: [openib-general] max_send_sge < max_sge
In-Reply-To: <20060628145102.GD19300@mellanox.co.il>
References: <20060628142121.GA11906@osc.edu>
	<20060628145102.GD19300@mellanox.co.il>
Message-ID: <7.0.1.0.2.20060628152709.04471ce8@netapp.com>

At 10:51 AM 6/28/2006, Michael S. Tsirkin wrote:
>Yep.  We could have an option to have the stack scale the requested values down
>to some legal set instead of failing an allocation.  But we couldn't come up
>with a clean way to tell the stack e.g.  what should it round down: the SGE or
>WR value.  Do you think selecting something arbitrarily might still be a good
>idea?

No! Well, not as the default. Otherwise, the consumer has to go back
and check what happened even on success, which is a royal pain and
highly inefficient.

Maybe we should pass in an optional attribute structure, that is returned
with the granted attributes on success, or the would-have-been attributes
on failure?

Tom.


From bugzilla-daemon at openib.org  Wed Jun 28 12:55:00 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Wed, 28 Jun 2006 12:55:00 -0700 (PDT)
Subject: [openib-general] [Bug 160] New: OFED1.0: ib_modify_qp() of RC QP
	fails with -EINVAL
Message-ID: <20060628195500.DE33A22873F@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=160

           Summary: OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL
           Product: OpenFabrics Linux
           Version: gen2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: Verbs
        AssignedTo: bugzilla at openib.org
        ReportedBy: venkatesh.babu at 3leafnetworks.com


I have created a RC QP and estblishing a connection with remote RC QP using the
interfaces defined in ib_cm.h. I was loading the alternate_path before calling
ib_send_cm_req(). To transition the RC QP to IB_QPS_RTR state, I was calling
ib_cm_init_qp_attr() to initialize the struct ib_qp_attr and calling
ib_modify_qp(). It faild with -EINVAL.

I found that this problem is due to a bug in ib_cm_init_qp_attr() which was not
initializing the struct ib_qp_attr fields correctly. I made the following
changes in cm_init_qp_rtr_attr() of
openib-1.0/src/linux-kernel/infiniband/core/cm.c

                if (cm_id_priv->alt_av.ah_attr.dlid) {
                        *qp_attr_mask |= IB_QP_ALT_PATH;
+                        qp_attr->alt_port_num =
+                               cm_id_priv->alt_av.port->port_num;
                         qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr;
                }

With this patch ib_modify_qp() worked fine and I was able to establish the
connection with remote RC QP.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Wed Jun 28 15:52:59 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Wed, 28 Jun 2006 15:52:59 -0700 (PDT)
Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails
	with -EINVAL
Message-ID: <20060628225259.3BFC922873F@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=160


------- Comment #1 from sean.hefty at intel.com  2006-06-28 15:52 -------
Thanks for the info.  I have committed the fix to SVN revision 8267.  The OFED
release will need to be updated separately.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mshefty at ichips.intel.com  Wed Jun 28 16:24:05 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 28 Jun 2006 16:24:05 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <adad5d9ukf0.fsf@cisco.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<adad5d9ukf0.fsf@cisco.com>
Message-ID: <44A30F95.2050408@ichips.intel.com>

Roland Dreier wrote:
>>I suggest the following design: the CMA would replace the event handler
>>provided with the qp_init_attr struct with a callback of its own and
>>keep the original handler/context on a private structure.
> 
> 
> This is probably fine.  There is one further situation where the
> connection needs to be established, beyond RTU and the communication
> established async event.  Namely, if a receive completion is polled.
> Since async events are, well, asynchronous, there's no guarantee that
> the communication established event will be reported any time soon...

This brings up a good point.  Even if a user gets a communication established 
event, the IB CM could have already timed out and failed the connection.  I 
don't think that we can do anything about this.

I should also point out that the proposed design will not work for userspace. 
I'm hesitant to make this change until a solution for userspace can also be 
found, in the hope that a common fix can be shared.

- Sean


From bos at pathscale.com  Wed Jun 28 16:54:53 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 28 Jun 2006 16:54:53 -0700
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
Message-ID: <1151538893.13430.43.camel@obsidian>

Hi, Andrew -

I have a pile of patches for the ipath driver that I'd like to get in
during the "open season" window.  Roland has his hands full with diapers
and other sprog paraphernalia as of a few days ago, so I doubt he'll see
this message soon, much less care about the patches.

Given Roland's presumed unavailability, would the appropriate thing be
to drop the patches into -mm and then push them along to Linus, or what?

	<b


From akpm at osdl.org  Wed Jun 28 17:13:18 2006
From: akpm at osdl.org (Andrew Morton)
Date: Wed, 28 Jun 2006 17:13:18 -0700
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
In-Reply-To: <1151538893.13430.43.camel@obsidian>
References: <1151538893.13430.43.camel@obsidian>
Message-ID: <20060628171318.7d97d617.akpm@osdl.org>

"Bryan O'Sullivan" <bos at pathscale.com> wrote:
>
> Hi, Andrew -
> 
> I have a pile of patches for the ipath driver that I'd like to get in
> during the "open season" window.  Roland has his hands full with diapers
> and other sprog paraphernalia as of a few days ago, so I doubt he'll see
> this message soon, much less care about the patches.
> 
> Given Roland's presumed unavailability, would the appropriate thing be
> to drop the patches into -mm and then push them along to Linus, or what?
> 

We can do that, sure.  Please cc openib and lkml and netdev and
whatever-else-you-can-think of when you send them over.


From mst at mellanox.co.il  Wed Jun 28 22:33:01 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 29 Jun 2006 08:33:01 +0300
Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP
 fails with -EINVAL
In-Reply-To: <20060628225259.3BFC922873F@openib.ca.sandia.gov>
References: <20060628225259.3BFC922873F@openib.ca.sandia.gov>
Message-ID: <20060629053301.GA5127@mellanox.co.il>

Quoting r. bugzilla-daemon at openib.org <bugzilla-daemon at openib.org>:
> Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL
> 
> http://openib.org/bugzilla/show_bug.cgi?id=160
> 
> 
> 
> 
> 
> ------- Comment #1 from sean.hefty at intel.com  2006-06-28 15:52 -------
> Thanks for the info.  I have committed the fix to SVN revision 8267.  The OFED
> release will need to be updated separately.

OFED is tracking 2.6.18 so to get things there they need to be submitted to
Roland's for-2.6.18 tree.

-- 
MST


From mst at mellanox.co.il  Wed Jun 28 22:45:24 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 29 Jun 2006 08:45:24 +0300
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <adad5d9ukf0.fsf@cisco.com>
References: <000001c690c7$bd5231d0$62268686@amr.corp.intel.com>
	<adad5d9ukf0.fsf@cisco.com>
Message-ID: <20060629054524.GC5127@mellanox.co.il>

Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: design for communication established affiliated asynchronous event handling
> 
> >I suggest the following design: the CMA would replace the event handler
> >provided with the qp_init_attr struct with a callback of its own and
> >keep the original handler/context on a private structure.
> 
> This is probably fine.  There is one further situation where the
> connection needs to be established, beyond RTU and the communication
> established async event.  Namely, if a receive completion is polled.
> Since async events are, well, asynchronous, there's no guarantee that
> the communication established event will be reported any time soon...

How about user taking this into account and not arming the CQ /
not polling it until the established event?

-- 
MST


From sean.hefty at intel.com  Wed Jun 28 22:50:38 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 Jun 2006 22:50:38 -0700
Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP
 fails with -EINVAL
In-Reply-To: <20060629053301.GA5127@mellanox.co.il>
Message-ID: <000001c69b3f$f3045470$e7d8180a@amr.corp.intel.com>

>OFED is tracking 2.6.18 so to get things there they need to be submitted to
>Roland's for-2.6.18 tree.

I downloaded Linus' latest tree today, and will submit a patch tomorrow.

- Sean


From sean.hefty at intel.com  Wed Jun 28 22:52:28 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 28 Jun 2006 22:52:28 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <20060629054524.GC5127@mellanox.co.il>
Message-ID: <000101c69b40$3463b640$e7d8180a@amr.corp.intel.com>

>How about user taking this into account and not arming the CQ /
>not polling it until the established event?

The CQ could be in use by other QPs.

- Sean


From halr at voltaire.com  Thu Jun 29 04:10:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Jun 2006 07:10:53 -0400
Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_inform.c: n
 __dump_all_informs,
 don't scan inform list unless logging has the debug level turned on
Message-ID: <1151579451.4541.55584.camel@hal.voltaire.com>

OpenSM/osm_inform.c: In __dump_all_informs, don't scan inform list
unless logging has the debug level turned on 

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_inform.c
===================================================================
--- opensm/osm_inform.c	(revision 8274)
+++ opensm/osm_inform.c	(working copy)
@@ -179,6 +179,9 @@ __dump_all_informs(
 
    OSM_LOG_ENTER( p_log, __dump_all_informs );
 
+   if( ! osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
+     goto Exit;
+
    p_list_item = cl_qlist_head( &p_subn->sa_infr_list );
    while (p_list_item != cl_qlist_end( &p_subn->sa_infr_list ))
    {
@@ -188,6 +191,7 @@ __dump_all_informs(
      p_list_item = cl_qlist_next( p_list_item );
    }
 
+ Exit:
    OSM_LOG_EXIT( p_log );
 }
 

From bugzilla-daemon at openib.org  Thu Jun 29 07:34:34 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 29 Jun 2006 07:34:34 -0700 (PDT)
Subject: [openib-general] [Bug 163] New: ibv_ack_async_event seg-fault when
	requested event is SRQ limit
Message-ID: <20060629143434.B011022873F@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=163

           Summary: ibv_ack_async_event seg-fault when requested event is
                    SRQ limit
           Product: OpenFabrics Linux
           Version: gen2
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: IB Core
        AssignedTo: bugzilla at openib.org
        ReportedBy: amip at mellanox.co.il
                CC: ziv at mellanox.co.il


the event->element.srq returned from read() in ibv_get_async_event() is NULL


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From sean.hefty at intel.com  Thu Jun 29 10:07:38 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 29 Jun 2006 10:07:38 -0700
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
In-Reply-To: <20060629163857.GT19300@mellanox.co.il>
Message-ID: <000001c69b9e$86268fd0$8698070a@amr.corp.intel.com>

>This currently includes a single patch from Venkatesh Babu:
>	IB/core: Set alternate port number when initializing QP attributes.
>
>that has been checked into openib svn by Sean.

Thanks Michael.  I will assume that you will push this change in through Roland
when he's back.

- Sean


From mshefty at ichips.intel.com  Thu Jun 29 09:48:40 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 29 Jun 2006 09:48:40 -0700
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE65AA9A@mail.silverstorm.com>
References: <D80D83302DEE6249A221093BF2BB69AE65AA9A@mail.silverstorm.com>
Message-ID: <44A40468.9070600@ichips.intel.com>

Rimmer, Todd wrote:
> The CM would open the CA, provide its async event callback routine and
> perform a special register_cm() verbs call.  Of course most CM traffic
> would occur on the GSI QP, so this open CA instance was only for this
> purpose.  This special verb was only available in kernel space (avoiding
> security issue of application stealing CM interface and because our CM
> was in the kernel anyway).

Thanks for the info.  I'm considering this sort of approach.

- Sean


From mst at mellanox.co.il  Thu Jun 29 09:38:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 29 Jun 2006 19:38:57 +0300
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
In-Reply-To: <20060628171318.7d97d617.akpm@osdl.org>
References: <20060628171318.7d97d617.akpm@osdl.org>
Message-ID: <20060629163857.GT19300@mellanox.co.il>

Quoting r. Andrew Morton <akpm at osdl.org>:
> > Hi, Andrew -
> > 
> > I have a pile of patches for the ipath driver that I'd like to get in
> > during the "open season" window.  Roland has his hands full with diapers
> > and other sprog paraphernalia as of a few days ago, so I doubt he'll see
> > this message soon, much less care about the patches.
> > 
> > Given Roland's presumed unavailability, would the appropriate thing be
> > to drop the patches into -mm and then push them along to Linus, or what?
> > 
> 
> We can do that, sure.  Please cc openib and lkml and netdev and
> whatever-else-you-can-think of when you send them over.

Yes, -mm seems like a good way to get more review.

Further, in the hope that this will help keep things reasonably stable till
Roland comes back, and help everyone see what's being merged, I have
created a git branch for all things infiniband going into 2.6.18.

You can get at it here:
	git://www.mellanox.co.il/~git/infiniband  mst-for-2.6.18

This currently includes a single patch from Venkatesh Babu:
	IB/core: Set alternate port number when initializing QP attributes.

that has been checked into openib svn by Sean.

Please Cc me on infiniband patches that are going to be merged and I'll do my
best to compile, test and if it works put them there.  If everyone does this, I
also hope this will help Roland when he's back to figure out where do things
stand.

Thanks,

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies


From halr at voltaire.com  Thu Jun 29 08:19:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Jun 2006 11:19:03 -0400
Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/Remote SM: Eliminate some
 unneeded status checking
Message-ID: <1151593785.4541.65570.camel@hal.voltaire.com>

OpenSM/Remote SM: Eliminate some unneeded status checking

Since osm_remote_sm.c:osm_remote_sm_init cannot fail, don't return any
status and don't check in an consumers of this API

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: include/opensm/osm_remote_sm.h
===================================================================
--- include/opensm/osm_remote_sm.h	(revision 8288)
+++ include/opensm/osm_remote_sm.h	(working copy)
@@ -188,7 +188,7 @@ osm_remote_sm_destroy(
 *
 * SYNOPSIS
 */
-ib_api_status_t
+void
 osm_remote_sm_init(
 	IN osm_remote_sm_t* const p_sm,
 	IN const osm_port_t* const p_port,
@@ -205,7 +205,7 @@ osm_remote_sm_init(
 *		[in] Pointer to the SMInfo attribute for this SM.
 *
 * RETURN VALUES
-*	IB_SUCCESS if the SM object was initialized successfully.
+*	This function does not return a value.
 *
 * NOTES
 *	Allows calling other Remote SM methods.
Index: opensm/osm_remote_sm.c
===================================================================
--- opensm/osm_remote_sm.c	(revision 8277)
+++ opensm/osm_remote_sm.c	(working copy)
@@ -74,7 +74,7 @@ osm_remote_sm_destroy(
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t
+void
 osm_remote_sm_init(
   IN osm_remote_sm_t* const p_sm,
   IN const osm_port_t* const p_port,
@@ -87,5 +87,5 @@ osm_remote_sm_init(
 
   p_sm->p_port = p_port;
   p_sm->smi = *p_smi;
-  return( IB_SUCCESS );
+  return;
 }
Index: opensm/osm_sminfo_rcv.c
===================================================================
--- opensm/osm_sminfo_rcv.c	(revision 8287)
+++ opensm/osm_sminfo_rcv.c	(working copy)
@@ -568,7 +568,6 @@ __osm_sminfo_rcv_process_get_response(
   osm_port_t*              p_port;
   ib_net64_t               port_guid;
   osm_remote_sm_t*         p_sm;
-  ib_api_status_t          status;
   osm_signal_t             process_get_sm_ret_val = OSM_SIGNAL_NONE;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sminfo_rcv_process_get_response );
@@ -647,15 +646,7 @@ __osm_sminfo_rcv_process_get_response(
       goto Exit;
     }
 
-    status = osm_remote_sm_init( p_sm, p_port, p_smi );
-    if( status != IB_SUCCESS )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "__osm_sminfo_rcv_process_get_response: ERR 2F15: "
-               "Other SM object initialization failed (%s)\n",
-               ib_get_err_str( status ) );
-      goto Exit;
-    }
+    osm_remote_sm_init( p_sm, p_port, p_smi );
 
     cl_qmap_insert( p_sm_tbl, port_guid, &p_sm->map_item );
   }


From trimmer at silverstorm.com  Thu Jun 29 05:48:25 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Thu, 29 Jun 2006 08:48:25 -0400
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <44A30F95.2050408@ichips.intel.com>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65AA9A@mail.silverstorm.com>

> -----Original Message-----
> From: openib Sean Hefty
> Sent: Wednesday, June 28, 2006 7:24 PM
> 
> Roland Dreier wrote:
> >>I suggest the following design: the CMA would replace the event
handler
> >>provided with the qp_init_attr struct with a callback of its own and
> >>keep the original handler/context on a private structure.
> 
> I should also point out that the proposed design will not work for
> userspace.
> I'm hesitant to make this change until a solution for userspace can
also
> be
> found, in the hope that a common fix can be shared.
> 
> - Sean

The approach we took in our proprietary stack was to provide a verbs
driver interface for the CM to register itself with the verbs driver. 

The CM would open the CA, provide its async event callback routine and
perform a special register_cm() verbs call.  Of course most CM traffic
would occur on the GSI QP, so this open CA instance was only for this
purpose.  This special verb was only available in kernel space (avoiding
security issue of application stealing CM interface and because our CM
was in the kernel anyway).

When the CA got an Async Event for a Communication Established event, it
would deliver it to both the CM (regardless of which QP it was for) and
to the open instance owning the QP.  All other async events were only
delivered to the appropriate open instance.

This put the handling in the kernel and at a low level where it would
not impact handling of other async events and avoided complications of
user vs kernel async event filters.

Depending on the design of APM, the CM might also be interested in APM
related Async Events (in our design the application had an opportunity
to select a new alternate path, so it was more appropriate to let the
ULP handle these events directly).

Todd Rimmer


From halr at voltaire.com  Thu Jun 29 08:09:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Jun 2006 11:09:33 -0400
Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced
 switch port 0 for LMC > 0
Message-ID: <1151593772.4541.65566.camel@hal.voltaire.com>

OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0

Base port 0 is constrained to have LMC of 0 whereas enhanced switch port
0 is not. Support enhanced switch port 0 is more like CA and router
ports in terms of LMC.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_lid_mgr.c
===================================================================
--- opensm/osm_lid_mgr.c	(revision 8277)
+++ opensm/osm_lid_mgr.c	(working copy)
@@ -94,6 +94,7 @@
 #include <opensm/osm_lid_mgr.h>
 #include <opensm/osm_log.h>
 #include <opensm/osm_node.h>
+#include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
 #include <vendor/osm_vendor_api.h>
@@ -351,6 +352,8 @@ __osm_lid_mgr_init_sweep(
   osm_lid_mgr_range_t *p_range = NULL;
   osm_port_t          *p_port;
   cl_qmap_t           *p_port_guid_tbl;
+  osm_switch_t        *p_sw;
+  ib_switch_info_t    *p_si;
   uint8_t              lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc);
   uint16_t             lmc_mask;
   uint16_t             req_lid, num_lids;
@@ -436,7 +439,20 @@ __osm_lid_mgr_init_sweep(
            IB_NODE_TYPE_SWITCH )
         num_lids = lmc_num_lids;
       else
-        num_lids = 1;
+      {
+        /* Determine if enhanced switch port 0 */
+        p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                      osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+        if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
+            ib_switch_info_is_enhanced_port0(p_si))
+        {
+          num_lids = lmc_num_lids;
+        }
+        else
+        {
+          num_lids = 1;
+        }
+      }
 
       if ((num_lids != 1) &&
           (((db_min_lid & lmc_mask) != db_min_lid) ||
@@ -539,7 +555,18 @@ __osm_lid_mgr_init_sweep(
           }
           else
           {
-            num_lids = 1;
+            /* Determine if enhanced switch port 0 */
+            p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                          osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+            if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
+                ib_switch_info_is_enhanced_port0(p_si))
+            {
+              num_lids = lmc_num_lids;
+            }
+            else
+            {
+              num_lids = 1;
+            }
           }
 
           /* Make sure the lid is aligned */
@@ -798,6 +825,8 @@ __osm_lid_mgr_get_port_lid(
   uint8_t  num_lids = (1 << p_mgr->p_subn->opt.lmc);
   int      lid_changed = 0;
   uint16_t lmc_mask;
+  osm_switch_t        *p_sw;
+  ib_switch_info_t    *p_si;
 
   OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_get_port_lid );
 
@@ -809,10 +838,19 @@ __osm_lid_mgr_get_port_lid(
   /* get the lid from the guid2lid */
   guid = cl_ntoh64( osm_port_get_guid( p_port ) );
 
-  /* if the port is a switch then we only need one lid */
+  /* if the port is a switch with base switch port 0 then we only need one lid */
   if( osm_node_get_type( osm_port_get_parent_node( p_port ) ) ==
       IB_NODE_TYPE_SWITCH )
-    num_lids = 1;
+  {
+    /* Determine if base switch port 0 */
+    p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                  osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+    if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
+        !ib_switch_info_is_enhanced_port0(p_si))
+    {
+      num_lids = 1;
+    }
+  }
 
   /* if the port matches the guid2lid */
   if (!osm_db_guid2lid_get( p_mgr->p_g2l, guid, &min_lid, &max_lid))


From halr at voltaire.com  Thu Jun 29 08:15:49 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Jun 2006 11:15:49 -0400
Subject: [openib-general] [PATCH] OpenSM/osm_sa_portinfo_record.c: Support
 enhanced switch port 0 for LMC > 0
Message-ID: <1151593780.4541.65568.camel@hal.voltaire.com>

OpenSM/osm_sa_portinfo_record.c: Support enhanced switch port 0 for LMC
> 0

In __osm_sa_pir_create, handle enhanced switch port 0 (and the
possibility that it's LMC > 0)

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: opensm/osm_sa_portinfo_record.c
===================================================================
--- opensm/osm_sa_portinfo_record.c	(revision 8277)
+++ opensm/osm_sa_portinfo_record.c	(working copy)
@@ -60,6 +60,7 @@
 #include <opensm/osm_sa_portinfo_record.h>
 #include <opensm/osm_port.h>
 #include <opensm/osm_node.h>
+#include <opensm/osm_switch.h>
 #include <vendor/osm_vendor_api.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_pkey.h>
@@ -197,24 +198,34 @@ __osm_sa_pir_create(
   uint16_t                    max_lid_ho;
   uint16_t                    base_lid_ho;
   uint16_t                    match_lid_ho;
+  osm_physp_t                *p_node_physp;
+  osm_switch_t               *p_sw;
+  ib_switch_info_t           *p_si;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create );
 
-  if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH)
+  if (p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH)
   {
-    lmc = 0;
-    base_lid_ho = cl_ntoh16(
-      osm_physp_get_base_lid(
-        osm_node_get_physp_ptr(p_physp->p_node, 0))
-      );
-    max_lid_ho = base_lid_ho;
+    p_node_physp = osm_node_get_physp_ptr( p_physp->p_node, 0 );
+    base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_node_physp ) );
+    p_sw = osm_get_switch_by_guid( p_rcv->p_subn, 
+             osm_physp_get_port_guid( p_node_physp ) );
+    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) ||
+        !ib_switch_info_is_enhanced_port0( p_si ))
+    {
+      lmc = 0;
+    }
+    else
+    {
+      lmc = osm_physp_get_lmc( p_node_physp );
+    }
   }
   else
   {
     lmc = osm_physp_get_lmc( p_physp );
     base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) );
-    max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
   }
+  max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
 
   if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID )
   {


From trimmer at silverstorm.com  Thu Jun 29 05:12:45 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Thu, 29 Jun 2006 08:12:45 -0400
Subject: [openib-general] design for communication established
 affiliated asynchronous event handling
In-Reply-To: <20060629054524.GC5127@mellanox.co.il>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65AA8D@mail.silverstorm.com>


> -----Original Message-----
> From: Michael S. Tsirkin
> Sent: Thursday, June 29, 2006 1:45 AM
>  
> Quoting r. Roland Dreier <rdreier at cisco.com>:
> > Subject: Re: design for communication established affiliated
> asynchronous event handling
> >
> > >I suggest the following design: the CMA would replace the event
handler
> > >provided with the qp_init_attr struct with a callback of its own
and
> > >keep the original handler/context on a private structure.
> >
> > This is probably fine.  There is one further situation where the
> > connection needs to be established, beyond RTU and the communication
> > established async event.  Namely, if a receive completion is polled.
> > Since async events are, well, asynchronous, there's no guarantee
that
> > the communication established event will be reported any time
soon...
> 
> How about user taking this into account and not arming the CQ /
> not polling it until the established event?

If the ULP is properly designed, the asynchronous-ness of the event (or
RTU for that matter) should not be an issue.

Per the IBTA CM state machine, the passive side upon sending the REP
should move its endpoint (the QP and the ULPs state machine) state to
Ready to Receive.  QPs in RTR can have send WQEs posted to them, however
they will not be sent until the QP is moved to RTS.

This means the ULP while in RTR can perform its normal receive
completion handling and even build and post send requests in response to
such received messages.  Such sends will be queued until the QP later
moves to RTS.

Most ULPs have some sort of application level flow control.  This may be
simply RNR NAK or it could be a credit system (such as SRP) or an
additional application initialization protocol (such as SDP).  Hence the
active side will generally perform limited sends (typically one) to the
passive side until it gets a response from the passive side (which won't
happen until the QP is in RTS).  Hence for a good ULP protocol, there is
no risk of overflowing the send Q while waiting to move to RTS.

The only thing the passive side ULP should not do until in RTS is any
sort of "periodic status messages which don't require active side
acknowledgement".  Since the RTS state could be delayed, the ULP should
not risk overflowing its send Q with such messages.  Most of the
standard ULP protocols (SDP, etc) do not have such messages or they
require ULP level protocol negotiation before they are activated.

Hence if this is all properly handled, the passive side's RTU/Async
Event handling sequence will merely move the QP to RTS and notify the
ULP.  The ULP will likely do very limited work for this notification
(perhaps just a state transition) as all the real work should have been
done before sending the REP.

The movement to RTS will enable the QP to start processing its Send Q
and everything will be good.

Taking this approach keeps the CM/CMA and ULP simpler in design and
merely allows the RTS/RTU/Async Event handling to be another event in a
state machine.

Todd Rimmer


From eitan at mellanox.co.il  Thu Jun 29 12:51:12 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 29 Jun 2006 22:51:12 +0300
Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/Remote SM: Eliminate
 some unneeded statuschecking
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891B@mtlexch01.mtl.com>

sure

> OpenSM/Remote SM: Eliminate some unneeded status checking
> 
> Since osm_remote_sm.c:osm_remote_sm_init cannot fail, don't return any
> status and don't check in an consumers of this API
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 


From eitan at mellanox.co.il  Thu Jun 29 12:52:18 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 29 Jun 2006 22:52:18 +0300
Subject: [openib-general] [PATCH] OpenSM/osm_sa_portinfo_record.c:
 Support enhanced switchport 0 for LMC > 0
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891C@mtlexch01.mtl.com>

Right !

> 
> OpenSM/osm_sa_portinfo_record.c: Support enhanced switch port 0 for
LMC
> > 0
> 
> In __osm_sa_pir_create, handle enhanced switch port 0 (and the
> possibility that it's LMC > 0)
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> Index: opensm/osm_sa_portinfo_record.c
> ===================================================================
> --- opensm/osm_sa_portinfo_record.c	(revision 8277)
> +++ opensm/osm_sa_portinfo_record.c	(working copy)
> @@ -60,6 +60,7 @@
>  #include <opensm/osm_sa_portinfo_record.h>
>  #include <opensm/osm_port.h>
>  #include <opensm/osm_node.h>
> +#include <opensm/osm_switch.h>
>  #include <vendor/osm_vendor_api.h>
>  #include <opensm/osm_helper.h>
>  #include <opensm/osm_pkey.h>
> @@ -197,24 +198,34 @@ __osm_sa_pir_create(
>    uint16_t                    max_lid_ho;
>    uint16_t                    base_lid_ho;
>    uint16_t                    match_lid_ho;
> +  osm_physp_t                *p_node_physp;
> +  osm_switch_t               *p_sw;
> +  ib_switch_info_t           *p_si;
> 
>    OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_pir_create );
> 
> -  if(p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH)
> +  if (p_physp->p_node->node_info.node_type == IB_NODE_TYPE_SWITCH)
>    {
> -    lmc = 0;
> -    base_lid_ho = cl_ntoh16(
> -      osm_physp_get_base_lid(
> -        osm_node_get_physp_ptr(p_physp->p_node, 0))
> -      );
> -    max_lid_ho = base_lid_ho;
> +    p_node_physp = osm_node_get_physp_ptr( p_physp->p_node, 0 );
> +    base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_node_physp )
);
> +    p_sw = osm_get_switch_by_guid( p_rcv->p_subn,
> +             osm_physp_get_port_guid( p_node_physp ) );
> +    if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) ||
> +        !ib_switch_info_is_enhanced_port0( p_si ))
> +    {
> +      lmc = 0;
> +    }
> +    else
> +    {
> +      lmc = osm_physp_get_lmc( p_node_physp );
> +    }
>    }
>    else
>    {
>      lmc = osm_physp_get_lmc( p_physp );
>      base_lid_ho = cl_ntoh16( osm_physp_get_base_lid( p_physp ) );
> -    max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
>    }
> +  max_lid_ho = (uint16_t)( base_lid_ho + (1 << lmc) - 1 );
> 
>    if( p_ctxt->comp_mask & IB_PIR_COMPMASK_LID )
>    {
> 


From eitan at mellanox.co.il  Thu Jun 29 12:54:33 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 29 Jun 2006 22:54:33 +0300
Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support
 enhanced switch port 0 forLMC > 0
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com>

Hi Hal,

I think the check for num lids is so similar it deserves an inline
function.
What do you say?

I refer to:

> +        if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
> +            ib_switch_info_is_enhanced_port0(p_si))
> +        {
> +          num_lids = lmc_num_lids;
> +        }
> +        else
> +        {
> +          num_lids = 1;
> +        }
> +      }
> 
 

From bos at pathscale.com  Thu Jun 29 14:40:51 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:51 -0700
Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes,
 performance enhancements, and portability improvements
Message-ID: <patchbomb.1151617251@eng-12.pathscale.com>

Hi, Andrew -

These patches bring the ipath driver up to date with a number of bug fixes,
performance improvements, and better PowerPC support.  There are a few
whitespace and formatting patches in the series, but they're all self-
contained.  The patches have been tested internally, and shouldn't contain
anything controversial.

My hope is that they'll sit in -mm for a little bit, and make it into
an early 2.6.18 -rc kernel.

Thanks,

	<b

From bos at pathscale.com  Thu Jun 29 14:40:55 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:55 -0700
Subject: [openib-general] [PATCH 4 of 39] IB/ipath - fix an indenting problem
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <c93c2b42d279e047acdf.1151617255@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r ebf646d10db0 -r c93c2b42d279 drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1053,32 +1053,32 @@ static inline void ipath_rc_rcv_resp(str
 			goto ack_done;
 		}
 	rdma_read:
-	if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST)))
-		goto ack_done;
-	if (unlikely(tlen != (hdrsize + pmtu + 4)))
-		goto ack_done;
-	if (unlikely(pmtu >= qp->s_len))
-		goto ack_done;
-	/* We got a response so update the timeout. */
-	if (unlikely(qp->s_last == qp->s_tail ||
-		     get_swqe_ptr(qp, qp->s_last)->wr.opcode !=
-		     IB_WR_RDMA_READ))
-		goto ack_done;
-	spin_lock(&dev->pending_lock);
-	if (qp->s_rnr_timeout == 0 && !list_empty(&qp->timerwait))
-		list_move_tail(&qp->timerwait,
-			       &dev->pending[dev->pending_index]);
-	spin_unlock(&dev->pending_lock);
-	/*
-	 * Update the RDMA receive state but do the copy w/o holding the
-	 * locks and blocking interrupts.  XXX Yet another place that
-	 * affects relaxed RDMA order since we don't want s_sge modified.
-	 */
-	qp->s_len -= pmtu;
-	qp->s_last_psn = psn;
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-	ipath_copy_sge(&qp->s_sge, data, pmtu);
-	goto bail;
+		if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST)))
+			goto ack_done;
+		if (unlikely(tlen != (hdrsize + pmtu + 4)))
+			goto ack_done;
+		if (unlikely(pmtu >= qp->s_len))
+			goto ack_done;
+		/* We got a response so update the timeout. */
+		if (unlikely(qp->s_last == qp->s_tail ||
+			     get_swqe_ptr(qp, qp->s_last)->wr.opcode !=
+			     IB_WR_RDMA_READ))
+			goto ack_done;
+		spin_lock(&dev->pending_lock);
+		if (qp->s_rnr_timeout == 0 && !list_empty(&qp->timerwait))
+			list_move_tail(&qp->timerwait,
+				       &dev->pending[dev->pending_index]);
+		spin_unlock(&dev->pending_lock);
+		/*
+		 * Update the RDMA receive state but do the copy w/o holding the
+		 * locks and blocking interrupts.  XXX Yet another place that
+		 * affects relaxed RDMA order since we don't want s_sge modified.
+		 */
+		qp->s_len -= pmtu;
+		qp->s_last_psn = psn;
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+		ipath_copy_sge(&qp->s_sge, data, pmtu);
+		goto bail;
 
 	case OP(RDMA_READ_RESPONSE_LAST):
 		/* ACKs READ req. */


From bos at pathscale.com  Thu Jun 29 14:40:52 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:52 -0700
Subject: [openib-general] [PATCH 1 of 39] IB/ipath - Name zero counter
 offsets so it's clear they aren't counters
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <addf90abc7248e961bdb.1151617252@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Fri Jun 23 22:47:27 2006 +0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:25 2006 -0700
@@ -215,7 +215,7 @@ static int recv_subn_get_portinfo(struct
 	/* P_KeyViolations are counted by hardware. */
 	pip->pkey_violations =
 		cpu_to_be16((ipath_layer_get_cr_errpkey(dev->dd) -
-			     dev->n_pkey_violations) & 0xFFFF);
+			     dev->z_pkey_violations) & 0xFFFF);
 	pip->qkey_violations = cpu_to_be16(dev->qkey_violations);
 	/* Only the hardware GUID is supported for now */
 	pip->guid_cap = 1;
@@ -389,7 +389,7 @@ static int recv_subn_set_portinfo(struct
 	 * later.
 	 */
 	if (pip->pkey_violations == 0)
-		dev->n_pkey_violations =
+		dev->z_pkey_violations =
 			ipath_layer_get_cr_errpkey(dev->dd);
 
 	if (pip->qkey_violations == 0)
@@ -844,18 +844,18 @@ static int recv_pma_get_portcounters(str
 	ipath_layer_get_counters(dev->dd, &cntrs);
 
 	/* Adjust counters for any resets done. */
-	cntrs.symbol_error_counter -= dev->n_symbol_error_counter;
+	cntrs.symbol_error_counter -= dev->z_symbol_error_counter;
 	cntrs.link_error_recovery_counter -=
-		dev->n_link_error_recovery_counter;
-	cntrs.link_downed_counter -= dev->n_link_downed_counter;
+		dev->z_link_error_recovery_counter;
+	cntrs.link_downed_counter -= dev->z_link_downed_counter;
 	cntrs.port_rcv_errors += dev->rcv_errors;
-	cntrs.port_rcv_errors -= dev->n_port_rcv_errors;
-	cntrs.port_rcv_remphys_errors -= dev->n_port_rcv_remphys_errors;
-	cntrs.port_xmit_discards -= dev->n_port_xmit_discards;
-	cntrs.port_xmit_data -= dev->n_port_xmit_data;
-	cntrs.port_rcv_data -= dev->n_port_rcv_data;
-	cntrs.port_xmit_packets -= dev->n_port_xmit_packets;
-	cntrs.port_rcv_packets -= dev->n_port_rcv_packets;
+	cntrs.port_rcv_errors -= dev->z_port_rcv_errors;
+	cntrs.port_rcv_remphys_errors -= dev->z_port_rcv_remphys_errors;
+	cntrs.port_xmit_discards -= dev->z_port_xmit_discards;
+	cntrs.port_xmit_data -= dev->z_port_xmit_data;
+	cntrs.port_rcv_data -= dev->z_port_rcv_data;
+	cntrs.port_xmit_packets -= dev->z_port_xmit_packets;
+	cntrs.port_rcv_packets -= dev->z_port_rcv_packets;
 
 	memset(pmp->data, 0, sizeof(pmp->data));
 
@@ -928,10 +928,10 @@ static int recv_pma_get_portcounters_ext
 				      &rpkts, &xwait);
 
 	/* Adjust counters for any resets done. */
-	swords -= dev->n_port_xmit_data;
-	rwords -= dev->n_port_rcv_data;
-	spkts -= dev->n_port_xmit_packets;
-	rpkts -= dev->n_port_rcv_packets;
+	swords -= dev->z_port_xmit_data;
+	rwords -= dev->z_port_rcv_data;
+	spkts -= dev->z_port_xmit_packets;
+	rpkts -= dev->z_port_rcv_packets;
 
 	memset(pmp->data, 0, sizeof(pmp->data));
 
@@ -967,37 +967,37 @@ static int recv_pma_set_portcounters(str
 	ipath_layer_get_counters(dev->dd, &cntrs);
 
 	if (p->counter_select & IB_PMA_SEL_SYMBOL_ERROR)
-		dev->n_symbol_error_counter = cntrs.symbol_error_counter;
+		dev->z_symbol_error_counter = cntrs.symbol_error_counter;
 
 	if (p->counter_select & IB_PMA_SEL_LINK_ERROR_RECOVERY)
-		dev->n_link_error_recovery_counter =
+		dev->z_link_error_recovery_counter =
 			cntrs.link_error_recovery_counter;
 
 	if (p->counter_select & IB_PMA_SEL_LINK_DOWNED)
-		dev->n_link_downed_counter = cntrs.link_downed_counter;
+		dev->z_link_downed_counter = cntrs.link_downed_counter;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_RCV_ERRORS)
-		dev->n_port_rcv_errors =
+		dev->z_port_rcv_errors =
 			cntrs.port_rcv_errors + dev->rcv_errors;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS)
-		dev->n_port_rcv_remphys_errors =
+		dev->z_port_rcv_remphys_errors =
 			cntrs.port_rcv_remphys_errors;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS)
-		dev->n_port_xmit_discards = cntrs.port_xmit_discards;
+		dev->z_port_xmit_discards = cntrs.port_xmit_discards;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA)
-		dev->n_port_xmit_data = cntrs.port_xmit_data;
+		dev->z_port_xmit_data = cntrs.port_xmit_data;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_RCV_DATA)
-		dev->n_port_rcv_data = cntrs.port_rcv_data;
+		dev->z_port_rcv_data = cntrs.port_rcv_data;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_XMIT_PACKETS)
-		dev->n_port_xmit_packets = cntrs.port_xmit_packets;
+		dev->z_port_xmit_packets = cntrs.port_xmit_packets;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_RCV_PACKETS)
-		dev->n_port_rcv_packets = cntrs.port_rcv_packets;
+		dev->z_port_rcv_packets = cntrs.port_rcv_packets;
 
 	return recv_pma_get_portcounters(pmp, ibdev, port);
 }
@@ -1014,16 +1014,16 @@ static int recv_pma_set_portcounters_ext
 				      &rpkts, &xwait);
 
 	if (p->counter_select & IB_PMA_SELX_PORT_XMIT_DATA)
-		dev->n_port_xmit_data = swords;
+		dev->z_port_xmit_data = swords;
 
 	if (p->counter_select & IB_PMA_SELX_PORT_RCV_DATA)
-		dev->n_port_rcv_data = rwords;
+		dev->z_port_rcv_data = rwords;
 
 	if (p->counter_select & IB_PMA_SELX_PORT_XMIT_PACKETS)
-		dev->n_port_xmit_packets = spkts;
+		dev->z_port_xmit_packets = spkts;
 
 	if (p->counter_select & IB_PMA_SELX_PORT_RCV_PACKETS)
-		dev->n_port_rcv_packets = rpkts;
+		dev->z_port_rcv_packets = rpkts;
 
 	if (p->counter_select & IB_PMA_SELX_PORT_UNI_XMIT_PACKETS)
 		dev->n_unicast_xmit = 0;
@@ -1285,18 +1285,18 @@ int ipath_process_mad(struct ib_device *
 
 		ipath_layer_get_counters(to_idev(ibdev)->dd, &cntrs);
 		dev->rcv_errors++;
-		dev->n_symbol_error_counter = cntrs.symbol_error_counter;
-		dev->n_link_error_recovery_counter =
+		dev->z_symbol_error_counter = cntrs.symbol_error_counter;
+		dev->z_link_error_recovery_counter =
 			cntrs.link_error_recovery_counter;
-		dev->n_link_downed_counter = cntrs.link_downed_counter;
-		dev->n_port_rcv_errors = cntrs.port_rcv_errors + 1;
-		dev->n_port_rcv_remphys_errors =
+		dev->z_link_downed_counter = cntrs.link_downed_counter;
+		dev->z_port_rcv_errors = cntrs.port_rcv_errors + 1;
+		dev->z_port_rcv_remphys_errors =
 			cntrs.port_rcv_remphys_errors;
-		dev->n_port_xmit_discards = cntrs.port_xmit_discards;
-		dev->n_port_xmit_data = cntrs.port_xmit_data;
-		dev->n_port_rcv_data = cntrs.port_rcv_data;
-		dev->n_port_xmit_packets = cntrs.port_xmit_packets;
-		dev->n_port_rcv_packets = cntrs.port_rcv_packets;
+		dev->z_port_xmit_discards = cntrs.port_xmit_discards;
+		dev->z_port_xmit_data = cntrs.port_xmit_data;
+		dev->z_port_rcv_data = cntrs.port_rcv_data;
+		dev->z_port_xmit_packets = cntrs.port_xmit_packets;
+		dev->z_port_rcv_packets = cntrs.port_rcv_packets;
 	}
 	switch (in_mad->mad_hdr.mgmt_class) {
 	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Fri Jun 23 22:47:27 2006 +0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -646,7 +646,7 @@ static int ipath_query_port(struct ib_de
 	props->max_msg_sz = 4096;
 	props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd);
 	props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) -
-		dev->n_pkey_violations;
+		dev->z_pkey_violations;
 	props->qkey_viol_cntr = dev->qkey_violations;
 	props->active_width = IB_WIDTH_4X;
 	/* See rate_show() */
diff -r 28e3d8204fdb -r addf90abc724 drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Fri Jun 23 22:47:27 2006 +0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
@@ -442,17 +442,17 @@ struct ipath_ibdev {
 	u64 n_unicast_rcv;	/* total unicast packets received */
 	u64 n_multicast_xmit;	/* total multicast packets sent */
 	u64 n_multicast_rcv;	/* total multicast packets received */
-	u64 n_symbol_error_counter;	/* starting count for PMA */
-	u64 n_link_error_recovery_counter;	/* starting count for PMA */
-	u64 n_link_downed_counter;	/* starting count for PMA */
-	u64 n_port_rcv_errors;	/* starting count for PMA */
-	u64 n_port_rcv_remphys_errors;	/* starting count for PMA */
-	u64 n_port_xmit_discards;	/* starting count for PMA */
-	u64 n_port_xmit_data;	/* starting count for PMA */
-	u64 n_port_rcv_data;	/* starting count for PMA */
-	u64 n_port_xmit_packets;	/* starting count for PMA */
-	u64 n_port_rcv_packets;	/* starting count for PMA */
-	u32 n_pkey_violations;	/* starting count for PMA */
+	u64 z_symbol_error_counter;		/* starting count for PMA */
+	u64 z_link_error_recovery_counter;	/* starting count for PMA */
+	u64 z_link_downed_counter;		/* starting count for PMA */
+	u64 z_port_rcv_errors;			/* starting count for PMA */
+	u64 z_port_rcv_remphys_errors;		/* starting count for PMA */
+	u64 z_port_xmit_discards;		/* starting count for PMA */
+	u64 z_port_xmit_data;			/* starting count for PMA */
+	u64 z_port_rcv_data;			/* starting count for PMA */
+	u64 z_port_xmit_packets;		/* starting count for PMA */
+	u64 z_port_rcv_packets;			/* starting count for PMA */
+	u32 z_pkey_violations;			/* starting count for PMA */
 	u32 n_rc_resends;
 	u32 n_rc_acks;
 	u32 n_rc_qacks;


From bos at pathscale.com  Thu Jun 29 14:40:54 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:54 -0700
Subject: [openib-general] [PATCH 3 of 39] IB/ipath - Share more common code
 between RC and UC protocols
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <ebf646d10db07350e297.1151617254@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -709,9 +709,7 @@ struct ib_qp *ipath_create_qp(struct ib_
 		spin_lock_init(&qp->r_rq.lock);
 		atomic_set(&qp->refcount, 0);
 		init_waitqueue_head(&qp->wait);
-		tasklet_init(&qp->s_task,
-			     init_attr->qp_type == IB_QPT_RC ?
-			     ipath_do_rc_send : ipath_do_uc_send,
+		tasklet_init(&qp->s_task, ipath_do_ruc_send,
 			     (unsigned long)qp);
 		INIT_LIST_HEAD(&qp->piowait);
 		INIT_LIST_HEAD(&qp->timerwait);
@@ -896,9 +894,9 @@ void ipath_get_credit(struct ipath_qp *q
 	 * as many packets as we like.  Otherwise, we have to
 	 * honor the credit field.
 	 */
-	if (credit == IPS_AETH_CREDIT_INVAL) {
+	if (credit == IPS_AETH_CREDIT_INVAL)
 		qp->s_lsn = (u32) -1;
-	} else if (qp->s_lsn != (u32) -1) {
+	else if (qp->s_lsn != (u32) -1) {
 		/* Compute new LSN (i.e., MSN + credit) */
 		credit = (aeth + credit_table[credit]) & IPS_MSN_MASK;
 		if (ipath_cmp24(credit, qp->s_lsn) > 0)
diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -73,9 +73,9 @@ static void ipath_init_restart(struct ip
  * Return bth0 if constructed; otherwise, return 0.
  * Note the QP s_lock must be held.
  */
-static inline u32 ipath_make_rc_ack(struct ipath_qp *qp,
-				    struct ipath_other_headers *ohdr,
-				    u32 pmtu)
+u32 ipath_make_rc_ack(struct ipath_qp *qp,
+		      struct ipath_other_headers *ohdr,
+		      u32 pmtu)
 {
 	struct ipath_sge_state *ss;
 	u32 hwords;
@@ -96,8 +96,7 @@ static inline u32 ipath_make_rc_ack(stru
 		if (len > pmtu) {
 			len = pmtu;
 			qp->s_ack_state = OP(RDMA_READ_RESPONSE_FIRST);
-		}
-		else
+		} else
 			qp->s_ack_state = OP(RDMA_READ_RESPONSE_ONLY);
 		qp->s_rdma_len -= len;
 		bth0 = qp->s_ack_state << 24;
@@ -177,9 +176,9 @@ static inline u32 ipath_make_rc_ack(stru
  * Return 1 if constructed; otherwise, return 0.
  * Note the QP s_lock must be held.
  */
-static inline int ipath_make_rc_req(struct ipath_qp *qp,
-				    struct ipath_other_headers *ohdr,
-				    u32 pmtu, u32 *bth0p, u32 *bth2p)
+int ipath_make_rc_req(struct ipath_qp *qp,
+		      struct ipath_other_headers *ohdr,
+		      u32 pmtu, u32 *bth0p, u32 *bth2p)
 {
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	struct ipath_sge_state *ss;
@@ -497,160 +496,33 @@ done:
 	return 0;
 }
 
-static inline void ipath_make_rc_grh(struct ipath_qp *qp,
-				     struct ib_global_route *grh,
-				     u32 nwords)
-{
-	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
-
-	/* GRH header size in 32-bit words. */
-	qp->s_hdrwords += 10;
-	qp->s_hdr.u.l.grh.version_tclass_flow =
-		cpu_to_be32((6 << 28) |
-			    (grh->traffic_class << 20) |
-			    grh->flow_label);
-	qp->s_hdr.u.l.grh.paylen =
-		cpu_to_be16(((qp->s_hdrwords - 12) + nwords +
-			     SIZE_OF_CRC) << 2);
-	/* next_hdr is defined by C8-7 in ch. 8.4.1 */
-	qp->s_hdr.u.l.grh.next_hdr = 0x1B;
-	qp->s_hdr.u.l.grh.hop_limit = grh->hop_limit;
-	/* The SGID is 32-bit aligned. */
-	qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix;
-	qp->s_hdr.u.l.grh.sgid.global.interface_id =
-		ipath_layer_get_guid(dev->dd);
-	qp->s_hdr.u.l.grh.dgid = grh->dgid;
-}
-
 /**
- * ipath_do_rc_send - perform a send on an RC QP
- * @data: contains a pointer to the QP
+ * send_rc_ack - Construct an ACK packet and send it
+ * @qp: a pointer to the QP
  *
- * Process entries in the send work queue until credit or queue is
- * exhausted.  Only allow one CPU to send a packet per QP (tasklet).
- * Otherwise, after we drop the QP s_lock, two threads could send
- * packets out of order.
+ * This is called from ipath_rc_rcv() and only uses the receive
+ * side QP state.
+ * Note that RDMA reads are handled in the send side QP state and tasklet.
  */
-void ipath_do_rc_send(unsigned long data)
-{
-	struct ipath_qp *qp = (struct ipath_qp *)data;
-	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
-	unsigned long flags;
-	u16 lrh0;
-	u32 nwords;
-	u32 extra_bytes;
-	u32 bth0;
-	u32 bth2;
-	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
-	struct ipath_other_headers *ohdr;
-
-	if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags))
-		goto bail;
-
-	if (unlikely(qp->remote_ah_attr.dlid ==
-		     ipath_layer_get_lid(dev->dd))) {
-		struct ib_wc wc;
-
-		/*
-		 * Pass in an uninitialized ib_wc to be consistent with
-		 * other places where ipath_ruc_loopback() is called.
-		 */
-		ipath_ruc_loopback(qp, &wc);
-		goto clear;
-	}
-
-	ohdr = &qp->s_hdr.u.oth;
-	if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
-		ohdr = &qp->s_hdr.u.l.oth;
-
-again:
-	/* Check for a constructed packet to be sent. */
-	if (qp->s_hdrwords != 0) {
-		/*
-		 * If no PIO bufs are available, return.  An interrupt will
-		 * call ipath_ib_piobufavail() when one is available.
-		 */
-		_VERBS_INFO("h %u %p\n", qp->s_hdrwords, &qp->s_hdr);
-		_VERBS_INFO("d %u %p %u %p %u %u %u %u\n", qp->s_cur_size,
-			    qp->s_cur_sge->sg_list,
-			    qp->s_cur_sge->num_sge,
-			    qp->s_cur_sge->sge.vaddr,
-			    qp->s_cur_sge->sge.sge_length,
-			    qp->s_cur_sge->sge.length,
-			    qp->s_cur_sge->sge.m,
-			    qp->s_cur_sge->sge.n);
-		if (ipath_verbs_send(dev->dd, qp->s_hdrwords,
-				     (u32 *) &qp->s_hdr, qp->s_cur_size,
-				     qp->s_cur_sge)) {
-			ipath_no_bufs_available(qp, dev);
-			goto bail;
-		}
-		dev->n_unicast_xmit++;
-		/* Record that we sent the packet and s_hdr is empty. */
-		qp->s_hdrwords = 0;
-	}
-
-	/*
-	 * The lock is needed to synchronize between setting
-	 * qp->s_ack_state, resend timer, and post_send().
-	 */
-	spin_lock_irqsave(&qp->s_lock, flags);
-
-	/* Sending responses has higher priority over sending requests. */
-	if (qp->s_ack_state != OP(ACKNOWLEDGE) &&
-	    (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0)
-		bth2 = qp->s_ack_psn++ & IPS_PSN_MASK;
-	else if (!ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2))
-		goto done;
-
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-
-	/* Construct the header. */
-	extra_bytes = (4 - qp->s_cur_size) & 3;
-	nwords = (qp->s_cur_size + extra_bytes) >> 2;
-	lrh0 = IPS_LRH_BTH;
-	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
-		ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, nwords);
-		lrh0 = IPS_LRH_GRH;
-	}
-	lrh0 |= qp->remote_ah_attr.sl << 4;
-	qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
-	qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
-	qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords +
-				       SIZE_OF_CRC);
-	qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
-	bth0 |= ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
-	bth0 |= extra_bytes << 20;
-	ohdr->bth[0] = cpu_to_be32(bth0);
-	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
-	ohdr->bth[2] = cpu_to_be32(bth2);
-
-	/* Check for more work to do. */
-	goto again;
-
-done:
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-clear:
-	clear_bit(IPATH_S_BUSY, &qp->s_flags);
-bail:
-	return;
-}
-
 static void send_rc_ack(struct ipath_qp *qp)
 {
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	u16 lrh0;
 	u32 bth0;
+	u32 hwords;
+	struct ipath_ib_header hdr;
 	struct ipath_other_headers *ohdr;
 
 	/* Construct the header. */
-	ohdr = &qp->s_hdr.u.oth;
+	ohdr = &hdr.u.oth;
 	lrh0 = IPS_LRH_BTH;
 	/* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */
-	qp->s_hdrwords = 6;
+	hwords = 6;
 	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
-		ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, 0);
-		ohdr = &qp->s_hdr.u.l.oth;
+		hwords += ipath_make_grh(dev, &hdr.u.l.grh,
+					 &qp->remote_ah_attr.grh,
+					 hwords, 0);
+		ohdr = &hdr.u.l.oth;
 		lrh0 = IPS_LRH_GRH;
 	}
 	bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
@@ -658,15 +530,14 @@ static void send_rc_ack(struct ipath_qp 
 	if (qp->s_ack_state >= OP(COMPARE_SWAP)) {
 		bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24;
 		ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic);
-		qp->s_hdrwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4;
-	}
-	else
+		hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4;
+	} else
 		bth0 |= OP(ACKNOWLEDGE) << 24;
 	lrh0 |= qp->remote_ah_attr.sl << 4;
-	qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
-	qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
-	qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + SIZE_OF_CRC);
-	qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
+	hdr.lrh[0] = cpu_to_be16(lrh0);
+	hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
+	hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC);
+	hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
 	ohdr->bth[0] = cpu_to_be32(bth0);
 	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
 	ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & IPS_PSN_MASK);
@@ -674,8 +545,7 @@ static void send_rc_ack(struct ipath_qp 
 	/*
 	 * If we can send the ACK, clear the ACK state.
 	 */
-	if (ipath_verbs_send(dev->dd, qp->s_hdrwords, (u32 *) &qp->s_hdr,
-			     0, NULL) == 0) {
+	if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) {
 		qp->s_ack_state = OP(ACKNOWLEDGE);
 		dev->n_rc_qacks++;
 		dev->n_unicast_xmit++;
@@ -805,7 +675,7 @@ bail:
  * @qp: the QP
  * @psn: the packet sequence number to restart at
  *
- * This is called from ipath_rc_rcv() to process an incoming RC ACK
+ * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK
  * for the given QP.
  * Called at interrupt level with the QP s_lock held.
  */
@@ -1231,18 +1101,12 @@ static inline void ipath_rc_rcv_resp(str
 		 * ICRC (4).
 		 */
 		if (unlikely(tlen <= (hdrsize + pad + 8))) {
-			/*
-			 * XXX Need to generate an error CQ
-			 * entry.
-			 */
+			/* XXX Need to generate an error CQ entry. */
 			goto ack_done;
 		}
 		tlen -= hdrsize + pad + 8;
 		if (unlikely(tlen != qp->s_len)) {
-			/*
-			 * XXX Need to generate an error CQ
-			 * entry.
-			 */
+			/* XXX Need to generate an error CQ entry. */
 			goto ack_done;
 		}
 		if (!header_in_data)
@@ -1384,7 +1248,7 @@ static inline int ipath_rc_rcv_error(str
 	case OP(COMPARE_SWAP):
 	case OP(FETCH_ADD):
 		/*
-		 * Check for the PSN of the last atomic operations
+		 * Check for the PSN of the last atomic operation
 		 * performed and resend the result if found.
 		 */
 		if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) {
@@ -1454,11 +1318,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		} else
 			psn = be32_to_cpu(ohdr->bth[2]);
 	}
-	/*
-	 * The opcode is in the low byte when its in network order
-	 * (top byte when in host order).
-	 */
-	opcode = be32_to_cpu(ohdr->bth[0]) >> 24;
 
 	/*
 	 * Process responses (ACKs) before anything else.  Note that the
@@ -1466,6 +1325,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 	 * queue rather than the expected receive packet sequence number.
 	 * In other words, this QP is the requester.
 	 */
+	opcode = be32_to_cpu(ohdr->bth[0]) >> 24;
 	if (opcode >= OP(RDMA_READ_RESPONSE_FIRST) &&
 	    opcode <= OP(ATOMIC_ACKNOWLEDGE)) {
 		ipath_rc_rcv_resp(dev, ohdr, data, tlen, qp, opcode, psn,
diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_ruc.c
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -32,6 +32,7 @@
  */
 
 #include "ipath_verbs.h"
+#include "ips_common.h"
 
 /*
  * Convert the AETH RNR timeout code into the number of milliseconds.
@@ -188,7 +189,6 @@ bail:
 /**
  * ipath_ruc_loopback - handle UC and RC lookback requests
  * @sqp: the loopback QP
- * @wc: the work completion entry
  *
  * This is called from ipath_do_uc_send() or ipath_do_rc_send() to
  * forward a WQE addressed to the same HCA.
@@ -197,13 +197,14 @@ bail:
  * receive interrupts since this is a connected protocol and all packets
  * will pass through here.
  */
-void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc)
+static void ipath_ruc_loopback(struct ipath_qp *sqp)
 {
 	struct ipath_ibdev *dev = to_idev(sqp->ibqp.device);
 	struct ipath_qp *qp;
 	struct ipath_swqe *wqe;
 	struct ipath_sge *sge;
 	unsigned long flags;
+	struct ib_wc wc;
 	u64 sdata;
 
 	qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn);
@@ -234,8 +235,8 @@ again:
 	wqe = get_swqe_ptr(sqp, sqp->s_last);
 	spin_unlock_irqrestore(&sqp->s_lock, flags);
 
-	wc->wc_flags = 0;
-	wc->imm_data = 0;
+	wc.wc_flags = 0;
+	wc.imm_data = 0;
 
 	sqp->s_sge.sge = wqe->sg_list[0];
 	sqp->s_sge.sg_list = wqe->sg_list + 1;
@@ -243,8 +244,8 @@ again:
 	sqp->s_len = wqe->length;
 	switch (wqe->wr.opcode) {
 	case IB_WR_SEND_WITH_IMM:
-		wc->wc_flags = IB_WC_WITH_IMM;
-		wc->imm_data = wqe->wr.imm_data;
+		wc.wc_flags = IB_WC_WITH_IMM;
+		wc.imm_data = wqe->wr.imm_data;
 		/* FALLTHROUGH */
 	case IB_WR_SEND:
 		spin_lock_irqsave(&qp->r_rq.lock, flags);
@@ -255,7 +256,7 @@ again:
 			if (qp->ibqp.qp_type == IB_QPT_UC)
 				goto send_comp;
 			if (sqp->s_rnr_retry == 0) {
-				wc->status = IB_WC_RNR_RETRY_EXC_ERR;
+				wc.status = IB_WC_RNR_RETRY_EXC_ERR;
 				goto err;
 			}
 			if (sqp->s_rnr_retry_cnt < 7)
@@ -270,8 +271,8 @@ again:
 		break;
 
 	case IB_WR_RDMA_WRITE_WITH_IMM:
-		wc->wc_flags = IB_WC_WITH_IMM;
-		wc->imm_data = wqe->wr.imm_data;
+		wc.wc_flags = IB_WC_WITH_IMM;
+		wc.imm_data = wqe->wr.imm_data;
 		spin_lock_irqsave(&qp->r_rq.lock, flags);
 		if (!ipath_get_rwqe(qp, 1))
 			goto rnr_nak;
@@ -285,20 +286,20 @@ again:
 					    wqe->wr.wr.rdma.rkey,
 					    IB_ACCESS_REMOTE_WRITE))) {
 		acc_err:
-			wc->status = IB_WC_REM_ACCESS_ERR;
+			wc.status = IB_WC_REM_ACCESS_ERR;
 		err:
-			wc->wr_id = wqe->wr.wr_id;
-			wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-			wc->vendor_err = 0;
-			wc->byte_len = 0;
-			wc->qp_num = sqp->ibqp.qp_num;
-			wc->src_qp = sqp->remote_qpn;
-			wc->pkey_index = 0;
-			wc->slid = sqp->remote_ah_attr.dlid;
-			wc->sl = sqp->remote_ah_attr.sl;
-			wc->dlid_path_bits = 0;
-			wc->port_num = 0;
-			ipath_sqerror_qp(sqp, wc);
+			wc.wr_id = wqe->wr.wr_id;
+			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
+			wc.vendor_err = 0;
+			wc.byte_len = 0;
+			wc.qp_num = sqp->ibqp.qp_num;
+			wc.src_qp = sqp->remote_qpn;
+			wc.pkey_index = 0;
+			wc.slid = sqp->remote_ah_attr.dlid;
+			wc.sl = sqp->remote_ah_attr.sl;
+			wc.dlid_path_bits = 0;
+			wc.port_num = 0;
+			ipath_sqerror_qp(sqp, &wc);
 			goto done;
 		}
 		break;
@@ -374,22 +375,22 @@ again:
 		goto send_comp;
 
 	if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-		wc->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+		wc.opcode = IB_WC_RECV_RDMA_WITH_IMM;
 	else
-		wc->opcode = IB_WC_RECV;
-	wc->wr_id = qp->r_wr_id;
-	wc->status = IB_WC_SUCCESS;
-	wc->vendor_err = 0;
-	wc->byte_len = wqe->length;
-	wc->qp_num = qp->ibqp.qp_num;
-	wc->src_qp = qp->remote_qpn;
+		wc.opcode = IB_WC_RECV;
+	wc.wr_id = qp->r_wr_id;
+	wc.status = IB_WC_SUCCESS;
+	wc.vendor_err = 0;
+	wc.byte_len = wqe->length;
+	wc.qp_num = qp->ibqp.qp_num;
+	wc.src_qp = qp->remote_qpn;
 	/* XXX do we know which pkey matched? Only needed for GSI. */
-	wc->pkey_index = 0;
-	wc->slid = qp->remote_ah_attr.dlid;
-	wc->sl = qp->remote_ah_attr.sl;
-	wc->dlid_path_bits = 0;
+	wc.pkey_index = 0;
+	wc.slid = qp->remote_ah_attr.dlid;
+	wc.sl = qp->remote_ah_attr.sl;
+	wc.dlid_path_bits = 0;
 	/* Signal completion event if the solicited bit is set. */
-	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc,
+	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
 		       wqe->wr.send_flags & IB_SEND_SOLICITED);
 
 send_comp:
@@ -397,19 +398,19 @@ send_comp:
 
 	if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &sqp->s_flags) ||
 	    (wqe->wr.send_flags & IB_SEND_SIGNALED)) {
-		wc->wr_id = wqe->wr.wr_id;
-		wc->status = IB_WC_SUCCESS;
-		wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
-		wc->vendor_err = 0;
-		wc->byte_len = wqe->length;
-		wc->qp_num = sqp->ibqp.qp_num;
-		wc->src_qp = 0;
-		wc->pkey_index = 0;
-		wc->slid = 0;
-		wc->sl = 0;
-		wc->dlid_path_bits = 0;
-		wc->port_num = 0;
-		ipath_cq_enter(to_icq(sqp->ibqp.send_cq), wc, 0);
+		wc.wr_id = wqe->wr.wr_id;
+		wc.status = IB_WC_SUCCESS;
+		wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
+		wc.vendor_err = 0;
+		wc.byte_len = wqe->length;
+		wc.qp_num = sqp->ibqp.qp_num;
+		wc.src_qp = 0;
+		wc.pkey_index = 0;
+		wc.slid = 0;
+		wc.sl = 0;
+		wc.dlid_path_bits = 0;
+		wc.port_num = 0;
+		ipath_cq_enter(to_icq(sqp->ibqp.send_cq), &wc, 0);
 	}
 
 	/* Update s_last now that we are finished with the SWQE */
@@ -455,11 +456,11 @@ void ipath_no_bufs_available(struct ipat
 }
 
 /**
- * ipath_post_rc_send - post RC and UC sends
+ * ipath_post_ruc_send - post RC and UC sends
  * @qp: the QP to post on
  * @wr: the work request to send
  */
-int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr)
+int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr)
 {
 	struct ipath_swqe *wqe;
 	unsigned long flags;
@@ -534,13 +535,149 @@ int ipath_post_rc_send(struct ipath_qp *
 	qp->s_head = next;
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 
-	if (qp->ibqp.qp_type == IB_QPT_UC)
-		ipath_do_uc_send((unsigned long) qp);
-	else
-		ipath_do_rc_send((unsigned long) qp);
+	ipath_do_ruc_send((unsigned long) qp);
 
 	ret = 0;
 
 bail:
 	return ret;
 }
+
+/**
+ * ipath_make_grh - construct a GRH header
+ * @dev: a pointer to the ipath device
+ * @hdr: a pointer to the GRH header being constructed
+ * @grh: the global route address to send to
+ * @hwords: the number of 32 bit words of header being sent
+ * @nwords: the number of 32 bit words of data being sent
+ *
+ * Return the size of the header in 32 bit words.
+ */
+u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr,
+		   struct ib_global_route *grh, u32 hwords, u32 nwords)
+{
+	hdr->version_tclass_flow =
+		cpu_to_be32((6 << 28) |
+			    (grh->traffic_class << 20) |
+			    grh->flow_label);
+	hdr->paylen = cpu_to_be16((hwords - 2 + nwords + SIZE_OF_CRC) << 2);
+	/* next_hdr is defined by C8-7 in ch. 8.4.1 */
+	hdr->next_hdr = 0x1B;
+	hdr->hop_limit = grh->hop_limit;
+	/* The SGID is 32-bit aligned. */
+	hdr->sgid.global.subnet_prefix = dev->gid_prefix;
+	hdr->sgid.global.interface_id = ipath_layer_get_guid(dev->dd);
+	hdr->dgid = grh->dgid;
+
+	/* GRH header size in 32-bit words. */
+	return sizeof(struct ib_grh) / sizeof(u32);
+}
+
+/**
+ * ipath_do_ruc_send - perform a send on an RC or UC QP
+ * @data: contains a pointer to the QP
+ *
+ * Process entries in the send work queue until credit or queue is
+ * exhausted.  Only allow one CPU to send a packet per QP (tasklet).
+ * Otherwise, after we drop the QP s_lock, two threads could send
+ * packets out of order.
+ */
+void ipath_do_ruc_send(unsigned long data)
+{
+	struct ipath_qp *qp = (struct ipath_qp *)data;
+	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+	unsigned long flags;
+	u16 lrh0;
+	u32 nwords;
+	u32 extra_bytes;
+	u32 bth0;
+	u32 bth2;
+	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
+	struct ipath_other_headers *ohdr;
+
+	if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags))
+		goto bail;
+
+	if (unlikely(qp->remote_ah_attr.dlid ==
+		     ipath_layer_get_lid(dev->dd))) {
+		ipath_ruc_loopback(qp);
+		goto clear;
+	}
+
+	ohdr = &qp->s_hdr.u.oth;
+	if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
+		ohdr = &qp->s_hdr.u.l.oth;
+
+again:
+	/* Check for a constructed packet to be sent. */
+	if (qp->s_hdrwords != 0) {
+		/*
+		 * If no PIO bufs are available, return.  An interrupt will
+		 * call ipath_ib_piobufavail() when one is available.
+		 */
+		if (ipath_verbs_send(dev->dd, qp->s_hdrwords,
+				     (u32 *) &qp->s_hdr, qp->s_cur_size,
+				     qp->s_cur_sge)) {
+			ipath_no_bufs_available(qp, dev);
+			goto bail;
+		}
+		dev->n_unicast_xmit++;
+		/* Record that we sent the packet and s_hdr is empty. */
+		qp->s_hdrwords = 0;
+	}
+
+	/*
+	 * The lock is needed to synchronize between setting
+	 * qp->s_ack_state, resend timer, and post_send().
+	 */
+	spin_lock_irqsave(&qp->s_lock, flags);
+
+	/* Sending responses has higher priority over sending requests. */
+	if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
+	    (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0)
+		bth2 = qp->s_ack_psn++ & IPS_PSN_MASK;
+	else if (!((qp->ibqp.qp_type == IB_QPT_RC) ?
+		   ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) :
+		   ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) {
+		/*
+		 * Clear the busy bit before unlocking to avoid races with
+		 * adding new work queue items and then failing to process
+		 * them.
+		 */
+		clear_bit(IPATH_S_BUSY, &qp->s_flags);
+		spin_unlock_irqrestore(&qp->s_lock, flags);
+		goto bail;
+	}
+
+	spin_unlock_irqrestore(&qp->s_lock, flags);
+
+	/* Construct the header. */
+	extra_bytes = (4 - qp->s_cur_size) & 3;
+	nwords = (qp->s_cur_size + extra_bytes) >> 2;
+	lrh0 = IPS_LRH_BTH;
+	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
+		qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh,
+						 &qp->remote_ah_attr.grh,
+						 qp->s_hdrwords, nwords);
+		lrh0 = IPS_LRH_GRH;
+	}
+	lrh0 |= qp->remote_ah_attr.sl << 4;
+	qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
+	qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
+	qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords +
+				       SIZE_OF_CRC);
+	qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
+	bth0 |= ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
+	bth0 |= extra_bytes << 20;
+	ohdr->bth[0] = cpu_to_be32(bth0);
+	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
+	ohdr->bth[2] = cpu_to_be32(bth2);
+
+	/* Check for more work to do. */
+	goto again;
+
+clear:
+	clear_bit(IPATH_S_BUSY, &qp->s_flags);
+bail:
+	return;
+}
diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_uc.c
--- a/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -62,90 +62,40 @@ static void complete_last_send(struct ip
 }
 
 /**
- * ipath_do_uc_send - do a send on a UC queue
- * @data: contains a pointer to the QP to send on
- *
- * Process entries in the send work queue until the queue is exhausted.
- * Only allow one CPU to send a packet per QP (tasklet).
- * Otherwise, after we drop the QP lock, two threads could send
- * packets out of order.
- * This is similar to ipath_do_rc_send() below except we don't have
- * timeouts or resends.
+ * ipath_make_uc_req - construct a request packet (SEND, RDMA write)
+ * @qp: a pointer to the QP
+ * @ohdr: a pointer to the IB header being constructed
+ * @pmtu: the path MTU
+ * @bth0p: pointer to the BTH opcode word
+ * @bth2p: pointer to the BTH PSN word
+ *
+ * Return 1 if constructed; otherwise, return 0.
+ * Note the QP s_lock must be held and interrupts disabled.
  */
-void ipath_do_uc_send(unsigned long data)
+int ipath_make_uc_req(struct ipath_qp *qp,
+		      struct ipath_other_headers *ohdr,
+		      u32 pmtu, u32 *bth0p, u32 *bth2p)
 {
-	struct ipath_qp *qp = (struct ipath_qp *)data;
-	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	struct ipath_swqe *wqe;
-	unsigned long flags;
-	u16 lrh0;
 	u32 hwords;
-	u32 nwords;
-	u32 extra_bytes;
 	u32 bth0;
-	u32 bth2;
-	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
 	u32 len;
-	struct ipath_other_headers *ohdr;
 	struct ib_wc wc;
 
-	if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags))
-		goto bail;
-
-	if (unlikely(qp->remote_ah_attr.dlid ==
-		     ipath_layer_get_lid(dev->dd))) {
-		/* Pass in an uninitialized ib_wc to save stack space. */
-		ipath_ruc_loopback(qp, &wc);
-		clear_bit(IPATH_S_BUSY, &qp->s_flags);
-		goto bail;
-	}
-
-	ohdr = &qp->s_hdr.u.oth;
-	if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
-		ohdr = &qp->s_hdr.u.l.oth;
-
-again:
-	/* Check for a constructed packet to be sent. */
-	if (qp->s_hdrwords != 0) {
-			/*
-			 * If no PIO bufs are available, return.
-			 * An interrupt will call ipath_ib_piobufavail()
-			 * when one is available.
-			 */
-			if (ipath_verbs_send(dev->dd, qp->s_hdrwords,
-					     (u32 *) &qp->s_hdr,
-					     qp->s_cur_size,
-					     qp->s_cur_sge)) {
-				ipath_no_bufs_available(qp, dev);
-				goto bail;
-			}
-			dev->n_unicast_xmit++;
-		/* Record that we sent the packet and s_hdr is empty. */
-		qp->s_hdrwords = 0;
-	}
-
-	lrh0 = IPS_LRH_BTH;
+	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK))
+		goto done;
+
 	/* header size in 32-bit words LRH+BTH = (8+12)/4. */
 	hwords = 5;
-
-	/*
-	 * The lock is needed to synchronize between
-	 * setting qp->s_ack_state and post_send().
-	 */
-	spin_lock_irqsave(&qp->s_lock, flags);
-
-	if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK))
-		goto done;
-
-	bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
-
-	/* Send a request. */
+	bth0 = 0;
+
+	/* Get the next send request. */
 	wqe = get_swqe_ptr(qp, qp->s_last);
 	switch (qp->s_state) {
 	default:
 		/*
-		 * Signal the completion of the last send (if there is
-		 * one).
+		 * Signal the completion of the last send
+		 * (if there is one).
 		 */
 		if (qp->s_last != qp->s_tail)
 			complete_last_send(qp, wqe, &wc);
@@ -258,61 +208,16 @@ again:
 		}
 		break;
 	}
-	bth2 = qp->s_next_psn++ & IPS_PSN_MASK;
 	qp->s_len -= len;
-	bth0 |= qp->s_state << 24;
-
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-
-	/* Construct the header. */
-	extra_bytes = (4 - len) & 3;
-	nwords = (len + extra_bytes) >> 2;
-	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
-		/* Header size in 32-bit words. */
-		hwords += 10;
-		lrh0 = IPS_LRH_GRH;
-		qp->s_hdr.u.l.grh.version_tclass_flow =
-			cpu_to_be32((6 << 28) |
-				    (qp->remote_ah_attr.grh.traffic_class
-				     << 20) |
-				    qp->remote_ah_attr.grh.flow_label);
-		qp->s_hdr.u.l.grh.paylen =
-			cpu_to_be16(((hwords - 12) + nwords +
-				     SIZE_OF_CRC) << 2);
-		/* next_hdr is defined by C8-7 in ch. 8.4.1 */
-		qp->s_hdr.u.l.grh.next_hdr = 0x1B;
-		qp->s_hdr.u.l.grh.hop_limit =
-			qp->remote_ah_attr.grh.hop_limit;
-		/* The SGID is 32-bit aligned. */
-		qp->s_hdr.u.l.grh.sgid.global.subnet_prefix =
-			dev->gid_prefix;
-		qp->s_hdr.u.l.grh.sgid.global.interface_id =
-			ipath_layer_get_guid(dev->dd);
-		qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid;
-	}
 	qp->s_hdrwords = hwords;
 	qp->s_cur_sge = &qp->s_sge;
 	qp->s_cur_size = len;
-	lrh0 |= qp->remote_ah_attr.sl << 4;
-	qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
-	/* DEST LID */
-	qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
-	qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC);
-	qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
-	bth0 |= extra_bytes << 20;
-	ohdr->bth[0] = cpu_to_be32(bth0);
-	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
-	ohdr->bth[2] = cpu_to_be32(bth2);
-
-	/* Check for more work to do. */
-	goto again;
+	*bth0p = bth0 | (qp->s_state << 24);
+	*bth2p = qp->s_next_psn++ & IPS_PSN_MASK;
+	return 1;
 
 done:
-	spin_unlock_irqrestore(&qp->s_lock, flags);
-	clear_bit(IPATH_S_BUSY, &qp->s_flags);
-
-bail:
-	return;
+	return 0;
 }
 
 /**
@@ -536,12 +441,13 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 		if (qp->r_len != 0) {
 			u32 rkey = be32_to_cpu(reth->rkey);
 			u64 vaddr = be64_to_cpu(reth->vaddr);
+			int ok;
 
 			/* Check rkey */
-			if (unlikely(!ipath_rkey_ok(
-					     dev, &qp->r_sge, qp->r_len,
-					     vaddr, rkey,
-					     IB_ACCESS_REMOTE_WRITE))) {
+			ok = ipath_rkey_ok(dev, &qp->r_sge, qp->r_len,
+					   vaddr, rkey,
+					   IB_ACCESS_REMOTE_WRITE);
+			if (unlikely(!ok)) {
 				dev->n_pkt_drops++;
 				goto done;
 			}
@@ -559,8 +465,7 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 		}
 		if (opcode == OP(RDMA_WRITE_ONLY))
 			goto rdma_last;
-		else if (opcode ==
-			 OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE))
+		else if (opcode == OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE))
 			goto rdma_last_imm;
 		/* FALLTHROUGH */
 	case OP(RDMA_WRITE_MIDDLE):
@@ -593,9 +498,9 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 			dev->n_pkt_drops++;
 			goto done;
 		}
-		if (qp->r_reuse_sge) {
+		if (qp->r_reuse_sge)
 			qp->r_reuse_sge = 0;
-		} else if (!ipath_get_rwqe(qp, 1)) {
+		else if (!ipath_get_rwqe(qp, 1)) {
 			dev->n_pkt_drops++;
 			goto done;
 		}
diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -194,7 +194,7 @@ static int ipath_post_send(struct ib_qp 
 		switch (qp->ibqp.qp_type) {
 		case IB_QPT_UC:
 		case IB_QPT_RC:
-			err = ipath_post_rc_send(qp, wr);
+			err = ipath_post_ruc_send(qp, wr);
 			break;
 
 		case IB_QPT_SMI:
diff -r f7c82500b9c7 -r ebf646d10db0 drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
@@ -581,10 +581,6 @@ void ipath_sqerror_qp(struct ipath_qp *q
 
 void ipath_get_credit(struct ipath_qp *qp, u32 aeth);
 
-void ipath_do_rc_send(unsigned long data);
-
-void ipath_do_uc_send(unsigned long data);
-
 void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig);
 
 int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss,
@@ -597,7 +593,7 @@ void ipath_copy_sge(struct ipath_sge_sta
 
 void ipath_skip_sge(struct ipath_sge_state *ss, u32 length);
 
-int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr);
+int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr);
 
 void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		  int has_grh, void *data, u32 tlen, struct ipath_qp *qp);
@@ -679,7 +675,19 @@ void ipath_insert_rnr_queue(struct ipath
 
 int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only);
 
-void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc);
+u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr,
+		   struct ib_global_route *grh, u32 hwords, u32 nwords);
+
+void ipath_do_ruc_send(unsigned long data);
+
+u32 ipath_make_rc_ack(struct ipath_qp *qp, struct ipath_other_headers *ohdr,
+		      u32 pmtu);
+
+int ipath_make_rc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr,
+		      u32 pmtu, u32 *bth0p, u32 *bth2p);
+
+int ipath_make_uc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr,
+		      u32 pmtu, u32 *bth0p, u32 *bth2p);
 
 extern const enum ib_wc_opcode ib_ipath_wc_opcode[];
 

From bos at pathscale.com  Thu Jun 29 14:40:53 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:53 -0700
Subject: [openib-general] [PATCH 2 of 39] IB/ipath - update copyrights and
 other strings to reflect new company name
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <f7c82500b9c76de42985.1151617253@eng-12.pathscale.com>

Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/Kconfig
--- a/drivers/infiniband/hw/ipath/Kconfig	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/Kconfig	Thu Jun 29 14:33:25 2006 -0700
@@ -1,16 +1,16 @@ config IPATH_CORE
 config IPATH_CORE
-	tristate "PathScale InfiniPath Driver"
+	tristate "QLogic InfiniPath Driver"
 	depends on 64BIT && PCI_MSI && NET
 	---help---
-	This is a low-level driver for PathScale InfiniPath host channel
+	This is a low-level driver for QLogic InfiniPath host channel
 	adapters (HCAs) based on the HT-400 and PE-800 chips.
 
 config INFINIBAND_IPATH
-	tristate "PathScale InfiniPath Verbs Driver"
+	tristate "QLogic InfiniPath Verbs Driver"
 	depends on IPATH_CORE && INFINIBAND
 	---help---
 	This is a driver that provides InfiniBand verbs support for
-	PathScale InfiniPath host channel adapters (HCAs).  This
+	QLogic InfiniPath host channel adapters (HCAs).  This
 	allows these devices to be used with both kernel upper level
 	protocols such as IP-over-InfiniBand as well as with userspace
 	applications (in conjunction with InfiniBand userspace access).
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/Makefile
--- a/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,4 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScal
-EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScale kernel.org driver"' \
+EXTRA_CFLAGS += -DIPATH_IDSTR='"QLogic kernel.org driver"' \
 	-DIPATH_KERN_TYPE=0
 
 obj-$(CONFIG_IPATH_CORE) += ipath_core.o
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -38,7 +39,7 @@
  * to communicate between kernel and user code.
  */
 
-/* This is the IEEE-assigned OUI for PathScale, Inc. */
+/* This is the IEEE-assigned OUI for QLogic, Inc. InfiniPath */
 #define IPATH_SRC_OUI_1 0x00
 #define IPATH_SRC_OUI_2 0x11
 #define IPATH_SRC_OUI_3 0x75
@@ -342,9 +343,9 @@ struct ipath_base_info {
 /*
  * Similarly, this is the kernel version going back to the user.  It's
  * slightly different, in that we want to tell if the driver was built as
- * part of a PathScale release, or from the driver from OpenIB, kernel.org,
+ * part of a QLogic release, or from the driver from OpenIB, kernel.org,
  * or a standard distribution, for support reasons.  The high bit is 0 for
- * non-PathScale, and 1 for PathScale-built/supplied.
+ * non-QLogic, and 1 for QLogic-built/supplied.
  *
  * It's returned by the driver to the user code during initialization in the
  * spi_sw_version field of ipath_base_info, so the user code can in turn
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_cq.c
--- a/drivers/infiniband/hw/ipath/ipath_cq.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_debug.h
--- a/drivers/infiniband/hw/ipath/ipath_debug.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_debug.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_diag.c
--- a/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -52,7 +53,7 @@ const char *ipath_get_unit_name(int unit
 
 EXPORT_SYMBOL_GPL(ipath_get_unit_name);
 
-#define DRIVER_LOAD_MSG "PathScale " IPATH_DRV_NAME " loaded: "
+#define DRIVER_LOAD_MSG "QLogic " IPATH_DRV_NAME " loaded: "
 #define PFX IPATH_DRV_NAME ": "
 
 /*
@@ -74,8 +75,8 @@ EXPORT_SYMBOL_GPL(ipath_debug);
 EXPORT_SYMBOL_GPL(ipath_debug);
 
 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("PathScale <support at pathscale.com>");
-MODULE_DESCRIPTION("Pathscale InfiniPath driver");
+MODULE_AUTHOR("QLogic <support at pathscale.com>");
+MODULE_DESCRIPTION("QLogic InfiniPath driver");
 
 const char *ipath_ibcstatus_str[] = {
 	"Disabled",
@@ -452,7 +453,7 @@ static int __devinit ipath_init_one(stru
 		ipath_init_pe800_funcs(dd);
 		break;
 	default:
-		ipath_dev_err(dd, "Found unknown PathScale deviceid 0x%x, "
+		ipath_dev_err(dd, "Found unknown QLogic deviceid 0x%x, "
 			      "failing\n", ent->device);
 		return -ENODEV;
 	}
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_eeprom.c
--- a/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_fs.c
--- a/drivers/infiniband/hw/ipath/ipath_fs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ht400.c
--- a/drivers/infiniband/hw/ipath/ipath_ht400.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ht400.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,6 +1,7 @@
 #ifndef _IPATH_KERNEL_H
 #define _IPATH_KERNEL_H
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_keys.c
--- a/drivers/infiniband/hw/ipath/ipath_keys.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_keys.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_layer.h
--- a/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_mr.c
--- a/drivers/infiniband/hw/ipath/ipath_mr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_pe800.c
--- a/drivers/infiniband/hw/ipath/ipath_pe800.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_pe800.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -44,7 +45,7 @@
 
 /*
  * This file contains all the chip-specific register information and
- * access functions for the PathScale PE800, the PCI-Express chip.
+ * access functions for the QLogic InfiniPath PE800, the PCI-Express chip.
  *
  * This lists the InfiniPath PE800 registers, in the actual chip layout.
  * This structure should never be directly accessed.
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_registers.h
--- a/drivers/infiniband/hw/ipath/ipath_registers.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_registers.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ruc.c
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_srq.c
--- a/drivers/infiniband/hw/ipath/ipath_srq.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_srq.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_stats.c
--- a/drivers/infiniband/hw/ipath/ipath_stats.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_stats.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_sysfs.c
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_uc.c
--- a/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_ud.c
--- a/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_user_pages.c
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -56,8 +57,8 @@ MODULE_PARM_DESC(debug, "Verbs debug mas
 MODULE_PARM_DESC(debug, "Verbs debug mask");
 
 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("PathScale <support at pathscale.com>");
-MODULE_DESCRIPTION("Pathscale InfiniPath driver");
+MODULE_AUTHOR("QLogic <support at pathscale.com>");
+MODULE_DESCRIPTION("QLogic InfiniPath driver");
 
 const int ib_ipath_state_ops[IB_QPS_ERR + 1] = {
 	[IB_QPS_RESET] = 0,
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ipath_wc_x86_64.c
--- a/drivers/infiniband/hw/ipath/ipath_wc_x86_64.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_wc_x86_64.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/ips_common.h
--- a/drivers/infiniband/hw/ipath/ips_common.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ips_common.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,6 +1,7 @@
 #ifndef IPS_COMMON_H
 #define IPS_COMMON_H
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
diff -r addf90abc724 -r f7c82500b9c7 drivers/infiniband/hw/ipath/verbs_debug.h
--- a/drivers/infiniband/hw/ipath/verbs_debug.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/verbs_debug.h	Thu Jun 29 14:33:25 2006 -0700
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two


From bos at pathscale.com  Thu Jun 29 14:41:00 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:00 -0700
Subject: [openib-general] [PATCH 9 of 39] IB/ipath - don't allow resources
 to be created with illegal values
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <ac81d2563bbaf9c23dae.1151617260@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_mr.c
--- a/drivers/infiniband/hw/ipath/ipath_mr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c	Thu Jun 29 14:33:25 2006 -0700
@@ -169,6 +169,11 @@ struct ib_mr *ipath_reg_user_mr(struct i
 	struct ib_umem_chunk *chunk;
 	int n, m, i;
 	struct ib_mr *ret;
+
+	if (region->length == 0) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
 
 	n = 0;
 	list_for_each_entry(chunk, &region->chunk_list, list)
diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -667,6 +667,14 @@ struct ib_qp *ipath_create_qp(struct ib_
 		goto bail;
 	}
 
+	if (init_attr->cap.max_send_sge +
+	    init_attr->cap.max_recv_sge +
+	    init_attr->cap.max_send_wr +
+	    init_attr->cap.max_recv_wr == 0) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_UC:
 	case IB_QPT_RC:
diff -r 081142011371 -r ac81d2563bba drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -788,6 +788,17 @@ static struct ib_ah *ipath_create_ah(str
 	if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE &&
 	    ah_attr->dlid != IPS_PERMISSIVE_LID &&
 	    !(ah_attr->ah_flags & IB_AH_GRH)) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
+
+	if (ah_attr->dlid == 0) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
+
+	if (ah_attr->port_num != 1 ||
+	    ah_attr->port_num > pd->device->phys_port_cnt) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
 	}


From bos at pathscale.com  Thu Jun 29 14:40:59 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:59 -0700
Subject: [openib-general] [PATCH 8 of 39] IB/ipath - remove some duplicate
	code
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <08114201137114764a83.1151617259@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 8f08597cacd2 -r 081142011371 drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -511,9 +511,6 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
 	if (attr_mask & IB_QP_QKEY)
 		qp->qkey = attr->qkey;
 
-	if (attr_mask & IB_QP_PKEY_INDEX)
-		qp->s_pkey_index = attr->pkey_index;
-
 	qp->state = new_state;
 	spin_unlock(&qp->s_lock);
 	spin_unlock_irqrestore(&qp->r_rq.lock, flags);


From bos at pathscale.com  Thu Jun 29 14:41:01 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:01 -0700
Subject: [openib-general] [PATCH 10 of 39] IB/ipath - fix some memory leaks
 on failure paths
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <160e5cf91761a2daf6db.1151617261@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r ac81d2563bba -r 160e5cf91761 drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
@@ -115,6 +115,7 @@ static int create_port0_egr(struct ipath
 				      "eager TID %u\n", e);
 			while (e != 0)
 				dev_kfree_skb(skbs[--e]);
+			vfree(skbs);
 			ret = -ENOMEM;
 			goto bail;
 		}
diff -r ac81d2563bba -r 160e5cf91761 drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -692,6 +692,7 @@ struct ib_qp *ipath_create_qp(struct ib_
 	case IB_QPT_GSI:
 		qp = kmalloc(sizeof(*qp), GFP_KERNEL);
 		if (!qp) {
+			vfree(swq);
 			ret = ERR_PTR(-ENOMEM);
 			goto bail;
 		}
@@ -702,6 +703,7 @@ struct ib_qp *ipath_create_qp(struct ib_
 		qp->r_rq.wq = vmalloc(qp->r_rq.size * sz);
 		if (!qp->r_rq.wq) {
 			kfree(qp);
+			vfree(swq);
 			ret = ERR_PTR(-ENOMEM);
 			goto bail;
 		}


From bos at pathscale.com  Thu Jun 29 14:41:05 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:05 -0700
Subject: [openib-general] [PATCH 14 of 39] IB/ipath - removed unused field
 ipath_kregvirt from struct ipath_devdata
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <e43b4df874a97ae8bfc9.1151617265@eng-12.pathscale.com>

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r a94e9f9c9c23 -r e43b4df874a9 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -496,10 +496,8 @@ static int __devinit ipath_init_one(stru
 		((void __iomem *)dd->ipath_kregbase + len);
 	dd->ipath_physaddr = addr;	/* used for io_remap, etc. */
 	/* for user mmap */
-	dd->ipath_kregvirt = (u64 __iomem *) phys_to_virt(addr);
-	ipath_cdbg(VERBOSE, "mapped io addr %llx to kregbase %p "
-		   "kregvirt %p\n", addr, dd->ipath_kregbase,
-		   dd->ipath_kregvirt);
+	ipath_cdbg(VERBOSE, "mapped io addr %llx to kregbase %p\n",
+		   addr, dd->ipath_kregbase);
 
 	/*
 	 * clear ipath_flags here instead of in ipath_init_chip as it is set
@@ -1809,7 +1807,6 @@ static void cleanup_device(struct ipath_
 			 * re-init
 			 */
 			dd->ipath_kregbase = NULL;
-			dd->ipath_kregvirt = NULL;
 			dd->ipath_uregbase = 0;
 			dd->ipath_sregbase = 0;
 			dd->ipath_cregbase = 0;
diff -r a94e9f9c9c23 -r e43b4df874a9 drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
@@ -158,11 +158,6 @@ struct ipath_devdata {
 	unsigned long ipath_physaddr;
 	/* base of memory alloced for ipath_kregbase, for free */
 	u64 *ipath_kregalloc;
-	/*
-	 * version of kregbase that doesn't have high bits set (for 32 bit
-	 * programs, so mmap64 44 bit works)
-	 */
-	u64 __iomem *ipath_kregvirt;
 	/*
 	 * virtual address where port0 rcvhdrqtail updated for this unit.
 	 * only written to by the chip, not the driver.


From bos at pathscale.com  Thu Jun 29 14:40:58 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:58 -0700
Subject: [openib-general] [PATCH 7 of 39] IB/ipath - update some comments
	and fix typos
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <8f08597cacd2a9dcea28.1151617258@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 600ceb6aeb8c -r 8f08597cacd2 drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
@@ -723,13 +723,8 @@ u64 ipath_read_kreg64_port(const struct 
  * @port: port number
  *
  * Return the contents of a register that is virtualized to be per port.
- * Prints a debug message and returns -1 on errors (not distinguishable from
- * valid contents at runtime; we may add a separate error variable at some
- * point).
- *
- * This is normally not used by the kernel, but may be for debugging, and
- * has a different implementation than user mode, which is why it's not in
- * _common.h.
+ * Returns -1 on errors (not distinguishable from valid contents at
+ * runtime; we may add a separate error variable at some point).
  */
 static inline u32 ipath_read_ureg32(const struct ipath_devdata *dd,
 				    ipath_ureg regno, int port)
diff -r 600ceb6aeb8c -r 8f08597cacd2 drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
@@ -885,7 +885,7 @@ static void copy_io(u32 __iomem *piobuf,
 /**
  * ipath_verbs_send - send a packet from the verbs layer
  * @dd: the infinipath device
- * @hdrwords: the number of works in the header
+ * @hdrwords: the number of words in the header
  * @hdr: the packet header
  * @len: the length of the packet in bytes
  * @ss: the SGE to send


From bos at pathscale.com  Thu Jun 29 14:41:02 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:02 -0700
Subject: [openib-general] [PATCH 11 of 39] IB/ipath - return an error for
 unknown multicast GID
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <1e1f3da0e78d32f2a733.1151617262@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 160e5cf91761 -r 1e1f3da0e78d drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
@@ -273,7 +273,7 @@ int ipath_multicast_detach(struct ib_qp 
 	while (1) {
 		if (n == NULL) {
 			spin_unlock_irqrestore(&mcast_lock, flags);
-			ret = 0;
+			ret = -EINVAL;
 			goto bail;
 		}
 

From bos at pathscale.com  Thu Jun 29 14:41:03 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:03 -0700
Subject: [openib-general] [PATCH 12 of 39] IB/ipath - report correct device
 identification information in /sys
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <21d5d64750acfd45f537.1151617263@eng-12.pathscale.com>

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:25 2006 -0700
@@ -341,18 +341,26 @@ u32 ipath_layer_get_nguid(struct ipath_d
 
 EXPORT_SYMBOL_GPL(ipath_layer_get_nguid);
 
-int ipath_layer_query_device(struct ipath_devdata *dd, u32 * vendor,
-			     u32 * boardrev, u32 * majrev, u32 * minrev)
-{
-	*vendor = dd->ipath_vendorid;
-	*boardrev = dd->ipath_boardrev;
-	*majrev = dd->ipath_majrev;
-	*minrev = dd->ipath_minrev;
-
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_query_device);
+u32 ipath_layer_get_majrev(struct ipath_devdata *dd)
+{
+	return dd->ipath_majrev;
+}
+
+EXPORT_SYMBOL_GPL(ipath_layer_get_majrev);
+
+u32 ipath_layer_get_minrev(struct ipath_devdata *dd)
+{
+	return dd->ipath_minrev;
+}
+
+EXPORT_SYMBOL_GPL(ipath_layer_get_minrev);
+
+u32 ipath_layer_get_pcirev(struct ipath_devdata *dd)
+{
+	return dd->ipath_pcirev;
+}
+
+EXPORT_SYMBOL_GPL(ipath_layer_get_pcirev);
 
 u32 ipath_layer_get_flags(struct ipath_devdata *dd)
 {
@@ -374,6 +382,13 @@ u16 ipath_layer_get_deviceid(struct ipat
 }
 
 EXPORT_SYMBOL_GPL(ipath_layer_get_deviceid);
+
+u32 ipath_layer_get_vendorid(struct ipath_devdata *dd)
+{
+	return dd->ipath_vendorid;
+}
+
+EXPORT_SYMBOL_GPL(ipath_layer_get_vendorid);
 
 u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd)
 {
diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_layer.h
--- a/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:25 2006 -0700
@@ -144,11 +144,13 @@ int ipath_layer_set_guid(struct ipath_de
 int ipath_layer_set_guid(struct ipath_devdata *, __be64 guid);
 __be64 ipath_layer_get_guid(struct ipath_devdata *);
 u32 ipath_layer_get_nguid(struct ipath_devdata *);
-int ipath_layer_query_device(struct ipath_devdata *, u32 * vendor,
-			     u32 * boardrev, u32 * majrev, u32 * minrev);
+u32 ipath_layer_get_majrev(struct ipath_devdata *);
+u32 ipath_layer_get_minrev(struct ipath_devdata *);
+u32 ipath_layer_get_pcirev(struct ipath_devdata *);
 u32 ipath_layer_get_flags(struct ipath_devdata *dd);
 struct device *ipath_layer_get_device(struct ipath_devdata *dd);
 u16 ipath_layer_get_deviceid(struct ipath_devdata *dd);
+u32 ipath_layer_get_vendorid(struct ipath_devdata *);
 u64 ipath_layer_get_lastibcstat(struct ipath_devdata *dd);
 u32 ipath_layer_get_ibmtu(struct ipath_devdata *dd);
 int ipath_layer_enable_timer(struct ipath_devdata *dd);
diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:25 2006 -0700
@@ -85,7 +85,7 @@ static int recv_subn_get_nodeinfo(struct
 {
 	struct nodeinfo *nip = (struct nodeinfo *)&smp->data;
 	struct ipath_devdata *dd = to_idev(ibdev)->dd;
-	u32 vendor, boardid, majrev, minrev;
+	u32 vendor, majrev, minrev;
 
 	if (smp->attr_mod)
 		smp->status |= IB_SMP_INVALID_FIELD;
@@ -105,9 +105,11 @@ static int recv_subn_get_nodeinfo(struct
 	nip->port_guid = nip->sys_guid;
 	nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(dd));
 	nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(dd));
-	ipath_layer_query_device(dd, &vendor, &boardid, &majrev, &minrev);
+	majrev = ipath_layer_get_majrev(dd);
+	minrev = ipath_layer_get_minrev(dd);
 	nip->revision = cpu_to_be32((majrev << 16) | minrev);
 	nip->local_port_num = port;
+	vendor = ipath_layer_get_vendorid(dd);
 	nip->vendor_id[0] = 0;
 	nip->vendor_id[1] = vendor >> 8;
 	nip->vendor_id[2] = vendor;
diff -r 1e1f3da0e78d -r 21d5d64750ac drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -568,18 +568,15 @@ static int ipath_query_device(struct ib_
 			      struct ib_device_attr *props)
 {
 	struct ipath_ibdev *dev = to_idev(ibdev);
-	u32 vendor, boardrev, majrev, minrev;
 
 	memset(props, 0, sizeof(*props));
 
 	props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR |
 		IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT |
 		IB_DEVICE_SYS_IMAGE_GUID;
-	ipath_layer_query_device(dev->dd, &vendor, &boardrev,
-				 &majrev, &minrev);
-	props->vendor_id = vendor;
-	props->vendor_part_id = boardrev;
-	props->hw_ver = boardrev << 16 | majrev << 8 | minrev;
+	props->vendor_id = ipath_layer_get_vendorid(dev->dd);
+	props->vendor_part_id = ipath_layer_get_deviceid(dev->dd);
+	props->hw_ver = ipath_layer_get_pcirev(dev->dd);
 
 	props->sys_image_guid = dev->sys_image_guid;
 
@@ -1121,11 +1118,8 @@ static ssize_t show_rev(struct class_dev
 {
 	struct ipath_ibdev *dev =
 		container_of(cdev, struct ipath_ibdev, ibdev.class_dev);
-	int vendor, boardrev, majrev, minrev;
-
-	ipath_layer_query_device(dev->dd, &vendor, &boardrev,
-				 &majrev, &minrev);
-	return sprintf(buf, "%d.%d\n", majrev, minrev);
+
+	return sprintf(buf, "%x\n", ipath_layer_get_pcirev(dev->dd));
 }
 
 static ssize_t show_hca(struct class_device *cdev, char *buf)


From bos at pathscale.com  Thu Jun 29 14:41:04 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:04 -0700
Subject: [openib-general] [PATCH 13 of 39] IB/ipath - enforce device
	resource limits
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <a94e9f9c9c23c0140a76.1151617264@eng-12.pathscale.com>

These limits are somewhat artificial in that we don't actually have any
device limits.  However, the verbs layer expects that such limits exist
and are enforced, so we make up arbitrary (but sensible) limits.

Signed-off-by: Robert Walsh <robert.walsh at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_cq.c
--- a/drivers/infiniband/hw/ipath/ipath_cq.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c	Thu Jun 29 14:33:25 2006 -0700
@@ -158,9 +158,20 @@ struct ib_cq *ipath_create_cq(struct ib_
 			      struct ib_ucontext *context,
 			      struct ib_udata *udata)
 {
+	struct ipath_ibdev *dev = to_idev(ibdev);
 	struct ipath_cq *cq;
 	struct ib_wc *wc;
 	struct ib_cq *ret;
+
+	if (entries > ib_ipath_max_cqes) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
+
+	if (dev->n_cqs_allocated == ib_ipath_max_cqs) {
+		ret = ERR_PTR(-ENOMEM);
+		goto bail;
+	}
 
 	/*
 	 * Need to use vmalloc() if we want to support large #s of
@@ -197,6 +208,8 @@ struct ib_cq *ipath_create_cq(struct ib_
 
 	ret = &cq->ibcq;
 
+	dev->n_cqs_allocated++;
+
 bail:
 	return ret;
 }
@@ -211,9 +224,11 @@ bail:
  */
 int ipath_destroy_cq(struct ib_cq *ibcq)
 {
+	struct ipath_ibdev *dev = to_idev(ibcq->device);
 	struct ipath_cq *cq = to_icq(ibcq);
 
 	tasklet_kill(&cq->comptask);
+	dev->n_cqs_allocated--;
 	vfree(cq->queue);
 	kfree(cq);
 
diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:25 2006 -0700
@@ -661,8 +661,10 @@ struct ib_qp *ipath_create_qp(struct ib_
 	size_t sz;
 	struct ib_qp *ret;
 
-	if (init_attr->cap.max_send_sge > 255 ||
-	    init_attr->cap.max_recv_sge > 255) {
+	if (init_attr->cap.max_send_sge > ib_ipath_max_sges ||
+	    init_attr->cap.max_recv_sge > ib_ipath_max_sges ||
+	    init_attr->cap.max_send_wr > ib_ipath_max_qp_wrs ||
+	    init_attr->cap.max_recv_wr > ib_ipath_max_qp_wrs) {
 		ret = ERR_PTR(-ENOMEM);
 		goto bail;
 	}
diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_srq.c
--- a/drivers/infiniband/hw/ipath/ipath_srq.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_srq.c	Thu Jun 29 14:33:25 2006 -0700
@@ -126,11 +126,23 @@ struct ib_srq *ipath_create_srq(struct i
 				struct ib_srq_init_attr *srq_init_attr,
 				struct ib_udata *udata)
 {
+	struct ipath_ibdev *dev = to_idev(ibpd->device);
 	struct ipath_srq *srq;
 	u32 sz;
 	struct ib_srq *ret;
 
-	if (srq_init_attr->attr.max_sge < 1) {
+	if (dev->n_srqs_allocated == ib_ipath_max_srqs) {
+		ret = ERR_PTR(-ENOMEM);
+		goto bail;
+	}
+
+	if (srq_init_attr->attr.max_wr == 0) {
+		ret = ERR_PTR(-EINVAL);
+		goto bail;
+	}
+
+	if ((srq_init_attr->attr.max_sge > ib_ipath_max_srq_sges) ||
+	    (srq_init_attr->attr.max_wr > ib_ipath_max_srq_wrs)) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
 	}
@@ -165,6 +177,8 @@ struct ib_srq *ipath_create_srq(struct i
 
 	ret = &srq->ibsrq;
 
+	dev->n_srqs_allocated++;
+
 bail:
 	return ret;
 }
@@ -182,24 +196,26 @@ int ipath_modify_srq(struct ib_srq *ibsr
 	unsigned long flags;
 	int ret;
 
-	if (attr_mask & IB_SRQ_LIMIT) {
-		spin_lock_irqsave(&srq->rq.lock, flags);
-		srq->limit = attr->srq_limit;
-		spin_unlock_irqrestore(&srq->rq.lock, flags);
-	}
+	if (attr_mask & IB_SRQ_MAX_WR)
+		if ((attr->max_wr > ib_ipath_max_srq_wrs) ||
+		    (attr->max_sge > srq->rq.max_sge)) {
+			ret = -EINVAL;
+			goto bail;
+		}
+
+	if (attr_mask & IB_SRQ_LIMIT)
+		if (attr->srq_limit >= srq->rq.size) {
+			ret = -EINVAL;
+			goto bail;
+		}
+
 	if (attr_mask & IB_SRQ_MAX_WR) {
-		u32 size = attr->max_wr + 1;
 		struct ipath_rwqe *wq, *p;
-		u32 n;
-		u32 sz;
-
-		if (attr->max_sge < srq->rq.max_sge) {
-			ret = -EINVAL;
-			goto bail;
-		}
+		u32 sz, size, n;
 
 		sz = sizeof(struct ipath_rwqe) +
 			attr->max_sge * sizeof(struct ipath_sge);
+		size = attr->max_wr + 1;
 		wq = vmalloc(size * sz);
 		if (!wq) {
 			ret = -ENOMEM;
@@ -243,6 +259,11 @@ int ipath_modify_srq(struct ib_srq *ibsr
 		spin_unlock_irqrestore(&srq->rq.lock, flags);
 	}
 
+	if (attr_mask & IB_SRQ_LIMIT) {
+		spin_lock_irqsave(&srq->rq.lock, flags);
+		srq->limit = attr->srq_limit;
+		spin_unlock_irqrestore(&srq->rq.lock, flags);
+	}
 	ret = 0;
 
 bail:
@@ -266,7 +287,9 @@ int ipath_destroy_srq(struct ib_srq *ibs
 int ipath_destroy_srq(struct ib_srq *ibsrq)
 {
 	struct ipath_srq *srq = to_isrq(ibsrq);
-
+	struct ipath_ibdev *dev = to_idev(ibsrq->device);
+
+	dev->n_srqs_allocated--;
 	vfree(srq->rq.wq);
 	kfree(srq);
 
diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:25 2006 -0700
@@ -55,6 +55,59 @@ unsigned int ib_ipath_debug;	/* debug ma
 unsigned int ib_ipath_debug;	/* debug mask */
 module_param_named(debug, ib_ipath_debug, uint, S_IWUSR | S_IRUGO);
 MODULE_PARM_DESC(debug, "Verbs debug mask");
+
+static unsigned int ib_ipath_max_pds = 0xFFFF;
+module_param_named(max_pds, ib_ipath_max_pds, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_pds,
+		 "Maximum number of protection domains to support");
+
+static unsigned int ib_ipath_max_ahs = 0xFFFF;
+module_param_named(max_ahs, ib_ipath_max_ahs, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_ahs, "Maximum number of address handles to support");
+
+unsigned int ib_ipath_max_cqes = 0x2FFFF;
+module_param_named(max_cqes, ib_ipath_max_cqes, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_cqes,
+		 "Maximum number of completion queue entries to support");
+
+unsigned int ib_ipath_max_cqs = 0x1FFFF;
+module_param_named(max_cqs, ib_ipath_max_cqs, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_cqs, "Maximum number of completion queues to support");
+
+unsigned int ib_ipath_max_qp_wrs = 0x3FFF;
+module_param_named(max_qp_wrs, ib_ipath_max_qp_wrs, uint,
+		   S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_qp_wrs, "Maximum number of QP WRs to support");
+
+unsigned int ib_ipath_max_sges = 0x60;
+module_param_named(max_sges, ib_ipath_max_sges, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_sges, "Maximum number of SGEs to support");
+
+unsigned int ib_ipath_max_mcast_grps = 16384;
+module_param_named(max_mcast_grps, ib_ipath_max_mcast_grps, uint,
+		   S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_mcast_grps,
+		 "Maximum number of multicast groups to support");
+
+unsigned int ib_ipath_max_mcast_qp_attached = 16;
+module_param_named(max_mcast_qp_attached, ib_ipath_max_mcast_qp_attached,
+		   uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_mcast_qp_attached,
+		 "Maximum number of attached QPs to support");
+
+unsigned int ib_ipath_max_srqs = 1024;
+module_param_named(max_srqs, ib_ipath_max_srqs, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_srqs, "Maximum number of SRQs to support");
+
+unsigned int ib_ipath_max_srq_sges = 128;
+module_param_named(max_srq_sges, ib_ipath_max_srq_sges,
+		   uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_srq_sges, "Maximum number of SRQ SGEs to support");
+
+unsigned int ib_ipath_max_srq_wrs = 0x1FFFF;
+module_param_named(max_srq_wrs, ib_ipath_max_srq_wrs,
+		   uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(max_srq_wrs, "Maximum number of SRQ WRs support");
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("QLogic <support at pathscale.com>");
@@ -581,24 +634,25 @@ static int ipath_query_device(struct ib_
 	props->sys_image_guid = dev->sys_image_guid;
 
 	props->max_mr_size = ~0ull;
-	props->max_qp = 0xffff;
-	props->max_qp_wr = 0xffff;
-	props->max_sge = 255;
-	props->max_cq = 0xffff;
-	props->max_cqe = 0xffff;
-	props->max_mr = 0xffff;
-	props->max_pd = 0xffff;
+	props->max_qp = dev->qp_table.max;
+	props->max_qp_wr = ib_ipath_max_qp_wrs;
+	props->max_sge = ib_ipath_max_sges;
+	props->max_cq = ib_ipath_max_cqs;
+	props->max_ah = ib_ipath_max_ahs;
+	props->max_cqe = ib_ipath_max_cqes;
+	props->max_mr = dev->lk_table.max;
+	props->max_pd = ib_ipath_max_pds;
 	props->max_qp_rd_atom = 1;
 	props->max_qp_init_rd_atom = 1;
 	/* props->max_res_rd_atom */
-	props->max_srq = 0xffff;
-	props->max_srq_wr = 0xffff;
-	props->max_srq_sge = 255;
+	props->max_srq = ib_ipath_max_srqs;
+	props->max_srq_wr = ib_ipath_max_srq_wrs;
+	props->max_srq_sge = ib_ipath_max_srq_sges;
 	/* props->local_ca_ack_delay */
 	props->atomic_cap = IB_ATOMIC_HCA;
 	props->max_pkeys = ipath_layer_get_npkeys(dev->dd);
-	props->max_mcast_grp = 0xffff;
-	props->max_mcast_qp_attach = 0xffff;
+	props->max_mcast_grp = ib_ipath_max_mcast_grps;
+	props->max_mcast_qp_attach = ib_ipath_max_mcast_qp_attached;
 	props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
 		props->max_mcast_grp;
 
@@ -741,8 +795,21 @@ static struct ib_pd *ipath_alloc_pd(stru
 				    struct ib_ucontext *context,
 				    struct ib_udata *udata)
 {
+	struct ipath_ibdev *dev = to_idev(ibdev);
 	struct ipath_pd *pd;
 	struct ib_pd *ret;
+
+	/*
+	 * This is actually totally arbitrary.	Some correctness tests
+	 * assume there's a maximum number of PDs that can be allocated.
+	 * We don't actually have this limit, but we fail the test if
+	 * we allow allocations of more than we report for this value.
+	 */
+
+	if (dev->n_pds_allocated == ib_ipath_max_pds) {
+		ret = ERR_PTR(-ENOMEM);
+		goto bail;
+	}
 
 	pd = kmalloc(sizeof *pd, GFP_KERNEL);
 	if (!pd) {
@@ -750,6 +817,8 @@ static struct ib_pd *ipath_alloc_pd(stru
 		goto bail;
 	}
 
+	dev->n_pds_allocated++;
+
 	/* ib_alloc_pd() will initialize pd->ibpd. */
 	pd->user = udata != NULL;
 
@@ -762,6 +831,9 @@ static int ipath_dealloc_pd(struct ib_pd
 static int ipath_dealloc_pd(struct ib_pd *ibpd)
 {
 	struct ipath_pd *pd = to_ipd(ibpd);
+	struct ipath_ibdev *dev = to_idev(ibpd->device);
+
+	dev->n_pds_allocated--;
 
 	kfree(pd);
 
@@ -780,6 +852,12 @@ static struct ib_ah *ipath_create_ah(str
 {
 	struct ipath_ah *ah;
 	struct ib_ah *ret;
+	struct ipath_ibdev *dev = to_idev(pd->device);
+
+	if (dev->n_ahs_allocated == ib_ipath_max_ahs) {
+		ret = ERR_PTR(-ENOMEM);
+		goto bail;
+	}
 
 	/* A multicast address requires a GRH (see ch. 8.4.1). */
 	if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE &&
@@ -794,7 +872,7 @@ static struct ib_ah *ipath_create_ah(str
 		goto bail;
 	}
 
-	if (ah_attr->port_num != 1 ||
+	if (ah_attr->port_num < 1 ||
 	    ah_attr->port_num > pd->device->phys_port_cnt) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
@@ -806,6 +884,8 @@ static struct ib_ah *ipath_create_ah(str
 		goto bail;
 	}
 
+	dev->n_ahs_allocated++;
+
 	/* ib_create_ah() will initialize ah->ibah. */
 	ah->attr = *ah_attr;
 
@@ -823,7 +903,10 @@ bail:
  */
 static int ipath_destroy_ah(struct ib_ah *ibah)
 {
+	struct ipath_ibdev *dev = to_idev(ibah->device);
 	struct ipath_ah *ah = to_iah(ibah);
+
+	dev->n_ahs_allocated--;
 
 	kfree(ah);
 
diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:25 2006 -0700
@@ -149,6 +149,7 @@ struct ipath_mcast {
 	struct list_head qp_list;
 	wait_queue_head_t wait;
 	atomic_t refcount;
+	int n_attached;
 };
 
 /* Memory region */
@@ -432,6 +433,11 @@ struct ipath_ibdev {
 	__be64 sys_image_guid;	/* in network order */
 	__be64 gid_prefix;	/* in network order */
 	__be64 mkey;
+	u32 n_pds_allocated;	/* number of PDs allocated for device */
+	u32 n_ahs_allocated;	/* number of AHs allocated for device */
+	u32 n_cqs_allocated;	/* number of CQs allocated for device */
+	u32 n_srqs_allocated;	/* number of SRQs allocated for device */
+	u32 n_mcast_grps_allocated; /* number of mcast groups allocated */
 	u64 ipath_sword;	/* total dwords sent (sample result) */
 	u64 ipath_rword;	/* total dwords received (sample result) */
 	u64 ipath_spkts;	/* total packets sent (sample result) */
@@ -697,6 +703,24 @@ extern const int ib_ipath_state_ops[];
 
 extern unsigned int ib_ipath_lkey_table_size;
 
+extern unsigned int ib_ipath_max_cqes;
+
+extern unsigned int ib_ipath_max_cqs;
+
+extern unsigned int ib_ipath_max_qp_wrs;
+
+extern unsigned int ib_ipath_max_sges;
+
+extern unsigned int ib_ipath_max_mcast_grps;
+
+extern unsigned int ib_ipath_max_mcast_qp_attached;
+
+extern unsigned int ib_ipath_max_srqs;
+
+extern unsigned int ib_ipath_max_srq_sges;
+
+extern unsigned int ib_ipath_max_srq_wrs;
+
 extern const u32 ib_ipath_rnr_table[];
 
 #endif				/* IPATH_VERBS_H */
diff -r 21d5d64750ac -r a94e9f9c9c23 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs_mcast.c	Thu Jun 29 14:33:25 2006 -0700
@@ -93,6 +93,7 @@ static struct ipath_mcast *ipath_mcast_a
 	INIT_LIST_HEAD(&mcast->qp_list);
 	init_waitqueue_head(&mcast->wait);
 	atomic_set(&mcast->refcount, 0);
+	mcast->n_attached = 0;
 
 bail:
 	return mcast;
@@ -158,7 +159,8 @@ bail:
  * the table but the QP was added.  Return ESRCH if the QP was already
  * attached and neither structure was added.
  */
-static int ipath_mcast_add(struct ipath_mcast *mcast,
+static int ipath_mcast_add(struct ipath_ibdev *dev,
+			   struct ipath_mcast *mcast,
 			   struct ipath_mcast_qp *mqp)
 {
 	struct rb_node **n = &mcast_tree.rb_node;
@@ -189,16 +191,28 @@ static int ipath_mcast_add(struct ipath_
 		/* Search the QP list to see if this is already there. */
 		list_for_each_entry_rcu(p, &tmcast->qp_list, list) {
 			if (p->qp == mqp->qp) {
-				spin_unlock_irqrestore(&mcast_lock, flags);
 				ret = ESRCH;
 				goto bail;
 			}
 		}
+		if (tmcast->n_attached == ib_ipath_max_mcast_qp_attached) {
+			ret = ENOMEM;
+			goto bail;
+		}
+
+		tmcast->n_attached++;
+
 		list_add_tail_rcu(&mqp->list, &tmcast->qp_list);
-		spin_unlock_irqrestore(&mcast_lock, flags);
 		ret = EEXIST;
 		goto bail;
 	}
+
+	if (dev->n_mcast_grps_allocated == ib_ipath_max_mcast_grps) {
+		ret = ENOMEM;
+		goto bail;
+	}
+
+	dev->n_mcast_grps_allocated++;
 
 	list_add_tail_rcu(&mqp->list, &mcast->qp_list);
 
@@ -206,17 +220,18 @@ static int ipath_mcast_add(struct ipath_
 	rb_link_node(&mcast->rb_node, pn, n);
 	rb_insert_color(&mcast->rb_node, &mcast_tree);
 
+	ret = 0;
+
+bail:
 	spin_unlock_irqrestore(&mcast_lock, flags);
 
-	ret = 0;
-
-bail:
 	return ret;
 }
 
 int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 {
 	struct ipath_qp *qp = to_iqp(ibqp);
+	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_mcast *mcast;
 	struct ipath_mcast_qp *mqp;
 	int ret;
@@ -236,7 +251,7 @@ int ipath_multicast_attach(struct ib_qp 
 		ret = -ENOMEM;
 		goto bail;
 	}
-	switch (ipath_mcast_add(mcast, mqp)) {
+	switch (ipath_mcast_add(dev, mcast, mqp)) {
 	case ESRCH:
 		/* Neither was used: can't attach the same QP twice. */
 		ipath_mcast_qp_free(mqp);
@@ -246,6 +261,12 @@ int ipath_multicast_attach(struct ib_qp 
 	case EEXIST:		/* The mcast wasn't used */
 		ipath_mcast_free(mcast);
 		break;
+	case ENOMEM:
+		/* Exceeded the maximum number of mcast groups. */
+		ipath_mcast_qp_free(mqp);
+		ipath_mcast_free(mcast);
+		ret = -ENOMEM;
+		goto bail;
 	default:
 		break;
 	}
@@ -259,6 +280,7 @@ int ipath_multicast_detach(struct ib_qp 
 int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 {
 	struct ipath_qp *qp = to_iqp(ibqp);
+	struct ipath_ibdev *dev = to_idev(ibqp->device);
 	struct ipath_mcast *mcast = NULL;
 	struct ipath_mcast_qp *p, *tmp;
 	struct rb_node *n;
@@ -297,6 +319,7 @@ int ipath_multicast_detach(struct ib_qp 
 		 * link until we are sure there are no list walkers.
 		 */
 		list_del_rcu(&p->list);
+		mcast->n_attached--;
 
 		/* If this was the last attached QP, remove the GID too. */
 		if (list_empty(&mcast->qp_list)) {
@@ -320,6 +343,7 @@ int ipath_multicast_detach(struct ib_qp 
 		atomic_dec(&mcast->refcount);
 		wait_event(mcast->wait, !atomic_read(&mcast->refcount));
 		ipath_mcast_free(mcast);
+		dev->n_mcast_grps_allocated--;
 	}
 
 	ret = 0;


From bos at pathscale.com  Thu Jun 29 14:40:56 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:40:56 -0700
Subject: [openib-general] [PATCH 5 of 39] IB/ipath - fix shared receive
	queues for RC
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <e4f29a4e0c0fedcbe3c8.1151617256@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r c93c2b42d279 -r e4f29a4e0c0f drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:25 2006 -0700
@@ -257,7 +257,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 			break;
 
 		case IB_WR_RDMA_WRITE:
-			if (newreq)
+			if (newreq && qp->s_lsn != (u32) -1)
 				qp->s_lsn++;
 			/* FALLTHROUGH */
 		case IB_WR_RDMA_WRITE_WITH_IMM:
@@ -283,8 +283,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 			else {
 				qp->s_state =
 					OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE);
-				/* Immediate data comes
-				 * after RETH */
+				/* Immediate data comes after RETH */
 				ohdr->u.rc.imm_data = wqe->wr.imm_data;
 				hwords += 1;
 				if (wqe->wr.send_flags & IB_SEND_SOLICITED)
@@ -304,7 +303,8 @@ int ipath_make_rc_req(struct ipath_qp *q
 			qp->s_state = OP(RDMA_READ_REQUEST);
 			hwords += sizeof(ohdr->u.rc.reth) / 4;
 			if (newreq) {
-				qp->s_lsn++;
+				if (qp->s_lsn != (u32) -1)
+					qp->s_lsn++;
 				/*
 				 * Adjust s_next_psn to count the
 				 * expected number of responses.
@@ -335,7 +335,8 @@ int ipath_make_rc_req(struct ipath_qp *q
 				wqe->wr.wr.atomic.compare_add);
 			hwords += sizeof(struct ib_atomic_eth) / 4;
 			if (newreq) {
-				qp->s_lsn++;
+				if (qp->s_lsn != (u32) -1)
+					qp->s_lsn++;
 				wqe->lpsn = wqe->psn;
 			}
 			if (++qp->s_cur == qp->s_size)
@@ -553,6 +554,88 @@ static void send_rc_ack(struct ipath_qp 
 }
 
 /**
+ * reset_psn - reset the QP state to send starting from PSN
+ * @qp: the QP
+ * @psn: the packet sequence number to restart at
+ *
+ * This is called from ipath_rc_rcv() to process an incoming RC ACK
+ * for the given QP.
+ * Called at interrupt level with the QP s_lock held.
+ */
+static void reset_psn(struct ipath_qp *qp, u32 psn)
+{
+	u32 n = qp->s_last;
+	struct ipath_swqe *wqe = get_swqe_ptr(qp, n);
+	u32 opcode;
+
+	qp->s_cur = n;
+
+	/*
+	 * If we are starting the request from the beginning,
+	 * let the normal send code handle initialization.
+	 */
+	if (ipath_cmp24(psn, wqe->psn) <= 0) {
+		qp->s_state = OP(SEND_LAST);
+		goto done;
+	}
+
+	/* Find the work request opcode corresponding to the given PSN. */
+	opcode = wqe->wr.opcode;
+	for (;;) {
+		int diff;
+
+		if (++n == qp->s_size)
+			n = 0;
+		if (n == qp->s_tail)
+			break;
+		wqe = get_swqe_ptr(qp, n);
+		diff = ipath_cmp24(psn, wqe->psn);
+		if (diff < 0)
+			break;
+		qp->s_cur = n;
+		/*
+		 * If we are starting the request from the beginning,
+		 * let the normal send code handle initialization.
+		 */
+		if (diff == 0) {
+			qp->s_state = OP(SEND_LAST);
+			goto done;
+		}
+		opcode = wqe->wr.opcode;
+	}
+
+	/*
+	 * Set the state to restart in the middle of a request.
+	 * Don't change the s_sge, s_cur_sge, or s_cur_size.
+	 * See ipath_do_rc_send().
+	 */
+	switch (opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		qp->s_state = OP(RDMA_READ_RESPONSE_FIRST);
+		break;
+
+	case IB_WR_RDMA_WRITE:
+	case IB_WR_RDMA_WRITE_WITH_IMM:
+		qp->s_state = OP(RDMA_READ_RESPONSE_LAST);
+		break;
+
+	case IB_WR_RDMA_READ:
+		qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE);
+		break;
+
+	default:
+		/*
+		 * This case shouldn't happen since its only
+		 * one PSN per req.
+		 */
+		qp->s_state = OP(SEND_LAST);
+	}
+done:
+	qp->s_psn = psn;
+}
+
+/**
  * ipath_restart_rc - back up requester to resend the last un-ACKed request
  * @qp: the QP to restart
  * @psn: packet sequence number for the request
@@ -564,7 +647,6 @@ void ipath_restart_rc(struct ipath_qp *q
 {
 	struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
 	struct ipath_ibdev *dev;
-	u32 n;
 
 	/*
 	 * If there are no requests pending, we are done.
@@ -606,130 +688,13 @@ void ipath_restart_rc(struct ipath_qp *q
 	else
 		dev->n_rc_resends += (int)qp->s_psn - (int)psn;
 
-	/*
-	 * If we are starting the request from the beginning, let the normal
-	 * send code handle initialization.
-	 */
-	qp->s_cur = qp->s_last;
-	if (ipath_cmp24(psn, wqe->psn) <= 0) {
-		qp->s_state = OP(SEND_LAST);
-		qp->s_psn = wqe->psn;
-	} else {
-		n = qp->s_cur;
-		for (;;) {
-			if (++n == qp->s_size)
-				n = 0;
-			if (n == qp->s_tail) {
-				if (ipath_cmp24(psn, qp->s_next_psn) >= 0) {
-					qp->s_cur = n;
-					wqe = get_swqe_ptr(qp, n);
-				}
-				break;
-			}
-			wqe = get_swqe_ptr(qp, n);
-			if (ipath_cmp24(psn, wqe->psn) < 0)
-				break;
-			qp->s_cur = n;
-		}
-		qp->s_psn = psn;
-
-		/*
-		 * Reset the state to restart in the middle of a request.
-		 * Don't change the s_sge, s_cur_sge, or s_cur_size.
-		 * See ipath_do_rc_send().
-		 */
-		switch (wqe->wr.opcode) {
-		case IB_WR_SEND:
-		case IB_WR_SEND_WITH_IMM:
-			qp->s_state = OP(RDMA_READ_RESPONSE_FIRST);
-			break;
-
-		case IB_WR_RDMA_WRITE:
-		case IB_WR_RDMA_WRITE_WITH_IMM:
-			qp->s_state = OP(RDMA_READ_RESPONSE_LAST);
-			break;
-
-		case IB_WR_RDMA_READ:
-			qp->s_state =
-				OP(RDMA_READ_RESPONSE_MIDDLE);
-			break;
-
-		default:
-			/*
-			 * This case shouldn't happen since its only
-			 * one PSN per req.
-			 */
-			qp->s_state = OP(SEND_LAST);
-		}
-	}
+	reset_psn(qp, psn);
 
 done:
 	tasklet_hi_schedule(&qp->s_task);
 
 bail:
 	return;
-}
-
-/**
- * reset_psn - reset the QP state to send starting from PSN
- * @qp: the QP
- * @psn: the packet sequence number to restart at
- *
- * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK
- * for the given QP.
- * Called at interrupt level with the QP s_lock held.
- */
-static void reset_psn(struct ipath_qp *qp, u32 psn)
-{
-	struct ipath_swqe *wqe;
-	u32 n;
-
-	n = qp->s_cur;
-	wqe = get_swqe_ptr(qp, n);
-	for (;;) {
-		if (++n == qp->s_size)
-			n = 0;
-		if (n == qp->s_tail) {
-			if (ipath_cmp24(psn, qp->s_next_psn) >= 0) {
-				qp->s_cur = n;
-				wqe = get_swqe_ptr(qp, n);
-			}
-			break;
-		}
-		wqe = get_swqe_ptr(qp, n);
-		if (ipath_cmp24(psn, wqe->psn) < 0)
-			break;
-		qp->s_cur = n;
-	}
-	qp->s_psn = psn;
-
-	/*
-	 * Set the state to restart in the middle of a
-	 * request.  Don't change the s_sge, s_cur_sge, or
-	 * s_cur_size.  See ipath_do_rc_send().
-	 */
-	switch (wqe->wr.opcode) {
-	case IB_WR_SEND:
-	case IB_WR_SEND_WITH_IMM:
-		qp->s_state = OP(RDMA_READ_RESPONSE_FIRST);
-		break;
-
-	case IB_WR_RDMA_WRITE:
-	case IB_WR_RDMA_WRITE_WITH_IMM:
-		qp->s_state = OP(RDMA_READ_RESPONSE_LAST);
-		break;
-
-	case IB_WR_RDMA_READ:
-		qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE);
-		break;
-
-	default:
-		/*
-		 * This case shouldn't happen since its only
-		 * one PSN per req.
-		 */
-		qp->s_state = OP(SEND_LAST);
-	}
 }
 
 /**
@@ -738,7 +703,7 @@ static void reset_psn(struct ipath_qp *q
  * @psn: the packet sequence number of the ACK
  * @opcode: the opcode of the request that resulted in the ACK
  *
- * This is called from ipath_rc_rcv() to process an incoming RC ACK
+ * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK
  * for the given QP.
  * Called at interrupt level with the QP s_lock held.
  * Returns 1 if OK, 0 if current operation should be aborted (NAK).
@@ -877,22 +842,12 @@ static int do_rc_ack(struct ipath_qp *qp
 		if (qp->s_last == qp->s_tail)
 			goto bail;
 
-		/* The last valid PSN seen is the previous request's. */
-		qp->s_last_psn = wqe->psn - 1;
+		/* The last valid PSN is the previous PSN. */
+		qp->s_last_psn = psn - 1;
 
 		dev->n_rc_resends += (int)qp->s_psn - (int)psn;
 
-		/*
-		 * If we are starting the request from the beginning, let
-		 * the normal send code handle initialization.
-		 */
-		qp->s_cur = qp->s_last;
-		wqe = get_swqe_ptr(qp, qp->s_cur);
-		if (ipath_cmp24(psn, wqe->psn) <= 0) {
-			qp->s_state = OP(SEND_LAST);
-			qp->s_psn = wqe->psn;
-		} else
-			reset_psn(qp, psn);
+		reset_psn(qp, psn);
 
 		qp->s_rnr_timeout =
 			ib_ipath_rnr_table[(aeth >> IPS_AETH_CREDIT_SHIFT) &
@@ -1070,9 +1025,10 @@ static inline void ipath_rc_rcv_resp(str
 				       &dev->pending[dev->pending_index]);
 		spin_unlock(&dev->pending_lock);
 		/*
-		 * Update the RDMA receive state but do the copy w/o holding the
-		 * locks and blocking interrupts.  XXX Yet another place that
-		 * affects relaxed RDMA order since we don't want s_sge modified.
+		 * Update the RDMA receive state but do the copy w/o
+		 * holding the locks and blocking interrupts.
+		 * XXX Yet another place that affects relaxed RDMA order
+		 * since we don't want s_sge modified.
 		 */
 		qp->s_len -= pmtu;
 		qp->s_last_psn = psn;
@@ -1119,9 +1075,12 @@ static inline void ipath_rc_rcv_resp(str
 		if (do_rc_ack(qp, aeth, psn, OP(RDMA_READ_RESPONSE_LAST))) {
 			/*
 			 * Change the state so we contimue
-			 * processing new requests.
+			 * processing new requests and wake up the
+			 * tasklet if there are posted sends.
 			 */
 			qp->s_state = OP(SEND_LAST);
+			if (qp->s_tail != qp->s_head)
+				tasklet_hi_schedule(&qp->s_task);
 		}
 		goto ack_done;
 	}


From bos at pathscale.com  Thu Jun 29 14:41:09 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:09 -0700
Subject: [openib-general] [PATCH 18 of 39] IB/ipath - use vmalloc to
 allocate struct ipath_devdata
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <9c072f8e7e68131f1c7e.1151617269@eng-12.pathscale.com>

This is not a DMA target, so no need to use dma_alloc_coherent on it.

Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 9d943b828776 -r 9c072f8e7e68 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -171,14 +171,13 @@ static void ipath_free_devdata(struct pc
 		list_del(&dd->ipath_list);
 		spin_unlock_irqrestore(&ipath_devs_lock, flags);
 	}
-	dma_free_coherent(&pdev->dev, sizeof(*dd), dd, dd->ipath_dma_addr);
+	vfree(dd);
 }
 
 static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev)
 {
 	unsigned long flags;
 	struct ipath_devdata *dd;
-	dma_addr_t dma_addr;
 	int ret;
 
 	if (!idr_pre_get(&unit_table, GFP_KERNEL)) {
@@ -186,15 +185,12 @@ static struct ipath_devdata *ipath_alloc
 		goto bail;
 	}
 
-	dd = dma_alloc_coherent(&pdev->dev, sizeof(*dd), &dma_addr,
-				GFP_KERNEL);
-
+	dd = vmalloc(sizeof(*dd));
 	if (!dd) {
 		dd = ERR_PTR(-ENOMEM);
 		goto bail;
 	}
-
-	dd->ipath_dma_addr = dma_addr;
+	memset(dd, 0, sizeof(*dd));
 	dd->ipath_unit = -1;
 
 	spin_lock_irqsave(&ipath_devs_lock, flags);
diff -r 9d943b828776 -r 9c072f8e7e68 drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
@@ -163,7 +163,6 @@ struct ipath_devdata {
 	 * only written to by the chip, not the driver.
 	 */
 	volatile __le64 *ipath_hdrqtailptr;
-	dma_addr_t ipath_dma_addr;
 	/* ipath_cfgports pointers */
 	struct ipath_portdata **ipath_pd;
 	/* sk_buffs used by port 0 eager receive queue */


From bos at pathscale.com  Thu Jun 29 14:41:06 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:06 -0700
Subject: [openib-general] [PATCH 15 of 39] IB/ipath - print better debug
 info when handling 32/64-bit DMA mask problems
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <125471ee6c6863fbfa35.1151617266@eng-12.pathscale.com>

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r e43b4df874a9 -r 125471ee6c68 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -425,12 +425,29 @@ static int __devinit ipath_init_one(stru
 		 */
 		ret = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
 		if (ret) {
-			dev_info(&pdev->dev, "pci_set_dma_mask unit %u "
-				 "fails: %d\n", dd->ipath_unit, ret);
+			dev_info(&pdev->dev,
+				"Unable to set DMA mask for unit %u: %d\n",
+				dd->ipath_unit, ret);
 			goto bail_regions;
 		}
-		else
+		else {
 			ipath_dbg("No 64bit DMA mask, used 32 bit mask\n");
+			ret = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+			if (ret)
+				dev_info(&pdev->dev,
+					"Unable to set DMA consistent mask "
+					"for unit %u: %d\n",
+					dd->ipath_unit, ret);
+
+		}
+	}
+	else {
+		ret = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
+		if (ret)
+			dev_info(&pdev->dev,
+				"Unable to set DMA consistent mask "
+				"for unit %u: %d\n",
+				dd->ipath_unit, ret);
 	}
 
 	pci_set_master(pdev);


From bos at pathscale.com  Thu Jun 29 14:41:14 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:14 -0700
Subject: [openib-general] [PATCH 23 of 39] IB/ipath - disallow send of
 invalid packet sizes over UD
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <8e39364c2402304872e6.1151617274@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 811021b6c112 -r 8e39364c2402 drivers/infiniband/hw/ipath/ipath_ud.c
--- a/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
@@ -274,6 +274,11 @@ int ipath_post_ud_send(struct ipath_qp *
 		}
 		len += wr->sg_list[i].length;
 		ss.num_sge++;
+	}
+	/* Check for invalid packet size. */
+	if (len > ipath_layer_get_ibmtu(dev->dd)) {
+		ret = -EINVAL;
+		goto bail;
 	}
 	extra_bytes = (4 - len) & 3;
 	nwords = (len + extra_bytes) >> 2;


From bos at pathscale.com  Thu Jun 29 14:41:15 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:15 -0700
Subject: [openib-general] [PATCH 24 of 39] IB/ipath - don't confuse the max
 message size with the MTU
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <e952aedb0e94b4f8ccc1.1151617275@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 8e39364c2402 -r e952aedb0e94 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -695,7 +695,7 @@ static int ipath_query_port(struct ib_de
 		ipath_layer_get_lastibcstat(dev->dd) & 0xf];
 	props->port_cap_flags = dev->port_cap_flags;
 	props->gid_tbl_len = 1;
-	props->max_msg_sz = 4096;
+	props->max_msg_sz = 0x80000000;
 	props->pkey_tbl_len = ipath_layer_get_npkeys(dev->dd);
 	props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->dd) -
 		dev->z_pkey_violations;


From bos at pathscale.com  Thu Jun 29 14:41:19 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:19 -0700
Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where our
 delay for EEPROM no longer works due to compiler reordering
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com>

The mb() prevents the compiler from reordering on this function, with some versions
of gcc and -Os optimization.   The result is random failures in the EEPROM read
without this change.


Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 7d22a8963bda -r 5f3c0b2d446d drivers/infiniband/hw/ipath/ipath_eeprom.c
--- a/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
@@ -186,6 +186,7 @@ bail:
  */
 static void i2c_wait_for_writes(struct ipath_devdata *dd)
 {
+	mb();
 	(void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch);
 }
 

From bos at pathscale.com  Thu Jun 29 14:41:08 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:08 -0700
Subject: [openib-general] [PATCH 17 of 39] IB/ipath - use more appropriate
	gfp flags
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <9d943b828776136a2bb7.1151617268@eng-12.pathscale.com>

This helps us to survive better when memory is fragmented.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r fd5e733f02ac -r 9d943b828776 drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
@@ -705,6 +705,15 @@ static int ipath_create_user_egr(struct 
 	unsigned e, egrcnt, alloced, egrperchunk, chunk, egrsize, egroff;
 	size_t size;
 	int ret;
+	gfp_t gfp_flags;
+
+	/*
+	 * GFP_USER, but without GFP_FS, so buffer cache can be
+	 * coalesced (we hope); otherwise, even at order 4,
+	 * heavy filesystem activity makes these fail, and we can
+	 * use compound pages.
+	 */
+	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
 
 	egrcnt = dd->ipath_rcvegrcnt;
 	/* TID number offset for this port */
@@ -721,10 +730,8 @@ static int ipath_create_user_egr(struct 
 	 * memory pressure (creating large files and then copying them over
 	 * NFS while doing lots of MPI jobs), we hit some allocation
 	 * failures, even though we can sleep...  (2.6.10) Still get
-	 * failures at 64K.  32K is the lowest we can go without waiting
-	 * more memory again.  It seems likely that the coalescing in
-	 * free_pages, etc. still has issues (as it has had previously
-	 * during 2.6.x development).
+	 * failures at 64K.  32K is the lowest we can go without wasting
+	 * additional memory.
 	 */
 	size = 0x8000;
 	alloced = ALIGN(egrsize * egrcnt, size);
@@ -745,12 +752,6 @@ static int ipath_create_user_egr(struct 
 		goto bail_rcvegrbuf;
 	}
 	for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) {
-		/*
-		 * GFP_USER, but without GFP_FS, so buffer cache can be
-		 * coalesced (we hope); otherwise, even at order 4,
-		 * heavy filesystem activity makes these fail
-		 */
-		gfp_t gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
 
 		pd->port_rcvegrbuf[e] = dma_alloc_coherent(
 			&dd->pcidev->dev, size, &pd->port_rcvegrbuf_phys[e],
@@ -1167,9 +1168,10 @@ static int ipath_mmap(struct file *fp, s
 
 	ureg = dd->ipath_uregbase + dd->ipath_palign * pd->port_port;
 
-	ipath_cdbg(MM, "ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n",
+	ipath_cdbg(MM, "pgaddr %llx vm_start=%lx len %lx port %u:%u\n",
 		   (unsigned long long) pgaddr, vma->vm_start,
-		   vma->vm_end - vma->vm_start);
+		   vma->vm_end - vma->vm_start, dd->ipath_unit,
+		   pd->port_port);
 
 	if (pgaddr == ureg)
 		ret = mmap_ureg(vma, dd, ureg);


From bos at pathscale.com  Thu Jun 29 14:41:16 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:16 -0700
Subject: [openib-general] [PATCH 25 of 39] IB/ipath - removed redundant
	statements
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <4c581c37bb95ad3abb6d.1151617276@eng-12.pathscale.com>

The tail register read became redundant as the result of earlier receive
interrupt bug fixes.

Drop another unneeded register read.

And another line that got duplicated.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -890,9 +890,6 @@ void ipath_kreceive(struct ipath_devdata
 		goto done;
 
 reloop:
-	/* read only once at start for performance */
-	hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
-
 	for (i = 0; l != hdrqtail; i++) {
 		u32 qp;
 		u8 *bthbytes;
diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_ht400.c
--- a/drivers/infiniband/hw/ipath/ipath_ht400.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ht400.c	Thu Jun 29 14:33:26 2006 -0700
@@ -1573,7 +1573,6 @@ void ipath_init_ht400_funcs(struct ipath
 	dd->ipath_f_reset = ipath_setup_ht_reset;
 	dd->ipath_f_get_boardname = ipath_ht_boardname;
 	dd->ipath_f_init_hwerrors = ipath_ht_init_hwerrors;
-	dd->ipath_f_init_hwerrors = ipath_ht_init_hwerrors;
 	dd->ipath_f_early_init = ipath_ht_early_init;
 	dd->ipath_f_handle_hwerrors = ipath_ht_handle_hwerrors;
 	dd->ipath_f_quiet_serdes = ipath_ht_quiet_serdes;
diff -r e952aedb0e94 -r 4c581c37bb95 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
@@ -824,7 +824,6 @@ irqreturn_t ipath_intr(int irq, void *da
 			ipath_stats.sps_fastrcvint++;
 			goto done;
 		}
-		istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
 	}
  
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);


From bos at pathscale.com  Thu Jun 29 14:41:24 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:24 -0700
Subject: [openib-general] [PATCH 33 of 39] IB/ipath - read/write correct
 sizes through diag interface
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <a7c1ad1e090b34dab632.1151617284@eng-12.pathscale.com>

We must increment uaddr by size we are reading or writing, since it's
passed as a char *, not a pointer to the appropriate size.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 8fbb5d71823a -r a7c1ad1e090b drivers/infiniband/hw/ipath/ipath_diag.c
--- a/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:26 2006 -0700
@@ -115,7 +115,7 @@ static int ipath_read_umem64(struct ipat
 			goto bail;
 		}
 		reg_addr++;
-		uaddr++;
+		uaddr += sizeof(u64);
 	}
 	ret = 0;
 bail:
@@ -154,7 +154,7 @@ static int ipath_write_umem64(struct ipa
 		writeq(data, reg_addr);
 
 		reg_addr++;
-		uaddr++;
+		uaddr += sizeof(u64);
 	}
 	ret = 0;
 bail:
@@ -192,7 +192,8 @@ static int ipath_read_umem32(struct ipat
 		}
 
 		reg_addr++;
-		uaddr++;
+		uaddr += sizeof(u32);
+
 	}
 	ret = 0;
 bail:
@@ -231,7 +232,7 @@ static int ipath_write_umem32(struct ipa
 		writel(data, reg_addr);
 
 		reg_addr++;
-		uaddr++;
+		uaddr += sizeof(u32);
 	}
 	ret = 0;
 bail:


From bos at pathscale.com  Thu Jun 29 14:41:17 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:17 -0700
Subject: [openib-general] [PATCH 26 of 39] IB/ipath - check for valid LID
 and multicast LIDs
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <eef7f80215004f4cf854.1151617277@eng-12.pathscale.com>

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 4c581c37bb95 -r eef7f8021500 drivers/infiniband/hw/ipath/ipath_sysfs.c
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -280,7 +280,7 @@ static ssize_t store_lid(struct device *
 	if (ret < 0)
 		goto invalid;
 
-	if (lid == 0 || lid >= 0xc000) {
+	if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) {
 		ret = -EINVAL;
 		goto invalid;
 	}
@@ -314,7 +314,7 @@ static ssize_t store_mlid(struct device 
 	int ret;
 
 	ret = ipath_parse_ushort(buf, &mlid);
-	if (ret < 0)
+	if (ret < 0 || mlid < IPS_MULTICAST_LID_BASE)
 		goto invalid;
 
 	unit = dd->ipath_unit;


From bos at pathscale.com  Thu Jun 29 14:41:12 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:12 -0700
Subject: [openib-general] [PATCH 21 of 39] IB/ipath - fixed bug 9776 for
 real. The problem was that I was updating
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <1a4350d895c9a673c98e.1151617272@eng-12.pathscale.com>

the head register multiple times in the rcvhdrq processing loop,
and setting the counter on each update.   Since that meant that
the tail register was ahead of head for all but the last update,
we would get extra interrupts.   The fix was to not write the
counter value except on the last update.

I also changed to update rcvhdrhead and rcvegrindexhead at most
every 16 packets, if there were lots of packets in the queue
(and of course, on the last packet, regardless).

I also made some small cleanups while debugging this.

With these changes, xeon/monty typically sees two openib packets
per interrupt on sdp and ipoib, opteron/monty is about 1.25 pkts/intr.

I'm seeing about 3800 Mbit/s monty/xeon, and 5000-5100 opteron/monty
with netperf sdp.  Netpipe doesn't show as good as that, peaking
at about 4400 on opteron/monty sdp.   Plain ipoib xeon is about 2100+
netperf, opteron 2900+, at 128KB

Signed-off-by: olson at eng-12.pathscale.com
Signed-off-by: Bryan O'Sullivan <bos at pathscale.com>

diff -r 8bc865893a11 -r 1a4350d895c9 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -870,7 +870,7 @@ void ipath_kreceive(struct ipath_devdata
 	const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize;	/* words */
 	u32 etail = -1, l, hdrqtail;
 	struct ips_message_header *hdr;
-	u32 eflags, i, etype, tlen, pkttot = 0;
+	u32 eflags, i, etype, tlen, pkttot = 0, updegr=0;
 	static u64 totcalls;	/* stats, may eventually remove */
 	char emsg[128];
 
@@ -884,14 +884,14 @@ void ipath_kreceive(struct ipath_devdata
 	if (test_and_set_bit(0, &dd->ipath_rcv_pending))
 		goto bail;
 
-	if (dd->ipath_port0head ==
-	    (u32)le64_to_cpu(*dd->ipath_hdrqtailptr))
+	l = dd->ipath_port0head;
+	if (l == (u32)le64_to_cpu(*dd->ipath_hdrqtailptr))
 		goto done;
 
 	/* read only once at start for performance */
 	hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
 
-	for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) {
+	for (i = 0; l != hdrqtail; i++) {
 		u32 qp;
 		u8 *bthbytes;
 
@@ -1002,15 +1002,26 @@ void ipath_kreceive(struct ipath_devdata
 		l += rsize;
 		if (l >= maxcnt)
 			l = 0;
+		if (etype != RCVHQ_RCV_TYPE_EXPECTED)
+		    updegr = 1;
 		/*
-		 * update for each packet, to help prevent overflows if we
-		 * have lots of packets.
+		 * update head regs on last packet, and every 16 packets.
+		 * Reduce bus traffic, while still trying to prevent
+		 * rcvhdrq overflows, for when the queue is nearly full
 		 */
-		(void)ipath_write_ureg(dd, ur_rcvhdrhead,
-				       dd->ipath_rhdrhead_intr_off | l, 0);
-		if (etype != RCVHQ_RCV_TYPE_EXPECTED)
-			(void)ipath_write_ureg(dd, ur_rcvegrindexhead,
-					       etail, 0);
+		if (l == hdrqtail || (i && !(i&0xf))) {
+			u64 lval;
+			if (l == hdrqtail) /* want interrupt only on last */
+				lval = dd->ipath_rhdrhead_intr_off | l;
+			else
+				lval = l;
+			(void)ipath_write_ureg(dd, ur_rcvhdrhead, lval, 0);
+			if (updegr) {
+				(void)ipath_write_ureg(dd, ur_rcvegrindexhead,
+						       etail, 0);
+				updegr = 0;
+			}
+		}
 	}
 
 	pkttot += i;
diff -r 8bc865893a11 -r 1a4350d895c9 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
@@ -383,7 +383,7 @@ static unsigned handle_frequent_errors(s
 	return supp_msgs;
 }
 
-static void handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
+static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 {
 	char msg[512];
 	u64 ignore_this_time = 0;
@@ -480,7 +480,7 @@ static void handle_errors(struct ipath_d
 			  INFINIPATH_E_IBSTATUSCHANGED);
 	}
 	if (!errs)
-		return;
+		return 0;
 
 	if (!noprint)
 		/*
@@ -604,9 +604,7 @@ static void handle_errors(struct ipath_d
 		wake_up_interruptible(&ipath_sma_state_wait);
 	}
 
-	if (chkerrpkts)
-		/* process possible error packets in hdrq */
-		ipath_kreceive(dd);
+	return chkerrpkts;
 }
 
 /* this is separate to allow for better optimization of ipath_intr() */
@@ -765,10 +763,10 @@ irqreturn_t ipath_intr(int irq, void *da
 irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs)
 {
 	struct ipath_devdata *dd = data;
-	u32 istat;
+	u32 istat, chk0rcv = 0;
 	ipath_err_t estat = 0;
 	irqreturn_t ret;
-	u32 p0bits;
+	u32 p0bits, oldhead;
 	static unsigned unexpected = 0;
 	static const u32 port0rbits = (1U<<INFINIPATH_I_RCVAVAIL_SHIFT) |
 		 (1U<<INFINIPATH_I_RCVURG_SHIFT);
@@ -810,9 +808,8 @@ irqreturn_t ipath_intr(int irq, void *da
 	 * interrupts.   We clear the interrupts first so that we don't
 	 * lose intr for later packets that arrive while we are processing.
 	 */
-	if (dd->ipath_port0head !=
-		(u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) {
-		u32 oldhead = dd->ipath_port0head;
+	oldhead = dd->ipath_port0head;
+	if (oldhead != (u32) le64_to_cpu(*dd->ipath_hdrqtailptr)) {
 		if (dd->ipath_flags & IPATH_GPIO_INTR) {
 			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
 					 (u64) (1 << 2));
@@ -830,6 +827,8 @@ irqreturn_t ipath_intr(int irq, void *da
 	}
  
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
+	p0bits = port0rbits;
+
 	if (unlikely(!istat)) {
 		ipath_stats.sps_nullintr++;
 		ret = IRQ_NONE; /* not our interrupt, or already handled */
@@ -867,10 +866,11 @@ irqreturn_t ipath_intr(int irq, void *da
 			ipath_dev_err(dd, "Read of error status failed "
 				      "(all bits set); ignoring\n");
 		else
-			handle_errors(dd, estat);
-	}
-
-	p0bits = port0rbits;
+			if (handle_errors(dd, estat))
+				/* force calling ipath_kreceive() */
+				chk0rcv = 1;
+	}
+
 	if (istat & INFINIPATH_I_GPIO) {
 		/*
 		 * Packets are available in the port 0 rcv queue.
@@ -892,8 +892,10 @@ irqreturn_t ipath_intr(int irq, void *da
 			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
 					 (u64) (1 << 2));
 			p0bits |= INFINIPATH_I_GPIO;
-		}
-	}
+			chk0rcv = 1;
+		}
+	}
+	chk0rcv |= istat & p0bits;
 
 	/*
 	 * clear the ones we will deal with on this round
@@ -905,18 +907,16 @@ irqreturn_t ipath_intr(int irq, void *da
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
 
 	/*
-	 * we check for both transition from empty to non-empty, and urgent
-	 * packets (those with the interrupt bit set in the header), and
-	 * if enabled, the GPIO bit 2 interrupt used for port0 on some
-	 * HT-400 boards.
-	 * Do this before checking for pio buffers available, since
-	 * receives can overflow; piobuf waiters can afford a few
-	 * extra cycles, since they were waiting anyway.
-	 */
-	if (istat & p0bits) {
+	 * handle port0 receive  before checking for pio buffers available,
+	 * since receives can overflow; piobuf waiters can afford a few
+	 * extra cycles, since they were waiting anyway, and user's waiting
+	 * for receive are at the bottom.
+	 */
+	if (chk0rcv) {
 		ipath_kreceive(dd);
 		istat &= ~port0rbits;
 	}
+
 	if (istat & ((infinipath_i_rcvavail_mask <<
 		      INFINIPATH_I_RCVAVAIL_SHIFT)
 		     | (infinipath_i_rcvurg_mask <<


From bos at pathscale.com  Thu Jun 29 14:41:13 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:13 -0700
Subject: [openib-general] [PATCH 22 of 39] IB/ipath - fix lost interrupts on
	HT-400
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <811021b6c112f8616d73.1151617273@eng-12.pathscale.com>

Do an extra check to see if in-memory tail changed while processing
packets, and if so, going back through the loop again (but only once
per call to ipath_kreceive()).   In practice, this seems to be enough
to guarantee that if we crossed the clearing of an interrupt at start of
ipath_intr with a scheduled tail register update, that we'll process the
"extra" packet that lost the interrupt because we cleared it just as it
was about to arrive.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 1a4350d895c9 -r 811021b6c112 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -870,7 +870,7 @@ void ipath_kreceive(struct ipath_devdata
 	const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize;	/* words */
 	u32 etail = -1, l, hdrqtail;
 	struct ips_message_header *hdr;
-	u32 eflags, i, etype, tlen, pkttot = 0, updegr=0;
+	u32 eflags, i, etype, tlen, pkttot = 0, updegr=0, reloop=0;
 	static u64 totcalls;	/* stats, may eventually remove */
 	char emsg[128];
 
@@ -885,9 +885,11 @@ void ipath_kreceive(struct ipath_devdata
 		goto bail;
 
 	l = dd->ipath_port0head;
-	if (l == (u32)le64_to_cpu(*dd->ipath_hdrqtailptr))
+	hdrqtail = (u32) le64_to_cpu(*dd->ipath_hdrqtailptr);
+	if (l == hdrqtail)
 		goto done;
 
+reloop:
 	/* read only once at start for performance */
 	hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
 
@@ -1011,7 +1013,7 @@ void ipath_kreceive(struct ipath_devdata
 		 */
 		if (l == hdrqtail || (i && !(i&0xf))) {
 			u64 lval;
-			if (l == hdrqtail) /* want interrupt only on last */
+			if (l == hdrqtail) /* PE-800 interrupt only on last */
 				lval = dd->ipath_rhdrhead_intr_off | l;
 			else
 				lval = l;
@@ -1021,6 +1023,23 @@ void ipath_kreceive(struct ipath_devdata
 						       etail, 0);
 				updegr = 0;
 			}
+		}
+	}
+
+	if (!dd->ipath_rhdrhead_intr_off && !reloop) {
+		/* HT-400 workaround; we can have a race clearing chip
+		 * interrupt with another interrupt about to be delivered,
+		 * and can clear it before it is delivered on the GPIO
+		 * workaround.  By doing the extra check here for the
+		 * in-memory tail register updating while we were doing
+		 * earlier packets, we "almost" guarantee we have covered
+		 * that case.
+		 */
+		u32 hqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
+		if (hqtail != hdrqtail) {
+			hdrqtail = hqtail;
+			reloop = 1; /* loop 1 extra time at most */
+			goto reloop;
 		}
 	}
 
diff -r 1a4350d895c9 -r 811021b6c112 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
@@ -766,7 +766,7 @@ irqreturn_t ipath_intr(int irq, void *da
 	u32 istat, chk0rcv = 0;
 	ipath_err_t estat = 0;
 	irqreturn_t ret;
-	u32 p0bits, oldhead;
+	u32 oldhead, curtail;
 	static unsigned unexpected = 0;
 	static const u32 port0rbits = (1U<<INFINIPATH_I_RCVAVAIL_SHIFT) |
 		 (1U<<INFINIPATH_I_RCVURG_SHIFT);
@@ -809,15 +809,16 @@ irqreturn_t ipath_intr(int irq, void *da
 	 * lose intr for later packets that arrive while we are processing.
 	 */
 	oldhead = dd->ipath_port0head;
-	if (oldhead != (u32) le64_to_cpu(*dd->ipath_hdrqtailptr)) {
+	curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
+	if (oldhead != curtail) {
 		if (dd->ipath_flags & IPATH_GPIO_INTR) {
 			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
 					 (u64) (1 << 2));
-			p0bits = port0rbits | INFINIPATH_I_GPIO;
+			istat = port0rbits | INFINIPATH_I_GPIO;
 		}
 		else
-			p0bits = port0rbits;
-		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, p0bits);
+			istat = port0rbits;
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
 		ipath_kreceive(dd);
 		if (oldhead != dd->ipath_port0head) {
 			ipath_stats.sps_fastrcvint++;
@@ -827,7 +828,6 @@ irqreturn_t ipath_intr(int irq, void *da
 	}
  
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
-	p0bits = port0rbits;
 
 	if (unlikely(!istat)) {
 		ipath_stats.sps_nullintr++;
@@ -890,19 +890,19 @@ irqreturn_t ipath_intr(int irq, void *da
 		else {
 			/* Clear GPIO status bit 2 */
 			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
-					 (u64) (1 << 2));
-			p0bits |= INFINIPATH_I_GPIO;
+					(u64) (1 << 2));
 			chk0rcv = 1;
 		}
 	}
-	chk0rcv |= istat & p0bits;
-
-	/*
-	 * clear the ones we will deal with on this round
-	 * We clear it early, mostly for receive interrupts, so we
-	 * know the chip will have seen this by the time we process
-	 * the queue, and will re-interrupt if necessary.  The processor
-	 * itself won't take the interrupt again until we return.
+	chk0rcv |= istat & port0rbits;
+
+	/*
+	 * Clear the interrupt bits we found set, unless they are receive
+	 * related, in which case we already cleared them above, and don't
+	 * want to clear them again, because we might lose an interrupt.
+	 * Clear it early, so we "know" know the chip will have seen this by
+	 * the time we process the queue, and will re-interrupt if necessary.
+	 * The processor itself won't take the interrupt again until we return.
 	 */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
 

From bos at pathscale.com  Thu Jun 29 14:41:22 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:22 -0700
Subject: [openib-general] [PATCH 31 of 39] IB/ipath - drop the "stats" sysfs
 attribute group
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <21378f21e091f6fc81fc.1151617282@eng-12.pathscale.com>

This attribute group made it into the original driver, but should
not have.

Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 3ceb73f8bde0 -r 21378f21e091 drivers/infiniband/hw/ipath/ipath_sysfs.c
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -84,81 +84,6 @@ static ssize_t show_num_units(struct dev
 	return scnprintf(buf, PAGE_SIZE, "%d\n",
 			 ipath_count_units(NULL, NULL, NULL));
 }
-
-#define DRIVER_STAT(name, attr) \
-	static ssize_t show_stat_##name(struct device_driver *dev, \
-					char *buf) \
-	{ \
-		return scnprintf( \
-			buf, PAGE_SIZE, "%llu\n", \
-			(unsigned long long) ipath_stats.sps_ ##attr); \
-	} \
-	static DRIVER_ATTR(name, S_IRUGO, show_stat_##name, NULL)
-
-DRIVER_STAT(intrs, ints);
-DRIVER_STAT(err_intrs, errints);
-DRIVER_STAT(errs, errs);
-DRIVER_STAT(pkt_errs, pkterrs);
-DRIVER_STAT(crc_errs, crcerrs);
-DRIVER_STAT(hw_errs, hwerrs);
-DRIVER_STAT(ib_link, iblink);
-DRIVER_STAT(port0_pkts, port0pkts);
-DRIVER_STAT(ether_spkts, ether_spkts);
-DRIVER_STAT(ether_rpkts, ether_rpkts);
-DRIVER_STAT(sma_spkts, sma_spkts);
-DRIVER_STAT(sma_rpkts, sma_rpkts);
-DRIVER_STAT(hdrq_full, hdrqfull);
-DRIVER_STAT(etid_full, etidfull);
-DRIVER_STAT(no_piobufs, nopiobufs);
-DRIVER_STAT(ports, ports);
-DRIVER_STAT(pkey0, pkeys[0]);
-DRIVER_STAT(pkey1, pkeys[1]);
-DRIVER_STAT(pkey2, pkeys[2]);
-DRIVER_STAT(pkey3, pkeys[3]);
-
-DRIVER_STAT(nports, nports);
-DRIVER_STAT(null_intr, nullintr);
-DRIVER_STAT(max_pkts_call, maxpkts_call);
-DRIVER_STAT(avg_pkts_call, avgpkts_call);
-DRIVER_STAT(page_locks, pagelocks);
-DRIVER_STAT(page_unlocks, pageunlocks);
-DRIVER_STAT(krdrops, krdrops);
-
-static struct attribute *driver_stat_attributes[] = {
-	&driver_attr_intrs.attr,
-	&driver_attr_err_intrs.attr,
-	&driver_attr_errs.attr,
-	&driver_attr_pkt_errs.attr,
-	&driver_attr_crc_errs.attr,
-	&driver_attr_hw_errs.attr,
-	&driver_attr_ib_link.attr,
-	&driver_attr_port0_pkts.attr,
-	&driver_attr_ether_spkts.attr,
-	&driver_attr_ether_rpkts.attr,
-	&driver_attr_sma_spkts.attr,
-	&driver_attr_sma_rpkts.attr,
-	&driver_attr_hdrq_full.attr,
-	&driver_attr_etid_full.attr,
-	&driver_attr_no_piobufs.attr,
-	&driver_attr_ports.attr,
-	&driver_attr_pkey0.attr,
-	&driver_attr_pkey1.attr,
-	&driver_attr_pkey2.attr,
-	&driver_attr_pkey3.attr,
-	&driver_attr_nports.attr,
-	&driver_attr_null_intr.attr,
-	&driver_attr_max_pkts_call.attr,
-	&driver_attr_avg_pkts_call.attr,
-	&driver_attr_page_locks.attr,
-	&driver_attr_page_unlocks.attr,
-	&driver_attr_krdrops.attr,
-	NULL
-};
-
-static struct attribute_group driver_stat_attr_group = {
-	.name = "stats",
-	.attrs = driver_stat_attributes
-};
 
 static ssize_t show_status(struct device *dev,
 			   struct device_attribute *attr,
@@ -716,20 +641,12 @@ int ipath_driver_create_group(struct dev
 	int ret;
 
 	ret = sysfs_create_group(&drv->kobj, &driver_attr_group);
-	if (ret)
-		goto bail;
-
-	ret = sysfs_create_group(&drv->kobj, &driver_stat_attr_group);
-	if (ret)
-		sysfs_remove_group(&drv->kobj, &driver_attr_group);
-
-bail:
+
 	return ret;
 }
 
 void ipath_driver_remove_group(struct device_driver *drv)
 {
-	sysfs_remove_group(&drv->kobj, &driver_stat_attr_group);
 	sysfs_remove_group(&drv->kobj, &driver_attr_group);
 }
 

From bos at pathscale.com  Thu Jun 29 14:41:23 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:23 -0700
Subject: [openib-general] [PATCH 32 of 39] IB/ipath - support more models of
 InfiniPath hardware
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <8fbb5d71823abafe963a.1151617283@eng-12.pathscale.com>

We do a few more explicit checks for specific models, and now also
support the old PathScale serial number style, or new QLogic style.

This is backwards compatible with previous versions of software and
hardware.  That is, older software will see a plausible serial number
and correct GUID when used with a new board, while newer software will
correctly handle an older board.

Signed-off-by: Mike Albaugh <mike.albaugh at qlogic.com>
Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
@@ -476,7 +476,7 @@ struct ipath_sma_pkt
  * Data layout in I2C flash (for GUID, etc.)
  * All fields are little-endian binary unless otherwise stated
  */
-#define IPATH_FLASH_VERSION 1
+#define IPATH_FLASH_VERSION 2
 struct ipath_flash {
 	/* flash layout version (IPATH_FLASH_VERSION) */
 	__u8 if_fversion;
@@ -484,14 +484,14 @@ struct ipath_flash {
 	__u8 if_csum;
 	/*
 	 * valid length (in use, protected by if_csum), including
-	 * if_fversion and if_sum themselves)
+	 * if_fversion and if_csum themselves)
 	 */
 	__u8 if_length;
 	/* the GUID, in network order */
 	__u8 if_guid[8];
 	/* number of GUIDs to use, starting from if_guid */
 	__u8 if_numguid;
-	/* the board serial number, in ASCII */
+	/* the (last 10 characters of) board serial number, in ASCII */
 	char if_serial[12];
 	/* board mfg date (YYYYMMDD ASCII) */
 	char if_mfgdate[8];
@@ -503,8 +503,10 @@ struct ipath_flash {
 	__u8 if_powerhour[2];
 	/* ASCII free-form comment field */
 	char if_comment[32];
-	/* 78 bytes used, min flash size is 128 bytes */
-	__u8 if_future[50];
+	/* Backwards compatible prefix for longer QLogic Serial Numbers */
+	char if_sprefix[4];
+	/* 82 bytes used, min flash size is 128 bytes */
+	__u8 if_future[46];
 };
 
 /*
diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_eeprom.c
--- a/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
@@ -602,8 +602,31 @@ void ipath_get_eeprom_info(struct ipath_
 		guid = *(__be64 *) ifp->if_guid;
 	dd->ipath_guid = guid;
 	dd->ipath_nguid = ifp->if_numguid;
-	memcpy(dd->ipath_serial, ifp->if_serial,
-	       sizeof(ifp->if_serial));
+	/*
+	 * Things are slightly complicated by the desire to transparently
+	 * support both the Pathscale 10-digit serial number and the QLogic
+	 * 13-character version.
+	 */
+	if ((ifp->if_fversion > 1) && ifp->if_sprefix[0]
+		&& ((u8 *)ifp->if_sprefix)[0] != 0xFF) {
+		/* This board has a Serial-prefix, which is stored
+		 * elsewhere for backward-compatibility.
+		 */
+		char *snp = dd->ipath_serial;
+		int len;
+		memcpy(snp, ifp->if_sprefix, sizeof ifp->if_sprefix);
+		snp[sizeof ifp->if_sprefix] = '\0';
+		len = strlen(snp);
+		snp += len;
+		len = (sizeof dd->ipath_serial) - len;
+		if (len > sizeof ifp->if_serial) {
+			len = sizeof ifp->if_serial; 
+		}
+		memcpy(snp, ifp->if_serial, len);
+	} else
+		memcpy(dd->ipath_serial, ifp->if_serial,
+		       sizeof ifp->if_serial);
+	
 	ipath_cdbg(VERBOSE, "Initted GUID to %llx from eeprom\n",
 		   (unsigned long long) be64_to_cpu(dd->ipath_guid));
 
diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
@@ -491,8 +491,11 @@ struct ipath_devdata {
 	u16 ipath_lid;
 	/* list of pkeys programmed; 0 if not set */
 	u16 ipath_pkeys[4];
-	/* ASCII serial number, from flash */
-	u8 ipath_serial[12];
+	/*
+	 * ASCII serial number, from flash, large enough for original
+	 * all digit strings, and longer QLogic serial number format
+	 */
+	u8 ipath_serial[16];
 	/* human readable board version */
 	u8 ipath_boardversion[80];
 	/* chip major rev, from ipath_revision */
diff -r 21378f21e091 -r 8fbb5d71823a drivers/infiniband/hw/ipath/ipath_pe800.c
--- a/drivers/infiniband/hw/ipath/ipath_pe800.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_pe800.c	Thu Jun 29 14:33:26 2006 -0700
@@ -533,7 +533,7 @@ static int ipath_pe_boardname(struct ipa
 	if (n)
 		snprintf(name, namelen, "%s", n);
 
-	if (dd->ipath_majrev != 4 || dd->ipath_minrev != 1) {
+	if (dd->ipath_majrev != 4 || !dd->ipath_minrev || dd->ipath_minrev>2) {
 		ipath_dev_err(dd, "Unsupported PE-800 revision %u.%u!\n",
 			      dd->ipath_majrev, dd->ipath_minrev);
 		ret = 1;


From bos at pathscale.com  Thu Jun 29 14:41:25 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:25 -0700
Subject: [openib-general] [PATCH 34 of 39] IB/ipath - fix a bug that results
 in addresses near 0 being written via DMA
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <b6ebaf2dd2fddbd384f1.1151617285@eng-12.pathscale.com>

We can't tell for sure if any packets are in the infinipath receive buffer
when we shut down a chip port.   Normally this is taken care of by orderly
shutdown, but when processes are terminated, or sending process has a bug,
we can continue to receive packets.   So rather than writing zero to the
address registers for the closing port, we point it at a dummy memory.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -1824,6 +1824,12 @@ static void cleanup_device(struct ipath_
 				  dd->ipath_pioavailregs_phys);
 		dd->ipath_pioavailregs_dma = NULL;
 	}
+	if (dd->ipath_dummy_hdrq) {
+		dma_free_coherent(&dd->pcidev->dev,
+			dd->ipath_pd[0]->port_rcvhdrq_size,
+			dd->ipath_dummy_hdrq, dd->ipath_dummy_hdrq_phys);
+		dd->ipath_dummy_hdrq = NULL;
+	}
 
 	if (dd->ipath_pageshadow) {
 		struct page **tmpp = dd->ipath_pageshadow;
diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
@@ -1486,41 +1486,50 @@ static int ipath_close(struct inode *in,
 	}
 
 	if (dd->ipath_kregbase) {
-		ipath_write_kreg_port(
-			dd, dd->ipath_kregs->kr_rcvhdrtailaddr,
-			port, 0ULL);
-		ipath_write_kreg_port(
-			dd, dd->ipath_kregs->kr_rcvhdraddr,
-			pd->port_port, 0);
+		int i;
+		/* atomically clear receive enable port. */
+		clear_bit(INFINIPATH_R_PORTENABLE_SHIFT + port,
+			  &dd->ipath_rcvctrl);
+		ipath_write_kreg( dd, dd->ipath_kregs->kr_rcvctrl,
+			dd->ipath_rcvctrl);
+		/* and read back from chip to be sure that nothing
+		 * else is in flight when we do the rest */
+		(void)ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 
 		/* clean up the pkeys for this port user */
 		ipath_clean_part_key(pd, dd);
 
-		if (port < dd->ipath_cfgports) {
-			int i = dd->ipath_pbufsport * (port - 1);
-			ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport);
-
-			/* atomically clear receive enable port. */
-			clear_bit(INFINIPATH_R_PORTENABLE_SHIFT + port,
-				  &dd->ipath_rcvctrl);
-			ipath_write_kreg(
-				dd,
-				dd->ipath_kregs->kr_rcvctrl,
-				dd->ipath_rcvctrl);
-
-			if (dd->ipath_pageshadow)
-				unlock_expected_tids(pd);
-			ipath_stats.sps_ports--;
-			ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n",
-				   pd->port_comm, pd->port_pid,
-				   dd->ipath_unit, port);
-		}
+
+		/*
+		 * be paranoid, and never write 0's to these, just use an
+		 * unused part of the port 0 tail page.  Of course,
+		 * rcvhdraddr points to a large chunk of memory, so this
+		 * could still trash things, but at least it won't trash
+		 * page 0, and by disabling the port, it should stop "soon",
+		 * even if a packet or two is in already in flight after we
+		 * disabled the port.
+		 */
+		ipath_write_kreg_port(dd,
+		        dd->ipath_kregs->kr_rcvhdrtailaddr, port,
+			dd->ipath_dummy_hdrq_phys);
+		ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr,
+			pd->port_port, dd->ipath_dummy_hdrq_phys);
+
+		i = dd->ipath_pbufsport * (port - 1);
+		ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport);
+
+		if (dd->ipath_pageshadow)
+			unlock_expected_tids(pd);
+		ipath_stats.sps_ports--;
+		ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n",
+			   pd->port_comm, pd->port_pid,
+			   dd->ipath_unit, port);
+
+		dd->ipath_f_clear_tids(dd, pd->port_port);
 	}
 
 	pd->port_cnt = 0;
 	pd->port_pid = 0;
-
-	dd->ipath_f_clear_tids(dd, pd->port_port);
 
 	dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */
 	mutex_unlock(&ipath_mutex);
diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
@@ -647,6 +647,7 @@ int ipath_init_chip(struct ipath_devdata
 	u32 val32, kpiobufs;
 	u64 val;
 	struct ipath_portdata *pd = NULL; /* keep gcc4 happy */
+	gfp_t gfp_flags = GFP_USER | __GFP_COMP;
 
 	ret = init_housekeeping(dd, &pd, reinit);
 	if (ret)
@@ -833,6 +834,22 @@ int ipath_init_chip(struct ipath_devdata
 			      "rcvhdrq and/or egr bufs\n");
 	else
 		enable_chip(dd, pd, reinit);
+
+
+	if (!ret && !reinit) {
+	    /* used when we close a port, for DMA already in flight at close */
+		dd->ipath_dummy_hdrq = dma_alloc_coherent(
+			&dd->pcidev->dev, pd->port_rcvhdrq_size,
+			&dd->ipath_dummy_hdrq_phys,
+			gfp_flags);
+		if (!dd->ipath_dummy_hdrq ) {
+			dev_info(&dd->pcidev->dev,
+				"Couldn't allocate 0x%lx bytes for dummy hdrq\n",
+				pd->port_rcvhdrq_size);
+			/* fallback to just 0'ing */
+			dd->ipath_dummy_hdrq_phys = 0UL;
+		}
+	}
 
 	/*
 	 * cause retrigger of pending interrupts ignored during init,
diff -r a7c1ad1e090b -r b6ebaf2dd2fd drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
@@ -352,6 +352,8 @@ struct ipath_devdata {
 	/* check for stale messages in rcv queue */
 	/* only allow one intr at a time. */
 	unsigned long ipath_rcv_pending;
+	void *ipath_dummy_hdrq;	/* used after port close */
+	dma_addr_t ipath_dummy_hdrq_phys;
 
 	/*
 	 * Shadow copies of registers; size indicates read access size.


From bos at pathscale.com  Thu Jun 29 14:41:21 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:21 -0700
Subject: [openib-general] [PATCH 30 of 39] IB/ipath - purge sps_lid and
	sps_mlid arrays
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <3ceb73f8bde0e0335b54.1151617281@eng-12.pathscale.com>

The two arrays only had space for 4 units.

Also changed from ipath_set_sps_lid() to ipath_set_lid(); the sps
was leftover.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
@@ -122,8 +122,7 @@ struct infinipath_stats {
 	__u64 sps_ports;
 	/* list of pkeys (other than default) accepted (0 means not set) */
 	__u16 sps_pkeys[4];
-	/* lids for up to 4 infinipaths, indexed by infinipath # */
-	__u16 sps_lid[4];
+	__u16 sps_unused16[4]; /* available; maintaining compatible layout */
 	/* number of user ports per chip (not IB ports) */
 	__u32 sps_nports;
 	/* not our interrupt, or already handled */
@@ -141,10 +140,8 @@ struct infinipath_stats {
 	 * packets if ipath not configured, sma/mad, etc.)
 	 */
 	__u64 sps_krdrops;
-	/* mlids for up to 4 infinipaths, indexed by infinipath # */
-	__u16 sps_mlid[4];
 	/* pad for future growth */
-	__u64 __sps_pad[45];
+	__u64 __sps_pad[46];
 };
 
 /*
diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
@@ -811,8 +811,6 @@ int ipath_init_chip(struct ipath_devdata
 	/* clear any interrups up to this point (ints still not enabled) */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL);
 
-	ipath_stats.sps_lid[dd->ipath_unit] = dd->ipath_lid;
-
 	/*
 	 * Set up the port 0 (kernel) rcvhdr q and egr TIDs.  If doing
 	 * re-init, the simplest way to handle this is to free
diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
@@ -300,9 +300,8 @@ bail:
 
 EXPORT_SYMBOL_GPL(ipath_layer_set_mtu);
 
-int ipath_set_sps_lid(struct ipath_devdata *dd, u32 arg, u8 lmc)
-{
-	ipath_stats.sps_lid[dd->ipath_unit] = arg;
+int ipath_set_lid(struct ipath_devdata *dd, u32 arg, u8 lmc)
+{
 	dd->ipath_lid = arg;
 	dd->ipath_lmc = lmc;
 
@@ -316,7 +315,7 @@ int ipath_set_sps_lid(struct ipath_devda
 	return 0;
 }
 
-EXPORT_SYMBOL_GPL(ipath_set_sps_lid);
+EXPORT_SYMBOL_GPL(ipath_set_lid);
 
 int ipath_layer_set_guid(struct ipath_devdata *dd, __be64 guid)
 {
@@ -632,9 +631,9 @@ int ipath_layer_open(struct ipath_devdat
 
 	if (*dd->ipath_statusp & IPATH_STATUS_IB_READY)
 		intval |= IPATH_LAYER_INT_IF_UP;
-	if (ipath_stats.sps_lid[dd->ipath_unit])
+	if (dd->ipath_lid)
 		intval |= IPATH_LAYER_INT_LID;
-	if (ipath_stats.sps_mlid[dd->ipath_unit])
+	if (dd->ipath_mlid)
 		intval |= IPATH_LAYER_INT_BCAST;
 	/*
 	 * do this on open, in case low level is already up and
diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_layer.h
--- a/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:26 2006 -0700
@@ -129,7 +129,7 @@ u32 ipath_layer_get_cr_errpkey(struct ip
 u32 ipath_layer_get_cr_errpkey(struct ipath_devdata *dd);
 int ipath_layer_set_linkstate(struct ipath_devdata *dd, u8 state);
 int ipath_layer_set_mtu(struct ipath_devdata *, u16);
-int ipath_set_sps_lid(struct ipath_devdata *, u32, u8);
+int ipath_set_lid(struct ipath_devdata *, u32, u8);
 int ipath_layer_send_hdr(struct ipath_devdata *dd,
 			 struct ether_header *hdr);
 int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords,
diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
@@ -308,7 +308,7 @@ static int recv_subn_set_portinfo(struct
 		/* Must be a valid unicast LID address. */
 		if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE)
 			goto err;
-		ipath_set_sps_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7);
+		ipath_set_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7);
 		event.event = IB_EVENT_LID_CHANGE;
 		ib_dispatch_event(&event);
 	}
diff -r 1bef8244297a -r 3ceb73f8bde0 drivers/infiniband/hw/ipath/ipath_sysfs.c
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -115,11 +115,6 @@ DRIVER_STAT(pkey1, pkeys[1]);
 DRIVER_STAT(pkey1, pkeys[1]);
 DRIVER_STAT(pkey2, pkeys[2]);
 DRIVER_STAT(pkey3, pkeys[3]);
-/* XXX fix the following when dynamic table of devices used */
-DRIVER_STAT(lid0, lid[0]);
-DRIVER_STAT(lid1, lid[1]);
-DRIVER_STAT(lid2, lid[2]);
-DRIVER_STAT(lid3, lid[3]);
 
 DRIVER_STAT(nports, nports);
 DRIVER_STAT(null_intr, nullintr);
@@ -128,11 +123,6 @@ DRIVER_STAT(page_locks, pagelocks);
 DRIVER_STAT(page_locks, pagelocks);
 DRIVER_STAT(page_unlocks, pageunlocks);
 DRIVER_STAT(krdrops, krdrops);
-/* XXX fix the following when dynamic table of devices used */
-DRIVER_STAT(mlid0, mlid[0]);
-DRIVER_STAT(mlid1, mlid[1]);
-DRIVER_STAT(mlid2, mlid[2]);
-DRIVER_STAT(mlid3, mlid[3]);
 
 static struct attribute *driver_stat_attributes[] = {
 	&driver_attr_intrs.attr,
@@ -155,10 +145,6 @@ static struct attribute *driver_stat_att
 	&driver_attr_pkey1.attr,
 	&driver_attr_pkey2.attr,
 	&driver_attr_pkey3.attr,
-	&driver_attr_lid0.attr,
-	&driver_attr_lid1.attr,
-	&driver_attr_lid2.attr,
-	&driver_attr_lid3.attr,
 	&driver_attr_nports.attr,
 	&driver_attr_null_intr.attr,
 	&driver_attr_max_pkts_call.attr,
@@ -166,10 +152,6 @@ static struct attribute *driver_stat_att
 	&driver_attr_page_locks.attr,
 	&driver_attr_page_unlocks.attr,
 	&driver_attr_krdrops.attr,
-	&driver_attr_mlid0.attr,
-	&driver_attr_mlid1.attr,
-	&driver_attr_mlid2.attr,
-	&driver_attr_mlid3.attr,
 	NULL
 };
 
@@ -273,7 +255,7 @@ static ssize_t store_lid(struct device *
 			  size_t count)
 {
 	struct ipath_devdata *dd = dev_get_drvdata(dev);
-	u16 lid;
+	u16 lid = 0;
 	int ret;
 
 	ret = ipath_parse_ushort(buf, &lid);
@@ -285,11 +267,11 @@ static ssize_t store_lid(struct device *
 		goto invalid;
 	}
 
-	ipath_set_sps_lid(dd, lid, 0);
+	ipath_set_lid(dd, lid, 0);
 
 	goto bail;
 invalid:
-	ipath_dev_err(dd, "attempt to set invalid LID\n");
+	ipath_dev_err(dd, "attempt to set invalid LID 0x%x\n", lid);
 bail:
 	return ret;
 }
@@ -320,7 +302,6 @@ static ssize_t store_mlid(struct device 
 	unit = dd->ipath_unit;
 
 	dd->ipath_mlid = mlid;
-	ipath_stats.sps_mlid[unit] = mlid;
 	ipath_layer_intr(dd, IPATH_LAYER_INT_BCAST);
 
 	goto bail;


From bos at pathscale.com  Thu Jun 29 14:41:11 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:11 -0700
Subject: [openib-general] [PATCH 20 of 39] IB/ipath - reduce overhead on
 receive interrupts
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <8bc865893a11e5c8772c.1151617271@eng-12.pathscale.com>

Also count the number of interrupts where that works (fastrcvint).
On any interrupt where the port0 head and tail registers are not equal,
just call the ipath_kreceive code without reading the interrupt status,
thus saving the approximately 0.25usec processor stall waiting for the
read to return.  If any other interrupt bits are set, or head==tail,
take the normal path, but that has been reordered to handle read ahead
of pioavail.  Also no longer call ipath_kreceive() from ipath_qcheck(),
because that just seems to make things worse, and isn't really buying
us anything, these days.

Also no longer loop in ipath_kreceive(); better to not hold things off
too long (I saw many cases where we would loop 4-8 times, and handle
thousands (up to 3500) in a single call).

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
@@ -97,8 +97,8 @@ struct infinipath_stats {
 	__u64 sps_hwerrs;
 	/* number of times IB link changed state unexpectedly */
 	__u64 sps_iblink;
-	/* no longer used; left for compatibility */
-	__u64 sps_unused3;
+	/* kernel receive interrupts that didn't read intstat */
+	__u64 sps_fastrcvint;
 	/* number of kernel (port0) packets received */
 	__u64 sps_port0pkts;
 	/* number of "ethernet" packets sent by driver */
diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -888,12 +888,7 @@ void ipath_kreceive(struct ipath_devdata
 	    (u32)le64_to_cpu(*dd->ipath_hdrqtailptr))
 		goto done;
 
-gotmore:
-	/*
-	 * read only once at start.  If in flood situation, this helps
-	 * performance slightly.  If more arrive while we are processing,
-	 * we'll come back here and do them
-	 */
+	/* read only once at start for performance */
 	hdrqtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
 
 	for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) {
@@ -1022,10 +1017,6 @@ gotmore:
 
 	dd->ipath_port0head = l;
 
-	if (hdrqtail != (u32)le64_to_cpu(*dd->ipath_hdrqtailptr))
-		/* more arrived while we handled first batch */
-		goto gotmore;
-
 	if (pkttot > ipath_stats.sps_maxpkts_call)
 		ipath_stats.sps_maxpkts_call = pkttot;
 	ipath_stats.sps_port0pkts += pkttot;
diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
@@ -539,10 +539,10 @@ static void handle_errors(struct ipath_d
 				continue;
 			if (hd == (tl + 1) ||
 			    (!hd && tl == dd->ipath_hdrqlast)) {
+				if (i == 0)
+					chkerrpkts = 1;
 				dd->ipath_lastrcvhdrqtails[i] = tl;
 				pd->port_hdrqfull++;
-				if (i == 0)
-					chkerrpkts = 1;
 			}
 		}
 	}
@@ -724,7 +724,12 @@ set:
 			 dd->ipath_sendctrl);
 }
 
-static void handle_rcv(struct ipath_devdata *dd, u32 istat)
+/*
+ * Handle receive interrupts for user ports; this means a user
+ * process was waiting for a packet to arrive, and didn't want
+ * to poll
+ */
+static void handle_urcv(struct ipath_devdata *dd, u32 istat)
 {
 	u64 portr;
 	int i;
@@ -734,22 +739,17 @@ static void handle_rcv(struct ipath_devd
 		 infinipath_i_rcvavail_mask)
 		| ((istat >> INFINIPATH_I_RCVURG_SHIFT) &
 		   infinipath_i_rcvurg_mask);
-	for (i = 0; i < dd->ipath_cfgports; i++) {
+	for (i = 1; i < dd->ipath_cfgports; i++) {
 		struct ipath_portdata *pd = dd->ipath_pd[i];
-		if (portr & (1 << i) && pd &&
-		    pd->port_cnt) {
-			if (i == 0)
-				ipath_kreceive(dd);
-			else if (test_bit(IPATH_PORT_WAITING_RCV,
-					  &pd->port_flag)) {
-				int rcbit;
-				clear_bit(IPATH_PORT_WAITING_RCV,
-					  &pd->port_flag);
-				rcbit = i + INFINIPATH_R_INTRAVAIL_SHIFT;
-				clear_bit(1UL << rcbit, &dd->ipath_rcvctrl);
-				wake_up_interruptible(&pd->port_wait);
-				rcvdint = 1;
-			}
+		if (portr & (1 << i) && pd && pd->port_cnt &&
+			test_bit(IPATH_PORT_WAITING_RCV, &pd->port_flag)) {
+			int rcbit;
+			clear_bit(IPATH_PORT_WAITING_RCV,
+				  &pd->port_flag);
+			rcbit = i + INFINIPATH_R_INTRAVAIL_SHIFT;
+			clear_bit(1UL << rcbit, &dd->ipath_rcvctrl);
+			wake_up_interruptible(&pd->port_wait);
+			rcvdint = 1;
 		}
 	}
 	if (rcvdint) {
@@ -767,14 +767,17 @@ irqreturn_t ipath_intr(int irq, void *da
 	struct ipath_devdata *dd = data;
 	u32 istat;
 	ipath_err_t estat = 0;
+	irqreturn_t ret;
+	u32 p0bits;
 	static unsigned unexpected = 0;
-	irqreturn_t ret;
-
-	if(!(dd->ipath_flags & IPATH_PRESENT)) {
-		/* this is mostly so we don't try to touch the chip while
-		 * it is being reset */
-		/*
-		 * This return value is perhaps odd, but we do not want the
+	static const u32 port0rbits = (1U<<INFINIPATH_I_RCVAVAIL_SHIFT) |
+		 (1U<<INFINIPATH_I_RCVURG_SHIFT);
+
+	ipath_stats.sps_ints++;
+
+	if (!(dd->ipath_flags & IPATH_PRESENT)) {
+		/*
+		 * This return value is not great, but we do not want the
 		 * interrupt core code to remove our interrupt handler
 		 * because we don't appear to be handling an interrupt
 		 * during a chip reset.
@@ -782,6 +785,50 @@ irqreturn_t ipath_intr(int irq, void *da
 		return IRQ_HANDLED;
 	}
 
+	/*
+	 * this needs to be flags&initted, not statusp, so we keep
+	 * taking interrupts even after link goes down, etc.
+	 * Also, we *must* clear the interrupt at some point, or we won't
+	 * take it again, which can be real bad for errors, etc...
+	 */
+ 
+	if (!(dd->ipath_flags & IPATH_INITTED)) {
+		ipath_bad_intr(dd, &unexpected);
+		ret = IRQ_NONE;
+		goto bail;
+	}
+ 
+	/*
+	 * We try to avoid reading the interrupt status register, since
+	 * that's a PIO read, and stalls the processor for up to about
+	 * ~0.25 usec. The idea is that if we processed a port0 packet,
+	 * we blindly clear the  port 0 receive interrupt bits, and nothing
+	 * else, then return.  If other interrupts are pending, the chip
+	 * will re-interrupt us as soon as we write the intclear register.
+	 * We then won't process any more kernel packets (if not the 2nd
+	 * time, then the 3rd or 4th) and we'll then handle the other
+	 * interrupts.   We clear the interrupts first so that we don't
+	 * lose intr for later packets that arrive while we are processing.
+	 */
+	if (dd->ipath_port0head !=
+		(u32)le64_to_cpu(*dd->ipath_hdrqtailptr)) {
+		u32 oldhead = dd->ipath_port0head;
+		if (dd->ipath_flags & IPATH_GPIO_INTR) {
+			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
+					 (u64) (1 << 2));
+			p0bits = port0rbits | INFINIPATH_I_GPIO;
+		}
+		else
+			p0bits = port0rbits;
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, p0bits);
+		ipath_kreceive(dd);
+		if (oldhead != dd->ipath_port0head) {
+			ipath_stats.sps_fastrcvint++;
+			goto done;
+		}
+		istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
+	}
+ 
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
 	if (unlikely(!istat)) {
 		ipath_stats.sps_nullintr++;
@@ -795,31 +842,17 @@ irqreturn_t ipath_intr(int irq, void *da
 		goto bail;
 	}
 
-	ipath_stats.sps_ints++;
-
-	/*
-	 * this needs to be flags&initted, not statusp, so we keep
-	 * taking interrupts even after link goes down, etc.
-	 * Also, we *must* clear the interrupt at some point, or we won't
-	 * take it again, which can be real bad for errors, etc...
-	 */
-
-	if (!(dd->ipath_flags & IPATH_INITTED)) {
-		ipath_bad_intr(dd, &unexpected);
-		ret = IRQ_NONE;
-		goto bail;
-	}
 	if (unexpected)
 		unexpected = 0;
 
-	ipath_cdbg(VERBOSE, "intr stat=0x%x\n", istat);
-
-	if (istat & ~infinipath_i_bitsextant)
+	if (unlikely(istat & ~infinipath_i_bitsextant))
 		ipath_dev_err(dd,
 			      "interrupt with unknown interrupts %x set\n",
 			      istat & (u32) ~ infinipath_i_bitsextant);
-
-	if (istat & INFINIPATH_I_ERROR) {
+	else
+		ipath_cdbg(VERBOSE, "intr stat=0x%x\n", istat);
+
+	if (unlikely(istat & INFINIPATH_I_ERROR)) {
 		ipath_stats.sps_errints++;
 		estat = ipath_read_kreg64(dd,
 					  dd->ipath_kregs->kr_errorstatus);
@@ -837,7 +870,14 @@ irqreturn_t ipath_intr(int irq, void *da
 			handle_errors(dd, estat);
 	}
 
+	p0bits = port0rbits;
 	if (istat & INFINIPATH_I_GPIO) {
+		/*
+		 * Packets are available in the port 0 rcv queue.
+		 * Eventually this needs to be generalized to check
+		 * IPATH_GPIO_INTR, and the specific GPIO bit, if
+		 * GPIO interrupts are used for anything else.
+		 */
 		if (unlikely(!(dd->ipath_flags & IPATH_GPIO_INTR))) {
 			u32 gpiostatus;
 			gpiostatus = ipath_read_kreg32(
@@ -851,14 +891,7 @@ irqreturn_t ipath_intr(int irq, void *da
 			/* Clear GPIO status bit 2 */
 			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
 					 (u64) (1 << 2));
-
-			/*
-			 * Packets are available in the port 0 rcv queue.
-			 * Eventually this needs to be generalized to check
-			 * IPATH_GPIO_INTR, and the specific GPIO bit, if
-			 * GPIO interrupts are used for anything else.
-			 */
-			ipath_kreceive(dd);
+			p0bits |= INFINIPATH_I_GPIO;
 		}
 	}
 
@@ -871,6 +904,25 @@ irqreturn_t ipath_intr(int irq, void *da
 	 */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
 
+	/*
+	 * we check for both transition from empty to non-empty, and urgent
+	 * packets (those with the interrupt bit set in the header), and
+	 * if enabled, the GPIO bit 2 interrupt used for port0 on some
+	 * HT-400 boards.
+	 * Do this before checking for pio buffers available, since
+	 * receives can overflow; piobuf waiters can afford a few
+	 * extra cycles, since they were waiting anyway.
+	 */
+	if (istat & p0bits) {
+		ipath_kreceive(dd);
+		istat &= ~port0rbits;
+	}
+	if (istat & ((infinipath_i_rcvavail_mask <<
+		      INFINIPATH_I_RCVAVAIL_SHIFT)
+		     | (infinipath_i_rcvurg_mask <<
+			INFINIPATH_I_RCVURG_SHIFT)))
+		handle_urcv(dd, istat);
+
 	if (istat & INFINIPATH_I_SPIOBUFAVAIL) {
 		clear_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl);
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
@@ -882,17 +934,7 @@ irqreturn_t ipath_intr(int irq, void *da
 		handle_layer_pioavail(dd);
 	}
 
-	/*
-	 * we check for both transition from empty to non-empty, and urgent
-	 * packets (those with the interrupt bit set in the header)
-	 */
-
-	if (istat & ((infinipath_i_rcvavail_mask <<
-		      INFINIPATH_I_RCVAVAIL_SHIFT)
-		     | (infinipath_i_rcvurg_mask <<
-			INFINIPATH_I_RCVURG_SHIFT)))
-		handle_rcv(dd, istat);
-
+done:
 	ret = IRQ_HANDLED;
 
 bail:
diff -r 1e8837473193 -r 8bc865893a11 drivers/infiniband/hw/ipath/ipath_stats.c
--- a/drivers/infiniband/hw/ipath/ipath_stats.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_stats.c	Thu Jun 29 14:33:25 2006 -0700
@@ -186,7 +186,6 @@ static void ipath_qcheck(struct ipath_de
 				   dd->ipath_port0head,
 				   (unsigned long long)
 				   ipath_stats.sps_port0pkts);
-			ipath_kreceive(dd);
 		}
 		dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts;
 	}


From bos at pathscale.com  Thu Jun 29 14:41:10 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:10 -0700
Subject: [openib-general] [PATCH 19 of 39] IB/ipath - memory management
	cleanups
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <1e88374731937c2d4379.1151617270@eng-12.pathscale.com>

Made in-memory rcvhdrq tail update be in dma_alloc'ed memory, not
random user or special kernel (needed for ppc, also "just the right
thing to do").

Some cleanups to make unexpected link transitions less likely to produce
complaints about packet errors, and also to not leave SMA packets stuck
and unable to go out.

A few other random debug and comment cleanups.

Always init rcvhdrq head/tail registers to 0, to avoid race conditions
(should have been that way some time ago).

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:25 2006 -0700
@@ -311,6 +311,9 @@ struct ipath_base_info {
 	__u32 spi_rcv_egrchunksize;
 	/* total size of mmap to cover full rcvegrbuffers */
 	__u32 spi_rcv_egrbuftotlen;
+	__u32 spi_filler_for_align;
+	/* address of readonly memory copy of the rcvhdrq tail register. */
+	__u64 spi_rcvhdr_tailaddr;
 } __attribute__ ((aligned(8)));
 
 
@@ -380,13 +383,7 @@ struct ipath_user_info {
 	 */
 	__u32 spu_rcvhdrsize;
 
-	/*
-	 * cache line aligned (64 byte) user address to
-	 * which the rcvhdrtail register will be written by infinipath
-	 * whenever it changes, so that no chip registers are read in
-	 * the performance path.
-	 */
-	__u64 spu_rcvhdraddr;
+	__u64 spu_unused; /* kept for compatible layout */
 
 	/*
 	 * address of struct base_info to write to
diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -131,14 +131,6 @@ static struct pci_driver ipath_driver = 
 	.id_table = ipath_pci_tbl,
 };
 
-/*
- * This is where port 0's rcvhdrtail register is written back; we also
- * want nothing else sharing the cache line, so make it a cache line
- * in size.  Used for all units.
- */
-volatile __le64 *ipath_port0_rcvhdrtail;
-dma_addr_t ipath_port0_rcvhdrtail_dma;
-static int port0_rcvhdrtail_refs;
 
 static inline void read_bars(struct ipath_devdata *dd, struct pci_dev *dev,
 			     u32 *bar0, u32 *bar1)
@@ -268,47 +260,6 @@ int ipath_count_units(int *npresentp, in
 	return nunits;
 }
 
-static int init_port0_rcvhdrtail(struct pci_dev *pdev)
-{
-	int ret;
-
-	mutex_lock(&ipath_mutex);
-
-	if (!ipath_port0_rcvhdrtail) {
-		ipath_port0_rcvhdrtail =
-			dma_alloc_coherent(&pdev->dev,
-					   IPATH_PORT0_RCVHDRTAIL_SIZE,
-					   &ipath_port0_rcvhdrtail_dma,
-					   GFP_KERNEL);
-
-		if (!ipath_port0_rcvhdrtail) {
-			ret = -ENOMEM;
-			goto bail;
-		}
-	}
-	port0_rcvhdrtail_refs++;
-	ret = 0;
-
-bail:
-	mutex_unlock(&ipath_mutex);
-
-	return ret;
-}
-
-static void cleanup_port0_rcvhdrtail(struct pci_dev *pdev)
-{
-	mutex_lock(&ipath_mutex);
-
-	if (!--port0_rcvhdrtail_refs) {
-		dma_free_coherent(&pdev->dev, IPATH_PORT0_RCVHDRTAIL_SIZE,
-				  (void *) ipath_port0_rcvhdrtail,
-				  ipath_port0_rcvhdrtail_dma);
-		ipath_port0_rcvhdrtail = NULL;
-	}
-
-	mutex_unlock(&ipath_mutex);
-}
-
 /*
  * These next two routines are placeholders in case we don't have per-arch
  * code for controlling write combining.  If explicit control of write
@@ -333,20 +284,12 @@ static int __devinit ipath_init_one(stru
 	u32 bar0 = 0, bar1 = 0;
 	u8 rev;
 
-	ret = init_port0_rcvhdrtail(pdev);
-	if (ret < 0) {
-		printk(KERN_ERR IPATH_DRV_NAME
-		       ": Could not allocate port0_rcvhdrtail: error %d\n",
-		       -ret);
-		goto bail;
-	}
-
 	dd = ipath_alloc_devdata(pdev);
 	if (IS_ERR(dd)) {
 		ret = PTR_ERR(dd);
 		printk(KERN_ERR IPATH_DRV_NAME
 		       ": Could not allocate devdata: error %d\n", -ret);
-		goto bail_rcvhdrtail;
+		goto bail;
 	}
 
 	ipath_cdbg(VERBOSE, "initializing unit #%u\n", dd->ipath_unit);
@@ -574,9 +517,6 @@ bail_devdata:
 bail_devdata:
 	ipath_free_devdata(pdev, dd);
 
-bail_rcvhdrtail:
-	cleanup_port0_rcvhdrtail(pdev);
-
 bail:
 	return ret;
 }
@@ -608,7 +548,6 @@ static void __devexit ipath_remove_one(s
 	pci_disable_device(pdev);
 
 	ipath_free_devdata(pdev, dd);
-	cleanup_port0_rcvhdrtail(pdev);
 }
 
 /* general driver use */
@@ -1383,26 +1322,20 @@ bail:
  * @dd: the infinipath device
  * @pd: the port data
  *
- * this *must* be physically contiguous memory, and for now,
- * that limits it to what kmalloc can do.
+ * this must be contiguous memory (from an i/o perspective), and must be
+ * DMA'able (which means for some systems, it will go through an IOMMU,
+ * or be forced into a low address range).
  */
 int ipath_create_rcvhdrq(struct ipath_devdata *dd,
 			 struct ipath_portdata *pd)
 {
-	int ret = 0, amt;
-
-	amt = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize *
-		    sizeof(u32), PAGE_SIZE);
+	int ret = 0;
+
 	if (!pd->port_rcvhdrq) {
-		/*
-		 * not using REPEAT isn't viable; at 128KB, we can easily
-		 * fail this.  The problem with REPEAT is we can block here
-		 * "forever".  There isn't an inbetween, unfortunately.  We
-		 * could reduce the risk by never freeing the rcvhdrq except
-		 * at unload, but even then, the first time a port is used,
-		 * we could delay for some time...
-		 */
+		dma_addr_t phys_hdrqtail;
 		gfp_t gfp_flags = GFP_USER | __GFP_COMP;
+		int amt = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize *
+				sizeof(u32), PAGE_SIZE);
 
 		pd->port_rcvhdrq = dma_alloc_coherent(
 			&dd->pcidev->dev, amt, &pd->port_rcvhdrq_phys,
@@ -1415,6 +1348,16 @@ int ipath_create_rcvhdrq(struct ipath_de
 			ret = -ENOMEM;
 			goto bail;
 		}
+		pd->port_rcvhdrtail_kvaddr = dma_alloc_coherent(
+			&dd->pcidev->dev, PAGE_SIZE, &phys_hdrqtail, GFP_KERNEL);
+		if (!pd->port_rcvhdrtail_kvaddr) {
+			ipath_dev_err(dd, "attempt to allocate 1 page "
+				      "for port %u rcvhdrqtailaddr failed\n",
+				      pd->port_port);
+			ret = -ENOMEM;
+			goto bail;
+		}
+		pd->port_rcvhdrqtailaddr_phys = phys_hdrqtail;
 
 		pd->port_rcvhdrq_size = amt;
 
@@ -1424,20 +1367,28 @@ int ipath_create_rcvhdrq(struct ipath_de
 			   (unsigned long) pd->port_rcvhdrq_phys,
 			   (unsigned long) pd->port_rcvhdrq_size,
 			   pd->port_port);
-	} else {
-		/*
-		 * clear for security, sanity, and/or debugging, each
-		 * time we reuse
-		 */
-		memset(pd->port_rcvhdrq, 0, amt);
-	}
+
+		ipath_cdbg(VERBOSE, "port %d hdrtailaddr, %llx physical\n",
+			   pd->port_port,
+			   (unsigned long long) phys_hdrqtail);
+	}
+	else
+		ipath_cdbg(VERBOSE, "reuse port %d rcvhdrq @%p %llx phys; "
+			   "hdrtailaddr@%p %llx physical\n",
+			   pd->port_port, pd->port_rcvhdrq,
+			   pd->port_rcvhdrq_phys, pd->port_rcvhdrtail_kvaddr,
+			   (unsigned long long)pd->port_rcvhdrqtailaddr_phys);
+
+	/* clear for security and sanity on each use */
+	memset(pd->port_rcvhdrq, 0, pd->port_rcvhdrq_size);
+	memset((void *)pd->port_rcvhdrtail_kvaddr, 0, PAGE_SIZE);
 
 	/*
 	 * tell chip each time we init it, even if we are re-using previous
-	 * memory (we zero it at process close)
-	 */
-	ipath_cdbg(VERBOSE, "writing port %d rcvhdraddr as %lx\n",
-		   pd->port_port, (unsigned long) pd->port_rcvhdrq_phys);
+	 * memory (we zero the register at process close)
+	 */
+	ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr,
+			      pd->port_port, pd->port_rcvhdrqtailaddr_phys);
 	ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr,
 			      pd->port_port, pd->port_rcvhdrq_phys);
 
@@ -1525,15 +1476,27 @@ void ipath_set_ib_lstate(struct ipath_de
 		[INFINIPATH_IBCC_LINKCMD_ARMED] = "ARMED",
 		[INFINIPATH_IBCC_LINKCMD_ACTIVE] = "ACTIVE"
 	};
+	int linkcmd = (which >> INFINIPATH_IBCC_LINKCMD_SHIFT) &
+			INFINIPATH_IBCC_LINKCMD_MASK;
+
 	ipath_cdbg(SMA, "Trying to move unit %u to %s, current ltstate "
 		   "is %s\n", dd->ipath_unit,
-		   what[(which >> INFINIPATH_IBCC_LINKCMD_SHIFT) &
-			INFINIPATH_IBCC_LINKCMD_MASK],
+		   what[linkcmd],
 		   ipath_ibcstatus_str[
 			   (ipath_read_kreg64
 			    (dd, dd->ipath_kregs->kr_ibcstatus) >>
 			    INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) &
 			   INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]);
+	/* flush all queued sends when going to DOWN or INIT, to be sure that
+	 * they don't block SMA and other MAD packets */
+	if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT) {
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+				 INFINIPATH_S_ABORT);
+		ipath_disarm_piobufs(dd, dd->ipath_lastport_piobuf,
+		                    (unsigned)(dd->ipath_piobcnt2k +
+				    dd->ipath_piobcnt4k) -
+				    dd->ipath_lastport_piobuf);
+	}
 
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl,
 			 dd->ipath_ibcctrl | which);
@@ -1681,60 +1644,54 @@ void ipath_shutdown_device(struct ipath_
 /**
  * ipath_free_pddata - free a port's allocated data
  * @dd: the infinipath device
- * @port: the port
- * @freehdrq: free the port data structure if true
- *
- * when closing, free up any allocated data for a port, if the
- * reference count goes to zero
- * Note: this also optionally frees the portdata itself!
- * Any changes here have to be matched up with the reinit case
- * of ipath_init_chip(), which calls this routine on reinit after reset.
- */
-void ipath_free_pddata(struct ipath_devdata *dd, u32 port, int freehdrq)
-{
-	struct ipath_portdata *pd = dd->ipath_pd[port];
-
+ * @pd: the portdata structure
+ *
+ * free up any allocated data for a port
+ * This should not touch anything that would affect a simultaneous
+ * re-allocation of port data, because it is called after ipath_mutex
+ * is released (and can be called from reinit as well).
+ * It should never change any chip state, or global driver state.
+ * (The only exception to global state is freeing the port0 port0_skbs.)
+ */
+void ipath_free_pddata(struct ipath_devdata *dd, struct ipath_portdata *pd)
+{
 	if (!pd)
 		return;
-	if (freehdrq)
-		/*
-		 * only clear and free portdata if we are going to also
-		 * release the hdrq, otherwise we leak the hdrq on each
-		 * open/close cycle
-		 */
-		dd->ipath_pd[port] = NULL;
-	if (freehdrq && pd->port_rcvhdrq) {
+
+	if (pd->port_rcvhdrq) {
 		ipath_cdbg(VERBOSE, "free closed port %d rcvhdrq @ %p "
 			   "(size=%lu)\n", pd->port_port, pd->port_rcvhdrq,
 			   (unsigned long) pd->port_rcvhdrq_size);
 		dma_free_coherent(&dd->pcidev->dev, pd->port_rcvhdrq_size,
 				  pd->port_rcvhdrq, pd->port_rcvhdrq_phys);
 		pd->port_rcvhdrq = NULL;
-	}
-	if (port && pd->port_rcvegrbuf) {
-		/* always free this */
-		if (pd->port_rcvegrbuf) {
-			unsigned e;
-
-			for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) {
-				void *base = pd->port_rcvegrbuf[e];
-				size_t size = pd->port_rcvegrbuf_size;
-
-				ipath_cdbg(VERBOSE, "egrbuf free(%p, %lu), "
-					   "chunk %u/%u\n", base,
-					   (unsigned long) size,
-					   e, pd->port_rcvegrbuf_chunks);
-				dma_free_coherent(
-					&dd->pcidev->dev, size, base,
-					pd->port_rcvegrbuf_phys[e]);
-			}
-			vfree(pd->port_rcvegrbuf);
-			pd->port_rcvegrbuf = NULL;
-			vfree(pd->port_rcvegrbuf_phys);
-			pd->port_rcvegrbuf_phys = NULL;
-		}
+		if (pd->port_rcvhdrtail_kvaddr) {
+			dma_free_coherent(&dd->pcidev->dev, PAGE_SIZE,
+					 (void *)pd->port_rcvhdrtail_kvaddr,
+					 pd->port_rcvhdrqtailaddr_phys);
+			pd->port_rcvhdrtail_kvaddr = NULL;
+		}
+	}
+	if (pd->port_port && pd->port_rcvegrbuf) {
+		unsigned e;
+
+		for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) {
+			void *base = pd->port_rcvegrbuf[e];
+			size_t size = pd->port_rcvegrbuf_size;
+
+			ipath_cdbg(VERBOSE, "egrbuf free(%p, %lu), "
+				   "chunk %u/%u\n", base,
+				   (unsigned long) size,
+				   e, pd->port_rcvegrbuf_chunks);
+			dma_free_coherent(&dd->pcidev->dev, size,
+				base, pd->port_rcvegrbuf_phys[e]);
+		}
+		vfree(pd->port_rcvegrbuf);
+		pd->port_rcvegrbuf = NULL;
+		vfree(pd->port_rcvegrbuf_phys);
+		pd->port_rcvegrbuf_phys = NULL;
 		pd->port_rcvegrbuf_chunks = 0;
-	} else if (port == 0 && dd->ipath_port0_skbs) {
+	} else if (pd->port_port == 0 && dd->ipath_port0_skbs) {
 		unsigned e;
 		struct sk_buff **skbs = dd->ipath_port0_skbs;
 
@@ -1746,10 +1703,8 @@ void ipath_free_pddata(struct ipath_devd
 				dev_kfree_skb(skbs[e]);
 		vfree(skbs);
 	}
-	if (freehdrq) {
-		kfree(pd->port_tid_pg_list);
-		kfree(pd);
-	}
+	kfree(pd->port_tid_pg_list);
+	kfree(pd);
 }
 
 static int __init infinipath_init(void)
@@ -1874,10 +1829,14 @@ static void cleanup_device(struct ipath_
 
 	/*
 	 * free any resources still in use (usually just kernel ports)
-	 * at unload
-	 */
-	for (port = 0; port < dd->ipath_cfgports; port++)
-		ipath_free_pddata(dd, port, 1);
+	 * at unload; we do for portcnt, not cfgports, because cfgports
+	 * could have changed while we were loaded.
+	 */
+	for (port = 0; port < dd->ipath_portcnt; port++) {
+		struct ipath_portdata *pd = dd->ipath_pd[port];
+		dd->ipath_pd[port] = NULL;
+		ipath_free_pddata(dd, pd);
+	}
 	kfree(dd->ipath_pd);
 	/*
 	 * debuggability, in case some cleanup path tries to use it
diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
@@ -123,6 +123,7 @@ static int ipath_get_base_info(struct ip
 	 * on to yet another method of dealing with this
 	 */
 	kinfo->spi_rcvhdr_base = (u64) pd->port_rcvhdrq_phys;
+	kinfo->spi_rcvhdr_tailaddr = (u64)pd->port_rcvhdrqtailaddr_phys;
 	kinfo->spi_rcv_egrbufs = (u64) pd->port_rcvegr_phys;
 	kinfo->spi_pioavailaddr = (u64) dd->ipath_pioavailregs_phys;
 	kinfo->spi_status = (u64) kinfo->spi_pioavailaddr +
@@ -785,11 +786,12 @@ static int ipath_create_user_egr(struct 
 
 bail_rcvegrbuf_phys:
 	for (e = 0; e < pd->port_rcvegrbuf_chunks &&
-		     pd->port_rcvegrbuf[e]; e++)
+		pd->port_rcvegrbuf[e]; e++) {
 		dma_free_coherent(&dd->pcidev->dev, size,
 				  pd->port_rcvegrbuf[e],
 				  pd->port_rcvegrbuf_phys[e]);
 
+	}
 	vfree(pd->port_rcvegrbuf_phys);
 	pd->port_rcvegrbuf_phys = NULL;
 bail_rcvegrbuf:
@@ -804,10 +806,7 @@ static int ipath_do_user_init(struct ipa
 {
 	int ret = 0;
 	struct ipath_devdata *dd = pd->port_dd;
-	u64 physaddr, uaddr, off, atmp;
-	struct page *pagep;
 	u32 head32;
-	u64 head;
 
 	/* for now, if major version is different, bail */
 	if ((uinfo->spu_userversion >> 16) != IPATH_USER_SWMAJOR) {
@@ -831,54 +830,6 @@ static int ipath_do_user_init(struct ipa
 	}
 
 	/* for now we do nothing with rcvhdrcnt: uinfo->spu_rcvhdrcnt */
-
-	/* set up for the rcvhdr Q tail register writeback to user memory */
-	if (!uinfo->spu_rcvhdraddr ||
-	    !access_ok(VERIFY_WRITE, (u64 __user *) (unsigned long)
-		       uinfo->spu_rcvhdraddr, sizeof(u64))) {
-		ipath_dbg("Port %d rcvhdrtail addr %llx not valid\n",
-			  pd->port_port,
-			  (unsigned long long) uinfo->spu_rcvhdraddr);
-		ret = -EINVAL;
-		goto done;
-	}
-
-	off = offset_in_page(uinfo->spu_rcvhdraddr);
-	uaddr = PAGE_MASK & (unsigned long) uinfo->spu_rcvhdraddr;
-	ret = ipath_get_user_pages_nocopy(uaddr, &pagep);
-	if (ret) {
-		dev_info(&dd->pcidev->dev, "Failed to lookup and lock "
-			 "address %llx for rcvhdrtail: errno %d\n",
-			 (unsigned long long) uinfo->spu_rcvhdraddr, -ret);
-		goto done;
-	}
-	ipath_stats.sps_pagelocks++;
-	pd->port_rcvhdrtail_uaddr = uaddr;
-	pd->port_rcvhdrtail_pagep = pagep;
-	pd->port_rcvhdrtail_kvaddr =
-		page_address(pagep);
-	pd->port_rcvhdrtail_kvaddr += off;
-	physaddr = page_to_phys(pagep) + off;
-	ipath_cdbg(VERBOSE, "port %d user addr %llx hdrtailaddr, %llx "
-		   "physical (off=%llx)\n",
-		   pd->port_port,
-		   (unsigned long long) uinfo->spu_rcvhdraddr,
-		   (unsigned long long) physaddr, (unsigned long long) off);
-	ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr,
-			      pd->port_port, physaddr);
-	atmp = ipath_read_kreg64_port(dd,
-				      dd->ipath_kregs->kr_rcvhdrtailaddr,
-				      pd->port_port);
-	if (physaddr != atmp) {
-		ipath_dev_err(dd,
-			      "Catastrophic software error, "
-			      "RcvHdrTailAddr%u written as %llx, "
-			      "read back as %llx\n", pd->port_port,
-			      (unsigned long long) physaddr,
-			      (unsigned long long) atmp);
-		ret = -EINVAL;
-		goto done;
-	}
 
 	/* for right now, kernel piobufs are at end, so port 1 is at 0 */
 	pd->port_piobufs = dd->ipath_piobufbase +
@@ -898,26 +849,18 @@ static int ipath_do_user_init(struct ipa
 		ret = ipath_create_user_egr(pd);
 	if (ret)
 		goto done;
-	/* enable receives now */
-	/* atomically set enable bit for this port */
-	set_bit(INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port,
-		&dd->ipath_rcvctrl);
 
 	/*
-	 * set the head registers for this port to the current values
+	 * set the eager head register for this port to the current values
 	 * of the tail pointers, since we don't know if they were
 	 * updated on last use of the port.
 	 */
-	head32 = ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port);
-	head = (u64) head32;
-	ipath_write_ureg(dd, ur_rcvhdrhead, head, pd->port_port);
 	head32 = ipath_read_ureg32(dd, ur_rcvegrindextail, pd->port_port);
 	ipath_write_ureg(dd, ur_rcvegrindexhead, head32, pd->port_port);
 	dd->ipath_lastegrheads[pd->port_port] = -1;
 	dd->ipath_lastrcvhdrqtails[pd->port_port] = -1;
-	ipath_cdbg(VERBOSE, "Wrote port%d head %llx, egrhead %x from "
-		   "tail regs\n", pd->port_port,
-		   (unsigned long long) head, head32);
+	ipath_cdbg(VERBOSE, "Wrote port%d egrhead %x from tail regs\n",
+		pd->port_port, head32);
 	pd->port_tidcursor = 0;	/* start at beginning after open */
 	/*
 	 * now enable the port; the tail registers will be written to memory
@@ -926,13 +869,62 @@ static int ipath_do_user_init(struct ipa
 	 * transition from 0 to 1, so clear it first, then set it as part of
 	 * enabling the port.  This will (very briefly) affect any other
 	 * open ports, but it shouldn't be long enough to be an issue.
+	 * We explictly set the in-memory copy to 0 beforehand, so we don't
+	 * have to wait to be sure the DMA update has happened.
 	 */
+	*pd->port_rcvhdrtail_kvaddr = 0ULL;
+	set_bit(INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port,
+		&dd->ipath_rcvctrl);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl,
 			 dd->ipath_rcvctrl & ~INFINIPATH_R_TAILUPD);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl,
 			 dd->ipath_rcvctrl);
-
 done:
+	return ret;
+}
+
+
+/* common code for the mappings on dma_alloc_coherent mem */
+static int ipath_mmap_mem(struct vm_area_struct *vma,
+			     struct ipath_portdata *pd, unsigned len,
+			     int write_ok, dma_addr_t addr, char *what)
+{
+	struct ipath_devdata *dd = pd->port_dd;
+	unsigned pfn = (unsigned long)addr >> PAGE_SHIFT;
+	int ret;
+
+	if ((vma->vm_end - vma->vm_start) > len) {
+		dev_info(&dd->pcidev->dev,
+		         "FAIL on %s: len %lx > %x\n", what,
+			 vma->vm_end - vma->vm_start, len);
+		ret = -EFAULT;
+		goto bail;
+	}
+
+	if (!write_ok) {
+		if (vma->vm_flags & VM_WRITE) {
+			dev_info(&dd->pcidev->dev,
+				 "%s must be mapped readonly\n", what);
+			ret = -EPERM;
+			goto bail;
+		}
+
+		/* don't allow them to later change with mprotect */
+		vma->vm_flags &= ~VM_MAYWRITE;
+	}
+
+	ret = remap_pfn_range(vma, vma->vm_start, pfn,
+			      len, vma->vm_page_prot);
+	if (ret)
+		dev_info(&dd->pcidev->dev,
+			 "%s port%u mmap of %lx, %x bytes r%c failed: %d\n",
+			 what, pd->port_port, (unsigned long)addr, len,
+			 write_ok?'w':'o', ret);
+	else
+		ipath_cdbg(VERBOSE, "%s port%u mmaped %lx, %x bytes r%c\n",
+			what, pd->port_port, (unsigned long)addr, len,
+			 write_ok?'w':'o');
+bail:
 	return ret;
 }
 
@@ -942,8 +934,11 @@ static int mmap_ureg(struct vm_area_stru
 	unsigned long phys;
 	int ret;
 
-	/* it's the real hardware, so io_remap works */
-
+	/*
+	 * This is real hardware, so use io_remap.  This is the mechanism
+	 * for the user process to update the head registers for their port
+	 * in the chip.
+	 */
 	if ((vma->vm_end - vma->vm_start) > PAGE_SIZE) {
 		dev_info(&dd->pcidev->dev, "FAIL mmap userreg: reqlen "
 			 "%lx > PAGE\n", vma->vm_end - vma->vm_start);
@@ -969,10 +964,11 @@ static int mmap_piobufs(struct vm_area_s
 	int ret;
 
 	/*
-	 * When we map the PIO buffers, we want to map them as writeonly, no
-	 * read possible.
+	 * When we map the PIO buffers in the chip, we want to map them as
+	 * writeonly, no read possible.   This prevents access to previous
+	 * process data, and catches users who might try to read the i/o
+	 * space due to a bug.
 	 */
-
 	if ((vma->vm_end - vma->vm_start) >
 	    (dd->ipath_pbufsport * dd->ipath_palign)) {
 		dev_info(&dd->pcidev->dev, "FAIL mmap piobufs: "
@@ -983,11 +979,10 @@ static int mmap_piobufs(struct vm_area_s
 	}
 
 	phys = dd->ipath_physaddr + pd->port_piobufs;
+
 	/*
-	 * Do *NOT* mark this as non-cached (PWT bit), or we don't get the
+	 * Don't mark this as non-cached, or we don't get the
 	 * write combining behavior we want on the PIO buffers!
-	 * vma->vm_page_prot =
-	 *        pgprot_noncached(vma->vm_page_prot);
 	 */
 
 	if (vma->vm_flags & VM_READ) {
@@ -999,8 +994,7 @@ static int mmap_piobufs(struct vm_area_s
 	}
 
 	/* don't allow them to later change to readable with mprotect */
-
-	vma->vm_flags &= ~VM_MAYWRITE;
+	vma->vm_flags &= ~VM_MAYREAD;
 	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
 
 	ret = io_remap_pfn_range(vma, vma->vm_start, phys >> PAGE_SHIFT,
@@ -1018,11 +1012,6 @@ static int mmap_rcvegrbufs(struct vm_are
 	size_t total_size, i;
 	dma_addr_t *phys;
 	int ret;
-
-	if (!pd->port_rcvegrbuf) {
-		ret = -EFAULT;
-		goto bail;
-	}
 
 	size = pd->port_rcvegrbuf_size;
 	total_size = pd->port_rcvegrbuf_chunks * size;
@@ -1041,12 +1030,11 @@ static int mmap_rcvegrbufs(struct vm_are
 		ret = -EPERM;
 		goto bail;
 	}
+	/* don't allow them to later change to writeable with mprotect */
+	vma->vm_flags &= ~VM_MAYWRITE;
 
 	start = vma->vm_start;
 	phys = pd->port_rcvegrbuf_phys;
-
-	/* don't allow them to later change to writeable with mprotect */
-	vma->vm_flags &= ~VM_MAYWRITE;
 
 	for (i = 0; i < pd->port_rcvegrbuf_chunks; i++, start += size) {
 		ret = remap_pfn_range(vma, start, phys[i] >> PAGE_SHIFT,
@@ -1056,78 +1044,6 @@ static int mmap_rcvegrbufs(struct vm_are
 	}
 	ret = 0;
 
-bail:
-	return ret;
-}
-
-static int mmap_rcvhdrq(struct vm_area_struct *vma,
-			struct ipath_portdata *pd)
-{
-	struct ipath_devdata *dd = pd->port_dd;
-	size_t total_size;
-	int ret;
-
-	/*
-	 * kmalloc'ed memory, physically contiguous; this is from
-	 * spi_rcvhdr_base; we allow user to map read-write so they can
-	 * write hdrq entries to allow protocol code to directly poll
-	 * whether a hdrq entry has been written.
-	 */
-	total_size = ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize *
-			   sizeof(u32), PAGE_SIZE);
-	if ((vma->vm_end - vma->vm_start) > total_size) {
-		dev_info(&dd->pcidev->dev,
-			 "FAIL on rcvhdrq: reqlen %lx > actual %lx\n",
-			 vma->vm_end - vma->vm_start,
-			 (unsigned long) total_size);
-		ret = -EFAULT;
-		goto bail;
-	}
-
-	ret = remap_pfn_range(vma, vma->vm_start,
-			      pd->port_rcvhdrq_phys >> PAGE_SHIFT,
-			      vma->vm_end - vma->vm_start,
-			      vma->vm_page_prot);
-bail:
-	return ret;
-}
-
-static int mmap_pioavailregs(struct vm_area_struct *vma,
-			     struct ipath_portdata *pd)
-{
-	struct ipath_devdata *dd = pd->port_dd;
-	int ret;
-
-	/*
-	 * when we map the PIO bufferavail registers, we want to map them as
-	 * readonly, no write possible.
-	 *
-	 * kmalloc'ed memory, physically contiguous, one page only, readonly
-	 */
-
-	if ((vma->vm_end - vma->vm_start) > PAGE_SIZE) {
-		dev_info(&dd->pcidev->dev, "FAIL on pioavailregs_dma: "
-			 "reqlen %lx > actual %lx\n",
-			 vma->vm_end - vma->vm_start,
-			 (unsigned long) PAGE_SIZE);
-		ret = -EFAULT;
-		goto bail;
-	}
-
-	if (vma->vm_flags & VM_WRITE) {
-		dev_info(&dd->pcidev->dev,
-			 "Can't map pioavailregs as writable (flags=%lx)\n",
-			 vma->vm_flags);
-		ret = -EPERM;
-		goto bail;
-	}
-
-	/* don't allow them to later change with mprotect */
-	vma->vm_flags &= ~VM_MAYWRITE;
-
-	ret = remap_pfn_range(vma, vma->vm_start,
-			      dd->ipath_pioavailregs_phys >> PAGE_SHIFT,
-			      PAGE_SIZE, vma->vm_page_prot);
 bail:
 	return ret;
 }
@@ -1151,6 +1067,7 @@ static int ipath_mmap(struct file *fp, s
 
 	pd = port_fp(fp);
 	dd = pd->port_dd;
+
 	/*
 	 * This is the ipath_do_user_init() code, mapping the shared buffers
 	 * into the user process. The address referred to by vm_pgoff is the
@@ -1160,29 +1077,59 @@ static int ipath_mmap(struct file *fp, s
 	pgaddr = vma->vm_pgoff << PAGE_SHIFT;
 
 	/*
-	 * note that ureg does *NOT* have the kregvirt as part of it, to be
-	 * sure that for 32 bit programs, we don't end up trying to map a >
-	 * 44 address.  Has to match ipath_get_base_info() code that sets
-	 * __spi_uregbase
+	 * Must fit in 40 bits for our hardware; some checked elsewhere,
+	 * but we'll be paranoid.  Check for 0 is mostly in case one of the
+	 * allocations failed, but user called mmap anyway.   We want to catch
+	 * that before it can match.
 	 */
-
+	if (!pgaddr || pgaddr >= (1ULL<<40))  {
+		ipath_dev_err(dd, "Bad phys addr %llx, start %lx, end %lx\n",
+			(unsigned long long)pgaddr, vma->vm_start, vma->vm_end);
+		return -EINVAL;
+	}
+
+	/* just the offset of the port user registers, not physical addr */
 	ureg = dd->ipath_uregbase + dd->ipath_palign * pd->port_port;
 
-	ipath_cdbg(MM, "pgaddr %llx vm_start=%lx len %lx port %u:%u\n",
+	ipath_cdbg(MM, "ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n",
 		   (unsigned long long) pgaddr, vma->vm_start,
-		   vma->vm_end - vma->vm_start, dd->ipath_unit,
-		   pd->port_port);
-
-	if (pgaddr == ureg)
+		   vma->vm_end - vma->vm_start);
+
+	if (vma->vm_start & (PAGE_SIZE-1)) {
+		ipath_dev_err(dd,
+			"vm_start not aligned: %lx, end=%lx phys %lx\n",
+			vma->vm_start, vma->vm_end, (unsigned long)pgaddr);
+		ret = -EINVAL;
+	}
+	else if (pgaddr == ureg)
 		ret = mmap_ureg(vma, dd, ureg);
 	else if (pgaddr == pd->port_piobufs)
 		ret = mmap_piobufs(vma, dd, pd);
 	else if (pgaddr == (u64) pd->port_rcvegr_phys)
 		ret = mmap_rcvegrbufs(vma, pd);
-	else if (pgaddr == (u64) pd->port_rcvhdrq_phys)
-		ret = mmap_rcvhdrq(vma, pd);
+	else if (pgaddr == (u64) pd->port_rcvhdrq_phys) {
+		/*
+		 * The rcvhdrq itself; readonly except on HT-400 (so have
+		 * to allow writable mapping), multiple pages, contiguous
+		 * from an i/o perspective.
+		 */
+		unsigned total_size =
+			ALIGN(dd->ipath_rcvhdrcnt * dd->ipath_rcvhdrentsize
+			   * sizeof(u32), PAGE_SIZE);
+		ret = ipath_mmap_mem(vma, pd, total_size, 1,
+				     pd->port_rcvhdrq_phys,
+				     "rcvhdrq");
+	}
+	else if (pgaddr == (u64)pd->port_rcvhdrqtailaddr_phys)
+		/* in-memory copy of rcvhdrq tail register */
+		ret = ipath_mmap_mem(vma, pd, PAGE_SIZE, 0,
+				     pd->port_rcvhdrqtailaddr_phys,
+				     "rcvhdrq tail");
 	else if (pgaddr == dd->ipath_pioavailregs_phys)
-		ret = mmap_pioavailregs(vma, pd);
+		/* in-memory copy of pioavail registers */
+		ret = ipath_mmap_mem(vma, pd, PAGE_SIZE, 0,
+				     dd->ipath_pioavailregs_phys,
+				     "pioavail registers");
 	else
 		ret = -EINVAL;
 
@@ -1539,14 +1486,6 @@ static int ipath_close(struct inode *in,
 	}
 
 	if (dd->ipath_kregbase) {
-		if (pd->port_rcvhdrtail_uaddr) {
-			pd->port_rcvhdrtail_uaddr = 0;
-			pd->port_rcvhdrtail_kvaddr = NULL;
-			ipath_release_user_pages_on_close(
-				&pd->port_rcvhdrtail_pagep, 1);
-			pd->port_rcvhdrtail_pagep = NULL;
-			ipath_stats.sps_pageunlocks++;
-		}
 		ipath_write_kreg_port(
 			dd, dd->ipath_kregs->kr_rcvhdrtailaddr,
 			port, 0ULL);
@@ -1583,9 +1522,9 @@ static int ipath_close(struct inode *in,
 
 	dd->ipath_f_clear_tids(dd, pd->port_port);
 
-	ipath_free_pddata(dd, pd->port_port, 0);
-
+	dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */
 	mutex_unlock(&ipath_mutex);
+	ipath_free_pddata(dd, pd); /* after releasing the mutex */
 
 	return ret;
 }
@@ -1905,3 +1844,4 @@ bail:
 bail:
 	return;
 }
+
diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:25 2006 -0700
@@ -411,17 +411,8 @@ static int init_pioavailregs(struct ipat
 	/* and its length */
 	dd->ipath_freezelen = L1_CACHE_BYTES - sizeof(dd->ipath_statusp[0]);
 
-	if (dd->ipath_unit * 64 > (IPATH_PORT0_RCVHDRTAIL_SIZE - 64)) {
-		ipath_dev_err(dd, "unit %u too large for port 0 "
-			      "rcvhdrtail buffer size\n", dd->ipath_unit);
-		ret = -ENODEV;
-	}
-	else
-		ret = 0;
-
-	/* so we can get current tail in ipath_kreceive(), per chip */
-	dd->ipath_hdrqtailptr = &ipath_port0_rcvhdrtail[
-		dd->ipath_unit * (64 / sizeof(*ipath_port0_rcvhdrtail))];
+	ret = 0;
+
 done:
 	return ret;
 }
@@ -654,7 +645,7 @@ int ipath_init_chip(struct ipath_devdata
 {
 	int ret = 0, i;
 	u32 val32, kpiobufs;
-	u64 val, atmp;
+	u64 val;
 	struct ipath_portdata *pd = NULL; /* keep gcc4 happy */
 
 	ret = init_housekeeping(dd, &pd, reinit);
@@ -777,24 +768,6 @@ int ipath_init_chip(struct ipath_devdata
 		goto done;
 	}
 
-	val = ipath_port0_rcvhdrtail_dma + dd->ipath_unit * 64;
-
-	/* verify that the alignment requirement was met */
-	ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdrtailaddr,
-			      0, val);
-	atmp = ipath_read_kreg64_port(
-		dd, dd->ipath_kregs->kr_rcvhdrtailaddr, 0);
-	if (val != atmp) {
-		ipath_dev_err(dd, "Catastrophic software error, "
-			      "RcvHdrTailAddr0 written as %llx, "
-			      "read back as %llx from %x\n",
-			      (unsigned long long) val,
-			      (unsigned long long) atmp,
-			      dd->ipath_kregs->kr_rcvhdrtailaddr);
-		ret = -EINVAL;
-		goto done;
-	}
-
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvbthqp, IPATH_KD_QP);
 
 	/*
@@ -845,12 +818,18 @@ int ipath_init_chip(struct ipath_devdata
 	 * re-init, the simplest way to handle this is to free
 	 * existing, and re-allocate.
 	 */
-	if (reinit)
-		ipath_free_pddata(dd, 0, 0);
+	if (reinit) {
+		struct ipath_portdata *pd = dd->ipath_pd[0];
+		dd->ipath_pd[0] = NULL;
+		ipath_free_pddata(dd, pd);
+	}
 	dd->ipath_f_tidtemplate(dd);
 	ret = ipath_create_rcvhdrq(dd, pd);
-	if (!ret)
+	if (!ret) {
+		dd->ipath_hdrqtailptr =
+			(volatile __le64 *)pd->port_rcvhdrtail_kvaddr;
 		ret = create_port0_egr(dd);
+	}
 	if (ret)
 		ipath_dev_err(dd, "failed to allocate port 0 (kernel) "
 			      "rcvhdrq and/or egr bufs\n");
diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:25 2006 -0700
@@ -37,6 +37,7 @@
 #include "ips_common.h"
 #include "ipath_layer.h"
 
+/* These are all rcv-related errors which we want to count for stats */
 #define E_SUM_PKTERRS \
 	(INFINIPATH_E_RHDRLEN | INFINIPATH_E_RBADTID | \
 	 INFINIPATH_E_RBADVERSION | INFINIPATH_E_RHDR | \
@@ -45,12 +46,25 @@
 	 INFINIPATH_E_RFORMATERR | INFINIPATH_E_RUNSUPVL | \
 	 INFINIPATH_E_RUNEXPCHAR | INFINIPATH_E_REBP)
 
+/* These are all send-related errors which we want to count for stats */
 #define E_SUM_ERRS \
 	(INFINIPATH_E_SPIOARMLAUNCH | INFINIPATH_E_SUNEXPERRPKTNUM | \
 	 INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \
 	 INFINIPATH_E_SMAXPKTLEN | INFINIPATH_E_SUNSUPVL | \
 	 INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN | \
 	 INFINIPATH_E_INVALIDADDR)
+
+/*
+ * these are errors that can occur when the link changes state while
+ * a packet is being sent or received.  This doesn't cover things
+ * like EBP or VCRC that can be the result of a sending having the
+ * link change state, so we receive a "known bad" packet.
+ */
+#define E_SUM_LINK_PKTERRS \
+	(INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \
+	 INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN | \
+	 INFINIPATH_E_RSHORTPKTLEN | INFINIPATH_E_RMINPKTLEN | \
+	 INFINIPATH_E_RUNEXPCHAR)
 
 static u64 handle_e_sum_errs(struct ipath_devdata *dd, ipath_err_t errs)
 {
@@ -101,9 +115,7 @@ static u64 handle_e_sum_errs(struct ipat
 		if (ipath_debug & __IPATH_PKTDBG)
 			printk("\n");
 	}
-	if ((errs & (INFINIPATH_E_SDROPPEDDATAPKT |
-		     INFINIPATH_E_SDROPPEDSMPPKT |
-		     INFINIPATH_E_SMINPKTLEN)) &&
+	if ((errs & E_SUM_LINK_PKTERRS) &&
 	    !(dd->ipath_flags & IPATH_LINKACTIVE)) {
 		/*
 		 * This can happen when SMA is trying to bring the link
@@ -112,11 +124,9 @@ static u64 handle_e_sum_errs(struct ipat
 		 * valid.  We don't want to confuse people, so we just
 		 * don't print them, except at debug
 		 */
-		ipath_dbg("Ignoring pktsend errors %llx, because not "
-			  "yet active\n", (unsigned long long) errs);
-		ignore_this_time = INFINIPATH_E_SDROPPEDDATAPKT |
-			INFINIPATH_E_SDROPPEDSMPPKT |
-			INFINIPATH_E_SMINPKTLEN;
+		ipath_dbg("Ignoring packet errors %llx, because link not "
+			  "ACTIVE\n", (unsigned long long) errs);
+		ignore_this_time = errs & E_SUM_LINK_PKTERRS;
 	}
 
 	return ignore_this_time;
@@ -157,7 +167,29 @@ static void handle_e_ibstatuschanged(str
 	 */
 	val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_ibcstatus);
 	lstate = val & IPATH_IBSTATE_MASK;
-	if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM ||
+
+	/*
+	 * this is confusing enough when it happens that I want to always put it
+	 * on the console and in the logs.  If it was a requested state change,
+	 * we'll have already cleared the flags, so we won't print this warning
+	 */
+	if ((lstate != IPATH_IBSTATE_ARM && lstate != IPATH_IBSTATE_ACTIVE)
+		&& (dd->ipath_flags & (IPATH_LINKARMED | IPATH_LINKACTIVE))) {
+		dev_info(&dd->pcidev->dev, "Link state changed from %s to %s\n",
+				 (dd->ipath_flags & IPATH_LINKARMED) ? "ARM" : "ACTIVE",
+				 ib_linkstate(lstate));
+		/*
+		 * Flush all queued sends when link went to DOWN or INIT,
+		 * to be sure that they don't block SMA and other MAD packets
+		 */
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+				 INFINIPATH_S_ABORT);
+		ipath_disarm_piobufs(dd, dd->ipath_lastport_piobuf,
+							(unsigned)(dd->ipath_piobcnt2k +
+					dd->ipath_piobcnt4k) -
+					dd->ipath_lastport_piobuf);
+	}
+	else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM ||
 	    lstate == IPATH_IBSTATE_ACTIVE) {
 		/*
 		 * only print at SMA if there is a change, debug if not
@@ -380,6 +412,19 @@ static void handle_errors(struct ipath_d
 
 	if (errs & E_SUM_ERRS)
 		ignore_this_time = handle_e_sum_errs(dd, errs);
+	else if ((errs & E_SUM_LINK_PKTERRS) &&
+	    !(dd->ipath_flags & IPATH_LINKACTIVE)) {
+		/*
+		 * This can happen when SMA is trying to bring the link
+		 * up, but the IB link changes state at the "wrong" time.
+		 * The IB logic then complains that the packet isn't
+		 * valid.  We don't want to confuse people, so we just
+		 * don't print them, except at debug
+		 */
+		ipath_dbg("Ignoring packet errors %llx, because link not "
+			  "ACTIVE\n", (unsigned long long) errs);
+		ignore_this_time = errs & E_SUM_LINK_PKTERRS;
+	}
 
 	if (supp_msgs == 250000) {
 		/*
diff -r 9c072f8e7e68 -r 1e8837473193 drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:25 2006 -0700
@@ -62,9 +62,7 @@ struct ipath_portdata {
 	/* rcvhdrq base, needs mmap before useful */
 	void *port_rcvhdrq;
 	/* kernel virtual address where hdrqtail is updated */
-	u64 *port_rcvhdrtail_kvaddr;
-	/* page * used for uaddr */
-	struct page *port_rcvhdrtail_pagep;
+	volatile __le64 *port_rcvhdrtail_kvaddr;
 	/*
 	 * temp buffer for expected send setup, allocated at open, instead
 	 * of each setup call
@@ -79,11 +77,7 @@ struct ipath_portdata {
 	dma_addr_t port_rcvegr_phys;
 	/* mmap of hdrq, must fit in 44 bits */
 	dma_addr_t port_rcvhdrq_phys;
-	/*
-	 * the actual user address that we ipath_mlock'ed, so we can
-	 * ipath_munlock it at close
-	 */
-	unsigned long port_rcvhdrtail_uaddr;
+	dma_addr_t port_rcvhdrqtailaddr_phys;
 	/*
 	 * number of opens on this instance (0 or 1; ignoring forks, dup,
 	 * etc. for now)
@@ -515,11 +509,6 @@ struct ipath_devdata {
 	u8 ipath_lmc;
 };
 
-extern volatile __le64 *ipath_port0_rcvhdrtail;
-extern dma_addr_t ipath_port0_rcvhdrtail_dma;
-
-#define IPATH_PORT0_RCVHDRTAIL_SIZE PAGE_SIZE
-
 extern struct list_head ipath_dev_list;
 extern spinlock_t ipath_devs_lock;
 extern struct ipath_devdata *ipath_lookup(int unit);
@@ -579,7 +568,7 @@ void ipath_disarm_piobufs(struct ipath_d
 			  unsigned cnt);
 
 int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *);
-void ipath_free_pddata(struct ipath_devdata *, u32, int);
+void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *);
 
 int ipath_parse_ushort(const char *str, unsigned short *valp);
 

From bos at pathscale.com  Thu Jun 29 14:41:18 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:18 -0700
Subject: [openib-general] [PATCH 27 of 39] IB/ipath - fixes to performance
 get counters for IB compliance
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <7d22a8963bdaca778b13.1151617278@eng-12.pathscale.com>

This patch fixes some problems uncovered during IB compliance
testing to return the right values for error counters returned
by the Performance Get Counters packet.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -460,6 +460,8 @@ static int __devinit ipath_init_one(stru
 	 * by ipath_setup_htconfig.
 	 */
 	dd->ipath_flags = 0;
+	dd->ipath_lli_counter = 0;
+	dd->ipath_lli_errors = 0;
 
 	if (dd->ipath_f_bus(dd, pdev))
 		ipath_dev_err(dd, "Failed to setup config space; "
@@ -942,6 +944,18 @@ reloop:
 				   "tlen=%x opcode=%x egridx=%x: %s\n",
 				   eflags, l, etype, tlen, bthbytes[0],
 				   ips_get_index((__le32 *) rc), emsg);
+			/* Count local link integrity errors. */
+			if (eflags & (INFINIPATH_RHF_H_ICRCERR |
+				      INFINIPATH_RHF_H_VCRCERR)) {
+				u8 n = (dd->ipath_ibcctrl >>
+					INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) &
+					INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK;
+
+				if (++dd->ipath_lli_counter > n) {
+					dd->ipath_lli_counter = 0;
+					dd->ipath_lli_errors++;
+				}
+			}
 		} else if (etype == RCVHQ_RCV_TYPE_NON_KD) {
 				int ret = __ipath_verbs_rcv(dd, rc + 1,
 							    ebuf, tlen);
@@ -949,6 +963,9 @@ reloop:
 					ipath_cdbg(VERBOSE,
 						   "received IB packet, "
 						   "not SMA (QP=%x)\n", qp);
+				if (dd->ipath_lli_counter)
+					dd->ipath_lli_counter--;
+
 		} else if (etype == RCVHQ_RCV_TYPE_EAGER) {
 			if (qp == IPATH_KD_QP &&
 			    bthbytes[0] == ipath_layer_rcv_opcode &&
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
@@ -262,6 +262,7 @@ static void handle_e_ibstatuschanged(str
 				     | IPATH_LINKACTIVE |
 				     IPATH_LINKARMED);
 		*dd->ipath_statusp &= ~IPATH_STATUS_IB_READY;
+		dd->ipath_lli_counter = 0;
 		if (!noprint) {
 			if (((dd->ipath_lastibcstat >>
 			      INFINIPATH_IBCS_LINKSTATE_SHIFT) &
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_kernel.h
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h	Thu Jun 29 14:33:26 2006 -0700
@@ -507,6 +507,11 @@ struct ipath_devdata {
 	u8 ipath_pci_cacheline;
 	/* LID mask control */
 	u8 ipath_lmc;
+
+	/* local link integrity counter */
+	u32 ipath_lli_counter;
+	/* local link integrity errors */
+	u32 ipath_lli_errors;
 };
 
 extern struct list_head ipath_dev_list;
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
@@ -1032,19 +1032,22 @@ int ipath_layer_get_counters(struct ipat
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_ibsymbolerrcnt);
 	cntrs->link_error_recovery_counter =
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkerrrecovcnt);
+	/*
+	 * The link downed counter counts when the other side downs the
+	 * connection.  We add in the number of times we downed the link
+	 * due to local link integrity errors to compensate.
+	 */
 	cntrs->link_downed_counter =
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkdowncnt);
 	cntrs->port_rcv_errors =
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_rxdroppktcnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvovflcnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_portovflcnt) +
-		ipath_snap_cntr(dd, dd->ipath_cregs->cr_errrcvflowctrlcnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_err_rlencnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_invalidrlencnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_erricrccnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_errvcrccnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlpcrccnt) +
-		ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlinkcnt) +
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_badformatcnt);
 	cntrs->port_rcv_remphys_errors =
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvebpcnt);
@@ -1058,6 +1061,8 @@ int ipath_layer_get_counters(struct ipat
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt);
 	cntrs->port_rcv_packets =
 		ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt);
+	cntrs->local_link_integrity_errors = dd->ipath_lli_errors;
+	cntrs->excessive_buffer_overrun_errors = 0; /* XXX */
 
 	ret = 0;
 
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_layer.h
--- a/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.h	Thu Jun 29 14:33:26 2006 -0700
@@ -55,6 +55,8 @@ struct ipath_layer_counters {
 	u64 port_rcv_data;
 	u64 port_xmit_packets;
 	u64 port_rcv_packets;
+	u32 local_link_integrity_errors;
+	u32 excessive_buffer_overrun_errors;
 };
 
 /*
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
@@ -613,6 +613,9 @@ struct ib_pma_portcounters {
 #define IB_PMA_SEL_PORT_RCV_ERRORS		__constant_htons(0x0008)
 #define IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS	__constant_htons(0x0010)
 #define IB_PMA_SEL_PORT_XMIT_DISCARDS		__constant_htons(0x0040)
+#define IB_PMA_SEL_LOCAL_LINK_INTEGRITY_ERRORS	__constant_htons(0x0200)
+#define IB_PMA_SEL_EXCESSIVE_BUFFER_OVERRUNS	__constant_htons(0x0400)
+#define IB_PMA_SEL_PORT_VL15_DROPPED		__constant_htons(0x0800)
 #define IB_PMA_SEL_PORT_XMIT_DATA		__constant_htons(0x1000)
 #define IB_PMA_SEL_PORT_RCV_DATA		__constant_htons(0x2000)
 #define IB_PMA_SEL_PORT_XMIT_PACKETS		__constant_htons(0x4000)
@@ -859,6 +862,10 @@ static int recv_pma_get_portcounters(str
 	cntrs.port_rcv_data -= dev->z_port_rcv_data;
 	cntrs.port_xmit_packets -= dev->z_port_xmit_packets;
 	cntrs.port_rcv_packets -= dev->z_port_rcv_packets;
+	cntrs.local_link_integrity_errors -=
+		dev->z_local_link_integrity_errors;
+	cntrs.excessive_buffer_overrun_errors -=
+		dev->z_excessive_buffer_overrun_errors;
 
 	memset(pmp->data, 0, sizeof(pmp->data));
 
@@ -896,6 +903,16 @@ static int recv_pma_get_portcounters(str
 	else
 		p->port_xmit_discards =
 			cpu_to_be16((u16)cntrs.port_xmit_discards);
+	if (cntrs.local_link_integrity_errors > 0xFUL)
+		cntrs.local_link_integrity_errors = 0xFUL;
+	if (cntrs.excessive_buffer_overrun_errors > 0xFUL)
+		cntrs.excessive_buffer_overrun_errors = 0xFUL;
+	p->lli_ebor_errors = (cntrs.local_link_integrity_errors << 4) |
+		cntrs.excessive_buffer_overrun_errors;
+	if (dev->n_vl15_dropped > 0xFFFFUL)
+		p->vl15_dropped = __constant_cpu_to_be16(0xFFFF);
+	else
+		p->vl15_dropped = cpu_to_be16((u16)dev->n_vl15_dropped);
 	if (cntrs.port_xmit_data > 0xFFFFFFFFUL)
 		p->port_xmit_data = __constant_cpu_to_be32(0xFFFFFFFF);
 	else
@@ -989,6 +1006,17 @@ static int recv_pma_set_portcounters(str
 
 	if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS)
 		dev->z_port_xmit_discards = cntrs.port_xmit_discards;
+
+	if (p->counter_select & IB_PMA_SEL_LOCAL_LINK_INTEGRITY_ERRORS)
+		dev->z_local_link_integrity_errors =
+			cntrs.local_link_integrity_errors;
+
+	if (p->counter_select & IB_PMA_SEL_EXCESSIVE_BUFFER_OVERRUNS)
+		dev->z_excessive_buffer_overrun_errors =
+			cntrs.excessive_buffer_overrun_errors;
+
+	if (p->counter_select & IB_PMA_SEL_PORT_VL15_DROPPED)
+		dev->n_vl15_dropped = 0;
 
 	if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA)
 		dev->z_port_xmit_data = cntrs.port_xmit_data;
@@ -1275,32 +1303,8 @@ int ipath_process_mad(struct ib_device *
 		      struct ib_wc *in_wc, struct ib_grh *in_grh,
 		      struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
-	struct ipath_ibdev *dev = to_idev(ibdev);
 	int ret;
 
-	/*
-	 * Snapshot current HW counters to "clear" them.
-	 * This should be done when the driver is loaded except that for
-	 * some reason we get a zillion errors when brining up the link.
-	 */
-	if (dev->rcv_errors == 0) {
-		struct ipath_layer_counters cntrs;
-
-		ipath_layer_get_counters(to_idev(ibdev)->dd, &cntrs);
-		dev->rcv_errors++;
-		dev->z_symbol_error_counter = cntrs.symbol_error_counter;
-		dev->z_link_error_recovery_counter =
-			cntrs.link_error_recovery_counter;
-		dev->z_link_downed_counter = cntrs.link_downed_counter;
-		dev->z_port_rcv_errors = cntrs.port_rcv_errors + 1;
-		dev->z_port_rcv_remphys_errors =
-			cntrs.port_rcv_remphys_errors;
-		dev->z_port_xmit_discards = cntrs.port_xmit_discards;
-		dev->z_port_xmit_data = cntrs.port_xmit_data;
-		dev->z_port_rcv_data = cntrs.port_rcv_data;
-		dev->z_port_xmit_packets = cntrs.port_xmit_packets;
-		dev->z_port_rcv_packets = cntrs.port_rcv_packets;
-	}
 	switch (in_mad->mad_hdr.mgmt_class) {
 	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
 	case IB_MGMT_CLASS_SUBN_LID_ROUTED:
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_ud.c
--- a/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
@@ -560,7 +560,16 @@ void ipath_ud_rcv(struct ipath_ibdev *de
 	spin_lock_irqsave(&rq->lock, flags);
 	if (rq->tail == rq->head) {
 		spin_unlock_irqrestore(&rq->lock, flags);
-		dev->n_pkt_drops++;
+		/*
+		 * Count VL15 packets dropped due to no receive buffer.
+		 * Otherwise, count them as buffer overruns since usually,
+		 * the HW will be able to receive packets even if there are
+		 * no QPs with posted receive buffers.
+		 */
+		if (qp->ibqp.qp_num == 0)
+			dev->n_vl15_dropped++;
+		else
+			dev->rcv_errors++;
 		goto bail;
 	}
 	/* Silently drop packets which are too big. */
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -981,6 +981,7 @@ static int ipath_verbs_register_sysfs(st
  */
 static void *ipath_register_ib_device(int unit, struct ipath_devdata *dd)
 {
+	struct ipath_layer_counters cntrs;
 	struct ipath_ibdev *idev;
 	struct ib_device *dev;
 	int ret;
@@ -1030,6 +1031,25 @@ static void *ipath_register_ib_device(in
 	idev->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS;
 	idev->pma_counter_select[5] = IB_PMA_PORT_XMIT_WAIT;
 	idev->link_width_enabled = 3;	/* 1x or 4x */
+
+	/* Snapshot current HW counters to "clear" them. */
+	ipath_layer_get_counters(dd, &cntrs);
+	idev->z_symbol_error_counter = cntrs.symbol_error_counter;
+	idev->z_link_error_recovery_counter =
+		cntrs.link_error_recovery_counter;
+	idev->z_link_downed_counter = cntrs.link_downed_counter;
+	idev->z_port_rcv_errors = cntrs.port_rcv_errors;
+	idev->z_port_rcv_remphys_errors =
+		cntrs.port_rcv_remphys_errors;
+	idev->z_port_xmit_discards = cntrs.port_xmit_discards;
+	idev->z_port_xmit_data = cntrs.port_xmit_data;
+	idev->z_port_rcv_data = cntrs.port_rcv_data;
+	idev->z_port_xmit_packets = cntrs.port_xmit_packets;
+	idev->z_port_rcv_packets = cntrs.port_rcv_packets;
+	idev->z_local_link_integrity_errors =
+		cntrs.local_link_integrity_errors;
+	idev->z_excessive_buffer_overrun_errors =
+		cntrs.excessive_buffer_overrun_errors;
 
 	/*
 	 * The system image GUID is supposed to be the same for all
diff -r eef7f8021500 -r 7d22a8963bda drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
@@ -460,6 +460,8 @@ struct ipath_ibdev {
 	u64 z_port_xmit_packets;		/* starting count for PMA */
 	u64 z_port_rcv_packets;			/* starting count for PMA */
 	u32 z_pkey_violations;			/* starting count for PMA */
+	u32 z_local_link_integrity_errors;	/* starting count for PMA */
+	u32 z_excessive_buffer_overrun_errors;	/* starting count for PMA */
 	u32 n_rc_resends;
 	u32 n_rc_acks;
 	u32 n_rc_qacks;
@@ -469,6 +471,7 @@ struct ipath_ibdev {
 	u32 n_other_naks;
 	u32 n_timeouts;
 	u32 n_pkt_drops;
+	u32 n_vl15_dropped;
 	u32 n_wqe_errs;
 	u32 n_rdma_dup_busy;
 	u32 n_piowait;


From bos at pathscale.com  Thu Jun 29 14:41:07 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:07 -0700
Subject: [openib-general] [PATCH 16 of 39] IB/ipath - enable freeze mode
 when shutting down device
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <fd5e733f02aceffe3434.1151617267@eng-12.pathscale.com>

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 125471ee6c68 -r fd5e733f02ac drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:25 2006 -0700
@@ -1656,7 +1656,7 @@ void ipath_shutdown_device(struct ipath_
 	/* disable IBC */
 	dd->ipath_control &= ~INFINIPATH_C_LINKENABLE;
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
-			 dd->ipath_control);
+			 dd->ipath_control | INFINIPATH_C_FREEZEMODE);
 
 	/*
 	 * clear SerdesEnable and turn the leds off; do this here because


From bos at pathscale.com  Thu Jun 29 14:41:20 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:20 -0700
Subject: [openib-general] [PATCH 29 of 39] IB/ipath - RC receive interrupt
 performance changes
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <1bef8244297aef83d9a6.1151617280@eng-12.pathscale.com>

This patch separates QP state used for sending and receiving
RC packets so the processing in the receive interrupt handler
can be done mostly without locks being held.  ACK packets are
now sent without requiring synchronization with the send tasklet.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_keys.c
--- a/drivers/infiniband/hw/ipath/ipath_keys.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_keys.c	Thu Jun 29 14:33:26 2006 -0700
@@ -121,6 +121,7 @@ int ipath_lkey_ok(struct ipath_lkey_tabl
 		  struct ib_sge *sge, int acc)
 {
 	struct ipath_mregion *mr;
+	unsigned n, m;
 	size_t off;
 	int ret;
 
@@ -152,20 +153,22 @@ int ipath_lkey_ok(struct ipath_lkey_tabl
 	}
 
 	off += mr->offset;
+	m = 0;
+	n = 0;
+	while (off >= mr->map[m]->segs[n].length) {
+		off -= mr->map[m]->segs[n].length;
+		n++;
+		if (n >= IPATH_SEGSZ) {
+			m++;
+			n = 0;
+		}
+	}
 	isge->mr = mr;
-	isge->m = 0;
-	isge->n = 0;
-	while (off >= mr->map[isge->m]->segs[isge->n].length) {
-		off -= mr->map[isge->m]->segs[isge->n].length;
-		isge->n++;
-		if (isge->n >= IPATH_SEGSZ) {
-			isge->m++;
-			isge->n = 0;
-		}
-	}
-	isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off;
-	isge->length = mr->map[isge->m]->segs[isge->n].length - off;
+	isge->vaddr = mr->map[m]->segs[n].vaddr + off;
+	isge->length = mr->map[m]->segs[n].length - off;
 	isge->sge_length = sge->length;
+	isge->m = m;
+	isge->n = n;
 
 	ret = 1;
 
@@ -190,6 +193,7 @@ int ipath_rkey_ok(struct ipath_ibdev *de
 	struct ipath_lkey_table *rkt = &dev->lk_table;
 	struct ipath_sge *sge = &ss->sge;
 	struct ipath_mregion *mr;
+	unsigned n, m;
 	size_t off;
 	int ret;
 
@@ -207,20 +211,22 @@ int ipath_rkey_ok(struct ipath_ibdev *de
 	}
 
 	off += mr->offset;
+	m = 0;
+	n = 0;
+	while (off >= mr->map[m]->segs[n].length) {
+		off -= mr->map[m]->segs[n].length;
+		n++;
+		if (n >= IPATH_SEGSZ) {
+			m++;
+			n = 0;
+		}
+	}
 	sge->mr = mr;
-	sge->m = 0;
-	sge->n = 0;
-	while (off >= mr->map[sge->m]->segs[sge->n].length) {
-		off -= mr->map[sge->m]->segs[sge->n].length;
-		sge->n++;
-		if (sge->n >= IPATH_SEGSZ) {
-			sge->m++;
-			sge->n = 0;
-		}
-	}
-	sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off;
-	sge->length = mr->map[sge->m]->segs[sge->n].length - off;
+	sge->vaddr = mr->map[m]->segs[n].vaddr + off;
+	sge->length = mr->map[m]->segs[n].length - off;
 	sge->sge_length = len;
+	sge->m = m;
+	sge->n = n;
 	ss->sg_list = NULL;
 	ss->num_sge = 1;
 
diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
@@ -333,10 +333,11 @@ static void ipath_reset_qp(struct ipath_
 	qp->remote_qpn = 0;
 	qp->qkey = 0;
 	qp->qp_access_flags = 0;
+	clear_bit(IPATH_S_BUSY, &qp->s_flags);
 	qp->s_hdrwords = 0;
 	qp->s_psn = 0;
 	qp->r_psn = 0;
-	atomic_set(&qp->msn, 0);
+	qp->r_msn = 0;
 	if (qp->ibqp.qp_type == IB_QPT_RC) {
 		qp->s_state = IB_OPCODE_RC_SEND_LAST;
 		qp->r_state = IB_OPCODE_RC_SEND_LAST;
@@ -345,7 +346,8 @@ static void ipath_reset_qp(struct ipath_
 		qp->r_state = IB_OPCODE_UC_SEND_LAST;
 	}
 	qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
-	qp->s_nak_state = 0;
+	qp->r_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+	qp->r_nak_state = 0;
 	qp->s_rnr_timeout = 0;
 	qp->s_head = 0;
 	qp->s_tail = 0;
@@ -363,10 +365,10 @@ static void ipath_reset_qp(struct ipath_
  * @qp: the QP to put into an error state
  *
  * Flushes both send and receive work queues.
- * QP r_rq.lock and s_lock should be held.
- */
-
-static void ipath_error_qp(struct ipath_qp *qp)
+ * QP s_lock should be held and interrupts disabled.
+ */
+
+void ipath_error_qp(struct ipath_qp *qp)
 {
 	struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
 	struct ib_wc wc;
@@ -409,12 +411,14 @@ static void ipath_error_qp(struct ipath_
 	qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
 
 	wc.opcode = IB_WC_RECV;
+	spin_lock(&qp->r_rq.lock);
 	while (qp->r_rq.tail != qp->r_rq.head) {
 		wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id;
 		if (++qp->r_rq.tail >= qp->r_rq.size)
 			qp->r_rq.tail = 0;
 		ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
 	}
+	spin_unlock(&qp->r_rq.lock);
 }
 
 /**
@@ -434,8 +438,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
 	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(&qp->r_rq.lock, flags);
-	spin_lock(&qp->s_lock);
+	spin_lock_irqsave(&qp->s_lock, flags);
 
 	cur_state = attr_mask & IB_QP_CUR_STATE ?
 		attr->cur_qp_state : qp->state;
@@ -506,31 +509,19 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
 	}
 
 	if (attr_mask & IB_QP_MIN_RNR_TIMER)
-		qp->s_min_rnr_timer = attr->min_rnr_timer;
+		qp->r_min_rnr_timer = attr->min_rnr_timer;
 
 	if (attr_mask & IB_QP_QKEY)
 		qp->qkey = attr->qkey;
 
 	qp->state = new_state;
-	spin_unlock(&qp->s_lock);
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
-
-	/*
-	 * If QP1 changed to the RTS state, try to move to the link to INIT
-	 * even if it was ACTIVE so the SM will reinitialize the SMA's
-	 * state.
-	 */
-	if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) {
-		struct ipath_ibdev *dev = to_idev(ibqp->device);
-
-		ipath_layer_set_linkstate(dev->dd, IPATH_IB_LINKDOWN);
-	}
+	spin_unlock_irqrestore(&qp->s_lock, flags);
+
 	ret = 0;
 	goto bail;
 
 inval:
-	spin_unlock(&qp->s_lock);
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 	ret = -EINVAL;
 
 bail:
@@ -564,7 +555,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s
 	attr->sq_draining = 0;
 	attr->max_rd_atomic = 1;
 	attr->max_dest_rd_atomic = 1;
-	attr->min_rnr_timer = qp->s_min_rnr_timer;
+	attr->min_rnr_timer = qp->r_min_rnr_timer;
 	attr->port_num = 1;
 	attr->timeout = 0;
 	attr->retry_cnt = qp->s_retry_cnt;
@@ -591,16 +582,12 @@ int ipath_query_qp(struct ib_qp *ibqp, s
  * @qp: the queue pair to compute the AETH for
  *
  * Returns the AETH.
- *
- * The QP s_lock should be held.
  */
 __be32 ipath_compute_aeth(struct ipath_qp *qp)
 {
-	u32 aeth = atomic_read(&qp->msn) & IPS_MSN_MASK;
-
-	if (qp->s_nak_state) {
-		aeth |= qp->s_nak_state << IPS_AETH_CREDIT_SHIFT;
-	} else if (qp->ibqp.srq) {
+	u32 aeth = qp->r_msn & IPS_MSN_MASK;
+
+	if (qp->ibqp.srq) {
 		/*
 		 * Shared receive queues don't generate credits.
 		 * Set the credit field to the invalid value.
diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -42,7 +42,7 @@
  * @qp: the QP who's SGE we're restarting
  * @wqe: the work queue to initialize the QP's SGE from
  *
- * The QP s_lock should be held.
+ * The QP s_lock should be held and interrupts disabled.
  */
 static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe)
 {
@@ -77,7 +77,6 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
 		      struct ipath_other_headers *ohdr,
 		      u32 pmtu)
 {
-	struct ipath_sge_state *ss;
 	u32 hwords;
 	u32 len;
 	u32 bth0;
@@ -91,7 +90,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
 	 */
 	switch (qp->s_ack_state) {
 	case OP(RDMA_READ_REQUEST):
-		ss = &qp->s_rdma_sge;
+		qp->s_cur_sge = &qp->s_rdma_sge;
 		len = qp->s_rdma_len;
 		if (len > pmtu) {
 			len = pmtu;
@@ -108,7 +107,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
 		qp->s_ack_state = OP(RDMA_READ_RESPONSE_MIDDLE);
 		/* FALLTHROUGH */
 	case OP(RDMA_READ_RESPONSE_MIDDLE):
-		ss = &qp->s_rdma_sge;
+		qp->s_cur_sge = &qp->s_rdma_sge;
 		len = qp->s_rdma_len;
 		if (len > pmtu)
 			len = pmtu;
@@ -127,41 +126,50 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
 		 * We have to prevent new requests from changing
 		 * the r_sge state while a ipath_verbs_send()
 		 * is in progress.
-		 * Changing r_state allows the receiver
-		 * to continue processing new packets.
-		 * We do it here now instead of above so
-		 * that we are sure the packet was sent before
-		 * changing the state.
-		 */
-		qp->r_state = OP(RDMA_READ_RESPONSE_LAST);
+		 */
 		qp->s_ack_state = OP(ACKNOWLEDGE);
-		return 0;
+		bth0 = 0;
+		goto bail;
 
 	case OP(COMPARE_SWAP):
 	case OP(FETCH_ADD):
-		ss = NULL;
+		qp->s_cur_sge = NULL;
 		len = 0;
-		qp->r_state = OP(SEND_LAST);
-		qp->s_ack_state = OP(ACKNOWLEDGE);
-		bth0 = IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24;
+		/*
+		 * Set the s_ack_state so the receive interrupt handler
+		 * won't try to send an ACK (out of order) until this one
+		 * is actually sent.
+		 */
+		qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST);
+		bth0 = OP(ATOMIC_ACKNOWLEDGE) << 24;
 		ohdr->u.at.aeth = ipath_compute_aeth(qp);
-		ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic);
+		ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->r_atomic_data);
 		hwords += sizeof(ohdr->u.at) / 4;
 		break;
 
 	default:
 		/* Send a regular ACK. */
-		ss = NULL;
+		qp->s_cur_sge = NULL;
 		len = 0;
-		qp->s_ack_state = OP(ACKNOWLEDGE);
-		bth0 = qp->s_ack_state << 24;
-		ohdr->u.aeth = ipath_compute_aeth(qp);
+		/*
+		 * Set the s_ack_state so the receive interrupt handler
+		 * won't try to send an ACK (out of order) until this one
+		 * is actually sent.
+		 */
+		qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST);
+		bth0 = OP(ACKNOWLEDGE) << 24;
+		if (qp->s_nak_state)
+			ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) |
+						    (qp->s_nak_state <<
+						     IPS_AETH_CREDIT_SHIFT));
+		else
+			ohdr->u.aeth = ipath_compute_aeth(qp);
 		hwords++;
 	}
 	qp->s_hdrwords = hwords;
-	qp->s_cur_sge = ss;
 	qp->s_cur_size = len;
 
+bail:
 	return bth0;
 }
 
@@ -174,7 +182,7 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
  * @bth2p: pointer to the BTH PSN word
  *
  * Return 1 if constructed; otherwise, return 0.
- * Note the QP s_lock must be held.
+ * Note the QP s_lock must be held and interrupts disabled.
  */
 int ipath_make_rc_req(struct ipath_qp *qp,
 		      struct ipath_other_headers *ohdr,
@@ -356,6 +364,11 @@ int ipath_make_rc_req(struct ipath_qp *q
 		bth2 |= qp->s_psn++ & IPS_PSN_MASK;
 		if ((int)(qp->s_psn - qp->s_next_psn) > 0)
 			qp->s_next_psn = qp->s_psn;
+		/*
+		 * Put the QP on the pending list so lost ACKs will cause
+		 * a retry.  More than one request can be pending so the
+		 * QP may already be on the dev->pending list.
+		 */
 		spin_lock(&dev->pending_lock);
 		if (list_empty(&qp->timerwait))
 			list_add_tail(&qp->timerwait,
@@ -365,8 +378,8 @@ int ipath_make_rc_req(struct ipath_qp *q
 
 	case OP(RDMA_READ_RESPONSE_FIRST):
 		/*
-		 * This case can only happen if a send is restarted.  See
-		 * ipath_restart_rc().
+		 * This case can only happen if a send is restarted.
+		 * See ipath_restart_rc().
 		 */
 		ipath_init_restart(qp, wqe);
 		/* FALLTHROUGH */
@@ -526,11 +539,17 @@ static void send_rc_ack(struct ipath_qp 
 		ohdr = &hdr.u.l.oth;
 		lrh0 = IPS_LRH_GRH;
 	}
+	/* read pkey_index w/o lock (its atomic) */
 	bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
-	ohdr->u.aeth = ipath_compute_aeth(qp);
-	if (qp->s_ack_state >= OP(COMPARE_SWAP)) {
-		bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24;
-		ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic);
+	if (qp->r_nak_state)
+		ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) |
+					    (qp->r_nak_state <<
+					     IPS_AETH_CREDIT_SHIFT));
+	else
+		ohdr->u.aeth = ipath_compute_aeth(qp);
+	if (qp->r_ack_state >= OP(COMPARE_SWAP)) {
+		bth0 |= OP(ATOMIC_ACKNOWLEDGE) << 24;
+		ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->r_atomic_data);
 		hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4;
 	} else
 		bth0 |= OP(ACKNOWLEDGE) << 24;
@@ -541,15 +560,36 @@ static void send_rc_ack(struct ipath_qp 
 	hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
 	ohdr->bth[0] = cpu_to_be32(bth0);
 	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
-	ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & IPS_PSN_MASK);
+	ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPS_PSN_MASK);
 
 	/*
 	 * If we can send the ACK, clear the ACK state.
 	 */
 	if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) {
-		qp->s_ack_state = OP(ACKNOWLEDGE);
+		qp->r_ack_state = OP(ACKNOWLEDGE);
+		dev->n_unicast_xmit++;
+	} else {
+		/*
+		 * We are out of PIO buffers at the moment.
+		 * Pass responsibility for sending the ACK to the
+		 * send tasklet so that when a PIO buffer becomes
+		 * available, the ACK is sent ahead of other outgoing
+		 * packets.
+		 */
 		dev->n_rc_qacks++;
-		dev->n_unicast_xmit++;
+		spin_lock_irq(&qp->s_lock);
+		/* Don't coalesce if a RDMA read or atomic is pending. */
+		if (qp->s_ack_state == OP(ACKNOWLEDGE) ||
+		    qp->s_ack_state < OP(RDMA_READ_REQUEST)) {
+			qp->s_ack_state = qp->r_ack_state;
+			qp->s_nak_state = qp->r_nak_state;
+			qp->s_ack_psn = qp->r_ack_psn;
+			qp->r_ack_state = OP(ACKNOWLEDGE);
+		}
+		spin_unlock_irq(&qp->s_lock);
+
+		/* Call ipath_do_rc_send() in another thread. */
+		tasklet_hi_schedule(&qp->s_task);
 	}
 }
 
@@ -641,7 +681,7 @@ done:
  * @psn: packet sequence number for the request
  * @wc: the work completion request
  *
- * The QP s_lock should be held.
+ * The QP s_lock should be held and interrupts disabled.
  */
 void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc)
 {
@@ -705,7 +745,7 @@ bail:
  *
  * This is called from ipath_rc_rcv_resp() to process an incoming RC ACK
  * for the given QP.
- * Called at interrupt level with the QP s_lock held.
+ * Called at interrupt level with the QP s_lock held and interrupts disabled.
  * Returns 1 if OK, 0 if current operation should be aborted (NAK).
  */
 static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode)
@@ -1126,18 +1166,16 @@ static inline int ipath_rc_rcv_error(str
 		 * Don't queue the NAK if a RDMA read, atomic, or
 		 * NAK is pending though.
 		 */
-		spin_lock(&qp->s_lock);
-		if ((qp->s_ack_state >= OP(RDMA_READ_REQUEST) &&
-		     qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) ||
-		    qp->s_nak_state != 0) {
-			spin_unlock(&qp->s_lock);
+		if (qp->s_ack_state != OP(ACKNOWLEDGE) ||
+		    qp->r_nak_state != 0)
 			goto done;
-		}
-		qp->s_ack_state = OP(SEND_ONLY);
-		qp->s_nak_state = IB_NAK_PSN_ERROR;
-		/* Use the expected PSN. */
-		qp->s_ack_psn = qp->r_psn;
-		goto resched;
+		if (qp->r_ack_state < OP(COMPARE_SWAP)) {
+			qp->r_ack_state = OP(SEND_ONLY);
+			qp->r_nak_state = IB_NAK_PSN_ERROR;
+			/* Use the expected PSN. */
+			qp->r_ack_psn = qp->r_psn;
+		}
+		goto send_ack;
 	}
 
 	/*
@@ -1151,33 +1189,29 @@ static inline int ipath_rc_rcv_error(str
 	 * send the earliest so that RDMA reads can be restarted at
 	 * the requester's expected PSN.
 	 */
-	spin_lock(&qp->s_lock);
-	if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE &&
-	    ipath_cmp24(psn, qp->s_ack_psn) >= 0) {
-		if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST)
-			qp->s_ack_psn = psn;
-		spin_unlock(&qp->s_lock);
-		goto done;
-	}
-	switch (opcode) {
-	case OP(RDMA_READ_REQUEST):
-		/*
-		 * We have to be careful to not change s_rdma_sge
-		 * while ipath_do_rc_send() is using it and not
-		 * holding the s_lock.
-		 */
-		if (qp->s_ack_state != OP(ACKNOWLEDGE) &&
-		    qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) {
-			spin_unlock(&qp->s_lock);
-			dev->n_rdma_dup_busy++;
-			goto done;
-		}
+	if (opcode == OP(RDMA_READ_REQUEST)) {
 		/* RETH comes after BTH */
 		if (!header_in_data)
 			reth = &ohdr->u.rc.reth;
 		else {
 			reth = (struct ib_reth *)data;
 			data += sizeof(*reth);
+		}
+		/*
+		 * If we receive a duplicate RDMA request, it means the
+		 * requester saw a sequence error and needs to restart
+		 * from an earlier point.  We can abort the current
+		 * RDMA read send in that case.
+		 */
+		spin_lock_irq(&qp->s_lock);
+		if (qp->s_ack_state != OP(ACKNOWLEDGE) &&
+		    (qp->s_hdrwords || ipath_cmp24(psn, qp->s_ack_psn) >= 0)) {
+			/*
+			 * We are already sending earlier requested data.
+			 * Don't abort it to send later out of sequence data.
+			 */
+			spin_unlock_irq(&qp->s_lock);
+			goto done;
 		}
 		qp->s_rdma_len = be32_to_cpu(reth->length);
 		if (qp->s_rdma_len != 0) {
@@ -1192,8 +1226,10 @@ static inline int ipath_rc_rcv_error(str
 			ok = ipath_rkey_ok(dev, &qp->s_rdma_sge,
 					   qp->s_rdma_len, vaddr, rkey,
 					   IB_ACCESS_REMOTE_READ);
-			if (unlikely(!ok))
+			if (unlikely(!ok)) {
+				spin_unlock_irq(&qp->s_lock);
 				goto done;
+			}
 		} else {
 			qp->s_rdma_sge.sg_list = NULL;
 			qp->s_rdma_sge.num_sge = 0;
@@ -1202,25 +1238,44 @@ static inline int ipath_rc_rcv_error(str
 			qp->s_rdma_sge.sge.length = 0;
 			qp->s_rdma_sge.sge.sge_length = 0;
 		}
-		break;
-
+		qp->s_ack_state = opcode;
+		qp->s_ack_psn = psn;
+		spin_unlock_irq(&qp->s_lock);
+		tasklet_hi_schedule(&qp->s_task);
+		goto send_ack;
+	}
+
+	/*
+	 * A pending RDMA read will ACK anything before it so
+	 * ignore earlier duplicate requests.
+	 */
+	if (qp->s_ack_state != OP(ACKNOWLEDGE))
+		goto done;
+
+	/*
+	 * If an ACK is pending, don't replace the pending ACK
+	 * with an earlier one since the later one will ACK the earlier.
+	 * Also, if we already have a pending atomic, send it.
+	 */
+	if (qp->r_ack_state != OP(ACKNOWLEDGE) &&
+	    (ipath_cmp24(psn, qp->r_ack_psn) <= 0 ||
+	     qp->r_ack_state >= OP(COMPARE_SWAP)))
+		goto send_ack;
+	switch (opcode) {
 	case OP(COMPARE_SWAP):
 	case OP(FETCH_ADD):
 		/*
 		 * Check for the PSN of the last atomic operation
 		 * performed and resend the result if found.
 		 */
-		if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) {
-			spin_unlock(&qp->s_lock);
+		if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn)
 			goto done;
-		}
-		qp->s_ack_atomic = qp->r_atomic_data;
 		break;
 	}
-	qp->s_ack_state = opcode;
-	qp->s_nak_state = 0;
-	qp->s_ack_psn = psn;
-resched:
+	qp->r_ack_state = opcode;
+	qp->r_nak_state = 0;
+	qp->r_ack_psn = psn;
+send_ack:
 	return 0;
 
 done:
@@ -1248,7 +1303,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 	u32 hdrsize;
 	u32 psn;
 	u32 pad;
-	unsigned long flags;
 	struct ib_wc wc;
 	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
 	int diff;
@@ -1289,10 +1343,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 	    opcode <= OP(ATOMIC_ACKNOWLEDGE)) {
 		ipath_rc_rcv_resp(dev, ohdr, data, tlen, qp, opcode, psn,
 				  hdrsize, pmtu, header_in_data);
-		goto bail;
-	}
-
-	spin_lock_irqsave(&qp->r_rq.lock, flags);
+		goto done;
+	}
 
 	/* Compute 24 bits worth of difference. */
 	diff = ipath_cmp24(psn, qp->r_psn);
@@ -1300,7 +1352,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		if (ipath_rc_rcv_error(dev, ohdr, data, qp, opcode,
 				       psn, diff, header_in_data))
 			goto done;
-		goto resched;
+		goto send_ack;
 	}
 
 	/* Check for opcode sequence errors. */
@@ -1312,22 +1364,19 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		    opcode == OP(SEND_LAST_WITH_IMMEDIATE))
 			break;
 	nack_inv:
-	/*
-	 * A NAK will ACK earlier sends and RDMA writes.  Don't queue the
-	 * NAK if a RDMA read, atomic, or NAK is pending though.
-	 */
-	spin_lock(&qp->s_lock);
-	if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) &&
-	    qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) {
-		spin_unlock(&qp->s_lock);
-		goto done;
-	}
-	/* XXX Flush WQEs */
-	qp->state = IB_QPS_ERR;
-	qp->s_ack_state = OP(SEND_ONLY);
-	qp->s_nak_state = IB_NAK_INVALID_REQUEST;
-	qp->s_ack_psn = qp->r_psn;
-	goto resched;
+		/*
+		 * A NAK will ACK earlier sends and RDMA writes.
+		 * Don't queue the NAK if a RDMA read, atomic, or NAK
+		 * is pending though.
+		 */
+		if (qp->r_ack_state >= OP(COMPARE_SWAP))
+			goto send_ack;
+		/* XXX Flush WQEs */
+		qp->state = IB_QPS_ERR;
+		qp->r_ack_state = OP(SEND_ONLY);
+		qp->r_nak_state = IB_NAK_INVALID_REQUEST;
+		qp->r_ack_psn = qp->r_psn;
+		goto send_ack;
 
 	case OP(RDMA_WRITE_FIRST):
 	case OP(RDMA_WRITE_MIDDLE):
@@ -1336,20 +1385,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		    opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE))
 			break;
 		goto nack_inv;
-
-	case OP(RDMA_READ_REQUEST):
-	case OP(COMPARE_SWAP):
-	case OP(FETCH_ADD):
-		/*
-		 * Drop all new requests until a response has been sent.  A
-		 * new request then ACKs the RDMA response we sent.  Relaxed
-		 * ordering would allow new requests to be processed but we
-		 * would need to keep a queue of rwqe's for all that are in
-		 * progress.  Note that we can't RNR NAK this request since
-		 * the RDMA READ or atomic response is already queued to be
-		 * sent (unless we implement a response send queue).
-		 */
-		goto done;
 
 	default:
 		if (opcode == OP(SEND_MIDDLE) ||
@@ -1359,6 +1394,11 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		    opcode == OP(RDMA_WRITE_LAST) ||
 		    opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE))
 			goto nack_inv;
+		/*
+		 * Note that it is up to the requester to not send a new
+		 * RDMA read or atomic operation before receiving an ACK
+		 * for the previous operation.
+		 */
 		break;
 	}
 
@@ -1375,17 +1415,12 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			 * Don't queue the NAK if a RDMA read or atomic
 			 * is pending though.
 			 */
-			spin_lock(&qp->s_lock);
-			if (qp->s_ack_state >=
-			    OP(RDMA_READ_REQUEST) &&
-			    qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) {
-				spin_unlock(&qp->s_lock);
-				goto done;
-			}
-			qp->s_ack_state = OP(SEND_ONLY);
-			qp->s_nak_state = IB_RNR_NAK | qp->s_min_rnr_timer;
-			qp->s_ack_psn = qp->r_psn;
-			goto resched;
+			if (qp->r_ack_state >= OP(COMPARE_SWAP))
+				goto send_ack;
+			qp->r_ack_state = OP(SEND_ONLY);
+			qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer;
+			qp->r_ack_psn = qp->r_psn;
+			goto send_ack;
 		}
 		qp->r_rcv_len = 0;
 		/* FALLTHROUGH */
@@ -1442,7 +1477,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		if (unlikely(wc.byte_len > qp->r_len))
 			goto nack_inv;
 		ipath_copy_sge(&qp->r_sge, data, tlen);
-		atomic_inc(&qp->msn);
+		qp->r_msn++;
 		if (opcode == OP(RDMA_WRITE_LAST) ||
 		    opcode == OP(RDMA_WRITE_ONLY))
 			break;
@@ -1486,29 +1521,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			ok = ipath_rkey_ok(dev, &qp->r_sge,
 					   qp->r_len, vaddr, rkey,
 					   IB_ACCESS_REMOTE_WRITE);
-			if (unlikely(!ok)) {
-			nack_acc:
-				/*
-				 * A NAK will ACK earlier sends and RDMA
-				 * writes.  Don't queue the NAK if a RDMA
-				 * read, atomic, or NAK is pending though.
-				 */
-				spin_lock(&qp->s_lock);
-				if (qp->s_ack_state >=
-				    OP(RDMA_READ_REQUEST) &&
-				    qp->s_ack_state !=
-				    IB_OPCODE_ACKNOWLEDGE) {
-					spin_unlock(&qp->s_lock);
-					goto done;
-				}
-				/* XXX Flush WQEs */
-				qp->state = IB_QPS_ERR;
-				qp->s_ack_state = OP(RDMA_WRITE_ONLY);
-				qp->s_nak_state =
-					IB_NAK_REMOTE_ACCESS_ERROR;
-				qp->s_ack_psn = qp->r_psn;
-				goto resched;
-			}
+			if (unlikely(!ok))
+				goto nack_acc;
 		} else {
 			qp->r_sge.sg_list = NULL;
 			qp->r_sge.sge.mr = NULL;
@@ -1535,12 +1549,10 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			reth = (struct ib_reth *)data;
 			data += sizeof(*reth);
 		}
-		spin_lock(&qp->s_lock);
-		if (qp->s_ack_state != OP(ACKNOWLEDGE) &&
-		    qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) {
-			spin_unlock(&qp->s_lock);
-			goto done;
-		}
+		if (unlikely(!(qp->qp_access_flags &
+			       IB_ACCESS_REMOTE_READ)))
+			goto nack_acc;
+		spin_lock_irq(&qp->s_lock);
 		qp->s_rdma_len = be32_to_cpu(reth->length);
 		if (qp->s_rdma_len != 0) {
 			u32 rkey = be32_to_cpu(reth->rkey);
@@ -1552,7 +1564,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 					   qp->s_rdma_len, vaddr, rkey,
 					   IB_ACCESS_REMOTE_READ);
 			if (unlikely(!ok)) {
-				spin_unlock(&qp->s_lock);
+				spin_unlock_irq(&qp->s_lock);
 				goto nack_acc;
 			}
 			/*
@@ -1569,21 +1581,25 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			qp->s_rdma_sge.sge.length = 0;
 			qp->s_rdma_sge.sge.sge_length = 0;
 		}
-		if (unlikely(!(qp->qp_access_flags &
-			       IB_ACCESS_REMOTE_READ)))
-			goto nack_acc;
 		/*
 		 * We need to increment the MSN here instead of when we
 		 * finish sending the result since a duplicate request would
 		 * increment it more than once.
 		 */
-		atomic_inc(&qp->msn);
+		qp->r_msn++;
+
 		qp->s_ack_state = opcode;
-		qp->s_nak_state = 0;
 		qp->s_ack_psn = psn;
+		spin_unlock_irq(&qp->s_lock);
+
 		qp->r_psn++;
 		qp->r_state = opcode;
-		goto rdmadone;
+		qp->r_nak_state = 0;
+
+		/* Call ipath_do_rc_send() in another thread. */
+		tasklet_hi_schedule(&qp->s_task);
+
+		goto done;
 
 	case OP(COMPARE_SWAP):
 	case OP(FETCH_ADD): {
@@ -1612,7 +1628,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			goto nack_acc;
 		/* Perform atomic OP and save result. */
 		sdata = be64_to_cpu(ateth->swap_data);
-		spin_lock(&dev->pending_lock);
+		spin_lock_irq(&dev->pending_lock);
 		qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr;
 		if (opcode == OP(FETCH_ADD))
 			*(u64 *) qp->r_sge.sge.vaddr =
@@ -1620,8 +1636,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 		else if (qp->r_atomic_data ==
 			 be64_to_cpu(ateth->compare_data))
 			*(u64 *) qp->r_sge.sge.vaddr = sdata;
-		spin_unlock(&dev->pending_lock);
-		atomic_inc(&qp->msn);
+		spin_unlock_irq(&dev->pending_lock);
+		qp->r_msn++;
 		qp->r_atomic_psn = psn & IPS_PSN_MASK;
 		psn |= 1 << 31;
 		break;
@@ -1633,44 +1649,39 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 	}
 	qp->r_psn++;
 	qp->r_state = opcode;
+	qp->r_nak_state = 0;
 	/* Send an ACK if requested or required. */
 	if (psn & (1 << 31)) {
 		/*
 		 * Coalesce ACKs unless there is a RDMA READ or
 		 * ATOMIC pending.
 		 */
-		spin_lock(&qp->s_lock);
-		if (qp->s_ack_state == OP(ACKNOWLEDGE) ||
-		    qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) {
-			qp->s_ack_state = opcode;
-			qp->s_nak_state = 0;
-			qp->s_ack_psn = psn;
-			qp->s_ack_atomic = qp->r_atomic_data;
-			goto resched;
-		}
-		spin_unlock(&qp->s_lock);
-	}
+		if (qp->r_ack_state < OP(COMPARE_SWAP)) {
+			qp->r_ack_state = opcode;
+			qp->r_ack_psn = psn;
+		}
+		goto send_ack;
+	}
+	goto done;
+
+nack_acc:
+	/*
+	 * A NAK will ACK earlier sends and RDMA writes.
+	 * Don't queue the NAK if a RDMA read, atomic, or NAK
+	 * is pending though.
+	 */
+	if (qp->r_ack_state < OP(COMPARE_SWAP)) {
+		/* XXX Flush WQEs */
+		qp->state = IB_QPS_ERR;
+		qp->r_ack_state = OP(RDMA_WRITE_ONLY);
+		qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR;
+		qp->r_ack_psn = qp->r_psn;
+	}
+send_ack:
+	/* Send ACK right away unless the send tasklet has a pending ACK. */
+	if (qp->s_ack_state == OP(ACKNOWLEDGE))
+		send_rc_ack(qp);
+
 done:
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
-	goto bail;
-
-resched:
-	/*
-	 * Try to send ACK right away but not if ipath_do_rc_send() is
-	 * active.
-	 */
-	if (qp->s_hdrwords == 0 &&
-	    (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST ||
-	     qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP))
-		send_rc_ack(qp);
-
-rdmadone:
-	spin_unlock(&qp->s_lock);
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
-
-	/* Call ipath_do_rc_send() in another thread. */
-	tasklet_hi_schedule(&qp->s_task);
-
-bail:
 	return;
 }
diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_ruc.c
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -113,20 +113,23 @@ void ipath_insert_rnr_queue(struct ipath
  *
  * Return 0 if no RWQE is available, otherwise return 1.
  *
- * Called at interrupt level with the QP r_rq.lock held.
+ * Can be called from interrupt level.
  */
 int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 {
+	unsigned long flags;
 	struct ipath_rq *rq;
 	struct ipath_srq *srq;
 	struct ipath_rwqe *wqe;
-	int ret;
+	int ret = 1;
 
 	if (!qp->ibqp.srq) {
 		rq = &qp->r_rq;
+		spin_lock_irqsave(&rq->lock, flags);
+
 		if (unlikely(rq->tail == rq->head)) {
 			ret = 0;
-			goto bail;
+			goto done;
 		}
 		wqe = get_rwqe_ptr(rq, rq->tail);
 		qp->r_wr_id = wqe->wr_id;
@@ -138,17 +141,16 @@ int ipath_get_rwqe(struct ipath_qp *qp, 
 		}
 		if (++rq->tail >= rq->size)
 			rq->tail = 0;
-		ret = 1;
-		goto bail;
+		goto done;
 	}
 
 	srq = to_isrq(qp->ibqp.srq);
 	rq = &srq->rq;
-	spin_lock(&rq->lock);
+	spin_lock_irqsave(&rq->lock, flags);
+
 	if (unlikely(rq->tail == rq->head)) {
-		spin_unlock(&rq->lock);
 		ret = 0;
-		goto bail;
+		goto done;
 	}
 	wqe = get_rwqe_ptr(rq, rq->tail);
 	qp->r_wr_id = wqe->wr_id;
@@ -170,18 +172,18 @@ int ipath_get_rwqe(struct ipath_qp *qp, 
 			n = rq->head - rq->tail;
 		if (n < srq->limit) {
 			srq->limit = 0;
-			spin_unlock(&rq->lock);
+			spin_unlock_irqrestore(&rq->lock, flags);
 			ev.device = qp->ibqp.device;
 			ev.element.srq = qp->ibqp.srq;
 			ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
 			srq->ibsrq.event_handler(&ev,
 						 srq->ibsrq.srq_context);
-		} else
-			spin_unlock(&rq->lock);
-	} else
-		spin_unlock(&rq->lock);
-	ret = 1;
-
+			goto bail;
+		}
+	}
+
+done:
+	spin_unlock_irqrestore(&rq->lock, flags);
 bail:
 	return ret;
 }
@@ -248,10 +250,8 @@ again:
 		wc.imm_data = wqe->wr.imm_data;
 		/* FALLTHROUGH */
 	case IB_WR_SEND:
-		spin_lock_irqsave(&qp->r_rq.lock, flags);
 		if (!ipath_get_rwqe(qp, 0)) {
 		rnr_nak:
-			spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 			/* Handle RNR NAK */
 			if (qp->ibqp.qp_type == IB_QPT_UC)
 				goto send_comp;
@@ -263,20 +263,17 @@ again:
 				sqp->s_rnr_retry--;
 			dev->n_rnr_naks++;
 			sqp->s_rnr_timeout =
-				ib_ipath_rnr_table[sqp->s_min_rnr_timer];
+				ib_ipath_rnr_table[sqp->r_min_rnr_timer];
 			ipath_insert_rnr_queue(sqp);
 			goto done;
 		}
-		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 		break;
 
 	case IB_WR_RDMA_WRITE_WITH_IMM:
 		wc.wc_flags = IB_WC_WITH_IMM;
 		wc.imm_data = wqe->wr.imm_data;
-		spin_lock_irqsave(&qp->r_rq.lock, flags);
 		if (!ipath_get_rwqe(qp, 1))
 			goto rnr_nak;
-		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 		/* FALLTHROUGH */
 	case IB_WR_RDMA_WRITE:
 		if (wqe->length == 0)
diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_uc.c
--- a/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -241,7 +241,6 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 	u32 hdrsize;
 	u32 psn;
 	u32 pad;
-	unsigned long flags;
 	struct ib_wc wc;
 	u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
 	struct ib_reth *reth;
@@ -279,8 +278,6 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 	wc.imm_data = 0;
 	wc.wc_flags = 0;
 
-	spin_lock_irqsave(&qp->r_rq.lock, flags);
-
 	/* Compare the PSN verses the expected PSN. */
 	if (unlikely(ipath_cmp24(psn, qp->r_psn) != 0)) {
 		/*
@@ -537,15 +534,11 @@ void ipath_uc_rcv(struct ipath_ibdev *de
 
 	default:
 		/* Drop packet for unknown opcodes. */
-		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 		dev->n_pkt_drops++;
-		goto bail;
+		goto done;
 	}
 	qp->r_psn++;
 	qp->r_state = opcode;
 done:
-	spin_unlock_irqrestore(&qp->r_rq.lock, flags);
-
-bail:
 	return;
 }
diff -r 5f3c0b2d446d -r 1bef8244297a drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
@@ -307,32 +307,34 @@ struct ipath_qp {
 	u32 s_next_psn;		/* PSN for next request */
 	u32 s_last_psn;		/* last response PSN processed */
 	u32 s_psn;		/* current packet sequence number */
+	u32 s_ack_psn;		/* PSN for RDMA_READ */
 	u32 s_rnr_timeout;	/* number of milliseconds for RNR timeout */
-	u32 s_ack_psn;		/* PSN for next ACK or RDMA_READ */
-	u64 s_ack_atomic;	/* data for atomic ACK */
+	u32 r_ack_psn;		/* PSN for next ACK or atomic ACK */
 	u64 r_wr_id;		/* ID for current receive WQE */
 	u64 r_atomic_data;	/* data for last atomic op */
 	u32 r_atomic_psn;	/* PSN of last atomic op */
 	u32 r_len;		/* total length of r_sge */
 	u32 r_rcv_len;		/* receive data len processed */
 	u32 r_psn;		/* expected rcv packet sequence number */
+	u32 r_msn;		/* message sequence number */
 	u8 state;		/* QP state */
 	u8 s_state;		/* opcode of last packet sent */
 	u8 s_ack_state;		/* opcode of packet to ACK */
 	u8 s_nak_state;		/* non-zero if NAK is pending */
 	u8 r_state;		/* opcode of last packet received */
+	u8 r_ack_state;		/* opcode of packet to ACK */
+	u8 r_nak_state;		/* non-zero if NAK is pending */
+	u8 r_min_rnr_timer;	/* retry timeout value for RNR NAKs */
 	u8 r_reuse_sge;		/* for UC receive errors */
 	u8 r_sge_inx;		/* current index into sg_list */
+	u8 qp_access_flags;
 	u8 s_max_sge;		/* size of s_wq->sg_list */
-	u8 qp_access_flags;
 	u8 s_retry_cnt;		/* number of times to retry */
 	u8 s_rnr_retry_cnt;
-	u8 s_min_rnr_timer;
 	u8 s_retry;		/* requester retry counter */
 	u8 s_rnr_retry;		/* requester RNR retry counter */
 	u8 s_pkey_index;	/* PKEY index to use */
 	enum ib_mtu path_mtu;
-	atomic_t msn;		/* message sequence number */
 	u32 remote_qpn;
 	u32 qkey;		/* QKEY for this QP (for UD or RD) */
 	u32 s_size;		/* send work queue size */


From bos at pathscale.com  Thu Jun 29 14:41:27 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:27 -0700
Subject: [openib-general] [PATCH 36 of 39] IB/ipath - Ignore receive queue
 size if SRQ is specified
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <31c382d8210a80c37278.1151617287@eng-12.pathscale.com>

The receive work queue size should be ignored if the QP is created
to use a shared receive queue according to the IB spec.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 9b423c45af8b -r 31c382d8210a drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
@@ -685,16 +685,22 @@ struct ib_qp *ipath_create_qp(struct ib_
 			ret = ERR_PTR(-ENOMEM);
 			goto bail;
 		}
-		qp->r_rq.size = init_attr->cap.max_recv_wr + 1;
-		sz = sizeof(struct ipath_sge) *
-			init_attr->cap.max_recv_sge +
-			sizeof(struct ipath_rwqe);
-		qp->r_rq.wq = vmalloc(qp->r_rq.size * sz);
-		if (!qp->r_rq.wq) {
-			kfree(qp);
-			vfree(swq);
-			ret = ERR_PTR(-ENOMEM);
-			goto bail;
+		if (init_attr->srq) {
+			qp->r_rq.size = 0;
+			qp->r_rq.max_sge = 0;
+			qp->r_rq.wq = NULL;
+		} else {
+			qp->r_rq.size = init_attr->cap.max_recv_wr + 1;
+			qp->r_rq.max_sge = init_attr->cap.max_recv_sge;
+			sz = (sizeof(struct ipath_sge) * qp->r_rq.max_sge) +
+				sizeof(struct ipath_rwqe);
+			qp->r_rq.wq = vmalloc(qp->r_rq.size * sz);
+			if (!qp->r_rq.wq) {
+				kfree(qp);
+				vfree(swq);
+				ret = ERR_PTR(-ENOMEM);
+				goto bail;
+			}
 		}
 
 		/*
@@ -713,7 +719,6 @@ struct ib_qp *ipath_create_qp(struct ib_
 		qp->s_wq = swq;
 		qp->s_size = init_attr->cap.max_send_wr + 1;
 		qp->s_max_sge = init_attr->cap.max_send_sge;
-		qp->r_rq.max_sge = init_attr->cap.max_recv_sge;
 		qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ?
 			1 << IPATH_S_SIGNAL_REQ_WR : 0;
 		dev = to_idev(ibpd->device);


From bos at pathscale.com  Thu Jun 29 14:41:26 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:26 -0700
Subject: [openib-general] [PATCH 35 of 39] IB/ipath - remove some #if 0 code
 related to lockable memory
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <9b423c45af8b2eb98562.1151617286@eng-12.pathscale.com>

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r b6ebaf2dd2fd -r 9b423c45af8b drivers/infiniband/hw/ipath/ipath_user_pages.c
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c	Thu Jun 29 14:33:26 2006 -0700
@@ -57,17 +57,6 @@ static int __get_user_pages(unsigned lon
 	unsigned long lock_limit;
 	size_t got;
 	int ret;
-
-#if 0
-	/*
-	 * XXX - causes MPI programs to fail, haven't had time to check
-	 * yet
-	 */
-	if (!capable(CAP_IPC_LOCK)) {
-		ret = -EPERM;
-		goto bail;
-	}
-#endif
 
 	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >>
 		PAGE_SHIFT;


From bos at pathscale.com  Thu Jun 29 14:41:29 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:29 -0700
Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to
 support InfiniPath on PowerPC 970 systems
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <c22b6c244d5db77f7b1d.1151617289@eng-12.pathscale.com>

Ordering of writethrough store buffers needs to be forced, and we need
to use ifdef to get writethrough behavior to InfiniPath buffers, because
there is no generic way to specify that at this time (similar to code
in char/drm/drm_vm.c and block/z2ram.c).

Signed-off-by: John Gregor <john.gregor at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/Makefile
--- a/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:26 2006 -0700
@@ -20,6 +20,7 @@ ipath_core-y := \
 	ipath_user_pages.o
 
 ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o
+ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o
 
 ib_ipath-y := \
 	ipath_cq.o \
diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -440,7 +440,13 @@ static int __devinit ipath_init_one(stru
 	}
 	dd->ipath_pcirev = rev;
 
+#if defined(__powerpc__)
+	/* There isn't a generic way to specify writethrough mappings */
+	dd->ipath_kregbase = __ioremap(addr, len,
+		(_PAGE_NO_CACHE|_PAGE_WRITETHRU));
+#else
 	dd->ipath_kregbase = ioremap_nocache(addr, len);
+#endif
 
 	if (!dd->ipath_kregbase) {
 		ipath_dbg("Unable to map io addr %llx to kvirt, failing\n",
diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
@@ -985,6 +985,13 @@ static int mmap_piobufs(struct vm_area_s
 	 * write combining behavior we want on the PIO buffers!
 	 */
 
+#if defined(__powerpc__)
+	/* There isn't a generic way to specify writethrough mappings */
+	pgprot_val(vma->vm_page_prot) |= _PAGE_NO_CACHE;
+	pgprot_val(vma->vm_page_prot) |= _PAGE_WRITETHRU;
+	pgprot_val(vma->vm_page_prot) &= ~_PAGE_GUARDED;
+#endif
+
 	if (vma->vm_flags & VM_READ) {
 		dev_info(&dd->pcidev->dev,
 			 "Can't map piobufs as readable (flags=%lx)\n",
diff -r 2a721e1f490b -r c22b6c244d5d drivers/infiniband/hw/ipath/ipath_wc_ppc64.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_wc_ppc64.c	Thu Jun 29 14:33:26 2006 -0700
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * This file is conditionally built on PowerPC only.  Otherwise weak symbol
+ * versions of the functions exported from here are used.
+ */
+
+#include "ipath_kernel.h"
+
+/**
+ * ipath_unordered_wc - indicate whether write combining is ordered
+ *
+ * PowerPC systems (at least those in the 970 processor family)
+ * write partially filled store buffers in address order, but will write
+ * completely filled store buffers in "random" order, and therefore must
+ * have serialization for correctness with current InfiniPath chips.
+ *
+ */
+int ipath_unordered_wc(void)
+{
+	return 1;
+}


From bos at pathscale.com  Thu Jun 29 14:41:30 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:30 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy in
 RDMA interrupt handler to reduce packet loss
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>

In cases where a large incoming RDMA is being received, we have to
copy data inside the interrupt handler before we can ACK each packet.
The source is DMAed to by the hardware, which means that the CPU won't
have it cached.  We only read the source this one time; using normal load
instructions pollutes the dcache with useless data, reducing performance
to the point where we can lose a significant number of packets.

Using a (memcpy-compatible) copy routine that loads with streaming
instructions, we try to not fill the dcache with useless data.  Avoiding
the cache refill penalty lets us keep up better with the sender, resulting
in many fewer dropped packets.  We use normal stores to the destination,
because the copied-to data will be used soon after the interrupt handler
completes.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/Makefile
--- a/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/Makefile	Thu Jun 29 14:33:26 2006 -0700
@@ -35,3 +35,5 @@ ib_ipath-y := \
 	ipath_ud.o \
 	ipath_verbs.o \
 	ipath_verbs_mcast.o
+
+ib_ipath-$(CONFIG_X86_64) += ipath_memcpy_x86_64.o
diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -159,7 +159,7 @@ void ipath_copy_sge(struct ipath_sge_sta
 		BUG_ON(len == 0);
 		if (len > length)
 			len = length;
-		memcpy(sge->vaddr, data, len);
+		ipath_memcpy_nc(sge->vaddr, data, len);
 		sge->vaddr += len;
 		sge->length -= len;
 		sge->sge_length -= len;
diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Thu Jun 29 14:33:26 2006 -0700
@@ -728,4 +728,14 @@ extern unsigned int ib_ipath_max_srq_wrs
 
 extern const u32 ib_ipath_rnr_table[];
 
+/*
+ * Copy data.  Try not to pollute the dcache with the source data,
+ * because we won't be reading it again.
+ */
+#if defined(CONFIG_X86_64)
+void *ipath_memcpy_nc(void *dest, const void *src, size_t n);
+#else
+#define ipath_memcpy_nc(dest, src, n) memcpy(dest, src, n)
+#endif
+
 #endif				/* IPATH_VERBS_H */
diff -r c22b6c244d5d -r 1b00209ef20a drivers/infiniband/hw/ipath/ipath_memcpy_x86_64.S
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_memcpy_x86_64.S	Thu Jun 29 14:33:26 2006 -0700
@@ -0,0 +1,157 @@
+/*
+ * Copyright (c) 2006 QLogic, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * ipath_memcpy_nc - memcpy-compatible copy routine, using streaming loads
+ * @dest: destination address
+ * @src: source address
+ * @count: number of bytes to copy
+ *
+ * Use streaming loads and normal stores for a special-case copy where
+ * we know we won't be reading the source again, but will be reading the
+ * destination again soon.
+ */
+	.text
+	.p2align 4,,15
+	/* rdi  destination, rsi source, rdx count */
+	.globl	ipath_memcpy_nc
+	.type	ipath_memcpy_nc, @function
+ipath_memcpy_nc:
+	movq	%rdi, %rax
+.L5:
+	cmpq	$15, %rdx
+	ja	.L34
+.L3:
+	cmpl	$8, %edx	/* rdx is 0..15 */
+	jbe	.L9
+.L6:
+	testb	$8, %dxl	/* rdx is 3,5,6,7,9..15 */
+	je	.L13
+	movq	(%rsi), %rcx
+	addq	$8, %rsi
+	movq	%rcx, (%rdi)
+	addq	$8, %rdi
+.L13:
+	testb	$4, %dxl
+	je	.L15
+	movl	(%rsi), %ecx
+	addq	$4, %rsi
+	movl	%ecx, (%rdi)
+	addq	$4, %rdi
+.L15:
+	testb	$2, %dxl
+	je	.L17
+	movzwl	(%rsi), %ecx
+	addq	$2, %rsi
+	movw	%cx, (%rdi)
+	addq	$2, %rdi
+.L17:
+	testb	$1, %dxl
+	je	.L33
+.L1:
+	movzbl	(%rsi), %ecx
+	movb	%cl, (%rdi)
+.L33:
+	ret
+.L34:
+	cmpq	$63, %rdx	/* rdx is > 15 */
+	ja	.L64
+	movl	$16, %ecx	/* rdx is 16..63 */
+.L25:
+	movq	8(%rsi), %r8
+	movq	(%rsi), %r9
+	addq	%rcx, %rsi
+	movq	%r8, 8(%rdi)
+	movq	%r9, (%rdi)
+	addq	%rcx, %rdi
+	subq	%rcx, %rdx
+	cmpl	%edx, %ecx	/* is rdx >= 16? */
+	jbe	.L25
+	jmp	.L3		/* rdx is 0..15 */
+	.p2align 4,,7
+.L64:
+	movl	$64, %ecx
+.L42:
+	prefetchnta	128(%rsi)
+	movq	(%rsi), %r8
+	movq	8(%rsi), %r9
+	movq	16(%rsi), %r10
+	movq	24(%rsi), %r11
+	subq	%rcx, %rdx
+	movq	%r8, (%rdi)
+	movq	32(%rsi), %r8
+	movq	%r9, 8(%rdi)
+	movq	40(%rsi), %r9
+	movq	%r10, 16(%rdi)
+	movq	48(%rsi), %r10
+	movq	%r11, 24(%rdi)
+	movq	56(%rsi), %r11
+	addq	%rcx, %rsi
+	movq	%r8, 32(%rdi)
+	movq	%r9, 40(%rdi)
+	movq	%r10, 48(%rdi)
+	movq	%r11, 56(%rdi)
+	addq	%rcx, %rdi
+	cmpq	%rdx, %rcx	/* is rdx >= 64? */
+	jbe	.L42
+	sfence
+	orl	%edx, %edx
+	je	.L33
+	jmp	.L5
+.L9:
+	jmp	*.L12(,%rdx,8)	/* rdx is 0..8 */
+	.section	.rodata
+	.align 8
+	.align 4
+.L12:
+	.quad	.L33
+	.quad	.L1
+	.quad	.L2
+	.quad	.L6
+	.quad	.L4
+	.quad	.L6
+	.quad	.L6
+	.quad	.L6
+	.quad	.L8
+	.text
+.L2:
+	movzwl	(%rsi), %ecx
+	movw	%cx, (%rdi)
+	ret
+.L4:
+	movl	(%rsi), %ecx
+	movl	%ecx, (%rdi)
+	ret
+.L8:
+	movq	(%rsi), %rcx
+	movq	%rcx, (%rdi)
+	ret


From bos at pathscale.com  Thu Jun 29 14:41:28 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:41:28 -0700
Subject: [openib-general] [PATCH 37 of 39] IB/ipath - namespace cleanup:
 replace ips with ipath
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <2a721e1f490b74df3737.1151617288@eng-12.pathscale.com>

Remove ips namespace from infinipath drivers.  This renames ips_common.h
to ipath_common.h.  Definitions, data structures, etc. that were not
used by kernel modules have moved to user-only headers.  All names
including ips have been renamed to ipath.  Some names have had an ipath
prefix added.

Signed-off-by: Christian Bell <christian.bell at qlogic.com>
Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.h	Thu Jun 29 14:33:26 2006 -0700
@@ -39,7 +39,8 @@
  * to communicate between kernel and user code.
  */
 
-/* This is the IEEE-assigned OUI for QLogic, Inc. InfiniPath */
+
+/* This is the IEEE-assigned OUI for QLogic Inc. InfiniPath */
 #define IPATH_SRC_OUI_1 0x00
 #define IPATH_SRC_OUI_2 0x11
 #define IPATH_SRC_OUI_3 0x75
@@ -343,9 +344,9 @@ struct ipath_base_info {
 /*
  * Similarly, this is the kernel version going back to the user.  It's
  * slightly different, in that we want to tell if the driver was built as
- * part of a QLogic release, or from the driver from OpenIB, kernel.org,
- * or a standard distribution, for support reasons.  The high bit is 0 for
- * non-QLogic, and 1 for QLogic-built/supplied.
+ * part of a QLogic release, or from the driver from openfabrics.org,
+ * kernel.org, or a standard distribution, for support reasons.
+ * The high bit is 0 for non-QLogic and 1 for QLogic-built/supplied.
  *
  * It's returned by the driver to the user code during initialization in the
  * spi_sw_version field of ipath_base_info, so the user code can in turn
@@ -600,14 +601,118 @@ struct infinipath_counters {
 #define INFINIPATH_KPF_INTR 0x1
 
 /* SendPIO per-buffer control */
-#define INFINIPATH_SP_LENGTHP1_MASK 0x3FF
-#define INFINIPATH_SP_LENGTHP1_SHIFT 0
-#define INFINIPATH_SP_INTR    0x80000000
-#define INFINIPATH_SP_TEST    0x40000000
-#define INFINIPATH_SP_TESTEBP 0x20000000
+#define INFINIPATH_SP_TEST    0x40
+#define INFINIPATH_SP_TESTEBP 0x20
 
 /* SendPIOAvail bits */
 #define INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT 1
 #define INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT 0
 
+/* infinipath header format */
+struct ipath_header {
+	/*
+	 * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset -
+	 * 14 bits before ECO change ~28 Dec 03.  After that, Vers 4,
+	 * Port 3, TID 11, offset 14.
+	 */
+	__le32 ver_port_tid_offset;
+	__le16 chksum;
+	__le16 pkt_flags;
+};
+
+/* infinipath user message header format.
+ * This structure contains the first 4 fields common to all protocols
+ * that employ infinipath.
+ */
+struct ipath_message_header {
+	__be16 lrh[4];
+	__be32 bth[3];
+	/* fields below this point are in host byte order */
+	struct ipath_header iph;
+	__u8 sub_opcode;
+};
+
+/* infinipath ethernet header format */
+struct ether_header {
+	__be16 lrh[4];
+	__be32 bth[3];
+	struct ipath_header iph;
+	__u8 sub_opcode;
+	__u8 cmd;
+	__be16 lid;
+	__u16 mac[3];
+	__u8 frag_num;
+	__u8 seq_num;
+	__le32 len;
+	/* MUST be of word size due to PIO write requirements */
+	__le32 csum;
+	__le16 csum_offset;
+	__le16 flags;
+	__u16 first_2_bytes;
+	__u8 unused[2];		/* currently unused */
+};
+
+
+/* IB - LRH header consts */
+#define IPATH_LRH_GRH 0x0003	/* 1. word of IB LRH - next header: GRH */
+#define IPATH_LRH_BTH 0x0002	/* 1. word of IB LRH - next header: BTH */
+
+/* misc. */
+#define SIZE_OF_CRC 1
+
+#define IPATH_DEFAULT_P_KEY 0xFFFF
+#define IPATH_PERMISSIVE_LID 0xFFFF
+#define IPATH_AETH_CREDIT_SHIFT 24
+#define IPATH_AETH_CREDIT_MASK 0x1F
+#define IPATH_AETH_CREDIT_INVAL 0x1F
+#define IPATH_PSN_MASK 0xFFFFFF
+#define IPATH_MSN_MASK 0xFFFFFF
+#define IPATH_QPN_MASK 0xFFFFFF
+#define IPATH_MULTICAST_LID_BASE 0xC000
+#define IPATH_MULTICAST_QPN 0xFFFFFF
+
+/* Receive Header Queue: receive type (from infinipath) */
+#define RCVHQ_RCV_TYPE_EXPECTED  0
+#define RCVHQ_RCV_TYPE_EAGER     1
+#define RCVHQ_RCV_TYPE_NON_KD    2
+#define RCVHQ_RCV_TYPE_ERROR     3
+
+
+/* sub OpCodes - ith4x  */
+#define IPATH_ITH4X_OPCODE_ENCAP 0x81
+#define IPATH_ITH4X_OPCODE_LID_ARP 0x82
+
+#define IPATH_HEADER_QUEUE_WORDS 9
+
+/* functions for extracting fields from rcvhdrq entries for the driver.
+ */
+static inline __u32 ipath_hdrget_err_flags(const __le32 * rbuf)
+{
+	return __le32_to_cpu(rbuf[1]);
+}
+
+static inline __u32 ipath_hdrget_rcv_type(const __le32 * rbuf)
+{
+	return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_RCVTYPE_SHIFT)
+	    & INFINIPATH_RHF_RCVTYPE_MASK;
+}
+
+static inline __u32 ipath_hdrget_length_in_bytes(const __le32 * rbuf)
+{
+	return ((__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_LENGTH_SHIFT)
+		& INFINIPATH_RHF_LENGTH_MASK) << 2;
+}
+
+static inline __u32 ipath_hdrget_index(const __le32 * rbuf)
+{
+	return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_EGRINDEX_SHIFT)
+	    & INFINIPATH_RHF_EGRINDEX_MASK;
+}
+
+static inline __u32 ipath_hdrget_ipath_ver(__le32 hdrword)
+{
+	return (__le32_to_cpu(hdrword) >> INFINIPATH_I_VERS_SHIFT)
+	    & INFINIPATH_I_VERS_MASK;
+}
+
 #endif				/* _IPATH_COMMON_H */
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_diag.c
--- a/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_diag.c	Thu Jun 29 14:33:26 2006 -0700
@@ -44,10 +44,9 @@
 #include <linux/pci.h>
 #include <asm/uaccess.h>
 
+#include "ipath_kernel.h"
+#include "ipath_layer.h"
 #include "ipath_common.h"
-#include "ipath_kernel.h"
-#include "ips_common.h"
-#include "ipath_layer.h"
 
 int ipath_diag_inuse;
 static int diag_set_link;
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c	Thu Jun 29 14:33:26 2006 -0700
@@ -39,8 +39,8 @@
 #include <linux/vmalloc.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
 #include "ipath_layer.h"
+#include "ipath_common.h"
 
 static void ipath_update_pio_bufs(struct ipath_devdata *);
 
@@ -823,7 +823,8 @@ static void ipath_rcv_layer(struct ipath
 	u8 pad, *bthbytes;
 	struct sk_buff *skb, *nskb;
 
-	if (dd->ipath_port0_skbs && hdr->sub_opcode == OPCODE_ENCAP) {
+	if (dd->ipath_port0_skbs && 
+			hdr->sub_opcode == IPATH_ITH4X_OPCODE_ENCAP) {
 		/*
 		 * Allocate a new sk_buff to replace the one we give
 		 * to the network stack.
@@ -854,7 +855,7 @@ static void ipath_rcv_layer(struct ipath
 		/* another ether packet received */
 		ipath_stats.sps_ether_rpkts++;
 	}
-	else if (hdr->sub_opcode == OPCODE_LID_ARP)
+	else if (hdr->sub_opcode == IPATH_ITH4X_OPCODE_LID_ARP)
 		__ipath_layer_rcv_lid(dd, hdr);
 }
 
@@ -871,7 +872,7 @@ void ipath_kreceive(struct ipath_devdata
 	const u32 rsize = dd->ipath_rcvhdrentsize;	/* words */
 	const u32 maxcnt = dd->ipath_rcvhdrcnt * rsize;	/* words */
 	u32 etail = -1, l, hdrqtail;
-	struct ips_message_header *hdr;
+	struct ipath_message_header *hdr;
 	u32 eflags, i, etype, tlen, pkttot = 0, updegr=0, reloop=0;
 	static u64 totcalls;	/* stats, may eventually remove */
 	char emsg[128];
@@ -897,7 +898,7 @@ reloop:
 		u8 *bthbytes;
 
 		rc = (u64 *) (dd->ipath_pd[0]->port_rcvhdrq + (l << 2));
-		hdr = (struct ips_message_header *)&rc[1];
+		hdr = (struct ipath_message_header *)&rc[1];
 		/*
 		 * could make a network order version of IPATH_KD_QP, and
 		 * do the obvious shift before masking to speed this up.
@@ -905,10 +906,10 @@ reloop:
 		qp = ntohl(hdr->bth[1]) & 0xffffff;
 		bthbytes = (u8 *) hdr->bth;
 
-		eflags = ips_get_hdr_err_flags((__le32 *) rc);
-		etype = ips_get_rcv_type((__le32 *) rc);
+		eflags = ipath_hdrget_err_flags((__le32 *) rc);
+		etype = ipath_hdrget_rcv_type((__le32 *) rc);
 		/* total length */
-		tlen = ips_get_length_in_bytes((__le32 *) rc);
+		tlen = ipath_hdrget_length_in_bytes((__le32 *) rc);
 		ebuf = NULL;
 		if (etype != RCVHQ_RCV_TYPE_EXPECTED) {
 			/*
@@ -918,7 +919,7 @@ reloop:
 			 * set ebuf (so we try to copy data) unless the
 			 * length requires it.
 			 */
-			etail = ips_get_index((__le32 *) rc);
+			etail = ipath_hdrget_index((__le32 *) rc);
 			if (tlen > sizeof(*hdr) ||
 			    etype == RCVHQ_RCV_TYPE_NON_KD)
 				ebuf = ipath_get_egrbuf(dd, etail, 0);
@@ -930,7 +931,7 @@ reloop:
 		 */
 
 		if (etype != RCVHQ_RCV_TYPE_NON_KD && etype !=
-		    RCVHQ_RCV_TYPE_ERROR && ips_get_ipath_ver(
+		    RCVHQ_RCV_TYPE_ERROR && ipath_hdrget_ipath_ver(
 			    hdr->iph.ver_port_tid_offset) !=
 		    IPS_PROTO_VERSION) {
 			ipath_cdbg(PKT, "Bad InfiniPath protocol version "
@@ -943,7 +944,7 @@ reloop:
 			ipath_cdbg(PKT, "RHFerrs %x hdrqtail=%x typ=%u "
 				   "tlen=%x opcode=%x egridx=%x: %s\n",
 				   eflags, l, etype, tlen, bthbytes[0],
-				   ips_get_index((__le32 *) rc), emsg);
+				   ipath_hdrget_index((__le32 *) rc), emsg);
 			/* Count local link integrity errors. */
 			if (eflags & (INFINIPATH_RHF_H_ICRCERR |
 				      INFINIPATH_RHF_H_VCRCERR)) {
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:26 2006 -0700
@@ -39,8 +39,8 @@
 #include <asm/pgtable.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
 #include "ipath_layer.h"
+#include "ipath_common.h"
 
 static int ipath_open(struct inode *, struct file *);
 static int ipath_close(struct inode *, struct file *);
@@ -458,7 +458,7 @@ static int ipath_set_part_key(struct ipa
 	u16 lkey = key & 0x7FFF;
 	int ret;
 
-	if (lkey == (IPS_DEFAULT_P_KEY & 0x7FFF)) {
+	if (lkey == (IPATH_DEFAULT_P_KEY & 0x7FFF)) {
 		/* nothing to do; this key always valid */
 		ret = 0;
 		goto bail;
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c	Thu Jun 29 14:33:26 2006 -0700
@@ -36,7 +36,7 @@
 #include <linux/vmalloc.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /*
  * min buffers we want to have per port, after driver
@@ -277,7 +277,7 @@ static int init_chip_first(struct ipath_
 	pd->port_port = 0;
 	pd->port_cnt = 1;
 	/* The port 0 pkey table is used by the layer interface. */
-	pd->port_pkeys[0] = IPS_DEFAULT_P_KEY;
+	pd->port_pkeys[0] = IPATH_DEFAULT_P_KEY;
 	dd->ipath_rcvtidcnt =
 		ipath_read_kreg32(dd, dd->ipath_kregs->kr_rcvtidcnt);
 	dd->ipath_rcvtidbase =
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_intr.c
--- a/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c	Thu Jun 29 14:33:26 2006 -0700
@@ -34,8 +34,8 @@
 #include <linux/pci.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
 #include "ipath_layer.h"
+#include "ipath_common.h"
 
 /* These are all rcv-related errors which we want to count for stats */
 #define E_SUM_PKTERRS \
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_layer.c
--- a/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c	Thu Jun 29 14:33:26 2006 -0700
@@ -41,8 +41,8 @@
 #include <asm/byteorder.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
 #include "ipath_layer.h"
+#include "ipath_common.h"
 
 /* Acquire before ipath_devs_lock. */
 static DEFINE_MUTEX(ipath_layer_mutex);
@@ -622,7 +622,7 @@ int ipath_layer_open(struct ipath_devdat
 		goto bail;
 	}
 
-	ret = ipath_setrcvhdrsize(dd, NUM_OF_EXTRA_WORDS_IN_HEADER_QUEUE);
+	ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS);
 
 	if (ret < 0)
 		goto bail;
@@ -1106,10 +1106,10 @@ int ipath_layer_send_hdr(struct ipath_de
 		}
 
 	vlsllnh = *((__be16 *) hdr);
-	if (vlsllnh != htons(IPS_LRH_BTH)) {
+	if (vlsllnh != htons(IPATH_LRH_BTH)) {
 		ipath_dbg("Warning: lrh[0] wrong (%x, not %x); "
 			  "not sending\n", be16_to_cpu(vlsllnh),
-			  IPS_LRH_BTH);
+			  IPATH_LRH_BTH);
 		ret = -EINVAL;
 	}
 	if (ret)
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_mad.c
--- a/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c	Thu Jun 29 14:33:26 2006 -0700
@@ -35,7 +35,7 @@
 
 #include "ipath_kernel.h"
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 #define IB_SMP_UNSUP_VERSION	__constant_htons(0x0004)
 #define IB_SMP_UNSUP_METHOD	__constant_htons(0x0008)
@@ -306,7 +306,7 @@ static int recv_subn_set_portinfo(struct
 	lid = be16_to_cpu(pip->lid);
 	if (lid != ipath_layer_get_lid(dev->dd)) {
 		/* Must be a valid unicast LID address. */
-		if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE)
+		if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE)
 			goto err;
 		ipath_set_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7);
 		event.event = IB_EVENT_LID_CHANGE;
@@ -316,7 +316,7 @@ static int recv_subn_set_portinfo(struct
 	smlid = be16_to_cpu(pip->sm_lid);
 	if (smlid != dev->sm_lid) {
 		/* Must be a valid unicast LID address. */
-		if (smlid == 0 || smlid >= IPS_MULTICAST_LID_BASE)
+		if (smlid == 0 || smlid >= IPATH_MULTICAST_LID_BASE)
 			goto err;
 		dev->sm_lid = smlid;
 		event.event = IB_EVENT_SM_CHANGE;
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c	Thu Jun 29 14:33:26 2006 -0700
@@ -35,7 +35,7 @@
 #include <linux/vmalloc.h>
 
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 #define BITS_PER_PAGE		(PAGE_SIZE*BITS_PER_BYTE)
 #define BITS_PER_PAGE_MASK	(BITS_PER_PAGE-1)
@@ -450,7 +450,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
 
 	if (attr_mask & IB_QP_AV)
 		if (attr->ah_attr.dlid == 0 ||
-		    attr->ah_attr.dlid >= IPS_MULTICAST_LID_BASE)
+		    attr->ah_attr.dlid >= IPATH_MULTICAST_LID_BASE)
 			goto inval;
 
 	if (attr_mask & IB_QP_PKEY_INDEX)
@@ -585,14 +585,14 @@ int ipath_query_qp(struct ib_qp *ibqp, s
  */
 __be32 ipath_compute_aeth(struct ipath_qp *qp)
 {
-	u32 aeth = qp->r_msn & IPS_MSN_MASK;
+	u32 aeth = qp->r_msn & IPATH_MSN_MASK;
 
 	if (qp->ibqp.srq) {
 		/*
 		 * Shared receive queues don't generate credits.
 		 * Set the credit field to the invalid value.
 		 */
-		aeth |= IPS_AETH_CREDIT_INVAL << IPS_AETH_CREDIT_SHIFT;
+		aeth |= IPATH_AETH_CREDIT_INVAL << IPATH_AETH_CREDIT_SHIFT;
 	} else {
 		u32 min, max, x;
 		u32 credits;
@@ -622,7 +622,7 @@ __be32 ipath_compute_aeth(struct ipath_q
 			else
 				min = x;
 		}
-		aeth |= x << IPS_AETH_CREDIT_SHIFT;
+		aeth |= x << IPATH_AETH_CREDIT_SHIFT;
 	}
 	return cpu_to_be32(aeth);
 }
@@ -888,18 +888,18 @@ void ipath_sqerror_qp(struct ipath_qp *q
  */
 void ipath_get_credit(struct ipath_qp *qp, u32 aeth)
 {
-	u32 credit = (aeth >> IPS_AETH_CREDIT_SHIFT) & IPS_AETH_CREDIT_MASK;
+	u32 credit = (aeth >> IPATH_AETH_CREDIT_SHIFT) & IPATH_AETH_CREDIT_MASK;
 
 	/*
 	 * If the credit is invalid, we can send
 	 * as many packets as we like.  Otherwise, we have to
 	 * honor the credit field.
 	 */
-	if (credit == IPS_AETH_CREDIT_INVAL)
+	if (credit == IPATH_AETH_CREDIT_INVAL)
 		qp->s_lsn = (u32) -1;
 	else if (qp->s_lsn != (u32) -1) {
 		/* Compute new LSN (i.e., MSN + credit) */
-		credit = (aeth + credit_table[credit]) & IPS_MSN_MASK;
+		credit = (aeth + credit_table[credit]) & IPATH_MSN_MASK;
 		if (ipath_cmp24(credit, qp->s_lsn) > 0)
 			qp->s_lsn = credit;
 	}
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -32,7 +32,7 @@
  */
 
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /* cut down ridiculously long IB macro names */
 #define OP(x) IB_OPCODE_RC_##x
@@ -49,7 +49,7 @@ static void ipath_init_restart(struct ip
 	struct ipath_ibdev *dev;
 	u32 len;
 
-	len = ((qp->s_psn - wqe->psn) & IPS_PSN_MASK) *
+	len = ((qp->s_psn - wqe->psn) & IPATH_PSN_MASK) *
 		ib_mtu_enum_to_int(qp->path_mtu);
 	qp->s_sge.sge = wqe->sg_list[0];
 	qp->s_sge.sg_list = wqe->sg_list + 1;
@@ -159,9 +159,9 @@ u32 ipath_make_rc_ack(struct ipath_qp *q
 		qp->s_ack_state = OP(RDMA_READ_RESPONSE_LAST);
 		bth0 = OP(ACKNOWLEDGE) << 24;
 		if (qp->s_nak_state)
-			ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) |
+			ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPATH_MSN_MASK) |
 						    (qp->s_nak_state <<
-						     IPS_AETH_CREDIT_SHIFT));
+						     IPATH_AETH_CREDIT_SHIFT));
 		else
 			ohdr->u.aeth = ipath_compute_aeth(qp);
 		hwords++;
@@ -361,7 +361,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 			if (qp->s_tail >= qp->s_size)
 				qp->s_tail = 0;
 		}
-		bth2 |= qp->s_psn++ & IPS_PSN_MASK;
+		bth2 |= qp->s_psn++ & IPATH_PSN_MASK;
 		if ((int)(qp->s_psn - qp->s_next_psn) > 0)
 			qp->s_next_psn = qp->s_psn;
 		/*
@@ -387,7 +387,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 		qp->s_state = OP(SEND_MIDDLE);
 		/* FALLTHROUGH */
 	case OP(SEND_MIDDLE):
-		bth2 = qp->s_psn++ & IPS_PSN_MASK;
+		bth2 = qp->s_psn++ & IPATH_PSN_MASK;
 		if ((int)(qp->s_psn - qp->s_next_psn) > 0)
 			qp->s_next_psn = qp->s_psn;
 		ss = &qp->s_sge;
@@ -429,7 +429,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 		qp->s_state = OP(RDMA_WRITE_MIDDLE);
 		/* FALLTHROUGH */
 	case OP(RDMA_WRITE_MIDDLE):
-		bth2 = qp->s_psn++ & IPS_PSN_MASK;
+		bth2 = qp->s_psn++ & IPATH_PSN_MASK;
 		if ((int)(qp->s_psn - qp->s_next_psn) > 0)
 			qp->s_next_psn = qp->s_psn;
 		ss = &qp->s_sge;
@@ -466,7 +466,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 		 * See ipath_restart_rc().
 		 */
 		ipath_init_restart(qp, wqe);
-		len = ((qp->s_psn - wqe->psn) & IPS_PSN_MASK) * pmtu;
+		len = ((qp->s_psn - wqe->psn) & IPATH_PSN_MASK) * pmtu;
 		ohdr->u.rc.reth.vaddr =
 			cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len);
 		ohdr->u.rc.reth.rkey =
@@ -474,7 +474,7 @@ int ipath_make_rc_req(struct ipath_qp *q
 		ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len);
 		qp->s_state = OP(RDMA_READ_REQUEST);
 		hwords += sizeof(ohdr->u.rc.reth) / 4;
-		bth2 = qp->s_psn++ & IPS_PSN_MASK;
+		bth2 = qp->s_psn++ & IPATH_PSN_MASK;
 		if ((int)(qp->s_psn - qp->s_next_psn) > 0)
 			qp->s_next_psn = qp->s_psn;
 		ss = NULL;
@@ -529,7 +529,7 @@ static void send_rc_ack(struct ipath_qp 
 
 	/* Construct the header. */
 	ohdr = &hdr.u.oth;
-	lrh0 = IPS_LRH_BTH;
+	lrh0 = IPATH_LRH_BTH;
 	/* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */
 	hwords = 6;
 	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
@@ -537,14 +537,14 @@ static void send_rc_ack(struct ipath_qp 
 					 &qp->remote_ah_attr.grh,
 					 hwords, 0);
 		ohdr = &hdr.u.l.oth;
-		lrh0 = IPS_LRH_GRH;
+		lrh0 = IPATH_LRH_GRH;
 	}
 	/* read pkey_index w/o lock (its atomic) */
 	bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
 	if (qp->r_nak_state)
-		ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPS_MSN_MASK) |
+		ohdr->u.aeth = cpu_to_be32((qp->r_msn & IPATH_MSN_MASK) |
 					    (qp->r_nak_state <<
-					     IPS_AETH_CREDIT_SHIFT));
+					     IPATH_AETH_CREDIT_SHIFT));
 	else
 		ohdr->u.aeth = ipath_compute_aeth(qp);
 	if (qp->r_ack_state >= OP(COMPARE_SWAP)) {
@@ -560,7 +560,7 @@ static void send_rc_ack(struct ipath_qp 
 	hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd));
 	ohdr->bth[0] = cpu_to_be32(bth0);
 	ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
-	ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPS_PSN_MASK);
+	ohdr->bth[2] = cpu_to_be32(qp->r_ack_psn & IPATH_PSN_MASK);
 
 	/*
 	 * If we can send the ACK, clear the ACK state.
@@ -890,8 +890,8 @@ static int do_rc_ack(struct ipath_qp *qp
 		reset_psn(qp, psn);
 
 		qp->s_rnr_timeout =
-			ib_ipath_rnr_table[(aeth >> IPS_AETH_CREDIT_SHIFT) &
-					   IPS_AETH_CREDIT_MASK];
+			ib_ipath_rnr_table[(aeth >> IPATH_AETH_CREDIT_SHIFT) &
+					   IPATH_AETH_CREDIT_MASK];
 		ipath_insert_rnr_queue(qp);
 		goto bail;
 
@@ -899,8 +899,8 @@ static int do_rc_ack(struct ipath_qp *qp
 		/* The last valid PSN seen is the previous request's. */
 		if (qp->s_last != qp->s_tail)
 			qp->s_last_psn = wqe->psn - 1;
-		switch ((aeth >> IPS_AETH_CREDIT_SHIFT) &
-			IPS_AETH_CREDIT_MASK) {
+		switch ((aeth >> IPATH_AETH_CREDIT_SHIFT) &
+			IPATH_AETH_CREDIT_MASK) {
 		case 0:	/* PSN sequence error */
 			dev->n_seq_naks++;
 			/*
@@ -1268,7 +1268,7 @@ static inline int ipath_rc_rcv_error(str
 		 * Check for the PSN of the last atomic operation
 		 * performed and resend the result if found.
 		 */
-		if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn)
+		if ((psn & IPATH_PSN_MASK) != qp->r_atomic_psn)
 			goto done;
 		break;
 	}
@@ -1638,7 +1638,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de
 			*(u64 *) qp->r_sge.sge.vaddr = sdata;
 		spin_unlock_irq(&dev->pending_lock);
 		qp->r_msn++;
-		qp->r_atomic_psn = psn & IPS_PSN_MASK;
+		qp->r_atomic_psn = psn & IPATH_PSN_MASK;
 		psn |= 1 << 31;
 		break;
 	}
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_ruc.c
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -32,7 +32,7 @@
  */
 
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /*
  * Convert the AETH RNR timeout code into the number of milliseconds.
@@ -632,7 +632,7 @@ again:
 	/* Sending responses has higher priority over sending requests. */
 	if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
 	    (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0)
-		bth2 = qp->s_ack_psn++ & IPS_PSN_MASK;
+		bth2 = qp->s_ack_psn++ & IPATH_PSN_MASK;
 	else if (!((qp->ibqp.qp_type == IB_QPT_RC) ?
 		   ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) :
 		   ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) {
@@ -651,12 +651,12 @@ again:
 	/* Construct the header. */
 	extra_bytes = (4 - qp->s_cur_size) & 3;
 	nwords = (qp->s_cur_size + extra_bytes) >> 2;
-	lrh0 = IPS_LRH_BTH;
+	lrh0 = IPATH_LRH_BTH;
 	if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
 		qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh,
 						 &qp->remote_ah_attr.grh,
 						 qp->s_hdrwords, nwords);
-		lrh0 = IPS_LRH_GRH;
+		lrh0 = IPATH_LRH_GRH;
 	}
 	lrh0 |= qp->remote_ah_attr.sl << 4;
 	qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_sysfs.c
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -35,8 +35,8 @@
 #include <linux/pci.h>
 
 #include "ipath_kernel.h"
-#include "ips_common.h"
 #include "ipath_layer.h"
+#include "ipath_common.h"
 
 /**
  * ipath_parse_ushort - parse an unsigned short value in an arbitrary base
@@ -187,7 +187,7 @@ static ssize_t store_lid(struct device *
 	if (ret < 0)
 		goto invalid;
 
-	if (lid == 0 || lid >= IPS_MULTICAST_LID_BASE) {
+	if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE) {
 		ret = -EINVAL;
 		goto invalid;
 	}
@@ -221,7 +221,7 @@ static ssize_t store_mlid(struct device 
 	int ret;
 
 	ret = ipath_parse_ushort(buf, &mlid);
-	if (ret < 0 || mlid < IPS_MULTICAST_LID_BASE)
+	if (ret < 0 || mlid < IPATH_MULTICAST_LID_BASE)
 		goto invalid;
 
 	unit = dd->ipath_unit;
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_uc.c
--- a/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c	Thu Jun 29 14:33:26 2006 -0700
@@ -32,7 +32,7 @@
  */
 
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /* cut down ridiculously long IB macro names */
 #define OP(x) IB_OPCODE_UC_##x
@@ -213,7 +213,7 @@ int ipath_make_uc_req(struct ipath_qp *q
 	qp->s_cur_sge = &qp->s_sge;
 	qp->s_cur_size = len;
 	*bth0p = bth0 | (qp->s_state << 24);
-	*bth2p = qp->s_next_psn++ & IPS_PSN_MASK;
+	*bth2p = qp->s_next_psn++ & IPATH_PSN_MASK;
 	return 1;
 
 done:
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_ud.c
--- a/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c	Thu Jun 29 14:33:26 2006 -0700
@@ -34,7 +34,7 @@
 #include <rdma/ib_smi.h>
 
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /**
  * ipath_ud_loopback - handle send on loopback QPs
@@ -289,8 +289,8 @@ int ipath_post_ud_send(struct ipath_qp *
 		ret = -EINVAL;
 		goto bail;
 	}
-	if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE) {
-		if (ah_attr->dlid != IPS_PERMISSIVE_LID)
+	if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE) {
+		if (ah_attr->dlid != IPATH_PERMISSIVE_LID)
 			dev->n_multicast_xmit++;
 		else
 			dev->n_unicast_xmit++;
@@ -310,7 +310,7 @@ int ipath_post_ud_send(struct ipath_qp *
 	if (ah_attr->ah_flags & IB_AH_GRH) {
 		/* Header size in 32-bit words. */
 		hwords = 17;
-		lrh0 = IPS_LRH_GRH;
+		lrh0 = IPATH_LRH_GRH;
 		ohdr = &qp->s_hdr.u.l.oth;
 		qp->s_hdr.u.l.grh.version_tclass_flow =
 			cpu_to_be32((6 << 28) |
@@ -336,7 +336,7 @@ int ipath_post_ud_send(struct ipath_qp *
 	} else {
 		/* Header size in 32-bit words. */
 		hwords = 7;
-		lrh0 = IPS_LRH_BTH;
+		lrh0 = IPATH_LRH_BTH;
 		ohdr = &qp->s_hdr.u.oth;
 	}
 	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
@@ -367,18 +367,18 @@ int ipath_post_ud_send(struct ipath_qp *
 	if (wr->send_flags & IB_SEND_SOLICITED)
 		bth0 |= 1 << 23;
 	bth0 |= extra_bytes << 20;
-	bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY :
+	bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPATH_DEFAULT_P_KEY :
 		ipath_layer_get_pkey(dev->dd, qp->s_pkey_index);
 	ohdr->bth[0] = cpu_to_be32(bth0);
 	/*
 	 * Use the multicast QP if the destination LID is a multicast LID.
 	 */
-	ohdr->bth[1] = ah_attr->dlid >= IPS_MULTICAST_LID_BASE &&
-		ah_attr->dlid != IPS_PERMISSIVE_LID ?
-		__constant_cpu_to_be32(IPS_MULTICAST_QPN) :
+	ohdr->bth[1] = ah_attr->dlid >= IPATH_MULTICAST_LID_BASE &&
+		ah_attr->dlid != IPATH_PERMISSIVE_LID ?
+		__constant_cpu_to_be32(IPATH_MULTICAST_QPN) :
 		cpu_to_be32(wr->wr.ud.remote_qpn);
 	/* XXX Could lose a PSN count but not worth locking */
-	ohdr->bth[2] = cpu_to_be32(qp->s_next_psn++ & IPS_PSN_MASK);
+	ohdr->bth[2] = cpu_to_be32(qp->s_next_psn++ & IPATH_PSN_MASK);
 	/*
 	 * Qkeys with the high order bit set mean use the
 	 * qkey from the QP context instead of the WR (see 10.2.5).
@@ -469,7 +469,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de
 			src_qp = be32_to_cpu(ohdr->u.ud.deth[1]);
 		}
 	}
-	src_qp &= IPS_QPN_MASK;
+	src_qp &= IPATH_QPN_MASK;
 
 	/*
 	 * Check that the permissive LID is only used on QP0
@@ -627,7 +627,7 @@ void ipath_ud_rcv(struct ipath_ibdev *de
 	/*
 	 * Save the LMC lower bits if the destination LID is a unicast LID.
 	 */
-	wc.dlid_path_bits = dlid >= IPS_MULTICAST_LID_BASE ? 0 :
+	wc.dlid_path_bits = dlid >= IPATH_MULTICAST_LID_BASE ? 0 :
 		dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1);
 	/* Signal completion event if the solicited bit is set. */
 	ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Thu Jun 29 14:33:26 2006 -0700
@@ -37,7 +37,7 @@
 
 #include "ipath_kernel.h"
 #include "ipath_verbs.h"
-#include "ips_common.h"
+#include "ipath_common.h"
 
 /* Not static, because we don't want the compiler removing it */
 const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR;
@@ -429,7 +429,7 @@ static void ipath_ib_rcv(void *arg, void
 
 	/* Check for a valid destination LID (see ch. 7.11.1). */
 	lid = be16_to_cpu(hdr->lrh[1]);
-	if (lid < IPS_MULTICAST_LID_BASE) {
+	if (lid < IPATH_MULTICAST_LID_BASE) {
 		lid &= ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1);
 		if (unlikely(lid != ipath_layer_get_lid(dev->dd))) {
 			dev->rcv_errors++;
@@ -439,9 +439,9 @@ static void ipath_ib_rcv(void *arg, void
 
 	/* Check for GRH */
 	lnh = be16_to_cpu(hdr->lrh[0]) & 3;
-	if (lnh == IPS_LRH_BTH)
+	if (lnh == IPATH_LRH_BTH)
 		ohdr = &hdr->u.oth;
-	else if (lnh == IPS_LRH_GRH)
+	else if (lnh == IPATH_LRH_GRH)
 		ohdr = &hdr->u.l.oth;
 	else {
 		dev->rcv_errors++;
@@ -453,8 +453,8 @@ static void ipath_ib_rcv(void *arg, void
 	dev->opstats[opcode].n_packets++;
 
 	/* Get the destination QP number. */
-	qp_num = be32_to_cpu(ohdr->bth[1]) & IPS_QPN_MASK;
-	if (qp_num == IPS_MULTICAST_QPN) {
+	qp_num = be32_to_cpu(ohdr->bth[1]) & IPATH_QPN_MASK;
+	if (qp_num == IPATH_MULTICAST_QPN) {
 		struct ipath_mcast *mcast;
 		struct ipath_mcast_qp *p;
 
@@ -465,7 +465,7 @@ static void ipath_ib_rcv(void *arg, void
 		}
 		dev->n_multicast_rcv++;
 		list_for_each_entry_rcu(p, &mcast->qp_list, list)
-			ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data,
+			ipath_qp_rcv(dev, hdr, lnh == IPATH_LRH_GRH, data,
 				     tlen, p->qp);
 		/*
 		 * Notify ipath_multicast_detach() if it is waiting for us
@@ -477,7 +477,7 @@ static void ipath_ib_rcv(void *arg, void
 		qp = ipath_lookup_qpn(&dev->qp_table, qp_num);
 		if (qp) {
 			dev->n_unicast_rcv++;
-			ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data,
+			ipath_qp_rcv(dev, hdr, lnh == IPATH_LRH_GRH, data,
 				     tlen, qp);
 			/*
 			 * Notify ipath_destroy_qp() if it is waiting
@@ -860,8 +860,8 @@ static struct ib_ah *ipath_create_ah(str
 	}
 
 	/* A multicast address requires a GRH (see ch. 8.4.1). */
-	if (ah_attr->dlid >= IPS_MULTICAST_LID_BASE &&
-	    ah_attr->dlid != IPS_PERMISSIVE_LID &&
+	if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE &&
+	    ah_attr->dlid != IPATH_PERMISSIVE_LID &&
 	    !(ah_attr->ah_flags & IB_AH_GRH)) {
 		ret = ERR_PTR(-EINVAL);
 		goto bail;
diff -r 31c382d8210a -r 2a721e1f490b drivers/infiniband/hw/ipath/ips_common.h
--- a/drivers/infiniband/hw/ipath/ips_common.h	Thu Jun 29 14:33:26 2006 -0700
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,264 +0,0 @@
-#ifndef IPS_COMMON_H
-#define IPS_COMMON_H
-/*
- * Copyright (c) 2006 QLogic, Inc. All rights reserved.
- * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#include "ipath_common.h"
-
-struct ipath_header {
-	/*
-	 * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset -
-	 * 14 bits before ECO change ~28 Dec 03.  After that, Vers 4,
-	 * Port 3, TID 11, offset 14.
-	 */
-	__le32 ver_port_tid_offset;
-	__le16 chksum;
-	__le16 pkt_flags;
-};
-
-struct ips_message_header {
-	__be16 lrh[4];
-	__be32 bth[3];
-	/* fields below this point are in host byte order */
-	struct ipath_header iph;
-	__u8 sub_opcode;
-	__u8 flags;
-	__u16 src_rank;
-	/* 24 bits. The upper 8 bit is available for other use */
-	union {
-		struct {
-			unsigned ack_seq_num:24;
-			unsigned port:4;
-			unsigned unused:4;
-		};
-		__u32 ack_seq_num_org;
-	};
-	__u8 expected_tid_session_id;
-	__u8 tinylen;		/* to aid MPI */
-	union {
-	    __u16 tag;		/* to aid MPI */
-	    __u16 mqhdr;	/* for PSM MQ */
-	};
-	union {
-		__u32 mpi[4];	/* to aid MPI */
-		__u32 data[4];
-		__u64 mq[2];	/* for PSM MQ */
-		struct {
-			__u16 mtu;
-			__u8 major_ver;
-			__u8 minor_ver;
-			__u32 not_used;	//free
-			__u32 run_id;
-			__u32 client_ver;
-		};
-	};
-};
-
-struct ether_header {
-	__be16 lrh[4];
-	__be32 bth[3];
-	struct ipath_header iph;
-	__u8 sub_opcode;
-	__u8 cmd;
-	__be16 lid;
-	__u16 mac[3];
-	__u8 frag_num;
-	__u8 seq_num;
-	__le32 len;
-	/* MUST be of word size due to PIO write requirements */
-	__le32 csum;
-	__le16 csum_offset;
-	__le16 flags;
-	__u16 first_2_bytes;
-	__u8 unused[2];		/* currently unused */
-};
-
-/*
- * The PIO buffer used for sending infinipath messages must only be written
- * in 32-bit words, all the data must be written, and no writes can occur
- * after the last word is written (which transfers "ownership" of the buffer
- * to the chip and triggers the message to be sent).
- * Since the Linux sk_buff structure can be recursive, non-aligned, and
- * any number of bytes in each segment, we use the following structure
- * to keep information about the overall state of the copy operation.
- * This is used to save the information needed to store the checksum
- * in the right place before sending the last word to the hardware and
- * to buffer the last 0-3 bytes of non-word sized segments.
- */
-struct copy_data_s {
-	struct ether_header *hdr;
-	/* addr of PIO buf to write csum to */
-	__u32 __iomem *csum_pio;
-	__u32 __iomem *to;	/* addr of PIO buf to write data to */
-	__u32 device;		/* which device to allocate PIO bufs from */
-	__s32 error;		/* set if there is an error. */
-	__s32 extra;		/* amount of data saved in u.buf below */
-	__u32 len;		/* total length to send in bytes */
-	__u32 flen;		/* frament length in words */
-	__u32 csum;		/* partial IP checksum */
-	__u32 pos;		/* position for partial checksum */
-	__u32 offset;		/* offset to where data currently starts */
-	__s32 checksum_calc;	/* set to 1 when csum has been calculated */
-	struct sk_buff *skb;
-	union {
-		__u32 w;
-		__u8 buf[4];
-	} u;
-};
-
-/* IB - LRH header consts */
-#define IPS_LRH_GRH 0x0003	/* 1. word of IB LRH - next header: GRH */
-#define IPS_LRH_BTH 0x0002	/* 1. word of IB LRH - next header: BTH */
-
-#define IPS_OFFSET  0
-
-/*
- * defines the cut-off point between the header queue and eager/expected
- * TID queue
- */
-#define NUM_OF_EXTRA_WORDS_IN_HEADER_QUEUE \
-	((sizeof(struct ips_message_header) - \
-	  offsetof(struct ips_message_header, iph)) >> 2)
-
-/* OpCodes  */
-#define OPCODE_IPS 0xC0
-#define OPCODE_ITH4X 0xC1
-
-/* OpCode 30 is use by stand-alone test programs  */
-#define OPCODE_RAW_DATA 0xDE
-/* last OpCode (31) is reserved for test  */
-#define OPCODE_TEST 0xDF
-
-/* sub OpCodes - ips  */
-#define OPCODE_SEQ_DATA 0x01
-#define OPCODE_SEQ_CTRL 0x02
-
-#define OPCODE_SEQ_MQ_DATA 0x03
-#define OPCODE_SEQ_MQ_CTRL 0x04
-
-#define OPCODE_ACK 0x10
-#define OPCODE_NAK 0x11
-
-#define OPCODE_ERR_CHK 0x20
-#define OPCODE_ERR_CHK_PLS 0x21
-
-#define OPCODE_STARTUP 0x30
-#define OPCODE_STARTUP_ACK 0x31
-#define OPCODE_STARTUP_NAK 0x32
-
-#define OPCODE_STARTUP_EXT 0x34
-#define OPCODE_STARTUP_ACK_EXT 0x35
-#define OPCODE_STARTUP_NAK_EXT 0x36
-
-#define OPCODE_TIDS_RELEASE 0x40
-#define OPCODE_TIDS_RELEASE_CONFIRM 0x41
-
-#define OPCODE_CLOSE 0x50
-#define OPCODE_CLOSE_ACK 0x51
-/*
- * like OPCODE_CLOSE, but no complaint if other side has already closed.
- * Used when doing abort(), MPI_Abort(), etc.
- */
-#define OPCODE_ABORT 0x52
-
-/* sub OpCodes - ith4x  */
-#define OPCODE_ENCAP 0x81
-#define OPCODE_LID_ARP 0x82
-
-/* Receive Header Queue: receive type (from infinipath) */
-#define RCVHQ_RCV_TYPE_EXPECTED  0
-#define RCVHQ_RCV_TYPE_EAGER     1
-#define RCVHQ_RCV_TYPE_NON_KD    2
-#define RCVHQ_RCV_TYPE_ERROR     3
-
-/* misc. */
-#define SIZE_OF_CRC 1
-
-#define EAGER_TID_ID INFINIPATH_I_TID_MASK
-
-#define IPS_DEFAULT_P_KEY 0xFFFF
-
-#define IPS_PERMISSIVE_LID 0xFFFF
-#define IPS_MULTICAST_LID_BASE 0xC000
-
-#define IPS_AETH_CREDIT_SHIFT 24
-#define IPS_AETH_CREDIT_MASK 0x1F
-#define IPS_AETH_CREDIT_INVAL 0x1F
-
-#define IPS_PSN_MASK 0xFFFFFF
-#define IPS_MSN_MASK 0xFFFFFF
-#define IPS_QPN_MASK 0xFFFFFF
-#define IPS_MULTICAST_QPN 0xFFFFFF
-
-/* functions for extracting fields from rcvhdrq entries */
-static inline __u32 ips_get_hdr_err_flags(const __le32 * rbuf)
-{
-	return __le32_to_cpu(rbuf[1]);
-}
-
-static inline __u32 ips_get_index(const __le32 * rbuf)
-{
-	return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_EGRINDEX_SHIFT)
-	    & INFINIPATH_RHF_EGRINDEX_MASK;
-}
-
-static inline __u32 ips_get_rcv_type(const __le32 * rbuf)
-{
-	return (__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_RCVTYPE_SHIFT)
-	    & INFINIPATH_RHF_RCVTYPE_MASK;
-}
-
-static inline __u32 ips_get_length_in_bytes(const __le32 * rbuf)
-{
-	return ((__le32_to_cpu(rbuf[0]) >> INFINIPATH_RHF_LENGTH_SHIFT)
-		& INFINIPATH_RHF_LENGTH_MASK) << 2;
-}
-
-static inline void *ips_get_first_protocol_header(const __u32 * rbuf)
-{
-	return (void *)&rbuf[2];
-}
-
-static inline struct ips_message_header *ips_get_ips_header(const __u32 *
-							    rbuf)
-{
-	return (struct ips_message_header *)&rbuf[2];
-}
-
-static inline __u32 ips_get_ipath_ver(__le32 hdrword)
-{
-	return (__le32_to_cpu(hdrword) >> INFINIPATH_I_VERS_SHIFT)
-	    & INFINIPATH_I_VERS_MASK;
-}
-
-#endif				/* IPS_COMMON_H */


From davem at davemloft.net  Thu Jun 29 14:50:27 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 14:50:27 -0700 (PDT)
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>
Message-ID: <20060629.145027.41636491.davem@davemloft.net>

From: Bryan O'Sullivan <bos at pathscale.com>
Date: Thu, 29 Jun 2006 14:41:30 -0700

> +/*
> + * Copy data.  Try not to pollute the dcache with the source data,
> + * because we won't be reading it again.
> + */
> +#if defined(CONFIG_X86_64)
> +void *ipath_memcpy_nc(void *dest, const void *src, size_t n);
> +#else
> +#define ipath_memcpy_nc(dest, src, n) memcpy(dest, src, n)
> +#endif

A facility like this doesn't belong in some arbitrary driver layer.
It belongs as a generic facility the whole kernel could make use
of.

Please stop polluting the infiniband drivers with Opteron crap.


From davem at davemloft.net  Thu Jun 29 14:53:19 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 14:53:19 -0700 (PDT)
Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to
 support InfiniPath on PowerPC 970 systems
In-Reply-To: <c22b6c244d5db77f7b1d.1151617289@eng-12.pathscale.com>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<c22b6c244d5db77f7b1d.1151617289@eng-12.pathscale.com>
Message-ID: <20060629.145319.71091846.davem@davemloft.net>

From: Bryan O'Sullivan <bos at pathscale.com>
Date: Thu, 29 Jun 2006 14:41:29 -0700

>  ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o
> +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o

Again, don't put these kinds of cpu specific functions
into the infiniband driver.  They are potentially globally
useful, not something only Infiniband might want to do.


From bos at pathscale.com  Thu Jun 29 14:59:37 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 14:59:37 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <20060629.145027.41636491.davem@davemloft.net>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>
	<20060629.145027.41636491.davem@davemloft.net>
Message-ID: <1151618377.10886.23.camel@chalcedony.pathscale.com>

On Thu, 2006-06-29 at 14:50 -0700, David Miller wrote:

> A facility like this doesn't belong in some arbitrary driver layer.
> It belongs as a generic facility the whole kernel could make use
> of.

It could, indeed.  In fact, we had that discussion here before I sent
this patch in.  It presumably wants to live in lib/, and acquire a more
generic name.  What name will capture the uncached-read-but-cached-write
semantics in a useful fashion?  memcpy_nc?

	<b


From bos at pathscale.com  Thu Jun 29 15:01:39 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 15:01:39 -0700
Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to
 support InfiniPath on PowerPC 970 systems
In-Reply-To: <20060629.145319.71091846.davem@davemloft.net>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<c22b6c244d5db77f7b1d.1151617289@eng-12.pathscale.com>
	<20060629.145319.71091846.davem@davemloft.net>
Message-ID: <1151618499.10886.26.camel@chalcedony.pathscale.com>

On Thu, 2006-06-29 at 14:53 -0700, David Miller wrote:
> From: Bryan O'Sullivan <bos at pathscale.com>
> Date: Thu, 29 Jun 2006 14:41:29 -0700
> 
> >  ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o
> > +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o
> 
> Again, don't put these kinds of cpu specific functions
> into the infiniband driver.  They are potentially globally
> useful, not something only Infiniband might want to do.

The support for write combining in the kernel is not in a state where
that makes any sense at the moment.  Also, this is a single-statement
function.

	<b


From davem at davemloft.net  Thu Jun 29 15:04:17 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 15:04:17 -0700 (PDT)
Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to
 support InfiniPath on PowerPC 970 systems
In-Reply-To: <1151618499.10886.26.camel@chalcedony.pathscale.com>
References: <c22b6c244d5db77f7b1d.1151617289@eng-12.pathscale.com>
	<20060629.145319.71091846.davem@davemloft.net>
	<1151618499.10886.26.camel@chalcedony.pathscale.com>
Message-ID: <20060629.150417.78710870.davem@davemloft.net>

From: Bryan O'Sullivan <bos at pathscale.com>
Date: Thu, 29 Jun 2006 15:01:39 -0700

> The support for write combining in the kernel is not in a state where
> that makes any sense at the moment.

Please fix the generic code if it doesn't provide the facility
you need at the moment.  Don't shoe horn it into your driver
just to make up for that.


From davem at davemloft.net  Thu Jun 29 15:03:19 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 15:03:19 -0700 (PDT)
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <1151618377.10886.23.camel@chalcedony.pathscale.com>
References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>
	<20060629.145027.41636491.davem@davemloft.net>
	<1151618377.10886.23.camel@chalcedony.pathscale.com>
Message-ID: <20060629.150319.104035601.davem@davemloft.net>

From: Bryan O'Sullivan <bos at pathscale.com>
Date: Thu, 29 Jun 2006 14:59:37 -0700

> It could, indeed.  In fact, we had that discussion here before I sent
> this patch in.  It presumably wants to live in lib/, and acquire a more
> generic name.  What name will capture the uncached-read-but-cached-write
> semantics in a useful fashion?  memcpy_nc?

I'm not good with names :-)

Note that there also might be cases where using such a memcpy
variant might be the wrong thing to do.  For example, for a very
tightly coupled CMT cpu implementation which has the memory controller,
L2 cache, PCI controller, etc. all on the same die and the PCI controller
makes use of the L2 cache just like the cpu threads do, using this
kind of memcpy would always be the wrong thing to do.


From afriedle at indiana.edu  Thu Jun 29 15:43:47 2006
From: afriedle at indiana.edu (Andrew Friedley)
Date: Thu, 29 Jun 2006 15:43:47 -0700
Subject: [openib-general] thread safety
Message-ID: <44A457A3.60001@indiana.edu>

Hello,

I'm working with Matt Leininger this summer on developing support for UD 
in Open MPI, and eventually multicast collectives - he suggested I ask 
my question here.

Is there any documentation available on thread safety (i.e., what is 
(non-)reentrant) with the openib verbs?  I've had trouble finding 
anything more than vague hints with google.

Thanks,

Andrew


From venkatesh.babu at 3leafnetworks.com  Thu Jun 29 16:51:17 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Thu, 29 Jun 2006 16:51:17 -0700
Subject: [openib-general] Reloading of partition policy
Message-ID: <44A46775.1000507@3leafnetworks.com>

I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the 
following comment -

If we need to add/delete a node to/from a partition we need to update 
the file

/etc/osm-partitions.txt  

 and restart the OpenSM. According to the docs there no way we can do 
this without restarting the OpenSM.

It would be useful to add new feature to reload the partition table 
after making the changes.

 VBabu


From bos at pathscale.com  Thu Jun 29 16:34:23 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Thu, 29 Jun 2006 16:34:23 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <20060629.150319.104035601.davem@davemloft.net>
References: <1b00209ef20a0e7893d8.1151617290@eng-12.pathscale.com>
	<20060629.145027.41636491.davem@davemloft.net>
	<1151618377.10886.23.camel@chalcedony.pathscale.com>
	<20060629.150319.104035601.davem@davemloft.net>
Message-ID: <1151624063.10886.34.camel@chalcedony.pathscale.com>

On Thu, 2006-06-29 at 15:03 -0700, David Miller wrote:

> I'm not good with names :-)

Heh.  I'll call it memcpy_nc for now, then, and people can retch all
over the name as they please when I submit a more suitably generic
patch.

> Note that there also might be cases where using such a memcpy
> variant might be the wrong thing to do.  For example, for a very
> tightly coupled CMT cpu implementation which has the memory controller,
> L2 cache, PCI controller, etc. all on the same die and the PCI controller
> makes use of the L2 cache just like the cpu threads do, using this
> kind of memcpy would always be the wrong thing to do.

I'm not quite following you, though I assume you're referring to Niagara
or Rock :-)  Are you saying a memcpy_nc would do worse than plain
memcpy, or worse than some other memcpy-like routine?

	<b


From mshefty at ichips.intel.com  Thu Jun 29 16:45:00 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 29 Jun 2006 16:45:00 -0700
Subject: [openib-general] thread safety
In-Reply-To: <44A457A3.60001@indiana.edu>
References: <44A457A3.60001@indiana.edu>
Message-ID: <44A465FC.4070803@ichips.intel.com>

Andrew Friedley wrote:
> I'm working with Matt Leininger this summer on developing support for UD 
> in Open MPI, and eventually multicast collectives - he suggested I ask 
> my question here.
> 
> Is there any documentation available on thread safety (i.e., what is 
> (non-)reentrant) with the openib verbs?  I've had trouble finding 
> anything more than vague hints with google.

Some kernel information is available in gen2/trunk/src/linux-kernel/docs.  See 
core_locking.txt.  Some of the information applies to userspace as well, such as 
all verbs being fully reentrant.

- Sean


From davem at davemloft.net  Thu Jun 29 16:46:23 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 16:46:23 -0700 (PDT)
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <1151624063.10886.34.camel@chalcedony.pathscale.com>
References: <1151618377.10886.23.camel@chalcedony.pathscale.com>
	<20060629.150319.104035601.davem@davemloft.net>
	<1151624063.10886.34.camel@chalcedony.pathscale.com>
Message-ID: <20060629.164623.59469884.davem@davemloft.net>

From: Bryan O'Sullivan <bos at pathscale.com>
Date: Thu, 29 Jun 2006 16:34:23 -0700

> I'm not quite following you, though I assume you're referring to Niagara
> or Rock :-)  Are you saying a memcpy_nc would do worse than plain
> memcpy, or worse than some other memcpy-like routine?

It would do worse than memcpy.

If you bypass the L2 cache, it's pointless because the next
agent (PCI controller, CPU thread, etc.) is going to need the
data in the L2 cache.

It's better in that kind of setup to eat the L2 cache miss overhead in
memcpy since memcpy can usually prefetch and store buffer in order to
absorb some of the L2 miss costs.


From ralphc at pathscale.com  Thu Jun 29 16:55:41 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 29 Jun 2006 16:55:41 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <20060629.164623.59469884.davem@davemloft.net>
References: <1151618377.10886.23.camel@chalcedony.pathscale.com>
	<20060629.150319.104035601.davem@davemloft.net>
	<1151624063.10886.34.camel@chalcedony.pathscale.com>
	<20060629.164623.59469884.davem@davemloft.net>
Message-ID: <1151625341.4572.133.camel@brick.pathscale.com>

This is intended to be an architecture specific function
so if the CPU does support HW dma to the CPU's L2 cache, the
architecture specific version of memcpy_nc() would not replace
the default definition which maps memcpy_nc() to memcpy().

For CPUs like the vast majority currently available, there
is a performance benefit by not reading data into the cache
that won't be read a second time.

On Thu, 2006-06-29 at 16:46 -0700, David Miller wrote:
> From: Bryan O'Sullivan <bos at pathscale.com>
> Date: Thu, 29 Jun 2006 16:34:23 -0700
> 
> > I'm not quite following you, though I assume you're referring to Niagara
> > or Rock :-)  Are you saying a memcpy_nc would do worse than plain
> > memcpy, or worse than some other memcpy-like routine?
> 
> It would do worse than memcpy.
> 
> If you bypass the L2 cache, it's pointless because the next
> agent (PCI controller, CPU thread, etc.) is going to need the
> data in the L2 cache.
> 
> It's better in that kind of setup to eat the L2 cache miss overhead in
> memcpy since memcpy can usually prefetch and store buffer in order to
> absorb some of the L2 miss costs.
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From akpm at osdl.org  Thu Jun 29 17:02:55 2006
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 29 Jun 2006 17:02:55 -0700
Subject: [openib-general] [PATCH 17 of 39] IB/ipath - use more
	appropriate gfp flags
In-Reply-To: <9d943b828776136a2bb7.1151617268@eng-12.pathscale.com>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<9d943b828776136a2bb7.1151617268@eng-12.pathscale.com>
Message-ID: <20060629170255.028d7a90.akpm@osdl.org>

"Bryan O'Sullivan" <bos at pathscale.com> wrote:
>
> diff -r fd5e733f02ac -r 9d943b828776 drivers/infiniband/hw/ipath/ipath_file_ops.c
> --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
> +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c	Thu Jun 29 14:33:25 2006 -0700
> @@ -705,6 +705,15 @@ static int ipath_create_user_egr(struct 
>  	unsigned e, egrcnt, alloced, egrperchunk, chunk, egrsize, egroff;
>  	size_t size;
>  	int ret;
> +	gfp_t gfp_flags;
> +
> +	/*
> +	 * GFP_USER, but without GFP_FS, so buffer cache can be
> +	 * coalesced (we hope); otherwise, even at order 4,
> +	 * heavy filesystem activity makes these fail, and we can
> +	 * use compound pages.
> +	 */
> +	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;

Yes, GFP_NOFS|_GFP_COMP is reasonably strong - we can do swapout but not
file pageout.

I expect you'll find that a full GFP_KERNEL is OK here.  The ~__GFP_FS is
used to prevent the vm scanner from calling into ->writepage() and getting
stuck on locks which the __alloc_pages() caller already holds.

But ipathfs doesn't even implement ->writepage(), so I don't see any
problem with setting __GFP_FS.  If you're getting into trouble there then
I'd recommend giving it a try - it will make memory reclaim more
successful, especially with ext3, where a ->writepage often cleans the page
synchronously without doing any IO.

That being said, order-4 allocations will be fairly reliably unreliable.


From akpm at osdl.org  Thu Jun 29 17:07:11 2006
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 29 Jun 2006 17:07:11 -0700
Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where
 our delay for EEPROM no longer works due to compiler reordering
In-Reply-To: <5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com>
Message-ID: <20060629170711.757a97d2.akpm@osdl.org>

"Bryan O'Sullivan" <bos at pathscale.com> wrote:
>
> The mb() prevents the compiler from reordering on this function, with some versions
> of gcc and -Os optimization.   The result is random failures in the EEPROM read
> without this change.
> 
> 
> Signed-off-by: Dave Olson <dave.olson at qlogic.com>
> Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>
> 
> diff -r 7d22a8963bda -r 5f3c0b2d446d drivers/infiniband/hw/ipath/ipath_eeprom.c
> --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
> +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c	Thu Jun 29 14:33:26 2006 -0700
> @@ -186,6 +186,7 @@ bail:
>   */
>  static void i2c_wait_for_writes(struct ipath_devdata *dd)
>  {
> +	mb();
>  	(void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch);
>  }
>  

That's a bit weird.  I wouldn't have expected the compiler to muck around
with a readl().


From rick.jones2 at hp.com  Thu Jun 29 17:28:50 2006
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 29 Jun 2006 17:28:50 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <20060629.164623.59469884.davem@davemloft.net>
References: <1151618377.10886.23.camel@chalcedony.pathscale.com>
	<20060629.150319.104035601.davem@davemloft.net>
	<1151624063.10886.34.camel@chalcedony.pathscale.com>
	<20060629.164623.59469884.davem@davemloft.net>
Message-ID: <44A47042.8060203@hp.com>

> If you bypass the L2 cache, it's pointless because the next
> agent (PCI controller, CPU thread, etc.) is going to need the
> data in the L2 cache.
> 
> It's better in that kind of setup to eat the L2 cache miss overhead in
> memcpy since memcpy can usually prefetch and store buffer in order to
> absorb some of the L2 miss costs.

I thought that most PCI controllers (that is to say the things bridging 
PCI to the rest of the system) could do prefetching and/or that PCI-X 
(if not PCI, no idea about PCI-e) cards could issue multiple 
transactions anyway?

rick jones


From olson at unixfolk.com  Thu Jun 29 17:28:51 2006
From: olson at unixfolk.com (Dave Olson)
Date: Thu, 29 Jun 2006 17:28:51 -0700 (PDT)
Subject: [openib-general] [PATCH 38 of 39] IB/ipath - More changes to
 support InfiniPath on PowerPC 970 systems
In-Reply-To: <fa./e2EfI5SsRLt5d/gFrBSOnDZpZ0@ifi.uio.no>
References: <fa.KkJLjFz0MBUMS9nMlAyiBMrqx1g@ifi.uio.no>
	<fa./e2EfI5SsRLt5d/gFrBSOnDZpZ0@ifi.uio.no>
Message-ID: <Pine.LNX.4.61.0606291725020.16720@osa.unixfolk.com>

On Thu, 29 Jun 2006, David Miller wrote:

| From: Bryan O'Sullivan <bos at pathscale.com>
| Date: Thu, 29 Jun 2006 14:41:29 -0700
| 
| >  ipath_core-$(CONFIG_X86_64) += ipath_wc_x86_64.o
| > +ipath_core-$(CONFIG_PPC64) += ipath_wc_ppc64.o
| 
| Again, don't put these kinds of cpu specific functions
| into the infiniband driver.  They are potentially globally
| useful, not something only Infiniband might want to do.

The new code simply sets a flag as to whether instruction level
write barriers need to be used or not, it doesn't contain actual
code.

The older file (already accepted) does have some setup code, as well as
code setting flags, due to the fact that Bryan mentioned in his reply,
that this stuff simply doesn't yet exist in a generic form.   It's not
clear to me that it can ever be made to exist in a generic form that will
actually work on multiple architectures (or that there are enough users
to be worth trying).   We can make the attempt, but so far it's pretty
non-generic, in it's very nature.

Dave Olson
olson at unixfolk.com
http://www.unixfolk.com/dave


From davem at davemloft.net  Thu Jun 29 17:32:06 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 17:32:06 -0700 (PDT)
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <44A47042.8060203@hp.com>
References: <1151624063.10886.34.camel@chalcedony.pathscale.com>
	<20060629.164623.59469884.davem@davemloft.net>
	<44A47042.8060203@hp.com>
Message-ID: <20060629.173206.48800902.davem@davemloft.net>

From: Rick Jones <rick.jones2 at hp.com>
Date: Thu, 29 Jun 2006 17:28:50 -0700

> I thought that most PCI controllers (that is to say the things bridging 
> PCI to the rest of the system) could do prefetching and/or that PCI-X 
> (if not PCI, no idea about PCI-e) cards could issue multiple 
> transactions anyway?

People doing deep CMT chips have found out that all of that
prefetching and store buffering is unnecessary when everything is so
tightly integrated.

All of the previous UltraSPARC boxes before Niagara had a
streaming cache sitting on the PCI controller.  It basically
prefetched for reads and collected writes from PCI devices
into cacheline sized chunks.

The PCI controller in the current Niagara systems has none of that
stuff.


From rick.jones2 at hp.com  Thu Jun 29 17:44:05 2006
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 29 Jun 2006 17:44:05 -0700
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <20060629.173206.48800902.davem@davemloft.net>
References: <1151624063.10886.34.camel@chalcedony.pathscale.com>
	<20060629.164623.59469884.davem@davemloft.net>
	<44A47042.8060203@hp.com>
	<20060629.173206.48800902.davem@davemloft.net>
Message-ID: <44A473D5.70809@hp.com>

David Miller wrote:
> From: Rick Jones <rick.jones2 at hp.com>
> Date: Thu, 29 Jun 2006 17:28:50 -0700
> 
> 
>>I thought that most PCI controllers (that is to say the things bridging 
>>PCI to the rest of the system) could do prefetching and/or that PCI-X 
>>(if not PCI, no idea about PCI-e) cards could issue multiple 
>>transactions anyway?
> 
> 
> People doing deep CMT chips have found out that all of that
> prefetching and store buffering is unnecessary when everything is so
> tightly integrated.

Then is prefetching in memcpy really that important to them (BTW besides 
  Sun/Niagra who are doing "deep CMT"?)

> All of the previous UltraSPARC boxes before Niagara had a
> streaming cache sitting on the PCI controller.  It basically
> prefetched for reads and collected writes from PCI devices
> into cacheline sized chunks.
> 
> The PCI controller in the current Niagara systems has none of that
> stuff.

Relying on PCI-X devices to issue multiple requests then?

rick jones


From davem at davemloft.net  Thu Jun 29 17:47:38 2006
From: davem at davemloft.net (David Miller)
Date: Thu, 29 Jun 2006 17:47:38 -0700 (PDT)
Subject: [openib-general] [PATCH 39 of 39] IB/ipath - use streaming copy
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <44A473D5.70809@hp.com>
References: <44A47042.8060203@hp.com>
	<20060629.173206.48800902.davem@davemloft.net> <44A473D5.70809@hp.com>
Message-ID: <20060629.174738.85688575.davem@davemloft.net>

From: Rick Jones <rick.jones2 at hp.com>
Date: Thu, 29 Jun 2006 17:44:05 -0700

> Then is prefetching in memcpy really that important to them.

Not really, the thread just blocks while waiting for memory.
On stores they do a cacheline fill optimization similar to
the powerpc.

> Relying on PCI-X devices to issue multiple requests then?

Perhaps :)


From chrisw at sous-sol.org  Thu Jun 29 18:39:05 2006
From: chrisw at sous-sol.org (Chris Wright)
Date: Thu, 29 Jun 2006 18:39:05 -0700
Subject: [openib-general] [stable] [PATCH -stable] IB/mthca: restore
 missing PCI registers after reset
In-Reply-To: <20060628171428.GF19300@mellanox.co.il>
References: <20060628171428.GF19300@mellanox.co.il>
Message-ID: <20060630013905.GF11588@sequoia.sous-sol.org>

* Michael S. Tsirkin (mst at mellanox.co.il) wrote:
> Hello, stable team!
> The pull of the following fix was requested by Roland Dreier just a couple of
> days before 2.6.17 came out, and so it seems it missed 2.6.17 by a narrow
> margin:
> 
> http://lkml.org/lkml/2006/6/13/164

Thanks, queued for the next -stable.
-chris


From halr at voltaire.com  Thu Jun 29 20:50:21 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Jun 2006 23:50:21 -0400
Subject: [openib-general] Reloading of partition policy
In-Reply-To: <44A46775.1000507@3leafnetworks.com>
References: <44A46775.1000507@3leafnetworks.com>
Message-ID: <1151639421.4478.746.camel@hal.voltaire.com>

On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote:
> I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the 
> following comment -
> 
> If we need to add/delete a node to/from a partition we need to update 
> the file
> 
> /etc/osm-partitions.txt  
> 
>  and restart the OpenSM. According to the docs there no way we can do 
> this without restarting the OpenSM.
> 
> It would be useful to add new feature to reload the partition table 
> after making the changes.

Partitions can be deleted and the new partitions enforced via issuing 
kill -HUP to the OpenSM without restarting now.

The document is (already) out of date :-( I will update it shortly.

-- Hal

> 
>  VBabu
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Fri Jun 30 02:38:29 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Jun 2006 05:38:29 -0400
Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support
 enhanced switch port 0 forLMC > 0
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E30236891D@mtlexch01.mtl.com>
Message-ID: <1151660308.4478.14933.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-06-29 at 15:54, Eitan Zahavi wrote:
> Hi Hal,
> 
> I think the check for num lids is so similar it deserves an inline
> function.
> What do you say?

Does the function need to be inlined ?

-- Hal

> I refer to:
> 
> > +        if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
> > +            ib_switch_info_is_enhanced_port0(p_si))
> > +        {
> > +          num_lids = lmc_num_lids;
> > +        }
> > +        else
> > +        {
> > +          num_lids = 1;
> > +        }
> > +      }
> > 
>  
> 


From halr at voltaire.com  Fri Jun 30 03:11:52 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Jun 2006 06:11:52 -0400
Subject: [openib-general] [PATCHv3] OpenSM/osm_lid_mgr.c: Support enhanced
 switch port 0 for LMC > 0
Message-ID: <1151662311.4478.16276.camel@hal.voltaire.com>

OpenSM/osm_lid_mgr.c: Support enhanced switch port 0 for LMC > 0

Base port 0 is constrained to have an of LMC of 0 whereas enhanced
switch port 0 is not. Support for enhanced switch port 0 is more like CA
and router ports in terms of LMC.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: include/opensm/osm_switch.h
===================================================================
--- include/opensm/osm_switch.h	(revision 8296)
+++ include/opensm/osm_switch.h	(working copy)
@@ -702,6 +702,42 @@ osm_switch_get_si_ptr(
 *	Switch object
 *********/
 
+/****f* OpenSM: Switch/osm_switch_is_sp0_enhanced
+* NAME
+*	osm_switch_is_sp0_enhanced	
+*
+* DESCRIPTION
+*	Returns whether switch port 0 (SP0) is enhanced or base
+*
+*/
+static inline uint16_t 
+osm_switch_is_sp0_enhanced(
+	IN const osm_switch_t* const p_sw )
+{
+	ib_switch_info_t    *p_si;
+
+	if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
+	    ib_switch_info_is_enhanced_port0(p_si))
+	{
+		return 1;	/* enhanced SP0 */
+        }
+
+	return 0;		/* base SP 0 */
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to an osm_switch_t object.
+*
+* RETURN VALUES
+*	TRUE if SP0 is enhanced. FALSE otherwise.
+*       
+* NOTES
+*
+* SEE ALSO
+*	Switch object
+*********/
+
 /****f* OpenSM: Switch/osm_switch_get_max_block_id
 * NAME
 *	osm_switch_get_max_block_id
Index: opensm/osm_lid_mgr.c
===================================================================
--- opensm/osm_lid_mgr.c	(revision 8296)
+++ opensm/osm_lid_mgr.c	(working copy)
@@ -94,6 +94,7 @@
 #include <opensm/osm_lid_mgr.h>
 #include <opensm/osm_log.h>
 #include <opensm/osm_node.h>
+#include <opensm/osm_switch.h>
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
 #include <vendor/osm_vendor_api.h>
@@ -351,6 +352,7 @@ __osm_lid_mgr_init_sweep(
   osm_lid_mgr_range_t *p_range = NULL;
   osm_port_t          *p_port;
   cl_qmap_t           *p_port_guid_tbl;
+  osm_switch_t        *p_sw;
   uint8_t              lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc);
   uint16_t             lmc_mask;
   uint16_t             req_lid, num_lids;
@@ -436,7 +438,19 @@ __osm_lid_mgr_init_sweep(
            IB_NODE_TYPE_SWITCH )
         num_lids = lmc_num_lids;
       else
-        num_lids = 1;
+      {
+        /* Determine if enhanced switch port 0 */
+        p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                      osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+        if (osm_switch_is_sp0_enhanced(p_sw))
+        {
+          num_lids = lmc_num_lids;
+        }
+        else
+        {
+          num_lids = 1;
+        }
+      }
 
       if ((num_lids != 1) &&
           (((db_min_lid & lmc_mask) != db_min_lid) ||
@@ -539,7 +553,17 @@ __osm_lid_mgr_init_sweep(
           }
           else
           {
-            num_lids = 1;
+            /* Determine if enhanced switch port 0 */
+            p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                          osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+            if (osm_switch_is_sp0_enhanced(p_sw))
+            {
+              num_lids = lmc_num_lids;
+            }
+            else
+            {
+              num_lids = 1;
+            }
           }
 
           /* Make sure the lid is aligned */
@@ -798,6 +822,7 @@ __osm_lid_mgr_get_port_lid(
   uint8_t  num_lids = (1 << p_mgr->p_subn->opt.lmc);
   int      lid_changed = 0;
   uint16_t lmc_mask;
+  osm_switch_t        *p_sw;
 
   OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_get_port_lid );
 
@@ -809,10 +834,18 @@ __osm_lid_mgr_get_port_lid(
   /* get the lid from the guid2lid */
   guid = cl_ntoh64( osm_port_get_guid( p_port ) );
 
-  /* if the port is a switch then we only need one lid */
+  /* if the port is a switch with base switch port 0 then we only need one lid */
   if( osm_node_get_type( osm_port_get_parent_node( p_port ) ) ==
       IB_NODE_TYPE_SWITCH )
-    num_lids = 1;
+  {
+    /* Determine if base switch port 0 */
+    p_sw = osm_get_switch_by_guid(p_mgr->p_subn,
+                                  osm_node_get_node_guid(osm_port_get_parent_node(p_port)));
+    if (!osm_switch_is_sp0_enhanced(p_sw))
+    {
+      num_lids = 1;
+    }
+  }
 
   /* if the port matches the guid2lid */
   if (!osm_db_guid2lid_get( p_mgr->p_g2l, guid, &min_lid, &max_lid))


From eitan at mellanox.co.il  Fri Jun 30 03:55:20 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 30 Jun 2006 13:55:20 +0300
Subject: [openib-general] [PATCHv2] OpenSM/osm_lid_mgr.c: Support
 enhanced switch port 0forLMC > 0
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368920@mtlexch01.mtl.com>

No not really

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Friday, June 30, 2006 12:38 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [PATCHv2] OpenSM/osm_lid_mgr.c: Support enhanced switch
port
> 0forLMC > 0
> 
> Hi Eitan,
> 
> On Thu, 2006-06-29 at 15:54, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > I think the check for num lids is so similar it deserves an inline
> > function.
> > What do you say?
> 
> Does the function need to be inlined ?
> 
> -- Hal
> 
> > I refer to:
> >
> > > +        if (p_sw && (p_si = osm_switch_get_si_ptr(p_sw)) &&
> > > +            ib_switch_info_is_enhanced_port0(p_si))
> > > +        {
> > > +          num_lids = lmc_num_lids;
> > > +        }
> > > +        else
> > > +        {
> > > +          num_lids = 1;
> > > +        }
> > > +      }
> > >
> >
> >


From svenar at simula.no  Fri Jun 30 06:30:22 2006
From: svenar at simula.no (Sven-Arne Reinemo)
Date: Fri, 30 Jun 2006 15:30:22 +0200
Subject: [openib-general] A few questions about IBMgtSim
In-Reply-To: <4496F9F9.90101@mellanox.co.il>
References: <44968BEF.9030401@simula.no> <4496F9F9.90101@mellanox.co.il>
Message-ID: <44A5276E.9010001@simula.no>

Anno Domini 19-06-2006 21:24, Eitan Zahavi wrote:
> Hi Sven,
> 
> Please see my response below:

Thanks for your help. I have another question regarding time scales in
simulations. When the SM is used with the simulator how do I find the
simulated time for events? I.e. if I run a simulation where it takes the
SM 1 hour to get to subnet up (wallclock time) how do I find/calculate
the time it took according to the simulator clock?

Best regards,

-- 
Sven-Arne Reinemo
[simula.research laboratory] http://www.simula.no/
++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++


From halr at voltaire.com  Fri Jun 30 08:10:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Jun 2006 11:10:19 -0400
Subject: [openib-general] Reloading of partition policy
In-Reply-To: <1151639421.4478.746.camel@hal.voltaire.com>
References: <44A46775.1000507@3leafnetworks.com>
	<1151639421.4478.746.camel@hal.voltaire.com>
Message-ID: <1151680218.4478.28538.camel@hal.voltaire.com>

On Thu, 2006-06-29 at 23:50, Hal Rosenstock wrote:
> On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote:
> > I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the 
> > following comment -
> > 
> > If we need to add/delete a node to/from a partition we need to update 
> > the file
> > 
> > /etc/osm-partitions.txt  
> > 
> >  and restart the OpenSM. According to the docs there no way we can do 
> > this without restarting the OpenSM.

I just looked at those documents and couldn't find what you were
referring to. Can you be more specific ?

-- Hal

> > 
> > It would be useful to add new feature to reload the partition table 
> > after making the changes.
> 
> Partitions can be deleted and the new partitions enforced via issuing 
> kill -HUP to the OpenSM without restarting now.
> 
> The document is (already) out of date :-( I will update it shortly.
> 
> -- Hal
> 
> > 
> >  VBabu
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eitan at mellanox.co.il  Fri Jun 30 08:04:35 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 30 Jun 2006 18:04:35 +0300
Subject: [openib-general] A few questions about IBMgtSim
Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E302368923@mtlexch01.mtl.com>

Hi Sven,

Currently there is no way to scale simulation time to real time. 
The main reason is that the time scale is mixed:
* OpenSM calculation time is about the same (if you run the simulator on
remote node)
* SMA time and packet traversal time is not scaling at all and the
larger the fabric the larger the scaling factor. In real life the
hardware handles the packets in simulation it is a single CPU 

EZ

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Sven-Arne Reinemo [mailto:svenar at simula.no]
> Sent: Friday, June 30, 2006 4:30 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] A few questions about IBMgtSim
> 
> Anno Domini 19-06-2006 21:24, Eitan Zahavi wrote:
> > Hi Sven,
> >
> > Please see my response below:
> 
> Thanks for your help. I have another question regarding time scales in
> simulations. When the SM is used with the simulator how do I find the
> simulated time for events? I.e. if I run a simulation where it takes
the
> SM 1 hour to get to subnet up (wallclock time) how do I find/calculate
> the time it took according to the simulator clock?
> 
> Best regards,
> 
> --
> Sven-Arne Reinemo
> [simula.research laboratory] http://www.simula.no/
> ++++ GnuPG public key - http://home.simula.no/~svenar/gpg.asc ++++


From bos at pathscale.com  Fri Jun 30 10:00:31 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Fri, 30 Jun 2006 10:00:31 -0700
Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes,
 performance enhancements, and portability improvements
In-Reply-To: <20060630163108.GA24882@mellanox.co.il>
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<20060630163108.GA24882@mellanox.co.il>
Message-ID: <1151686831.2194.7.camel@localhost.localdomain>

On Fri, 2006-06-30 at 19:31 +0300, Michael S. Tsirkin wrote:

> OK, next week I'll put these into my tree, too.

Thanks.  The first 37 are in -mm; the last two you can drop until I sort
them out.

	<b


From mst at mellanox.co.il  Fri Jun 30 09:33:45 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 30 Jun 2006 19:33:45 +0300
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
In-Reply-To: <000001c69b9e$86268fd0$8698070a@amr.corp.intel.com>
References: <000001c69b9e$86268fd0$8698070a@amr.corp.intel.com>
Message-ID: <20060630163345.GB24882@mellanox.co.il>

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: ipath patch series a-comin', but no IB maintainer to shepherd them
> 
> >This currently includes a single patch from Venkatesh Babu:
> >	IB/core: Set alternate port number when initializing QP attributes.
> >
> >that has been checked into openib svn by Sean.
> 
> Thanks Michael.  I will assume that you will push this change in through Roland
> when he's back.

Sure.

-- 
MST


From mst at mellanox.co.il  Fri Jun 30 09:31:08 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 30 Jun 2006 19:31:08 +0300
Subject: [openib-general] [PATCH 0 of 39] ipath - bug fixes,
 performance enhancements, and portability improvements
In-Reply-To: <patchbomb.1151617251@eng-12.pathscale.com>
References: <patchbomb.1151617251@eng-12.pathscale.com>
Message-ID: <20060630163108.GA24882@mellanox.co.il>

Quoting r. Bryan O'Sullivan <bos at pathscale.com>:
> Subject: [PATCH 0 of 39] ipath - bug fixes, performance enhancements,and portability improvements
> 
> Hi, Andrew -
> 
> These patches bring the ipath driver up to date with a number of bug fixes,
> performance improvements, and better PowerPC support.  There are a few
> whitespace and formatting patches in the series, but they're all self-
> contained.  The patches have been tested internally, and shouldn't contain
> anything controversial.
> 
> My hope is that they'll sit in -mm for a little bit, and make it into
> an early 2.6.18 -rc kernel.

OK, next week I'll put these into my tree, too.
Bryan, as far as I can see there were some comments with regard to patches 38
and 39 in the series. Will you be sending updated revisions of these?

-- 
MST


From viswa.krish at gmail.com  Fri Jun 30 10:26:11 2006
From: viswa.krish at gmail.com (Viswanath Krishnamurthy)
Date: Fri, 30 Jun 2006 10:26:11 -0700
Subject: [openib-general] CM and REP handling
Message-ID: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com>

In the current communication manager (CM) implementation how is the REP MAD
getting lost handled. When the REP gets lost, the cm_dup_req_handler gets
called
which currently enters the default condition and does nothing.  The client
retries
the number of timers it is configured to and fails.  If the first REP gets
lost, the connection
never gets established. So what should be the behavior ?

-Viswa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060630/bb950cec/attachment.html>

From venkatesh.babu at 3leafnetworks.com  Fri Jun 30 11:08:40 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 30 Jun 2006 11:08:40 -0700
Subject: [openib-general] Reloading of partition policy
In-Reply-To: <1151680218.4478.28538.camel@hal.voltaire.com>
References: <44A46775.1000507@3leafnetworks.com>
	<1151639421.4478.746.camel@hal.voltaire.com>
	<1151680218.4478.28538.camel@hal.voltaire.com>
Message-ID: <44A568A8.1020908@3leafnetworks.com>

The document doesn't describe the scenario where nodes are added/deleted 
from the partition table. I raised this issue because it could be an 
important use case. If this can be achieved without restarting the 
OpenSM, it is good.
 
Just one more clarification - sending HUP signal doesn't cause OpenSM 
failover to other standby one right ?

 VBabu

Hal Rosenstock wrote:
> On Thu, 2006-06-29 at 23:50, Hal Rosenstock wrote:
>   
>> On Thu, 2006-06-29 at 19:51, Venkatesh Babu wrote:
>>     
>>> I was reviewing partition-config.txt and OpenSM_PKey_Mgr.txt and had the 
>>> following comment -
>>>
>>> If we need to add/delete a node to/from a partition we need to update 
>>> the file
>>>
>>> /etc/osm-partitions.txt  
>>>
>>>  and restart the OpenSM. According to the docs there no way we can do 
>>> this without restarting the OpenSM.
>>>       
>
> I just looked at those documents and couldn't find what you were
> referring to. Can you be more specific ?
>
> -- Hal
>
>   
>>> It would be useful to add new feature to reload the partition table 
>>> after making the changes.
>>>       
>> Partitions can be deleted and the new partitions enforced via issuing 
>> kill -HUP to the OpenSM without restarting now.
>>
>> The document is (already) out of date :-( I will update it shortly.
>>
>> -- Hal
>>
>>     
>>>  VBabu
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>       
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>>     
>
>   


From trimmer at silverstorm.com  Fri Jun 30 11:00:44 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Fri, 30 Jun 2006 14:00:44 -0400
Subject: [openib-general] CM and REP handling
In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65ACD7@mail.silverstorm.com>

 
________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Viswanath
Krishnamurthy
Sent: Friday, June 30, 2006 1:26 PM
To: openib-general at openib.org
Subject: [openib-general] CM and REP handling

 
In the current communication manager (CM) implementation how is the REP
MAD
getting lost handled. When the REP gets lost, the cm_dup_req_handler
gets called
which currently enters the default condition and does nothing.  The
client retries 
the number of timers it is configured to and fails.  If the first REP
gets lost, the connection
never gets established. So what should be the behavior ?

-Viswa

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060630/9e5e18b2/attachment.html>

From mshefty at ichips.intel.com  Fri Jun 30 11:09:06 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 30 Jun 2006 11:09:06 -0700
Subject: [openib-general] CM and REP handling
In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com>
References: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com>
Message-ID: <44A568C2.7000502@ichips.intel.com>

Viswanath Krishnamurthy wrote:
> In the current communication manager (CM) implementation how is the REP MAD
> getting lost handled. When the REP gets lost, the cm_dup_req_handler 
> gets called
> which currently enters the default condition and does nothing.  The 
> client retries
> the number of timers it is configured to and fails.  If the first REP 
> gets lost, the connection
> never gets established. So what should be the behavior ?

The REP will be resent until an RTU is received.  Repeated REQs can be dropped 
in cm_dup_req_handler() because the initial REQ has been received and a REP 
generated.  That is, cm_dup_req_handler() is called on the side sending the REP.

Are you seeing an issue with the code when the first REP is lost?

- Sean


From trimmer at silverstorm.com  Fri Jun 30 11:12:02 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Fri, 30 Jun 2006 14:12:02 -0400
Subject: [openib-general] CM and REP handling
In-Reply-To: <4df28be40606301026g715df953v3676ed292662c694@mail.gmail.com>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65ACDF@mail.silverstorm.com>

> From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Viswanath
Krishnamurthy
> Sent: Friday, June 30, 2006 1:26 PM


> In the current communication manager (CM) implementation how is the
REP MAD
> getting lost handled. When the REP gets lost, the cm_dup_req_handler
gets called
> which currently enters the default condition and does nothing.  The
client retries 
> the number of timers it is configured to and fails.  If the first REP
gets lost, the connection
> never gets established. So what should be the behavior ?

The IBTA standard in section 12.9.7 defines this situation in the state
machine.

 
In this case the Active side will have sent a REQ.  It will be in REQ
Sent state (or REP Wait in passive side sent an MRA).  In these states
the Active side will have a timer running.  If the REP is lost, the
Active side will timeout and move to the "Timeout" state.  In this
state, the active side has the option of resending the REQ or sending a
REJ and giving up on the connection attempt. In general it is best for
the active side to perform a few retries before it gives up.

 
During this sequence the passive side will think it has sent its REP
(eg. the one which was lost) so it will be in the REP Sent state (see
12.9.7.2).  In this state if it receives another matching REQ, it is to
resend its REP.  There is also a timer on the passive side in this state
(waiting for the RTU).  If the passive side times out it will move to
RTU Timeout and has the option to resent its REP or send a REJ and give
up the connection attempt.  Here too it is best for the passive side to
perform a few retries before giving up.

 
Todd Rimmer

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060630/11f9cb8b/attachment.html>

From trimmer at silverstorm.com  Fri Jun 30 11:18:12 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Fri, 30 Jun 2006 14:18:12 -0400
Subject: [openib-general] CM and REP handling
In-Reply-To: <44A568C2.7000502@ichips.intel.com>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65ACE5@mail.silverstorm.com>

> From: Sean Hefty
> Sent: Friday, June 30, 2006 2:09 PM
> 
> The REP will be resent until an RTU is received.  Repeated REQs can be
> dropped
> in cm_dup_req_handler() because the initial REQ has been received and
a
> REP
> generated.  That is, cm_dup_req_handler() is called on the side
sending
> the REP.
> 

Shouldn't the cm_dup_req_handler in this case also resend the REP per
the IBTA passive side state machine "REP Sent" state?

Todd Rimmer


From mshefty at ichips.intel.com  Fri Jun 30 11:28:28 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 30 Jun 2006 11:28:28 -0700
Subject: [openib-general] CM and REP handling
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE65ACE5@mail.silverstorm.com>
References: <D80D83302DEE6249A221093BF2BB69AE65ACE5@mail.silverstorm.com>
Message-ID: <44A56D4C.70704@ichips.intel.com>

Rimmer, Todd wrote:
> Shouldn't the cm_dup_req_handler in this case also resend the REP per
> the IBTA passive side state machine "REP Sent" state?

The REP will already being retried based on a timeout.  It could be resent 
immediately in response to a duplicate REQ as well, but that shouldn't be 
necessary, and actually makes things more complex, since coordination must be 
done between sending based on a timeout, versus receiving a duplicate REQ.

- Sean


From trimmer at silverstorm.com  Fri Jun 30 12:46:40 2006
From: trimmer at silverstorm.com (Rimmer, Todd)
Date: Fri, 30 Jun 2006 15:46:40 -0400
Subject: [openib-general] CM and REP handling
In-Reply-To: <44A56D4C.70704@ichips.intel.com>
Message-ID: <D80D83302DEE6249A221093BF2BB69AE65AD11@mail.silverstorm.com>

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Friday, June 30, 2006 2:28 PM
> 
> Rimmer, Todd wrote:
> > Shouldn't the cm_dup_req_handler in this case also resend the REP
per
> > the IBTA passive side state machine "REP Sent" state?
> 
> The REP will already being retried based on a timeout.  It could be
resent
> immediately in response to a duplicate REQ as well, but that shouldn't
be
> necessary, and actually makes things more complex, since coordination
must
> be
> done between sending based on a timeout, versus receiving a duplicate
REQ.

I would recommend implementing the state machine as defined in the spec
for the following reasons:

1. it will be necessary to pass any future IBTA CIWG compliance tests
for the CM

2. I would need to think about it, but the lost REP case may not be the
only situation where a duplicate REQ can be received.

3. depending on RTU timeout on the passive side as the only means for
resending the REP reduces the retries attempted in a "lossy" fabric for
REP and RTU loss (eg. if you have 8 RTU timeout retries on passive side,
and many REPs are lost followed by many RTUs, you get a total of 8 lost
REPs+RTUs before you give up, managing the counters separately will tend
allow for more retries).

In our proprietary stack we implemented the defined state machine and
have stressed it for 1000s of concurrent connections (including various
Chariot SDP connect/disconnect stress tests and Oracle uDAPL stress
tests plus our use of the CM to establish connections when running MPI
on 1000s of nodes) in various real world and contrived situations of
packet loss and slow responsiveness and the defined state machine has
worked very well for all these situations.

Todd Rimmer


From rdreier at cisco.com  Fri Jun 30 13:56:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 30 Jun 2006 13:56:18 -0700
Subject: [openib-general] ipath patch series a-comin',
 but no IB maintainer to shepherd them
In-Reply-To: <20060629163857.GT19300@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 29 Jun 2006 19:38:57 +0300")
References: <20060628171318.7d97d617.akpm@osdl.org>
	<20060629163857.GT19300@mellanox.co.il>
Message-ID: <aday7vee51p.fsf@cisco.com>

 > Further, in the hope that this will help keep things reasonably stable till
 > Roland comes back, and help everyone see what's being merged, I have
 > created a git branch for all things infiniband going into 2.6.18.
 > 
 > You can get at it here:
 > 	git://www.mellanox.co.il/~git/infiniband  mst-for-2.6.18

Thanks for doing this ... however www.mellanox.co.il doesn't seem to
have the git port open:

    fatal: unable to connect a socket (Connection refused)
    fetch-pack from 'git://www.mellanox.co.il/~git/infiniband' failed.

 - R.


From rdreier at cisco.com  Fri Jun 30 14:08:19 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 30 Jun 2006 14:08:19 -0700
Subject: [openib-general] [PATCH 28 of 39] IB/ipath - Fixes a bug where
 our delay for EEPROM no longer works due to compiler reordering
In-Reply-To: <20060629170711.757a97d2.akpm@osdl.org> (Andrew Morton's
	message of "Thu, 29 Jun 2006 17:07:11 -0700")
References: <patchbomb.1151617251@eng-12.pathscale.com>
	<5f3c0b2d446d78e3327f.1151617279@eng-12.pathscale.com>
	<20060629170711.757a97d2.akpm@osdl.org>
Message-ID: <adar716e4ho.fsf@cisco.com>

 > >  static void i2c_wait_for_writes(struct ipath_devdata *dd)
 > >  {
 > > +	mb();
 > >  	(void)ipath_read_kreg32(dd, dd->ipath_kregs->kr_scratch);
 > >  }

 > That's a bit weird.  I wouldn't have expected the compiler to muck around
 > with a readl().

I never liked this patch.  The last time it came up there were
conflicting answers about whether it was a code generation bug or a
real issue talking to hardware or what.  At the least I think this
merits a big comment explain what's going on -- and even better would
be really understanding the bug that's being fixed so that we're
confident it is indeed a real fix.

 - R.


From mshefty at ichips.intel.com  Fri Jun 30 14:52:07 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 30 Jun 2006 14:52:07 -0700
Subject: [openib-general] CM and REP handling
In-Reply-To: <D80D83302DEE6249A221093BF2BB69AE65AD11@mail.silverstorm.com>
References: <D80D83302DEE6249A221093BF2BB69AE65AD11@mail.silverstorm.com>
Message-ID: <44A59D07.3060303@ichips.intel.com>

Rimmer, Todd wrote:
> I would recommend implementing the state machine as defined in the spec
> for the following reasons:

Technically, I believe that this follows the state machine.  After receiving a 
duplicate REQ, a REP will be resent.  The only difference is that there is a 
delay in resending the REP.

> 1. it will be necessary to pass any future IBTA CIWG compliance tests
> for the CM

I don't believe that a compliance test would detect any issue.

> 2. I would need to think about it, but the lost REP case may not be the
> only situation where a duplicate REQ can be received.

Note that the IB CM handles duplicate REQs differently based on the current state.

> 3. depending on RTU timeout on the passive side as the only means for
> resending the REP reduces the retries attempted in a "lossy" fabric for
> REP and RTU loss (eg. if you have 8 RTU timeout retries on passive side,
> and many REPs are lost followed by many RTUs, you get a total of 8 lost
> REPs+RTUs before you give up, managing the counters separately will tend
> allow for more retries).

The number of retries cannot exceed the maximum CM retries that was specified in 
the REQ.  Resending a REP immediately after receiving a duplicate REQ needs to 
check against this and increment the number of REPs that have been sent.  The 
result is that the connection timeout actually decreases for every duplicate REQ 
that is received.

- Sean


From sean.hefty at intel.com  Fri Jun 30 20:58:33 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 30 Jun 2006 20:58:33 -0700
Subject: [openib-general] [PATCH] RMPP: add Dual-sided RMPP support
Message-ID: <000001c69cc2$9f7ac710$4e268686@amr.corp.intel.com>

Add support for dual-sided RMPP transfers.

The implementation assumes that any RMPP request that requires a response uses
DS RMPP.  Based on the RMPP start-up scenarios defined by the spec, this should
be a valid assumption.  That is, there is no start-up scenario defined where
an RMPP request is followed by a non-RMPP response.  By having this assumption,
we avoid any API changes.

In order for a node that supports DS RMPP to communicate with one that does not,
RMPP responses assume a new window size of 1 if a DS ACK has not been received.
(By DS ACK, I'm referring to the ACK of the final ACK to the request.)  This
is a slight spec deviation, but is necessary to allow communication with nodes
that do not generate the DS ACK.  It also handles the case when a response is
sent after the request state has been discarded.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
This was tested by running grmpp between OpenFabric nodes running with and
without DS RMPP support.  Additional testing is desirable before committing,
since it affects all MADs using RMPP.

Index: mad_rmpp.c
===================================================================
--- mad_rmpp.c	(revision 8224)
+++ mad_rmpp.c	(working copy)
@@ -60,6 +60,7 @@ struct mad_rmpp_recv {
 	int last_ack;
 	int seg_num;
 	int newwin;
+	int repwin;
 
 	__be64 tid;
 	u32 src_qp;
@@ -170,6 +171,32 @@ static struct ib_mad_send_buf *alloc_res
 	return msg;
 }
 
+static void ack_ds_ack(struct ib_mad_agent_private *agent,
+		       struct ib_mad_recv_wc *recv_wc)
+{
+	struct ib_mad_send_buf *msg;
+	struct ib_rmpp_mad *rmpp_mad;
+	int ret;
+
+	msg = alloc_response_msg(&agent->agent, recv_wc);
+	if (IS_ERR(msg))
+		return;
+
+	rmpp_mad = msg->mad;
+	memcpy(rmpp_mad, recv_wc->recv_buf.mad, msg->hdr_len);
+
+	rmpp_mad->mad_hdr.method ^= IB_MGMT_METHOD_RESP;
+	ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE);
+	rmpp_mad->rmpp_hdr.seg_num = 0;
+	rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(1);
+
+	ret = ib_post_send_mad(msg, NULL);
+	if (ret) {
+		ib_destroy_ah(msg->ah);
+		ib_free_send_mad(msg);
+	}
+}
+
 void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc)
 {
 	struct ib_rmpp_mad *rmpp_mad = mad_send_wc->send_buf->mad;
@@ -271,6 +298,7 @@ create_rmpp_recv(struct ib_mad_agent_pri
 	rmpp_recv->newwin = 1;
 	rmpp_recv->seg_num = 1;
 	rmpp_recv->last_ack = 0;
+	rmpp_recv->repwin = 1;
 
 	mad_hdr = &mad_recv_wc->recv_buf.mad->mad_hdr;
 	rmpp_recv->tid = mad_hdr->tid;
@@ -591,6 +619,16 @@ static inline void adjust_last_ack(struc
 			break;
 }
 
+static void process_ds_ack(struct ib_mad_agent_private *agent,
+			   struct ib_mad_recv_wc *mad_recv_wc, int newwin)
+{
+	struct mad_rmpp_recv *rmpp_recv;
+
+	rmpp_recv = find_rmpp_recv(agent, mad_recv_wc);
+	if (rmpp_recv && rmpp_recv->state == RMPP_STATE_COMPLETE)
+		rmpp_recv->repwin = newwin;
+}
+
 static void process_rmpp_ack(struct ib_mad_agent_private *agent,
 			     struct ib_mad_recv_wc *mad_recv_wc)
 {
@@ -616,8 +654,18 @@ static void process_rmpp_ack(struct ib_m
 
 	spin_lock_irqsave(&agent->lock, flags);
 	mad_send_wr = ib_find_send_mad(agent, mad_recv_wc);
-	if (!mad_send_wr)
-		goto out;	/* Unmatched ACK */
+	if (!mad_send_wr) {
+		if (!seg_num)
+			process_ds_ack(agent, mad_recv_wc, newwin);
+		goto out;	/* Unmatched or DS RMPP ACK */
+	}
+
+	if ((mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) &&
+	    (mad_send_wr->timeout)) {
+		spin_unlock_irqrestore(&agent->lock, flags);
+		ack_ds_ack(agent, mad_recv_wc);
+		return;		/* Repeated ACK for DS RMPP transaction */
+	}
 
 	if ((mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) ||
 	    (!mad_send_wr->timeout) || (mad_send_wr->status != IB_WC_SUCCESS))
@@ -656,6 +704,9 @@ static void process_rmpp_ack(struct ib_m
 		if (mad_send_wr->refcount == 1)
 			ib_reset_mad_timeout(mad_send_wr,
 					     mad_send_wr->send_buf.timeout_ms);
+		spin_unlock_irqrestore(&agent->lock, flags);
+		ack_ds_ack(agent, mad_recv_wc);
+		return;
 	} else if (mad_send_wr->refcount == 1 &&
 		   mad_send_wr->seg_num < mad_send_wr->newwin &&
 		   mad_send_wr->seg_num < mad_send_wr->send_buf.seg_count) {
@@ -772,6 +823,39 @@ out:
 	return NULL;
 }
 
+static int init_newwin(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_mad_agent_private *agent = mad_send_wr->mad_agent_priv;
+	struct ib_mad_hdr *mad_hdr = mad_send_wr->send_buf.mad;
+	struct mad_rmpp_recv *rmpp_recv;
+	struct ib_ah_attr ah_attr;
+	unsigned long flags;
+	int newwin = 1;
+
+	if (!(mad_hdr->method & IB_MGMT_METHOD_RESP))
+		goto out;
+
+	spin_lock_irqsave(&agent->lock, flags);
+	list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) {
+		if (rmpp_recv->tid != mad_hdr->tid ||
+		    rmpp_recv->mgmt_class != mad_hdr->mgmt_class ||
+		    rmpp_recv->class_version != mad_hdr->class_version ||
+		    (rmpp_recv->method & IB_MGMT_METHOD_RESP))
+			continue;
+		
+		if (ib_query_ah(mad_send_wr->send_buf.ah, &ah_attr))
+			continue;
+
+		if (rmpp_recv->slid == ah_attr.dlid) {
+			newwin = rmpp_recv->repwin;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&agent->lock, flags);
+out:
+	return newwin;
+}
+
 int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	struct ib_rmpp_mad *rmpp_mad;
@@ -787,7 +871,7 @@ int ib_send_rmpp_mad(struct ib_mad_send_
 		return IB_RMPP_RESULT_INTERNAL;
 	}
 
-	mad_send_wr->newwin = 1;
+	mad_send_wr->newwin = init_newwin(mad_send_wr);
 
 	/* We need to wait for the final ACK even if there isn't a response */
 	mad_send_wr->refcount += (mad_send_wr->timeout == 0);